A Survey on Evaluation of Large Language Models

Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang,
Philip S. Yu, , Qiang Yang, , and Xing Xie Y. Chang, X. Wang, Y. Wu and Y. Chang are with the School of Artificial Intelligence, Jilin University, Changchun, China. The first two authors contributed equally. J. Wang, X. Yi, and X. Xie are with Microsoft Research, Beijing, China. K. Zhu is with Institute of Automation, CAS, Beijing, China. H. Chen is with Carnegie Mellon University, PA, USA. L. Yang, C. Wang, and Y. Zhang are with Westlake University, Hangzhou, China. Y. Wang and W. Ye are with Peking University, Beijing, China. P. Yu is with the University of Illinois at Chicago, IL, USA. Q. Yang is with Hong Kong University of Science and Technology, Kowloon, Hong Kong. Correspondence to: Yuan Wu (yuanwu@jlu.edu.cn) and Jindong Wang (jindong.wang@microsoft.com).Manuscript received April 19, 2005; revised August 26, 2015.

Abstract

Large language models (LLMs) are gaining increasing popularity in both academia and industry, owing to their unprecedented performance in various applications. As LLMs continue to play a vital role in both research and daily use, their evaluation becomes increasingly critical, not only at the task level, but also at the society level for better understanding of their potential risks. Over the past years, significant efforts have been made to examine LLMs from various perspectives. This paper presents a comprehensive review of these evaluation methods for LLMs, focusing on three key dimensions: what to evaluate, where to evaluate, and how to evaluate. Firstly, we provide an overview from the perspective of evaluation tasks, encompassing general natural language processing tasks, reasoning, medical usage, ethics, educations, natural and social sciences, agent applications, and other areas. Secondly, we answer the ‘where’ and ‘how’ questions by diving into the evaluation methods and benchmarks, which serve as crucial components in assessing performance of LLMs. Then, we summarize the success and failure cases of LLMs in different tasks. Finally, we shed light on several future challenges that lie ahead in LLMs evaluation. Our aim is to offer invaluable insights to researchers in the realm of LLMs evaluation, thereby aiding the development of more proficient LLMs. Our key point is that evaluation should be treated as an essential discipline to better assist the development of LLMs. We consistently maintain the related open-source materials at: https://github.com/MLGroupJLU/LLM-eval-survey.

Index Terms:

Large language models, evaluation, model assessment, benchmark

1 Introduction

Understanding the essence of intelligence and establishing whether a machine embodies it poses a compelling question for scientists. It is generally agreed upon that authentic intelligence equips us with reasoning capabilities, enables us to test hypotheses, and prepare for future eventualities (Khalfa,, 1994). In particular, Artificial Intelligence (AI) researchers focus on the development of machine-based intelligence, as opposed to biologically based intellect (McCarthy,, 2007). Proper measurement helps to understand intelligence. For instance, measures for general intelligence in human individuals often encompass IQ tests (Brody,, 1999).

for tree= grow=east, reversed=true, anchor=base west, parent anchor=east, child anchor=west, base=left, font=, rectangle, draw, rounded corners,align=left, minimum width=2.5em, inner xsep=4pt, inner ysep=1pt, , where level=1text width=5em,fill=blue!10, where level=2text width=5em,font=,fill=pink!30, where level=3font=,yshift=0.26pt,fill=yellow!20, [LLMs
evaluation,fill=green!20 [What to evaluate
(Sec. 3),text width=7em [Natural
language
processing,text width=4em [Natural language understanding:
(1) Sentiment analysis: Bang et al., (2023) / (Liang et al.,, 2022) / (Lopez-Lira and Tang,, 2023) / (Qin et al.,, 2023) / (Wang et al., 2023j, ) / (Zhang et al., 2023d, )
(2) Text classification: (Liang et al.,, 2022) / (Peña et al.,, 2023) / (Yang and Menczer,, 2023)
(3) Natural language inference: (Lee et al.,, 2023) / (Qin et al.,, 2023)
(4) Others: (Choi et al.,, 2023) / (Riccardi and Desai,, 2023) / (Tao et al.,, 2023) ] [Reasoning: Bang et al., (2023) / (Bian et al.,, 2023) / (Frieder et al.,, 2023) / (Fu et al., 2023b, ) / (Liévin et al.,, 2022) / (Liu et al., 2023b, ) /
(Orrù et al.,, 2023) / (Qin et al.,, 2023) / (Saparov et al.,, 2023) / Wu et al., 2023c / (Xu et al., 2023a, ) / (Zhuang et al.,, 2023)
] [Natural language generation:
(1) Summarization: (Bang et al.,, 2023) / (Liang et al.,, 2022) / (Pu and Demberg,, 2023) / (Qin et al.,, 2023)
(2) Dialogue: (Bang et al.,, 2023) / (Lin and Chen,, 2023) / (Qin et al.,, 2023)
(3) Translation: (Bang et al.,, 2023) / (Lyu et al., 2023a, ) / (Wang et al., 2023d, )
(4) Question answering: (Bai et al.,, 2023) / (Bang et al.,, 2023) / (Bian et al.,, 2023) / (Laskar et al.,, 2023) / (Liang et al.,, 2022) / (Qin et al.,, 2023)
(5) Others: (Chen et al.,, 2023) / (Chia et al.,, 2023) / (Pu and Demberg,, 2023) ] [Multilingual: (Abdelali et al.,, 2023) / (Ahuja et al.,, 2023) / (Bang et al.,, 2023) / (Lai et al.,, 2023) / (Zhang et al., 2023c, )] [Factuality: (Gekhman et al.,, 2023) / (Honovich et al.,, 2022)/ (Manakul et al., 2023a, )/ (Min et al.,, 2023)/ (Pezeshkpour,, 2023)/ (Wang et al., 2023b, )] ] [Robustness / Ethics/
Biases/ Trustworthiness,text width=9em [ Robustness: (Li et al., 2023c, ) / (Wang et al.,, 2022) / Wang et al., 2023c / (Yang et al.,, 2022) /
Zhao et al., 2023b / (Zhu et al.,, 2023) / Zhuo et al., 2023b ] [ Ethics and biases: (Cao et al.,, 2023) / Deshpande et al., (2023) / (Dhamala et al.,, 2021) / Ferrara, (2023) / (Gehman et al.,, 2020)
(Hartmann et al.,, 2023) / (Hendrycks et al., 2020a, ) / (Parrish et al.,, 2022) / (Rutinowski et al.,, 2023) / (Sheng et al.,, 2021)
(Simmons,, 2022) / (Wang et al., 2023e, ) / Zhuo et al., 2023a ] [ Trustworthiness: (Hagendorff and Fabi,, 2023) / (Wang et al., 2023a, ) ] ] [Social science, text width=6em [(Deroy et al.,, 2023) / (Frank,, 2023) / (Nay et al.,, 2023) / (Wu et al., 2023a, ) / (Ziems et al.,, 2023) ] ] [Natural science
& engineering, text width=5.8em [Mathematics: (Arora et al.,, 2023) / (Bubeck et al.,, 2023) / (Collins et al.,, 2023)/ (Dao and Le,, 2023) / (Wei et al.,, 2023)/
(Wu et al., 2023b, ) / (Yuan et al., 2023b, ) ] [General science: (Arora et al.,, 2023) / (Castro Nascimento and Pimentel,, 2023) / (Guo et al.,, 2023)] [Engineering: (Bubeck et al.,, 2023) / (Liu et al., 2023c, ) / (Pallagani et al.,, 2023) / (Sridhara et al.,, 2023) / (Valmeekam et al.,, 2022) /
(Valmeekam et al.,, 2023) / (Zhuang et al.,, 2023) ] ] [Medical applications, text width=8em [Medical queries: (Chervenak et al.,, 2023) / (Duong and Solomon,, 2023) / (Hamidi and Roberts,, 2023) / (Holmes et al.,, 2023) / (Jahan et al.,, 2023)
(Johnson et al.,, 2023) / (Samaan et al.,, 2023) / (Thirunavukarasu et al.,, 2023) ] [Medical examination: (Gilson et al.,, 2023) / (Kung et al.,, 2023)] [Medical assistants: (Cascella et al.,, 2023) / (Khan et al.,, 2023) / (Lahat et al.,, 2023) / (Lyu et al., 2023b, ) / (Oh et al.,, 2023) / (Wang et al., 2023i, )] ] [Agent applications, text width=7em [ Huang et al., 2023a / Karpas et al., (2022) / (Parisi et al.,, 2022) / (Schick et al.,, 2023) / (Shen et al.,, 2023) ]] [Other
applications, text width=5em [Education: (Dai et al., 2023b, ) / (de Winter,, 2023) / (Hellas et al.,, 2023) / (Wang and Demszky,, 2023) / Wei et al., (2023)] [Search and recommendation: (Dai et al., 2023a, ) / (Fan et al.,, 2023) / (Sun et al.,, 2023) / (Thakur et al.,, 2021) / (Xu et al., 2023c, ) / (Zhang et al., 2023a, )] [Personality testing: (Bodroza et al.,, 2023) / (Jentzsch and Kersting,, 2023) / (Safdari et al.,, 2023) / (Song et al.,, 2023) / (Wang et al., 2023f, )] [Specific tasks: (Lanzi and Loiacono,, 2023) / (Le and Zhang,, 2023) / (Wang et al., 2023h, )] ] ] [Where to evaluate
(Sec. 4),text width=7.8em [General
benchmarks,text width=4.2em [MME (Fu et al., 2023a, ) / AlpacaEval (Li et al., 2023d, )/ Chatbot Arena (LMSYS,, 2023) / Xiezhi (Gu et al.,, 2023) / C-Eval (Huang et al., 2023b, ) /
DynaBench (Kiela et al.,, 2021) OpenLLM (HuggingFace,, 2023) / HELM (Liang et al.,, 2022) / Big-Bench (Srivastava et al.,, 2022)/ PandaLM (Wang et al., 2023h, )/
GLUE-X (Yang et al.,, 2022) KoLA (Yu et al.,, 2023)/ MT-Bench (Zheng et al.,, 2023) / AGIEval (Zhong et al.,, 2023)/ PromptBench (Zhu et al.,, 2023)] ] [Specific
benchmarks,text width=4.2em [SOCKET (Choi et al.,, 2023) / CUAD (Hendrycks et al., 2021b, ) / TRUSTGPT (Huang et al., 2023c, ) / MATH (Hendrycks et al., 2021c, )
APPS (Hendrycks et al., 2021a, ) / API-Bank (Li et al., 2023a, ) / ARB (Sawada et al.,, 2023) / MultiMedQA (Singhal et al.,, 2022)
CVALUES (Xu et al., 2023b, ) / ToolBench (ToolBench,, 2023) / M3Exam (Zhang et al., 2023c, ) / GAOKAO-Bench (Zhang et al., 2023e, )] ] ] [How to evaluate
(Sec. 5),text width=7em [Evaluation criterion, text width=8em [Automatic evaluation: (Bang et al.,, 2023) / Jain et al., (2023) / (Lin and Chen,, 2023) / (Qin et al.,, 2023) / (Wang et al., 2023h, ) ] [Human evaluation: (Bang et al.,, 2023) / (Bubeck et al.,, 2023) / (Liang et al.,, 2022) / (Ziems et al.,, 2023) ] ] ] [Summary
(Sec. 6),text width=4em [Tasks: success and failure cases of LLMs,text width=15.5em ] [Benchmark and evaluations, text width=10.5em [Human-in-the-loop: AdaVision (Gao et al.,, 2022) / AdaTest (Ribeiro and Lundberg,, 2022) ] [Crowd-sourcing testing: DynaBench (Kiela et al.,, 2021) / DynaBoard (Ma et al.,, 2021) / DynamicTempLAMA (Margatina et al.,, 2023) /
DynaTask (Thrush et al.,, 2022) ] [More challenging tasks: HELM (Liang et al.,, 2022) / AdaFilter (Phang et al.,, 2021) / CheckList (Ribeiro et al.,, 2020) / Big-Bench (Srivastava et al.,, 2022) /
DeepTest (Tian et al.,, 2018) / PromptBench (Zhu et al.,, 2023) ] ] ] [Grand challenges
(Sec. 7),text width=7em [Challenges,text width=4em [(1) Designing AGI benchmarks (2) Complete behavioral evaluation (3) Robustness evaluation (4) Dynamic and evolving evaluation
(5) Principled and trustworthy evaluation (6) Unified evaluation that supports all LLMs tasks (7) Beyond evaluation: LLMs enhancement] ] ] ]

Figure 1: Structure of this paper.

Within the scope of AI, the Turing Test (Turing,, 2009), a widely recognized test for assessing intelligence by discerning if responses are of human or machine origin, has been a longstanding objective in AI evolution. It is generally believed among researchers that a computing machine that successfully passes the Turing Test can be regarded as intelligent. Consequently, when viewed from a wider lens, the chronicle of AI can be depicted as the timeline of creation and evaluation of intelligent models and algorithms. With each emergence of a novel AI model or algorithm, researchers invariably scrutinize its capabilities in real-world scenarios through evaluation using specific and challenging tasks. For instance, the Perceptron algorithm (Gallant et al.,, 1990), touted as an Artificial General Intelligence (AGI) approach in the 1950s, was later revealed as inadequate due to its inability to resolve the XOR problem. The subsequent rise and application of Support Vector Machines (SVMs) (Cortes and Vapnik,, 1995) and deep learning (LeCun et al.,, 2015) have marked both progress and setbacks in the AI landscape. A significant takeaway from previous attempts is the paramount importance of AI evaluation, which serves as a critical tool to identify current system limitations and inform the design of more powerful models.

Recently, large language models (LLMs) has incited substantial interest across both academic and industrial domains (Wei et al., 2022a, ; Bommasani et al.,, 2021; Zhao et al., 2023a, ). As demonstrated by existing work (Bubeck et al.,, 2023), the great performance of LLMs has raised promise that they could be AGI in this era. LLMs posses the capabilities to solve diverse tasks, contrasting with prior models confined to solving specific tasks. Due to its great performance in handling different applications such as general natural language tasks and domain-specific ones, LLMs are increasingly used by individuals with critical information needs, such as students or patients.

Evaluation is of paramount prominence to the success of LLMs due to several reasons. First, evaluating LLMs helps us better understand the strengths and weakness of LLMs. For instance, the PromptBench (Zhu et al.,, 2023) benchmark illustrates that current LLMs are sensitive to adversarial prompts, thus a careful prompt engineering is necessary for better performance. Second, better evaluations can provide a better guidance for human-LLMs interaction, which could inspire future interaction design and implementation. Third, the broad applicability of LLMs underscores the paramount importance of ensuring their safety and reliability, particularly in safety-sensitive sectors such as financial institutions and healthcare facilities. Finally, as LLMs are becoming larger with more emergent abilities, existing evaluation protocols may not be enough to evaluate their capabilities and potential risks. Therefore, we aim to call awareness of the community of the importance to LLMs evaluations by reviewing the current evaluation protocols and most importantly, shed light on future research about designing new LLMs evaluation protocols.

With the introduction of ChatGPT (OpenAI, 2023a, ) and GPT-4 (OpenAI, 2023b, ), there have been a number of research efforts aiming at evaluating ChatGPT and other LLMs from different aspects (Figure 2), encompassing a range of factors such as natural language tasks, reasoning, robustness, trustworthiness, medical applications, and ethical considerations. Despite these efforts, a comprehensive overview capturing the entire gamut of evaluations is still lacking. Furthermore, the ongoing evolution of LLMs has also presents novel aspects for evaluation, thereby challenging existing evaluation protocols and reinforcing the need for thorough, multifaceted evaluation techniques. While existing research such as (Bubeck et al.,, 2023) claimed that GPT-4 can be seen as sparks of AGI, others contest this claim due to the human-crafted nature of its evaluation approach.

This paper serves as the first comprehensive survey on evaluation of large language models. As depicted in Figure 1, we explore existing work in three dimensions: 1) What to evaluate, 2) Where to evaluate, and 3) How to evaluate. Specifically, “what to evaluate” encapsulates existing evaluation tasks for LLMs, “where to evaluate” involves selecting appropriate datasets and benchmarks for evaluation, while “how to evaluate” is concerned with the evaluation process given appropriate tasks and datasets. These three dimensions are integral to the evaluation of LLMs. We subsequently discuss potential future challenges in the realm of LLMs evaluation.

The contributions of this paper are as follows:

1.

We provide a comprehensive overview of LLMs evaluations from three aspects: what to evaluate, where to evaluate, and how to evaluate. Our categorization is general and encompasses the entire life cycle of LLMs evaluation.
2.

Regarding what to evaluate, we summarize existing tasks in various areas and obtain insightful conclusions on the success and failure case of LLMs (Sec. 6), providing experience for future research.
3.

As for where to evaluate, we summarize evaluation metrics, datasets, and benchmarks to provide a profound understanding of current LLMs evaluations. In terms of how to evaluate, we explore current protocols and summarize novel evaluation approaches.
4.

We further discuss future challenges in evaluating LLMs. We open-source and maintain the related materials of LLMs evaluation at https://github.com/MLGroupJLU/LLM-eval-survey to foster a collaborative community for better evaluations.

The paper is organized as follows. In Sec. 2, we provide the basic information of LLMs and AI model evaluation. Then, Sec. 3 reviews existing work from the aspects of “what to evaluate”. After that, Sec. 4 is the “where to evaluate” part, which summarizes existing datasets and benchmarks. Sec. 5 discusses how to perform the evaluation. In Sec. 6, we summarize the key findings of this paper. We discuss grand future challenges in Sec. 7 and Sec. 8 concludes the paper.

2 Background

2.1 Large Language Models

Language models (LMs) (Gao and Lin,, 2004; Kombrink et al.,, 2011; Devlin et al.,, 2018) are computational models that have the capability to understand and generate human language. LMs have the transformative ability to predict the likelihood of word sequences or generate new text based on a given input. N-gram models (Brown et al.,, 1992), the most common type of LM, estimate word probabilities based on the preceding context. However, LMs also face challenges, such as the issue of rare or unseen words, the problem of overfitting, and the difficulty in capturing complex linguistic phenomena. Researchers are continuously working on improving LM architectures and training methods to address these challenges.

Large Language Models (LLMs) (Kasneci et al.,, 2023; Zhao et al., 2023a, ; Chen et al.,, 2021) are advanced language models with massive parameter sizes and exceptional learning capabilities. The core module behind many LLMs such as GPT-3 (Floridi and Chiriatti,, 2020), InstructGPT (Ouyang et al.,, 2022), and GPT-4 (OpenAI, 2023b, ) is the self-attention module in Transformer (Vaswani et al.,, 2017) that serves as the fundamental building block for language modeling tasks. Transformers have revolutionized the field of NLP with their ability to handle sequential data efficiently, allowing for parallelization and capturing long-range dependencies in text. One key feature of LLMs is in-context learning (Brown et al.,, 2020), where the model is trained to generate text based on a given context or prompt. This enables LLMs to generate more coherent and contextually relevant responses, making them suitable for interactive and conversational applications. Reinforcement Learning from Human Feedback (RLHF) (Ziegler et al.,, 2019; Christiano et al.,, 2017) is another crucial aspect of LLMs. This technique involves fine-tuning the model using human-generated responses as rewards, allowing the model to learn from its mistakes and improve its performance over time.

Figure 2: Trend of LLMs evaluation papers over time (2020 - Jun. 2023, including Jul. 2023.).

In an autoregressive language model, such as GPT-3 and PaLM (Chowdhery et al.,, 2022), given a context sequence $X$ , the LM tasks aim to predict the next token $y$ . The model is trained by maximizing the probability of the given token sequence conditioned on the context, i.e., $P(y|X)=P(y|x_{1},x_{2},...,x_{t-1})$ , where $x_{1},x_{2},...,x_{t-1}$ are the tokens in the context sequence, and $t$ is the current position. By using the chain rule, the conditional probability can be decomposed into a product of probabilities at each position:

P(y|X)=\prod_{t=1}^{T}P(y_{t}|x_{1},x_{2},...,x_{t-1}),

where $T$ is sequence length. In this way, the model predicts each token at each position in an autoregressive manner, generating a complete text sequence.

One common approach to interacting with LLMs is prompt engineering (Zhou et al.,, 2022; White et al.,, 2023; Clavié et al.,, 2023), where users design and provide specific prompt texts to guide LLMs in generating desired responses or completing specific tasks. This is widely adopted in existing evaluation efforts. People can also engage in question-and-answer interactions (Jansson et al.,, 2021), where they pose questions to the model and receive answers, or engage in dialogue interactions, having natural language conversations with LLMs. In conclusion, LLMs, with their Transformer architecture, in-context learning, and RLHF capabilities, have revolutionized NLP and hold promise in various applications. TABLE I provides a brief comparison of traditional ML, deep learning, and LLMs.

TABLE I: Comparison of traditional ML, deep learning, and LLMs

Comparison	Traditional ML	DL	LLMs
Training Data Size	Large	Large	Very large
Feature Engineering	Manual	Automatic	Automatic
Model Complexity	Limited	Complex	Very Complex
Interpretability	Good	Poor	Poorer
Performance	Moderate	High	Highest
Hardware Requirements	Low	High	Very High

2.2 AI Model Evaluation

AI model evaluation is an essential step in assessing the performance of a model. There are some standard model evaluation protocols, including $k$ -fold cross-validation, holdout validation, leave one out cross-validation (LOOCV), bootstrap, and reduced set (Kohavi et al.,, 1995; Berrar,, 2019). For instance, $k$ -fold cross-validation divides the dataset into $k$ parts, with one part used as a test set and the rest as training sets, which can reduce training data loss and obtain relatively more accurate model performance evaluation (Fushiki,, 2011); Holdout validation divides the dataset into training and test sets, with a smaller calculation amount but potentially more significant bias; LOOCV is a unique $k$ -fold cross-validation method where only one data point is used as the test set (Wong,, 2015); Reduced set trains the model with one dataset and tests it with the remaining data, which is computationally simple, but the applicability is limited. The appropriate evaluation method should be chosen according to the specific problem and data characteristics for more reliable performance indicators.

Figure 3 illustrates the evaluation process of AI models, including LLMs. Some evaluation protocols may not be feasible to evaluate deep learning models due to the extensive training size. Thus, evaluation on a static validation set has long been the standard choice for deep learning models. For instance, computer vision models leverage static test sets such as ImageNet (Deng et al.,, 2009) and MS COCO (Lin et al.,, 2014) for evaluation. LLMs also use GLUE (Wang et al.,, 2018) or SuperGLUE (Wang et al.,, 2019) as the common test sets.

As LLMs are becoming more popular with even poorer interpretability, existing evaluation protocols may not be enough to evaluate the true capabilities of LLMs thoroughly. We will introduce recent evaluations of LLMs in Sec. 5.

Refer to caption — Figure 3: The evaluation process of AI models.

3 What to Evaluate

What tasks should we evaluate LLMs to show their performance? On what tasks can we claim the strength and weakness of LLMs? In this section, we divide existing tasks into the following categories: natural language processing, robustness, ethics, biases and trustworthiness, social sciences, natural science and engineering, medical applications, agent applications (using LLMs as agents), and other applications.¹¹1Note that LLMs are evaluated in various tasks and the categorization in this paper is only one possible way for classification of these works. There are certainly other taxonomies.

3.1 Natural Language Processing Tasks

The initial objective behind the development of language models, particularly large language models, was to enhance performance on natural language processing tasks, encompassing both understanding and generation. Consequently, the majority of evaluation research has been primarily focused on natural language tasks. TABLE II summarizes the evaluation aspects of existing research, and we mainly highlight their conclusions in the following.²²2Several NLP areas have intersections and thus our categorization of these areas is only one possible way to categorize.

TABLE II: Summary of evaluation on natural language processing tasks: NLU (Natural Language Understanding, including SA (Sentiment Analysis), TC (Text Classification), NLI (Natural Language Inference) and other NLU tasks), Rng. (Reasoning), NLG (Natural Language Generation, including Summ. (Summarization), Dlg. (Dialogue), Tran (Translation), QA (Question Answering) and other NLG tasks), and Mul. (Multilingual tasks) (ordered by the name of the first author).

	NLU				Rng.	NLG					Mul.
Reference	SA	TC	NLI	Others	Rng.	Summ.	Dlg.	Tran.	QA	Others	Mul.
(Abdelali et al.,, 2023)											✓
(Ahuja et al.,, 2023)											✓
(Bian et al.,, 2023)					✓				✓
(Bang et al.,, 2023)	✓				✓	✓	✓	✓	✓		✓
(Bai et al.,, 2023)									✓
(Chen et al.,, 2023)										✓
(Choi et al.,, 2023)				✓
(Chia et al.,, 2023)										✓
(Frieder et al.,, 2023)					✓
(Fu et al., 2023b, )					✓
(Gekhman et al.,, 2023)						✓
(Honovich et al.,, 2022)			✓			✓	✓			✓
(Lai et al.,, 2023)											✓
(Laskar et al.,, 2023)	✓		✓		✓	✓		✓	✓	✓	✓
(Lopez-Lira and Tang,, 2023)	✓
(Liang et al.,, 2022)	✓	✓				✓			✓
(Lee et al.,, 2023)			✓
(Lin and Chen,, 2023)							✓
(Liévin et al.,, 2022)					✓
(Liu et al., 2023b, )					✓
(Lyu et al., 2023a, )									✓
(Manakul et al., 2023a, )									✓	✓
(Min et al.,, 2023)										✓
(Orrù et al.,, 2023)					✓
(Peña et al.,, 2023)		✓
(Pu and Demberg,, 2023)						✓				✓
(Pezeshkpour,, 2023)										✓
(Qin et al.,, 2023)	✓		✓		✓	✓	✓		✓
(Riccardi and Desai,, 2023)				✓
(Saparov et al.,, 2023)					✓
(Tao et al.,, 2023)				✓
(Wang et al., 2023d, )								✓
(Wang et al., 2023j, )	✓
(Wang et al., 2023b, )			✓						✓
(Wu et al., 2023c, )					✓
(Xu et al., 2023a, )					✓
(Yang and Menczer,, 2023)		✓
(Zhang et al., 2023d, )	✓
(Zhang et al., 2023c, )											✓
(Zhuang et al.,, 2023)					✓

3.1.1 Natural language understanding

Natural language understanding represents a wide spectrum of tasks that aims to obtain a better understanding of the input sequence. We summarize recent efforts in LLMs evaluation from several aspects.

Sentiment analysis is a task that analyzes and interprets the text to determine the emotional inclination. It is typically a binary (positive and negative) or triple (positive, neutral, and negative) class classification problem. Evaluating sentiment analysis tasks is a popular direction. Liang et al., (2022); Zeng et al., (2022) showed that the performance of the models on this task is usually high. ChatGPT’s sentiment analysis prediction performance is superior to traditional sentiment analysis methods (Lopez-Lira and Tang,, 2023) and comes close to that of GPT-3.5 (Qin et al.,, 2023). In fine-grained sentiment and emotion cause analysis, ChatGPT also exhibits exceptional performance (Wang et al., 2023j, ). In low-resource learning environments, LLMs exhibit significant advantages over small language models (Zhang et al., 2023d, ), but the ability of ChatGPT to understand low-resource languages is limited (Bang et al.,, 2023). In conclusion, LLMs have demonstrated commendable performance in sentiment analysis tasks. Future work should focus on enhancing their capability to understand emotions in under-resourced languages.

Text classification and sentiment analysis are related fields, text classification not only focuses on sentiment, but also includes the processing of all texts and tasks. The work of Liang et al., (2022) showed that GLM-130B was the best-performed model, with an overall accuracy of 85.8% for miscellaneous text classification. Yang and Menczer, (2023) found that ChatGPT can produce credibility ratings for a wide range of news outlets, and these ratings have a moderate correlation with those from human experts. Furthermore, ChatGPT achieves acceptable accuracy in a binary classification scenario (AUC=0.89). Peña et al., (2023) discussed the problem of topic classification for public affairs documents and showed that using an LLM backbone in combination with SVM classifiers is a useful strategy to conduct the multi-label topic classification task in the domain of public affairs with accuracies over 85%. Overall, LLMs performs well on text classification and can even handle text classification tasks in unconventional problem settings as well.

Natural language inference (NLI) is the task of determining whether the given “hypothesis” logically follows from the “premise”. Qin et al., (2023) showed that ChatGPT outperforms GPT-3.5 for NLI tasks. They also found that ChatGPT excels in handling factual input that could be attributed to its RLHF training process in favoring human feedback. However, Lee et al., (2023) observed LLMs perform poorly in the scope of NLI and further fail in representing human disagreement, which indicates that LLMs still have a large room for improvement in this field.

Semantic understanding refers to the meaning or understanding of language and its associated concepts. It involves the interpretation and comprehension of words, phrases, sentences and the relationships between them. Semantic processing goes beyond the surface level and focuses on understanding the underlying meaning and intent. Tao et al., (2023) comprehensively evaluated the event semantic processing abilities of LLMs covering understanding, reasoning, and prediction about the event semantics. Results indicated that LLMs possess an understanding of individual events, but their capacity to perceive the semantic similarity among events is constrained. In reasoning tasks, LLMs exhibit robust reasoning abilities in causal and intentional relations, yet their performance in other relation types is comparatively weaker. In prediction tasks, LLMs exhibit enhanced predictive capabilities for future events with increased contextual information. Riccardi and Desai, (2023) explored the semantic proficiency of LLMs and showed that these models perform poorly in evaluating basic phrases. Furthermore, GPT-3.5 and Bard cannot distinguish between meaningful and nonsense phrases, consistently classifying highly nonsense phrases as meaningful. GPT-4 shows significant improvements, but its performance is still significantly lower than that of humans. In summary, the performance of LLMs in semantic understanding tasks is poor. In the future, we can start from this aspect and focus on improving its performance on this application.

In the field of social knowledge understanding, Choi et al., (2023) evaluated how well models perform at learning and recognizing concepts of social knowledge and the results revealed that despite being much smaller in the number of parameters, finetuning supervised models such as BERT lead to much better performance than zero-shot models using state-of-the-art LLMs, such as GPT (Radford et al.,, 2018), GPT-J-6B (Wang and Komatsuzaki,, 2021) and so on. This statement demonstrates that supervised models significantly outperform zero-shot models in terms of performance, highlighting that an increase in parameters does not necessarily guarantee a higher level of social knowledge in this particular scenario.

3.1.2 Reasoning

The task of reasoning poses significant challenges for an intelligent AI model. To effectively tackle reasoning tasks, the models need to not only comprehend the provided information but also utilize reasoning and inference to deduce answers when explicit responses are absent. TABLE II reveals that there is a growing interest in evaluating the reasoning ability of LLMs, as evidenced by the increasing number of articles focusing on exploring this aspect. Currently, the evaluation of reasoning tasks can be broadly categorized into mathematical reasoning, commonsense reasoning, logical reasoning, and domain-specific reasoning.

ChatGPT exhibits a strong capability for arithmetic reasoning by outperforming GPT-3.5 in the majority of tasks (Qin et al.,, 2023). However, its proficiency in mathematical reasoning still requires improvement (Bang et al.,, 2023; Frieder et al.,, 2023; Zhuang et al.,, 2023). On symbolic reasoning tasks, ChatGPT is mostly worse than GPT-3.5, which may be because ChatGPT is prone to uncertain responses, leading to poor performance (Bang et al.,, 2023). Through the poor performance of LLMs on task variants of counterfactual conditions, Wu et al., 2023c showed that the current LLMs have certain limitations in abstract reasoning ability. In logical reasoning, Liu et al., 2023b indicated that ChatGPT and GPT-4 outperform traditional fine-tuning methods on most benchmarks, demonstrating their superiority in logical reasoning. However, both models face challenges when handling new and out-of-distribution data. ChatGPT does not perform as well as other LLMs, including GPT-3.5 and BARD (Xu et al., 2023a, ; Qin et al.,, 2023). This is because ChatGPT is designed explicitly for chatting, so it does an excellent job of maintaining rationality. FLAN-T5, LLaMA, GPT-3.5, and PaLM perform well in general deductive reasoning tasks (Saparov et al.,, 2023). GPT-3.5 is not good at keep oriented for reasoning in the inductive setting (Xu et al., 2023a, ). For multi-step reasoning, Fu et al., 2023b showed PaLM and Claude2 are the only two model families that achiving similar performance (but still worse than the GPT model family). Moreover, LLaMA-65B is the most robust open-source LLMs to date, which performs closely to code-davinci-002. Some papers separately evaluate the performance of ChatGPT on some reasoning tasks: ChatGPT generally performs poorly on commonsense reasoning tasks, but relatively better than non-text semantic reasoning (Bang et al.,, 2023). Meanwhile, ChatGPT also lacks spatial reasoning ability, but exhibits better temporal reasoning. Finally, while the performance of ChatGPT is acceptable on causal and analogical reasoning, it performs poorly on multi-hop reasoning ability, which is similar to the weakness of other LLMs on complex reasoning (Ott et al.,, 2023). In professional domain reasoning tasks, zero-shot InstructGPT and Codex are capable of complex medical reasoning tasks, but still need to be further improved (Liévin et al.,, 2022). In terms of language insight issues, (Orrù et al.,, 2023) demonstrated the potential of ChatGPT for solving verbal insight problems, as ChatGPT’s performance was comparable to that of human participants. It should be noted that most of the above conclusions are obtained for specific data sets. Overall, LLMs show great potential in reasoning and show a continuous improvement trend, but still face many challenges and limitations, requiring more in-depth research and optimization.

3.1.3 Natural language generation

Natural language generation (NLG) evaluates the capabilities of LLMs in generating specific texts, which consists of several tasks, including summarization, dialogue generation, machine translation, question answering, and other open-ended generation applications.

Summarization is a generation task that aims to learn a concise abstract for the given sentence. In this evaluation, Liang et al., (2022) found that TNLG v2 (530B) (Smith et al.,, 2022) achieved the highest score in both scenarios, followed by OPT (175B) (Zhang et al.,, 2022) in second place. It is disappointing that ChatGPT sometimes generates a longer summary than the input document (Bang et al.,, 2023). The fine-tuned Bart (Lewis et al.,, 2019) is still better than zero-shot ChatGPT. Specifically, ChatGPT demonstrates comparable zero-shot performance to the text-davinci-002 (Bang et al.,, 2023), but performs worse than GPT-3.5 (Qin et al.,, 2023). In controllable text summarization, Pu and Demberg, (2023) showed that ChatGPT summaries are slightly more extractive (i.e., containing more content copied directly from the source) compared to human summaries. These findings indicate that LLMs, particularly ChatGPT, have a general performance in summarization tasks. However, their summary and generalization abilities still require further improvement.

Evaluating the performance of LLMs on dialogue tasks is crucial to the development of dialogue systems and improving the human-computer interaction. Through such evaluation, the natural language processing ability, context understanding ability and generation ability of the model can be improved, so as to realize a more intelligent and more natural dialogue system. Both Claude and ChatGPT generally achieve better performance across all dimensions when compared to GPT-3.5 (Qin et al.,, 2023; Lin and Chen,, 2023). When comparing the Claude and ChatGPT models, both models demonstrate competitive performance across different evaluation dimensions, with Claude slightly outperforming ChatGPT in specific configurations. Bang et al., (2023) conducted tests on ChatGPT for response generation in different dialogue settings: 1) Knowledge-Grounded Open-Domain Dialogue and 2) Task-Oriented Dialogue. The automatic evaluation results revealed that ChatGPT’s performance is comparatively lower than that of GPT-2 fine-tuned on the dataset for knowledge-grounded open-domain dialogue. In task-oriented dialogue, ChatGPT’s performance is acceptable; however, it tends to make errors in the presence of the following challenges: long-term multi-turn dependency, fundamental reasoning failure, and extrinsic hallucination.

While LLMs are not explicitly trained for translation tasks, they can still demonstrate strong performance. Wang et al., 2023d demonstrated that ChatGPT and GPT-4 exhibit superior performance in comparison to commercial machine translation (MT) systems, as evaluated by humans. Additionally, they outperform most document-level NMT methods in terms of sacreBLEU scores. During contrastive testing, ChatGPT shows lower accuracy in comparison to traditional translation models. However, GPT-4 demonstrates a robust capability in explaining discourse knowledge, even though it may occasionally select incorrect translation candidates. The findings from (Bang et al.,, 2023) indicated that ChatGPT performs X $\to$ Eng translation well, but it still lack the ability to perform Eng $\to$ X translation. (Lyu et al., 2023a, ) investigated several research directions in MT utilizing LLMs. This study significantly contributes to the advancement of MT research and highlights the potential of LLMs in enhancing translation capabilities. In summary, while LLMs perform satisfactorily in several translation tasks, there is still room for improvement, e.g., enhancing the translation capability from English to non-English languages.

Question answering is a crucial technology in the field of human-computer interaction, and it has found wide application in scenarios like search engines, intelligent customer service, and question answering systems. The measurement of accuracy and efficiency in QA models will have significant implications for these applications. According to Liang et al., (2022), among all the evaluated models, InstructGPT davinci v2 (175B) exhibited the highest performance in terms of accuracy, robustness, and fairness across the 9 question answering scenarios. Both GPT-3.5 and ChatGPT demonstrate significant advancements compared to GPT-3 in their ability to answer general knowledge questions. In most domains, ChatGPT surpasses GPT-3.5 by more than 2% in terms of performance (Bian et al.,, 2023; Qin et al.,, 2023). However, ChatGPT performs slightly weaker than GPT-3.5 on the CommonsenseQA and Social IQA benchmarks. This can be attributed to ChatGPT’s cautious nature, as it tends to decline providing an answer when there is insufficient information available. Fine-tuned models, such as Vícuna and ChatGPT, exhibit exceptional performance with near-perfect scores, surpassing models that lack supervised fine-tuning by a significant margin (Bang et al.,, 2023; Bai et al.,, 2023). Laskar et al., (2023) evaluated the effectiveness of ChatGPT on a range of academic datasets, including various tasks such as answering questions, summarizing text, generating code, reasoning with commonsense, solving math problems, translating languages, detecting bias, and addressing ethical issues. Overall, LLMs showcase flawless performance on QA tasks and hold the potential for further enhancing their proficiency in social, event, and temporal commonsense knowledge in the future.

There are also other generation tasks to explore. In the field of sentence style transfer, Pu and Demberg, (2023) demonstrated that ChatGPT surpasses the previous SOTA supervised model through training on the same subset for few-shot learning, as evident from the higher BLEU score. However, when it comes to controlling the formality of sentence style, ChatGPT’s performance still differs significantly from human behavior. In writing tasks, Chia et al., (2023) discovered that LLMs exhibit consistent performance across various categories such as informative, professional, argumentative, and creative writing. This finding implies that LLMs possess a general proficiency in writing capabilities. In text generation quality, Chen et al., (2023) revealed that ChatGPT excels in assessing text quality from multiple angles, even in the absence of reference texts, surpassing the performance of most existing automated metrics. Employing ChatGPT to generate numerical scores for text quality emerged as the most reliable and effective approach among the various testing methods studied.

3.1.4 Multilingual tasks

While English is the predominant language, many LLMs are trained on mixed-language training data. The combination of multilingual data indeed helps LLMs gain the ability to process inputs and generate responses in different languages, making them widely adopted and accepted across the globe. However, due to the relatively recent emergence of this technology, LLMs are primarily evaluated on English data, leading to a potential oversight of evaluating their multilingual performance. To address this, several articles have provided comprehensive, open, and independent evaluations of LLMs’ performance on various NLP tasks in different non-English languages. These evaluations offer valuable insights and perspectives for future research and applications.

Abdelali et al., (2023) evaluated the performance of ChatGPT in standard Arabic NLP tasks and observed that ChatGPT exhibits lower performance compared to SOTA models in the zero-shot setting for most tasks. Bang et al., (2023); Zhang et al., 2023c ; Lai et al., (2023); Ahuja et al., (2023) utilized a greater number of languages across multiple datasets, encompassing a wider range of tasks, and conducted a more comprehensive evaluation of LLMs, including BLOOM, Vicuna, Claude, ChatGPT, and GPT-4. The results indicated that these LLMs perform poorly when it came to non-Latin languages and languages with limited resources. Despite translating the input to English and using it as the query, generative LLMs still display subpar performance across tasks and languages compared to a SOTA models (Ahuja et al.,, 2023). Furthermore, Bang et al., (2023) highlighted that ChatGPT still faces a limitation in translating sentences written in non-Latin script languages with rich linguistic resources. The aforementioned demonstrates that there are numerous challenges and ample opportunities for enhancement in multilingual tasks for LLMs. Future research should prioritize achieving multilingual balance and addressing the challenges faced by non-Latin languages and low-resource languages, with the aim of better supporting users worldwide. At the same time, attention should be paid to the impartiality and neutrality of the language in order to mitigate any potential biases, including English bias or other biases, that could impact multilingual applications.

3.1.5 Factuality

Factuality in the context of LLMs refers to the extent to which the information or answers provided by the model align with real-world truths and verifiable facts. Factuality in LLMs significantly impacts a variety of tasks and downstream applications, such as question answering systems, information extraction, text summarization, dialogue systems, and automated fact-checking, where incorrect or inconsistent information could lead to substantial misunderstandings and misinterpretations. Evaluating factuality is of great importance in order to trust and efficiently use these models. This includes the ability of these models to maintain consistency with known facts, avoid generating misleading or false information (known as “factual hallucination”), and effectively learn and recall factual knowledge. A range of methodologies have been proposed to measure and improve the factuality of LLMs.

Wang et al., 2023b assessed the internal knowledge capabilities of several large models, namely InstructGPT, ChatGPT-3.5, GPT-4, and BingChat (Microsoft,, 2023), by examining their ability to answer open questions based on the Natural Questions (Kwiatkowski et al.,, 2019) and TriviaQA (Joshi et al.,, 2017) datasets. The evaluation process involved human assessment. The results of the study indicated that while GPT-4 and BingChat can provide correct answers for more than 80% of the questions, there is still a remaining gap of over 15% to achieve complete accuracy. In the work of Honovich et al., (2022), they conducted a review of current factual consistency evaluation methods and highlighted the absence of a unified comparison framework and the limited reference value of related scores compared to binary labels. To address this, they transformed existing fact consistency tasks into binary labels, specifically considering only whether there is a factual conflict with the input text, without factoring in external knowledge. The research discovered that fact evaluation methods founded on natural language inference and question generation-question answering exhibit superior performance and can complement each other. Pezeshkpour, (2023) proposed a novel metric, based on information theory, to assess the inclusion of specific knowledge in LLMs. The metric utilized the concept of uncertainty in knowledge to measure factualness, calculated by LLMs filling in prompts and examining the probability distribution of the answer. The paper discussed two methods for injecting knowledge into LLMs: explicit inclusion of knowledge in the prompts and implicit fine-tuning of the LLMs using knowledge-related data. The study demonstrated that this approach surpasses traditional ranking methods by achieving an accuracy improvement of over 30%. Gekhman et al., (2023) improved the method for evaluating fact consistency in summarization tasks. It proposed a novel approach that involved training student NLI models using summaries generated by multiple models and annotated by LLMs to ensure fact consistency. The trained student model was then used for summarization fact consistency evaluation. Manakul et al., 2023a operated on two hypotheses regarding how LLMs generate factual or hallucinated responses. It proposed the use of three formulas (BERTScore (Zhang et al.,, 2019), MQAG (Manakul et al., 2023b, ) and n-gram) to evaluate factuality and employed alternative LLMs to gather token probabilities for black-box language models. The study discovered that simply computing sentence likelihood or entropy helped validate the factuality of the responses. Min et al., (2023) broke down text generated by LLMs into individual “atomic” facts, which were then evaluated for their correctness. The FActScore is used to measure the performance of estimators through the calculation of F1 scores. The paper tested various estimators and revealed that current estimators still have some way to go in effectively addressing the task. Lin et al., (2021) introduced the TruthfulQA dataset, designed to cause models to make mistakes. Multiple language models were tested by providing factual answers. The findings from these experiments suggest that simply scaling up model sizes may not necessarily improve their truthfulness, and recommendations are provided for the training approach. This dataset has become widely used for evaluating the factuality of LLMs (Kadavath et al.,, 2022; Touvron et al.,, 2023; OpenAI, 2023b, ; Wei et al., 2022b, ).

3.2 Robustness, Ethic, Bias, and Trustworthiness

TABLE III: Summary of LLMs evaluation on robustness, ethics, biases, and trustworthiness (ordered by the name of the first author).

Reference	Robustness	Ethics and Biases	Trustworthiness
(Cao et al.,, 2023)		✓
(Dhamala et al.,, 2021)		✓
(Deshpande et al.,, 2023)		✓
(Ferrara,, 2023)		✓
(Gehman et al.,, 2020)		✓
(Hartmann et al.,, 2023)		✓
(Hendrycks et al., 2020a, )		✓
(Hagendorff and Fabi,, 2023)			✓
(Li et al., 2023c, )	✓
(Parrish et al.,, 2022)		✓
(Rutinowski et al.,, 2023)		✓
(Sheng et al.,, 2021)		✓
(Simmons,, 2022)		✓
(Wang et al.,, 2022)	✓
Wang et al., 2023c	✓
(Wang et al., 2023a, )	✓	✓	✓
(Wang et al., 2023e, )		✓
(Yang et al.,, 2022)	✓
(Zhao et al., 2023b, )	✓
(Zhuo et al., 2023b, )	✓
(Zhu et al.,, 2023)	✓
(Zhuo et al., 2023a, )		✓

The evaluation of LLMs encompasses the crucial aspects of robustness, ethics, biases, and trustworthiness. These factors have gained increasing importance in assessing the performance of LLMs comprehensively.

3.2.1 Robustness

Robustness studies the stability of a system when facing unexpected inputs. Specifically, out-of-distribution (OOD) (Wang et al.,, 2022) and adversarial robustness are two popular research topics for robustness. Wang et al., 2023c is an early work that evaluated ChatGPT and other LLMs from both the adversarial and OOD perspectives using existing benchmarks such as AdvGLUE (Wang et al., 2021a, ), ANLI (Nie et al.,, 2019), and DDXPlus (Fansi Tchango et al.,, 2022) datasets. Zhuo et al., 2023b evaluated the robustness of semantic parsing. Yang et al., (2022) evaluated OOD robustness by extending the GLUE (Wang et al.,, 2018) dataset. The results of this study emphasize the potential risks to the overall system security when manipulating visual input. For vision-language models, Zhao et al., 2023b evaluated LLMs on visual input and transferred them to other visual-linguistic models, revealing the vulnerability of visual input. Li et al., 2023c provided an overview of OOD evaluation for language models: adversarial robustness, domain generalization, and dataset biases. The authors compared and unified the three research lines, summarized the data-generating processes and evaluation protocols for each line, and highlighted the challenges and opportunities for future work.

For adversarial robustness, Zhu et al., (2023) evaluated the robustness of LLMs to prompts by proposing a unified benchmark called PromptBench. They comprehensively evaluated adversarial text attacks at multiple levels (character, word, sentence, and semantics). The results showed that contemporary LLMs are vulnerable to adversarial prompts, highlighting the importance of the models’ robustness when facing adversarial inputs. As for new adversarial datasets, Wang et al., 2023a introduced the use of the AdvGLUE++ benchmark data for assessing adversarial robustness and implemented a new evaluation protocol to scrutinize machine ethics via jailbreaking system prompts.

3.2.2 Ethic and bias

LLMs have been found to internalize, spread, and potentially magnify harmful information existing in the crawled training corpora, usually, toxic languages, like offensiveness, hate speech, and insults (Gehman et al.,, 2020), as well as social biases like stereotypes towards people with a particular demographic identity (e.g., gender, race, religion, occupation and ideology) (Sheng et al.,, 2021). More recently, Zhuo et al., 2023a used conventional testing sets and metrics (Gehman et al.,, 2020; Dhamala et al.,, 2021; Parrish et al.,, 2022) to perform a systematic evaluation of ChatGPT’s toxicity and social bias, finding that it still exhibits noxious content to some extend. Taking a further step, Deshpande et al., (2023) introduced role-playing into the model and observed an increase in generated toxicity up to 6x. Furthermore, such role-playing also caused biased toxicity towards specific entities. Different from simply measuring social biases, Ferrara, (2023) investigated the sources, underlying mechanisms and corresponding ethical consequences of these biases potentially produced by ChatGPT. Beyond social biases, LLMs have also been assessed by political tendency and personality traits (Rutinowski et al.,, 2023; Hartmann et al.,, 2023) based questionnaires like Political Compass Test and MBTI test, demonstrating a propensity for progressive views and an ENFJ personality type. In addition, LLMs like GPT-3 were found to have moral biases (Simmons,, 2022) in terms of the Moral Foundation theory (Graham et al.,, 2013); The study conducted by (Hendrycks et al., 2020a, ) reveals that existing LMs have potential in ethical judgment, but still need improvement. Moreover, in the assessment of GPT-4 alignment, (Wang et al., 2023e, ) discovered a systematic bias. ChatGPT is also observed to exhibit somewhat bias on cultural values (Cao et al.,, 2023). Wang et al., 2023a also incorporated an evaluation dataset specifically aimed at gauging stereotype bias, using both targeted and untargeted system prompts. All these ethical issues might elicit serious risks, impeding the deployment of LLMs and having a profound negative impact on society.

3.2.3 Trustworthiness

Some work focuses on other trustworthiness problems in addition to robustness and ethics.³³3The term ‘trustworthiness’ in this section refers to other work that contains more than robustness and ethics. In their 2023 study, DecodingTrust, Wang et al., 2023a offered a multifaceted exploration of trustworthiness vulnerabilities in the GPT models, especially GPT-3.5 and GPT-4. Their evaluation expanded beyond the typical trustworthiness concerns to include eight critical aspects: toxicity, stereotype bias, adversarial and out-of-distribution robustness, robustness to adversarial demonstrations, privacy, machine ethics, and fairness. DecodingTrust’s investigation employs an array of newly constructed scenarios, tasks, and metrics. They revealed that while GPT-4 often showcases improved trustworthiness over GPT-3.5 in standard evaluations, it is simultaneously more susceptible to attacks.

In another study by Hagendorff and Fabi, (2023), LLMs with enhanced cognitive abilities were evaluated. They found that these models can avoid common human intuitions and cognitive errors, demonstrating super-rational performance. By utilizing cognitive reflection tests and semantic illusion experiments, the researchers gained insights into the psychological aspects of LLMs. This method offers new perspectives for evaluating model biases and ethical issues that may not have been previously identified.

3.3 Social Science

Social science involves the study of human society and individual behavior, including economics, sociology, political science, law, and other disciplines. Evaluating the performance of LLMs in social science is important for academic research, policy formulation, and social problem-solving. Such evaluations can help improve the applicability and quality of models in the social sciences, increasing understanding of human societies and promoting social progress.

Wu et al., 2023a evaluated the potential use of LLMs in addressing scaling and measurement issues in social science and found that LLMs can generate meaningful responses regarding political ideology and significantly improve text-as-data methods in social science.

In computational social science (CSS) tasks, Ziems et al., (2023) presented a comprehensive evaluation of LLMs on several CSS tasks. During classification tasks, LLMs exhibit the lowest absolute performance on event argument extraction, character tropes, implicit hate, and empathy classification, achieving accuracy below 40%. These tasks either involve complex structures (event arguments) or subjective expert taxonomies with semantics that differ from those learned during LLM pretraining. Conversely, LLMs achieve the best performance on misinformation, stance, and emotion classification. When it comes to generation tasks, LLMs often produce explanations that surpass the quality of gold references provided by crowdworkers. In summary, while LLMs can greatly enhance the traditional CSS research pipeline, they cannot completely replace it.

Some articles also evaluate LLMs on legal tasks. The zero-shot performance of LLMs is mediocre in legal case judgment summarization. LLMs have several problems, including incomplete sentences and words, meaningless sentences merge, and more serious errors such as inconsistent and hallucinated information (Deroy et al.,, 2023). The results showed that further improvement is necessary for LLMs to be useful for case judgment summarization by legal experts. Nay et al., (2023) indicated that LLMs, particularly when combined with prompting enhancements and the correct legal texts, could perform better but not yet at expert tax lawyer levels.

Lastly, within the realm of psychology, (Frank,, 2023) adopted an interdisciplinary approach and drew insights from developmental psychology and comparative psychology to explore alternative methods for evaluating the capabilities of LLMs. By integrating different perspectives, researchers can deepen their understanding of the essence of cognition and effectively leverage the potential of advanced technologies such as large language models, while mitigating potential risks.

In conclusion, the utilization of LLMs has significantly benefited individuals in addressing social science-related tasks, leading to improved work efficiency. The outputs produced by LLMs serve as valuable resources for enhancing productivity. However, it is crucial to acknowledge that existing LLMs cannot completely replace human professionals in this domain.

3.4 Natural Science and Engineering

Evaluating the performance of LLMs in natural science and engineering fields can help guide applications and development in scientific research, technology development, and engineering studies.

TABLE IV: Summary of evaluations on natural science and engineering tasks based on three aspects: Mathematics, Science and Engineering (ordered by the name of the first author).

Reference	Mathematics	Science	Engineering
(Arora et al.,, 2023)	✓	✓
(Bubeck et al.,, 2023)	✓		✓
(Castro Nascimento and Pimentel,, 2023)		✓
(Collins et al.,, 2023)	✓
(Dao and Le,, 2023)	✓
(Guo et al.,, 2023)		✓
(Liu et al., 2023c, )			✓
(Pallagani et al.,, 2023)			✓
(Sridhara et al.,, 2023)			✓
(Valmeekam et al.,, 2022)			✓
(Valmeekam et al.,, 2023)			✓
(Wei et al.,, 2023)	✓
(Wu et al., 2023b, )	✓
(Yuan et al., 2023b, )	✓
(Zhuang et al.,, 2023)			✓

3.4.1 Mathematics

For fundamental mathematical problems, most large language models (LLMs) demonstrate proficiency in addition and subtraction, and possess some capability in multiplication. However, they face challenges when it comes to division, exponentiation, trigonometry functions, and logarithm functions. On the other hand, LLMs exhibit competence in handling decimal numbers, negative numbers, and irrational numbers (Yuan et al., 2023b, ). In terms of performance, ChatGPT and GPT-4 outperform other models significantly, showcasing their superiority in solving mathematical tasks (Wei et al.,, 2023). These two models have a distinct advantage in dealing with large numbers (greater than 1e12) and complex, lengthy mathematical queries. GPT-4 outperforms ChatGPT by achieving a significant increase in accuracy of 10 percentage points and a reduction in relative error by 50%, due to its superior division and trigonometry abilities, proper understanding of irrational numbers, and consistent step-by-step calculation of long expressions.

When confronted with complex and challenging mathematical problems, LLMs exhibit subpar performance. Specifically, GPT-3 demonstrates nearly random performance, while GPT-3.5 shows improvement, and GPT-4 performs the best (Arora et al.,, 2023). Despite the advancements made in the new models, it is important to note that the peak performance remains relatively low compared to that of experts and these models lack the capability to engage in mathematical research (Bubeck et al.,, 2023). The specific tasks of algebraic manipulation and calculation continue to pose challenges for GPTs (Collins et al.,, 2023; Bubeck et al.,, 2023). The primary reasons behind GPT-4’s low performance in these tasks are errors in algebraic manipulation and difficulties in retrieving pertinent domain-specific concepts. Wu et al., 2023b evaluated the use of GPT-4 on difficult high school competition problems and GPT-4 reached 60% accuracy on half of the categories. Intermediate algebra and precalculus can only be solved with a low accuracy rate of around 20%. ChatGPT is not good at answering questions on topics including derivatives and applications, Oxyz spatial calculus and spatial geometry (Dao and Le,, 2023). Dao and Le, (2023); Wei et al., (2023) showed that ChatGPT’s performance worsens as task difficulty increases: it correctly answered 83% of the questions at the recognition level, 62% at the comprehension level, 27% at the application level, and only 10% at the highest cognitive complexity level. Given those problems at higher knowledge levels tend to be more complex, requiring in-depth understanding and problem-solving skills, such results are to be expected.

These results indicate that the effectiveness of LLMs is highly influenced by the complexity of problems they encounter. This finding holds significant implications for the design and development of optimized artificial intelligence systems capable of successfully handling these challenging tasks.

3.4.2 General science

Further improvements are needed in the application of LLMs in the field of chemistry. Castro Nascimento and Pimentel, (2023) presented five straightforward tasks from various subareas of chemistry to assess ChatGPT’s comprehension of the subject, with accuracy ranging from 25% to 100%. Guo et al., (2023) created a comprehensive benchmark that encompasses 8 practical chemistry tasks, which is designed to assess the performance of LLMs (including GPT-4, GPT-3.5, and Davinci-003) for each chemistry task. Based on the experiment results, GPT-4 demonstrates superior performance compared to the other two models. (Arora et al.,, 2023) showed that LLMs perform worse on physics problems than chemistry problems, probably because chemistry problems have lower inference complexity than physics problems in this setting. There are limited evaluation studies on LLMs in the field of general science, and the current findings indicate that further improvement is needed in the performance of LLMs within this domain.

3.4.3 Engineering

Within engineering, the tasks can be organized in ascending order of difficulty, including code generation, software engineering, and commonsense planning.

In code generation tasks, the smaller LLMs trained for the tasks are competitive in performance, and CodeGen-16B (Nijkamp et al.,, 2022) is comparable in performance to ChatGPT using a larger parameter setting, reaching about a 78% match (Liu et al., 2023c, ). Despite facing challenges in mastering and comprehending certain fundamental concepts in programming languages, ChatGPT showcases a commendable level of coding level (Zhuang et al.,, 2023). Specifically, ChatGPT has developed superior skills in dynamic programming, greedy algorithm, and search, surpassing highly capable college students, but it struggle in data structure, tree, and graph theory. GPT-4 demonstrates an advanced ability to generate code based on given instructions, comprehend existing code, reason about code execution, simulate the impact of instructions, articulate outcomes in natural language, and execute pseudocode effectively (Bubeck et al.,, 2023).

In software engineering tasks, ChatGPT generally performs well and provides detailed responses, often surpassing both human expert output and SOTA output. However, for certain tasks such as code vulnerability detection and information retrieval-based test prioritization, the current version of ChatGPT fails to provide accurate answers, rendering it unsuitable for these specific tasks (Sridhara et al.,, 2023).

In commonsense planning tasks, LLMs may not perform well, even in simple planning tasks where humans excel (Valmeekam et al.,, 2022, 2023). Pallagani et al., (2023) demonstrated that the fine-tuned CodeT5 (Wang et al., 2021b, ) performs the best across all considered domains, with the shortest inference time. Moreover, it explored the capability of LLMs for plan generalization and found that their generalization capabilities appear to be limited. It turns out that LLMs can handle simple engineering tasks, but they perform poorly on complex engineering tasks.

3.5 Medical Applications

The application of LLMs in the medical field has recently received significant attention. As a result, this section aims to provide a comprehensive review of the ongoing efforts dedicated to implementing LLMs in medical applications. We have categorized these applications into three aspects as shown in TABLE V: medical query, medical examination, and medical assistants. A detailed examination of these categories will enhance our understanding of the potential impact and advantages that LLMs can bring to the medical domain.

3.5.1 Medical queries

The significance of evaluating LLMs on medical queries lies in providing accurate and reliable medical answers to meet the needs of healthcare professionals and patients for high-quality medical information. As shown in TABLE V, the majority of LLMs evaluations in the medical field concentrate on medical queries. ChatGPT generated relatively accurate information for various medical queries, including genetics (Duong and Solomon,, 2023), radiation oncology physics (Holmes et al.,, 2023), biomedicine (Jahan et al.,, 2023), and many other medical disciplines (Samaan et al.,, 2023; Johnson et al.,, 2023; Hamidi and Roberts,, 2023), demonstrating its effectiveness in the field of medical queries to a certain extent. As for the limitations, Thirunavukarasu et al., (2023) assessed ChatGPT’s performance in primary care and found that its average score in the student comprehensive assessment falls below the passing score, indicating room for improvement. Chervenak et al., (2023) highlighted that while ChatGPT can generate responses similar to existing sources in fertility-related clinical prompts, its limitations in reliably citing sources and potential for fabricating information restrict its clinical utility.

TABLE V: Summary of evaluations on medical applications based on the three aspects: Med. queries, Med. ass. (Medical assistants), and Med. exam. (Medical examination) (ordered by the name of the first author).

Reference	Med. queries	Med. exam	Med. ass.
(Cascella et al.,, 2023)			✓
(Chervenak et al.,, 2023)	✓
(Duong and Solomon,, 2023)	✓
(Gilson et al.,, 2023)		✓
(Hamidi and Roberts,, 2023)	✓
(Holmes et al.,, 2023)	✓
(Jahan et al.,, 2023)	✓
(Johnson et al.,, 2023)	✓
(Khan et al.,, 2023)			✓
(Kung et al.,, 2023)		✓
(Lahat et al.,, 2023)			✓
(Lyu et al., 2023b, )			✓
(Oh et al.,, 2023)			✓
(Samaan et al.,, 2023)	✓
(Thirunavukarasu et al.,, 2023)	✓
(Wang et al., 2023i, )			✓

3.5.2 Medical examination

The studies by Gilson et al., (2023); Kung et al., (2023) have evaluated the performance of LLMs in medical examination assessment through the United States Medical Licensing Examination (USMLE) ⁴⁴4https://www.usmle.org/. In the study of (Gilson et al.,, 2023), ChatGPT’s performance in answering USMLE Step 1 and Step 2 exam questions was assessed using novel multiple-choice question sets. The results indicated that ChatGPT achieves varying accuracies across different datasets. However, the presence of out-of-context information was found to be lower compared to the correct answer in the NBME-Free-Step1 and NBME-Free-Step2 datasets. Kung et al., (2023) showed that ChatGPT achieve or approach the passing threshold in these exams with no tailored training. The model demonstrates high consistency and insight, indicating its potential to assist in medical education and clinical decision-making. ChatGPT can be used as a tool to answer medical questions, provide explanations, and support decision-making processes. This offers additional resources and support for medical students and clinicians in their educational and clinical practices. Moreover, Sharma et al., (2023) found that answers generated by ChatGPT are more context-aware with better deductive reasoning abilities compared to Google search results.

3.5.3 Medical assistants

In the field of medical assistance, LLMs demonstrate potential applications, including research on identifying gastrointestinal diseases (Lahat et al.,, 2023), dementia diagnosis (Wang et al., 2023i, ), accelerating the evaluation of COVID-19 literature (Khan et al.,, 2023), and their overall potential in healthcare (Cascella et al.,, 2023). However, there are also limitations and challenges, such as lack of originality, high input requirements, resource constraints, uncertainty in answers, and potential risks related to misdiagnosis and patient privacy issues.

Moreover, several studies have evaluated the performance and feasibility of ChatGPT in the medical education field. In the study by Oh et al., (2023), ChatGPT, specifically GPT-3.5 and GPT-4 models, were evaluated in terms of their understanding of surgical clinical information and their potential impact on surgical education and training. The results indicate an overall accuracy of 46.8% for GPT-3.5 and 76.4% for GPT-4, demonstrating a significant performance difference between the two models. Notably, GPT-4 consistently performs well across different subspecialties, suggesting its capability to comprehend complex clinical information and enhance surgical education and training. Another study by Lyu et al., 2023b explores the feasibility of utilizing ChatGPT in clinical education, particularly in translating radiology reports into easily understandable language. The findings demonstrate that ChatGPT effectively translates radiology reports into accessible language and provides general recommendations. Furthermore, the quality of ChatGPT has shown improvement compared to GPT-4. These findings suggest that employing LLMs in clinical education is feasible, although further efforts are needed to address limitations and unlock their full potential.

3.6 Agent Applications

Instead of focusing solely on general language tasks, LLMs can be utilized as powerful tools in various domains. Equipping LLMs with external tools can greatly expand the capabilities of the model. Huang et al., 2023a introduced KOSMOS-1, which is capable of understanding general patterns, following instructions, and learning based on context. The study by MRKL Karpas et al., (2022) emphasized the importance of understanding when and how to utilize external symbolic tools, as this knowledge is dependent on the capabilities of LLMs, particularly when these tools can reliably perform functions. Additionally, two other studies, Toolformer (Schick et al.,, 2023) and TALM (Parisi et al.,, 2022), explored the utilization of tools to enhance language models. Toolformer employs a training approach to determine the optimal usage of specific APIs and integrates the obtained results into subsequent token predictions. On the other hand, TALM combines indistinguishable tools with text-based methods to augment language models and employs an iterative technique known as “self-play”, guided by minimal tool demonstrations. Furthermore, (Shen et al.,, 2023) proposed the HuggingGPT framework, which leverages LLMs to connect various AI models within the machine learning community (such as Hugging Face), aiming to address AI tasks.

3.7 Other Applications

In addition to the categories mentioned above, there have been evaluations of LLMs in various other domains, including education, search and recommendation, personality testing, and specific applications.

3.7.1 Education

LLMs have shown promise in revolutionizing the field of education. They have the potential to make significant contributions in several areas, such as assisting students in improving their writing skills, facilitating better comprehension of complex concepts, expediting the delivery of information, and providing personalized feedback to enhance student engagement. These applications aim to create more efficient and interactive learning experiences, offering students a broader range of educational opportunities. However, to fully harness the potential of LLMs in education, extensive research and ongoing refinement are necessary.

The evaluation of LLMs for educational assistance aims to investigate and assess their potential contributions to the field of education. Such evaluations can be conducted from various perspectives. According to Dai et al., 2023b , ChatGPT demonstrates the ability to generate detailed, fluent, and coherent feedback that surpasses that of human teachers. It can accurately assess student assignments and provide feedback on task completion, thereby assisting in the development of student skills. However, ChatGPT’s responses may lack novelty or insightful perspectives regarding teaching improvement (Wang and Demszky,, 2023). Additionally, the study conducted by Hellas et al., (2023) revealed that LLMs can successfully identify at least one actual problem in student code, although instances of misjudgment are also observed. In conclusion, the utilization of LLMs shows promise in addressing program logic issues, although challenges remain in achieving proficiency in output formatting. It is important to note that while these models can provide valuable insights, they may still generate errors similar to those made by students.

In educational testing, researchers aim to evaluate the application effectiveness of LLMs, including automatic scoring, question generation, and learning guidance. de Winter, (2023) showed that ChatGPT achieves an average of 71.8% correctness, which is comparable to the average score of all participating students. Subsequently, the evaluation was conducted using GPT-4, and it achieved a score of 8.33. Furthermore, this evaluation showed the effectiveness of leveraging bootstrapping that combines randomness via the “temperature” parameter in diagnosing incorrect answers. Zhang et al., 2023b claimed that GPT-3.5 can solve MIT math and EECS exams with GPT-4 achieving better performance. However, it turned out to be not fair since they accidentally input the correct answers to the prompts.

TABLE VI: Summary of evaluations on other applications based on the four aspects: Edu. (Education), Sea. & Rec. (Search and Recommendation), Pers. Test. (Personality Testing) and Specific applications (ordered by the name of the first author).

Reference	Edu.	Sea. & Rec.	Pers. Test.	Specific applications
(Bodroza et al.,, 2023)			✓
(Dai et al., 2023b, )	✓
(de Winter,, 2023)	✓
(Dai et al., 2023a, )		✓
(Fan et al.,, 2023)		✓
(Hellas et al.,, 2023)	✓
(Jentzsch and Kersting,, 2023)			✓
(Lanzi and Loiacono,, 2023)				✓
(Le and Zhang,, 2023)				✓
(Sun et al.,, 2023)		✓
(Song et al.,, 2023)			✓
(Safdari et al.,, 2023)			✓
(Thakur et al.,, 2021)		✓
(Wang and Demszky,, 2023)	✓
(Wang et al., 2023f, )			✓
(Wang et al., 2023h, )				✓
(Xu et al., 2023c, )		✓
(Zhang et al., 2023a, )		✓

3.7.2 Search and recommendation

The assessment of LLMs in search and recommendation can be broadly categorized into two areas. Firstly, in the realm of information retrieval, Sun et al., (2023) investigated the effectiveness of generative ranking algorithms, such as ChatGPT and GPT-4, for information retrieval tasks. Experimental results demonstrate that guided ChatGPT and GPT-4 exhibit competitive performance on popular benchmark tests, even outperforming supervised methods. Additionally, the extraction of ChatGPT’s ranking functionality into a specialized model shows superior performance when trained on 10K ChatGPT-generated data compared to training on 400K annotated MS MARCO data in the BEIR dataset (Thakur et al.,, 2021). Furthermore, Xu et al., 2023c conducted a randomized online experiment to investigate the behavioral differences of users when performing information retrieval tasks using search engine and chatbot tools. Participants were divided into two groups: one using tools similar to ChatGPT and the other using tools similar to Google Search. The results show that the ChatGPT group spent less time on all tasks and the difference between these two groups is not significant.

Secondly, moving to the domain of recommendation systems, LLMs have emerged as essential components that leverage their natural language processing capabilities to comprehend user preferences, item descriptions, and contextual information (Fan et al.,, 2023). By incorporating LLMs into recommendation pipelines, these systems can offer more accurate and personalized recommendations, thereby improving user experience and overall recommendation quality. However, it is crucial to address the potential risks associated with using LLMs for recommendations. Recent research by Zhang et al., 2023a has highlighted the issue of unfair recommendations generated by ChatGPT. This emphasizes the importance of evaluating fairness when employing LLMs in recommendation scenarios. (Dai et al., 2023a, ) suggest that ChatGPT exhibits strong performance in recommender systems. The use of listwise ranking is found to strike the best balance between cost and performance. Furthermore, ChatGPT shows promise in addressing the cold-start problem and providing interpretable recommendations. Moreover, the research by (Yuan et al., 2023a, ; Li et al., 2023b, ) demonstrated the promising potential of the modality-based recommendation model (MoRec) and text-based collaborative filtering (TCF) in recommendation systems.

3.7.3 Personality testing

Personality testing aims to measure individuals’ personality traits and behavioral tendencies, and LLMs as powerful natural language processing models have been widely applied in such tasks.

Research conducted by (Bodroza et al.,, 2023) investigated the personality features of using Davinci-003 as a chatbot and found variations in the consistency of its answers, despite exhibiting prosocial characteristics. However, there remains uncertainty regarding whether the chatbot’s responses are driven by conscious self-reflection or algorithmic processes. Song et al., (2023) examined the manifestation of personality in language models and discovered that many models perform unreliably in self-assessment tests and exhibit inherent biases. Therefore, it is necessary to develop specific machine personality measurement tools to enhance reliability. These studies offer vital insights to better understand LLMs in personality testing. Safdari et al., (2023) proposed a comprehensive approach to conduct effective psychometric testing for the personality traits in the text generated by LLMs. In order to evaluate the emotional intelligence of LLMs, (Wang et al., 2023f, ) developed a new psychometric assessment method. By referencing a framework constructed from over 500 adults, the authors tested various mainstream LLMs. The results showed that most LLMs achieve above-average scores in emotional quotient (EQ), with GPT-4 scoring 117, surpassing 89% of human participants. However, a multivariate pattern analysis indicated that certain LLMs achieve human-level performance without relying on mechanisms resembling those found in humans. This is evident from the distinct differences in the quality of their representational patterns, as compared to humans. Jentzsch and Kersting, (2023) discussed the challenges of incorporating humor into LLMs, particularly ChatGPT. They found that while ChatGPT demonstrates impressive capabilities in NLP tasks, it falls short in generating humorous responses. This study emphasizes the importance of humor in human communication and the difficulties that LLMs face in capturing the subtleties and context-dependent nature of humor. It discusses the limitations of current approaches and highlights the need for further research to develop more sophisticated models that can effectively understand and generate humor.

3.7.4 Specific applications

Moreover, various research endeavors have been conducted to explore the application and evaluation of LLMs across a wide spectrum of tasks, such as game design (Lanzi and Loiacono,, 2023), model performance assessment (Wang et al., 2023h, ), and log parsing (Le and Zhang,, 2023). Collectively, these findings enhance our comprehension of the practical implications associated with the utilization of LLMs across diverse tasks. They shed light on the potential and limitations of these models while providing valuable insights for performance improvement.

4 Where to Evaluate: Datasets and Benchmarks

TABLE VII: Summary of existing LLMs evaluation benchmarks (ordered by the name of the first author).

Benchmark	Focus	Domain	Evaluation Criteria
SOCKET (Choi et al.,, 2023)	Social knowledge	Specific downstream task	Social language understanding
MME (Fu et al., 2023a, )	Multimodal LLMs	General language task	Ability of perception and cognition
Xiezhi (Gu et al.,, 2023)	Comprehensive domain knowledge	General language task	Overall performance across multiple benchmarks
CUAD (Hendrycks et al., 2021b, )	Legal contract review	Specific downstream task	Legal contract understanding
TRUSTGPT (Huang et al., 2023c, )	Ethic	Specific downstream task	Toxicity, bias, and value-alignment
MMLU (Hendrycks et al., 2020b, )	Text models	General language task	Multitask accuracy
MATH (Hendrycks et al., 2021c, )	Mathematical problem	Specific downstream task	Mathematical ability
APPS (Hendrycks et al., 2021a, )	Coding challenge competence	Specific downstream task	Code generation ability
C-Eval (Huang et al., 2023b, )	Chinese evaluation	General language task	52 Exams in a Chinese context
OpenLLM (HuggingFace,, 2023)	Chatbots	General language task	Leaderboard rankings
DynaBench (Kiela et al.,, 2021)	Dynamic evaluation	General language task	NLI, QA, sentiment, and hate speech
Chatbot Arena (LMSYS,, 2023)	Chat assistants	General language task	Crowdsourcing and Elo rating system
AlpacaEval (Li et al., 2023d, )	Automated evaluation	General language task	Metrics, robustness, and diversity
HELM (Liang et al.,, 2022)	Transparency of language models	General language task	Multi-metric
API-Bank (Li et al., 2023a, )	Tool utilization	Specific downstream task	API call, retrieval, and planning
M3KE (Liu et al., 2023a, )	Multi-task	General language task	Multi-task accuracy
ARB (Sawada et al.,, 2023)	Advanced reasoning ability	Specific downstream task	Multidomain advanced reasoning ability
Big-Bench (Srivastava et al.,, 2022)	Capabilities and limitations of LMs	General language task	Model performance and calibration
MultiMedQA (Singhal et al.,, 2022)	Medical QA	Specific downstream task	Model performance, medical knowledge, and reasoning ability
CVALUES (Xu et al., 2023b, )	Safety and responsibility	Specific downstream task	Alignment ability of LLMs
ToolBench (ToolBench,, 2023)	Software tools	Specific downstream task	Execution success rate
PandaLM (Wang et al., 2023h, )	Instruction tuning	General language task	Winrate judged by PandaLM
GLUE-X (Yang et al.,, 2022)	OOD robustness for NLU tasks	General language task	OOD robustness
KoLA (Yu et al.,, 2023)	Knowledge-oriented evaluation	General language task	Self-contrast metrics
AGIEval (Zhong et al.,, 2023)	Human-centered foundational models	General language task	General
PromptBench (Zhu et al.,, 2023)	Adversarial prompt resilience	General language task	Adversarial robustness
MT-Bench (Zheng et al.,, 2023)	Multi-turn conversation	General language task	Winrate judged by GPT-4
M3Exam (Zhang et al., 2023c, )	Human exams	Specific downstream task	Task-specific metrics
GAOKAO-Bench (Zhang et al., 2023e, )	Chinese Gaokao examination	Specific downstream task	Accuracy and scoring rate

LLMs evaluation datasets are used to test and compare the performance of different language models on various tasks, as depicted in Sec. 3. These datasets, such as GLUE (Wang et al.,, 2018) and SuperGLUE (Wang et al.,, 2019), aim to simulate real-world language processing scenarios and cover diverse tasks such as text classification, machine translation, reading comprehension, and dialogue generation. This section will not discuss any single dataset for language models but benchmarks for LLMs.

As benchmarks for LLMs are evolving, a variety of benchmarks have emerged to evaluate their performance. In this study, we compile a selection of 28 popular benchmarks, as shown in TABLE VII.⁵⁵5Note that as the evaluation of LLMs is a hot research area, it is very likely that we cannot cover all benchmarks. We welcome suggestions and comments to make this list perfect. Each benchmark focuses on different aspects and evaluation criteria, providing valuable contributions to their respective domains. For a better summarization, we divide these benchmarks into two categories: benchmarks for general language tasks and benchmarks for specific downstream tasks.

4.1 Benchmarks for General Tasks

LLMs are designed to solve a vast majority of tasks. To this end, existing benchmarks tend to evaluate the performance in different tasks.

Chatbot Arena (LMSYS,, 2023) and MT-Bench (Zheng et al.,, 2023) are two significant benchmarks that contribute to the evaluation and advancement of chatbot models and LLMs in different contexts. Chatbot Arena provides a platform to assess and compare diverse chatbot models through user engagement and voting. Users can engage with anonymous models and express their preferences via voting. The platform gathers a significant volume of votes, facilitating the evaluation of models’ performance in realistic scenarios. Chatbot Arena provides valuable insights into the strengths and limitations of chatbot models, thereby contributing to the progress of chatbot research and advancement.

Meanwhile, MT-Bench evaluates LLMs on multi-turn dialogues using comprehensive questions tailored to handling conversations. It provides a comprehensive set of questions specifically designed for assessing the capabilities of models in handling multi-turn dialogues. MT-Bench possesses several distinguishing features that differentiate it from conventional evaluation methodologies. Notably, it excels in simulating dialogue scenarios representative of real-world settings, thereby facilitating a more precise evaluation of a model’s practical performance. Moreover, MT-Bench effectively overcomes the limitations in traditional evaluation approaches, particularly in gauging a model’s competence in handling intricate multi-turn dialogue inquiries.

Instead of focusing on specific tasks and evaluation metrics, HELM (Liang et al.,, 2022) provides a comprehensive assessment of LLMs. It evaluates language models across various aspects such as language understanding, generation, coherence, context sensitivity, common-sense reasoning, and domain-specific knowledge. HELM aims to holistically evaluate the performance of language models across different tasks and domains. In addition, Xiezhi (Gu et al.,, 2023) presents a comprehensive suite for assessing the knowledge level of large-scale language models in different subject areas. The evaluation conducted through Xiezhi enables researchers to comprehend the notable limitations inherent in these models and facilitates a deeper comprehension of their capabilities in diverse fields. For evaluating language models beyond their existing capacities, Big-Bench (Srivastava et al.,, 2022) introduces a diverse collection of 204 challenging tasks contributed by 450 authors from 132 institutions. These tasks cover various domains such as math, childhood development, linguistics, biology, common-sense reasoning, social bias, physics, software development, etc. Moreover, MME (Fu et al., 2023a, ) serves as an extensive evaluative benchmark specifically designed for multimodal large language models (MLLM), aiming to assess their perceptual and cognitive aptitudes. MME employs meticulously crafted instruction-answer pairs alongside succinct instruction design, thereby guaranteeing equitable evaluation conditions.

KoLA (Yu et al.,, 2023), a Knowledge-Oriented LLMs Evaluation Benchmark, is specially designed to evaluate the language understanding and reasoning abilities of LLMs. It emphasizes the comprehension and utilization of semantic knowledge and inference. KoLA serves as a crucial platform for researchers to assess the depth of LLMs’ understanding and reasoning, thereby propelling progress in language comprehension models. To allow for crowd-sourcing evaluations in language tasks, DynaBench (Kiela et al.,, 2021) is designed for conducting dynamic benchmark testing. It explores exciting new research directions, such as the impact of integration within a loop, characteristics of distributional shifts, exploring annotator efficiency, studying the influence of expert annotators, and enhancing model robustness against targeted adversarial attacks in interactive environments. Additionally, it contributes to advancing research on dynamic data collection and conducting cross-task analysis in the domain of general human-computer interaction. (Liu et al., 2023a, )

The development of standardized benchmarks for evaluating LLMs on diverse tasks has been an important research focus. MMLU (Hendrycks et al., 2020b, ) provides a comprehensive suite of tests for assessing text models in multi-task contexts. AlpacaEval (Li et al., 2023d, ) stands as an automated evaluation benchmark, which places its focus on assessing the performance of LLMs across various natural language processing tasks. It provides a range of metrics, robustness measures, and diversity evaluations to gauge the capabilities of LLMs. AlpacaEval has significantly contributed to advancing LLMs in diverse domains and promoting a deeper understanding of their performance. Furthermore, AGIEval, (Zhong et al.,, 2023), serves as a dedicated evaluation framework for assessing the performance of foundation models in the domain of human-centric standardized exams. Moreover, OpenLLM (HuggingFace,, 2023) functions as an evaluation benchmark by offering a public competition platform for comparing and assessing different LLM models’ performance on various tasks. It encourages researchers to submit their models and compete on different tasks, driving progress and competition in the field of LLM research.

As for tasks beyond standard performance, there are benchmarks designed for OOD, adversarial robustness, and fine-tuning. GLUE-X (Yang et al.,, 2022) is a novel attempt to create a unified benchmark aimed at evaluating the robustness of NLP models in OOD scenarios. This benchmark emphasizes the significance of robustness in NLP and provides insights into measuring and enhancing the robustness of models. PromptBench (Zhu et al.,, 2023) centers on the importance of prompt engineering in fine-tuning LLMs. It provides a standardized evaluation framework to compare different prompt engineering techniques and assess their impact on model performance. PromptBench facilitates the enhancement and optimization of fine-tuning methods for LLMs. To ensure impartial and equitable evaluation, PandaLM (Wang et al., 2023h, ) is introduced as a discriminative large-scale language model specifically designed to differentiate among multiple high-proficiency LLMs through training. In contrast to conventional evaluation datasets that predominantly emphasize objective correctness, PandaLM incorporates crucial subjective elements, including relative conciseness, clarity, adherence to instructions, comprehensiveness, and formality.

4.2 Benchmarks for Specific Downstream Tasks

Other than benchmarks for general tasks, there exist benchmarks specifically designed for certain downstream tasks.

MultiMedQA (Singhal et al.,, 2022) is a medical QA benchmark that focuses on medical examinations, medical research, and consumer healthcare questions. It consists of seven datasets related to medical QA, including six existing datasets and one new dataset. The goal of this benchmark is to evaluate the performance of LLMs in terms of clinical knowledge and QA abilities. To assess the performance of LLMs in advanced reasoning tasks across multiple domains, ARB (Sawada et al.,, 2023) has been introduced. Additionally, TRUSTGPT (Huang et al., 2023c, ) is specifically tailored to address ethical considerations within the context of LLMs, with a particular focus on toxicity, bias, and value alignment.

Other specific benchmarks such as C-Eval (Huang et al., 2023b, ), which is the first extensive benchmark to assess the advanced knowledge and reasoning capabilities of foundation models in Chinese. M3Exam (Zhang et al., 2023c, ) provides a unique and comprehensive evaluation framework that incorporates multiple languages, modalities, and levels to test the general capabilities of LLMs in diverse contexts. Additionally, GAOKAO-Bench (Zhang et al., 2023e, ) provides a comprehensive evaluation benchmark for gauging the proficiency of large language models in intricate and context-specific tasks, utilizing questions sourced from the Chinese Gaokao examination. On the other hand, SOCKET (Choi et al.,, 2023) serves as an NLP benchmark designed to evaluate the performance of LLMs in learning and recognizing social knowledge concepts. It consists of several tasks and case studies to assess the limitations of LLMs in social capabilities. MATH (Hendrycks et al., 2021c, ) concentrates on assessing reasoning and problem-solving proficiencies of AI models within the domain of mathematics. APPS (Hendrycks et al., 2021a, ) is a more comprehensive and rigorous benchmark for evaluating code generation, measuring the ability of language models to generate python code according to natural language specifications. CUAD (Hendrycks et al., 2021b, ) is an expert-annotated, domain-specific legal contract review dataset that presents a challenging research benchmark and potential for enhancing deep learning models’ performance in contract understanding tasks. CVALUES (Xu et al., 2023b, ) introduces a humanistic evaluation benchmark to assess the alignment of LLMs with safety and responsibility standards.

In addition to existing evaluation benchmarks, there is a research gap in assessing the effectiveness of utilizing tools for LLMs. To address this gap, the API-Bank benchmark (Li et al., 2023a, ) is introduced as the first benchmark explicitly designed for tool-augmented LLMs. It comprises a comprehensive Tool-Augmented LLM workflow, encompassing 53 commonly used API tools and 264 annotated dialogues, encompassing a total of 568 API calls. Furthermore, the ToolBench project (ToolBench,, 2023) aims to empower the development of large language models that effectively leverage the capabilities of general-purpose tools. By providing a platform for creating optimized instruction datasets, the ToolBench project seeks to drive progress in language models and enhance their practical applications.

5 How to Evaluate

In this section, we introduce two common evaluation methods: automatic evaluation and human evaluation. In fact, the taxonomy of “how to evaluate” is also not definite. Our categorization is based on whether or not the evaluation criterion can be automatically computed. If it can be automatically calculated, we categorize it into automatic evaluation; otherwise, it falls into human evaluation.

TABLE VIII: Summary of new LLMs evaluation protocols.

Method	References
Human-in-the-loop	AdaVision (Gao et al.,, 2022), AdaTest (Ribeiro and Lundberg,, 2022)
Crowd-sourcing testing	DynaBench (Kiela et al.,, 2021), DynaBoard (Ma et al.,, 2021), DynamicTempLAMA (Margatina et al.,, 2023), DynaTask (Thrush et al.,, 2022)
More challenging tests	HELM (Liang et al.,, 2022), AdaFilter (Phang et al.,, 2021), CheckList (Ribeiro et al.,, 2020), Big-Bench (Srivastava et al.,, 2022), DeepTest (Tian et al.,, 2018)

5.1 Automatic Evaluation

Automated evaluation of LLMs is a common and perhaps the most popular evaluation method that usually uses standard metrics or indicators and evaluation tools to assess the performance of models, such as accuracy, BLEU (Papineni et al.,, 2002), ROUGE (Lin,, 2004), BERTScore (Zhang et al.,, 2019), to name a few. For instance, we can use BLEU score to quantify the similarity and quality between the model-generated text and the reference text in a machine translation task. In fact, most of the existing evaluation efforts adopt this evaluation protocol due to its subjectivity, automatic computing, and simplicity. Thus, most of the deterministic tasks, such as natural language understanding and math problems, often adopt this evaluation protocol.

Compared with human evaluation, automatic evaluation does not require intensive human participation, which saves costs and time. For example, both (Qin et al.,, 2023) and Bang et al., (2023) use automated evaluation methods to evaluate a large number of tasks. Recently, with the development of LLMs, some advanced automatic evaluation techniques are also designed to help evaluate. Lin and Chen, (2023) proposed LLM-EVAL, a unified multidimensional automatic evaluation method for open-domain conversations with LLMs. PandaLM (Wang et al., 2023h, ) can achieve reproducible and automated language model assessment by training an LLM that serves as the “judge” to evaluate different models. Proposing a self-supervised evaluation framework, Jain et al., (2023) enabled a more efficient form of evaluating models in real-world deployment settings by eliminating the need for laborious labeling of new data.

Due to the large volume of automatic evaluation papers, we will not introduce them in detail. The principle of automatic evaluation is in fact the same as other AI model evaluation process: we just use some standard metrics to compute certain values under these metrics, which serves as indicators for model performance.

5.2 Human Evaluation

The increasingly strengthened capabilities of LLMs have certainly gone beyond standard evaluation metrics on general natural language tasks. Therefore, human evaluation becomes a natural choice in some non-standard cases where automatic evaluation is not suitable. For instance, in open generation tasks where embedded similarity metrics (such as BERTScore) are not enough, human evaluation is more reliable (Novikova et al.,, 2017). While some generation tasks can adopt certain automatic evaluation protocols, human evaluation in these tasks is more favorable as generation can always go better than standard answers.

Human evaluation of LLMs is a way to evaluate the quality and accuracy of model-generated results through human participation. Compared with automatic evaluation, manual evaluation is closer to the actual application scenario and can provide more comprehensive and accurate feedback. In the manual evaluation of LLMs, evaluators (such as experts, researchers, or ordinary users) are usually invited to evaluate the results generated by the model. For example, Ziems et al., (2023) used the annotations from experts for generation. By human evaluation, (Liang et al.,, 2022) performed human evaluation on summarization and disinformation scenarios on 6 models and Bang et al., (2023) evaluated analogical reasoning tasks. The seminal evaluation work by Bubeck et al., (2023) did a series of human-crafted tests using GPT-4 and they found that GPT-4 performs close to or even exceeds human performance on multiple tasks. This evaluation requires human evaluators to actually test and compare the performance of the models, not just evaluate the models through automated evaluation metrics. Note that even human evaluations can have high variance and instability, which could be due to cultural and individual differences (Peng et al.,, 1997). In practical applications, these two evaluation methods are considered and weighed in combination with the actual situation.

6 Summary

In this section, we summarize the key findings based on our review in sections 3, 4, and 5.

First of all, we would like to highlight that despite all the efforts spent on summarizing existing works on evaluation, there is no evidence to explicitly show that one certain evaluation protocol or benchmark is the most useful and successful, but with different characteristics and focuses. This also demonstrates that not a single model can perform best in all kinds of tasks. The purpose of this survey is to go beyond simply determining the “best” benchmark or evaluation protocol. By summarizing and analyzing existing efforts on LLMs evaluation, we may identify the current success and failure cases of LLMs, derive new trend for evaluation protocols, and most importantly, propose new challenges and opportunities for future research.

6.1 Task: Success and Failure Cases of LLMs

We now summarize the success and failure cases of LLMs in different tasks. Note that all the following conclusions are made based on existing evaluation efforts and the results are only dependent on specific datasets.

6.1.1 What can LLMs do well?

•

LLMs demonstrate proficiency in generating text by producing fluent and precise linguistic expressions.
•

LLMs obtain impressive performance in tasks involving language understanding, such as sentiment analysis, and text classification.
•

LLMs exhibit robust contextual comprehension, enabling them to generate coherent responses that align with the given input.
•

LLMs achieve satisfying performance across several natural language processing tasks, including machine translation, text generation, and question answering.

6.1.2 When can LLMs fail?

•

LLMs may exhibit biases and inaccuracies during the generation process, resulting in the production of biased outputs.
•

LLMs have limited abilities in comprehending complex logic and reasoning tasks, often experiencing confusion or making errors in intricate contexts.
•

LLMs face constraints in handling extensive datasets and long-term memory, which can pose challenges in processing lengthy texts and tasks involving long-term dependencies.
•

LLMs have limitations in incorporating real-time or dynamic information, making them less suitable for tasks that require up-to-date knowledge or rapid adaptation to changing contexts.
•

LLMs is sensitive to prompts, especially adversarial prompts, which trigger new evaluations and algorithms to improve its robustness.
•

In the domain of text summarization, it is observed that LLMs might demonstrate subpar performance on particular evaluation metrics, which can potentially be attributed to inherent limitations or inadequacies within those specific metrics.
•

LLMs do not achieve satisfying performance in counterfactual tasks.

6.2 Benchmark and Evaluation Protocol

With the rapid development and widespread use of LLMs, the importance of evaluating them in practical applications and research has become crucial. This evaluation process should include not only task-level evaluation but also a deep understanding of the potential risks they pose from a societal perspective. In this section, we summarize existing benchmark and evaluation protocols in TABLE VIII.

First, a shift from objective calculation to human-in-the-loop testing, allowing for greater human feedback during the evaluation process. AdaVision (Gao et al.,, 2022), an interactive process for testing vision models, enables users to label a small amount of data for model correctness, which helps users identify and fix coherent failure modes. In AdaTest (Ribeiro and Lundberg,, 2022), the user filters test samples by only selecting high quality tests and organizing them into semantically related topics.

Second, a move from static to crowd-sourcing test sets is becoming more common. Tools like DynaBench (Kiela et al.,, 2021), DynaBoard (Ma et al.,, 2021), and DynaTask (Thrush et al.,, 2022) rely on crowdworkers to create and test hard samples. Additionally, DynamicTempLAMA (Margatina et al.,, 2023) allows for dynamically constructed time-related tests.

Third, a shift from a unified to a challenging setting in evaluating machine learning models. While unified settings involve a test set with no preference for any specific task, challenging settings create test sets for specific tasks. Tools like DeepTest (Tian et al.,, 2018) use seeds to generate input transformations for testing, CheckList (Ribeiro et al.,, 2020) builds test sets based on templates, and AdaFilter (Phang et al.,, 2021) adversarially constructs tests. However, it is worth noting that AdaFilter may not be entirely fair as it relies on adversarial examples. HELM (Liang et al.,, 2022) evaluates LLMs from different aspects, while the Big-Bench (Srivastava et al.,, 2022) platform is used to design hard tasks for machine learning models to tackle. PromptBench (Zhu et al.,, 2023) aims to evaluate the adversarial robustness of LLMs by creating adversarial prompts, which is more challenging and the results demonstrated that current LLMs are not robust to adversarial prompts.

7 Grand Challenges and Opportunities for Future Research

Evaluation as a new discipline: Our summarization inspires us to redesign a wide spectrum of aspects related to evaluation in the era of LLMs. In this section, we present several grand challenges. Our key point is that evaluation should be treated as an essential discipline to drive the success of LLMs and other AI models. Existing protocols are not enough to thoroughly evaluate the true capabilities of LLMs, which poses grand challenges and triggers new opportunities for future research on LLMs evaluation.

7.1 Designing AGI Benchmarks

As we discussed earlier, while all tasks can potentially serve as evaluation tools for LLMs, the question remains as to which can truly measure AGI capabilities. As we expect LLMs to demonstrate AGI abilities, a comprehensive understanding of the differences between human and AGI capacities becomes crucial in the creation of AGI benchmarks. The prevailing trend seems to conceptualize AGI as a superhuman entity, thereby utilizing cross-disciplinary knowledge from fields such as education, psychology, and social sciences to design innovative benchmarks. Nonetheless, there remains a plethora of unresolved issues. For instance, does it make sense to use human values as a starting point for test construction, or should alternative perspectives be considered? The process of developing suitable AGI benchmarks presents many open questions demanding further exploration.

7.2 Complete Behavioral Evaluation

An idea AGI evaluation should contain not only standard benchmarks on common tasks, but also evaluations on open tasks such as complete behavioral tests. By behavioral test, we mean that AGI models should also be evaluated in an open environment. For instance, by treating LLMs as the central controller, we can construct evaluations on a robot manipulated by LLMs to test its behaviors in real situations. By treating LLMs as a completely intelligent machine, the evaluations of its multi-modal dimensions should also be considered. In fact, complete behavioral evaluations are complementary to standard AGI benchmarks and they should work together for better testing.

7.3 Robustness Evaluation

Beyond general tasks, it is crucial for LLMs to maintain robustness against a wide variety of inputs in order to perform optimally for end-users, given their extensive integration into daily life. For instance, the same prompts but with different grammars and expressions could lead ChatGPT and other LLMs to generate diverse results, indicating that current LLMs are not robust to the inputs. While there are some prior work on robustness evaluation (Wang et al., 2023c, ; Zhu et al.,, 2023), there are much room for advancement, such as including more diverse evaluation sets, examining more evaluation aspects, and developing more efficient evaluations to generate robustness tasks. Concurrently, the concept and definition of robustness are constantly evolving. It is thus vital to consider updating the evaluation system to better align with emerging requirements related to ethics and bias.

7.4 Dynamic and Evolving Evaluation

Existing evaluation protocols for most AI tasks rely on static and public benchmarks, i.e., the evaluation datasets and protocols are often publicly available. While this facilitates rapid and convenient evaluation within the community, it is unable to accurately assess the evolving abilities of LLMs, given their rapid rate of development. The capabilities of LLMs may enhance over time which cannot be consistently evaluated by existing static benchmarks. On the other hand, as LLMs grow increasingly powerful with larger model sizes and training set sizes, static and public benchmarks are likely to be memorized by LLMs, resulting in potential training data contamination. Therefore, developing dynamic and evolving evaluation systems is the key to providing a fair evaluation of LLMs.

7.5 Principled and Trustworthy Evaluation

When introducing an evaluation system, it is crucial to ascertain its integrity and trustworthiness. Therefore, the necessity for trustworthy computing extends to the requirement for reliable evaluation systems as well. This poses a challenging research question that intertwines with measurement theory, probability, and numerous other domains. For instance, how can we ensure that dynamic testing truly generates out-of-distribution examples? There is a scarcity of research in this domain, and it is hoped that future work will aim to scrutinize not only the algorithms but the evaluation system itself.

7.6 Unified Evaluation that Supports All LLMs Tasks

There are many other research areas of LLMs and we need to develop evaluation systems that can support all kinds of tasks such as value alignment, safety, verification, interdisciplinary research, fine-tuning, and others. For instance, PandaLM (Wang et al., 2023h, ) is an evaluation system that assists LLMs fine-tuning by providing an open-source evaluation model, which can automatically assess the performance of fine-tuning. We expect that more evaluation systems are becoming more general and can be used as assistance in certain LLMs tasks.

7.7 Beyond Evaluation: LLMs Enhancement

Ultimately, evaluation is not the end goal but rather the starting point. Following the evaluation, there are undoubtedly conclusions to be drawn regarding performance, robustness, stability, and other factors. A proficient evaluation system should not only offer benchmark results but should also deliver an insightful analysis, recommendations, and guidance for future research and development. For instance, PromptBench (Zhu et al.,, 2023) provides not only robustness evaluation results on adversarial prompts but also a comprehensive analysis through attention visualization, elucidating how adversarial texts can result in erroneous responses. The system further offers a word frequency analysis to identify robust and non-robust words in the test sets, thus providing prompt engineering guidance for end users. Subsequent research can leverage these findings to enhance LLMs. Another example is that Wang et al., 2023g first explored the performance of large vision-language models on imbalanced (long-tailed) tasks, which demonstrates the limitation of current large models. Then, they explored different methodologies to enhance the performance on these tasks. In summary, enhancement after evaluation helps to build better LLMs and much can be done in the future.

8 Conclusion

Evaluation carries profound significance, becoming imperative in the advancement of AI models, especially within the context of large language models. This paper presents the first survey to give an comprehensive overview of the evaluation on LLMs from three aspects: what to evaluate, how to evaluate, and where to evaluate. By encapsulating evaluation tasks, protocols, and benchmarks, our aim is to augment understanding of the current status of LLMs, elucidate their strengths and limitations, and furnish insights for future LLMs progression.

Our survey reveals that current LLMs exhibit certain limitations in numerous tasks, notably reasoning and robustness tasks. Concurrently, the need for contemporary evaluation systems to adapt and evolve remains evident, ensuring the accurate assessment of LLMs’ inherent capabilities and limitations. We identify several grand challenges that future research should address, with the aspiration that LLMs can progressively enhance their service to humanity.

Disclaimer

The goal of this paper is mainly to summarize and discuss existing evaluation efforts on large language models. Results and conclusions in each paper are original contributions of their corresponding authors, particularly for potential issues in ethics and biases. This paper may discuss some side effects of LLMs and the only intention is to foster a better understanding of large language models.

Additionally, due to the evolution of LLMs especially online services such as Claude and ChatGPT, it is very likely that they become stronger and some of their limitations described in this paper are mitigated (and new limitations may arise). We encourage interested readers to take this survey as a reference for future research and conduct real experiments in current systems when performing evaluations.

Finally, the evaluation of LLMs is continuously developing, thus we may miss some new papers or benchmarks. We welcome all constructive feedback and suggestions to help make this survey better.

References

Abdelali et al., (2023) Abdelali, A., Mubarak, H., Chowdhury, S. A., Hasanain, M., Mousi, B., Boughorbel, S., Kheir, Y. E., Izham, D., Dalvi, F., Hawasly, M., et al. (2023). Benchmarking arabic ai with large language models. arXiv preprint arXiv:2305.14982.
Ahuja et al., (2023) Ahuja, K., Hada, R., Ochieng, M., Jain, P., Diddee, H., Maina, S., Ganu, T., Segal, S., Axmed, M., Bali, K., et al. (2023). Mega: Multilingual evaluation of generative ai. arXiv preprint arXiv:2303.12528.
Arora et al., (2023) Arora, D., Singh, H. G., et al. (2023). Have llms advanced enough? a challenging problem solving benchmark for large language models. arXiv preprint arXiv:2305.15074.
Bai et al., (2023) Bai, Y., Ying, J., Cao, Y., Lv, X., He, Y., Wang, X., Yu, J., Zeng, K., Xiao, Y., Lyu, H., et al. (2023). Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181.
Bang et al., (2023) Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T., Chung, W., et al. (2023). A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
Berrar, (2019) Berrar, D. (2019). Cross-validation.
Bian et al., (2023) Bian, N., Han, X., Sun, L., Lin, H., Lu, Y., and He, B. (2023). Chatgpt is a knowledgeable but inexperienced solver: An investigation of commonsense problem in large language models. arXiv preprint arXiv:2303.16421.
Bodroza et al., (2023) Bodroza, B., Dinic, B. M., and Bojic, L. (2023). Personality testing of gpt-3: Limited temporal reliability, but highlighted social desirability of gpt-3’s personality instruments results. arXiv preprint arXiv:2306.04308.
Bommasani et al., (2021) Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
Brody, (1999) Brody, N. (1999). What is intelligence? International Review of Psychiatry, 11(1):19–25.
Brown et al., (1992) Brown, P. F., Della Pietra, V. J., Desouza, P. V., Lai, J. C., and Mercer, R. L. (1992). Class-based n-gram models of natural language. Computational linguistics, 18(4):467–480.
Brown et al., (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Bubeck et al., (2023) Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., et al. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
Cao et al., (2023) Cao, Y., Zhou, L., Lee, S., Cabello, L., Chen, M., and Hershcovich, D. (2023). Assessing cross-cultural alignment between chatgpt and human societies: An empirical study. In Proceedings of the First Workshop on Cross-Cultural Considerations in NLP (C3NLP), pages 53–67.
Cascella et al., (2023) Cascella, M., Montomoli, J., Bellini, V., and Bignami, E. (2023). Evaluating the feasibility of chatgpt in healthcare: an analysis of multiple clinical and research scenarios. Journal of Medical Systems, 47(1):33.
Castro Nascimento and Pimentel, (2023) Castro Nascimento, C. M. and Pimentel, A. S. (2023). Do large language models understand chemistry? a conversation with chatgpt. Journal of Chemical Information and Modeling, 63(6):1649–1655.
Chen et al., (2021) Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. (2021). Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
Chen et al., (2023) Chen, Y., Wang, R., Jiang, H., Shi, S., and Xu, R. (2023). Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study. arXiv preprint arXiv:2304.00723.
Chervenak et al., (2023) Chervenak, J., Lieman, H., Blanco-Breindel, M., and Jindal, S. (2023). The promise and peril of using a large language model to obtain clinical information: Chatgpt performs strongly as a fertility counseling tool with limitations. Fertility and Sterility.
Chia et al., (2023) Chia, Y. K., Hong, P., Bing, L., and Poria, S. (2023). Instructeval: Towards holistic evaluation of instruction-tuned large language models. arXiv preprint arXiv:2306.04757.
Choi et al., (2023) Choi, M., Pei, J., Kumar, S., Shu, C., and Jurgens, D. (2023). Do llms understand social knowledge? evaluating the sociability of large language models with socket benchmark. arXiv preprint arXiv:2305.14938.
Chowdhery et al., (2022) Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. (2022). Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
Christiano et al., (2017) Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
Clavié et al., (2023) Clavié, B., Ciceu, A., Naylor, F., Soulié, G., and Brightwell, T. (2023). Large language models in the workplace: A case study on prompt engineering for job type classification. In International Conference on Applications of Natural Language to Information Systems, pages 3–17. Springer.
Collins et al., (2023) Collins, K. M., Jiang, A. Q., Frieder, S., Wong, L., Zilka, M., Bhatt, U., Lukasiewicz, T., Wu, Y., Tenenbaum, J. B., Hart, W., et al. (2023). Evaluating language models for mathematics through interactions. arXiv preprint arXiv:2306.01694.
Cortes and Vapnik, (1995) Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning, 20:273–297.
(27) Dai, S., Shao, N., Zhao, H., Yu, W., Si, Z., Xu, C., Sun, Z., Zhang, X., and Xu, J. (2023a). Uncovering chatgpt’s capabilities in recommender systems. arXiv preprint arXiv:2305.02182.
(28) Dai, W., Lin, J., Jin, F., Li, T., Tsai, Y.-S., Gasevic, D., and Chen, G. (2023b). Can large language models provide feedback to students? a case study on chatgpt.
Dao and Le, (2023) Dao, X.-Q. and Le, N.-B. (2023). Investigating the effectiveness of chatgpt in mathematical reasoning and problem solving: Evidence from the vietnamese national high school graduation examination. arXiv preprint arXiv:2306.06331.
de Winter, (2023) de Winter, J. C. (2023). Can chatgpt pass high school exams on english language comprehension. Researchgate. Preprint.
Deng et al., (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee.
Deroy et al., (2023) Deroy, A., Ghosh, K., and Ghosh, S. (2023). How ready are pre-trained abstractive models and llms for legal case judgement summarization? arXiv preprint arXiv:2306.01248.
Deshpande et al., (2023) Deshpande, A., Murahari, V., Rajpurohit, T., Kalyan, A., and Narasimhan, K. (2023). Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335.
Devlin et al., (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dhamala et al., (2021) Dhamala, J., Sun, T., Kumar, V., Krishna, S., Pruksachatkun, Y., Chang, K.-W., and Gupta, R. (2021). Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 862–872.
Duong and Solomon, (2023) Duong, D. and Solomon, B. D. (2023). Analysis of large-language model versus human performance for genetics questions. European Journal of Human Genetics, pages 1–3.
Fan et al., (2023) Fan, W., Zhao, Z., Li, J., Liu, Y., Mei, X., Wang, Y., Tang, J., and Li, Q. (2023). Recommender systems in the era of large language models (llms).
Fansi Tchango et al., (2022) Fansi Tchango, A., Goel, R., Wen, Z., Martel, J., and Ghosn, J. (2022). Ddxplus: A new dataset for automatic medical diagnosis. Advances in Neural Information Processing Systems, 35:31306–31318.
Ferrara, (2023) Ferrara, E. (2023). Should chatgpt be biased? challenges and risks of bias in large language models. arXiv preprint arXiv:2304.03738.
Floridi and Chiriatti, (2020) Floridi, L. and Chiriatti, M. (2020). Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694.
Frank, (2023) Frank, M. C. (2023). Baby steps in evaluating the capacities of large language models. Nature Reviews Psychology, pages 1–2.
Frieder et al., (2023) Frieder, S., Pinchetti, L., Griffiths, R.-R., Salvatori, T., Lukasiewicz, T., Petersen, P. C., Chevalier, A., and Berner, J. (2023). Mathematical capabilities of chatgpt. arXiv preprint arXiv:2301.13867.
(43) Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Qiu, Z., Lin, W., Yang, J., Zheng, X., et al. (2023a). Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394.
(44) Fu, Y., Ou, L., Chen, M., Wan, Y., Peng, H., and Khot, T. (2023b). Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance. arXiv preprint arXiv:2305.17306.
Fushiki, (2011) Fushiki, T. (2011). Estimation of prediction error by using k-fold cross-validation. Statistics and Computing, 21:137–146.
Gallant et al., (1990) Gallant, S. I. et al. (1990). Perceptron-based learning algorithms. IEEE Transactions on neural networks, 1(2):179–191.
Gao et al., (2022) Gao, I., Ilharco, G., Lundberg, S., and Ribeiro, M. T. (2022). Adaptive testing of computer vision models. arXiv preprint arXiv:2212.02774.
Gao and Lin, (2004) Gao, J. and Lin, C.-Y. (2004). Introduction to the special issue on statistical language modeling.
Gehman et al., (2020) Gehman, S., Gururangan, S., Sap, M., Choi, Y., and Smith, N. A. (2020). Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3356–3369.
Gekhman et al., (2023) Gekhman, Z., Herzig, J., Aharoni, R., Elkind, C., and Szpektor, I. (2023). Trueteacher: Learning factual consistency evaluation with large language models. arXiv preprint arXiv:2305.11171.
Gilson et al., (2023) Gilson, A., Safranek, C. W., Huang, T., Socrates, V., Chi, L., Taylor, R. A., Chartash, D., et al. (2023). How does chatgpt perform on the united states medical licensing examination? the implications of large language models for medical education and knowledge assessment. JMIR Medical Education, 9(1):e45312.
Graham et al., (2013) Graham, J., Haidt, J., Koleva, S., Motyl, M., Iyer, R., Wojcik, S. P., and Ditto, P. H. (2013). Moral foundations theory: The pragmatic validity of moral pluralism. In Advances in experimental social psychology, volume 47, pages 55–130. Elsevier.
Gu et al., (2023) Gu, Z., Zhu, X., Ye, H., Zhang, L., Wang, J., Jiang, S., Xiong, Z., Li, Z., He, Q., Xu, R., et al. (2023). Xiezhi: An ever-updating benchmark for holistic domain knowledge evaluation. arXiv preprint arXiv:2306.05783.
Guo et al., (2023) Guo, T., Guo, K., Liang, Z., Guo, Z., Chawla, N. V., Wiest, O., Zhang, X., et al. (2023). What indeed can gpt models do in chemistry? a comprehensive benchmark on eight tasks. arXiv preprint arXiv:2305.18365.
Hagendorff and Fabi, (2023) Hagendorff, T. and Fabi, S. (2023). Human-like intuitive behavior and reasoning biases emerged in language models – and disappeared in gpt-4.
Hamidi and Roberts, (2023) Hamidi, A. and Roberts, K. (2023). Evaluation of ai chatbots for patient-specific ehr questions. arXiv preprint arXiv:2306.02549.
Hartmann et al., (2023) Hartmann, J., Schwenzow, J., and Witte, M. (2023). The political ideology of conversational ai: Converging evidence on chatgpt’s pro-environmental, left-libertarian orientation. arXiv preprint arXiv:2301.01768.
Hellas et al., (2023) Hellas, A., Leinonen, J., Sarsa, S., Koutcheme, C., Kujanpää, L., and Sorva, J. (2023). Exploring the responses of large language models to beginner programmers’ help requests. arXiv preprint arXiv:2306.05715.
(59) Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al. (2021a). Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938.
(60) Hendrycks, D., Burns, C., Basart, S., Critch, A., Li, J., Song, D., and Steinhardt, J. (2020a). Aligning ai with shared human values. arXiv preprint arXiv:2008.02275.
(61) Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. (2020b). Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
(62) Hendrycks, D., Burns, C., Chen, A., and Ball, S. (2021b). Cuad: An expert-annotated nlp dataset for legal contract review. arXiv preprint arXiv:2103.06268.
(63) Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. (2021c). Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
Holmes et al., (2023) Holmes, J., Liu, Z., Zhang, L., Ding, Y., Sio, T. T., McGee, L. A., Ashman, J. B., Li, X., Liu, T., Shen, J., et al. (2023). Evaluating large language models on a highly-specialized topic, radiation oncology physics. arXiv preprint arXiv:2304.01938.
Honovich et al., (2022) Honovich, O., Aharoni, R., Herzig, J., Taitelbaum, H., Kukliansy, D., Cohen, V., Scialom, T., Szpektor, I., Hassidim, A., and Matias, Y. (2022). True: Re-evaluating factual consistency evaluation. arXiv preprint arXiv:2204.04991.
(66) Huang, S., Dong, L., Wang, W., Hao, Y., Singhal, S., Ma, S., Lv, T., Cui, L., Mohammed, O. K., Liu, Q., et al. (2023a). Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045.
(67) Huang, Y., Bai, Y., Zhu, Z., Zhang, J., Zhang, J., Su, T., Liu, J., Lv, C., Zhang, Y., Lei, J., et al. (2023b). C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
(68) Huang, Y., Zhang, Q., Y, P. S., and Sun, L. (2023c). Trustgpt: A benchmark for trustworthy and responsible large language models.
HuggingFace, (2023) HuggingFace (2023). Open-source large language models leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
Jahan et al., (2023) Jahan, I., Laskar, M. T. R., Peng, C., and Huang, J. (2023). Evaluation of chatgpt on biomedical tasks: A zero-shot comparison with fine-tuned generative transformers. arXiv preprint arXiv:2306.04504.
Jain et al., (2023) Jain, N., Saifullah, K., Wen, Y., Kirchenbauer, J., Shu, M., Saha, A., Goldblum, M., Geiping, J., and Goldstein, T. (2023). Bring your own data! self-supervised evaluation for large language models. arXiv preprint arXiv:2306.13651.
Jansson et al., (2021) Jansson, M., Hrastinski, S., Stenbom, S., and Enoksson, F. (2021). Online question and answer sessions: How students support their own and other students’ processes of inquiry in a text-based learning environment. The Internet and Higher Education, 51:100817.
Jentzsch and Kersting, (2023) Jentzsch, S. and Kersting, K. (2023). Chatgpt is fun, but it is not funny! humor is still challenging large language models. arXiv preprint arXiv:2306.04563.
Johnson et al., (2023) Johnson, D., Goodman, R., Patrinely, J., Stone, C., Zimmerman, E., Donald, R., Chang, S., Berkowitz, S., Finn, A., Jahangir, E., et al. (2023). Assessing the accuracy and reliability of ai-generated medical responses: an evaluation of the chat-gpt model.
Joshi et al., (2017) Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. (2017). Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada. Association for Computational Linguistics.
Kadavath et al., (2022) Kadavath, S., Conerly, T., Askell, A., Henighan, T. J., Drain, D., Perez, E., Schiefer, N., Dodds, Z., DasSarma, N., Tran-Johnson, E., Johnston, S., El-Showk, S., Jones, A., Elhage, N., Hume, T., Chen, A., Bai, Y., Bowman, S., Fort, S., Ganguli, D., Hernandez, D., Jacobson, J., Kernion, J., Kravec, S., Lovitt, L., Ndousse, K., Olsson, C., Ringer, S., Amodei, D., Brown, T. B., Clark, J., Joseph, N., Mann, B., McCandlish, S., Olah, C., and Kaplan, J. (2022). Language models (mostly) know what they know. ArXiv, abs/2207.05221.
Karpas et al., (2022) Karpas, E., Abend, O., Belinkov, Y., Lenz, B., Lieber, O., Ratner, N., Shoham, Y., Bata, H., Levine, Y., Leyton-Brown, K., et al. (2022). Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. arXiv preprint arXiv:2205.00445.
Kasneci et al., (2023) Kasneci, E., Seßler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., et al. (2023). Chatgpt for good? on opportunities and challenges of large language models for education. Learning and Individual Differences, 103:102274.
Khalfa, (1994) Khalfa, J. (1994). What is intelligence?
Khan et al., (2023) Khan, Y. A., Hokia, C., Xu, J., and Ehlert, B. (2023). covllm: Large language models for covid-19 biomedical literature. arXiv preprint arXiv:2306.04926.
Kiela et al., (2021) Kiela, D., Bartolo, M., Nie, Y., Kaushik, D., Geiger, A., Wu, Z., Vidgen, B., Prasad, G., Singh, A., Ringshia, P., et al. (2021). Dynabench: Rethinking benchmarking in nlp. arXiv preprint arXiv:2104.14337.
Kohavi et al., (1995) Kohavi, R. et al. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, volume 14, pages 1137–1145. Montreal, Canada.
Kombrink et al., (2011) Kombrink, S., Mikolov, T., Karafiát, M., and Burget, L. (2011). Recurrent neural network based language modeling in meeting recognition. In Interspeech, volume 11, pages 2877–2880.
Kung et al., (2023) Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., Madriaga, M., Aggabao, R., Diaz-Candido, G., Maningo, J., et al. (2023). Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLoS digital health, 2(2):e0000198.
Kwiatkowski et al., (2019) Kwiatkowski, T., Palomaki, J., Redfield, O., Collins, M., Parikh, A., Alberti, C., Epstein, D., Polosukhin, I., Kelcey, M., Devlin, J., Lee, K., Toutanova, K. N., Jones, L., Chang, M.-W., Dai, A., Uszkoreit, J., Le, Q., and Petrov, S. (2019). Natural questions: a benchmark for question answering research. Transactions of the Association of Computational Linguistics.
Lahat et al., (2023) Lahat, A., Shachar, E., Avidan, B., Shatz, Z., Glicksberg, B. S., and Klang, E. (2023). Evaluating the use of large language model in identifying top research questions in gastroenterology. Scientific reports, 13(1):4164.
Lai et al., (2023) Lai, V. D., Ngo, N. T., Veyseh, A. P. B., Man, H., Dernoncourt, F., Bui, T., and Nguyen, T. H. (2023). Chatgpt beyond english: Towards a comprehensive evaluation of large language models in multilingual learning. arXiv preprint arXiv:2304.05613.
Lanzi and Loiacono, (2023) Lanzi, P. L. and Loiacono, D. (2023). Chatgpt and other large language models as evolutionary engines for online interactive collaborative game design. arXiv preprint arXiv:2303.02155.
Laskar et al., (2023) Laskar, M. T. R., Bari, M. S., Rahman, M., Bhuiyan, M. A. H., Joty, S., and Huang, J. X. (2023). A systematic study and comprehensive evaluation of chatgpt on benchmark datasets. arXiv preprint arXiv:2305.18486.
Le and Zhang, (2023) Le, V.-H. and Zhang, H. (2023). An evaluation of log parsing with chatgpt. arXiv preprint arXiv:2306.01590.
LeCun et al., (2015) LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. nature, 521(7553):436–444.
Lee et al., (2023) Lee, N., An, N. M., and Thorne, J. (2023). Can large language models infer and disagree like humans? arXiv preprint arXiv:2305.13788.
Lewis et al., (2019) Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. (2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.
(94) Li, M., Song, F., Yu, B., Yu, H., Li, Z., Huang, F., and Li, Y. (2023a). Api-bank: A benchmark for tool-augmented llms.
(95) Li, R., Deng, W., Cheng, Y., Yuan, Z., Zhang, J., and Yuan, F. (2023b). Exploring the upper limits of text-based collaborative filtering using large language models: Discoveries and insights. arXiv preprint arXiv:2305.11700.
(96) Li, X., Liu, M., Gao, S., and Buntine, W. (2023c). A survey on out-of-distribution evaluation of neural nlp models.
(97) Li, X., Zhang, T., Dubois, Y., Taori, R., Gulrajani, I., Guestrin, C., Liang, P., and Hashimoto, T. B. (2023d). Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
Liang et al., (2022) Liang, P., Bommasani, R., Lee, T., Tsipras, D., Soylu, D., Yasunaga, M., Zhang, Y., Narayanan, D., Wu, Y., Kumar, A., et al. (2022). Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
Liévin et al., (2022) Liévin, V., Hother, C. E., and Winther, O. (2022). Can large language models reason about medical questions? arXiv preprint arXiv:2207.08143.
Lin, (2004) Lin, C.-Y. (2004). Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
Lin et al., (2021) Lin, S., Hilton, J., and Evans, O. (2021). Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
Lin et al., (2014) Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
Lin and Chen, (2023) Lin, Y.-T. and Chen, Y.-N. (2023). Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. arXiv preprint arXiv:2305.13711.
(104) Liu, C., Jin, R., Ren, Y., Yu, L., Dong, T., Peng, X., Zhang, S., Peng, J., Zhang, P., Lyu, Q., Su, X., Liu, Q., and Xiong, D. (2023a). M3ke: A massive multi-level multi-subject knowledge evaluation benchmark for chinese large language models.
(105) Liu, H., Ning, R., Teng, Z., Liu, J., Zhou, Q., and Zhang, Y. (2023b). Evaluating the logical reasoning ability of chatgpt and gpt-4.
(106) Liu, J., Xia, C. S., Wang, Y., and Zhang, L. (2023c). Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210.
LMSYS, (2023) LMSYS (2023). Chatbot arena: Benchmarking llms in the wild with elo ratings. https://lmsys.org.
Lopez-Lira and Tang, (2023) Lopez-Lira, A. and Tang, Y. (2023). Can chatgpt forecast stock price movements? return predictability and large language models. arXiv preprint arXiv:2304.07619.
(109) Lyu, C., Xu, J., and Wang, L. (2023a). New trends in machine translation using large language models: Case examples with chatgpt. arXiv preprint arXiv:2305.01181.
(110) Lyu, Q., Tan, J., Zapadka, M. E., Ponnatapuram, J., Niu, C., Wang, G., and Whitlow, C. T. (2023b). Translating radiology reports into plain language using chatgpt and gpt-4 with prompt learning: Promising results, limitations, and potential. arXiv preprint arXiv:2303.09038.
Ma et al., (2021) Ma, Z., Ethayarajh, K., Thrush, T., Jain, S., Wu, L., Jia, R., Potts, C., Williams, A., and Kiela, D. (2021). Dynaboard: An evaluation-as-a-service platform for holistic next-generation benchmarking. Advances in Neural Information Processing Systems, 34:10351–10367.
(112) Manakul, P., Liusie, A., and Gales, M. J. (2023a). Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
(113) Manakul, P., Liusie, A., and Gales, M. J. F. (2023b). Mqag: Multiple-choice question answering and generation for assessing information consistency in summarization.
Margatina et al., (2023) Margatina, K., Wang, S., Vyas, Y., John, N. A., Benajiba, Y., and Ballesteros, M. (2023). Dynamic benchmarking of masked language models on temporal concept drift with multiple views. arXiv preprint arXiv:2302.12297.
McCarthy, (2007) McCarthy, J. (2007). What is artificial intelligence.
Microsoft, (2023) Microsoft (2023). Bing chat. https://www.bing.com/new.
Min et al., (2023) Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W.-t., Koh, P. W., Iyyer, M., Zettlemoyer, L., and Hajishirzi, H. (2023). Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251.
Nay et al., (2023) Nay, J. J., Karamardian, D., Lawsky, S. B., Tao, W., Bhat, M., Jain, R., Lee, A. T., Choi, J. H., and Kasai, J. (2023). Large language models as tax attorneys: A case study in legal capabilities emergence. arXiv preprint arXiv:2306.07075.
Nie et al., (2019) Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., and Kiela, D. (2019). Adversarial nli: A new benchmark for natural language understanding. arXiv preprint arXiv:1910.14599.
Nijkamp et al., (2022) Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y., Savarese, S., and Xiong, C. (2022). Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474.
Novikova et al., (2017) Novikova, J., Dušek, O., Curry, A. C., and Rieser, V. (2017). Why we need new evaluation metrics for nlg. arXiv preprint arXiv:1707.06875.
Oh et al., (2023) Oh, N., Choi, G.-S., and Lee, W. Y. (2023). Chatgpt goes to the operating room: evaluating gpt-4 performance and its potential in surgical education and training in the era of large language models. Annals of Surgical Treatment and Research, 104(5):269.
(123) OpenAI (2023a). https://chat.openai.com.chat.
(124) OpenAI (2023b). Gpt-4 technical report.
Orrù et al., (2023) Orrù, G., Piarulli, A., Conversano, C., and Gemignani, A. (2023). Human-like problem-solving abilities in large language models using chatgpt. Frontiers in Artificial Intelligence, 6.
Ott et al., (2023) Ott, S., Hebenstreit, K., Liévin, V., Hother, C. E., Moradi, M., Mayrhauser, M., Praas, R., Winther, O., and Samwald, M. (2023). Thoughtsource: A central hub for large language model reasoning data. arXiv preprint arXiv:2301.11596.
Ouyang et al., (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
Pallagani et al., (2023) Pallagani, V., Muppasani, B., Murugesan, K., Rossi, F., Srivastava, B., Horesh, L., Fabiano, F., and Loreggia, A. (2023). Understanding the capabilities of large language models for automated planning. arXiv preprint arXiv:2305.16151.
Papineni et al., (2002) Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
Parisi et al., (2022) Parisi, A., Zhao, Y., and Fiedel, N. (2022). Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255.
Parrish et al., (2022) Parrish, A., Chen, A., Nangia, N., Padmakumar, V., Phang, J., Thompson, J., Htut, P. M., and Bowman, S. (2022). Bbq: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2086–2105.
Peña et al., (2023) Peña, A., Morales, A., Fierrez, J., Serna, I., Ortega-Garcia, J., Puente, I., Cordova, J., and Cordova, G. (2023). Leveraging large language models for topic classification in the domain of public affairs. arXiv preprint arXiv:2306.02864.
Peng et al., (1997) Peng, K., Nisbett, R. E., and Wong, N. Y. (1997). Validity problems comparing values across cultures and possible solutions. Psychological methods, 2(4):329.
Pezeshkpour, (2023) Pezeshkpour, P. (2023). Measuring and modifying factual knowledge in large language models. arXiv preprint arXiv:2306.06264.
Phang et al., (2021) Phang, J., Chen, A., Huang, W., and Bowman, S. R. (2021). Adversarially constructed evaluation sets are more challenging, but may not be fair. arXiv preprint arXiv:2111.08181.
Pu and Demberg, (2023) Pu, D. and Demberg, V. (2023). Chatgpt vs human-authored text: Insights into controllable text summarization and sentence style transfer.
Qin et al., (2023) Qin, C., Zhang, A., Zhang, Z., Chen, J., Yasunaga, M., and Yang, D. (2023). Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
Radford et al., (2018) Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al. (2018). Improving language understanding by generative pre-training.
Ribeiro and Lundberg, (2022) Ribeiro, M. T. and Lundberg, S. (2022). Adaptive testing and debugging of nlp models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3253–3267.
Ribeiro et al., (2020) Ribeiro, M. T., Wu, T., Guestrin, C., and Singh, S. (2020). Beyond accuracy: Behavioral testing of nlp models with checklist. arXiv preprint arXiv:2005.04118.
Riccardi and Desai, (2023) Riccardi, N. and Desai, R. H. (2023). The two word test: A semantic benchmark for large language models. arXiv preprint arXiv:2306.04610.
Rutinowski et al., (2023) Rutinowski, J., Franke, S., Endendyk, J., Dormuth, I., and Pauly, M. (2023). The self-perception and political biases of chatgpt. arXiv preprint arXiv:2304.07333.
Safdari et al., (2023) Safdari, M., Serapio-García, G., Crepy, C., Fitz, S., Romero, P., Sun, L., Abdulhai, M., Faust, A., and Matarić, M. (2023). Personality traits in large language models. arXiv preprint arXiv:2307.00184.
Samaan et al., (2023) Samaan, J. S., Yeo, Y. H., Rajeev, N., Hawley, L., Abel, S., Ng, W. H., Srinivasan, N., Park, J., Burch, M., Watson, R., et al. (2023). Assessing the accuracy of responses by the language model chatgpt to questions regarding bariatric surgery. Obesity Surgery, pages 1–7.
Saparov et al., (2023) Saparov, A., Pang, R. Y., Padmakumar, V., Joshi, N., Kazemi, S. M., Kim, N., and He, H. (2023). Testing the general deductive reasoning capacity of large language models using ood examples. arXiv preprint arXiv:2305.15269.
Sawada et al., (2023) Sawada, T., Paleka, D., Havrilla, A., Tadepalli, P., Vidas, P., Kranias, A., Nay, J. J., Gupta, K., and Komatsuzaki, A. (2023). Arb: Advanced reasoning benchmark for large language models.
Schick et al., (2023) Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. (2023). Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
Sharma et al., (2023) Sharma, P., Thapa, K., Dhakal, P., Upadhaya, M. D., Adhikari, S., and Khanal, S. R. (2023). Performance of chatgpt on usmle: Unlocking the potential of large language models for ai-assisted medical education. arXiv preprint arXiv:2307.00112.
Shen et al., (2023) Shen, Y., Song, K., Tan, X., Li, D., Lu, W., and Zhuang, Y. (2023). Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
Sheng et al., (2021) Sheng, E., Chang, K.-W., Natarajan, P., and Peng, N. (2021). Societal biases in language generation: Progress and challenges. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4275–4293.
Simmons, (2022) Simmons, G. (2022). Moral mimicry: Large language models produce moral rationalizations tailored to political identity. arXiv preprint arXiv:2209.12106.
Singhal et al., (2022) Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., et al. (2022). Large language models encode clinical knowledge. arXiv preprint arXiv:2212.13138.
Smith et al., (2022) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al. (2022). Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
Song et al., (2023) Song, X., Gupta, A., Mohebbizadeh, K., Hu, S., and Singh, A. (2023). Have large language models developed a personality?: Applicability of self-assessment tests in measuring personality in llms. arXiv preprint arXiv:2305.14693.
Sridhara et al., (2023) Sridhara, G., Mazumdar, S., et al. (2023). Chatgpt: A study on its utility for ubiquitous software engineering tasks. arXiv preprint arXiv:2305.16837.
Srivastava et al., (2022) Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. (2022). Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
Sun et al., (2023) Sun, W., Yan, L., Ma, X., Ren, P., Yin, D., and Ren, Z. (2023). Is chatgpt good at search? investigating large language models as re-ranking agent. arXiv preprint arXiv:2304.09542.
Tao et al., (2023) Tao, Z., Jin, Z., Bai, X., Zhao, H., Feng, Y., Li, J., and Hu, W. (2023). Eveval: A comprehensive evaluation of event semantics for large language models. arXiv preprint arXiv:2305.15268.
Thakur et al., (2021) Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., and Gurevych, I. (2021). Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663.
Thirunavukarasu et al., (2023) Thirunavukarasu, A. J., Hassan, R., Mahmood, S., Sanghera, R., Barzangi, K., El Mukashfi, M., and Shah, S. (2023). Trialling a large language model (chatgpt) in general practice with the applied knowledge test: observational study demonstrating opportunities and limitations in primary care. JMIR Medical Education, 9(1):e46599.
Thrush et al., (2022) Thrush, T., Tirumala, K., Gupta, A., Bartolo, M., Rodriguez, P., Kane, T., Rojas, W. G., Mattson, P., Williams, A., and Kiela, D. (2022). Dynatask: A framework for creating dynamic ai benchmark tasks. arXiv preprint arXiv:2204.01906.
Tian et al., (2018) Tian, Y., Pei, K., Jana, S., and Ray, B. (2018). Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th international conference on software engineering, pages 303–314.
ToolBench, (2023) ToolBench (2023). Open-source tools learning benchmarks. https://github.com/sambanova/toolbench.
Touvron et al., (2023) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. (2023). Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Turing, (2009) Turing, A. M. (2009). Computing machinery and intelligence. Springer.
Valmeekam et al., (2023) Valmeekam, K., Marquez, M., Sreedharan, S., and Kambhampati, S. (2023). On the planning abilities of large language models–a critical investigation. arXiv preprint arXiv:2305.15771.
Valmeekam et al., (2022) Valmeekam, K., Olmo, A., Sreedharan, S., and Kambhampati, S. (2022). Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498.
Vaswani et al., (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Wang et al., (2019) Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. (2019). Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
Wang et al., (2018) Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2018). Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461.
(171) Wang, B., Chen, W., Pei, H., Xie, C., Kang, M., Zhang, C., Xu, C., Xiong, Z., Dutta, R., Schaeffer, R., Truong, S. T., Arora, S., Mazeika, M., Hendrycks, D., Lin, Z., Cheng, Y., Koyejo, S., Song, D., and Li, B. (2023a). Decodingtrust: A comprehensive assessment of trustworthiness in gpt models.
Wang and Komatsuzaki, (2021) Wang, B. and Komatsuzaki, A. (2021). Gpt-j-6b: A 6 billion parameter autoregressive language model.
(173) Wang, B., Xu, C., Wang, S., Gan, Z., Cheng, Y., Gao, J., Awadallah, A. H., and Li, B. (2021a). Adversarial glue: A multi-task benchmark for robustness evaluation of language models. arXiv preprint arXiv:2111.02840.
(174) Wang, C., Cheng, S., Xu, Z., Ding, B., Wang, Y., and Zhang, Y. (2023b). Evaluating open question answering evaluation. arXiv preprint arXiv:2305.12421.
(175) Wang, J., Hu, X., Hou, W., Chen, H., Zheng, R., Wang, Y., Yang, L., Huang, H., Ye, W., Geng, X., et al. (2023c). On the robustness of chatgpt: An adversarial and out-of-distribution perspective. In ICLR workshop on Trustworthy and Reliable Large-Scale Machine Learning Models.
Wang et al., (2022) Wang, J., Lan, C., Liu, C., Ouyang, Y., Qin, T., Lu, W., Chen, Y., Zeng, W., and Yu, P. (2022). Generalizing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering.
(177) Wang, L., Lyu, C., Ji, T., Zhang, Z., Yu, D., Shi, S., and Tu, Z. (2023d). Document-level machine translation with large language models. arXiv preprint arXiv:2304.02210.
(178) Wang, P., Li, L., Chen, L., Zhu, D., Lin, B., Cao, Y., Liu, Q., Liu, T., and Sui, Z. (2023e). Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
Wang and Demszky, (2023) Wang, R. E. and Demszky, D. (2023). Is chatgpt a good teacher coach? measuring zero-shot performance for scoring and providing actionable insights on classroom instruction. arXiv preprint arXiv:2306.03090.
(180) Wang, X., Li, X., Yin, Z., Wu, Y., and Jia, L. (2023f). Emotional intelligence of large language models.
(181) Wang, Y., Wang, W., Joty, S., and Hoi, S. C. (2021b). Codet5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. arXiv preprint arXiv:2109.00859.
(182) Wang, Y., Yu, Z., Wang, J., Heng, Q., Chen, H., Ye, W., Xie, R., Xie, X., and Zhang, S. (2023g). Exploring vision-language models for imbalanced learning. arXiv preprint arXiv:2304.01457.
(183) Wang, Y., Yu, Z., Zeng, Z., Yang, L., Wang, C., Chen, H., Jiang, C., Xie, R., Wang, J., Xie, X., et al. (2023h). Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv preprint arXiv:2306.05087.
(184) Wang, Z., Li, R., Dong, B., Wang, J., Li, X., Liu, N., Mao, C., Zhang, W., Dong, L., Gao, J., et al. (2023i). Can llms like gpt-4 outperform traditional ai tools in dementia diagnosis? maybe, but not today. arXiv preprint arXiv:2306.01499.
(185) Wang, Z., Xie, Q., Ding, Z., Feng, Y., and Xia, R. (2023j). Is chatgpt a good sentiment analyzer? a preliminary study.
(186) Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al. (2022a). Emergent abilities of large language models. arXiv preprint arXiv:2206.07682.
(187) Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., hsin Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., and Fedus, W. (2022b). Emergent abilities of large language models. Trans. Mach. Learn. Res., 2022.
Wei et al., (2023) Wei, T., Luan, J., Liu, W., Dong, S., and Wang, B. (2023). Cmath: Can your language model pass chinese elementary school math test?
White et al., (2023) White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., and Schmidt, D. C. (2023). A prompt pattern catalog to enhance prompt engineering with chatgpt. arXiv preprint arXiv:2302.11382.
Wong, (2015) Wong, T.-T. (2015). Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recognition, 48(9):2839–2846.
(191) Wu, P. Y., Tucker, J. A., Nagler, J., and Messing, S. (2023a). Large language models can be used to estimate the ideologies of politicians in a zero-shot learning setting. arXiv preprint arXiv:2303.12057.
(192) Wu, Y., Jia, F., Zhang, S., Wu, Q., Li, H., Zhu, E., Wang, Y., Lee, Y. T., Peng, R., and Wang, C. (2023b). An empirical study on challenging math problem solving with gpt-4. arXiv preprint arXiv:2306.01337.
(193) Wu, Z., Qiu, L., Ross, A., Akyürek, E., Chen, B., Wang, B., Kim, N., Andreas, J., and Kim, Y. (2023c). Reasoning or reciting? exploring the capabilities and limitations of language models through counterfactual tasks. arXiv preprint arXiv:2307.02477.
(194) Xu, F., Lin, Q., Han, J., Zhao, T., Liu, J., and Cambria, E. (2023a). Are large language models really good logical reasoners? a comprehensive evaluation from deductive, inductive and abductive views. arXiv preprint arXiv:2306.09841.
(195) Xu, G., Liu, J., Yan, M., Xu, H., Si, J., Zhou, Z., Yi, P., Gao, X., Sang, J., Zhang, R., Zhang, J., Peng, C., Huang, F., and Zhou, J. (2023b). Cvalues: Measuring the values of chinese large language models from safety to responsibility.
(196) Xu, R., Feng, Y., and Chen, H. (2023c). Chatgpt vs. google: A comparative study of search performance and user experience. arXiv preprint arXiv:2307.01135.
Yang and Menczer, (2023) Yang, K.-C. and Menczer, F. (2023). Large language models can rate news outlet credibility. arXiv preprint arXiv:2304.00228.
Yang et al., (2022) Yang, L., Zhang, S., Qin, L., Li, Y., Wang, Y., Liu, H., Wang, J., Xie, X., and Zhang, Y. (2022). Glue-x: Evaluating natural language understanding models from an out-of-distribution generalization perspective. arXiv preprint arXiv:2211.08073.
Yu et al., (2023) Yu, J., Wang, X., Tu, S., Cao, S., Zhang-Li, D., Lv, X., Peng, H., Yao, Z., Zhang, X., Li, H., et al. (2023). Kola: Carefully benchmarking world knowledge of large language models. arXiv preprint arXiv:2306.09296.
(200) Yuan, Z., Yuan, F., Song, Y., Li, Y., Fu, J., Yang, F., Pan, Y., and Ni, Y. (2023a). Where to go next for recommender systems? id- vs. modality-based recommender models revisited.
(201) Yuan, Z., Yuan, H., Tan, C., Wang, W., and Huang, S. (2023b). How well do large language models perform in arithmetic tasks? arXiv preprint arXiv:2304.02015.
Zeng et al., (2022) Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y., Zheng, W., Xia, X., et al. (2022). Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
(203) Zhang, J., Bao, K., Zhang, Y., Wang, W., Feng, F., and He, X. (2023a). Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation. arXiv preprint arXiv:2305.07609.
Zhang et al., (2022) Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V., et al. (2022). Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
(205) Zhang, S. J., Florin, S., Lee, A. N., Niknafs, E., Marginean, A., Wang, A., Tyser, K., Chin, Z., Hicke, Y., Singh, N., et al. (2023b). Exploring the mit mathematics and eecs curriculum using large language models. arXiv preprint arXiv:2306.08997.
Zhang et al., (2019) Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., and Artzi, Y. (2019). Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
(207) Zhang, W., Aljunied, S. M., Gao, C., Chia, Y. K., and Bing, L. (2023c). M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. arXiv preprint arXiv:2306.05179.
(208) Zhang, W., Deng, Y., Liu, B., Pan, S. J., and Bing, L. (2023d). Sentiment analysis in the era of large language models: A reality check. arXiv preprint arXiv:2305.15005.
(209) Zhang, X., Li, C., Zong, Y., Ying, Z., He, L., and Qiu, X. (2023e). Evaluating the performance of large language models on gaokao benchmark. arXiv preprint arXiv:2305.12474.
(210) Zhao, W. X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., et al. (2023a). A survey of large language models. arXiv preprint arXiv:2303.18223.
(211) Zhao, Y., Pang, T., Du, C., Yang, X., Li, C., Cheung, N.-M., and Lin, M. (2023b). On evaluating adversarial robustness of large vision-language models. arXiv preprint arXiv:2305.16934.
Zheng et al., (2023) Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. (2023). Judging llm-as-a-judge with mt-bench and chatbot arena.
Zhong et al., (2023) Zhong, W., Cui, R., Guo, Y., Liang, Y., Lu, S., Wang, Y., Saied, A., Chen, W., and Duan, N. (2023). Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364.
Zhou et al., (2022) Zhou, Y., Muresanu, A. I., Han, Z., Paster, K., Pitis, S., Chan, H., and Ba, J. (2022). Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910.
Zhu et al., (2023) Zhu, K., Wang, J., Zhou, J., Wang, Z., Chen, H., Wang, Y., Yang, L., Ye, W., Gong, N. Z., Zhang, Y., et al. (2023). Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528.
Zhuang et al., (2023) Zhuang, Y., Liu, Q., Ning, Y., Huang, W., Lv, R., Huang, Z., Zhao, G., Zhang, Z., Mao, Q., Wang, S., et al. (2023). Efficiently measuring the cognitive ability of llms: An adaptive testing perspective. arXiv preprint arXiv:2306.10512.
(217) Zhuo, T. Y., Huang, Y., Chen, C., and Xing, Z. (2023a). Exploring ai ethics of chatgpt: A diagnostic analysis. arXiv preprint arXiv:2301.12867.
(218) Zhuo, T. Y., Li, Z., Huang, Y., Li, Y.-F., Wang, W., Haffari, G., and Shiri, F. (2023b). On robustness of prompt-based semantic parsing with large pre-trained language model: An empirical study on codex. arXiv preprint arXiv:2301.12868.
Ziegler et al., (2019) Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. (2019). Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593.
Ziems et al., (2023) Ziems, C., Held, W., Shaikh, O., Chen, J., Zhang, Z., and Yang, D. (2023). Can large language models transform computational social science? arXiv preprint arXiv:2305.03514.