MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

Kaining Ying    Fanqing Meng    Jin Wang    Zhiqian Li    Han Lin    Yue Yang    Hao Zhang    Wenbo Zhang    Yuqi Lin    Shuo Liu    Jiayi Lei    Quanfeng Lu    Runjian Chen    Peng Xu    Renrui Zhang    Haozhe Zhang    Peng Gao    Yali Wang    Yu Qiao    Ping Luo    Kaipeng Zhang    Wenqi Shao
Abstract

Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises 31,325 meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering 32 core meta-tasks and 162 subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving 30 LVLMs such as the proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.

Machine Learning, ICML

1 Introduction

In recent years, Large Vision-Language Models (LVLMs) (Zhang et al., 2023a; Yang et al., 2023a; Liu et al., 2023b) have emerged as powerful tools for advancing artificial intelligence, demonstrating remarkable progress in various domains such as visual dialogue, video analysis and document understanding. Driven by diverse and high-quality instruction fine-tuning data mined from various fields, LVLMs will continue to advance towards multitask AGI (Team, 2023a; Bai et al., 2023). As pointed out in Levels of AGI (Morris et al., 2023), the breadth (generality) of tasks is a fundamental criterion for different levels of AGI. A multitask AGI model can perform a wide range of tasks across different domains with human-like proficiency, which could revolutionize many fields such as personalized education (Latif et al., 2023) and medical diagnosis (Singhal et al., 2023). Therefore, it is crucial to build a comprehensive evaluation benchmark to track multitask AGI development.

However, evaluating LVLMs significantly lags behind their development (Morris et al., 2023; Yue et al., 2023b; Liu et al., 2024b). A line of work attempts to bridge this gap by proposing various multimodal evaluation benchmarks. Examples include LVLM-eHub (Xu et al., 2023), MMBench (Liu et al., 2023c), MME (Fu et al., 2023), and SEED-Bench (Li et al., 2023a), which propose dimensions of multimodal capabilities and corresponding test samples. However, these benchmarks have limited coverage of multimodal tasks while testing rudimentary capabilities like visual recognition and text-scarce OCR. Therefore, they cannot fulfil the requirement of the breadth of tasks (Morris et al., 2023). Moreover, recent LVLMs continue to excel in these benchmarks. For instance, InternLM-XComposer2 (Dong et al., 2024) achieved 2242.7/2800 and 79.6/100 overall performance on MME and MMBench, respectively. Other works, such as MathVista (Lu et al., 2023) and MMMU (Yue et al., 2023a), focus on discipline knowledge understanding and reasoning but are constrained to visual questions with scientific diagram images, limiting their breadth for benchmarking multitask AGI.

Refer to caption
Figure 1: Visualization of MMT-Bench. Our MMT-Bench consists of 32 meta-tasks (middle ring) which are decomposed into 162 subtasks (outer ring). For each meta-task, we denote the number of subtasks in it and illustrate one example of the pair of the image and the question (see task hierarchy in Table A2 to Table A4 of Appendix). MMT-Bench can be comprehensive enough to evaluate the multitask performance of LVLMs.

To address this challenge, we introduce MMT-Bench, a new benchmark designed to comprehensively assess LVLMs in multimodal multitask understanding. The breadth of MMT-Bench features in three aspects. First, MMT-Bench is meticulously curated and comprises 32K multi-choice visual questions covering 32 core meta-tasks and a total of 162 subtasks (Fig. 1), which is 8.1 times larger than MM-Bench (Liu et al., 2023c). Second, it encompasses 13 image types such as natural scenes, synthetic images, depth maps, text-rich images, paintings, screenshots, point clouds, medical images, et al. (Fig. 2). Such diversity demands the model to be capable enough to interpret various visual inputs. Third, MMT-Bench spans multimodal scenarios such as vehicle driving, GUI navigation, and embodied AI, testing 14 kinds of multimodal capabilities including visual recognition, localization, reasoning, OCR, counting, 3D perception, temporal understanding, et al. (Fig. 2).

We assess 30 publicly available LVLMs under various input modes for best evaluation performance. Our findings highlight the significant challenges posed by MMT-Bench. For instance, GPT-4V only achieves 62.0/100 and 55.6/100 overall scores across all subtasks and subtasks except for visual recognition tasks, respectively, indicating significant room for improvement towards multitask AGI. Thanks to the extensive coverage of multimodal tasks, MMT-Bench enables the evaluation of LVLMs using a task map. This facilitates the discovery of both in- and out-of-domain tasks, providing valuable insights for multimodal commercial applications and ongoing efforts to enhance LVLMs. We summarize the findings as follows:

Table 1: The comparison between MMT-Bench and existing evaluation benchmarks. MMT-Bench consists of massive samples and multimodal tasks compared with other benchmarks. I, T, V, and P respectively represent image, text, video, and point cloud.
Benchmark Data Collection
# Sample # Meta-task # Task # Modality Source Answer Type
SEED-Bench (Li et al., 2023a) 19K 12 12 I + T + V Annotated Multi-Choice
MMBench (Liu et al., 2023c) 3K 2 20 I + T Repurposed Multi-Choice
MM-VET (Yu et al., 2023) 0.2K 6 N/A I + T Repurposed Multi-Choice
MMMU (Yue et al., 2023b) 11.5K 6 30 I + T Annotated Multi-Choice/Open
Tiny LVLM-eHub (Shao et al., 2023) 2.1K 5 42 I + T Repurposed Multi-Choice/Open
MMT-Bench 31K 32 162 I + T + V + P Repurposed Multi-Choice
  • The open-source model InternVL-chat has taken a leading position in MMT-Bench, surpassing other closed-source models such as QWen-VL-Plus, GPT-4V, and GeminiProVision.

  • The comprehensive error analyses conducted on 162 multimodal tasks reveal that top-performing LVLMs such as InternVL-chat, GPT4V, and GeminiProVision are predominantly prone to perception, reasoning, and knowledge errors.

  • The taxonomy analysis shows that current LVLMs perform well in tasks related to visual recognition and description which are in-domain tasks, yet fall short in tasks related to localization and pixel perception which are out-of-domain tasks.

  • BLIP2 that does not undergo instruction tuning even outperforms most LVLMs that are tuned by millions of instruction-following data, implying that instruction-tuning with data in some tasks even hurts the generalization on other tasks.

  • Certain tasks show improved performance with specific prompting methods, such as multi-image and coordinate-related tasks, as well as those involving visual referring prompts. However, most models do not exhibit improved performance with visual prompting, suggesting potential areas for future enhancement.

  • Model performance significantly improves with an increase in size (7B to 13B) for both LLaVA-v1.5 and LLaVA- v1.5-Xtuner. Upgrading LLMs, from InternLM to InternLM2, also enhances the performance of LLaVA.

Overall, the contributions of this work are three-fold. i) We build a new evaluation benchmark called MMT-Bench for multimodal multitask comprehension, allowing us to measure the progress on the path to multitask AGI. ii) We evaluate various publicly available LVLMs on MMT-Bench, revealing that current LVLMs including InternVL-Chat, GPT-4V, and GeminiProVision achieve plain performance in multitask intelligence. iii) We present a taskonomy analysis by evaluating LVLMs on a task map built upon MMT-Bench, facilitating the discovery of both in- and out-of-domain tasks relative to current LVLMs. We anticipate that MMT-Bench will inspire the community to push the boundaries of LVLM research and development, driving us closer to the realization of truly intelligent multimodal systems. The MMT-Bench is open-sourced at https://github.com/OpenGVLab/MMT-Bench.

Refer to caption
Figure 2: An illustration of our pipeline for data collection. First, given a task name, we retrieve its related datasets from the internet. Then we collate them in a uniform data format - metadata. Finally, we generate questions with choices and answers from metadata using manually designed rules or ChatGPT. Our benchmarks cover capabilities evaluation with diverse image types.

2 Related Work

LVLM. As the Large Language Models (LLMs) continue to garner impressive achievements (Bai et al., 2023; Team, 2023b; Touvron et al., 2023a, b; Zheng et al., 2023; Chung et al., 2022), academic emphasis is increasingly shifting towards the exploration and development of Large Visual Language Models (LVLMs), to bolster the multimodal understanding and generative capabilities of models. Some notable open-source LVLMs, such as mPLUG-Owl2 (Ye et al., 2023b), LLaVA (Liu et al., 2023b), and LLaMA-Adapter (Gao et al., 2023; Zhang et al., 2023b), have adopted LLMs as their backbone, processing visual features through these LLMs, ultimately achieving an innovative integration of text and visuals. In addition, closed-source models like Gemini (Team, 2023a) and GPT-4V (Yang et al., 2023b) have demonstrated remarkable results across numerous tasks, making groundbreaking contributions. We aim to undertake an in-depth and comprehensive exploration of LVLMs and their capabilities by testing them on massive multimodal tasks.

LVLM Evaluation. Recently, LVLMs have demonstrated remarkable capabilities to handle many visual-language tasks, which makes previous single-task benchmarks (Antol et al., 2015; Hudson & Manning, 2019; Krishna et al., 2017; Lin et al., 2014; Marino et al., 2019) insufficient to provide comprehensive evaluations of current LVLMs. To this end, current LVLM evaluation benchmarks aimed to provide relatively holistic evaluations for the overall reasoning capabilities of LVLMs, such as OwlEval (Ye et al., 2023a), LVLM-eHub (Xu et al., 2023), SEED-Bench (Li et al., 2023a), LAMM (Yin et al., 2023), MM-Vet (Yu et al., 2023) and MMBench (Liu et al., 2023c). However, these benchmarks only covered a small range of multimodal tasks and vision-language skills, making them not comprehensive enough to asses multitask AGI capabilities. Besides, recent studies also presented benchmarks of LVLMs which required expert-level domain knowledge, such as Mathvista (Lu et al., 2023) and MMMU (Yue et al., 2023a). In comparison, our proposed MMT-Bench covers an extensive range of multimodal reasoning capabilities with sufficient test samples from various modalities as shown in Table 1, which requires expert knowledge and deliberate visual recognition, localization, reasoning, and planning. Our MMT-Bench poses significant challenges for the current state-of-the-art LVLMs.

Multitask Analysis. Characterizing various tasks and establishing inter-task relationships is an effective means for multitask analysis (Ilharco et al., 2023; Achille et al., 2019; Zamir et al., 2018; Wallace et al., 2021), with wide applications in areas such as meta-learning and transfer learning. A substantial amount of research has been conducted in Taskonomy (Zamir et al., 2018). It utilizes transfer learning to model the structure of the space of visual tasks, thereby harnessing the interconnections among visual tasks to avoid redundancy in learning. Task2Vec (Achille et al., 2019) extracts fisher information as task vectors, which is used in meta-learning. In our paper, thanks to the vast amount of task data collected, we evaluate LVLMs on a task map and conclude challenging tasks for the current LVLMs.

3 MMT-Bench

In this section, we describe how to build the task hierarchy in Sec. 3.1 and the pipeline of data collection in Sec. 3.2.

3.1 Tasks

Hierarchical Task Structure. We utilize a hierarchical structure to include as more as multimodal tasks to build the MMT-Bench. First, all co-authors come up with meta-tasks for multimodal understanding by brainstorming. We then collect 32 meta-tasks by deduplication and filtering for important tasks as depicted in Fig. 1. Second, we decompose each meta-task into several subtasks. The subtask is kept in the MMT-Bench by three criteria. i) Whether the subtask examines the basic multimodal capability. ii) Whether the subtask challenges the current LVLMs. iii) Whether the test sample for the subtask can be publicly accessible. After selection, MMT-Bench comprises 162 sub-tasks, which is 3.8 times larger than TinyLVLM-eHub which previously contained the most tasks (Shao et al., 2023). The detailed comparison between MMT-Bench and previous benchmarks is provided in Table 1. We also present the whole hierarchical structure in Table A2 of the Appendix.

3.2 Data Collection

We design an efficient pipeline (see Fig. 2) to construct multi-choice visual questions evaluation data for each subtask and the data collection is completed by dozens of co-authors specializing in artificial intelligence.

Datasets Search. We conduct comprehensive searches for related datasets using various sources such as Google, Paper With Code, Kaggle, and ChatGPT, based on the name of the subtask. After downloading the datasets, we meticulously assess their suitability for evaluating the subtask, ensuring usability and relevance. While most tasks have multiple datasets available, a few may only have one dataset publicly accessible.

Metadata Construction. We define a uniform format, the metadata, to collate downloaded datasets. It enables the further generation of visual questions and answers. Each sample of metadata consists of images and meta-information. The meta-information (see Fig. 2) includes the necessary information to generate questions and answers for the evaluation and also includes manual annotations of required capabilities and the type of visual prompt (i.e., input image). For evaluation efficiency, in each task, we keep the maximum number of samples at 200 by random sampling, and each dataset comprises the same number of samples.

Question and Answer Generation. For each subtask, we generate multi-choice (maximum eight choices depending on the task) visual questions with choices and answers from their metadata. Specifically, depending on a specific task, we manually design rules or use ChatGPT with well-designed prompts for efficient and high-quality generation. For example, in sketch2image retrieval, we use the corresponding image as a ground-truth answer and generate other choices by randomly sampling other images from metadata. In video captioning, we use ChatGPT to write confused wrong choices.

Dataset Statistics. MMT-Bench comprises 31,325 meticulously curated multi-choice questions with 13 input image types such as natural scenes, synthetic images, text-rich images, medical images, et al. (see Fig. 2), covering 32 core meta-tasks and 162 subtasks for multitask multimodal understanding. Compared to previous LVLMs benchmarks (Yue et al., 2023a; Xu et al., 2023) addressing limited image types and skills, questions in MMT-Bench span diverse multimodal scenarios such as GUI navigation and document understanding, testing 14 kinds of capabilities including visual recognition, localization, reasoning, OCR, counting, 3D perception, temporal understanding, et al., as shown in Fig. 2. These features ensure that MMT-Bench meets the requirement of task breadth for evaluating multitask AGI.

Table 2: Quantitative results for 30 LVLMs across 32 meta-tasks are summarized, with R¯ representing the average rank. Accuracy is the metric, and the Overall score is computed across all subtasks, excluding visual recognition (VR) as denoted by . The maximum value of each meta-task is bolded. Meta-tasks are abbreviated for brevity, with full terms in Sec. C of the appendix.
Model Overall R¯ VR Loc OCR Count HLN IR 3D VC VG DU AR PLP I2IT RR IQT Emo
Overall R¯ VI MemU VPU AND KD VCR IEJ MIA CIM TU VP MedU AUD DKR EA GN
Frequency Guess 31.7 26.1 30.0 28.2 30.4 28.2 43.4 29.9 26.5 28.2 29.1 37.6 30.0 29.4 30.8 33.5 18.0 30.1
32.2 25.9 52.1 32.8 29.3 44.4 33.7 27.0 30.0 46.5 28.5 29.1 29.5 30.9 29.7 29.4 28.0 29.0
Random Guess 28.5 30.0 27.1 28.1 27.2 25.0 41.6 24.3 25.5 25.0 24.8 30.3 25.4 26.6 21.2 33.4 10.5 25.4
28.9 29.9 50.8 25.5 31.4 36.5 32.2 28.0 25.0 48.5 26.8 27.0 28.8 27.8 26.8 25.4 27.5 24.4
InternVL-Chat-v1.2-34B 63.4 5.7 81.3 59.4 60.5 66.4 82.4 56.3 45.5 82.3 49.4 68.3 52.6 37.4 32.8 55.0 84.0 48.7
58.2 5.7 61.5 62.5 58.2 57.0 62.2 76.0 31.0 82.8 56.8 45.2 41.8 71.8 57.8 49.4 74.5 41.2
Qwen-VL-Plus 62.3 6.7 82.6 55.3 65.6 61.1 69.9 40.7 46.5 86.5 43.6 77.3 53.4 43.1 37.8 53.0 84.5 41.6
56.6 6.8 50.3 61.0 67.5 58.8 55.3 76.5 31.8 81.5 61.3 45.5 33.7 73.3 59.5 46.8 85.0 32.6
GPT-4V 62.0 8.3 85.3 55.6 68.0 51.6 69.6 44.9 42.0 80.3 25.0 69.8 47.7 48.2 31.8 52.5 80.0 45.1
55.5 8.6 47.9 61.0 60.2 51.4 53.6 73.0 43.4 70.2 55.2 44.6 53.3 74.0 55.6 53.4 80.9 39.7
GeminiProVision 61.6 8.3 84.7 43.6 59.5 56.4 65.9 68.4 45.2 80.1 33.0 71.6 57.4 40.3 31.5 58.5 11.0 55.2
55.1 8.5 47.5 75.8 50.9 47.4 49.5 86.5 35.0 70.2 33.3 40.5 46.0 82.6 59.5 49.2 74.5 33.4
LLaVA-NEXT-34B 60.8 7.5 76.7 61.0 64.1 66.3 70.1 38.8 48.5 85.9 56.2 69.1 50.6 41.9 22.8 54.9 76.5 50.3
56.3 7.5 57.8 55.5 57.2 61.2 62.7 75.0 22.2 77.8 43.0 45.4 40.2 61.9 55.1 48.1 80.0 41.4
XComposer2 55.7 11.7 75.3 47.9 43.9 51.0 69.5 32.4 40.5 73.7 42.6 62.0 46.3 43.9 31.5 50.5 8.0 53.6
50.0 11.7 52.6 71.2 56.1 56.2 41.5 83.0 43.8 80.8 61.2 36.6 36.3 53.5 48.8 43.8 50.5 29.4
BLIP2 54.8 12.8 75.1 54.1 48.1 29.8 66.1 27.4 47.8 78.7 33.5 43.0 51.1 46.1 28.2 53.0 14.0 43.1
49.1 12.8 55.6 76.2 39.8 43.7 60.2 77.0 29.8 62.8 73.0 42.7 43.2 60.1 44.6 37.0 80.5 33.4
Yi-VL-34B 54.2 14.3 74.6 47.0 58.0 59.4 65.8 28.8 38.8 74.0 41.5 56.4 40.4 38.4 19.5 51.7 68.5 39.7
48.6 14.3 51.3 56.2 61.2 52.4 49.5 71.5 25.5 66.0 48.0 39.2 32.0 59.6 48.2 44.3 57.0 32.4
Monkey-Chat 53.4 15.5 79.0 40.1 51.0 43.6 63.1 26.8 46.5 68.9 27.5 51.1 49.3 32.2 29.5 61.8 11.0 45.1
46.0 15.8 55.3 69.5 43.6 44.6 36.3 85.5 26.0 58.8 61.7 36.8 33.3 68.0 43.6 38.1 46.0 29.8
DeepSeek-VL-7B 53.2 15.0 75.6 42.0 61.1 44.5 60.6 30.5 47.2 69.1 38.4 51.9 44.8 38.3 23.5 48.8 37.0 43.8
46.5 15.2 47.7 59.8 53.5 45.4 41.0 41.0 38.8 35.0 67.2 33.1 30.7 69.7 48.8 36.4 67.5 36.8
Yi-VL-6B 53.2 14.7 73.5 49.4 53.1 56.2 63.9 26.0 43.5 63.4 42.1 55.2 43.8 35.3 26.8 48.8 47.0 46.1
47.5 14.5 55.8 54.5 49.2 53.0 51.8 65.5 34.2 52.0 43.3 37.6 37.0 60.6 46.9 40.2 48.0 34.8
LLaVA-NEXT-13B 53.0 15.0 74.0 35.6 51.8 59.2 63.6 32.7 50.0 75.0 44.6 53.6 46.5 34.0 26.2 50.0 50.0 44.5
46.8 14.9 57.5 55.0 32.2 49.6 38.9 47.0 18.0 36.5 59.8 38.9 22.5 55.8 55.7 38.5 70.0 41.0
TransCore-M 52.7 13.1 73.6 40.5 50.4 54.5 71.9 27.5 45.0 75.6 35.1 45.3 46.9 38.3 25.0 53.2 15.0 46.3
46.9 12.9 55.6 76.8 51.9 43.7 38.6 85.5 34.2 52.8 65.8 29.7 28.8 61.1 46.5 38.4 39.5 35.6
QWen-VL-Chat 52.5 16.0 77.5 33.7 46.9 46.7 63.9 27.5 45.0 73.0 26.5 51.5 50.9 32.7 30.5 57.4 13.5 45.4
45.4 16.3 50.9 74.2 42.4 40.2 35.9 86.0 30.0 49.2 58.3 37.3 30.8 67.1 45.4 35.6 55.0 30.2
Claude3V-Haiku 52.2 17.7 74.3 44.8 54.4 51.1 63.6 34.6 38.2 67.6 26.9 69.8 46.2 35.5 22.8 50.0 59.5 35.2
46.4 17.7 42.9 53.8 43.2 41.2 53.3 70.5 31.5 34.8 52.5 35.9 34.2 62.7 34.1 40.4 54.5 35.1
XComposer 52.1 17.1 75.4 40.4 44.1 39.9 66.5 49.7 47.0 72.1 27.2 36.6 47.9 39.6 24.5 50.2 14.0 45.9
45.6 17.3 53.4 63.8 40.6 43.4 42.3 78.0 29.0 66.2 52.3 33.1 28.3 55.6 40.8 39.3 38.5 34.2
mPLUG-Owl2 52.0 17.3 76.5 45.8 44.5 47.6 63.4 27.6 45.2 66.6 33.0 42.4 45.2 41.6 25.5 52.0 18.0 42.0
45.0 17.5 58.5 59.0 40.1 49.4 32.9 85.5 30.0 55.0 57.7 31.9 27.3 63.4 45.5 38.1 35.0 27.8
RBDash-v1-13B 51.8 15.7 72.2 42.2 53.6 51.6 66.6 26.3 40.8 75.5 36.9 48.1 47.1 38.3 22.5 55.9 14.0 43.4
46.1 15.3 57.1 67.5 51.4 45.7 33.2 78.0 39.0 32.0 64.2 31.6 25.5 59.3 46.3 38.1 53.5 32.4
LLaVA-v1.5-13B 51.7 15.3 73.8 38.8 51.8 55.1 65.8 27.2 39.8 70.4 37.4 45.7 46.6 37.6 28.0 58.2 13.5 45.3
45.7 15.2 58.1 66.0 43.9 48.3 31.4 79.0 35.8 28.5 62.5 33.3 27.5 58.6 46.6 39.4 40.5 37.5
CogVLM-Chat 51.6 17.5 77.7 24.7 48.5 49.8 66.0 26.1 42.2 69.8 28.8 49.1 46.3 33.2 23.8 61.6 14.0 50.3
44.2 17.9 52.4 75.5 39.8 43.4 28.2 82.0 28.0 70.8 45.8 35.5 28.3 65.9 44.9 36.9 48.0 29.9
ShareGPT4V-7B 51.5 16.4 74.2 36.0 47.8 50.9 62.4 27.8 45.2 71.6 35.4 47.9 46.2 39.2 21.8 59.8 14.0 44.3
45.1 16.4 54.5 70.5 47.1 48.2 26.3 83.0 27.8 38.0 64.3 32.1 30.0 60.8 46.1 38.9 42.0 28.9
LLaVA-NEXT-7B 51.1 18.1 73.3 29.5 52.0 56.8 59.9 28.7 43.2 69.8 37.0 49.7 47.9 32.6 22.8 49.0 47.5 48.1
44.6 18.0 57.8 54.0 38.5 44.3 34.6 42.5 18.8 32.5 67.8 39.1 23.3 55.5 53.5 37.0 65.0 31.6
LLaVA-v1.5-13B-XTuner 51.1 16.8 72.5 40.7 46.8 54.1 66.5 26.4 47.5 68.8 35.6 47.0 44.2 38.3 26.0 52.4 14.0 51.0
45.1 16.5 54.4 66.5 47.9 52.0 28.8 82.0 39.2 37.0 56.8 28.3 28.3 49.1 44.4 37.3 33.5 40.9
LLaVA-InternLM2-7B 50.8 17.5 73.3 38.9 49.5 51.8 67.8 27.7 49.5 66.4 36.9 37.7 43.7 35.1 14.2 58.0 0.0 51.1
44.4 17.4 52.3 62.5 45.1 57.2 35.2 83.0 34.2 55.8 58.2 26.8 18.5 57.8 45.1 33.7 35.5 35.2
LLaVA-v1.5-7B-XTuner 50.2 19.5 72.5 41.1 46.0 49.9 62.1 26.0 45.5 66.4 35.3 42.8 45.8 42.5 25.5 53.9 11.5 44.2
43.9 19.3 60.1 56.5 42.6 47.2 28.4 80.5 32.2 41.2 63.2 29.9 24.2 52.5 43.4 37.2 32.0 30.5
SharedCaptioner 49.9 19.6 72.8 41.8 47.8 46.2 63.1 27.0 44.2 61.9 27.0 39.5 46.7 33.5 25.0 59.5 14.5 39.9
43.2 19.5 55.1 53.8 45.4 38.3 33.6 82.5 20.2 57.8 56.8 32.6 28.7 59.4 44.7 38.4 45.0 29.6
LLaVA-InternLM-7B 49.7 19.6 70.1 38.7 47.6 46.0 62.0 25.5 42.0 65.0 26.5 43.9 45.6 38.3 25.0 52.4 14.0 47.0
43.9 19.3 57.5 58.2 45.6 46.5 33.2 75.5 33.0 57.0 59.7 28.0 27.3 52.0 42.2 38.1 46.5 37.6
LLaVA-v1.5-7B 49.5 20.3 72.8 34.3 45.0 47.5 61.6 26.1 44.8 68.1 34.0 40.8 46.6 36.0 22.2 58.0 12.5 42.5
43.1 20.3 57.6 70.5 33.3 49.1 31.6 81.0 27.8 37.5 62.3 31.7 27.5 56.8 45.1 35.6 42.5 20.4
LLaMA-Adapter-v2-7B 40.4 27.5 62.3 32.5 35.0 30.1 46.5 24.1 33.8 34.8 25.2 30.2 43.9 33.1 18.2 44.9 11.0 36.0
34.1 27.4 36.4 40.5 33.8 30.4 34.9 71.0 33.2 42.2 35.8 31.1 25.8 52.0 29.1 32.0 25.0 29.9
VisualGLM-6B 38.6 27.1 55.0 33.1 33.8 31.1 39.2 26.0 36.8 40.5 31.1 39.1 39.2 32.4 26.8 43.8 14.0 33.1
33.9 27.0 28.9 44.8 27.1 34.5 35.2 65.0 28.0 35.8 48.2 30.8 23.5 44.0 26.2 29.6 37.5 21.1

4 Experiments

In this section, we conduct a comprehensive evaluation of 30 LVLMs on the MMT-Bench. Sec. 4.1 presents the selected LVLMs zoo and the evaluation methods. The quantitative evaluation of each meta-task is provided in Sec. 4.2. We present the analysis of specific tasks with different prompt methods in Sec. 4.3. Furthermore, we give an error analysis of three representative LVLMs in Sec. 4.4.

4.1 Evaluation Details

Selected LVLMs. For completeness, we test 30 representative LVLMs varying in parameters, vision encoders (InternVL (Chen et al., 2023b), EVA-CLIP-ViT (Sun et al., 2023), CLIP-ViT (Radford et al., 2021)), and LLMs (QWen (Bai et al., 2023), InternLM (Team, 2023b), LLaMA (Touvron et al., 2023a, b), Vicuna (Zheng et al., 2023), Flan-T5 (Chung et al., 2022)). For details, see Appendix D.1.

Evaluation Methods. In MMT-Bench, samples are in a multi-choice format, e.g., ‘What is this? Options: (A) Dog (B) Cat’. To extract the choice from LVLMs’ responses, we follow OpenCompass’ protocol (Contributors, 2023a): 1) Check if the response includes option letters (A/B); 2) Check for option content (‘dog’/‘cat’); 3) Use ChatGPT for extraction. If these steps fail, we set the model selection as option letter Z to avoid random assignment (Yue et al., 2023a). Accuracy is the primary metric.

4.2 Overall Evaluation

This section evaluates LVLMs on MMT-Bench alongside Random Choice and Frequent Choice baselines. We report the overall score for all meta-tasks as well as the best performance on each meta-task in Table 2. The detailed results of each sub-task are provided in the Sec. L of the Appendix. Various prompt settings for all tasks are investigated. We summarize the key findings as follows.

i) The Comprehensive Challenge of MMT-Bench: The benchmark poses significant challenges, with even advanced models like InternVL-Chat, GPT-4V and GeminiProVision achieving just 63.4%, 62.0% and 61.6% accuracy, respectively, indicating substantial room for improvement. Notably, removing its strongest area, Visual Recognition (VR), where it scores 84.7%, GeminiProVision’s overall performance drops to 55.1%, below satisfactory. The varied task dimensions of the MMT-Bench demand wide-ranging capabilities for optimal performance, emphasizing the benchmark’s extensive and rigorous criteria. ii) The comparison between Open-source LVLMs and close-source LVLMs. The performance of most open-source models lags behind that of closed-source models. However, leading open-sourced LVLM InternVL-Chat-V1.2-34B have demonstrated remarkable performance, outperforming sophisticated proprietary models such as GPT-4V and GeminiProVision in overall accuracy. This achievement suggests that by scaling model size, optimizing training regimes, and leveraging diverse high-quality data, open-sourced LVLMs can rival and even exceed the capabilities of advanced proprietary models. It brings a sense of pride to the open-source community and paves the way for more high-performance yet cost-effective solutions in academia and industry. iii) The Influence of LLMs and Model Scaling. As shown in Table 2, model performance significantly improves with an increase in size (7B to 13B) for both llava-v1.5 and llava-v1.5-tuner. Upgrading LLMs, from internlm to internLM2, also enhances the performance of LLaVA, suggesting that larger or improved LLMs boost multi-task performance, with unchanged training data and visual encoders. iv) Model Performance across Different Meta-Tasks. Most LVLMs excel in Visual Recognition (VR) tasks and Visual Captioning (VC), highlighting the ability of LVLMs to recognize ‘what’ an object is and describe the content shown in the image. However, for fine-grained perception tasks (localization, pixel-level perception, etc) or complex reasoning tasks (image evaluation judgment), most LVLMs struggle. v) BLIP2 impresses in open-source models without instruction-following training, outdoing LLaVA models trained with extensive instruction-following data. Although instruction-tuned models can give responses aligning better with human preference than BLIP2 in open-set QA on some tasks (Liu et al., 2023b), they perform worse than BLIP2 in close-set settings in MMT-Bench. This reflects MMT-Bench’s multi-task challenges and hints at using the taxonomy of MMT-Bench to expand the dataset in supervised fine-tuning for future advancement.

4.3 Specific Task and Prompt Methods Analysis

In this section, we evaluate specific tasks using different prompts for LVLMs.

Refer to caption
Figure 3: (a)-(h): Comparing the performance of LVLMs between settings of multiple-images prompt (denoted as M) and single-image prompt (denoted as S). Please check Appendix D.2 for the full task names of task name abbreviations. (i): Comparison of different prompting methods for visual referring prompting-related tasks. Here we select 14 subtasks from the MMT-Bench. We only report the average accuracy here. Zoom in for better view.

Prompting LVLMs with multi-images vs single-image. Here we explore the effects of exploiting multi-image prompts and single-image prompts on the performance of LVLMs. To this end, we summarized 28 tasks in our MMT-Bench, which usually require multiple images as input, such as image retrieval and video captioning. For multi-images prompting, we first evaluated LVLMs which are inherently designed to support multiple images as input (dubbed Multi-Images LVLMs), including mPLUG-Owl2, QWen-VL-chat and Gemini-Pro-Vision. Besides, we also assessed LVLMs which mainly learned on single-image prompts (dubbed Single-Image LVLMs) for more comprehensive comparisons, including BLIP2, SharedCaptioner, ShareGPT4V-7B, Monkey and LLaVA-v1.5-7B. Following previous studies (Dai et al., 2023; Li et al., 2023c), we input each image individually to Single-Image LVLMs and concatenated all output visual embeddings before feeding into LLMs. The designed multi-image prompts for Multi-Images LVLMs and Single-Image LVLMs are summarized in Appendix Sec.D.2. As for single-image prompting, we manually combine multiple images into one image and feed it into LVLMs (see examples in Fig. 1).

The detailed performance comparisons are presented in Fig. 3(a)-(h). We have several observations: i) Multi-images tasks posed significant challenges to current LVLMs, where the best accuracy achieved by GeminiProVision is only 53.8. ii) For Multi-Images LVLMs, providing multiple images as prompts instead of a single image boosted the overall performance on these tasks, demonstrating their capabilities to extract beneficial information from multiple images. For instance, for the task of face retrieval (FR), the performance of GeminiProVision increased from 30.5 to 92.5 when providing multiple images as visual prompts. iii) For Single-Image LVLMs, multi-image prompts also help improve the overall performance of most models, except for Monkey. To our surprise, BLIP2 achieved significant performance gain when switching to a multi-image prompt setting, especially on tasks like general action recognition (GAR) and video captioning (VC). These results highlight the potential of LVLMs to learn more robust unified representations of multiple modalities.

Most LVLMs Show Poor Generalization in Visual Referring Prompting. Visual referring prompting is an impressive prompting technique that entails direct image edits like drawing bounding boxes or masks to guide LVLMs to focus on specific regions (Yang et al., 2023a). We select 14 tasks (see Sec. D.3) involving visual referring prompting to explore the influence of different prompting methods on the final results. We compared three additional settings: using text prompts for bounding boxes in normalized ([0,1]) and pixel ([0, h or w]) formats, and combining visual and text prompts. As depicted in Fig. 3(i), visual prompting (blue curve) significantly lags behind other settings, a disparity mainly attributed to the lack of visual prompting data in most LVLMs during the Supervised Fine-Tuning (SFT) stage.

4.4 Error Analysis

Refer to caption
Figure 4: Distribution of error types for GPT-4V, GeminiProVision and InternVL-Chat-V1.2.

To analyze the error distribution of LVLMs on the MMT-Bench, we examined three LVLMs: GPT-4V, GeminiProVision, and InternVL-Chat-V1.2 (InternVL). Specifically, we randomly selected up to 5 incorrectly answered questions per subtask for each model. Task-specific experts among the co-authors then analyzed these error samples to identify the underlying reasons for the mistakes, yielding the error distribution presented in Fig. 4. For definitions and case studies of these six error types, please refer to Sec. G in the appendix.

As shown in Fig 4, perception error stands out as the most common type of error across all models, with GPT-4V exhibiting a significantly lower perception error rate (51%) compared to GeminiProVision (76.9%) and InternVL (67.2%), indicating its superior performance in perception tasks. Reasoning error emerges as the second most prevalent error type, with InternVL having the highest reasoning error rate (14.8%), followed by GeminiProVision (10.4%) and GPT-4V (9.94%), highlighting the challenges all models face in complex reasoning tasks.

Additionally, the proportion of lack of knowledge errors is similar across the three models, ranging from 6.99% to 9.0%. It suggests that insufficient knowledge is a common issue. However, GPT-4V has notably higher error rates in lack of capability (19%) and Refusing to Answer (6.11%) compared with the other models, which may be attributed to its more honest approach in acknowledging its limitations and refusing to answer certain questions.

InternVL stands out for its high error rate in failing to follow instructions (6.64%), significantly surpassing GPT-4V (2.99%) and GeminiProVision (1.14%), indicating its struggle in comprehending and executing instructions effectively. On the other hand, annotation error contributes the least to the overall error distribution, implying that the quality of data annotation is high and has a minimal impact on model performance.

To enhance the performance of these large language models, future improvements should focus on addressing the specific error types identified. By targeting perception and reasoning capabilities, tackling the lack of knowledge, and refining the ability to follow instructions, developers can work towards creating more accurate and reliable language models. GPT-4V’s honest approach to its limitations also highlights the importance of transparency in AI systems, which can be further explored and incorporated into future model designs.

5 Taskonomy Analysis

Thanks to the extensive coverage of tasks in the MMT-Bench, we can evaluate the multimodal performance of LVLMs on a task map. In this way, the roles of different tasks in multimodal capability can be systematically interpreted by analyzing relationships between tasks in the map.

Refer to caption
Figure 5: Visualization of task maps and hierarchical clustering with task map. Please zoom in for better visualizations.

5.1 Analytical Tools

Task map. To investigate the relationships between subtasks, we quantify each subtask as a task vector by following (Ilharco et al., 2023). Formally, a task vector is defined by the weight variation between the weight fine-tuned on task data Dt and the initial weight W0 of a probing model, as given by Vt=argminW(W|Dt)W0 where the subscript t denotes the task and is the task loss. Three steps are adopted to obtain Vt. First, we use pre-trained QwenVL-Chat as the probing model because QwenVL-Chat achieves good results on most subtasks, which helps acquire promising task vectors. Second, we construct task data Dt by adapting all multi-choice VQA samples into the instruction-following data for each subtask. Third, unlike TaskVec (Ilharco et al., 2023) that finetunes the whole model, we finetune QwenVL-Chat for 3 epochs using LoRA fine-tuning (Hu et al., 2021) for all 162 subtasks, which reduces the length of task vector from 9.6B to 3.5M and consumes less storage resources. With task vector, a task map can be constructed as 𝒢={Gst}s,t=1T where Gst=1cos(Vs,Vt) denotes the cosine distance between task s and t and T=162 denoted the total number of subtasks. By definition, we know that 0Gst2.

Table 3: The relationship between task distance threshold δ (normalized by the maximum task distance on the task map) and the consistency of LVLMs performance ranking τδ. We see that LVLMs have a more consistent performance ranking when two tasks get closer to each other.
δ 1 12 14 16 18
τδ 0.29 0.31 0.32 0.41 0.60

Ranking correlation: Kendall’s tau τ. To quantitatively evaluate LVLMs on a task map, we use the metric of Kendall’s tau τ to measure the ranking correlation between performance sequences of LVLMs on different subtasks. The intuition is that model A would be superior to model B on task t if model A performs better than model B on task s when task distance Gst is small. The Kendall’s tau τ is defined as τst=2M(M1)1m<nMsign((PmsPns)(PmtPnt)) where Pms denotes the performance of model m on task s and M is the number of LVLMs. The function sign() returns 1 if the argument is negative and 1 otherwise. When τst=1, LVLMs have completely consistent performance ranking on task s and t.

5.2 Findings on Task Map

LVLMs obtain a more consistent performance ranking on tasks closer to each other. We assess whether LVLMs achieve consistent performance on two tasks close to each other. To measure this consistency, we employ the Kendall tau metric as introduced in Sec. 5.1. Specifically, we consider all subtask pairs in which two tasks are closer to each other and calculate their average Kendall’s tau τ, which can be given by τδ=1Ts=1T1|Δs|tΔsτst where Δs={t:Gstδ} and δ is a threshold used to control the proximity between two tasks. As shown in Table 3, as the threshold δ decreases, the task distance becomes smaller, and τδ increases. This suggests that LVLMs obtain a more consistent performance ranking on tasks closer to each other. Hence, the performance of LVLMs on a new task can be predicted if it is close to one of the MMT-Bench subtasks.

Out-of-Domain (OoD) tasks discovery. The OoD tasks mean tasks that the current model struggles to handle. Discovering OoD tasks can provide insights for future evaluation efforts and the development of stronger LVLMs. Since model performance on different tasks is related to task distances, we hypothesize that OoD tasks would be grouped in local regions on the task map. Therefore, we conduct hierarchical clustering on the task map to find OoD tasks. Specifically, 162 subtasks are grouped into 12 clusters as shown in Fig. 5. We use two criteria to identify clusters containing OoD tasks. First, LVLMs would achieve poor performance on OoD tasks. In this regard, we calculate the average multimodal performance within each task cluster over all LVLM models. Second, the performance of LVLMs on OoD tasks would be inconsistent with the overall multimodal score in Table 2 because LVLMs with competitive overall scores would even fail to solve OoD tasks. Hence, we calculate the average ranking correlation τ within each cluster. We present these statistics in Table 4 and provide a detailed analysis with the clustering results in Appendix A.

We can see that clusters 8, 9, and 11 achieve low multimodal accuracy and ranking correlation τ. In sec 4.2, we find that the model struggles with handling fine-grained visual tasks, such as detection. Through the analysis of these clusters, we similarly find that current multimodal large models cannot perform fine-grained visual cognition and understanding of positional and spatial relationships, such as localization and detection tasks. Moreover, they exhibit poor performance in tasks related to new data structures or types of images, showing a lack of proficiency in handling tasks related to GUI and special data structures like tables.

Table 4: The number of tasks within each cluster after hierarchical clustering, and the Kendall’s tau τ between the average performance of the model on these tasks and the overall performance of the model.
Cluster 1 2 3 4 5 6 7 8 9 10 11 12
# Tasks 11 53 16 16 9 8 7 16 4 9 10 3
τ 0.54 0.73 0.57 0.48 -0.05 0.62 0.63 0.34 0.12 0.57 0.38 0.59
Acc 40.4 64.7 61.9 39.9 55.9 30.0 33.1 40.2 31.4 61.2 33.2 50.7

In-domain tasks discovery. In-domain tasks are tasks that most current multimodal large models can handle correctly. Discovering in-domain tasks guides the commercial application of LVLMs in specific scenarios. Different from OoD tasks, we identify in-domain tasks by looking for clusters with large ranking correlation τ and high multimodal accuracy. From Table 4, we can see that clusters 2, 3, and 10 achieve relatively high accuracy and large ranking correlation τ. We observe that current multimodal large models possess strong high-level visual comprehension capabilities, enabling them to effectively handle visual recognition tasks, even when dealing with specialized images such as medical images, which is also found in sec 4.2. Moreover, they benefit from the powerful LLMs to accurately describe images. We provide a detailed analysis along with the clustering results in Appendix A.

6 Conclusion and Discussion

In this work, we introduce MMT-Bench, a comprehensive benchmark designed to evaluate LVLMs in multimodal multitask understanding. The breadth of MMT-Bench is highlighted by its meticulously curated dataset of 31,325 multi-choice questions covering 162 multimodal tasks. Our evaluation reveal significant challenges for current LVLMs posed by our MMT-Bench. We present a taskonomy analysis of LVLMs on the task map, allowing us to predict the performance of a new task. Our goal with MMT-Bench is to measure the progress on the path to multitask AGI. We shall acknowledge that MMT-Bench may not be sufficient as a standard for determining whether multitask AGI has been achieved, as it is impossible to include all multimodal tasks. However, we believe that it should be necessary for a multitask AGI to achieve strong performance on MMT-Bench. We will continue to expand the task set of MMT-Bench. We believe that MMT-Bench will inspire further research and development in LVLMs, bringing us closer to the realization of truly intelligent multimodal systems.

Broader Impact. The development and widespread adoption of MMT-Bench as a benchmark for evaluating large vision-language models (LVLMs) have the potential to significantly impact the field of artificial intelligence. While MMT-Bench offers valuable insights and guidance for advancing LVLM research, it is important to consider its broader impact, including ethical considerations and potential societal consequences.

One potential positive impact of MMT-Bench is its role in driving advancements in LVLM technology, leading to improved performance and capabilities in various multimodal tasks. This could benefit numerous applications, such as visual dialogue, video analysis, and document understanding, ultimately enhancing user experiences and productivity.

However, it is crucial to recognize and address potential negative impacts as well. One of the primary limitations of MMT-Bench is its reliance on curated data, which may inadvertently introduce biases based on the sources and methodologies used for data collection. For example, the performance of each meta-task is obtained by taking the average over all subtasks, which may lead to biased assessment because meta-tasks comprise different numbers of subtasks. Moreover, the selection of tasks and subtasks in MMT-Bench may only partially capture the diversity of real-world scenarios, leading to a limited understanding of LVLMs’ capabilities across different domains and populations. Furthermore, the data collection process might disproportionately represent certain demographics or contexts, which can lead to biased evaluations of LVLMs’ performance.

The other concern is that the benchmark’s emphasis on performance metrics such as overall scores and task-specific accuracies may oversimplify the evaluation process and obscure nuanced differences in LVLMs’ performance. This could mask disparities in model performance across demographic groups or domains, contributing to the perpetuation of biases and inequities in AI systems. We are dedicated to collecting as many multimodal tasks as possible into our MMT-Bench for unbiased evaluation.

References

  • Achille et al. (2019) Achille, A., Lam, M., Tewari, R., Ravichandran, A., Maji, S., Fowlkes, C. C., Soatto, S., and Perona, P. Task2vec: Task embedding for meta-learning. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 6430–6439, 2019.
  • AI et al. (2024) AI, ., :, Young, A., Chen, B., Li, C., Huang, C., Zhang, G., Zhang, G., Li, H., Zhu, J., Chen, J., Chang, J., Yu, K., Liu, P., Liu, Q., Yue, S., Yang, S., Yang, S., Yu, T., Xie, W., Huang, W., Hu, X., Ren, X., Niu, X., Nie, P., Xu, Y., Liu, Y., Wang, Y., Cai, Y., Gu, Z., Liu, Z., and Dai, Z. Yi: Open foundation models by 01.ai, 2024.
  • Anthropic (2023) Anthropic. Claude, 2023. URL https://www.anthropic.com. Accessed: 2023-04-18.
  • Antol et al. (2015) Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433, 2015.
  • Bai et al. (2023) Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  • Chen et al. (2023a) Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., and Lin, D. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023a.
  • Chen et al. (2023b) Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu, T., Qiao, Y., and Dai, J. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023b.
  • Chung et al. (2022) Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Narang, S., Mishra, G., Yu, A., Zhao, V., Huang, Y., Dai, A., Yu, H., Petrov, S., Chi, E. H., Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q. V., and Wei, J. Scaling instruction-finetuned language models, 2022. URL https://arxiv.org/abs/2210.11416.
  • Contributors (2023a) Contributors, O. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023a.
  • Contributors (2023b) Contributors, T.-M. Transcore-m. https://github.com/PCIResearch/TransCore-M, 2023b.
  • Contributors (2023c) Contributors, X. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/InternLM/xtuner, 2023c.
  • Dai et al. (2023) Dai, W., Li, J., Li, D., Tiong, A. M. H., Zhao, J., Wang, W., Li, B., Fung, P., and Hoi, S. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  • Ding et al. (2021) Ding, M., Yang, Z., Hong, W., Zheng, W., Zhou, C., Yin, D., Lin, J., Zou, X., Shao, Z., Yang, H., et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
  • Dong et al. (2024) Dong, X., Zhang, P., Zang, Y., Cao, Y., Wang, B., Ouyang, L., Wei, X., Zhang, S., Duan, H., Cao, M., Zhang, W., Li, Y., Yan, H., Gao, Y., Zhang, X., Li, W., Li, J., Chen, K., He, C., Zhang, X., Qiao, Y., Lin, D., and Wang, J. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024.
  • Fu et al. (2023) Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  • Gao et al. (2023) Gao, P., Han, J., Zhang, R., Lin, Z., Geng, S., Zhou, A., Zhang, W., Lu, P., He, C., Yue, X., Li, H., and Qiao, Y. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  • Hu et al. (2021) Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  • Hudson & Manning (2019) Hudson, D. A. and Manning, C. D. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6700–6709, 2019.
  • Ilharco et al. (2023) Ilharco, G., Ribeiro, M. T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., and Farhadi, A. Editing models with task arithmetic. the 11th International Conference on Learning Representation (ICLR 2023), 2023.
  • Krishna et al. (2017) Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  • Langley (2000) Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
  • Latif et al. (2023) Latif, E., Mai, G., Nyaaba, M., Wu, X., Liu, N., Lu, G., Li, S., Liu, T., and Zhai, X. Agi: Artificial general intelligence for education, 2023.
  • Li et al. (2023a) Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., and Shan, Y. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
  • Li et al. (2023b) Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023b.
  • Li et al. (2023c) Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al. Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005, 2023c.
  • Li et al. (2023d) Li, Z., Yang, B., Liu, Q., Ma, Z., Zhang, S., Yang, J., Sun, Y., Liu, Y., and Bai, X. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023d.
  • Lin et al. (2014) Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
  • Liu et al. (2023a) Liu, H., Li, C., Li, Y., and Lee, Y. J. Improved baselines with visual instruction tuning, 2023a.
  • Liu et al. (2023b) Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning, 2023b.
  • Liu et al. (2024a) Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., and Lee, Y. J. Llava-next: Improved reasoning, ocr, and world knowledge, 2024a.
  • Liu et al. (2024b) Liu, S., Ying, K., Zhang, H., Yang, Y., Lin, Y., Zhang, T., Li, C., Qiao, Y., Luo, P., Shao, W., and Zhang, K. Convbench: A multi-turn conversation evaluation benchmark with hierarchical capability for large vision-language models, 2024b.
  • Liu et al. (2023c) Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023c.
  • Lu et al. (2024) Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., Sun, Y., Deng, C., Xu, H., Xie, Z., and Ruan, C. Deepseek-vl: Towards real-world vision-language understanding, 2024.
  • Lu et al. (2023) Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.-W., Galley, M., and Gao, J. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
  • Marino et al. (2019) Marino, K., Rastegari, M., Farhadi, A., and Mottaghi, R. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp. 3195–3204, 2019.
  • Morris et al. (2023) Morris, M. R., Sohl-dickstein, J., Fiedel, N., Warkentin, T., Dafoe, A., Faust, A., Farabet, C., and Legg, S. Levels of agi: Operationalizing progress on the path to agi. arXiv preprint arXiv:2311.02462, 2023.
  • Radford et al. (2021) Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision, 2021.
  • RBDash-Team (2023) RBDash-Team. Rbdash. https://github.com/RBDash-Team/RBDash, 2023.
  • Shao et al. (2023) Shao, W., Hu, Y., Gao, P., Lei, M., Zhang, K., Meng, F., Xu, P., Huang, S., Li, H., Qiao, Y., et al. Tiny lvlm-ehub: Early multimodal experiments with bard. arXiv preprint arXiv:2308.03729, 2023.
  • Singhal et al. (2023) Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
  • Sun et al. (2023) Sun, Q., Fang, Y., Wu, L., Wang, X., and Cao, Y. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
  • Team (2023a) Team, G. Gemini: A family of highly capable multimodal models, 2023a.
  • Team (2023b) Team, I. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023b.
  • Team (2023c) Team, Q. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023c.
  • Touvron et al. (2023a) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation language models, 2023a.
  • Touvron et al. (2023b) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open foundation and fine-tuned chat models, 2023b.
  • Wallace et al. (2021) Wallace, B., Wu, Z., and Hariharan, B. Can we characterize tasks without labels or features? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1245–1254, 2021.
  • Wang et al. (2023) Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., Xu, J., Xu, B., Li, J., Dong, Y., Ding, M., and Tang, J. Cogvlm: Visual expert for pretrained language models. 2023.
  • Xu et al. (2023) Xu, P., Shao, W., Zhang, K., Gao, P., Liu, S., Lei, M., Meng, F., Huang, S., Qiao, Y., and Luo, P. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
  • Yang et al. (2023a) Yang, Z., Li, L., Lin, K., Wang, J., Lin, C.-C., Liu, Z., and Wang, L. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023a.
  • Yang et al. (2023b) Yang, Z., Li, L., Lin, K., Wang, J., Lin, C.-C., Liu, Z., and Wang, L. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023b.
  • Yang et al. (2023c) Yang, Z., Liu, J., Han, Y., Chen, X., Huang, Z., Fu, B., and Yu, G. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023c.
  • Ye et al. (2023a) Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., Shi, Y., et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023a.
  • Ye et al. (2023b) Ye, Q., Xu, H., Ye, J., Yan, M., Hu, A., Liu, H., Qian, Q., Zhang, J., Huang, F., and Zhou, J. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023b.
  • Yin et al. (2023) Yin, Z., Wang, J., Cao, J., Shi, Z., Liu, D., Li, M., Sheng, L., Bai, L., Huang, X., Wang, Z., et al. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. arXiv preprint arXiv:2306.06687, 2023.
  • Yu et al. (2023) Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  • Yue et al. (2023a) Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y., Huang, W., Sun, H., Su, Y., and Chen, W. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023a.
  • Yue et al. (2023b) Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023b.
  • Zamir et al. (2018) Zamir, A. R., Sax, A., Shen, W., Guibas, L. J., Malik, J., and Savarese, S. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3712–3722, 2018.
  • Zhang et al. (2023a) Zhang, P., Dong, X., Wang, B., Cao, Y., Xu, C., Ouyang, L., Zhao, Z., Ding, S., Zhang, S., Duan, H., Zhang, W., Yan, H., Zhang, X., Li, W., Li, J., Chen, K., He, C., Zhang, X., Qiao, Y., Lin, D., and Wang, J. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023a.
  • Zhang et al. (2023b) Zhang, R., Han, J., Liu, C., Gao, P., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H., and Qiao, Y. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
  • Zheng et al. (2023) Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.

In this appendix, we provide further details as follows:

  • Sec. A: Presents hierarchical clustering and more analyses on the task map constructed from our MMT-Bench.

  • Sec. B: Includes details on sample size, visual input types, and capabilities of LVLMs evaluated for each subtask.

  • Sec. C: Enumerates task abbreviations used throughout the paper.

  • Sec. D: Presents detailed model configurations and experimental details in multi-images and visual prompting.

  • Sec. E: Compares the performance on tasks involving pixel coordinates and normalized coordinates.

  • Sec. F: Compares the performance of LVLMs on different image types and multimodal capabilities.

  • Sec. G: Illustrates error cases of GPT-4V, GeminiProVision, and InternVL-Chat on 32 meta-tasks in MMT-Bench.

  • Sec. H: Gives the comparison of MMT-Bench with Other Benchmarks on OCR-Related Tasks.

  • Sec. I: Presents some Ddtails about the benchmark construction.

  • Sec. J: Discusses the openCompass protocol used in MMT-Bench and other alternatives.

  • Sec. K: Gives the computaional resources used in evaluation.

  • Sec. L: Provides the detailed performance of 30 models across all 162 subtasks on MMT-Bench.

Appendix A Task Map

We perform hierarchical clustering on the taskmap, as shown in Fig. 5. When selecting the number of clustering clusters as 12, we analyze the clustering results of the task map and the model performance on the corresponding tasks. Here, we list the names of the tasks within each cluster in Table F.

Refer to caption
Figure A1: Visualization of model performance on different tasks. Different colours signify the respective categories formed after clustering, arranged from left to right, starting from the first category through to the twelfth. Please zoom in for better visualizations.

Out-of-Domain (OoD) tasks discovery. We can see that clusters 8, 9, and 11 achieve low multimodal accuracy and ranking correlation τ. From these clusters, we find that current multimodal large models lack the ability to perform fine-grained visual cognition and understanding of positional and spatial relationships, such as localization and detection tasks. Moreover, they exhibit poor performance in tasks related to new data structures or types of images, showing a lack of proficiency in handling tasks related to GUI and special data structures like tables.

  • Cluster 8 mainly involves detection, tracking, and localization tasks, all of which are related to the localization of objects within images. This indicates that current large multimodal models lack fine-grained visual cognition and understanding of positional and spatial relationships.

  • Tasks in cluster 9 are centered around GUI navigation, a novel task type requiring strong visual understanding, object localization, and expert knowledge in operating mobile devices (Yang et al., 2023c). This suggests that current large multimodal models need further optimization for GUI-related tasks.

  • Apart from detection and localization tasks, cluster 11 also includes tasks involving the recognition of special images or their conversion into structured text. The former requires models to possess spatial cognition and fine-grained visual capabilities, while the latter demands robust OCR abilities and extensive knowledge (such as understanding and outputting the basic structure of code or tables). Our testing LVLMs currently fall short in this aspect.

In-Domain tasks discovery. From Table 4, we can see that clusters 2, 3, and 10 achieve relatively high accuracy and large ranking correlation τ. We observe that current multimodal large models possess strong high-level visual comprehension capabilities, enabling them to effectively handle visual recognition tasks, even when dealing with specialized images such as medical images. Moreover, they benefit from the powerful LLMs to accurately describe images.

  • Cluster 2 mainly comprises visual recognition tasks, which require the model to possess certain high-level visual capabilities, yet these tasks are relatively simple. Examining Table 2 and Fig. A1, we observe that the model’s performance within this cluster is generally good. This validates that the current multimodal large models possess fundamental abilities for visual-semantic understanding, allowing them to fulfil recognition tasks.

  • Cluster 3 mainly includes visual recognition tasks as well, yet extends to cover sophisticated visual understanding tasks that require primary specialist knowledge, such as medicine and emotion. Within this cluster, the model demonstrates large τ and high accuracy, suggesting that current multimodal models pay attention to tasks necessitating the infusion of domain-specific knowledge, beyond just natural images. This implies a certain ability to handle problems in specialized fields.

  • In Cluster 10, LVLMs achieve good performance on tasks related to the visual description of the image. It indicates that current large multimodal models can describe the image well. It would stem from the fact that these models are typically tuned by massive image-text pairs.

Appendix B Hierarchical Structure of MMT-Bench

In Table A2 to Table A4, we present all 32 meta-tasks from MMT-Bench, encompassing a total of 162 subtasks. These tables include details on sample size, visual input types, and capabilities of LVLMs evaluated for each subtask.

Table A1: The Abbreviations of terms mentioned in this paper and their corresponding full terms.
Abbreviation Full Term Abbreviation Full Term
Meta-Task
VR Visual Recognition VI Visual Illusion
Loc Localization MemU Meme Understanding
OCR OCR VPU Visual Prompt Understanding
Count Counting AND Anomaly Detection
HLN Hallucination KD Keypoint Detection
IR Image Retrieval VCR Visual Commonsense Reasoning
3D 3D IEJ Image Evaluation Judgement
VC Visual Captioning MIA Multiple Image Analysis
VG Visual Grounding CIM Cross Image Matching
DU Doc Understanding TU Temporal Understanding
AR Action Recognition VCo Visual Code
PLP Pixel Level Perception MedU Medical Understanding
I2IT Image-to-image Translation AUD Autonomous Driving
RR Relation Reasoning DKR Discipline Knowledge Reasoning
IQT Intelligence Quotient Test EA Embodied AI
Emo Emotion GN GUI Navigation
Subtask
AQS Action Quality Assessment SODRD Salient Object Detection RGBD
FECR Facial Expression Change Recognition SLR Sign Language Recognition
FR Face Retrieval SOT Single Object Tracking
GAR General Action Recognition S2IR Sketch2image Retrieval
HR Handwritten Retrieval SD Spot the Diff
I2IR Image2image Retrieval SS Spot the Similarity
IC Image Colorization TA Temporal Anticipation
MVU Meme Video Understanding TL Temporal Localization
ME MEVIS TO Temporal Ordering
MIC Multiple Image Captioning T2IR Text2image Retrieval
NIP Next Image Prediction 3DCR 3D CAD Recognition
OSD One-shot Detection 3DIR 3D Indoor Recognition
PRe Person Reid VR Vehicle Retrieval
PT Point Tracking VC Video Captioning

Appendix C Task Abbreviations

Given the extensive number of tasks and models tested within the benchmark, we employ abbreviations to condense the manuscript. The abbreviations used throughout the paper are shown in Table A1.

Appendix D More Experimental Details

D.1 LVLMs Model Details

Table A5 summarizes the LVLMs information used in this paper, including the corresponding parameter sizes, visual encoders, and LLMs. Note that we use follow OpenCompass’ protocol (Contributors, 2023a) to conduct the evaluation process. The inference time varies with different models. For instance, the smaller LLaVA-v1.5-7B (Liu et al., 2023a) model takes only 12 minutes to complete the evaluation using 8 GPUs, while the larger InternVL-Chat-V1.2-34B model (Chen et al., 2023b) requires 79 minutes and around 80GB of memory. Our open-source codebase supports multi-GPU distributed inference, effectively accelerating the inference process.

D.2 Multi-Images Prompt Experimental Details

In terms of the 28 tasks requiring multiple images as input, please see Table A6-A9 for the specific task names given task abbreviations. Besides, we also present the designed prompt examples for Single-Image LVLMs and Multi-Images LVLMs in Table A6-A9 for reference.

D.3 Visual Referring Prompting Experimental Details

In Section 4.3, we explore the differential efficacy of visual prompting compared to alternative prompting strategies across a spectrum of 14 distinct tasks. These encompass human interaction understanding, social relation recognition, human-object interaction recognition, animal keypoint detection, vehicle keypoint detection, human keypoint detection, clothes keypoint detection, scene text recognition, interactive segmentation, instance captioning, multiple instance captioning, one-shot detection, single object tracking, and counting by visual prompting.

Appendix E Pixel Coordinates vs. Normalized Coordinates

In Fig. A2, we analyze the performance across 19 detection-related tasks, specifically point tracking, image matting, pixel recognition, polygon localization, pixel localization, depth estimation, MEVIS, remote sensing object detection, rotated object detection, small object detection, camouflage object detection, salient object detection in RGB-D, transparent object detection, face detection, object detection, salient object detection in RGB, referring detection, reason segmentation, and image dense captioning. These tasks span Localization, Pixel-level Perception, and Visual Captioning, comparing outcomes under two different coordinate formats. Notably, GeminiProVision lags behind top open-source LVLMs like BLIP2 and XComposer2, which have been extensively trained with detection data. The preference for normalized coordinates among most models is attributed to their use in the training instruction templates.

Appendix F Analysis on Images Types and Capabilities

Performance with Different Visual Types. We compare the performance of 20 LVLMs across 13 types of visual input in Fig. A3. Most LVLMs struggle with Scientific Diagrams due to task difficulty, as many, including Scientific and ”Raven’s Progressive Matrices,” require complex reasoning, a capability current LVLMs do not possess well.

Performance Across Multimodal Capabilities. We also compare the performance of 20 LVLMs across 14 types of visual input in Fig. A4. As we can see, GeminiProVision once again exhibits strong superiority across most capabilities, especially in retrieval and multi-image analysis (involving the recognition and matching of multiple images), vastly outperforming other open-source LVLMs. This superiority stems from GeminiProVision’s support for multi-image mode and its powerful generalization abilities, guiding the future direction of open-source models towards the focus on multi-image and video understanding.

Details of task clustering on the task map of our MMT-Bench.
Meta-Task Subtask # subtasks
Table A1 – continued from previous page
Meta-Task Subtask # subtasks
Cluster ID: 1
Visual Prompt Understanding Visual Prompt Understanding, Som (Set-of-marks) Recognition 2
Pixel Level Perception Image Matting 1
Visual Recognition Color Recognition, Abstract Visual Recognition 2
Discipline Knowledge Reasoning Science, Tech Engineering, Health Medicine, Humanities Social Science, Business, Art Design 6
Cluster ID: 2
Visual Recognition Waste recognition, Logo and Brand Recognition, Animals Recognition, Weapon Recognition, Celebrity Recognition, Shape Recognition, Age Gender Race Recognition, Rock Recognition, Painting Recognition, Gesture Recognition, Vehicle Recognition, Astronomical Recognition, Fashion Recognition, Musical Instrument Recognition, Disaster Recognition, Sports Recognition, Building Recognition, Texture Material Recognition, Plant Recognition, Film and Television Recognition, Animated Character Recognition, Electronic Object Recognition, Scene Recognition, National Flag Recognition, Profession Recognition, Weather Recognition, Food Recognition 27
Relation Reasoning Human Object Interaction Recognition, Human Interaction Understanding 2
Action Recognition Image-based Action Recognition, Sign Language Recognition, General Action Recognition 4
Emotion Scene Emotion Recognition, Artwork Emotion Recognition, Facial Expression Recognition, Micro Expression Recognition, Body Emotion Recognition 5
Image Evaluation Judgement Lvlm Response Judgement 1
Visual Commonsense Reasoning WhoopsVQA 1
Hallucination Attribute Hallucination 1
Counting Counting by Visual Prompting, Crowd Counting 2
Medical Understanding Other Biological Attributes 1
Autonomous Driving Traffic Sign Understanding 1
OCR Font Recognition, Scene Text Recognition 2
Pixel Level Perception Pixel Recognition 1
Anomaly Detection Face Mask Anomaly Detection 1
Multiple Image Analysis Spot the Diff 1
Visual Captioning Instance Captioning 1
Doc Understanding Clock Reading, Doc VQA 2
Meme Understanding Meme Image Understanding 1
Cluster ID: 3
Medical Understanding Medical Modality Recognition, Lesion Grading, Disease DiagnoseAnatomy Identification 3
Visual Captioning Multiple Image Captioning, Writing Poetry from Image 2
Emotion Facial Expression Change Recognition 1
Visual Recognition Image Season Recognition, Sculpture Recognition, Chemical Apparatus Recognition, Landmark Recognition, Religious Recognition 5
Hallucination Relation Hallucination 1
Relation reasoning Social Relation Recognition 1
OCR Handwritten Text Recognition 1
Temporal Understanding Temporal Anticipation 1
Cluster ID: 4
Intelligence Quotient Test Ravens Progressive Matrices 1
Temporal Understanding Temporal Localization 1
Autonomous Driving Traffic Participants Understanding, Temporal Sequence Understanding, Multiple View Image Understanding 3
Counting Counting by Category, Counting by Reasoning 2
Hallucination Order Hallucination 1
Doc Understanding Visual Document Information Extraction, Chart VQA 2
Action Recognition Action Quality Assessment, 2
3D 3D Cad Recognition, 3D indoor recognition 2
Anomaly Detection Industrial Produce Anomaly Detection 1
Image Evaluation Judgement Image Quality Assessment 1
Low Level Vision Depth Estimation 1
Cluster ID: 5
Multiple Image Analysis Spot the Similarity 1
Visual Illusion Color Assimilation, Geometrical Relativity, Color Constancy, Color Contrast, Geometrical Perspective 5
Autonomous Driving Traffic Light Understanding 1
Visual Recognition Deepfake Detection 1
Anomaly Detection Helmet Anomaly Detection 1
Cluster ID: 6
Image Retrieval Vehicle Retrieval, Image2image Retrieval, Sketch2image Retrieval, Face Retrieval, Text2image Retrieval, Handwritten Retrieval, Person Reid 7
Image-to-image translation Image Colorization 1
Cluster ID: 7
Visual Code Eqn2latex, 2
Keypoint Detection Clothes Keypoint Detection 1
OCR Handwritten Math Expression recognition 1
Pixel Level Perception Interactive Segmentation 1
Temporal Understanding Temporal Ordering 1
Visual Captioning Image Dense Captioning 1
Action Recognition Gaze Estimation 1
Cluster ID: 8
Localization Salient Object Detection RGB, Camouflage Object Detection, Face Detection, Object Detection, Small Object Detection, Salient Object Detection RGBD, Rotated Object Detection, Remote Sensing Object Detection, Transparent Object Detection 9
Visual Grounding Referring Detection, Reason Seg 2
Cross Image Matching Point Tracking, One Shot Detection, 3
Image-to-image Translation Jigsaw Puzzle Solving 1
Cross Image Catching Single Object Tracking 1
Pixel Level Perception Pixel Localization 1
Cluster ID: 9
GUI Navigation Web Shopping, GUI General, Google Apps, GUI Install 4
Cluster ID: 10
Visual Captioning Multiple Instance Captioning, Image Captioning Paragraph, Image Captioning 3
Anomaly Detection Traffic Anomaly Detection 1
Doc Understanding Chart to text 1
Hallucination Exist Hallucination 1
Relation Reasoning Scene Graph Recognition 1
Embodied AI Navigation 1
Anomaly Detection Behavior Anomaly Detection 1
Cluster ID: 11
Doc Understanding Table Structure Recognition, Chart to Table 2
Keypoint Detection Furniture Keypoint Detection, Vehicle Keypoint Detection, Human Keypoint Detection, Animal Keypoint Detection 4
Pixel Level Perception Polygon Localization, 2
Temporal Understanding Next Image Prediction 1
Visual Code Sketch2code, Screenshot2code 2
Cluster ID: 12
Meme Understanding Meme Video Understanding 1
Temporal Understanding Mevis 1
Visual Captioning Video Captioning 1
Table A2: MMT-Bench subtask details (part 1): including sample number, visual input types, and evaluated LVLM capabilities.
Subtask Name Sample Num Visual Input Type Capability
Visual Grounding
Reason Seg 196 Natural Image Visual Reasoning,Visual Localization
Referring Detection 200 Natural Image Visual Localization
Doc Understanding
Doc Vqa 200 Text-rich Image Document Understanding
Visual Document Information Extraction 200 Text-rich Image Document Understanding
Chart To Text 200 Chart Image Document Understanding
Chart To Table 200 Chart Image Document Understanding
Clock Reading 200 Abstract Image Visual Recognition,Document Understanding
Chart Vqa 200 Chart Image Document Understanding
Table Structure Recognition 46 Chart Image Document Understanding
Action Recognition
Gaze Estimation 200 Natural Image Visual Recognition,Visual Localization,Pixel Perception
Image Based Action Recognition 200 Natural Image Visual Recognition
General Action Recognition 200 Natural Image Visual Recognition,Multi-Images Analysis
Action Quality Assessment 200 Natural Image Visual Recognition,Multi-Images Analysis,Expert Knowledge Utilization
Sign Language Recognition 200 Natural Image Visual Recognition,Multi-Images Analysis
Localization
Remote Sensing Object Detection 200 Remote Sensing Image Visual Recognition,Visual Localization
Rotated Object Detection 90 Remote Sensing Image Visual Recognition,Visual Localization
Small Object Detection 200 Natural Image Visual Recognition,Visual Localization
Camouflage Object Detection 200 Natural Image Visual Recognition,Visual Localization
Salient Object Detection Rgbd 200 Natural Image,Depth Map Visual Localization
Transparent Object Detection 200 Natural Image Visual Recognition,Visual Localization
Face Detection 200 Natural Image Visual Recognition,Visual Localization
Object Detection 200 Natural Image Visual Recognition,Visual Localization
Salient Object Detection Rgb 200 Natural Image Visual Localization
Visual Recognition
Deepfake Detection 200 Natural Image,Synthetic Image Visual Recognition,Visual Reasoning,Expert Knowledge Utilization
Weather Recognition 194 Natural Image Visual Recognition
Image Season Recognition 200 Natural Image Visual Recognition
Gesture Recognition 200 Natural Image Visual Recognition
Muscial Instrument Recognition 200 Natural Image Visual Recognition
Food Recognition 200 Natural Image Visual Recognition
Landmark Recognition 50 Natural Image Visual Recognition,Expert Knowledge Utilization
Scene Recognition 200 Natural Image Visual Recognition
Animals Recognition 200 Natural Image Visual Recognition
Chemical Apparatusn Recognition 200 Natural Image Visual Recognition
Rock Recognition 200 Natural Image Visual Recognition,Expert Knowledge Utilization
Fashion Recognition 200 Natural Image Visual Recognition
Logo And Brand Recognition 200 Natural Image Visual Recognition
Astronomical Recognition 94 Natural Image Visual Recognition,Expert Knowledge Utilization
Painting Recognition 200 Painting Image Visual Recognition,Expert Knowledge Utilization
Color Recognition 200 Synthetic Image Visual Recognition
Plant Recognition 200 Natural Image Visual Recognition
Shape Recognition 200 Synthetic Image Visual Recognition
Profession Recognition 200 Natural Image Visual Recognition
Building Recognition 200 Natural Image Visual Recognition,Expert Knowledge Utilization
Electronic Object Recognition 200 Natural Image Visual Recognition
Sports Recognition 200 Natural Image Visual Recognition
Disaster Recognition 200 Natural Image Visual Recognition
Celebrity Recognition 200 Natural Image Visual Recognition
Vehicle Recognition 200 Natural Image Visual Recognition
National Flag Recognition 200 Synthetic Image Visual Recognition
Abstract Visual Recognition 200 Abstract Image Visual Recognition
Animated Character Recognition 200 Synthetic Image Visual Recognition
Texture Material Recognition 200 Natural Image Visual Recognition
Film And Television Recognition 200 Synthetic Image Visual Recognition,Expert Knowledge Utilization
Sculpture Recognition 50 Natural Image Visual Recognition,Expert Knowledge Utilization
Age Gender Race Recognition 200 Natural Image Visual Recognition
Weapon Recognition 200 Natural Image Visual Recognition
Religious Recognition 200 Natural Image,Synthetic Image Visual Recognition,Expert Knowledge Utilization
Waste Recognition 200 Natural Image Visual Recognition,Expert Knowledge Utilization
Table A3: MMT-Bench subtask details (part 2): including sample number, visual input types, and evaluated LVLM capabilities.
Subtask Name Sample Num Visual Input Type Capability
Gui Navigation
Gui General 200 Screenshot Image Visual Reasoning,Visual Localization
Google Apps 200 Screenshot Image Visual Reasoning,Visual Localization
Web Shopping 200 Screenshot Image Visual Reasoning,Visual Localization
Gui Install 200 Screenshot Image Visual Reasoning,Visual Localization
OCR
Font Recognition 200 Text-rich Image OCR
Handwritten Text Recognition 100 Text-rich Image OCR
Handwritten Mathematical Expression Recognition 100 Text-rich Image OCR
Scene Text Recognition 200 Natural Image,Text-rich Image OCR
Image-to-image Translation
Jigsaw Puzzle Solving 200 Natural Image Visual Recognition,Visual Reasoning
Image Colorization 200 Natural Image Pixel Perception
Temporal Understanding
Next Img Prediction 200 Visual Mark Temporal Understanding
Mevis 200 Natural Image Temporal Understanding
Temporal Anticipation 200 Natural Image Temporal Understanding
Temporal Ordering 200 Natural Image Temporal Understanding
Temporal Localization 193 Natural Image Temporal Understanding
Relation Reasoning
Social Relation Recognition 200 Natural Image Visual Recognition,Visual Reasoning
Human Object Interaction Recognition 200 Natural Image Visual Recognition,Visual Reasoning
Scene Graph Recognition 200 Natural Image Visual Recognition,Visual Reasoning
Human Interaction Understanding 200 Natural Image Visual Recognition,Visual Reasoning
Discipline Knowledge Reasoning
Science 127 Scientific Diagram Visual Reasoning,Expert Knowledge Utilization
Health Medicine 140 Natural Image,Chart Image,Medical Image Visual Reasoning,Expert Knowledge Utilization
Art Design 110 Synthetic Image,Text-rich Image,Painting Image Visual Reasoning,Expert Knowledge Utilization
Humanitites Social Science 112 Synthetic Image,Painting Image Visual Reasoning,Expert Knowledge Utilization
Tech Engineering 182 Chart Image,Scientific Diagram Visual Reasoning,Expert Knowledge Utilization
Business 120 Text-rich Image,Chart Image Visual Reasoning,Expert Knowledge Utilization
Intelligence Quotient Test
Ravens Progressive Matrices 200 Scientific Diagram Visual Reasoning,Expert Knowledge Utilization
Embodied AI
Navigation 200 Synthetic Image Visual Reasoning,Expert Knowledge Utilization
Emotion
Facail Expression Change Recognition 200 Natural Image Visual Recognition,Temporal Understanding
Scene Emotion Recognition 200 Natural Image Visual Recognition
Micro Expression Recognition 200 Natural Image Visual Recognition
Artwork Emotion Recognition 200 Painting Image Visual Recognition
Body Emotion Recognition 200 Natural Image Visual Recognition
Facial Expression Recognition 200 Natural Image Visual Recognition
Visual Illusion
Color Constancy 72 Synthetic Image Visual Recognition,Visual Reasoning
Color Assimilation 200 Synthetic Image Visual Recognition,Visual Reasoning
Geometrical Relativity 200 Synthetic Image Visual Recognition,Visual Reasoning
Geometrical Perspective 120 Synthetic Image Visual Recognition,Visual Reasoning
Color Contrast 200 Synthetic Image Visual Recognition,Visual Reasoning
Meme Understanding
Meme Vedio Understanding 200 Natural Image Visual Description
Meme Image Understanding 200 Synthetic Image Visual Description
Counting
Counting By Visual Prompting 200 Natural Image Visual Recognition,Counting
Counting By Category 800 Natural Image Visual Recognition,Counting
Crowd Counting 200 Natural Image Visual Recognition,Counting
Counting By Reasoning 200 Natural Image Visual Recognition,Counting
Hallucination
Order Hallucination 200 Natural Image Visual Recognition,Visual Reasoning,Visual Description
Relation Hallucination 200 Natural Image Visual Recognition,Visual Reasoning,Visual Description
Attribute Hallucination 200 Natural Image Visual Recognition,Visual Reasoning,Visual Description
Exist Hallucination 200 Natural Image Visual Recognition,Visual Reasoning
Image Retrieval
Person Reid 200 Natural Image Retrieval,Multi-Images Analysis
Sketch2image Retrieval 200 Natural Image,Text-rich Image Retrieval,Multi-Images Analysis
Face Retrieval 200 Natural Image Retrieval,Multi-Images Analysis
Handwritten Retrieval 200 Text-rich Image Retrieval,OCR,Multi-Images Analysis
Vehicle Retrieval 200 Natural Image Retrieval,Multi-Images Analysis
Image2image Retrieval 200 Natural Image Retrieval,Multi-Images Analysis
Text2image Retrieval 200 Natural Image Retrieval,Multi-Images Analysis
Visual Prompt Understanding
Som Recognition 199 Natural Image,Visual Mark Visual Recognition,Visual Reasoning,Visual Localization,Visual Prompting Understanding
Visual Prompt Understanding 200 Natural Image,Visual Mark Visual Recognition,Visual Reasoning,Visual Localization,Visual Prompting Understanding
Table A4: MMT-Bench subtask details (part 3): including sample number, visual input types, and evaluated LVLM capabilities.
Subtask Name Sample Num Visual Input Type Capability
Anomaly Detection
Industrial Produce Anomaly Detection 200 Natural Image Visual Recognition,Counting
Face Mask Anomaly Dectection 200 Natural Image Visual Recognition
Helmet Anomaly Detection 200 Natural Image Visual Recognition,Visual Localization
Behavior Anomaly Detection 200 Natural Image Visual Recognition,Multi-Images Analysis
Traffic Anomaly Detection 200 Natural Image Visual Recognition
Keypoint Detection
Furniture Keypoint Detection 200 Natural Image Visual Recognition,Visual Localization,Pixel Perception
Human Keypoint Detection 200 Natural Image Visual Recognition,Visual Localization,Pixel Perception
Clothes Keypoint Detection 200 Natural Image Visual Recognition,Visual Localization,Pixel Perception
Animal Keypoint Detection 200 Natural Image Visual Recognition,Visual Localization,Pixel Perception
Vehicle Keypoint Detection 92 Natural Image Visual Recognition,Visual Localization,Pixel Perception
Visual Commonsense Reasoning
Whoops 200 Synthetic Image Visual Recognition,Visual Reasoning
Visual Code
Eqn2latex 200 Text-rich Image,Scientific Diagram OCR,Document Understanding,Expert Knowledge Utilization
Screenshot2code 200 Screenshot Image Document Understanding,Expert Knowledge Utilization
Sketch2code 200 Scientific Diagram Document Understanding,Expert Knowledge Utilization
Image Evaluation Judgement
Image Quality Assessment 200 Natural Image Visual Reasoning
Lvlm Response Judgement 200 Synthetic Image,Chart Image Visual Reasoning
Pixel Level Perception
Polygon Localization 200 Natural Image Visual Recognition,Visual Localization,Pixel Perception
Interactive Segmentation 141 Natural Image Visual Localization,Pixel Perception
Depth Estimation 200 Natural Image Pixel Perception,3D Perception
Pixel Recognition 200 Natural Image Visual Recognition,Pixel Perception
Pixel Localization 200 Natural Image Visual Recognition,Visual Localization,Pixel Perception
Image Matting 200 Natural Image Pixel Perception
Multiple Image Analysis
Spot The Similarity 200 Natural Image,Synthetic Image Multi-Images Analysis
Spot The Diff 200 Natural Image Multi-Images Analysis
3D
3D Cad Recognition 200 3d Image Multi-Images Analysis,3D Perception
3D Indoor Recognition 200 3d Image Multi-Images Analysis,3D Perception
Medical Understanding
Anatomy Identification 200 Medical Image Visual Recognition,Expert Knowledge Utilization
Medical Modality Recognition 200 Medical Image Visual Recognition,Expert Knowledge Utilization
Other Biological Attributes 200 Medical Image Visual Recognition,Expert Knowledge Utilization
Disease Diagnose 200 Medical Image Visual Recognition,Expert Knowledge Utilization
Lesion Grading 200 Medical Image Visual Recognition,Expert Knowledge Utilization
Cross Image Matching
One Shot Detection 200 Natural Image Visual Localization
Point Tracking 200 Natural Image Visual Localization
Single Object Tracking 200 Natural Image Visual Localization
Visual Captioning
Video Captioning 200 Natural Image Visual Description,Temporal Understanding
Image Captioning Paragraph 200 Natural Image Visual Description
Image Captioning 200 Natural Image Visual Description
Instance Captioning 200 Natural Image Visual Description
Image Dense Captioning 197 Natural Image Visual Description
Multiple Instance Captioning 200 Natural Image Visual Description
Multiple Image Captioning 200 Natural Image Visual Description,Multi-Images Analysis
Writing Poetry From Image 200 Natural Image,Text-rich Image Visual Description
Autonomous Driving
Traffic Participants Understanding 200 Natural Image Counting
Multiple View Image Understanding 200 Natural Image Visual Reasoning,Multi-Images Analysis,Counting
Traffic Sign Understanding 200 Natural Image Visual Reasoning,Expert Knowledge Utilization
Temporal Sequence Understanding 200 Natural Image Visual Reasoning,Temporal Understanding
Traffic Light Understanding 200 Natural Image Visual Recognition
Table A5: Model architecture of 30 LVLMs evaluated on MMT-Bench.
Models Parameters Vision Encoder LLM
GPT-4V (Yang et al., 2023a) - - -
GeminiProVision (Team, 2023a) - - -
QWen-VL-Plus (Team, 2023c) - - -
Claude3V-Haiku (Anthropic, 2023) - - -
LLaVA-Next-34B (Liu et al., 2024a) 34.8B CLIP ViT-L/14 Nous-Hermes-2-Yi-34B
LLaVA-Next-13B (Liu et al., 2024a) 13.4B CLIP ViT-L/14 Vicuna-v1.5-13B
LLaVA-Next-7B (Liu et al., 2024a) 7.1B CLIP ViT-L/14 Vicuna-v1.5-7B
Yi-VL-34B (AI et al., 2024) 34.6B CLIP ViT-H/14 Nous-Hermes-2-Yi-34B
Yi-VL-6B (AI et al., 2024) 6.6B CLIP ViT-H/14 Yi-6B
InternVL-Chat-V1.2 (Chen et al., 2023b) 40B InternViT-6B Nous-Hermes-2-Yi-34B
DeepSeek-VL-7B (Lu et al., 2024) 7.3B SAM-B & SigLIP-L DeekSeek-7B
Monkey (Li et al., 2023d) 9.8B CLIP-ViT-BigHuge Qwen-7B
XComposer (Zhang et al., 2023a) 8B EVA-CLIP-G InternLM-7B
XComposer2 (Dong et al., 2024) 7B CLIP ViT-L/14 InternLM2-7B
ShareGPT4V (Chen et al., 2023a) 7.2B CLIP ViT-L/14 Vicuna-v1.5-7B
SharedCaptioner (Chen et al., 2023a) 8B EVA-G InternLM-7B
mPLUG-Owl2 (Ye et al., 2023b) 8.2B CLIP ViT-L/14 LLaMA2-7B
LLaVA-v1.5-7B (Liu et al., 2023b, a) 7.2B CLIP ViT-L/14 Vicuna-v1.5-7B
LLaVA-v1.5-13B (Liu et al., 2023b, a) 13.4B CLIP ViT-L/14 Vicuna-v1.5-13B
LLaVA-InternLM2-7B (Contributors, 2023c) 8.1B CLIP ViT-L/14 InternLM2-7B
LLaVA-InternLM-7B (Contributors, 2023c) 7.6B CLIP ViT-L/14 InternLM-7B
LLaVA-v1.5-7B-Xtuner (Contributors, 2023c) 7.2B CLIP ViT-L/14 Vicuna-v1.5-7B
LLaVA-v1.5-13B-Xtuner (Contributors, 2023c) 13.4B CLIP ViT-L/14 Vicuna-v1.5-13B
LLaMA-Adapter-v2 (Gao et al., 2023) 7B CLIP-ViT-L/14 LLaMA-7B
VisualGLM (Ding et al., 2021) 8B EVA-CLIP ChatGLM-6B
CogVLM (Wang et al., 2023) 17B EVA-CLIP-E Vicuna-v1.5-7B
TransCore-M (Contributors, 2023b) 13.4B CLIP ViT/L-14 PCITransGPT-13B
RBDash-v1 (RBDash-Team, 2023) 13.4B CLIP ViT-L/14 Vicuna-v1.5-13B
BLIP2 (Li et al., 2023b) 12.1B EVA-CLIP ViT-G/14 Flan-T5-XXL
QWenVL (Bai et al., 2023) 9.6B CLIP ViT-G/16 QWen-7B
Table A6: Abbreviations for tasks requiring multiple images as inputs (part one). Here we also present the designed prompt examples we used for Single-Image LVLMs and Multi-Images LVLMs.
Task Abbreviation Task Name Prompt Example for Single Image LVLMs Prompt example for Multiple Image LVLMs
AQS
action quality
assessment
Question: <image><image><image><image>
What is the most probable action quality assessment
number obtained by the person in the video?
Options:
A. 35.99
B. 28.0
C. 11.27
D. 44.98
Question: <image><image><image><image>
What is the most probable action quality assessment
number obtained by the person in the video?
Options:
A. 35.99
B. 28.0
C. 11.27
D. 44.98
FECR
facail expression
change recognition
Question: <image><image>What is the change
of expression from the first image to the second image?
Options:
A. disgust to happy
B. happy to sadness
C. anger to surprise
D. disgust to fear
Question: What is the change of expression from
Image 1: <image>to Image 2: <image>?
Options:
A. disgust to happy
B. happy to sadness
C. anger to surprise
D. disgust to fear
FR
face
retrieval
Question: <image><image><image><image>
<image>Please retrieve the most similar person to the query
in the candidates. The first image is the query image and
the remaining images are candidates from Candidate 1 to
Candidate 4.
Options:
A. Candidate 1
B. Candidate 2
C. Candidate 3
D. Candidate 4
Question: Please retrieve the most similar person to the query:
<image>in the candidates: Candidate 1: <image>,
Candidate 2: <image>, Candidate 3: <image>,
Candidate 4: <image>.
Options:
A. Candidate 1
B. Candidate 2
C. Candidate 3
D. Candidate 4
GAR
general action
recognition
Question: <image><image><image><image>
What is the action performed by the person in the video?
Options:
A. rock scissors paper
B. sword fighting
C. fencing
D. balloon blowing
Question: <image><image><image><image>
What is the action performed by the person in the video?
Options:
A. rock scissors paper
B. sword fighting
C. fencing
D. balloon blowing
HR
handwritten
retrieval
Question: <image><image><image><image>
<image>Please retrieve the most similar handwritten
text snapshot to the query in the candidates.
The first image is the query image and the remaining
images are candidates from Candidate 1 to Candidate 4.
Options:
A. Candidate 1
B. Candidate 2
C. Candidate 3
D. Candidate 4
Question: Please retrieve the most similar handwritten text
snapshot to the query: <image>in the candidates:
Candidate 1: <image>, Candidate 2: <image>,
Candidate 3: <image>, Candidate 4: <image>.
Options:
A. Candidate 1
B. Candidate 2
C. Candidate 3
D. Candidate 4
I2IR
image2image
retrieval
Question: <image><image><image><image>
<image>Please retrieve the most similar scene to the query
in the candidates. The first image is the query image
and the remaining images are candidates from Candidate 1
to Candidate 4.
Options:
A. Candidate 1
B. Candidate 2
C. Candidate 3
D. Candidate 4
Question: Please retrieve the most similar scene to the
query: <image>in the candidates: Candidate 1: <image>,
Candidate 2: <image>, Candidate 3: <image>,
Candidate 4: <image>.
Options:
A. Candidate 1
B. Candidate 2
C. Candidate 3
D. Candidate 4
IC
image
colorization
Question: <image><image><image><image>
The following images are candidates from Candidate 1
to Candidate 4, which are from the same picture
consisting of four styles: grayscale, original, warm, and sepia.
Which one is the original picture?
Options:
A. Candidate 1
B. Candidate 2
C. Candidate 3
D. Candidate 4
Question: The following images: Candidate 1: <image>,
Candidate 2: <image>, Candidate 3: <image>,
Candidate 4: <image>, are from the same picture,
which consists of four styles: grayscale, original,
warm, and sepia. Which one is the original picture?
Options:
A. Candidate 1
B. Candidate 2
C. Candidate 3
D. Candidate 4
MVU
meme video
understanding
Question: <image><image><image><image>
Please generate a description for this meme
Options:
A. From beneath the toilet door panel, a hand is reaching
out with an upward-facing palm to receive chopsticks
and a spoon from someone outside.
B. The hand is asking for help to get out of the bathroom.
C. The hand is actually reaching out for a handshake.
D. A person is handing over toilet paper instead of
chopsticks and a spoon.
Question: <image><image><image><image>
Please generate a description for this meme
Options:
A. From beneath the toilet door panel, a hand is
reaching out with an upward-facing palm to receive
chopsticks and a spoon from someone outside.
B. The hand is asking for help to get out of the bathroom.
C. The hand is actually reaching out for a handshake.
D. A person is handing over toilet paper instead of
chopsticks and a spoon.
Table A7: Abbreviations for tasks requiring multiple images as inputs (part two). Here we also present the designed prompt examples we used for Single-Image LVLMs and Multi-Images LVLMs.
Task Abbreviation Task Name Prompt Example for Single Image LVLMs Prompt example for Multiple Image LVLMs
ME mevis
Question: <image><image><image><image>
<image><image><image><image><image>
I have provided several frames from a video, and
I will also provide a caption. Provide the output for
the detected area in the format [x, y, w, h].
This format represents the bounding box,
where [x, y, w, h] are the coordinates of the top-left
corner of the bounding box, as well as its width and height.
Note that the width of the input image is 1920 and
the height is 945.
CAPTION: little girl feeding rabbit
Options:
A. [70, 0, 993, 1007]
B. [203, 0, 1011, 944]
C. [70, 0, 1011, 944]
D. [196, 38, 652, 277]
Question: <image><image><image><image>
<image><image><image><image><image>
I have provided several frames from a video, and I will also
provide a caption. Provide the output for the detected area
in the format [x, y, w, h]. This format represents the bounding box,
where [x, y, w, h] are the coordinates of the top-left corner
of the bounding box, as well as its width and height.
Note that the width of the input image is 1920 and the height is 945.
CAPTION: little girl feeding rabbit
Options:
A. [70, 0, 993, 1007]
B. [203, 0, 1011, 944]
C. [70, 0, 1011, 944]
D. [196, 38, 652, 277]
MIC
multiple image
captioning
Question: <image><image><image><image>
<image>Describe this set of images briefly.
Options:
A. I took a cab to return to the hotel
B. the front of the mall was somewhat crowded .
i ran past them and took the escalator down .
after shopping for a few hours , i returned to the street .
i tried to catch a cab but a bush blocked me .
i decided to just walk back to my hotel .
C. the mall was empty and I took the stairs up
D. I quickly caught a bus to my hotel
Question: Describe this set of images:
<image><image><image><image><image>briefly.
Options:
A. I took a cab to return to the hotel
B. the front of the mall was somewhat crowded .
i ran past them and took the escalator down .
after shopping for a few hours , i returned to the street .
i tried to catch a cab but a bush blocked me .
i decided to just walk back to my hotel .
C. the mall was empty and I took the stairs up
D. I quickly caught a bus to my hotel
NIP
next img
prediction
Question: <image><image><image><image>
<image>Please predict the last 10 frames in the
candidates of the video based on the first 10 frames of
the input video. Note that the order is from left to right.
The first four images are candidates from Candidate 1
to Candidate 4 and the last image shows the first 10 frames
of the input video.
Options:
A. Candidate 1: last 10 frames
B. Candidate 2: last 10 frames
C. Candidate 3: last 10 frames
D. Candidate 4: last 10 frames
Question: Please predict the last 10 frames in the
candidates: Candidate 1: <image>, Candidate 2: <image>,
Candidate 3: <image>, Candidate 4: <image>, of the video:
based on the first 10 frames of the input video: <image>.
Note that the order is from left to right
Options:
A. Candidate 1: last 10 frames
B. Candidate 2: last 10 frames
C. Candidate 3: last 10 frames
D. Candidate 4: last 10 frames
OSD
one shot
detection
Question: <image><image>According to the prompts in the
Support Image (marked in red), please detect the corresponding
object in the Query Image. The first image is the Support Image
and the second image is the Query Image.
Provide the output for the object in the format [x, y, w, h].
This format represents the bounding box, where [x, y, w, h] are the
coordinates of the top-left corner of the bounding box,
as well as its width and height.
Note that the width of the input RGB image is 224
and the height is 224.
Options:
A. [0, 0, 511, 2]
B. [0, 0, 426, 1]
C. [1, 1, 511, 2]
D. [0, 0, 499, 2]
Question: According to the prompts in the Support Image
(marked in red): <image>, please detect the corresponding
object in the Query Image: <image>.
Provide the output for the object in the format [x, y, w, h].
This format represents the bounding box,
where [x, y, w, h] are the coordinates of the top-left corner
of the bounding box, as well as its width and height.
Note that the width of the input RGB image is 224 and
the height is 224.
Options:
A. [0, 0, 511, 2]
B. [0, 0, 426, 1]
C. [1, 1, 511, 2]
D. [0, 0, 499, 2]
PRe person reid
Question: <image><image><image><image>
<image>Please retrieve the most similar person to
the query in the candidates. The first image is the query
image and the remaining images are candidates from
Candidate 1 to Candidate 4.
Options:
A. Candidate 1
B. Candidate 2
C. Candidate 3
D. Candidate 4
Question: Please retrieve the most similar person to the
query: <image>in the candidates: Candidate 1: <image>,
Candidate 2: <image>, Candidate 3: <image>,
Candidate 4: <image>.
Options:
A. Candidate 1
B. Candidate 2
C. Candidate 3
D. Candidate 4
PT point tracking
Question: <image><image>What is the position coordinates
of the point with coordinates ([0.711, 0.154]) in the first image
within the second image? Note that the width of the input
RGB image is 256 and the height is 256.
Options:
A. [0.336, 0.241]
B. [0.754, 0.592]
C. [0.711, 0.154]
D. [0.814, 0.269]
Question: What is the position coordinates of the point
with coordinates ([0.711, 0.154]) in Frame 1: <image>
within the Frame 2: <image>?
Note that the width of the input RGB image is 256
and the height is 256.
Options:
A. [0.336, 0.241]
B. [0.754, 0.592]
C. [0.711, 0.154]
D. [0.814, 0.269]
Table A8: Abbreviations for tasks requiring multiple images as inputs (part three). Here we also present the designed prompt examples we used for Single-Image LVLMs and Multi-Images LVLMs.
Task Abbreviation Task Name Prompt Example for Single Image LVLMs Prompt example for Multiple Image LVLMs
SODRD
salient object
detection rgbd
Question: <image><image>The first image is RGB image
and the second image is the corresponding depth map.
Please detect the salient foreground object in this RGB
image and represent them using a single bounding box.
Provide the output for the detected area in the format [x, y, w, h].
This format represents the bounding box, where [x, y, w, h] are the
coordinates of the top-left corner of the bounding box, as well as
its width and height.
Note that the width of the input RGB image is 640 and the height is 480.
Options:
A. [267, 105, 119, 209]
B. [85, 307, 65, 79]
C. [318, 294, 111, 156]
D. [267, 105, 135, 241]
Question: The first image is RGB image: <image>
and the second image is the corresponding depth map: <image>.
Please detect the salient foreground object in this RGB image
and represent them using a single bounding box.
Provide the output for the detected area in the format [x, y, w, h].
This format represents the bounding box,
where [x, y, w, h] are the coordinates of the top-left corner of
the bounding box, as well as its width and height.
Note that the width of the input RGB image is 640
and the height is 480.
Options:
A. [267, 105, 119, 209]
B. [85, 307, 65, 79]
C. [318, 294, 111, 156]
D. [267, 105, 135, 241]
SLR
sign language
recognition
Question: <image><image><image><image>
What is the sign language gesture performed
by the person in the video?
Options:
A. fashionable
B. trendy
C. fascinating
D. cool
Question: <image><image><image><image>
What is the sign language gesture performed
by the person in the video?
Options:
A. fashionable
B. trendy
C. fascinating
D. cool
SOT
single object
tracking
Question: <image><image>Here is an object (marked as RED box)
in the first image. Please give the coordinations
of this object in the second image.
Provide the output for the object in the format [x, y, w, h].
This format represents the bounding box,
where [x, y, w, h] are the coordinates of the top-left corner
of the bounding box, as well as its width and height.
Note that the width of the input RGB image is 1280
and the height is 720.
Options:
A. [148.0, 187.0, 918, 487]
B. [148.0, 187.0, 792.0, 533.0]
C. [0, 187, 792.0, 533.0]
D. [149, 451, 263, 24]
Question: Here is an object (marked as RED box)
in the Frame 1: <image>. Please give the coordinations
of this object in the Frame 2: <image>.
Provide the output for the object in the format [x, y, w, h].
This format represents the bounding box,
where [x, y, w, h] are the coordinates of the top-left corner
of the bounding box, as well as its width and height.
Note that the width of the input RGB image is 1280
and the height is 720.
Options:
A. [148.0, 187.0, 918, 487]
B. [148.0, 187.0, 792.0, 533.0]
C. [0, 187, 792.0, 533.0]
D. [149, 451, 263, 24]
S2IR
sketch2image
retrieval
Question: <image><image><image><image>
Please retrieve the most similar image to the Query
Image in the candidate Images. The first image is the
query image and the remaining images are candidates
from Candidate 1 to Candidate 3.
Options:
A. Candidate 1
B. Candidate 2
C. Candidate 3
Question: Please retrieve the most similar image to the
Query Image: <image>in the candidate Images:
Candidate 1: <image>, Candidate 2: <image>,
Candidate 3: <image>.
Options:
A. Candidate 1
B. Candidate 2
C. Candidate 3
SD spot the diff
Question: <image><image>The following is a description
of the differences between two pictures. Which one is incorrect?
Options:
A. The images show different types of flowers in full bloom,
with colorful petals and green leaves.
B. there is a car driving by in the right picture
C. there is a car leaving the lot in the left picture
Question: The following is a description of the differences
between two pictures: <image><image>. Which one is incorrect?
Options:
A. The images show different types of flowers in full bloom,
with colorful petals and green leaves.
B. there is a car driving by in the right picture
C. there is a car leaving the lot in the left picture
SS
spot the
similarity
Question: <image><image>Are there any
similarities between the two pictures?
Options:
A. Yes
B. No
Question: <image><image>Are there any
similarities between the two pictures?
Options:
A. Yes
B. No
TA
temporal
anticipation
Question: <image><image><image><image>
What will the person do next with the medicine?
Options:
A. Apply topically
B. Inject intravenously
C. Throw away
D. Eat
Question: <image><image><image><image>
What will the person do next with the medicine?
Options:
A. Apply topically
B. Inject intravenously
C. Throw away
D. Eat
TL
temporal
localization
Question: <image><image><image><image>
Given the sequence of images, please identify the image
consistent with the text description: Billiards.
The image index starts from 0.
Options:
A. Image 0
B. Image 1
C. Image 2
D. Image 3
Question: Given the sequence of images: Image 0: <image>,
Image 1: <image>, Image 2: <image>, Image 3: <image>,
please identify the image consistent with the text
description: Billiards.
Options:
A. Image 0
B. Image 1
C. Image 2
D. Image 3
Table A9: Abbreviations for tasks requiring multiple images as inputs (part four). Here we also present the designed prompt examples we used for Single-Image LVLMs and Multi-Images LVLMs.
Task Abbreviation Task Name Prompt Example for Single Image LVLMs Prompt example for Multiple Image LVLMs
TL
temporal
localization
Question: <image><image><image><image>
Given the sequence of images, please identify the image
consistent with the text description: Billiards.
The image index starts from 0.
Options:
A. Image 0
B. Image 1
C. Image 2
D. Image 3
Question: Given the sequence of images: Image 0: <image>,
Image 1: <image>, Image 2: <image>, Image 3: <image>,
please identify the image consistent with the text
description: Billiards.
Options:
A. Image 0
B. Image 1
C. Image 2
D. Image 3
TO temporal ordering
Question: <image><image><image><image>
Please predict the order of the following pictures,
and give each picture a sequential index.
This index starts from 0. The larger the index, the later the order.
Options:
A. [3, 0, 2, 1]
B. [2, 0, 1, 3]
C. [0, 2, 1, 3]
D. [1, 3, 2, 0]
Question: Please predict the order of the following pictures:
<image><image><image><image>, and give each
picture a sequential index.
This index starts from 0. The larger the index, the later the order.
Options:
A. [3, 0, 2, 1]
B. [2, 0, 1, 3]
C. [0, 2, 1, 3]
D. [1, 3, 2, 0]
T2IR
text2image
retrieval
Question: <image><image><image><image>
Please find the most relevant picture among the candidate images
for this description.
The given images are candidates from Candidate 1 to Candidate 4.
Description:
this flower has petals that are green with stringy purple stamen
this flower is white and blue in color, with petals that are
oval shaped.
the petals on this flower are white with an elaborate pistil.
the flower is unique because the petals aren’t separated and
they have a round tip
this flower has blue petals as well as a green and purple pistil.
this flower has thick and pale green petals under a thick fringe of
purple and white.
this flower has petals that are white and has stringy stamen
this flower has white oblong petals and white flat filaments.
a flower with long and narrow petals that are whtie.
a flower with long and narrow petals that are whtie.
Options: A. Candidate 1
B. Candidate 2
C. Candidate 3
D. Candidate 4
Question: Please find the most relevant picture among the
candidate images: Candidate 1: <image>, Candidate 2: <image>,
Candidate 3: <image>, Candidate 4: <image>, for this description.
Description:
this flower has petals that are green with stringy purple stamen
this flower is white and blue in color, with petals that are
oval shaped.
the petals on this flower are white with an elaborate pistil.
the flower is unique because the petals aren’t separated and
they have a round tip
this flower has blue petals as well as a green and purple pistil.
this flower has thick and pale green petals under a thick fringe of
purple and white.
this flower has petals that are white and has stringy stamen
this flower has white oblong petals and white flat filaments.
a flower with long and narrow petals that are whtie.
a flower with long and narrow petals that are whtie.
Options: A. Candidate 1
B. Candidate 2
C. Candidate 3
D. Candidate 4
3DCR
3D cad
recognition
Question: <image><image><image><image>
<image><image>What is the category of the point
cloud based on the multi-view of the point cloud?
Options:
A. telephone
B. chair
C. table
D. sofa
Question: <image><image><image><image>
<image><image>What is the category of the point
cloud based on the multi-view of the point cloud?
Options:
A. telephone
B. chair
C. table
D. sofa
3DIR
3D indoor
recognition
Question: <image><image><image><image>
<image><image>What is the category of the point cloud
based on the multi-view of the point cloud?
Options:
A. sink
B. bed
C. cabinet
D. bag
Question: <image><image><image><image>
<image><image>What is the category of the point
cloud based on the multi-view of the point cloud?
Options:
A. sink
B. bed
C. cabinet
D. bag
VR vehicle retrieval
Question: <image><image><image><image>
<image>Please retrieve the most similar vehicle
to the query in the candidates. The first image is the query
image and the remaining images are candidates from
Candidate 1 to Candidate 4.
Options:
A. Candidate 1
B. Candidate 2
C. Candidate 3
D. Candidate 4
Question: Please retrieve the most similar vehicle to
the query: <image>in the candidates:
Candidate 1: <image>, Candidate 2: <image>,
Candidate 3: <image>, Candidate 4: <image>.
Options:
A. Candidate 1
B. Candidate 2
C. Candidate 3
D. Candidate 4
VC video captioning
Question: <image><image><image><image>
Please generate textual descriptions for a sequence of video frames.
Options:
A. a woman is speaking into a microphone
B. a man is playing guitar on stage
C. a man is speaking into a microphone
D. a man is typing on a computer keyboard
Question: Please generate textual descriptions for
a sequence of video frames:
<image><image><image><image>.
Options:
A. a woman is speaking into a microphone
B. a man is playing guitar on stage
C. a man is speaking into a microphone
D. a man is typing on a computer keyboard
Refer to caption
Figure A2: Comparison of coordinate formats for detection tasks across 19 MMT-Bench subtasks, reporting average accuracy.
Refer to caption
Figure A3: The performance of 20 LVLMs across 13 types of visual input.
Refer to caption
Figure A4: The performance of 20 LVLMs across 14 capabilities.

Appendix G Case Study

Table A10: Table index of case study figures by meta-task with associated (error) categories for each LVLM.
Case Figure Meta-task Subtask GPT-4V GeminiProVision InternVL-Chat
Fig. A5 Visual Recognition Landmark Recognition Lack of Knowledge No Error No Error
Fig. A6 Object Localization Camouflaged Object Detection Lack of Capability Perception Error Perception Error
Fig. A7 Pixel-level Recognition Image Matting Perception Error No Error Perception Error
Fig. A8 OCR Handwritten Text Recognition No Error Perception Error Perception Error
Fig. A9 Visual Prompt Understanding Visual Prompt Understanding No Error Perception Error Fail to Follow Instruct No Error
Fig. A10 Retrieval Sketch to Image Retrieval Perception Error No Error Perception Error Reasoning Error
Fig. A11 Counting Counting by Reasoning Perception Error Perception Error No Error
Fig. A12 Keypoint Detection Human Keypoint Detection Refuse to Answer Perception Error Fail to Follow Instruct Fail to Follow Instruct
Fig. A13 Action Recognition Sign Language Recognition Lack of Capability Perception Error Perception Error
Fig. A14 Visual Hallucination Exist Hallucination No Error Reasoning Error Perception Error
Fig. A15 Anomaly Detection Industrial Produce Anomaly Detection Lack of Knowledge No Error Perception Error
Fig. A16 Image-to-Image Translation Jigsaw Puzzle Solving No Error Perception Error Perception Error
Fig. A17 Visual Summary Image Captioning Paragraph Perception Error No Error Perception Error
Fig. A18 Intelligence Quotient Test Ravens Progressive Matrices No Error Reasoning Error Reasoning Error
Fig. A19 Emotional Quotient Test Scene Emotion Recognition Perception Error Reasoning Error Reasoning Error No Error
Fig. A20 Visual Grounding Referring Detection Perception Error Perception Error Fail to Follow Instruct
Fig. A21 Visual Commonsense Reasoning Whoops Reasoning Error Perception Error Perception Error
Fig. A22 Chart, Doc Understanding Clock Reading Perception Error Perception Error Perception Error
Fig. A23 Relation Reasoning Scene Graph Recognition No Error Perception Error No Error
Fig. A24 Meme Understanding Meme Image Understanding Perception Error No Error No Error
Fig. A25 Multi-Image Analysis Spot the Diff No Error No Error No Error
Fig. A26 Temporal Understanding Temporal Ordering Perception Error No Error Perception Error
Fig. A27 Cross-Image Matching Single Object Tracking Lack of Capability Perception Error Perception Error
Fig. A28 Visual Coding Equation to Latex Perception Error Perception Error No Error
Fig. A29 Visual Illusion Color Constancy Perception Error No Error Perception Error
Fig. A30 Image Evaluation Judgement LVLM Response Judgement Reasoning Error No Error Perception Error
Fig. A31 3D Perception 3D CAD Recognition Lack of Capability No Error No Error
Fig. A32 Emodied Agent Navigation Fail to Follow Instruct Fail to Follow Instruct Fail to Follow Instruct
Fig. A33 Medical Understanding Medical Modality Recognition No Error No Error Perception Error
Fig. A34 Autonomous Driving Traffic Light Understanding Refuse to Answer No Error No Error
Fig. A35 GUI Navigation Installation Perception Error Perception Error Perception Error
Fig. A36 Discipline Knowledge Reasoning Art and Design Lack of Knowledge Lack of Knowledge Lack of Knowledge

In this section, we present a case study analysis of the error types made by GPT-4V, GeminiProVision, and InternVL-Chat on various meta-tasks in MMT-Bench. We classify the errors into the following six categories:

Perception Error

: LVLMs fail to recognize, classify or detect the objects or content in images. Most LVLMs are constrained by the representation power of their visual encoders, making this the most common type of error. See examples in Fig. A6, Fig. A8, etc.

Reasoning Error

: LVLMs correctly recognize and perceive the visual content but make errors in reasoning, leading to incorrect answers. See examples in Fig. A21, Fig. A30, etc.

Lack of Knowledge

: LVLMs lack the domain-specific knowledge required to answer specialized questions, such as the location of a landmark (see Fig. A5) or the creation date of a particular painting (see Fig. A36).

Lack of Capability

: LVLMs do not have the capability to solve the corresponding tasks. This error type is particularly evident in GPT-4V, which tends to respond more honestly when it lacks the ability to handle certain tasks. In contrast, other LVLM models are inclined to generate outputs even when the accuracy rate is relatively low. See examples in Fig. A6, Fig. A13.

Refuse to Answer

: LVLMs, such as GPT-4V or Gemini, refuse to answer questions that are anthropocentric or sensitive in nature. See examples in Fig. A12, Fig. A34.

Fail to Follow Instruct

: LVLMs fail to correctly understand instructions and provide erroneous answers. For example, LVLMs may not understand the specified conditions in the instruction (see Fig. A9) or may ignore the instruction altogether and instead generate a caption for the given image (see Fig. A12).

Refer to caption
Figure A5: A sample case of visual recognition (landmark recognition). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A6: A sample case of object localization (camouflaged object detection). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A7: A sample case of pixel-level recognition (image matting). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A8: A sample case of OCR (handwritten text recognition). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A9: A sample case of visual prompt understanding (visual prompt understanding). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A10: A sample case of retrieval (sketch2image retrieval). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A11: A sample case of counting (counting by reasoning). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A12: A sample case of keypoint detection (human keypoint detection). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A13: A sample case of action recognition (sign language recognition). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A14: A sample case of visual hallucination (exist hallucination). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A15: A sample case of anomaly detection (industrial produce anomaly detection). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A16: A sample case of image-to-image translation (jigsaw puzzle solving). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A17: A sample case of visual summary (image captioning paragraph). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A18: A sample case of intelligence quotient test (ravens progressive matrices). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A19: A sample case of emotional quotient test (scene emotion recognition). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A20: A sample case of visual grounding (referring detection). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A21: A sample case of visual commonsense (whoops). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A22: A sample case of chart, doc understanding (clock reading). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A23: A sample case of relation reasoning (scene graph recognition). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A24: A sample case of meme understanding (meme image understanding). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A25: A sample case of multi-image analysis (spot the difference). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A26: A sample case of temporal understanding (temporal ordering). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A27: A sample case of cross-image matching (single object tracking). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A28: A sample case of visual coding (equation to latex). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A29: A sample case of visual illusion (color constancy). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A30: A sample case of image evaluation judgement (LVLM response judgement). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A31: A sample case of 3D perception (3D CAD recognition). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A32: A sample case of embodied agent (navigation). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A33: A sample case of medical understanding (medical modality recognition). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A34: A sample case of autonomous driving (traffic light understanding). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A35: A sample case of GUI navigation (installation). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.
Refer to caption
Figure A36: A sample case of discipline knowledge reasoning (art and design). Green highlights the right answer. Red highlights the wrong answer. Back to Table Index.

Appendix H Comparison of MMT-Bench with Other Benchmarks on OCR-Related Tasks

Table A11: Statistics of different evaluation benchmarks on OCR-related samples. The number of tokens is calculated by the tiktoken package from OpenAI.
Words Number Tokens Number
Benchmark Sample Num Task Type Average Min Middle Max std Average Min Middle Max std
MME (Fu et al., 2023) 40 1 2.5 1 2 5 1 3.9 1 3 8 1.6
MMBench (dev+test) (Liu et al., 2023c) 608 - 7.3 1 6 54 7 8.3 1 6 78 9.3
Tiny-LVLM-eHub (Shao et al., 2023) 600 1 1 1 1 1 0 2.2 1 2 8 1.1
MMT-Bench (Ours) 600 4 14.8 1 1.5 103 22.7 20.4