MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Abstract
Large Vision-Language Models (LVLMs) show significant strides in general-purpose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, reasoning, and planning. MMT-Bench comprises meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering core meta-tasks and subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving LVLMs such as the proprietary GPT-4V, GeminiProVision, and open-sourced InternVL-Chat, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.
1 Introduction
In recent years, Large Vision-Language Models (LVLMs) (Zhang et al., 2023a; Yang et al., 2023a; Liu et al., 2023b) have emerged as powerful tools for advancing artificial intelligence, demonstrating remarkable progress in various domains such as visual dialogue, video analysis and document understanding. Driven by diverse and high-quality instruction fine-tuning data mined from various fields, LVLMs will continue to advance towards multitask AGI (Team, 2023a; Bai et al., 2023). As pointed out in Levels of AGI (Morris et al., 2023), the breadth (generality) of tasks is a fundamental criterion for different levels of AGI. A multitask AGI model can perform a wide range of tasks across different domains with human-like proficiency, which could revolutionize many fields such as personalized education (Latif et al., 2023) and medical diagnosis (Singhal et al., 2023). Therefore, it is crucial to build a comprehensive evaluation benchmark to track multitask AGI development.
However, evaluating LVLMs significantly lags behind their development (Morris et al., 2023; Yue et al., 2023b; Liu et al., 2024b). A line of work attempts to bridge this gap by proposing various multimodal evaluation benchmarks. Examples include LVLM-eHub (Xu et al., 2023), MMBench (Liu et al., 2023c), MME (Fu et al., 2023), and SEED-Bench (Li et al., 2023a), which propose dimensions of multimodal capabilities and corresponding test samples. However, these benchmarks have limited coverage of multimodal tasks while testing rudimentary capabilities like visual recognition and text-scarce OCR. Therefore, they cannot fulfil the requirement of the breadth of tasks (Morris et al., 2023). Moreover, recent LVLMs continue to excel in these benchmarks. For instance, InternLM-XComposer2 (Dong et al., 2024) achieved / and / overall performance on MME and MMBench, respectively. Other works, such as MathVista (Lu et al., 2023) and MMMU (Yue et al., 2023a), focus on discipline knowledge understanding and reasoning but are constrained to visual questions with scientific diagram images, limiting their breadth for benchmarking multitask AGI.
To address this challenge, we introduce MMT-Bench, a new benchmark designed to comprehensively assess LVLMs in multimodal multitask understanding. The breadth of MMT-Bench features in three aspects. First, MMT-Bench is meticulously curated and comprises K multi-choice visual questions covering core meta-tasks and a total of subtasks (Fig. 1), which is times larger than MM-Bench (Liu et al., 2023c). Second, it encompasses image types such as natural scenes, synthetic images, depth maps, text-rich images, paintings, screenshots, point clouds, medical images, et al. (Fig. 2). Such diversity demands the model to be capable enough to interpret various visual inputs. Third, MMT-Bench spans multimodal scenarios such as vehicle driving, GUI navigation, and embodied AI, testing kinds of multimodal capabilities including visual recognition, localization, reasoning, OCR, counting, 3D perception, temporal understanding, et al. (Fig. 2).
We assess publicly available LVLMs under various input modes for best evaluation performance. Our findings highlight the significant challenges posed by MMT-Bench. For instance, GPT-4V only achieves / and / overall scores across all subtasks and subtasks except for visual recognition tasks, respectively, indicating significant room for improvement towards multitask AGI. Thanks to the extensive coverage of multimodal tasks, MMT-Bench enables the evaluation of LVLMs using a task map. This facilitates the discovery of both in- and out-of-domain tasks, providing valuable insights for multimodal commercial applications and ongoing efforts to enhance LVLMs. We summarize the findings as follows:
Benchmark | Data Collection | |||||
---|---|---|---|---|---|---|
# Sample | # Meta-task | # Task | # Modality | Source | Answer Type | |
SEED-Bench (Li et al., 2023a) | 19K | 12 | 12 | I + T + V | Annotated | Multi-Choice |
MMBench (Liu et al., 2023c) | 3K | 2 | 20 | I + T | Repurposed | Multi-Choice |
MM-VET (Yu et al., 2023) | 0.2K | 6 | N/A | I + T | Repurposed | Multi-Choice |
MMMU (Yue et al., 2023b) | 11.5K | 6 | 30 | I + T | Annotated | Multi-Choice/Open |
Tiny LVLM-eHub (Shao et al., 2023) | 2.1K | 5 | 42 | I + T | Repurposed | Multi-Choice/Open |
MMT-Bench | 31K | 32 | 162 | I + T + V + P | Repurposed | Multi-Choice |
-
•
The open-source model InternVL-chat has taken a leading position in MMT-Bench, surpassing other closed-source models such as QWen-VL-Plus, GPT-4V, and GeminiProVision.
-
•
The comprehensive error analyses conducted on multimodal tasks reveal that top-performing LVLMs such as InternVL-chat, GPT4V, and GeminiProVision are predominantly prone to perception, reasoning, and knowledge errors.
-
•
The taxonomy analysis shows that current LVLMs perform well in tasks related to visual recognition and description which are in-domain tasks, yet fall short in tasks related to localization and pixel perception which are out-of-domain tasks.
-
•
BLIP2 that does not undergo instruction tuning even outperforms most LVLMs that are tuned by millions of instruction-following data, implying that instruction-tuning with data in some tasks even hurts the generalization on other tasks.
-
•
Certain tasks show improved performance with specific prompting methods, such as multi-image and coordinate-related tasks, as well as those involving visual referring prompts. However, most models do not exhibit improved performance with visual prompting, suggesting potential areas for future enhancement.
-
•
Model performance significantly improves with an increase in size (7B to 13B) for both LLaVA-v1.5 and LLaVA- v1.5-Xtuner. Upgrading LLMs, from InternLM to InternLM2, also enhances the performance of LLaVA.
Overall, the contributions of this work are three-fold. i) We build a new evaluation benchmark called MMT-Bench for multimodal multitask comprehension, allowing us to measure the progress on the path to multitask AGI. ii) We evaluate various publicly available LVLMs on MMT-Bench, revealing that current LVLMs including InternVL-Chat, GPT-4V, and GeminiProVision achieve plain performance in multitask intelligence. iii) We present a taskonomy analysis by evaluating LVLMs on a task map built upon MMT-Bench, facilitating the discovery of both in- and out-of-domain tasks relative to current LVLMs. We anticipate that MMT-Bench will inspire the community to push the boundaries of LVLM research and development, driving us closer to the realization of truly intelligent multimodal systems. The MMT-Bench is open-sourced at https://github.com/OpenGVLab/MMT-Bench.
2 Related Work
LVLM. As the Large Language Models (LLMs) continue to garner impressive achievements (Bai et al., 2023; Team, 2023b; Touvron et al., 2023a, b; Zheng et al., 2023; Chung et al., 2022), academic emphasis is increasingly shifting towards the exploration and development of Large Visual Language Models (LVLMs), to bolster the multimodal understanding and generative capabilities of models. Some notable open-source LVLMs, such as mPLUG-Owl2 (Ye et al., 2023b), LLaVA (Liu et al., 2023b), and LLaMA-Adapter (Gao et al., 2023; Zhang et al., 2023b), have adopted LLMs as their backbone, processing visual features through these LLMs, ultimately achieving an innovative integration of text and visuals. In addition, closed-source models like Gemini (Team, 2023a) and GPT-4V (Yang et al., 2023b) have demonstrated remarkable results across numerous tasks, making groundbreaking contributions. We aim to undertake an in-depth and comprehensive exploration of LVLMs and their capabilities by testing them on massive multimodal tasks.
LVLM Evaluation. Recently, LVLMs have demonstrated remarkable capabilities to handle many visual-language tasks, which makes previous single-task benchmarks (Antol et al., 2015; Hudson & Manning, 2019; Krishna et al., 2017; Lin et al., 2014; Marino et al., 2019) insufficient to provide comprehensive evaluations of current LVLMs. To this end, current LVLM evaluation benchmarks aimed to provide relatively holistic evaluations for the overall reasoning capabilities of LVLMs, such as OwlEval (Ye et al., 2023a), LVLM-eHub (Xu et al., 2023), SEED-Bench (Li et al., 2023a), LAMM (Yin et al., 2023), MM-Vet (Yu et al., 2023) and MMBench (Liu et al., 2023c). However, these benchmarks only covered a small range of multimodal tasks and vision-language skills, making them not comprehensive enough to asses multitask AGI capabilities. Besides, recent studies also presented benchmarks of LVLMs which required expert-level domain knowledge, such as Mathvista (Lu et al., 2023) and MMMU (Yue et al., 2023a). In comparison, our proposed MMT-Bench covers an extensive range of multimodal reasoning capabilities with sufficient test samples from various modalities as shown in Table 1, which requires expert knowledge and deliberate visual recognition, localization, reasoning, and planning. Our MMT-Bench poses significant challenges for the current state-of-the-art LVLMs.
Multitask Analysis. Characterizing various tasks and establishing inter-task relationships is an effective means for multitask analysis (Ilharco et al., 2023; Achille et al., 2019; Zamir et al., 2018; Wallace et al., 2021), with wide applications in areas such as meta-learning and transfer learning. A substantial amount of research has been conducted in Taskonomy (Zamir et al., 2018). It utilizes transfer learning to model the structure of the space of visual tasks, thereby harnessing the interconnections among visual tasks to avoid redundancy in learning. Task2Vec (Achille et al., 2019) extracts fisher information as task vectors, which is used in meta-learning. In our paper, thanks to the vast amount of task data collected, we evaluate LVLMs on a task map and conclude challenging tasks for the current LVLMs.
3 MMT-Bench
In this section, we describe how to build the task hierarchy in Sec. 3.1 and the pipeline of data collection in Sec. 3.2.
3.1 Tasks
Hierarchical Task Structure. We utilize a hierarchical structure to include as more as multimodal tasks to build the MMT-Bench. First, all co-authors come up with meta-tasks for multimodal understanding by brainstorming. We then collect meta-tasks by deduplication and filtering for important tasks as depicted in Fig. 1. Second, we decompose each meta-task into several subtasks. The subtask is kept in the MMT-Bench by three criteria. i) Whether the subtask examines the basic multimodal capability. ii) Whether the subtask challenges the current LVLMs. iii) Whether the test sample for the subtask can be publicly accessible. After selection, MMT-Bench comprises sub-tasks, which is times larger than TinyLVLM-eHub which previously contained the most tasks (Shao et al., 2023). The detailed comparison between MMT-Bench and previous benchmarks is provided in Table 1. We also present the whole hierarchical structure in Table A2 of the Appendix.
3.2 Data Collection
We design an efficient pipeline (see Fig. 2) to construct multi-choice visual questions evaluation data for each subtask and the data collection is completed by dozens of co-authors specializing in artificial intelligence.
Datasets Search. We conduct comprehensive searches for related datasets using various sources such as Google, Paper With Code, Kaggle, and ChatGPT, based on the name of the subtask. After downloading the datasets, we meticulously assess their suitability for evaluating the subtask, ensuring usability and relevance. While most tasks have multiple datasets available, a few may only have one dataset publicly accessible.
Metadata Construction. We define a uniform format, the metadata, to collate downloaded datasets. It enables the further generation of visual questions and answers. Each sample of metadata consists of images and meta-information. The meta-information (see Fig. 2) includes the necessary information to generate questions and answers for the evaluation and also includes manual annotations of required capabilities and the type of visual prompt (i.e., input image). For evaluation efficiency, in each task, we keep the maximum number of samples at 200 by random sampling, and each dataset comprises the same number of samples.
Question and Answer Generation. For each subtask, we generate multi-choice (maximum eight choices depending on the task) visual questions with choices and answers from their metadata. Specifically, depending on a specific task, we manually design rules or use ChatGPT with well-designed prompts for efficient and high-quality generation. For example, in sketch2image retrieval, we use the corresponding image as a ground-truth answer and generate other choices by randomly sampling other images from metadata. In video captioning, we use ChatGPT to write confused wrong choices.
Dataset Statistics. MMT-Bench comprises meticulously curated multi-choice questions with input image types such as natural scenes, synthetic images, text-rich images, medical images, et al. (see Fig. 2), covering core meta-tasks and subtasks for multitask multimodal understanding. Compared to previous LVLMs benchmarks (Yue et al., 2023a; Xu et al., 2023) addressing limited image types and skills, questions in MMT-Bench span diverse multimodal scenarios such as GUI navigation and document understanding, testing kinds of capabilities including visual recognition, localization, reasoning, OCR, counting, 3D perception, temporal understanding, et al., as shown in Fig. 2. These features ensure that MMT-Bench meets the requirement of task breadth for evaluating multitask AGI.
Model | Overall | VR | Loc | OCR | Count | HLN | IR | 3D | VC | VG | DU | AR | PLP | I2IT | RR | IQT | Emo | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Overall∗ | VI | MemU | VPU | AND | KD | VCR | IEJ | MIA | CIM | TU | VP | MedU | AUD | DKR | EA | GN | ||
Frequency Guess | 31.7 | 26.1 | 30.0 | 28.2 | 30.4 | 28.2 | 43.4 | 29.9 | 26.5 | 28.2 | 29.1 | 37.6 | 30.0 | 29.4 | 30.8 | 33.5 | 18.0 | 30.1 |
32.2 | 25.9 | 52.1 | 32.8 | 29.3 | 44.4 | 33.7 | 27.0 | 30.0 | 46.5 | 28.5 | 29.1 | 29.5 | 30.9 | 29.7 | 29.4 | 28.0 | 29.0 | |
Random Guess | 28.5 | 30.0 | 27.1 | 28.1 | 27.2 | 25.0 | 41.6 | 24.3 | 25.5 | 25.0 | 24.8 | 30.3 | 25.4 | 26.6 | 21.2 | 33.4 | 10.5 | 25.4 |
28.9 | 29.9 | 50.8 | 25.5 | 31.4 | 36.5 | 32.2 | 28.0 | 25.0 | 48.5 | 26.8 | 27.0 | 28.8 | 27.8 | 26.8 | 25.4 | 27.5 | 24.4 | |
InternVL-Chat-v1.2-34B | 63.4 | 5.7 | 81.3 | 59.4 | 60.5 | 66.4 | 82.4 | 56.3 | 45.5 | 82.3 | 49.4 | 68.3 | 52.6 | 37.4 | 32.8 | 55.0 | 84.0 | 48.7 |
58.2 | 5.7 | 61.5 | 62.5 | 58.2 | 57.0 | 62.2 | 76.0 | 31.0 | 82.8 | 56.8 | 45.2 | 41.8 | 71.8 | 57.8 | 49.4 | 74.5 | 41.2 | |
Qwen-VL-Plus | 62.3 | 6.7 | 82.6 | 55.3 | 65.6 | 61.1 | 69.9 | 40.7 | 46.5 | 86.5 | 43.6 | 77.3 | 53.4 | 43.1 | 37.8 | 53.0 | 84.5 | 41.6 |
56.6 | 6.8 | 50.3 | 61.0 | 67.5 | 58.8 | 55.3 | 76.5 | 31.8 | 81.5 | 61.3 | 45.5 | 33.7 | 73.3 | 59.5 | 46.8 | 85.0 | 32.6 | |
GPT-4V | 62.0 | 8.3 | 85.3 | 55.6 | 68.0 | 51.6 | 69.6 | 44.9 | 42.0 | 80.3 | 25.0 | 69.8 | 47.7 | 48.2 | 31.8 | 52.5 | 80.0 | 45.1 |
55.5 | 8.6 | 47.9 | 61.0 | 60.2 | 51.4 | 53.6 | 73.0 | 43.4 | 70.2 | 55.2 | 44.6 | 53.3 | 74.0 | 55.6 | 53.4 | 80.9 | 39.7 | |
GeminiProVision | 61.6 | 8.3 | 84.7 | 43.6 | 59.5 | 56.4 | 65.9 | 68.4 | 45.2 | 80.1 | 33.0 | 71.6 | 57.4 | 40.3 | 31.5 | 58.5 | 11.0 | 55.2 |
55.1 | 8.5 | 47.5 | 75.8 | 50.9 | 47.4 | 49.5 | 86.5 | 35.0 | 70.2 | 33.3 | 40.5 | 46.0 | 82.6 | 59.5 | 49.2 | 74.5 | 33.4 | |
LLaVA-NEXT-34B | 60.8 | 7.5 | 76.7 | 61.0 | 64.1 | 66.3 | 70.1 | 38.8 | 48.5 | 85.9 | 56.2 | 69.1 | 50.6 | 41.9 | 22.8 | 54.9 | 76.5 | 50.3 |
56.3 | 7.5 | 57.8 | 55.5 | 57.2 | 61.2 | 62.7 | 75.0 | 22.2 | 77.8 | 43.0 | 45.4 | 40.2 | 61.9 | 55.1 | 48.1 | 80.0 | 41.4 | |
XComposer2 | 55.7 | 11.7 | 75.3 | 47.9 | 43.9 | 51.0 | 69.5 | 32.4 | 40.5 | 73.7 | 42.6 | 62.0 | 46.3 | 43.9 | 31.5 | 50.5 | 8.0 | 53.6 |
50.0 | 11.7 | 52.6 | 71.2 | 56.1 | 56.2 | 41.5 | 83.0 | 43.8 | 80.8 | 61.2 | 36.6 | 36.3 | 53.5 | 48.8 | 43.8 | 50.5 | 29.4 | |
BLIP2 | 54.8 | 12.8 | 75.1 | 54.1 | 48.1 | 29.8 | 66.1 | 27.4 | 47.8 | 78.7 | 33.5 | 43.0 | 51.1 | 46.1 | 28.2 | 53.0 | 14.0 | 43.1 |
49.1 | 12.8 | 55.6 | 76.2 | 39.8 | 43.7 | 60.2 | 77.0 | 29.8 | 62.8 | 73.0 | 42.7 | 43.2 | 60.1 | 44.6 | 37.0 | 80.5 | 33.4 | |
Yi-VL-34B | 54.2 | 14.3 | 74.6 | 47.0 | 58.0 | 59.4 | 65.8 | 28.8 | 38.8 | 74.0 | 41.5 | 56.4 | 40.4 | 38.4 | 19.5 | 51.7 | 68.5 | 39.7 |
48.6 | 14.3 | 51.3 | 56.2 | 61.2 | 52.4 | 49.5 | 71.5 | 25.5 | 66.0 | 48.0 | 39.2 | 32.0 | 59.6 | 48.2 | 44.3 | 57.0 | 32.4 | |
Monkey-Chat | 53.4 | 15.5 | 79.0 | 40.1 | 51.0 | 43.6 | 63.1 | 26.8 | 46.5 | 68.9 | 27.5 | 51.1 | 49.3 | 32.2 | 29.5 | 61.8 | 11.0 | 45.1 |
46.0 | 15.8 | 55.3 | 69.5 | 43.6 | 44.6 | 36.3 | 85.5 | 26.0 | 58.8 | 61.7 | 36.8 | 33.3 | 68.0 | 43.6 | 38.1 | 46.0 | 29.8 | |
DeepSeek-VL-7B | 53.2 | 15.0 | 75.6 | 42.0 | 61.1 | 44.5 | 60.6 | 30.5 | 47.2 | 69.1 | 38.4 | 51.9 | 44.8 | 38.3 | 23.5 | 48.8 | 37.0 | 43.8 |
46.5 | 15.2 | 47.7 | 59.8 | 53.5 | 45.4 | 41.0 | 41.0 | 38.8 | 35.0 | 67.2 | 33.1 | 30.7 | 69.7 | 48.8 | 36.4 | 67.5 | 36.8 | |
Yi-VL-6B | 53.2 | 14.7 | 73.5 | 49.4 | 53.1 | 56.2 | 63.9 | 26.0 | 43.5 | 63.4 | 42.1 | 55.2 | 43.8 | 35.3 | 26.8 | 48.8 | 47.0 | 46.1 |
47.5 | 14.5 | 55.8 | 54.5 | 49.2 | 53.0 | 51.8 | 65.5 | 34.2 | 52.0 | 43.3 | 37.6 | 37.0 | 60.6 | 46.9 | 40.2 | 48.0 | 34.8 | |
LLaVA-NEXT-13B | 53.0 | 15.0 | 74.0 | 35.6 | 51.8 | 59.2 | 63.6 | 32.7 | 50.0 | 75.0 | 44.6 | 53.6 | 46.5 | 34.0 | 26.2 | 50.0 | 50.0 | 44.5 |
46.8 | 14.9 | 57.5 | 55.0 | 32.2 | 49.6 | 38.9 | 47.0 | 18.0 | 36.5 | 59.8 | 38.9 | 22.5 | 55.8 | 55.7 | 38.5 | 70.0 | 41.0 | |
TransCore-M | 52.7 | 13.1 | 73.6 | 40.5 | 50.4 | 54.5 | 71.9 | 27.5 | 45.0 | 75.6 | 35.1 | 45.3 | 46.9 | 38.3 | 25.0 | 53.2 | 15.0 | 46.3 |
46.9 | 12.9 | 55.6 | 76.8 | 51.9 | 43.7 | 38.6 | 85.5 | 34.2 | 52.8 | 65.8 | 29.7 | 28.8 | 61.1 | 46.5 | 38.4 | 39.5 | 35.6 | |
QWen-VL-Chat | 52.5 | 16.0 | 77.5 | 33.7 | 46.9 | 46.7 | 63.9 | 27.5 | 45.0 | 73.0 | 26.5 | 51.5 | 50.9 | 32.7 | 30.5 | 57.4 | 13.5 | 45.4 |
45.4 | 16.3 | 50.9 | 74.2 | 42.4 | 40.2 | 35.9 | 86.0 | 30.0 | 49.2 | 58.3 | 37.3 | 30.8 | 67.1 | 45.4 | 35.6 | 55.0 | 30.2 | |
Claude3V-Haiku | 52.2 | 17.7 | 74.3 | 44.8 | 54.4 | 51.1 | 63.6 | 34.6 | 38.2 | 67.6 | 26.9 | 69.8 | 46.2 | 35.5 | 22.8 | 50.0 | 59.5 | 35.2 |
46.4 | 17.7 | 42.9 | 53.8 | 43.2 | 41.2 | 53.3 | 70.5 | 31.5 | 34.8 | 52.5 | 35.9 | 34.2 | 62.7 | 34.1 | 40.4 | 54.5 | 35.1 | |
XComposer | 52.1 | 17.1 | 75.4 | 40.4 | 44.1 | 39.9 | 66.5 | 49.7 | 47.0 | 72.1 | 27.2 | 36.6 | 47.9 | 39.6 | 24.5 | 50.2 | 14.0 | 45.9 |
45.6 | 17.3 | 53.4 | 63.8 | 40.6 | 43.4 | 42.3 | 78.0 | 29.0 | 66.2 | 52.3 | 33.1 | 28.3 | 55.6 | 40.8 | 39.3 | 38.5 | 34.2 | |
mPLUG-Owl2 | 52.0 | 17.3 | 76.5 | 45.8 | 44.5 | 47.6 | 63.4 | 27.6 | 45.2 | 66.6 | 33.0 | 42.4 | 45.2 | 41.6 | 25.5 | 52.0 | 18.0 | 42.0 |
45.0 | 17.5 | 58.5 | 59.0 | 40.1 | 49.4 | 32.9 | 85.5 | 30.0 | 55.0 | 57.7 | 31.9 | 27.3 | 63.4 | 45.5 | 38.1 | 35.0 | 27.8 | |
RBDash-v1-13B | 51.8 | 15.7 | 72.2 | 42.2 | 53.6 | 51.6 | 66.6 | 26.3 | 40.8 | 75.5 | 36.9 | 48.1 | 47.1 | 38.3 | 22.5 | 55.9 | 14.0 | 43.4 |
46.1 | 15.3 | 57.1 | 67.5 | 51.4 | 45.7 | 33.2 | 78.0 | 39.0 | 32.0 | 64.2 | 31.6 | 25.5 | 59.3 | 46.3 | 38.1 | 53.5 | 32.4 | |
LLaVA-v1.5-13B | 51.7 | 15.3 | 73.8 | 38.8 | 51.8 | 55.1 | 65.8 | 27.2 | 39.8 | 70.4 | 37.4 | 45.7 | 46.6 | 37.6 | 28.0 | 58.2 | 13.5 | 45.3 |
45.7 | 15.2 | 58.1 | 66.0 | 43.9 | 48.3 | 31.4 | 79.0 | 35.8 | 28.5 | 62.5 | 33.3 | 27.5 | 58.6 | 46.6 | 39.4 | 40.5 | 37.5 | |
CogVLM-Chat | 51.6 | 17.5 | 77.7 | 24.7 | 48.5 | 49.8 | 66.0 | 26.1 | 42.2 | 69.8 | 28.8 | 49.1 | 46.3 | 33.2 | 23.8 | 61.6 | 14.0 | 50.3 |
44.2 | 17.9 | 52.4 | 75.5 | 39.8 | 43.4 | 28.2 | 82.0 | 28.0 | 70.8 | 45.8 | 35.5 | 28.3 | 65.9 | 44.9 | 36.9 | 48.0 | 29.9 | |
ShareGPT4V-7B | 51.5 | 16.4 | 74.2 | 36.0 | 47.8 | 50.9 | 62.4 | 27.8 | 45.2 | 71.6 | 35.4 | 47.9 | 46.2 | 39.2 | 21.8 | 59.8 | 14.0 | 44.3 |
45.1 | 16.4 | 54.5 | 70.5 | 47.1 | 48.2 | 26.3 | 83.0 | 27.8 | 38.0 | 64.3 | 32.1 | 30.0 | 60.8 | 46.1 | 38.9 | 42.0 | 28.9 | |
LLaVA-NEXT-7B | 51.1 | 18.1 | 73.3 | 29.5 | 52.0 | 56.8 | 59.9 | 28.7 | 43.2 | 69.8 | 37.0 | 49.7 | 47.9 | 32.6 | 22.8 | 49.0 | 47.5 | 48.1 |
44.6 | 18.0 | 57.8 | 54.0 | 38.5 | 44.3 | 34.6 | 42.5 | 18.8 | 32.5 | 67.8 | 39.1 | 23.3 | 55.5 | 53.5 | 37.0 | 65.0 | 31.6 | |
LLaVA-v1.5-13B-XTuner | 51.1 | 16.8 | 72.5 | 40.7 | 46.8 | 54.1 | 66.5 | 26.4 | 47.5 | 68.8 | 35.6 | 47.0 | 44.2 | 38.3 | 26.0 | 52.4 | 14.0 | 51.0 |
45.1 | 16.5 | 54.4 | 66.5 | 47.9 | 52.0 | 28.8 | 82.0 | 39.2 | 37.0 | 56.8 | 28.3 | 28.3 | 49.1 | 44.4 | 37.3 | 33.5 | 40.9 | |
LLaVA-InternLM2-7B | 50.8 | 17.5 | 73.3 | 38.9 | 49.5 | 51.8 | 67.8 | 27.7 | 49.5 | 66.4 | 36.9 | 37.7 | 43.7 | 35.1 | 14.2 | 58.0 | 0.0 | 51.1 |
44.4 | 17.4 | 52.3 | 62.5 | 45.1 | 57.2 | 35.2 | 83.0 | 34.2 | 55.8 | 58.2 | 26.8 | 18.5 | 57.8 | 45.1 | 33.7 | 35.5 | 35.2 | |
LLaVA-v1.5-7B-XTuner | 50.2 | 19.5 | 72.5 | 41.1 | 46.0 | 49.9 | 62.1 | 26.0 | 45.5 | 66.4 | 35.3 | 42.8 | 45.8 | 42.5 | 25.5 | 53.9 | 11.5 | 44.2 |
43.9 | 19.3 | 60.1 | 56.5 | 42.6 | 47.2 | 28.4 | 80.5 | 32.2 | 41.2 | 63.2 | 29.9 | 24.2 | 52.5 | 43.4 | 37.2 | 32.0 | 30.5 | |
SharedCaptioner | 49.9 | 19.6 | 72.8 | 41.8 | 47.8 | 46.2 | 63.1 | 27.0 | 44.2 | 61.9 | 27.0 | 39.5 | 46.7 | 33.5 | 25.0 | 59.5 | 14.5 | 39.9 |
43.2 | 19.5 | 55.1 | 53.8 | 45.4 | 38.3 | 33.6 | 82.5 | 20.2 | 57.8 | 56.8 | 32.6 | 28.7 | 59.4 | 44.7 | 38.4 | 45.0 | 29.6 | |
LLaVA-InternLM-7B | 49.7 | 19.6 | 70.1 | 38.7 | 47.6 | 46.0 | 62.0 | 25.5 | 42.0 | 65.0 | 26.5 | 43.9 | 45.6 | 38.3 | 25.0 | 52.4 | 14.0 | 47.0 |
43.9 | 19.3 | 57.5 | 58.2 | 45.6 | 46.5 | 33.2 | 75.5 | 33.0 | 57.0 | 59.7 | 28.0 | 27.3 | 52.0 | 42.2 | 38.1 | 46.5 | 37.6 | |
LLaVA-v1.5-7B | 49.5 | 20.3 | 72.8 | 34.3 | 45.0 | 47.5 | 61.6 | 26.1 | 44.8 | 68.1 | 34.0 | 40.8 | 46.6 | 36.0 | 22.2 | 58.0 | 12.5 | 42.5 |
43.1 | 20.3 | 57.6 | 70.5 | 33.3 | 49.1 | 31.6 | 81.0 | 27.8 | 37.5 | 62.3 | 31.7 | 27.5 | 56.8 | 45.1 | 35.6 | 42.5 | 20.4 | |
LLaMA-Adapter-v2-7B | 40.4 | 27.5 | 62.3 | 32.5 | 35.0 | 30.1 | 46.5 | 24.1 | 33.8 | 34.8 | 25.2 | 30.2 | 43.9 | 33.1 | 18.2 | 44.9 | 11.0 | 36.0 |
34.1 | 27.4 | 36.4 | 40.5 | 33.8 | 30.4 | 34.9 | 71.0 | 33.2 | 42.2 | 35.8 | 31.1 | 25.8 | 52.0 | 29.1 | 32.0 | 25.0 | 29.9 | |
VisualGLM-6B | 38.6 | 27.1 | 55.0 | 33.1 | 33.8 | 31.1 | 39.2 | 26.0 | 36.8 | 40.5 | 31.1 | 39.1 | 39.2 | 32.4 | 26.8 | 43.8 | 14.0 | 33.1 |
33.9 | 27.0 | 28.9 | 44.8 | 27.1 | 34.5 | 35.2 | 65.0 | 28.0 | 35.8 | 48.2 | 30.8 | 23.5 | 44.0 | 26.2 | 29.6 | 37.5 | 21.1 |
4 Experiments
In this section, we conduct a comprehensive evaluation of 30 LVLMs on the MMT-Bench. Sec. 4.1 presents the selected LVLMs zoo and the evaluation methods. The quantitative evaluation of each meta-task is provided in Sec. 4.2. We present the analysis of specific tasks with different prompt methods in Sec. 4.3. Furthermore, we give an error analysis of three representative LVLMs in Sec. 4.4.
4.1 Evaluation Details
Selected LVLMs. For completeness, we test 30 representative LVLMs varying in parameters, vision encoders (InternVL (Chen et al., 2023b), EVA-CLIP-ViT (Sun et al., 2023), CLIP-ViT (Radford et al., 2021)), and LLMs (QWen (Bai et al., 2023), InternLM (Team, 2023b), LLaMA (Touvron et al., 2023a, b), Vicuna (Zheng et al., 2023), Flan-T5 (Chung et al., 2022)). For details, see Appendix D.1.
Evaluation Methods. In MMT-Bench, samples are in a multi-choice format, e.g., ‘What is this? Options: (A) Dog (B) Cat’. To extract the choice from LVLMs’ responses, we follow OpenCompass’ protocol (Contributors, 2023a): 1) Check if the response includes option letters (A/B); 2) Check for option content (‘dog’/‘cat’); 3) Use ChatGPT for extraction. If these steps fail, we set the model selection as option letter Z to avoid random assignment (Yue et al., 2023a). Accuracy is the primary metric.
4.2 Overall Evaluation
This section evaluates LVLMs on MMT-Bench alongside Random Choice and Frequent Choice baselines. We report the overall score for all meta-tasks as well as the best performance on each meta-task in Table 2. The detailed results of each sub-task are provided in the Sec. L of the Appendix. Various prompt settings for all tasks are investigated. We summarize the key findings as follows.
i) The Comprehensive Challenge of MMT-Bench: The benchmark poses significant challenges, with even advanced models like InternVL-Chat, GPT-4V and GeminiProVision achieving just 63.4%, 62.0% and 61.6% accuracy, respectively, indicating substantial room for improvement. Notably, removing its strongest area, Visual Recognition (VR), where it scores 84.7%, GeminiProVision’s overall performance drops to 55.1%, below satisfactory. The varied task dimensions of the MMT-Bench demand wide-ranging capabilities for optimal performance, emphasizing the benchmark’s extensive and rigorous criteria. ii) The comparison between Open-source LVLMs and close-source LVLMs. The performance of most open-source models lags behind that of closed-source models. However, leading open-sourced LVLM InternVL-Chat-V1.2-34B have demonstrated remarkable performance, outperforming sophisticated proprietary models such as GPT-4V and GeminiProVision in overall accuracy. This achievement suggests that by scaling model size, optimizing training regimes, and leveraging diverse high-quality data, open-sourced LVLMs can rival and even exceed the capabilities of advanced proprietary models. It brings a sense of pride to the open-source community and paves the way for more high-performance yet cost-effective solutions in academia and industry. iii) The Influence of LLMs and Model Scaling. As shown in Table 2, model performance significantly improves with an increase in size (7B to 13B) for both llava-v1.5 and llava-v1.5-tuner. Upgrading LLMs, from internlm to internLM2, also enhances the performance of LLaVA, suggesting that larger or improved LLMs boost multi-task performance, with unchanged training data and visual encoders. iv) Model Performance across Different Meta-Tasks. Most LVLMs excel in Visual Recognition (VR) tasks and Visual Captioning (VC), highlighting the ability of LVLMs to recognize ‘what’ an object is and describe the content shown in the image. However, for fine-grained perception tasks (localization, pixel-level perception, etc) or complex reasoning tasks (image evaluation judgment), most LVLMs struggle. v) BLIP2 impresses in open-source models without instruction-following training, outdoing LLaVA models trained with extensive instruction-following data. Although instruction-tuned models can give responses aligning better with human preference than BLIP2 in open-set QA on some tasks (Liu et al., 2023b), they perform worse than BLIP2 in close-set settings in MMT-Bench. This reflects MMT-Bench’s multi-task challenges and hints at using the taxonomy of MMT-Bench to expand the dataset in supervised fine-tuning for future advancement.
4.3 Specific Task and Prompt Methods Analysis
In this section, we evaluate specific tasks using different prompts for LVLMs.
Prompting LVLMs with multi-images vs single-image. Here we explore the effects of exploiting multi-image prompts and single-image prompts on the performance of LVLMs. To this end, we summarized tasks in our MMT-Bench, which usually require multiple images as input, such as image retrieval and video captioning. For multi-images prompting, we first evaluated LVLMs which are inherently designed to support multiple images as input (dubbed Multi-Images LVLMs), including mPLUG-Owl2, QWen-VL-chat and Gemini-Pro-Vision. Besides, we also assessed LVLMs which mainly learned on single-image prompts (dubbed Single-Image LVLMs) for more comprehensive comparisons, including BLIP2, SharedCaptioner, ShareGPT4V-7B, Monkey and LLaVA-v1.5-7B. Following previous studies (Dai et al., 2023; Li et al., 2023c), we input each image individually to Single-Image LVLMs and concatenated all output visual embeddings before feeding into LLMs. The designed multi-image prompts for Multi-Images LVLMs and Single-Image LVLMs are summarized in Appendix Sec.D.2. As for single-image prompting, we manually combine multiple images into one image and feed it into LVLMs (see examples in Fig. 1).
The detailed performance comparisons are presented in Fig. 3(a)-(h). We have several observations: i) Multi-images tasks posed significant challenges to current LVLMs, where the best accuracy achieved by GeminiProVision is only . ii) For Multi-Images LVLMs, providing multiple images as prompts instead of a single image boosted the overall performance on these tasks, demonstrating their capabilities to extract beneficial information from multiple images. For instance, for the task of face retrieval (FR), the performance of GeminiProVision increased from to when providing multiple images as visual prompts. iii) For Single-Image LVLMs, multi-image prompts also help improve the overall performance of most models, except for Monkey. To our surprise, BLIP2 achieved significant performance gain when switching to a multi-image prompt setting, especially on tasks like general action recognition (GAR) and video captioning (VC). These results highlight the potential of LVLMs to learn more robust unified representations of multiple modalities.
Most LVLMs Show Poor Generalization in Visual Referring Prompting. Visual referring prompting is an impressive prompting technique that entails direct image edits like drawing bounding boxes or masks to guide LVLMs to focus on specific regions (Yang et al., 2023a). We select tasks (see Sec. D.3) involving visual referring prompting to explore the influence of different prompting methods on the final results. We compared three additional settings: using text prompts for bounding boxes in normalized ([0,1]) and pixel ([0, h or w]) formats, and combining visual and text prompts. As depicted in Fig. 3(i), visual prompting (blue curve) significantly lags behind other settings, a disparity mainly attributed to the lack of visual prompting data in most LVLMs during the Supervised Fine-Tuning (SFT) stage.
4.4 Error Analysis
To analyze the error distribution of LVLMs on the MMT-Bench, we examined three LVLMs: GPT-4V, GeminiProVision, and InternVL-Chat-V1.2 (InternVL). Specifically, we randomly selected up to 5 incorrectly answered questions per subtask for each model. Task-specific experts among the co-authors then analyzed these error samples to identify the underlying reasons for the mistakes, yielding the error distribution presented in Fig. 4. For definitions and case studies of these six error types, please refer to Sec. G in the appendix.
As shown in Fig 4, perception error stands out as the most common type of error across all models, with GPT-4V exhibiting a significantly lower perception error rate (51%) compared to GeminiProVision (76.9%) and InternVL (67.2%), indicating its superior performance in perception tasks. Reasoning error emerges as the second most prevalent error type, with InternVL having the highest reasoning error rate (14.8%), followed by GeminiProVision (10.4%) and GPT-4V (9.94%), highlighting the challenges all models face in complex reasoning tasks.
Additionally, the proportion of lack of knowledge errors is similar across the three models, ranging from 6.99% to 9.0%. It suggests that insufficient knowledge is a common issue. However, GPT-4V has notably higher error rates in lack of capability (19%) and Refusing to Answer (6.11%) compared with the other models, which may be attributed to its more honest approach in acknowledging its limitations and refusing to answer certain questions.
InternVL stands out for its high error rate in failing to follow instructions (6.64%), significantly surpassing GPT-4V (2.99%) and GeminiProVision (1.14%), indicating its struggle in comprehending and executing instructions effectively. On the other hand, annotation error contributes the least to the overall error distribution, implying that the quality of data annotation is high and has a minimal impact on model performance.
To enhance the performance of these large language models, future improvements should focus on addressing the specific error types identified. By targeting perception and reasoning capabilities, tackling the lack of knowledge, and refining the ability to follow instructions, developers can work towards creating more accurate and reliable language models. GPT-4V’s honest approach to its limitations also highlights the importance of transparency in AI systems, which can be further explored and incorporated into future model designs.
5 Taskonomy Analysis
Thanks to the extensive coverage of tasks in the MMT-Bench, we can evaluate the multimodal performance of LVLMs on a task map. In this way, the roles of different tasks in multimodal capability can be systematically interpreted by analyzing relationships between tasks in the map.
5.1 Analytical Tools
Task map. To investigate the relationships between subtasks, we quantify each subtask as a task vector by following (Ilharco et al., 2023). Formally, a task vector is defined by the weight variation between the weight fine-tuned on task data and the initial weight of a probing model, as given by where the subscript denotes the task and is the task loss. Three steps are adopted to obtain . First, we use pre-trained QwenVL-Chat as the probing model because QwenVL-Chat achieves good results on most subtasks, which helps acquire promising task vectors. Second, we construct task data by adapting all multi-choice VQA samples into the instruction-following data for each subtask. Third, unlike TaskVec (Ilharco et al., 2023) that finetunes the whole model, we finetune QwenVL-Chat for epochs using LoRA fine-tuning (Hu et al., 2021) for all subtasks, which reduces the length of task vector from B to M and consumes less storage resources. With task vector, a task map can be constructed as where denotes the cosine distance between task and and denoted the total number of subtasks. By definition, we know that .
1 | |||||
---|---|---|---|---|---|
0.29 | 0.31 | 0.32 | 0.41 | 0.60 |
Ranking correlation: Kendall’s tau . To quantitatively evaluate LVLMs on a task map, we use the metric of Kendall’s tau to measure the ranking correlation between performance sequences of LVLMs on different subtasks. The intuition is that model would be superior to model on task if model performs better than model on task when task distance is small. The Kendall’s tau is defined as where denotes the performance of model on task and is the number of LVLMs. The function returns if the argument is negative and otherwise. When , LVLMs have completely consistent performance ranking on task and .
5.2 Findings on Task Map
LVLMs obtain a more consistent performance ranking on tasks closer to each other. We assess whether LVLMs achieve consistent performance on two tasks close to each other. To measure this consistency, we employ the Kendall tau metric as introduced in Sec. 5.1. Specifically, we consider all subtask pairs in which two tasks are closer to each other and calculate their average Kendall’s tau , which can be given by where and is a threshold used to control the proximity between two tasks. As shown in Table 3, as the threshold decreases, the task distance becomes smaller, and increases. This suggests that LVLMs obtain a more consistent performance ranking on tasks closer to each other. Hence, the performance of LVLMs on a new task can be predicted if it is close to one of the MMT-Bench subtasks.
Out-of-Domain (OoD) tasks discovery. The OoD tasks mean tasks that the current model struggles to handle. Discovering OoD tasks can provide insights for future evaluation efforts and the development of stronger LVLMs. Since model performance on different tasks is related to task distances, we hypothesize that OoD tasks would be grouped in local regions on the task map. Therefore, we conduct hierarchical clustering on the task map to find OoD tasks. Specifically, subtasks are grouped into clusters as shown in Fig. 5. We use two criteria to identify clusters containing OoD tasks. First, LVLMs would achieve poor performance on OoD tasks. In this regard, we calculate the average multimodal performance within each task cluster over all LVLM models. Second, the performance of LVLMs on OoD tasks would be inconsistent with the overall multimodal score in Table 2 because LVLMs with competitive overall scores would even fail to solve OoD tasks. Hence, we calculate the average ranking correlation within each cluster. We present these statistics in Table 4 and provide a detailed analysis with the clustering results in Appendix A.
We can see that clusters , , and achieve low multimodal accuracy and ranking correlation . In sec 4.2, we find that the model struggles with handling fine-grained visual tasks, such as detection. Through the analysis of these clusters, we similarly find that current multimodal large models cannot perform fine-grained visual cognition and understanding of positional and spatial relationships, such as localization and detection tasks. Moreover, they exhibit poor performance in tasks related to new data structures or types of images, showing a lack of proficiency in handling tasks related to GUI and special data structures like tables.
Cluster | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
# Tasks | 11 | 53 | 16 | 16 | 9 | 8 | 7 | 16 | 4 | 9 | 10 | 3 |
0.54 | 0.73 | 0.57 | 0.48 | -0.05 | 0.62 | 0.63 | 0.34 | 0.12 | 0.57 | 0.38 | 0.59 | |
Acc | 40.4 | 64.7 | 61.9 | 39.9 | 55.9 | 30.0 | 33.1 | 40.2 | 31.4 | 61.2 | 33.2 | 50.7 |
In-domain tasks discovery. In-domain tasks are tasks that most current multimodal large models can handle correctly. Discovering in-domain tasks guides the commercial application of LVLMs in specific scenarios. Different from OoD tasks, we identify in-domain tasks by looking for clusters with large ranking correlation and high multimodal accuracy. From Table 4, we can see that clusters , , and achieve relatively high accuracy and large ranking correlation . We observe that current multimodal large models possess strong high-level visual comprehension capabilities, enabling them to effectively handle visual recognition tasks, even when dealing with specialized images such as medical images, which is also found in sec 4.2. Moreover, they benefit from the powerful LLMs to accurately describe images. We provide a detailed analysis along with the clustering results in Appendix A.
6 Conclusion and Discussion
In this work, we introduce MMT-Bench, a comprehensive benchmark designed to evaluate LVLMs in multimodal multitask understanding. The breadth of MMT-Bench is highlighted by its meticulously curated dataset of multi-choice questions covering multimodal tasks. Our evaluation reveal significant challenges for current LVLMs posed by our MMT-Bench. We present a taskonomy analysis of LVLMs on the task map, allowing us to predict the performance of a new task. Our goal with MMT-Bench is to measure the progress on the path to multitask AGI. We shall acknowledge that MMT-Bench may not be sufficient as a standard for determining whether multitask AGI has been achieved, as it is impossible to include all multimodal tasks. However, we believe that it should be necessary for a multitask AGI to achieve strong performance on MMT-Bench. We will continue to expand the task set of MMT-Bench. We believe that MMT-Bench will inspire further research and development in LVLMs, bringing us closer to the realization of truly intelligent multimodal systems.
Broader Impact. The development and widespread adoption of MMT-Bench as a benchmark for evaluating large vision-language models (LVLMs) have the potential to significantly impact the field of artificial intelligence. While MMT-Bench offers valuable insights and guidance for advancing LVLM research, it is important to consider its broader impact, including ethical considerations and potential societal consequences.
One potential positive impact of MMT-Bench is its role in driving advancements in LVLM technology, leading to improved performance and capabilities in various multimodal tasks. This could benefit numerous applications, such as visual dialogue, video analysis, and document understanding, ultimately enhancing user experiences and productivity.
However, it is crucial to recognize and address potential negative impacts as well. One of the primary limitations of MMT-Bench is its reliance on curated data, which may inadvertently introduce biases based on the sources and methodologies used for data collection. For example, the performance of each meta-task is obtained by taking the average over all subtasks, which may lead to biased assessment because meta-tasks comprise different numbers of subtasks. Moreover, the selection of tasks and subtasks in MMT-Bench may only partially capture the diversity of real-world scenarios, leading to a limited understanding of LVLMs’ capabilities across different domains and populations. Furthermore, the data collection process might disproportionately represent certain demographics or contexts, which can lead to biased evaluations of LVLMs’ performance.
The other concern is that the benchmark’s emphasis on performance metrics such as overall scores and task-specific accuracies may oversimplify the evaluation process and obscure nuanced differences in LVLMs’ performance. This could mask disparities in model performance across demographic groups or domains, contributing to the perpetuation of biases and inequities in AI systems. We are dedicated to collecting as many multimodal tasks as possible into our MMT-Bench for unbiased evaluation.
References
- Achille et al. (2019) Achille, A., Lam, M., Tewari, R., Ravichandran, A., Maji, S., Fowlkes, C. C., Soatto, S., and Perona, P. Task2vec: Task embedding for meta-learning. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 6430–6439, 2019.
- AI et al. (2024) AI, ., :, Young, A., Chen, B., Li, C., Huang, C., Zhang, G., Zhang, G., Li, H., Zhu, J., Chen, J., Chang, J., Yu, K., Liu, P., Liu, Q., Yue, S., Yang, S., Yang, S., Yu, T., Xie, W., Huang, W., Hu, X., Ren, X., Niu, X., Nie, P., Xu, Y., Liu, Y., Wang, Y., Cai, Y., Gu, Z., Liu, Z., and Dai, Z. Yi: Open foundation models by 01.ai, 2024.
- Anthropic (2023) Anthropic. Claude, 2023. URL https://www.anthropic.com. Accessed: 2023-04-18.
- Antol et al. (2015) Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., and Parikh, D. Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pp. 2425–2433, 2015.
- Bai et al. (2023) Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., and Zhou, J. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
- Chen et al. (2023a) Chen, L., Li, J., Dong, X., Zhang, P., He, C., Wang, J., Zhao, F., and Lin, D. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023a.
- Chen et al. (2023b) Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu, T., Qiao, Y., and Dai, J. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023b.
- Chung et al. (2022) Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, E., Wang, X., Dehghani, M., Brahma, S., Webson, A., Gu, S. S., Dai, Z., Suzgun, M., Chen, X., Chowdhery, A., Narang, S., Mishra, G., Yu, A., Zhao, V., Huang, Y., Dai, A., Yu, H., Petrov, S., Chi, E. H., Dean, J., Devlin, J., Roberts, A., Zhou, D., Le, Q. V., and Wei, J. Scaling instruction-finetuned language models, 2022. URL https://arxiv.org/abs/2210.11416.
- Contributors (2023a) Contributors, O. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023a.
- Contributors (2023b) Contributors, T.-M. Transcore-m. https://github.com/PCIResearch/TransCore-M, 2023b.
- Contributors (2023c) Contributors, X. Xtuner: A toolkit for efficiently fine-tuning llm. https://github.com/InternLM/xtuner, 2023c.
- Dai et al. (2023) Dai, W., Li, J., Li, D., Tiong, A. M. H., Zhao, J., Wang, W., Li, B., Fung, P., and Hoi, S. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Ding et al. (2021) Ding, M., Yang, Z., Hong, W., Zheng, W., Zhou, C., Yin, D., Lin, J., Zou, X., Shao, Z., Yang, H., et al. Cogview: Mastering text-to-image generation via transformers. Advances in Neural Information Processing Systems, 34:19822–19835, 2021.
- Dong et al. (2024) Dong, X., Zhang, P., Zang, Y., Cao, Y., Wang, B., Ouyang, L., Wei, X., Zhang, S., Duan, H., Cao, M., Zhang, W., Li, Y., Yan, H., Gao, Y., Zhang, X., Li, W., Li, J., Chen, K., He, C., Zhang, X., Qiao, Y., Lin, D., and Wang, J. Internlm-xcomposer2: Mastering free-form text-image composition and comprehension in vision-language large model. arXiv preprint arXiv:2401.16420, 2024.
- Fu et al. (2023) Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
- Gao et al. (2023) Gao, P., Han, J., Zhang, R., Lin, Z., Geng, S., Zhou, A., Zhang, W., Lu, P., He, C., Yue, X., Li, H., and Qiao, Y. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
- Hu et al. (2021) Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., and Chen, W. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Hudson & Manning (2019) Hudson, D. A. and Manning, C. D. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6700–6709, 2019.
- Ilharco et al. (2023) Ilharco, G., Ribeiro, M. T., Wortsman, M., Gururangan, S., Schmidt, L., Hajishirzi, H., and Farhadi, A. Editing models with task arithmetic. the 11th International Conference on Learning Representation (ICLR 2023), 2023.
- Krishna et al. (2017) Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma, D. A., et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
- Langley (2000) Langley, P. Crafting papers on machine learning. In Langley, P. (ed.), Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stanford, CA, 2000. Morgan Kaufmann.
- Latif et al. (2023) Latif, E., Mai, G., Nyaaba, M., Wu, X., Liu, N., Lu, G., Li, S., Liu, T., and Zhai, X. Agi: Artificial general intelligence for education, 2023.
- Li et al. (2023a) Li, B., Wang, R., Wang, G., Ge, Y., Ge, Y., and Shan, Y. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
- Li et al. (2023b) Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023b.
- Li et al. (2023c) Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al. Mvbench: A comprehensive multi-modal video understanding benchmark. arXiv preprint arXiv:2311.17005, 2023c.
- Li et al. (2023d) Li, Z., Yang, B., Liu, Q., Ma, Z., Zhang, S., Yang, J., Sun, Y., Liu, Y., and Bai, X. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023d.
- Lin et al. (2014) Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer, 2014.
- Liu et al. (2023a) Liu, H., Li, C., Li, Y., and Lee, Y. J. Improved baselines with visual instruction tuning, 2023a.
- Liu et al. (2023b) Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning, 2023b.
- Liu et al. (2024a) Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., and Lee, Y. J. Llava-next: Improved reasoning, ocr, and world knowledge, 2024a.
- Liu et al. (2024b) Liu, S., Ying, K., Zhang, H., Yang, Y., Lin, Y., Zhang, T., Li, C., Qiao, Y., Luo, P., Shao, W., and Zhang, K. Convbench: A multi-turn conversation evaluation benchmark with hierarchical capability for large vision-language models, 2024b.
- Liu et al. (2023c) Liu, Y., Duan, H., Zhang, Y., Li, B., Zhang, S., Zhao, W., Yuan, Y., Wang, J., He, C., Liu, Z., et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023c.
- Lu et al. (2024) Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., Sun, Y., Deng, C., Xu, H., Xie, Z., and Ruan, C. Deepseek-vl: Towards real-world vision-language understanding, 2024.
- Lu et al. (2023) Lu, P., Bansal, H., Xia, T., Liu, J., Li, C., Hajishirzi, H., Cheng, H., Chang, K.-W., Galley, M., and Gao, J. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. arXiv preprint arXiv:2310.02255, 2023.
- Marino et al. (2019) Marino, K., Rastegari, M., Farhadi, A., and Mottaghi, R. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp. 3195–3204, 2019.
- Morris et al. (2023) Morris, M. R., Sohl-dickstein, J., Fiedel, N., Warkentin, T., Dafoe, A., Faust, A., Farabet, C., and Legg, S. Levels of agi: Operationalizing progress on the path to agi. arXiv preprint arXiv:2311.02462, 2023.
- Radford et al. (2021) Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., and Sutskever, I. Learning transferable visual models from natural language supervision, 2021.
- RBDash-Team (2023) RBDash-Team. Rbdash. https://github.com/RBDash-Team/RBDash, 2023.
- Shao et al. (2023) Shao, W., Hu, Y., Gao, P., Lei, M., Zhang, K., Meng, F., Xu, P., Huang, S., Li, H., Qiao, Y., et al. Tiny lvlm-ehub: Early multimodal experiments with bard. arXiv preprint arXiv:2308.03729, 2023.
- Singhal et al. (2023) Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
- Sun et al. (2023) Sun, Q., Fang, Y., Wu, L., Wang, X., and Cao, Y. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
- Team (2023a) Team, G. Gemini: A family of highly capable multimodal models, 2023a.
- Team (2023b) Team, I. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023b.
- Team (2023c) Team, Q. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023c.
- Touvron et al. (2023a) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation language models, 2023a.
- Touvron et al. (2023b) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M.-A., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open foundation and fine-tuned chat models, 2023b.
- Wallace et al. (2021) Wallace, B., Wu, Z., and Hariharan, B. Can we characterize tasks without labels or features? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1245–1254, 2021.
- Wang et al. (2023) Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., Song, X., Xu, J., Xu, B., Li, J., Dong, Y., Ding, M., and Tang, J. Cogvlm: Visual expert for pretrained language models. 2023.
- Xu et al. (2023) Xu, P., Shao, W., Zhang, K., Gao, P., Liu, S., Lei, M., Meng, F., Huang, S., Qiao, Y., and Luo, P. Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models. arXiv preprint arXiv:2306.09265, 2023.
- Yang et al. (2023a) Yang, Z., Li, L., Lin, K., Wang, J., Lin, C.-C., Liu, Z., and Wang, L. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023a.
- Yang et al. (2023b) Yang, Z., Li, L., Lin, K., Wang, J., Lin, C.-C., Liu, Z., and Wang, L. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1, 2023b.
- Yang et al. (2023c) Yang, Z., Liu, J., Han, Y., Chen, X., Huang, Z., Fu, B., and Yu, G. Appagent: Multimodal agents as smartphone users. arXiv preprint arXiv:2312.13771, 2023c.
- Ye et al. (2023a) Ye, Q., Xu, H., Xu, G., Ye, J., Yan, M., Zhou, Y., Wang, J., Hu, A., Shi, P., Shi, Y., et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023a.
- Ye et al. (2023b) Ye, Q., Xu, H., Ye, J., Yan, M., Hu, A., Liu, H., Qian, Q., Zhang, J., Huang, F., and Zhou, J. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration, 2023b.
- Yin et al. (2023) Yin, Z., Wang, J., Cao, J., Shi, Z., Liu, D., Li, M., Sheng, L., Bai, L., Huang, X., Wang, Z., et al. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. arXiv preprint arXiv:2306.06687, 2023.
- Yu et al. (2023) Yu, W., Yang, Z., Li, L., Wang, J., Lin, K., Liu, Z., Wang, X., and Wang, L. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
- Yue et al. (2023a) Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., Wei, C., Yu, B., Yuan, R., Sun, R., Yin, M., Zheng, B., Yang, Z., Liu, Y., Huang, W., Sun, H., Su, Y., and Chen, W. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023a.
- Yue et al. (2023b) Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023b.
- Zamir et al. (2018) Zamir, A. R., Sax, A., Shen, W., Guibas, L. J., Malik, J., and Savarese, S. Taskonomy: Disentangling task transfer learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3712–3722, 2018.
- Zhang et al. (2023a) Zhang, P., Dong, X., Wang, B., Cao, Y., Xu, C., Ouyang, L., Zhao, Z., Ding, S., Zhang, S., Duan, H., Zhang, W., Yan, H., Zhang, X., Li, W., Li, J., Chen, K., He, C., Zhang, X., Qiao, Y., Lin, D., and Wang, J. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023a.
- Zhang et al. (2023b) Zhang, R., Han, J., Liu, C., Gao, P., Zhou, A., Hu, X., Yan, S., Lu, P., Li, H., and Qiao, Y. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023b.
- Zheng et al. (2023) Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., and Stoica, I. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
In this appendix, we provide further details as follows:
-
•
Sec. A: Presents hierarchical clustering and more analyses on the task map constructed from our MMT-Bench.
-
•
Sec. B: Includes details on sample size, visual input types, and capabilities of LVLMs evaluated for each subtask.
-
•
Sec. C: Enumerates task abbreviations used throughout the paper.
-
•
Sec. D: Presents detailed model configurations and experimental details in multi-images and visual prompting.
-
•
Sec. E: Compares the performance on tasks involving pixel coordinates and normalized coordinates.
-
•
Sec. F: Compares the performance of LVLMs on different image types and multimodal capabilities.
-
•
Sec. G: Illustrates error cases of GPT-4V, GeminiProVision, and InternVL-Chat on meta-tasks in MMT-Bench.
-
•
Sec. H: Gives the comparison of MMT-Bench with Other Benchmarks on OCR-Related Tasks.
-
•
Sec. I: Presents some Ddtails about the benchmark construction.
-
•
Sec. J: Discusses the openCompass protocol used in MMT-Bench and other alternatives.
-
•
Sec. K: Gives the computaional resources used in evaluation.
-
•
Sec. L: Provides the detailed performance of 30 models across all 162 subtasks on MMT-Bench.
Appendix A Task Map
We perform hierarchical clustering on the taskmap, as shown in Fig. 5. When selecting the number of clustering clusters as , we analyze the clustering results of the task map and the model performance on the corresponding tasks. Here, we list the names of the tasks within each cluster in Table F.
Out-of-Domain (OoD) tasks discovery. We can see that clusters , , and achieve low multimodal accuracy and ranking correlation . From these clusters, we find that current multimodal large models lack the ability to perform fine-grained visual cognition and understanding of positional and spatial relationships, such as localization and detection tasks. Moreover, they exhibit poor performance in tasks related to new data structures or types of images, showing a lack of proficiency in handling tasks related to GUI and special data structures like tables.
-
•
Cluster mainly involves detection, tracking, and localization tasks, all of which are related to the localization of objects within images. This indicates that current large multimodal models lack fine-grained visual cognition and understanding of positional and spatial relationships.
-
•
Tasks in cluster are centered around GUI navigation, a novel task type requiring strong visual understanding, object localization, and expert knowledge in operating mobile devices (Yang et al., 2023c). This suggests that current large multimodal models need further optimization for GUI-related tasks.
-
•
Apart from detection and localization tasks, cluster also includes tasks involving the recognition of special images or their conversion into structured text. The former requires models to possess spatial cognition and fine-grained visual capabilities, while the latter demands robust OCR abilities and extensive knowledge (such as understanding and outputting the basic structure of code or tables). Our testing LVLMs currently fall short in this aspect.
In-Domain tasks discovery. From Table 4, we can see that clusters , , and achieve relatively high accuracy and large ranking correlation . We observe that current multimodal large models possess strong high-level visual comprehension capabilities, enabling them to effectively handle visual recognition tasks, even when dealing with specialized images such as medical images. Moreover, they benefit from the powerful LLMs to accurately describe images.
-
•
Cluster mainly comprises visual recognition tasks, which require the model to possess certain high-level visual capabilities, yet these tasks are relatively simple. Examining Table 2 and Fig. A1, we observe that the model’s performance within this cluster is generally good. This validates that the current multimodal large models possess fundamental abilities for visual-semantic understanding, allowing them to fulfil recognition tasks.
-
•
Cluster mainly includes visual recognition tasks as well, yet extends to cover sophisticated visual understanding tasks that require primary specialist knowledge, such as medicine and emotion. Within this cluster, the model demonstrates large and high accuracy, suggesting that current multimodal models pay attention to tasks necessitating the infusion of domain-specific knowledge, beyond just natural images. This implies a certain ability to handle problems in specialized fields.
-
•
In Cluster , LVLMs achieve good performance on tasks related to the visual description of the image. It indicates that current large multimodal models can describe the image well. It would stem from the fact that these models are typically tuned by massive image-text pairs.
Appendix B Hierarchical Structure of MMT-Bench
In Table A2 to Table A4, we present all meta-tasks from MMT-Bench, encompassing a total of subtasks. These tables include details on sample size, visual input types, and capabilities of LVLMs evaluated for each subtask.
Abbreviation | Full Term | Abbreviation | Full Term |
Meta-Task | |||
VR | Visual Recognition | VI | Visual Illusion |
Loc | Localization | MemU | Meme Understanding |
OCR | OCR | VPU | Visual Prompt Understanding |
Count | Counting | AND | Anomaly Detection |
HLN | Hallucination | KD | Keypoint Detection |
IR | Image Retrieval | VCR | Visual Commonsense Reasoning |
3D | 3D | IEJ | Image Evaluation Judgement |
VC | Visual Captioning | MIA | Multiple Image Analysis |
VG | Visual Grounding | CIM | Cross Image Matching |
DU | Doc Understanding | TU | Temporal Understanding |
AR | Action Recognition | VCo | Visual Code |
PLP | Pixel Level Perception | MedU | Medical Understanding |
I2IT | Image-to-image Translation | AUD | Autonomous Driving |
RR | Relation Reasoning | DKR | Discipline Knowledge Reasoning |
IQT | Intelligence Quotient Test | EA | Embodied AI |
Emo | Emotion | GN | GUI Navigation |
Subtask | |||
AQS | Action Quality Assessment | SODRD | Salient Object Detection RGBD |
FECR | Facial Expression Change Recognition | SLR | Sign Language Recognition |
FR | Face Retrieval | SOT | Single Object Tracking |
GAR | General Action Recognition | S2IR | Sketch2image Retrieval |
HR | Handwritten Retrieval | SD | Spot the Diff |
I2IR | Image2image Retrieval | SS | Spot the Similarity |
IC | Image Colorization | TA | Temporal Anticipation |
MVU | Meme Video Understanding | TL | Temporal Localization |
ME | MEVIS | TO | Temporal Ordering |
MIC | Multiple Image Captioning | T2IR | Text2image Retrieval |
NIP | Next Image Prediction | 3DCR | 3D CAD Recognition |
OSD | One-shot Detection | 3DIR | 3D Indoor Recognition |
PRe | Person Reid | VR | Vehicle Retrieval |
PT | Point Tracking | VC | Video Captioning |
Appendix C Task Abbreviations
Given the extensive number of tasks and models tested within the benchmark, we employ abbreviations to condense the manuscript. The abbreviations used throughout the paper are shown in Table A1.
Appendix D More Experimental Details
D.1 LVLMs Model Details
Table A5 summarizes the LVLMs information used in this paper, including the corresponding parameter sizes, visual encoders, and LLMs. Note that we use follow OpenCompass’ protocol (Contributors, 2023a) to conduct the evaluation process. The inference time varies with different models. For instance, the smaller LLaVA-v1.5-7B (Liu et al., 2023a) model takes only minutes to complete the evaluation using 8 GPUs, while the larger InternVL-Chat-V1.2-34B model (Chen et al., 2023b) requires minutes and around 80GB of memory. Our open-source codebase supports multi-GPU distributed inference, effectively accelerating the inference process.
D.2 Multi-Images Prompt Experimental Details
D.3 Visual Referring Prompting Experimental Details
In Section 4.3, we explore the differential efficacy of visual prompting compared to alternative prompting strategies across a spectrum of 14 distinct tasks. These encompass human interaction understanding, social relation recognition, human-object interaction recognition, animal keypoint detection, vehicle keypoint detection, human keypoint detection, clothes keypoint detection, scene text recognition, interactive segmentation, instance captioning, multiple instance captioning, one-shot detection, single object tracking, and counting by visual prompting.
Appendix E Pixel Coordinates vs. Normalized Coordinates
In Fig. A2, we analyze the performance across detection-related tasks, specifically point tracking, image matting, pixel recognition, polygon localization, pixel localization, depth estimation, MEVIS, remote sensing object detection, rotated object detection, small object detection, camouflage object detection, salient object detection in RGB-D, transparent object detection, face detection, object detection, salient object detection in RGB, referring detection, reason segmentation, and image dense captioning. These tasks span Localization, Pixel-level Perception, and Visual Captioning, comparing outcomes under two different coordinate formats. Notably, GeminiProVision lags behind top open-source LVLMs like BLIP2 and XComposer2, which have been extensively trained with detection data. The preference for normalized coordinates among most models is attributed to their use in the training instruction templates.
Appendix F Analysis on Images Types and Capabilities
Performance with Different Visual Types. We compare the performance of 20 LVLMs across 13 types of visual input in Fig. A3. Most LVLMs struggle with Scientific Diagrams due to task difficulty, as many, including Scientific and ”Raven’s Progressive Matrices,” require complex reasoning, a capability current LVLMs do not possess well.
Performance Across Multimodal Capabilities. We also compare the performance of LVLMs across types of visual input in Fig. A4. As we can see, GeminiProVision once again exhibits strong superiority across most capabilities, especially in retrieval and multi-image analysis (involving the recognition and matching of multiple images), vastly outperforming other open-source LVLMs. This superiority stems from GeminiProVision’s support for multi-image mode and its powerful generalization abilities, guiding the future direction of open-source models towards the focus on multi-image and video understanding.
Meta-Task | Subtask | # subtasks |
Table A1 – continued from previous page | ||
Meta-Task | Subtask | # subtasks |
Cluster ID: 1 | ||
Visual Prompt Understanding | Visual Prompt Understanding, Som (Set-of-marks) Recognition | 2 |
Pixel Level Perception | Image Matting | 1 |
Visual Recognition | Color Recognition, Abstract Visual Recognition | 2 |
Discipline Knowledge Reasoning | Science, Tech Engineering, Health Medicine, Humanities Social Science, Business, Art Design | 6 |
Cluster ID: 2 | ||
Visual Recognition | Waste recognition, Logo and Brand Recognition, Animals Recognition, Weapon Recognition, Celebrity Recognition, Shape Recognition, Age Gender Race Recognition, Rock Recognition, Painting Recognition, Gesture Recognition, Vehicle Recognition, Astronomical Recognition, Fashion Recognition, Musical Instrument Recognition, Disaster Recognition, Sports Recognition, Building Recognition, Texture Material Recognition, Plant Recognition, Film and Television Recognition, Animated Character Recognition, Electronic Object Recognition, Scene Recognition, National Flag Recognition, Profession Recognition, Weather Recognition, Food Recognition | 27 |
Relation Reasoning | Human Object Interaction Recognition, Human Interaction Understanding | 2 |
Action Recognition | Image-based Action Recognition, Sign Language Recognition, General Action Recognition | 4 |
Emotion | Scene Emotion Recognition, Artwork Emotion Recognition, Facial Expression Recognition, Micro Expression Recognition, Body Emotion Recognition | 5 |
Image Evaluation Judgement | Lvlm Response Judgement | 1 |
Visual Commonsense Reasoning | WhoopsVQA | 1 |
Hallucination | Attribute Hallucination | 1 |
Counting | Counting by Visual Prompting, Crowd Counting | 2 |
Medical Understanding | Other Biological Attributes | 1 |
Autonomous Driving | Traffic Sign Understanding | 1 |
OCR | Font Recognition, Scene Text Recognition | 2 |
Pixel Level Perception | Pixel Recognition | 1 |
Anomaly Detection | Face Mask Anomaly Detection | 1 |
Multiple Image Analysis | Spot the Diff | 1 |
Visual Captioning | Instance Captioning | 1 |
Doc Understanding | Clock Reading, Doc VQA | 2 |
Meme Understanding | Meme Image Understanding | 1 |
Cluster ID: 3 | ||
Medical Understanding | Medical Modality Recognition, Lesion Grading, Disease DiagnoseAnatomy Identification | 3 |
Visual Captioning | Multiple Image Captioning, Writing Poetry from Image | 2 |
Emotion | Facial Expression Change Recognition | 1 |
Visual Recognition | Image Season Recognition, Sculpture Recognition, Chemical Apparatus Recognition, Landmark Recognition, Religious Recognition | 5 |
Hallucination | Relation Hallucination | 1 |
Relation reasoning | Social Relation Recognition | 1 |
OCR | Handwritten Text Recognition | 1 |
Temporal Understanding | Temporal Anticipation | 1 |
Cluster ID: 4 | ||
Intelligence Quotient Test | Ravens Progressive Matrices | 1 |
Temporal Understanding | Temporal Localization | 1 |
Autonomous Driving | Traffic Participants Understanding, Temporal Sequence Understanding, Multiple View Image Understanding | 3 |
Counting | Counting by Category, Counting by Reasoning | 2 |
Hallucination | Order Hallucination | 1 |
Doc Understanding | Visual Document Information Extraction, Chart VQA | 2 |
Action Recognition | Action Quality Assessment, | 2 |
3D | 3D Cad Recognition, 3D indoor recognition | 2 |
Anomaly Detection | Industrial Produce Anomaly Detection | 1 |
Image Evaluation Judgement | Image Quality Assessment | 1 |
Low Level Vision | Depth Estimation | 1 |
Cluster ID: 5 | ||
Multiple Image Analysis | Spot the Similarity | 1 |
Visual Illusion | Color Assimilation, Geometrical Relativity, Color Constancy, Color Contrast, Geometrical Perspective | 5 |
Autonomous Driving | Traffic Light Understanding | 1 |
Visual Recognition | Deepfake Detection | 1 |
Anomaly Detection | Helmet Anomaly Detection | 1 |
Cluster ID: 6 | ||
Image Retrieval | Vehicle Retrieval, Image2image Retrieval, Sketch2image Retrieval, Face Retrieval, Text2image Retrieval, Handwritten Retrieval, Person Reid | 7 |
Image-to-image translation | Image Colorization | 1 |
Cluster ID: 7 | ||
Visual Code | Eqn2latex, | 2 |
Keypoint Detection | Clothes Keypoint Detection | 1 |
OCR | Handwritten Math Expression recognition | 1 |
Pixel Level Perception | Interactive Segmentation | 1 |
Temporal Understanding | Temporal Ordering | 1 |
Visual Captioning | Image Dense Captioning | 1 |
Action Recognition | Gaze Estimation | 1 |
Cluster ID: 8 | ||
Localization | Salient Object Detection RGB, Camouflage Object Detection, Face Detection, Object Detection, Small Object Detection, Salient Object Detection RGBD, Rotated Object Detection, Remote Sensing Object Detection, Transparent Object Detection | 9 |
Visual Grounding | Referring Detection, Reason Seg | 2 |
Cross Image Matching | Point Tracking, One Shot Detection, | 3 |
Image-to-image Translation | Jigsaw Puzzle Solving | 1 |
Cross Image Catching | Single Object Tracking | 1 |
Pixel Level Perception | Pixel Localization | 1 |
Cluster ID: 9 | ||
GUI Navigation | Web Shopping, GUI General, Google Apps, GUI Install | 4 |
Cluster ID: 10 | ||
Visual Captioning | Multiple Instance Captioning, Image Captioning Paragraph, Image Captioning | 3 |
Anomaly Detection | Traffic Anomaly Detection | 1 |
Doc Understanding | Chart to text | 1 |
Hallucination | Exist Hallucination | 1 |
Relation Reasoning | Scene Graph Recognition | 1 |
Embodied AI | Navigation | 1 |
Anomaly Detection | Behavior Anomaly Detection | 1 |
Cluster ID: 11 | ||
Doc Understanding | Table Structure Recognition, Chart to Table | 2 |
Keypoint Detection | Furniture Keypoint Detection, Vehicle Keypoint Detection, Human Keypoint Detection, Animal Keypoint Detection | 4 |
Pixel Level Perception | Polygon Localization, | 2 |
Temporal Understanding | Next Image Prediction | 1 |
Visual Code | Sketch2code, Screenshot2code | 2 |
Cluster ID: 12 | ||
Meme Understanding | Meme Video Understanding | 1 |
Temporal Understanding | Mevis | 1 |
Visual Captioning | Video Captioning | 1 |
Subtask Name | Sample Num | Visual Input Type | Capability |
Visual Grounding | |||
Reason Seg | 196 | Natural Image | Visual Reasoning,Visual Localization |
Referring Detection | 200 | Natural Image | Visual Localization |
Doc Understanding | |||
Doc Vqa | 200 | Text-rich Image | Document Understanding |
Visual Document Information Extraction | 200 | Text-rich Image | Document Understanding |
Chart To Text | 200 | Chart Image | Document Understanding |
Chart To Table | 200 | Chart Image | Document Understanding |
Clock Reading | 200 | Abstract Image | Visual Recognition,Document Understanding |
Chart Vqa | 200 | Chart Image | Document Understanding |
Table Structure Recognition | 46 | Chart Image | Document Understanding |
Action Recognition | |||
Gaze Estimation | 200 | Natural Image | Visual Recognition,Visual Localization,Pixel Perception |
Image Based Action Recognition | 200 | Natural Image | Visual Recognition |
General Action Recognition | 200 | Natural Image | Visual Recognition,Multi-Images Analysis |
Action Quality Assessment | 200 | Natural Image | Visual Recognition,Multi-Images Analysis,Expert Knowledge Utilization |
Sign Language Recognition | 200 | Natural Image | Visual Recognition,Multi-Images Analysis |
Localization | |||
Remote Sensing Object Detection | 200 | Remote Sensing Image | Visual Recognition,Visual Localization |
Rotated Object Detection | 90 | Remote Sensing Image | Visual Recognition,Visual Localization |
Small Object Detection | 200 | Natural Image | Visual Recognition,Visual Localization |
Camouflage Object Detection | 200 | Natural Image | Visual Recognition,Visual Localization |
Salient Object Detection Rgbd | 200 | Natural Image,Depth Map | Visual Localization |
Transparent Object Detection | 200 | Natural Image | Visual Recognition,Visual Localization |
Face Detection | 200 | Natural Image | Visual Recognition,Visual Localization |
Object Detection | 200 | Natural Image | Visual Recognition,Visual Localization |
Salient Object Detection Rgb | 200 | Natural Image | Visual Localization |
Visual Recognition | |||
Deepfake Detection | 200 | Natural Image,Synthetic Image | Visual Recognition,Visual Reasoning,Expert Knowledge Utilization |
Weather Recognition | 194 | Natural Image | Visual Recognition |
Image Season Recognition | 200 | Natural Image | Visual Recognition |
Gesture Recognition | 200 | Natural Image | Visual Recognition |
Muscial Instrument Recognition | 200 | Natural Image | Visual Recognition |
Food Recognition | 200 | Natural Image | Visual Recognition |
Landmark Recognition | 50 | Natural Image | Visual Recognition,Expert Knowledge Utilization |
Scene Recognition | 200 | Natural Image | Visual Recognition |
Animals Recognition | 200 | Natural Image | Visual Recognition |
Chemical Apparatusn Recognition | 200 | Natural Image | Visual Recognition |
Rock Recognition | 200 | Natural Image | Visual Recognition,Expert Knowledge Utilization |
Fashion Recognition | 200 | Natural Image | Visual Recognition |
Logo And Brand Recognition | 200 | Natural Image | Visual Recognition |
Astronomical Recognition | 94 | Natural Image | Visual Recognition,Expert Knowledge Utilization |
Painting Recognition | 200 | Painting Image | Visual Recognition,Expert Knowledge Utilization |
Color Recognition | 200 | Synthetic Image | Visual Recognition |
Plant Recognition | 200 | Natural Image | Visual Recognition |
Shape Recognition | 200 | Synthetic Image | Visual Recognition |
Profession Recognition | 200 | Natural Image | Visual Recognition |
Building Recognition | 200 | Natural Image | Visual Recognition,Expert Knowledge Utilization |
Electronic Object Recognition | 200 | Natural Image | Visual Recognition |
Sports Recognition | 200 | Natural Image | Visual Recognition |
Disaster Recognition | 200 | Natural Image | Visual Recognition |
Celebrity Recognition | 200 | Natural Image | Visual Recognition |
Vehicle Recognition | 200 | Natural Image | Visual Recognition |
National Flag Recognition | 200 | Synthetic Image | Visual Recognition |
Abstract Visual Recognition | 200 | Abstract Image | Visual Recognition |
Animated Character Recognition | 200 | Synthetic Image | Visual Recognition |
Texture Material Recognition | 200 | Natural Image | Visual Recognition |
Film And Television Recognition | 200 | Synthetic Image | Visual Recognition,Expert Knowledge Utilization |
Sculpture Recognition | 50 | Natural Image | Visual Recognition,Expert Knowledge Utilization |
Age Gender Race Recognition | 200 | Natural Image | Visual Recognition |
Weapon Recognition | 200 | Natural Image | Visual Recognition |
Religious Recognition | 200 | Natural Image,Synthetic Image | Visual Recognition,Expert Knowledge Utilization |
Waste Recognition | 200 | Natural Image | Visual Recognition,Expert Knowledge Utilization |
Subtask Name | Sample Num | Visual Input Type | Capability |
Gui Navigation | |||
Gui General | 200 | Screenshot Image | Visual Reasoning,Visual Localization |
Google Apps | 200 | Screenshot Image | Visual Reasoning,Visual Localization |
Web Shopping | 200 | Screenshot Image | Visual Reasoning,Visual Localization |
Gui Install | 200 | Screenshot Image | Visual Reasoning,Visual Localization |
OCR | |||
Font Recognition | 200 | Text-rich Image | OCR |
Handwritten Text Recognition | 100 | Text-rich Image | OCR |
Handwritten Mathematical Expression Recognition | 100 | Text-rich Image | OCR |
Scene Text Recognition | 200 | Natural Image,Text-rich Image | OCR |
Image-to-image Translation | |||
Jigsaw Puzzle Solving | 200 | Natural Image | Visual Recognition,Visual Reasoning |
Image Colorization | 200 | Natural Image | Pixel Perception |
Temporal Understanding | |||
Next Img Prediction | 200 | Visual Mark | Temporal Understanding |
Mevis | 200 | Natural Image | Temporal Understanding |
Temporal Anticipation | 200 | Natural Image | Temporal Understanding |
Temporal Ordering | 200 | Natural Image | Temporal Understanding |
Temporal Localization | 193 | Natural Image | Temporal Understanding |
Relation Reasoning | |||
Social Relation Recognition | 200 | Natural Image | Visual Recognition,Visual Reasoning |
Human Object Interaction Recognition | 200 | Natural Image | Visual Recognition,Visual Reasoning |
Scene Graph Recognition | 200 | Natural Image | Visual Recognition,Visual Reasoning |
Human Interaction Understanding | 200 | Natural Image | Visual Recognition,Visual Reasoning |
Discipline Knowledge Reasoning | |||
Science | 127 | Scientific Diagram | Visual Reasoning,Expert Knowledge Utilization |
Health Medicine | 140 | Natural Image,Chart Image,Medical Image | Visual Reasoning,Expert Knowledge Utilization |
Art Design | 110 | Synthetic Image,Text-rich Image,Painting Image | Visual Reasoning,Expert Knowledge Utilization |
Humanitites Social Science | 112 | Synthetic Image,Painting Image | Visual Reasoning,Expert Knowledge Utilization |
Tech Engineering | 182 | Chart Image,Scientific Diagram | Visual Reasoning,Expert Knowledge Utilization |
Business | 120 | Text-rich Image,Chart Image | Visual Reasoning,Expert Knowledge Utilization |
Intelligence Quotient Test | |||
Ravens Progressive Matrices | 200 | Scientific Diagram | Visual Reasoning,Expert Knowledge Utilization |
Embodied AI | |||
Navigation | 200 | Synthetic Image | Visual Reasoning,Expert Knowledge Utilization |
Emotion | |||
Facail Expression Change Recognition | 200 | Natural Image | Visual Recognition,Temporal Understanding |
Scene Emotion Recognition | 200 | Natural Image | Visual Recognition |
Micro Expression Recognition | 200 | Natural Image | Visual Recognition |
Artwork Emotion Recognition | 200 | Painting Image | Visual Recognition |
Body Emotion Recognition | 200 | Natural Image | Visual Recognition |
Facial Expression Recognition | 200 | Natural Image | Visual Recognition |
Visual Illusion | |||
Color Constancy | 72 | Synthetic Image | Visual Recognition,Visual Reasoning |
Color Assimilation | 200 | Synthetic Image | Visual Recognition,Visual Reasoning |
Geometrical Relativity | 200 | Synthetic Image | Visual Recognition,Visual Reasoning |
Geometrical Perspective | 120 | Synthetic Image | Visual Recognition,Visual Reasoning |
Color Contrast | 200 | Synthetic Image | Visual Recognition,Visual Reasoning |
Meme Understanding | |||
Meme Vedio Understanding | 200 | Natural Image | Visual Description |
Meme Image Understanding | 200 | Synthetic Image | Visual Description |
Counting | |||
Counting By Visual Prompting | 200 | Natural Image | Visual Recognition,Counting |
Counting By Category | 800 | Natural Image | Visual Recognition,Counting |
Crowd Counting | 200 | Natural Image | Visual Recognition,Counting |
Counting By Reasoning | 200 | Natural Image | Visual Recognition,Counting |
Hallucination | |||
Order Hallucination | 200 | Natural Image | Visual Recognition,Visual Reasoning,Visual Description |
Relation Hallucination | 200 | Natural Image | Visual Recognition,Visual Reasoning,Visual Description |
Attribute Hallucination | 200 | Natural Image | Visual Recognition,Visual Reasoning,Visual Description |
Exist Hallucination | 200 | Natural Image | Visual Recognition,Visual Reasoning |
Image Retrieval | |||
Person Reid | 200 | Natural Image | Retrieval,Multi-Images Analysis |
Sketch2image Retrieval | 200 | Natural Image,Text-rich Image | Retrieval,Multi-Images Analysis |
Face Retrieval | 200 | Natural Image | Retrieval,Multi-Images Analysis |
Handwritten Retrieval | 200 | Text-rich Image | Retrieval,OCR,Multi-Images Analysis |
Vehicle Retrieval | 200 | Natural Image | Retrieval,Multi-Images Analysis |
Image2image Retrieval | 200 | Natural Image | Retrieval,Multi-Images Analysis |
Text2image Retrieval | 200 | Natural Image | Retrieval,Multi-Images Analysis |
Visual Prompt Understanding | |||
Som Recognition | 199 | Natural Image,Visual Mark | Visual Recognition,Visual Reasoning,Visual Localization,Visual Prompting Understanding |
Visual Prompt Understanding | 200 | Natural Image,Visual Mark | Visual Recognition,Visual Reasoning,Visual Localization,Visual Prompting Understanding |
Subtask Name | Sample Num | Visual Input Type | Capability |
Anomaly Detection | |||
Industrial Produce Anomaly Detection | 200 | Natural Image | Visual Recognition,Counting |
Face Mask Anomaly Dectection | 200 | Natural Image | Visual Recognition |
Helmet Anomaly Detection | 200 | Natural Image | Visual Recognition,Visual Localization |
Behavior Anomaly Detection | 200 | Natural Image | Visual Recognition,Multi-Images Analysis |
Traffic Anomaly Detection | 200 | Natural Image | Visual Recognition |
Keypoint Detection | |||
Furniture Keypoint Detection | 200 | Natural Image | Visual Recognition,Visual Localization,Pixel Perception |
Human Keypoint Detection | 200 | Natural Image | Visual Recognition,Visual Localization,Pixel Perception |
Clothes Keypoint Detection | 200 | Natural Image | Visual Recognition,Visual Localization,Pixel Perception |
Animal Keypoint Detection | 200 | Natural Image | Visual Recognition,Visual Localization,Pixel Perception |
Vehicle Keypoint Detection | 92 | Natural Image | Visual Recognition,Visual Localization,Pixel Perception |
Visual Commonsense Reasoning | |||
Whoops | 200 | Synthetic Image | Visual Recognition,Visual Reasoning |
Visual Code | |||
Eqn2latex | 200 | Text-rich Image,Scientific Diagram | OCR,Document Understanding,Expert Knowledge Utilization |
Screenshot2code | 200 | Screenshot Image | Document Understanding,Expert Knowledge Utilization |
Sketch2code | 200 | Scientific Diagram | Document Understanding,Expert Knowledge Utilization |
Image Evaluation Judgement | |||
Image Quality Assessment | 200 | Natural Image | Visual Reasoning |
Lvlm Response Judgement | 200 | Synthetic Image,Chart Image | Visual Reasoning |
Pixel Level Perception | |||
Polygon Localization | 200 | Natural Image | Visual Recognition,Visual Localization,Pixel Perception |
Interactive Segmentation | 141 | Natural Image | Visual Localization,Pixel Perception |
Depth Estimation | 200 | Natural Image | Pixel Perception,3D Perception |
Pixel Recognition | 200 | Natural Image | Visual Recognition,Pixel Perception |
Pixel Localization | 200 | Natural Image | Visual Recognition,Visual Localization,Pixel Perception |
Image Matting | 200 | Natural Image | Pixel Perception |
Multiple Image Analysis | |||
Spot The Similarity | 200 | Natural Image,Synthetic Image | Multi-Images Analysis |
Spot The Diff | 200 | Natural Image | Multi-Images Analysis |
3D | |||
3D Cad Recognition | 200 | 3d Image | Multi-Images Analysis,3D Perception |
3D Indoor Recognition | 200 | 3d Image | Multi-Images Analysis,3D Perception |
Medical Understanding | |||
Anatomy Identification | 200 | Medical Image | Visual Recognition,Expert Knowledge Utilization |
Medical Modality Recognition | 200 | Medical Image | Visual Recognition,Expert Knowledge Utilization |
Other Biological Attributes | 200 | Medical Image | Visual Recognition,Expert Knowledge Utilization |
Disease Diagnose | 200 | Medical Image | Visual Recognition,Expert Knowledge Utilization |
Lesion Grading | 200 | Medical Image | Visual Recognition,Expert Knowledge Utilization |
Cross Image Matching | |||
One Shot Detection | 200 | Natural Image | Visual Localization |
Point Tracking | 200 | Natural Image | Visual Localization |
Single Object Tracking | 200 | Natural Image | Visual Localization |
Visual Captioning | |||
Video Captioning | 200 | Natural Image | Visual Description,Temporal Understanding |
Image Captioning Paragraph | 200 | Natural Image | Visual Description |
Image Captioning | 200 | Natural Image | Visual Description |
Instance Captioning | 200 | Natural Image | Visual Description |
Image Dense Captioning | 197 | Natural Image | Visual Description |
Multiple Instance Captioning | 200 | Natural Image | Visual Description |
Multiple Image Captioning | 200 | Natural Image | Visual Description,Multi-Images Analysis |
Writing Poetry From Image | 200 | Natural Image,Text-rich Image | Visual Description |
Autonomous Driving | |||
Traffic Participants Understanding | 200 | Natural Image | Counting |
Multiple View Image Understanding | 200 | Natural Image | Visual Reasoning,Multi-Images Analysis,Counting |
Traffic Sign Understanding | 200 | Natural Image | Visual Reasoning,Expert Knowledge Utilization |
Temporal Sequence Understanding | 200 | Natural Image | Visual Reasoning,Temporal Understanding |
Traffic Light Understanding | 200 | Natural Image | Visual Recognition |
Models | Parameters | Vision Encoder | LLM |
---|---|---|---|
GPT-4V (Yang et al., 2023a) | - | - | - |
GeminiProVision (Team, 2023a) | - | - | - |
QWen-VL-Plus (Team, 2023c) | - | - | - |
Claude3V-Haiku (Anthropic, 2023) | - | - | - |
LLaVA-Next-34B (Liu et al., 2024a) | 34.8B | CLIP ViT-L/14 | Nous-Hermes-2-Yi-34B |
LLaVA-Next-13B (Liu et al., 2024a) | 13.4B | CLIP ViT-L/14 | Vicuna-v1.5-13B |
LLaVA-Next-7B (Liu et al., 2024a) | 7.1B | CLIP ViT-L/14 | Vicuna-v1.5-7B |
Yi-VL-34B (AI et al., 2024) | 34.6B | CLIP ViT-H/14 | Nous-Hermes-2-Yi-34B |
Yi-VL-6B (AI et al., 2024) | 6.6B | CLIP ViT-H/14 | Yi-6B |
InternVL-Chat-V1.2 (Chen et al., 2023b) | 40B | InternViT-6B | Nous-Hermes-2-Yi-34B |
DeepSeek-VL-7B (Lu et al., 2024) | 7.3B | SAM-B & SigLIP-L | DeekSeek-7B |
Monkey (Li et al., 2023d) | 9.8B | CLIP-ViT-BigHuge | Qwen-7B |
XComposer (Zhang et al., 2023a) | 8B | EVA-CLIP-G | InternLM-7B |
XComposer2 (Dong et al., 2024) | 7B | CLIP ViT-L/14 | InternLM2-7B |
ShareGPT4V (Chen et al., 2023a) | 7.2B | CLIP ViT-L/14 | Vicuna-v1.5-7B |
SharedCaptioner (Chen et al., 2023a) | 8B | EVA-G | InternLM-7B |
mPLUG-Owl2 (Ye et al., 2023b) | 8.2B | CLIP ViT-L/14 | LLaMA2-7B |
LLaVA-v1.5-7B (Liu et al., 2023b, a) | 7.2B | CLIP ViT-L/14 | Vicuna-v1.5-7B |
LLaVA-v1.5-13B (Liu et al., 2023b, a) | 13.4B | CLIP ViT-L/14 | Vicuna-v1.5-13B |
LLaVA-InternLM2-7B (Contributors, 2023c) | 8.1B | CLIP ViT-L/14 | InternLM2-7B |
LLaVA-InternLM-7B (Contributors, 2023c) | 7.6B | CLIP ViT-L/14 | InternLM-7B |
LLaVA-v1.5-7B-Xtuner (Contributors, 2023c) | 7.2B | CLIP ViT-L/14 | Vicuna-v1.5-7B |
LLaVA-v1.5-13B-Xtuner (Contributors, 2023c) | 13.4B | CLIP ViT-L/14 | Vicuna-v1.5-13B |
LLaMA-Adapter-v2 (Gao et al., 2023) | 7B | CLIP-ViT-L/14 | LLaMA-7B |
VisualGLM (Ding et al., 2021) | 8B | EVA-CLIP | ChatGLM-6B |
CogVLM (Wang et al., 2023) | 17B | EVA-CLIP-E | Vicuna-v1.5-7B |
TransCore-M (Contributors, 2023b) | 13.4B | CLIP ViT/L-14 | PCITransGPT-13B |
RBDash-v1 (RBDash-Team, 2023) | 13.4B | CLIP ViT-L/14 | Vicuna-v1.5-13B |
BLIP2 (Li et al., 2023b) | 12.1B | EVA-CLIP ViT-G/14 | Flan-T5-XXL |
QWenVL (Bai et al., 2023) | 9.6B | CLIP ViT-G/16 | QWen-7B |
Task Abbreviation | Task Name | Prompt Example for Single Image LVLMs | Prompt example for Multiple Image LVLMs | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AQS |
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
FECR |
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
FR |
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GAR |
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HR |
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I2IR |
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
IC |
|
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MVU |
|
|
|
Task Abbreviation | Task Name | Prompt Example for Single Image LVLMs | Prompt example for Multiple Image LVLMs | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ME | mevis |
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
MIC |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
NIP |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
OSD |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PRe | person reid |
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
PT | point tracking |
|
|
Task Abbreviation | Task Name | Prompt Example for Single Image LVLMs | Prompt example for Multiple Image LVLMs | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
SODRD |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SLR |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SOT |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
S2IR |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SD | spot the diff |
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SS |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TA |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TL |
|
|
|
Task Abbreviation | Task Name | Prompt Example for Single Image LVLMs | Prompt example for Multiple Image LVLMs | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
TL |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
TO | temporal ordering |
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
T2IR |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3DCR |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
3DIR |
|
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
VR | vehicle retrieval |
|
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
VC | video captioning |
|
|
Appendix G Case Study
Case Figure | Meta-task | Subtask | GPT-4V | GeminiProVision | InternVL-Chat |
---|---|---|---|---|---|
Fig. A5 | Visual Recognition | Landmark Recognition | |||
Fig. A6 | Object Localization | Camouflaged Object Detection | |||
Fig. A7 | Pixel-level Recognition | Image Matting | |||
Fig. A8 | OCR | Handwritten Text Recognition | |||
Fig. A9 | Visual Prompt Understanding | Visual Prompt Understanding | |||
Fig. A10 | Retrieval | Sketch to Image Retrieval | |||
Fig. A11 | Counting | Counting by Reasoning | |||
Fig. A12 | Keypoint Detection | Human Keypoint Detection | |||
Fig. A13 | Action Recognition | Sign Language Recognition | |||
Fig. A14 | Visual Hallucination | Exist Hallucination | |||
Fig. A15 | Anomaly Detection | Industrial Produce Anomaly Detection | |||
Fig. A16 | Image-to-Image Translation | Jigsaw Puzzle Solving | |||
Fig. A17 | Visual Summary | Image Captioning Paragraph | |||
Fig. A18 | Intelligence Quotient Test | Ravens Progressive Matrices | |||
Fig. A19 | Emotional Quotient Test | Scene Emotion Recognition | |||
Fig. A20 | Visual Grounding | Referring Detection | |||
Fig. A21 | Visual Commonsense Reasoning | Whoops | |||
Fig. A22 | Chart, Doc Understanding | Clock Reading | |||
Fig. A23 | Relation Reasoning | Scene Graph Recognition | |||
Fig. A24 | Meme Understanding | Meme Image Understanding | |||
Fig. A25 | Multi-Image Analysis | Spot the Diff | |||
Fig. A26 | Temporal Understanding | Temporal Ordering | |||
Fig. A27 | Cross-Image Matching | Single Object Tracking | |||
Fig. A28 | Visual Coding | Equation to Latex | |||
Fig. A29 | Visual Illusion | Color Constancy | |||
Fig. A30 | Image Evaluation Judgement | LVLM Response Judgement | |||
Fig. A31 | 3D Perception | 3D CAD Recognition | |||
Fig. A32 | Emodied Agent | Navigation | |||
Fig. A33 | Medical Understanding | Medical Modality Recognition | |||
Fig. A34 | Autonomous Driving | Traffic Light Understanding | |||
Fig. A35 | GUI Navigation | Installation | |||
Fig. A36 | Discipline Knowledge Reasoning | Art and Design |
In this section, we present a case study analysis of the error types made by GPT-4V, GeminiProVision, and InternVL-Chat on various meta-tasks in MMT-Bench. We classify the errors into the following six categories:
: LVLMs fail to recognize, classify or detect the objects or content in images. Most LVLMs are constrained by the representation power of their visual encoders, making this the most common type of error. See examples in Fig. A6, Fig. A8, etc.
: LVLMs correctly recognize and perceive the visual content but make errors in reasoning, leading to incorrect answers. See examples in Fig. A21, Fig. A30, etc.
: LVLMs lack the domain-specific knowledge required to answer specialized questions, such as the location of a landmark (see Fig. A5) or the creation date of a particular painting (see Fig. A36).
: LVLMs do not have the capability to solve the corresponding tasks. This error type is particularly evident in GPT-4V, which tends to respond more honestly when it lacks the ability to handle certain tasks. In contrast, other LVLM models are inclined to generate outputs even when the accuracy rate is relatively low. See examples in Fig. A6, Fig. A13.
Appendix H Comparison of MMT-Bench with Other Benchmarks on OCR-Related Tasks
Words Number | Tokens Number | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Benchmark | Sample Num | Task Type | Average | Min | Middle | Max | std | Average | Min | Middle | Max | std |
MME (Fu et al., 2023) | 40 | 1 | 2.5 | 1 | 2 | 5 | 1 | 3.9 | 1 | 3 | 8 | 1.6 |
MMBench (dev+test) (Liu et al., 2023c) | 608 | - | 7.3 | 1 | 6 | 54 | 7 | 8.3 | 1 | 6 | 78 | 9.3 |
Tiny-LVLM-eHub (Shao et al., 2023) | 600 | 1 | 1 | 1 | 1 | 1 | 0 | 2.2 | 1 | 2 | 8 | 1.1 |
MMT-Bench (Ours) | 600 | 4 | 14.8 | 1 | 1.5 | 103 | 22.7 | 20.4 |