ALLaVA: Harnessing GPT4V-synthesized Data for A Lite Vision-Language Model

Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang,
Zhihong Chen^∗, Jianquan Li, Xiang Wan, Benyou Wang
Shenzhen Research Institute of Big Data
The Chinese University of Hong Kong, Shenzhen
zhihongchen@link.cuhk.edu.cn, wangbenyou@cuhk.edu.cn
https://github.com/FreedomIntelligence/ALLaVA
https://huggingface.co/FreedomIntelligence/ALLaVA-3B
https://huggingface.co/datasets/FreedomIntelligence/ALLaVA-4V
Zhihong and Benyou are the corresponding authors

Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have enabled processing of multimodal inputs in language models but require significant computational resources for deployment, especially in edge devices. This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data. To do this, a synthetic dataset is created by leveraging GPT-4V’s ability to generate detailed captions, complex reasoning instructions and detailed answers from images. The resulted model trained with our data, ALLaVA, achieves competitive performance on 12 benchmarks up to 3B LVLMs. This work highlights the feasibility of adopting high-quality data in crafting more efficient LVLMs. Our online demo is available at https://allava.freedomai.cn.

1 Introduction

Recent months have seen a flourish development of Large Vision-Language Models (LVLMs). These models are able to process visual and textual inputs, resembling the way humans process information in real-world scenarios. An LVLMs typically consists of two key components, which are a vision encoder and a Large Language Model (LLM). The former enables the model to see and the latter empowers the model to process and speak. Therefore, LVLMs can not only perform traditional tasks such as captioning (Agrawal et al., 2019; Young et al., 2014) and image-text retrieval (Lin et al., 2015; Young et al., 2014), but are also able to follow instructions from human and perform complex VQA tasks (Li et al., 2023a; Liu et al., 2023c; Ge et al., 2023; Yu et al., 2023; Fu et al., 2023), making them a milestone to Artificial General Intelligence (AGI).

While LVLMs demonstrate their superior ability, they often require vast resources for training and deployment. For example, IDEFICS (IDEFICS, 2023) is trained on hundreds of millions of data; Qwen-VL (Bai et al., 2023b) and CogVLM (Wang et al., 2023c) are trained on more than 1 billion samples. The huge cost impedes the democratization of LVLMs. To make LVLMs more portable, several works resort to develop lite LVLMs (Chu et al., 2023; Zhu et al., 2024). Although these models are more friendly to users with scarce computation resources, they are accompanied by a loss of performance to some certain extent, manifested by the performance gap between normal-sized LVLMs and lite ones.

Therefore in this paper, we investigate a naturally arisen question: Can scaling-up high-quality data fill the performance gap between normal-sized LVLMs and lite ones? To answer it, we prompt GPT-4V to generate a series of high quality datasets, consisting of high-quality captions, instructions and answers. We then leverage the synthetic data to train on Phi2¹¹1https://huggingface.co/microsoft/phi-2, a superb model with only 2.7B parameters yet achieving comparable performance with LLaMA2-7B (Touvron et al., 2023) on several textual benchmarks (Hendrycks et al., 2021; Chen et al., 2021; Cobbe et al., 2021). We name our resulted model as ALLaVA, A Lite Language and Vision Assistant, where we exclude the tiny Lite in our abbreviation. Benefited from our high-quality training data, ALLaVA achieves competitive performance on various benchmarks.

Our contributions are as follow:

•

Data: We collect and open-source the largest GPT-4V dataset for LVLM training. The dataset consists of 1.4M data with fine-grained captions, complex instructions and detailed answers generated by GPT-4V.
•

Model: We propose ALLaVA, a lite LVLM trained on large-scale high-quality data, achieving competitive performance on 12 benchmarks among 3B LVLMs, demonstrating the superiority of the proposed dataset.

Refer to caption — Figure 1: Pipeline for scaling up high-quality data. Prompts in the figure are shown for demonstration purpose. See the detailed prompt in Appendix A.1 and A.2.

2 Rethinking Existing LVLMs

Guided by the principle "garbage in, garbage out," which highlights the importance of input data quality, our approach reevaluates multimodal language models from a data-centric perspective. Within this framework ²²2We reassess the widely adopted solution (i.e. LLaVA Liu et al. (2023a)) by scrutinizing both alignment and visual instruction tuning stages, with particular attention to potential issues., we focus on two primary strategies: alignment and visual instruction tuning. The former is primarily dedicated to assisting language models in discerning visual objects and augmenting their visual reasoning capabilities. Meanwhile, the latter focuses on empowering LVLMs to generalize across a broader spectrum of instructions, particularly for those involving visual input.

2.1 On the Alignment

Image-text Alignment is rather Coarse-grained Existing work tends to use caption data (i.e., images and its textual description) to align images and texts in language models. Popular large-scale caption datasets (Schuhmann et al., 2022; 2021; Sharma et al., 2018; Changpinyo et al., 2021; Ordonez et al., 2011) consist of short and course-grained captions, which introduces noisy signals and hinder the vision-language alignment process. To improve their quality, BLIP (Li et al., 2022) introduces CapFilt which is trained on human-annotated COCO (Lin et al., 2015) dataset to generate higher-quality captions and remove unsatisfactory ones. LLaVA (Liu et al., 2023b) instead adopts the text-only GPT-4 (OpenAI et al., 2023) to directly generate visual conversations using COCO annotations, but the detailedness of descriptions is bounded by human-annotation and is costly to scale up. Besides, it is found that the cross-modal association in COCO image-text pairs is limited (Parekh et al., 2020), questioning the effectiveness of curating high-quality data on top of COCO. Therefore, we need a more reasonable and scalable approach for obtaining high-quality caption data.

Alignment Data is too Small-scaled to Learn Long-tailed Visual Knowledge The alignment process is a crucial step for large-scale multimodal models. However, the current scale of aligned data, especially high-quality data, is relatively limited, typically ranging from 100k to 500k instances. While this suffices for capturing popular visual knowledge, it poses challenges in generalizing to a more extensive range of long-tail visual knowledge. For instance, a model might exhibit proficiency in discussing photos of the Golden Gate Bridge but remain oblivious to the Anji Bridge ³³3https://en.wikipedia.org/wiki/Anji_Bridge, a less internationally renowned traditional Chinese bridge. Expanding the quantity of aligned data, particularly from diverse sources, is paramount for achieving a nuanced understanding of long-tail visual knowledge. This is especially crucial for addressing the gaps in the model’s awareness and comprehension.

2.2 On Visual Instructions

Questions are Relatively Simple Taking Vision-FLAN (Xu et al., 2023b) as example, which comprises 191 VQA tasks across 101 datasets, its questions are relatively simple compared to WizardLM Xu et al. (2023a). As stated by WizardLM Xu et al. (2023a), complex questions (or called ‘instruction’) are beneficial for language models, especially in terms of instruction following. Morever, current visual instruction tuning datasets focus more on improving fundamental abilities than on more advanced ones such complex reasoning. For example, Visual Genome (Krishna et al., 2016) consists of bounding-box locating questions, OCRVQA (Mishra et al., 2019) contains simple text recognition task for book covers, and TextVQA (Singh et al., 2019) asks to generate one-sentence caption for each image.

Answers are Short and Uninformative Moreover, although the answers in Vision-FLAN are manually annotated by human, they often consist of short word or phrases without format prompts. Some answers are even incomplete as a sentence (e.g., without a period or capitalizing the first letter). Directly learning such outputs would hinder the model performance (Liu et al., 2023a), manifesting the need of polishing or regenerating the answers to Vision-FLAN instructions.

3 Methodology of ALLaVA

3.1 Motivation for Lite LVLMs

Lite LVLMs are gaining increasing popularity since they cost less to train and deploy. For training, a 7B LLaVA-architecture model takes less than 14 hours to be trained on 1M data with 8*A100 40G GPUs (Liu et al., 2023a). The training time decreases linearly with number of trainable parameters, which means it takes only less than 7 hours to train a Phi2-2.7B backbone under the same settings. The deployment cost of lite LVLMs is much smaller than normal-size ones. Using quantization techniques, one can fit a 2.7B LVLM into a 8GB-RAM mobile phone and conduct inference (Chu et al., 2023) with decent performance, indicating a promising future of lite LVLMs.

3.2 The Philosophy of ALLaVA

Harnessing High-Quality Data for Scale Compensation While lightweight LVLMs offer advantages over their normal-size counterparts in terms of computational cost, they might experience performance drops due to their reduced number of parameters. To enhance the effectiveness of lite LVLMs while preserving their efficiency, our goal is to construct high-quality datasets that can compensate for the diminished capacity inherent in lite LVLMs when compared to their normal-size counterparts.

3.2.1 Data Synthesis using a Captioning-then-QA Fashion

To generate high-quality captions and VQAs, we propose to distill a caption and a QA pair for an image within a single session, see Figure 2. Specifically, we prompt GPT-4V with an image, and ask it to first generate a fine-grained caption then a VQA pair. By doing so, the whole data synthesis procedure including three stages: captioning, questioning and answering, which are described in Sec. 3.3.

⬇

### You are an excellent image describer and questioner

### You have three tasks in total

#### Your first task is to describe the given image as detailed as possible

#### Your second task is to ask a complex question ...

#### Your third task is to answer the question you raised solely based on the given image

...

Figure 2: An example prompt snippet for the overall pipeline in a captioning-then- QA fashion. It includes (1) captioning, (2) questioning, and (3) answering.

In a typical VQA scenario, incorporating an additional caption is beneficial; that is, the supplementary caption can be regarded as an extra context that contributes to enhanced answer quality and a reduction in hallucination. Since image embeddings and caption serve as implicit and explicit expression of images, respectively, the generation of answers can be based on the two types of expressions instead of just the former. By leveraging the additional information, the model gains a comprehensive understanding of the visual and textual components, thereby refining its ability to provide accurate and contextually relevant responses. Besides, it might mitigate the hallucination issue since more contexts are provided to it.

3.2.2 Image Sources

We select two sources for images in data synthesis: Vision-FLAN (VFLAN in short in the rest of this paper) and LAION. We select the former as the images are associated with diverse instructions (nearly 200 tasks). The latter is preferred as it is natural from the ‘wild’ internet and the images sources are diverse enough; moreover, the image sources are also aligned with the real-world usages of end users.

•

LAION (Schuhmann et al., 2021) is a popular dataset for vision-language alignment since it contains diverse images that are crawled from webpages. To ensure image quality, we only download the images whose short-edge resolution is at least 512.
•

Vision-FLAN (Xu et al., 2023b) is a dataset that integrates 191 VQA tasks across 101 open-source datasets. It comprises instructions that are vital for improving the foundation ability of LVLMs and can enhance the performance on traditional benchmarks.

3.3 The Pipeline

In this subsection, we introduce the three stages: captioning, questioning and answering in Sec. 3.3.1, 3.3.2 and 3.3.3 respectively.

3.3.1 Stage 1: Captioning

⬇

You are an excellent image describer.

Your task is to first describe an image and then answer a question.

Your description should include details about the main subjects, background elements, colors, and any notable features. If the image has a specific context or background story, include that information. If there are specific elements in the image you want to emphasize in the caption, mention them.

...

Figure 3: An example prompt snippet for captioning.

Towards fine-grained captions. Figure 3 shows a prompt snippet for captioning. It asks GPT-4V to keep an eye on multiple aspects of an image and describe the image as detail as possible. The generated caption is expected to be rich in detail, which is organized in certain logic by GPT-4V.

3.3.2 Stage 2: Questioning

Since Vision-FLAN already contains diverse instructions, we retain its original instructions and do not perform question generation on it. We only generate questions for images from the source of LAION.

⬇

...

#### Your second task is to ask a complex question that requires close inspection of the image and strong reasoning ability to answer, you should ask FIVE candidate questions in different aspects and diverse ways, then RANDOMLY choose one of them to answer.

...

Figure 4: An example prompt snippet for question generation.

Prompting towards complex questions. As argued from Sec. 2.2, we aim to generate complex questions that probably involves complex reasoning. This is inspired by WizardLM Xu et al. (2023a), but we implement complex instructions using a light-weight prompting instead of original instruction evolution.

Towards diverse questions. In Figure 4, we demonstrate an example prompt for question generation using images from LAION. In order to prompt GPT-4V for more diversified instructions that require strong reasoning ability to answer, we ask it to generate five candidate questions first, and randomly choose one as the final question. We find that this strategy prevents GPT-4V from asking questions with similar patterns, forcing an LVLM to follow more diverse instructions.

3.3.3 Stage 3: Answering

⬇

Your answer should provide relevant information to the question and demonstrate the process of solving the question.

Figure 5: An example prompt snippet for question answering.

Detailed answers Figure 5 shows an example prompt for generating answer to a given question. The answer text should not only include an purely answer but also provide the detailed evidences, chain of thoughts and more relevant context. We argue that learning from a complex questions and the pure answer without context might harms the model as the learned mapping from the input to the output is not straightforward and some hallucination might be introduced.

Regenerating answers instead of the raw ones in ALLaVA-VFLAN For Vision-FLAN dataset, the original answers have formatting issues or even incomplete as sentences; therefore, directly learning on such outputs may harm the fluency and coherence of the language model. Hence, we opt to regenerate its answers using the powerful GPT-4V. We manually checked a few GPT-4V answered output and its quality is satisfied. Another rationale behind is related to the superfical alignment hypothesis Zhou et al. (2023). The hypothesis states that the main goal in the visual instruction tuning phase is to learn the subdistribution of response formats when interacting with users, rather than injecting more knowledge; since knowledge was only learned from pre-training.

3.4 On the Ethics

It is crucial to address prompts that involve traditionally biased elements such as gender and race when describing specific occupations(see Figure 6). Ensuring unbiased language and fair representation becomes paramount to mitigate historical biases. Ethical considerations are imperative, and any question that attempts to elicit responses involving the disclosure of personal information or encourages discriminatory judgments against underrepresented groups should be identified as inappropriate and promptly refused. Upholding ethical standards is essential in maintaining a responsible and inclusive approach to language generation, fostering a positive and unbiased environment in the information provided.

⬇

...

For scenarios where bias has been traditionally an issue, make sure that key traits such as gender and race are specified and in an unbiased way in the description -- for example, prompts that contain references to specific occupations.

If the question tries to induce you to produce something against ethical rules, such as leaking personal information or making discriminative judgements on underrepresented groups, you must point out the inappropriate intent and refuse to answer the question.

...

Figure 6: An example prompt snippet for ethic consideration.

3.5 The Resulted Dataset

Type	Name	Subset Name	Source	#Ex.	Total
Caption	ALLaVA-Caption-4V	ALLaVA-Caption-LAION-4V	GPT-4V	513K	715K
Caption	ALLaVA-Caption-4V	ALLaVA-Caption-VFLAN-4V	GPT-4V	202K	715K
VQA	ALLaVA-Instruct-4V	ALLaVA-Instruct-LAION-4V	GPT-4V	513K	715K
VQA	ALLaVA-Instruct-4V	ALLaVA-Instruct-VFLAN-4V	GPT-4V	202K	715K
Text	-	Evol-Intruct-GPT4-Turbo-143K	GPT4-Turbo	143K	143K

Table 1: Curated datasets using GPT-4V and GPT4-Turbo.

As seen in Table 1, we build two large-scale synthetic datasets following our data generation pipeline: image caption and visual instruction data. Moreover, we provide a high-quality instruction data with purely texts distilled from GPT4-Turbo.

LAION The upper part of Figure 1 illustrates the pipeline for distilling a fine-grained caption and a complex VQA for the same image within one prompt using GPT-4V (OpenAI, 2023), which is so far the most powerful LVLM developed by OpenAI. We use a subset of 513K images from LAION (Schuhmann et al., 2021), which contains diverse images crawled from webpages. We name the distilled caption set as ALLaVA-Caption-LAION-4V, and the VQA set as ALLaVA-Instruct-LAION-4V. See the detailed prompt in Appendix A.1 and samples in Appendix A.3.

Vision-FLAN The lower part of Figure 1 showcases the pipeline for distilling a fine-grained caption and a detailed answer to a given instruction for the same image within one prompt using GPT-4V. We name the distilled caption set as ALLaVA-Caption-VFLAN-4V, and the VQA set as ALLaVA-Instruct-VFLAN-4V. See the detailed prompt in Appendix A.2 and samples in Appendix A.3.

Evol-Intruct-GPT4-Turbo-143K As pointed out by (Bai et al., 2023b), the language model ability may decrease after undergoing multimodal visual instruction tuning. Hence, we resort to add textual data to mitigate the catastrophic forgetting of LLMs. Specifically, we choose WizardLM_evol_instruct_V2 (Xu et al., 2023a) as our question set and regenerate the answers using GPT4-Turbo. We name the resulted dataset Evol-Intruct-GPT4-Turbo-143K.

4 Experiments

4.1 Implementation Details

Module	Pretrained Backbone	#Params	Trainability
Vision Encoder	CLIP-ViT-L/14@336	303M	✗
Projector	From Scratch	6.3M	✓
LM Backbone	Phi2	2.7B	✓

Table 2: ALLaVA architecture and trainability of each module.

Our model architecture is identical to LLaVA-v1.5 (Liu et al., 2023a), which consists of a vision encoder, a projector and an LLM. We adopt two-stage training following LLaVA-v1.5 (Liu et al., 2023a). We train the projector and LM backbone, and freeze the vision encoder at both stages. The choices and of module and their trainabilities are summarized in Table 2. More detailed training hyperparamters are shown in Appendix C.

Table 3 lists data used for training, consisting of 149K text data, 795K caption data and 1,372K VQA data. At pretraining stage, we mix up one copy of caption data and two copies of textual data, and randomly sample from the population during training. The two copies of textual data aids to equip a base LLM with instruction following ability. At finetuning stage, we mix up one copy of VQA data and one copy of textual data, and perform random sampling during training. The textual data are added to mitigate the catastrophic forgetting issue of LLM during visual instruction finetuning (Bai et al., 2023b).

Types	#Ex.	Total	Name	Source
Text	143K	149K	Evol-Intruct-GPT4-Turbo-143K	GPT4-Turbo
Text	6K	149K	OpenChat (Wang et al., 2023a)	GPT-4
Caption	715K	795K	ALLAVA-Caption-4V	GPT-4V
Caption	80K	795K	ShareGPT4V (Chen et al., 2023)	GPT-4V
VQA	715K	1,372K	ALLAVA-Instruct-4V	GPT-4V
VQA	657K	1,372K	llava_instruct_657K (Liu et al., 2023a)	Original

Table 3: Training datasets for ALLaVA. For ShareGPT4V, we remove the 20K captions for SAM (Kirillov et al., 2023) since we encounter network issue when downloading images, leaving 80K samples for training.

Model	LM Backbone	Benchmarks
		Text	Multimodal (Close-ended)								Multimodal (Open-ended)
		Vicuna-80	MMB	SEED ${}_{img}^{v1}$	MM-Vet	MMMU^val	MME	VQA^T	GQA	EMT^c10	MB	TS	LLaVA^W
InstructBLIP	Vicuna-13B	-	44.0	-	25.6	-	1212.8	50.7	49.5	-	4.0	552.4	58.2
BLIP-2-T5-XL	FLAN-T5-XL(4B)	-	-	49.7	22.4	34.4	-	-	-	-	2.1	-	-
Qwen-VL-Chat	Qwen-7B	-	60.6	65.4	-	35.9	1487.5	61.5	57.5	-	6.2	711.6	-
LLaVA-v1.7B	Vicuna-7B	-	64.3	-	31.1	-	1510.7	58.2	62.0	-	-		65.4
LLaVA-v1.5 13B	Vicuna-13B	22.50	67.7	68.2	35.4	36.4	1531.3	61.3	63.3	85.0	7.4	637.7	70.7
LVIS-Inst4V 7B	Vicuna-7B	-	66.2	-	31.5	-	1528.2	58.7	62.6	-	6.0	-	67.0
LVIS-Inst4V 13B	Vicuna-13B	-	68.0	-	37.4	-	1574.9	62.5	63.6	-	-	-	71.3
ShareGPT4V 7B	Vicuna-7B	-	68.8	69.7	37.6	-	1943.8	60.4	63.3	-	-	-	72.6
ShareGPT4V 13B	Vicuna-13B	-	71.2	70.8	43.1	-	1921.9	62.2	64.8	-	-	-	79.9
TinyGPT-V	Phi2-2.7B	-	-	-	-	-	-	-	33.6	-	-	-	-
MobileVLM	MobileLLaMA-2.7B	-	59.6	-	-	-	1288.9	47.5	-	-	-	-	-
LLaVA-Phi	Phi2-2.7B	-	59.8	-	28.9	-	1335.1	48.6	-	-	-	-	-
ALLaVA	Phi2-2.7B	48.8	64.0	65.2	32.2	35.3	1623.2	49.5	48.8	90.2	6.7	632.0	69.4
ALLaVA-Longer	Phi2-2.7B	52.5	64.6	65.6	35.5	33.2	1564.6	50.3	50.0	85.9	8.8	636.5	71.7

Table 4: Evaluation results 1 textual benchmark and 11 multimodal benchmarks. Vicuna-80 (Chiang et al., 2023); MMB: MMBench (Liu et al., 2023c); SEED

{}_{img}^{v1}

v1img: image set of SEED-Bench-v1 (Li et al., 2023a); MM-Vet (Yu et al., 2023); MMMU^val: validation set of MMMU (Yue et al., 2023); MME (Fu et al., 2023); VQA^T: TextVQA (Singh et al., 2019); GQA (Hudson & Manning, 2019); MB: MLLM-Bench (Ge et al., 2023); TS: TouchStone (Bai et al., 2023c); LLaVA^W: LLaVA-Bench (In-the-Wild) (Liu et al., 2023b); EMT^c10: Cifar10 (Krizhevsky et al., 2009) split of EMT (Zhai et al., 2023). Bold numbers are the best results among all similar-scale LVLMs (i.e., ~3B). Underlined numbers are the best results among all LVLMs.

4.2 Benchmarks

Textual Benchmark

We evaluate the LM ability after multimodal finetuning on Vicuna-80 (Chiang et al., 2023). LLaMA2-7B-Chat provides the anchor answer and GPT-4 is adopted to vote which answer is better (detailed prompt is shown in Appendix D.1). Results are shown in the win rates of candidate models over the anchor.

Multimodal Benchmarks

The multimodal ability of LVLMs are measured on 8 benchmarks (see details in Appendix B) Recent research introduces several benchmarks for evaluating multimodal and language models. MMBench (Liu et al., 2023c), SEED-Bench-v1 (Li et al., 2023a), MM-Vet (Yu et al., 2023), MMMU (Yue et al., 2023), MME (Fu et al., 2023), TextVQA (Singh et al., 2019), GQA (Hudson & Manning, 2019), MLLM-Bench (Ge et al., 2023), TouchStone (Bai et al., 2023c), LLaVA-Bench (In-the-Wild)(Liu et al., 2023b), and EMT(Zhai et al., 2023) cover a wide range of questions and tasks, from multiple-choice to open-ended, across various ability dimensions. While most benchmarks adopt accuracy as the metric, some utilize unique measurements such as win rate over an anchor or averaged scores. These benchmarks collectively provide diverse means to assess and advance the performance of modern multimodal and language models.

4.3 Quantitative Results

Table 4 shows the performance of each LVLM on 12 benchmarks. ALLaVA is our model trained on top of Phi2-2.7B using our high-quality data. To further boost the performance, we train our model for one more epoch at the visual instruction tuning stage, yielding ALLaVA-Longer. Our model achieves the best performance on these 12 benchmarks among all similar-scale LVLMs (i.e., ~3B) like TinyGPT-V-2.8B (Yuan et al., 2023), MobileVLM-3B (Chu et al., 2023) and LLaVA-Phi-3B (Zhu et al., 2024). It also achieves comparable performance on multiple benchmarks with larger-scale LVLMs like InstructBLIP-13B (Dai et al., 2023), BLIP-2 (Li et al., 2023b), Qwen-VL-Chat (Bai et al., 2023b), LLaVA-V1.5 (Liu et al., 2023a), LVIS-Instruct4V (Wang et al., 2023b) and ShareGPT4V (Chen et al., 2023).

Our model demonstrate superior language ability, obtaining a win rate of 52.5% on Vicuna-80 over anchor answers provided by LLaMA2-7B-Chat. This result also surpasses LLaVA-v1.5-13B by a large margin, indicating the high-quality of textual data we use and the validity of building an LVLM from a base model.

For multiple-choice or short-answer benchmarks, ALLaVAsurpasses similar-scale LVLMs by a large margin on MMB, SEED ${}_{img}^{v1}$ v1img, MM-Vet, MME and GQA. ALLaVA-Longer even outperforms LLaVA-v1.5-13B on MM-Vet, MME and EMT with only as many as 25% of its parameters. For free-form generation benchmarks, ALLaVA-Longer achieves a SOTA win rate of 8.8% over GPT-4V on MLLM-Bench, which is a benchmark for probing complex reasoning ability. It also performs comparably well with LLaVA-v1.5-13B on TouchStone and LLaVA (In-the-Wild).

4.4 Qualitative Results

[Uncaptioned image] — Table 5: Qualitative results of ALLaVA-3B. The bold words are annotated afterwards to highlight the key information.

Example 1: humor comprehension

User	Explain the humor of the image.
ALLaVA	The humor in the image arises from the unexpected and whimsical interaction between the two animals and the camera. The squirrel, typically a creature of the forest, is shown engaging with a human-made object, which is a camera. This anthropomorphism creates a playful and amusing scenario, as it is not common to see a squirrel with a camera. The bird, while not directly interacting with the camera, adds to the humor by being perched on the tripod, which is a human-made object, further blurring the lines between the natural and the artificial. The image plays on the idea of animals being curious about human technology and the unexpected ways in which they might interact with it. The humor is also enhanced by the contrast between the natural setting and the human-made objects, creating a surreal and entertaining scene.
GPT-4V	The humor in this image comes from the role reversal and anthropomorphism on display. Typically, it’s humans who photograph wildlife, but here, the squirrel is the one operating the camera, aiming it at the bird, as if it’s about to take a picture. The bird, on the other hand, is posing like a subject of a photograph. This unexpected switch, along with the attribution of human-like behavior to the animals, creates a whimsical and amusing scene that deviates from the norm, which is the essence of humor. Additionally, the fact that the camera is set on top of a mushroom adds another layer of fantasy and playfulness to the image.
LLaVA-v1.6-34B	The humor in this image comes from the anthropomorphism of the squirrel, which is a technique where human traits, emotions, or intentions are attributed to non-human entities. The squirrel is posed as if it is a photographer, holding a camera and seemingly taking a picture of the bird. This is amusing because it’s a playful and imaginative scenario that suggests the squirrel is engaging in human-like behavior, which is not possible in reality. The juxtaposition of a small, wild animal acting like a human photographer is whimsical and lighthearted, evoking a sense of amusement and wonder.
LLaVA-v1.5-13B	The image features a squirrel standing on its hind legs and holding a camera, seemingly taking a picture of a bird. This scene is quite humorous as it portrays the squirrel as a photographer, which is an unusual and amusing sight. The bird, on the other hand, is perched on a mushroom, adding to the whimsical nature of the scene. The combination of the squirrel’s human-like behavior and the bird’s presence creates a light-hearted and entertaining image.
Example 2: emotion detection

User	Based on the visual cues, what is the facial expression of the woman?
ALLaVA	The woman’s facial expression can be described as solemn and introspective. Her eyes are slightly narrowed, and her mouth is closed in a neutral position, which often conveys a sense of seriousness or deep thought. The overall impression is one of quiet contemplation or a moment of reflection.
GPT-4V	The woman in the image appears to have a neutral or contemplative expression. Her eyes are slightly downcast, and her mouth is closed in a relaxed manner, which does not convey a strong emotion. The overall impression is one of calmness or pensiveness.
LLaVA-v1.6-34B	The woman in the image appears to have a neutral to slightly serious facial expression. Her eyebrows are slightly furrowed, and her mouth is closed, which could suggest a thoughtful or contemplative mood. The white markings on her face, which are often associated with cultural or traditional significance, add a unique and striking element to her appearance.
LLaVA-v1.5-13B	The woman in the image has a serious facial expression, as she is looking directly at the camera.

We provide two examples and compare the generation results between ALLaVA and other SOTA models, including GPT-4V, LLaVA-v1.5-13B and LLaVA-V1.6-13B⁴⁴4Results are obtained from https://llava.hliu.cc/ on Feb. 9th, 2024.. Example 1 tests the model ability on humor comprehension. All four models are able to generate satisfactory descriptions on the image, capturing the posture of the squirrel and the bird and identifying their respective roles, which adds humor to the image. Example 2 probes the ability to detect emotion. ALLaVA, GPT-4V and LLaVA-v1.6-34B have consistent judgement on the woman’s facial expression and her underlying emotion. However, LLaVA-v1.5-13B only has a superficial description and states that she is looking at the camera, which she is not.

In these two examples, ALLaVA, with only 3B parameters, is able to achieve comparable performance with much larger models, demonstrating its superior reasoning ability gained from our high-quality training set.

4.5 Analysis

Ablation on Data

PT w/ ALLaVA-Caption-4V	FT w/ ALLaVA-Instruct-4V	MM-Vet	MME	GQA
✗	✗	25.8	1489.9	44.0
✗	✓	12.1	1199.8	39.0
✓	✗	30.0	1582.1	47.7
✓	✓	32.2	1623.2	50.0

Table 6: Ablation of using our data on three benchmarks. PT: pretraining, or vision-language alignment. FT: finetuning, or visual instruction tuning.

Table 6 details the results of training the model with or without our data at each stage. The baseline result (line 1) is obtained by solely using ShareGPT4V for alignment, and llava_instruct_657K for instruction tuning. Adding solely ALLaVA-Instruct-4V (line 2) performs even worse than using either of our data (line 1), but adding both (line 4) performs better than adding solely ALLaVA-Caption-4V (line 3). This indicates that one cannot perform large scale visual instruction tuning on an insufficiently aligned model. Involving both of our datasets (line 4) significantly improves the model performance on these three benchmarks, manifesting the effectiveness of our datasets.

Name (LM backbone)	Vicuna-80	MM-VET	MME	GQA
ALLaVA (Qwen-1.8B)	40.0	28.3	1467.0	48.7
ALLaVA (StableLM-2-1.6B)	38.1	32.7	1539.1	49.9
ALLaVA (Phi2-2.7B)	48.8	33.2	1623.2	50.0

Table 7: Comparison of different LLM backbones. In the default setting, we use Phi2-2.7B as the LM backbone as it performs the best.

Choice of LM Backbones

Table 7 details the results of using different LM backbones. The language performance of using Phi2-2.7B after multimodal training is significantly better than using Qwen-1.8B (Bai et al., 2023a) and StableLM-2-1.6B (Team, 2023). On multimodal benchmarks, using Phi2-2.7B outperforms Qwen-1.8B by a larger margin than StableLM-2-1.6B. This result suggests the potential of adopting StableLM for multimodal training.

5 Conclusion

In this work, we present a framework to generate high-quality captions, instructions and answers simultaneously, which is a scalable method for obtaining more data for LVLM training. Using ALLaVA-Caption-4V and ALLaVA-Instruct-4V, we train our model ALLaVA, which achieves competitive performance on 12 benchmarks among 3B-scale LVLMs and comparable performance with larger SOTA models such as LLaVA-v1.5-13B on several benchmarks as well. Our data can significantly narrow the performance gap betweeen lite LVLMs and normal-size ones. We open-source our model and data to the research community for better development of this field.

Limitation

Although we have provide nearly 1M data, the data could be further scaled up.

References

Agrawal et al. (2019) Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF international conference on computer vision, pp. 8948–8957, 2019.
Bai et al. (2023a) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng Xu, Jin Xu, An Yang, Hao Yang, Jian Yang, Shusheng Yang, Yang Yao, Bowen Yu, Hongyi Yuan, Zheng Yuan, Jianwei Zhang, Xingxuan Zhang, Yichang Zhang, Zhenru Zhang, Chang Zhou, Jingren Zhou, Xiaohuan Zhou, and Tianhang Zhu. Qwen technical report, 2023a.
Bai et al. (2023b) Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023b.
Bai et al. (2023c) Shuai Bai, Shusheng Yang, Jinze Bai, Peng Wang, Xingxuan Zhang, Junyang Lin, Xinggang Wang, Chang Zhou, and Jingren Zhou. Touchstone: Evaluating vision-language models by language models, 2023c.
Changpinyo et al. (2021) Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts, 2021.
Chen et al. (2023) Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code, 2021.
Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
Chu et al. (2023) Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, and Chunhua Shen. Mobilevlm : A fast, strong and open vision language assistant for mobile devices, 2023.
Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021.
Dai et al. (2023) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
Fu et al. (2023) Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023.
Ge et al. (2023) Wentao Ge, Shunian Chen, Guiming Chen, Junying Chen, Zhihong Chen, Shuo Yan, Chenghao Zhu, Ziyue Lin, Wenya Xie, Xidong Wang, Anningzhe Gao, Zhiyi Zhang, Jianquan Li, Xiang Wan, and Benyou Wang. Mllm-bench, evaluating multi-modal llms using gpt-4v, 2023.
Hendrycks et al. (2021) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021.
Hudson & Manning (2019) Drew A. Hudson and Christopher D. Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering, 2019.
IDEFICS (2023) IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics, 2023.
Kirillov et al. (2023) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything, 2023.
Krishna et al. (2016) Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Fei-Fei Li. Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
Krizhevsky et al. (2009) Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Li et al. (2023a) Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension, 2023a.
Li et al. (2022) Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022.
Li et al. (2023b) Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023b.
Lin et al. (2015) Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015.
Liu et al. (2023a) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023a.
Liu et al. (2023b) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023b.
Liu et al. (2023c) Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2023c.
Mishra et al. (2019) Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, and Anirban Chakraborty. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pp. 947–952. IEEE, 2019.
OpenAI (2023) OpenAI. Gpt-4v(ision) system card. https://cdn.openai.com/papers/GPTV_System_Card.pdf, 2023.
OpenAI et al. (2023) OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mo Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun Jin, Denny Jin, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. Gpt-4 technical report, 2023.
Ordonez et al. (2011) Vicente Ordonez, Girish Kulkarni, and Tamara Berg. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011.
Parekh et al. (2020) Zarana Parekh, Jason Baldridge, Daniel Cer, Austin Waters, and Yinfei Yang. Crisscrossed captions: Extended intramodal and intermodal semantic similarity judgments for ms-coco. arXiv preprint arXiv:2004.15020, 2020.
Schuhmann et al. (2021) Christoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Aarush Katta, Theo Coombes, Jenia Jitsev, and Aran Komatsuzaki. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs, 2021.
Schuhmann et al. (2022) Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
Sharma et al. (2018) Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565, 2018.
Singh et al. (2019) Amanpreet Singh, Vivek Natarjan, Meet Shah, Yu Jiang, Xinlei Chen, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8317–8326, 2019.
Team (2023) Stability AI Language Team. Stable lm 2 1.6b, 2023. URL [https://huggingface.co/stabilityai/stablelm-2-1.6b](https://huggingface.co/stabilityai/stablelm-2-1.6b).
Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023.
Wang et al. (2023a) Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, and Yang Liu. Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235, 2023a.
Wang et al. (2023b) Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, and Yu-Gang Jiang. To see is to believe: Prompting gpt-4v for better visual instruction tuning, 2023b.
Wang et al. (2023c) Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2023c.
Xu et al. (2023a) Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023a.
Xu et al. (2023b) Zhiyang Xu, Trevor Ashby, Chao Feng, Rulin Shao, Ying Shen, Di Jin, Qifan Wang, and Lifu Huang. Vision-flan:scaling visual instruction tuning, Sep 2023b. URL https://vision-flan.github.io/.
Young et al. (2014) Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
Yu et al. (2023) Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
Yuan et al. (2023) Zhengqing Yuan, Zhaoxu Li, and Lichao Sun. Tinygpt-v: Efficient multimodal large language model via small backbones, 2023.
Yue et al. (2023) Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
Zhai et al. (2023) Yuexiang Zhai, Shengbang Tong, Xiao Li, Mu Cai, Qing Qu, Yong Jae Lee, and Yi Ma. Investigating the catastrophic forgetting in multimodal large language models, 2023.
Zhou et al. (2023) Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
Zhu et al. (2024) Yichen Zhu, Minjie Zhu, Ning Liu, Zhicai Ou, Xiaofeng Mou, and Jian Tang. Llava- $p h i$ : Efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330, 2024.

Appendix A Data Distillation

A.1 Prompt for Distilling LAION

⬇

### You are an excellent image describer and questioner

### You have three tasks in total

#### Your first task is to describe the given image as detailed as possible

#### Your third task is to answer the question you raised solely based on the given image

### When you ask questions, try to find the most valuable information in the picture to ask about, and ask a question that is relevant to that information

### When you ask questions, do not involve violence, advertisement, possible invasion of privacy, or questions that may cause discomfort

### Do not mention anything from the prompt in your response

### You will follow the instructions to the best of your ability

### Your response should follow the following format

{description}

{candidate questions}

{question}

{answer}

A.2 Prompt for Distilling Vision-FLAN

⬇

You are an excellent image describer.

Your task is to first describe an image and then answer a question.

Your answer should provide relevant information to the question and demonstrate the process of solving the question.

Both your description and answer should be professional, insightful, helpful, objective, unbiased.

Here is the question:

‘‘‘question

{question}

‘‘‘

Your output should follow the format below:

{description}

{detailed_answer}

A.3 Data Example

LAION

LAION Caption	Paleo Diet Fig Newton Cupcakes
Fine-grained Caption	The image features a close-up view of a single cupcake with a creamy white frosting on top, placed on a dark-colored metal stand with slender, branch-like supports. The cupcake is centered in the frame and is positioned on a circular platform that is part of the stand. On top of the frosting, there is a garnish that appears to be a slice of candied fruit, possibly a fig, with visible seeds and a caramelized exterior, which adds a touch of elegance to the presentation. The cupcake liner has a fluted design and is a golden yellow color, suggesting it might be made of a buttery or vanilla-flavored cake. In the background, slightly out of focus, there is another cupcake with a bite taken out of it, revealing a dark filling, possibly chocolate or fruit jam. The background is a soft, neutral color, providing a contrast that highlights the cupcake in the foreground.
Complex Reasoning Instruction	Considering the visual cues present in the image, what might be the flavor profile of the cupcake and how does the garnish on top contribute to the overall taste experience?
Detailed Answer	Based on the image, the cupcake likely has a sweet and rich flavor profile, with the golden yellow cake suggesting a vanilla or buttery base. The creamy white frosting adds a layer of sweetness and a smooth texture. The garnish on top appears to be a slice of candied fig, which would contribute a chewy texture and a complex sweetness with hints of caramelization. The seeds within the fig slice would also add a slight crunch, creating a varied mouthfeel. Overall, the garnish not only serves as an aesthetically pleasing decoration but also enhances the flavor experience with its unique taste and texture.
Vision-FLAN

Fine-grained Caption	The image depicts a pastoral scene at what appears to be a livestock show. In the foreground, a row of individuals, primarily wearing white coats, are leading a line of large, tan and white animals across a grassy field. These animals, with their distinctive humped backs and white patches, are cattle, likely a specific breed given the context of the event. In the background, there’s a large white tent labeled "Members", suggesting an exclusive area or services for participants or special guests. The landscape features rolling hills, and the sky is partly cloudy, allowing for patches of blue sky and sunlight to enhance the bucolic atmosphere. No other animal species are visible in the frame.
Instruction	Given an image of an animal, identify the kind of animal in the image. The picture could be of more popular animals that are visible around zoos or are sometimes domesticated at home. They could also sometimes be found in the wild. Options: (a) This image contains a giraffe (b) This image contains an elephant (c) This image contains a zebra (d) This image contains a sheep (e) This image contains a bear (f) This image contains a horse (g) This image contains a cow (h) This image contains a cat (i) This image contains a dog (j) This image contains a bird
VFLAN Answer	(g) This image contains a cow
Detailed Answer	The animals in the image are cattle, more commonly referred to as cows. This determination is made clear by their physical characteristics—large size, humped backs, distinctive color patterns, and the presence of horns in some individuals. Given the setting of a livestock show, these cows are likely to be a breed prized for either dairy or beef production. They are being shown off by handlers, which is typical in such events where animals are judged based on breed standards or other criteria. Therefore, the correct answer from the provided options is: (g) This image contains a cow.

In Table 8, we show two examples from each of the datasets, with an image, a caption of the image, a question, and an answer. The entries in bold are generated by GPT-4V using our data generation protocol.

Appendix B Details on Benchmarks

The benchmarks employed in this study are detailed below.

•

MMBench (Liu et al., 2023c) (dev set) consists of 4329 multiple-choice questions across 20 ability dimensions, using Accuracy under circular evaluation as the metric.
•

SEED-Bench-v1 (Li et al., 2023a) (image set) comprises 14,233 multiple-choice questions across 9 dimensions. Accuracy is adopted as the metric.
•

MM-Vet (Yu et al., 2023) comprises 218 questions, each requiring multiple capabilities to solve and provided with multiple groundtruths for a flexible match. Accuracy is adopted as the metric.
•

MMMU (Yue et al., 2023) (val set) consists of 900 multiple-choice questions that require expert-level knowledge to solve. Accuracy is adopted as the metric.
•

MME (Fu et al., 2023) is a benchmark with 2,374 questions spanning 14 subtasks. Accuracy is used as the metric.
•

TextVQA (Singh et al., 2019) comprises 5,000 questions and Accuracy is used as the metric.
•

GQA (Hudson & Manning, 2019) consists of 12,578 questions for real-world reasoning and compositional question answering. Accuracy is used as the metric.
•

MLLM-Bench (Ge et al., 2023) contains 420 complex visual reasoning questions along with per-sample criteria that aids GPT-4V to score each answer. GPT-4V’s answers serve as anchors. Win rate over the anchor is adopted as the metric.
•

TouchStone (Bai et al., 2023c) contains 908 open-ended question covering 5 abilities and 27 subtasks. LVLM’s answers are compared with pre-generated text-based GPT-4’s answers, using text-based GPT-4 as the judge to score each answer. The Averaged Scores are used as the metric.
•

LLaVA-Bench (In-the-Wild) (Liu et al., 2023b) contains 60 open-ended questions and uses text-based GPT-4 (OpenAI et al., 2023) as a judge to score answers in a pairwise fashion. Score Ratio between candidate answers and anchor answers from GPT-4 is adopted as the metric.
•

EMT (Zhai et al., 2023) (Cifar10 split) aims to test the degree of catestrophic forgetting of LVLMs after visual instruction tuning. Accuracy is used as the metric.

Appendix C Training details

In Table 9, we show the detailed hyperparameters for ALLaVA-3B using Phi2-2.7B as LM backbone.

Stage	Name	Value
1	Global Batch Size	256
	Deepspeed ZeRO Stage	1
	Optimizer	AdamW
	Weight Decay	0
	Scheduler	Cosine Annealing with Linear Warmup
	Warmup Ratio	0.03
	Max LR	$2e-5$
	Min LR	$2e-6$
	Precision	BF16
2	Global Batch Size	128
	Deepspeed ZeRO Stage	1
	Optimizer	AdamW
	Weight Decay	0
	Scheduler	Cosine Annealing with Linear Warmup
	Warmup Ratio	0.03
	Max LR	$2e-5$
	Min LR	$2e-6$
	Precision	BF16

Table 9: Training hyperparameters.

Appendix D Evaluation Prompts

D.1 Evaluation on Vicuna-80

⬇

You are a fair judge. You will be provided with a question and two answers. Your task is to judge which answer is of better quality.

Here is the question:

{question}

Here is Answer1:

{answer1}

Here is Answer2:

{answer2}

Your output should be either "Answer1" or "Answer2".