An Open and Comprehensive Pipeline
for Unified Object Grounding and Detection

Xiangyu Zhao

{}^{1}

Yicheng Chen

{}^{1}

Shilin Xu

{}^{1}

Xiangtai Li

{}^{1}

Xinjiang Wang

{}^{2}

Yining Li

{}^{1}

Haian Huang

{}^{1,}

²²2Project lead

{}^{1}

Shanghai AI Lab

{}^{2}

SenseTime Research
{zhaoxiangyu, chenyicheng, huanghaian}@pjlab.org.cn

Abstract

Grounding-DINO is a state-of-the-art open-set detection model that tackles multiple vision tasks including Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC). Its effectiveness has led to its widespread adoption as a mainstream architecture for various downstream applications. However, despite its significance, the original Grounding-DINO model lacks comprehensive public technical details due to the unavailability of its training code. To bridge this gap, we present MM-Grounding-DINO, an open-source, comprehensive, and user-friendly pipeline, which is built with the MMDetection toolbox. It adopts abundant vision datasets for pre-training and various detection and grounding datasets for fine-tuning. We give a comprehensive analysis of each reported result and detailed settings for reproduction. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO-Tiny outperforms the Grounding-DINO-Tiny baseline. We release all our models to the research community. Codes and trained models are released at https://github.com/open-mmlab/mmdetection/tree/main/configs/mm_grounding_dino.

Figure 1: (a)Open-Vocabulary Detection(OVD). (b)Phrase Grounding(PG). (c)Referring Expression Comprehension(REC).

Refer to caption — Figure 2: Results on various benchmarks. MM-Grounding-DINO outperforms other grounding models on a broad range of tasks.

1 Introduction

The task of object detection typically involves inputting an image into a model to obtain proposals, which are then matched with text through multi-modal alignment, making it a key component of most state-of-the-art multi-modal understanding architectures. Presently, object detection can be subdivided into three sub-tasks according to the type of input text: Open-Vocabulary Detection (OVD), Phrase Grounding (PG), and Referring Expression Comprehension (REC).

Following zero-shot settings, OVD models are trained on base categories but require predicting both base and novel categories within a large-scale language vocabulary [29]. Phrase grounding task takes not only a category but a phrase that describes all the candidate categories as input and output corresponding boxes [25]. The primary aim of REC task is to accurately identify the target designated by a given textual description and subsequently demarcate its position utilizing a bounding box [9].

In recent years, numerous vision grounding and detection models have been explored to solve the tasks above. Among these grounding models, Grounding-DINO [20] has been made as a main-stream architecture with superior performance. Based on a closed-set detector DINO [34], Grounding-DINO-Large achieves state-of-the-art zero-shot performance on COCO [17](mAP 52.5) without any COCO training data. Grounding-DINO executes the integration of vision and language modality at various stages, encompassing feature enhancer, query selection module, and decoder. This profound fusion approach significantly enhances the detection of objects in an open-set context and DETR-based structure makes it an end-to-end network without any hard-crafted module.

Given that Grounding-DINO has demonstrated superior precision across the aforementioned three downstream tasks, yet is not entirely open-source (with only test and demo codes available), we rebuild the Grounding-DINO model utilizing the MMDetection toolbox [4] within the OpenMMLab project, adhering to the official test codes of Grounding-DINO. The structure of the model remains almost unchanged except for the modifications during initialization. Based on the Grounding-DINO framework, we propose to apply more datasets for pretraining, including COCO, Objects365 [27], GRIT [23], V3Det [28], RefCOCO [13], RefCOCO+ [33], RefCOCOg [22], GQA [11] / Flickr30k Entities [24](combination also named as Golden-G dataset [12]), results in a stronger Grounding-DINO-based model which we call MM-Grounding-DINO. Since the Cap4M dataset [25] used by Grounding-DINO is not open-source, we have opted for the GRIT and V3Det datasets as substitutes in our study.

We further extend all available benchmarks for OVD, PG and REC evaluation, including COCO, LVIS [8], RefCOCO/+/g, Flickr30k Entities, ODinW13/35 [15], gRefCOCO [19] and Description Detection Dataset( $D^{3}$ ) [30]. To our knowledge, we are the first to implement a framework that facilitates systematic evaluation across such an extensive array of datasets. All evaluation metrics are readily available in MMDetection. Pretrained with a large number of data, MM-Grounding-DINO-Tiny achieves zero-shot 50.6 mAP on COCO, 41.4 mAP on LVIS mini, and comprehensively surpasses Grounding-DINO-Tiny in REC task, detailed results are shown in Section 3. We hope that our pipeline will serve as a valuable resource for further investigations in OVD, PG, and REC tasks.

The contributions of our paper are as follows:

1.

We propose MM-Grounding-DINO, a comprehensive and open-sourced grounding pipeline based on Grounding-DINO and pretrained with abundant vision datasets, which comprehensively address OVD, PG, and REC tasks.
2.

We take the lead in extending all available benchmarks for OVD, PG, and REC evaluation, including COCO, LVIS, RefCOCO/+/g, Flickr30K Entities, ODinW13/35, gRefCOCO and $D^{3}$ . All evaluation metrics are readily available in MMDetection.
3.

We extensively evaluate the transfer ability of our models by fine-tuning our model through a multitude of external special datasets.

2 Approach

In this section, we introduce the model and datasets in detail. Unless otherwise specified, MM-G denotes MM-Grounding-DINO. G-DINO refers to Grounding-DINO. O365 means Objects365 V1 and GoldG refers to the combination of GQA and Flickr30k Entities in the following sections.

2.1 Model

Table 1: Variants of MM-G/G-DINO along with their corresponding pretraining datasets. The same sign(i.e., a, b, c) next to each variant denotes comparable groups. Since Cap4M is non-open-source, GRIT and V3Det are utilized as substitute datasets. VG: Visual Genome. RefC: RefCOCO/+/g. INB: ImageNetBoxes.

Model	Training Datasets	Image Consumption
MDETR	COCO, RefC, VG, GoldG	52M (40 Ep $\times$ 1.3M Img)
GLIP	O365, OpenImages, VG, INB, COCO, RefC, GoldG, Cap24M	64M (64 Bs $\times$ 1M Iter)
GLIPv2 (stage II)	COCO, LVIS, PhraseCut, Cap16M	5.36M (24 Ep $\times$ 0.2M Img + 8 Ep $\times$ 0.07M Img)
G-DINO-Tiny(a)	O365	18.3M (30 Ep $\times$ 128 Bs $\times$ 4755 Iter)
G-DINO-Tiny(b)	O365, GoldG	41.3M (30 Ep $\times$ 128 Bs $\times$ 10763 Iter)
G-DINO-Tiny(c)	O365, GoldG, Cap4M	-
MM-G-Tiny(a)	O365	18.3M (30 Ep $\times$ 128 Bs $\times$ 4755 Iter)
MM-G-Tiny(b)	O365, GoldG	41.3M (30 Ep $\times$ 128 Bs $\times$ 10763 Iter)
MM-G-Tiny(c1)	O365, GoldG, GRIT	56.3M (30 Ep $\times$ 128 Bs $\times$ 14669 Iter)
MM-G-Tiny(c2)	O365, GoldG, V3Det	46.8M (30 Ep $\times$ 128 Bs $\times$ 12196 Iter)
MM-G-Tiny(c3)	O365, GoldG, GRIT, V3Det	61.8M (30 Ep $\times$ 128Bs $\times$ 16102 Iter)
MM-G-Large	COCO, RefC, O365V2, GoldG, GRIT, Open-Images, V3Det	-

As we mentioned in Section An Open and Comprehensive Pipeline for Unified Object Grounding and Detection, our model is based on Grounding-DINO [20] and almost remains unchanged. Our framework is shown in Figure 3. Given images with shape $[Batchsize,3,H,W]$ and text descriptions, our model can align the description with corresponding generated bounding boxes. The components of our model contain a text backbone for extracting text features, an image backbone for extracting image features, a feature enhancer for deeply fusing image and text features, a language-guided query selection module for query initialization, and a cross-modality decoder for box refinement. More details of the structure are drawn in [20].

Feature Extraction and Fusion. Given an image-text pair, we employ an image backbone to extract image features at multiple scales, and concurrently, a text backbone is utilized for the extraction of text features. Then we fed both features into a feature enhancer module for cross-modality fusion. In the feature enhancer module, the text features and image features are first fused through a Bi-Attention Block containing both text-to-image cross-attention and image-to-text cross-attention layers. Then the fused text features and image features are additionally enhanced with vanilla self-attention and deformable self-attention layers followed by an FFN layer respectively, as drawn in Algorithm 1.

Algorithm 1 Feature Enhancer Layer

1:fused_image, fused_text =

2:BiAttentionBlock(image_features, text_features)

3:fused_text =

4:FFN( SelfAttentionLayer(fused_text) )

5:fused_image =

6:FFN( DeformableAttentionLayer(fused_image) )

Language-Guided Query Selection. To optimize the utilization of text for guiding object detection, Grounding-DINO designed a language-guided query selection module. Language-guided query selection module selects $num\_query$ proposals based on the cosine similarity with input text features as decoder queries. The parameter $num\_query$ denotes the number of queries fed into the decoder, and it has been configured to a value of 900 in our implementation, following DINO [34]. The input queries for the decoder are composed of two components: the content part and the position part. The position part represents dynamic anchor boxes and is initialized based on the output of the language-guided query selection module, while the content part is initialized as an all-zero learnable query.

Cross-modality Decoder. The cross-modality decoder layer in Grounding-DINO is designed to further incorporate text and image features for cross-modality learning. After self-attention, the architecture incorporates an image cross-attention layer, succeeded by a text cross-attention layer, and culminating in an FFN layer. Each decoder layer possesses an additional text cross-attention layer in comparison to the DINO decoder layer. This enhancement is necessitated by the requirement to inject textual information into queries, thereby facilitating the model’s performance.

Training Loss. The L1 loss and the GIOU [26] loss are implemented for the box regression branch. Following GLIP [16], we utilize focal loss [18] as a contrastive loss between the predicted boxes and language tokens for classification. Every predicted box would be multiplied with all language tokens to compute the similarity between them. Box regression and classification loss are jointly utilized for the computation of bipartite matching loss [3]. Consistent with Grounding-DINO, we incorporate auxiliary loss for each decoder layer as well as the encoder outputs.”

Difference. The primary difference between MM-G and G-DINO lies in the contrastive embedding module. Motivated by CLIP [25], we add bias while initializing the contrastive embedding module. This can significantly reduce the initial loss value and accelerate the convergence of our model.

The implementation codes are shown in Algorithm 2.

Algorithm 2 Contrastive Embedding

1:res = visual_feat @ text_feat.transpose(-1, -2)

2:res = res / math.sqrt(visual_feat.shape[-1])

3:res = res + self.bias

2.2 Datasets Preparation

Our data format is motivated by the format in Open Grounding-DINO [35] and modified with the format in MMDetection. Since MM-Grounding-DINO is designed to address three tasks with datasets of different kinds of annotations, we divided the 15 datasets we used into three groups separately. Comprehensive details of the datasets are shown in Table 2. It’s noteworthy that the entirety of data in GRIT, which exceeds 13 million, is not fully utilized per epoch during the training process. Instead, it is partitioned into segments of 500,000 for each epoch.

OVD Datasets. The datasets we use for training include COCO [17], Objects365V1 [27], Objects365V2 [27], V3Det [28], Open-Images, and the evaluation datasets contain COCO, LVIS [8], ODinW12/35 [15].

PG Datasets. The training datasets encompass GQA [11], GRIT [23], Flickr30K Entities [24], while Flickr30K Entities dataset is also used for evaluation.

REC Datasets The training datasets include RefCOCO [13], RefCOCO+ [33], RefCOCOg [22]. For evaluation, we utilize a broader set of datasets, which contain RefCOCO, RefCOCO+, RefCOCOg, gRefCOCO [19], and Description Detection Dataset( $D^{3}$ ) [30].

Table 2: Comprehensive details of datasets utilized in MM-Grounding-DINO. Datasets denoted with an asterisk (*) can be utilized for both training and evaluation in our framework. Benchmarks are exclusively utilized for evaluation, while the remaining datasets are only for training.

Dataset	Task	Images	Instances	categories
COCO¹¹1Both utilized for training and evaluation [17]	OVD	123K	896K	80
Objects365-V1 [27]	OVD	638K	10M	365
Objects365-V2 [27]	OVD	1.7M	25M	365
OpenImages-V6	OVD	1.5M	14M	600
V3Det [28]	OVD	245K	1753K	13029
Flickr30k Entities¹¹1 [24]	PG	31K	275K	-
GQA [11]	PG	113K	-	-
GRIT [23]	PG	9M	137M	-
RefCOCO¹¹1 [13]	REC	19K	50K	-
RefCOCO+¹¹1 [33]	REC	19K	49K	-
RefCOCOg¹¹1 [22]	REC	26K	54K	-
LVIS [8]	OVD Benchmark	164K	2M	1000
ODinW [15]	OVD Benchmark	20K	135K	314
gRefCOCO [19]	REC Benchmark	19K	60K	-
$D^{3}$ [30]	REC Benchmark	10K	18K	-

2.3 Training Settings

Input Rules of Text. For OVD training, we concatenate all categories in detection datasets as a long string, like ”People. Ball. Racket. Cat.”. For PG and REC tasks, following M-DETR [12], during the pre-training phase, we annotate every object that is referred to within the text, which results in a slight modification in the model’s application for this task. For instance, during pre-training, given the caption ”The woman wearing a blue dress standing next to the rose bush.”, MM-Grounding-DINO would be trained to predict bounding boxes for all referred objects such as the woman, the blue dress, and the rose bush.

Model Variants. Similar to Grounding-DINO, we choose a well-pretrained BERT-based-uncased [6] model as our language encoder and Swin Transformer [21] as image backbone. We compare different combinations of datasets in MM-G-tiny and G-DINO-Tiny. The selection of training datasets is contingent upon the scale of the image backbone, as shown in Table 1.

Data Augmentation. Besides random resize, random clip, and random flip, we also introduce random negative samples in data augmentation. We concatenate the categories or text descriptions, which are randomly sampled from other images as negative examples, with ground-truth descriptions serving as positive examples. This can effectively suppress the hallucination phenomena generated by the model, thus the model will not predict objects not exist in the image.

Computing Resources. We trained our MM-G-Tiny on 32 NVIDIA 3090 GPUs with a total batch size of 128 for 30 epochs. Since the computational cost of MM-G-Large is extremely high, MM-G-Large model is still in training.

3 Main Results

3.1 Zero-shot Transfer

In zero-shot settings, MM-G models are initially trained on base datasets and subsequently assessed on novel datasets. Additionally, we present a set of results derived from fine-tuning to facilitate a comprehensive comparison of our model with Grounding-DINO. This approach ensures a robust evaluation of the model’s performance and its relative standing in the field.

Table 3: Results on COCO benchmark. All MM-G-T variants outperformed their counterparts, notably MM-G-T(c1) achieving 50.5 mAP in a zero-shot setting.

Model	Backbone	Lr schd	COCO mAP
GLIP	Swin-T	zero-shot	46.6
G-DINO-T(a)	Swin-T	zero-shot	46.7
G-DINO-T(b)	Swin-T	zero-shot	48.1
G-DINO-T(c)	Swin-T	zero-shot	48.4
MM-G-T(a)	Swin-T	zero-shot	48.5(+1.8)
MM-G-T(b)	Swin-T	zero-shot	50.4(+2.3)
MM-G-T(c1)	Swin-T	zero-shot	50.5(+2.1)
MM-G-T(c2)	Swin-T	zero-shot	50.6(+2.2)
MM-G-T(c3)	Swin-T	zero-shot	50.4(+2.0)

Table 4: Results on LVIS benchmark. All MM-G-T variants outperformed their counterparts. Significantly, MM-G-T(c3) reached 41.4 AP in the Mini Val zero-shot setting and showed a 17.3 improvement following 12 epochs of fine-tuning.

Model

Backbone

Lr schd

MiniVal

APr

MiniVal

APc

MiniVal

APf

MiniVal

Val1.0

APr

Val1.0

APc

Val1.0

APf

Val1.0

GLIP

Swin-T

zero-shot

18.1

21.2

33.1

26.7

10.8

14.7

29.0

19.6

G-DINO-T(c)

Swin-T

zero-shot

18.8

24.2

34.7

28.8

10.1

15.3

29.9

20.1

MM-G-T(b)

Swin-T

zero-shot

28.1

30.2

42.0

35.7

17.1

22.4

36.5

27.0

MM-G-T(c1)

Swin-T

zero-shot

26.6

32.4

41.8

36.5(+7.7)

17.3

22.6

36.4

27.1(+7.0)

MM-G-T(c2)

Swin-T

zero-shot

33.0

35.6

45.9

40.5(+11.7)

17.3

22.6

36.4

27.1(+7.0)

MM-G-T(c3)

Swin-T

zero-shot

34.2

37.4

46.2

41.4(+12.6)

23.6

27.6

40.5

31.9(+11.8)

COCO Benchmark. We conduct an evaluation of MM-Grounding-DINO pretrained on O365 dataset and other PG/REC datasets. Following Grounding-DINO, COCO dataset is utilized for establishing a zero-shot learning baseline. We compare MM-Grounding-DINO-Tiny with Grounding-DINO-Tiny in Table 3. It’s shown in the result that even MM-G(a) trained with O365 only(mAP 48.5) can outperform G-DINO(c) trained with O365, Gold-G, and Cap4M(mAP 48.4), which proves the efficiency of our model. Trained with objects365, Gold-G and GRIT, MM-G-T(c) demonstrates a performance of 50.5 mAP, which improves 2.1 AP over G-DINO(c) on COCO benchmark. This is achieved without the model being exposed to any COCO images during training, and GRIT data we use() is even less than Cap4M(4M). There are two potential explanations for this:

•

Our training strategy, particularly the additional bias during initialization, aids in the convergence of the model.
•

O365 dataset encompasses the categories of COCO dataset. Consequently, our model has been extensively trained on the O365 dataset and naturally exhibits improved accuracy on the COCO dataset. This assertion is indirectly validated by the comparatively lower performance observed when the model is evaluated on other datasets.

It has also been observed that the incorporation of V3Det dataset does not contribute positively to the COCO zero-shot evaluation, and may even have detrimental effects.

LVIS Benchmark. LVIS dataset constitutes a long-tail detection dataset, encompassing in more than 1000 distinct categories for evaluation. Following Grounding-DINO, LVIS is also utilized for zero-shot OVD evaluation. We compare MM-Grounding-DINO-Tiny with Grounding-DINO-Tiny in Table 4. We observe that despite MM-G(a) being trained by O365 and GoldG in the absence of Cap4M, it still manages to surpass G-DINO(c) by +6.9AP on both LVIS MiniVal and Val. MM-G(c1) surpasses G-DINO(c) by +7.7AP on MiniVal and +7.0AP on Val, upon the addition of V3Det, MM-G(c3) experiences a substantial improvement of nearly 5 AP, reaching 41.4 mAP on MiniVal and 31.9 mAP on Val, which surpasses G-DINO(c) by a significant +12.6 AP on MiniVal and +11.8 AP on Val! The potential reasons could be categorized into two aspects:

•

Model exhibits more comprehensive training on LVIS categories vocabulary.
•

V3Det includes more than 13k categories which may cover a large part of LVIS’s categories, a similar conclusion is also drawn in [31].

Table 5: Results on RefCOCO/+/g. The training details of Grounding-DINO have not been released, so the learning schedule for fine-tuning is unknown.

Method	Backbone	Setting	RefCOCO			RefCOCO+			RefCOCOg
			val	testA	testB	val	testA	testB	val	test
G-DINO-T(c)	Swin-T	zero-shot	50.8	57.4	45.0	51.6	57.3	46.4	60.4	59.7
MM-G-T(b)	Swin-T	zero-shot	53.1	59.7	46.4	53.1	58.9	47.9	61.2	61.1
MM-G-T(c1)	Swin-T	zero-shot	53.4	58.8	46.8	53.5	59.0	47.9	62.7	62.6
MM-G-T(c2)	Swin-T	zero-shot	52.1	58.4	45.4	52.5	58.2	46.9	61.7	61.0
MM-G-T(c3)	Swin-T	zero-shot	53.1	59.1	46.8	52.7	58.7	48.4	62.9	62.9
G-DINO-T(c)	Swin-T	-	89.2	91.9	86.0	81.1	87.4	74.7	84.2	84.9
MM-G-T(c3)	Swin-T	5e	89.5	91.4	86.6	82.1	87.5	74.0	85.5	85.8

ODinW Benchmark. ODinW (Object Detection in the Wild) benchmark represents a more rigorous benchmark designed to assess model performance within real-world contexts. It consists of 35 object detection datasets, each of which is augmented with external knowledge. We utilize ODinW13/35 to evaluate the transferability of our model, summary results are shown in Table 6. Our MM-G-T(c3) demonstrates superior performance over G-DINO-T(c) and achieves scores of 53.3 mAP and 28.4 mAP on ODinW13 and ODinW35 respectively, which proves the robust transferability of our model. It is evident that a wide vocabulary holds substantial significance for ODinW datasets. Upon the integration of V3Det, the model’s performance experienced a substantial enhancement. The primary reason for this improvement is that V3Det encompasses a broader range of categories within ODinW. Detailed results of each sub-dataset are shown in Appendix A.3.

Table 6: Zero-shot domain transfer on ODinW.

Model	Backbone	ODinW13	ODinW35
G-DINO-T(c)	Swin-T	51.4	22.7
MM-G-T(b)	Swin-T	45.3	20.2
MM-G-T(c1)	Swin-T	51.1	22.8
MM-G-T(c2)	Swin-T	51.1	22.8
MM-G-T(c3)	Swin-T	53.3(+1.9)	28.4(+5.7)

Table 7: Results on gRefCOCO benchmark. Note that the threshold value is set to 0.6. More results can refer to Appendix A.1.

Method	Backbone	gRefCOCO
		Pr(F1=1, IoU $\geqslant$ 0.5) / N-acc
		val	testA	testB
G-DINO-T(c)	Swin-T	40.5/83.8	29.3/82.9	30.0/86.1
MM-G-T(c3)	Swin-T	40.6/83.1	29.2/84.3	31.6/96.7

RefCOCO/+/g and gRefCOCO Benchmark. We also evaluate MM-G’s zero-shot ability on REC task. RefCOCO, RefCOCO+, and RefCOCOg are established for REC evaluation, results are shown in Table 5. Compared with RefCOCO, gRefCOCO broadens its scope to encompass multi-target expressions, which entail the specification of multiple target objects via a single expression. Additionally, gRefCOCO accommodates no-target expressions that do not refer to any object within the image. This augmentation introduces a markedly elevated degree of versatility to input expressions, consequently enhancing the practicality and robustness of REC in real-world applications. We also conduct an evaluation on gRefCOCO benchmark to assess the zero-shot capabilities of REC, with the results being presented in Table 7. Our model is able to surpass the baseline across all zero-shot evaluation metrics on RefCOCO, and can either surpass or approximately equal G-DINO on gRefCOCO. From the results, it can be inferred that V3Det dataset can not provide any benefit for REC task.

Description Detection Dataset( $\mathbf{D^{3}}$ ) Benchmark. $D^{3}$ is characterized by its flexible language expressions, ranging from concise category names to extensive descriptions, and it ensures comprehensive annotation of all objects described across all images without omission. Sentences in $D^{3}$ are slightly longer than an ordinary word, therefore, it does not require a high level of understanding ability from the model. In fact, it leans more towards OVD task. In addition, there are 24,282 positive object-text pairs and 7,788,626 negative pairs in $D^{3}$ , which imposes a stringent demand on the model’s ability to distinguish negative objects. We report our results in Table 8. From the results, we observe that MM-G-T(c1) with GRIT and G-DINO-T(c) trained with Cap4M have demonstrated comparable performance. In particular, MM-G-T(c1) exhibits advancements in long sentences, while G-DINO-T(c) shows progress when dealing with short sentences. This will be elaborated in detail in Section 3.2. After incorporating V3Det, which contains a large number of precise annotations, the performance of MM-G-T(c3) on short sentences surpassed G-DINO-T(c) while the performance on long sentences gets worse. This is primarily due to the fact that the majority of text annotations in V3Det are short sentences.

Table 8: Zero-shot transfer on

D^{3}

. FULL, PRES, and ABS denote evaluation on all descriptions, presence descriptions only, and absence descriptions only respectively. s/m/l/vl denote short, middle, long and very long.

Method	mode	G-DINO-T(c)	MM-G-T(b)	MM-G-T(c1)	MM-G-T(c2)	MM-G-T(c3)
FULL/s/m/l/vl	concat	17.2/18.0/18.7/14.8/16.3	15.6/17.3/16.7/14.3/13.1	17.0/17.7/18.0/15.7/15.7	16.2/17.4/16.8/14.9/15.4	17.5/23.4/18.3/14.7/13.8
FULL/s/m/l/vl	parallel	22.3/28.2/24.8/19.1/13.9	21.7/24.7/24.0/20.2/13.7	22.5/25.6/25.1/20.5/14.9	22.3/25.6/24.5/20.6/14.7	22.9/28.1/25.4/20.4/14.4
PRES/s/m/l/vl	concat	17.8/18.3/19.2/15.2/17.3	16.4/18.4/17.3/14.5/14.2	17.9/19.0/18.3/16.5/17.5	16.6/18.8/17.1/15.1/15.0	18.0/23.7/18.6/15.4/13.3
PRES/s/m/l/vl	parallel	21.0/27.0/22.8/17.5/12.5	21.3/25.5/22.8/19.2/12.9	21.5/25.2/23.0/19.0/15.0	21.6/25.7/23.0/19.5/14.8	21.9/27.4/23.2/19.1/14.2
ABS/s/m/l/vl	concat	15.4/17.1/16.4/13.6/14.9	13.4/13.4/14.5/13.5/11.9	14.5/13.1/16.7/13.6/13.3	14.8/12.5/15.6/14.3/15.8	15.9/22.2/17.1/12.5/14.4
ABS/s/m/l/vl	parallel	26.0/32.0/33.0/23.6/15.5	22.8/22.2/28.7/22.9/14.7	25.6/26.8/33.9/24.5/14.7	24.1/24.9/30.7/23.8/14.7	26.0/30.3/34.1/23.9/14.6

Table 9: Fine-tune Results on LVIS benchmark.

Model

Backbone

Setting

MiniVal

APr

MiniVal

APc

MiniVal

APf

MiniVal

Val1.0

APr

Val1.0

APc

Val1.0

APf

Val1.0

MM-G-T(c3)

Swin-T

zero-shot

34.2

37.4

46.2

41.4

23.6

27.6

40.5

31.9

MM-G-T(c3)

Swin-T

open-set 1x

50.7(+16.5)

58.8(+21.4)

60.1(+13.9)

58.7(+17.3)

45.2(+21.6)

50.2(+12.6)

56.1(+15.6)

51.7(+19.8)

MM-G-T(c3)

Swin-T

open vocabulary 1x

43.2(+9.0)

57.4(+20.0)

59.3(+13.1)

57.1(+15.7)

Table 10: Fine-tune Results on COCO benchmark. Both close-set and open-set fine-tuning achieve a large improvement over the pre-trained model.

Model	Backbone	Setting	mAP
GLIP	Swin-T	zero-shot	46.6
G-DINO-T(c)	Swin-T	zero-shot	48.4
MM-G-T(c1)	Swin-T	zero-shot	50.5(+2.1)
MM-G-T(c2)	Swin-T	zero-shot	50.6(+2.2)
MM-G-T(c3)	Swin-T	zero-shot	50.4(+2.0)
Faster R-CNN	R-50	close-set 1x	37.4
Cascade R-CNN	R-50	close-set 1x	40.3
ATSS	R-50	close-set 1x	39.4
TOOD	R-50	close-set 1x	42.4
DINO	R-50	close-set 1x	50.1
GLIP	Swin-T	close-set 1x	55.4
G-DINO-T(c)	Swin-T	close-set 1x	58.1
MM-G-T(c3)	Swin-T	close-set 1x	58.2(+7.8)
MM-G-T(c3)	Swin-T	open-set 1x	54.7(+4.3)

3.2 Analysis for GRIT

GRIT[23] is a large dataset employed as our substitute for Cap4M created in GLIP [16], given that the latter is not open-source. However, as shown in the results above, the performance of GRIT doesn’t meet our expectations. For OVD task, MM-G-T(c1) with GRIT only improve +0.1 AP on COCO in Table 3 and +0.1 AP(Val) on LVIS in Table 4 than MM-G-T(b) without GRIT. For REC task, the gain brought by GRIT is relatively low on RefCOCO and gRefCOCO in Table 5 and 7. From our observation of images and annotations in GRIT, the primary reasons can be enumerated as follows:

•

The text annotation of GRIT comes from phrases or sentences extracted by spaCy[10] from captions in COYO-700M and LAION-2B, including a large number of abstract phrases like human names, events, facilities, and Geo-Political entities, which could potentially lead to the misdirection of the model.
•

In GRIT dataset, the majority of images are accompanied by a singular annotation. The single annotation encompasses a long sentence which is actually the whole caption of the image and a noisy box which approximately spans the full extent of the image.

However, it is noteworthy that the large-scale data of GRIT still serves a purpose. MM-G-T(c1) with GRIT surpass MM-G-T(b) by 5.8/2.6 AP on ODinW13/35 in Table 6, which is on par with G-DINO-T(c) pretrained with Cap4M. We thus Observed from Table 8, MM-G-T(c1) with GRIT and G-DINO-T(c) with Cap4M have demonstrated comparable performance on $D^{3}$ . Fortuitously, the single long text annotation of GRIT contributes to the enhancement of MM-G-T(c1)’s performance on long sentences.

Table 11: Fine-tune results on downstream tasks. * denotes 12 epochs of fine-tuning setting. †denotes 50 epochs of fine-tune setting. Under the same fine-tuning setting, our MM-Grounding-DINO outperforms previous methods by a large margin for most downstream tasks. The evaluation metric is box AP.

Method	Backbone	Fine-tune	RTTS¹¹footnotemark: 1	RUOD¹¹footnotemark: 1	Brain Tumor²²footnotemark: 2	Cityscapes²²footnotemark: 2	People in Painting²²footnotemark: 2
Faster R-CNN	R-50	✓	48.1	52.4	43.5	30.1	17.0
Cascade R-CNN	R-50	✓	50.8	55.3	46.2	31.8	18.0
ATSS	R-50	✓	48.2	55.7	-	-	-
TOOD	R-50	✓	50.8	57.4	-	-	-
DINO	R-50	✓	-	-	46.4	34.5	12.0
Cascade-DINO	R-50	✓	-	-	48.6	34.8	13.4
MM-GDINO	Swin-T	✗	49.8	29.8	0.4	34.2	23.1
MM-GDINO	Swin-T	✓	69.1	65.5	47.5	51.5	38.9

3.3 Validation through Fine-tuning

The default fine-tuning in this report is based on MM-G-T(c3) pre-trained model.

3.3.1 Fine-tuning on COCO/LVIS

Fine-tune on COCO. We implemented three mainstream fine-tuning approaches with MM-Grounding-DINO to thoroughly evaluate its capabilities: close-set fine-tuning, open-set continuing pretraining fine-tuning, and open-vocabulary fine-tuning. The latter two fine-tuning methods are designed to keep model’s generalizability while enhancing performance on the COCO dataset.

•

In close-set fine-tuning, we fine-tuned our model using a close-set algorithm, optimizing specifically for COCO dataset. Post fine-tuning, the textual input was restricted to COCO categories.
•

In the open-set continuing pretraining fine-tuning, we derived two distinct methods based on the same training strategy in pretraining phase. The first involves lowering the learning rate and freezing certain modules, then continue training on COCO dataset. The second method combines the COCO dataset with other pre-training datasets of MM-G-T(c3) for continue training.
•

For open-vocabulary fine-tuning, we categorized the dataset into base and novel categories. During fine-tuning, only base categories were utilized. Subsequently, we evaluated the model’s performance across both base and novel categories.

As shown in Table 10, MM-G-T significantly improved performance on the COCO dataset through both close-set fine-tuning and open-set continuing pretraining fine-tuning. Notably, MM-G-T achieved a 7.8 mAP increase after 12 epochs of close-set fine-tuning, reaching 58.2 mAP. For more results regarding open-vocabulary fine-tuning, please refer to Table 15 in Appendix A.4.

Fine-tune on LVIS. LVIS dataset, characterized by its long-tail distribution, encompasses 1203 categories. Given this extensive categorization, we exclusively employed open-set continue pretraining fine-tuning and open vocabulary fine-tuning for this dataset.

As illustrated in Table 9, the open-set continuing pretraining fine-tuning significantly enhanced MM-G-T’s performance. Notably, MM-G-T achieved a substantial increase of 9.0 mAP in the Apr metric for Mini LVIS.

3.3.2 Fine-tuning on REC

Fine-tune for RefCOCO/+/g. We further evaluate our model by fine-tuning on REC task as detailed in Table 5. Following MDETR [12], we adapted the fine-tuning phase to phrase grounding, consistent with the pre-training. The results, presented in Table 5, indicate a marked improvement in REC task performance after just 5 epochs of fine-tuning. This suggests that the current RefCOCO/+/g dataset, along with its evaluation metric, might be overly simplistic. Utilizing phrase grounding for fine-tuning on this task still leads to significant enhancements. We look forward to the emergence of a more robust and rigorous evaluation metric for further advancing REC task proficiency.

3.3.3 Fine-tuning on Downstream Tasks

To comprehensively show the generalizability of MM-Grounding-DINO, we extended its evaluation to various downstream tasks. In the fine-tuning settings, models are initially trained on expansive datasets and are then specifically trained using the training sets from the respective downstream tasks.

Object Detection in the Haze. Our study utilized the Real-world Task-driven Testing Set (RTTS), comprising 4,322 real-world hazy images predominantly featuring traffic and driving scenarios[14]. The RTTS dataset encompasses a variety of common categories within hazy conditions, offering an apt platform to access our model’s efficacy and generalizability in diverse environments. We adopted the same dehazing and detection joint pipeline proposed in the benchmark. Impressively, MM-Grounding-DINO reached 69.1 AP after 12 epochs of fine-tuning, surpassing previous standards by a large margin as shown in Table 11.

Object Detection Underwater. In this study, we evaluated the performance of MM-Grounding-DINO on the Real-world Underwater Object Detection dataset (RUOD)[7]. This dataset comprises 14,000 high-resolution images with 74,903 labeled instances. Characterized by its diverse range of categories, object scales, image scales, object densities, and category densities, the dataset also introduces a range of underwater challenges. These include haze-like effects, color casts, light interference, and complex marine objects. This evaluation leveraged the RUOD dataset to ascertain the capabilities of our model in a distinct image domain, while simultaneously engaging with a subset of commonly encountered objects.

Table 11 shows that in the zero-shot setting, the MM-Grounding-DINO achieved a mAP of 29.8, primarily due to distribution mismatches between the training dataset, which mainly consists of terrestrial images, and the RUOD. However, after 12 epochs of fine-tuning, the model showed an improvement of 35.7 mAP, thereby setting a new benchmark. This performance surpasses the previous state-of-the-art by 8.1 mAP.

Object Detection for Brain Tumor. We further extended our evaluation to the medical domain, utilizing the Brain tumor dataset[2]. Notably, this dataset is unique in its labeling approach, as it only utilizes numerical identifiers without providing descriptive label information. As detailed in Table 11, the performance of MM-Grounding-DINO underperformed Cascade-DINO[32]. We hypothesize that the relatively suboptimal results of our model could be attributed to the challenges posed by the dataset’s reliance on purely numerical labels, especially in scenarios where the textual context is completely unknown.

Object Detection for Cityscapes. Cityscapes[5] is an extensive urban street scene collection, comprising 3k training images and 500 validation images. It features a broad and varied set of stereo video sequences captured in the streets of 50 different cities, accompanied by high-quality, pixel-level annotations. This dataset assessed our model’s performance in recognizing common objects encountered in daily life. Notably, in Table 11, we can observe that our pretrained MM-Grounding-DINO already performed on par with fine-tuned models without the need for any dataset-specific training. After 50 epochs of fine-tuning, it had an improvement of 17.3 mAP, reaching new state-of-the-art.

Object Detection for People in Painting. People in Paintings [1] was originally created by Raya AI as a part of RF100, an initiative to establish a new object detection benchmark for model generalizability. The annotations in this dataset exclusively pertain to the figures depicted in paintings. As illustrated in Table 11, our MM-Grounding-DINO model has already outperformed the performance of fine-tuned models in a zero-shot setting. Following a fine-tuning of 50 epochs, it demonstrated a significant improvement, achieving an increase of +15.8AP, setting a new benchmark of 38.9 mAP.

4 Conclusion

In this paper, We propose MM-Grounding-DINO, a comprehensive and open-sourced grounding baseline based on Grounding-DINO and pretrained with abundant vision datasets, and comprehensively address OVD, PG, and REC tasks. We extend all available benchmarks for OVD, PG, and REC evaluation, and all evaluation metrics are readily available in MMDetection. The extensive experiments on the benchmarks mentioned demonstrate that our MM-Grounding-DINO outperforms (or is on par with) the Grounding-DINO baseline. We hope that our pipeline will serve as a valuable resource for further investigations in grounding and detection tasks.

References

100 [2023] Roboflow 100. people in paintings dataset. https://universe.roboflow.com/roboflow-100/people-in-paintings, 2023. visited on 2023-12-21.
AABBCCEEFFGG [2022] AABBCCEEFFGG. Brain tumor detection dataset. https://universe.roboflow.com/aabbcceeffgg/brain-tumor-detection-69d9s, 2022. visited on 2023-12-21.
Carion et al. [2020] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers, 2020.
Chen et al. [2019] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tianheng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu, Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang, Chen Change Loy, and Dahua Lin. MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155, 2019.
Cordts et al. [2016] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding, 2016.
Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
Fu et al. [2023] Chenping Fu, Risheng Liu, Xin Fan, Puyang Chen, Hao Fu, Wanqi Yuan, Ming Zhu, and Zhongxuan Luo. Rethinking general underwater object detection: Datasets, challenges, and solutions. Neurocomputing, 517:243–256, 2023.
Gupta et al. [2019] Agrim Gupta, Piotr Dollár, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation, 2019.
He et al. [2023] Shuting He, Henghui Ding, Chang Liu, and Xudong Jiang. Grec: Generalized referring expression comprehension, 2023.
Honnibal et al. [2020] Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. spaCy: Industrial-strength Natural Language Processing in Python. 2020.
Hudson and Manning [2019] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
Kamath et al. [2021] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr – modulated detection for end-to-end multi-modal understanding, 2021.
Kazemzadeh et al. [2014] Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798, Doha, Qatar, 2014. Association for Computational Linguistics.
Li et al. [2019] Boyi Li, Wenqi Ren, Dengpan Fu, Dacheng Tao, Dan Feng, Wenjun Zeng, and Zhangyang Wang. Benchmarking single image dehazing and beyond, 2019.
Li et al. [2022a] Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Houdong Hu, Zicheng Liu, Yong Jae Lee, and Jianfeng Gao. Elevater: A benchmark and toolkit for evaluating language-augmented visual models, 2022a.
Li et al. [2022b] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. Grounded language-image pre-training, 2022b.
Lin et al. [2015] Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bourdev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Dollár. Microsoft coco: Common objects in context, 2015.
Lin et al. [2018] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection, 2018.
Liu et al. [2023a] Chang Liu, Henghui Ding, and Xudong Jiang. Gres: Generalized referring expression segmentation, 2023a.
Liu et al. [2023b] Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding dino: Marrying dino with grounded pre-training for open-set object detection, 2023b.
Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows, 2021.
Mao et al. [2016] Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions, 2016.
Peng et al. [2023] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world, 2023.
Plummer et al. [2017] Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. IJCV, 123(1):74–93, 2017.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021.
Rezatofighi et al. [2019] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression, 2019.
Shao et al. [2019] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pages 8429–8438, 2019.
Wang et al. [2023] Jiaqi Wang, Pan Zhang, Tao Chu, Yuhang Cao, Yujie Zhou, Tong Wu, Bin Wang, Conghui He, and Dahua Lin. V3det: Vast vocabulary visual detection dataset, 2023.
[29] Jianzong Wu, Xiangtai Li, Shilin Xu, Haobo Yuan, Henghui Ding, Yibo Yang, Jiangning Zhang, Yunhai Tong, Xudong Jiang, Bernard Ghanem, and Dacheng Tao. Towards open vocabulary learning: A survey.
Xie et al. [2023] Chi Xie, Zhao Zhang, Yixuan Wu, Feng Zhu, Rui Zhao, and Shuang Liang. Described object detection: Liberating object detection with flexible expressions, 2023.
Yang et al. [2023] Haosen Yang, Chuofan Ma, Bin Wen, Yi Jiang, Zehuan Yuan, and Xiatian Zhu. Recognize any regions, 2023.
Ye et al. [2023] Mingqiao Ye, Lei Ke, Siyuan Li, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, and Fisher Yu. Cascade-detr: Delving into high-quality universal object detection, 2023.
Yu et al. [2016] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling context in referring expressions, 2016.
Zhang et al. [2022] Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection, 2022.
Zuwei Long [2023] Wei Li Zuwei Long. Open grounding dino:the third party implementation of the paper grounding dino. https://github.com/longzw1997/Open-GroundingDino, 2023.

Supplementary Material

Appendix A More Results

A.1 Detailed Results on gRefCOCO

In our experiments, we initially set the default threshold to 0.7 following [19]. Then we conducted an extensive series of tests with varying threshold values. The impact of these different thresholds on our results is detailed in Table12. We observed distinct effects of the threshold adjustments on the outputs. Specifically, a threshold of 0.8 yielded the highest F1 score for the validation set. In contrast, for both test sets A and B, a lower threshold of 0.5 proved more effective. This leads to an anticipation for the development of a more robust evaluation metric for this dataset. It’s noteworthy that following the fine-tuning process (threshold is set to 0.7), all subsets of gRefCOCO demonstrated significant improvements.

A.2 Detailed Results on Flickr30K Entities

As shown in Table13, MM-G-T(c) exhibits lower performance compared to G-DINO-T on Flickr30K Entities. Given that the GoldG dataset includes images from Flickr30K Entities, it’s important to note that these results do not represent a zero-shot scenario. The observed performance differences could be attributed to variations in training strategies and settings.

A.3 Detailed Results on ODinW datasets

We provide the details of the 35 datasets we use in Table14. Considering the rarity of categories within the ODinW13/35 dataset, the additional concepts brought by GRIT and V3Det datasets prove to be beneficial.

A.4 Open-Vocabulary Fine-tuning on COCO

As elaborated in section3.3.1, the results in Table15 show that despite fine-tuning solely on base categories, there is an observable +1.5 mAP enhancement in the novel categories. This finding demonstrates the effectiveness of open-vocabulary fine-tuning in preserving the model’s generalizability.

Table 12: Results on gRefCOCO benchmark with different threshold values.

Method	Threshold	Setting	gRefCOCO
			Pr(F1=1, IoU $\geqslant$ 0.5)
			val	testA	testB
G-DINO-T(c)	0.5	zero-shot	39.3	31.9	30.0
G-DINO-T(c)	0.6	zero-shot	40.5	29.3	30.0
G-DINO-T(c)	0.7	zero-shot	41.3	27.2	29.7
G-DINO-T(c)	0.8	zero-shot	41.5	25.1	29.1
MM-G-T(c3)	0.5	zero-shot	39.4	33.1	33.0
MM-G-T(c3)	0.6	zero-shot	40.6	29.2	31.6
MM-G-T(c3)	0.7	zero-shot	41.0	26.1	30.4
MM-G-T(c3)	0.8	zero-shot	41.1	23.8	29.5
MM-G-T(c3)	0.7	fine-tune	45.1(+4.1)	42.5(+16.4)	40.3(+9.9)

Appendix B Visualization

B.1 Visualization on Pre-training Dataset

In Figure 4, we present visualizations of the pretraining datasets. Our analysis of these datasets revealed several noisy elements that could potentially undermine the training effectiveness. For example, some captions contain function words without substantive content, such as ’Who’ in the top-left image and proper nouns like people’s names in the bottom-left image. Additionally, the GRIT dataset, which utilizes GLIP for generating pseudo labels, may have inaccuracies in annotations. This is evident in the bottom-right image, where the box annotations appear to be incorrectly assigned. There are similar situations in GQA dataset. In the top-right image, the same phrase ’a woman’ in a caption is assigned to different boxes, which contradicts the phrase grounding settings.

Table 13: Results on Flickr30K

Model	Pre-Train Data	Val R@1	Val R@5	Val R@10	Test R@1	Test R@5	Test R@10
GLIP-T	O365,GoldG	84.9	94.9	96.3	85.6	95.4	96.7
GLIP-T	O365,GoldG,CC3M,SBU	85.3	95.5	96.9	86.0	95.9	97.2
G-DINO-T(c)	O365,GoldG,Cap4M	87.8	96.6	98.0	88.1	96.9	98.2
MM-G-T(b)	O365,GoldG	85.5	95.6	97.2	86.2	95.7	97.4
MM-G-T(c1)	O365,GoldG,GRIT	86.7	95.8	97.6	87.0	96.2	97.7
MM-G-T(c2)	O365,GoldG,GRIT	86.7	95.8	97.6	87.0	96.2	97.7
MM-G-T(c3)	O365,GoldG,GRIT,V3Det	86.7	96.0	97.6	87.2	96.2	97.7

Table 14: Zero-shot Results on ODinW35

Method	OdinW13	ODinW35	G-DINO-T(c)	MM-G-T(b)	MM-G-T(c1)	MM-G-T(c2)	MM-G-T(c3)
AerialMaritimeDrone_large	✓	✓	0.173	0.133	0.155	0.177	0.151
AerialMaritimeDrone_tiled		✓	0.206	0.170	0.225	0.184	0.206
AmericanSignLanguageLetters		✓	0.002	0.016	0.020	0.011	0.007
Aquarium	✓	✓	0.195	0.252	0.261	0.266	0.283
BCCD		✓	0.161	0.069	0.118	0.083	0.077
boggleBoards		✓	0.000	0.002	0.001	0.001	0.002
brackishUnderwater		✓	0.021	0.033	0.021	0.025	0.025
ChessPieces		✓	0.000	0.000	0.000	0.000	0.000
CottontailRabbits	✓	✓	0.806	0.771	0.810	0.778	0.786
dice		✓	0.004	0.002	0.005	0.001	0.001
DroneControl		✓	0.042	0.047	0.097	0.088	0.074
EgoHands_generic		✓	0.608	0.527	0.537	0.506	0.519
EgoHands_specific		✓	0.002	0.001	0.005	0.007	0.003
EgoHands	✓		0.608	0.499	0.537	0.506	0.519
HardHatWorkers		✓	0.046	0.048	0.070	0.070	0.108
MaskWearing		✓	0.004	0.009	0.004	0.011	0.009
MountainDewCommercial		✓	0.430	0.453	0.465	0.194	0.430
NorthAmericaMushrooms	✓	✓	0.471	0.331	0.462	0.669	0.767
openPoetryVision		✓	0.000	0.001	0.000	0.000	0.000
OxfordPets_by_breed		✓	0.003	0.002	0.004	0.006	0.004
OxfordPets_by_species		✓	0.011	0.019	0.016	0.020	0.015
PKLot		✓	0.001	0.004	0.002	0.008	0.007
Packages	✓	✓	0.695	0.707	0.687	0.710	0.706
PascalVOC	✓	✓	0.563	0.565	0.580	0.566	0.566
pistols	✓	✓	0.726	0.585	0.709	0.671	0.729
plantdoc		✓	0.005	0.005	0.007	0.008	0.011
pothole	✓	✓	0.215	0.136	0.219	0.077	0.168
Raccoons	✓	✓	0.549	0.469	0.511	0.553	0.535
selfdrivingCar		✓	0.089	0.091	0.076	0.094	0.083
ShellfishOpenImages	✓	✓	0.393	0.321	0.437	0.519	0.488
ThermalCheetah		✓	0.087	0.063	0.081	0.030	0.045
thermalDogsAndPeople	✓	✓	0.657	0.556	0.603	0.493	0.543
UnoCards		✓	0.006	0.012	0.010	0.009	0.005
VehiclesOpenImages	✓	✓	0.613	0.566	0.603	0.614	0.615
WildfireSmoke		✓	0.134	0.106	0.154	0.042	0.127
websiteScreenshots		✓	0.012	0.02	0.016	0.016	0.016
ODinW13 Average			0.514	0.453	0.511	0.516	0.533
ODinW35 Average			0.227	0.202	0.228	0.214	0.284

Table 15: Open vocabulary fine-tune results on COCO benchmark.

Model	Backbone	Setting	mAP			AP@50
			box	Base	Novel	box	Base	Novel
MM-G-T(c3)	Swin-T	zero-shot	51.1	48.4	58.9	66.7	64.0	74.2
MM-G-T(c3)	Swin-T	open vocabulary 1x	57.2(+8.8)	56.1(+7.7)	60.4(+1.5)	73.6(+6.9)	73.0(+9.0)	75.3(+1.1)

B.2 Visualization on Model Predictions

For both Figure 6 and Figure 5, the ground-truth annotations are depicted on the left, while the predictions made by our model are presented on the right.

Limitations of Evaluation. Our visualization-based analysis of the evaluation process has revealed inaccuracies in ground-truth annotations of the evaluation dataset. This is evident in Figure 5, concerning the ’girl’ object, our model’s prediction appears to be more precise compared to the existing annotations.

Limitations of Model. During the pretraining phase, although the model has access to the entire caption, it tends to prioritize nouns, which are crucial for phrase grounding settings. For instance, in the caption ’horseman without helmet’ depicted in Figure 5(a), the model primarily focuses on ’horseman’ and ’helmet’, yet the key relational term ‘without’ was disregarded. This leads to an incapacity to differentiate between ’helmet’ and ’without helmet’. Additionally, the model struggles with interpreting certain detailed descriptions, such as in Figure 5(b), the model incorrectly detected ’railings being crossed by horse’. In terms of position description in the caption, the model only achieves suboptimal performance as shown in Figure 5(c), which confused the object on the left with the object on the right. In Figure 5, our model additionally predicts ’frisbee’ due to the phrase grounding settings, which leads to a lower performance in evaluation.