¹¹institutetext: ¹DAMO Academy, Alibaba Group ²University of Rochester
³ Hupan Lab, 310023, Hangzhou, China

CT-GLIP: 3D Grounded Language-Image Pretraining with CT Scans and Radiology Reports for Full-Body Scenarios

Jingyang Lin

{}^{(\textrm{\Letter})}

1Work was done during an internship at Alibaba DAMO Academy. Corresponding authors

{}^{(\textrm{\Letter})}

: jlin81@ur.rochester.edu, yingda.xia@alibaba-inc.com.1Work was done during an internship at Alibaba DAMO Academy. Corresponding authors

{}^{(\textrm{\Letter})}

: jlin81@ur.rochester.edu, yingda.xia@alibaba-inc.com.22 Yingda Xia

{}^{(\textrm{\Letter})}

11 Jianpeng Zhang 1133 Ke Yan 1133
Le Lu 11 Jiebo Luo 22 Ling Zhang 11

Abstract

Medical Vision-Language Pretraining (Med-VLP) establishes a connection between visual content from medical images and the relevant textual descriptions. Existing Med-VLP methods primarily focus on 2D images depicting a single body part, notably chest X-rays. In this paper, we extend the scope of Med-VLP to encompass 3D images, specifically targeting full-body scenarios, by using a multimodal dataset of CT images and reports. Compared with the 2D counterpart, 3D VLP is required to effectively capture essential semantics from significantly sparser representation in 3D imaging. In this paper, we introduce CT-GLIP (Grounded Language-Image Pretraining with CT scans), a novel method that constructs organ-level image-text pairs to enhance multimodal contrastive learning, aligning grounded visual features with precise diagnostic text. Additionally, we developed an abnormality dictionary to augment contrastive learning with diverse contrastive pairs. Our method, trained on a multimodal CT dataset comprising 44,011 organ-level vision-text pairs from 17,702 patients across 104 organs, demonstrates it can identify organs and abnormalities in a zero-shot manner using natural languages. The performance of CT-GLIP is validated on a separate test set of 1,130 patients, focusing on the 16 most frequent abnormalities across 7 organs. The experimental results show our model’s superior performance over the standard CLIP framework across zero-shot and fine-tuning scenarios, using both CNN and ViT architectures.

Keywords:

Medical Vision-Language Pretraining CT Grounded Contrastive Learning.

1 Introduction

Vision-language pretraining (VLP) has become a fundamental training paradigm in vision-language (VL) research, to enable the VL framework to learn universal and transferable vision-language representations in a weakly-supervised manner [5, 9, 17]. Recent attempts in Medical VLP (Med-VLP) [4, 7, 12, 14, 23, 25, 27, 28] have demonstrated the effectiveness of the VLP paradigm in medical imaging. By leveraging large-scale medical image-report paired data derived from routine clinical practice, these methods decrease the dependency on expensive annotated data and alleviate the workload on radiologists. However, previous works mainly focus on 2D medical images of single body parts (i.e., chest) due to data scarcity. This limitation restricts the application of Med-VLP in broader medical contexts, particularly in dealing with 3D medical images that are not only more complex but also constitute the primary workload in radiology departments, offering a richer, more detailed view of patient anatomy and abnormalities.

Refer to caption — (a) Radiology report preprocessing

To this end, our goal is to expand Med-VLP to incorporate 3D images, covering full body parts. Compared with 2D VLP, the sparse representations of 3D images complicate the alignment of textual descriptions with the corresponding visual concepts. Therefore, we seek to design an efficient 3D Med-VLP training paradigm for the scenario of full-body 3D imaging.

In this paper, we present an innovative method, named CT-GLIP (Grounded Language-Image Pretraining with CT scans), to reorganize grounded (i.e.organ-level) vision-text pairs for multimodal contrastive learning, while reducing the complexity of 3D vision-text alignment from both grounded visual and textual ends. For 3D images, we use Totalsegmentator [24] to generate segmentation masks to identify the location of 104 organs. To the radiology report, we adopt LLaMA-2 [20] and manually check to break down the original report into several diagnostic descriptions for each organ, as shown in Fig. 1a. The simplified grounded visual and textual components enable efficiently associating the organ-level visual concepts with concise diagnostic descriptions. Technically, CT-GLIP consists of two objectives respectively for organ-text and abnormality-text alignments. Organ-text alignment aims to understand basic medical visual concepts. Meanwhile, abnormality-text alignment associates the abnormal visual components with the corresponding text descriptions, facilitating zero-shot abnormality detection as shown in Fig. 1b. Furthermore, to mitigate the limitations of using small mini-batch sizes in large-scale 3D models for contrastive learning, which benefits from a larger number of diverse contrastive pairs [3, 6, 16, 18, 21, 22], we have developed an abnormality dictionary containing a variety of abnormal text descriptions. This dictionary substantially increases the availability of diverse negative pairs, thereby improving the effectiveness of contrastive learning.

Our study curates a multimodal CT image-report dataset with 44,011 pairs from 17,702 patients covering 104 organs, and develops validation and test datasets with 643 and 1,130 patients, respectively, targeting 16 common abnormalities across 7 organs. In Section 3, the proposed CT-GLIP outperforms whole image-report alignment [17] in 3D imaging. It achieves notable zero-shot performance for organ classification and abnormality detection, and enhances tumor segmentation and detection for both CNN and ViT-based models.

The main contributions of CT-GLIP are summarized as follows: (1) we propose a novel mechanism to reorganize grounded vision-text pairs for efficient 3D Med-VLP; (2) we build an abnormality dictionary that scales up the number and diversity of contrastive pairs; (3) empirical results on both zero-shot and fine-tuning settings demonstrate the superiority of grounded image-report alignment mechanism over whole image-report alignment in 3D imaging scenarios.

2 Related Work

General VLP. Vision-language pretraining aims to develop multimodal foundational models that enhance performance across a variety of tasks involving both vision and language. CLIP [17] and ALIGN [9] represent significant milestones in the VLP field. These models have highlighted the critical role of language supervision in enhancing both computer vision and natural language processing tasks. BLIP [10] strives to unify vision-language understanding and generation during pretraining. GLIP [11] seeks to learn object-level, language-aware, and semantically rich visual representations through local vision-language alignment.

Medical VLP. Medical image-report paired data, derived from routine clinical practice, promotes the development of Med-VLP. ConVIRT [4] adapts the CLIP methodology to medical imaging, enabling the matching of chest X-ray images with their corresponding reports. Building upon this, MedCLIP [23] refines the approach by processing images and texts separately, which effectively scales the available training data at a low cost. CheXzero [19] further advances the field by creating a system capable of detecting pathologies in a zero-shot fashion. MedKLIP [25] incorporates additional medical knowledge to enhance the joint analysis of images and language. LoVT [15] and GLoRIA [7] introduce localized vision-language alignment, sharing similar motivations with our work. However, our research distinguishes itself by focusing on 3D medical imaging, which presents challenges with its significantly sparser representation. In addition, we emphasize the critical importance of not only grounded visual representation but also high-quality, organ-level text descriptions.

2.0.1 Methodology

In this section, we delve into the design of our proposed pertaining mechanism for organ-level vision-language pertaining. Our pertaining methodology consists of organ-text alignment and abnormality-text alignment. These alignments serve as the foundation for our comprehensive objectives, and will further fulfill the task of zero-shot organ classification and zero-shot abnormality detection.

Problem Formulation The basic motivation for multimodal contrastive loss is to learn visual concepts from text by training an image encoder $f$ and a text encoder $g$ to maximize the similarity between corresponding image and text feature embeddings while minimizing it for non-corresponding pairs. For a batch of $N$ image-text pairs ( $V_{n}$ , $T_{n}$ ), we first get the $i$ -th normalized visual feature $\mbox{\boldmath$v$}_{i}=f(V_{i})$ and the $i$ -th normalized textual feature $\mbox{\boldmath$t$}_{i}=f(T_{i})$ . Then, the loss for a single pair is shown below:

\displaystyle\mathcal{L}_{i}=-\log\frac{\exp(\mbox{\boldmath$v$}_{i}^{T}\mbox{% \boldmath$t$}_{i})/\tau}{\sum^{N}_{k=1}\exp(\mbox{\boldmath$v$}_{i}^{T}\mbox{% \boldmath$t$}_{k})/\tau}-\log\frac{\exp(\mbox{\boldmath$v$}_{i}^{T}\mbox{% \boldmath$t$}_{i})/\tau}{\sum^{N}_{k=1}\exp(\mbox{\boldmath$v$}_{k}^{T}\mbox{% \boldmath$t$}_{i})/\tau},

(1)

where $\tau$ is a temperature parameter. The total loss is $\mathcal{L}=\frac{1}{N}\sum_{i=1}^{N}\mathcal{L}_{i}$ .

Organ-Text Alignment. The motivation for organ-text alignment is to learn the visual concepts from the supervision contained in the expert language model, which enables our model to understand basic medical visual concepts. Following the previous work [23], we adopt ClinicalBERT [1] as an expert text encoder to compute the embedding for text description. To achieve organ-text alignment, we get the organ-level visual embeddings and the corresponding textual embeddings. Specifically, given a CT image $V_{i}$ , the vision encoder projects the CT image into the representation space and produces a feature map $\mbox{\boldmath$v$}_{i}$ . Based on the multi-organ segmentation pseudo-label, we apply organ-level average pooling on each segmented organ mask to obtain a set of organ-level features $\{\mbox{\boldmath$z$}_{ij}\}^{M}_{j=1}$ , where the $M$ refers to the number of organs in the given CT image. For each organ, we generate its textual description $T_{ij}$ by integrating the specified organ into a predefined template, like “this is a {organ} in the CT scan”. We then feed the organ descriptions into the expert text encoder to produce organ-level text embedding $\{\mbox{\boldmath$t$}_{ij}\}^{M}_{j=1}$ . After that, our training objective $\mathcal{L}_{\text{OT}}$ is to align organ-text features, as follows:

{\mathcal{L}_{\text{OT}}}_{i}=\frac{1}{M}\sum^{M}_{j=1}\left(-\log\frac{\exp(% \mbox{\boldmath$z$}_{ij}^{T}\mbox{\boldmath$t$}_{ij})/\tau}{\sum^{M}_{k=1}\exp% (\mbox{\boldmath$z$}_{ij}^{T}\mbox{\boldmath$t$}_{ik})/\tau}-\log\frac{\exp(% \mbox{\boldmath$z$}_{ij}^{T}\mbox{\boldmath$t$}_{ij})/\tau}{\sum^{M}_{k=1}\exp% (\mbox{\boldmath$z$}_{ik}^{T}\mbox{\boldmath$t$}_{ij})/\tau}\right),

(2)

where the temperature parameter $\tau$ is set as 0.07. Furthermore, to enhance the utilization of the given pseudo-segmentation label $\tilde{y}$ , we introduce an additional segmentation head to predict organs at the pixel level. The segmentation objective $\mathcal{L}_{\text{segm}}$ is a mixture of cross-entropy loss and dice loss.

Abnormality-Text Alignment. The goal of abnormality-text alignment is to integrate the knowledge of abnormality into the multimodal model. The training procedure of abnormality-text alignment is illustrated in Figure 2. Similar to organ-text alignment, we first extract organ-level visual features embeddings $\{\mbox{\boldmath$z$}_{ij}\}^{M}_{j=1}$ from the given CT image $V_{i}$ . Different from organ-text alignment, we organize $M$ diagnostic descriptions, including $M^{\prime}$ organ-level real diagnostic descriptions for abnormal organs and $M-M^{\prime}$ generated descriptions with a predefined template (e.g., “no evident abnormality in {organ}”) for normal organs. Furthermore, to scale up the number of negative pairs for abnormality-text alignment [6, 16], we introduce an abnormality dictionary storing diverse text descriptions of abnormalities for 104 organs. In particular, for each normal organ, we look up $T$ abnormal descriptions from the abnormality dictionary and integrate $B=(M-M^{\prime})\times T$ abnormal descriptions in total. These $B$ abnormal descriptions provide additional negative pairs for multimodal contrastive learning to distinguish among diseases. After that, all $M+B$ text descriptions are fed into the expert language model, producing the text embedding $\{t_{ij}\}_{j=1}^{M+B}$ . Given the organ-level paired vision and text embeddings, the training objective of abnormality-text alignment is shown below:

{\mathcal{L}_{\text{AT}}}_{i}=\frac{1}{M}\sum^{M}_{j=1}\left(-\log\frac{\exp(% \mbox{\boldmath$z$}_{ij}^{T}\mbox{\boldmath$t$}_{ij})/\tau}{\sum^{M+B}_{k=1}% \exp(\mbox{\boldmath$z$}_{ij}^{T}\mbox{\boldmath$t$}_{ik})/\tau}-\log\frac{% \exp(\mbox{\boldmath$z$}_{ij}^{T}\mbox{\boldmath$t$}_{ij})/\tau}{\sum^{M}_{k=1% }\exp(\mbox{\boldmath$z$}_{ik}^{T}\mbox{\boldmath$t$}_{ij})/\tau}\right),

(3)

Overall objective. The overall objective of our organ-level vision-language alignment is calculated as the integration of organ-text contrastive loss $\mathcal{L}_{\text{OT}}$ , abnormality-text contrastive loss $\mathcal{L}_{\text{AT}}$ , and auxiliary cross-entropy loss $\mathcal{L}_{segm}$ (dice loss supervised by pseudo organ segmentation masks):

\mathcal{L}=\lambda_{1}\mathcal{L}_{\text{OT}}+\lambda_{2}\mathcal{L}_{\text{% AT}}+\lambda_{3}\mathcal{L}_{\text{segm}},

(4)

where the weights $\lambda_{1}$ , $\lambda_{2}$ , and $\lambda_{3}$ are set to 0.5, 0.5, and 1.0, respectively.

3 Experiments

Dataset for pretraining. For the proposed CT-GLIP, we collect a multimodal dataset of CT images and reports containing 17,702 consecutive patients with 44,011 organ-level vision-text pairs.

Pretraining details. For the vision part, we employ both representative CNN-based and ViT-based vision encoders, particularly nnUNet [8] and MiT [26]. To keep the low-level semantics, we feed the feature map with the highest resolution into the organ-level average pooling. On top of organ-level average pooling, an additional two-layer MLP (hidden layer 768- $d$ with ReLU) is added. For the language part, we adopt BioClinicalBERT [1] as an expert text encoder. We keep the expert text encoder frozen [13] to avoid catastrophic forgetting by CT-specific domain data. The batch size is 8, distributed on 4 V100 GPUs. We train the CT-GLIP for 20 epochs since the training loss has converged at that point. We adopt the Adam optimizer with the weight decay of 3e-5, an initial learning rate of 1e-3, and a final learning rate of 1e-6 upon the cosine decay rule.

3.1 Zero-shot Evaluation

Dataset for zero-shot evaluation. To assess the zero-shot capabilities, we further build additional datasets, including 643 patients for validation and 1,130 patients for testing, specifically focusing on the 16 most frequent abnormalities. Please refer to the supplementary material for more details about the evaluation datasets and zero-shot inference for organ-text and abnormality-text alignment.

Baseline model. We adopt vanilla CLIP [17] as our baseline model, which employs standard image-level contrastive pairs.

Impact of abnormality-text (AT) alignment. In zero-shot abnormality detection, the AT alignment greatly enhances performance compared to the vanilla CLIP across different architectures in Table 1. The results show that image-level contrastive learning can hardly learn useful information from the 3D CT-report pairs. It highlights that CT-GLIP can efficiently facilitate effective VL alignment over the sparse representations of 3D images. In particular, with nnUNet backbones, there is a 15.0% increase in the F1 score and a 16.4% rise in AUC. The boosts on MiT backbones are even more pronounced, with improvements of 15.6% in F1 score and 19.5% in AUC.

Impact of organ-text (OT) alignment. Table 1 shows that OT alignment equips our model with a strong capability of zero-shot organ classification. In particular, the performance on top-1 accuracy reaches 86.9% and 85.4% for nnUNet and MiT, respectively. Moreover, Table 1 presents OT alignment further improves performance on zero-shot abnormality detection, demonstrating that the capability of basic visual concept understanding serves as a foundation for better abnormality detection. In particular, for both nnUNet and MiT, OT alignment achieves more than 2% boosts in F1 score and AUC.

Impact of abnormality dictionary. The abnormality dictionary aims to expand diverse negative samples since large-scale negative samples benefit contrastive learning [16]. The scale of the abnormality dictionary is 512 because the larger scale will no longer benefit the performance. Based on AT and OT alignment, the abnormality dictionary improves performance on zero-shot abnormality detection for both nnUNet and MiT.

Table 1: The performance of zero-shot organ classification and pathology detection. OT align indicates organ-text alignment, AT align refers to abnormality-text alignment and A-Dict denotes abnormality dictionary. Top-1 accuracy, PPV (Positive Predictive Value), Sensitivity, F1 score, and AUC are shown in %.

Method Zero-shot Zero-shot Organ Classification Abnormality Detection Top-1 Acc $\uparrow$ PPV $\uparrow$ Sensitivity $\uparrow$ F1 $\uparrow$ AUC $\uparrow$ CNN-based architecture: nnUNet Vanilla CLIP [17] 0.00 32.75 35.19 33.93 52.23 CT-GLIP +AT align 0.03 35.24 70.66 47.02 66.00 +AT align +OT align 86.92 39.07 64.11 48.60 66.76 +AT align +OT align +A-Dict 86.24 39.24 72.85 49.02 68.63 ViT-based architecture: MiT Vanilla CLIP [17] 0.00 34.01 40.43 36.94 52.37 CT-GLIP +AT align 0.07 37.65 74.24 49.96 69.27 +AT align +OT align 85.46 38.24 77.43 51.19 70.12 +AT align +OT align +A-Dict 84.93 39.47 78.59 52.55 71.90

3.2 Fine-Tuning Evaluation on Cancer Screening

Dataset and evaluation. For the evaluation of the proposed CT-GLIP in a downstream fine-tuning context, we prepare an in-house dataset encompassing 700 non-contrast CT scans of 700 patients, specifically targeting seven of the most prevalent types of cancer, including lung, breast, liver, esophagus, stomach, colorectum, and pancreas cancer, 100 patients for each type. This dataset is designed to validate the adaptability and performance of our pre-trained model in the segmentation and detection of these types of cancer on noncontrast CT scans, which is an emerging and challenging clinical task [2]. Seven board-certified radiologists manually annotated the pixel-level mask of the tumors, all confirmed by histopathology. We randomly split the dataset into 448, 112, and 140 cases for training, validation, and test set, respectively. The performance is evaluated by the Dice score for tumor segmentation. As for the patient-level detection of each type of tumor (the presence or absence of each tumor), we use the 3D volume of the respective tumor as the score for the computation of AUC [29].

Fine-tuning strategy. We employ the same two backbone architectures, i.e., nnUNet [8] and the MiT [26] network. For the nnUNet backbone, we use the original training schedule and the self-configured architecture, only with our pretrained model as initialization. The batch size is 8 and we train all experiments for 125k iterations. For the MiT backbone, we add an UNet-style decoder for the segmentation task, fix the MiT encoder for the initial 25k iterations, and tune the whole encoder-decoder network for another 100k iterations. The optimizer for finetuning MiT is RAdam with an initial learning rate of 0.001 and a polynomial learning rate decay.

Table 2: The performance of downstream fine-tuning on the task of cancer screening of pancreas (Pan), breast (Bre), stomach (Sto), colorectum (Col), lung, esophagus (Eso), and liver (liv). Tumor segmentation is evaluated via DSC (%), and the performance of cancer screening is evaluated via AUC (%).

Metric Method Pan Bre Sto Col Lung Eso Liv Mean DSC $\uparrow$ CNN-based architecture: nnUNet Scratch 38.77 13.20 20.36 24.18 40.45 52.94 19.25 29.88 Vanilla CLIP [17] 50.58 19.28 19.98 24.31 47.15 56.45 16.43 33.45 CT-GLIP (ours) 52.42 20.59 23.13 26.16 48.89 53.25 18.44 34.70 ViT-based architecture: MiT Scratch 35.18 19.13 0.00 18.56 11.12 40.32 34.41 22.68 Vanilla CLIP [17] 36.02 19.96 17.47 28.78 33.39 50.19 28.75 30.65 CT-GLIP (ours) 39.85 22.84 27.23 34.15 42.32 46.60 37.39 35.77 AUC $\uparrow$ CNN-based architecture: nnUNet Scratch 92.19 70.23 83.48 90.72 74.91 100.00 63.31 82.12 Vanilla CLIP [17] 96.69 80.97 83.17 86.99 92.05 100.00 71.39 87.32 CT-GLIP (ours) 97.73 81.49 90.64 90.54 92.43 93.70 80.35 89.55 ViT-based architecture: MiT Scratch 97.63 77.91 50.00 90.26 74.05 98.74 77.99 80.94 Vanilla CLIP [17] 90.10 85.83 78.40 95.65 78.45 99.55 81.03 87.00 CT-GLIP (ours) 91.48 81.31 87.79 95.03 85.76 96.35 82.46 88.60

Results. For both backbones, our CT-GLIP outperforms the baseline models trained from scratch and those fine-tuned with vanilla CLIP training, as shown in Table 2. For example, CT-GLIP outperforms the model trained from scratch and fine-tuned from vanilla CLIP by 4.8% and 1.3% in mean tumor segmentation dice score, and 7.4% and 2.2% in cancer detection AUC score for the nnUNet backbone, respectively. For the MiT backbones, the respective improvements are 13.1% and 5.1% for tumor segmentation, and 7.7% and 1.6% for cancer detection. Generally, either pretrained with CLIP or CT-GLIP can improve the performance by a large margin, illustrating the importance of pre-learned representation for this clinically significant task. Our superiority over vanilla CLIP further illustrates the efficacy of our method in leveraging visual-textual associations for enhanced tumor-related image representations.

4 Conclusion

In this study, we have expanded VLP into 3D medical imaging, particularly full-body CT scans, by generating grounded (organ-level) image-text pairs and enhancing learning pair diversity with an abnormality dictionary. Our proposed CT-GLIP overcomes sparse data challenges and shows promise in zero-shot recognition of organs and abnormalities, with implications for improving the downstream task of multi-cancer screening. This research establishes new benchmarks for evaluating 3D VLP’s potential in medical diagnostics.

References

[1] Alsentzer, E., Murphy, J.R., Boag, W., Weng, W.H., Jin, D., Naumann, T., McDermott, M.: Publicly available clinical bert embeddings. In: NAACL. pp. 72–78 (2019)
[2] Cao, K., Xia, Y., Yao, J., Han, X., Lambert, L., Zhang, T., Tang, W., Jin, G., Jiang, H., Fang, X., et al.: Large-scale pancreatic cancer detection via non-contrast ct and deep learning. Nature Medicine pp. 3033–3043 (2023)
[3] Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. In: NeurIPS. pp. 9912–9924 (2020)
[4] Chauhan, G., Liao, R., Wells, W., Andreas, J., Wang, X., Berkowitz, S., Horng, S., Szolovits, P., Golland, P.: Joint modeling of chest radiographs and radiology reports for pulmonary edema assessment. In: MICCAI. pp. 529–539 (2020)
[5] Gan, Z., Li, L., Li, C., Wang, L., Liu, Z., Gao, J., et al.: Vision-language pre-training: Basics, recent advances, and future trends. Foundations and Trends® in Computer Graphics and Vision pp. 163–352 (2022)
[6] He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR. pp. 9729–9738 (2020)
[7] Huang, S.C., Shen, L., Lungren, M.P., Yeung, S.: Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In: ICCV. pp. 3942–3951 (2021)
[8] Isensee, F., Jaeger, P.F., Kohl, S., Wasserthal, J., Koehler, G., Norajitra, T., Wirkert, S., Maier-Hein, K.H.: nnu-net: a self-configuring method for deep learning-based biomedical image segmentation. Nature Methods pp. 1–9 (2020)
[9] Jia, C., Yang, Y., Xia, Y., Chen, Y.T., Parekh, Z., Pham, H., Le, Q., Sung, Y.H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: ICML. pp. 4904–4916 (2021)
[10] Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: ICML. pp. 12888–12900 (2022)
[11] Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., Hwang, J.N., et al.: Grounded language-image pre-training. In: CVPR. pp. 10965–10975 (2022)
[12] Lin, W., Zhao, Z., Zhang, X., Wu, C., Zhang, Y., Wang, Y., Xie, W.: Pmc-clip: Contrastive language-image pre-training using biomedical documents. In: MICCAI. pp. 525–536 (2023)
[13] Liu, C., Cheng, S., Chen, C., Qiao, M., Zhang, W., Shah, A., Bai, W., Arcucci, R.: M-flag: Medical vision-language pre-training with frozen language models and latent space geometry optimization. In: MICCAI (2023)
[14] Lu, M.Y., Chen, B., Zhang, A., Williamson, D.F., Chen, R.J., Ding, T., Le, L.P., Chuang, Y.S., Mahmood, F.: Visual language pretrained multiple instance zero-shot transfer for histopathology images. In: CVPR. pp. 19764–19775 (2023)
[15] Müller, P., Kaissis, G., Zou, C., Rueckert, D.: Joint learning of localized representations from medical images and reports. In: ECCV. pp. 685–701 (2022)
[16] Oord, A.v.d., Li, Y., Vinyals, O.: Representation learning with contrastive predictive coding. arXiv:1807.03748 (2018)
[17] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763 (2021)
[18] Tian, Y., Krishnan, D., Isola, P.: Contrastive multiview coding. In: ECCV. pp. 776–794 (2020)
[19] Tiu, E., Talius, E., Patel, P., Langlotz, C.P., Ng, A.Y., Rajpurkar, P.: Expert-level detection of pathologies from unannotated chest x-ray images via self-supervised learning. Nature Biomedical Engineering pp. 1399–1406 (2022)
[20] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288 (2023)
[21] Wang, Y., Lin, J., Cai, Q., Pan, Y., Yao, T., Chao, H., Mei, T.: A low rank promoting prior for unsupervised contrastive learning. IEEE TPAMI pp. 2667–2681 (2022)
[22] Wang, Y., Lin, J., Zou, J., Pan, Y., Yao, T., Mei, T.: Improving self-supervised learning with automated unsupervised outlier arbitration. In: NeurIPS. pp. 27617–27630 (2021)
[23] Wang, Z., Wu, Z., Agarwal, D., Sun, J.: Medclip: Contrastive learning from unpaired medical images and text. In: EMNLP. pp. 3876–3887 (2022)
[24] Wasserthal, J., Breit, H.C., Meyer, M.T., Pradella, M., Hinck, D., Sauter, A.W., Heye, T., Boll, D.T., Cyriac, J., Yang, S., et al.: Totalsegmentator: Robust segmentation of 104 anatomic structures in ct images. Radiology: Artificial Intelligence (2023)
[25] Wu, C., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Medklip: Medical knowledge enhanced language-image pre-training. In: ICCV. pp. 21372–21383 (2023)
[26] Xie, Y., Zhang, J., Xia, Y., Wu, Q.: Unimiss: Universal medical self-supervised learning via breaking dimensionality barrier. In: ECCV. pp. 558–575 (2022)
[27] You, K., Gu, J., Ham, J., Park, B., Kim, J., Hong, E.K., Baek, W., Roh, B.: Cxr-clip: Toward large scale chest x-ray language-image pre-training. In: MICCAI. pp. 101–111 (2023)
[28] Zhang, Y., Jiang, H., Miura, Y., Manning, C.D., Langlotz, C.P.: Contrastive learning of medical visual representations from paired images and text. In: MLHC. pp. 2–25 (2022)
[29] Zhu, Z., Xia, Y., Xie, L., Fishman, E.K., Yuille, A.L.: Multi-scale coarse-to-fine segmentation for screening pancreatic ductal adenocarcinoma. In: MICCAI. pp. 3–12 (2019)

Appendix 0.A Supplementary Material

0.A.1 Details about Zero-shot Evaluation

16 representative abnormalities over 7 common organs. The evaluation of zero-shot abnormality detection requires the model to identify whether the given organ is abnormal. As shown in Fig. 3a, we first select the 7 most frequent organs, including the spleen, pancreas, aorta, gallbladder, kidney, liver, and lung. Then, we select 1-3 most common abnormalities from these 7 organs. Figure 3b illustrates the word cloud of the radiology report over our dataset. In Table 3, we present 16 abnormalities from 7 frequent organs. Figure 4 shows the AUC of zero-shot abnormality detection over 16 abnormalities on MiT backbones. The result demonstrates the superiority of our proposed CT-GLIP.

Table 3: The 16 representative abnormalities across 7 organs.

Organ Abnormalities Spleen splenomegaly, spleen calcification Pancreas acute pancreatitis, chronic pancreatitis, pancreatic duct stones Aorta arteriosclerosis of aorta Kidney kidney stone, renal cyst Liver fatty liver, hepatic cyst, hepatic calcification Lung old lesions in lung, pulmonary nodules, pulmonary fibrous lesion

Inference for zero-shot organ classification. Fig. 5 illustrates the inference process for zero-shot organ classification. In particular, we first generate organ descriptions for all 104 organs by a given template. We then convert these descriptions into text embeddings using expert text encoder. Meanwhile, the corresponding CT scans and multi-organ segmentation are fed into the 3D image encoder to produce organ-level visual embeddings. The class label whose text embedding is closest to the image embedding is then predicted as the most likely class for the organ. This approach allows CT-GLIP to perform 104-way organ classification tasks on CT scans using just organ descriptions of possible outcomes, enabling accurate and flexible classification without direct training on the task’s specific classes. The results of zero-shot organ classification on top-1 accuracy are shown in Table 1.

Inference for zero-shot abnormality detection. In Fig. 1b, we illustrate the inference process of zero-shot abnormality detection. For each test CT image, we provide a pair of normal and abnormal text descriptions with corresponding organ segmentation. We assess the similarity between the organ-level grounded visual features and both normal and abnormal textual embeddings for each targeted abnormality. The prediction is made based on the higher similarity score. Clearly, zero-shot abnormality detection operates as a binary classification task.