11institutetext: Department of Engineering, University of Palermo, Palermo, 90128, Sicily, Italy 11email: corrisponding author salvatore.contino01@unipa.it

MedPix 2.0: A Comprehensive Multimodal Biomedical Dataset for Advanced AI Applications

Irene Siragusa 11    Salvatore Contino Massimo La Ciura 11** 0000-0002-7476-1545 11    Rosario Alicata 11    Roberto Pirrone 11 0000-0001-9453-510X
Abstract

The increasing interest in developing Artificial Intelligence applications in the medical domain, suffers from the lack of high-quality dataset, mainly due to privacy-related issues. Moreover, the recent rising of Multimodal Large Language Models (MLLM) leads to a need for multimodal medical datasets, where clinical reports and findings are attached to the corresponding CT or MR scans. This paper illustrates the entire workflow for building the data set MedPix 2.0. Starting from the well-known multimodal dataset MedPix®, mainly used by physicians, nurses and healthcare students for Continuing Medical Education purposes, a semi-automatic pipeline was developed to extract visual and textual data followed by a manual curing procedure where noisy samples were removed, thus creating a MongoDB database. Along with the dataset, we developed a GUI aimed at navigating efficiently the MongoDB instance, and obtaining the raw data that can be easily used for training and/or fine-tuning MLLMs. To enforce this point, we also propose a CLIP-based model trained on MedPix 2.0 for scan classification tasks.

Keywords:
MedPix MongoDB Biomedical Dataset MLLM CLIP Decision Support System

1 Introduction

The rise of computer-based applications in recent years has strongly favoured the digitisation of historically analogue processes in the management and analysis of biomedical data. In turn, the emergence of technologies based on Artificial Intelligence (AI) prompted the development of increasingly precise models to support diagnosis for the construction of personalised treatments. In the biomedical domain, in fact, different sub-areas can benefit of AI-based systems, ranging from the administrative field (e.g. better management of queues in emergency areas, centralised integration of medical records) to the clinical one, thanks to the use of AI that can efficiently extract useful features to reliably achieve diagnosis.

One of the fundamental requirements of these applications consists in their trustworthiness, which must help the physician with high confidence, being able to provide reliable predictions and classifications. AI models, however, require a considerable amount of data to achieve these results, making this expansion into the biomedical domain much more complex than it actually would be in a different domain. One of the main problems lies in the availability of data sets to allow the scientific community for developing new AI approaches. This problem arises mainly from the sensitive nature of the data that has to cope with privacy issues, and this makes it difficult to build public data sets containing images and/or clinical reports to be available for the scientific community.

In order to start to overcome these obstacles, the European Community has constituted the European Health Data Space (EHDS). This is a health-specific ecosystem composed of common rules, standards and practices, infrastructure and a governance framework that aims to empower people through increased digital access and control of their personal electronic health data. The EHDS promotes a single market for electronic health record systems, relevant medical devices, and high-risk artificial intelligence systems. Finally, the EHDS aims at providing the research that makes use of health data with a trusted framework for controlling the whole analytical process[13, 17]. The systemic approach proposed by the EHDS will provide a controlled pool of both shared data and applications thus allowing AI in the medical field to access to certified and controlled data by overcoming both medical and engineering obstacles through new solid foundations on which a new generation of AI-based health applications can raly on.

In view of the implications provided by the EHDS on the application side, researchers wishing to pursue their research in the biomedical domain by developing MLLMs should be able to find and optimise the datasets that are available, in a way that maximises the available public resources. To date, the datasets containing both images (CT and/or MRI) and medical reports are not many. One of the most renowned data sets is MedPix®111https://medpix.nlm.nih.gov/home, a free open-access online database of medical images, teaching cases, and clinical topics, integrating images and textual metadata. MedPix® includes more than 12,000 patient case scenarios, 9,000 topics, and nearly 59,000 images. The contribution proposed in this paper, lies in a new organisation of a non-relational database based on MongoDB, which reorganises the structure of MedPix®, making it accessible and structured, through the construction of views capable of creating ready-made subsets for training MLLMs. Database queries are further simplified through the development of a user-friendly GUI which, allows the user to pose her/his query thus browsing the results in very close way as in the original website. In addition, we used our MongoDB data source to create a data set for training a CLIP-based model to classify an input medical image, providing information about both the scanning modality and the body part shown.

The paper is arranged as follows: Section 2 illustrates State Of The Art (SOTA) for medical multimodal data sets, the details on the building process of the curated data set are reported in Section 3 along with the MongoDB implementation and the proposed GUI. The experimental results MLLM trained to demonstrate the effectiveness of MedPix 2.0, are reported and discussed in Section 4. Concluding remarks and future works are drawn in Section 5.

2 Related Works

Medical data sets for AI applications suffer from diverse problems, related both with the data and the peculiarity of the domain. First, there is a privacy issue, since clinical data contain private information about the patient, thus the process of creation of such type of data set has to start with an anonymization phase. To overcome this problem, researchers rely on either open-access and textbook data or they collaborate with hospitals to create data sets. The first two methods allow for large scalability, relying on anonymous data. On the other hand, anonymization must be carried out from scratch when dealing with hospitals. Moreover, multimodal medical data suffer from scarcity when compared with other data sets related to different domains such as MS-COCO [8].

One of the most used open-access multi modal database for developing medical datasets is PubMed Central® (PMC)222https://www.ncbi.nlm.nih.gov/pmc/. This is a widely used free archive of biomedical scientific literature, from which it is possible to build one’s own datasets via semi-automatic procedures. In PMC, data are anonymized, and high-quality captions can be extracted from the medical research papers the images belong to. The following multimodal data sets were extracted from PMC: ROCO [12], MedICaT [16], PMC-OA [9]. ROCO contains pairs of radiology images and the corresponding captions, and it incorporates an out-of-class set to improve prediction and classification performances. MedICaT is a disjoint dataset from ROCO that is mainly composed by radiology images and provides manually annotations for sub-figures. PMC-OA is the larger than the previous ones, and it keeps a variety of diagnostic procedures, diseases, and findings, while introducing sub-figure separation.

VQA-RAD [6] is a data set derived from MedPix®, and it collects a subset of radiological images, while providing Question-Answer (QA) pairs validated by domain experts.

Another source of available high-quality data are textbooks: PathVQA [4] is a Visual Question Answering (VQA) data set that collects both closed- and open-ended QA pairs, which are extracted from both pathology textbooks and online digital libraries via a semi-automated pipeline.

On the other side, open access data sets like MINIC-CXR [5], IU-Xray [2] and SLAKE [10] are manually annotated by domain experts. Both MINIC-CXR and IU-Xray are chest radiography data sets derived from hospital’s clinical cases. Both if them contain a semi-structured radiology report, describing the radiological findings of the images it is related to. SLAKE, on the contrary, collects images from different radiology open-access datasets and provides manual annotations and QA pairs given by experienced doctors in English and Chinese.

The aforementioned data sets are multimodal ones, and they are deeply focused on VQA tasks. Also some specialized data sets exist that provide only visual data, as UniToChest [1] or free-form textual data [15]. Other works worth mentioning are UniToBrain [3] that integrates a technical report with the scanning modality, and the E3C corpus [11] that is a multilingual data set of clinical reports.

Despite the proposed MedPix 2.0 data set may be close to VQA-RAD, they differ from both the sampling strategy and the samples themselves. In Med Pix 2.0, we do not integrate QA pairs, but different QA pairs can be created relying on the structured textual information provided in the data set itself, following the structure of the other one. Thus, the creation of more complex tasks is possible, such as document summarization or understanding.

With respect to the existing multimodal data sets, MedPix 2.0:

  1. 1.

    is derived from a freely open-access source, and it has no privacy-related issues;

  2. 2.

    offers a balanced variety of CT and MRI scans of different body parts;

  3. 3.

    for each image, a complete structured clinical case is provided.

Thanks to the annotation scheme we selected for the JSON documents in MedPix 2.0, the last point makes it suited for various tasks not only limited to document-level retrieval. Unfortunately, images are provided in PNG format thus limiting visual processing with respect to raw images in DICOM format.

3 MedPix 2.0

MedPix® is a free open-access multimodal online database of medical images, teaching cases, and clinical topics, managed by the National Library of Medicine (NLM) of the National Institutes of Health (NIH). It mainly serves as a support system for Continuing Medical Education (CME) of physicians, nurses, and healthcare students. The database collects clinical cases related to more than 12,000 patients. Each case contains at least one medical image, and the corresponding findings, discussion notes, diagnosis, differential diagnosis, treatment, and follow up. Textual information is reported in a semi-structured format. Attached to the clinical case, there is the topic section, where the disease under investigation is discussed in detail from an academic and general perspective.

In Fig. 1 an example from the MedPix® website is reported.

Refer to caption
Figure 1: An example from the MedPix® web site where the original screenshots have been rearranged for visualizing purposes, and the colored labels have been added for clarity.

Despite the richness of the data set, its free availability, the possibility to add new cases, and the access to clinical cases of interest using wither keywords, the body part or the disease, it is not possible to access to the raw data. This feature limits the usage of MedPix® for training multimodal AI systems. Therefore we decided to create a brand new structured version of this data set because it represents a high quality source for AI-based medical aplications. MedPix 2.0 has been built essentially as a MongoDB instance that is released along with a suitable GUI aimed both at general purpose querying and extracting training data for AI models. Referring to the Fig.1, a MongoDB version of the data set was built using a semi-automated pipeline to create two kinds of JSON documents: the one collecting the information falling into the screenshots labeled as DESCRIPTION, and the one that gathers the information falling into the screenshots labeled both as CASE and TOPIC.

3.1 Dataset extraction

We decided to focus on a part of MedPix® that involves cases related with two diagnostic modalities, namely Computed Tomography (CT), and Magnetic Resonance Imaging (MRI). First of all, the images in the considered split were downloaded via Open-i® 333https://openi.nlm.nih.gov/s, and noisy samples where manually removed or modified accordingly444among the downloaded images, there where pathology exams, and teaching materials like annotated slides that are useless for training a medical MLLM. The purpose of this data cleaning stage was obtaining images that can be suitable as input data for training a MLLM. An automatic scraping pipeline was implemented to extract the textual data related to the selected images using Selenium555https://www.selenium.dev/ and Beautiful Soup666https://www.crummy.com/software/BeautifulSoup/bs4/doc/. Finally, two kinds of JSON documents were devised to store, respectively, the information strictly connected with the images (descriptions document) and the one related to a clinical case (case-topic document). A one-to-many relation has been created between a clinical cases and images by embedding the U_id defined for a case-topic document in each descriptions document attached to each image related to the clinical case itself. An example of the two kinds of JSON documents is shown in Fig. 2.

Refer to caption
Figure 2: An an example of the JSON documents created in MedPix 2.0: (a) descriptions document, (b) case-topic document.

3.2 MongoDB representation

In order to process properly the data in MedPix 2.0 we built a MongoDB database to host all the JSON documents along with the images. The architectural choice stems not only from the nature of the data in MedPix 2.0 but also from considerations related to its high flexibility and scalability to distributed environments where also private multimodal medical data could be added to the original collections with the constraint of not being moved away from their generation site as it is the case of hospital generated information.

We built a MongoDB instance made by two collections, namely Image_Descriptions containing the descriptions documents, and Clinical_reports which contains all the case-topic documents. In our implementation, the images are stored in a separated folder, and they are accessed using a proper file:// URL built starting from their U_id. Finally, a view called Image_Reports allow a direct access to both collections via their U_id.

A user-friendly GUI was built using PyQt5777https://www.riverbankcomputing.com/software/pyqt/ to allow an easy access to the database, querying it to obtain the desired data for visualization and/or download. As shown in Fig. 3(a), it is possible to select either the collection or the view to be queried and, add the query input in the relative fields. In Fig. 3(b), an example of the answer to a query is reported: samples matching the query are reported as a list of JSON objects and the user can save it or view a specific clinical case and/or image selected in list of the query results. An example of this view is shown in Fig. 4, where a MedPix® like GUI is reproduced to enforce usability for the users of the original website. In our GUI, the curated images scraped along with the texts are showed by default, but the user can choose to download the original (non curated) data related to the clinical case under investigation.

Refer to caption
Figure 3: An example of the interface for querying the database where the user asks for CT scans of female patients (a), and the provided output (b).
Refer to caption
Figure 4: An example of the view of a selected clinical case retrieved by the query posed in Fig. 3. The image, and the relative description are reported along with the information of the clinical case the image belongs to.

MedPix 2.0 and its querying interface can be a valid tool for both physicians and AI researchers since the desired data can be easily downloaded for further application, like training DL models where a large amount of structured data are required. The structured output downloaded from the interface, can be used in turn to fill a fixed template, and to provide a MLLM with a rich textual prompt. Also multimodal tasks can be addressed if the textual descriptions are combined with the corresponding image, as we will demonstrate in Section 4.

4 Training a MLLM using MedPix 2.0

The training-oriented reorganisation of MedPix 2.0 makes it possible to train a multimodal deep neural network without further pre-processing of the data. To validate this claim, we used a data set extracted from MedPix 2.0 to train CLIP[14] that is one of the most recent and widespread multimodal models. In particular, CLIP succeeds in achieving competitive performance in zero-shot contexts on a wide range of classification datasets by learning the relationships between images and their textual descriptions [18]. The structure of the architecture is well known and consists of an image encoder and a text encoder. CLIP’s training scheme was kept consistent with what has been reported and involved a contrastive pre-training phase using a large corpus of image-text pairs, a second phase tha is the creation of the classifier from the text labels, and finally the zero-shot prediction.A schematic of this training is shown in Fig. 5.

Refer to caption
Figure 5: The three characteristic phases of CLIP. 1) Contrastive pre-training, 2) Creation of the classifier from text labels and 3) Zero-shot prediction.

The CLIP’s architecture is modular that is one can use specific encoders both for the visual and the textual part other than the default ones. This feature makes CLIP also a framework and not simply a model.

Leveraging this peculiarity, we started a trial training phase to select the best suited visual encoder for our purposes, while keeping the textual encoder unchanged. Specifically, a ViT-L/14 and a RN50x16 were tested in combination with the CLIP’s default textual encoder. The trial phase favoured the use of the RN50x16, which has been tested as a whole with the CLIP’s textual encoder, obtaining an accuracy of 58% in the scanning modality identification task, as it is shown in Fig. 6.

Refer to caption
Figure 6: Histogram of the results obtained using RN50x16 and CLIP encoders.

Once the visual encoder had been identified, we started testing a new textual encoder, BioBERT [7], in order to improve the overall classification performance. BioBERT has first been trained on the textual data obtained from MedPix 2.0 via the GUI, using the Mask Language Model, and then it was integrated with the RN50x16 encoder for fine-tuning on the selected train data. By doing so, the performance obtained increased considerably, reaching an accuracy of 88% in the scanning modality prediction task, as it is shown in the Fig. 7.a. We also evaluated the performances of the obtained model under the location classification task as in Fig. 7.b.

Refer to caption
Figure 7: Histogram of the results obtained using RN50x16 and BioBERT encoders for the modality identification (a) and the location classification task (b).

5 Conclusions and Future work

In this paper we presented MedPix 2.0, a multimodal dataset of clinical reports, CT and MR scans. We devised a semi-automated pipeline to download and curate the images in the original data sets, while structuring the textual information as a set of JSON documents collections which had been used to build a proper MongoDB instance. The NoSQL version of the data set can be accessed and queried with a usable GUI that has been developed puprosely. Using the GUI one can browse the data set in the same manner as in the original MedPix® and can download the structured output of the query that is suitable for training AI models. To demonstrate this point, we developed a CLIP-based model for multimodal classification of CT/MR scans with respect to both the scanning modality and the body part shown in the input slice.

The presented dataset and its MongoDB interface, represent in our view a relevant starting point for the development of AI multimodal models in the medical domain such as Information Extraction systems tailored for clinical reports, automated analysis of the medical images, or Generative AI models for clinical report generation as part of a Medical Decision Support System. All these systems can rely on MedPix 2.0 as a structured source of data containing both clinical cases and medical explanations about the found disease.

The structured JSON documents encode also implicit knowledge on the domain. We are currently working on a generative model that relies on a Knowledge Graph (KG) built from MedPix 2.0, to generate diagnostic findings given some preliminary information. The information retrieved analyzing the keys of the JSON documents can be easily structured as nodes (patients, diagnosis, treatment, exams, …) connected via edges symbolizing the relations be between them, and enriched by adding attributes at node or edge level. This re-organization of the data, can show some patterns on relations that otherwise cannot be easy recognized, and the developed graph, apart from AI-based applications, can be browsed by doctors for diagnostic or research purposes using a proper GUI. The scalability of the developed MongoDB database, makes it suitable for future extensions with the possibility to add new clinical cases that have to be compliant with privacy regulations and follow the required information structure. Moreover, the inherent distributed nature of MongoDB allows for creating huge databases across different wards where the data owned by a single institution do not need to be explicitly moved out of the hospital thus violating privacy regulations. The structure of MedPix 2.0 could also serve as a guide to develop suitable connectors to share allowed data in the EHDS. New cases can also be easily added to a KG that can be queried and used in a more complex pipeline with via a Retrieval Augmented Generation (RAG) approach to retrieve relevant piece of information for generating a more precise report via a generative LLM. Further improvement is being done also on the interface, providing the user with advanced data visualization tools, like the possibility to interactively compare similar cases, thus helping physicians during the diagnostic phase.

5.1 Data and Code Availability

5.1.1 Acknowledgements

This work is supported by the cup project B73C22000810001, project code ECS_00000022 “SAMOTHRACE” (Sicilian MicronanoTech Research And Innovation Center).

5.1.2

The authors declares that they have no relevant or material financial interests that relate to the research described in this paper.

References

  • [1] Chaudhry, H.A.H., Renzulli, R., Perlo, D., Santinelli, F., Tibaldi, S., Cristiano, C., Grosso, M., Limerutti, G., Fiandrotti, A., Grangetto, M., et al.: Unitochest: A lung image dataset for segmentation of cancerous nodules on ct scans. In: International Conference on Image Analysis and Processing. pp. 185–196. Springer (2022)
  • [2] Demner-Fushman, D., Kohli, M.D., Rosenman, M.B., Shooshan, S.E., Rodriguez, L., Antani, S., Thoma, G.R., McDonald, C.J.: Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association 23(2), 304–310 (2016)
  • [3] Gava, U., D’Agata, F., Bennink, E., Tartaglione, E., Perlo, D., Vernone, A., Bertolino, F., Ficiarà, E., Cicerale, A., Pizzagalli, F., Guiot, C., Grangetto, M., Bergui, M.: Unitobrain (2021). https://doi.org/10.21227/x8ea-vh16, https://dx.doi.org/10.21227/x8ea-vh16
  • [4] He, X., Zhang, Y., Mou, L., Xing, E., Xie, P.: Pathvqa: 30000+ questions for medical visual question answering. arXiv preprint arXiv:2003.10286 (2020)
  • [5] Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Mark, R.G., Horng, S.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data 6(1), 317 (2019)
  • [6] Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically generated visual questions and answers about radiology images. Scientific data 5(1), 1–10 (2018)
  • [7] Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C.H., Kang, J.: Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4), 1234–1240 (Feb 2020). https://doi.org/10.1093/bioinformatics/btz682, arXiv:1901.08746 [cs]
  • [8] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014)
  • [9] Lin, W., Zhao, Z., Zhang, X., Wu, C., Zhang, Y., Wang, Y., Xie, W.: Pmc-clip: Contrastive language-image pre-training using biomedical documents. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 525–536. Springer (2023)
  • [10] Liu, B., Zhan, L.M., Xu, L., Ma, L., Yang, Y., Wu, X.M.: Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In: 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI). pp. 1650–1654. IEEE (2021)
  • [11] Magnini, B., Altuna, B., Lavelli, A., Speranza, M., Zanoli, R.: The e3c project: Collection and annotation of a multilingual corpus of clinical cases. Proceedings of the Seventh Italian Conference on Computational Linguistics CLiC-it 2020 (2020), https://api.semanticscholar.org/CorpusID:229293442
  • [12] Pelka, O., Koitka, S., Rückert, J., Nensa, F., Friedrich, C.M.: Radiology objects in context (roco): a multimodal image dataset. In: Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 16, 2018, Proceedings 3. pp. 180–189. Springer (2018)
  • [13] Penedo, A.C.: The regulation of data spaces under the eu data strategy: Towards the ‘act-ification’ of the fifth european freedom for data? European Journal of Law and Technology 15(1) (May 2024), https://www.ejlt.org/index.php/ejlt/article/view/995
  • [14] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
  • [15] Schulz, S., Ševa, J., Rodriguez, S., Ostendorff, M., Rehm, G.: Named entities in medical case reports: Corpus and experiments. In: Proceedings of the Twelfth Language Resources and Evaluation Conference. pp. 4495–4500. European Language Resources Association, Marseille, France (May 2020), https://aclanthology.org/2020.lrec-1.553
  • [16] Subramanian, S., Wang, L.L., Mehta, S., Bogin, B., van Zuylen, M., Parasa, S., Singh, S., Gardner, M., Hajishirzi, H.: Medicat: A dataset of medical images, captions, and textual references. arXiv preprint arXiv:2010.06000 (2020)
  • [17] Terzis, P., Echeverria, E.O.S.: Interoperability and governance in the european health data space regulation. Medical Law International (Apr 2023). https://doi.org/10.1177/09685332231165692, https://journals.sagepub.com/doi/full/10.1177/09685332231165692
  • [18] Van, M.H., Verma, P., Wu, X.: On large visual language models for medical imaging analysis: An empirical study (arXiv:2402.14162) (Feb 2024). https://doi.org/10.48550/arXiv.2402.14162, http://arxiv.org/abs/2402.14162, arXiv:2402.14162 [cs]