MedInsight: A Multi-Source Context Augmentation Framework for Generating Patient-Centric Medical Responses using Large Language Models

Subash Neupane, Shaswata Mitra, Sudip Mittal, Noorbakhsh Amiri Golilarz, Shahram Rahimi, Amin Amirlatifi Mississippi State University665 Perry StMSSTATEMSUSA39762 sn922@msstate.edu,sm3843@msstate.edu, mittal@cse.msstate.edu, amiri@cse.msstate.edu, rahimi@cse.msstate.edu, amin@che.msstate.edu

Abstract.

Large Language Models (LLMs) have shown impressive capabilities in generating human-like responses. However, their lack of domain-specific knowledge limits their applicability in healthcare settings, where contextual and comprehensive responses are vital. To address this challenge and enable the generation of patient-centric responses that are contextually relevant and comprehensive, we propose MedInsight- a novel retrieval augmented framework that augments LLM inputs (prompts) with relevant background information from multiple sources. MedInsight extracts pertinent details from the patient’s medical record or consultation transcript. It then integrates information from authoritative medical textbooks and curated web resources based on the patient’s health history and condition. By constructing an augmented context combining the patient’s record with relevant medical knowledge, MedInsight generates enriched, patient-specific responses tailored for healthcare applications such as diagnosis, treatment recommendations, or patient education. Experiments on the MTSamples dataset validate MedInsight’s effectiveness in generating contextually appropriate medical responses. Quantitative evaluation using the Ragas metric and TruLens for answer similarity and answer correctness demonstrates the model’s efficacy. Furthermore, human evaluation studies involving Subject Matter Expert (SMEs) confirm MedInsight’s utility, with moderate inter-rater agreement on the relevance and correctness of the generated responses.

Large Language Model (LLM), Context augmentation, Retrieval Augmented Generation, Healthcare, Patient, Caregiver

1. Introduction

In the healthcare domain, providing contextual and comprehensive medical information tailored to individual patients is crucial for enabling effective care. However, existing approaches often struggle to deliver personalized responses due to the distributed nature of medical data across multiple sources like patient records, medical literature, and online resources. While recent advances in Large Language Models (LLMs) have demonstrated their potential for understanding and communicating medical knowledge, their training objective of next-token prediction can lead to information loss, ‘memory distortion’ (peng2023check, ), and the generation of plausible but incorrect content, known as hallucinations (huang2023survey, ). These shortcomings highlight the need for techniques to augment LLMs with contextually relevant information from diverse sources to ensure the delivery of reliable, patient-centric responses.

To address the challenge of adapting LLMs for specialized domains like healthcare, two main approaches have emerged: fine-tuning and augmenting the models with external knowledge. Fine-tuning involves further training a pre-trained LLM on domain-specific data to optimize its performance for targeted applications (ovadia2023fine, ). However, this method can be computationally expensive, limited by data availability, and susceptible to catastrophic forgetting, where the model forgets previously learned knowledge (ovadia2023fine, ; kirkpatrick2017overcoming, ; goodfellow2013empirical, ; chen2020recall, ; luo2023empirical, ). An alternative approach is In-Context Learning (ICL), which aims to enhance LLM effectiveness on new tasks by modifying the input prompts without changing the model weights (chen2021meta, ; radford2019language, ; lampinen2022can, ). A prominent implementation of ICL is Retrieval Augmented Generation (RAG) (lewis2020retrieval, ; neelakantan2022text, ), where information retrieval techniques are used to extract relevant knowledge from external sources and integrate it into the LLM’s generated text. By augmenting the model’s input with retrieved contextual information, RAG can adapt LLMs to domain-specific tasks without the drawbacks associated with fine-tuning.

While a patient’s medical transcript captures their history and current condition, the background information required for comprehensive care, such as details about diseases, symptoms, diagnoses, and treatments, is often distributed across multiple sources like medical literature, clinical guidelines, and online knowledge bases. This fragmentation of relevant medical knowledge poses a significant challenge in providing personalized responses tailored to a patient’s unique context. To address this challenge and effectively leverage the disparate sources of information, we propose MedInsight, a RAG framework capable of generating tailored medical responses by augmenting a patient’s specific context from their transcript with pertinent background knowledge retrieved from various authoritative sources.

Generating patient-centric responses to medical queries requires augmenting the patient’s context with relevant knowledge extracted from authoritative sources. Unlike tasks focused on specific facts, medical questions often necessitate a deeper understanding of relationships across multiple contexts spanning the patient’s history, current symptoms, lab findings, and background medical knowledge. For instance, consider the question "For a patient experiencing a Chronic Obstructive Pulmonary Disease (COPD) exacerbation, what specific management strategies or interventions would you recommend to improve respiratory symptoms and overall lung function?" Answering this effectively demands comprehending the overall patient context from their transcript, while also incorporating pertinent information about COPD exacerbations, treatment strategies, and respiratory management from medical knowledge sources. Figure 1 illustrates how MedInsight augments context from multiple sources to craft patient-centric responses. The prompt (query) is merged with relevant information extracted from the patient’s medical transcript (i.e. patient unique context) and authoritative medical knowledge sources like textbooks. This formulates an augmented context combining the patient’s details with related medical concepts, enabling MedInsight to generate a contextually relevant, personalized response tailored to the specific patient’s needs.

Refer to caption — Figure 1. MedInsight’s context augmentation approach for generating patient-centric responses. Patient context from medical transcripts and prompt is augmented with relevant medical knowledge from authoritative sources into a comprehensive context input to the language model, enabling personalized patient-centric response generation.

By contextualizing patient information with medical knowledge, we empower patient and caregivers with the insights and tools necessary to enhance self care/patient care and optimize healthcare delivery. Our RAG approach generates patient-centric responses leveraging patient’s unique context and extracting relevant medical knowledge based on patient’s history for specific input. This enables both patient and caregivers to access relevant and personalized information, facilitating informed decision-making and improving patient outcomes.

The major contributions of this paper are:

•

We demonstrate the possibility of tailoring medical recommendations for a specific patient need using patient context (medical transcript and health record). This equips caregivers with the knowledge and tools necessary to elevate patient care, enhance treatment results, and optimize the efficiency of healthcare delivery.
•

We built a retrieval-augmented question-answering system that generates patient-centric responses leveraging patient’s unique context and extracting medical knowledge based on patient’s medical history for specific input.
•

We showcase and evaluated MedInsight’s proficiency in generating accurate and patient-centric responses through both qualitative and quantitative metrics.

The rest of the paper is organized as follows: Section 2 discusses the background and related works, and section 3 provides insight on MedInsight’s architecture and methodology. In Section 4 we present our experiments, evaluation and discuss our results. Finally, Section 5 concludes the paper.

2. Background and Related Works

In this section, we commence by providing the foundational background of Large Language Models (LLM) and Retrieval Augmented Generation (RAG). Subsequently, we explore relevant works that have leveraged these methodologies within the healthcare domain.

2.1. Large Language Models

Language models (LMs) can be referred to as computational models that have the ability to comprehend and generate human language. They learn to model the probability distribution of text, subsequently predict the likelihood of word sequences, or generate new text based on the input (chang2023survey, ). For instance, traditional language models like N-gram (brown1992class, ) estimate the probability of a word by considering its context in preceding text.

Large Language Models (LLMs), on the other hand, are advanced language models trained on massive crawls of Internet text with massive parameter sizes and has exceptional learning capabilities. Recently, they have emerged as pivotal catalysts in several research domains including but not limited to Natural Language Processing (NLP) (radford2019language, ; brown2020language, ), cybersecurity (mitra2024localintel, ), and recommender systems (zhang2023recommendation, ; hou2023large, ). The state-of-the-art LLMs such as OpenAI GPT (radford2019language, ), Google’s PaLM (chowdhery2022palm, ), Meta’s LLMA2 (touvron2023llama, ) primarily leverage the Transformer architecture (vaswani2017attention, ). Existing LLMs follow diverse transformer architectures and pre-training objectives such as employing solely decoders (as seen in GPT-2 and GPT-3), using only encoders (as exemplified by BERT (devlin2018bert, ) and RoBERTa (liu2019roberta, )), or adopting encoder-decoder structures (as seen in BART). Models can be trained through distinct methodologies: employing an autoregressive approach, wherein the objective is to predict the subsequent word given the left-hand context; utilizing a masking technique, akin to a fill-in-the-blanks problem, where the goal is to predict masked words with context on both sides; or adopting a strategy where the sequence is intentionally corrupted, followed by the task of predicting the original sequence. These models showcase exceptional proficiency in understanding and generating language, producing responses that closely emulate human expression and intention, as exemplified by the emergence of chatbot applications such as ChatGPT. LLMs effectively apply their acquired knowledge and reasoning abilities to handle various downstream tasks including Named Entity Recognition (NER), text summarization, question-answering, and more. Figure 1(a) depicts a base LLM for these downstream tasks. This capability is attributed to inherent features of LLMs such as prompting or in-context learning (brown2020language, ), achieved through the provision of appropriate instructions or prompts (zhu2023large, ).

2.2. Retrieval Augmented Generation

Pre-trained LLMs, including models such as GPT and LLMA, have demonstrated their capabilities at acquiring thorough and detailed knowledge from training data. They leverage this acquired knowledge to generate text based on the provided input. Nevertheless, these models face significant limitations that impede wider deployment. For example, they may produce predictions that, while appearing plausible may not be non factual, commonly referred to as hallucinations (huang2023survey, ). This tendency is particularly pronounced when queries go beyond the model’s training data or require current and updated information. To address this limitation, a promising approach is Retrieval Augmented Generation (RAG), introduced by (lewis2020retrieval, ) in mid-2020. This method integrates external data retrieval into the generative process, thereby augmenting the model’s capacity to offer precise and pertinent responses. An illustration of a simple RAG can be observed in Figure 1(b).

RAG comprises three essential elements: a knowledge database, a retriever, and an LLM. The knowledge database is capable of housing an extensive array of texts sourced from diverse outlets, tailored to the specific domain. For instance, in the medical domain, it might encompass information about medical condition (diseases), their symptoms, preventive measures, diagnosis and recommended medications, among other relevant content. The retriever employs a text encoder to compute an embedding vector for each text in the knowledge database. When presented with a user’s query like “What is Kawasaki diseases and how does it impact my child’s health?” the retriever utilizes the text encoder to output an embedding vector for the question Depending on implementation retrieval component may be based Dense Passage Retrieval (DPR) (karpukhin2020dense, ) and may follow a bi-encoder architecture. Next, a subset of texts, e.g., $k$ , denoted as retrieved texts is extracted from the medical knowledge database with largest similarity, e.g., cosine similarity, to that of given question. Subsequently, these $k$ retrieved texts serve as the augmented context for the LLMs to generate an answer to the provided question.

2.3. LLMs in Healthcare

In the realm of Natural Language Processing (NLP), LLMs have sparked a revolution with their outstanding performance (bakker2022fine, ) in various tasks like summarization, question-answering, and Natural Language Generation (NLG) (bubeck2023sparks, ). Their versatile utility is prompting researchers to actively explore potential applications in the healthcare domain. This is evidenced by the success of ChatGPT in attaining a passing grade in United States Medical Licensing Examinations (USMLE) (kung2023performance, ). Additionally, a version of Med-PaLM2 that was fine-tuned using medical data has recently achieved state-of-the-art results, attaining the level of expertise demonstrated by human clinicians (singhal2023towards, ). Although LLMs have impressive capacities to generate human-like responses for different downstream tasks such as task-oriented question and answering, applying LLMs to medical domains is still challenging. This is because an LLM may lack comprehensive expertise in medical knowledge to which it has not been exposed. For instance, a model trained exclusively on texts written by William Shakespeare would struggle to perform well when queried about the symptoms of a disease (ovadia2023fine, ). In order to mitigate this issue, researchers are currently augmenting LLMs with external knowledge. Zakka et al., (zakka2023almanac, ) introduced Almanac, a framework augmented with retrieval capabilities for medical guidelines and treatment recommendations. This framework was created to respond to 130 clinical questions formulated by a panel of five board-certified clinicians and resident physicians. The results demonstrated that Almanac surpassed GPT-4 in terms of factuality, safety, and correctness. This suggests that the incorporation of retrieval systems results in more precise and dependable responses to clinical inquiries.

Likewise, the authors in (kang2023knowledge, ) presented a novel approach called KARD. This method fine-tunes small LLMs to generate rationales obtained from language models with augmented knowledge retrieved from an external knowledge base or from a non-parametric memory. Lozano et al., (lozano2023clinfo, ) on the other hand, introduced Clinfo.ai, an open-source workflow that incorporates end-to-end retrieval-augmented LLM chains. This workflow is specifically designed for querying, evaluating, and synthesizing medical literature into concise summaries to address questions on demand. To enhance the accuracy of Large Language Models (LLMs) such as GPT-3/4 on biomedical data, the authors in (soong2023improving, ) employed retrieval methods. Subsequently, they qualitatively evaluated the performance of GPT-3.5 and GPT-4 against a custom RetA LLM using a set of 19 questions. The results indicated that LLMs when used alone exhibited more instances of hallucination compared to the customized approach. To incorporate general knowledge from LLMs into specific domains, Wang et al.,(wang2023augmenting, ) introduced a method called Large-scale Language Models Augmented with medical textbooks (LLM-AMT). This method integrates medical textbooks as an external database for a question and answering system. Empirical evaluations suggested an enhancement in the accuracy of responses when utilizing LLM-AMT.

Although the approaches mentioned above have made notable strides in augmenting LLMs with external knowledge for the healthcare domain, they often rely on a single knowledge source or focus on specific medical tasks. However, in real-world healthcare scenarios, relevant information is typically distributed across multiple fragmented sources, such as patient medical records, clinical guidelines, research literature, and online knowledge bases. This data segmentation and diversity of knowledge sources pose a significant challenge in providing comprehensive and personalized medical responses tailored to a patient’s unique context and history. As depicted in Fig. 1(c), our framework addresses this challenge by integrating information from multiple sources and contextualizing patients’ details with relevant medical knowledge. This multi-source context augmentation approach enables MedInsight to overcome the limitations of relying on a single knowledge source and provides personalized medical information that accounts for the fragmented nature of healthcare data.

3. MedInsight Architecture and Methodology

The architecture of our framework, referred to as MedInsight, is delineated across three integral phases, each playing a distinct role in the functioning of the framework, as illustrated in Figure 3. The three phases encompass patient context retrieval, medical knowledge retrieval and response generation. The initial phase revolves around patient context retrieval, specifically focusing on harnessing the patient’s medical transcript to construct a comprehensive and patient-specific knowledge repository. In the subsequent phase, the framework extends its scope to encompass medical knowledge retrieval, leveraging trusted medical sources such as medical textbooks and web platforms (e.g., Mayo Clinic and WebMD). This phase ensures that our framework is well-versed in a variety of subject matter related to medical conditions, treatments, drugs, and general health and wellness. The final phase, response generation, is the synthesis point where the accumulated contextual knowledge from both patient data or history and medical knowledge converges to craft a contextually relevant patient-centric response. This three-phase approach guarantees a robust and adaptive framework capable of delivering contextual and relevant information to meet the diverse needs of patients and caregivers.

In the following subsections, we explain each of these phases in greater details.

3.1. Patient Context Retrieval

In the healthcare domain, healthcare providers and patients engage in dialogues to address the health concerns of the patient. Subsequently, these conversations are transcribed to ensure the accuracy and comprehensiveness of patient records. Fig 4(A) provides a glimpse of doctor-patient interaction. Within this context, a medical transcript can be defined as a written or typed document that meticulously records a patient’s medical history, diagnosis, treatment, and pertinent details. These transcripts are crafted based on the notes taken by healthcare professionals, including doctors, nurses, or medical transcriber, and may be derived from recordings of the interactions that occur during a patient’s engagement with the healthcare provider. A representative instance of a transcribed medical document is presented in Figure 4(B). During this phase, we initiate the conversion of the unstructured medical transcript into a structured format. This transformation is accomplished through the application of a zero-shot prompting strategy utilizing OpenAI’s GPT-3.5-Turbo, managed through API calls. The process involves categorizing and annotating the unstructured medical transcript into three distinct categories: patient history and symptoms, executed diagnostics, and prescribed medications along with further instructions. Figure 4(C) provides an illustrative example of a pre-processed and annotated unique patient context resulting from this process. Further insights into the data pre-processing techniques and strategies employed for this undertaking are discussed in Section 4.1.1.

The second step in this phase is transformation of structured and annotated transcript into a vector database. This process relies on two crucial components: an embedding model and a vector database. Embedding refers to the technique used for representing natural language words and documents in a way that captures their meanings. This representation is typically a real valued vector in low-dimensional space. In our MedInsight framework, the embedding model is used to vectorize and pre-process the patient’s unique contexts to store them in a vector database. This allows performing a semantic search for relevant knowledge retrieval. Similarly, A Vector Database (VD) is a specific type of database that stores data in the form of high-dimensional vectors. These vectors are mathematical representations derived from raw data, such as unstructured texts, while representing word-level semantic meaning. The advantage of a vector database is that it facilitates rapid similarity search and retrieval of data based on their vector distances, contrary to traditional databases that search based on exact matches or predefined criteria. In our MedInsight framework, Chroma DB¹¹1Chroma, Website: trychroma.com, an in-memory VD, allows storing and retrieving the most relevant pre-processed patient unique context based on the semantic or contextual meaning of the prompt.

Figure 4. Illustration the transformation of the doctor-patient interaction into the patient’s specific context: A represents the interaction between the doctor and patient, B denotes an unstructured medical transcript, and C indicates the annotated structured representation of the patient’s unique context.

3.2. Medical Knowledge Retrieval

A crucial phase in MedInsight’s framework is the comprehensive augmentation of the patient’s context with relevant medical knowledge retrieved from authoritative sources like textbooks and trusted web platforms (e.g., Mayo Clinic, WebMD). Medical textbooks serve as authoritative repositories of factual information about conditions, symptoms, diagnoses, and treatments, while web platforms provide a diverse array of healthcare resources accessible to both professionals and the general public. This context augmentation phase takes the relevant chunks extracted from the patient’s medical transcript in the initial phase and combines them with the user’s query to establish a broader context. MedInsight then leverages this comprehensive context to retrieve the most pertinent medical knowledge from the available sources.

Figure 5. An illustrative example of retrieving medical knowledge that is tailored to a patient’s unique context when responding to a medical query.

To illustrate, Fig. 5 provides an example of medical knowledge retrieval based on a patient’s unique context and a specific medical query. In this particular scenario, the patient’s unique context involves a male subject experiencing an allergic reaction, for which he is prescribed an EpiPen ${}^{\circledR}$ for emergencies. However, the instructions on how to use it are not specified in this patient’s context. The medical knowledge retriever then searches for information on this topic and returns the most relevant documents based on the patient’s context and medical query. The capability of MedInsight to search for pertinent information based on the provided context and query makes it a comprehensive and supportive technology for both patients and caregivers. The results are personalized, as evident in the figure, where a patient is seeking medical advice tailored to his specific context.

However, traditional RAG pipelines can encounter challenges with context overflow when retrieving evidence from various sources. Retrievers often don’t know the specific queries the document storage system will face, leading to situations where relevant information may be buried within irrelevant text. Passing full documents can be computationally expensive and may degrade performance. To address this, MedInsight employs contextual compression retrievers available through LangChain, as depicted in Fig 6. Instead of returning entire documents, these retrievers compress the content using the query context, extracting only the most relevant portions. This contextual compression ensures the extraction of concise, relevant medical knowledge, facilitating the subsequent phase of patient-centric response generation in MedInsight.

3.3. Response Generation

The conclusive phase in the MedInsight framework is response generation, tasked with generating comprehensive, patient-centric responses by leveraging the augmented context obtained from multiple sources in the preceding phases. As discussed earlier, MedInsight augments the patient’s context extracted from their medical transcript by retrieving relevant medical knowledge from authoritative textbooks and trusted web platforms using contextual compression techniques. This context augmentation process ensures that the response generation phase has access to a comprehensive, personalized context that combines the individual’s unique medical details with pertinent background information tailored to their specific condition and query.

Utilizing a LLM, this final phase tailors responses to individual patients or caregivers based on user prompts and the augmented context comprising retrieved medical documents and the patient’s context. Our approach integrates both proprietary and open-source LLMs for response generation. Initially, we employ OpenAI’s GPT-3.5-Turbo model to generate responses for 100 sets of questions (refer to Section 4.1 for details on selected medical specialties and corresponding questions). Next, we employ the same pipeline to generate responses using Mistral-7B-Instruct, facilitating a comparative assessment of closed-source versus open-source models in patient-centric response generation tasks. The findings of this evaluation are presented in Section 4.3.1. Fig. 7 illustrates an instance of a patient-centric response generated by MedInsight, addressing a medical query: “How do I use the prescribed EpiPen® in case of an emergency?” The response incorporates the patient’s medical context extracted from their transcript, relevant evidence retrieved from external knowledge sources based on that context, and the query itself. The contextual aspects integrated from multiple sources are highlighted in red for clarity.

Figure 7. Illustrating a patient-centric response generated by MedInsight, this example highlights how the patient’s context, extracted medical knowledge, and their medical query are contextualized, as indicated by the red highlights.

4. Experimental Results and Discussion

Our framework, MedInsight, can function as a personalized support system for patients, aiding them in comprehending their medical conditions while also offering broader medical information on general topics. Additionally, caregivers can benefit from this system by posing personalized questions related to a patient, thereby optimizing caregiving. Here, we present a proof-of-concept experiment demonstrating the feasibility of our framework.

4.1. Dataset Preparation and Description

Securing access to publicly available medical data poses a challenge due to the stringent privacy regulations imposed by HIPAA. Because of this, in this study, we leverage the medical transcripts sourced from MTSamples (mtsamples, ) as the foundational dataset (or patient context) for our investigation.

MTSamples dataset is fully synthetic, meaning it contains no actual patient information and has been artificially generated. This repository comprises transcriptions of 5000 medical reports spanning a diverse array of over 40 medical specialties. From this expansive range, we have meticulously chosen 10 specific specialties for focused analysis, including Allergy/Immunology, Pulmonary/Cardiovascular, Dermatology, Gastroenterology, General Medicine, Orthopedic, Neurology, Podiatry, and Pediatrics – Neonatal as depicted in Table 1. Within each category, we select 5 transcripts at random that represent patient’s unique context. We then generate questions for each context, employing a zero-shot prompting strategy with GPT-3.5-Turbo. This approach allows us to create a pair of synthetic questions for every distinct patient context. In total, we compile a collection of approximately 100 questions, each uniquely tailored to individual patients. Selected transcripts encompass a wide age spectrum, ranging from infants aged 2 months to adults aged 93 and they represent both gender. Inclusion of large age group and gender makes our dataset diverse this in turn, make MedInsight comprehensive.

For medical knowledge retrieval we carefully curated medical textbooks for each medical speciality to make our retriever comprehensive (see Section 3.2 for details). In total, we collect 12 medical textbooks in PDF format, with 10 directly related to specific medical specialties and 2 serving as medical encyclopedias. Notably, some of these books exceeded 8000 pages. The total token counts for the books are presented in Table 1. For a detailed insight into pre-processing via the splitting and chunking strategies employed for these extensive texts please refer to Section 4.2.

4.1.1. Dataset Pre-processing

The initial stage involves the preprocessing of synthetic raw data obtained from the MTSamples dataset. This raw data comprises summaries of patient contexts devoid of annotations. Through contextual analysis, we categorize the data into three distinct groups: patient history, executed diagnostics, and prescribed medications and further instructions. This categorization serves as a pivotal factor in guiding MedInsight ’s retrievers to precisely identify pertinent information within the text documents. Consequently, this approach facilitates the generation of contextually relevant, patient-centric responses. For the preprocessing of raw data, we employed a zero-shot prompting strategy, instructing the GPT-3.5-Turbo model to annotate the provided patient context into the aforementioned three categories. Table 2 illustrates the example of patient context and it’s annotated output after applying zero-shot prompting techniques.

Table 1. Medical speciality dataset curated from MTSAmples.

Medical Speciality	# of Transcript(s)	# Selected	# of Token(s)
Allergy / Immunology	7	5	335,807
Pulmonary / cardiovascular	372	5	643,733
Dermatology	29	5	652,882
Gastroenterology	230	5	108,893
General Medicine	259	5	541,243
Orthopedic	355	5	481,512
Neurology	223	5	1,921,963
Podiatry	47	5	82,270
Urology	158	5	791,163
Pediatrics - Neonatal	70	5	1,356,563
Total	1750	50	6,263,147

4.2. Implementation Details

In the course of our experiments, we employed OpenAI’s proprietary GPT-3.5-turbo and the open-source Mistral-7B-Instruct as the foundational generator for our framework. The model’s temperature was intentionally set to 0 to eliminate randomness in the response. GPT-3.5-turbo, accessed through its API, handled the patient context annotation task using zero-shot prompting strategies. For both patient context retrieval and medical knowledge retrieval, we constructed a vector database using Chroma however, the chunking and splitting strategies differed. The patient context, being relatively smaller in size, was split into 500 chunks with an overlap of 200. In contrast, the medical context, curated from web platforms and medical textbooks, was significantly larger, with some medical books comprising more than 8 thousand pages. Hence, we opted for a chunk size of 2500 with a 500 overlap for the medical knowledge retriever. The base embedding model in our RAG pipeline is text-embedding-ada-002. During the evidence retrieval stage, we utilized the contextual compression retriever available through LangChain. Instead of immediately returning retrieved documents as-is, this retriever compresses them using the context of the given query, ensuring that only relevant information is returned. Such retrievers enhance the efficiency and effectiveness of the document retrieval process, resulting in better user experiences and optimized resource utilization. The generation of contextual, patient-centric responses involved both GPT-3.5-Turbo and Mistral-7B-Instruct 8 Bit quantized models. The experiment was conducted in the Google Colab Jupiter environment with standard CPU runtime for GPT-3.5-Turbo. Mistral-7B-Instruct was downloaded from Huggingface(mistral, ) and run locally. We utilized an Intel i9-12900 CPU, GPU GeForce RTX™ 3090 Ti with 24 GB and 128 GB of RAM to run Mistral. High level overview of our algorithm is provided in Algorithm 1.

Input: Prompt

p

Require: a patient context

D_{p}

, a patient context retriever

R_{p}

, medical knowledge

D_{m}

, a medical context retriever

R_{m}

, retrieval augmented generator

G

Variables:

D_{p}R_{p}

is retrieved patient context,

D_{m}R_{m}

is retrieved medical knowledge

Output: Patient-centric response (

C

)

1: Request

D_{p}

with prompt

p

;

2: Retrieve

D_{p}

with patient retriever

R_{p}

->Return relevant patient context

D_{p}R_{p}

;

3: Request

D_{m}

with prompt

p

;

4: Retrieve

D_{m}

with medical retriever

R_{m}

-> Return relevant medical knowledge

D_{m}R_{m}

;

5: Concatenate

D_{p}R_{p}

\oplus

D_{m}R_{m}

\oplus

p

;

6: Use

G

to generate patient-centric response using concatenated context ->

G

(

D_{p}R_{p}

\oplus

D_{m}R_{m}

\oplus

p

) ;

7: return

C

Algorithm 1 MedInsight Framework

4.3. Evaluation

Table 2. Example highlighting the transformation of unstructured patient context into a structured format using a zero-shot prompting strategy. Due to space limitations, the table presents only a snapshot of the actual data.

Prompt Template	Unstructured Patient Context	Annotated Unique Patient Context
prompt_template = """Given the following {medical transcript} of a patient, create a detailed summary by categories. The summary should be divided into the following categories: • Patient history and symptom • Executed diagnostics • Prescribed medications & Instruction: Medical Transcript: {transcript1.txt} """	An 83-year-old diabetic female presents today stating that she would like diabetic foot care.,O - ,On examination, the lateral aspect of her left great toenail is deeply ingrown. Her toenails are thick and opaque. Vibratory sensation appears to be intact. Dorsal pedal pulses are 1/4. There is no hair growth seen on her toes, feet or lower legs. Her feet are warm to the touch…discolored	• Patient history and symptom: The patient is an 83-year-old diabetic female presenting with a complaint of diabetic foot care…discolored. • Executed diagnostics: …the diagnosis made is onychocryptosis,… • Prescribed medications & Instruction: The transcript does not … prescribed medications…
prompt_template = """Given the following {medical transcript} of a patient, create a detailed summary by categories. The summary should be divided into the following categories: • Patient history and symptom • Executed diagnostics • Prescribed medications & Instruction: Medical Transcript: {transcript2.txt} """	The patient is admitted for shortness of breath, continues to do fairly well. The patient has chronic atrial fibrillation, on anticoagulation, INR of 1.72…cardiologist regarding aortic stenosis. She may need a surgical intervention in this regard, which I explained to her. The patient will be discharged home on medical management and she has an appointment to see her cardiologist in the next few days.,In the interim, if she changes her mind or if she has any concerns, I have requested to call me back.	• Patient history and symptom: The patient was admitted for shortness of breath…atrial fibrillation and is on anticoagulation with an INR of 1.72…severe aortic stenosis… • Executed diagnostics: Physical examination showed vital signs…systolic murmur in the aortic area,…impression was made of shortness… • Prescribed medications & Instruction: Continue current medications…cardiologist regarding aortic stenosis, as surgical intervention may be necessary.

Our evaluation study encompasses a dual-pronged experimental approach. For the quantitative assessment of our framework’s performance in generating contextually relevant responses, we employ a comprehensive set of metrics, including Ragas scores (es2023ragas, ) and TruLens scores (TruLens, ). Following the quantitative assessment of our model’s overall effectiveness, we proceeded with a Subject-Matter Expert (SME) evaluation study to validate the answer generation capabilities of MedInsight. Due to the resource-intensive nature of this evaluation, we engaged a panel of four medical residents. Their task involved scoring answers across 100 questions spanning across all medical speciality as explained in Section 4.1, considering two critical aspects: factual correctness and relevance to the patient’s unique context.

4.3.1. Quantitative Evaluation

To quantitatively evaluate the performance of our MedInsight in generating contextually relevant patient-centric responses, we utilize RAGAS (es2023ragas, ) and TruLens (TruLens, ) frameworks. These frameworks feature offers comprehensive metrics specifically designed for assessing RAG pipelines. We chose these frameworks over popular alternatives like BLEU (papineni2002bleu, ) and ROUGE (rouge2004package, ), as they are not aligned with our specific context. BLEU is primarily used to evaluate machine translation tasks, while ROUGE is specialized for evaluating text summarization tasks. Both metrics focus on structural similarity between ground truth and generated sentences, which may not be suitable for our case where sentences can be structurally different but factually similar. Traditional metrics fail to capture this nuance. Given that MedInsight predominantly functions as a RAG-based question-answering system with context mapping, these conventional metrics are not well-suited to gauge its effectiveness.

Table 3. Quantitative evaluation using both proprietary and open source LLM for answer generation task. In our context, gpt-3.5-turbo has slight edge on answer similarity. While it has over 7% gain on answer correctness over Mistral-7B model.

Evaluation Framework	Model	Average Similarity Score	Average Correctness Score
Ragas	gpt-3.5-turbo	0.93	0.84
Ragas	Mistral-7B-Instruct	0.92	0.77
TruLens	gpt-3.5-turbo	0.90	-
TruLens	Mistral-7B-Instruct	0.91	-

In the context of our study, we report the mean RAGAS score of 0.93 for answer similarity and 0.8409 for answer correctness in case of GPT-3.5-turbo generated answers. Figure 7(a) depicts answer similarity scores, and Figure 7(b) presents answer correctness scores for answers generated by GPT-3.5-Turbo. Whereas, in the case of Mistral-7B-Instruct, we report a mean score of 0.92 for answer similarity and 0.77 for answer correctness. Figure 7(c) illustrates the answer similarity score, while Figure 7(d) presents the answer correctness scores for responses generated by Mistral-7B-Instruct. Similarly, for TruLens, we report an average answer similarity score of 0.91 for GPT-3.5-Turbo generated answers and 0.90 for Mistral-7B generated answers. For both frameworks the scores range between 0 and 1, where 1 signifies optimal generation.

The results as highlighted in Table 3 suggest that the GPT model exhibited a slight advantage over Mistral-7B-Instruct in terms of answer similarity. Additionally, it achieved a notable 7% increase in accuracy when assessed in relation to answer correctness. Overall, results emphasize the proficiency of MedInsight in delivering contextually relevant answers within the framework of retrieval-augmented question-answering tasks.

4.3.2. Qualitative Evaluation

After evaluating the effectiveness of our model through quantitative metrics, we proceed to a human evaluation study to validate MedInsight’s proficiency in the patient-centric response generation task. Given the substantial cost associated with this evaluation, we engage a panel of 4 medical professionals, comprising medical doctors and residents, to evaluate the generated responses using scoring mechanism. Evaluation is conducted on two key aspects: factual correctness and contextual relevancy. The first criterion measures the accuracy and relevance of the generated answer to the given question, while the second criterion assesses the contextual appropriateness of the retrieved information, taking into account both the question and the patient’s unique context.

Medical experts were provided with 10 distinct medical specialties, each comprising 5 unique patient contexts and 2 question-answer pairs (see Section 4.1 for details about questions). Their assignment was to manually evaluate the quality of answers generated across a total of 100 question-answer pairs. To evaluate factual correctness and contextual relevance, a 5-point Likert scale (allen2007likert, ) was employed, ranging from 1 (indicating "Factually Incorrect and Contextually irrelevant") to 5 (indicating "Factually Accurate and Contextually relevant"). Medical professionals were instructed to score responses on this scale. The average score obtained was 4.66 out of 5. This step allowed us to establish the ground truth for quantitative evaluation. Additionally, we evaluated inter-rater agreement as depicted in Table 4 using the Fleiss Kappa measure (mchugh2012interrater, ).The evaluation revealed a moderate overall agreement among medical experts on the generated responses, standing at 0.60 with a standard error of 0.029.

Table 4. Fleiss Multirater Kappa Analysis: Moderate agreement (0.60) among four raters evaluating 98 sets of answers across 10 medical specialties.

	Overall Categories ${}^{a,b}$
Overall Agreement	Kappa ( $K$ )	Standard Error
	0.600708945	0.029630951
a. Data contains 98 question-answer pairs evaluated by 4 raters using Fleiss Kappa.
b. Rating category values are case-sensitive.

5. Conclusion and Future Work

Large Language Models possess remarkable capabilities in generating contextual responses. However, their application in the healthcare domain is limited due to deficiencies in domain-specific knowledge. To address this, we have developed a novel framework called MedInsight, aiding patients in better understanding their medical history, diagnosis, and prescriptions through retrieval-augmented question-answering. The main objective of MedInsight is to empower patients with insights for improving and optimizing both patient care and healthcare delivery.

To achieve this, MedInsight employs a context augmentation approach that combines medical knowledge from multiple sources like medical textbooks and web platforms with patients’ unique medical context from their transcripts. The developed method comprises three stages: First, relevant details from medical transcripts and health records are extracted to understand the patient’s context. Second, MedInsight retrieves trusted and relevant clinical information from external resources like WebMD and Mayo Clinic to augment the patient’s context. Finally, the augmented context, encompassing the patient’s details and retrieved medical knowledge, is utilized to generate patient-centric responses to the user prompt.

We evaluated the effectiveness of MedInsight’s context augmentation approach using the MTSamples dataset with ten medical conditions and fifty unique patient contexts. Results demonstrated MedInsight’s efficacy in generating contextually relevant responses. The RAGAS framework revealed promising scores for answer similarity (0.93 for GPT-3.5-turbo and 0.92 for Mistral-7B-Instruct) and answer correctness (0.84 for GPT-3.5-turbo and 0.77 for Mistral-7B-Instruct).

In the future we aim to further optimize the retrievers and investigate on effectiveness of RAG approach over fine-tuning for augmented context from multiple source when generating contextually relevant patinet-centric responses.

Acknowledgments

This work was supported by PATENT Lab (Predictive Analytics and TEchnology iNTegration Laboratory) at the Department of Computer Science and Engineering, Mississippi State University. The authors would like to thank SME’s for their assistance in qualitative evaluation. The views and conclusions are those of the authors.

References

[1] Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, et al. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.
[2] Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023.
[3] Oded Ovadia, Menachem Brief, Moshik Mishaeli, and Oren Elisha. Fine-tuning or retrieval? comparing knowledge injection in llms. arXiv preprint arXiv:2312.05934, 2023.
[4] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
[5] Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville, and Yoshua Bengio. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211, 2013.
[6] Sanyuan Chen, Yutai Hou, Yiming Cui, Wanxiang Che, Ting Liu, and Xiangzhan Yu. Recall and learn: Fine-tuning deep pretrained language models with less forgetting. arXiv preprint arXiv:2004.12651, 2020.
[7] Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv preprint arXiv:2308.08747, 2023.
[8] Yanda Chen, Ruiqi Zhong, Sheng Zha, George Karypis, and He He. Meta-learning via language model in-context tuning. arXiv preprint arXiv:2110.07814, 2021.
[9] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
[10] Andrew K Lampinen, Ishita Dasgupta, Stephanie CY Chan, Kory Matthewson, Michael Henry Tessler, Antonia Creswell, James L McClelland, Jane X Wang, and Felix Hill. Can language models learn from explanations in context? arXiv preprint arXiv:2204.02329, 2022.
[11] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
[12] Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, et al. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005, 2022.
[13] Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Kaijie Zhu, Hao Chen, Linyi Yang, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023.
[14] PF Brown, PV DeSouza, RL Mercer, VJ Della Pietra, and JC Lai. Class-based n-gram models of natural language. Comput. Linguist, (1950), 1992.
[15] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
[16] Shaswata Mitra, Subash Neupane, Trisha Chakraborty, Sudip Mittal, Aritran Piplai, Manas Gaur, and Shahram Rahimi. Localintel: Generating organizational threat intelligence from global and local cyber knowledge. arXiv preprint arXiv:2401.10036, 2024.
[17] Junjie Zhang, Ruobing Xie, Yupeng Hou, Wayne Xin Zhao, Leyu Lin, and Ji-Rong Wen. Recommendation as instruction following: A large language model empowered recommendation approach. arXiv preprint arXiv:2305.07001, 2023.
[18] Yupeng Hou, Junjie Zhang, Zihan Lin, Hongyu Lu, Ruobing Xie, Julian McAuley, and Wayne Xin Zhao. Large language models are zero-shot rankers for recommender systems. arXiv preprint arXiv:2305.08845, 2023.
[19] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
[20] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
[21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[22] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[23] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
[24] Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Zhicheng Dou, and Ji-Rong Wen. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107, 2023.
[25] Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
[26] Michiel Bakker, Martin Chadwick, Hannah Sheahan, Michael Tessler, Lucy Campbell-Gillingham, Jan Balaguer, Nat McAleese, Amelia Glaese, John Aslanides, Matt Botvinick, et al. Fine-tuning language models to find agreement among humans with diverse preferences. Advances in Neural Information Processing Systems, 35:38176–38189, 2022.
[27] Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
[28] Tiffany H Kung, Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, et al. Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLoS digital health, 2(2):e0000198, 2023.
[29] Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023.
[30] Cyril Zakka, Akash Chaurasia, Rohan Shad, Alex R Dalal, Jennifer L Kim, Michael Moor, Kevin Alexander, Euan Ashley, Jack Boyd, Kathleen Boyd, et al. Almanac: Retrieval-augmented language models for clinical medicine. Research Square, 2023.
[31] Minki Kang, Seanie Lee, Jinheon Baek, Kenji Kawaguchi, and Sung Ju Hwang. Knowledge-augmented reasoning distillation for small language models in knowledge-intensive tasks. arXiv preprint arXiv:2305.18395, 2023.
[32] Alejandro Lozano, Scott L Fleming, Chia-Chun Chiang, and Nigam Shah. Clinfo. ai: An open-source retrieval-augmented large language model system for answering medical questions using scientific literature. In PACIFIC SYMPOSIUM ON BIOCOMPUTING 2024, pages 8–23. World Scientific, 2023.
[33] David Soong, Sriram Sridhar, Han Si, Jan-Samuel Wagner, Ana Caroline Costa Sá, Christina Y Yu, Kubra Karagoz, Meijian Guan, Hisham Hamadeh, and Brandon W Higgs. Improving accuracy of gpt-3/4 results on biomedical data using a retrieval-augmented language model. arXiv preprint arXiv:2305.17116, 2023.
[34] Yubo Wang, Xueguang Ma, and Wenhu Chen. Augmenting black-box llms with medical textbooks for clinical question answering. arXiv preprint arXiv:2309.02233, 2023.
[35] [Online] MTSAMPLES. Transcribed medical transcription sample reports and examples. https://mtsamples.com/.
[36] [Online] Hugging Face. Thebloke/mistral-7b-instruct-v0.2-gguf. https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF.
[37] Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. Ragas: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217, 2023.
[38] TruLens. Trulens: Don’t just vibe check your llm app!
[39] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
[40] Lin CY ROUGE. A package for automatic evaluation of summaries. In Proceedings of Workshop on Text Summarization of ACL, Spain, volume 5, 2004.
[41] I Elaine Allen and Christopher A Seaman. Likert scales and data analyses. Quality progress, 40(7):64–65, 2007.
[42] Mary L McHugh. Interrater reliability: the kappa statistic. Biochemia medica, 22(3):276–282, 2012.