Measuring and Enhancing Trustworthiness of LLMs in RAG
through Grounded Attributions and Learning to Refuse

Maojia Song¹ Shang Hong Sim¹¹¹footnotemark: 1 Rishabh Bhardwaj¹,
Hai Leong Chieu² Navonil Majumder¹ Soujanya Poria¹

¹ Singapore University of Technology and Design, ² DSO National Laboratories, Singapore
{maojia_song, shanghong_sim, rishabh_bhardwaj}@mymail.sutd.edu.sg
chaileon@dso.org.sg
{navonil_majumder, sporia}@sutd.edu.sg
These authors contributed equally.

Abstract

LLMs are an integral part of retrieval-augmented generation (RAG) systems. While many studies focus on evaluating the quality of end-to-end RAG systems, there is a lack of research on understanding the appropriateness of an LLM for the RAG task. Thus, we introduce a new metric, Trust-Score, that provides a holistic evaluation of the trustworthiness of LLMs in an RAG framework. We show that various prompting methods, such as in-context learning, fail to adapt LLMs effectively to the RAG task. Thus, we propose Trust-Align, a framework to align LLMs for higher Trust-Score. LLaMA-3-8b, aligned with our method, significantly outperforms open-source LLMs of comparable sizes on ASQA ( $\uparrow$ 10.7), QAMPARI ( $\uparrow$ 29.2) and ELI5 ( $\uparrow$ 14.9). We release our code at: https://github.com/declare-lab/trust-align.

Maojia Song¹^†^†thanks: These authors contributed equally. Shang Hong Sim¹¹¹footnotemark: 1 Rishabh Bhardwaj¹, Hai Leong Chieu² Navonil Majumder¹ Soujanya Poria¹ ¹ Singapore University of Technology and Design, ² DSO National Laboratories, Singapore {maojia_song, shanghong_sim, rishabh_bhardwaj}@mymail.sutd.edu.sg chaileon@dso.org.sg {navonil_majumder, sporia}@sutd.edu.sg

1 Introduction

Hallucination in Large Language Models (LLMs) is a significant concern in generative AI, where the models produce information that appears plausible but is factually incorrect Ji et al. (2023). Examples include falsely accusing individuals of crimes The Independent (2023), generating fictitious judicial cases Bohannon (2023), and creating historically inaccurate images Business Insider (2023). Such instances raise concerns about the reliability of LLMs as tools for accessing accurate information.

Rather than directly using LLMs as an information source, incorporating them into a Retrieval-Augmented Generation (RAG) framework has become a popular approach to enhance the credibility of generated information. A typical RAG system, thus, consists of a large corpus of documents, a retriever that finds the top-K reference documents relevant to a query, and an LLM that composes the response and presents it to the user in a well-formatted manner. Notably, the role of the LLM shifts from being a source of information (in a non-RAG setup) to a consolidator of the information supplied by the retriever, with consolidation conditioned on the question asked.

There has been a significant amount of research on studying and reducing hallucinations in LLMs. For instance, Bai et al. (2024) examines hallucinations due to incorrect access to parametric knowledge. However, there is a lack of understanding of how these LLMs behave when they are required to rely solely on external (non-parametric) knowledge provided to them. An early work by Gao et al. (2023b) focuses on evaluating the RAG system in an end-to-end fashion, thereby, entangling the shortcomings of the retrieval with the errors in the final LLM output. Naturally, such an evaluation scheme is inconducive to isolating the role of LLMs under RAG setup.

In this work, we propose Trust-Score—a novel holistic metric to exclusively evaluate the trustworthiness of LLMs for RAG. Trust-Score assesses an LLM across multiple dimensions: 1) The ability to discern which questions can be answered or refused based on the provided documents (Grounded Refusals); 2) Gold claim recall scores for the answerable responses (Exact Match Recall); 3) The extent to which generated claims are supported by the corresponding citations (Citation Recall); and 4) The relevance of the citations (Citation Precision).

Our investigation shows that many state-of-the-art systems, including GPT-4 and Claude-3.5-Sonnet, heavily rely on their internal parametric knowledge acquired during parameter tuning phases to answer questions OpenAI (2023); Anthropic (2024). This limits their suitability for RAG tasks, where models should base responses solely on provided documents, leading to a low Trust-Score. Moreover, prompting approaches intended to enhance model trustworthiness have been found ineffective, as the responsiveness of the models becomes overly sensitive to the prompt. This leads to extreme Answered Ratio (AR%) values, indicating indiscriminate answering or refusal.

Thus, we propose an alignment framework, Trust-Align, to tune LLMs towards generating document-grounded responses and achieving higher Trust-Score. The framework aims to build an alignment dataset consisting of 19K questions, documents, positive (preferred) responses $r^{+}$ , and negative (unpreferred) responses $r^{-}$ . This dataset was created to address the five hallucinations types found - Inaccurate Answer, Over Responsveness, Excessive Refusal, OverCitation and Improper Citation. First, we collect a diverse and high-quality seed set of questions $q$ , followed by gathering the relevant (oracle) documents $D$ , and then perform extensive data augmentation. Positive responses are generated by stitching the gold claims together using GPT-4, while negative responses are derived from high-ranked hallucinations of a generic RAG fine-tuned model.

Refer to caption — Figure 1: Trust-Score.

Evaluation on the benchmark datasets shows that the models trained with Trust-Align outperform the competitive baselines w.r.t. Trust-Score: 10.73%, 29.24%, and 14.88% on ASQA, QAMPARI, and ELI5, respectively. TRUST-ALIGN significantly enhances the ability of models to correctly refuse or provide answers as compared to the baselines with refusal metric scores increased by 9.87% for ASQA, 22.53% for QAMPARI, and 5.32% for ELI5. Moreover, Trust-Align improves citation quality, with citation groundedness scores increasing by 26.67% for ASQA, 31.96% for QAM- PARI, and 29.30% for ELI5. Due to gamification, we observe mixed scores on exact match recall. We observe a notable increase in recall scores for QAMPARI (33.23%) and ELI5 (10.04%), but a decrease of 4.34% for ASQA.

We show that Trust-Align combined with DPO improves trustworthiness more effectively than prompting or SFT methods. Our augmented data leads to significant gains in Trust-Score, with an increase of 1.50% on ASQA, 1.78% on QAMPARI, and 2.23% on ELI5. Additionally, ablation studies highlight the importance of using data specific to each hallucination subtype. Removing subsegments of data for any subtype results in a measurable decrease in Trust-Score. Moreover, we find that aligning with refusal samples in Trust-Align produces the highest Trust-Score scores, emphasizing the critical role of including refusal samples during training. Our key contributions to this work are as follows:

•

We are the first to study hallucinations of LLMs in a RAG setup, where model responses should be exclusively grounded in retrieved documents rather than the model’s parametric knowledge.
•

We define answerability—a crucial concept for determining if the provided documents are sufficient to answer the question.
•

To measure LLM performance under RAG, we introduce Trust-Score, a holistic metric for quantifying LLM hallucinations in the RAG setup.
•

We propose Trust-Align, an alignment framework designed to improve the trustworthiness of LLMs in RAG. It first creates an alignment dataset of 19K samples with positive (gold) and negative (unpreferred) responses, followed by applying the DPO algorithm on the model.

2 Problem Description

2.1 Task Setup

Given a question $q$ and a set of retrieved documents $\mathcal{D}$ as input, the LLM is instructed to generate a response $S$ which consists of a set of citation-grounded statements $\{s_{1},\ldots,s_{n}\}$ ; each statement $s_{i}$ follows a set of citations $\mathcal{C}_{i}=\{c_{i,1},c_{i,2},\ldots\}$ referring to the documents in $\mathcal{D}$ ¹¹1For QAMPARI, we treat each entity in the response list as a statement.. If $\mathcal{D}$ is not sufficient to answer $q$ , the gold response would be a refusal statement, such as, “I apologize, but I couldn’t find an answer to your question in the search results”.

2.2 When is refusal expected?

To label a sample as a ground truth refusal, we first define the notion of answerability:

A refusal response contains no claims or citations but provides a generic message conveying the LLM’s inability to respond to $q$ .

Nuances of answerability.

Determining answerability can be challenging. To determine answerability, we use a system that evaluates the entailment of gold claims against provided documents, referred to as the Natural Language Inference (NLI) system. An NLI system can range from a simple exact match (EM) identifier to an LLM or even a human evaluator, with answerability determined based on $q, D$ and biases of the NLI²²2For EM, the bias is that a $q$ is answerable if exact match fo claims is present in $D$ .. These biases can be useful in specific RAG applications, such as solving mathematical problems where the documents provide a formula and the question assigns values to variables. The choice of NLI depends on whether the RAG system requires the LLM to have mathematical understanding. Ideally, to prevent improper evaluations, the NLI model used to construct the gold claims should also be used to evaluate the LLM responses.

In this paper, our focus is on evaluating the generic comprehension capabilities of LLMs without specialized knowledge. Thus, we use two NLI mechanisms: 1) identifying whether an exact match of claims is present in the gold claims, and 2) using a Machine Learning (ML) model to determine if the documents can entail the gold claims. The ML-based NLI model is used for multiple purposes, such as alignment dataset construction (data/training) and evaluating generated responses (metric/testing). For this, we adopt the NLI model from Rashkin et al. (2022). $\phi(c_{ij},s_{i})=1$ if $c_{ij}$ (premise) entails $s_{i}$ (hypothesis); otherwise, 0. To determine answerability, we employ the TRUE-based method Honovich et al. (2022) to assess whether a gold claim can be entailed by a given document.

The knowledge grounding problem.

Typically, LLMs are designed to perform question-answering tasks, where response generation heavily relies on the parametric (internal) knowledge acquired during their pre-training, tuning, and alignment phases OpenAI (2023); Anthropic (2024). Thus, most of their knowledge is grounded in parametric memory. This makes them inherently less suitable for RAG applications, where the knowledge generated by the LLM is expected to be grounded in input documents. RAG is analogous to a reading comprehension task, where the answers must come from the provided passage (documents in RAG) rather than the prior knowledge of the person taking the test. Thus, any reliance on parametric knowledge can result in statements that are not fully grounded in the documents, including providing answers to unanswerable questions. Our investigation shows that state-of-the-art models, such as GPT-4 and Claude-3.5-Sonnet, overtly rely on parametric knowledge even when used in a RAG setting.³³3We show a detailed analysis in Sections D.1 and D.2.

2.3 Hallucination in LLM in RAG

For the task of RAG, we define hallucination in an LLM as any error where the generated response is not grounded on the provided documents. We categorize hallucination into five types: (1) Inaccurate Answer - The generated statements $S$ fail to cover claims in the gold response, (2) Over Responsiveness - The model answers an unanswerable (refusal) question, (3) Excessive Refusal - The model refuses to answer an answerable question, (4) Overcitation - The model generates redundant citations, (5) Improper Citation - The model’s citation(s) do not support the statement.

Next, we introduce a comprehensive metric to effectively measure hallucinations in LLMs.

3 Metrics for LLM-in-RAG

Given a question $q$ and the corresponding ground truth response $A_{G}=\{a_{g1},\ldots,a_{gn}\}$ consisting of gold claims, we define the claims obtainable from the provided documents as $A_{D}=\{a_{d1},\ldots,a_{dn}\}$ and the claims generated in the response as $A_{R}=\{a_{r1},\ldots,a_{rn}\}$ . We aim to measure two aspects of an LLM in RAG: 1) the Correctness of the generated claims (Response Truthfulness); and 2) the Correctness of citations generated (Attribution Groundedness).

Insufficiency of the existing metrics.

The existing metric measures Response Truthfulness by first computing the per-sample exact match recall (EM_r) score for gold claims $A_{G}$ Gao et al. (2023b), disregarding how many of these claims are obtainable from $D$ . This is followed by averaging the recall scores across samples to obtain a single score for the dataset. This method introduces inconsistencies: models that rely on parametric knowledge ( $\mathcal{M}_{p}$ ) may generate gold claims not found in $D$ , leading to an artificially inflated recall value. In contrast, an ideal LLM ( $\mathcal{M}_{i}$ ) would rely solely on $D$ to generate responses (a desired trait) and would be constrained by an upper recall limit of $\frac{|A_{G}\cap A_{D}|}{|A_{G}|}$ , which varies depending on the question. This approach presents two key problems: (1) Recall Consolidation: Since the measurement range depends on the claims present in $D$ , it is infeasible to provide a consistent, consolidated EM_r score across the dataset, (2) Recall Gamification: $\mathcal{M}_{p}$ may have a higher upper limit on EM_r (up to 1) because they can generate gold claims not present in $D$ (an undesirable trait), unlike $\mathcal{M}_{i}$ that depend entirely on $D$ .

Answer Calibration.

To address the challenges of recall consolidation and gamification in existing evaluation metrics, we propose new metrics that measure sample-wise recall score based on the fraction of gold claims that can be obtained from $D$ . Specifically, this involves computing $|A_{G}\cap A_{D}|$ , which measures the exact match (EM) recall after calibrating the gold claims. This approach sets a maximum recall limit of 1 for all models. For dataset-wide scoring, we consolidate per-sample EM recall scores using two methods: 1) EM ${}^{\alpha}_{\text{AC}}$ : The average recall score across samples answered by the LLM, i.e., samples where $A_{R}\neq\emptyset$ ; 2) EM ${}^{\beta}_{\text{AC}}$ : The average recall score across samples that are answerable, i.e., samples where ${A_{G}\cap A_{D}}\neq\emptyset$ ⁴⁴4Notably, both EM ${}^{\alpha}_{\text{AC}}$ and EM ${}^{\beta}_{\text{AC}}$ sum over samples that are both answered and answerable, differing primarily in their normalization values.. These metrics, illustrated in Fig. 1, are then combined into a single score, EM ${}^{\text{F1}}_{\text{AC}}$ , which serves as a comprehensive measure of how well the LLM grounds its claims on the document $D$ . This combined metric not only facilitates the consolidation of recall but also addresses issues related to recall gamification.

Scoring Refusals.

An important capability of an LLM in RAG is its ability to identify when a response is unanswerable based on the provided documents $D$ . To measure this, we introduce a metric called Grounded Refusals. This metric evaluates the model’s refusal performance by calculating dataset-wide precision and recall for both ground-truth answerable cases and refusals. These values are then combined into their respective F1 scores, F1_ref for refusals and F1_ans for answerable cases. The final score, F1 ${}_{\text{RG}}$ , is the average of these two F1 scores, as shown in Fig. 1.

Measuring Attribution Groundedness.

While Response Truthfulness metrics like EM ${}^{\text{F1}}_{\text{AC}}$ and F1 ${}_{\text{CG}}$ evaluate the quality of generated claims, it is equally important to measure how well these statements are supported by relevant citations—what we call Attribution Groundedness. To this end, we adopt two sub-metrics from Gao et al. (2023b): Citation Recall (CR) and Citation Precision (CP). To compute CR, we first determine if a generated statement $s_{i}$ is supported by its cited documents using an NLI model⁵⁵5An NLI model checks if the cited document entails the statement., thus obtaining sample-wise recall scores CR^$s_{i}$. Then we take the mean across all samples to obtain the final CR score (Figure 1). To compute CP, we first score each citation $c_{i,j}$ of a statement $s_{i}$ , followed by computing the average across citations in a response $S$ (sample-wise score). The dataset-wide citation score is computed by averaging the citation scores across all the samples. To provide a single metric for Attribution Groundedness, we calculate the harmonic mean of CP and CR, resulting in the final score, F1 ${}_{\text{CG}}$ .

Thus, we define a new metric, Trust-Score, as follows:

\textsc{Trust-Score}=\frac{1}{3}(\textbf{F1${}_{\text{RG}}$}+\textbf{EM${}^{% \text{F1}}_{\text{AC}}$}+\textbf{F1${}_{\text{CG}}$}).

Responsiveness.

To measure the answering tendency of an LLM, we define Responsiveness. It is the fraction of answered questions, denoted by the Answered Ratio (AR %), which is calculated as $\text{AR \%}=\frac{\text{\# answered}}{\text{\# total questions}}$ . A model is expected to show a high AR% for answerable questions and a low AR% for unanswerable ones, with the scores expected to align with the dataset distribution.

4 The Trust-Align Framework

To align LLMs towards trustworthiness, we propose a new framework, Trust-Align. The framework constructs an LLM trustworthiness alignment dataset, where each sample in the dataset consists of a question $q$ , a set of retrieved documents $D$ , and a pair of positive (preferred) and negative (unpreferred) responses ( $r^{+}$ , $r^{-}$ ). The positive response corresponds to an answer that encompasses expected gold claims for $q$ and corresponding citations referring to the documents. If $D$ is not sufficient to answer $q$ , $r^{+}$ is assigned a refusal response, while $r^{-}$ is its non-refusal counterpart. We build the dataset in multiple steps: 1) Obtain a set of high quality and diverse questions, 2) Obtain documents for each question, 3) Augmenting $(q,D)$ pairs that cover diverse hallucination types, 4) Construct positive responses entailing gold claims, and 5) Construct negative (unpreferred) responses by prompting a fine-tuned model and observing its hallucinations.

Collecting Quality Questions.

The dataset construction begins by collecting a set of high-quality (challenging) and diverse questions from source datasets i.e. ASQA, QAMPARI, and ELI5—referred to as seed samples. To collect such samples, we first divide the questions in a dataset into $k$ clusters. After identifying the diverse clusters, we assign each a quality score ranging from 1 to 7. The quality of a cluster is determined by how difficult it is to answer the questions without requiring additional information i.e. a higher score corresponds to a high difficulty. We then select clusters with a quality score of 4 or higher and sample the desired number of questions from these top clusters. Suppose we have three clusters, $C_{1},C_{2},C_{3}$ , with respective sizes $N_{1},N_{2},N_{3}$ , where $N_{c}=N_{1}+N_{2}+N_{3}$ . To sample $N_{s}$ questions from the clusters, we sample $N_{s}\times\frac{C_{i}}{N_{c}}$ questions from cluster $C_{i}$ . If this number exceeds the available questions in the cluster, we randomly sample the remaining questions from the filtered-out clusters (those with a quality score below 4). This process ensures that the seed set prioritizes both high quality and diversity. For this paper, we set $N_{s}$ to 3K, 3K, and 4K for ASQA, QAMPARI, and ELI5, respectively, resulting in approximately 10K questions in the seed set.

Collecting D’s.

Next, we collect documents relevant to each question in the seed set. To do this, we query Wikipedia and Common Crawl to retrieve the 100 most relevant documents. We filter seed question for which the retriever fails to retrieve relevant documents. Furthermore, we identify 5 documents that are equally effective for the model as the 100 documents in terms of achieving the EM recall value); we refer to such documents as oracle documents for question $q$ .⁶⁶6We provide clustering and document retrieval detials in Appendix B. Notably, to compute EM, gold claims are obtained from respective source datasets.

Augmenting (q,D) set.

Now that we have the questions and the most relevant (oracle) documents, our goal is to create samples of diverse types (i.e., different proportions of relevant documents for the same question) that can trigger multiple hallucinations from LLMs (Section 2.3). As illustrated in Fig. 3, for answerable questions, we first utilize the identified entailment patterns to generate all possible combinations of documents, then select $k$ combinations that cover diverse patterns. To create samples with unanswerable questions, we select documents that are similar to gold-claim-entailing documents but do not entail any gold claims. To minimize the risk of introducing bias in citation indices, we shuffle the order of documents in each sample. As a result, we generate approximately 70K question-document pairs.

After obtaining $(q,D)$ pairs for the alignment dataset, we obtain positive and negative responses ( $r^{+},r^{-}$ ) for each pair—an essential component of the dataset signaling the model’s preferred and unpreferred responses. To achieve this, we introduce a response generation pipeline.

Obtaining $\mathbf{r^{+}}$ .

We develop an automated data labeling pipeline that synthesizes natural responses from gold claims and maps each statement to the corresponding documents for embedded in-line citations. The gold claims are obtained from the source datasets (ASQA, QAMPARI, ELI5) and calibrated to the provided documents, i.e., filtering out claims that cannot be derived from $D$ . We first split the questions into answerable and unanswerable samples based on whether the provided documents entail the gold claims. For an answerable sample, consisting of a question $q$ , a set of documents $\mathcal{D}$ , and a list of (calibrated) gold claims, we prompt GPT-4 to generate a natural response by stitching together the gold claims using a template (Table 6). The prompt template asks GPT-4 to label each gold claim used with its index from the provided list (e.g., "[Gold Claim X]"), allowing for later matching of claims to documents. For unanswerable questions, a refusal response is assigned. Additional details are provided in Section B.1. To generate citations corresponding to each statement generated, we map the "[Gold Claim X]" labels to the appropriate documents. First, we extract all such labels from a sentence (which may contain multiple claims and labels). Then, we greedily identify the smallest combination of documents that covers these claims, minimizing over-citation. Details of this process is illustrated in Fig. 4.

Obtaining $\mathbf{r^{-}}$ .

To create high-quality preference data, we aim to obtain quality negative (unpreferred) responses. We first fine-tune LLaMA-2-7b on the training set of the source datasets, creating $\mathcal{M}_{sft}$ (details in Section B.1). We then test $\mathcal{M}_{sft}$ on the above-obtained dataset with approximately 70K questions and identify that 40K responses exhibit hallucinations. Table 1 shows the severity computation ( $e_{i}$ ) and the frequency of each hallucination type ( $w_{i}$ ). Thus, we can compute hallucination severity for each sample:

e_{q}=\sum_{i=1}^{5}w_{i}e_{i},

(1)

Hallucination type	Frequency ( $w_{i}$ )		Severity ( $e_{i}$ )
Unwarranted Refusal	8,786	0.50	$I_{(A_{g}\neq\emptyset,A_{r}=\emptyset)}$
Over Responsiveness	13,067	0.50	$I_{(A_{g}=\emptyset,A_{r}\neq\emptyset)}$
Overcitation	12,656	0.34	1 - CP
Improper Citation	9,592	0.26	1 - CR
Inaccurate Claims	14,783	0.40	1 - EM ${}^{\text{F1}}_{\text{AC}}$

Table 1: Fraction of each hallucination amongst all the observed hallucinations in

\mathcal{M}_{sft}

(40,985), with possible overlap.

w_{i}

shows the severity computation of each hallucination.

I_{\text{condition}}

= 1 if condition is True otherwise it is 0. See Fig. 5 for the detailed breakdown of the last three errors.

To obtain good negative samples, we first rank each of the 40K responses according to their severity score $e_{q}$ . We then select the top 50% of the corresponding samples for both answerable and unanswerable responses. Thus, we demonstrate the alignment data construction phase of Trust-Align, i.e., obtaining 19K samples with all the desired attributes $(\mathbf{q,D,r^{+},r^{-}})$ .

5 Experimental Setup

Evaluation datasets.

We evaluate on the test-set of attributable factoid and long-form question-answering tasks from ASQA Stelmakh et al. (2023), QAMPARI Amouyal et al. (2023), and ELI5 Fan et al. (2019). Additionally, we include ExpertQA Malaviya et al. (2024) for generalization evaluation. For each question, we append the top 5 documents obtained using retriever. For ELI5 and ExpertQA, the ground truth answers are decomposed into three claims. The dataset statistics are shown on top of Section 6.

Baselines.

We evaluate the effectiveness of Trust-Align framework under two settings — default prompting and refusal prompting as shown in Table 15. We compare our models trained with Trust-Align framework against five competitive baseline methods — In-Context Learning (ICLCite) Gao et al. (2023b), Post-hoc Attribute Gao et al. (2023a), Post-hoc Search Gao et al. (2023b), Self-RAG Asai et al. (2024), and FRONT Huang et al. (2024b). The details of these baselines are given in Section G.3.

		ASQA (610 answerable, 338 unanswerable)					QAMPARI (295 answerable, 705 unanswerable)					ELI5 (207 answerable, 793 unanswerable)
		Responsiveness	Trustworthiness				Responsiveness	Trustworthiness				Responsiveness	Trustworthiness
		AR (%)	Truthfullness		Attr. Grdness	TRUST	AR (%)	Truthfullness		Attr. Grdness	TRUST	AR (%)	Truthfullness		Attr. Grdness	TRUST
	Prompt	AR (%)	EM ${}^{\text{F1}}_{\text{AC}}$	F1 ${}_{\text{RG}}$	F1 ${}_{\text{CG}}$	TRUST	AR (%)	EM ${}^{\text{F1}}_{\text{AC}}$	F1 ${}_{\text{RG}}$	F1 ${}_{\text{CG}}$	TRUST	AR (%)	EM ${}^{\text{F1}}_{\text{AC}}$	F1 ${}_{\text{RG}}$	F1 ${}_{\text{CG}}$	TRUST
		LLaMA-2-7b
ICL	R	0.00	0.00	26.28	0.00	8.76	0.00	0.00	41.35	0.00	13.78	0.50	0.00	46.71	0.00	15.57
PostCite	R	10.44	0.07	35.23	0.00	11.77	34.40	0.00	57.34	9.50	22.28	0.90	1.86	44.98	5.04	17.29
PostAttr	R	10.44	0.07	35.23	0.00	11.77	34.40	0.00	57.34	3.78	20.37	0.90	1.86	44.98	0.00	15.61
Self-RAG	R	100.00	45.19	39.15	63.49	49.28	96.00	6.81	28.23	19.95	18.33	73.50	14.94	40.20	13.80	22.98
FRONT	R	100.00	60.47	39.15	68.86	56.16	100.00	17.27	22.78	24.26	21.44	100.00	21.66	17.15	52.72	30.51
ICL	D	94.30	50.38	49.51	43.67	47.85	93.60	8.36	31.02	3.88	14.42	95.30	19.83	22.82	16.30	19.65
PostCite	D	88.71	2.30	50.82	0.98	18.03	56.30	0.00	49.18	7.73	18.97	83.90	11.95	30.05	4.90	15.63
PostAttr	D	87.24	2.32	51.56	0.43	18.10	51.10	0.00	49.50	4.70	18.07	84.00	11.94	29.74	0.93	14.20
Self-RAG	D	98.00	46.82	41.16	56.59	48.19	96.20	7.72	27.08	15.44	16.75	97.90	13.16	19.62	10.31	14.36
		LLaMA-2-13b
ICL	R	17.41	21.52	41.40	13.83	25.58	26.50	0.44	59.57	0.00	20.00	46.40	19.97	54.81	4.73	26.50
PostCite	R	90.51	2.21	49.91	1.53	17.88	100.00	0.00	22.78	8.05	10.28	76.60	2.27	38.05	0.72	13.68
PostAttr	R	90.51	2.21	49.91	0.17	17.43	100.00	0.00	22.78	2.95	8.58	76.60	2.27	38.05	0.09	13.47
Self-RAG	R	100.00	48.52	39.15	69.79	52.49	72.70	2.71	48.58	26.91	26.07	22.10	12.77	58.68	24.54	32.00
ICL	D	97.57	49.16	44.06	9.35	34.19	97.80	0.00	26.20	0.00	8.73	96.50	20.93	21.06	2.80	14.93
PostCite	D	89.77	0.04	50.33	0.00	16.79	63.00	0.00	47.20	7.14	18.11	7.00	3.62	45.31	4.73	17.89
PostAttr	D	89.24	0.04	51.46	0.00	17.17	58.50	0.00	48.86	4.56	17.81	6.70	3.66	48.41	0.71	17.59
Self-RAG	D	97.68	48.93	42.74	63.39	51.69	96.30	3.66	27.15	21.06	17.29	98.00	12.19	19.07	6.68	12.65
		LLaMA-3-8b
ICL	R	1.48	3.01	28.58	86.50	39.36	3.90	5.92	48.60	20.24	24.92	0.00	0.00	44.23	0.00	14.74
PostCite	R	77.53	32.98	53.31	28.01	38.10	87.00	6.10	34.52	8.42	16.35	62.00	20.80	45.88	8.06	24.91
PostAttr	R	77.53	32.98	53.31	5.95	30.75	87.00	6.10	34.52	1.64	14.09	62.00	20.80	45.88	1.25	22.64
ICL	D	89.66	58.28	55.62	61.59	58.50	70.80	5.82	50.50	4.81	20.38	84.60	23.69	33.11	31.03	29.28
PostCite	D	97.26	34.80	43.56	17.89	32.08	92.00	2.45	30.07	11.14	14.55	98.90	19.00	18.47	6.33	14.60
PostAttr	D	97.47	34.75	42.98	3.18	26.97	93.00	2.43	29.95	5.65	12.68	98.90	19.00	18.26	1.02	12.76
		Our Models
SFT-LLaMA-2-7b	R	80.17	53.21	63.43	79.61	65.42	31.60	33.76	71.13	46.37	50.42	29.50	21.58	63.30	39.59	41.49
SFT-LLaMA-3-8b	R	68.99	52.35	66.06	80.95	66.45	24.20	33.85	71.11	48.01	50.99	23.60	22.57	65.06	46.85	44.83
DPO-LLaMA-2-7b	R	65.30	52.48	66.12	83.94	67.51	31.10	32.09	71.83	51.33	51.75	21.60	22.54	63.27	48.43	44.75
DPO-LLaMA-3-8b	R	56.43	53.94	65.49	88.26	69.23	23.10	35.94	71.11	58.87	55.31	15.50	22.81	64.00	53.84	46.88
3-17 $\Delta$			$\downarrow$ 4.34	$\uparrow$ 9.87	$\uparrow$ 26.67	$\uparrow$ 10.73		$\uparrow$ 33.23	$\uparrow$ 22.53	$\uparrow$ 31.96	$\uparrow$ 29.24		$\uparrow$ 10.04	$\uparrow$ 5.32	$\uparrow$ 29.30	$\uparrow$ 14.88

	ASQA					QAMPARI					ELI5
	Responsiveness	Trustworthiness				Responsiveness	Trustworthiness				Responsiveness	Trustworthiness
	AR (%)	Truthfullness		Attr. Grdness	TRUST	AR (%)	Truthfullness		Attr. Grdness	TRUST	AR (%)	Truthfullness		Attr. Grdness	TRUST
	AR (%)	EM ${}^{\text{F1}}_{\text{AC}}$	F1 ${}_{\text{RG}}$	F1 ${}_{\text{CG}}$	TRUST	AR (%)	EM ${}^{\text{F1}}_{\text{AC}}$	F1 ${}_{\text{RG}}$	F1 ${}_{\text{CG}}$	TRUST	AR (%)	EM ${}^{\text{F1}}_{\text{AC}}$	F1 ${}_{\text{RG}}$	F1 ${}_{\text{CG}}$	TRUST
DPO-LLaMA-2-7b	65.30	52.48	66.12	83.94	67.51	31.10	32.09	71.83	51.33	51.75	21.60	22.54	63.27	48.43	44.75
Trust-Align w/o. augmented instructions	79.43	53.54	63.33	81.15	66.01	32.20	33.14	70.82	45.94	49.97	29.50	23.98	63.30	40.28	42.52
Trust-Align w/o. answer HT	77.74	53.29	63.7	81.2	66.06	33.40	33.56	71.36	46.17	50.36	27.60	23.47	63.56	38.28	41.77
Trust-Align w/o. citation HT	77.32	52.55	63.88	81.51	65.98	33.10	34.13	71.40	46.91	50.81	26.70	22.65	64.33	42.81	43.26
Trust-Align w/o. refusal HT	79.11	53.55	63.33	81.85	66.24	31.10	34.40	71.35	48.12	51.29	28.30	22.93	64.05	41.18	42.72
GPT-4 as critic	70.36	54.91	65.29	78.47	66.22	25.90	30.77	70.29	48.87	49.98	23.50	17.27	62.24	42.38	40.63

	Model	Responsiveness (AR%)	EM ${}^{\text{F1}}_{\text{AC}}$	F1 ${}_{\text{RG}}$	F1 ${}_{\text{CG}}$	TRUST
Only Answerable	DPO-LLaMA-2-7b	100	51.79	39.15	77.37	56.10
Only Answerable	DPO-LLaMA-3-8b	100	56.54	39.15	81.39	59.03
With Refusal	DPO-LLaMA-2-7b	65.30	52.48	66.12	83.94	67.51
With Refusal	DPO-LLaMA-3-8b	56.43	53.94	65.49	88.26	69.23

In-Context Learning Models
Model	AR (%)	EM ${}^{\text{F1}}_{\text{AC}}$	F1 ${}_{\text{RG}}$	F1 ${}_{\text{CG}}$	TRUST
ICL-LLaMA-2 7B	0.51	0.00	41.01	9.52	16.84
ICL-LLaMA-3 8B	0.65	2.82	42.50	69.46	38.26
ICL-GPT-3.5	59.47	36.65	56.39	63.91	52.32
ICL-GPT-4	72.20	41.21	52.91	69.70	54.61
ICL-Claude 3.5	73.95	11.68	51.91	10.70	24.76
Direct Preference Optimization Models
DPO-LLaMA-2-7B	17.75	23.99	66.63	64.96	51.86
DPO-LLaMA-3-8B	16.41	27.36	68.05	70.11	54.85

$\displaystyle\text{P}_{\text{ref}}$	$\displaystyle=\frac{\|\neg A_{r}\cap\neg A_{g}\|}{\|\neg A_{r}\|}$	(2)
$\displaystyle\text{R}_{\text{ref}}$	$\displaystyle=\frac{\|\neg A_{r}\cap\neg A_{g}\|}{\|\neg A_{g}\|}$	(3)
$\displaystyle\text{F1}_{\text{ref}}$	$\displaystyle=\frac{\text{2P}_{\text{ref}}\cdot\text{R}_{\text{ref}}}{\text{P}% _{\text{ref}}+\text{R}_{\text{ref}}},$	(4)

$\displaystyle\text{P}_{\text{ans}}$	$\displaystyle=\frac{\|A_{r}\cap A_{g}\|}{\|A_{r}\|}$	(5)
$\displaystyle\text{R}_{\text{ans}}$	$\displaystyle=\frac{\|A_{r}\cap A_{g}\|}{\|A_{g}\|}$	(6)
$\displaystyle\text{F1}_{\text{ans}}$	$\displaystyle=\frac{2\text{P}_{\text{ans}}\cdot\text{R}_{\text{ans}}}{\text{P}% _{\text{ans}}+\text{R}_{\text{ans}}}$	(7)

	$\displaystyle\text{CP}^{c_{j}}$	$\displaystyle=\phi(c_{i,j},s_{i})$		(13)
		$\displaystyle\quad\text{OR}\quad\neg\phi(\{c_{i,k}\mid k\neq j\},s_{i})$

CR	$\displaystyle=\frac{1}{\|A_{r}\|}\sum_{S\in A_{r}^{s}}\frac{1}{\|S\|}\sum_{s_{i}% \in S}\text{CR}^{s_{i}}$	(14)
CP	$\displaystyle=\frac{1}{\|A_{r}\|}\sum_{C\in A_{r}^{c}}\frac{1}{\|C\|}\sum_{c_{j}% \in C}\text{CP}^{c_{j}}$	(15)
F1_CG	$\displaystyle=\frac{2\cdot\text{CP}\cdot\text{CR}}{\text{CP}+\text{CR}}$	(16)

Question	How many state parks are there in Virginia?
Gold Answer	38
Retrieved document	Virginia has 30 National Park Service units, such as Great Falls Park and the Appalachian Trail, and one national park, the Shenandoah National Park. With over 500 miles of trails, including 38 miles of the iconic Appalachian Trail, it’s a paradise for hikers, nature lovers, and those seeking serene mountain landscapes.
Substring match	Substring is matched and as such the question is answerable.
TRUE Judgement	Not entailed as such the question is unanswerable given the document.

Model	ASQA		QAMPRARI		ELI5
Model	AR (%)	$\text{P}_{\text{score}}$	AR (%)	$\text{P}_{\text{score}}$	AR (%)	$\text{P}_{\text{score}}$
ICL-LLaMA-2 7B	0.00	0.00	0.00	0.00	0.50	0.00
ICL-LLaMA-3 8B	1.48	1.79	3.90	16.92	0.00	0.00
ICL-GPT-3.5	71.20	9.74	65.30	11.45	49.00	7.89
ICL-GPT-4	86.81	12.71	73.40	13.05	61.50	9.05
ICL-Claude-3.5	84.60	12.99	69.80	12.55	59.00	1.76
DPO-LLaMA-2-7B	65.30	8.15	31.10	8.45	21.60	5.56
DPO-LLaMA-3-8B	56.42	8.65	23.10	8.97	15.50	7.26

	Presence	$\displaystyle=\frac{1}{\|\mathcal{N}_{e}\|}\sum_{q_{i}\in{\mathcal{A}_{e}}}\frac% {\|A_{R}^{e}\cap A_{D}\|}{\|A_{R}^{e}\|}$		(19)
	Absence	$\displaystyle=\frac{1}{\|\mathcal{N}_{e}\|}\sum_{q_{i}\in{\mathcal{A}_{e}}}\frac% {\|A_{R}^{e}-(A_{R}^{e}\cap A_{D})\|}{\|A_{R}^{e}\|}$		(20)

	ASQA	QAMPARI	ELI5	ExpertQA
Total # of Samples	948	1000	1000	2169
# Answerable Samples	610	295	207	682
# Unanswerable Samples	338	705	793	1487

	Prompt	AR%	EM_reg	EM ${}^{\alpha}_{\text{AC}}$	EM ${}^{\beta}_{\text{AC}}$	EM^F1_AC	$\text{R}_{\text{ref}}$	$\text{P}_{\text{ref}}$	$\text{F1}_{\text{ref}}$	$\text{R}_{\text{ans}}$	$\text{P}_{\text{ans}}$	$\text{F1}_{\text{ans}}$	F1 ${}_{\text{CG}}$	CR	CP	F1_CG	Trust-Score
		LLaMA-2-7b
ICL	R	0.00	12.78	0.00	0.00	0.00	100.00	35.65	52.57	0.00	0.00	0.00	26.28	0.00	0.00	0.00	8.76
PostCite	R	10.44	8.49	0.25	0.04	0.07	90.53	36.04	51.56	10.98	67.68	18.90	35.23	0.00	0.00	0.00	11.77
PostAttr	R	10.44	8.49	0.25	0.04	0.07	90.53	36.04	51.56	10.98	67.68	18.90	35.23	0.00	0.00	0.00	11.77
Self-RAG	R	100.00	28.87	37.13	57.71	45.19	0.00	0.00	0.00	100.00	64.35	78.31	39.15	59.27	68.35	63.49	49.28
FRONT	D	100.00	40.72	49.69	77.22	60.47	0.00	0.00	0.00	100.00	64.35	78.31	39.15	68.45	69.27	68.86	56.16
ICL	D	94.30	32.29	42.06	62.79	50.38	11.54	72.22	19.90	97.54	66.55	79.12	49.51	44.21	43.14	43.67	47.85
PostCite	D	88.71	1.91	1.98	2.73	2.30	16.27	51.40	24.72	91.48	66.35	76.91	50.82	0.98	0.98	0.98	18.03
PostAttr	D	87.24	1.91	2.01	2.73	2.32	18.05	50.41	26.58	90.16	66.51	76.55	51.56	0.43	0.43	0.43	18.10
Self-RAG	D	98.00	30.11	38.63	59.41	46.82	2.37	42.11	4.48	98.20	64.48	77.84	41.16	50.69	64.05	56.59	48.19
		LLaMA-2-13b
ICL	R	17.41	9.17	50.54	13.67	21.52	86.39	37.29	52.10	19.51	72.12	30.71	41.40	10.94	18.81	13.83	25.58
PostCite	R	90.51	1.88	1.89	2.66	2.21	14.20	53.33	22.43	93.11	66.20	77.38	49.91	1.53	1.53	1.53	17.88
PostAttr	R	90.51	1.88	1.89	2.66	2.21	14.20	53.33	22.43	93.11	66.20	77.38	49.91	0.17	0.17	0.17	17.43
Self-RAG	R	100.00	30.82	39.87	61.96	48.52	0.00	0.00	0.00	100.00	64.35	78.31	39.15	66.42	73.52	69.79	52.49
ICL	D	97.57	33.31	40.57	62.35	49.16	5.03	73.91	9.42	99.02	65.30	78.70	44.06	7.22	13.25	9.35	34.19
PostCite	D	89.77	0.06	0.03	0.04	0.04	15.09	52.58	23.45	92.46	66.27	77.21	50.33	0.00	0.00	0.00	16.79
PostAttr	D	89.24	0.06	0.03	0.04	0.04	16.57	54.90	25.45	92.46	66.67	77.47	51.46	0.00	0.00	0.00	17.17
Self-RAG	D	97.68	31.36	40.53	61.73	48.93	3.85	59.09	7.22	98.52	64.90	78.26	42.74	58.31	69.44	63.39	51.69
		LLaMA-3-8b
ICL	R	1.48	0.69	67.14	1.54	3.01	99.70	36.08	52.99	2.13	92.86	4.17	28.58	92.86	80.95	86.50	39.36
PostCite	R	77.53	22.15	30.17	36.36	32.98	27.51	43.66	33.76	80.33	66.67	72.86	53.31	28.01	28.01	28.01	38.10
PostAttr	R	77.53	22.15	30.17	36.36	32.98	27.51	43.66	33.76	80.33	66.67	72.86	53.31	5.95	5.95	5.95	30.75
ICL	D	89.66	36.41	49.83	70.17	58.28	20.41	70.41	31.65	95.25	68.35	79.59	55.62	61.40	61.77	61.59	58.50
PostCite	D	97.26	27.65	28.91	43.69	34.80	4.73	61.54	8.79	98.36	65.08	78.33	43.56	17.89	17.89	17.89	32.08
PostAttr	D	97.47	27.65	28.84	43.69	34.75	4.14	58.33	7.73	98.36	64.94	78.23	42.98	3.18	3.18	3.18	26.97
		Closed-source Models
GPT-3.5	R	71.20	27.30	50.36	55.72	52.91	48.82	60.44	54.01	82.30	74.37	78.13	66.07	84.66	83.24	83.94	67.64
GPT-4	R	86.81	37.93	54.81	73.95	62.96	28.99	78.40	42.33	95.57	70.84	81.37	61.85	85.82	82.93	84.35	69.72
Claude-3.5	R	84.60	36.29	52.79	69.41	59.97	34.02	78.77	47.52	94.92	72.19	82.01	64.77	67.29	69.43	68.35	64.36
GPT-3.5	D	94.41	34.67	46.27	67.88	55.03	14.20	90.57	24.55	99.18	67.60	80.40	52.48	78.13	77.95	78.04	61.85
GPT-4	D	92.72	41.13	52.58	76.65	62.37	16.86	82.61	28.01	98.03	68.03	80.32	54.17	79.48	79.92	79.70	65.41
Claude-3.5	D	82.49	32.68	47.64	62.86	54.20	37.87	77.11	50.79	93.77	73.15	82.18	66.49	57.41	60.44	58.88	59.86
		Our Models
SFT-LLaMA-2-7b	R	80.17	29.21	47.96	59.76	53.21	36.69	65.96	47.15	89.51	71.84	79.71	63.43	83.36	76.18	79.61	65.42
SFT-LLaMA-3-8b	R	68.99	25.22	50.59	54.24	52.35	51.18	58.84	54.75	80.16	74.77	77.37	66.06	86.09	76.38	80.95	66.45
DPO-LLaMA-2-7b	R	65.30	25.04	52.10	52.87	52.48	55.33	56.84	56.07	76.72	75.61	76.16	66.12	85.35	82.57	83.94	67.51
DPO-LLaMA-3-8b	R	56.43	23.53	57.72	50.63	53.94	64.79	53.03	58.32	68.20	77.76	72.66	65.49	88.93	87.60	88.26	69.23

	Prompt	AR%	EM_reg	EM ${}^{\alpha}_{\text{AC}}$	EM ${}^{\beta}_{\text{AC}}$	EM^F1_AC	$\text{R}_{\text{ref}}$	$\text{P}_{\text{ref}}$	$\text{F1}_{\text{ref}}$	$\text{R}_{\text{ans}}$	$\text{P}_{\text{ans}}$	$\text{F1}_{\text{ans}}$	F1 ${}_{\text{CG}}$	CR	CP	F1_CG	Trust-Score
		LLaMA-2-7b
ICL	R	0.50	2.63	0.00	0.00	0.00	100.00	79.70	88.70	2.42	100.00	4.72	46.71	0.00	0.00	0.00	15.57
PostCite	R	0.90	6.33	22.22	0.97	1.86	99.12	79.31	88.12	0.97	22.22	1.85	44.98	5.04	5.04	5.04	17.29
PostAttr	R	0.90	6.33	22.22	0.97	1.86	99.12	79.31	88.12	0.97	22.22	1.85	44.98	0.00	0.00	0.00	15.61
Self-RAG	R	73.50	6.80	9.57	33.98	14.94	29.13	87.17	43.67	83.57	23.54	36.73	40.20	12.34	15.65	13.80	22.98
FRONT	D	100.00	9.57	13.07	63.12	21.66	0.00	0.00	0.00	100.00	20.70	34.30	17.15	52.44	53.01	52.72	30.51
ICL	D	95.30	12.03	12.07	55.56	19.83	5.55	93.62	10.48	98.55	21.41	35.17	22.82	15.73	16.92	16.30	19.65
PostCite	D	83.90	8.13	7.45	30.19	11.95	16.14	79.50	26.83	84.06	20.74	33.27	30.05	4.90	4.90	4.90	15.63
PostAtr	D	84.00	8.13	7.44	30.19	11.94	15.89	78.75	26.44	83.57	20.60	33.05	29.74	0.93	0.93	0.93	14.20
Self-RAG	D	97.90	8.13	7.97	37.68	13.16	2.40	90.48	4.67	99.03	20.94	34.57	19.62	9.01	12.05	10.31	14.36
		LLaMA-2-13b
ICL	R	46.40	6.90	14.44	32.37	19.97	58.39	86.38	69.68	64.73	28.88	39.94	54.81	3.79	6.28	4.73	26.50
PostCite	R	76.60	2.27	1.44	5.31	2.27	25.73	87.18	39.73	85.51	23.11	36.38	38.05	0.72	0.72	0.72	13.68
PostAttr	R	76.60	2.27	1.44	5.31	2.27	25.73	87.18	39.73	85.51	23.11	36.38	38.05	0.09	0.09	0.09	13.47
Self-RAG	R	22.10	2.40	12.37	13.20	12.77	81.59	83.06	82.32	36.23	33.94	35.05	58.68	22.09	27.60	24.54	32.00
ICL	D	96.50	13.07	12.71	59.26	20.93	3.91	88.57	7.49	98.07	21.04	34.64	21.06	2.45	3.25	2.80	14.93
PostCite	D	7.00	0.57	7.14	2.42	3.62	92.18	78.60	84.85	3.86	11.43	5.78	45.31	4.73	4.73	4.73	17.89
PostAttr	D	6.70	0.57	7.46	2.42	3.66	93.44	79.42	85.86	7.25	22.39	10.95	48.41	0.71	0.71	0.71	17.59
Self-RAG	D	98.00	9.73	7.38	34.94	12.19	2.02	80.00	3.94	98.07	20.71	34.20	19.07	5.71	8.06	6.68	12.65
		LLaMA-3-8b
ICL	R	0.00	0.00	0.00	0.00	0.00	100.00	79.30	88.46	0.00	0.00	0.00	44.23	0.00	0.00	0.00	14.74
PostCite	R	62.00	10.80	13.87	41.55	20.80	40.86	85.26	55.24	72.95	24.35	36.52	45.88	8.06	8.06	8.06	24.91
PostAttr	R	62.00	10.80	13.87	41.55	20.80	40.86	85.26	55.24	72.95	24.35	36.52	45.88	1.25	1.25	1.25	22.64
ICL	D	84.60	11.90	14.74	60.23	23.69	17.65	90.91	29.57	93.24	22.81	36.66	33.11	31.32	30.74	31.03	29.28
PostCite	D	98.90	17.40	11.49	54.91	19.00	1.26	90.91	2.49	99.52	20.83	34.45	18.47	6.33	6.33	6.33	14.60
PostAttr	D	98.90	17.40	11.49	54.91	19.00	1.13	81.82	2.24	99.03	20.73	34.28	18.26	1.02	1.02	1.02	12.76
		Closed-source Models
GPT-3.5	R	49.00	8.47	23.03	54.51	32.38	58.26	90.59	70.91	76.81	32.45	45.62	58.27	56.57	58.03	57.29	49.31
GPT-4	R	61.50	10.50	22.09	65.62	33.05	45.65	94.03	61.46	88.89	29.92	44.77	53.11	61.33	62.35	61.84	49.33
Claude-3.5	R	59.00	2.87	7.66	21.82	11.34	48.05	92.93	63.34	85.99	30.17	44.67	54.00	11.64	13.34	12.43	25.92
GPT-3.5	D	93.50	14.33	14.58	65.86	23.88	7.57	92.31	13.99	97.58	21.60	35.38	24.68	46.46	46.10	46.28	31.61
GPT-4	D	82.80	15.00	18.18	72.71	29.09	21.19	97.67	34.82	98.07	24.52	39.23	37.02	48.20	48.47	48.33	38.15
Claude-3.5	D	56.60	3.40	7.89	21.58	11.56	51.07	93.32	66.01	85.99	31.45	46.05	56.03	10.22	12.43	11.22	26.27
		Our Models
SFT-LLaMA-2-7b	R	29.50	3.80	18.36	26.17	21.58	77.05	86.67	81.58	54.59	38.31	45.02	63.30	45.25	35.19	39.59	41.49
SFT-LLaMA-3-8b	R	23.60	3.27	21.19	24.15	22.57	82.98	86.13	84.52	48.79	42.80	45.60	65.06	51.77	42.79	46.85	44.83
DPO-LLaMA-2-7b	R	21.60	3.30	22.07	23.03	22.54	83.98	84.95	84.46	43.00	41.20	42.08	63.27	48.46	46.29	47.35	44.39
DPO-LLaMA-3-8b	R	15.50	2.77	24.30	18.20	20.81	89.66	84.14	86.81	35.27	47.10	40.33	63.57	50.75	49.74	50.24	44.87

Insufficient case
Question: Why do burns blister and why do burn wounds remain warm long after the injury occurred?
Label: Burn blisters occur when the second layer of the skin is damaged, they occur to protect the underlying skin layers from more damage and infection. You could see it as the bodys/skins natural bandage, so never pop them. The skin remain warm because of the increased blood in the area to repair and replace the damaged skin.

Decomposed claims: 1. Burn blisters occur when the second layer of skin is damaged. 2. Burn wounds remain warm due to increased blood flow to the area to repair and replace damaged skin.

Missing points: 1. Protection and Infection: The first claim does not mention that the blisters protect the underlying skin from more damage and infection, which is a significant part of the explanation in the answer. 2. Never Pop Them: The answer advises against popping blisters, which is a preventive measure not mentioned in the claims.
Redundant case
Question: How do fitness trackers know that you actually sleeping but not just laying there resting, being awake?
Label: Your heart beats slows down when you sleep, they will use a mixture of heart rate and how long you haven’t moved to determine how you’ve slept

Decomposed claims: 1. The combined factors of heart rate and inactivity determine sleep assessment. 2. Fitness trackers consider the duration of inactivity to assess sleep. 3. A slowed heart rate is an indicator of sleep that fitness trackers monitor.

Redundant point: The first claim has already summarised the core statement, and the last two claims just expand it and give more details

Coverage Critic Prompt
[INSTRUCTION] You will be given Question and the corresponding correct answers, along with a candidate answer and reference facts. Please follow these steps to process the candidate answer: 1. Carefully read and understand the given Question, the list of correct answers, and the candidate answer. 2. For each given correct answer, first determine if there is a conflict with the candidate answer: - If there is no conflict, and it is included in the candidate answer, extract the matched term from the candidate answer and classify them as "upvote". - If there is a conflict, identify the specific conflicting span within the candidate answer (accurately pinpoint the details), classify it as "downvote", then only minimally modify the conflicting part of the candidate answer to correct it according to the corresponding correct answer (using context from the reference fact). Classify the modified span as "revise". - If there is a conflict, but it is not included in the candidate answer, extend the candidate answer to include the correct answer (using material from the corresponding part of the reference facts), and classify the extended portion as "revise". 3. At the end of your response, provide the following: - The final revised candidate answer that includes all correct answers and has no conflicts (if no modification is needed, output the original one). [TASK] Question: {QUESTION} Correct Answers: {SHORT_ANS} Candidate Answer: {CANDIDATE} Reference Facts: {FACT}
Citation Critic Prompt
[INSTRUCTION] Given a question and a list of CLAIMs, use the provided FACTs to determine which numbered FACTs togeter SUPPORT, OPPOSE, or are IRRELEVANT to each CLAIM. Follow these to give your judgement: 1. "SUPPORT" means the FACT directly participates in supporting the factuality of the CLAIM. The CLAIM should be strongly implied by the FACT. 2. "OPPOSE" means the FACT contributes to prove the CLAIM contains at least one factual error. 3. "IRRELEVANT" means the FACT does not contribute directly to either SUPPORT or OPPOSE the given CLAIM. 4. Carefully read the given question and FACTs to ensure you have a clear understanding of them. 5. For each CLAIM, analyze its content to show all factual arguments and assertions. 6. Look into the details of each FACT, and find factual-related points of each FACT. 7. Before determining your final judgement for all CLAIMs at the end, state your reasoning and evidence first. 8. In your final judgement, give a numbered list with each line corresponding to a CLAIM. Then, for each CLAIM, separately list the index of each FACT for "SUPPORT", "OPPOSE", and "IRRELEVANT", with the format [FACT X], where X is the index of the FACT starting from 1. For example, suppose we have two CLAIMs and three FACTs in total: "/n/n1. SUPPORT: [FACT 1][FACT 3], OPPOSE: NONE, IRRELEVANT: [FACT 2]/n/n2. SUPPORT: NONE, OPPOSE: [FACT 2], IRRELEVANT: [FACT 1][FACT 3]". If no FACT, then just give "NONE". [TASK] Question: {QUESTION} CLAIM: {CLAIM_PLACEHOLDER} FACTs: {FACT_PLACEHOLDER}

Measuring and Enhancing Trustworthiness of LLMs in RAG through Grounded Attributions and Learning to Refuse

Abstract

1 Introduction

2 Problem Description

2.1 Task Setup

2.2 When is refusal expected?

Nuances of answerability.

The knowledge grounding problem.

2.3 Hallucination in LLM in RAG

3 Metrics for LLM-in-RAG

Insufficiency of the existing metrics.

Answer Calibration.

Scoring Refusals.

Measuring Attribution Groundedness.

Responsiveness.

4 The Trust-Align Framework

Collecting Quality Questions.

Collecting D’s.

Augmenting (q,D) set.

Obtaining 𝐫+.

Obtaining 𝐫−.

5 Experimental Setup

Evaluation datasets.

Baselines.

6 Results and Analysis

Trust-Align boosts trustworthiness compared to baseline methods.

Trust-Align improves models’ refusal capability.

Trust-Align enhances models’ citation quality.

Trust-Align has mixed effects on EMACF1.

Models aligned with DPO outperform those trained with SFT.

Trust-Align enhances trustworthiness more robustly than prompting.

6.1 Analysis

6.1.1 Different Data Synthesis Techniques

6.1.2 Effect of Adding Refusal Samples in Trust-Align

6.1.3 Generalizability Analysis

7 Related Works

7.1 Attributable Retrieval Augmented Generation

7.2 Enhance grounded text generation in attributed Large Language Models

8 Conclusion

References

Appendix A Metrics

A.1 Response Truthfulness

Grounded Refusal [F1RG]:

Exact Match (Answer Calibrated) [EMF1AC]:

A.2 Attribution Groundedness

Citation Grounded F1 [F1CG]:

Trust-Score:

Appendix B Details on Trust-Align Framework

B.1 Seed Prompt Curation Detials

Clustering Questions.

Cluster Quality.

Obtaining D for q.

Obtaining r+.

Obtaining r−.

B.2 Dataset Statistics

Appendix C Answerability: A Case Study

Appendix D Additional Analysis

D.1 Utilization of Parametric Knowledge

D.2 The Source of LLM Hallucinations

D.3 Comparison with Closed-source Models

D.4 Adaptability with Different Alignment Techniques

D.5 Evaluation Data Creation Without Using TRUE

Appendix E Prompt Templates

Appendix F GPT-4 based Data Pipeline

Coverage critiques.

Citation critiques.

Appendix G Experimental Setup

G.1 Implementation details

G.2 Dataset details

ASQA Stelmakh et al. (2023)

QAMPARI Amouyal et al. (2023)

ELI5 Fan et al. (2019)

ExpertQA Malaviya et al. (2024)

G.3 Baselines

G.3.1 In-Context Learning (ICLCite)

G.3.2 Post-hoc Search Gao et al. (2023b) (PostCite)

G.3.3 Post-hoc Attribute Gao et al. (2023a) (PostAttr)

G.3.4 Self-RAG Asai et al. (2024)

G.3.5 FRONT Huang et al. (2024b)

G.4 Refusal Detection

Measuring and Enhancing Trustworthiness of LLMs in RAG
through Grounded Attributions and Learning to Refuse

Obtaining $\mathbf{r^{+}}$ .

Obtaining $\mathbf{r^{-}}$ .

Trust-Align has mixed effects on EM ${}^{\text{F1}}_{\text{AC}}$ .

Grounded Refusal [ $\text{F1}_{\text{RG}}$ ]:

Exact Match (Answer Calibrated) [EM^F1_AC]:

Citation Grounded F1 [F1_CG]:

Obtaining $r^{+}$ .

Obtaining $r^{-}$ .