AI-TA：利用开源大语言模型打造智能问答助教

Yann Hicke
Cornell University
Ithaca, USA
ylh8@cornell.edu
Anmol Agarwal

{}^{*}

Microsoft Research
Bangalore, India
t-agarwalan@microsoft.com

Qianou (Christina) Ma

{}^{*}

Carnegie Mellon University
Pittsburgh, USA
qianouma@cmu.edu
Paul Denny
University of Auckland
Auckland, New Zealand
p.denny@auckland.ac.nz
Equal contribution.

摘要

每学期在在线问答平台上回答数千个学生问题会耗费相当大的人力成本，特别是在入学人数快速增长的计算机课程中。为了应对可扩展和智能问答 (QA) 的挑战，我们引入了一种创新解决方案，该解决方案利用 LLaMA-2 系列的开源大型语言模型（大语言模型）来确保数据隐私。我们的方法结合了增强技术，例如检索增强生成（RAG）、监督微调（SFT）以及使用直接偏好优化（DPO）从人类偏好数据中学习。通过对 CS 入门课程的 Piazza 数据集（包括 10,000 对 QA 对和 1,500 对偏好数据）进行广泛实验，我们证明答案质量显着提高了 30%，其中 RAG 是一个特别有影响力的补充。我们的贡献包括开发一种新颖的教育 QA 架构、利用人工评估和基于 LLM 的指标对大语言模型性能进行广泛评估，以及对教育数据处理的挑战和未来方向的见解。这项工作为 AI-TA 的开发铺平了道路，这是一款可通过在线 QA 平台为课程定制的智能 QA 助手。 ¹¹1Relevant code and sample snippets of data for this work can be found at https://github.com/aita-ai4ed/AITA.git

1简介

Piazza 或 EdStem 等在线问答 (QA) 平台通常用于对学生有关课程内容、作业和管理事务的询问提供及时有效的答复[47]。然而，每学期回答数千个学生问题需要付出相当大的人力成本。这一挑战在各个学科中都普遍存在，但由于计算机课程的入学人数快速增长，因此它尤其重要[32]。

迫切需要创新且可扩展的解决方案，为学生的疑问提供即时、准确的答案，从而在不产生额外成本的情况下增强教育体验。最近解决这个问题的方法已经暴露出一些局限性。例如，2015年佐治亚理工学院推出了Jill Watson AI[14]；该模型依赖于本体框架和问答对数据库[13]，这限制了可以回答的查询类型（物流或教学大纲相关），并且需要大量时间来开发和维护。最近，哈佛大学的 CS50.AI 正在探索使用专有的大型语言模型（大语言模型），例如 OpenAI 的 GPT-3.5 和 GPT-4 [1]。 Khan Academy 的 Khanmigo 是一个类似的聊天机器人导师，也由 GPT-4 [5] 提供支持。这些工具将学生的查询发送给第三方提供商，从而引发了数据隐私问题。

为了克服上述限制，我们利用开源大语言模型的适应性来提供多种查询响应，最重要的是确保数据隐私。我们当前的实现使用 LLaMA-2 [48]，这是目前最先进的开源大语言模型之一，并作为我们管道的基础，如所示图1。我们通过新颖的技术组合增强了基线 LLaMA-2 模型，并在 CS1（入门级编程）课程的匿名 Piazza QA 数据集上对其进行了测试(部分 3)。我们应用检索增强生成（RAG）[25]、监督微调（SFT）[10]以及人类反馈强化学习的替代方案（RLHF）[57]，以提高针对学生问题生成的答案的质量(部分 4)。我们使用 GPT-4 [26, 50] 评估不同模型的性能，并对模型答案的子集进行手动评估(部分 5)。在这项工作中，我们展示了我们的初步知识发现和架构，以实现开发AI-TA的最终目标，这是一个基于定制的内部大语言模型的智能QA助手，可以定制通过在线问答平台支持任何课程。

我们的贡献可概括如下：

•

我们引入了一个利用开源大语言模型来执行教育 QA 任务的管道。据我们所知，我们是第一个将 RAG、SFT 和 DPO 结合起来开发专为教育环境设计的强大 QA 系统的公司。
•

我们对我们的管道进行了全面评估，包括基于 LLM 和基于标题的人类评估，基于一系列模型和基线。此外，我们分析了增强技术的不同组合对模型性能的影响。
•

我们利用 CS 入门课程中的大型 Piazza QA 数据集，研究了基于 LLM 的解决方案在教育 QA 的独特背景下的应用。我们还讨论了在此背景下与数据处理和评估相关的挑战、局限性和未来方向。

Refer to caption — 图1： AI-TA 流程概述：我们结合了多种技术，包括 (1) 使用课程材料的检索增强生成 (RAG)，(2) 使用 Piazza QA 数据的监督微调 (SFT) ，（3）从人类偏好中学习，为此我们使用直接策略优化（DPO）作为 RLHF 的代理。

2相关工作

指令数据微调：自监督语言模型（LM）在扩大规模后能够实现零样本[38]或少样本任务完成[7]。通过在包含指令和人工生成文本[10]的特定数据集上进行微调，可以显着提高专门任务的性能以及与用户目标的一致性。

对人类偏好数据进行微调：在实践中，根据人类对响应质量的偏好收集数据通常比收集专家演示以进行指令调整更可行。这涉及训练奖励模型，通常使用偏好模型来预测生成文本的分数。然后，通过强化学习算法（例如近端策略优化 (PPO)[43])对语言模型进行微调，以最大化此奖励函数。这种方法之前已用于提高 LM 在翻译 [22]、摘要 [46]、讲故事 [57] 等任务中的性能>，并遵循指令[35]。尽管如此，RLHF 比监督学习更复杂，容易出现优化不稳定，并且对超参数调整敏感。

直接偏好优化（DPO）[39]是一种更加简化的方法，它利用奖励函数和最优策略之间的关系来解决策略训练的单个阶段中的约束奖励优化问题。本质上，它将问题视为基于人类偏好数据的分类任务，并声称更加稳定和高效。

参数高效微调：训练最小的 LLaMA-2 模型，即 LLaMA-2-7B，仅批量大小为 4 就需要超过 100GB 的内存。为了解决这些问题，[11]引入了大语言模型.int8()，一种8位矩阵乘法方法，通过量化参数来显着缩小模型大小。随后的工作是进一步优化同时存储在内存中所需的参数数量[12]，它建立在低秩适应方法[17]之上。

检索增强生成：微调不太适合将知识注入模型，因为它无法跟上频繁变化的数据，可能泄漏敏感信息，由于缺乏参考而缺乏可解释性，并且存在灾难性遗忘的风险。 RAG [25, 15] 通过单独的搜索系统检索知识并奠定模型的推理和综合基础来解决这些问题。通过定期更新的搜索索引，大语言模型所使用的知识始终是最新的。通过检查大语言模型使用的文档，用户可以检查生成的响应的来源并检测潜在的幻觉。

基于人工智能的问答助手：不同领域的先前工作已经采取了微调开源大语言模型的方法，并使用不同的技术对其进行增强，例如 RAG [25] 和 RLHF [57]. 例如，ClinicalGPT [49] 使用 RAG 和 RLHF 将 Bloom-7B 微调到医疗和保健领域，与评估的基线 LLaMA-7B 和 ChatGLM-6B 相比，在问答方面表现出更好的性能使用 GPT-4。一些作者也在 Generative AI 黑客马拉松中初步探索了这个想法 ²²2https://genaihackathon2023.github.io，黑客马拉松的代码存储库可以在此处找到。

3数据集

在我们的实验中，我们使用了预匿名的 Piazza QA 数据集，该数据集涵盖新西兰一所公立研究型大学教授的 Matlab 和 C 大型入门编程课程的八个学期（从 2015 年到 2022 年）。该数据集总共包含 $10,647$ 个 QA 对。我们保留了 $2022$ 学期的 $717$ QA 对进行评估，并且由于计算和预算限制，我们进一步随机选择了 $236$ 对的子集。此外，我们还获得了 $2022$ 学期的教学材料，包括课程手册、实验讲义和评估日期以进行信息检索。有关数据集的更多详细信息，请参阅附录A。

数据准备：在准备 SFT 的 QA 数据集时，我们使用学生的问题和教师或其同伴的相应答案提取单轮问答对。如果一个问题有多个答案（例如，学生答案和教师答案），我们只保留最终教师的答案。此外，有些帖子还包含各种图像，如代码截图、表格、图表、IDE等。我们对这些图像进行了一些关于对象字符识别（OCR）的基础知识探索，但由于图像内容多样化，质量仍然有限，所以我们最终过滤掉了带有图像的帖子。在部分 6中，我们讨论了更高质量的OCR [6]以及多回合和对话问答的未来工作支持[53]。

此外，在分析数据集时，我们观察到存在接近重复的查询，即使在同一学期内也是如此。除了其他缺点[24, 36]之外，数据重复和训练测试重叠还可能导致模型准确性过高估计。因此，我们进行了重复数据删除，其中我们为每个查询创建了一个嵌入³³3using sentence-transformers/all-MiniLM-L6-v2，应用聚类将潜在的重复数据归入同一聚类，每个聚类只保留一个成员。这将我们的数据集大小减少了 $5.1\%$ 。最后，我们构建了一个没有图像或重复项的 $10,647$ 单轮 QA 对的数据集。有关重复项的示例，请参阅附录 A.2。

此外，我们还从 Piazza 数据集中创建了 $1,549$ 对具有代理人类偏好的答案。我们使用了 Piazza 帖子的编辑历史记录：一些答案有原始回复，但随后经过编辑以进行改进。我们采用了更喜欢最终答案而不是第一个答案的启发法来创建这个人类偏好数据集。请参阅部分A.3了解偏好对的示例，以及附录B 有关模型数据格式的更多详细信息。

4方法

在我们的实验中，主要目标是评估大语言模型在教育领域的能力，特别是当它们通过 RAG、SFT 和 DPO 进行增强时。有关训练和推理参数的更多详细信息，请参阅附录C。

基线：在许多高水平的大语言模型[34, 3, 9, 2, 37]中，一些模型并未公开，但与它们的模型相比，它们往往表现出更好的性能开源同行。由于 LLaMA-2 系列 [48] 的广泛采用和支持，我们使用 LLaMA-2 系列作为我们的开源模型基线，特别是 LLaMA-2-7B-chat、LLaMA-2-13B-chat 和LLaMA-2-70B-聊天。对于闭源模型基线，我们使用 GPT-4 [34]。 ⁴⁴4在向 OpenAI API 发送推理查询之前，我们确保查询中不存在私人或可识别信息。

有监督微调（SFT）：我们使用预训练的 LLaMA-2-13B-chat 来初始化模型，并在单个 A100 GPU 上使用前面描述的 QA 数据集执行指令调整。我们用 100 个预热步骤训练了三个时期的模型，并使用余弦学习率调度程序将学习率设置为 $1e-4$ 。

符合人类偏好 (DPO)：由于 DPO [39] 无需在微调或广泛的超参数调整期间从语言模型中采样，因此它在计算上比通过 PPO 的 RLHF 便宜得多。它需要更少的 VRAM 来加载所有三个模型（策略、价值和奖励模型），而且超参数优化也不那么昂贵。我们使用 DPO 作为人类偏好数据的微调算法，其性能相当于 RLHF，但在单个 A6000 GPU 上。我们还以 DPO 目标对模型进行了三个 epoch 的训练。

参数高效微调：我们的监督微调和 DPO 管道都使用 peft Huggingface 库在微调之前加载我们的基本模型。我们以 4 位配置加载模型，然后通过 QLora 方法 [12] 进行训练。

检索增强生成 (RAG)：我们使用 Nougat [6]（一种 Visual Transformer 模型）从 PDF（例如教科书和实验手册）中提取文本，该模型将 PDF 转换为 Markdown 文件，并且在捕获 LaTeX 数学的多个实例方面尤其令人印象深刻， PDF 中存在表格数据和代码片段。

为了提高检索效率并考虑到大语言模型上下文长度的限制，我们对各个源文档进行分块，每个块最大限制为 $1000$ 个字符，并且重叠 $100$ 相邻块之间的字符。为了将段落、句子和单词保持在一起，我们使用预定义的分隔符（包括双换行符、单换行符、空格和空字符串）分割文档，直到生成的块足够小。我们观察到，(1) 基于嵌入的检索器更适合需要更细致地理解语言的查询，例如逻辑查询，(2) 基于关键字的检索器 对于引用特定编码函数和分配问题名称的查询更加稳健。为了利用两者的优势，我们使用基于嵌入的 Dense Passage Retriever [19] 检索到的 top- $3$ 文档的并集来构建上下文> 并使用BM25排名函数检索到的前 $2$ 文档。请注意，我们没有对检索器进行端到端的训练，而是用检索到的文档来增强提示。检索器上下文的提示构造可以在附录B中找到。

评估：我们采用了两种方法来评估我们的模型：人工评估和基于 LLM (GPT-4) 的评估。先前的研究[44]表明，BLEU、Rouge-L 和 Cider 等传统评估指标无法有效捕获模型响应的有用性和准确性，而这在 AI-TA 的用例。不过，为了完整起见，我们还在 Section 5 中报告了基于 BertScore-F1 [54] 指标的分数。

•

人类评价：在我们的研究中，我们利用了一组改编自先前工作[44, 55]的标题项目，并通过迭代过程对其进行改进。我们的最终评分标准将 $0$ 、 $0.5$ 或 $1$ 分数（分数越高表示质量越好）分配给两个不同的标准：实用性 和准确性。实用性类别衡量模型答案对人类讲师或助教 (TA) 的实用价值，同时考虑答案的自然性、连贯性和简洁性等因素。另一方面，准确性类别评估模型答案中所提供信息的事实正确性和基础性。例如，实用性得分为 $1$ 表示答案非常有用，只需很少的修改或无需修改即可轻易被助教接受，而准确性得分为 $0$ 表示答案非常有用。答案完全不准确并提供虚假信息。请参阅附录中的表 6 E了解评估标准的详细说明及其应用示例。

对于我们研究的人工评估部分，我们从 $236$ 问题评估集中随机选择了 20 个问题的子集。每个问题都附有 9 个不同的标准答案，总共有 180 个手动评分答案。表 5、部分 5<中列出了本次评估中使用的9个模型/t5>。其中一位作者使用上述评分标准对所有 180 个答案进行了标注。
•

大语言模型评测：最近的研究利用 GPT-4 [34] 来评估模型 [29, 52, 56] 生成的文本的质量。 Alpaca Eval [27] 证明 GPT-4 是最好的自动注释器，基于一组 2.5k 人类注释，与人类的一致性最高。然而，GPT-4 也被证明更喜欢 Instruct-GPT 模型家族[16]产生的世代。在我们的实验中，我们提示 GPT-4 使用与人类评估相同的规则来评估每个 $9$ 模型的 $236$ 问题生成的答案。我们的提示（包含在附录E中）改编自AlpacaEval框架[27]和G-Eval[29]⁵⁵5 AlpacaEval 的原始提示可以在此处找到。我们使用此处提供的 G-Eval 实现.

5结果

对于人工评估，我们在 $9$ 不同模型上评估了 $20$ 个问题。对于 BertScore 和基于 LLM 的评估，我们评估了 $9$ 模型上的 $236$ 问题。分数总结在表1中。请参阅附录中表5D例如每个模型的结果和评级。

表格1： BertScore、人类评估分数和大语言模型评估分数（平均

\pm

标准差）按照评分标准。粗体数字表示开源和闭源大语言模型中性能最好的。请注意，评估是在 LLaMA-2 模型的聊天变体上进行的。

Model	Evaluation by Humans			Evaluation using GPT-4			Bert Score F1
	Usefulness	Accuracy	Avg	Usefulness	Accuracy	Avg	Bert Score F1
LLaMA-2-7B	0.18 ( $\pm$ 0.29)	0.28 ( $\pm$ 0.38)	0.23	0.54 ( $\pm$ 0.35)	0.42 ( $\pm$ 0.35)	0.48	0.454
LLaMA-2-13B (L-13)	0.18 ( $\pm$ 0.24)	0.30 ( $\pm$ 0.41)	0.24	0.62 ( $\pm$ 0.35)	0.52 ( $\pm$ 0.38)	0.57	0.455
L-13+RAG	0.53 ( $\pm$ 0.41)	0.50 ( $\pm$ 0.40)	0.51	0.61 ( $\pm$ 0.35)	0.52 ( $\pm$ 0.38)	0.57	0.453
L-13+SFT	0.13 ( $\pm$ 0.28)	0.10 ( $\pm$ 0.21)	0.11	0.26 ( $\pm$ 0.32)	0.21 ( $\pm$ 0.33)	0.24	0.400
L-13+SFT+DPO	0.23 ( $\pm$ 0.38)	0.33 ( $\pm$ 0.44)	0.28	0.60 ( $\pm$ 0.36)	0.50 ( $\pm$ 0.39)	0.55	0.453
L-13+SFT+DPO+RAG	0.50 ( $\pm$ 0.43)	0.58 ( $\pm$ 0.41)	0.54	0.63 ( $\pm$ 0.36)	0.53 ( $\pm$ 0.39)	0.58	0.449
LLaMA-2-70B	0.20 ( $\pm$ 0.34)	0.33 ( $\pm$ 0.41)	0.26	0.63 ( $\pm$ 0.36)	0.53 ( $\pm$ 0.36)	0.58	0.456
GPT-4	0.28 ( $\pm$ 0.34)	0.40 ( $\pm$ 0.42)	0.34	0.78 ( $\pm$ 0.29)	0.68 ( $\pm$ 0.32)	0.73	0.458
GPT-4 + RAG	0.65 ( $\pm$ 0.43)	0.73 ( $\pm$ 0.34)	0.69	0.81 ( $\pm$ 0.27)	0.72 ( $\pm$ 0.32)	0.77	0.469

5.1人工评价

基线模型性能相似，随着尺寸的增加而提高，GPT-4 表现最好。

根据人类评估分数，基线模型的性能随着尺寸的增加而提高，其中 GPT-4 表现出最佳性能。不同尺寸（7B、13B、70B）的 LLaMA 模型不会生成显着不同的答案，尽管较大的模型表现稍好。然而，基线模型的评级存在很高的标准偏差，这表明它们的性能根据查询类型的不同而存在很大差异。例如，没有检索上下文的模型在与作业相关的逻辑问题上表现不佳，但在概念性问题（例如 C 代码语法）上表现良好。

不同的增强技术（SFT、DPO、RAG）对性能有不同的影响。

单独使用 SFT 并不能提高性能,LLaMA-2-13b+SFT 的性能比 LLaMA-2-13B 基础模型差 13% 。定性地，我们观察到LLaMA-2-13B + SFT经常产生幻觉，这使得它的答案既无用又不准确。微调应该教会模型根据其知识输出正确的答案。然而，有一些查询（例如逻辑查询）在没有相关上下文的情况下无法回答，并且预训练数据中缺乏有关此类查询的知识。对此类查询的微调可能会给模型注入猜测的倾向，因此可能解释幻觉。

DPO 似乎可以调节模型中的幻觉。与 LLaMA-2-13B+SFT 模型相比，LLaMA-2-13B+SFT+DPO 模型的人类评估分数提高了 $17$ % ，准确率显着提高 $23$ %。然而，其性能仅比 LLaMA-2-13B 基本模型稍好 $4$ %。

RAG 对性能的影响最为显着，所有采用 RAG 的模型都比未采用 RAG 的模型表现出更好的结果。例如，GPT-4+RAG 与没有 RAG 的 GPT-4 相比，性能提高了 $35$ %，而 LLaMA2-13b+SFT+与LLaMA2-13b+SFT+DPO模型相比，DPO+RAG提高了 $26$ %。定性地，我们观察到基于 RAG 的模型中的答案更加准确，无关信息更少(附录中的示例 D、表4)。

模型答案：通常有帮助，可能超越人类或完全失败。

根据人类评估，第二好的模型 LLaMA2-13B+SFT+DPO+RAG 的性能比 GPT-4+RAG 差 $15$ %在实用性和准确性指标方面。然而，与 LLaMA-2-13B 基线相比，其性能提高了 30%。总体实用性为 $0.50$ ，准确度为 $0.58$ ，与相比，它对于教师来说已经很有价值，可以减少工作量，同时还可以减轻隐私问题GPT-4。

值得注意的是，模型答案有时可以在全面性和积极情绪方面超越人类的真实答案。例如，一名学生问了问题“当代码很长时我们会受到惩罚吗？”人类的答案是 “最好的办法就是尽可能地简化。如果太长，可能效率不高，可以改进”. 相比之下，LLaMA2-13b+SFT+DPO+RAG 的回复（如下所示）更加详细和令人鼓舞：

“你好！关于您对长代码处罚的担忧，我很高兴澄清，课程中使用长代码不会受到处罚。然而，值得注意的是，长代码有时会难以阅读和维护，因此最好在编码中力求简洁和清晰。这不仅会使您的代码更易于管理，而且还可以帮助您避免不必要的错误和错误。所以，专注于编写干净、高效、结构良好的代码！如果您还有任何其他问题或疑虑，请随时询问，我会随时为您提供帮助。”

另一方面，也存在所有模型都无法生成任何合理输出的情况。例如，当一个问题要求助教在另一个平台上检查他们的提交结果时，没有一个模型能够做出适当的回应。这甚至不是模型应该执行的任务。

5.2使用 GPT-4 进行评估

如表 1所示，人类和大语言模型（GPT-4）评估均一致认为GPT-4+RAG 是所有模型中最强的。然而，大语言模型评估将基线GPT-4评为第二佳模型，这与人类评估结果不同。这种差异可能归因于 [16] 的发现，该发现表明 GPT-4 倾向于支持基于 InstructGPT、GPT-4 或 ChatGPT 派生的数据训练的模型，即使内容不太真实或有用。

还值得注意的是，GPT-4 给基线模型分配的分数比人类评估的分数高得多。虽然大语言模型评估确实捕捉到了模型之间的一些差异（例如，它也一致表明，单独的 SFT 会使 LLaMA2-13b 表现更差），但它与人类评估的一致性不够好。为了量化人类和大语言模型评估之间的一致性，我们计算了 Pearson ( $r$ ) 和 Spearman ( $\rho$ ) 相关分数。在针对 9 个模型的 20 个查询（180 个响应）中，我们发现人类和 LLM 评估之间的皮尔逊系数为 $0.662$ ，斯皮尔曼系数为 $0.659$ 。这些分数表明两种类型的评估之间存在中等正相关，这表明虽然人类评估和大语言模型评估之间存在一定的一致性，但也存在许多差异。不同类型评估方法之间相关性的详细分析请参见部分E.2。

6 局限性和未来的工作

在这项工作中，我们的主要目标是利用我们自己的内部基础设施来生成对学生查询的响应。我们的目标是最大限度地利用开源模型，而不是依赖商业模型和通过 API 调用进行快速工程。这种方法使我们能够避免使用第三方服务提供商带来的不可靠性和数据隐私问题。我们的实验表明，使用不同的增强技术可以提高 LLaMA-2 系列较小的内部模型的性能，使它们更接近功能更强大的 GPT-4 模型。然而，我们未来的工作仍有进一步改进的巨大潜力。

数据：我们的工作受到数据集过滤和准备过程的限制。特别是，由于在论坛式数据中分离问题和答案的挑战，我们可能会丢失未包含在内的后续交互中有价值的 QA 数据。此外，包含图像（例如屏幕截图）的帖子被排除在我们的数据集中。正如最近的研究[53,40,41,51]所建议的那样，未来可以探索多回合和对话问答。此外，我们可以采用更好的 OCR 模型[6]来处理数据源中的图像。

训练：我们的流程由多个阶段组成 - RAG、SFT 和 RLHF - 并且每个阶段都可以单独进一步开发和增强。在这项工作中，我们仅测试了 LLaMa-2 模型。未来，人们计划尝试使用专门的特定领域模型，例如 CodeLLaMA，它具有增强的编码功能[42]。我们还计划探索如何使用模型集合来提高性能，同时降低推理成本[8]。正如我们的工作已经证明的那样，学生的疑问及其相应的答案在不同学期中通常是相似的。因此，我们打算在[30]中已经探索过的方法的基础上，检索前学期的可比较的QA对，并在推理过程中将它们用作少样本示例。此外，端到端训练检索器可以进一步提高检索到的上下文[25]的质量。

评估：我们的人工评估目前受到 20 个回复的小样本量的限制，这些回复仅由一名注释者进行评分。未来，我们计划扩大我们的评估范围，包括更大的样本量和多个注释者，以便为我们的评分标准建立强大的评估者间可靠性。值得注意的是，大语言模型和现有的自动评估方法可能并不总是与人类教师的回答启发式相一致，或者准确地捕捉答案的教学价值。为了解决这个问题，我们将借鉴现有工作[20,18,50,28]的见解，继续探索自动评估方法。

此外，迄今为止，我们对 LLaMA-2 模型的实验仅限于单个课程。人们注意到，与 Python 和 Java [33] 相比，现有的预训练模型在 Matlab 和 C 等语言上的性能可能不佳。为了解决这个问题，我们计划在一系列大学计算机科学课程中部署我们的系统，以便收集助教和讲师的反馈并尝试不同的模型配置。

部署和用户研究：我们目前正在开发 AI-TA 工具，利用 Piazza、Discord 和 Chrome 扩展程序中的 API，将 TA 反馈收集无缝集成到他们的日常 QA 工作流程中。我们计划进行一项用户研究，以评估 AI-TA 为课堂带来的生产力和教学价值。此外，我们将探索界面和交互设计，以最大化人类人工智能问答的价值并最小化风险[4]。例如，我们可以向助教展示模型答案的不确定性估计[23]，或者将模型的答案呈现为不完美的同学讨论[31]。跨不同机构训练模型也可能有益，以可扩展且安全的方式利用学科的集体教学知识，同时仍使用分布式机器学习维护每个学校的数据隐私[21, 45]。

致谢和资金披露

我们感谢 Saloni Parekh、Yunxin Gu 和 Ruixuan Tang 参与了这项工作的概念化。 Aashika Vishwanath、Rujun Taka 和三位第一作者作为黑客马拉松获胜团队获得了卡内基梅隆大学 Block Center 的资助。本作品中表达的任何意见、发现、结论或建议均为作者的观点，并不一定反映资助实体的观点。

参考

CS [5] MS Windows NT kernel description. https://www.thecrimson.com/article/2023/6/21/cs50-artificial-intelligence/. Accessed: 2023-09-16.
Anil et al. [2023] Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, and Paige Bailey et al. Palm 2 technical report, 2023.
Bai et al. [2022] Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional ai: Harmlessness from ai feedback, 2022.
Bansal et al. [2021] Gagan Bansal, Tongshuang Wu, Joyce Zhou, Raymond Fok, Besmira Nushi, Ece Kamar, Marco Tulio Ribeiro, and Daniel Weld. Does the whole exceed its parts? the effect of AI explanations on complementary team performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, number Article 81 in CHI ’21, pages 1–16, New York, NY, USA, May 2021. Association for Computing Machinery. ISBN 9781450380966. doi: 10.1145/3411764.3445717. URL https://doi.org/10.1145/3411764.3445717.
Bidarian [2023] Nadia Bidarian. Meet khan academy’s chatbot tutor. CNN, August 2023. URL https://www.cnn.com/2023/08/21/tech/khan-academy-ai-tutor/index.html.
Blecher et al. [2023] Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic. Nougat: Neural optical understanding for academic documents. August 2023. URL http://arxiv.org/abs/2308.13418.
Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Chen et al. [2023] Lingjiao Chen, Matei Zaharia, and James Zou. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023.
Chowdhery et al. [2022] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, and Gaurav Mishra et al. Palm: Scaling language modeling with pathways, 2022.
Chung et al. [2022] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Eric Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
Dettmers et al. [2022] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
Dettmers et al. [2023] Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
Goel et al. [2021] Ashok Goel, Harshvardhan Sikka, and Eric Gregori. Agent smith: Teaching question answering to jill watson. arXiv preprint arXiv:2112.13677, 2021.
Goel and Polepeddi [2018] Ashok K Goel and Lalith Polepeddi. Jill watson. Learning engineering for online education: Theoretical contexts and design-based examples. Routledge, 2018.
Guu et al. [2020] Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: Retrieval-augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
HF [2023] HF. Hf. https://huggingface.co/blog/llm-leaderboard, 2023.
Hu et al. [2021] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
Kabir et al. [2023] Samia Kabir, David N Udo-Imeh, Bonan Kou, and Tianyi Zhang. Who answers it better? an In-Depth analysis of ChatGPT and stack overflow answers to software engineering questions. August 2023. URL http://arxiv.org/abs/2308.02312.
Karpukhin et al. [2020] Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URL https://aclanthology.org/2020.emnlp-main.550.
Kim et al. [2023] Seungone Kim, Jamin Shin, Yejin Cho, Joel Jang, Shayne Longpre, Hwaran Lee, Sangdoo Yun, Seongjin Shin, Sungdong Kim, James Thorne, and Minjoon Seo. Prometheus: Inducing fine-grained evaluation capability in language models, 2023.
Knott et al. [2021] Brian Knott, Shobha Venkataraman, Awni Hannun, Shubho Sengupta, Mark Ibrahim, and Laurens van der Maaten. CrypTen: Secure multi-party computation meets machine learning. September 2021. URL https://scontent.fagc1-1.fna.fbcdn.net/v/t39.8562-6/260321120_621702352397035_7531336426414693679_n.pdf?_nc_cat=106&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=xI-tMErFcxQAX8damhf&_nc_ht=scontent.fagc1-1.fna&oh=00_AfAvpJcL-YE4dkqTV2mheFleKL4UO8kPh3SkpCYW_5VznQ&oe=650B4BFE.
Kreutzer et al. [2018] Julia Kreutzer, Joshua Uyheng, and Stefan Riezler. Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. arXiv preprint arXiv:1805.10627, 2018.
Kuhn et al. [2023] Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. February 2023. URL http://arxiv.org/abs/2302.09664.
Lee et al. [2021] Katherine Lee, Daphne Ippolito, Andrew Nystrom, Chiyuan Zhang, Douglas Eck, Chris Callison-Burch, and Nicholas Carlini. Deduplicating training data makes language models better. July 2021. URL http://arxiv.org/abs/2107.06499.
Lewis et al. [2020] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-Tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented generation for Knowledge-Intensive NLP tasks. May 2020. URL http://arxiv.org/abs/2005.11401.
Li et al. [2023a] Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023a.
Li et al. [2023b] Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models, 2023b.
Liu et al. [2023a] Chengyuan Liu, Divyang Doshi, Ruixuan Shang, Jialin Cui, Qinjin Jia, and Edward Gehringer. Less but enough: Evaluation of peer reviews through pseudo-labeling with less annotated data. JEDM, 15(2):123–140, June 2023a. ISSN 2157-2100, 2157-2100. doi: 10.5281/zenodo.7304981. URL https://jedm.educationaldatamining.org/index.php/JEDM/article/view/613.
Liu et al. [2023b] Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023b.
Lu et al. [2022] Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, and Alexey Svyatkovskiy. ReACC: A retrieval-augmented code completion framework. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6227–6240, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.431. URL https://aclanthology.org/2022.acl-long.431.
Ma et al. [2023] Qianou Ma, Hua Shen, Kenneth Koedinger, and Tongshuang Wu. HypoCompass: Large-Language-Model-based tutor for hypothesis construction in debugging for novices. October 2023. URL http://arxiv.org/abs/2310.05292.
National Academies of Sciences et al. [2018] Medicine National Academies of Sciences, Engineering et al. Assessing and responding to the growth of computer science undergraduate enrollments. National Academies Press, 2018.
Nguyen and Nadi [2022] N Nguyen and Sarah Nadi. An empirical evaluation of GitHub copilot’s code suggestions. 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022. doi: 10.1145/3524842.3528470. URL https://www.semanticscholar.org/paper/cdfe9580f63070f311151444f9df32818cc858bf.
OpenAI [2023] OpenAI. Gpt-4 technical report, 2023.
Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
Penedo et al. [2023a] Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The RefinedWeb dataset for falcon LLM: Outperforming curated corpora with web data, and web data only. June 2023a. URL http://arxiv.org/abs/2306.01116.
Penedo et al. [2023b] Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023b.
Radford et al. [2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Rafailov et al. [2023] Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
Reddy et al. [2019] Siva Reddy, Danqi Chen, and Christopher D Manning. CoQA: A conversational question answering challenge. Trans. Assoc. Comput. Linguist., 7:249–266, November 2019. ISSN 2307-387X. doi: 10.1162/tacl_a_00266. URL https://direct.mit.edu/tacl/article/43511.
Ross et al. [2023] Steven I Ross, Fernando Martinez, Stephanie Houde, Michael Muller, and Justin D Weisz. The programmer’s assistant: Conversational interaction with a large language model for software development. February 2023. URL http://arxiv.org/abs/2302.07080.
Rozière et al. [2023] Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom, and Gabriel Synnaeve. Code llama: Open foundation models for code. August 2023. URL http://arxiv.org/abs/2308.12950.
Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Shi et al. [2023] Ensheng Shi, Fengji Zhang, Yanlin Wang, Bei Chen, Lun Du, Hongyu Zhang, Shi Han, Dongmei Zhang, and Hongbin Sun. Sotana: The open-source software development assistant. arXiv preprint arXiv:2308.13416, 2023.
Soin et al. [2021] Arjun Soin, Pratik Bhatu, Rohit Takhar, Nishanth Chandran, Divya Gupta, Javier Alvarez-Valle, Rahul Sharma, Vidur Mahajan, and Matthew P Lungren. Multi-institution encrypted medical imaging ai validation without data sharing. SSRN Electron. J., 2021. ISSN 1556-5068. doi: 10.2139/ssrn.3973993. URL https://www.microsoft.com/en-us/research/publication/multi-institution-encrypted-medical-imaging-ai-validation-without-data-sharing/.
Stiennon et al. [2020] Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
Thinnyun et al. [2021] Adrian Thinnyun, Ryan Lenfant, Raymond Pettit, and John R. Hott. Gender and engagement in cs courses on piazza. In Proceedings of the 52nd ACM Technical Symposium on Computer Science Education, SIGCSE ’21, page 438–444, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380621. doi: 10.1145/3408877.3432395. URL https://doi.org/10.1145/3408877.3432395.
Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Wang et al. [2023a] Guangyu Wang, Guoxing Yang, Zongxin Du, Longjun Fan, and Xiaohu Li. ClinicalGPT: Large language models finetuned with diverse medical data and comprehensive evaluation. June 2023a. URL http://arxiv.org/abs/2306.09968.
Wang et al. [2023b] Jiaan Wang, Yunlong Liang, Fandong Meng, Zengkui Sun, Haoxiang Shi, Zhixu Li, Jinan Xu, Jianfeng Qu, and Jie Zhou. Is ChatGPT a good NLG evaluator? a preliminary study. March 2023b. URL http://arxiv.org/abs/2303.04048.
Wang et al. [2023c] Xingyao Wang, Zihan Wang, Jiateng Liu, Yangyi Chen, Lifan Yuan, Hao Peng, and Heng Ji. MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback. September 2023c. URL http://arxiv.org/abs/2309.10691.
Yidong et al. [2023] Wang Yidong, Yu Zhuohao, Zeng Zhengran, Yang Linyi, Heng Qiang, Wang Cunxiang, Chen Hao, Jiang Chaoya, Xie Rui, Wang Jindong, et al. Pandalm: Reproducible and automated language model assessment, 2023.
Zaib et al. [2022] Munazza Zaib, Wei Emma Zhang, Quan Z Sheng, Adnan Mahmood, and Yang Zhang. Conversational question answering: a survey. Knowl. Inf. Syst., 64(12):3151–3195, December 2022. ISSN 0219-1377, 0219-3116. doi: 10.1007/s10115-022-01744-y. URL https://doi.org/10.1007/s10115-022-01744-y.
Zhang* et al. [2020] Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkeHuCVFDr.
Zhong et al. [2022] Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and Jiawei Han. Towards a unified Multi-Dimensional evaluator for text generation. October 2022. URL http://arxiv.org/abs/2210.07197.
Zhou et al. [2023] Chunting Zhou, Pengfei Liu, Puxin Xu, Srini Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
Ziegler et al. [2019] Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.

附录

附录A数据集描述

A.1 数据集统计

Piazza 是一个广泛使用的在线问答论坛，供 90 个国家/地区的数千所大学的学生和教师使用。在 Piazza 数据集中，有六种类型的用户贡献，包括问题（即由学生或教师创建的帖子，用于提出问题或发起有关特定主题的讨论。)、i_answer（即讲师针对问题提供的答案）、s_answer（即同学针对问题提供的答案）、跟进、followup_response 和注释。每个帖子都有一个关联的文件夹标签，学生和教师使用该标签对帖子进行分类，例如考试、后勤等。

在我们分析的课程数据中，跨越八个学期，平均每学期有 6 名教师和 953 名学生。总共记录了 $18,181$ 个帖子和 $60,884$ 个用户贡献，我们从中提取了 $19,391$ 个问答对。平均而言，每个贡献（例如问题）都包含 $52$ $\pm$ $155$ 标记和 $38$ $\pm$ $56$ 个单词，最大贡献为 $5,561$ 个标记，长度为 $3,688$ 个单词。后续跟进回复中可以进一步提问——六类用户贡献各自占发帖总数的比例、平均字数、词符总数如图图2(a)所示——Piazza数据集中总 Token 的25%是后续和后续响应。

我们提供了 2022 年学期的数据进行评估，其中有 $8$ 个教师， $1,096$ 个学生，总共 $1,870$ 个帖子。 2022年学期不同文件夹和贡献类型的帖子分布如图图2(b)所示。

值得注意的是，我们研究中使用的数据集存在一些复杂性。例如，一个问题可能有多个答案，有些问题可能包含图像或重复。为了应对这些挑战，我们建立了一套选择真实答案的标准。如果某个问题有教师和同学的答案，我们会优先考虑教师的回答作为基本事实。如果没有教师答案，我们会将学生提供的最新答案视为基本事实。没有任何答案的查询将被丢弃。此外，我们排除了问题或答案正文中包含图像的所有帖子，从而从数据集中删除了总共 5,801 张图像。我们基于数据聚类进行重复数据删除。最后，我们的数据集中有 $10,647$ 个 QA 对，其中 $717$ 个 2022 年学期的 QA 对被保留用于评估。

A.2 查询中的重复

表2中可以找到通过集群检测到的重复查询的一些示例。

表2：我们的数据集中存在重复查询的示例

Example Cluster 1	Example Cluster 2
• Lab 9 task 2: When I run this code, it says it cannot open input file "blank.bmp." How can I get it to open it? • Lab 9 task 2: When I run task 2 in developer command, it says: could not open input file, "Blank.bmp". I added it to my source files so what is causing this problem? • "Could not open input file: blank.bmp": How do I fix this error for lab 9 task 2? This is in Visual Studios, running it through the developer command prompt works fine. • Lab 9 Prep Task 2: Whenever I try to run the code it says that I "could not open input file blank.bmp" And also I have no idea how to approach this task help would be much appreciated. Edit: No worries this has been solved, you must put the file in the project directory.	• Where can we access computers with Matlab in order to do our labs before lab times, or to get some Matlab study done? • Which computers in the University have Matlab installed on them? I’m guessing only the ones in the science department? • Which labs in the uni have Matlab installed so that we can check our running times if we’ve done them on our laptops? • Which computers have Matlab installed in uni? Because if work from labs are unfinished it would be convenient to know where you could finish it. • Are there any open computer labs at uni which have MATLAB, apart from the engineering ones?
Example Cluster 3	Example Cluster 4
• Are we allowed calculators in this test tonight? • Are we allowed to use calculators for the test? • Are we permitted calculators in the test? • Are we allowed a calculator for the test? • In the test, will we be allowed calculators? • Is calculator usage allowed tonight?	• fprintf vs sprintf: Can these two be used interchangeably? • General Q: Does it make a difference if you use fprintf or sprintf? • When do we use fprintf and sprintf? Really confused by these two things. • What are the differences between Sprintf and Fprintf? And what does the “f” at the end of both commands correlate to? • What is the difference between fprintf and sprintf? When do we use which one?

A.3 DPO 的偏好数据

我们的数据集中的偏好数据的一些示例可以在表 3 中找到。

表3：我们数据集中偏好对的一些示例

Query: Hello all, what is the difference between writing ’&’ and ’&&’. E.g ⬇ if(income > 14000 && income <= 48000) tax = 0.105 * 14000 + 0.175 * (income - 14000); end Preferred response: && is for scalars & is for vectors. The way in which MATLAB compares is different for each of the two there is also two more operators (\|\| and \|). Over-simplifications aside && and \|\| are short circuit operators, they go left to right. If any of the conditions are false then the comparing is stopped. So for conditionOne && conditionTwo the condition on the left if it meets requirements (False here) will not check the condition on the right. Non-preferred response: && is for scalars & is for vectors
Query: How does the Quotient "%" thing work? Tried googling it, did a bunch of tests in C to try and work it out myself. No luck. How does it calculate stuff? Preferred response: This is the modulus operator, and gives you the remainder of a division operation. e.g. 6%2 will give you 0. You’ve already met the modulus operator as part of the Matlab project, where mod(5,2) would find the remainder when 5 is divided by 2 (handy for telling if a number is even or not). There was even a recording on how to use it. If you understand how mod(5,2) works in Matlab, then in C you have the benefit of a shorthand for it, i.e. 5%2. Non-preferred response: This is the modulus operator, and gives you the remainder of a division operation. e.g. 6%2 will give you 0.
Query: How do I get my while loop to start from the second line of data (e.g. not the header)? Preferred response: Add a fgetl(fid); before your loop starts. Alternatively, you could try to write some logic to skip it in the loop. Non-preferred response: Add a fgetl(fid); before your loop

Query: Hello all, what is the difference between writing ’&’ and ’&&’. E.g ⬇ if(income > 14000 && income <= 48000) tax = 0.105 * 14000 + 0.175 * (income - 14000); end

Preferred response: && is for scalars & is for vectors. The way in which MATLAB compares is different for each of the two there is also two more operators (|| and |).

Over-simplifications aside && and || are short circuit operators, they go left to right. If any of the conditions are false then the comparing is stopped. So for conditionOne && conditionTwo the condition on the left if it meets requirements (False here) will not check the condition on the right.

Non-preferred response: && is for scalars & is for vectors

Query: How does the Quotient "%" thing work? Tried googling it, did a bunch of tests in C to try and work it out myself. No luck. How does it calculate stuff?

Preferred response: This is the modulus operator, and gives you the remainder of a division operation. e.g. 6%2 will give you 0. You’ve already met the modulus operator as part of the Matlab project, where mod(5,2) would find the remainder when 5 is divided by 2 (handy for telling if a number is even or not). There was even a recording on how to use it. If you understand how mod(5,2) works in Matlab, then in C you have the benefit of a shorthand for it, i.e. 5%2.

Non-preferred response: This is the modulus operator, and gives you the remainder of a division operation. e.g. 6%2 will give you 0.

Query: How do I get my while loop to start from the second line of data (e.g. not the header)?

Preferred response: Add a fgetl(fid); before your loop starts. Alternatively, you could try to write some logic to skip it in the loop.

Non-preferred response: Add a fgetl(fid); before your loop

附录B数据格式和模型提示

对于LLaMA-2系列的型号，以下是我们使用的提示：

You are a helpful, respectful, and honest teaching assistant for an introductory programming course in Matlab and C. Your current task is to answer student queries on Piazza. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don’t know the answer to a question, please don’t share false information. <</SYS>>

Query subject: ‘‘‘{{Subject of the Query}}‘‘‘

Query body: ‘‘‘{{Body of the Query}}‘‘‘

Please answer the query. [/INST]

对于 RAG，在原始提示中替换以下格式的文本：
Here are some snippets from the course material & other uploaded content which might be helpful to generate the response. ### Below is snippet 1: <SNIPPET DOCUMENT TEXT> ### Below is snippet 2: <SNIPPET DOCUMENT TEXT> ### Below is snippet K: <SNIPPET DOCUMENT TEXT> Above were the snippets. Now, here is the query to be answered:

对于 DPO，人类偏好数据采用以下格式：

{
    "instruction": "{subject}\n{question}",
    "output1": (started_off_answer), //the original student or instructor answer
    "output2": (updated_answer), //updated version of the original answer
    "preference": 2
}

附录C实验设置

推理参数：所有世代均使用以下参数完成：

•

最大长度: $2048$
•

最大新 Token : $1024$
•

top_p: $1$
•

top_k: $50$
•

温度: $0.3$

LLaMA-2-70B-chat 上的推理是在 4 位量化下完成的。

监督微调：为了进行微调，使用的优化器是具有 32 位精度设置的分页 AdamW，这有助于优化内存使用和计算效率。我们采用的每设备训练批量大小为 4，评估批量大小为 1。为了增强训练的稳定性和有效性，梯度累积步长设置为2，并启用梯度检查点。考虑到 LLaMA-2-13B-chat 模型的大小和我们的计算限制，我们利用了 LoRA（低秩适应）层，其 alpha 参数为 16，dropout 率为 0.05，r 参数为 8。这种方法使我们能够有效地调整模型，而不需要对整个网络进行大量的重新训练，从而节省了时间和计算资源。序列长度固定为 1024 个标记，以平衡上下文可用性和计算可行性。

DPO： DPO 的设置与 SFT 相同，但 DPO 损失的 beta 参数设置为 0.1。学习率提高到5e-4，梯度累积步数加倍到4。

附录 D模型结果示例比较

我们在表4中提供了一些示例来展示 RAG 的潜在影响。对于同一个问题，我们还在表中提供了不同的模型答案以及相应的大语言模型评估（使用GPT-4）分数和人类评估分数（一位作者的评分） 5。

表 4： RAG 有益于性能的一些定性示例

Query: What is the function IsExitUnique() supposed to return when there are no exits?

Ground Truth Answer: Return 0 as the manual clearly says that ‘The IsExitUnique() function should return true if there is exactly one exit position in the cave array, and false otherwise.‘

Model considered: GPT-4

Relevant content within the context retrieved:

# Task Five: “The one and only” (10 marks)

Next, a valid cave must have only one exit. Define a function called IsExitUnique() which checks that there is a single ‘exit’ position in the 2-dimensional array: ⬇ int IsExitUnique(int cave[CAVE_SIZE][CAVE_SIZE]) If there is exactly one ‘exit’ position, then the function should return true, otherwise it should return false. Consider the following example: ⬇ char layout[200] = "1111111111100000000110000000011000000001 1EEEEEEEE110000000011000000001100000000110000000011111111111"; int cave[CAVE_SIZE][CAVE_SIZE] = {0}; InitialiseCave(cave, layout); printf("Unique? \%d\n", IsExitUnique(cave)); \\ 0 This is because there are many ‘exit’ positions on row 4 of the cave (see ‘E’ characters).

Answer without RAG:

Thank you for your question. Based on the provided information, it seems that the ‘IsExitUnique‘ function is meant to check if exits are unique in a given context. However, without more context or details about the function, it’s difficult to provide a specific answer to your question. If there are no exits, the function could return ‘true‘ (since there are no duplicate exits), or it could return a specific value (e.g., -1) to indicate that there are no exits at all. The expected behavior would depend on the function’s implementation and the requirements of the programming assignment. If you could provide more information about the function or the context in which it is being used, I would be happy to help you further.

Answer with RAG:

The ‘IsExitUnique‘ function is designed to check if there is exactly one ’exit’ position in the 2-dimensional array. If there is exactly one ’exit’ position, then the function should return true (or 1 in the case of C programming), otherwise, it should return false (or 0 in C). In the case where there are no exits in the 2-dimensional array, the function should return false (or 0). This is because the condition of having exactly one ’exit’ is not met.

Remember, the function is not checking for the presence of an exit, but rather it is checking if there is exactly one exit. If there are no exits or more than one exit, the function should return false.

Query: Can we push a boulder to crash the treasure boxes?

Ground Truth Answer: The boulder will continue to roll in the same direction as long as there is empty space ahead of it. It will stop when it hits anything that is not empty space (such as a wall).

Model considered: LLAMA-2 Chat 13B (SFT + DPO)

Relevant content within the context retrieved:

# Task Nine: “Time to get moving” (10 marks) • The player cannot move into or through the walls of the cave. When a boulder rolls into a hole, both the hole and the boulder disappear (i.e. ’empty space’ appears at the previous location of the hole) • A boulder, once pushed, will continue to move in a straight line until it hits an obstacle (either a hole, a treasure, another boulder or a cave wall) • You can assume that the boulder will never be pushed through the exit. Answer without RAG:

As a helpful & responsible teaching assistant, I’m here to assist you with your query. However, I would like to point out that crashing treasure boxes with a boulder may not be the most appropriate or safe approach, as it could cause harm to others or damage the environment. Instead, I suggest we focus on finding alternative & more constructive solutions to gain access to the treasure boxes. In the context of programming, it’s important to prioritize safety and ethical considerations in our code and actions. We should always strive to create programs and solutions that are socially unbiased, positive, & respectful of others.

Answer with RAG:

As your helpful TA, I’m here to answer your query about player pushing in the Boulder Break game. To answer your question, unfortunately, it is not possible to push boulders to crash the treasure boxes in the game. When you push a boulder, it will roll in a straight line until it hits an obstacle, such as a hole, another boulder, or a cave wall. It will never move through treasure chests or other objects in its path.

However, you can use boulders to fill in holes that may be blocking your way to the treasure chests! If you push a boulder into a hole, it will disappear, allowing you to access the area beyond.

Remember, the goal of the game is to help the adventurer collect treasure and escape the cave by pushing boulders into holes that may block the exit. Good luck, and have fun playing Boulder Break!

表5：大语言模型（GPT-4）和人类对不同模型的反应给出的平均分数的比较。我们使用与课程 C 项目相关的示例查询来演示这些规则：“Can you put an if statements inside another if statements?”。为了便于阅读，很少有模型响应被压缩。

Model	Response	LLM	Human
LLaMA-2-7B-chat	Hello! I’m here to help you with your query. To answer your question directly, the short answer is no, you cannot put an ‘if statement‘ inside another ‘if statement‘ in Matlab/C. In Matlab, you can only have one ‘if‘ statement per ‘if-else‘ block. If you try to put another ‘if‘ statement inside the first ‘if‘ statement, you will get a compiling error…	$0$	$0$
LLaMA-2-13B-chat	Of course! I’d be happy to help. The answer to your question is yes, you can put an if statement inside another if statement in both Matlab and C. This is called nesting if statements. In Matlab, you can use the elseif keyword to nest if statements. [example omitted]. In C, you can use the if keyword to nest if statements. [example omitted]. I hope this helps! Let me know if you have any other questions.	$1$	$0.5$
LLaMA-2-70B-chat)	Hello! I’m happy to help you with your question. Yes, it is possible to put an if statement inside another if statement. This is called a nested if statement. Here’s an example: [example omitted]. In this example, the inner if statement will only execute if the condition1 is true. If condition1 is false, the code inside the inner if statement will not be executed. It’s important to note that nesting if statements can make the code harder to read and understand, so it’s generally recommended to use them sparingly and only when necessary. I hope this helps! Let me know if you have any further questions.	$1$	$1$
GPT-4	Subject: Nested if statements in Matlab and C Answer: Yes, you can put an if statement inside another if statement. This is called "nested if statements." Both Matlab and C support nested if statements. Here’s an example in both languages: Matlab: [example omitted] C: [example omitted] In both examples, we have an outer if statement checking if ‘x‘ is positive, and inside that, we have another if statement checking if ‘y‘ is positive. This allows us to handle different cases based on values of ‘x‘ and ‘y‘.	$1$	$0.5$
GPT-4 + RAG	Yes, you can definitely put an "if" statement inside another "if" statement. This is known as nested if statements. Here’s a simple example: [example omitted] In this example, the inner "if" statement (checking condition2) is only executed if condition1 is true. This can be useful when you need to check for multiple conditions that depend on each other. Remember to always properly indent your code when using nested if statements, as it can quickly become difficult to read and understand the flow of your program if you don’t.	$1$	$1$
LLaMA-2-13B-chat+RAG	Hello! As your helpful teaching assistant, I’m here to help you with your query. Yes, it is possible to put an if statement inside another if statement in C programming. This is called nested if statements. Here’s an example: [example omitted] In this example, the inner if statement (condition2) will only be executed if the outer if statement (condition1) is true. I hope this helps! If you have any more questions or need further clarification, please don’t hesitate to ask :-)	$1$	$1$
LLaMA-2-13B-chat+SFT	Yes.	$0.5$	$0.25$
LLaMA-2-13B-chat+SFT +DPO	Certainly! Yes, it is possible to put an if statement inside another if statement in MATLAB and C. This is known as nested if statements. In MATLAB, you can use the ‘if‘ statement within another ‘if‘ statement like this: [example omitted] In C, you can use the ‘if‘ statement within another ‘if‘ statement like this: [example omitted] It’s important to note that when using nested if statements, you should always make sure that the inner if statement is fully enclosed within the outer if statement, as shown in the examples above. Additionally, it’s worth noting that using nested if statements can make your code more complex and difficult to read. It’s usually a good idea to break up your code into separate if statements if possible, to make it easier to understand and maintain.	$1$	$0.75$
LLaMA-2-13B-chat+SFT +DPO +RAG	Yes, you can put an if statement inside another if statement in C programming. This is called nesting if statements. Here’s an example: [example omitted] In the above example, the inner if statement is nested inside the outer if statement. The outer if statement checks the condition ‘condition1‘, and if it is true, the inner if statement is executed. If the inner if statement is false, the outer if statement will execute the code inside the ‘else‘ block. You can use this syntax to add more complexity to your code and make decisions based on multiple conditions. I hope this helps! Let me know if you have any more questions.	$1$	$1$

附录E评估标准、示例和评估提示

E.1 标题和提示

评价提示：为了自动评估我们的世代，我们使用以下提示格式查询 GPT-4。它包括提示模板，后面是可包含在提示中的相应评估标准的定义，以及每个相应标准要遵循的评估步骤。

⬇

# Evaluation prompt template based on G-Eval

EVALUATION_PROMPT_TEMPLATE = """

You will be given one answer to a question written by a student on a Question-Answer platform for a Computer Science undergraduate course. You will also have access to the ground truth answer given by a human teaching assistant. Your task is to rate the answer on one metric. Please make sure you read and understand these instructions very carefully. Please keep the ground truth answer given by the teaching assistant in mind while reviewing, and refer to it as needed.

Evaluation Criteria:

{criteria}

Evaluation Steps:

{steps}

Example:

Question:

{question}

Ground Truth Answer:

{GroundTruthAnswer}

Answer:

{answer}

Evaluation Form (scores ONLY):

- {metric_name}

"""

# Metric 1: Usefulness

USEFULNESS_SCORE_CRITERIA = """

Usefulness (0-2) - judge whether a response would be useful to a Teaching Assistant in answering a student’s question.

Here is the scale:

0 - A score of 0 means that the response is not useful at all. \

A Teaching Assistant would simply reject this answer\

because it is not a natural response, is irrelevant to the question, or is too verbose.

1 - A score of 1 means that the response is partially useful. \

A Teaching Assistant needs to edit this answer, but it still sounds natural and \

relevant so editing will not take long.

2 - A score of 2 means that the response is useful as is. \

A Teaching Assistant can use this response as is.

"""

USEFULNESS_SCORE_STEPS = """

1. Read the question carefully.

2. Read the response carefully.

2. Read the ground truth answer carefully.

3. Consider whether the response would be useful to a Teaching Assistant in answering a student’s question.

4. Assign a usefulness score from 0 to 2.

"""

# Metric 2: Accuracy

ACCURACY_SCORE_CRITERIA = """

Accuracy (0-2) - determine whether this response provides a factually correct answer to the question.

Here is the scale:

0 - A score of 0 means that the response is completely inaccurate. The answer is entirely incorrect or provides false information.

1 - A score of 1 means that the response is partially accurate. The answer lacks some correct information or contains incorrect or unnecessary information.

2 - A score of 2 means that the response is accurate. The answer is completely accurate, providing correct information and a valid solution.

"""

ACCURACY_SCORE_STEPS = """

1. Read the question carefully.

2. Read the response carefully.

3. Read the ground truth answer carefully.

4. Consider whether this response provides a factually correct answer to the question.

5. Assign an accuracy score from 0 to 2.

"""

清单 1：用于定义评估提示模板的 Python 代码。

评价标准：表6中总结了我们用于人类评估的评估标准和示例。请注意，即使答案实际上不准确，它仍然可能有用。 Table 6 和 Table 5 基于在 Piazza 上提出的与课程 C 项目相关的查询：“Can you put an if statements inside another if statements?” Piazza 上上述查询的实际答案是：

你可以吗？是的。你应该？有时。

对于项目来说，完全没问题。

一般来说，以下行为被认为是不好的做法：

⬇

if (A) {

if (B) {

// Some code

}

总是可以写成：

⬇

if (A && B) {

// Some code

}

但是，除了可读性之外，它并没有真正的问题。

表6：人类评价评分标准。我们使用与课程 C 项目相关的示例查询来演示这些规则：“Can you put an if statements inside another if statements?”。为了便于阅读，很少有模型响应被压缩。注意：L-13 是 LLAMA-2-13B-chat 的简写。

Category	Score	Scoring Criteria	Example answer	Explanation
Usefulness	$0$	Useless:A response that would be rejected by a TA due to its lack of relevance, unnatural language, verbosity, or other factors	L-13 + SFT: Yes.	Despite being correct, the response is too brief to provide any meaningful information to a TA.
	$0.5$	Partially useful: A response that requires editing by a TA, but remains natural and relevant, thus requiring minimal effort to revise.	L-13: … The answer to your question is yes, you can put an if statement inside another if statement in both Matlab and C. This is called nesting if statements. In Matlab, you can use the elseif keyword to nest if statements. [example omitted]. In C, you can use the if keyword to nest if statements. [example omitted].	The response provides a direct answer and examples, but contains irrelevant information about MATLAB and lacks discussion of trade-offs. A TA may need to revise the content.
	$1$	Useful: A response that would be accepted by a TA with little or no editing, as it provides a natural and coherent answer to the question.	L-13+SFT+DPO: Certainly! Yes, it is possible to put an if statement inside another if statement in MATLAB and C. This is known as nested if statements. In MATLAB, you can use the ‘if‘ statement within another ‘if‘ statement like this: [example omitted]. In C, you can use the ‘if‘ statement within another ‘if‘ statement like this: [example omitted]. Additionally, it’s worth noting that using nested if statements can make your code more complex and difficult to read. It’s usually a good idea to break up your code into separate if statements if possible, to make it easier to understand and maintain.	The response provides a clear and concise answer to the question, including examples and additional information about the potential complexity of using nested if statements. While it contains extra information about MATLAB, this can be easily removed with minor edits.
Accuracy	$0$	Completely inaccurate: A response that is entirely incorrect or provides false information.	L-7: Hello! I’m here to help you with your query. To answer your question directly, the short answer is no, you cannot put an ‘if statement‘ inside another ‘if statement‘ in Matlab/C. In Matlab, you can only have one ‘if‘ statement per ‘if-else‘ block. If you try to put another ‘if‘ statement inside the first ‘if‘ statement, you will get a compiling error…	The response is completely inaccurate as it provides false information, stating that nesting if statements is not allowed in MATLAB/C, which is incorrect.
	$0.5$	Partially accurate: A response which lacks some important information or contains incorrect or unnecessary information.	L-13: The answer to your question is yes, you can put an if statement inside another if statement in both Matlab and C. This is called nesting if statements. In Matlab, you can use the elseif keyword to nest if statements. [example omitted]. In C, you can use the if keyword to nest if statements. [example omitted].	The response is considered partially accurate as it provides correct information about C, but also contains irrelevant information about MATLAB.
	$1$	Accurate: A response that is completely accurate, providing correct information and a valid solution.	GPT-4+RAG: Yes, you can definitely put an “if” statement inside another “if” statement. This is known as nested if statements. [example omitted]. In this example, the inner "if" statement is only executed if condition1 is true. This can be useful when you need to check for multiple conditions that depend on each other. Always properly indent your code when using nested if statements, as it can quickly become difficult to read and understand the flow of your program.	The response is considered accurate as it provides correct information, including examples, and does not contain any irrelevant information about MATLAB.

E.2 大语言模型、Human 和 BertScore 评估之间的相关性

在本节中，我们将研究大语言模型、人类和 BertScore 分配的分数之间的关系。为此，我们使用三个不同的相关系数：Pearson 相关系数 ( $r$ )、Spearman 相关系数 ( $\rho$ ) 和 Kendall-Tau 相关系数 ( $\tau$ )。 Pearson 相关系数衡量两个变量之间的线性关系，而 Spearman 相关系数衡量两个变量之间的单调关系。 Kendall-Tau 相关系数衡量两个变量之间关联的强度，并基于一致和不一致的观测值对的数量。

我们在表7中报告了不同评价指标之间的相关系数，使用的数据来自 $20$ 对 $9$ 模型的查询，结果总共有 $180$ 个响应。

Metric-1	Metric-2	Pearson	Spearman	Kendall-Tau
Human Evaluation (Usefulness)	LLM Evaluation (Usefulness)	0.586	0.613	0.567
Human Evaluation (Accuracy)	LLM Evaluation (Accuracy)	0.646	0.65	0.591
Human Evaluation (Overall)	LLM Evaluation (Overall)	0.662	0.659	0.565
BertScore-F1	LLM Evaluation (Overall)	0.383	0.363	0.28

表 7：我们的评估数据集上不同指标的相关性。

从表中我们可以看到，人类评价指标（有用性、准确性和总体性）与相应的大语言模型评价指标之间存在适度的正相关性。这表明大语言模型的评价在一定程度上与人类的评价趋于一致。

为了进一步了解大语言模型和人类评估之间的关系，我们检查了有用性和准确性指标的混淆矩阵，该矩阵可以采用 3 个值：0、0.5 和 1。这些混淆矩阵如图3所示。矩阵的对角线代表大语言模型和人类评估在评级上达成一致的示例的比例。我们可以看到，两个矩阵的下三角中都有很大一部分示例，这表明大语言模型评估往往比人类评估更宽松并分配更高的分数。

我们还研究了 BertScore [54]（一种基于相似性的自动化指标）与人类评估之间的关系。如图图4所示，BertScore与人工评价的相关性低于大语言模型评价与人工评价的相关性。这表明大语言模型评估与人类评估比 BertScore 等指标更接近。