在这项工作中,我们介绍了 ChatQA,这是一个对话式问题解答(QA)模型系列,可获得 GPT-4 级别的准确度。 具体来说,我们提出了一种两阶段指令调整方法,可以显著改善大型语言模型(LLM)的零样本会话 QA 结果。 为了处理对话式质量保证中的检索增强生成,我们在多轮质量保证数据集上对密集检索器进行了微调,其结果与使用最先进的查询重写模型相当,同时大大降低了部署成本。 值得注意的是,我们的 ChatQA-70B 在 10 个会话质量保证数据集上的平均得分超过了 GPT-4(54.14 比 53.90),而无需依赖 OpenAI GPT 模型的任何合成数据。

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training.

理解语境是理解人类语言的关键所在,而大型语言模型(LLM)在这方面的表现越来越令人印象深刻。 然而,尽管对 LLM 的评估涉及自然语言处理领域的各个领域,但对其理解上下文特征的语言能力的关注却很有限。 本文通过调整现有数据集以适应生成模型的评估,介绍了一种语境理解基准。 该基准包括四项不同的任务和九个数据集,所有任务的提示都是为了评估模型理解上下文的能力。 首先,我们评估了 LLM 在情境学习预训练情景下的性能。 实验结果表明,与最先进的微调模型相比,预训练的密集模型在理解更细微的上下文特征方面很吃力。 其次,由于 LLM 压缩在研究和实际应用中的重要性与日俱增,我们评估了在上下文学习设置下量化模型的上下文理解能力。 我们发现,在基准测试中,3 位训练后量化会导致不同程度的性能下降。 我们对这些场景进行了广泛的分析,以证实我们的实验结果。1

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models Peters et al. (2018a); Radford et al. (2018), BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.

Physics of Language Models: Part 3.1, Knowledge Storage and Extraction

Low Resource Pipeline for Spoken Language Understanding via Weak Supervision

