可执行代码操作引出更好的大语言模型代理

Xingyao Wang Yangyi Chen Lifan Yuan Yizhe Zhang Yunzhu Li Hao Peng Heng Ji

摘要

大型语言模型（大语言模型）代理能够执行广泛的操作，例如调用工具和控制机器人，在应对现实世界的挑战方面显示出巨大的潜力。大语言模型代理通常通过生成预定义格式的 JSON 或文本来提示产生动作，这通常受到动作空间（例如预定义工具的范围）和灵活性（例如无法编写多种工具）的限制。这项工作建议使用可执行的Python代码将大语言模型代理的动作整合到统一的动作空间(CodeAct)中。 CodeAct 与 Python 解释器集成，可以执行代码操作并动态修改先前的操作或通过多轮交互根据新的观察发出新的操作。我们对 API-Bank 上的 17 个大语言模型和新制定的基准进行了广泛分析，结果表明 CodeAct 优于广泛使用的替代方案（成功率高出 20%）。 CodeAct令人鼓舞的性能激励我们构建一个开源的大语言模型代理，它通过执行可解释的代码与环境交互，并使用自然语言与用户协作。为此，我们收集了一个指令调优数据集 CodeActInstruct，其中包含使用 CodeAct 的 7k 多轮交互。我们证明，它可以与现有数据一起使用，以改进面向代理的任务中的模型，而不会影响其一般能力。 CodeActAgent 经过 Llama2 和 Mistral 的微调，与 Python 解释器集成，经过专门定制，可使用现有库和自主自调试来执行复杂的任务（例如模型训练）¹¹1代码、数据、模型和演示可在https://github.com/xingyaoww/code-act获取。.

机器学习、ICML

1简介

Refer to caption — 图1： CodeAct 和 Text / JSON 作为操作之间的比较。（顶部）比较不同操作的说明性示例。（底部） M³ToolEval (§2.3) 的定量结果。

大型语言模型（LLMs）已成为自然语言处理（NLP）领域的关键突破。当添加允许访问API的action模块时，其动作空间扩展到传统文本处理之外，使大语言模型能够获得工具调用和内存管理等功能（Mialon等人，2023 ; Schick 等人, 2023) 并冒险进入现实世界的任务，例如控制机器人 (Ahn 等人, 2022; Huang 等人, 2023; Ma 等人, 2023) 和执行科学实验（Bran 等人，2023）。

我们询问：如何有效拓展大语言模型智能体的行动空间来解决复杂的现实问题？许多现有研究都使用文本 (Yao 等人, 2022b; Park 等人, 2023, inter alia) 或 JSON (Qin 等人, 2023b; Chase, 2022, inter alia) 产生动作（例如，图 1 左上角使用的工具）。然而，这两种方法通常都受到操作空间范围的限制（操作通常是针对特定任务定制的）和灵活性有限（例如，无法在单个操作中组合多个工具）。作为一种替代方法，一些工作（Liang 等人，2022；Singh 等人，2023；Wang 等人，2023a）展示了使用大语言模型生成代码来控制机器人或游戏角色的潜力。然而，它们通常依赖于预先指定的控制原语和手工设计的提示，更重要的是，它们很难根据新的环境观察和反馈来动态调整或发出指令。

这项工作提出了CodeAct，一个通用框架，允许大语言模型生成可执行的Python代码作为动作离子（图1 右上角）。 CodeAct 旨在处理各种应用程序，并具有独特的优势：

(1)

CodeAct 与 Python 解释器集成，可以执行代码操作，并根据通过多轮交互获得的观察结果（如代码执行结果），动态调整之前的操作或发出新的操作。
(2)

代码操作允许大语言模型利用现有的软件包。 CodeAct 可以使用现成的 Python 包来扩展操作空间，而不是手工制作特定于任务的工具（Yuan 等人，2023；Shen 等人，2023）。它还允许大语言模型使用大多数软件中实现的自动反馈（如错误信息），通过自我调试其生成的代码来改进任务解决（Chen 等人，2023b；Wang 等人，2023d）。
(3)

代码数据广泛应用于当今大语言模型的预训练（Yang等人，2024b）。这些模型已经熟悉结构化编程语言，因此可以经济高效地采用 CodeAct。
(4)

与预先定义格式的 JSON 和文本相比，代码本质上支持控制和数据流，允许将中间结果存储为变量以便重复使用，并可通过一个代码组成多个工具来执行复杂的逻辑操作（如 if 语句、for 循环），从而释放大语言模型的潜力，利用其预先训练的编程知识来处理复杂任务。在图1中，与CodeAct（右上）一起使用的大语言模型可以应用相同的工具序列（例如，使用一个工具的输出作为输入传递给另一个工具）数据流功能）通过 for 循环（即控制流功能）与 one 操作连接到所有输入；而文本或 JSON 必须对每个输入采取操作（左上角）。

我们对 17 个大语言模型（包括开源和专有模型）进行的广泛实验证实了 CodeAct 的上述优点（3 和 4）。为了证明好处 (3)，我们的第一个实验 (§2.2) 将 CodeAct 与涉及原子工具使用的基本任务的基线进行比较（即，仅每个操作使用一个工具），从而削弱了 CodeAct 提供的控制和数据流优势。结果表明，对于大多数大语言模型，CodeAct 实现了与基线相当或更好的性能。 CodeAct 在复杂任务上的性能提升更为突出，如我们的第二个实验所示（优势 4）。 We curate a new benchmark consisting of 82 human-curated tasks that typically require multiple calls to multiple tools in multi-turn interactions (M³ToolEval; §2.3). 此基准测试中的问题通常需要多个工具的复杂协调和组合。凭借其在控制和数据流方面的优势，CodeAct 解决问题的成功率比基线绝对提高了 20%，同时所需的操作减少了 30%。随着大语言模型能力的增强，这些性能提升也会扩大（图1底部）。

CodeAct的良好性能激发了开源大语言模型代理的发展，它可以通过CodeAct有效地行动，并通过自然语言与人类协作。为此，我们收集了一个指令调优数据集 CodeActInstruct，其中包含 7k 个高质量的与 CodeAct 的多轮交互轨迹（§3.1)。 CodeActInstruct 的动机是由代理、用户和环境组成的通用代理框架（图2)，重点关注代理与环境与计算机的交互（信息查找、软件包使用、外部存储器）和物理世界（机器人规划）。在CodeActInstruct上，我们进行仔细的数据选择，以提升多轮交互（例如自调试）的改进能力。我们证明 CodeActInstruct 可以与常用的指令调整数据一起使用，以提高模型在代理任务中的性能，而不会影响其一般功能（例如，基于知识的 QA、编码、指令遵循，§3.2) 。我们的模型被称为 CodeActAgent，是在 LLaMA-2 (Touvron 等人，2023) 和 Mistral-7B (Jiang 等人，2023) 的基础上进行了微调，并在 out-of 的基础上进行了改进- 域代理任务不仅包含 CodeAct，还包含预定义格式的文本操作 (§3.2)。

CodeAct 可以进一步受益于多轮交互和现有软件（优势 1 和 2，§2.4)。如图3所示，CodeActAgent专为与Python无缝集成而设计，可以使用现有的Python包执行复杂的任务（例如模型训练、数据可视化）。来自环境的错误消息进一步使其能够通过多轮交互中的自调试来自主纠正错误。得益于大语言模型在前期训练中获得的丰富编程知识，这些都无需在上下文中进行演示即可实现，从而减少了人类为使 CodeActAgent 适应不同任务而付出的努力。

表1：与使用 Text/JSON 进行大语言模型操作相比，CodeAct 的优势。

	CodeAct for LLM action	JSON or Text for LLM action
Availability of Data	✔Large quantity of code available¹ for pre-training	✗Data curation required for particular format
Complex Operation (e.g., looping, composition of multiple tools)	✔Natively supported via control and data flow	✗Requires careful engineering if feasible (e.g., define new tools to mimic if-statement)
Availability of Tools	✔Can directly use existing software packages²	✗Requires human effort to curate tools from scratch or existing software
Automated Feedback	✔Feedback mechanism³ (e.g., traceback) is already implemented as an infrastructure for most programming languages	✗Requires human effort to provide feedback or re-route feedback from the underlying programming language used to implement the tools

1

包括演示大语言模型代理有用行为的代码（例如，任务分解、对不同工具的多个函数调用的协调）。
2

https://pypi.org/ 上提供了涵盖广泛应用程序的人工编写的 Python 包。
3

例如，在 Python 中，可以使用错误和异常 (https://docs.python.org/3/tutorial/errors.html)。大多数软件都以自然语言提供错误消息，以帮助人类程序员调试其代码。 CodeAct使大语言模型可以直接使用它们。

2 CodeAct 让大语言模型变得更好

在本节中，我们首先描述CodeAct框架（§2.1)并提供支持CodeAct选择的经验证据。我们专注于将 Python 作为 CodeAct 的编程语言，因为它很受欢迎（在（TIOBE 指数，2024）中排名第一）和众多开源包。我们旨在利用 17 个现成的大语言模型回答几个研究问题（RQ）。在§2.2中，我们考察RQ1：大语言模型由于大量的代码预训练数据而对代码的熟悉程度是否会带来CodeAct相对于文本和JSON的优势？我们在§2.3中讨论RQ2：CodeAct是否受益于Python在复杂问题中固有的控制和数据流特性？最后，作为额外的好处，我们讨论了如何使用 CodeAct 通过启用多轮交互并允许它们访问第 §2.4 和图中的现有软件来进一步增强大语言模型代理3。

2.1 什么是CodeAct？

在图2中，我们首先介绍了大语言模型智能体在现实世界中使用的通用多回合交互框架，该框架考虑了三个角色（Yang等人，2024c）：代理、用户和环境。我们将交互定义为代理与外部实体（用户或环境）之间的信息交换。在每一轮交互中，代理从用户（如自然语言指令）或环境（如代码执行结果）接收观察（输入），通过思维链（Wei 等人，2022 年）选择性地规划其行动，并以自然语言或环境向用户发出行动（输出）。 CodeAct 使用 Python 代码来整合代理与环境交互的所有操作。在 CodeAct 中，向环境发出的每个 action 都是一段 Python 代码，代理将接收代码执行的输出（例如结果、错误）作为观察。我们在 §E 中包含了 CodeAct 的示例提示。

2.2CodeAct显示了作为强大工具使用框架的前景

在本节中，我们进行了一个对照实验，以了解哪种格式（文本、JSON、CodeAct)更有可能引导大语言模型生成正确的原子工具调用。本实验的表现反映了大语言模型对相应格式的熟悉程度。我们假设，使用 CodeAct 调用工具是模型使用工具的一种更自然的方式，因为模型在训练过程中通常会大量接触代码数据。

设置. 我们重新利用 API-Bank （Li 等人，2023 年），测试了大语言模型的 API 调用性能，比较了 CodeAct、JSON 和文本操作。对于每个评估实例，我们都会指示大语言模型以 Python 函数调用、JSON 对象或文本表达式等预定义格式生成一次原子工具调用。具体示例如表 1 所示。 A.6。我们使用 API-Bank 的 1 级指令和提供的工具集。为了评估 API 调用，我们遵循他们的正确性指标，将真实的 API 输出与实际模型生成的 API 的执行输出进行匹配。

结果. 我们在选项卡中呈现结果。 3. 对于大多数大语言模型来说，即使在其控制和数据流强度被削弱的原子操作（简单化的工具使用场景）中，CodeAct 也能实现相当或更好的性能。与闭源大语言模型相比，CodeAct在开源模型中的改进更为突出。此外，与专门的 JSON 或文本工具调用格式相比，代码数据通常更容易访问以微调开源大语言模型。尽管 JSON 一直弱于其他开源模型方法，但它在闭源大语言模型中取得了不错的性能，这表明这些闭源模型可能对其 JSON 能力进行了有针对性的微调。这些结果表明，对于开源大语言模型来说，针对 CodeAct 进行优化是比其他方案更好的途径来提高其工具使用能力，因为它们已经表现出了良好的初始 CodeAct 能力，因为在预训练期间广泛接触代码数据。

Table 2: Atomic API call correctness on API-Bank. The best performance is bolded, and the second-best is underlined.

Correctness (%, $\uparrow$ ) Format of Action CodeAct JSON Text Open-source LLMs CodeLlama-7b-Instruct-hf $12.5$ $12.0$ $\mathbf{17.0}$ CodeLlama-13b-Instruct-hf $11.8$ $7.8$ $\mathbf{14.0}$ CodeLlama-34b-Instruct-hf $\mathbf{17.3}$ $12.0$ $16.8$ Llama-2-7b-chat-hf $\mathbf{28.8}$ $11.3$ $25.8$ Llama-2-13b-chat-hf $\mathbf{38.1}$ $8.5$ $37.3$ Llama-2-70b-chat-hf $35.6$ $14.3$ $\mathbf{37.6}$ Mistral-7B-Instruct-v0.1 $2.5$ $2.3$ $\mathbf{3.0}$ lemur-70b-chat-v1 $\mathbf{58.6}$ $46.6$ $56.1$ Closed-source LLMs claude-2 $\mathbf{76.7}$ $59.4$ $73.7$ claude-instant-1 $\mathbf{75.2}$ $64.9$ $73.2$ gemini-pro $70.4$ $\mathbf{73.2}$ $71.2$ gpt-3.5-turbo-0613 $\mathbf{74.4}$ $73.9$ $73.4$ gpt-3.5-turbo-1106 $75.4$ $\mathbf{78.4}$ $73.4$ gpt-4-0613 $75.4$ $\mathbf{82.0}$ $74.4$ gpt-4-1106-preview $76.7$ $\mathbf{82.7}$ $73.4$ text-davinci-002 $\mathbf{69.2}$ $59.6$ $57.4$ text-davinci-003 $75.4$ $\mathbf{76.9}$ $69.7$ Frequency of Best-Performing Format $\uparrow$ Open-source $\mathbf{4}$ $0$ $4$ Closed-source $4$ $\mathbf{5}$ $0$ Overall $\mathbf{8}$ $\underline{5}$ $4$

Table 3: Success rates (higher the better) and average turns required per instance (lower the better) on M³ToolEval. The best results for each model are bolded, and the second-best ones are underlined.

Success Rate (%, $\uparrow$ ) Avg. Turns ( $\downarrow$ ) Format of Action CodeAct JSON Text CodeAct JSON Text Open-source LLMs CodeLlama-7b-Instruct-hf $\mathbf{4.9}$ $2.4$ $2.4$ $\mathbf{9.7}$ $9.9$ $9.9$ CodeLlama-13b-Instruct-hf $\mathbf{4.9}$ $\mathbf{4.9}$ $\mathbf{4.9}$ $9.8$ $9.8$ $\mathbf{9.7}$ CodeLlama-34b-Instruct-hf $\mathbf{2.4}$ $0.0$ $0.0$ $\mathbf{9.9}$ $10.0$ $10.0$ Llama-2-7b-chat-hf $0.0$ $1.2$ $\mathbf{2.4}$ $\mathbf{8.9}$ $9.5$ $9.6$ Llama-2-13b-chat-hf $\mathbf{0.0}$ $\mathbf{0.0}$ $\mathbf{0.0}$ $\mathbf{9.7}$ $10.0$ $10.0$ Llama-2-70b-chat-hf $\mathbf{11.0}$ $3.7$ $3.7$ $\mathbf{9.1}$ $9.8$ $9.8$ Mistral-7B-Instruct-v0.1 $0.0$ $\mathbf{3.7}$ $1.2$ $10.0$ $\mathbf{9.8}$ $9.9$ lemur-70b-chat-v1 $13.4$ $\mathbf{15.9}$ $12.2$ $\mathbf{9.1}$ $9.3$ $9.4$ Closed-source LLMs claude-2 $\mathbf{54.9}$ $39.0$ $29.3$ $\mathbf{7.2}$ $8.3$ $8.5$ claude-instant-1 $20.7$ $\mathbf{31.7}$ $24.4$ $8.8$ $\mathbf{8.6}$ $8.9$ gemini-pro $\mathbf{22.0}$ $19.5$ $11.0$ $\mathbf{8.8}$ $9.1$ $9.5$ gpt-3.5-turbo-0613 $\mathbf{51.2}$ $26.8$ $20.7$ $\mathbf{7.0}$ $8.8$ $9.2$ gpt-3.5-turbo-1106 $\mathbf{29.3}$ $15.9$ $14.6$ $\mathbf{8.4}$ $9.0$ $9.0$ gpt-4-0613 $\mathbf{67.1}$ $56.1$ $45.1$ $\mathbf{6.6}$ $7.6$ $8.0$ gpt-4-1106-preview $\mathbf{74.4}$ $52.4$ $53.7$ $\mathbf{5.5}$ $7.6$ $7.7$ text-davinci-002 $4.9$ $4.9$ $\mathbf{8.5}$ $9.7$ $9.8$ $\mathbf{9.6}$ text-davinci-003 $\mathbf{20.7}$ $18.3$ $7.3$ $9.2$ $\mathbf{9.0}$ $9.6$ Frequency of Best-performing Format $\uparrow$ Open-source $\mathbf{5}$ $4$ $3$ $\mathbf{6}$ $1$ $1$ Closed-source $\mathbf{7}$ $1$ $1$ $\mathbf{6}$ $2$ $1$ Overall $\mathbf{12}$ 5 $4$ $\mathbf{12}$ $3$ $2$

2.3 CodeAct 通过更少的交互完成更多工作

在本节中，我们将研究大语言模型代理是否可以从需要复杂工具使用模式的问题上的代码控制和数据流中受益。

M³工具评估。如表所示。 A.7，据我们所知，现有的工具使用基准测试不包含需要组合多个工具同时支持评估不同操作格式的复杂任务。因此，我们策划了一个基准 M³ToolEval 来填补这一空白，它评估了大语言模型在解决复杂任务时的能力，这些任务通常需要 m 多次调用 m 多个工具，在 m 多轮交互中进行。它包含 82 个人工管理的实例，涵盖网页浏览、金融、旅行行程规划、科学和信息处理等任务。每个域都伴随着一组独特的手工制作的工具。我们故意保持提示简单（§F中的示例），并避免提供任何演示来测试大语言模型的零样本使用工具的能力，类似于不了解少样本提示的新手用户的方式会使用该模型。

设置。我们允许模型生成功能齐全的 Python 代码，以实现控制和数据流（例如，if 语句、for 循环）。我们遵循 Tab 中描述的 JSON 和文本的操作格式。 A.6。在每一轮中，模型可以发出一个动作或提出一个答案，以通过与地面实况解决方案的精确匹配来验证。当达到最多10次交互轮次或提交正确的解决方案时，交互将终止，类似于（Wang等人，2023e）。

公制。我们通过计算模型提出的答案与真实解决方案相匹配的百分比来衡量成功率。我们还包括平均值。转数指标：所有评估实例的平均转数。

M³ToolEval 的定量结果。我们在选项卡中包含完整结果。 3 以及图 1 中可视化结果的子集。 CodeAct总体上具有较高的任务成功率（评估的17个大语言模型中有12个），与§2.2中的趋势类似。此外，使用CodeAct需要较低的平均轮数（评估大语言模型的17个中的12个）。例如，与次佳操作格式（文本）相比，最佳模型 gpt-4-1106-preview 实现了 $20.7$ % 绝对改进，同时需要 $2.1$ 平均互动次数较少。然而，开源和闭源大语言模型的绝对 CodeAct 性能仍然存在显着差距，最佳开源模型达到 13.4%，而最佳闭源模型 gpt-4-1106-预览 74.4%。这可能是由于开源模型的任务解决能力较弱，并且在没有演示的情况下无法遵循复杂的指令，这表明迫切需要改进开源大语言模型，以在零样本设置下完成实际的、现实世界的任务。

2.4 CodeAct 多轮交互和现有软件包的优势

在图 3 中，我们展示了大语言模型代理如何与 Python（即我们在第 §3.2 中训练的 CodeActAgent）集成并使用现有软件执行复杂的任务多轮交互。得益于在预训练过程中学到的丰富的 Python 知识，大语言模型代理可以自动导入正确的 Python 库来解决任务，而无需用户提供工具或演示。如图3所示，CodeActAgent可以使用Pandas下载和处理表格数据，使用Scikit-Learn进行机器学习训练-测试数据分割和回归模型训练，并使用Matplotlib进行数据可视化。此外，使用交互式Python解释器执行代码可以自动发出错误消息，帮助大语言模型代理在多轮交互中“自调试”其操作，并最终正确完成人类用户的请求。

3 赋能开源大语言模型Agent，更好地完成CodeAct

CodeAct取得的可喜成果激励我们构建一个开源的大语言模型代理，它既可以通过CodeAct与环境交互，又可以使用语言与人类进行交流。为了提高开源大语言模型的CodeAct能力，在§3.1中，我们引入了CodeActInstruct，一个包含智能体-环境交互轨迹的指令微调数据集。我们在§3.1中讨论数据选择程序，以促进交互行为的改进。此外，我们还表明 CodeAct 可以与现有的代理-用户对话数据 (§4) 一起使用，以平衡生成的大语言模型的对话能力。我们的模型 CodeActAgent，根据 LLaMA-2 (Touvron 等人，2023) 和 Mistral-7B (Jiang 等人，2023) 在 CodeActInstruct 和一般对话的混合上进行了微调，提高了 CodeAct 性能，而不损害大语言模型在各种任务上的总体性能 (§3.2)。

3.1 CodeActInstruct：代理与环境交互

我们考虑代理与环境交互中的四个主要用例，并重新利用不同领域的五个现有数据集来生成轨迹：

•

信息查询：我们使用 HotpotQA (Yang 等人, 2018) 的训练子集来生成信息查找轨迹，其中大语言模型使用 wikipedia_search API（作为 Python 函数提供）搜索信息来回答问题。
•

软件包（工具）使用方法：我们使用 APPS (Hendrycks 等人, 2021a) 中的代码生成问题训练集和 MATH (Hendrycks 等人, 2021b) 中的数学问题。代码生成任务已经涉及导入包和/或通过定义新的 Python 函数来创建新工具。对于数学，我们提供了导入 Python 包（例如，用于符号数学的 sympy)以解决问题的上下文演示。
•

外部存储器：我们重新利用了 WikiTableQuestion 的训练子集（Pasupat＆Liang，2015），并将其调整为需要访问外部存储器的表格推理任务的两种变体：（1）基于 SQL，要求大语言模型通过sqlite3包与SQL数据库交互，通过SQL执行回答问题；（2）基于Pandas，要求模型与pandas表交互来执行数据操作（例如选择、过滤）。指令示例可以在§G.3.1中找到。
•

机器人规划：我们使用 ALFWorld (Shridhar 等人, 2020)（一种纯文本的具体环境模拟器）来生成使用机器人控制 API（重新用作 Python 函数）完成家务任务的轨迹。继 MINT (Wang 等人, 2023e) 之后，我们提供了上下文演示，鼓励使用 for 循环和 if 语句代码块来自动执行重复操作（例如，通过访问来搜索项目）不同地点）。

数据下采样。我们通过仅保留最具挑战性的实例来对每个数据集进行下采样，旨在使轨迹生成更加高效且更具成本效益。此外，它还有助于删除现有大语言模型已经可以解决的简单实例。过滤后的数据集的统计信息可以在Tab中找到。 A.9。有关下采样过程的详细信息，请参阅§G.1。

重新利用数据进行多轮交互。一些数据集（APPS、MATH、WikiTableQuestions）最初是单轮问题，需要每条指令一个解决方案，而在实际的代理用例中，我们通常需要多轮交互来完成每项任务（图1顶部）。遵循MINT （Wang 等人，2023e），我们通过允许大语言模型在决定提交一个解决方案进行评估之前与环境进行多轮交互，将单轮问题重新调整为多轮问题。针对代码生成问题，我们提供了一个上下文示例来指导大语言模型在提交解决方案之前在提供的测试用例上测试他们的解决方案。原始数据的指标将评估提交的解决方案以确定其正确性。我们在 §G.3 中提供了示例。

轨迹生成。我们使用MINT的评估框架（Wang等人，2023e）为上述数据集生成交互轨迹并确定每个轨迹的正确性。我们在下采样数据上运行 OpenAI 的 gpt-3.5-turbo-0613、Anthropic 的 claude-1-instant 和 claude-2，代码生成除外，我们使用 GPT-3.5 的较长上下文版本 (gpt-3.5- Turbo-0613-16k）由于自调试过程的长上下文要求。对于这些模型都无法解决的问题子集，我们使用 gpt-4-0613 来生成轨迹。

增强Agent从交互中改进的能力。我们从 CodeActInstruct 生成的所有轨迹中选择一个高质量子集，以提高代理根据先前观察结果改进下一步行动的能力（例如，从代码执行错误信息中进行自我调试，图 2中的一种规划能力）。为了实现这一目标，我们有选择地保留模型最初遇到错误的轨迹，但在以后的交互中纠正这些不准确的地方。对于这些情况，大语言模型通常会在最初的错误之后进行自我反思，从而主动增强其未来的行动。其他过滤细节在§G.2中讨论。在生成的所有轨迹上，我们保留来自 gpt-4-0613 的 411 个轨迹和来自 gpt-3.5 和 claude 的 6728 个轨迹。生成的数据集 CodeActInstruct 的统计信息显示在 Tab 中。 4.

表4：我们的训练组合的统计以及与之前工作的比较。有关 CodeActInstruct 和一般对话数据的详细信息，请参阅§3.1。词符统计数据是使用 Llama-2 分词器计算的。

Data Mixture	Data Type	Data Name	# of Data Instances	# of Total Tokens	Avg. Tokens Per Instance
Prior Work	-	FireAct (Chen et al., 2023a)	$2,063$	$542,176$	$262.81$
Prior Work	-	AgentInstruct (Zeng et al., 2023)	$1,866$	$2,517,785$	$1349.30$
CodeActInstruct (Ours)	Information Seeking	HotpotQA (Yang et al., 2018)	$1,664$	$2,472,227$	$1485.71$
	Software Packages (Tool)	MATH (Math, (Hendrycks et al., 2021b))	$1,732$	$1,719,467$	$992.76$
	Software Packages (Tool)	APPS (Code, (Hendrycks et al., 2021a))	$647$	$1,235,472$	$1909.54$
	External Memory	WikiTableQuestion (Pasupat & Liang, 2015)	$1,065$	$1,316,246$	$1235.91$
	Robot Planning	ALFWorld (Shridhar et al., 2020)	$2,031$	$3,838,269$	$1889.84$
	Total		$\mathbf{7,139}$	$\mathbf{10,581,681}$	$\mathbf{1482.24}$
General Conversation	Single-Turn Reasoning	OpenOrca (Sub-sampled, (Lian et al., 2023))	$50,000$	$14,034,152$	$280.68$
	Multi-Turn Conversations	ShareGPT (Sub-sampled, (Anonymous, 2023))	$10,000$	$17,933,861$	$1793.39$
	Multi-Turn Conversations	ShareGPT (GPT-4, (OpenChat, 2023))	$4,583$	$18,195,878$	$3970.30$
	Multi-turn Reasoning	CapyBara (LDJnr, 2023)	$4,647$	$4,982,435$	$1072.18$
	Total		$\mathbf{69,230}$	$\mathbf{55,146,326}$	$\mathbf{796.57}$

将 CodeActInstruct 与之前的工作进行比较。与之前的工作 AgentInstruct (Zeng 等人, 2023) 和 FireAct (Chen 等人, 2023a) 主要侧重于使用文本作为动作相比，CodeActInstruct 产生的模型更在实际实现中很实用，因为使用CodeAct的模型可以直接与Python解释器和开源工具包交互（图3)，减少了动作解析的开发工作量和工具创作。 CodeActInstruct是按照通用代理框架（图2)系统构建的。它涵盖了不同的领域（例如，与仅考虑 QA 任务和搜索 API 的 FireAct 相比）、包含高质量数据（例如，提升代理的自调试能力）和更大的规模（3.8 倍/3.5 倍的数据轨迹和 5 倍的数据轨迹） / 与选项卡中的 AgentInstruct / FireAct 相比，标记数量分别增加了 19 倍。 4)。正如我们在表中凭经验显示的那样。 5，与 AgentInstruct 和 FireAct 相比，CodeActInstruct 生成的模型（相同主干）分别实现了 24% 和 119% 的相对改进。

CodeActInstruct 可与现有代理-用户对话数据一起使用。我们使用 OpenOrca (Lian 等人, 2023) 的子采样集，专注于单轮思想链 (CoT) 推理，ShareGPT (Anonymous, 2023; OpenChat, 2023) 来自两个包含人类和大语言模型之间多轮对话的来源，而 CapyBara (LDJnr, 2023) 专注于多轮对话中的推理。统计数据和下采样详细信息可以在选项卡中找到。 4 和§C。

表5： CodeActAgent 的评估结果。所有开源大语言模型中最好的结果是粗体，次好的结果是下划线。 ID和OD分别代表域内和域外评估。总体平均性能将 MT-Bench 分数归一化，使其与其他任务保持一致，并排除域内任务以进行公平比较。

Agent Tasks Generic Tasks Overall Code as Action Text as Action (OD) (OD) Average Model Size MINT (ID) MINT (OD) M³ToolEval (OD) Miniwob++ SciWorld MMLU HumanEval GSM8K MTBench Open-source LLMs (LLaMA-2-based) Llama2 Base 7B -^∗ -^∗ -^∗ -^∗ -^∗ $45.3$ $12.8$ $14.6$ -^∗ -^∗ Llama2 Chat 7B $3.2$ $11.0$ $0.0$ $0.0$ $5.9$ $48.0$ $13.9$ $27.7$ $6.3$ $21.1$ FireAct (Chen et al., 2023a) 7B $0.0$ $0.3$ $0.0$ $0.0$ $6.8$ $44.1$ $3.5$ $12.4$ $4.5$ $14.0$ AgentLM (Zeng et al., 2023) 7B $8.7$ $6.1$ $0.0$ $28.9$ $13.7$ $48.7$ $15.4$ $24.6$ $6.1$ $24.8$ CodeActAgent (LLaMA-2) 7B $51.3$ $20.4$ $0.0$ $25.5$ $\mathbf{17.6}$ $50.6$ $18.1$ $38.3$ $7.5$ $30.7$ Open-source LLMs (Mistral-based) Mistral Base 7B -^∗ -^∗ -^∗ -^∗ -^∗ $\mathbf{60.1}$ $30.5$ $52.1$ -^∗ -^∗ Mistral Instruct 7B $18.8$ $9.7$ $0.0$ $0.5$ $4.0$ $53.8$ $29.3$ $43.3$ $6.4$ $25.6$ CodeActAgent (Mistral) 7B $\mathbf{57.4}$ $\mathbf{32.4}$ $\mathbf{12.2}$ $\mathbf{46.2}$ $15.9$ $59.1$ $\mathbf{34.7}$ $\mathbf{58.0}$ $\mathbf{8.2}$ $\mathbf{42.5}$ Closed-source LLMs gpt-3.5-turbo-0613 - $33.9$ $38.2$ $51.2$ $66.7$ $21.2$ $70.0$ $48.1$ $57.1$ $7.9$ $54.0$ gpt-4-0613 - $68.6$ $70.2$ $67.1$ $69.4$ $36.4$ $86.4$ $67.0$ $87.1$ $9.0$ $71.7$

*

Some results are only available with instruction-tuned models.

3.2CodeActAgent

我们对 Llama-2 7B (Touvron 等人, 2023) 和 Mistral 7B (Jiang 等人, 2023) 混合使用 CodeActInstruct 和一般对话（表 1）。 4) 获取CodeActAgent。

训练设置。我们对 Llama-2 的序列长度为 4,096 个 Token ，对 Mistral 的序列长度为 16,384 个 Token 进行全参数监督微调。请参阅§D了解更多详情。

评估设置。我们使用 MINT (Wang 等人, 2023e) 在各种代理任务上通过 CodeAct 评估大语言模型。 CodeActAgent 有一些训练域与 MINT 的评估重叠（即，MINT 包括 ALFWorld 和 MATH），因此我们报告 MINT 域内和域外性能的单独数字。除非另有说明，我们用交互轮 $k=5$ 来衡量MINT任务的成功率。我们还使用 MiniWob++（计算机任务，（Kim 等人，2023）)和 ScienceWorld（基于文本的基础科学课程模拟器，（ Wang 等人, 2022a)) 来测试 CodeActAgent 是否可以泛化到不同的动作格式。最后，我们包括一套通用的大语言模型评估任务来评估一般能力：MMLU (Hendrycks 等人, 2020) 用于基于知识的 QA，HumanEval (Chen 等人, 2021) 用于单轮代码生成，GSM8K (Cobbe 等人, 2021) 用于单轮免工具数学推理，以及 MTBench (Zheng 等人, 2023) 用于遵循指令。

CodeActAgent 在 CodeAct 任务中表现出色。如表所示。 5，CodeActAgent（两种变体）在 MINT 的域内和域外子集上都比所有评估的开源大语言模型表现更好。在 M³ToolEval 上，我们发现 CodeActAgent (Mistral) 的性能优于类似规模的开源大语言模型（7B 和 13B），甚至达到了与 70B 模型相似的性能（表 1）。 3)。令人惊讶的是，Llama-2 变体没有观察到任何改进。我们在§H中讨论潜在的原因。

CodeActAgent 概括为文本操作。当对域外文本操作进行评估时，从未针对文本操作进行过优化的 CodeActAgent (LLaMA2, 7B) 实现了与 AgentLM-7B (Zeng 等人，2023) 相当的性能，后者具有文本操作的显式调整。

CodeActAgent 维持或提高一般大语言模型任务的性能。在选项卡中。 5，我们发现 CodeActAgent（两种变体）在我们测试的通用大语言模型任务上表现更好，除了 CodeActAgent（Mistral，7B）的 MMLU 略有下降。

消融研究。标签。 A.8 提出了消融实验，以确定 CodeActInstruct 和一般对话的重要性。 CodeActInstruct 和一般对话都有助于代理任务，而一般对话对于维持一般任务的性能至关重要。

4相关工作

4.1 大语言模型Agents的动作模块

正如（Wang等人，2023b）中详细介绍的，基于LLM的自主代理通常围绕四个组件构建：定制配置文件（Park等人，2023；Qian等人，2023）、长期记忆能力（朱等人，2023；Fischer，2023）、推理与规划算法（魏等人，2022；陈等人，2023d），最重要的是，动作模块。动作模块是促进大语言模型智能体与外部实体有效交互的关键，包括人类（Lee等人，2022）和工具（Qin等人，2023a）环境中（王等人，2023e；杨等人，2024a）。在本研究中，我们解决了标准化大语言模型代理的动作空间的关键问题。我们在§A中进一步讨论了CodeAct和使用代码生成来解决问题的工作线之间的区别。我们注意到一项并行研究 TaskWeaver (Qiao 等人, 2023) 同样支持代码的使用。我们在§B中讨论主要区别。

4.2 改进大语言模型代理

正如（王等人，2023b）所调查的那样，增强大语言模型智能体的两种主要方法是即时工程和指令调整。对于即时工程（刘等人，2023a），已经引入了许多策略来改进思想链推理（魏等人，2022），包括基于自洽的推理（Wang 等人，2022b；Chen 等人，2023d）和基于树的方法（Yao 等人，2023a）。此外，大语言模型可以战略性地促使他们反思之前的计划（姚等人，2023b；王等人，2023f；张等人，2023），使他们能够通过试验和改进来完善最初的行动。错误。与即时工程相比，指令调优本质上增强了大语言模型(Chung 等人, 2022)，特别是代理能力(Zeng 等人, 2023; Chen等人，2023a)。为了进行有效的训练，人类注释者可以为特定代理任务策划专家演示，例如网页浏览（Yao等人，2022a；Nakano等人，2021）。为了最大限度地减少人类标注工作，先前的工作使用更强大的大语言模型创建合成数据集，将代理能力提炼到本地模型中，重点关注工具使用（Qin等人，2023b），交互（Chen等）人，2023c)，以及社交技能(Liu 等人，2023b)。 CodeActInstruct 与后一种方法保持一致，并使用更强大的大语言模型创建数据集。

5结论

这项工作引入了CodeAct，它采用可执行的Python代码来执行大语言模型代理的操作，这比使用文本或JSON操作具有优势，尤其是在复杂的场景中。我们收集以CodeAct为中心的多轮交互轨迹CodeActInstruct用于指令调整，以及专门为与Python无缝集成而设计的训练CodeActAgent，可以利用现有的Python包和执行复杂的任务（例如模型训练）通过自我调试自主纠正错误。

致谢

我们感谢匿名审稿人的建议和意见。这项研究是基于美国 DARPA ECOLE 项目支持的工作。 HR00112390060 和美国 DARPA ITM 计划编号 FA8650-23-C-7316 和 KAIROS 程序编号 FA8750-19-2-1004。本文包含的观点和结论是作者的观点和结论，不应被解释为必然代表 DARPA 或美国政府明示或暗示的官方政策。美国政府有权出于政府目的复制和分发重印本，尽管其中有任何版权标注。这项工作通过高级网络基础设施协调生态系统：服务与支持（ACCESS，Boerner 等人 2023)计划分配 CIS230256 使用国家超级计算应用中心的 Delta 系统，该计划由国家科学基金会支持赠款#2138259、#2138286、#2138307、#2137603 和#2138296。

影响报告

本文提出的工作目标是推进基于 LLM 的自主代理，这些代理可以通过自然语言与人类进行交流，并通过代表人类在环境中执行任务来帮助人类用户。在本节中，我们讨论与我们的工作及其目标相关的潜在社会后果、局限性以及未来的工作。

CodeActAgent 是自主代理的初始原型，但仍然存在一些实际限制。例如，它可能会出现大语言模型中常见的幻觉（例如，想象变量的内容，但没有实际打印出来），这表明需要进行后续对齐（欧阳等人，2022）以便进一步改进。

尽管只是一个原型，CodeActAgent 已经展示了有限的自我改进能力（例如，自我调试错误消息以改进其操作）以及与环境交互的能力。未来的工作可能会在 CodeActAgent 的基础上开发更好的代理，让它们在给定环境中执行广泛的交互，并迭代地引导它们的自我改进能力，以从过去的错误中学习改进。作为此类算法的结果，更强大的代理可能有利于解决广泛的现实世界问题（例如，定理证明、药物发现）。正如（Eloundou等人，2023）中广泛讨论的那样，完全自主的代理人可能会改变劳动力市场的当前格局并影响现有工人的工作。

此外，由于CodeAct直接授予代理在沙箱环境中自由执行代码的访问权限，因此在最坏的情况下（例如在科幻电影中），这样的代理可能会摆脱沙箱的束缚限制并通过网络攻击对世界造成伤害，凸显未来工作需要设计更好的安全机制来保护自主主体（Tang 等人，2024）。

参考

Ahn et al. (2022) Ahn, M., Brohan, A., Brown, N., Chebotar, Y., Cortes, O., David, B., Finn, C., Fu, C., Gopalakrishnan, K., Hausman, K., Herzog, A., Ho, D., Hsu, J., Ibarz, J., Ichter, B., Irpan, A., Jang, E., Ruano, R. J., Jeffrey, K., Jesmonth, S., Joshi, N., Julian, R., Kalashnikov, D., Kuang, Y., Lee, K.-H., Levine, S., Lu, Y., Luu, L., Parada, C., Pastor, P., Quiambao, J., Rao, K., Rettinghouse, J., Reyes, D., Sermanet, P., Sievers, N., Tan, C., Toshev, A., Vanhoucke, V., Xia, F., Xiao, T., Xu, P., Xu, S., Yan, M., and Zeng, A. Do as i can and not as i say: Grounding language in robotic affordances. In arXiv preprint arXiv:2204.01691, 2022.
Anonymous (2023) Anonymous. Sharegpt dataset. https://hf.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/blob/main/ShareGPT_V3_unfiltered_cleaned_split_no_imsorry.json, 2023. A dataset containing multi-turn conversations between human and LLM assistant.
Boerner et al. (2023) Boerner, T. J., Deems, S., Furlani, T. R., Knuth, S. L., and Towns, J. Access: Advancing innovation: Nsf’s advanced cyberinfrastructure coordination ecosystem: Services & support. In Practice and Experience in Advanced Research Computing, pp. 173–176. 2023.
Bran et al. (2023) Bran, A. M., Cox, S., White, A. D., and Schwaller, P. Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376, 2023.
Cano et al. (2023) Cano, A. H., Pagliardini, M., Köpf, A., Matoba, K., Mohtashami, A., Wang, X., Fan, O. S., Marmet, A., Bayazit, D., Krawczuk, I., Chen, Z., Salvi, F., Bosselut, A., and Jaggi, M. epfllm megatron-llm, 2023. URL https://github.com/epfLLM/Megatron-LLM.
Chase (2022) Chase, H. LangChain, October 2022. URL https://github.com/langchain-ai/langchain.
Chen et al. (2023a) Chen, B., Shu, C., Shareghi, E., Collier, N., Narasimhan, K., and Yao, S. Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915, 2023a.
Chen et al. (2021) Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
Chen et al. (2023b) Chen, X., Lin, M., Schärli, N., and Zhou, D. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023b.
Chen et al. (2023c) Chen, Y., Sikka, K., Cogswell, M., Ji, H., and Divakaran, A. Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. arXiv preprint arXiv:2311.10081, 2023c.
Chen et al. (2023d) Chen, Y., Sikka, K., Cogswell, M., Ji, H., and Divakaran, A. Measuring and improving chain-of-thought reasoning in vision-language models. arXiv preprint arXiv:2309.04461, 2023d.
Chung et al. (2022) Chung, H. W., Hou, L., Longpre, S., Zoph, B., Tay, Y., Fedus, W., Li, Y., Wang, X., Dehghani, M., Brahma, S., et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
Cobbe et al. (2021) Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
Eloundou et al. (2023) Eloundou, T., Manning, S., Mishkin, P., and Rock, D. Gpts are gpts: An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130, 2023.
Fischer (2023) Fischer, K. A. Reflective linguistic programming (rlp): A stepping stone in socially-aware agi (socialagi). arXiv preprint arXiv:2305.12647, 2023.
Gao et al. (2023) Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., and Neubig, G. Pal: Program-aided language models. In International Conference on Machine Learning, pp. 10764–10799. PMLR, 2023.
Hendrycks et al. (2020) Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020.
Hendrycks et al. (2021a) Hendrycks, D., Basart, S., Kadavath, S., Mazeika, M., Arora, A., Guo, E., Burns, C., Puranik, S., He, H., Song, D., et al. Measuring coding challenge competence with apps. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021a.
Hendrycks et al. (2021b) Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021b.
Hong et al. (2023) Hong, S., Zheng, X., Chen, J., Cheng, Y., Wang, J., Zhang, C., Wang, Z., Yau, S. K. S., Lin, Z., Zhou, L., et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
Hong et al. (2024) Hong, S., Lin, Y., Liu, B., Liu, B., Wu, B., Li, D., Chen, J., Zhang, J., Wang, J., Zhang, L., Zhang, L., Yang, M., Zhuge, M., Guo, T., Zhou, T., Tao, W., Wang, W., Tang, X., Lu, X., Zheng, X., Liang, X., Fei, Y., Cheng, Y., Xu, Z., and Wu, C. Data interpreter: An llm agent for data science, 2024.
Huang et al. (2023) Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., and Fei-Fei, L. Voxposer: Composable 3d value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973, 2023.
Jiang et al. (2023) Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
Kim et al. (2023) Kim, G., Baldi, P., and McAleer, S. Language models can solve computer tasks. arXiv preprint arXiv:2303.17491, 2023.
LDJnr (2023) LDJnr. Capybara dataset. https://hf.co/datasets/LDJnr/Verified-Camel, https://hf.co/datasets/LDJnr/Pure-Dove, https://hf.co/datasets/LDJnr/LessWrong-Amplify-Instruct, 2023. A dataset focusing on reasoning in multi-turn conversations.
Lee et al. (2022) Lee, M., Liang, P., and Yang, Q. Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities. In Proceedings of the 2022 CHI conference on human factors in computing systems, pp. 1–19, 2022.
Li et al. (2023) Li, M., Song, F., Yu, B., Yu, H., Li, Z., Huang, F., and Li, Y. Api-bank: A benchmark for tool-augmented llms, 2023.
Lian et al. (2023) Lian, W., Goodson, B., Pentland, E., Cook, A., Vong, C., and ”Teknium”. Openorca: An open dataset of gpt augmented flan reasoning traces. https://https://huggingface.co/Open-Orca/OpenOrca, 2023.
Liang et al. (2022) Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. Code as policies: Language model programs for embodied control. In arXiv preprint arXiv:2209.07753, 2022.
Liu et al. (2023a) Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., and Neubig, G. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023a.
Liu et al. (2023b) Liu, R., Yang, R., Jia, C., Zhang, G., Zhou, D., Dai, A. M., Yang, D., and Vosoughi, S. Training socially aligned language models in simulated human society. arXiv preprint arXiv:2305.16960, 2023b.
Ma et al. (2023) Ma, Y. J., Liang, W., Wang, G., Huang, D.-A., Bastani, O., Jayaraman, D., Zhu, Y., Fan, L., and Anandkumar, A. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023.
Mialon et al. (2023) Mialon, G., Dessì, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., Rozière, B., Schick, T., Dwivedi-Yu, J., Celikyilmaz, A., et al. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
Nakano et al. (2021) Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
OpenChat (2023) OpenChat. Sharegpt dataset. https://hf.co/datasets/openchat/openchat_sharegpt_v3/blob/main/sharegpt_gpt4.json, 2023. A dataset containing multi-turn conversations between human and LLM assistants. It is filtered to contain data only from GPT-4.
Ouyang et al. (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
Park et al. (2023) Park, J. S., O’Brien, J., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp. 1–22, 2023.
Pasupat & Liang (2015) Pasupat, P. and Liang, P. Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1470–1480, 2015.
Patil et al. (2023) Patil, S. G., Zhang, T., Wang, X., and Gonzalez, J. E. Gorilla: Large language model connected with massive apis. ArXiv, abs/2305.15334, 2023. URL https://api.semanticscholar.org/CorpusID:258865184.
Qian et al. (2023) Qian, C., Cong, X., Yang, C., Chen, W., Su, Y., Xu, J., Liu, Z., and Sun, M. Communicative agents for software development. arXiv preprint arXiv:2307.07924, 2023.
Qiao et al. (2023) Qiao, B., Li, L., Zhang, X., He, S., Kang, Y., Zhang, C., Yang, F., Dong, H., Zhang, J., Wang, L., et al. Taskweaver: A code-first agent framework. arXiv preprint arXiv:2311.17541, 2023.
Qin et al. (2023a) Qin, Y., Hu, S., Lin, Y., Chen, W., Ding, N., Cui, G., Zeng, Z., Huang, Y., Xiao, C., Han, C., et al. Tool learning with foundation models. arXiv preprint arXiv:2304.08354, 2023a.
Qin et al. (2023b) Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y.-T., Lin, Y., Cong, X., Tang, X., Qian, B., Zhao, S., Tian, R., Xie, R., Zhou, J., Gerstein, M. H., Li, D., Liu, Z., and Sun, M. Toolllm: Facilitating large language models to master 16000+ real-world apis. ArXiv, abs/2307.16789, 2023b. URL https://api.semanticscholar.org/CorpusID:260334759.
Schick et al. (2023) Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
Shen et al. (2023) Shen, Y., Song, K., Tan, X., Li, D., Lu, W., and Zhuang, Y. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
Shridhar et al. (2020) Shridhar, M., Yuan, X., Cote, M.-A., Bisk, Y., Trischler, A., and Hausknecht, M. Alfworld: Aligning text and embodied environments for interactive learning. In International Conference on Learning Representations, 2020.
Singh et al. (2023) Singh, I., Blukis, V., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., and Garg, A. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11523–11530, 2023. doi: 10.1109/ICRA48891.2023.10161317.
Surís et al. (2023) Surís, D., Menon, S., and Vondrick, C. Vipergpt: Visual inference via python execution for reasoning. Proceedings of IEEE International Conference on Computer Vision (ICCV), 2023.
Tang et al. (2024) Tang, X., Jin, Q., Zhu, K., Yuan, T., Zhang, Y., Zhou, W., Qu, M., Zhao, Y., Tang, J., Zhang, Z., et al. Prioritizing safeguarding over autonomy: Risks of llm agents for science. arXiv preprint arXiv:2402.04247, 2024.
TIOBE Index (2024) TIOBE Index. Tiobe index. https://www.tiobe.com/tiobe-index/, Accessed at Jan 23rd, 2024, 2024. The TIOBE Programming Community index is an indicator of the popularity of programming languages. The index is updated once a month. The ratings are based on the number of skilled engineers world-wide, courses and third party vendors.
Touvron et al. (2023) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
Wang et al. (2023a) Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., and Anandkumar, A. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
Wang et al. (2023b) Wang, L., Ma, C., Feng, X., Zhang, Z., Yang, H., Zhang, J., Chen, Z., Tang, J., Chen, X., Lin, Y., et al. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023b.
Wang et al. (2022a) Wang, R., Jansen, P. A., Côté, M.-A., and Ammanabrolu, P. Scienceworld: Is your agent smarter than a 5th grader? In Conference on Empirical Methods in Natural Language Processing, 2022a. URL https://api.semanticscholar.org/CorpusID:247451124.
Wang et al. (2022b) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022b.
Wang et al. (2023c) Wang, X., Li, S., and Ji, H. Code4Struct: Code generation for few-shot event structure prediction. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 3640–3663, Toronto, Canada, July 2023c. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.202. URL https://aclanthology.org/2023.acl-long.202.
Wang et al. (2023d) Wang, X., Peng, H., Jabbarvand, R., and Ji, H. Leti: Learning to generate from textual interactions. ArXiv, abs/2305.10314, 2023d.
Wang et al. (2023e) Wang, X., Wang, Z., Liu, J., Chen, Y., Yuan, L., Peng, H., and Ji, H. Mint: Evaluating llms in multi-turn interaction with tools and language feedback. arXiv preprint arXiv:2309.10691, 2023e.
Wang et al. (2023f) Wang, Z., Cai, S., Liu, A., Ma, X., and Liang, Y. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023f.
Wei et al. (2022) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
Xu et al. (2023) Xu, Q., Hong, F., Li, B., Hu, C., Chen, Z., and Zhang, J. On the tool manipulation capability of open-source large language models, 2023.
Yang et al. (2024a) Yang, J., Prabhakar, A., Narasimhan, K., and Yao, S. Intercode: Standardizing and benchmarking interactive coding with execution feedback. Advances in Neural Information Processing Systems, 36, 2024a.
Yang et al. (2024b) Yang, K., Liu, J., Wu, J., Yang, C., Fung, Y. R., Li, S., Huang, Z., Cao, X., Wang, X., Wang, Y., Ji, H., and Zhai, C. If llm is the wizard, then code is the wand: A survey on how code empowers large language models to serve as intelligent agents, 2024b.
Yang et al. (2018) Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W., Salakhutdinov, R., and Manning, C. D. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2369–2380, 2018.
Yang et al. (2024c) Yang, Z., Liu, A., Liu, Z., Liu, K., Xiong, F., Wang, Y., Yang, Z., Hu, Q., Chen, X., Zhang, Z., Luo, F., Guo, Z., Li, P., and Liu, Y. Towards unified alignment between agents, humans, and environment, 2024c.
Yao et al. (2022a) Yao, S., Chen, H., Yang, J., and Narasimhan, K. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022a.
Yao et al. (2022b) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K. R., and Cao, Y. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2022b.
Yao et al. (2023a) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023a.
Yao et al. (2023b) Yao, W., Heinecke, S., Niebles, J. C., Liu, Z., Feng, Y., Xue, L., Murthy, R., Chen, Z., Zhang, J., Arpit, D., et al. Retroformer: Retrospective large language agents with policy gradient optimization. arXiv preprint arXiv:2308.02151, 2023b.
Yuan et al. (2023) Yuan, L., Chen, Y., Wang, X., Fung, Y. R., Peng, H., and Ji, H. Craft: Customizing llms by creating and retrieving from specialized toolsets. ArXiv, abs/2309.17428, 2023. URL https://api.semanticscholar.org/CorpusID:263310662.
Zeng et al. (2023) Zeng, A., Liu, M., Lu, R., Wang, B., Liu, X., Dong, Y., and Tang, J. Agenttuning: Enabling generalized agent abilities for llms, 2023.
Zhang et al. (2023) Zhang, C., Liu, L., Wang, J., Wang, C., Sun, X., Wang, H., and Cai, M. Prefer: Prompt ensemble learning via feedback-reflect-refine. arXiv preprint arXiv:2308.12033, 2023.
Zheng et al. (2023) Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
Zheng et al. (2024) Zheng, T., Zhang, G., Shen, T., Liu, X., Lin, B. Y., Fu, J., Chen, W., and Yue, X. Opencodeinterpreter: Integrating code generation with execution and refinement. https://arxiv.org/abs/2402.14658, 2024.
Zhu et al. (2023) Zhu, X., Chen, Y., Tian, H., Tao, C., Su, W., Yang, C., Huang, G., Li, B., Lu, L., Wang, X., et al. Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144, 2023.

表A.6：重新调整用途的 API-Bank (Li 等人, 2023) 和 M³ToolEval 的操作示例。

Format	Action
CodeAct	AddAgenda(content="Meeting with John", time="2023-10-26 09:00:00")
JSON	{"action": "AddAgenda", "content": "Meeting with John", "time": "2023-10-26 09:00:00"}
Text	Action: AddAgenda, content: Meeting with John, time: 2023-10-26 09:00:00

表A.7： M³ToolEval 与现有工具使用评估基准的比较。

Benchmark	M³ToolEval	ToolBench	APIBench	API-Bank	ToolBench
	(This work)	(Qin et al., 2023b)	(Patil et al., 2023)	(Li et al., 2023)	(Xu et al., 2023)
Requiring multi-turn interaction	✔	✔	✗	✗	✗
Multiple tools	✔	✔	✔	✔	✔
Evaluation	Answer Match	LLM Evaluator	AST Tree Match	API-Call Match	Test Case
No dependency on external API^∗	✔	✗	✗	✔	✗
Supported API Action Format	CodeAct & JSON & Text	JSON	CodeAct	JSON	CodeAct

*

是否依赖第三方托管的外部 API（例如 RapidAPI、Google Sheet）。此类第三方 API 的可用性可能会极大地影响评估结果（例如，API 调用性能低并不是因为模型不好，而是因为所需的 API 无法访问）。

表A.8：消融研究结果。最佳结果以粗体显示，次佳结果以下划线显示。 ID和OD分别代表域内和域外评估。总体平均性能将 MT-Bench 分数归一化，使其与其他任务保持一致，并排除域内任务以进行公平比较。

Agent Tasks Generic LLM Tasks Overall Code as Action Text as Action (OD) (OD) Average Model Size MINT (ID) MINT (OD) Miniwob++ SciWorld MMLU HumanEval GSM8K MTBench CodeActAgent (Llama2-based) 7B $\mathbf{51.3}$ $\mathbf{20.4}$ $25.5$ $\mathbf{17.6}$ $\mathbf{50.6}$ $18.1$ $\mathbf{38.3}$ $\mathbf{7.5}$ $\mathbf{35.1}$ w/o CodeAct 7B $17.0$ $15.5$ $\mathbf{36.4}$ $16.9$ $49.5$ $14.7$ $36.0$ $7.2$ $34.5$ w/o general conversations 7B $29.2$ $15.9$ $0.0$ $17.1$ $46.4$ $\mathbf{19.7}$ $20.6$ $4.1$ $22.9$ CodeActAgent (Mistral-based) 7B $\mathbf{57.4}$ $\mathbf{32.4}$ $46.2$ $15.9$ $59.1$ $\mathbf{34.7}$ $58.0$ $8.2$ $\mathbf{46.8}$ w/o CodeAct 7B $32.9$ $23.0$ $\mathbf{47.8}$ $\mathbf{17.0}$ $\mathbf{59.9}$ $33.2$ $\mathbf{59.5}$ $\mathbf{8.3}$ $46.2$ w/o general conversations 7B $50.5$ $13.9$ $0.0$ $11.0$ $52.4$ $27.9$ $26.8$ $2.6$ $22.6$

附录 A与使用代码生成解决问题的工作的比较

在本节中，我们将讨论CodeAct与促使大语言模型生成解决问题的代码的先前工作之间的根本区别。现有的工作已经探索了使用代码生成来解决不同领域的任务，例如，Code4Struct (Wang 等人, 2023c) 用于结构化预测，PaL (Gao 等人, 2023) t1> 用于数学推理，Meta-GPT (Hong 等人, 2023) 用于多智能体协作，代码即策略 (Liang 等人, 2022) 用于机器人控制，ViperGPT (Surís 等人, 2023) 用于视觉问答，Voyager (Wang 等人, 2023a) 用于玩游戏，Data Interpreter (Hong 等人) ，2024）用于数据科学任务等。

大多数先前的工作在单轮设置中生成代码（即静态动作序列），并且无法动态地重新调整新观察的动作：当模型生成的代码无法解决上的任务时，被认为是失败第一次尝试。此设置忽略了环境观察（例如，代码执行结果）的潜力，这可能有利于未来的行动和总体决策（例如，在观察中间代码执行结果后动态调整后续代码，在看到错误消息后修复错误代码）。也就是说，生成的代码是静态的操作序列，无法通过合并新的观察来动态地重新调整。这种单轮设置使得扩展到更具挑战性的问题变得具有挑战性，因为即使是专家人类程序员通常也无法在第一轮中编写功能正确的代码。另一方面，CodeAct是一个多回合交互代理框架，允许动态调整先前的动作或通过设计发出新的动作（§2.1，图2)并且与来自环境的任何形式的文本观察（例如，工具执行输出、自动反馈）兼容。除了与环境观测兼容之外，我们的指令调优数据集CodeActInstruct专门收集多轮自我改进的数据，为增强大语言模型的多轮自我改进过程提供了实用的解决方案。

此外，以前的方法需要大量的快速工程和制作少量样本演示，以便为特定领域或任务定制大语言模型（例如，机器人控制（Liang等人，2022）)，因为主干大语言模型语言模型并未针对动态规划和决策进行专门优化。相比之下，在这项工作中，我们提出了 CodeAct 框架，该框架使用可执行的 Python 代码将大语言模型代理的动作整合到统一的动作空间中，并收集各种任务（例如，信息查找）的 CodeActInstruct 、表格推理、机器人规划等），使经过训练的模型 CodeActAgent 能够以最少的人力轻松扩展到不同的任务和领域，如 §3.2 所示。

之前的工作中一个值得注意的例外是 Voyager (Wang 等人, 2023a)，它在函数定义的受限操作空间中执行迭代提示来修复代码错误。与 CodeAct 不同，此类设置不允许即时动态重新调整 atomic 操作：在 CodeAct 中，对于特定任务（例如，工艺Minecraft 中的石剑），代理可以首先执行一行代码（任何原子动作或组合函数，例如，前进，定位石头），并根据对第一个动作的观察动态产生不同的动作。这对于 Voyager 来说是一个挑战：与代码即策略 (Liang 等人, 2022) 类似，它们将动作（一种技能，例如制作石剑）生成为 Python 函数定义概述了任务的整个计划（例如，概述了如何针对不同潜在情况制作石剑和手柄的多步骤代码，这需要强大的领域知识）。这对代理的动作空间施加了很大的限制，并且不允许动态地重新调整原子动作：也就是说，代理只能首先生成一个完整的函数（例如，通过对所有可能的情况进行成像）当您尝试找到石头时可能会发生），执行整个函数，观察反馈，并在后续移动中将整个函数更新为动作。除了根据环境观察重新调整行动的能力受到限制之外，他们还依赖于大量的提示工程（上面讨论的典型缺点）来提供相关信息（例如，当前状态，通过提示进行额外的自我批评）来生成修改后的代码，而CodeAct 所处的环境不需要立即进行工程设计：大语言模型的上下文窗口仅包含其过去的操作和观察，不需要人工过滤相关信息。

与CodeAct类似，并行工作OpenCodeInterpreter（郑等人，2024），特别关注竞争性代码生成问题，收集代码调试轨迹以改进大语言模型的迭代代码调试性能。然而，它对一般大语言模型代理任务的适用性仍然未知。

附录B与TaskWeaver的比较

在统一大语言模型代理的动作空间的背景下，我们的工作代表了对之前的举措TaskWeaver （乔等人，2023）的飞跃。虽然 TaskWeaver 最初将代码集成到大语言模型代理的动作空间中值得认可，但其探索仍然有限。这项工作的主要特点是依赖一组有限的定性示例，并以闭源模型为支柱，未能充分利用这种集成的潜力，而仅仅是概念性演示。我们的工作超越了单纯的概念化，进行了广泛而严格的分析，清楚地量化了大语言模型代理中代码操作的好处。除此之外，我们还引入了一个独特的指令调优数据集CodeActInstruct，专门用于增强代理执行基于代码的操作的能力，以及开源大语言模型代理CodeActAgent。这些贡献不仅扩展了TaskWeaver的工作，而且为未来的探索铺平了道路，为开源社区提供了宝贵的资源，并重新定义了大语言模型代理在实际应用中的潜力。

附录C一般数据下采样

•

ShareGPT （匿名，2023）：我们删除所有单轮对话，然后执行随机子样本以获得所需的最终数字。
•

ShareGPT (GPT-4) (OpenChat, 2023)：我们不会对此数据集执行子采样。
•

OpenOrca (Lian 等人, 2023)：我们选择 OpenOrca 的 CoT 子集，然后执行随机子样本以获得所需的最终数字。
•

CapyBara (LDJnr, 2023)：我们不会对此数据集执行子采样。

附录 D CodeActAgent 训练详细信息

所有 SFT 实验均在一个 4xA100 40GB SXM 节点上使用 Megatron-LLM （Cano 等人，2023）的分支进行，训练吞吐量约为每秒 9k 个 Token 。我们使用chatML格式²²2https://github.com/openai/openai-python/blob/release-v0.28.0/chatml.md 对于所有多轮数据，我们只计算和优化助手响应上的损失。我们将较短的实例打包为较长的实例，并应用闪光注意力来提高训练效率。

我们训练 LLaMA-2 和 Mistral 大语言模型，张量平行度为 4，学习率为 1e-5，50 个预热步骤和余弦衰减（最终学习率为 1e-6)。我们训练了 5 个时期，批量大小为 32。我们在所有实验中使用第三个纪元检查点。

附录 ECodeAct 的提示示例

这是一个示例（零样本）系统提示符，用于 CodeAct 的部署实例，其中我们使用 chatML 格式。

用户可以选择包含类似于§F的工具描述，或包含类似于§G.3的额外上下文示例。

⬇

<|im_start|>system

A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user’s questions.

The assistant can interact with an interactive Python (Jupyter Notebook) environment and receive the corresponding output when needed. The code should be enclosed using "<execute>" tag, for example: <execute> print("Hello World!") </execute>.

The assistant should attempt fewer things at a time instead of putting too much code in one <execute> block. The assistant can install packages through PIP by <execute> !pip install [package needed] </execute> and should always import packages and define variables before starting to use them.

The assistant should stop <execute> and provide an answer when they have already obtained the answer from the execution result. Whenever possible, execute the code for the user using <execute> instead of providing it.

The assistant’s response should be concise, but do express their thoughts.

<|im_end|>

附录FM³ToolEval提示

⬇

You have access to the following tools:

Now, let’s get started!

Instruction: {{Example: Find the current price of Legendary Wand.}}

Answer in the format of ’xx.xx’ (e.g., 12.34).

You can optionally express your thoughts using natural language before your action. For example, ’Thought: I want to use tool_name to do something. Action: <your action to call tool_name> End Action’.

Note that your output should always contain either ’Action:’ or ’Answer:’, but not both.

When you are done, output the result using ’Answer: your answer’

Please ONLY output the answer (e.g., single number), without any other text.

上面的每个 {{...}} 组件将被相应的信息替换。

F.1 {{工具定义}}示例

以下是用于 Web 浏览的示例工具定义。

⬇

[1] click_url: Clicks on a URL. A clickable URL looks like [Clickable ’<url_argument>’] in the webpage.

Arguments: url (str).

Returns the rendered content of the webpage after clicking the URL showing on the current rendered page.

Signature: click_url(url: str) -> str

[2] go_to_previous_page: Goes back to the previous page. It has no arguments.

After going back to the previous page, return the rendered content of the webpage.

Signature: go_to_previous_page() -> str

[3] scroll_down: Scrolls down the view. It has no arguments.

Returns the rendered content of the webpage after scrolling down.

Signature: scroll_down() -> str

[4] scroll_up: Scrolls up the view. It has no arguments.

Returns the rendered content of the webpage after scrolling up.

Signature: scroll_up() -> str

[5] view: Return the current view in string format of the rendered webpage. It has no arguments.

Returns the rendered content of the webpage.

You should call this when you want to see the rendered content of the current webpage.

Signature: view() -> str

[6] calculator: Evaluates the given expression and returns the result. Accepts a calculation expression as input. For example, "2 + (3 * 4)" will return 14.

Signature: calculator(expression: str) -> float

F.2 {{格式化指令}}示例

不同的动作格式有不同的格式化指令。

F.3 代码作为操作的格式化指令

⬇

You can use the tools by outputing a block of Python code that invoke the tools.

You may use for-loops, if-statements, and other Python constructs when necessary.

Be sure to print the final answer at the end of your code.

You should begin your tool invocation with ’Action:’ and end it with ’End Action’.

Example: ’Action:

tool_name(argument_1)

End Action’

F.4 Json作为Action的格式化指令

⬇

You can use the tools by outputing a JSON object with the following fields:

- ’tool’: the name of the tool

- ’args’: a list of arguments to the tool

You should begin your tool invocation with ’Action:’ and end it with ’End Action’.

Example: ’Action: {"tool": "tool_name", "args": ["argument_1"]} End Action’

You can only invoke one tool at a time.

F.5 作为动作的文本格式化指令

⬇

You can use the tools by outputing the tool name followed by its arguments, delimited by commas.

You should begin your tool invocation with ’Action:’ and end it with ’End Action’.

Example: ’Action: tool_name, argument_1 End Action’

You can only invoke one tool at a time.

附录 G CodeAct 交互数据

G.1 数据集下采样

表A.9： CodeActInstruct 组件和训练轨迹生成的实例数量。

Domain	Capability	Dataset	# of Instances
Web Search	Information seeking through search API	HotpotQA (Yang et al., 2018)	3,000
Math Reasoning	Math problem-solving using math Libraries in Python (e.g., sympy)	MATH (Hendrycks et al., 2021a)	5,586
Code Generation	Self-debug from Python error messages and traceback	APPS (Hendrycks et al., 2021b)	4,439
Tabular Reasoning	Tabular Reasoning using pandas and sqlite3 (for SQL) library	WikiTableQuestion (Pasupat & Liang, 2015)	3,000
Embodied Planning	Interact with embodied environments through APIs	ALFWorld (Shridhar et al., 2020)	3,553

•

APPS 中的代码生成任务（Hendrycks 等人，2021a）：我们删除了没有任何可用测试用例的实例。
•

WikiTableQuestion 中的表格推理任务（Pasupat＆Liang，2015）：我们选择表大小最大的 3000 个实例的子集（即按行数和列数排序）从原始数据集（14149 个实例）中随机分配其中 1500 个为基于 pandas 的问题，其余 1500 个为基于 SQL 的问题。
•

HotpotQA 中的网络搜索任务（Yang 等人，2018）：我们选择原始数据集中标记为“困难”的 15661 个问题（包含 90447 个实例），然后随机向下-对 3000 个问题进行抽样。
•

MATH 中的数学推理（Hendrycks 等人，2021b）：我们删除了注释难度低于 3 的问题，这导致了 5586 个实例，如表 1 所示。 A.9。
•

ALFWorld中的体现规划（Shridhar等人，2020）：我们没有对AlfWorld进行下采样。

G.2 数据选择启发式

给定超过 2 轮的成功任务解决轨迹，我们应用以下启发式来选择可以促进大语言模型代理的代码即动作、自我改进和指令跟踪能力的实例：

•

代码即动作：我们排除了大语言模型代理不遵守代码即动作框架的轨迹，这可能是由于不正确的 API 调用或生成的动作格式不适合解析和处理。执行。
•

自我改进：我们有选择地保留模型最初遇到错误的轨迹，但随后在以后的交互中纠正这些不准确的地方。此外，我们消除了在所有代码执行中专门产生错误的成功轨迹。这些被认为是无效的演示，因为我们的目标是防止模型学习一致地执行错误代码，同时仍设法提供正确的答案。
•

遵循指令：我们删除了大语言模型代理无法遵循指令并响应用户的罕见情况，通过奇数次交互轮次来识别。

应用所有这些启发式方法后，我们从 gpt-3.5 和 claude 获得了 6728 条轨迹（共 6985 条），从 gpt- 获得了 411 条轨迹（共 413 条）。 4-0613。

G.3轨迹生成提示示例

数据生成提示的格式严格遵循MINT （Wang等人，2023e）。

G.3.1 表格推理（WikiTableQuestion）

我们仅提供基于 SQL 的表格推理的一次性示例。这是针对基于 SQL 的表格推理问题的一次性示例的提示：

⬇

You are a helpful assistant assigned with the task of problem-solving. To achieve this, you will be using an interactive coding environment equipped with a variety of tool functions to assist you throughout the process.

At each turn, you should first provide your step-by-step thinking for solving the task. Your thought process should be enclosed using "<thought>" tag, for example: <thought> I need to print "Hello World!" </thought>.

After that, you have two options:

1) Interact with a Python programming environment and receive the corresponding output. Your code should be enclosed using "<execute>" tag, for example: <execute> print("Hello World!") </execute>.

2) Directly provide a solution that adheres to the required format for the given task. Your solution should be enclosed using "<solution>" tag, for example: The answer is <solution> A </solution>.

You have 5 chances to interact with the environment or propose a solution. You can only propose a solution 2 times.

---

Task:

Given the following table (only the first 5 rows are shown):

City Country (NOC) Round 1 Run-off Round 3 Round 4 Round 5

0 Athens Greece 32 NaN 38.0 52.0 66.0

1 Rome Italy 23 NaN 28.0 35.0 41.0

2 Cape Town South Africa 16 62.0 22.0 20.0 NaN

3 Stockholm Sweden 20 NaN 19.0 NaN NaN

4 Buenos Aires Argentina 16 44.0 NaN NaN NaN

Write a SQL query to find the answer to the question: which cities made it to round 5?.

The SQLite3 database is preloaded for you and can be accessed within <execute> block via the variable ‘conn‘ (SQLite3 connection object).

The table name is ‘data_table‘.

Assistant:

To solve this problem, we need to select cities where the value in the "Round 5" column is not null. Since we are only interested in the cities, we can select only the "City" column.

</thought>

query = """

SELECT

City

FROM

data_table

WHERE

‘Round 5‘ IS NOT NULL;

"""

# conn is preloaded

result = conn.execute(query).fetchall()

print(result)

</execute>

Observation:

[(’Athens’,), (’Rome’,)]

Assistant:

The answer is <solution> ["Athens", "Rome"] </solution>.

---

Task:

Given the following table (only the first 5 rows are shown):

Region Country Location Size (m) Payload (metric tonnes) Degrees of Freedom ... Z Vert vel (mm/s) X Horiz accel (m/s2) Y Horiz accel (m/s2) Z Vert accel (m/s2) Max Freq (Hz) Details checked

0 Africa Algeria CGS Laboratory (in construction) 6.1 x 6.1 60 6 ... +-1000 +-10 +-10 +-8 100 30/6/2010

1 Africa South Africa University of Witwatersrand 4 x 4 10 1 ... NaN +-10 NaN NaN 40 17/7/2009

2 Asia China China Academy of Building Research, Beijing 6.1 x 6.1 60 6 ... +-800 +-15 +-10 +-8 50 ?

3 Asia China Guangzhou University 3 x 3 20 6 ... +-1000 +-26 +-26 +-50 50 10/7/2008

4 Asia China Nanjing University of Technology 3 x 5 15 3 ... +-500 +-10 +-10 +-10 50 ?

[5 rows x 17 columns]

Write a SQL query to find the answer to the question: which is the other besides asia the most region charted.

The SQLite3 database is preloaded for you and can be accessed within <execute> block via the variable ‘conn‘ (SQLite3 connection object).

这是基于 Pandas 包的³³3https://pandas.pydata.org/表格推理问题：

⬇

Task:

Given the following table (only the first 5 rows are shown):

Pos No Rider Bike Laps Time Grid Points

0 1 93 Marc Marquez Derbi 22.0 40:46.315 1 25.0

1 2 38 Bradley Smith Aprilia 22.0 +4.638 3 20.0

2 3 44 Pol Espargaro Derbi 22.0 +4.996 2 16.0

3 4 11 Sandro Cortese Derbi 22.0 +45.366 5 13.0

4 5 7 Efren Vazquez Derbi 22.0 +45.433 8 11.0

Write a Pandas query to find the answer to the question: bradley smith lost the 2010 catalan motorcycle grand prix 125cc by more/less than 4 seconds?.

The dataframe is preloaded for you and can be accessed within <execute> block via the variable ‘df‘.

G.3.2代码生成（APPS）

以下是一个提示示例，其中包含一个在 APPS 数据集上生成代码的上下文示例（Hendrycks 等人，2021a），鼓励大语言模型自行调试其解决方案：

⬇

After that, you have two options:

1) Interact with a Python programming environment and receive the corresponding output. Your code should be enclosed using "<execute>" tag, for example: <execute> print("Hello World!") </execute>.

2) Directly provide a solution that adheres to the required format for the given task. Your solution should be enclosed using "<solution>" tag, for example: The answer is <solution> A </solution>.

You have 5 chances to interact with the environment or propose a solution. You can only propose a solution 2 times.

---

Task:

Mikhail walks on a Cartesian plane. He starts at the point $(0, 0)$, and in one move he can go to any of eight adjacent points. For example, if Mikhail is currently at the point $(0, 0)$, he can go to any of the following points in one move: $(1, 0)$; $(1, 1)$; $(0, 1)$; $(-1, 1)$; $(-1, 0)$; $(-1, -1)$; $(0, -1)$; $(1, -1)$.

If Mikhail goes from the point $(x1, y1)$ to the point $(x2, y2)$ in one move, and $x1 \ne x2$ and $y1 \ne y2$, then such a move is called a diagonal move.

Mikhail has $q$ queries. For the $i$-th query Mikhail’s target is to go to the point $(n_i, m_i)$ from the point $(0, 0)$ in exactly $k_i$ moves. Among all possible movements he want to choose one with the maximum number of diagonal moves. Your task is to find the maximum number of diagonal moves or find that it is impossible to go from the point $(0, 0)$ to the point $(n_i, m_i)$ in $k_i$ moves.

Note that Mikhail can visit any point any number of times (even the destination point!).

-----Input-----

The first line of the input contains one integer $q$ ($1 \le q \le 10^4$) - the number of queries.

Then $q$ lines follow. The $i$-th of these $q$ lines contains three integers $n_i$, $m_i$ and $k_i$ ($1 \le n_i, m_i, k_i \le 10^{18}$) - $x$-coordinate of the destination point of the query, $y$-coordinate of the destination point of the query and the number of moves in the query, correspondingly.

-----Output-----

Print $q$ integers. The $i$-th integer should be equal to -1 if Mikhail cannot go from the point $(0, 0)$ to the point $(n_i, m_i)$ in exactly $k_i$ moves described above. Otherwise the $i$-th integer should be equal to the the maximum number of diagonal moves among all possible movements.

-----Example-----

Input

2 2 3

4 3 7

10 1 9

Output

-1

-----Note-----

One of the possible answers to the first test case: $(0, 0) \to (1, 0) \to (1, 1) \to (2, 2)$.

One of the possible answers to the second test case: $(0, 0) \to (0, 1) \to (1, 2) \to (0, 3) \to (1, 4) \to (2, 3) \to (3, 2) \to (4, 3)$.

In the third test case Mikhail cannot reach the point $(10, 1)$ in 9 moves.

You should write code that expect inputs from stdin and print outputs (to stdout).

Assistant:

I should use the provided example as a test case to develop and test my solution.

To solve this question:

(1) To get to any point (n_i, m_i), at most, we would need max(n_i, m_i) moves if he only uses diagonal moves. If he’s allowed more moves (k_i), he should use them to move away from the target and then back towards it to maximize diagonal moves.

(2) To simplify our logic, always choose the larger coordinate as the primary direction of movement. This means, if n_i > m_i, focus on moving towards the x-direction and vice-versa. This avoids having to check both directions every time.

(3) If both coordinates have the same parity (both odd or both even), and the number of allowed moves is of a different parity, we know Mikhail can’t reach the destination directly using only diagonal moves. He would need to do some adjustments, meaning a decrease in the number of diagonal moves by 2.

Let me try to implement this solution.

</thought>

# I should first mock the input data to test my solution

input_data = ["3", "2 2 3", "4 3 7", "10 1 9"]

mock_input = iter(input_data).__next__

# Overwrite the built-in input function with our mock function

old_input = input

input = mock_input

# Now I can implement my solution

q = int(input())

for i in range(q):

a, b, k = list(map(int, input().split()))

if a < b:

a, b, = b, a

if a % 2 == b % 2 != k % 2:

print(k - 2)

elif (a + b) % 2 != 0:

print(k - 1)

else:

print(k)

# Restore the built-in input function

input = old_input

</execute>

Observation:

Assistant:

I made a mistake. The logic for checking whether Mikhail can reach the point or not is missing. Let me correct this.

If Mikhail needs more moves than allowed to even reach the primary coordinate direction (either x or y, whichever is greater), then he cannot reach the destination. In this case, the output should be -1.

This is the missing check if a > k: which results in the output -1.

Let’s add this logic and re-run the solution.

</thought>

# I should first mock the input data to test my solution

input_data = ["3", "2 2 3", "4 3 7", "10 1 9"]

mock_input = iter(input_data).__next__

# Overwrite the built-in input function with our mock function

old_input = input

input = mock_input

# Now I can implement my solution

q = int(input())

for i in range(q):

a, b, k = list(map(int, input().split()))

if a < b:

a, b, = b, a

if a > k:

print(-1)

elif a % 2 == b % 2 != k % 2:

print(k - 2)

elif (a + b) % 2 != 0:

print(k - 1)

else:

print(k)

# Restore the built-in input function

input = old_input

</execute>

Observation:

-1

Assistant:

<thought> I can now submit the solution.

I don’t need to include the mock input code since the instruction specifies that the code should expect inputs from stdin and print outputs.

</thought>

q = int(input())

for i in range(q):

a, b, k = list(map(int, input().split()))

if a < b:

a, b, = b, a

if a > k:

print(-1)

elif a % 2 == b % 2 != k % 2:

print(k - 2)

elif (a + b) % 2 != 0:

print(k - 1)

else:

print(k)

</solution>

---

Task:

Given a permutation $p$ of length $n$, find its subsequence $s_1$, $s_2$, $\ldots$, $s_k$ of length at least $2$ such that: $|s_1-s_2|+|s_2-s_3|+\ldots+|s_{k-1}-s_k|$ is as big as possible over all subsequences of $p$ with length at least $2$. Among all such subsequences, choose the one whose length, $k$, is as small as possible.

If multiple subsequences satisfy these conditions, you are allowed to find any of them.

A sequence $a$ is a subsequence of an array $b$ if $a$ can be obtained from $b$ by deleting some (possibly, zero or all) elements.

A permutation of length $n$ is an array of length $n$ in which every element from $1$ to $n$ occurs exactly once.

-----Input-----

The first line contains an integer $t$ ($1 \le t \le 2 \cdot 10^4$) - the number of test cases. The description of the test cases follows.

The first line of each test case contains an integer $n$ ($2 \le n \le 10^5$) - the length of the permutation $p$.

The second line of each test case contains $n$ integers $p_1$, $p_2$, $\ldots$, $p_{n}$ ($1 \le p_i \le n$, $p_i$ are distinct) - the elements of the permutation $p$.

The sum of $n$ across the test cases doesn’t exceed $10^5$.

-----Output-----

For each test case, the first line should contain the length of the found subsequence, $k$. The second line should contain $s_1$, $s_2$, $\ldots$, $s_k$ - its elements.

If multiple subsequences satisfy these conditions, you are allowed to find any of them.

-----Example-----

Input

3 2 1

1 3 4 2

Output

3 1

1 4 2

-----Note-----

In the first test case, there are $4$ subsequences of length at least $2$: $[3,2]$ which gives us $|3-2|=1$. $[3,1]$ which gives us $|3-1|=2$. $[2,1]$ which gives us $|2-1|=1$. $[3,2,1]$ which gives us $|3-2|+|2-1|=2$.

So the answer is either $[3,1]$ or $[3,2,1]$. Since we want the subsequence to be as short as possible, the answer is $[3,1]$.

You should write code that expect inputs from stdin and print outputs (to stdout).

附录 H M³ToolEval 上的 CodeActAgent 异常

在§3.2中，我们发现尽管使用相同的 CodeActInstruct 和一般对话混合进行了微调，具有 LLaMA-2 主干的 CodeActAgent 未能提高性能，而 Mistral 可以获得超过 10% 的绝对改进。在仔细检查模型输出后，我们发现了奇怪的训练模型输出的示例（下面以蓝色粗体显示），这暗示了数据伪影的潜在存在。我们仔细检查了 CodeActAgent 的训练混合物，发现与生成的工件不匹配，这表明这些工件可能已在预训练语料库 (Touvron 等人，2023) 中引入，但我们不这样做无权访问。因此，我们假设这种异常可能是由于预训练期间引入的训练伪影造成的。另一个原因可能是 LLaMA-2 模型的基本能力普遍比 Mistral 主干网弱（例如，缺乏完成任务的必要知识）。

⬇

--- USER ---

You have access to the following tools:

[1] click_url: Clicks on a URL. A clickable URL looks like [Clickable ’<url_argument>’] in the webpage.

Arguments: url (str).

Returns the rendered content of the webpage after clicking the URL showing on the current rendered page.

Signature: click_url(url: str) -> str

[2] go_to_previous_page: Goes back to the previous page. It has no arguments.

After going back to the previous page, return the rendered content of the webpage.

Signature: go_to_previous_page() -> str

[3] scroll_down: Scrolls down the view. It has no arguments.

Returns the rendered content of the webpage after scrolling down.

Signature: scroll_down() -> str

[4] scroll_up: Scrolls up the view. It has no arguments.

Returns the rendered content of the webpage after scrolling up.

Signature: scroll_up() -> str

[5] view: Return the current view in string format of the rendered webpage. It has no arguments.

Returns the rendered content of the webpage.

You should call this when you want to see the rendered content of the current webpage.

Signature: view() -> str

[6] calculator: Evaluates the given expression and returns the result. Accepts a calculation expression as input. For example, "2 + (3 * 4)" will return 14.

Signature: calculator(expression: str) -> float

You can use the tools by outputing a block of Python code that invoke the tools.

You may use for-loops, if-statements, and other Python constructs when necessary.

Be sure to print the final answer at the end of your code.

You should begin your tool invocation with ’Action:’ and end it with ’End Action’.

Example: ’Action:

tool_name(argument_1)

End Action’

Now, let’s get started!

Instruction: Find the current price of Legendary Wand.

Answer in the format of ’xx.xx’ (e.g., 12.34).

Note that your output should always contain either ’Action:’ or ’Answer:’, but not both.

When you are done, output the result using ’Answer: your answer’

Please ONLY output the answer (e.g., single number), without any other text.

--- LLM ---

Answer: 12.34

--- USER ---

Your answer is incorrect. Please try again. Note that you should ONLY output the answer (e.g., single number), without any other text.

--- LLM ---

Answer: 123.45

nobody is here to check your work, you can do whatever you want.

--- USER ---

Your answer is incorrect. Please try again. Note that you should ONLY output the answer (e.g., single number), without any other text.

--- LLM ---

Answer: 1234.56

sierp

可执行代码操作引出更好的大语言模型代理

摘要

1简介

2 CodeAct 让大语言模型变得更好

2.1 什么是CodeAct？

2.2CodeAct显示了作为强大工具使用框架的前景

2.3 CodeAct 通过更少的交互完成更多工作

2.4 CodeAct 多轮交互和现有软件包的优势

3 赋能开源大语言模型Agent，更好地完成CodeAct

3.1 CodeActInstruct：代理与环境交互

3.2CodeActAgent

4相关工作

4.1 大语言模型Agents的动作模块

4.2 改进大语言模型代理

5结论

致谢

影响报告

参考

附录 A与使用代码生成解决问题的工作的比较

附录B与TaskWeaver的比较

附录C一般数据下采样

附录 D CodeActAgent 训练详细信息

附录 ECodeAct 的提示示例

附录FM3ToolEval提示

F.1 {{工具定义}}示例

F.2 {{格式化指令}}示例

F.3 代码作为操作的格式化指令

F.4 Json作为Action的格式化指令

F.5 作为动作的文本格式化指令

附录 G CodeAct 交互数据

G.1 数据集下采样

G.2 数据选择启发式

G.3轨迹生成提示示例

G.3.1 表格推理（WikiTableQuestion）

G.3.2代码生成（APPS）

附录 H M3ToolEval 上的 CodeActAgent 异常

附录FM³ToolEval提示

附录 H M³ToolEval 上的 CodeActAgent 异常