通过

提示的接触链实现统一的人机交互

Zeqi Xiao^1,2, Tai Wang¹, Jingbo Wang¹, Jinkun Cao^1,3, Wenwei Zhang¹, Bo Dai¹,
Dahua Lin¹, Jiangmiao Pang

{}^{1\textrm{{\char 0\relax}}}

¹Shanghai AI Laboratory, ²S-Lab, NTU, ³CMU

摘要

人机交互 (HSI) 是具身 AI 和虚拟现实等领域的重要组成部分。尽管运动质量和物理可信度取得了进步，但多功能交互控制和用户友好界面这两个关键因素，需要在 HSI 的实际应用中进行进一步探索。本文提出了一种名为 UniHSI 的统一 HSI 框架，该框架通过语言命令支持对各种交互的统一控制。该框架将交互定义为“接触链 (CoC)”，代表涉及人机关节-物体部件对的步骤。这一概念的灵感来自交互类型与相应接触区域之间强烈的关联。基于该定义，UniHSI 构成一个 大型语言模型 (LLM) 规划器，将语言提示转换为 CoC 形式的任务计划，以及一个 统一控制器，将 CoC 转换为统一的任务执行。为支持训练和评估，我们收集了一个名为 ScenePlan 的新数据集，该数据集包含数千个基于不同场景由 LLM 生成的任务计划。全面的实验表明，我们的框架在多功能任务执行和对真实扫描场景的泛化能力方面的有效性。

^† ^†✉ 通讯作者。项目页面位于此 URL。

Refer to caption — 图 1： UniHSI 便于在响应自然语言命令时进行统一和长范围的控制，提供显着的功能，例如对单个对象的各种交互、多对象交互以及细粒度控制。

1 引言

人机交互 (HSI) 是各种应用中的关键要素，包括具身 AI 和虚拟现实。尽管该领域为提高运动质量 (Holden 等人，2017；Starke 等人，2019；2020；Hassan 等人，2021b；Zhao 等人，2022；Hassan 等人，2021a；Wang 等人，2022a) 和物理可信度 (Holden 等人，2017；Starke 等人，2019；2020；Hassan 等人，2021b；Zhao 等人，2022；Hassan 等人，2021a；Wang 等人，2022a) 做出了巨大努力，但在 HSI 投入实际使用之前，仍有待探索两个关键因素：灵活的交互控制和用户友好界面的开发。

本文旨在提供一个支持通过语言命令进行灵活交互控制的 HSI 系统，这是对用户而言最统一、最易访问的接口之一。这样的系统需要： 1) 将语言命令与精确的交互执行相匹配， 2) 在单个模型中统一不同的交互，以确保可扩展性。为此，最初的努力涉及对不同交互的统一定义。我们建议，交互本身包含一个强先验，即人机接触区域的形式。例如，在“躺在床上”的情况下，可以解释为“首先是骨盆接触床垫，然后是头部接触枕头”。为此，我们将交互定义为人类关节-物体部位接触对的有序序列，我们称之为 接触链 (CoC)。与以前仅限于通过手动设计支持特定交互的接触驱动方法不同，我们的交互定义可推广到灵活的交互，并能够对多轮转换进行建模。大语言模型的最新进展使得将语言命令转换为 CoC 成为可能。然后，结构化的公式可以统一地处理，以便下游控制器执行。

遵循上述公式，我们提出了 UniHSI，第一个 Uni 一化的物理 HSI 框架，它以语言命令作为输入。 UniHSI 包含一个高级 LLM 规划器，用于将语言输入翻译成 CoC 形式的任务计划，以及一个低级统一控制器，用于执行这些计划。通过结合语言命令和背景信息（如身体关节名称和物体部位布局），我们利用提示工程技术来指示 LLM 逐步规划交互。我们设计了 TaskParser 来支持统一执行。它作为统一控制器的核心。遵循 CoC，TaskParser 从物理环境中收集包括关节姿势和物体点云在内的信息，然后将其转换为统一的任务观察和任务目标。

如图1所示，统一控制器对场景中全身关节和物体任意部分进行建模，以实现细粒度控制和多物体交互。使用不同的语言命令，我们可以生成对同一个物体的多种交互方式。与之前仅对有限范围交互（如“坐下”）进行建模的方法不同，我们设计了任务解析器来评估当前步骤的完成情况，并依次获取下一步，从而实现多轮和长范围的过渡控制。统一控制利用对抗性运动先验框架(Peng et al., 2021)，该框架使用运动鉴别器进行逼真的运动合成，并使用物理模拟(Makoviychuk et al., 2021)以确保物理上的合理性。

我们的框架另一个令人印象深刻的特点是训练不需要交互标注。之前的方法通常需要能够捕捉目标物体和相应运动序列的数据集，这需要大量的劳动。相比之下，我们利用大型语言模型的交互知识来生成交互计划。它显著减少了标注需求，并使多功能交互训练成为可能。为此，我们创建了一个名为ScenePlan的新数据集。它包含数千个基于从 PartNet (Mo et al., 2019) 和 ScanNet (Dai et al., 2017) 数据集构建的场景的交互计划。我们在 ScenePlan 上进行了全面实验。结果表明该模型在多功能交互控制方面是有效的，并且在真实扫描场景中具有良好的泛化能力。

2 相关工作

基于运动学的人机交互。如何合成逼真的人机行为是一个长期存在的问题。大多数现有方法都集中在提高人形运动的质量和多样性 (Barsoum et al., 2018; Harvey et al., 2020; Pavllo et al., 2018; Yan et al., 2019; Zhang et al., 2022a; Tevet et al., 2022b; Zhang et al., 2023b)，但没有考虑场景的影响。最近，由于其在具身人工智能和虚拟现实等各种应用中的应用，人们对合成包含人与场景交互的运动越来越感兴趣。许多先前的方法 (Holden 等人，2017；Starke 等人，2019；2020；Hassan 等人，2021b；Zhao 等人，2022；Hassan 等人，2021a；Wang 等人，2022a；Zhang 等人，2022b；Wang 等人，2022b) 使用数据驱动的运动学模型来生成静态或动态交互。这些方法在物理真实性方面通常较差，并且容易合成包含伪影的运动，例如穿透、漂浮和滑动。需要额外的后处理来减轻这些伪影，这阻碍了这些框架的实时适用性。

基于物理的人与场景交互。基于物理方法的最新进展（例如，(Peng 等人，2021；2022；Hassan 等人，2023；Juravsky 等人，2022；Pan 等人，2023) 为通过物理感知模拟器确保物理真实性带来了希望。但是，它们有局限性： 1）它们通常需要为每个任务单独的策略网络，这限制了它们在统一控制器中学习通用交互的能力。 2）这些方法通常专注于基于动作的基本控制，而忽略了更细粒度的交互细节。 3）它们严重依赖于人与场景交互的标注运动序列，这可能难以获得。相反，我们的 UniHSI 将人与场景的交互重新设计为统一的表示，由来自我们高级 LLM 规划器的世界知识驱动。这使我们能够训练一个具有通用交互技能的统一控制器，而无需标注运动序列。关键特征比较在表 1 中。

人体运动控制中的语言。将语言理解纳入人体运动控制已成为最近的研究重点。现有方法主要集中在场景无关的运动合成 (Zhang 等人，2022a；Chen 等人，2023；Tevet 等人，2022a；b；Zhang 等人，2023a；b；Jiang 等人，2023) (Athanasiou 等人，2023)。使用语言命令生成人与场景的交互带来了额外的挑战，因为输出的运动必须与命令一致，并且与环境相协调。 Zhao 等人（2022）通过将语言命令基于规则映射到特定任务来生成静态交互手势。 Juravsky 等人（2022）利用 BERT (Devlin 等人，2018) 推断语言命令，但他们的方法需要预定义的任务和用于任务执行的不同低级策略。 Wang 等人（2022b）将各种任务统一到一个带有语言接口的 CVAE (Yao 等人，2022) 网络中，但由于在为角色确定目标对象和接触区域方面存在挑战，他们的性能受到限制。最近，在基于 LLM 的代理控制方面有一些探索。 Brohan 等人（2023）使用微调的 VLM（视觉语言模型）直接输出低级机器人的动作。 Rocamonde 等人（2023）使用 CLIP 生成的余弦相似度作为 RL 训练奖励。相反，UniHSI 利用大型语言模型将语言命令转换为 接触链 的形成，并设计了一个稳健的统一控制器来根据结构化形成执行多功能交互。

表 1： UniHSI 与先前方法的关键特征比较分析。

Methods	Unified Interaction	Language Input	Long-horizon Transition	Interaction Annotation-free	Control Joints	Multi-object Interactions
NSM Starke et al. (2019)			✓		3 (pelvis, hands)	✓
SAMP Hassan et al. (2021a)					1 (pelvis)
COUCH Zhang et al. (2022b)					3 (pelvis, hands)	✓
HUMANISE Wang et al. (2022b)	✓	✓			-
ScenDiffuser Huang et al. (2023)	✓	✓			-
PADL Juravsky et al. (2022)		✓	✓	✓	-
InterPhys Hassan et al. (2023)					4 (pelvis, head, hands)
Ours	✓	✓	✓	✓	15 (whole-body)	✓

3 方法论

如图 2 所示，UniHSI 支持根据语言命令进行多功能的人机场景交互控制。在接下来的几个小节中，我们将首先说明我们如何将统一交互公式设计为 CoC（第 3.1 节）。然后，我们将展示如何通过 LLM 计划器将语言命令转换为统一公式（第 3.2 节）。最后，我们将详细说明统一控制器（第 3.3 节）的构造。

3.1 接触链

UniHSI 的最初努力在于统一交互公式。受 Hassan 等人 (2021b) 的启发，该研究基于人类的交互手势推断人类和物体的接触区域，我们提出了接触区域和交互类型之间存在高度相关性。此外，交互并不局限于单个手势，而是涉及顺序过渡。为此，我们可以将交互普遍定义为 CoC $\mathcal{C}$ ，其公式为

\mathcal{C}=\{\mathcal{S}_{1},\mathcal{S}_{2},...\},

(1)

其中 $\mathcal{S}_{i}$ 是 $i^{th}$ 接触步骤。每个步骤 $\mathcal{S}$ 包括几个接触对。对于每个接触对，我们控制一个关节是否接触相应的物体部分以及接触的方向。我们用五个元素构建每个接触对：一个物体 $o$ ，一个物体部分 $p$ ，一个人形关节 $j$ ， $j$ 和 $p$ 的接触类型 $c$ ，以及从 $j$ 到 $p$ 的相对方向 $d$ 。接触类型包括“接触”、“不接触”和“不关心”。相对方向包括“上”、“下”、“前”、“后”、“左”和“右”。例如，一个接触单元 $\{o,p,j,c,d\}$ 可以是 {椅子，座椅表面，骨盆，接触，上}。通过这种方式，我们可以将每个 $\mathcal{S}$ 公式化为

\mathcal{S}=\{\{o_{1},p_{1},j_{1},c_{1},d_{1}\},\{o_{2},p_{2},j_{2},c_{2},d_{2% }\},...\}.

(2)

CoC 是 LLM 计划器的输出，也是统一控制器的输入。

3.2 大型语言模型规划器

我们利用大型语言模型 (LLM) 作为规划器，将语言命令 $\mathcal{L}$ 推断为可管理的计划 $\mathcal{C}$ 。如图 3 所示，LLM 规划器的输入包括语言命令 $\mathcal{L}$ 、背景场景信息 $\mathcal{B}$ 、人形关节信息 $\mathcal{J}$ ，以及预设的指令、规则和示例。具体来说， $\mathcal{B}$ 包括几个对象 $\mathcal{O}$ 及其可选的空间布局。每个对象包含几个部分 $\mathcal{P}$ ，例如，椅子可以由扶手、靠背和座椅组成。人形关节信息是针对所有场景预定义的。我们使用提示工程将这些元素结合在一起，并指示 LLM 输出任务计划。通过修改提示中的指令，我们可以生成特定数量的计划，以实现各种交互方式。我们还可以让 LLM 根据场景自动生成合理的计划。通过这种方式，我们构建了交互数据集来训练和评估统一控制器。

3.3 统一控制器

统一控制器以网格和点云的形式接收多步计划 $\mathcal{C}$ 和背景场景作为输入，并输出与环境相一致的真实运动。

预备知识。我们在 AMP (Peng et al., 2021) 上构建控制器。 AMP 是一种目标条件强化学习框架，它结合了对抗性判别器来模拟运动先验。其目标由奖励函数 $R(\cdot)$ 定义为

R({\bm{s}}_{t},{\bm{a}}_{t},{\bm{s}}_{t+1},\mathcal{G})=w^{G}R^{G}({\bm{s}}_{t% },{\bm{a}}_{t},{\bm{s}}_{t+1},\mathcal{G})+w^{S}R^{S}({\bm{s}}_{t},{\bm{s}}_{t% +1}).

(3)

任务奖励 $R^{G}$ 定义了代理应该达成的高级目标 $\mathcal{G}$ 。风格奖励 $R^{S}$ 鼓励代理模仿运动数据集中的低级行为。 $w^{G}$ 和 $w^{S}$ 分别是 $R^{G}$ 和 $R^{S}$ 的经验权重。 ${\bm{s}}_{t}$ 、 ${\bm{a}}_{t}$ 、 ${\bm{s}}_{t+1}$ 分别是时间 $t$ 的状态、时间 $t$ 的动作、时间 ${t+1}$ 的状态。风格奖励 $R^{S}$ 是使用对抗性鉴别器 $D$ 建模的，该鉴别器根据以下目标进行训练：

	$\displaystyle\mathop{\mathrm{arg\ min}}_{D}\ -\mathbb{E}_{d^{\mathcal{M}}({\bm% {s}}_{t},{\bm{s}}_{t+1})}\left[\mathrm{log}\left(D({\bm{s}}^{A}_{t},{\bm{s}}^{% A}_{t+1})\right)\right]-\mathbb{E}_{d^{\pi}({{\bm{s}},{\bm{s}}_{t+1}})}\left[% \mathrm{log}\left(1-D({\bm{s}}^{A},{\bm{s}}^{A}_{t+1})\right)\right]$		(4)
	$\displaystyle+w^{\mathrm{gp}}\ \mathbb{E}_{d^{\mathcal{M}}({\bm{s}},{\bm{s}}_{% t+1})}\left[\left\|\left\|\nabla_{\phi}D(\phi)\middle\|_{\phi=({\bm{s}}^{A},{\bm{% s}}^{A}_{t+1})}\right\|\right\|^{2}\right],$		(4)

其中 $d^{\mathcal{M}}({\bm{s}},{\bm{s}}_{t+1})$ 和 $d^{\pi}({{\bm{s}},{\bm{s}}_{t+1}})$ 分别表示从 ${\bm{s}}_{t}$ 到 ${\bm{s}}_{t+1}$ 的状态转换在数据集 $\mathcal{M}$ 和策略 $\pi$ 中的可能性。 $w^{\mathrm{gp}}$ 是用于规范化梯度惩罚的经验系数。 ${\bm{s}}^{A}=\Phi({\bm{s}})$ 是鉴别器的观察结果。策略的风格奖励 $r^{S}=R^{S}(\cdot)$ 然后被表述为：

R^{S}({\bm{s}}_{t},{\bm{s}}_{t+1})=-\mathrm{log}(1-D({\bm{s}}^{A}_{t},{\bm{s}}% ^{A}_{t+1})).

(5)

我们采用运动鉴别器的关键设计来进行真实的运动建模。在我们的实现中，我们将 10 个相邻帧一起馈送到鉴别器中以评估风格。我们对控制器部分的主要贡献在于统一不同的任务。如图 4 (a) 左侧所示，AMP (Peng et al., 2021) 以及大多数以前的方法 (Juravsky et al., 2022; Zhao et al., 2023) 设计了特定于任务的观察、任务目标和超参数来训练特定于任务的控制策略。相反，我们将不同的任务统一到接触链中，并设计了一个任务解析器来处理统一表示。

任务解析器。作为统一控制器的核心，任务解析器负责将 CoC 转换为统一的任务观察和任务目标。它也按顺序获取用于多轮交互执行的步骤。

给定一个特定的接触对 $\{o,p,j,c,d\}$ ，对于任务观察，TaskParser 从仿真环境中收集相应的关节 $j$ 位置 ${\bm{v}}^{j}\in\mathbb{R}^{3}$ 和物体部分 $p$ 的点云 ${\bm{v}}^{p}\in\mathbb{R}^{m\times 3}$ ，其中 $m$ 是点云的点数。它从 ${\bm{v}}^{p}$ 中选择距离 ${\bm{v}}^{j}$ 最近的点 ${\bm{v}}^{np}\in{\bm{v}}^{p}$ 作为接触的目标点。我们将单对的任务观察公式化为 $\{{\bm{v}}^{np}-{\bm{v}}^{j},c,d\}$ 。对于网络中的任务观察，我们将 $c$ 和 $d$ 映射到数字，但为了简单起见，我们仍然使用相同的符号。将这些接触对组合在一起，我们得到统一的任务观察 $s^{U}=\{\{{\bm{v}}^{np}_{1}-{\bm{v}}^{j}_{1},c_{1},d_{1}\},\{{\bm{v}}^{np}_{2}% -{\bm{v}}^{j}_{2},c_{2},d_{2}\},...,\{{\bm{v}}^{np}_{n}-{\bm{v}}^{j}_{n},c_{n}% ,d_{n}\}\}$ 。

任务奖励 $r^{G}=R^{G}(\cdot)$ 是所有接触对奖励的汇总：

R^{G}=\sum_{k}w_{k}R_{k},\ k=1,2,...,n.

(6)

我们根据接触类型 $c_{k}$ 对每个接触奖励 $R_{k}$ 进行建模。当 $c_{k}=\mathrm{contact}$ 时，接触奖励鼓励关节 $j$ 靠近部分 $p$ ，满足指定方向 $d$ 。当 $c_{k}=\mathrm{notcontact}$ 时，我们希望关节 $j$ 不靠近部分 $p$ 。如果 $c_{k}=\mathrm{not\ care}$ ，我们直接将奖励设置为最大值。遵循这种思路， $k^{th}$ 接触奖励 $R_{k}$ 被定义为

R_{k}=\begin{cases}w_{\mathrm{dis}}\mathrm{exp}(-w_{dk}||{\bm{d}}_{k}||)+w_{% \mathrm{dir}}\mathrm{max}(\overline{{\bm{d}}}_{k}\hat{{\bm{d}}}_{k},0),&c_{k}=% \mathrm{contact}\\ 1-\mathrm{exp}(-w_{dk}||{\bm{d}}_{k}||),&c_{k}=\mathrm{not\ contact}\\ 1,&c_{k}=\mathrm{not\ care}\\ \end{cases}

(7)

其中 ${\bm{d}}_{k}={\bm{v}}^{np}-{\bm{v}}^{j}$ 表示 $k^{\mathrm{th}}$ 距离向量， $\overline{{\bm{d}}}_{k}$ 是 ${\bm{d}}_{k}$ 的归一化单位向量， $\hat{{\bm{d}}}_{k}$ 是方向 $d_{k}$ 指定的单位方向向量， $c_{k}$ 是 $k^{\mathrm{th}}$ 接触类型。 $w_{dis}$ 、 $w_{dir}$ 、 $w_{dk}$ 是相应的权重。我们将 $R_{k}$ 的比例间隔设置为 $[0,1]$ ，并使用 exp 来确保这一点。

与接触奖励的公式类似，TaskParser 将一个步骤视为完成，如果所有 $k=1,2,...,n$ 都满足以下条件：如果 $c_{k}=\mathrm{contact}:||{\bm{d}}_{k}||<0.1\ \mathrm{and}\ \overline{{\bm{d}}}% _{k}\hat{{\bm{d}}}_{k}>0.8$ ，如果 $c_{k}=\mathrm{not\ contact}:||{\bm{d}}_{k}||>0.1$ ，如果 $c_{k}=\mathrm{not\ care},True$ 。

自适应接触权重。 6 的公式包含许多权重，用于平衡奖励的不同接触部分。经验性地设置它们需要大量劳动，并且不能推广到各种任务。为此，我们根据当前的优化过程自适应地设置这些权重。基本思想是：对于难以优化的奖励部分，给予更高的奖励，而降低较易部分的权重。给定 $R_{1}$ ， $R_{2}$ ，…， $R_{n}$ ，我们启发式地将它们的权重设置为

w_{k}=\frac{1-R_{k}}{n-\sum_{k=1,2,...,n}R_{k}+e},

(8)

自我中心高度图。人形机器人必须具有场景感知能力，以避免在场景中导航或交互时发生碰撞。我们采用类似的方法，例如 Wang 等人 (2022a)； Won 等人 (2022)； Starke 等人 (2019)，将周围的信息作为人形机器人的观察结果进行采样。我们构建了一个方形自我中心高度图，它对周围物体的 height 进行采样 (图 4 (b))。重要的是，将我们的方法扩展到真实的扫描场景，例如 ScanNet (Dai 等人，2017)，其中各种物体密集分布，容易发生碰撞。

表 2: ScenePlan数据集上的性能评估。

Source	Success Rate (%) $\uparrow$			Contact Error $\downarrow$			Success Steps
Source	Simple	Mid	Hard	Simple	Mid	Hard	Simple	Mid	Hard
PartNet (Mo et al., 2019)	91.1	63.2	39.7	0.038	0.073	0.101	2.3	4.5	6.1
wo Adaptive Weights	21.2	5.3	0.1	0.181	0.312	0.487	0.7	1.2	0.0
wo Heightmap	61.6	45.7	0.0	0.068	0.076	-	1.8	3.4	0.0
ScanNet (Dai et al., 2017)	76.1	43.5	32.2	0.067	0.101	0.311	1.8	2.9	4.9

4 实验

与人场景交互相关的现有方法和数据集主要关注短而有限的任务 (Hassan等人，2021a；Peng等人，2021；Hassan等人，2023；Wang等人，2022b)。据我们所知，我们是第一个支持以语言命令作为输入的任意范围交互的方法。为此，我们构建了一个用于训练和评估的新数据集。我们还对vanilla基线和框架的关键组件进行了各种消融研究。

4.1 数据集和指标

为了促进UniHSI的训练和评估，我们构建了一个新的ScenePlan数据集，其中包含各种室内场景和交互计划。室内场景从物体数据集和扫描场景数据集中收集和构建。我们利用我们的LLM规划器根据这些场景生成交互计划。我们模型的训练也需要运动数据集来训练运动判别器，这将我们的代理限制在以自然的方式进行交互。我们遵循Hassan等人(2023)的做法来评估我们方法的性能。

场景计划。我们从PartNet (Mo等人，2019)和ScanNet (Dai等人，2017)数据集中收集ScenePlan的场景。 PartNet 提供了具有细粒度部件标注的室内物体，非常适合 LLM 规划器。我们从 PartNet 中选择不同的物体并将它们组合成场景。对于包含真实室内房间场景的 ScanNet，我们收集场景并根据碎片化区域标注来标注关键物体部件。然后，我们使用 LLM 规划器从这些场景中生成各种交互计划。我们的训练集包括来自 PartNet 的 40 个物体，每个物体生成 5-20 个合理的交互步骤。在训练期间，我们为每个场景随机选择此集合中的 1-4 个物体，并选择它们的步骤作为交互计划。评估集包含 40 个 PartNet 物体和 10 个 ScanNet 场景。我们手动或随机地将 PartNet 中的物体构建成场景。我们为 PartNet 场景生成了 1,040 个交互计划，为 ScanNet 场景生成了 100 个交互计划。这些计划包含多种交互，包括不同类型、范围和多个物体。

运动数据集。我们使用 SAMP 数据集 (Hassan 等人，2021a) 和 CIRCLE (Araújo 等人，2023) 作为我们的运动数据集。 SAMP 包含 100 分钟的 MoCap 剪辑，涵盖常见的步行、坐着和躺下行为。 CIRCLE 包含各种左右手伸展数据。我们使用 SAMP 中的所有剪辑并在 CIRCLE 中挑选 20 个代表性剪辑进行训练。

指标。我们遵循 Hassan 等人 (2023) 的方法，使用 成功率 和 接触误差 (精确度 在 Hassan 等人 (2023) 中) 作为主要指标来定量衡量交互质量。成功率记录了人形机器人成功完成整个计划中每个步骤的试验比例。在我们的实验中，我们认为一个 $n$ 步的试验如果人形机器人能在 $n\times 10$ 秒内完成，则被视为成功完成。我们还记录了所有接触对的平均误差：

\mathrm{ContactError}=\sum_{i,c_{i}\neq 0}er_{i}/\sum_{i,c_{i}\neq 0}1,\qquad er% _{i}=\begin{cases}||{\bm{d}}_{k}||,&c_{i}=\mathrm{contact}\\ \mathrm{min}(0.3-||{\bm{d}}_{k}||,0).&c_{i}=\mathrm{not\ contact}\end{cases}

(9)

我们进一步记录 成功步骤，它表示任务执行中的平均成功步骤。

4.2 ScenePlan 上的性能

我们最初在我们的 ScenePlan 数据集上进行了实验。为了详细测量性能，我们将任务计划分为三个级别：简单、中等和困难。我们将 3 步以内的计划归类为简单任务，超过 3 步但只有一个对象的计划归类为中等任务，而包含多个对象的计划归类为困难任务。简单任务计划通常涉及直接的交互。中等任务计划包含更多样化的交互，涉及多轮转换。困难任务计划引入了多个对象，要求代理在这些对象之间导航，并同时与一个或多个对象交互。任务示例如图 5 所示。

如表 2 所示，UniHSI 在简单任务计划中表现良好，展现出很高的成功率和较低的误差。但是，随着任务计划变得更加多样化和复杂，我们模型的性能出现了明显的下降。然而，成功步骤指标仍在继续增加，表明我们的模型在计划的一部分中仍然表现良好。需要注意的是，ScenePlan 测试集中场景在训练期间是看不见的，ScanNet 中的场景与训练集存在模态差异。测试集上的总体性能证明了 UniHSI 的多功能能力、鲁棒性和泛化能力。

表 3：基线模型和香草实现的消融研究。

Methods	Success Rate (%) $\uparrow$			Contact Error $\downarrow$
Methods	Sit	Lie Down	Reach	Sit	Lie Down	Reach
NSM - Sit (Starke et al., 2019)	75.0	-	-	0.19	-	-
SAMP - Sit (Hassan et al., 2021a)	75.0	-	-	0.06	-	-
SAMP - Lie Down(Hassan et al., 2021a)	-	50.0	-	-	0.05	-
InterPhys - Sit (Hassan et al., 2023)	93.7	-	-	0.09	-	-
InterPhys - Lie Down(Hassan et al., 2023)	-	80.0	-	-	0.30	-
AMP (Peng et al., 2021)-Sit	77.3	-	-	0.090	-	-
AMP-Lie Down	-	21.3	-	-	0.112	-
AMP-Reach	-	-	98.1	-	-	0.016
AMP-Vanilla Combination (VC)	62.5	20.1	90.3	0.093	0.108	0.032
UniHSI	94.3	81.5	97.5	0.032	0.061	0.016

4.3 消融研究

4.3.1 关键组件消融

UniHSI 的 LLM 选择。我们评估了不同的语言模型 (LM) 选择

表 4：具有不同 LLM 的 UniHSI。

LLM Type	ESR (%) $\uparrow$	PC (%) $\uparrow$
Human	73.2	-
w. GPT-3.5	35.6	49.1
w. GPT-4	57.3	71.9

用于 LLM 计划器，使用 100 组语言命令。我们比较了人类、GPT-3.5OpenAI (2020) 和 GPT-4OpenAI (2023) 在每项计划 10 次测试中，任务计划执行成功率 (ESR) 和计划正确性 (PC)。 PC 由人类评估，选择“正确”和“不正确”。 GPT-4 的表现优于 GPT-3.5，但这两个 LLM 的性能仍然落后于人类。失败通常涉及不完整的计划和分布外交互，例如 GPT-3.5 偶尔会跳过过渡或生成分布外的动作，比如打开笔记本电脑。虽然在提示中使用更多规则和 GPT-4 可以缓解这些问题，但错误仍然可能发生。

自适应权重。表格 2 表明，从我们的控制器中移除自适应权重会导致所有任务级别上的性能大幅下降。自适应权重对于有效地优化各种接触对至关重要。它们会自动调整权重，对于未使用的或易于学习的配对降低权重，而对于更具挑战性的配对则增加权重。当任务变得更加复杂时，这一点尤其重要。

自我中心高度图。移除自我中心高度图会导致性能下降，尤其是对于困难的任务。该高度图对于智能体在场景中导航至关重要，它使智能体能够感知周围环境并防止与物体发生碰撞。这对于涉及复杂场景和大量物体的具有挑战性的任务尤其重要。此外，自我中心高度图是我们模型能够泛化到真实扫描场景的关键。

4.3.2 与以前方法的设计比较

基线设置。我们将我们的方法与之前使用简单交互任务（例如“坐下”、“躺下”和“伸手”）的方法进行了比较。由于训练数据的差异以及与一种密切相关的方法 (Hassan 等人，2023；Starke 等人，2019；Hassan 等人，2021a) 的代码不可用，因此直接比较具有挑战性。因此，我们列出了他们论文中的结果，并实现了 InterPhys (Hassan 等人，2023) 的简单版本。我们将 Hassan 等人 (2023) 中的关键设计元素整合到我们的基线模型 (Peng 等人，2021) 中，以确保公平性。任务观察和目标是根据 Hassan 等人 (2023) 为各种任务手动制定的，其中任务目标表示为：

R^{G}=\begin{cases}0.7R^{\mathrm{near}}+0.3R^{\mathrm{far}},&\text{if distance% }>0.5\text{m}\\ 0.7R^{\mathrm{near}}+0.3,&\text{otherwise}\\ \end{cases}

(10)

在此等式中， $R^{\mathrm{far}}$ 鼓励角色向目标移动，而 $R^{\mathrm{near}}$ 鼓励角色靠近时执行特定任务，这需要特定于任务的设计。

我们还通过在一个模型中合并多个任务来创建了一个普通基线。我们将来自各种任务的任务观察结果结合起来，并在这些观察结果中包含了任务选择。我们随机选择任务，并在训练期间使用它们各自的奖励对它们进行训练。此实验涉及总共 70 个物体（30 个用于坐着，30 个用于躺下，10 个用于伸手），每个任务 4096 次试验，并在评估期间随机改变方向和物体放置。

定量比较。在表 3 中，UniHSI 在各种指标上始终优于或与基线实现相匹配。性能优势在复杂的任务中最为明显，尤其是在具有挑战性的“躺下”任务中。这种改进源于我们通过将任务分解成多步骤计划来降低任务复杂性的方法。此外，我们的模型得益于任务之间的共享运动转换，从而增强了其适应性。图 6 (b) 表明，我们的方法比基线实现实现了更高的成功率，并且收敛速度更快。重要的是，AMP (Peng 等人，2021) 的普通组合会导致所有任务的性能明显下降，而我们的方法仍然有效。这种差异是因为普通组合在训练中引入了干扰和低效率，而我们的方法将任务统一到一致的表示和目标中，增强了多任务学习。

定性比较。在图 6 (a) 中，我们定性地可视化了基线方法和我们模型的性能。我们的模型在“坐”和“躺下”等任务中比基线方法表现得更加自然和准确。这主要归因于任务目标的差异。基线目标 (公式 10) 将子任务的组合，例如靠近行走和坐下，建模为同时进行的过程。因此，智能体倾向于同时执行这些不同的目标。例如，即使它们不在正确的位置，它们也可能会尝试坐下，或者像抛射物一样把自己扔到床上，而无视自然的任务进程。另一方面，我们的方法通过语言规划器将任务分解为自然运动，从而产生更逼真的交互。

5 结论

UniHSI 是一种统一的人机交互 (HSI) 系统，擅长各种交互和语言命令。定义为接触链 (CoC)，交互涉及人-物体部分接触对的序列。 UniHSI 整合了一个大型语言规划器，用于将命令翻译成 CoC，以及一个统一控制器，用于统一执行。全面的实验展示了 UniHSI 的有效性和泛化能力，代表了通用的用户友好的 HSI 系统的重大进步。致谢。我们感谢上海人工智能实验室和南洋理工大学 S 实验室的资金支持。

附录 A 局限性和未来工作。

除了我们框架的优势外，还有一些局限性。首先，我们的框架只能控制人形机器人与固定物体交互。我们没有考虑移动或搬运物体。使人形机器人能够与可移动物体交互是一个重要的未来方向。此外，我们没有将 LLM 无缝集成到训练过程中。在当前的设计中，我们使用预生成的计划。将 LLM 纳入训练流程将提高交互类型的可扩展性，并使整个框架更加集成。

附录 B 实现细节

我们遵循 Peng 等人 (2021) 来构建低级控制器，包括策略和鉴别器网络。策略网络包含一个批评网络和一个行动网络，两者都建模为一个 CNN 层，后面跟着两个具有 [1024, 1024, 512] 个单元的 MLP 层。鉴别器通过具有 [1024, 1024, 512] 个单元的两个 MLP 层建模。我们使用 PPO (Schulman 等人，2017) 作为策略训练的基础强化学习算法，并使用带有 2e-5 学习率的 Adam 优化器 Kingma & Ba (2014)。我们的实验在 IsaacGym (Makoviychuk 等人，2021) 模拟器上进行，使用单个 Nvidia A100 GPU，具有 8192 个并行环境。

附录 C LLM 计划程序的详细提示示例

如表所示。 7. 我们展示了主论文图2和图3中展示的LLM规划器的输入和输出的完整提示示例。输出由 OpenAI (2020) 生成。值得注意的是，在表 7 中，示例1步骤2对2：OBJECT是椅子，PART是左膝。这是一个设计选择。我们的框架支持关节之间的交互。我们用与物体交互相同的方式对关节之间的交互进行建模。我们只需要用关节位置替换物体部分的点云。计划的某些部分涉及“走到特定位置”，这些位置不包含接触。为了在我们的表示中对这些特殊情况进行建模并统一执行它们，我们将它们视为伪接触：将骨盆（根部）接触到目标位置点。这使得策略可以输出“行走”动作。我们将这些情况表示为 {object, none, none, none, direction}。在未来的研究中，我们将收集语言命令列表，并将 ChatGPT OpenAI (2020) 和 GPT OpenAI (2023) 集成到循环中，以评估 UniHSI 整个框架的性能。

附录 D ScenePlan 的详细信息

我们分别在表 8、9 和 10 中展示了 ScenePlan 中不同交互计划级别的三个示例。简单级别交互计划涉及在 3 步内与 1 个对象进行交互。中等级别交互计划涉及与 1 个对象进行超过 3 步的交互。困难级别交互计划涉及与 1 个以上对象进行超过 3 步的交互。具体来说，每个交互计划都有一个项目编号和两个名为“obj”和“chain_of_contacts”的子项目。 “obj”项目包含有关对象的信息，如对象 ID、名称和变换参数。 “chain_of_contacts”项目包含以 CoC 形式表示的接触对步骤。

我们在表 11 和 12 中提供了包含在我们框架的训练和评估中的交互类型列表。

附录 E 更多可视化

我们在图 7、8、9 中提供了更多定量结果。

附录 F 失败规划演示

在表 5 中，我们展示了 LLM 在生成准确计划时遇到挑战的实例。我们加粗了计划中的失败部分。由大型语言模型 (LLM) 生成的计划有时会在空间关系上出现问题。这个问题主要归因于它们对空间语境的理解不足。此外，大型语言模型 (LLM) 有时会设计涉及物体操纵的计划，这在当前阶段给成功执行带来了困难。

表 5：失败计划的演示。

Instruction: I want to get up from the bed and watch movies.
Output
Human	GPT-3.5 (OpenAI, 2020)	GPT-4 (OpenAI, 2023)
Step 1: Get up from the bed. Pair 1: {bed, mattress, torso, not contact, up} Pair 2: {bed, mattress, pelvis, contact, up} Step 2: Stand up from the bed. Pair 1: {bed, none, none, none, front} Step 3: Get close to the chair. Pair 1: {chair, none, none, none, front} Step 4: Sit on the chair. Pair 1: {chair, seat surface, pelvis, contact, up} Step 5: control the laptop. Pair 1: {laptop, keyboard, right hand, contact, up} Pair 2: {laptop, keyboard, left hand, contact, up}	Step 1: Get up from the bed. Pair 1: {bed, none, none, none, up} Step 2: Get close to the chair. Pair 1: {chair, none, none, none, front} Step 3: Open the laptop. Pair 1: {laptop, screen, left hand, contact, up} Pair 2: {laptop, screen, right hand, contact, up} Step 4: Search for movies on the laptop. Pair 1: {laptop, keyboard, left hand, contact, up} Pair 2: {laptop, keyboard, right hand, contact, up}	Step 1: Get up from the bed. Pair 1: {bed, mattress, torso, not contact, up} Step 2: Move away from the bed. Pair 1: {bed, none, none, none, back} Step 3: Get close to the chair. Pair 1: {chair, none, none, none, front} Step 4: Sit on the chair. Pair 1: {chair, seat surface, pelvis, contact, up} Step 5: Use the keyboard of the laptop to start a movie. Pair 1: {laptop, keyboard, right hand, contact, up} Pair 2: {laptop, keyboard, left hand, contact, up}

附录 G 关于运动真实性的用户研究。

为了检验生成的运动的全局真实性，我们进一步对运动真实性的评估进行了用户研究。结果如表 6 所示。自然度评分范围为 0 到 5，反映了感知自然度的程度，评分越高表示运动越自然。同样，语义忠实度评分范围为 0 到 5。评分越高表示与语义输入的匹配度越高。

然而，在现阶段进行量化评估具有挑战性，需要进一步探索。

表 6：关于运动真实性的用户研究。

	Naturalness	Semantic Faithfulness
AMPPeng et al., 2021-baseline	3.3	-
UniHSI-PartNetMo et al.,2019	4.2	4.2
UniHSI-ScanNetDai et al.,2017	3.9	4.1

表 7：通过详细的提示示例说明 LLM 计划器。此标题全面说明了 LLM 计划器的输入和输出。

Input

Instruction: I want to play video games for a while, then go to sleep.

Background Information:

[

start of background Information

]

The room has OBJECTS:

[

bed, chair, table, laptop

]

The

[

OBJECT: laptop

]

is upon the

[

OBJECT: table

]

. The

[

OBJECT: table

]

is in front of the

[

OBJECT: chair

]

. The

[

OBJECT: bed

]

is several meters away from

[

OBJECT: table

]

. The human is several meters away from these objects.

The

[

OBJECT: bed

]

has PARTS:

[

pillow, mattress

]

. The

[

OBJECT: chair

]

has PARTS:

[

back_soft_surface, seat_surface, left_armrest_hard_surface, right_armrest_hard_surface

]

. The

[

OBJECT: table

]

has PARTS:

[

board

]

. The

[

OBJECT: laptop

]

has PARTS:

[

screen, keyboard

]

. The human has JOINTS:

[

pelvis, left hip, left knee, left foot, right hip, right knee, right foot, torso, head, left shoulder, left elbow, left hand, right shoulder, right elbow, right hand

]

[

end of background Information

]

Given the instruction and background information, generate 1 task plan according to the following rules and examples.

[

start of rules

]

1. Each task plan should be composite into detailed steps. If the human is not close to the target object, the first step should be to get close to the object.

2. Each step should contain meaningful joint-part pairs.

3. Each joint-part pair should be formatted into {OBJECT, PART, JOINT, Contact type, Contact Direcion}. Or if the step is getting close to an object, the step should be formatted into {none, none, none, none, relative direction of the target object}. JOINT should replace JOINT in the format in the background information. Important: PART in the format should only be replaced by PART or JOINT in the background information. The choices of Contact type include

[

contact, not contact

]

. The choices of Contact Direction include

[

front, back, left, right, up, down, none

]

4. Be plausible. Do not generate uncommon interactions.

5. Only interact with still objects. Do not move objects.

[

end of rules

]

[

start of an example

]

Instruction: Generate a 6-step interaction with the object.

[

start of background Information

]

The room has a

[

OBJECT: chair

]

The

[

OBJECT: chair

]

has PARTS:

[

back_soft_surface, armrest_hard_surface, armrest_hard_surface, seat_surface

]

The human has JOINTS:

[

pelvis, left hip, left knee, left foot, right hip, right knee, right foot, torso, head, left shoulder, left elbow, left hand, right shoulder, right elbow, right hand

]

The human is several meters away from the

[

OBJECT: chair

]

[

end of background Information

]

Step 1: Get close to the chair.

Pair 1: {chair, none, none, none, front}

Step 2: Sit on the chair.

Pair 1: {chair, seat_surface, pelvis, contact, up}

Step 3: Rest the right hand on the right arm of the chair.

Pair 1: {chair, seat_surfac, pelvis, contact, up}

Pair 2: {chair, armrest_hard_surface, right hand, contact, up}

Step 4: Place the right foot on the left knee.

Pair 1: {chair, seat_surface, pelvis, contact, up}

Pair 2: {chair, left knee, right foot, contact, up}

Step 5: Stand up.

Pair 1: {chair, none, none, none, front}

Step 6: Sit on the chair.

Pair 1: {chair, seat_surface, pelvis, contact, up}

[

end of an example

]

Output

Step 1: Get close to the chair.

Pair 1: {chair, none, none, none, front}

Step 2: Sit on the chair.

Pair 1: {chair, seat_surface, pelvis, contact, up}

Step 3: Use the laptop to play video games.

Pair 1: {chair, seat_surface, pelvis, contact, up}

Pair 2: {laptop, keyboard, left hand, contact, up}

Pair 3: {laptop, keyboard, right hand, contact, up}

Step 4: Stand up from the chair.

Pair 1: {chair, none, none, none, front}

Step 5: Get close to the bed.

Pair 1: {bed, none, none, none, front}

Step 6: Sit on the bed.

Pair 1: {bed, mattress, pelvis, contact, up}

Step 7: Lie down on the bed.

Pair 1: {bed, mattress, pelvis, contact, up}

Pair 2: {bed, mattress, torso, contact, up}

Pair 3: {bed, pillow, head, contact, up}

表 8： ScenePlan 中简单级别交互计划的说明。简单级别交互计划包含三个步骤内的交互，涉及单个对象。

{

“0000”:

{

“obj”:

{

“000”:

{

“id”: “12747”,

“name”: “bed”,

“rotate”:

[

[

1.5707963267948966, 0, 0

]

[

0, 0, -1.5707963267948966

]

]

“scale”: 2.5,

”transfer”:

[

0,-2,0

]

}

“chain_of_contacts”:

[

[

[

“bed000”, “none”, “none”, “none”, “front”

]

]

[

[

“bed000”, “mattress25”, “pelvis”, “contact”, “up”

]

[

“bed000”, “mattress25”, “head”, “not contact”, “up”

]

]

[

[

“bed000”, “mattress25”, “pelvis”, “contact”, “up”

]

[

“bed000”, “mattress25”, “left_foot”, “contact”, “up”

]

[

“bed000”, “mattress25”, “right_foot”, “contact”, “up”

]

[

“bed000”, “mattress25”, “head”, “contact”, “up”

]

]

]

}

表 9： ScenePlan 中中等级别交互计划的示例。中等级别交互计划包含超过三个步骤的交互，涉及单个对象。

{

“0000”:

{

“obj”: {

“000”:{

“id”: “45005”,

“name”: “chair”,

“rotate”:

[

[

1.5707963267948966, 0, 0

]

[

0, 0, -1.5707963267948966

]

]

“scale”: 1.5,

“transfer”:

[

0,-2,0

]

}

“chain_of_contacts”:

[

[

[

“chair000”, “none”, “none”, “none”, “front”

]

]

[

[

“chair000”, “seat_soft_surface42”, “pelvis”, “contact”, “up”

]

]

[

[

“chair000”, “seat_soft_surface42”, “pelvis”, “contact”, “up”

]

[

“chair000”, “back_soft_surface47”, “torso”, “contact”, “none”

]

]

[

[

“chair000”, “seat_soft_surface42”, “pelvis”, “contact”, “up”

]

[

“chair000”, “back_soft_surface47”, “torso”, “contact”, “none”

]

]

[

[

“chair000”, “seat_soft_surface42”, “pelvis”, “contact”, “up”

]

[

“chair000”, “arm_sofa_style44”, “left_hand”, “contact”, “up”

]

[

“chair000”, “arm_sofa_style48”, “right_hand”, “contact”, “up”

]

]

[

[

“chair000”, “seat_soft_surface42”, “pelvis”, “contact”, “up”

]

[

“chair000”, “arm_sofa_style44”, “left_hand”, “not contact”, “up”

]

[

“chair000”, “arm_sofa_style48”, “right_hand”, “not contact”, “up”

]

]

[

[

“chair000”, “seat_soft_surface42”, “pelvis”, “contact”, “up”

]

[

“chair000”, “left_knee”, “right_foot”, “contact”, “none”

]

]

[

[

“chair000”, “seat_soft_surface42”, “pelvis”, “contact”, “up”

]

[

“chair000”, “back_soft_surface47”, “torso”, “not contact”, “none”

]

]

[

[

“chair000”, “none”, “none”, “none”, “front”

]

]

]

}

表 10： ScenePlan 中硬级别交互计划的示例。硬级别交互计划涉及超过 3 个步骤和超过 1 个对象的交互。

{

“0000”:

{

“obj”:

{

”000”:

{

“id”: “37825”,

“name”: “chair”,

“rotate”:

[

[

1.5707963267948966, 0, 0

]

[

0, 0, -1.5707963267948966

]

]

“scale”: 1.5,

“transfer”:

[

0,-2,0

]

“001”:

{

“id”: “21980”,

“name”: “table”,

“rotate”:

[

[

1.5707963267948966, 0, 0

]

[

0, 0, 1.5707963267948966

]

]

“scale”: 1.8,

“transfer”:

[

1,-2,0

]

“002”:

{

“id”: “11873”,

“name”: “laptop”,

“rotate”:

[

[

1.5707963267948966, 0, 0

]

[

0, 0, 1.5707963267948966

]

]

“scale”: 0.6,

“transfer”:

[

0.8,-2,0.65

]

“003”:

{

“id”: “10873”,

“name”: “bed”,

“rotate”:

[

[

1.5707963267948966, 0, 0

]

[

0, 0, -1.5707963267948966

]

]

“scale”: 3,

“transfer”:

[

-0.2,-4,0

]

}

“chain_of_contacts”:

[

[

[

“chair000”, “none”, “none”, “none”, “front”

]

]

[

[

“chair000”, “seat_soft_surface58”, “pelvis”, “contact”, “up”

]

]

[

[

“chair000”, “seat_soft_surface58”, “pelvis”, “contact”, “up”

]

[

“laptop002”, “keyboard15”, “left_hand”, “contact”, “none”

]

[

“laptop002”, “keyboard15”, “right_hand”, “contact”, “none”

]

]

[

[

“chair000”, “none”, “none”, “none”, “front”

]

]

[

[

“bed003”, “none”, “none”, “none”, “front”

]

]

[

[

“bed003”, “mattress16”, “pelvis”, “contact”, “up”

]

[

“bed003”, “mattress16”, “head”, “not contact”, “up”

]

]

[

[

“bed003”, “mattress16”, “pelvis”, “contact”, “up”

]

[

“bed003”, “mattress16”, “left_foot”, “contact”, “up”

]

[

“bed003”, “mattress16”, “right_foot”, “contact”, “up”

]

[

“bed003”, “pillow17”, “head”, “contact”, “up”

]

]

[

[

“bed003”, “mattress16”, “pelvis”, “contact”, “up”

]

[

“bed003”, “mattress16”, “head”, “not contact”, “up”

]

]

[

[

“bed003”, “none”, “none”, “none”, “front”

]

]

]

}

表 11： ScenePlan-1 中的交互列表

Interaction Type	Contact Formation
Get close to xxx	{xxx, none, none, none, dir}
Stand up	{xxx, none, none, none, dir}
Left hand reaches xxx	{xxx, part, left_hand, contact, dir}
Right hand reaches xxx	{xxx, part, right_hand, contact, dir}
Both hands reaches xxx	{{xxx, part, left_hand, contact, dir}, {xxx, part, right_hand, contact, dir}}
Sit on xxx	{xxx, seat_surface, pelvis, contact, up}
Sit on xxx, left hand on left arm	{{xxx, seat_surface, pelvis, contact, up}, {xxx, left_arm, left_hand, contact, up}}
Sit on xxx, right hand on right arm	{{xxx, seat_surface, pelvis, contact, up}, {xxx, right_arm, right_hand, contact, up}}
Sit on xxx, hands on arms	{{xxx, seat_surface, pelvis, contact, up}, {xxx, left_arm, left_hand, contact, none}, {xxx, right_arm, right_hand, contact, none}}
Sit on xxx, hands away from arms	{{xxx, seat_surface, pelvis, contact, up}, {xxx, left_arm, left_hand, not contact, none}, {xxx, right_arm, right_hand, not contact, none}}
Sit on xxx, left elbow on left arm	{{xxx, seat_surface, pelvis, contact, up}, {xxx, left_arm, left_elbow, contact, up}}
Sit on xxx, right elbow on right arm	{{xxx, seat_surface, pelvis, contact, up}, {xxx, right_arm, right_elbow, contact, up}}
Sit on xxx, elbows on arms	{{xxx, seat_surface, pelvis, contact, up}, {xxx, left_arm, left_elbow, contact, none}, {xxx, right_arm, right_elbow, contact, none}}
Sit on xxx, left hand on left knee	{{xxx, seat_surface, pelvis, contact, up}, {xxx, left_knee, left_hand, contact, up}}
Sit on xxx, right hand on right knee	{{xxx, seat_surface, pelvis, contact, up}, {xxx, right_knee, right_hand, contact, up}}
Sit on xxx, hands on knees	{{xxx, seat_surface, pelvis, contact, up}, {xxx, left_knee, left_hand, contact, none}, {xxx, right_knee, right_hand, contact, none}}
Sit on xxx, left hand on stomach	{{xxx, seat_surface, pelvis, contact, up}, {xxx, pelvis, left_hand, contact, none}}
Sit on xxx, right hand on stomach	{{xxx, seat_surface, pelvis, contact, up}, {xxx, pelvis, right_hand, contact, none}}
Sit on xxx, hands on stomach	{{xxx, seat_surface, pelvis, contact, up}, {xxx, pelvis, left_hand, contact, none}, {xxx, pelvis, right_hand, contact, none}}
Sit on xxx, left foot on right knee	{{xxx, seat_surface, pelvis, contact, up}, {xxx, right_knee, left_foot, contact, none}}
Sit on xxx, right foot on left knee	{{xxx, seat_surface, pelvis, contact, up}, {xxx, left_knee, right_foot, contact, none}}
Sit on xxx, lean forward	{{xxx, seat_surface, pelvis, contact, up}, {xxx, back_surface, torso, not contact, none}}
Sit on xxx, lean backward	{{xxx, seat_surface, pelvis, contact, up}, {xxx, back_surface, torso, contact, none}}

表 12： ScenePlan-2 中的交互列表

Interaction Type	Contact Formation
Lie on xxx	{{xxx, mattress, pelvis, contact, up}, {xxx, pillow, head, contact, up}}
Lie on xxx, left knee up	{{xxx, mattress, pelvis, contact, up}, {xxx, pillow, head, contact, up {xxx, mattress, left_knee, not contact, none}}
Lie on xxx, right knee up	{{xxx, mattress, pelvis, contact, up}, {xxx, pillow, head, contact, up}, {xxx, mattress, right_knee, not contact, none}}
Lie on xxx, knees up	{{xxx, mattress, pelvis, contact, up}, {xxx, pillow, head, contact, up}, {xxx, mattress, left_knee, not contact, none}, {xxx, mattress, right_knee, not contact, none}}
Lie on xxx, left hand on pillow	{{xxx, mattress, pelvis, contact, up}, {xxx, pillow, head, contact, up}, {xxx, pillow, left_hand, contact, none}}
Lie on xxx, right hand on pillow	{{xxx, mattress, pelvis, contact, up}, {xxx, pillow, head, contact, up}, {xxx, pillow, right_hand, contact, none}}
Lie on xxx, hands on pillow	{{xxx, mattress, pelvis, contact, up}, {xxx, pillow, head, contact, up}, {xxx, pillow, left_hand, contact, none}, {xxx, pillow, right_hand, contact, none}}
Lie on xxx, on left side	{{xxx, mattress, pelvis, contact, up}, {xxx, pillow, head, contact, up}, {xxx, mattress, right_shoulder, not contact, none}}
Lie on xxx, on right side	{{xxx, mattress, pelvis, contact, up}, {xxx, pillow, head, contact, up}, {xxx, mattress, left_shoulder, not contact, none}}
Lie on xxx, left foot on right knee	{{xxx, mattress, pelvis, contact, up}, {xxx, pillow, head, contact, up}, {xxx, right_knee, left_foot, contact, up}}
Lie on xxx, right foot on left knee	{{xxx, mattress, pelvis, contact, up}, {xxx, pillow, head, contact, up}, {xxx, left_knee, right_foot, contact, up}}
Lie on xxx, head up	{{xxx, mattress, pelvis, contact, up}, {xxx, pillow, head, not contact, none}}

参考文献

Araújo et al. (2023) Joao Pedro Araújo, Jiaman Li, Karthik Vetrivel, Rishi Agarwal, Jiajun Wu, Deepak Gopinath, Alexander William Clegg, and Karen Liu. Circle: Capture in rich contextual environments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21211–21221, 2023.
Athanasiou et al. (2023) Nikos Athanasiou, Mathis Petrovich, Michael J Black, and Gül Varol. Sinc: Spatial composition of 3d human motions for simultaneous action generation. arXiv preprint arXiv:2304.10417, 2023.
Barsoum et al. (2018) Emad Barsoum, John Kender, and Zicheng Liu. Hp-gan: Probabilistic 3d human motion prediction via gan. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 1418–1427, 2018.
Brohan et al. (2023) Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
Chen et al. (2023) Xin Chen, Biao Jiang, Wen Liu, Zilong Huang, Bin Fu, Tao Chen, and Gang Yu. Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18000–18010, 2023.
Dai et al. (2017) Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5828–5839, 2017.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Harvey et al. (2020) Félix G Harvey, Mike Yurick, Derek Nowrouzezahrai, and Christopher Pal. Robust motion in-betweening. ACM Transactions on Graphics (TOG), 39(4):60–1, 2020.
Hassan et al. (2021a) Mohamed Hassan, Duygu Ceylan, Ruben Villegas, Jun Saito, Jimei Yang, Yi Zhou, and Michael J Black. Stochastic scene-aware motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11374–11384, 2021a.
Hassan et al. (2021b) Mohamed Hassan, Partha Ghosh, Joachim Tesch, Dimitrios Tzionas, and Michael J Black. Populating 3d scenes by learning human-scene interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14708–14718, 2021b.
Hassan et al. (2023) Mohamed Hassan, Yunrong Guo, Tingwu Wang, Michael Black, Sanja Fidler, and Xue Bin Peng. Synthesizing physical character-scene interactions. arXiv preprint arXiv:2302.00883, 2023.
Holden et al. (2017) Daniel Holden, Taku Komura, and Jun Saito. Phase-functioned neural networks for character control. ACM Transactions on Graphics (TOG), 36(4):1–13, 2017.
Huang et al. (2023) Siyuan Huang, Zan Wang, Puhao Li, Baoxiong Jia, Tengyu Liu, Yixin Zhu, Wei Liang, and Song-Chun Zhu. Diffusion-based generation, optimization, and planning in 3d scenes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16750–16761, 2023.
Jiang et al. (2023) Biao Jiang, Xin Chen, Wen Liu, Jingyi Yu, Gang Yu, and Tao Chen. Motiongpt: Human motion as a foreign language. arXiv preprint arXiv:2306.14795, 2023.
Juravsky et al. (2022) Jordan Juravsky, Yunrong Guo, Sanja Fidler, and Xue Bin Peng. Padl: Language-directed physics-based character control. In SIGGRAPH Asia 2022 Conference Papers, pp. 1–9, 2022.
Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Makoviychuk et al. (2021) Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin, David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021.
Mo et al. (2019) Kaichun Mo, Shilin Zhu, Angel X Chang, Li Yi, Subarna Tripathi, Leonidas J Guibas, and Hao Su. Partnet: A large-scale benchmark for fine-grained and hierarchical part-level 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 909–918, 2019.
OpenAI (2020) OpenAI. Gpt-3: Generative pre-trained transformer 3. https://openai.com/research/gpt-3, 2020.
OpenAI (2023) OpenAI. Gpt-4 technical report, 2023.
Pan et al. (2023) Liang Pan, Jingbo Wang, Buzhen Huang, Junyu Zhang, Haofan Wang, Xu Tang, and Yangang Wang. Synthesizing physically plausible human motions in 3d scenes. arXiv preprint arXiv:2308.09036, 2023.
Pavllo et al. (2018) Dario Pavllo, David Grangier, and Michael Auli. Quaternet: A quaternion-based recurrent model for human motion. arXiv preprint arXiv:1805.06485, 2018.
Peng et al. (2021) Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa. Amp: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics (ToG), 40(4):1–20, 2021.
Peng et al. (2022) Xue Bin Peng, Yunrong Guo, Lina Halper, Sergey Levine, and Sanja Fidler. Ase: Large-scale reusable adversarial skill embeddings for physically simulated characters. ACM Transactions On Graphics (TOG), 41(4):1–17, 2022.
Rocamonde et al. (2023) Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, and David Lindner. Vision-language models are zero-shot reward models for reinforcement learning. arXiv preprint arXiv:2310.12921, 2023.
Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Starke et al. (2019) Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. Neural state machine for character-scene interactions. ACM Trans. Graph., 38(6):209–1, 2019.
Starke et al. (2020) Sebastian Starke, Yiwei Zhao, Taku Komura, and Kazi Zaman. Local motion phases for learning multi-contact character movements. ACM Transactions on Graphics (TOG), 39(4):54–1, 2020.
Tevet et al. (2022a) Guy Tevet, Brian Gordon, Amir Hertz, Amit H Bermano, and Daniel Cohen-Or. Motionclip: Exposing human motion generation to clip space. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pp. 358–374. Springer, 2022a.
Tevet et al. (2022b) Guy Tevet, Sigal Raab, Brian Gordon, Yonatan Shafir, Daniel Cohen-Or, and Amit H Bermano. Human motion diffusion model. arXiv preprint arXiv:2209.14916, 2022b.
Wang et al. (2022a) Jingbo Wang, Yu Rong, Jingyuan Liu, Sijie Yan, Dahua Lin, and Bo Dai. Towards diverse and natural scene-aware 3d human motion synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20460–20469, 2022a.
Wang et al. (2022b) Zan Wang, Yixin Chen, Tengyu Liu, Yixin Zhu, Wei Liang, and Siyuan Huang. Humanise: Language-conditioned human motion generation in 3d scenes. Advances in Neural Information Processing Systems, 35:14959–14971, 2022b.
Won et al. (2022) Jungdam Won, Deepak Gopinath, and Jessica Hodgins. Physics-based character controllers using conditional vaes. ACM Transactions on Graphics (TOG), 41(4):1–12, 2022.
Yan et al. (2019) Sijie Yan, Zhizhong Li, Yuanjun Xiong, Huahan Yan, and Dahua Lin. Convolutional sequence generation for skeleton-based action synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4394–4402, 2019.
Yao et al. (2022) Heyuan Yao, Zhenhua Song, Baoquan Chen, and Libin Liu. Controlvae: Model-based learning of generative controllers for physics-based characters. ACM Transactions on Graphics (TOG), 41(6):1–16, 2022.
Zhang et al. (2023a) Jianrong Zhang, Yangsong Zhang, Xiaodong Cun, Shaoli Huang, Yong Zhang, Hongwei Zhao, Hongtao Lu, and Xi Shen. T2m-gpt: Generating human motion from textual descriptions with discrete representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023a.
Zhang et al. (2022a) Mingyuan Zhang, Zhongang Cai, Liang Pan, Fangzhou Hong, Xinying Guo, Lei Yang, and Ziwei Liu. Motiondiffuse: Text-driven human motion generation with diffusion model. arXiv preprint arXiv:2208.15001, 2022a.
Zhang et al. (2022b) Xiaohan Zhang, Bharat Lal Bhatnagar, Sebastian Starke, Vladimir Guzov, and Gerard Pons-Moll. Couch: Towards controllable human-chair interactions. In European Conference on Computer Vision, pp. 518–535. Springer, 2022b.
Zhang et al. (2023b) Yaqi Zhang, Di Huang, Bin Liu, Shixiang Tang, Yan Lu, Lu Chen, Lei Bai, Qi Chu, Nenghai Yu, and Wanli Ouyang. Motiongpt: Finetuned llms are general-purpose motion generators. arXiv preprint arXiv:2306.10900, 2023b.
Zhao et al. (2022) Kaifeng Zhao, Shaofei Wang, Yan Zhang, Thabo Beeler, and Siyu Tang. Compositional human-scene interaction synthesis with semantic control. In European Conference on Computer Vision, pp. 311–327. Springer, 2022.
Zhao et al. (2023) Kaifeng Zhao, Yan Zhang, Shaofei Wang, Thabo Beeler, and Siyu Tang. Synthesizing diverse human motions in 3d indoor scenes. arXiv preprint arXiv:2305.12411, 2023.

通过提示的接触链实现统一的人机交互

摘要

1 引言

2 相关工作

3 方法论

3.1 接触链

3.2 大型语言模型规划器

3.3 统一控制器

4 实验

4.1 数据集和指标

4.2 ScenePlan 上的性能

4.3 消融研究

4.3.1 关键组件消融

4.3.2 与以前方法的设计比较

5 结论

附录 A 局限性和未来工作。

附录 B 实现细节

附录 C LLM 计划程序的详细提示示例

附录 D ScenePlan 的详细信息

附录 E 更多可视化

附录 F 失败规划演示

附录 G 关于运动真实性的用户研究。

参考文献

通过

提示的接触链实现统一的人机交互