AAAI 2023
On Grounded Planning for Embodied Tasks with Language Models

Bill Yuchen Lin1, Chengsong Huang211footnotemark: 1, Qian Liu3, Wenda Gu1, Sam Sommerer1, Xiang Ren1 The first two authors contributed equally.
Abstract

Language models (LMs) have demonstrated their capability in possessing commonsense knowledge of the physical world, a crucial aspect of performing tasks in everyday life. However, it remains unclear whether they have the capacity to generate grounded, executable plans for embodied tasks. This is a challenging task as LMs lack the ability to perceive the environment through vision and feedback from the physical environment. In this paper, we address this important research question and present the first investigation into the topic. Our novel problem formulation, named G-PlanET, inputs a high-level goal and a data table about objects in a specific environment, and then outputs a step-by-step actionable plan for a robotic agent to follow. To facilitate the study, we establish an evaluation protocol and design a dedicated metric, KAS, to assess the quality of the plans. Our experiments demonstrate that the use of tables for encoding the environment and an iterative decoding strategy can significantly enhance the LMs’ ability in grounded planning. Our analysis also reveals interesting and non-trivial findings. 111Project website: https://yuchenlin.xyz/g-planet/

1 Introduction

Pre-trained language models (LMs) demonstrate exceptional proficiency in a wide range of natural language processing (NLP) tasks such as question answering, machine translation, and summarization. They indeed capture some commonsense knowledge about our physical world such as “birds can fly”. However, the question of whether LMs can exhibit reasoning abilities within a grounded, realistic setting remains an open issue. This is because LMs lack the sensory experiences and physical interactions with the environment that enable human beings to grasp the nuances of real-life situations and plan for completing tasks.

Refer to caption
Figure 1: The task of grounded planning for embodied tasks (G-PlanET). The input to the LMs is a goal with a specific environment, and the output is a step-by-step plan that can guide a robot to complete the task.

Embodied robotics learning is a growing field that seeks to create artificial intelligence agents capable of navigating and performing tasks within real-world environments, typically simulated through physical engines such as AI2THOR (Kolve et al. 2017). The ALFRED benchmark (Shridhar et al. 2020) represents one of the pioneering datasets that bridges the gap between NLP and robotics, providing a platform for investigating language-directed agents. The objective of these studies is to design and test agents that can translate language instructions into sequences of low-level actions that enable the agent to manipulate objects within an environment and achieve a desired outcome (e.g., cleaning an object and placing it elsewhere).

However, the primary emphasis of the ALFRED benchmark and related datasets is on the comprehension of pre-established plans, rather than the ability to reason and independently plan within a realistic environment. Prior research focuses on the capacity of agents to comprehend and execute step-by-step plans, but not on their capacity for decomposing tasks and generating such plans, which represents a more advanced skill. Additionally, the role of LMs has received limited examination in the context of these benchmarks, where they are mainly used as encoders for embedding token sequences, rather than for planning or reasoning.

Prior studies have explored the planning capability of LMs, with Huang et al. (2022) demonstrating that GPT-3 and similar models are capable of generating general plans for executing everyday tasks. However, these plans lack grounding in a realistic environment, as LMs are not environment-specific. As a result, these plans are not necessarily executable by agents. For instance, in the context of an ALFRED task to “move a teapot from the stove to a shelf,” embodied agents require knowledge of the location of the teapot and the path to reach it. Humans, on the other hand, can readily observe the location of the teapot on the stove and their current position in the kitchen, allowing them to formulate a grounded plan that starts with “turn right and walk to the stove.” This highlights the need for generating detailed, step-by-step action sequences for robotic agents to use in their execution processes.

Can LMs also learn grounded planning ability? How should we evaluate and improve LMs for grounded planning? In this paper, we address the question of whether LMs can also learn grounded planning abilities. To this end, we propose a study on the ability of language models for grounded planning for embodied tasks (G-PlanET). Our approach involves providing LMs with two inputs: a high-level task description and a realistic environment in the form of an object table. The output is a plan consisting of executable, step-by-step actions. We formulate G-PlanET as a language generation task and focus on encoder-decoder language models such as BART (Lewis et al. 2020).

In order to establish a dataset and evaluation protocol for G-PlanET, we leveraged the ALFRED data by developing a suite of data conversion programs. They extract the object information from the environment and format it into data tables, thereby enabling models to access observations from realistic scenarios. Additionally, we formulated a new evaluation metric, referred to as KAS, that is more appropriate for the task than existing ones for text generation. As regards the methodology of G-PlanET, we suggest flattening an object table into a sequence of tokens and appending it to the task description as input to the model. The base LMs are then fine-tuned with these seq2seq data to learn to generate plans. Furthermore, we propose a simple yet effective decoding strategy that iteratively generates subsequent steps by incorporating the previous generation into the input. Our empirical results and analysis indicate that incorporating object tables into inputs and the proposed iterative decoding strategies are both crucial for enhancing the performance of language models in G-PlanET.

To summarize, our main contributions are:

  • The task of G-PlanET: To the best of our knowledge, this is one of the first studies to investigate the ability of LMs for embodied planning in realistic environments. G-PlanET is crucial for advancing the grounded generalization of large LMs and bridging the gap between NLP and embodied intelligence. (Sec. 2)

  • A comprehensive evaluation protocol: We put significant effort to convert the ALFRED and AI2THOR data into data tables to support the evaluation of G-PlanET. We also created a new evaluation metric, KAS, to effectively assess the plans generated by the LMs.

  • Improving LMs for G-PlanET: We present two simple but effective components for enhancing the grounded planning ability of LMs - flattening object tables and an iterative decoding strategy. Our experiments show that these components lead to notable performance gains. (Sec. 3) Also, through extensive experimentation and in-depth analysis, we have gained a deeper understanding of the behavior of LMs for G-PlanET and present a series of non-trivial findings in our study.

2 Problem Formulation

Refer to caption
Figure 2: The overall workflow of the proposed methods. First, we extract the object table from the realistic environment. Then we flatten the table into a sequence of tokens E (Sec. 3.2). We provide two learning methods for generating plans: 1) generate the whole plan S1,S2,,ST and 2) iteratively decode the St+1 (Sec. 3.3).

Here we present the background knowledge, the problem formulation and the data sources for G-PlanET.

2.1 Background Knowledge

Embodied tasks.

The ALFRED benchmark (Shridhar et al. 2020) is among the first benchmarks focusing on embodied tasks in realistic environments, although most of the examples are household tasks. It aims to test the ability of agents to execute embodied tasks in real-world scenarios. Specifically, the agents need to understand language-based instructions and output a sequence of actions to interact with an engine named AI2-THOR (Kolve et al. 2017), such that the given tasks can be achieved.

Language instructions.

Language instructions play an important role in the ALFRED benchmark. The embodied tasks are annotated with a high-level goal and a low-level plan (i.e., a sequence of executable actions for robots) in natural language, which are both inputs to the agents. The agents need to understand such language instructions and parse them into action templates. Note that the agents do not need to plan for the task, as they already have the step-by-step instructions to follow.

Task planning.

Prior works show that large pre-trained language models (LMs) such as GPT-3 (Brown et al. 2020) can generate general procedures for completing a task. However, such plans are not aligned with the particular environment in which we are interested. This is because these methods never encode the environment as part of the inputs to LMs for grounding the plans to the given environment. Therefore, such non-grounded plans are hardly useful in guiding agents to work in real-world situations.

2.2 G-PlanET with LMs

As discussed in Sec. 2.1, the ALFRED benchmark does not explicitly test the planning ability, while prior works on planning with LMs have not considered grounding to a specific environment. In this work, we focus on evaluating and improving the ability to generate grounded plans for embodied tasks with LMs, which we dub as G-PlanET. It has been an underexplored open problem for both the robotics and NLP communities.

Task formulation.

The task we aim to study in this paper is essentially a language generation problem. Specifically, the input is two-fold: 1) a high-level goal G and 2) a specific environment E that the agents need to ground to. The expected output is a sequence of actionable plans S={S1,S2,} to solve the given goal in the specific environment step-by-step. The goal G and the plan S are in the form of natural language, while the environment E can be viewed as a data table consisting of the object information in a room. Figure 2 shows an illustrative example and we will discuss more details in Section 3.2.

2.3 Data for G-PlanET

To build a large-scale dataset for studying the G-PlanET task, we re-use the goals and the plans of ALFRED and extract object information from AI2THOR for the aligned environment. The ALFRED dataset uses the AI2THOR engine to provide an interactive environment for agents with an egocentric vision to perform actions. However, the dataset does not contain explicit data about objects in the environment (e.g., the coordination, rotation, and spatial relationship with each other).

We develop a suite of conversion programs for using AI2THOR to re-purpose the ALFRED benchmark for evaluating the methods shown in Section 3. We managed to get a structured data table to describe the environment of each task in the ALFRED dataset. We explore the AI2THOR engine and write conversion programs such that we can get full observations of all objects: properties (movable, openable, etc.), positions (3D coordinates & rotation), sizes, and spatial relationships (e.g., object A is on the top of object B). We believe our variant of the ALFRED data will be a great resource for the community to study G-PlanET and future directions in grounded reasoning.

3 Methods

Herein, we introduce the methods that we adopt or propose to address the G-PlanET problem. First of all, we present the base language models that are encoder-decoder architectures. Then, we show in detail how we encode the environment data and integrate them with the seq2seq learning frameworks. Finally, we propose an interactive decoding strategy that significantly improves performance.

3.1 Base Language Models

Pretrained encoder-decoder language models, such as BART (Lewis et al. 2020) and T5 (Raffel et al. 2020), have achieved promising performance in many well-known language generation tasks such as summarization and question answering. They also show great potential for general commonsense reasoning tasks such as CommonsenseQA (Talmor et al. 2019), suggesting that these large LMs have common sense to some extent. As the G-PlanET can be also viewed as a text generation problem, we use these LMs as the backbone for developing further planning methods, hoping that their common sense can be grounded in real-world situations for embodied tasks.

Vanilla baseline methods.

As shown in many papers, BART and T5, when sizes are similar, show comparable performance in many generation tasks. Thus, we use BART-base and BART-large as two selected LMs for evaluation. The simplest and most straightforward baseline method of using such LMs to solve G-PlanET is to ignore the environment and only use the goal as the sole input. Then, we fine-tune the base LMs with the training data and expect they can directly output the whole plan as a single sequence of tokens (including special separator tokens). This simple method does not allow the LMs to perceive the environment, although training from the large-scale data can still teach the LMs some general strategies for planning. Therefore, we see this as an important baseline method to analyze.

3.2 Encoding Realistic Environments

To enable the LMs to perceive an environment, we need to encode the object tables described in Sec. 2.2. Following prior works in table-based NLP tasks (Chen et al. 2020; Liu et al. 2022b), we flatten a table into token sequences row by row, thus creating a linearized version of an object table. Then, we append the flattened table after the goal to form a complete input sequence. Thus, the input side of the encoder-decoder finally has the environment information for generating a grounded plan.

Considering the max sequence limit, we only choose to encode objects by their type, position, rotation, and the receptacle parent. The object type does not only tell what an object is but also implies commonsense affordance (e.g., a microwave can heat up something, a knife can slice something) which is very important for planning. The position information is essential for agents to navigate and find objects, thus playing an important part in planning. The rotation is also useful for some objects that can only be used with a certain orientation (e.g., a refrigerator can only be opened when the agent is in front of it). The receptacle of an object and itself has a close spatial connection (e.g., a pen is on a desk; an apple is in a fridge). Every object has a unique identifier such that objects of the same type can be referred to precisely when they are receptacles of others. In addition, the agent is represented as a special object.

3.3 Iterative Decoding Strategy

Adding the flattened table of object information to the input sequences indeed improves the LMs in terms of their perception of the realistic environments, which forms the foundation of grounded planning. However, the thinking process is still limited by the conventional seq2seq learning framework, which assumes LMs should output a complete plan by a single pass of decoding. We argue that a thoughtful planning process should carefully handle the coherence of each step, otherwise errors accumulate and cause a failed plan.

Therefore, we propose a simple yet effective decoding strategy that learns to iteratively generate a plan step by step. Specifically, we append previously generated steps until the current step t to the input sequence (i.e., Input = [G+S1++St(+E)]) for generating the next step (i.e., Output = St+1). This iterative decoding process will end until the LM generates the special token END. In the training stage, we use the ground-truth references for St; in the inference stage, we do not have such references, so we use the model predictions as St.

Notably, in contrast to the conventional seq2seq learning process, the iterative decoding strategy needs to run the encoder-decoder model N+1 times to generate a plan with N steps. The additional computation cost for re-encoding is worthy. Imagine when we humans are planning a task in a room. It is natural for us to come up with the plans step by step, and it is very likely that the most useful information to generate different steps is about different objects. Therefore, a temporally dynamic attention mechanism is favorable in planning with LMs. Our iterative decoding strategy encourages the encoder-decoder architectures to learn such ability.

3.4 Other Methods

Pretrained table encoders.

Since we use environmental information in a tabular format and BART has not been pre-trained in the tabular form of input, BART may not be able to use this part of information well. Therefore, we employ TaPEx (Liu et al. 2022b), the state-of-the-art pre-trained language model on tabular data. Using SQL execution as the only pre-training task, TaPEx achieves better tabular reasoning capability than BART, and thus we expect TaPEx can make full use of the environmental information represented by the table in our task.

In-context few-shot learning with GPT-J.

Finally, to explore whether large-scale language models can master the task with few-shot examples, we also experimented with few-shot performance on a larger language model GPT-J 6B.

4 Evaluation

How do we evaluate a method for G-PlanET? Due to the novelty of the problem setup, it is challenging to evaluate and analyze the methods. In this section, we present a general evaluation protocol and a complementary metric to measure the quality of generated plans. We report the main experimental results with the proposed evaluation protocol. We leave the analysis in Sec. 5.

4.1 Metrics

Step-wise evaluation.

Conventional evaluation metrics such as BLEU (Papineni et al. 2002) and ROUGE (Lin 2004) measure the similarity between generated text and truth references as a whole, which is suitable for translation and summarization. However, the output text of planning tasks such as our G-PlanET is highly structured. A plan naturally can be split into a sequence of step-by-step actions. Using the conventional way to evaluate plans inevitably breaks such internal structures and will lead to inaccurate measurement. For example, if the first step of the generated plan is the same as the last step of the reference plan, the conventional evaluation will still assign a high score to such a generated plan, even though it is not useful at all. Therefore, we argue that it is much more reasonable to evaluate the similarity of a pair of plans step by step. Specifically, we first align the generations and the truths and compute the scores of every step222The ALFRED authors ensure that the references consist of atomic action steps and all references share the same length. Therefore, we consider the length of truth plans as the standard: when the generated plan has more steps than the truth plans, we cut off them; when the generation has fewer steps than the references, we duplicate the last step to make them even for step-wise evaluation. by multiple metrics. Then, we aggregate the final score by taking the average of all steps. We also consider other temporal weighting aggregation for more analysis in Sec. 5.

Data Split Unseen Room Layouts Seen Room Layouts
Methods Metrics CIDEr SPICE KAS CIDEr SPICE KAS
BART-base (vanilla) 0.9417 0.1378 0.2455 0.8231 0.1277 0.2197
BART-large (vanilla) 1.4632 0.3168 0.4069 1.4414 0.3161 0.3900
GPT-J-6B 1.1968 0.2655 0.3622 1.1047 0.2509 0.3370
BART-base w/table 1.6706 0.3692 0.4584 1.6230 0.3595 0.4339
BART-large w/table 1.6630 0.3491 0.4411 1.5865 0.3393 0.4204
BART-large (TaPEx) 2.8824 0.5054 0.6373 2.7432 0.4944 0.6045
BART-base w/table + iterative decoding 2.9147 0.5107 0.6334 2.8582 0.5118 0.6124
BART-large w/table + iterative decoding 2.8580 0.5194 0.6518 2.8799 0.5096 0.6326
BART-large (TaPEx) + iterative decoding 2.8440 0.5210 0.6313 2.6959 0.5036 0.6074
Table 1: Experimental results for the G-PlanET by different base LMs. The methods are grouped by model types and whether encoding the environment; by decoding strategies.

Measuring grounded plans.

It is a unique challenge for evaluating G-PlanET to consider the grounding nature of plans. Metrics, such as BLEU, METEOR, and ROUGE, do not give a suitable penalty when a plan is similar to the reference in terms of word usage, yet leading to totally different states in an interactive environment for embodied tasks. For example, it is only a one-word difference between “turn to the left” vs “turn to the right”, but the agents that faithfully follow these instructions can arrive at very different places.

The LM-based metrics, e.g., BERTScore (Zhang et al. 2020), are not suitable either because the neural embeddings of “left” and “right” are also very similar. Plus, the grounded plans for G-PlanET are object-centric in a context and very similar to the captions of a sequence of events by visual perception, for which these metrics are not specifically designed. Considering these limitations, we use two typical metrics that are widely used for captions and devise a new metric for complementary measurement.

The first two metrics are CIDEr (Vedantam, Zitnick, and Parikh 2015) and SPICE (Anderson et al. 2016), which are both widely used for tasks where the outputs are highly contextualized and describe natural scenarios in everyday life, e.g., VaTex (Wang et al. 2019) and CommonGen (Lin et al. 2020). In particular, SPICE parses both the generation and references to scene graphs, a graph-based semantic representation. Then, it calculates the edge-based F1 score to measure the similarity between each step. Note that SPICE computation has a special focus on the propositions. This is particularly favorable for evaluating G-PlanET since there are many actions in the grounded plans, where propositions can be seen as atomic units for evaluation.

KeyActionScore (KAS).

Inspired by SPICE, a step in a plan can be deconstructed into several propositions that are represented as edges. However, not all propositions in SPICE are necessarily important in evaluating plans for G-PlanET. Not to mention that SPICE relies on an external parser that is expensive to run yet sometimes contains noisy outputs. Also, most of the truth plans in the ALFRED annotations are overly specific, and it is not necessary for a plan to cover all details. Therefore, we devise a metric that focuses on the key actions of the generated plans and checks if they are part of references, named Key Action Score (KAS).

Specifically, we extract a set of key action phrases from each step in the generated plan Si^ and the truth reference Si respectively. We denote this two sets as Si^={a^1,a^2,} and Si={a1,a2,}. Then, we check how many action phrases in Si^ are covered by the truth set Si, the precision then becomes the KAS score for the i-th step in the plan. To increase the matching quality, we curate a set of rules and a dictionary to map the actions that share the same behaviors. For example, “turn to the left” and “turn left” are counted as a single match; “go straight” and “walk straight” can be matched too. In addition, we break the compound nouns such that we allow partial scores to match for a smoother scoring (e.g., “xxx on the table” vs “xxx on the coffee table”). Simply put, the KAS metric looks at the key actions extracted from the plans and checks if these important elements can be (fuzzy) matched to count as a valid step.

4.2 Experimental Setup

Data statistics.

Table 2 shows some statistics of our dataset that we described in Sec. 2.2. We follow the data split in ALFRED to split the train, valid, and test dataset. The data split is based on whether the room layout has been seen in the training tasks. It is usually easier for robotic agents to map instructions to low-level actions in seen rooms than in unseen rooms. However, for the planning ability that we want to study with G-PlanET in this paper, the two splits do not differ very much. We keep using this split to make the results consistent and convenient for people who want to connect our results with the ALFRED results.

split train valid test
aspect - seen unseen seen unseen
# tasks 21,025 820 821 705 694
avg. |G| 9.26 9.32 9.26 10.3 9.95
avg. # O 73.71 74.21 77.91 75.31 73.9
avg. # T 6.72 6.79 6.26 6.95 6.63
avg. |Si| 11.24 11.13 11.49 9.84 10.19
Table 2: The avg. |G| means the average length of goal and the avg. |Si| means the average length of each step. The avg. # O is the average number of objects in each room and the avg. # T is the average number of steps.

Implementation details.

In single-pass decoding, we format the output sequences as follows: “Step 1:[S1]|Step 2:[S2]||END”. When appending the flattened table of objects, we format input with “[G]Env:[row 1][SEP][row 2]”, where the [row i] is a sequence of the i-th object including its id, type, coordinates, rotation, parent receptacles, etc. Due to the page limit, we leave the details of the data, methods, and hyper-parameters in the Appendix that are linked to our project website.

4.3 Main Results

We report the main results in Table 1, and leave the deeper analysis in the next section. To sum up, we find that encoding the object table as part of the inputs will significantly improve the performance, and pre-training on other table-related tasks can benefit G-PlanET a lot. The iterative decoding strategy is also an important component that can further improve the results to some extent.

Case Study of Table Effect

Although we have added environment information E to the input, it is still a problem whether the model effectively uses this information. To verify this, we present a case study here. In a number of instances, we have demonstrated that the introduction of environmental information can be helpful. Here is one example:

  • truth: Close the laptop that is on the table.

  • vanilla: Close the laptop and pick it up from the bed

  • w/ table: Pick up the laptop on the coffee table.

As shown in the example, the model successfully identified the location of a laptop with the help of the object table.

00.20.40.60.810.350.450.550.650.75PPerformanceVanillaTableTaPExTable IterTaPEx Iter00.20.40.60.810.200.300.400.500.60PPerformance
Figure 3: The step-wise reweighting results of KAS (Left) and SPICE (Right).The x-axis indicates the parameter p in the geometric distribution and also the importance of the preceding step, and the y-axis indicates the weighted result of each step. A larger coefficient means that the previous step is more important.

Effect of model sizes.

Table 1 shows that small models can perform as well or even better than large models in some cases. This is mainly due to the following reasons. 1) The sentences in plans are relatively simpler than other NLG tasks, with a smaller vocabulary and shorter length. This leaves the power of large models in terms of generation unexpressed, 2) G-PlanET is a task to examine the ability to plan rather than write. Whether this ability changes with model size remains to be explored. 3) For scenarios with the table, the form of the task is not the same as the traditional generation task, so the training phase will have a greater impact. Models with fewer parameters are more sufficiently tuned with limited data.

5 Analysis

In this section, we deeply analyze the performance of the methods in Table 1 from multiple aspects and provide non-trivial findings that can help future research. For a fair comparison, all analytical experiments were performed in the BART-large model on the unseen split of the test data.

5.1 Temporal Re-weighting of Scores

When we computed the overall score of a plan with a metric, we use the average score to aggregate the score for each step. However, in a realistic environment, there are causality constraints for an agent to complete the steps – i.e., some tasks can only be done when their prerequisite steps are finished. For example, only when the agent arrives at the microwave can it heat the bread in its hands.

Therefore, the earlier steps in a plan should be of higher importance, while our previous evaluation is based on a uniform distribution of the weights across steps. To this end, we adopt geometric distribution to re-weight the step-wise importance for weighted aggregation. The geometric distribution can be used to model the number of failures before the first success in repeated mutually independent Bernoulli trials, each with a probability of success p.

f(x)=p(1p)x(0<p<1)
(,5](5,3](3,1]0(0,+)0100200300400500Step RangeCountVanillaTableTaPExTable IterTaPEx Iter
Figure 4: The result of error statistics for # of step.

This suits our setup well because when the first step is incorrect, the whole task can hardly be completed and executed in a generated plan for ALFRED. The range of p in the original setting of the geometric distribution is restricted to between 0 and 1. When p=0, each step has the same weight (uniform importance), which is exactly what we have done in Tab. 1. When p=1, the first step is the only thing we look at for evaluation, meaning that the other steps will be given zero weights for aggregation.

Figure 3 shows the results on unseen subset which is more realistic. The performance of the iterative and non-iterative approaches is very close in the case of the first step. This is mainly because iterative methods are similar to non-iterative methods when generating the first step, and differ only after the second step. At the same time, it can be seen that there is an overall downward trend in performance as the focus moves to the early step. The main reason is that the later the subtask is, the closer it is to the high-level instruction. For example, if the task goal is to place the sponge in the sink, the final step must be to place the sponge in the sink. This feature makes the last step of subtask generation very simple, resulting in high performance. We also see that the performance of the non-iterative method rises and then falls in KAS, and the change in a downward trend in SPICE. The main reason is an error in the number of steps in the non-iterative method, which will be explained next.

5.2 Error Analysis on the Lengths of Plans

We found a huge gap in the prediction of the number of task steps between iterative and non-iterative methods, which may be an important reason for the final performance difference. As shown in Figure 4, iterative methods have a higher probability of predicting the number of steps for the correct task, while non-iterative methods do underestimate the number of steps. In our evaluation framework, the missing follow-up steps of non-iterative methods are often generated by copying. This might be reason for the poor performance of non-iterative methods and the performance of non-iterative methods increases first in the reweight step process.

5.3 Impact of Task Length on Performance

Although all the tasks in the dataset are part of daily life tasks, they differ in difficulty. A simple metric to evaluate the difficulty of a task is the number of steps they require. Figure 5 illustrates the decrease in the quality of the generated steps as the number of task steps increases. The figure also reflects the relatively small difference in the performance of the different methods on shorter tasks. And the performance of all methods degrades rapidly on the longest tasks. The iterative approach has more significant performance benefits on longer tasks. This may be because this approach makes better use of the state changes due to intermediate steps and fixes some previous errors.

(0,5](5,8](8,10](10,13)0.00.150.300.450.60Step RangePerformanceVanillaTableTaPExTable IterTaPEx Iter
Figure 5: The result of KAS of tasks with a different number of steps. Due to the large variance caused by the small number of samples of certain lengths, we use the statistics by dividing the intervals.

6 Related Work

Grounded commonsense reasoning.

ALFWorld (Shridhar et al. 2021) also uses LM to generate the next step in a text game which is based on ALFRED. SciWorld (Wang et al. 2022) designed a text game to find whether the LMs have learned to reason about commonsense. SayCan (Ahn et al. 2022) also uses LM to find the potential next step in the real world. Both these three works only expect to learn the next step in a text game. Our methods share similar motivation with decision transformer (Chen et al. 2021) and Behavior Cloning (Farag and Saleh 2018), but we work on very different applications.

Table-based NLP.

Our work is closely related to two lines of tabular data usage in NLP: the approach to modeling tabular representations and the application of a table as an intermediate representation. For the first line of work, there is rich literature focusing on modeling tabular representations, including TabNet (Arik and Pfister 2021), TAPAS (Herzig et al. 2020), TaBERT (Yin et al. 2020) and TaPEx (Liu et al. 2022b). We have explored the impact of state-of-the-art table representation models (e.g., TaPEx) on our task in experiments. As for the second line of work, previous work has explored to use of tables in several downstream tasks, including visual question answering (Yi et al. 2018), code modeling (Pashakhanloo et al. 2022), and numerical reasoning (Pi et al. 2022; Yoran, Talmor, and Berant 2022). Different from them, our work is the first to explore the use of tabular representations in embodied tasks.

ALFRED Agents.

Some previous research has been published on embodied tasks in realistic environments since the appearance of ALFRED. E.T. (Pashevich, Schmid, and Sun 2021) first encoded the history with a transformer to solve compositional tasks and proved that pretraining and joint training with synthetic instructions can improve performance. FILM (Min et al. 2022) proposed an explicit spatial memory and a semantic search policy to provide a more effective representation for state tracking and guidance. LEBP (Liu et al. 2022a), the currently published SOTA method, generated a sequence of sub-steps by understanding the language instruction and used the predefined actual actions template to complete the sub-steps. We also try to use these methods to evaluate our generated low-level instructions. However, due to the limited importance of the low-level instructions, there is no gap with conspicuousness between our generated instructions and the ones in ALFRED.

7 Conclusion

In this work, we present the first investigation into grounded planning for embodied tasks using language models. The G-PlanET problem is of utmost significance for advancing the embodied intelligence of LMs and constitutes a critical step towards artificial general intelligence. To evaluate the performance of encoder-decoder LMs in solving G-PlanET, we developed a benchmark as well as a specialized evaluation metric named KAS to assess the quality of generated plans. Furthermore, we propose two methods for improving LMs’ ability in G-PlanET - flattening object tables and an iterative decoding strategy. Our experiments and analyses demonstrate their effectiveness and yield non-trivial findings. This study is expected to encourage further research into G-PlanET and pave the way for integrating LMs and embodied tasks in realistic environments.

The main limitations of this work on the new task G-PlanET are as follows:

  • Evaluation: Although we have adopted and devised automatic metrics for evaluating methods for G-PlanET, there is not yet a straightforward way for us to test the ultimate success rates of such plans (if they were executed by oracle agents). We tried using state-of-the-art ALFRED agents such as FILM (Min et al. 2022), but they did not show obvious differences using step-by-step instructions (even if using the oracle version). We believe more human evaluation will help us further refine the metrics, which can be very expensive though. It is because human annotators much play with the 3D engine while following these instructions, in order to assess the quality of such plans.

  • Methods: Flattening object tables into sequences of tokens row by row is straightforward but might not be optimal. The number of objects can be huge for a complicated room. How can we narrow down the important objects at each step? We argue that a more advanced version of attention modules for dynamic table encoding is needed. We may not need to input the whole table for decoding at all steps. As a preliminary study, we created a retrieval augmentation method that only includes the oracle objects (that are mentioned in the next step) as the input, but we see little improvement. We think more physical rules and math computation with the object features will help us gain more improvement.

References

  • Ahn et al. (2022) Ahn, M.; Brohan, A.; Brown, N.; Chebotar, Y.; Cortes, O.; David, B.; Finn, C.; Gopalakrishnan, K.; Hausman, K.; Herzog, A.; Ho, D.; Hsu, J.; Ibarz, J.; Ichter, B.; Irpan, A.; Jang, E.; Ruano, R. J.; Jeffrey, K.; Jesmonth, S.; Joshi, N. J.; Julian, R. C.; Kalashnikov, D.; Kuang, Y.; Lee, K.-H.; Levine, S.; Lu, Y.; Luu, L.; Parada, C.; Pastor, P.; Quiambao, J.; Rao, K.; Rettinghouse, J.; Reyes, D. M.; Sermanet, P.; Sievers, N.; Tan, C.; Toshev, A.; Vanhoucke, V.; Xia, F.; Xiao, T.; Xu, P.; Xu, S.; and Yan, M. 2022. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. In Conference on Robot Learning.
  • Anderson et al. (2016) Anderson, P.; Fernando, B.; Johnson, M.; and Gould, S. 2016. SPICE: Semantic Propositional Image Caption Evaluation. In Proc. of ECCV.
  • Arik and Pfister (2021) Arik, S. Ö.; and Pfister, T. 2021. TabNet: Attentive Interpretable Tabular Learning. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, 6679–6687. AAAI Press.
  • Brown et al. (2020) Brown, T. B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; Agarwal, S.; Herbert-Voss, A.; Krueger, G.; Henighan, T.; Child, R.; Ramesh, A.; Ziegler, D. M.; Wu, J.; Winter, C.; Hesse, C.; Chen, M.; Sigler, E.; Litwin, M.; Gray, S.; Chess, B.; Clark, J.; Berner, C.; McCandlish, S.; Radford, A.; Sutskever, I.; and Amodei, D. 2020. Language Models are Few-Shot Learners. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  • Chen et al. (2021) Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; and Mordatch, I. 2021. Decision Transformer: Reinforcement Learning via Sequence Modeling. In Ranzato, M.; Beygelzimer, A.; Dauphin, Y. N.; Liang, P.; and Vaughan, J. W., eds., Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, 15084–15097.
  • Chen et al. (2020) Chen, W.; Wang, H.; Chen, J.; Zhang, Y.; Wang, H.; Li, S.; Zhou, X.; and Wang, W. Y. 2020. TabFact: A Large-scale Dataset for Table-based Fact Verification. In Proc. of ICLR. OpenReview.net.
  • Farag and Saleh (2018) Farag, W. A.; and Saleh, Z. 2018. Behavior Cloning for Autonomous Driving using Convolutional Neural Networks. 2018 International Conference on Innovation and Intelligence for Informatics, Computing, and Technologies (3ICT).
  • Herzig et al. (2020) Herzig, J.; Nowak, P. K.; Müller, T.; Piccinno, F.; and Eisenschlos, J. 2020. TaPas: Weakly Supervised Table Parsing via Pre-training. In Proc. of ACL, 4320–4333. Online: Association for Computational Linguistics.
  • Huang et al. (2022) Huang, W.; Abbeel, P.; Pathak, D.; and Mordatch, I. 2022. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. In Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvári, C.; Niu, G.; and Sabato, S., eds., International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, 9118–9147. PMLR.
  • Kolve et al. (2017) Kolve, E.; Mottaghi, R.; Han, W.; VanderBilt, E.; Weihs, L.; Herrasti, A.; Deitke, M.; Ehsani, K.; Gordon, D.; Zhu, Y.; Kembhavi, A.; Gupta, A. K.; and Farhadi, A. 2017. AI2-THOR: An Interactive 3D Environment for Visual AI. ArXiv preprint, abs/1712.05474.
  • Lewis et al. (2020) Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; and Zettlemoyer, L. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proc. of ACL, 7871–7880. Online: Association for Computational Linguistics.
  • Lin et al. (2020) Lin, B. Y.; Zhou, W.; Shen, M.; Zhou, P.; Bhagavatula, C.; Choi, Y.; and Ren, X. 2020. CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2020, 1823–1840. Online: Association for Computational Linguistics.
  • Lin (2004) Lin, C.-Y. 2004. ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, 74–81. Barcelona, Spain: Association for Computational Linguistics.
  • Liu et al. (2022a) Liu, H.; Liu, Y.; He, H.; and Yang, H. 2022a. LEBP - Language Expectation & Binding Policy: A Two-Stream Framework for Embodied Vision-and-Language Interaction Task Learning Agents. ArXiv preprint, abs/2203.04637.
  • Liu et al. (2022b) Liu, Q.; Chen, B.; Guo, J.; Ziyadi, M.; Lin, Z.; Chen, W.; and Lou, J. 2022b. TAPEX: Table Pre-training via Learning a Neural SQL Executor. In Proc. of ICLR. OpenReview.net.
  • Min et al. (2022) Min, S. Y.; Chaplot, D. S.; Ravikumar, P. K.; Bisk, Y.; and Salakhutdinov, R. 2022. FILM: Following Instructions in Language with Modular Methods. In Proc. of ICLR. OpenReview.net.
  • Papineni et al. (2002) Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proc. of ACL, 311–318. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics.
  • Pashakhanloo et al. (2022) Pashakhanloo, P.; Naik, A.; Wang, Y.; Dai, H.; Maniatis, P.; and Naik, M. 2022. CodeTrek: Flexible Modeling of Code using an Extensible Relational Representation. In Proc. of ICLR. OpenReview.net.
  • Pashevich, Schmid, and Sun (2021) Pashevich, A.; Schmid, C.; and Sun, C. 2021. Episodic Transformer for Vision-and-Language Navigation. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021, 15922–15932. IEEE.
  • Pi et al. (2022) Pi, X.; Liu, Q.; Chen, B.; Ziyadi, M.; Lin, Z.; Fu, Q.; Gao, Y.; Lou, J.-G.; and Chen, W. 2022. Reasoning Like Program Executors. In Proc. of EMNLP, 761–779. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
  • Raffel et al. (2020) Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; and Liu, P. J. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res., 21: 140:1–140:67.
  • Shridhar et al. (2020) Shridhar, M.; Thomason, J.; Gordon, D.; Bisk, Y.; Han, W.; Mottaghi, R.; Zettlemoyer, L.; and Fox, D. 2020. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 10737–10746. IEEE.
  • Shridhar et al. (2021) Shridhar, M.; Yuan, X.; Côté, M.; Bisk, Y.; Trischler, A.; and Hausknecht, M. J. 2021. ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. In Proc. of ICLR. OpenReview.net.
  • Talmor et al. (2019) Talmor, A.; Herzig, J.; Lourie, N.; and Berant, J. 2019. CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge. In Proc. of NAACL-HLT, 4149–4158. Minneapolis, Minnesota: Association for Computational Linguistics.
  • Vedantam, Zitnick, and Parikh (2015) Vedantam, R.; Zitnick, C. L.; and Parikh, D. 2015. CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, 4566–4575. IEEE Computer Society.
  • Wang et al. (2022) Wang, R.; Jansen, P.; Côté, M.-A.; and Ammanabrolu, P. 2022. ScienceWorld: Is your Agent Smarter than a 5th Grader? In Proc. of EMNLP, 11279–11298. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics.
  • Wang et al. (2019) Wang, X.; Wu, J.; Chen, J.; Li, L.; Wang, Y.; and Wang, W. Y. 2019. VaTeX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, 4580–4590. IEEE.
  • Yi et al. (2018) Yi, K.; Wu, J.; Gan, C.; Torralba, A.; Kohli, P.; and Tenenbaum, J. 2018. Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. In Bengio, S.; Wallach, H. M.; Larochelle, H.; Grauman, K.; Cesa-Bianchi, N.; and Garnett, R., eds., Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, 1039–1050.
  • Yin et al. (2020) Yin, P.; Neubig, G.; Yih, W.-t.; and Riedel, S. 2020. TaBERT: Pretraining for Joint Understanding of Textual and Tabular Data. In Proc. of ACL, 8413–8426. Online: Association for Computational Linguistics.
  • Yoran, Talmor, and Berant (2022) Yoran, O.; Talmor, A.; and Berant, J. 2022. Turning Tables: Generating Examples from Semi-structured Tables for Endowing Language Models with Reasoning Skills. In Proc. of ACL, 6016–6031. Dublin, Ireland: Association for Computational Linguistics.
  • Zhang et al. (2020) Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K. Q.; and Artzi, Y. 2020. BERTScore: Evaluating Text Generation with BERT. In Proc. of ICLR. OpenReview.net.