DriveMLM: Aligning Multi-Modal Large Language Models with
Behavioral Planning States for Autonomous Driving

Wenhai Wang

{}^{1,2*}

, Jiangwei Xie

{}^{3*}

, ChuanYang Hu

{}^{3*}

, Haoming Zou

{}^{4*}

, Jianan Fan

{}^{3*}

, Wenwen Tong

{}^{3*}

,
Yang Wen

{}^{3*}

, Silei Wu

{}^{3*}

, Hanming Deng

{}^{3*}

, Zhiqi Li

{}^{1,5*}

, Hao Tian

{}^{3}

, Lewei Lu

{}^{3}

, Xizhou Zhu

{}^{6}

,
Xiaogang Wang

{}^{2,3}

, Yu Qiao

{}^{1}

, Jifeng Dai

{}^{1,6}

^🖂

{}^{1}

OpenGVLab, Shanghai AI Laboratory

{}^{2}

The Chinese University of Hong Kong

{}^{3}

SenseTime Research

{}^{4}

Stanford University

{}^{5}

Nanjing University

{}^{6}

Tsinghua University
https://github.com/OpenGVLab/DriveMLM

Abstract

Large language models (LLMs) have opened up new possibilities for intelligent agents, endowing them with human-like thinking and cognitive abilities. In this work, we delve into the potential of large language models (LLMs) in autonomous driving (AD). We introduce DriveMLM, an LLM-based AD framework that can perform close-loop autonomous driving in realistic simulators. To this end, (1) we bridge the gap between the language decisions and the vehicle control commands by standardizing the decision states according to the off-the-shelf motion planning module. (2) We employ a multi-modal LLM (MLLM) to model the behavior planning module of a module AD system, which uses driving rules, user commands, and inputs from various sensors (e.g., camera, lidar) as input and makes driving decisions and provide explanations; This model can plug-and-play in existing AD systems such as Apollo for close-loop driving. (3) We design an effective data engine to collect a dataset that includes decision state and corresponding explanation annotation for model training and evaluation. We conduct extensive experiments and show that our model achieves 76.1 driving score on the CARLA Town05 Long, and surpasses the Apollo baseline by 4.7 points under the same settings, demonstrating the effectiveness of our model. We hope this work can serve as a baseline for autonomous driving with LLMs.

1 Introduction

Refer to caption — (a) Rule-Based Autonomous Driving System [3]

^†^†

*

equal contribution, 🖂 corresponding author (daijifeng@tsinghua.
edu.cn)

Autonomous driving (AD) has undergone significant advancements in recent years, evolving from traditional rule-based systems, which rely on a predefined set of rules informed by prior knowledge (see Figure 0(a)), to data-driven, end-to-end systems, as demonstrated in Figure 0(b). Despite their advancements, these systems have encountered limitations due to the constraints of expert knowledge or the diversity of training data. This has made it challenging for them to handle corner-case situations, even though human drivers might find them intuitive to deal with. In contrast to these traditional rule-based or data-driven AD planners, Large language models (LLMs) trained with web-scale text corpus, are equipped with extensive world knowledge, robust logical reasoning, and advanced cognitive capabilities. These features position them as potential planners in AD systems, providing a human-like approach to autonomous driving.

Some recent studies [16, 72, 68, 39, 24, 13, 56] have been made to integrate LLMs into AD systems, focusing on generating language-based decisions in response to driving scenarios. However, these approaches have limitations when it comes to performing closed-loop driving in real-world environments or realistic simulators. This is because the outputs of LLMs are mainly linguistic and conceptual, which cannot be used for vehicle control. In traditional modular AD systems [3, 22, 21], the gap between high-level strategic goals and low-level operational actions is connected by a behavioral planning module, whose decision states can be easily transformed into vehicle control signals by follow-up motion planning and control. This motivates us to align the LLM with the decision state of the behavioral planning module, and further design an LLM-based close-loop AD system that can run on real-world environments or realistic simulators by using the aligned LLM for behavioral planning.

Based on this point, we propose DriveMLM, the first LLM-based AD framework that can perform close-loop autonomous driving in realistic simulators. To achieve this, we have three key designs: (1) We investigate the decision states of the behavioral planning module of the well-developed Apollo system [3], and transform them into forms that can be easily processed by LLMs. (2) We develop a multi-modal LLM (MLLM) planner that can accept the current multi-modal inputs including multi-view images, LiDAR point clouds, traffic rules, system messages, and user instructions, and predict the decision state; (3) To obtain enough training data for behavioral planning state alignment, we manually collect 280 hours of driving data on CARLA, and convert them into decision state and corresponding explanation annotations by an efficient data engine. With these designs, we can obtain an MLLM planner that can make decisions based on the driving scenes and user requirements, and its decisions can be easily converted into vehicle control signals for closed-loop driving.

Our work has the following advantages: (1) Benefiting from the aligned decision states, our MLLM planner can be easily integrated with existing modular AD systems, such as Apollo, to achieve closed-loop driving without requiring any major changes or modifications. (2) By taking language instruction as input, our model can handle both user needs (e.g., overtaking a car) and high-level system messages (e.g., defining basic driving logic). This makes our model more flexible and adaptable to different driving situations and corner cases. (3) It can provide interpretability and explain different decisions. This enhances the transparency and trustworthiness of our model, as it can explain its actions and choices to the user.

In summary, the contribution of this work is three folds:

(1) We propose an LLM-based AD framework that bridges the gap between LLM and closed-loop driving by aligning the output of LLMs with the decision states of behavioral planning modules.

(2) To implement this framework, we tailor a set of decision states with forms that can be easily processed by LLMs, design an MLLM planner for decision prediction, and develop a data engine that can effectively generate decision states and corresponding explanation annotation for model training and evaluation.

(3) To validate the effectiveness of our method, we not only evaluate our method on the closed-loop driving metrics including driving score (DS) and miles per intervention (MPI), but also use understanding metrics including accuracy, F1-measure for decision state, BLEU-4, CIDEr and METEOR for decision explanation to evaluate the driving understanding capability of our model. Notably, our method achieves 76.1 DS, 0.955 MPI results on CARLA Town05 Long, which is 4.7 points, 1.25 times better than Apollo. Moreover, we can change the decision of the MLLM planner by describing special requirements with language instructions such as yielding for ambulance or traffic rules, as shown in Figure 2.

2 Related Work

2.1 Multi-Modal Large Language Models

The swift evolution of Large Language Models (LLMs) [53, 54, 7, 47, 46] has recently given rise to the emergence of multi-modal LLMs (MLLMs) [1, 26, 38, 37, 83, 17, 78, 12, 67, 51, 30, 2, 23, 33, 79, 71, 29, 80], which augment language models with the capacity to analyze and comprehend information from diverse modalities. Prominent instances of such advancements include GPT-4 [46], Flamingo [1], KOSMOS-1 [26], LLaVA series [38, 37], and MiniGPT-4 [83], as well as InstructBLIP [17]. These models have integrated visual instruction tuning methodologies to enhance the MLLMs’ ability to adhere to prescribed instructions. Furthermore, mPLUG-DocOwl [78] has broadened the document comprehension capabilities of MLLMs by incorporating digital document datasets. Concurrently, Shikra [12], VisionLLM [67], KOSMOS-2 [51], LISA [30], and Qwen-VL [2] have augmented MLLMs with visual grounding capabilities, empowering them to detect or segment objects in accordance with user prompts. The introduction of VideoChat [33] and VideoLLaMA [79] has ushered in the integration of video processing capabilities into LLMs. Additionally, NExT-GPT [71] has introduced a modality-switching instruction tuning technique for multi-modal prompt tuning, facilitating the handling of inputs and outputs in any combination of text, images, videos, and audio. ASM [29] and GPT4RoI [80] introduce region-level recognition and understanding capability into LLMs. These endeavors demonstrate the effectiveness and generalizability of LLMs, establishing a foundation for open-world tasks.

2.2 Intelligent Agents with Large Language Models

A burgeoning application of LLMs is their role in facilitating interaction and communication among intelligent agents (e.g., robots, virtual assistants, or game characters) and various entities, including humans, the environment, or even the intelligent agents themselves. Several API-based methods, including Visual ChatGPT [69], MM-REACT [77], HuggingGPT [59], InternGPT [40], ViperGPT [62], ControlLLM [41], and GPT4Tool [76] have attempted to integrate diverse modal APIs with LLMs to accomplish complex tasks in the open world, such as image editing, video processing, and audio synthesis. These methods allow language models to perform complex real-world tasks by following natural language instructions. In parallel, alternative research initiatives, such as Camel [31], AutoGPT [75], MetaGPT [24] and Smallville [50], investigate the utility of LLMs in the context of role-playing conversations or communication games. Additionally, within the domain of embodied AI, works such as PaLM-E [19], EmbodiedGPT [45], and the RT series [5, 6, 48] leverage LLMs to generate natural language actions, thereby controlling embodied agents proficient in executing navigation, manipulation, and interaction tasks within real or 3D environments. These works demonstrate the notable advancements achieved by LLMs in the realm of intelligent agent control.

2.3 Autonomous Driving Models

The development of autonomous driving (AD) models has accelerated rapidly in recent years, giving rise to many disruptive and groundbreaking technologies. Notably, the open-source frameworks, such as Apollo [3] and Autoware [22], have played pivotal roles by furnishing robust tools and resources, thereby facilitating the development of autonomous driving technology and contributing to its widespread adoption and progression. In terms of AD perception, BEV (Bird’s Eye View) [34, 73, 36, 61] and Occupancy Network [63, 60, 35] have become essential components of autonomous vehicles, helping them better understand the surrounding environment and make corresponding decisions. The decision-making process in conventional autonomous driving systems typically relies on finite state machines [14]. These systems often require the manual creation of numerous rules to determine the states and conditions for transitioning between them. However, considering the ever-changing nature of the world, this is usually laborious to design rules to cover all the scenarios for the real world. In recent years, end-to-end autonomous driving models have also made remarkable progress, such as UniAD [25], which adopts a novel end-to-end approach, directly integrating perception, prediction, and planning, avoiding information loss and efficiency issues in the traditional modular design method. Recently, open-sourced simulators [18, 66, 82] have been proposed to bridge the gap between model prediction and closed-loop control. Among them, CARLA [18], featuring comprehensive sensor simulations and realistic environments, is the most widely used benchmark for evaluating closed-loop performance by many state-of-the-art methods [27, 58, 57, 28, 15, 10, 11, 9].

Recent works [16, 72, 43, 68, 39, 13, 56] changes our perception by introducing LLM for driving planning, opening up a new direction for the autonomous driving field. As early explorations, some [68, 56] use ChatGPT and GPT-4 to predict driving decisions. Following works fine-tune LLM models to predict driving signal [13], trajectory [43] or designed decision space [39], conditioned only on language as input. DriveGPT4 [72] finetunes Multimodal LLM to predict control signal. However, DriveGPT4 is constrained by the input from a monocular camera, limiting its ability to construct comprehensive scene information. All LLM-based works above are not evaluated on realistic simulators in closed-loop driving, because either linguistic decisions of LLMs are hard to transform to actually reliable control signals, or the direct prediction of control signal by LLM remains a large gap to real-time closed-loop driving.

3 Proposed Method

3.1 System Overview

The DriveMLM framework integrates the world knowledge and reasoning capabilities of large language models (LLMs) into an autonomous driving (AD) system, achieving closed-loop driving in realistic simulators. As illustrated in Figure 3, this framework has three key designs: (1) Behavioral Planning States Alignment. This part aligns LLM’s linguistic decision outputs with the behavioral planning module of a well-established modular AD system like Apollo [3]. In this way, the output of LLM can be easily transformed into vehicle control signals. (2) MLLM Planner. It is a combination of a multi-modal tokenizer and a multi-modal LLM (MLLM) decoder. The multi-modal tokenizer transforms diverse inputs like multi-view images, LiDAR, traffic rules, and user requirements into unified tokens, and the MLLM decoder makes decisions based on the unified tokens. (3) Efficient Data Collection Strategy. It introduces a tailored data collection method for LLM-based autonomous driving, ensuring a comprehensive dataset encompassing decision states, decision explanations, and user commands.

During inference, the DriveMLM framework leverages multi-modal data to make driving decisions. These data include: multi-view images $I\!\in\!\mathbb{R}^{T\times N_{I}\times H\times W\times 3}$ , where $T$ denotes the time length, $N_{I}$ indicates the number of views, and $H$ and $W$ denotes the height and width of images. The point clouds $L\!\in\!\mathbb{R}^{K\times 4}$ from LiDAR point clouds, with $K$ representing the number of points. System message $M\!\in\!\mathbb{R}^{N_{M}}$ , with $N_{M}$ representing the number of system message tokens. The system message is the gathering of task definition, traffic rules, and decision state definition. User instructions $U\!\in\!\mathbb{R}^{N_{U}}$ , where $N_{U}$ stands for the number of user instruction tokens. These inputs undergo tokenization through a multi-modal tokenizer, resulting in: $X_{I}\!\in\!\mathbb{R}^{N_{I}\times N_{Q}\times D}$ , $X_{L}\!\in\!\mathbb{R}^{N_{Q}\times D}$ , $X_{M}\!\in\!\mathbb{R}^{N_{M}\times D}$ , $X_{U}\!\in\!\mathbb{R}^{N_{U}\times D}$ , which represent the tokens embedding of multi-view images, LiDAR point clouds, traffic rules, and user instructions, respectively. Here, $N_{Q}$ denotes the output token number which is decided by the number of queries of QFormer [32], and each token embedding is with $D$ dimension. Next, these tokens are inputted into the MLLM decoder, which generates the decision state token $S$ along with a corresponding explanation $E$ . Finally, the decision state $S$ is inputted into a motion planning and control module. This module computes the final trajectory for vehicle control.

3.2 Behavioral Planning States Alignment

Transforming the linguistic choices of Large Language Models (LLMs) into actionable control signals is crucial for vehicle control. To achieve this, we align the LLM’s outputs with the decision stages of the behavioral planning module in the popular Apollo system. Following common practice [3], we divide the decision-making process into two categories: speed decision and path decision. Specifically, the speed decision states contain [KEEP, ACCELERATE, DECELERATE, STOP], while the path decision states include [FOLLOW, LEFT_CHANGE, RIGHT_CHANGE, LEFT_BORROW, RIGHT_BORROW].

To enable a language model to make precise predictions among these states, we established a comprehensive link between linguistic descriptions and decision states, as illustrated in System Massage of Table 1. This correlation is used as a part of the system message and is integrated into the MLLM planner. As a result, once the LLM describes certain situations, the prediction will converge into a clear decision within the decision space. At each time, one-speed decision and one path decision are mutually inferred and sent to the motion planning framework. More detailed definitions of decision states can be found in the supplementary material.

System Message: You are a driving assistant to drive the car. You need to follow the navigation command and traffic rules. The traffic rule is … Path decisions include [FOLLOW, …]. Path decision definitions: ‘FOLLOW’ means …, Speed decisions include [KEEP, …]. Speed decision definitions: ‘KEEP’ means …, Given navigation command and driving scene obtained from camera or LiDAR, You should choose a path decision and a speed decision from the predefined options and give the explanation of your decision. Q1(Human): Caption instruction (e.g. Describe the current driving environment.) A1(DriveMLM): Caption response (e.g. It is currently daytime. A red car is driving away in front of ego.) Q2(Human): Navigation instruction (e.g. The navigation command is turned right. Please choose a path decision and a speed decision.) A2(DriveMLM): Speed and path decision (e.g. RIGHT_CHANGE, KEEP) Q3(Human): Explanation instruction (e.g. Please explain why to choose these decisions.) A3(DriveMLM): Explanation response (e.g. Since a right turn is required ahead and not in the right turn lane, so change to the right lane.) Q4(Human): Instruction (e.g. I’m in a hurry. Can you overtake the front car?) A4(DriveMLM): Speed and path decision (e.g. LEFT_CHANGE, ACCELERATE) Q5(Human): Explanation instruction (e.g. Please explain why to choose these decisions.) A5(DriveMLM): Explanation response (e.g. Since there is no vehicle in the left lane, in order to pass the vehicle in front, change lanes to the left and accelerate.)

Table 1: Examples of system message and interaction between user and DriveMLM system. The system message includes the description of the driving task, the traffic rules, and the definition of decision states. Given driving scenes such as images and user prompts, the driving system can infer the image caption, path, and speed decision, and additional explanation. Complete system messages and prompts are provided in the supplementary.

3.3 MLLM Planner

The MLLM planner of DriveMLM consists of two components: the multi-modal tokenizer and the MLLM decoder. The two components collaborate closely, handling a variety of inputs to accurately determine driving decisions and provide explanations for these decisions.

Multi-Modal Tokenizer. This tokenizer is engineered to handle various forms of input efficiently:

(1) For temporal multi-view images: We use a temporal QFormer to process multi-view images from timestamp $-T$ to 0 (current timestamp). First, it takes each view $I_{i}^{-T}$ at timestamp $-T$ and feeds it to ViT-g and QFormer with $N_{Q}$ random initialized queries of $D$ dimension. This produces image token embedding $X_{I_{i}^{-T}}\!\in\!\mathbb{R}^{N_{Q}\times D}$ . Then, using the image token embedding $X_{I_{i}}^{-T}$ as queries of QFormer, we get the image token embedding of the next timestamp $X_{I_{i}^{-T+1}}$ by conducting the first step again. We repeat the two steps until we get the image token embedding of the current timestamp $X_{{I_{i}}^{0}}$ , which gathers all temporal information from $-T$ to 0. This approach avoids the linear increase in resources required to process time series data as the length of time increases.

(2) For LiDAR data, we first send the point clouds as the input of the Sparse Pyramid Transformer (SPT) backbone [74] to extract the LiDAR features. Then we employ Qformer with $M$ random initialized queries of $D$ dimension to get the point cloud token embedding $X_{L}\!\in\!\mathbb{R}^{N_{Q}\times D}$ . We concatenate it with the image token embedding.

(3) For system messages and user instructions, we simply treat them as normal text data and use a token embedding layer of LLM to extract their embedding, $X_{M}\!\in\!\mathbb{R}^{N_{M}\times D}$ , $X_{U}\!\in\!\mathbb{R}^{N_{U}\times D}$ .

MLLM Decoder. The decoder is the core that translates the tokenized inputs into decision states and decision explanations. To this end, we design a system message template for LLM-based AD, which is shown in Table 1. We see that the system messages contain a description of the AD tasks, traffic rules, the definition of decision states, and placeholders indicating where each modality’s information is incorporated. This approach ensures that inputs from various modalities and sources are seamlessly integrated.

The output is formatted to provide decision states (see the Q2 of Table 1) and an explanation of the decisions (see the Q3 of Table 1), offering transparency and clarity in the decision-making process. Regarding the supervision methods, our framework uses cross-entropy loss with the next token prediction, following common practices. In this way, the MLLM planner can perform detailed understanding and processing of data from different sensors and sources, and transform it into appropriate decisions and explanations.

3.4 Efficient Data Engine

We propose a data generation pipeline that can create decision states and explanation annotations from various scenarios in the CARLA simulators. This pipeline can address the limitations of existing driving data, which lack decision states and detailed explanations for training LLM-based AD systems. Our pipeline consists of two main components: data collection and data annotation.

The data collection is designed to improve decision variety while staying realistic. First, various challenging scenarios are constructed in the simulator. Complex driving behaviors are required to safely drive through. Then, experts, either experienced human drivers or agents, are asked to safely drive through these scenarios triggered at one of its many passable locations. Notably, interaction data is generated when the expert randomly raises driving demand and drives accordingly. Once the expert drives safely to the destination, the data is recorded.

The data annotation mainly focuses on decision and explanation. First, speed and path decision states are automatically annotated based on experts’ driving trajectories by using hand-crafted rules. Second, explanation annotations are first generated based on the scenario, dynamically defined by current elements nearby. Third, the generated explanation annotations are refined by human annotators, and their variety is expanded by GPT-3.5. In addition, the interaction content is also refined by human annotators, including cases that are both executing or rejecting human requests. In this way, we avoid the costly frame-by-frame decision state annotation, as well as the costly manual writing of explanation annotation from scratch, greatly speeding up our data annotation process.

4 Experiments

4.1 Data Analysis

We have collected $280$ hours of driving data for training. These data consist of 50k routes, collected in 30 driving scenarios with different weather and lighting conditions across 8 maps (Town01, Town02, Town03, Town04, Town06, Town07, Town10HD, Town12) in CARLA. On average, each scenario has about 200 trigger points on each map to be randomly triggered. Each scenario is either a common or rare safety-critical situation in driving. Details of these scenarios are in the supplementary. For each frame, we collect images from 4 cameras on the front, rear, left, and right, and also the point clouds from a LiDAR sensor added in the center of the ego vehicle. All data we collected have corresponding explanations and accurate decisions that successfully drive through scenarios.

Table 2 presents the comparison with the previous datasets designed for driving understanding with natural language. Our data has two unique features. The first is the alignment of behavioral planning states. This enables us to transform the MLLM planner’s output to control signal so that our framework can control vehicles in closed-loop driving. The second is human interaction annotation. It is characterized by natural language instructions given by humans alongside the responding decisions and explanations. The objective is to improve the ability to understand human instructions and respond accordingly.

Dataset	Perception	Reason	Plan	Align	Interact
NuPrompt [70]	$\checkmark$
NuScenes-QA [52]	$\checkmark$	$\checkmark$
Rank2Tell [55]	$\checkmark$	$\checkmark$
BDD-X [72]		$\checkmark$	$\checkmark$
DRAMA [44]		$\checkmark$	$\checkmark$
DriveLM [16]	$\checkmark$	$\checkmark$	$\checkmark$
Ours	✓	✓	✓	✓	✓

Table 2: Comparisons of AD datasets for driving understanding. The alignment of behavioral planning states enables us to transform the MLLM planner’s output to control signal for closed-loop driving. The human interaction annotation enhances the model’s understanding of customized language instruction.

4.2 Implementation Details

Our MLLM model is built from LLaMA [23]. Specifically, we use ViT-g/14 from EVA-CLIP [20] as the visual encoder and LLaMA-7B [64] as the LLM. The querying transformer with $N_{Q}$ queries is applied to extract image tokens from ViT, where we set $N_{Q}=32$ . For the LiDAR encoder, we use the GD-MAE [74] model finetuned on ONCE [42]. Based on the pre-trained husky model, we train MLLM with instruction following data. We employ the AdamW optimizer with $\beta_{1}=0.9$ , $\beta_{2}=0.95$ , and a cosine learning rate decay with learning rate $5e^{-5}$ . The training epoch is 2, and the batch size is 256. We train QFormer and LLM to ensure the instruction following ability of LLM so that we can obtain a predefined format of path decision and speed decision. The resolution of image input to MLLM is set as $448\times 448$ .

Method	Type	Acc. $(\%)$ $\uparrow$	Path $($ F1 $)$ $\uparrow$			Speed $($ F1 $)$ $\uparrow$				BLEU-4 $\uparrow$	CIDEr $\uparrow$	METEOR $\uparrow$
Method	Type	Acc. $(\%)$ $\uparrow$	follow	change	borrow	keep	accelerate	decelerate	stop	BLEU-4 $\uparrow$	CIDEr $\uparrow$	METEOR $\uparrow$
LLaVA 1.5 [37]	LLM	22.92	0.73	0.00	0.00	0.75	0.00	0.02	0.00	10.00	18.03	23.00
InstructBLIP [17]	LLM	17.92	0.00	0.30	0.08	0.23	0.00	0.28	0.00	9.81	18.61	22.95
Apollo [3]	FSM	18.53	0.76	0.40	0.04	0.54	0.05	0.19	0.37	-	-	-
DriveMLM	LLM	75.23	0.90	0.52	0.89	0.91	0.61	0.66	0.89	40.46	124.91	56.54

Table 3: Results of open-loop evaluation on CARLA Town05. Compared with previous approaches, our method can predict more precise decisions and give better explanations for the decision choice.

For evaluating closed-loop driving performance, we use the widely used Town05Long benchmark, which follows previous work [15, 57]. It is worth noting that Town05 is not in our training data. We use Driving Score (DS), Route Completion (RC), and Infraction Score (IS) [18] as the metrics. RC computes the average percentage of routes completed by an agent. IS measures the infraction penalty between $0$ , including collision and violation of traffic rules. Note that IS is only calculated on the completed part of a route. DS is the core metric among the three, which is the product of both RC and IS. We also evaluate driving performance using Miles Per Intervention (MPI), which is a widely used metric in industry. It is computed as the total distance traveled over the total times of human takeovers. If the ego-car violates traffic rules or has a collision, it will be taken over and continue self-driving in a safe location until it reaches its destination. Unlike DS, which terminates the route under certain conditions, MPI requires the ego-car to complete the entire route.

For the open-loop evaluation, we collect 10 routes of each scenario in Town05 obtained and annotated by human drivers as the test set. To evaluate decision prediction, we compute the accuracy of predicted decision pairs and the F1 score of each type of decision. For the explanation prediction task, we use the commonly used metrics in the NLP community, including BLEU-4 [49], CIDEr [65] and METEOR [4]. We compare our method with the popular Apollo, which is based on a finite state machine (FSM) and two MLLM models - LLaVA1.5 [37] and InstructBLIP [17]. These two MLLM models used for comparison were not fine-tuned but instead provided with several examples of input/decision pairs for few-shot adaptation.

4.3 Evaluation in Closed-Loop Driving

Method	Type	DS $\uparrow$	RC $\uparrow$	IS $\uparrow$	MPI $\uparrow$
Roach [81]	DD	43.6	80.4	0.54	-
Interfuser [57]	DD	68.3	95.0	0.72	0.70
ThinkTwice [27]	DD	70.9	95.5	0.75	0.40
Apollo [3]	FSM	71.4	92.2	0.80	0.76
DriveMLM	LLM	76.1	98.1	0.78	0.96

Table 4: Results of closed-loop driving on CARLA Town05 Long. FSM denotes a Finite State Machine. DD denotes Data Driven. DS denotes Driving Score. RC denotes Route Completion. IS denotes Infraction Score. MPI denotes Miles Per Intervention. DriveMLM has a higher driving score and route completion rate and is also close to Apollo’s infraction penalty, indicating that DriveMLM can make better decisions while following the traffic rules. Meanwhile, DriveMLM also shows advantages in MPI, representing fewer human takeovers at the same mileage.

We evaluate closed-loop driving in CARLA, the most widely used and realistic simulation benchmark publicly available. State-of-the-art methods [81, 57, 27] that are capable of performing closed-loop driving in CARLA are included for performance comparison. The open-sourced Apollo [3] is also evaluated in CARLA as a baseline. No other LLM-based methods have shown the readiness to be deployed and evaluated besides ours. All methods are evaluated on Town05 long benchmarks [15].

Table 4 presents the Driving Score, Route Completion, and Infraction Score. Note that despite being a rule-based method, Apollo achieves almost on-par performance with recent end-to-end methods. DriveMLM surpasses all other methods on Driving Score by a large margin. This suggests that DriveMLM is better for handling state-transitions to safely drive through hard cases. The last column in Table 4 presents the results of MPI evaluation. This metric shows a more holistic driving performance because an agent is required to finish all routes. In other words, all situations along all routes are encountered by the tested agents. Thinktwice achieves better DS but lower MPI than Interfuser due to frequently crossing the stop line. However, CARLA imposes minimal penalties for this behavior. By contrast, MPI takes each violation of traffic rules as one take-over. DriveMLM also achieves the highest MPI among all other methods, suggesting its ability to avoid more situations for a safer driving experience.

4.4 Evaluation of Driving Knowledge

We adopt open-loop evaluation to evaluate the driving knowledge, which includes the decision prediction and the explanation prediction task. Table 3 presents the accuracy of predicted decision pairs, F1-score of each type of decision for the decision prediction, and BLEU-4 [49], CIDEr [65] and METEOR [4] for the predicted explanation. For Apollo, manually collected scenarios on Town05 are replayed as input to models in Table 3. The corresponding model states and outputs at every timestamp of replay are saved as predictions for metric calculation. For other methods, we give them the corresponding images as input and the proper prompts. By comparing model prediction with our manually collected ground truth, accuracy reveals decision correctness and similarity to human behavior, and the F1-score demonstrates the decision-making capability across each individual type of path and speed decision. DriveMLM achieves the highest accuracy overall, surpassing LLaVA with an accuracy of 40.97%. Compared to the Apollo baseline, the higher F1-score of DriveMLM suggests that it is much more effective in overtaking the rule-based state machine for solving various road situations. LLaVA [37], InstructBLIP [17], and our proposed DriveMLM can output explanations of decisions in the form of question and answer. In terms of BLEU-4, CIDEr, and METEOR, DriveMLM can achieve the highest performance, indicating that DriveMLM can give the most reasonable explanation of the decision.

4.5 Ablation Study

Sensor Modality

Table 5 presents the results of different impacts of input sensor modality to the DriveMLM. Multi-View (MV) images bring a substantial performance improvement in both path and speed F1-score, along with 18.19% increase in accuracy. Compared to concatenating temporal tokens directly, temporal QFormer results in a larger improvement of 7.4%, while ensuring multi-modal decision capability, which leads to 0.05 improvement in the average F1-score on speed decision. Point clouds do not show the ability to enhance performance.

MV	CT	TQ	PC	Acc. $(\%)$ $\uparrow$	Path $\uparrow$ (F1 Avg)	Speed $\uparrow$ (F1 Avg)
-	-	-	-	47.83	0.55	0.61
✓	-	-	-	64.54	0.78	0.70
✓	✓	-	-	67.22	0.70	0.68
✓	-	✓	-	75.23	0.78	0.75
✓	-	✓	✓	74.99	0.77	0.75

Table 5: Ablation results of sensor modality and temporal information. MV denotes multi-view images, CT denotes concatenating temporal tokens, TQ denotes temporal QFormer, and PC denotes point clouds. MV + TQ shows the best decision performance, and CT brings a small improvement in accuracy but leads to greater computational consumption. PC has little impact on DriveMLM. This might caused by the large representation gap between the sparse pyramid transformer and the MLLM Decoder.

Temporal Module Design

We propose the temporal QFormer module to process the temporal multi-view images. A simple and naive design is directly concatenating query tokens temporal to generate $N_{tq}=T\times N_{I}\times N_{Q}$ tokens acting as LLM input. But $N_{tq}$ increases with $T$ , contributing to large computational costs. Instead, we propose the temporal QFormer module to process temporal images for each view separately, generating $N_{I}\times N_{Q}$ tokens for LLM input. The comparison of the temporal module is shown in table 5, indicating the better performance of our temporal module design with fewer image tokens. We set $T=2$ by default in our experiments.

4.6 Case Study & Visualization

Human Interaction

Figure 4 provides an example of how vehicle control can be achieved through human instructions. The control process involves analyzing the road conditions, making decision choices, and providing explanatory statements. When given the identical instruction to “overtake”, DriveMLM exhibits varying responses based on the analysis of the current traffic conditions. In the scenario where the right lane is occupied and the left lane is available, the system opts to overtake from the left. However, in situations where the given instruction may pose a danger, such as when all lanes are occupied, DriveMLM chooses to refrain from executing the overtaking maneuver and responds appropriately. DriveMLM, in this context, serves as an interface for human-vehicle interaction, which evaluates the reasonableness of the instruction based on traffic dynamics and ensures its compliance with predefined rules before ultimately selecting a course of action.

Performance in Real Scenarios

We apply DriveMLM on the nuScenes dataset [8] to test the zero-shot performance of the developed driving system. We annotate 6,019 frames on the validation set, and the zero-shot performance of decision accuracy is 0.395. Figure 5 presents the result on two real driving scenes, indicating the generability of DriveMLM.

5 Conclusion

In this work, we have presented DriveMLM, a novel framework that leverages large language models (LLMs) for autonomous driving (AD). DriveMLM can perform close-loop AD in realistic simulators by using a multi-modal LLM (MLLM) to model the behavior planning module of a modular AD system. DriveMLM can also generate natural language explanations for its driving decisions, which can enhance the transparency and trustworthiness of the AD system. We have shown that DriveMLM can outperform the Apollo baseline on the CARLA Town05 Long benchmark. We believe that our work can inspire more research on the integration of LLMs and AD.

Supplementary Material

Appendix A Prompt Details

As illustrated in Table A, we provide the complete system message which includes detailed definitions of path decision states and speed decision states. Specifically, our path decision states include 5 states, which are {FOLLOW_LANE, LEFT_LANE_CHANGE, RIGHT_LANE_CHANGE, LEFT_LANE_BORROW, RIGHT_LANE_BORROW}, and our speed decision states include 4 states: {KEEP, ACCELERATE, DECELERATE, STOP}.

In Table B, we detail the prompts utilized for describing the surrounding environments. In Table C, prompts are employed to derive driving decisions based on navigation commands. The prompts listed in Table D are used to elicit explanations from the model regarding its decisions. Finally, in Table E, we present the human instructions that were used to guide the model’s decision-making process.

System Message: You are a driving assistant to drive the car. You need to follow the navigation command and traffic rules. The traffic rule is 1. Traffic light indications: a. Green: Vehicles may proceed. b. Yellow: Vehicles already past the stop line can continue. c. Red: Vehicles must stop. 2. Vehicle regulations: a. Vehicles must not exceed speed limits indicated by signs or road markings. b. Vehicles must stop when they meet the stop line. 3. Drivers should note specific traffic signs/markings: - Double solid lines: Overtaking is prohibited. Adhere strictly and don’t cross to overtake. - Single solid line: Overtaking is restricted. Overtaking is allowed to provide a safe distance and clear visibility, ensuring safety. 4. If special vehicles like police or ambulances are behind, yield and allow them to pass first. 5. Collision with other moving or static objects is not allowed. Path decision definitions: ‘LEFT_LANE_CHANGE’ refers to a driver’s decision to switch from the current to the adjacent left lane. ‘RIGHT_LANE_CHANGE’ refers to a driver’s decision to switch from the current lane to the adjacent right lane. ‘LEFT_LANE_BORROW’ is when a driver temporarily uses the adjacent left lane, commonly for overtaking or avoiding obstacles. ‘RIGHT_LANE_BORROW’ is when a driver temporarily uses the adjacent right lane, commonly for overtaking or avoiding obstacles. ‘FOLLOW_LANE’ means the driver decides to continue in their current lane. Speed decision definitions: ‘ACCELERATE’ refers to a driver increasing their speed. ‘DECELERATE’ means the driver reduces their speed. ‘KEEP’ refers to a driver keeping a steady speed. ‘STOP’ means the driver completely halts the vehicle. Based on the definitions of path decision, and while adhering to traffic rules, please choose a path and speed decision from the predefined options below, considering the current scenario. Path decisions include [LEFT_LANE_BORROW, RIGHT_LANE_BORROW, LEFT_LANE_CHANGE, RIGHT_LANE_CHANGE, FOLLOW_LANE]. Speed decisions include [ACCELERATE, DECELERATE, KEEP, STOP]. Given the navigation command and driving scene obtained from the camera or LiDAR, You should choose a path decision and a speed decision from the predefined options and give an explanation of your decision.

Table A: System message. The system message includes the description of the driving task, the traffic rules, and the definition of decision states.

1. This is a driving scenario. Please describe the environment. The images are provided by the front, left, right, and back cameras of a vehicle. The point cloud is generated by a LiDAR mounted on the top of the vehicle. 2. Could you provide me with a description of the current surroundings? Visual information from the vehicle’s front, left, right, and back cameras provides the images, while a LiDAR sensor mounted on the top of the vehicle generates a point cloud representation of the environment. 3. Would you kindly provide me with a description of the current surroundings? The vehicle’s front, left, right, and back cameras provide the visual images, while a LiDAR system, mounted on the top of the vehicle, generates a point cloud representation of the environment. 4. Could you please describe the current surroundings to me? 5. Could you kindly provide me with a description of the surrounding environment, please?

Table B: Prompts used to describe the surrounding environment.

1. The navigation command is turn left. Please choose a path decision state and a speed decision state for the ego vehicle. 2. Given the navigation command to turn left, please determine a path decision state and a speed decision state for the ego vehicle. 3. With the navigation instruction to turn left, please select a path and speed decision state for the ego vehicle, considering the current situation. 4. The navigation command is turn left. Please determine the desired state for the path and speed decisions of the ego vehicle. 5. The navigation instruction is to turn left. Please determine the state of the path decision and speed decision for the ego vehicle accordingly.

Table C: Prompts employed for guiding models to adhere to navigation commands. There are three navigation commands: {follow lane, turn left, turn right}.

1. Please explain why you chose these decisions. 2. Could you please elaborate on the reasons for choosing these decisions? 3. Could you please justify choosing these decisions? 4. Could you kindly provide a rationale for selecting these decisions? 5. Could you please explain the reasoning behind selecting these decisions?

Table D: Prompts used to elicit explanations for driving decisions.

1. I’m in a hurry. Can you overtake the front car? 2. I’m pressed for time. Can you go through the intersection without stopping at the red light? 3. I’m running short on time. Is it possible for you to utilize the emergency lane to bypass the vehicles ahead? 4. I’m in a rush. Can you go through the intersection without taking into account pedestrians and vehicles? 5. Great view on the left. Can you change to the left lane? 6. There are obstacles ahead. Can you switch to a different lane to bypass? 7. Right turn ahead. Can you switch to the right lane?

Table E: Prompts for guiding the model to make decisions based on human instructions.

Scenario ID	Scenario Name
1 $\star$	YieldBehindEmergencyVehicles
2 $\star$	OvertakingFromLeft
3 $\star$	OvertakingFromRight
4 $\dagger$	LeftBorrowPassObstacle
5 $\dagger$	LeftBorrowPassAccident
6 $\dagger$	LeftInvasionBorrowPassObstacle
7 $\dagger$	LeftInvasionBorrowPassAccident
8 $\dagger$	RightBorrowPassObstacle
9 $\dagger$	RightBorrowPassAccident
10 $\dagger$	RightInvasionBorrowPassObstacle
11 $\dagger$	RightInvasionBorrowPassAccident
12 $\star$	JunctionRightChange
13 $\star$	JunctionLeftChange
14 $\star$	JunctionStraight
15 $\star$	JunctionYieldPedestrian
16 $\dagger$	JunctionYieldPedestrianAfterTurn
17 $\dagger$	YieldJunctionSpecialisedVehicles
18 $\star$	LeftChangeInRoute
19 $\star$	RightChangeInRoute
20 $\dagger$	UnprotectedJunctionLeftTurn
21 $\dagger$	UnprotectedJunctionStraight
22 $\dagger$	UnprotectedJunctionRightTurn
23 $\dagger$	SignedJunctionLeftTurn
24 $\dagger$	SignedJunctionStraight
25 $\dagger$	SignedJunctionRightTurn
26 $\ddagger$	PedestrianBlindSpotA
27 $\ddagger$	PedestrianBlindSpotB
28 $\ddagger$	VehicleBlindSpotA
29 $\ddagger$	VehicleBlindSpotB
30 $\dagger$	FollowerChange

Table F: Scenario list.

\star

denotes that these scenarios are constructed by ourselves.

\dagger

and denotes that these scenarios are from official Carla settings. and ReasonNet[58], respectively.

Appendix B Scenario Details

Our training data contains 30 common or rare safety-critical scenarios, and Table F lists the names of all the scenarios and describes the source of the scenarios. Non-custom scenarios (marked as $\dagger\;$ and $\;\ddagger$ ) are usually set by loading preset trigger points, which makes it difficult to set them in other maps. Therefore, we have dynamized the scenarios to automatically find suitable trigger points on any map in preparation for the scenario setup. It is worth noting that all scenarios in Table F have been dynamized.

The customized scenarios are described as follows:

(1) YieldBehindEmergencyVehicles: An emergency vehicle (police car, ambulance, firetruck) is approaching from behind at high speed, and since there are vehicles driving on the left and right side lanes behind it, the ego vehicle needs to change lanes to the left/right to yield to the emergency vehicle.

(2) OvertakingFromLeft: Overtaking from the left due to a slow-moving vehicle ahead.

(3) OvertakingFromRight: Overtaking from the right due to a slow-moving vehicle ahead.

(4) JunctionLeftChange: Turn left at the intersection ahead, but the ego vehicle is not currently in the leftmost left turn lane, so the ego vehicle changes lanes to the left and then turns left through the intersection.

(5) JunctionRightChange: Turn right at the intersection ahead, but the ego vehicle is not currently in the rightmost right turn lane, so the ego vehicle changes lanes to the right and then turns right through the intersection.

(6) JunctionStraight: Go straight ahead at intersections, follow traffic rules, and avoid collisions with other vehicles.

(7) JunctionYieldPedestrian: Pedestrians are crossing the crosswalk at the intersection ahead, so the ego vehicle yields to pedestrians.

(8) LeftChangeInRoute: The vehicle in front of the ego vehicle is moving slowly, change lanes to the left to cancel the following.

(9) RightchangeInRoute: The vehicle in front of the ego vehicle is moving slowly, change lanes to the right to cancel the following.

Appendix C Comparative Analysis of Diverse Methods

Compared to methods such as Interfuser [57] or Apollo [3], our approach demonstrates superior performance in scenarios with unknown obstacles or those necessitating common sense. As depicted in Figure A (a), when facing unknown obstacles on the road, previous methods typically either overlook them or halt the vehicle, both strategies deviating from optimal driving practices. In contrast, our method employs a more logical ’borrow lane’ decision, effectively preventing accidents. In addition, the deficiency of previous methods in embodying real-world common sense or understanding traffic rules limits their capability to manage diverse special scenarios encountered in complex driving scenarios. Illustratively, as shown in Figure A (b), when emergency vehicles approaching from behind, conventional methods fail to yield, whereas our method proactively clears the path for the firetruck.

We posit that since corner cases in driving are virtually infinite, the integration of the chain of thought, augmented with pre-defined traffic knowledge, is particularly vital for decision-making in driving scenarios. Given the inherent characteristics of Large Language Models (LLMs), our DriveMLM demonstrates significant potential for adaptation to varied environments and for being tuned to distinct driving styles across diverse settings.

Appendix D Human Interaction with DriveMLM

In Figure B, we demonstrate humans can interact with DriveMLM using natural language with more examples. Humans can provide driving instructions to DriveMLM or request DriveMLM to explain its driving decisions. Leveraging the advantages of large language models, our approach offers enhanced interpretability, contributing to the development of safer autonomous driving systems.

Appendix E Comparison with Other Multi-Modal Large Language Models

As shown in Figure C, in the context of autonomous driving scenarios, LLaVA 1.5 [38] and InstructBLIP [17] fail to adequately comprehend the driving environment, often issuing incorrect instructions and hallucinatory explanations.

Appendix F Compared with GPT-4V

In our comparative analysis with GPT-4V [46], as depicted in Figure D, we noted that GPT-4V generated incorrect driving commands in three scenarios: (a), (b), and (c). Specifically, GPT-4V struggled with accurately perceiving road lanes in scenario (a), the motion status of other vehicles in scenario (b), and atypical obstacles in scenario (c). In contrast, our method not only provided sensible driving commands but also offered precise linguistic explanations for each of these scenarios

Appendix G Zero-Shot Results on nuScenes

We provide more visualization of the zero-shot results of our model on nuScenes. As shown in Figure E, despite our model being trained solely on simulator images, it still exhibits commendable generalization capabilities on real-world data. The robust generalizability of our model significantly enhances its potential for application.

Appendix H Closed Loop Ablation Studies

MV	TQ	PC	DS	RC	IS
-	-	-	36.7	70.6	0.52
✓	-	-	65.2	90.5	0.72
✓	✓	-	76.1	98.1	0.78
✓	✓	✓	72.2	96.3	0.75

Table G: Ablation results on Town05 Long of sensor modality and temporal information. MV denotes multi-view images, TQ denotes temporal QFormer, and PC denotes point clouds.

LLM	Acc. $(\%)$	BLEU-4 $\uparrow$	CIDEr $\uparrow$	METEOR $\uparrow$
LLaMA-7B	47.83	22.03	38.85	40.10
LLaMA-13B	48.92	25.54	75.68	42.50

Table H: Ablation results of LLM size. LLaMA-13B achieves higher decision prediction accuracy and demonstrates more reasonable explanations.

Data Size	35h	70h	140h	280h
Acc. $(\%)$ $\uparrow$	41.83	45.16	46.12	47.83

Table I: Ablation results of training set size. As the length of the training set increases, the decision prediction accuracy can gradually improve. We see that performance can be still enhanced by increasing the training set size. Hence, we plan to further expand the collection of training data.

To explore the impact of various designs in our model, we undertook comprehensive ablation studies within the closed-loop evaluation framework. As indicated in Table G, the incorporation of multi-view input images markedly enhances driving performance compared to only using front-view images. Our proposed Temporal QFormer can further improve the driving performance by a large margin. In our method, the integration of point clouds did not result in further gains, possibly due to the increased challenge of aligning more diverse modalities.

Appendix I Ablation Experiments on Model and Data Scale

Large Language Model Scale. We study how the parameter scale of large Language models (LLMs) influences the results of our method with our front-view model. Table H shows our method achieves better performance with a larger model. However, the improvement of LLaMA-13B [64] is limited, so we select LLaMA-7B for other experiments due to its lower operating efficiency and memory consumption.

Training Set Scale. We also study how the training data size influences the results of our method with our front-view model. Table I shows the performance is still enhanced by increasing the training set size, which indicates that our method can benefit from the scaling laws. Hence, we will expand the collection of training data in the future.

References

Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
Bai et al. [2023] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 1(2):3, 2023.
Baidu [2019] Baidu. Apollo auto. https://github.com/ApolloAuto/apollo, 2019.
Banerjee and Lavie [2005] Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
Brohan et al. [2022] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
Brohan et al. [2023] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Caesar et al. [2020] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020.
Chen and Krähenbühl [2022] Dian Chen and Philipp Krähenbühl. Learning from all vehicles. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17222–17231, 2022.
Chen et al. [2020] Dian Chen, Brady Zhou, Vladlen Koltun, and Philipp Krähenbühl. Learning by cheating. In Conference on Robot Learning, pages 66–75. PMLR, 2020.
Chen et al. [2021] Dian Chen, Vladlen Koltun, and Philipp Krähenbühl. Learning to drive from a world on rails. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15590–15599, 2021.
Chen et al. [2023a] Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023a.
Chen et al. [2023b] Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. arXiv preprint arXiv:2310.01957, 2023b.
Chen et al. [2019] Shitao Chen, Zhiqiang Jian, Yuhao Huang, Yu Chen, Zhuoli Zhou, and Nanning Zheng. Autonomous driving: cognitive construction and situation understanding. Science China Information Sciences, 62:1–27, 2019.
Chitta et al. [2022] Kashyap Chitta, Aditya Prakash, Bernhard Jaeger, Zehao Yu, Katrin Renz, and Andreas Geiger. Transfuser: Imitation with transformer-based sensor fusion for autonomous driving. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
Contributors [2023] DriveLM Contributors. Drivelm: Drive on language. https://github.com/OpenDriveLab/DriveLM, 2023.
Dai et al. [2023] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
Dosovitskiy et al. [2017] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. In Conference on robot learning, pages 1–16. PMLR, 2017.
Driess et al. [2023] Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
Fang et al. [2023] Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023.
FONTANA [2021] FRANCESCO FONTANA. Self-driving cars and openpilot: a complete overview of the framework, 2021.
Foundation [2018] The Autoware Foundation. Autoware: Open-source software for urban autonomous driving. https://github.com/CPFL/Autoware, 2018.
Gao et al. [2023] Peng Gao, Jiaming Han, Renrui Zhang, Ziyi Lin, Shijie Geng, Aojun Zhou, Wei Zhang, Pan Lu, Conghui He, Xiangyu Yue, et al. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
Hong et al. [2023] Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, Chenyu Ran, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
Hu et al. [2023] Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17853–17862, 2023.
Huang et al. [2023] Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Qiang Liu, et al. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045, 2023.
Jia et al. [2023] Xiaosong Jia, Penghao Wu, Li Chen, Jiangwei Xie, Conghui He, Junchi Yan, and Hongyang Li. Think twice before driving: Towards scalable decoders for end-to-end autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21983–21994, 2023.
Jiang et al. [2023] Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving. arXiv preprint arXiv:2303.12077, 2023.
Junqing et al. [2023] He Junqing, Pan Kunhao, Dong Xiaoqun, Song Zhuoyang, Liu Yibo, Liang Yuxin, Wang Hao, Sun Qianguo, Zhang Songxin, Xie Zejian, et al. Never lost in the middle: Improving large language models via attention strengthening question answering. arXiv preprint arXiv:2311.09198, 2023.
Lai et al. [2023] Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
Li et al. [2023a] Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. Camel: Communicative agents for” mind” exploration of large language model society. In Thirty-seventh Conference on Neural Information Processing Systems, 2023a.
Li et al. [2023b] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
Li et al. [2023c] KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023c.
Li et al. [2022] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Yu Qiao, and Jifeng Dai. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, pages 1–18. Springer, 2022.
Li et al. [2023d] Zhiqi Li, Zhiding Yu, David Austin, Mingsheng Fang, Shiyi Lan, Jan Kautz, and Jose M Alvarez. Fb-occ: 3d occupancy prediction based on forward-backward view transformation. arXiv preprint arXiv:2307.01492, 2023d.
Liang et al. [2022] Tingting Liang, Hongwei Xie, Kaicheng Yu, Zhongyu Xia, Zhiwei Lin, Yongtao Wang, Tao Tang, Bing Wang, and Zhi Tang. Bevfusion: A simple and robust lidar-camera fusion framework. Advances in Neural Information Processing Systems, 35:10421–10434, 2022.
Liu et al. [2023a] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
Liu et al. [2023b] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
Liu et al. [2023c] Jiaqi Liu, Peng Hang, Jianqiang Wang, Jian Sun, et al. Mtd-gpt: A multi-task decision-making gpt model for autonomous driving at unsignalized intersections. arXiv preprint arXiv:2307.16118, 2023c.
Liu et al. [2023d] Zhaoyang Liu, Yinan He, Wenhai Wang, Weiyun Wang, Yi Wang, Shoufa Chen, Qinglong Zhang, Yang Yang, Qingyun Li, Jiashuo Yu, et al. Internchat: Solving vision-centric tasks by interacting with chatbots beyond language. arXiv preprint arXiv:2305.05662, 2023d.
Liu et al. [2023e] Zhaoyang Liu, Zeqiang Lai, Zhangwei Gao, Erfei Cui, Xizhou Zhu, Lewei Lu, Qifeng Chen, Yu Qiao, Jifeng Dai, and Wenhai Wang. Controlllm: Augment language models with tools by searching on graphs. arXiv preprint arXiv:2310.17796, 2023e.
Mao et al. [2021] Jiageng Mao, Minzhe Niu, Chenhan Jiang, Hanxue Liang, Jingheng Chen, Xiaodan Liang, Yamin Li, Chaoqiang Ye, Wei Zhang, Zhenguo Li, et al. One million scenes for autonomous driving: Once dataset. arXiv preprint arXiv:2106.11037, 2021.
Mao et al. [2023] Jiageng Mao, Yuxi Qian, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt. arXiv preprint arXiv:2310.01415, 2023.
Movva et al. [2023] Rajiv Movva, Sidhika Balachandar, Kenny Peng, Gabriel Agostini, Nikhil Garg, and Emma Pierson. Large language models shape and are shaped by society: A survey of arxiv publication patterns. arXiv preprint arXiv:2307.10700, 2023.
Mu et al. [2023] Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought. arXiv preprint arXiv:2305.15021, 2023.
OpenAI [2023] OpenAI. GPT-4 Technical Report, 2023. https://cdn.openai.com/papers/gpt-4.pdf.
Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
Padalkar et al. [2023] Abhishek Padalkar, Acorn Pooley, Ajinkya Jain, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anikait Singh, Anthony Brohan, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
Papineni et al. [2002] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
Park et al. [2023] Joon Sung Park, Joseph C O’Brien, Carrie J Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023.
Peng et al. [2023] Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao, Shaohan Huang, Shuming Ma, and Furu Wei. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
Qian et al. [2023] Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. arXiv preprint arXiv:2305.14836, 2023.
Radford et al. [2018] Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. Improving language understanding by generative pre-training. OpenAI, 2018.
Radford et al. [2019] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Sachdeva et al. [2023] Enna Sachdeva, Nakul Agarwal, Suhas Chundi, Sean Roelofs, Jiachen Li, Behzad Dariush, Chiho Choi, and Mykel Kochenderfer. Rank2tell: A multimodal driving dataset for joint importance ranking and reasoning. arXiv preprint arXiv:2309.06597, 2023.
Sha et al. [2023] Hao Sha, Yao Mu, Yuxuan Jiang, Li Chen, Chenfeng Xu, Ping Luo, Shengbo Eben Li, Masayoshi Tomizuka, Wei Zhan, and Mingyu Ding. Languagempc: Large language models as decision makers for autonomous driving. arXiv preprint arXiv:2310.03026, 2023.
Shao et al. [2023a] Hao Shao, Letian Wang, Ruobing Chen, Hongsheng Li, and Yu Liu. Safety-enhanced autonomous driving using interpretable sensor fusion transformer. In Conference on Robot Learning, pages 726–737. PMLR, 2023a.
Shao et al. [2023b] Hao Shao, Letian Wang, Ruobing Chen, Steven L Waslander, Hongsheng Li, and Yu Liu. Reasonnet: End-to-end driving with temporal and global reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13723–13733, 2023b.
Shen et al. [2023] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
Shi et al. [2023] Yining Shi, Kun Jiang, Jiusi Li, Junze Wen, Zelin Qian, Mengmeng Yang, Ke Wang, and Diange Yang. Grid-centric traffic scenario perception for autonomous driving: A comprehensive review. arXiv preprint arXiv:2303.01212, 2023.
Singh and Bankiti [2023] Apoorv Singh and Varun Bankiti. Surround-view vision-based 3d detection for autonomous driving: A survey. arXiv preprint arXiv:2302.06650, 2023.
Surís et al. [2023] Dídac Surís, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
Tong et al. [2023] Wenwen Tong, Chonghao Sima, Tai Wang, Li Chen, Silei Wu, Hanming Deng, Yi Gu, Lewei Lu, Ping Luo, Dahua Lin, et al. Scene as occupancy. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8406–8415, 2023.
Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Vedantam et al. [2015] Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
Vinitsky et al. [2022] Eugene Vinitsky, Nathan Lichtlé, Xiaomeng Yang, Brandon Amos, and Jakob Foerster. Nocturne: a scalable driving benchmark for bringing multi-agent learning one step closer to the real world. Advances in Neural Information Processing Systems, 35:3962–3974, 2022.
Wang et al. [2023] Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, et al. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023.
Wen et al. [2023] Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, and Yu Qiao. Dilu: A knowledge-driven approach to autonomous driving with large language models. arXiv preprint arXiv:2309.16292, 2023.
Wu et al. [2023a] Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023a.
Wu et al. [2023b] Dongming Wu, Wencheng Han, Tiancai Wang, Yingfei Liu, Xiangyu Zhang, and Jianbing Shen. Language prompt for autonomous driving. arXiv preprint arXiv:2309.04379, 2023b.
Wu et al. [2023c] Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023c.
Xu et al. [2023] Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kenneth KY Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model. arXiv preprint arXiv:2310.01412, 2023.
Yang et al. [2023a] Chenyu Yang, Yuntao Chen, Hao Tian, Chenxin Tao, Xizhou Zhu, Zhaoxiang Zhang, Gao Huang, Hongyang Li, Yu Qiao, Lewei Lu, et al. Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17830–17839, 2023a.
Yang et al. [2023b] Honghui Yang, Tong He, Jiaheng Liu, Hua Chen, Boxi Wu, Binbin Lin, Xiaofei He, and Wanli Ouyang. Gd-mae: generative decoder for mae pre-training on lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9403–9414, 2023b.
Yang et al. [2023c] Hui Yang, Sifu Yue, and Yunzhong He. Auto-gpt for online decision making: Benchmarks and additional opinions. arXiv preprint arXiv:2306.02224, 2023c.
Yang et al. [2023d] Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, and Ying Shan. Gpt4tools: Teaching large language model to use tools via self-instruction. arXiv preprint arXiv:2305.18752, 2023d.
Yang et al. [2023e] Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023e.
Ye et al. [2023] Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
Zhang et al. [2023a] Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023a.
Zhang et al. [2023b] Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, and Ping Luo. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023b.
Zhang et al. [2021] Zhejun Zhang, Alexander Liniger, Dengxin Dai, Fisher Yu, and Luc Van Gool. End-to-end urban driving by imitating a reinforcement learning coach. In Proceedings of the IEEE/CVF international conference on computer vision, pages 15222–15232, 2021.
Zhou et al. [2020] Ming Zhou, Jun Luo, Julian Villella, Yaodong Yang, David Rusu, Jiayu Miao, Weinan Zhang, Montgomery Alban, Iman Fadakar, Zheng Chen, et al. Smarts: Scalable multi-agent reinforcement learning training school for autonomous driving. arXiv preprint arXiv:2010.09776, 2020.
Zhu et al. [2023] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.

DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving