Hybrid-Prediction Integrated Planning for
Autonomous Driving

Haochen Liu, Zhiyu Huang, Wenhui Huang, Haohan Yang,
Xiaoyu Mo, and Chen Lv^∗ H. Liu, Z. Huang, W. Huang, H. Yang, X. Mo, and C. Lv are with the School of Mechanical and Aerospace Engineering, Nanyang Technological University, 639798, Singapore. (E-mails: {haochen002, zhiyu001, wenhui001}@e.ntu.edu.sg, {haohan.yang, xiaoyu.mo, lyuchen}@ntu.edu.sg)^∗Corresponding author: C. Lv

Abstract

Autonomous driving systems require the ability to fully understand and predict the surrounding environment to make informed decisions in complex scenarios. Recent advancements in learning-based systems have highlighted the importance of integrating prediction and planning modules. However, this integration has brought forth three major challenges: inherent trade-offs by sole prediction, consistency between prediction patterns, and social coherence in prediction and planning. To address these challenges, we introduce a hybrid-prediction integrated planning (HPP) system, which possesses three novelly designed modules. First, we introduce marginal-conditioned occupancy prediction to align joint occupancy with agent-wise perceptions. Our proposed MS-OccFormer module achieves multi-stage alignment per occupancy forecasting with consistent awareness from agent-wise motion predictions. Second, we propose a game-theoretic motion predictor, GTFormer, to model the interactive future among individual agents with their joint predictive awareness. Third, hybrid prediction patterns are concurrently integrated with Ego Planner and optimized by prediction guidance. HPP achieves state-of-the-art performance on the nuScenes dataset, demonstrating superior accuracy and consistency for end-to-end paradigms in prediction and planning. Moreover, we test the long-term open-loop and closed-loop performance of HPP on the Waymo Open Motion Dataset and CARLA benchmark, surpassing other integrated prediction and planning pipelines with enhanced accuracy and compatibility. Project website: https://georgeliu233.github.io/HPP

Index Terms:

Occupancy prediction, motion prediction, integrated prediction and planning, autonomous driving.

I Introduction

Refer to caption — Figure 1: Generic learning-based pipelines in autonomous driving. a) Integrated pipelines learn planning jointly with motion prediction (IPP) or occupancy prediction (IOP) with coupled networks. b) End-to-end pipelines directly map from raw sensor inputs to planning. c) Our proposed HPP establishes a coupled planning system with hybrid-prediction integration and optimization.

Autonomous driving systems (ADS) have made significant progress in perception, prediction, and planning, thanks to the advancement of learning paradigms [1]. However, the performance growth of these independent tasks has come to a halt, prompting a reconsideration of modular design optimization [2, 3, 4, 5]. Fueled by inherent interactions among autonomous vehicles and traffic participants, recent research has placed significant emphasis on the integration of prediction and planning tasks [6]. They seek to achieve concurrent advancements in both prediction and planning .

Integrated pipelines (Fig. 1a) generally connect planning and singular prediction modules involving agent-wise motion trajectories (IPP) [3, 7, 8] or whole-scene occupancy probabilities (IOP) [9, 10, 11]. However, relying on single prediction format inevitably confronts respective shortfalls. Specifically, motion prediction models continuous temporal trajectories tracked for each agent, but encounters inconsistency and exponential cost in joint social patterns from marginal agents [12]. Conversely, exclusive occupancy accurately predicts aligned joint patterns of whole scene agents under bird’s eye view (BEV) perception, but the loss of agent-wise tractability leads to temporal conflicts and omission risks of critical agents [13]. This characteristic highlights the complementarity of motion and occupancy prediction. The inconsistency arising in the singular prediction pattern presents challenges for both IPP and IOP, resulting in incompatible planning learned collectively with misaligned predictions.

The limitations of integrated systems have spurred a renewed focus on end-to-end pipelines (Fig. 1b) [14]. Planning-oriented systems employ a well-organized modular design to address plan guidance under a unified BEV geometry and query-based intermediate interactions [2, 5, 15]. While this approach proves planning superiority and consider a hybrid format of predictions and planning, it falls short in addressing potential conflicts among different predictions. Furthermore, the absence of interactive co-design across predictions and planning results in passive maneuvers and hinders the system learning for social compatibility. The integration of different modules gives rise to three major challenges: inherent trade-offs, including inconsistency and omission risks with sole prediction; discrepancies between joint and marginal patterns in predictions, limiting accuracy and impeding the learning of safe and naturalistic driving maneuvers; and the absence of interactive co-design for compatible planning, considering dependencies of hybrid prediction.

To tackle the aforementioned challenges, we propose a novel integrated pipeline (Fig. 1c) named Hybrid-Prediction integrated Planning (HPP) that optimizes prediction and planning in a co-design process. By combining IPP and IOP, HPP delivers consistent planning while ensuring that hybrid prediction inform each other consistently. To achieve this, HPP leverages Transformer-based queries to channel and aggregate interactions between modules. Additionally, we have developed a refinement process that guides safe planning interacting with hybrid prediction. Specifically, we propose MS-OccFormer to perform marginal-conditioned joint occupancy prediction that aligns and refines consistently with marginal motion prediction. We leverage GTFormer, which is inspired by simulating game-theoretic iterative reasoning of marginal motion prediction with ego vehicle and joint occupancy. HPP concurrently models fine-grained interactions of integrated hybrid prediction patterns in Ego Planner, where we eventually devise a hybrid-prediction-guided refinement mechanism to facilitate safe and compatible planning. The main contributions are listed below:

1.

We propose HPP, a modular co-design optimization ADS paradigm, that consistently interacts among marginal and joint prediction patterns with planning.
2.

We introduce MS-OccFormer, a model that predicts joint occupancy patterns in BEV geometry while being aware of the marginal predictions. We also present GTFormer, a model that performs game-theoretic reasoning among marginal motion predictions in coordination with both planning and occupancy prediction. The ego planner is devised under interactive guidance by hybrid prediction.
3.

HPP is tested on multiple large-scale real-world benchmarks, and extensive testing results demonstrate its state-of-the-art performance in terms of accuracy, safety, and consistency in prediction and planning.

The remainder of this paper is organized as follows: Sec. II reviews integrated prediction and planning pipelines in ADS. Sec. III formulate the methodology of HPP. Comprehensive benchmarks and discussions are presented in Sec. IV. Finally, Sec. V concludes the whole work.

II Related Work

II-A Predictions and Planning in ADS

Prediction and planning modules in the conventional ADS model are separate. Predictions typically define evolving transitions as conditions for safe planning. Learning-based prediction models excel in modeling interactions among diverse agents and scene contexts [12]. Categorized by representations, sparse predictions forecast multi-agent trajectories (MATP) along with detected participants. Leveraging Transformer [16, 17] or GNNs [7, 18] in constructing the social interaction graph [7] or recurrent refining [19], MATP filters multi-agent predictions in scoring combinations of marginal ones for each agent. While achieving agent-wise accuracy, MATP introduces exponential computations and trajectory-wise inconsistency. Dense predictions directly estimate the future distribution of agents jointly from ego-centered occupancy [9, 20, 21]. A notable issue is the loss of agent-wise tractability. Enhancements via trajectories of heatmap sampling [22] or joint trajectory learning [23] exhibit similar consistency issues. In ADS planning, various approaches are well-founded including sampling [24], optimization [25], and learning-based techniques via imitation learning [26] or reinforcement learning [27]. Still, achieving safe and socially compatible driving maneuvers in planning requires interactive awareness and safety guarantees derived from planning-compliant predictions. Furthermore, accumulated errors from detached predictions and planning underscore the need for developing integrated ADS.

II-B Integrated Predictions and Planning in ADS

Navigating dense and interactive traffic requires integrated ADS, which models the simultaneous driving behaviors among social agents and the autonomous vehicle. Intuitive thought is to stack ego vehicles collectively with social agents’ predictions and learn unified trajectories. All interactions are implicitly modeled by Transformer-based IPP [8, 28] or post-processed occupancy prediction[23] under the IOP pipeline. To consider the evolving interactive behaviors between predictions and planning, conditional methods model the behavioral response individually from ego vehicle to social agents predictions [29]. Conditional motion predictions are then integrated into planning reschedules [24, 30] or modeled as non-cooperative games [31]. Still, these one-way interactions bypass the mutual consistency with all agents. Moreover, iterative bi-level optimization [32] significantly slows down learning. Obviated from agent-wise conflicts, hierarchical game-theoretic approaches model the iterative reasoning process [33], updating mutual behaviors for all agents simultaneously [34]. Yet, uniformed agent-wise reasoning lacks specifications between predictions and planning, which should be target-driven. In HPP, we integrate reasoning by introducing an agent-conditioned occupancy to modulate joint behavior in social agents. and an Ego Planner for interactive planning. This integrated co-design enables learning both predictions and planning with flexibility while maintaining mutual awareness by iterative reasoning.

Recent paradigms focus on leveraging output responses for predictions to enhance planning guarantees. Adversarial objectives are employed between joint motion predictions and planning, considering likely [3, 35] or safety-critical patterns [36] for mutual differentiable optimizations. However, similar consistency issues faced by MATP pose additional optimization challenges. Some methods predict dense occupancy jointly as a potential cost for planning guidance [11, 9, 10]. However, intractability issues introduce risks and require laborious filtering. An additional challenge is the lack of integrated modeling leads to inconsistency and limits the interactive process only in optimization. In HPP, integration is considered concurrently for ADS co-design and output optimizations. Through flexible and consistent hybrid prediction and planning via system co-design, hybrid prediction are jointly utilized to refine planning, enhancing safety consistently.

II-C End-to-end Systems and LLMs for Autonomous Driving

End-to-end methods consider a direct mapping from raw sensors to prediction and planning under perception understanding [14]. A typical system is serialized by modules and learns jointly with each objective [37, 4]. However, modular errors are accumulated and geometry is hindered by misaligned predictions and planning. Thus, planning is sampled in trajectory retrieval by predictions [38]. Prominence in BEV perception [39, 40, 41] enables modular integration and learning under unified BEV geometry [42]. This further prompts a planning-oriented system, which organizes and serves all intermediate modules targeting planning under visual [2, 15, 43] or vectorized [5] perceptions. Query-based design channel and propagate modular integration. This prompts recent works incorporating large language models (LLMs) as the motion planner [44] or routing agent [45]. Still, the current end-to-end system focuses more on integration with perception that aligns predictions and planning, leaving paradigm discussions incorporating LLMs [46]. In HPP, we put more emphasis on modular co-design optimization, learning interactive hybrid prediction, and planning informed by perception under query-based integration.

III Hybrid-prediction Integrated Planning

The overview of our proposed HPP is shown in Fig.2, which is defined in Sec. III-A upon query-based modular co-design and optimization of ADS. In the upcoming sections, informed by BEV perception pipelines exhibited in Sec. III-B, HPP manages the framework co-design from MS-OccFormer in Sec. III-C for joint occupancy prediction, sharing mutual consistency with motion prediction in GTFormer. Here we elaborate its hierarchical reasoning model for interactive prediction and planning in Sec. III-D. Followed by hybrid prediction-aware Ego Planner in Sec. III-E, we frame the learning and optimization process for HPP reckoning hybrid prediction guidance in Sec. III-F.

III-A Problem Formulation

As shown in Fig. 2, HPP focuses on addressing integrated predictions and planning challenges for ADS. Informed by perception under BEV geometry, this learning-based system is founded by modular co-design and optimization. Modularized by Transformer basis, HPP leverages categorical queries $\mathbf{Q}$ to aggregate modular outputs and channel interactions as keys and values $\mathbf{K},\mathbf{V}$ by multi-head attention mechanism.

Given multi-view image inputs, the co-design of HPP begins with the perception backbone for BEV features $\mathcal{B}$ . BEV perception further defines an ego-centered area of $H\times W$ with map $\mathcal{S}$ and $N_{A}$ detected agents $\mathcal{A}\in A_{0:N_{A}-1}$ at current timestep $t=0$ , where $A_{0}$ denotes the ego vehicle. Based on scene context $\mathbf{X}=\{\mathcal{B},\mathcal{S},\mathcal{A}\}$ , HPP aims to learn and optimize concurrently for hybrid prediction and planning modules. Specifically, over future horizon $T$ , and set of queries $\mathbf{Q}$ , joint predictions are defined as per-step occupancy probabilities $\mathbf{\hat{O}}_{1:T}=\{o^{h,w}_{t}|o^{h,w}\in[0,1]^{H\times W}\}_{t=1}^{T}$ for all neighbor agents. Simultaneously, the marginal future motions for all agents are defined by $\hat{\mathbf{Y}}^{A}_{M}=\{(\hat{\mathbf{y}}^{1:T,i}_{m},\hat{\mathbf{p}}^{i}_{m})|m\in[1,M]\}^{N_{A}-1}_{i=0}$ denoting multi-modal future trajectories $\hat{\mathbf{y}}\in\mathbb{R}^{N_{A}\times M\times T\times 2}$ and probability $\hat{\mathbf{p}}\in\mathbb{R}^{N_{A}\times M}$ considering $M$ modes of uncertainty. hybrid prediction-aware planning $\tau_{1:T}\in\mathbb{R}^{T\times 2}$ is queried by plan context $\mathcal{E}$ . Subsequently, HPP formulates a collective objective $\mathcal{L}$ for modular learning, and cost criteria $\mathcal{C}$ for optimized hybrid prediction guided planning $\tau^{*}\in\mathbb{R}^{T\times 2}$ . In general, HPP is formulated as:

	$\displaystyle\mathbf{\hat{Y}}^{A}_{M},\mathbf{\hat{O}}_{1:T},\tau_{1:T}$	$\displaystyle=f(\mathbf{X},\mathbf{Q},\mathcal{E}\|\theta),$		(1)
	$\displaystyle\tau^{*}=\mathop{\arg\min}_{\tau}$	$\displaystyle\ \mathcal{C}(\tau,\mathbf{X},\mathbf{\hat{Y}},\mathbf{\hat{O}}),$		(1)

where $f$ denotes the co-design of HPP, and $\theta$ is the model parameters. Specific co-design and formations for each module in HPP are illustrated in the following sections.

III-B Perception Scene Encoding

Perception pipelines in HPP aim to capture scene context $\mathbf{X}$ with raw image inputs under BEV geometry. Scene contexts are then jointly encoded to catch their global relations.

III-B1 BEV Perception

In HPP, multi-view image features $\mathbf{I}\in\mathbb{R}^{6\times H_{in}\times W_{in}\times C}$ are extracted via shared backbones [47] from raw image inputs. We leverage BEV encoder [39] to transform $\mathbf{I}$ into BEV feature $\mathcal{S}$ through recurrent top-down BEV queries $Q_{B}\in\mathbb{R}^{H\times W\times D}$ , where $D$ denotes the hidden dimensions. Founded on BEV backbones, we utilize two DETR-like perception decoders [2] in extracting scene context features for agent $\mathcal{A}$ and map $\mathcal{S}$ from $Q_{B}$ , following $N_{A}$ agent queries $Q_{A}\in\mathbb{R}^{N_{A}\times D}$ and $N_{M}$ map queries $Q_{Map}\in\mathbb{R}^{N_{M}\times D}$ . Note that HPP focuses on integrated prediction and planning. Therefore, it is expected better results for HPP using more advanced BEV perception units.

III-B2 Scene Encoding

to model the global interactions between scene elements with BEV perceptions, we inherit from our previous work [48, 9] to gather and encode separate visual and vectorized scene context features. Specifically, visual features are encoded by BEV queries $Q_{B}$ , and the scene features for map and agent are concatenated and encoded as $Q_{s}=[Q_{Map};Q_{A}]\in\mathbb{R}^{(N_{A}+N_{M})\times D}$ , where $[\cdot;\cdot]$ denotes concatenations. Encoded results $\mathbf{X}=\{Q_{B},Q_{Map},Q_{A}\}$ are then served as input for HPP co-design integrating predictions and planning.

III-C Ms-OccFormer

HPP formulates joint predictions as occupancy $\mathbf{O}_{1:T}$ consistently with BEV geometry in perception. To further tackle the consistency challenges between hybrid prediction, in HPP we propose MS-OccFormer spotlights twp aspects, i.e. Marginal-conditioned occupancy that defines tractable predictions, and Multi-scale prediction-wise integration that deals with the interactive alignments with different granularity. Illustrated in Fig. 3, MS-OccFormer utilizes a streaming-based pipeline to roll out the future horizon $T$ , decoding per-step occupancy prediction based on $L$ levels integration of succeeded step features.

III-C1 Modular Queries

we leverage occupancy queries $Q_{occ}\in\mathbb{R}^{H\times W\times D}$ in multi-scale aggregating for positional and perception features: $Q_{occ}=\operatorname{MLP}([\operatorname{PE}({I}_{B});Q_{B}])$ . Positional grids ${I}_{B}\in\mathbb{R}^{\times H\times W\times 2}$ are encoded using sinusoidal $\operatorname{PE}(\cdot)$ and transformed by multi-layer perceptron (MLP). We further downsample $Q_{occ}$ under $L$ levels $\{Q^{l,0}_{occ}\in\mathbb{R}^{\frac{H}{2^{l}}\times\frac{W}{2^{l}}\times D}\}^{0}_{l=L}$ to recurrently query multi-scale interactions.

III-C2 Marginal Dependencies

To fully extract the interactive marginal prediction features, we conduct an agent-wise fusion (see Fig. 3c) that leverages the marginal future $\hat{\mathbf{Y}}^{A}_{M}$ outputs from GTFormer (Sec. III-D). Multi-modal motion features $Q^{A}_{M}\in\mathbb{R}^{N_{A}\times M\times D}$ are fused with marginal features $\mathbf{H}^{A}_{M}=\operatorname{MLP}(\operatorname{PE}(\mathbf{\hat{y}}))$ and projected by each horizon:

\mathbf{H}^{A}_{traj}=\operatorname{MLP}_{1:T}(\mathbf{\hat{p}}(Q^{A}_{M}+\mathbf{H}^{A}_{M})+Q_{A}),

(2)

where $\mathbf{H}^{A}_{traj}\in\mathbb{R}^{T\times N_{A}\times D}$ denotes the marginal features.

III-C3 Marginal-conditioned Occupancy

The primary challenge in formulating $\mathbf{O}_{1:T}\in\mathbb{R}^{T\times H\times W}$ lies in the intractability with marginal predictions that cause joint inconsistencies. Inspired by instance-level occupancy $\mathbf{O}^{A}_{1:T}\in\mathbb{R}^{T\times H\times W\times N_{A}}$ [2] and conditional methods [24], we propose the marginal-conditioned occupancy prediction task. This models the consistent joint occupancy $p(\mathbf{O}^{A}_{1:T}|\mathbf{Y}^{A}_{M},\mathbf{X})$ over agent-wise marginal predictions. To associate uncertainty and mutual interactions, given final joint decoding features $Q^{L}_{occ}$ and marginal features $\mathbf{H}^{A}_{traj}$ , the marginal-conditioned occupancy will be eventually modeled by dot products:

\mathbf{O}^{A}_{1:T}=\sigma(Q^{L}_{occ}\cdot\operatorname{MLP}(\mathbf{H}^{A}_{traj})^{T}),

(3)

where $\sigma$ denotes the sigmoid function for per-grid probabilities. The original task can be then transformed back $\mathbf{O}_{1:T}=\max_{A}\mathbf{O}^{A}_{1:T}$ for co-design of other HPP modules:

III-C4 Multi-scale Prediction-wise Integration

aims to iteratively align multi-scale interaction features between hybrid prediction in decoding $\mathbf{O}^{A}_{1:T}$ .In Fig. 3a, multi-scale succeeded occupancy features $\{Q^{l,t-1}_{occ}\}^{L}_{l=1}$ query aligned marginal features by attentions at different granularities from two-stage Transformer decoders.

The global integration stage leverages the vanilla Transformer decoders to perform per-grid interactions from flattened high-level joint features $Q^{L,t-1}_{occ}$ with marginal ones. Subsequently, with the upscaling of occupancy features $\{Q^{l,t-1}_{occ}\}^{L-1}_{l=1}$ , the local integration stage focuses on capturing consistency from partial joint behaviors with marginal features. This motivates us to design shift-window multi-head cross-attention (SW-MCA), inspired by SW-MSA in [49]. As depicted in Fig. 3b, we employ the rolling process to simultaneously capture local interactions under shifted windows attention.

To ensure interactive consistency across multi-scale integration, we devise a learnable attention mask $\mathbf{M}^{l}_{1:T}\in\mathbb{R}^{T\times\frac{H}{2^{L-l}}\times\frac{W}{2^{L-l}}\times N_{A}}$ for Transformer decoder that iteratively refines upon interaction results from the previous scale. This aligns the attention modeling based on the previous results Shown in Fig. 3a, for each level, the attention mask gets updated with agent-conditioned occupancy on the current scale level:

\mathbf{\hat{M}}^{l}=\sigma(Q^{l}_{occ}\cdot\operatorname{MLP}_{l}(\mathbf{H}^{A}_{traj})^{T}).

(4)

The attention masks are then iteratively updated following:

\mathbf{M}^{l}=\lambda_{m}\operatorname{Upsample}(\mathbf{M}^{l-1})+(1-\lambda_{m})\mathbf{\hat{M}}^{l},

(5)

where $\lambda_{m}=0.5$ is the update factor. In general, given Transformer decoder at certain stage as $\operatorname{Trans}$ , the prediction-wise integration under scale $l$ of timestep $t$ is defined as:

Q^{l,t}_{occ}=\operatorname{Trans}(q=Q^{l,t-1}_{occ},k,v=\mathbf{H}^{A,t}_{traj},m=\mathbf{M}^{l}_{t}).

(6)

Output joint occupancy features $Q^{L}_{occ}$ will be eventually fused via Equ. 3 for conditioned occupancy predictions $\mathbf{\hat{O}}^{A}_{1:T}$ .

III-D GTFormer

In HPP, ensuring interactive consistency between motion predictions and planning involves the introduction of GTFormer. As depicted in Fig. 4, GTFormer formulates future interactive behaviors as a game-theoretic reasoning process, simulating hierarchical reasoning through $K$ -layers stacked Transformer decoders. Beyond the uniform agent-wise reasoning model of our previous work [34], GTFormer collaborates with MS-OccFormer (Sec. III-C), modeling occupancy interactions to modulate joint future behaviors at each level of reasoning.

III-D1 Modular Queries

We leverage motion queries $Q^{A}_{M}\in\mathbb{R}^{N_{A}\times M\times D}$ to initialize agent-wise interactive reasoning with multi-modal motion intentions ${I}_{A}\in\mathbb{R}^{N_{A}\times M\times 2}$ and agent perception features by: $Q^{A}_{M}=\operatorname{MLP}([\operatorname{PE}({I}_{A});Q_{A}])$ .

III-D2 Joint Dependencies

We transform joint occupancy predictions $\hat{\mathbf{O}}_{1:T}\in\ \mathbb{R}^{T\times H\times W}$ with positional features $I_{B}$ to align with continuous geometry for predictions and planning. Shown in Fig. 4b, we conduct a multiplication with max-pooling to encode occupancy positional features with BEV semantics:

\mathbf{H}_{occ}=\operatorname{ResConv}(\mathop{\max}_{T}\operatorname{Conv}(\operatorname{PE}(I_{B})\hat{\mathbf{O}})+Q_{occ}),

(7)

where $\operatorname{Conv}$ and $\operatorname{ResConv}$ represent convolutional projections and residual layers, respectively. The joint features $\mathbf{H}_{occ}\in\mathbb{R}^{H\times W\times D}$ are then interacted with all agents.

III-D3 Game-Theoretic Transformer Layer

We employ level-k games to model the GTFormer layer (see Fig. 4a) for future interactive behaviors of all agents. Inherited from our previous work [34], we denote agent-wise multi-modal predictions and planning $\hat{\mathbf{Y}}^{A}_{M}$ as player policy $\{\pi_{i}\}^{N_{A}-1}_{i=0}$ . For each player $i$ , the reasoning process is defined as a $K$ -times iterations of policy conditioned on the opponents $\neg i$ policy from last reasoning level: $\pi^{k}_{i}(\mathbf{Y}^{i}_{M}|\pi^{k-1}_{\neg i},\mathbf{X}),k\in[1,K-1]$ . Specifically, the level-0 policy is reasoned independently: $\pi^{0}_{i}(\mathbf{Y}^{i}_{M}|\mathbf{X})$ . We leverage the Gaussian mixture model (GMM) to outline the uncertainty for predictions and planning policy as:

\pi^{k}_{i}=\sum^{M}_{m=1}p(\mathbf{\hat{p}}^{i}_{m}|\pi^{k-1}_{\neg i},\mathbf{X})\sum^{T}_{t=1}\phi(\mathbf{\hat{y}}^{t,i}_{m},\mathbf{\hat{\sigma}}^{t,i}_{m}|\pi^{k-1}_{\neg i},\mathbf{X}),

(8)

where $\phi$ and $\mathbf{\sigma}^{t,i}_{m}\in\mathbb{R}^{N_{A}\times M\times T\times 2}$ represent the density function and variance, respectively for 2D Gaussian distributions.

Specifically, a single layer of GTFormer is shared by level- $k$ policy $\pi^{k}_{0:N_{A}-1}$ (Fig. 4a) across all agents. The $k^{th}$ layer starts with a multi-head self-attention (MHSA), named Mode2Mode for last level motion queries $Q^{A}_{M,k-1}\in\mathbb{R}^{N_{A}\times M\times D}$ to model future interactions across modalities. Four multi-head cross-attention (MHCA) are devised for $Q^{A}_{M,k-1}$ to aggregate the policy conditions separately for reasoning. (1) GT-reasoning interacts with last level components policies $\pi^{k-1}_{0:N_{A}-1}$ . Fused by Equ. 2 with $\mathbf{\hat{p}}^{\neg A}_{M,k-1}$ and $\mathbf{\hat{y}}^{\neg A}_{M,k-1}$ , components features are filtered by future mask [34] to query future reasoning behaviors as $F^{k}_{GT}$ . (2) Mode2Agent and (3) Mode2Map outlines the scene context interactions of agents $Q_{A}$ and maps $Q_{Map}$ as $F^{k}_{A}$ and $F^{k}_{Map}$ . (4) Mode2Occ leverage deformable MHCA (DCA) [39] to model the future interactions with joint occupancy feature. $\mathbf{H}_{occ}$ is queried by A set of offsets $\Delta\in\mathbb{R}^{N_{A}\times M\times N_{p}\times 2}$ referenced by $\mathcal{P}(\mathbf{\hat{y}}^{A}_{M,k-1})$ transformed to pixel coordinate. $N_{p}$ referenced occupancy are then aggregated as $F^{k}_{occ}$ for each agent. In general, all interactive features are concatenated to update the motion query:

Q^{A}_{M,k}=\operatorname{FFN}([F^{k}_{GT}+F^{k}_{A};F^{k}_{Map};F^{k}_{occ}]),

(9)

where $\operatorname{FFN}(\cdot)$ denotes the feed-forward layers for residual MLPs and layer-norm. We omit the reasoning attention of $F^{0}_{GT}$ at $k=0$ for independent future policies. Eventually, updated motion queries $Q^{A}_{M,k}$ are passed through score and trajectory decoders for reasoned GMM policies of predictions and planning.

III-E Ego Planner

HPP introduces the Ego Planner depicted in Fig. 4c. Ego Planner specifies planning-oriented conditions $\mathcal{E}$ based on the planning reasoning policy $\pi^{K-1}_{0}$ . The co-design of HPP facilitates a hybrid-prediction aware Ego Planner that collaborates with GTFormer and MS-OccFormer, defined as $p(\tau|\pi^{K-1}_{0},\hat{\mathbf{Y}},\hat{\mathbf{O}},\mathcal{E},\mathbf{X})$ . This is essential as the game-theoretic process in GTFormer (Sec. III-D) simply models ego planning uniformly with other agents.

III-E1 Modular Queries

In the context of target-driven planning, we define plan queries $Q_{\mathcal{E}}\in\mathbb{R}^{1\times D}$ that integrate planning reasoning features $Q^{0}_{M,K-1}$ using plan context $\mathcal{E}$ through a max-pooling fusion by: $Q_{\mathcal{E}}=\mathop{\max}_{M}\operatorname{MLP}([Q^{0}_{M,K-1};\mathbf{H}_{\mathcal{E}}])$ . Here, $\mathbf{H}_{\mathcal{E}}\in\mathbb{R}^{1\times D}$ encompasses global target features $\mathcal{G}\in\mathbb{R}^{1\times D}$ for encoded navigation commands or coordinates, traffic light features $\operatorname{TL}\in\mathbb{R}^{1\times D}$ , and optional ego status $\mathbf{s}_{\text{ego}}\in\mathbb{R}^{1\times D}$ for ego speed and heading $\{v,\psi\}$ .

III-E2 Hybrid Predictions Dependencies

We directly utilize a stack of $L_{p}$ interaction Transformers parts in the GTFormer layer (Fig. 4a) to model future joint interactions $\hat{\mathbf{O}}$ and marginal interactions $\hat{\mathbf{Y}}$ , informed by reasoned planning features and target-conditioned planning queries $Q_{\mathcal{E}}$ . The GT-Reasoning module incorporates hybrid prediction awareness for marginal motion interactions $\mathbf{\hat{Y}}^{\neg A}_{M,K-1}$ , while the Mode2Occ module handles joint occupancy interactions $\mathbf{\hat{O}}_{1:T}$ . By aggregating interactive future behaviors from hybrid prediction, plan queries pass through an identical motion decoder, resulting in a refined planning trajectory $\tau\in\mathbb{R}^{T\times 2}$ .

III-F System Learning and Optimization Design

We present a collaborative learning and optimization paradigm for the HPP system. With detached BEV backbones, the co-designed framework undergoes a two-stage training process: 1) warm-up learning of two perception decoders using perception objectives $\mathcal{L}_{per}$ [2]; 2) end-to-end supervised learning using all modular objectives $\mathcal{L}$ as:

\mathcal{L}=\mathcal{L}_{per}+\mathcal{L}_{occ}+\mathcal{L}_{GT}+\mathcal{L}_{plan}.

(10)

During inference, hybrid-prediction guided optimization refines the planning $\tau^{*}$ by minimizing cost functions $\mathcal{C}$ . In the following, we elaborate on modular objectives, costs, and co-optimization strategies.

III-F1 Modular Objectives

For precise prediction of marginal-conditioned occupancy $\mathbf{\hat{O}}^{A}_{1:T}$ in MS-OccFormer, we employ a combination of top-k BCE loss and Dice loss [2] jointly for $\mathbf{\hat{O}}^{A}_{1:T}$ and $\mathbf{M}_{1:T}$ , for balanced predictions of occupancy probabilities: $\mathcal{L}_{occ}=\mathcal{L}_{\text{BCE}}+\lambda_{\text{Dice}}\mathcal{L}_{\text{Dice}}$ , with $\lambda_{\text{Dice}}=5$ .

To capture the hierarchical reasoning process for all agents in GTFormer, the min-max objectives for policy in level- $k$ consist of $\mathcal{L}^{k}_{\text{IL}}$ , minimizing imitative behaviors, and $\mathcal{L}^{k}_{\text{col}}$ , maximizing the interactive distance. The overall objective is defined as $\mathcal{L}_{GT}=\sum^{K-1}_{k=0}(\mathcal{L}^{k}_{\text{IL}}+\lambda_{\text{col}}\mathcal{L}^{k}_{\text{col}})$ . Here, $\mathcal{L}^{k}_{\text{IL}}$ represents the negative log-likelihood (NLL) loss for reasoning policies $\pi^{k}_{0:N_{A}-1}$ , with the closest final displacement errors (FDE) for each agent as a positive mixture:

\mathcal{L}^{k}_{\text{IL}}=\sum_{i=0}^{N_{A}-1}\sum_{t=1}^{T}\sum_{m=1}^{M}\mathbf{1}(m=\hat{m}^{i})\mathcal{L}_{\text{NLL}}(\hat{\mathbf{y}}^{t,i}_{m},\mathbf{\sigma}^{t,i}_{m},\hat{\mathbf{p}}^{i}_{m}),

(11)

where $\hat{m}$ denotes the positive index, $\mathcal{L}_{\text{NLL}}$ is defined as:

\small\mathcal{L}_{\text{NLL}}=\log\sigma_{x}+\log\sigma_{y}+\frac{1}{2}((\frac{\Delta_{x}}{\sigma_{x}})^{2}+(\frac{\Delta_{y}}{\sigma_{y}})^{2})-\log(p),

(12)

Here $\sigma_{x},\sigma_{y}\in\mathbf{\sigma}^{t,i}_{m}$ and $\Delta_{xy}=\mathbf{y}^{t,i}_{xy}-\hat{\mathbf{y}}^{t,i}_{m,xy}$ . We leverage cross-entropy loss for $\hat{\mathbf{p}}^{i}_{m}$ in updating scoring. For interactive loss $\mathcal{L}^{k}_{\text{col}}$ , it is modeled by maximize L2 distance $\mathcal{D}$ for closest trajectories within $d_{\text{col}}$ under last level component policies:

\mathcal{L}^{k}_{\text{col}}=\sum_{i=0}^{N_{A}-1}\sum_{t=1}^{T}\sum_{m=1}^{M}\mathop{\max}_{j\neg i,M}\frac{\mathbf{1}(\mathcal{D}<d_{\text{col}})}{(1+\mathcal{D}(\hat{\mathbf{y}}^{i,t,k}_{m},\hat{\mathbf{y}}^{j,t,k-1}_{M}))}.

(13)

To perform refined planning $\tau$ in Ego Planner, it is learned with L2 distance $\mathcal{D}$ : $\mathcal{L}_{plan}=\mathcal{D}(\tau,\mathbf{y}^{0})$ .

III-F2 Cost Profiles

The cost function profiles encompass diverse cost terms $\{c_{i}\}$ , taking into account various aspects of planning performance, categorized as: driving progress, comforts, adherence to rules [9], and, crucially, safety. Importantly, the planning safety is defined by Gaussian potential fields incorporating joint $\hat{\mathbf{O}}$ and marginal $\hat{\mathbf{Y}}$ predictions. Motivated by complementary safety guidance by hybrid prediction, the explicit catalog is as follows:

c^{\operatorname{safe}}_{t}=\sum_{\hat{\mathbf{O}}^{x,y}_{t}\in D_{1}}\phi(\tau_{t},\hat{\textbf{O}}_{t})+\sum^{N_{A}-1}_{i=1}\sum^{M}_{m=1}\sum_{\hat{\textbf{y}}^{t}\in D_{2}}\phi(\tau_{t},\hat{\textbf{y}}^{t,i}_{m}).

(14)

Here, $\phi$ represents the Gaussian density functions as in Equ. 8. $\hat{\mathbf{O}}^{x,y}_{t}=\mathcal{P}^{-1}(\hat{\mathbf{O}}^{h,w}_{t})$ denotes the occupied coordinates, and $\hat{\textbf{y}}^{t,i}_{m}$ is derived from reasoned results of $\mathbf{\hat{Y}}^{A}_{M,K-1}$ . Each potential field is subjected to masking by a distance threshold $D_{1}=5,D_{2}=3$ towards planning trajectories.

III-F3 Optimization

The optimization for hybrid prediction-guided planning is defined as an open-loop optimization problem under finite horizons. The general formulation is as follows:

\mathbf{u}^{*}=\mathop{\arg\min}_{\mathbf{u}}\frac{1}{2}\sum_{i}\|\omega_{i}c_{i}(\mathbf{u},\mathbf{X},\mathbf{\hat{Y}},\mathbf{\hat{O}})\|^{2},

(15)

where $\mathbf{u}$ is the planning variable, and $\omega_{i}$ denotes the weight for cost function $c_{i}$ . For generalized ADS, HPP solves this optimization problem according to different criteria:

Reference routes: Suppose an accessible reference route $\mathcal{I}\in\mathbb{R}^{L_{\text{ref}}\times d_{\text{ref}}}$ that is densely interpolated, HPP transforms all cost profiles under Frenet coordinates to alleviate optimization difficulties. Each reference point $r\in\mathcal{I}$ is assigned with tangential and normal vectors: $[\vec{t}_{r},\vec{n}_{r}]$ . The Cartesian coordinate $\vec{y}=(x,y)$ can then be transformed to $\vec{r}=(s,d)$ via:

\vec{y}(s(t),d(t))=\vec{r}(s(t))+d(t)\vec{n}_{r}(s(t)).

(16)

Path planning: This handles trajectories of lower frequency for the ego vehicle. Direct optimization is performed for $\tau=\mathbf{u}$ from the Ego Planner, producing the optimized path $\tau^{*}$ . Only the safety cost is considered in this operation.

Motion planning: This addresses per-step future states of the ego vehicle. Optimization is conducted using model predictive control (MPC) for control actions $\mathbf{u}=[a,\delta]_{1:T}$ based on inverse dynamics: $\mathbf{u}_{t}=\mathcal{T}^{-1}(\tau_{t+1},\tau_{t})$ [36]. The optimal motion planning is then transformed back through forward dynamics: $\tau^{*}_{t+1}=\mathcal{T}(\tau^{*}_{t},\mathbf{u}^{*}_{t})$ after optimizations.

To solve this non-linear optimization problem, as illustrated in Equ. 15, we utilize the Gauss-Newton method [50]. that iteratively refines the planning variable as the initial value. The cost weights can be meticulously designed or learned directly, as the entire optimization process is fully differentiable [3].

IV Experiments

In this section, we first introduce the experimental settings for the proposed HPP, including testing benchmarks, evaluation metrics, and detailed implementations. Subsequently, HPP is quantitatively compared against existing state-of-the-art methods and systems in predictions and planning. Discussions on ablation studies unveil the effectiveness and mechanism of modular co-design optimizations for HPP. Qualitative results further ablate the characteristics of HPP against certain state-of-the-art baselines.

IV-A Experimental Setup

IV-A1 Testing Benchmarks

To discover the comprehensive performances in predictions and planning for HPP, we summarize three questions to be tackled by benchmark testing: (1) How is the capabilities of HPP as full-stack ADS under interactive real-world cases? (2) How is the long-term horizon performance of HPP in interactive real-world scenarios? (3) How is the long-term driving functioning by HPP in continuous realistic scenarios? These prompt benchmarks accordingly:

(1) nuScenes dataset [51]: This dataset is among the largest and most widely used for full-stack autonomous driving. It includes over 1,000 20-second frames of driving scenarios annotated at 2 Hz, covering four cities worldwide. Benchmarked evaluations [2] involve 6,019 frames for all tasks of ADS under open-loop settings. Testing horizons are defined at $T=3\,\mathrm{s}$ and $T=6\,\mathrm{s}$ for motion predictions.

(2) Waymo open motion dataset (WOMD) [52]: Utilized for long-term motion evaluations, it is the largest real-world dataset for interactive scenarios. It comprises 104,000 20-second frames representing unique scenarios, marked at 10 Hz, and encompasses over 570 km of driving and 1750 km of roadways. To assess long-term real-world performance in planning and motion predictions, evaluations are systematically conducted in the SMARTS benchmark [34]. This includes 400 highly interactive scenarios, each lasting 9 seconds, featuring representative behaviors. Autonomous vehicles are tasked with 5-second long-term planning and predictions in both open-loop and closed-loop configurations. Closed-loop testing involves leveraging the log simulator to replay driving scenarios for online interactions. To further examine the performance of joint predictions in HPP, the system is tested on the Waymo Occupancy Predictions benchmark [53]. This involves predicting over 44,000 driving scenes of occupancy and flow within an 8-second timeframe at a frequency of 1 Hz.

TABLE I: nuScenes Open-loop Planning Testing Results (

{}^{\dagger}:

Lidar inputs;

{}^{\ddagger}:

Ego status;

{}^{*}:

Augmentations)

Methods	Collision rate (%) $\downarrow$				Planning error (m) $\downarrow$
Methods	@1 s	@2 s	@3 s	Avg.	@1 s	@2 s	@3 s	Avg.
NMP^† [54]	-	-	1.92	-	-	-	2.31	-
SA-NMP^† [54]	-	-	1.59	-	-	-	2.05	-
FusionAD^†‡ [15]	0.02	0.08	0.27	0.12	-	-	-	0.81
FF^† [55]	0.06	0.17	1.07	0.43	0.55	1.20	2.54	1.43
EO^† [56]	0.04	0.09	0.88	0.33	0.67	1.36	2.78	1.60
OccNet [57]	0.21	0.59	1.37	0.72	1.29	2.13	2.99	2.13
UniAD [2]	0.05	0.17	0.71	0.31	0.48	0.96	1.65	1.03
HPP^‡	0.03	0.07	0.35	0.15	0.30	0.61	1.15	0.72
HPP	0.03	0.17	0.68	0.29	0.48	0.91	1.54	0.97
Avg. Metrics	@1 s	@2 s	@3 s	Avg.	@1 s	@2 s	@3 s	Avg.
ST-P3 [4]	0.23	0.62	1.27	0.71	1.33	2.11	2.90	2.11
VAD-Base^‡ [5]	0.07	0.10	0.24	0.14	0.17	0.34	0.60	0.37
VAD-Base [5]	0.07	0.17	0.41	0.22	0.41	0.70	1.05	0.72
DeepEM^∗ [58]	0.05	0.15	0.36	0.19	0.25	0.45	0.73	0.48
HPP^‡	0.02	0.04	0.11	0.06	0.26	0.37	0.59	0.40
HPP	0.03	0.08	0.24	0.12	0.41	0.61	0.86	0.63

(3) CARLA simulator [59]: We utilize the Longest6 benchmark [60] for long-term driving evaluations. The autonomous vehicle is assigned a $T=2\,\mathrm{s}$ horizons closed-loop planning task across 36 routes, ranging from 1.6 to 1.8 km, under various driving conditions in six CARLA towns.

IV-A2 Testing Metrics

We adhere to the original benchmarks in the testing metrics configurations. Coinciding with the contributions of HPP, the devised metrics primarily focus on three aspects, i.e., Accuracy, Consistency, and Safety for predictions and planning. Detailed metrics are listed as follows:

(1) Occupancy prediction: Intersections over Union (IoU) and Area Under the Curve (AUC) [61] quantify the overall and per-grid prediction accuracy for occupancy. Video Panoptic Quality (VPQ) [62] is adopted to assess the consistency of occupancy across marginal agents and perceptions.

(2) Motion prediction: Prediction accuracy is assessed using minimum average and final displacement errors (minADE, minFDE) for trajectories, as well as miss rate (MR) for each agent[52]. Consistency is tested through joint displacement errors (JADE, JFDE) for all agents[34], along with End-to-End Prediction Accuracy (EPA) [63] over perceptions.

(3) Planning: For open-loop testing, planning accuracy is evaluated using displacement errors (DE)[2] and the average distance [5]. Consistency in predictions and safety is assessed through collision rates (CR) [64]. In closed-loop settings, external measurements include infractions (IS), vehicle collisions (CV), routes completion (RC), and a driving score (DS) that encompasses overall driving performance [59].

IV-A3 Implementation Details

For fair comparisons, HPP is configured according to each benchmark with carefully devised learning pipelines and system architectures. In the nuScenes dataset, full-size training is conducted with a total batch size of 4. For planning, 10% of the full training set is randomly sampled, and the full set is utilized for the occupancy benchmark in WOMD. Learning occurs in batches of 24. The expert dataset from the CARLA benchmark [28], collected in different towns, is directly used for training in batches of 32.

Training strategies for all benchmarks are aligned using a distributed strategy on four NVIDIA A100 GPUs. The AdamW optimizer is employed with an initial learning rate of 1e-4, and a cosine annealing learning rate strategy is applied. The total number of training epochs is set to 20. We apply the same GPU devices for nuScenes and WOMD for testing. Evaluations for the CARLA benchmark are conducted in one NVIDIA RTX 3080 GPU.

For system architectures, HPP establishes BEV perception ego-centered within $\pm$ 50 m of $H,W=200$ in nuScenes. In WOMD and CARLA benchmarks, privileged perceptions are assumed. Therefore, HPP is developed by encoding perfect scene context inputs according to our previous work [34, 65]. For CARLA benchmarks, BEV remains ego-centered within $\pm 32$ m of $H,W=128$ . BEV settings for WOMD follow the official benchmark guidelines. Path planning is optimized without knowledge of reference routes in nuScenes and CARLA. In WOMD, motion planning is conducted considering reference information. The full-stack HPP in nuScenes considers various queries for agents. In WOMD and CARLA, agents are sorted and filtered to $N_{A}=11$ . We select the ReLU activation function and apply a dropout rate of 0.1. We refer more detailed parameters with notations in Table XII.

IV-B Main Results

IV-B1 Full-stack ADS Performance

We report HPP’s testing performance against recent state-of-the-art methods on the nuScenes dataset. HPP has achieved state-of-the-art results across various key metrics in both prediction and planning.

TABLE II: Open-loop Planning Compared with LLMs (

{}^{\ddagger}:

Ego status)

Methods	Collision rate (%) $\downarrow$				Planning error (m) $\downarrow$
Methods	@1 s	@2 s	@3 s	Avg.	@1 s	@2 s	@3 s	Avg.
GPTDriver^‡ [44]	0.07	0.15	1.10	0.44	0.27	0.74	1.52	0.84
Agent-Driver^‡ [45]	0.02	0.13	0.48	0.21	0.22	0.65	1.34	0.74
HPP^‡	0.03	0.07	0.35	0.30	0.61	1.15	0.72	0.15

TABLE III: Testing Results on nuScenes Occupancy Prediction

Methods	IoU-n. $\uparrow$	IoU-f. $\uparrow$	VPQ-n. $\uparrow$	VPQ-f. $\uparrow$
FIERY[66]	59.4	36.7	50.2	29.9
StretchBEV[41]	55.5	37.1	46.0	29.0
ST-P3[4]	-	38.9	-	32.1
BEVerse[40]	61.4	40.9	54.3	36.1
PowerBEV[67]	62.5	39.4	55.5	33.8
UniAD[2]	63.4	40.2	54.7	33.5
HPP	64.8	40.5	56.4	34.7

(1) Planning results: Described in Table I, HPP achieves state-of-the-art (SOTA) results across all planning horizons (@1 s-@3 s) against various autonomous driving systems in both absolute metrics [2] and average ones [5]. Specifically, HPP presents a 5.9% lower average L2 errors and a 6.5% lower average collision rate compared to UniAD [2], utilizing identical BEV perceptions. This showcases the validity of the modular co-design for HPP in predictions and planning.

In average metrics, HPP reports a 12.5% improvement in planning errors against VAD [5] which highlights superior perception modules. The reasoning design for HPP manifests with a 30% lower collision rate compared to DeepEM [58], which also features reasoning by EM decoding and extra de-noising augmentations. Compared with methods leveraging ego-status $\mathbf{s}_{\text{ego}}$ , a variant of HPP adding ego-status (denoted HPP^‡) in Ego Planner (Sec. III-E) is trained with boasting results. HPP^‡ does not include accelerations in [5] as leakage of ground-truths. In comparison with FusionAD [15], which depends on excessive LIDAR fusion, HPP also demonstrates an 11.8% lower planning error with comparable safety. Our system also exhibits 2.8% lower errors and nearly 20% lower collision rates compared to LLM methods (see Table II). This further substantiates the effectiveness of modular integration of predictions in planning-oriented objectives, as LLM baselines focus more on alignments by language knowledge.

(2) Predictions results: Joint results of occupancy predictions are tested in two ranges (near: $30\times 30$ m; far: $50\times 50$ m) centered on the autonomous vehicle. Shown in Table III, HPP presents advanced accuracy +2.5% and consistency +3.7% compared to [2], thanks to proposed Ms-OccFormer that integrate mutually with motion predictions. Validity of modular co-design in HPP is further manifested by +4% and +1.6% improved IoU and VPQ without extra augmentations, compared with [67] learning single occupancy task.

Marginal results of motion predictions are presented in Table IV. Here, we compare the prediction results averaged from all vehicles (-v.) and measure full agent (-f.) results weighted by categories. HPP reports a 2.8% lower minADE and a +3.2% EPA gain in vehicle predictions, with 3.7% and +6% improved performance predicting all agents compared to baselines [2] under the same perception settings. This highlights the performance gain achieved through game-theoretic reasoning and joint dependencies in GTFormer (Sec. III-D).

TABLE IV: Testing Results on nuScenes Motion Predictions

Methods	minADE $\downarrow$	minFDE $\downarrow$	MR. $\downarrow$	EPA $\uparrow$
ViP3D[63]	1.15	1.95	22.6	0.222
PnPNet[38]	2.05	2.84	24.6	0.226
UniAD-v.[2]	0.71	1.02	15.1	0.456
UniAD-f.[2]	0.911	1.236	15.1	0.314
HPP-v.	0.682	0.947	13.8	0.471
HPP-f.	0.878	1.205	14.5	0.334

TABLE V: Testing Results on Waymo Occupancy Flow Benchmark

Methods	AUC	AUC	EPE	AUC
Methods	-obs. $\uparrow$	-occ. $\uparrow$	-f. $\downarrow$	-FT $\uparrow$
MotionPerceiver	77.1	-	-	-
OFMPNet	77.0	16.5	3.58	76.1
STCNN	74.4	16.8	3.87	73.3
HOPE[20]	80.3	16.5	3.67	83.9
STrajNet[48]	77.8	17.8	3.20	78.5
VectorFlow[21]	75.4	17.3	3.58	76.7
HPP	79.7	19.4	2.95	80.2

It is important to note that HPP focuses on predictions and planning rather than perception. Prediction results could potentially be further enhanced with better perceptions [5] or additional LIDAR inputs [15]. However, HPP still outperforms in final planning results, demonstrating better social compliance denoted as our contributions.

IV-B2 Long-horizon Interactive Performance

We benchmark HPP’s long-horizon performance under WOMD. HPP has underscored advanced performances against numerous SOTA methods in predictions and planning.

(1) Occupancy results: Shown in Table V, HPP demonstrates superior prediction accuracy (+8.8% AUC-occ.) and lower flow error (9.1% EPE-f.) when considering joint flow predictions. The explicit design of MS-OccFormer in HPP has proven its strong accuracy compared to our previous work [65], which only conducts global interactions without marginal awareness. HPP presents a +2.4% improvement in AUC-obs. and +2.2% in flow-traced occupancy.

TABLE VI: Open-loop Planning Testing Results on WOMD Scenarios (SMARTS Benchmark)

Method	Collision rate	Miss rate	Planning error (m) $\downarrow$			Prediction error (m) $\downarrow$
Method	(%) $\downarrow$	(%) $\downarrow$	@1 s	@3 s	@5 s	JADE	JFDE
Vanilla IL [3]	4.25	15.61	0.216	1.273	3.175	–	–
DIM [68]	4.96	17.68	0.483	1.869	3.683	–	–
OPGP [9]	3.79	12.89	0.245	1.672	3.099	–	–
MultiPath++ [69]	2.86	8.61	0.146	0.948	2.719	–	–
MTR-e2e [19]	2.32	8.88	0.141	0.888	2.698	–	–
DIPP [3]	2.33	8.44	0.135	0.928	2.803	0.925	2.059
GameFormer [34]	1.98	7.53	0.129	0.836	2.451	0.853	1.919
HPP	1.85	7.58	0.092	0.881	2.667	0.829	1.965

TABLE VII: Close-loop Planning Testing Results on WOMD Scenarios (SMARTS Benchmark)

Method	Success rate	Progress	Acceleration	Jerk	Lateral acc.	Position error to expert driver ( $\mathrm{m}$ ) $\downarrow$
Method	(%) $\uparrow$	$(\mathrm{m})$ $\uparrow$	( $\mathrm{m}/\mathrm{s}^{2}$ )	( $\mathrm{m}/\mathrm{s}^{3}$ )	( $\mathrm{m}/\mathrm{s}^{2}$ )	@3 s	@5 s	@8 s
Expert	-	54.52	0.529	1.020	0.103	-	-
Vanilla IL [3]	0	6.23	1.588	16.24	0.661	9.355	20.52	46.33
RIP [68]	19.5	12.85	1.445	14.97	0.355	7.035	17.13	38.25
CQL [70]	10	8.28	3.158	25.31	0.152	10.86	21.18	40.17
DIPP [3]	68.12 $\pm$ 5.51	41.08 $\pm$ 5.88	1.44 $\pm$ 0.18	12.58 $\pm$ 3.23	0.31 $\pm$ 0.11	6.22 $\pm$ 0.52	15.55 $\pm$ 1.12	26.10 $\pm$ 3.88
GameFormer [34]	73.16 $\pm$ 6.14	44.94 $\pm$ 7.69	1.19 $\pm$ 0.15	13.63 $\pm$ 2.88	0.32 $\pm$ 0.09	5.89 $\pm$ 0.78	12.43 $\pm$ 0.51	21.02 $\pm$ 2.48
HPP	74.28 $\pm$ 5.49	47.17 $\pm$ 8.92	1.33 $\pm$ 0.18	11.68 $\pm$ 2.76	0.35 $\pm$ 0.09	4.93 $\pm$ 0.78	10.24 $\pm$ 0.82	18.99 $\pm$ 3.05
DIPP (optim.)	92.16 $\pm$ 0.62	51.85 $\pm$ 0.14	0.58 $\pm$ 0.03	1.54 $\pm$ 0.19	0.11 $\pm$ 0.01	2.26 $\pm$ 0.10	5.55 $\pm$ 0.24	12.53 $\pm$ 0.48
GameFormer (optim.)	94.50 $\pm$ 0.66	52.67 $\pm$ 0.33	0.53 $\pm$ 0.02	1.56 $\pm$ 0.23	0.10 $\pm$ 0.01	2.11 $\pm$ 0.21	4.87 $\pm$ 0.18	11.13 $\pm$ 0.33
HPP (optim.)	92.25 $\pm$ 0.85	52.19 $\pm$ 0.41	0.66 $\pm$ 0.02	1.87 $\pm$ 0.28	0.10 $\pm$ 0.01	2.13 $\pm$ 0.29	4.90 $\pm$ 0.26	12.89 $\pm$ 0.38

(2) Open-loop results: Table VI reveals the compelling results of HPP over numerous state-of-the-art (SOTA) methods in open-loop prediction and planning. HPP achieves notable reductions of over 50% in collisions and 26% in planning errors compared to imitation learning baselines [3, 68] that discount predictions. When compared with SOTA motion prediction baselines as imitative planners, HPP also demonstrates 2.3% to 17.6% reductions in planning errors and 21.1% fewer collisions. This underscores the importance of integrated predictions and planning.

In comparison with integrated baselines, HPP presents significant improvements over the IOP framework [9] due to the absence of future interactions for planning and the intractable occupancy in IOP that hampers normal guidance. When compared with IPP systems, HPP delivers superior performance in collision rates (6.6%), joint prediction errors (2.9%), and comparable planning errors. Compared to the state-of-the-art results from our previous works [3, 34], which focus on modeling future interactions and reasoning (see Fig. 5), HPP exhibits superior short-term performance with reduced variance. This further validates the efficacy of co-design integration for joint predictions and planning upon reasoning in HPP.

(3) Closed-loop results: In Table VII, HPP undergoes testing in a replay simulator against state-of-the-art IPP systems [34], IL methods [68], and RL approaches [70]. IL and RL results are significantly compromised due to accumulated distributional shifts in closed-loop interactions with the environment. HPP exhibits a higher success rate (+1.5%) and lower planning errors (13.2%) compared to our previous methods [34], thanks to the hybrid prediction-aware planner designed conditioned on goal context. The closed-loop performance sees substantial improvements with online optimizations. Due to the high cost of the occupancy process in raw data, HPP is re-planned by online refinements at 2 Hz. Testing results demonstrate strong performance against our previous methods [34], which re-plan more frequently at 10 Hz, achieving 8.5% lower closed-loop positional errors by [3].

TABLE VIII: Testing Results on CARLA Longest6 Benchmark

Methods	DS $\uparrow$	RC^∗ $\uparrow$	IS $\uparrow$	CV $\downarrow$
Rule-Based[59]	38.0	29.1	0.84	0.64
KING[36]	45.1	78.3	0.55	1.67
Roach[71]	55.3	88.2	0.62	0.72
PlanT^∗[28]	70.9	83.1	0.87	0.31
HPP^∗	65.5	79.2	0.82	0.51

TABLE IX: Ablation study on modular co-design integration among predictions and planning

ID	MS-OccFormer(wo.)			GTFormer(wo.)			EgoPlanner (wo.)		Occupancy Prediction				Motion Prediction			Planning
ID	Score	Motion	AC-Occ.	Score	Motion	Occ.	Motion	Occ.	IoU-n. $\uparrow$	IoU-f. $\uparrow$	VPQ-n. $\uparrow$	VPQ-f. $\uparrow$	minADE $\downarrow$	MR $\downarrow$	EPA $\uparrow$	ADE $\downarrow$	CR $\downarrow$
1	✗	-	-	-	-	-	-	-	63.5	40.0	54.5	33.7	0.733	16.1	45.6	0.989	0.428
2	-	✗	-	-	-	-	-	-	63.1	39.4	53.6	32.8	0.739	16.2	44.8	0.994	0.441
3	-	✗	✗	-	-	-	-	-	61.9	39.6	50.8	30.9	0.748	16.6	43.8	1.015	0.466
4	-	-	-	✗	-	-	-	-	63.7	39.8	55.0	33.3	0.728	15.9	45.0	0.986	0.402
5	-	-	-	-	✗	-	-	-	63.5	39.5	54.6	33.5	0.734	16.4	44.8	0.992	0.432
6	-	-	-	-	✗	✗	-	-	63.5	39.6	54.8	33.0	0.747	16.2	44.2	0.998	0.436
7	-	-	-	-	-	-	✗	-	63.6	39.8	55.0	33.4	0.726	16.0	45.7	0.989	0.414
8	-	-	-	-	-	-	-	✗	63.5	39.7	54.7	33.3	0.721	15.6	45.4	1.012	0.458
9	-	-	-	-	-	-	✗	✗	63.5	39.8	54.8	33.4	0.726	15.9	45.7	0.997	0.449
0	-	-	-	-	-	-	-	-	63.7	39.8	55.0	33.5	0.724	15.9	45.8	0.985	0.402

IV-B3 Long-term Driving Performance

HPP demonstrates comparable driving capabilities against the state-of-the-art method [28] in the CARLA benchmark (Table VIII). With significant improvements over rule-based agents [59], HPP achieves a $+20.4$ driving score compared to IPP methods [36], which guide planning by adversarial predictions, as well as a $+12.2$ improvement compared to RL methods [71]. Noted that both HPP and the reproduced [28] (^∗) show compromised route completions (RC), likely due to GPU inference issues¹¹1https://github.com/autonomousvision/plant/issues/17.

IV-C Ablation Studies and Discussions

To unleash the internal effectiveness in HPP, comprehensive ablation studies are conducted in nuScenes centered on discussing the roles in modular co-design and characteristics for hybrid prediction in planning guidance.

IV-C1 Roles in Modular Co-design

To elicit the effectiveness of co-design among each module, as presented in Table IX, we purposefully remove certain key designs for integration across each sub-module. Referred to ID 1-9, all ablated baselines are compared to the original HPP (ID.0). Detailed discussions are as follows:

(1) Effects in Ms-OccFormer: For marginal dependencies (Sec. III-C2), when removing score scaling (ID.3 vs. ID.0), we observe a slight trade-off under occupancy range (-0.5 VPQ-n. vs. +0.2 VPQ-f.), with increased collisions ( $+0.026\%$ ). This indicates the scaling captures multi-modal features for joint predictions and enhances the shared occupancy area nearby. Removing motion fusion (ID.2) causes inclusive decreases for lack of marginal alignments. Greater decreases are observed (-1.2 IoU-n. and -3.7 VPQ-n.) in removing agent-conditioned occupancy modeling. It highlights the importance of future interactions and validates the consistency towards motion predictions ( $2.4\%$ lower for prediction errors). Results in planning also reflect more contributions in near-scale joint prediction interactions to planning. Comparing ID.2 and ID.3, without sacrificing far-scale occupancy (-1.2 IoU.-n vs. +0.2 IoU-f.), the planning errors and collisions increase along with a drop in near-scale prediction accuracy. For the key design of multi-scale predictions integration, qualitative (see Fig. 6) and quantitative ablations (Table X) have demonstrated notable improvements from multi-scale attention mask update design (+1.2 IoU-n.) as well as local integration (+1.0 VPQ-n.) for prediction consistency.

(2) Effects in GTFormer: As the core module that enables reasoning capabilities for consistent predictions and planning, we try to discover roles by removing marginal and joint predictions in GTFormer. Compared ID.4 with ID.5, the removal of the reasoning module had a thorough drop, especially planning ( $+0.03\%$ CR) and motion predictions ( $+3.1\%$ MR). This highlights the importance of reasoning in consistency and accuracy for future interactions. Meanwhile, considering ID.5 and ID.6, the increase of prediction errors ( $+1.7\%$ minADE) implies that joint dependencies (Sec. III-D2) are the key in consistent motion predictions upon GT-reasoning. This reflects an enhanced mutual consistency modulated by interactive modeling between joint and marginal predictor co-design.

TABLE X: Ablations on multi-scale prediction-wise integration

Baselines	IoU-n. $\uparrow$	IoU-f. $\uparrow$	VPQ-n. $\uparrow$	VPQ-f. $\uparrow$
w/o. global integration	61.6	38.8	52.0	31.9
w/o. local integration	61.8	38.8	52.8	32.1
w/o. attn. mask $\textbf{M}_{t}$	62.5	39.0	53.8	32.4
HPP	63.7	39.8	55.0	33.5

TABLE XI: Ablation study on hybrid prediction guidance for planning

ID	$\mathcal{L}_{col}$	Occ.	Most-likely	Full-motion	Collision rate (%) $\downarrow$				Planning error (m) $\downarrow$
ID	$\mathcal{L}_{col}$	Occ.	Most-likely	Full-motion	@1 s	@2 s	@3 s	Avg.	@1 s	@2 s	@3 s	Avg.
1	-	-	-	-	0.11	0.31	0.73	0.39	0.19	0.53	1.13	0.61
2	✓	-	-	-	0.07	0.22	0.64	0.31	0.18	0.52	1.09	0.60
3	✓	-	✓	-	0.05	0.11	0.60	0.25	0.19	0.52	1.10	0.60
4	✓	-	-	✓	0.04	0.09	0.53	0.20	0.19	0.52	1.10	0.60
5	✓	✓	-	-	0.03	0.14	0.45	0.20	0.30	0.61	1.16	0.72
6	✓	✓	✓	-	0.03	0.14	0.44	0.20	0.31	0.61	1.17	0.72
0	✓	✓	-	✓	0.03	0.07	0.35	0.15	0.30	0.61	1.15	0.72

(3) Effects in Ego Planner: Validating the roles of plan-conditioning design compared to our previous work [34] in Sec. IV-B2, we further examine the effects of hybrid prediction interactions. In ID.9, we observe marginal improvements in predictions (+0.2 IoU, +0.1 EPA) and more significant enhancements ( $-0.047\%$ CR) in planning with the inclusion of hybrid prediction interactions. This underscores the consistency modeling contributed along with the HPP design. Ablations from ID.7 and ID.8 further suggest a substantial impact of interactions with occupancy predictive features ( $-0.056\%$ CR) over marginal ones ( $-0.012\%$ CR) on planning. Solely motion conditional planning (ID.8) outputs overly optimistic motion predictions (-0.3 MR). This in turn harms the original game-theoretic reasoning, resulting in inferior planning ( $+0.035\%$ CR). These results underscore the modulating effect of joint dependencies in both consistencies for planning and motion predictions.

IV-C2 Roles in hybrid prediction

To further delve into the characteristics of marginal and joint predictions, listed in Table XI, the original HPP (ID.0) is measured with ablations (ID.1-6) guiding: reasoning loss ( $\mathcal{L}_{col}$ ); marginal predictions (most likely, full); and joint predictions (Occ.) during learning and optimizations. Key findings are summarized below:

(1) Consistent reasoning: Comparing ID.1 and ID.2, significant safety improvements ( $-25.8\%$ CR) highlights the consistency role for marginal predictions in reasoning learning ( $\mathcal{L}_{col}$ ). Joint predictions in HPP are leveraged to guide internal marginal consistency that modulates planning.

(2) Complementary influences: hybrid prediction benefit planning in mutual coverage gains compared with sole guidance ( $-33\%$ CR) in ID.4 and ID.5. This is enhanced upon consistency design in HPP. 1) Compared with ID.3, guiding with full motion predictions (ID.2) spawns safe planning ( $-20\%$ CR) without sacrificing accuracy. 2) A good alignment in hybrid prediction results in close performance in ID.5 and ID.6 adding the most likely marginal prediction. These mutual effects ascertain the planning performance in case either of the hybrid prediction is less functioning due to: (1) state error for pose and speed; (2) future uncertainty; or (3) discontinuation under spatial temporal horizons.

IV-D Qualitative Results

Figs. 7 and 8 showcase the qualitative benefits of integrating hybrid prediction in the nuScenes [51] and WOMD [52] benchmarks, respectively. In the comprehensive testing scenario depicted in Fig. 7a, joint guidance from occupancy ensures consistent motion predictions during cruising. Notably, Fig. 7b demonstrates that hybrid prediction awareness contributes to consistent planning, enhancing reasoning behavior for smooth cruising without overly conservative avoidance, thereby mitigating risks near lane boundaries. Fig. 7c further illustrates the complementary effects, where motion predictions take precedence in guiding planning when occupancy is uncertain. The mutual influence is evident in Fig. 7d, where the ego vehicle remains in the lane instead of avoiding, as the occupancy predictions for rear cyclists are constrained by marginal predictions to follow a straight trajectory, maintaining a safe distance from the ego vehicle.

In Fig. 8, qualitative assessments of WOMD interactive scenarios underscore HPP’s superior reasoning capabilities and safety-conscious planning. In Fig. 8a, DIPP [3] encounters failure in an emergency stop, exposing a lack of consistency between motion predictions and planning, while HPP and GameFormer [34] exhibit smoother planning around obstacles. Motion prediction in HPP operates more smoothly with [34], as occupancy predictions modulate the joint behaviors. This is important as the reasoned motion predictions [34] may trade consistency for safety (Fig. 8b) and accuracy (Fig. 8c). High-speed situations (Fig. 8d, Fig. 8e) further demonstrate HPP’s mutual consistency between motion predictions and planning, showcasing its adaptability and robustness in diverse driving environments. These results underscore the efficacy of HPP in enhancing both the consistency and safety aspects of the autonomous driving system.

IV-E Future Outlook

In introducing HPP, we advocate for the adoption of modular co-design and optimization principles to shape ADS. As the notion of a planning-oriented modular learning system gains traction, it is crucial to underscore the significance of well-organized and concurrently integrated modules that mirror the complexities of real-world scenarios. In the future, we see agent-based models, such as LLMs, facilitating connections between various modules to ensure seamless integration. HPP lays the groundwork for module-wise integration, and future endeavors will delve deeper into exploring intermediate-level integration and guidance between prediction and planning.

V Conclusions

In this paper, we develop HPP, a modular co-design optimization framework for autonomous driving systems. The main focus of HPP is on integrating hybrid prediction and planning, wherein three sub-modules, i.e., MS-OccFormer, GTFormer, and Ego Planner address consistency issues and enhance adaptability using the hybrid prediction-guided learning and optimization pipeline. HPP has been extensively evaluated across diverse benchmarks, and the results consistently demonstrate its superior performance in both prediction and planning metrics compared to state-of-the-art methods. The hybrid prediction awareness in HPP, which incorporates joint behaviors and occupancy predictions, improves qualitative consistency, safety, and feasibility in real-world scenarios. By elucidating the roles in modular co-design and the complementary effects that enhance planning through hybrid prediction, HPP showcases its potential to advance the field by tackling challenges in prediction and planning, contributing to the ongoing evolution of autonomous driving frameworks.

TABLE XII: Notations and parameters

Notation	Meaning	Testing Benchmarks
Notation	Meaning	nuScenes	WOMD	CARLA
$H, W$	Size of BEV space	200	128/256	128
	BEV length (m)	50	64	64
$N_{A}$	Max number of agents	-	11/33	11
$N_{M}$	Max number of map	-	100	50
$T$	Future Horizons (s)	3/2/6	5/8	2
	Future frequency (Hz)	2	2/1	2
$M$	Number of modalities	6	6	6
$L$	Occ integration levels	3	3	3
$K$	Reasoning levels	3	3	3
$L_{P}$	Layers in Ego Planner	3	3	3
$h$	Number of attention heads	8	8	8
$D$	Embedding size	256	256	256
	Activation function	$\operatorname{ReLU}$
	Dropout rate	0.1	0.1	0.1
	Batch size	4	24	32
	Learning rate	$1e^{-4}$	$1e^{-4}$	$1e^{-4}$
	Training epochs	20	20	20

References

[1] L. Chen, Y. Li, C. Huang, B. Li, Y. Xing, D. Tian, L. Li, Z. Hu, X. Na, Z. Li et al., “Milestones in autonomous driving and intelligent vehicles: Survey of surveys,” IEEE Transactions on Intelligent Vehicles, 2022.
[2] Y. Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang et al., “Planning-oriented autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 853–17 862.
[3] Z. Huang, H. Liu, J. Wu, and C. Lv, “Differentiable integrated motion prediction and planning with learnable cost function for autonomous driving,” arXiv preprint arXiv:2207.10422, 2022.
[4] S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao, “St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning,” in European Conference on Computer Vision. Springer, 2022, pp. 533–549.
[5] B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Vad: Vectorized scene representation for efficient autonomous driving,” arXiv preprint arXiv:2303.12077, 2023.
[6] S. Hagedorn, M. Hallgarten, M. Stoll, and A. Condurache, “Rethinking integration of prediction and planning in deep learning-based automated driving systems: a review,” arXiv preprint arXiv:2308.05731, 2023.
[7] X. Mo, Z. Huang, Y. Xing, and C. Lv, “Multi-agent trajectory prediction with heterogeneous edge-enhanced graph attention network,” IEEE Transactions on Intelligent Transportation Systems, 2022.
[8] S. Pini, C. S. Perone, A. Ahuja, A. S. R. Ferreira, M. Niendorf, and S. Zagoruyko, “Safe real-world autonomous driving by learning to predict and plan with a mixture of experts,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 10 069–10 075.
[9] H. Liu, Z. Huang, and C. Lv, “Occupancy prediction-guided neural planner for autonomous driving,” arXiv preprint arXiv:2305.03303, 2023.
[10] Y. Hu, K. Li, P. Liang, J. Qian, Z. Yang, H. Zhang, W. Shao, Z. Ding, W. Xu, and Q. Liu, “Imitation with spatial-temporal heatmap: 2nd place solution for nuplan challenge,” arXiv preprint arXiv:2306.15700, 2023.
[11] M. Bansal, A. Krizhevsky, and A. Ogale, “Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst,” arXiv preprint arXiv:1812.03079, 2018.
[12] S. Mozaffari, O. Y. Al-Jarrah, M. Dianati, P. Jennings, and A. Mouzakitis, “Deep learning-based vehicle behavior prediction for autonomous driving applications: A review,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 1, pp. 33–47, 2020.
[13] J. Kim, R. Mahjourian, S. Ettinger, M. Bansal, B. White, B. Sapp, and D. Anguelov, “Stopnet: Scalable trajectory and occupancy prediction for urban autonomous driving,” arXiv preprint arXiv:2206.00991, 2022.
[14] L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li, “End-to-end autonomous driving: Challenges and frontiers,” arXiv preprint arXiv:2306.16927, 2023.
[15] T. Ye, W. Jing, C. Hu, S. Huang, L. Gao, F. Li, J. Wang, K. Guo, W. Xiao, W. Mao et al., “Fusionad: Multi-modality fusion for prediction and planning tasks of autonomous driving,” arXiv preprint arXiv:2308.01006, 2023.
[16] X. Jia, P. Wu, L. Chen, Y. Liu, H. Li, and J. Yan, “Hdgt: Heterogeneous driving graph transformer for multi-agent trajectory prediction via scene encoding,” IEEE transactions on pattern analysis and machine intelligence, 2023.
[17] Z. Huang, X. Mo, and C. Lv, “Multi-modal motion prediction with transformer-based neural network for autonomous driving,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 2605–2611.
[18] X. Mo, Y. Xing, H. Liu, and C. Lv, “Map-adaptive multimodal trajectory prediction using hierarchical graph neural networks,” IEEE Robotics and Automation Letters, 2023.
[19] S. Shi, L. Jiang, D. Dai, and B. Schiele, “Motion transformer with global intention localization and local movement refinement,” Advances in Neural Information Processing Systems, 2022.
[20] Y. Hu, W. Shao, B. Jiang, J. Chen, S. Chai, Z. Yang, J. Qian, H. Zhou, and Q. Liu, “Hope: Hierarchical spatial-temporal network for occupancy flow prediction,” arXiv preprint arXiv:2206.10118, 2022.
[21] X. Huang, X. Tian, J. Gu, Q. Sun, and H. Zhao, “Vectorflow: Combining images and vectors for traffic occupancy and flow prediction,” arXiv preprint arXiv:2208.04530, 2022.
[22] T. Gilles, S. Sabatini, D. Tsishkou, B. Stanciulescu, and F. Moutarde, “Thomas: Trajectory heatmap output with learned multi-agent sampling,” arXiv preprint arXiv:2110.06607, 2021.
[23] A. Kamenev, L. Wang, O. B. Bohan, I. Kulkarni, B. Kartal, A. Molchanov, S. Birchfield, D. Nistér, and N. Smolyanskiy, “Predictionnet: Real-time joint probabilistic traffic prediction for planning, control, and simulation,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 8936–8942.
[24] Z. Huang, H. Liu, J. Wu, and C. Lv, “Conditional predictive behavior planning with inverse reinforcement learning for human-like autonomous driving,” IEEE Transactions on Intelligent Transportation Systems, 2023.
[25] P. Hang, C. Lv, C. Huang, J. Cai, Z. Hu, and Y. Xing, “An integrated framework of decision making and motion planning for autonomous vehicles considering social behaviors,” IEEE transactions on vehicular technology, vol. 69, no. 12, pp. 14 458–14 469, 2020.
[26] D. Xu, Y. Chen, B. Ivanovic, and M. Pavone, “Bits: Bi-level imitation for traffic simulation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 2929–2936.
[27] H. Liu, Z. Huang, X. Mo, and C. Lv, “Augmenting reinforcement learning with transformer-based scene representation learning for decision-making of autonomous driving,” arXiv preprint arXiv:2208.12263, 2022.
[28] K. Renz, K. Chitta, O.-B. Mercea, A. Koepke, Z. Akata, and A. Geiger, “Plant: Explainable planning transformers via object-level representations,” arXiv preprint arXiv:2210.14222, 2022.
[29] N. Rhinehart, R. McAllister, K. Kitani, and S. Levine, “Precog: Prediction conditioned on goals in visual multi-agent settings,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2821–2830.
[30] Z. Huang, H. Liu, J. Wu, W. Huang, and C. Lv, “Learning interaction-aware motion prediction model for decision-making in autonomous driving,” arXiv preprint arXiv:2302.03939, 2023.
[31] J. L. V. Espinoza, A. Liniger, W. Schwarting, D. Rus, and L. Van Gool, “Deep interactive motion prediction and planning: Playing games with motion prediction models,” in Learning for Dynamics and Control Conference. PMLR, 2022, pp. 1006–1019.
[32] C. Burger, J. Fischer, F. Bieder, Ö. Ş. Taş, and C. Stiller, “Interaction-aware game-theoretic motion planning for automated vehicles using bi-level optimization,” in 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2022, pp. 3978–3985.
[33] W. Wang, L. Wang, C. Zhang, C. Liu, L. Sun et al., “Social interactions for autonomous driving: A review and perspectives,” Foundations and Trends® in Robotics, vol. 10, no. 3-4, pp. 198–376, 2022.
[34] Z. Huang, H. Liu, and C. Lv, “Gameformer: Game-theoretic modeling and learning of transformer-based interactive prediction and planning for autonomous driving,” arXiv preprint arXiv:2303.05760, 2023.
[35] P. Karkus, B. Ivanovic, S. Mannor, and M. Pavone, “Diffstack: A differentiable and modular control stack for autonomous vehicles,” in Conference on Robot Learning. PMLR, 2023, pp. 2170–2180.
[36] N. Hanselmann, K. Renz, K. Chitta, A. Bhattacharyya, and A. Geiger, “King: Generating safety-critical driving scenarios for robust imitation via kinematics gradients,” in European Conference on Computer Vision. Springer, 2022, pp. 335–352.
[37] S. Casas, A. Sadat, and R. Urtasun, “Mp3: A unified model to map, perceive, predict and plan,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 403–14 412.
[38] M. Liang, B. Yang, W. Zeng, Y. Chen, R. Hu, S. Casas, and R. Urtasun, “Pnpnet: End-to-end perception and prediction with tracking in the loop,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 553–11 562.
[39] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, and J. Dai, “Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” in European conference on computer vision. Springer, 2022, pp. 1–18.
[40] Y. Zhang, Z. Zhu, W. Zheng, J. Huang, G. Huang, J. Zhou, and J. Lu, “Beverse: Unified perception and prediction in birds-eye-view for vision-centric autonomous driving,” arXiv preprint arXiv:2205.09743, 2022.
[41] A. K. Akan and F. Güney, “Stretchbev: Stretching future instance prediction spatially and temporally,” in European Conference on Computer Vision. Springer, 2022, pp. 444–460.
[42] H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li, J. Yang, H. Deng et al., “Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[43] X. Jia, Y. Gao, L. Chen, J. Yan, P. L. Liu, and H. Li, “Driveadapter: Breaking the coupling barrier of perception and planning in end-to-end autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7953–7963.
[44] J. Mao, Y. Qian, H. Zhao, and Y. Wang, “Gpt-driver: Learning to drive with gpt,” arXiv preprint arXiv:2310.01415, 2023.
[45] J. Mao, J. Ye, Y. Qian, M. Pavone, and Y. Wang, “A language agent for autonomous driving,” arXiv preprint arXiv:2311.10813, 2023.
[46] C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” arXiv preprint arXiv:2312.14150, 2023.
[47] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
[48] H. Liu, Z. Huang, and C. Lv, “Multi-modal hierarchical transformer for occupancy flow field prediction in autonomous driving,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 1449–1455.
[49] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
[50] M. Bhardwaj, B. Boots, and M. Mukadam, “Differentiable gaussian process motion planning,” in 2020 IEEE international conference on robotics and automation (ICRA). IEEE, 2020, pp. 10 598–10 604.
[51] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631.
[52] S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y. Chai, B. Sapp, C. R. Qi, Y. Zhou et al., “Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9710–9719.
[53] R. Mahjourian, J. Kim, Y. Chai, M. Tan, B. Sapp, and D. Anguelov, “Occupancy flow fields for motion forecasting in autonomous driving,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5639–5646, 2022.
[54] W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun, “End-to-end interpretable neural motion planner,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8660–8669.
[55] P. Hu, A. Huang, J. Dolan, D. Held, and D. Ramanan, “Safe local motion planning with self-supervised freespace forecasting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 732–12 741.
[56] T. Khurana, P. Hu, A. Dave, J. Ziglar, D. Held, and D. Ramanan, “Differentiable raycasting for self-supervised occupancy forecasting,” in European Conference on Computer Vision. Springer, 2022, pp. 353–369.
[57] W. Tong, C. Sima, T. Wang, L. Chen, S. Wu, H. Deng, Y. Gu, L. Lu, P. Luo, D. Lin et al., “Scene as occupancy,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8406–8415.
[58] Z. Chen, M. Ye, S. Xu, T. Cao, and Q. Chen, “Deepemplanner: An em motion planner with iterative interactions,” arXiv preprint arXiv:2311.08100, 2023.
[59] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: An open urban driving simulator,” in Conference on robot learning. PMLR, 2017, pp. 1–16.
[60] K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger, “Transfuser: Imitation with transformer-based sensor fusion for autonomous driving,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
[61] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
[62] D. Kim, S. Woo, J.-Y. Lee, and I. S. Kweon, “Video panoptic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9859–9868.
[63] J. Gu, C. Hu, T. Zhang, X. Chen, Y. Wang, Y. Wang, and H. Zhao, “Vip3d: End-to-end visual trajectory prediction via 3d agent queries,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5496–5506.
[64] O. Scheel, L. Bergamini, M. Wolczyk, B. Osiński, and P. Ondruska, “Urban driver: Learning to drive from real-world demonstrations using policy gradients,” in Conference on Robot Learning. PMLR, 2022, pp. 718–728.
[65] Y. Liu, J. Zhang, L. Fang, Q. Jiang, and B. Zhou, “Multimodal motion prediction with stacked transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7577–7586.
[66] A. Hu, Z. Murez, N. Mohan, S. Dudas, J. Hawke, V. Badrinarayanan, R. Cipolla, and A. Kendall, “Fiery: Future instance prediction in bird’s-eye view from surround monocular cameras,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 273–15 282.
[67] P. Li, S. Ding, X. Chen, N. Hanselmann, M. Cordts, and J. Gall, “Powerbev: A powerful yet lightweight framework for instance prediction in bird’s-eye view,” arXiv preprint arXiv:2306.10761, 2023.
[68] N. Rhinehart, R. McAllister, and S. Levine, “Deep imitative models for flexible inference, planning, and control,” arXiv preprint arXiv:1810.06544, 2018.
[69] B. Varadarajan, A. Hefny, A. Srivastava, K. S. Refaat, N. Nayakanti, A. Cornman, K. Chen, B. Douillard, C. P. Lam, D. Anguelov et al., “Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 7814–7821.
[70] A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q-learning for offline reinforcement learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 1179–1191, 2020.
[71] Z. Zhang, A. Liniger, D. Dai, F. Yu, and L. Van Gool, “End-to-end urban driving by imitating a reinforcement learning coach,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 15 222–15 232.

Hybrid-Prediction Integrated Planning for Autonomous Driving