HybridPrediction Integrated Planning for
Autonomous Driving
Abstract
Autonomous driving systems require the ability to fully understand and predict the surrounding environment to make informed decisions in complex scenarios. Recent advancements in learningbased systems have highlighted the importance of integrating prediction and planning modules. However, this integration has brought forth three major challenges: inherent tradeoffs by sole prediction, consistency between prediction patterns, and social coherence in prediction and planning. To address these challenges, we introduce a hybridprediction integrated planning (HPP) system, which possesses three novelly designed modules. First, we introduce marginalconditioned occupancy prediction to align joint occupancy with agentwise perceptions. Our proposed MSOccFormer module achieves multistage alignment per occupancy forecasting with consistent awareness from agentwise motion predictions. Second, we propose a gametheoretic motion predictor, GTFormer, to model the interactive future among individual agents with their joint predictive awareness. Third, hybrid prediction patterns are concurrently integrated with Ego Planner and optimized by prediction guidance. HPP achieves stateoftheart performance on the nuScenes dataset, demonstrating superior accuracy and consistency for endtoend paradigms in prediction and planning. Moreover, we test the longterm openloop and closedloop performance of HPP on the Waymo Open Motion Dataset and CARLA benchmark, surpassing other integrated prediction and planning pipelines with enhanced accuracy and compatibility. Project website: https://georgeliu233.github.io/HPP
Index Terms:
Occupancy prediction, motion prediction, integrated prediction and planning, autonomous driving.I Introduction
Autonomous driving systems (ADS) have made significant progress in perception, prediction, and planning, thanks to the advancement of learning paradigms [1]. However, the performance growth of these independent tasks has come to a halt, prompting a reconsideration of modular design optimization [2, 3, 4, 5]. Fueled by inherent interactions among autonomous vehicles and traffic participants, recent research has placed significant emphasis on the integration of prediction and planning tasks [6]. They seek to achieve concurrent advancements in both prediction and planning .
Integrated pipelines (Fig. 1a) generally connect planning and singular prediction modules involving agentwise motion trajectories (IPP) [3, 7, 8] or wholescene occupancy probabilities (IOP) [9, 10, 11]. However, relying on single prediction format inevitably confronts respective shortfalls. Specifically, motion prediction models continuous temporal trajectories tracked for each agent, but encounters inconsistency and exponential cost in joint social patterns from marginal agents [12]. Conversely, exclusive occupancy accurately predicts aligned joint patterns of whole scene agents under bird’s eye view (BEV) perception, but the loss of agentwise tractability leads to temporal conflicts and omission risks of critical agents [13]. This characteristic highlights the complementarity of motion and occupancy prediction. The inconsistency arising in the singular prediction pattern presents challenges for both IPP and IOP, resulting in incompatible planning learned collectively with misaligned predictions.
The limitations of integrated systems have spurred a renewed focus on endtoend pipelines (Fig. 1b) [14]. Planningoriented systems employ a wellorganized modular design to address plan guidance under a unified BEV geometry and querybased intermediate interactions [2, 5, 15]. While this approach proves planning superiority and consider a hybrid format of predictions and planning, it falls short in addressing potential conflicts among different predictions. Furthermore, the absence of interactive codesign across predictions and planning results in passive maneuvers and hinders the system learning for social compatibility. The integration of different modules gives rise to three major challenges: inherent tradeoffs, including inconsistency and omission risks with sole prediction; discrepancies between joint and marginal patterns in predictions, limiting accuracy and impeding the learning of safe and naturalistic driving maneuvers; and the absence of interactive codesign for compatible planning, considering dependencies of hybrid prediction.
To tackle the aforementioned challenges, we propose a novel integrated pipeline (Fig. 1c) named HybridPrediction integrated Planning (HPP) that optimizes prediction and planning in a codesign process. By combining IPP and IOP, HPP delivers consistent planning while ensuring that hybrid prediction inform each other consistently. To achieve this, HPP leverages Transformerbased queries to channel and aggregate interactions between modules. Additionally, we have developed a refinement process that guides safe planning interacting with hybrid prediction. Specifically, we propose MSOccFormer to perform marginalconditioned joint occupancy prediction that aligns and refines consistently with marginal motion prediction. We leverage GTFormer, which is inspired by simulating gametheoretic iterative reasoning of marginal motion prediction with ego vehicle and joint occupancy. HPP concurrently models finegrained interactions of integrated hybrid prediction patterns in Ego Planner, where we eventually devise a hybridpredictionguided refinement mechanism to facilitate safe and compatible planning. The main contributions are listed below:

1.
We propose HPP, a modular codesign optimization ADS paradigm, that consistently interacts among marginal and joint prediction patterns with planning.

2.
We introduce MSOccFormer, a model that predicts joint occupancy patterns in BEV geometry while being aware of the marginal predictions. We also present GTFormer, a model that performs gametheoretic reasoning among marginal motion predictions in coordination with both planning and occupancy prediction. The ego planner is devised under interactive guidance by hybrid prediction.

3.
HPP is tested on multiple largescale realworld benchmarks, and extensive testing results demonstrate its stateoftheart performance in terms of accuracy, safety, and consistency in prediction and planning.
II Related Work
IIA Predictions and Planning in ADS
Prediction and planning modules in the conventional ADS model are separate. Predictions typically define evolving transitions as conditions for safe planning. Learningbased prediction models excel in modeling interactions among diverse agents and scene contexts [12]. Categorized by representations, sparse predictions forecast multiagent trajectories (MATP) along with detected participants. Leveraging Transformer [16, 17] or GNNs [7, 18] in constructing the social interaction graph [7] or recurrent refining [19], MATP filters multiagent predictions in scoring combinations of marginal ones for each agent. While achieving agentwise accuracy, MATP introduces exponential computations and trajectorywise inconsistency. Dense predictions directly estimate the future distribution of agents jointly from egocentered occupancy [9, 20, 21]. A notable issue is the loss of agentwise tractability. Enhancements via trajectories of heatmap sampling [22] or joint trajectory learning [23] exhibit similar consistency issues. In ADS planning, various approaches are wellfounded including sampling [24], optimization [25], and learningbased techniques via imitation learning [26] or reinforcement learning [27]. Still, achieving safe and socially compatible driving maneuvers in planning requires interactive awareness and safety guarantees derived from planningcompliant predictions. Furthermore, accumulated errors from detached predictions and planning underscore the need for developing integrated ADS.
IIB Integrated Predictions and Planning in ADS
Navigating dense and interactive traffic requires integrated ADS, which models the simultaneous driving behaviors among social agents and the autonomous vehicle. Intuitive thought is to stack ego vehicles collectively with social agents’ predictions and learn unified trajectories. All interactions are implicitly modeled by Transformerbased IPP [8, 28] or postprocessed occupancy prediction[23] under the IOP pipeline. To consider the evolving interactive behaviors between predictions and planning, conditional methods model the behavioral response individually from ego vehicle to social agents predictions [29]. Conditional motion predictions are then integrated into planning reschedules [24, 30] or modeled as noncooperative games [31]. Still, these oneway interactions bypass the mutual consistency with all agents. Moreover, iterative bilevel optimization [32] significantly slows down learning. Obviated from agentwise conflicts, hierarchical gametheoretic approaches model the iterative reasoning process [33], updating mutual behaviors for all agents simultaneously [34]. Yet, uniformed agentwise reasoning lacks specifications between predictions and planning, which should be targetdriven. In HPP, we integrate reasoning by introducing an agentconditioned occupancy to modulate joint behavior in social agents. and an Ego Planner for interactive planning. This integrated codesign enables learning both predictions and planning with flexibility while maintaining mutual awareness by iterative reasoning.
Recent paradigms focus on leveraging output responses for predictions to enhance planning guarantees. Adversarial objectives are employed between joint motion predictions and planning, considering likely [3, 35] or safetycritical patterns [36] for mutual differentiable optimizations. However, similar consistency issues faced by MATP pose additional optimization challenges. Some methods predict dense occupancy jointly as a potential cost for planning guidance [11, 9, 10]. However, intractability issues introduce risks and require laborious filtering. An additional challenge is the lack of integrated modeling leads to inconsistency and limits the interactive process only in optimization. In HPP, integration is considered concurrently for ADS codesign and output optimizations. Through flexible and consistent hybrid prediction and planning via system codesign, hybrid prediction are jointly utilized to refine planning, enhancing safety consistently.
IIC Endtoend Systems and LLMs for Autonomous Driving
Endtoend methods consider a direct mapping from raw sensors to prediction and planning under perception understanding [14]. A typical system is serialized by modules and learns jointly with each objective [37, 4]. However, modular errors are accumulated and geometry is hindered by misaligned predictions and planning. Thus, planning is sampled in trajectory retrieval by predictions [38]. Prominence in BEV perception [39, 40, 41] enables modular integration and learning under unified BEV geometry [42]. This further prompts a planningoriented system, which organizes and serves all intermediate modules targeting planning under visual [2, 15, 43] or vectorized [5] perceptions. Querybased design channel and propagate modular integration. This prompts recent works incorporating large language models (LLMs) as the motion planner [44] or routing agent [45]. Still, the current endtoend system focuses more on integration with perception that aligns predictions and planning, leaving paradigm discussions incorporating LLMs [46]. In HPP, we put more emphasis on modular codesign optimization, learning interactive hybrid prediction, and planning informed by perception under querybased integration.
III Hybridprediction Integrated Planning
The overview of our proposed HPP is shown in Fig.2, which is defined in Sec. IIIA upon querybased modular codesign and optimization of ADS. In the upcoming sections, informed by BEV perception pipelines exhibited in Sec. IIIB, HPP manages the framework codesign from MSOccFormer in Sec. IIIC for joint occupancy prediction, sharing mutual consistency with motion prediction in GTFormer. Here we elaborate its hierarchical reasoning model for interactive prediction and planning in Sec. IIID. Followed by hybrid predictionaware Ego Planner in Sec. IIIE, we frame the learning and optimization process for HPP reckoning hybrid prediction guidance in Sec. IIIF.
IIIA Problem Formulation
As shown in Fig. 2, HPP focuses on addressing integrated predictions and planning challenges for ADS. Informed by perception under BEV geometry, this learningbased system is founded by modular codesign and optimization. Modularized by Transformer basis, HPP leverages categorical queries $\mathbf{Q}$ to aggregate modular outputs and channel interactions as keys and values $\mathbf{K},\mathbf{V}$ by multihead attention mechanism.
Given multiview image inputs, the codesign of HPP begins with the perception backbone for BEV features $\mathcal{B}$. BEV perception further defines an egocentered area of $H\times W$ with map $\mathcal{S}$ and ${N}_{A}$ detected agents $\mathcal{A}\in {A}_{0:{N}_{A}1}$ at current timestep $t=0$, where ${A}_{0}$ denotes the ego vehicle. Based on scene context $\mathbf{X}=\{\mathcal{B},\mathcal{S},\mathcal{A}\}$, HPP aims to learn and optimize concurrently for hybrid prediction and planning modules. Specifically, over future horizon $T$, and set of queries $\mathbf{Q}$, joint predictions are defined as perstep occupancy probabilities ${\widehat{\mathbf{O}}}_{1:T}={\{{o}_{t}^{h,w}{o}^{h,w}\in {[0,1]}^{H\times W}\}}_{t=1}^{T}$ for all neighbor agents. Simultaneously, the marginal future motions for all agents are defined by ${\widehat{\mathbf{Y}}}_{M}^{A}={\{({\widehat{\mathbf{y}}}_{m}^{1:T,i},{\widehat{\mathbf{p}}}_{m}^{i})m\in [1,M]\}}_{i=0}^{{N}_{A}1}$ denoting multimodal future trajectories $\widehat{\mathbf{y}}\in {\mathbb{R}}^{{N}_{A}\times M\times T\times 2}$ and probability $\widehat{\mathbf{p}}\in {\mathbb{R}}^{{N}_{A}\times M}$ considering $M$ modes of uncertainty. hybrid predictionaware planning ${\tau}_{1:T}\in {\mathbb{R}}^{T\times 2}$ is queried by plan context $\mathcal{E}$. Subsequently, HPP formulates a collective objective $\mathcal{L}$ for modular learning, and cost criteria $\mathcal{C}$ for optimized hybrid prediction guided planning ${\tau}^{\ast}\in {\mathbb{R}}^{T\times 2}$. In general, HPP is formulated as:
${\widehat{\mathbf{Y}}}_{M}^{A},{\widehat{\mathbf{O}}}_{1:T},{\tau}_{1:T}$  $=f(\mathbf{X},\mathbf{Q},\mathcal{E}\theta ),$  (1)  
${\tau}^{\ast}=\underset{\tau}{\mathrm{arg}\mathrm{min}}$  $\mathcal{C}(\tau ,\mathbf{X},\widehat{\mathbf{Y}},\widehat{\mathbf{O}}),$ 
where $f$ denotes the codesign of HPP, and $\theta $ is the model parameters. Specific codesign and formations for each module in HPP are illustrated in the following sections.
IIIB Perception Scene Encoding
Perception pipelines in HPP aim to capture scene context $\mathbf{X}$ with raw image inputs under BEV geometry. Scene contexts are then jointly encoded to catch their global relations.
IIIB1 BEV Perception
In HPP, multiview image features $\mathbf{I}\in {\mathbb{R}}^{6\times {H}_{in}\times {W}_{in}\times C}$ are extracted via shared backbones [47] from raw image inputs. We leverage BEV encoder [39] to transform $\mathbf{I}$ into BEV feature $\mathcal{S}$ through recurrent topdown BEV queries ${Q}_{B}\in {\mathbb{R}}^{H\times W\times D}$, where $D$ denotes the hidden dimensions. Founded on BEV backbones, we utilize two DETRlike perception decoders [2] in extracting scene context features for agent $\mathcal{A}$ and map $\mathcal{S}$ from ${Q}_{B}$, following ${N}_{A}$ agent queries ${Q}_{A}\in {\mathbb{R}}^{{N}_{A}\times D}$ and ${N}_{M}$ map queries ${Q}_{Map}\in {\mathbb{R}}^{{N}_{M}\times D}$. Note that HPP focuses on integrated prediction and planning. Therefore, it is expected better results for HPP using more advanced BEV perception units.
IIIB2 Scene Encoding
to model the global interactions between scene elements with BEV perceptions, we inherit from our previous work [48, 9] to gather and encode separate visual and vectorized scene context features. Specifically, visual features are encoded by BEV queries ${Q}_{B}$, and the scene features for map and agent are concatenated and encoded as ${Q}_{s}=[{Q}_{Map};{Q}_{A}]\in {\mathbb{R}}^{({N}_{A}+{N}_{M})\times D}$, where $[\cdot ;\cdot ]$ denotes concatenations. Encoded results $\mathbf{X}=\{{Q}_{B},{Q}_{Map},{Q}_{A}\}$ are then served as input for HPP codesign integrating predictions and planning.
IIIC MsOccFormer
HPP formulates joint predictions as occupancy ${\mathbf{O}}_{1:T}$ consistently with BEV geometry in perception. To further tackle the consistency challenges between hybrid prediction, in HPP we propose MSOccFormer spotlights twp aspects, i.e. Marginalconditioned occupancy that defines tractable predictions, and Multiscale predictionwise integration that deals with the interactive alignments with different granularity. Illustrated in Fig. 3, MSOccFormer utilizes a streamingbased pipeline to roll out the future horizon $T$ , decoding perstep occupancy prediction based on $L$ levels integration of succeeded step features.
IIIC1 Modular Queries
we leverage occupancy queries ${Q}_{occ}\in {\mathbb{R}}^{H\times W\times D}$ in multiscale aggregating for positional and perception features: ${Q}_{occ}=\mathrm{MLP}([\mathrm{PE}({I}_{B});{Q}_{B}])$. Positional grids ${I}_{B}\in {\mathbb{R}}^{\times H\times W\times 2}$ are encoded using sinusoidal $\mathrm{PE}(\cdot )$ and transformed by multilayer perceptron (MLP). We further downsample ${Q}_{occ}$ under $L$ levels ${\{{Q}_{occ}^{l,0}\in {\mathbb{R}}^{\frac{H}{{2}^{l}}\times \frac{W}{{2}^{l}}\times D}\}}_{l=L}^{0}$ to recurrently query multiscale interactions.
IIIC2 Marginal Dependencies
To fully extract the interactive marginal prediction features, we conduct an agentwise fusion (see Fig. 3c) that leverages the marginal future ${\widehat{\mathbf{Y}}}_{M}^{A}$ outputs from GTFormer (Sec. IIID). Multimodal motion features ${Q}_{M}^{A}\in {\mathbb{R}}^{{N}_{A}\times M\times D}$ are fused with marginal features ${\mathbf{H}}_{M}^{A}=\mathrm{MLP}(\mathrm{PE}(\widehat{\mathbf{y}}))$ and projected by each horizon:
$${\mathbf{H}}_{traj}^{A}={\mathrm{MLP}}_{1:T}(\widehat{\mathbf{p}}({Q}_{M}^{A}+{\mathbf{H}}_{M}^{A})+{Q}_{A}),$$  (2) 
where ${\mathbf{H}}_{traj}^{A}\in {\mathbb{R}}^{T\times {N}_{A}\times D}$ denotes the marginal features.
IIIC3 Marginalconditioned Occupancy
The primary challenge in formulating ${\mathbf{O}}_{1:T}\in {\mathbb{R}}^{T\times H\times W}$ lies in the intractability with marginal predictions that cause joint inconsistencies. Inspired by instancelevel occupancy ${\mathbf{O}}_{1:T}^{A}\in {\mathbb{R}}^{T\times H\times W\times {N}_{A}}$ [2] and conditional methods [24], we propose the marginalconditioned occupancy prediction task. This models the consistent joint occupancy $p({\mathbf{O}}_{1:T}^{A}{\mathbf{Y}}_{M}^{A},\mathbf{X})$ over agentwise marginal predictions. To associate uncertainty and mutual interactions, given final joint decoding features ${Q}_{occ}^{L}$ and marginal features ${\mathbf{H}}_{traj}^{A}$, the marginalconditioned occupancy will be eventually modeled by dot products:
$${\mathbf{O}}_{1:T}^{A}=\sigma ({Q}_{occ}^{L}\cdot \mathrm{MLP}{({\mathbf{H}}_{traj}^{A})}^{T}),$$  (3) 
where $\sigma $ denotes the sigmoid function for pergrid probabilities. The original task can be then transformed back ${\mathbf{O}}_{1:T}={\mathrm{max}}_{A}{\mathbf{O}}_{1:T}^{A}$ for codesign of other HPP modules:
IIIC4 Multiscale Predictionwise Integration
aims to iteratively align multiscale interaction features between hybrid prediction in decoding ${\mathbf{O}}_{1:T}^{A}$.In Fig. 3a, multiscale succeeded occupancy features ${\{{Q}_{occ}^{l,t1}\}}_{l=1}^{L}$ query aligned marginal features by attentions at different granularities from twostage Transformer decoders.
The global integration stage leverages the vanilla Transformer decoders to perform pergrid interactions from flattened highlevel joint features ${Q}_{occ}^{L,t1}$ with marginal ones. Subsequently, with the upscaling of occupancy features ${\{{Q}_{occ}^{l,t1}\}}_{l=1}^{L1}$, the local integration stage focuses on capturing consistency from partial joint behaviors with marginal features. This motivates us to design shiftwindow multihead crossattention (SWMCA), inspired by SWMSA in [49]. As depicted in Fig. 3b, we employ the rolling process to simultaneously capture local interactions under shifted windows attention.
To ensure interactive consistency across multiscale integration, we devise a learnable attention mask ${\mathbf{M}}_{1:T}^{l}\in {\mathbb{R}}^{T\times \frac{H}{{2}^{Ll}}\times \frac{W}{{2}^{Ll}}\times {N}_{A}}$ for Transformer decoder that iteratively refines upon interaction results from the previous scale. This aligns the attention modeling based on the previous results Shown in Fig. 3a, for each level, the attention mask gets updated with agentconditioned occupancy on the current scale level:
$${\widehat{\mathbf{M}}}^{l}=\sigma ({Q}_{occ}^{l}\cdot {\mathrm{MLP}}_{l}{({\mathbf{H}}_{traj}^{A})}^{T}).$$  (4) 
The attention masks are then iteratively updated following:
$${\mathbf{M}}^{l}={\lambda}_{m}\mathrm{Upsample}({\mathbf{M}}^{l1})+(1{\lambda}_{m}){\widehat{\mathbf{M}}}^{l},$$  (5) 
where ${\lambda}_{m}=0.5$ is the update factor. In general, given Transformer decoder at certain stage as $\mathrm{Trans}$, the predictionwise integration under scale $l$ of timestep $t$ is defined as:
$${Q}_{occ}^{l,t}=\mathrm{Trans}(q={Q}_{occ}^{l,t1},k,v={\mathbf{H}}_{traj}^{A,t},m={\mathbf{M}}_{t}^{l}).$$  (6) 
Output joint occupancy features ${Q}_{occ}^{L}$ will be eventually fused via Equ. 3 for conditioned occupancy predictions ${\widehat{\mathbf{O}}}_{1:T}^{A}$.
IIID GTFormer
In HPP, ensuring interactive consistency between motion predictions and planning involves the introduction of GTFormer. As depicted in Fig. 4, GTFormer formulates future interactive behaviors as a gametheoretic reasoning process, simulating hierarchical reasoning through $K$layers stacked Transformer decoders. Beyond the uniform agentwise reasoning model of our previous work [34], GTFormer collaborates with MSOccFormer (Sec. IIIC), modeling occupancy interactions to modulate joint future behaviors at each level of reasoning.
IIID1 Modular Queries
We leverage motion queries ${Q}_{M}^{A}\in {\mathbb{R}}^{{N}_{A}\times M\times D}$ to initialize agentwise interactive reasoning with multimodal motion intentions ${I}_{A}\in {\mathbb{R}}^{{N}_{A}\times M\times 2}$ and agent perception features by: ${Q}_{M}^{A}=\mathrm{MLP}([\mathrm{PE}({I}_{A});{Q}_{A}])$.
IIID2 Joint Dependencies
We transform joint occupancy predictions ${\widehat{\mathbf{O}}}_{1:T}\in {\mathbb{R}}^{T\times H\times W}$ with positional features ${I}_{B}$ to align with continuous geometry for predictions and planning. Shown in Fig. 4b, we conduct a multiplication with maxpooling to encode occupancy positional features with BEV semantics:
$${\mathbf{H}}_{occ}=\mathrm{ResConv}(\underset{T}{max}\mathrm{Conv}(\mathrm{PE}({I}_{B})\widehat{\mathbf{O}})+{Q}_{occ}),$$  (7) 
where $\mathrm{Conv}$ and $\mathrm{ResConv}$ represent convolutional projections and residual layers, respectively. The joint features ${\mathbf{H}}_{occ}\in {\mathbb{R}}^{H\times W\times D}$ are then interacted with all agents.
IIID3 GameTheoretic Transformer Layer
We employ levelk games to model the GTFormer layer (see Fig. 4a) for future interactive behaviors of all agents. Inherited from our previous work [34], we denote agentwise multimodal predictions and planning ${\widehat{\mathbf{Y}}}_{M}^{A}$ as player policy ${\{{\pi}_{i}\}}_{i=0}^{{N}_{A}1}$. For each player $i$, the reasoning process is defined as a $K$times iterations of policy conditioned on the opponents $\neg i$ policy from last reasoning level: ${\pi}_{i}^{k}({\mathbf{Y}}_{M}^{i}{\pi}_{\neg i}^{k1},\mathbf{X}),k\in [1,K1]$. Specifically, the level0 policy is reasoned independently: ${\pi}_{i}^{0}({\mathbf{Y}}_{M}^{i}\mathbf{X})$. We leverage the Gaussian mixture model (GMM) to outline the uncertainty for predictions and planning policy as:
$${\pi}_{i}^{k}=\sum _{m=1}^{M}p({\widehat{\mathbf{p}}}_{m}^{i}{\pi}_{\neg i}^{k1},\mathbf{X})\sum _{t=1}^{T}\varphi ({\widehat{\mathbf{y}}}_{m}^{t,i},{\widehat{\sigma}}_{m}^{t,i}{\pi}_{\neg i}^{k1},\mathbf{X}),$$  (8) 
where $\varphi $ and ${\sigma}_{m}^{t,i}\in {\mathbb{R}}^{{N}_{A}\times M\times T\times 2}$ represent the density function and variance, respectively for 2D Gaussian distributions.
Specifically, a single layer of GTFormer is shared by level$k$ policy ${\pi}_{0:{N}_{A}1}^{k}$ (Fig. 4a) across all agents. The ${k}^{th}$ layer starts with a multihead selfattention (MHSA), named Mode2Mode for last level motion queries ${Q}_{M,k1}^{A}\in {\mathbb{R}}^{{N}_{A}\times M\times D}$ to model future interactions across modalities. Four multihead crossattention (MHCA) are devised for ${Q}_{M,k1}^{A}$ to aggregate the policy conditions separately for reasoning. (1) GTreasoning interacts with last level components policies ${\pi}_{0:{N}_{A}1}^{k1}$. Fused by Equ. 2 with ${\widehat{\mathbf{p}}}_{M,k1}^{\neg A}$ and ${\widehat{\mathbf{y}}}_{M,k1}^{\neg A}$, components features are filtered by future mask [34] to query future reasoning behaviors as ${F}_{GT}^{k}$. (2) Mode2Agent and (3) Mode2Map outlines the scene context interactions of agents ${Q}_{A}$ and maps ${Q}_{Map}$ as ${F}_{A}^{k}$ and ${F}_{Map}^{k}$. (4) Mode2Occ leverage deformable MHCA (DCA) [39] to model the future interactions with joint occupancy feature. ${\mathbf{H}}_{occ}$ is queried by A set of offsets $\mathrm{\Delta}\in {\mathbb{R}}^{{N}_{A}\times M\times {N}_{p}\times 2}$ referenced by $\mathcal{P}({\widehat{\mathbf{y}}}_{M,k1}^{A})$ transformed to pixel coordinate. ${N}_{p}$ referenced occupancy are then aggregated as ${F}_{occ}^{k}$ for each agent. In general, all interactive features are concatenated to update the motion query:
$${Q}_{M,k}^{A}=\mathrm{FFN}([{F}_{GT}^{k}+{F}_{A}^{k};{F}_{Map}^{k};{F}_{occ}^{k}]),$$  (9) 
where $\mathrm{FFN}(\cdot )$ denotes the feedforward layers for residual MLPs and layernorm. We omit the reasoning attention of ${F}_{GT}^{0}$ at $k=0$ for independent future policies. Eventually, updated motion queries ${Q}_{M,k}^{A}$ are passed through score and trajectory decoders for reasoned GMM policies of predictions and planning.
IIIE Ego Planner
HPP introduces the Ego Planner depicted in Fig. 4c. Ego Planner specifies planningoriented conditions $\mathcal{E}$ based on the planning reasoning policy ${\pi}_{0}^{K1}$. The codesign of HPP facilitates a hybridprediction aware Ego Planner that collaborates with GTFormer and MSOccFormer, defined as $p(\tau {\pi}_{0}^{K1},\widehat{\mathbf{Y}},\widehat{\mathbf{O}},\mathcal{E},\mathbf{X})$. This is essential as the gametheoretic process in GTFormer (Sec. IIID) simply models ego planning uniformly with other agents.
IIIE1 Modular Queries
In the context of targetdriven planning, we define plan queries ${Q}_{\mathcal{E}}\in {\mathbb{R}}^{1\times D}$ that integrate planning reasoning features ${Q}_{M,K1}^{0}$ using plan context $\mathcal{E}$ through a maxpooling fusion by: ${Q}_{\mathcal{E}}={max}_{M}\mathrm{MLP}([{Q}_{M,K1}^{0};{\mathbf{H}}_{\mathcal{E}}])$. Here, ${\mathbf{H}}_{\mathcal{E}}\in {\mathbb{R}}^{1\times D}$ encompasses global target features $\mathcal{G}\in {\mathbb{R}}^{1\times D}$ for encoded navigation commands or coordinates, traffic light features $\mathrm{TL}\in {\mathbb{R}}^{1\times D}$, and optional ego status ${\mathbf{s}}_{\text{ego}}\in {\mathbb{R}}^{1\times D}$ for ego speed and heading $\{v,\psi \}$.
IIIE2 Hybrid Predictions Dependencies
We directly utilize a stack of ${L}_{p}$ interaction Transformers parts in the GTFormer layer (Fig. 4a) to model future joint interactions $\widehat{\mathbf{O}}$ and marginal interactions $\widehat{\mathbf{Y}}$, informed by reasoned planning features and targetconditioned planning queries ${Q}_{\mathcal{E}}$. The GTReasoning module incorporates hybrid prediction awareness for marginal motion interactions ${\widehat{\mathbf{Y}}}_{M,K1}^{\neg A}$, while the Mode2Occ module handles joint occupancy interactions ${\widehat{\mathbf{O}}}_{1:T}$. By aggregating interactive future behaviors from hybrid prediction, plan queries pass through an identical motion decoder, resulting in a refined planning trajectory $\tau \in {\mathbb{R}}^{T\times 2}$.
IIIF System Learning and Optimization Design
We present a collaborative learning and optimization paradigm for the HPP system. With detached BEV backbones, the codesigned framework undergoes a twostage training process: 1) warmup learning of two perception decoders using perception objectives ${\mathcal{L}}_{per}$ [2]; 2) endtoend supervised learning using all modular objectives $\mathcal{L}$ as:
$$\mathcal{L}={\mathcal{L}}_{per}+{\mathcal{L}}_{occ}+{\mathcal{L}}_{GT}+{\mathcal{L}}_{plan}.$$  (10) 
During inference, hybridprediction guided optimization refines the planning ${\tau}^{\ast}$ by minimizing cost functions $\mathcal{C}$. In the following, we elaborate on modular objectives, costs, and cooptimization strategies.
IIIF1 Modular Objectives
For precise prediction of marginalconditioned occupancy ${\widehat{\mathbf{O}}}_{1:T}^{A}$ in MSOccFormer, we employ a combination of topk BCE loss and Dice loss [2] jointly for ${\widehat{\mathbf{O}}}_{1:T}^{A}$ and ${\mathbf{M}}_{1:T}$, for balanced predictions of occupancy probabilities: ${\mathcal{L}}_{occ}={\mathcal{L}}_{\text{BCE}}+{\lambda}_{\text{Dice}}{\mathcal{L}}_{\text{Dice}}$, with ${\lambda}_{\text{Dice}}=5$.
To capture the hierarchical reasoning process for all agents in GTFormer, the minmax objectives for policy in level$k$ consist of ${\mathcal{L}}_{\text{IL}}^{k}$, minimizing imitative behaviors, and ${\mathcal{L}}_{\text{col}}^{k}$, maximizing the interactive distance. The overall objective is defined as ${\mathcal{L}}_{GT}={\sum}_{k=0}^{K1}({\mathcal{L}}_{\text{IL}}^{k}+{\lambda}_{\text{col}}{\mathcal{L}}_{\text{col}}^{k})$. Here, ${\mathcal{L}}_{\text{IL}}^{k}$ represents the negative loglikelihood (NLL) loss for reasoning policies ${\pi}_{0:{N}_{A}1}^{k}$, with the closest final displacement errors (FDE) for each agent as a positive mixture:
$${\mathcal{L}}_{\text{IL}}^{k}=\sum _{i=0}^{{N}_{A}1}\sum _{t=1}^{T}\sum _{m=1}^{M}\mathrm{\U0001d7cf}(m={\widehat{m}}^{i}){\mathcal{L}}_{\text{NLL}}({\widehat{\mathbf{y}}}_{m}^{t,i},{\sigma}_{m}^{t,i},{\widehat{\mathbf{p}}}_{m}^{i}),$$  (11) 
where $\widehat{m}$ denotes the positive index, ${\mathcal{L}}_{\text{NLL}}$ is defined as:
$${\mathcal{L}}_{\text{NLL}}=\mathrm{log}{\sigma}_{x}+\mathrm{log}{\sigma}_{y}+\frac{1}{2}\left({\left(\frac{{\mathrm{\Delta}}_{x}}{{\sigma}_{x}}\right)}^{2}+{\left(\frac{{\mathrm{\Delta}}_{y}}{{\sigma}_{y}}\right)}^{2}\right)\mathrm{log}\left(p\right),$$  (12) 
Here ${\sigma}_{x},{\sigma}_{y}\in {\sigma}_{m}^{t,i}$ and ${\mathrm{\Delta}}_{xy}={\mathbf{y}}_{xy}^{t,i}{\widehat{\mathbf{y}}}_{m,xy}^{t,i}$. We leverage crossentropy loss for ${\widehat{\mathbf{p}}}_{m}^{i}$ in updating scoring. For interactive loss ${\mathcal{L}}_{\text{col}}^{k}$, it is modeled by maximize L2 distance $\mathcal{D}$ for closest trajectories within ${d}_{\text{col}}$ under last level component policies:
$$  (13) 
To perform refined planning $\tau $ in Ego Planner, it is learned with L2 distance $\mathcal{D}$: ${\mathcal{L}}_{plan}=\mathcal{D}(\tau ,{\mathbf{y}}^{0})$.
IIIF2 Cost Profiles
The cost function profiles encompass diverse cost terms $\{{c}_{i}\}$, taking into account various aspects of planning performance, categorized as: driving progress, comforts, adherence to rules [9], and, crucially, safety. Importantly, the planning safety is defined by Gaussian potential fields incorporating joint $\widehat{\mathbf{O}}$ and marginal $\widehat{\mathbf{Y}}$ predictions. Motivated by complementary safety guidance by hybrid prediction, the explicit catalog is as follows:
$${c}_{t}^{\mathrm{safe}}=\sum _{{\widehat{\mathbf{O}}}_{t}^{x,y}\in {D}_{1}}\varphi ({\tau}_{t},{\widehat{\text{O}}}_{t})+\sum _{i=1}^{{N}_{A}1}\sum _{m=1}^{M}\sum _{{\widehat{\text{y}}}^{t}\in {D}_{2}}\varphi ({\tau}_{t},{\widehat{\text{y}}}_{m}^{t,i}).$$  (14) 
Here, $\varphi $ represents the Gaussian density functions as in Equ. 8. ${\widehat{\mathbf{O}}}_{t}^{x,y}={\mathcal{P}}^{1}({\widehat{\mathbf{O}}}_{t}^{h,w})$ denotes the occupied coordinates, and ${\widehat{\text{y}}}_{m}^{t,i}$ is derived from reasoned results of ${\widehat{\mathbf{Y}}}_{M,K1}^{A}$. Each potential field is subjected to masking by a distance threshold ${D}_{1}=5,{D}_{2}=3$ towards planning trajectories.
IIIF3 Optimization
The optimization for hybrid predictionguided planning is defined as an openloop optimization problem under finite horizons. The general formulation is as follows:
$${\mathbf{u}}^{\ast}=\underset{\mathbf{u}}{\mathrm{arg}\mathrm{min}}\frac{1}{2}\sum _{i}{\Vert {\omega}_{i}{c}_{i}(\mathbf{u},\mathbf{X},\widehat{\mathbf{Y}},\widehat{\mathbf{O}})\Vert}^{2},$$  (15) 
where $\mathbf{u}$ is the planning variable, and ${\omega}_{i}$ denotes the weight for cost function ${c}_{i}$. For generalized ADS, HPP solves this optimization problem according to different criteria:
Reference routes: Suppose an accessible reference route $\mathcal{I}\in {\mathbb{R}}^{{L}_{\text{ref}}\times {d}_{\text{ref}}}$ that is densely interpolated, HPP transforms all cost profiles under Frenet coordinates to alleviate optimization difficulties. Each reference point $r\in \mathcal{I}$ is assigned with tangential and normal vectors: $[{\overrightarrow{t}}_{r},{\overrightarrow{n}}_{r}]$. The Cartesian coordinate $\overrightarrow{y}=(x,y)$ can then be transformed to $\overrightarrow{r}=(s,d)$ via:
$$\overrightarrow{y}(s(t),d(t))=\overrightarrow{r}(s(t))+d(t){\overrightarrow{n}}_{r}(s(t)).$$  (16) 
Path planning: This handles trajectories of lower frequency for the ego vehicle. Direct optimization is performed for $\tau =\mathbf{u}$ from the Ego Planner, producing the optimized path ${\tau}^{\ast}$. Only the safety cost is considered in this operation.
Motion planning: This addresses perstep future states of the ego vehicle. Optimization is conducted using model predictive control (MPC) for control actions $\mathbf{u}={[a,\delta ]}_{1:T}$ based on inverse dynamics: ${\mathbf{u}}_{t}={\mathcal{T}}^{1}({\tau}_{t+1},{\tau}_{t})$ [36]. The optimal motion planning is then transformed back through forward dynamics: ${\tau}_{t+1}^{\ast}=\mathcal{T}({\tau}_{t}^{\ast},{\mathbf{u}}_{t}^{\ast})$ after optimizations.
To solve this nonlinear optimization problem, as illustrated in Equ. 15, we utilize the GaussNewton method [50]. that iteratively refines the planning variable as the initial value. The cost weights can be meticulously designed or learned directly, as the entire optimization process is fully differentiable [3].
IV Experiments
In this section, we first introduce the experimental settings for the proposed HPP, including testing benchmarks, evaluation metrics, and detailed implementations. Subsequently, HPP is quantitatively compared against existing stateoftheart methods and systems in predictions and planning. Discussions on ablation studies unveil the effectiveness and mechanism of modular codesign optimizations for HPP. Qualitative results further ablate the characteristics of HPP against certain stateoftheart baselines.
IVA Experimental Setup
IVA1 Testing Benchmarks
To discover the comprehensive performances in predictions and planning for HPP, we summarize three questions to be tackled by benchmark testing: (1) How is the capabilities of HPP as fullstack ADS under interactive realworld cases? (2) How is the longterm horizon performance of HPP in interactive realworld scenarios? (3) How is the longterm driving functioning by HPP in continuous realistic scenarios? These prompt benchmarks accordingly:
(1) nuScenes dataset [51]: This dataset is among the largest and most widely used for fullstack autonomous driving. It includes over 1,000 20second frames of driving scenarios annotated at 2 Hz, covering four cities worldwide. Benchmarked evaluations [2] involve 6,019 frames for all tasks of ADS under openloop settings. Testing horizons are defined at $T=3\mathrm{s}$ and $T=6\mathrm{s}$ for motion predictions.
(2) Waymo open motion dataset (WOMD) [52]: Utilized for longterm motion evaluations, it is the largest realworld dataset for interactive scenarios. It comprises 104,000 20second frames representing unique scenarios, marked at 10 Hz, and encompasses over 570 km of driving and 1750 km of roadways. To assess longterm realworld performance in planning and motion predictions, evaluations are systematically conducted in the SMARTS benchmark [34]. This includes 400 highly interactive scenarios, each lasting 9 seconds, featuring representative behaviors. Autonomous vehicles are tasked with 5second longterm planning and predictions in both openloop and closedloop configurations. Closedloop testing involves leveraging the log simulator to replay driving scenarios for online interactions. To further examine the performance of joint predictions in HPP, the system is tested on the Waymo Occupancy Predictions benchmark [53]. This involves predicting over 44,000 driving scenes of occupancy and flow within an 8second timeframe at a frequency of 1 Hz.
Methods  Collision rate (%)$\downarrow $  Planning error (m)$\downarrow $  

@1 s  @2 s  @3 s  Avg.  @1 s  @2 s  @3 s  Avg.  
NMP^{†} [54]      1.92        2.31   
SANMP^{†} [54]      1.59        2.05   
FusionAD^{†‡} [15]  0.02  0.08  0.27  0.12        0.81 
FF^{†} [55]  0.06  0.17  1.07  0.43  0.55  1.20  2.54  1.43 
EO^{†} [56]  0.04  0.09  0.88  0.33  0.67  1.36  2.78  1.60 
OccNet [57]  0.21  0.59  1.37  0.72  1.29  2.13  2.99  2.13 
UniAD [2]  0.05  0.17  0.71  0.31  0.48  0.96  1.65  1.03 
HPP^{‡}  0.03  0.07  0.35  0.15  0.30  0.61  1.15  0.72 
HPP  0.03  0.17  0.68  0.29  0.48  0.91  1.54  0.97 
Avg. Metrics  @1 s  @2 s  @3 s  Avg.  @1 s  @2 s  @3 s  Avg. 
STP3 [4]  0.23  0.62  1.27  0.71  1.33  2.11  2.90  2.11 
VADBase^{‡} [5]  0.07  0.10  0.24  0.14  0.17  0.34  0.60  0.37 
VADBase [5]  0.07  0.17  0.41  0.22  0.41  0.70  1.05  0.72 
DeepEM^{∗} [58]  0.05  0.15  0.36  0.19  0.25  0.45  0.73  0.48 
HPP^{‡}  0.02  0.04  0.11  0.06  0.26  0.37  0.59  0.40 
HPP  0.03  0.08  0.24  0.12  0.41  0.61  0.86  0.63 
(3) CARLA simulator [59]: We utilize the Longest6 benchmark [60] for longterm driving evaluations. The autonomous vehicle is assigned a $T=2\mathrm{s}$ horizons closedloop planning task across 36 routes, ranging from 1.6 to 1.8 km, under various driving conditions in six CARLA towns.
IVA2 Testing Metrics
We adhere to the original benchmarks in the testing metrics configurations. Coinciding with the contributions of HPP, the devised metrics primarily focus on three aspects, i.e., Accuracy, Consistency, and Safety for predictions and planning. Detailed metrics are listed as follows:
(1) Occupancy prediction: Intersections over Union (IoU) and Area Under the Curve (AUC) [61] quantify the overall and pergrid prediction accuracy for occupancy. Video Panoptic Quality (VPQ) [62] is adopted to assess the consistency of occupancy across marginal agents and perceptions.
(2) Motion prediction: Prediction accuracy is assessed using minimum average and final displacement errors (minADE, minFDE) for trajectories, as well as miss rate (MR) for each agent[52]. Consistency is tested through joint displacement errors (JADE, JFDE) for all agents[34], along with EndtoEnd Prediction Accuracy (EPA) [63] over perceptions.
(3) Planning: For openloop testing, planning accuracy is evaluated using displacement errors (DE)[2] and the average distance [5]. Consistency in predictions and safety is assessed through collision rates (CR) [64]. In closedloop settings, external measurements include infractions (IS), vehicle collisions (CV), routes completion (RC), and a driving score (DS) that encompasses overall driving performance [59].
IVA3 Implementation Details
For fair comparisons, HPP is configured according to each benchmark with carefully devised learning pipelines and system architectures. In the nuScenes dataset, fullsize training is conducted with a total batch size of 4. For planning, 10% of the full training set is randomly sampled, and the full set is utilized for the occupancy benchmark in WOMD. Learning occurs in batches of 24. The expert dataset from the CARLA benchmark [28], collected in different towns, is directly used for training in batches of 32.
Training strategies for all benchmarks are aligned using a distributed strategy on four NVIDIA A100 GPUs. The AdamW optimizer is employed with an initial learning rate of 1e4, and a cosine annealing learning rate strategy is applied. The total number of training epochs is set to 20. We apply the same GPU devices for nuScenes and WOMD for testing. Evaluations for the CARLA benchmark are conducted in one NVIDIA RTX 3080 GPU.
For system architectures, HPP establishes BEV perception egocentered within $\pm $50 m of $H,W=200$ in nuScenes. In WOMD and CARLA benchmarks, privileged perceptions are assumed. Therefore, HPP is developed by encoding perfect scene context inputs according to our previous work [34, 65]. For CARLA benchmarks, BEV remains egocentered within $\pm 32$ m of $H,W=128$. BEV settings for WOMD follow the official benchmark guidelines. Path planning is optimized without knowledge of reference routes in nuScenes and CARLA. In WOMD, motion planning is conducted considering reference information. The fullstack HPP in nuScenes considers various queries for agents. In WOMD and CARLA, agents are sorted and filtered to ${N}_{A}=11$. We select the ReLU activation function and apply a dropout rate of 0.1. We refer more detailed parameters with notations in Table XII.
IVB Main Results
IVB1 Fullstack ADS Performance
We report HPP’s testing performance against recent stateoftheart methods on the nuScenes dataset. HPP has achieved stateoftheart results across various key metrics in both prediction and planning.
Methods  Collision rate (%)$\downarrow $  Planning error (m)$\downarrow $  

@1 s  @2 s  @3 s  Avg.  @1 s  @2 s  @3 s  Avg.  
GPTDriver^{‡} [44]  0.07  0.15  1.10  0.44  0.27  0.74  1.52  0.84 
AgentDriver^{‡} [45]  0.02  0.13  0.48  0.21  0.22  0.65  1.34  0.74 
HPP^{‡}  0.03  0.07  0.35  0.30  0.61  1.15  0.72  0.15 
Methods  IoUn. $\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\uparrow $}$  IoUf. $\uparrow $  VPQn. $\uparrow $  VPQf. $\uparrow $ 

FIERY[66]  59.4  36.7  50.2  29.9 
StretchBEV[41]  55.5  37.1  46.0  29.0 
STP3[4]    38.9    32.1 
BEVerse[40]  61.4  40.9  54.3  36.1 
PowerBEV[67]  62.5  39.4  55.5  33.8 
UniAD[2]  63.4  40.2  54.7  33.5 
HPP  64.8  40.5  56.4  34.7 
(1) Planning results: Described in Table I, HPP achieves stateoftheart (SOTA) results across all planning horizons (@1 s@3 s) against various autonomous driving systems in both absolute metrics [2] and average ones [5]. Specifically, HPP presents a 5.9% lower average L2 errors and a 6.5% lower average collision rate compared to UniAD [2], utilizing identical BEV perceptions. This showcases the validity of the modular codesign for HPP in predictions and planning.
In average metrics, HPP reports a 12.5% improvement in planning errors against VAD [5] which highlights superior perception modules. The reasoning design for HPP manifests with a 30% lower collision rate compared to DeepEM [58], which also features reasoning by EM decoding and extra denoising augmentations. Compared with methods leveraging egostatus ${\mathbf{s}}_{\text{ego}}$, a variant of HPP adding egostatus (denoted HPP^{‡}) in Ego Planner (Sec. IIIE) is trained with boasting results. HPP^{‡} does not include accelerations in [5] as leakage of groundtruths. In comparison with FusionAD [15], which depends on excessive LIDAR fusion, HPP also demonstrates an 11.8% lower planning error with comparable safety. Our system also exhibits 2.8% lower errors and nearly 20% lower collision rates compared to LLM methods (see Table II). This further substantiates the effectiveness of modular integration of predictions in planningoriented objectives, as LLM baselines focus more on alignments by language knowledge.
(2) Predictions results: Joint results of occupancy predictions are tested in two ranges (near: $30\times 30$ m; far: $50\times 50$ m) centered on the autonomous vehicle. Shown in Table III, HPP presents advanced accuracy +2.5% and consistency +3.7% compared to [2], thanks to proposed MsOccFormer that integrate mutually with motion predictions. Validity of modular codesign in HPP is further manifested by +4% and +1.6% improved IoU and VPQ without extra augmentations, compared with [67] learning single occupancy task.
Marginal results of motion predictions are presented in Table IV. Here, we compare the prediction results averaged from all vehicles (v.) and measure full agent (f.) results weighted by categories. HPP reports a 2.8% lower minADE and a +3.2% EPA gain in vehicle predictions, with 3.7% and +6% improved performance predicting all agents compared to baselines [2] under the same perception settings. This highlights the performance gain achieved through gametheoretic reasoning and joint dependencies in GTFormer (Sec. IIID).
Methods  minADE $\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\downarrow $}$  minFDE $\downarrow $  MR. $\downarrow $  EPA $\uparrow $ 

ViP3D[63]  1.15  1.95  22.6  0.222 
PnPNet[38]  2.05  2.84  24.6  0.226 
UniADv.[2]  0.71  1.02  15.1  0.456 
UniADf.[2]  0.911  1.236  15.1  0.314 
HPPv.  0.682  0.947  13.8  0.471 
HPPf.  0.878  1.205  14.5  0.334 
Methods  AUC  AUC  EPE  AUC 

obs. $\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\uparrow $}$  occ. $\uparrow $  f. $\downarrow $  FT $\uparrow $  
MotionPerceiver  77.1       
OFMPNet  77.0  16.5  3.58  76.1 
STCNN  74.4  16.8  3.87  73.3 
HOPE[20]  80.3  16.5  3.67  83.9 
STrajNet[48]  77.8  17.8  3.20  78.5 
VectorFlow[21]  75.4  17.3  3.58  76.7 
HPP  79.7  19.4  2.95  80.2 
It is important to note that HPP focuses on predictions and planning rather than perception. Prediction results could potentially be further enhanced with better perceptions [5] or additional LIDAR inputs [15]. However, HPP still outperforms in final planning results, demonstrating better social compliance denoted as our contributions.
IVB2 Longhorizon Interactive Performance
We benchmark HPP’s longhorizon performance under WOMD. HPP has underscored advanced performances against numerous SOTA methods in predictions and planning.
(1) Occupancy results: Shown in Table V, HPP demonstrates superior prediction accuracy (+8.8% AUCocc.) and lower flow error (9.1% EPEf.) when considering joint flow predictions. The explicit design of MSOccFormer in HPP has proven its strong accuracy compared to our previous work [65], which only conducts global interactions without marginal awareness. HPP presents a +2.4% improvement in AUCobs. and +2.2% in flowtraced occupancy.
Method  Collision rate  Miss rate  Planning error (m) $\downarrow $  Prediction error (m) $\downarrow $  

(%) $\downarrow $  (%) $\downarrow $  @1 s  @3 s  @5 s  JADE  JFDE  
Vanilla IL [3]  4.25  15.61  0.216  1.273  3.175  –  – 
DIM [68]  4.96  17.68  0.483  1.869  3.683  –  – 
OPGP [9]  3.79  12.89  0.245  1.672  3.099  –  – 
MultiPath++ [69]  2.86  8.61  0.146  0.948  2.719  –  – 
MTRe2e [19]  2.32  8.88  0.141  0.888  2.698  –  – 
DIPP [3]  2.33  8.44  0.135  0.928  2.803  0.925  2.059 
GameFormer [34]  1.98  7.53  0.129  0.836  2.451  0.853  1.919 
HPP  1.85  7.58  0.092  0.881  2.667  0.829  1.965 
Method  Success rate  Progress  Acceleration  Jerk  Lateral acc.  Position error to expert driver ($\mathrm{m}$) $\downarrow $  
(%) $\uparrow $  $(\mathrm{m})$ $\uparrow $  ($\mathrm{m}/{\mathrm{s}}^{2}$)  ($\mathrm{m}/{\mathrm{s}}^{3}$)  ($\mathrm{m}/{\mathrm{s}}^{2}$)  @3 s  @5 s  @8 s  
Expert    54.52  0.529  1.020  0.103      
Vanilla IL [3]  0  6.23  1.588  16.24  0.661  9.355  20.52  46.33 
RIP [68]  19.5  12.85  1.445  14.97  0.355  7.035  17.13  38.25 
CQL [70]  10  8.28  3.158  25.31  0.152  10.86  21.18  40.17 
DIPP [3]  68.12$\pm $5.51  41.08$\pm $5.88  1.44$\pm $0.18  12.58$\pm $3.23  0.31$\pm $0.11  6.22$\pm $0.52  15.55$\pm $1.12  26.10$\pm $3.88 
GameFormer [34]  73.16$\pm $6.14  44.94$\pm $7.69  1.19$\pm $0.15  13.63$\pm $2.88  0.32$\pm $0.09  5.89$\pm $0.78  12.43$\pm $0.51  21.02$\pm $2.48 
HPP  74.28$\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\pm $}$5.49  47.17$\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\pm $}$8.92  1.33$\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\pm $}$0.18  11.68$\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\pm $}$2.76  0.35$\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\pm $}$0.09  4.93$\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\pm $}$0.78  10.24$\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\pm $}$0.82  18.99$\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\pm $}$3.05 
DIPP (optim.)  92.16$\pm $0.62  51.85$\pm $0.14  0.58$\pm $0.03  1.54$\pm $0.19  0.11$\pm $0.01  2.26$\pm $0.10  5.55$\pm $0.24  12.53$\pm $0.48 
GameFormer (optim.)  94.50$\pm $0.66  52.67$\pm $0.33  0.53$\pm $0.02  1.56$\pm $0.23  0.10$\pm $0.01  2.11$\pm $0.21  4.87$\pm $0.18  11.13$\pm $0.33 
HPP (optim.)  92.25$\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\pm $}$0.85  52.19$\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\pm $}$0.41  0.66$\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\pm $}$0.02  1.87$\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\pm $}$0.28  0.10$\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\pm $}$0.01  2.13$\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\pm $}$0.29  4.90$\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\pm $}$0.26  12.89$\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\pm $}$0.38 
(2) Openloop results: Table VI reveals the compelling results of HPP over numerous stateoftheart (SOTA) methods in openloop prediction and planning. HPP achieves notable reductions of over 50% in collisions and 26% in planning errors compared to imitation learning baselines [3, 68] that discount predictions. When compared with SOTA motion prediction baselines as imitative planners, HPP also demonstrates 2.3% to 17.6% reductions in planning errors and 21.1% fewer collisions. This underscores the importance of integrated predictions and planning.
In comparison with integrated baselines, HPP presents significant improvements over the IOP framework [9] due to the absence of future interactions for planning and the intractable occupancy in IOP that hampers normal guidance. When compared with IPP systems, HPP delivers superior performance in collision rates (6.6%), joint prediction errors (2.9%), and comparable planning errors. Compared to the stateoftheart results from our previous works [3, 34], which focus on modeling future interactions and reasoning (see Fig. 5), HPP exhibits superior shortterm performance with reduced variance. This further validates the efficacy of codesign integration for joint predictions and planning upon reasoning in HPP.
(3) Closedloop results: In Table VII, HPP undergoes testing in a replay simulator against stateoftheart IPP systems [34], IL methods [68], and RL approaches [70]. IL and RL results are significantly compromised due to accumulated distributional shifts in closedloop interactions with the environment. HPP exhibits a higher success rate (+1.5%) and lower planning errors (13.2%) compared to our previous methods [34], thanks to the hybrid predictionaware planner designed conditioned on goal context. The closedloop performance sees substantial improvements with online optimizations. Due to the high cost of the occupancy process in raw data, HPP is replanned by online refinements at 2 Hz. Testing results demonstrate strong performance against our previous methods [34], which replan more frequently at 10 Hz, achieving 8.5% lower closedloop positional errors by [3].
Methods  DS $\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\uparrow $}$  RC^{∗} $\uparrow $  IS $\uparrow $  CV $\downarrow $ 

RuleBased[59]  38.0  29.1  0.84  0.64 
KING[36]  45.1  78.3  0.55  1.67 
Roach[71]  55.3  88.2  0.62  0.72 
PlanT^{∗}[28]  70.9  83.1  0.87  0.31 
HPP^{∗}  65.5  79.2  0.82  0.51 
ID  MSOccFormer(wo.)  GTFormer(wo.)  EgoPlanner (wo.)  Occupancy Prediction  Motion Prediction  Planning  
Score  Motion  ACOcc.  Score  Motion  Occ.  Motion  Occ.  IoUn.$\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\uparrow $}$  IoUf.$\uparrow $  VPQn.$\uparrow $  VPQf.$\uparrow $  minADE$\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\downarrow $}$  MR$\downarrow $  EPA$\uparrow $  ADE$\downarrow $  CR$\downarrow $  
1  ✗                63.5  40.0  54.5  33.7  0.733  16.1  45.6  0.989  0.428 
2    ✗              63.1  39.4  53.6  32.8  0.739  16.2  44.8  0.994  0.441 
3    ✗  ✗            61.9  39.6  50.8  30.9  0.748  16.6  43.8  1.015  0.466 
4        ✗          63.7  39.8  55.0  33.3  0.728  15.9  45.0  0.986  0.402 
5          ✗        63.5  39.5  54.6  33.5  0.734  16.4  44.8  0.992  0.432 
6          ✗  ✗      63.5  39.6  54.8  33.0  0.747  16.2  44.2  0.998  0.436 
7              ✗    63.6  39.8  55.0  33.4  0.726  16.0  45.7  0.989  0.414 
8                ✗  63.5  39.7  54.7  33.3  0.721  15.6  45.4  1.012  0.458 
9              ✗  ✗  63.5  39.8  54.8  33.4  0.726  15.9  45.7  0.997  0.449 
0                  63.7  39.8  55.0  33.5  0.724  15.9  45.8  0.985  0.402 
IVB3 Longterm Driving Performance
HPP demonstrates comparable driving capabilities against the stateoftheart method [28] in the CARLA benchmark (Table VIII). With significant improvements over rulebased agents [59], HPP achieves a $+20.4$ driving score compared to IPP methods [36], which guide planning by adversarial predictions, as well as a $+12.2$ improvement compared to RL methods [71]. Noted that both HPP and the reproduced [28] (^{∗}) show compromised route completions (RC), likely due to GPU inference issues^{1}^{1}1https://github.com/autonomousvision/plant/issues/17.
IVC Ablation Studies and Discussions
To unleash the internal effectiveness in HPP, comprehensive ablation studies are conducted in nuScenes centered on discussing the roles in modular codesign and characteristics for hybrid prediction in planning guidance.
IVC1 Roles in Modular Codesign
To elicit the effectiveness of codesign among each module, as presented in Table IX, we purposefully remove certain key designs for integration across each submodule. Referred to ID 19, all ablated baselines are compared to the original HPP (ID.0). Detailed discussions are as follows:
(1) Effects in MsOccFormer: For marginal dependencies (Sec. IIIC2), when removing score scaling (ID.3 vs. ID.0), we observe a slight tradeoff under occupancy range (0.5 VPQn. vs. +0.2 VPQf.), with increased collisions ($+0.026\%$). This indicates the scaling captures multimodal features for joint predictions and enhances the shared occupancy area nearby. Removing motion fusion (ID.2) causes inclusive decreases for lack of marginal alignments. Greater decreases are observed (1.2 IoUn. and 3.7 VPQn.) in removing agentconditioned occupancy modeling. It highlights the importance of future interactions and validates the consistency towards motion predictions ($2.4\%$ lower for prediction errors). Results in planning also reflect more contributions in nearscale joint prediction interactions to planning. Comparing ID.2 and ID.3, without sacrificing farscale occupancy (1.2 IoU.n vs. +0.2 IoUf.), the planning errors and collisions increase along with a drop in nearscale prediction accuracy. For the key design of multiscale predictions integration, qualitative (see Fig. 6) and quantitative ablations (Table X) have demonstrated notable improvements from multiscale attention mask update design (+1.2 IoUn.) as well as local integration (+1.0 VPQn.) for prediction consistency.
(2) Effects in GTFormer: As the core module that enables reasoning capabilities for consistent predictions and planning, we try to discover roles by removing marginal and joint predictions in GTFormer. Compared ID.4 with ID.5, the removal of the reasoning module had a thorough drop, especially planning ($+0.03\%$ CR) and motion predictions ($+3.1\%$ MR). This highlights the importance of reasoning in consistency and accuracy for future interactions. Meanwhile, considering ID.5 and ID.6, the increase of prediction errors ($+1.7\%$ minADE) implies that joint dependencies (Sec. IIID2) are the key in consistent motion predictions upon GTreasoning. This reflects an enhanced mutual consistency modulated by interactive modeling between joint and marginal predictor codesign.
Baselines  IoUn. $\colorbox[rgb]{0.901960784313726,0.901960784313726,0.901960784313726}{$\uparrow $}$  IoUf. $\uparrow $  VPQn. $\uparrow $  VPQf. $\uparrow $ 

w/o. global integration  61.6  38.8  52.0  31.9 
w/o. local integration  61.8  38.8  52.8  32.1 
w/o. attn. mask ${\text{M}}_{t}$  62.5  39.0  53.8  32.4 
HPP  63.7  39.8  55.0  33.5 
ID  ${\mathcal{L}}_{col}$  Occ.  Mostlikely  Fullmotion  Collision rate (%) $\downarrow $  Planning error (m) $\downarrow $  

@1 s  @2 s  @3 s  Avg.  @1 s  @2 s  @3 s  Avg.  
1          0.11  0.31  0.73  0.39  0.19  0.53  1.13  0.61 
2  ✓        0.07  0.22  0.64  0.31  0.18  0.52  1.09  0.60 
3  ✓    ✓    0.05  0.11  0.60  0.25  0.19  0.52  1.10  0.60 
4  ✓      ✓  0.04  0.09  0.53  0.20  0.19  0.52  1.10  0.60 
5  ✓  ✓      0.03  0.14  0.45  0.20  0.30  0.61  1.16  0.72 
6  ✓  ✓  ✓    0.03  0.14  0.44  0.20  0.31  0.61  1.17  0.72 
0  ✓  ✓    ✓  0.03  0.07  0.35  0.15  0.30  0.61  1.15  0.72 
(3) Effects in Ego Planner: Validating the roles of planconditioning design compared to our previous work [34] in Sec. IVB2, we further examine the effects of hybrid prediction interactions. In ID.9, we observe marginal improvements in predictions (+0.2 IoU, +0.1 EPA) and more significant enhancements ($0.047\%$ CR) in planning with the inclusion of hybrid prediction interactions. This underscores the consistency modeling contributed along with the HPP design. Ablations from ID.7 and ID.8 further suggest a substantial impact of interactions with occupancy predictive features ($0.056\%$ CR) over marginal ones ($0.012\%$ CR) on planning. Solely motion conditional planning (ID.8) outputs overly optimistic motion predictions (0.3 MR). This in turn harms the original gametheoretic reasoning, resulting in inferior planning ($+0.035\%$ CR). These results underscore the modulating effect of joint dependencies in both consistencies for planning and motion predictions.
IVC2 Roles in hybrid prediction
To further delve into the characteristics of marginal and joint predictions, listed in Table XI, the original HPP (ID.0) is measured with ablations (ID.16) guiding: reasoning loss (${\mathcal{L}}_{col}$); marginal predictions (most likely, full); and joint predictions (Occ.) during learning and optimizations. Key findings are summarized below:
(1) Consistent reasoning: Comparing ID.1 and ID.2, significant safety improvements ($25.8\%$ CR) highlights the consistency role for marginal predictions in reasoning learning (${\mathcal{L}}_{col}$). Joint predictions in HPP are leveraged to guide internal marginal consistency that modulates planning.
(2) Complementary influences: hybrid prediction benefit planning in mutual coverage gains compared with sole guidance ($33\%$ CR) in ID.4 and ID.5. This is enhanced upon consistency design in HPP. 1) Compared with ID.3, guiding with full motion predictions (ID.2) spawns safe planning ($20\%$ CR) without sacrificing accuracy. 2) A good alignment in hybrid prediction results in close performance in ID.5 and ID.6 adding the most likely marginal prediction. These mutual effects ascertain the planning performance in case either of the hybrid prediction is less functioning due to: (1) state error for pose and speed; (2) future uncertainty; or (3) discontinuation under spatial temporal horizons.
IVD Qualitative Results
Figs. 7 and 8 showcase the qualitative benefits of integrating hybrid prediction in the nuScenes [51] and WOMD [52] benchmarks, respectively. In the comprehensive testing scenario depicted in Fig. 7a, joint guidance from occupancy ensures consistent motion predictions during cruising. Notably, Fig. 7b demonstrates that hybrid prediction awareness contributes to consistent planning, enhancing reasoning behavior for smooth cruising without overly conservative avoidance, thereby mitigating risks near lane boundaries. Fig. 7c further illustrates the complementary effects, where motion predictions take precedence in guiding planning when occupancy is uncertain. The mutual influence is evident in Fig. 7d, where the ego vehicle remains in the lane instead of avoiding, as the occupancy predictions for rear cyclists are constrained by marginal predictions to follow a straight trajectory, maintaining a safe distance from the ego vehicle.
In Fig. 8, qualitative assessments of WOMD interactive scenarios underscore HPP’s superior reasoning capabilities and safetyconscious planning. In Fig. 8a, DIPP [3] encounters failure in an emergency stop, exposing a lack of consistency between motion predictions and planning, while HPP and GameFormer [34] exhibit smoother planning around obstacles. Motion prediction in HPP operates more smoothly with [34], as occupancy predictions modulate the joint behaviors. This is important as the reasoned motion predictions [34] may trade consistency for safety (Fig. 8b) and accuracy (Fig. 8c). Highspeed situations (Fig. 8d, Fig. 8e) further demonstrate HPP’s mutual consistency between motion predictions and planning, showcasing its adaptability and robustness in diverse driving environments. These results underscore the efficacy of HPP in enhancing both the consistency and safety aspects of the autonomous driving system.
IVE Future Outlook
In introducing HPP, we advocate for the adoption of modular codesign and optimization principles to shape ADS. As the notion of a planningoriented modular learning system gains traction, it is crucial to underscore the significance of wellorganized and concurrently integrated modules that mirror the complexities of realworld scenarios. In the future, we see agentbased models, such as LLMs, facilitating connections between various modules to ensure seamless integration. HPP lays the groundwork for modulewise integration, and future endeavors will delve deeper into exploring intermediatelevel integration and guidance between prediction and planning.
V Conclusions
In this paper, we develop HPP, a modular codesign optimization framework for autonomous driving systems. The main focus of HPP is on integrating hybrid prediction and planning, wherein three submodules, i.e., MSOccFormer, GTFormer, and Ego Planner address consistency issues and enhance adaptability using the hybrid predictionguided learning and optimization pipeline. HPP has been extensively evaluated across diverse benchmarks, and the results consistently demonstrate its superior performance in both prediction and planning metrics compared to stateoftheart methods. The hybrid prediction awareness in HPP, which incorporates joint behaviors and occupancy predictions, improves qualitative consistency, safety, and feasibility in realworld scenarios. By elucidating the roles in modular codesign and the complementary effects that enhance planning through hybrid prediction, HPP showcases its potential to advance the field by tackling challenges in prediction and planning, contributing to the ongoing evolution of autonomous driving frameworks.
Notation  Meaning  Testing Benchmarks  
nuScenes  WOMD  CARLA  
$H,W$  Size of BEV space  200  128/256  128 
BEV length (m)  50  64  64  
${N}_{A}$  Max number of agents    11/33  11 
${N}_{M}$  Max number of map    100  50 
$T$  Future Horizons (s)  3/2/6  5/8  2 
Future frequency (Hz)  2  2/1  2  
$M$  Number of modalities  6  6  6 
$L$  Occ integration levels  3  3  3 
$K$  Reasoning levels  3  3  3 
${L}_{P}$  Layers in Ego Planner  3  3  3 
$h$  Number of attention heads  8  8  8 
$D$  Embedding size  256  256  256 
Activation function  $\mathrm{ReLU}$  
Dropout rate  0.1  0.1  0.1  
Batch size  4  24  32  
Learning rate  $1{e}^{4}$  $1{e}^{4}$  $1{e}^{4}$  
Training epochs  20  20  20 
References
 [1] L. Chen, Y. Li, C. Huang, B. Li, Y. Xing, D. Tian, L. Li, Z. Hu, X. Na, Z. Li et al., “Milestones in autonomous driving and intelligent vehicles: Survey of surveys,” IEEE Transactions on Intelligent Vehicles, 2022.
 [2] Y. Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang et al., “Planningoriented autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 17 853–17 862.
 [3] Z. Huang, H. Liu, J. Wu, and C. Lv, “Differentiable integrated motion prediction and planning with learnable cost function for autonomous driving,” arXiv preprint arXiv:2207.10422, 2022.
 [4] S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao, “Stp3: Endtoend visionbased autonomous driving via spatialtemporal feature learning,” in European Conference on Computer Vision. Springer, 2022, pp. 533–549.
 [5] B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang, “Vad: Vectorized scene representation for efficient autonomous driving,” arXiv preprint arXiv:2303.12077, 2023.
 [6] S. Hagedorn, M. Hallgarten, M. Stoll, and A. Condurache, “Rethinking integration of prediction and planning in deep learningbased automated driving systems: a review,” arXiv preprint arXiv:2308.05731, 2023.
 [7] X. Mo, Z. Huang, Y. Xing, and C. Lv, “Multiagent trajectory prediction with heterogeneous edgeenhanced graph attention network,” IEEE Transactions on Intelligent Transportation Systems, 2022.
 [8] S. Pini, C. S. Perone, A. Ahuja, A. S. R. Ferreira, M. Niendorf, and S. Zagoruyko, “Safe realworld autonomous driving by learning to predict and plan with a mixture of experts,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 10 069–10 075.
 [9] H. Liu, Z. Huang, and C. Lv, “Occupancy predictionguided neural planner for autonomous driving,” arXiv preprint arXiv:2305.03303, 2023.
 [10] Y. Hu, K. Li, P. Liang, J. Qian, Z. Yang, H. Zhang, W. Shao, Z. Ding, W. Xu, and Q. Liu, “Imitation with spatialtemporal heatmap: 2nd place solution for nuplan challenge,” arXiv preprint arXiv:2306.15700, 2023.
 [11] M. Bansal, A. Krizhevsky, and A. Ogale, “Chauffeurnet: Learning to drive by imitating the best and synthesizing the worst,” arXiv preprint arXiv:1812.03079, 2018.
 [12] S. Mozaffari, O. Y. AlJarrah, M. Dianati, P. Jennings, and A. Mouzakitis, “Deep learningbased vehicle behavior prediction for autonomous driving applications: A review,” IEEE Transactions on Intelligent Transportation Systems, vol. 23, no. 1, pp. 33–47, 2020.
 [13] J. Kim, R. Mahjourian, S. Ettinger, M. Bansal, B. White, B. Sapp, and D. Anguelov, “Stopnet: Scalable trajectory and occupancy prediction for urban autonomous driving,” arXiv preprint arXiv:2206.00991, 2022.
 [14] L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li, “Endtoend autonomous driving: Challenges and frontiers,” arXiv preprint arXiv:2306.16927, 2023.
 [15] T. Ye, W. Jing, C. Hu, S. Huang, L. Gao, F. Li, J. Wang, K. Guo, W. Xiao, W. Mao et al., “Fusionad: Multimodality fusion for prediction and planning tasks of autonomous driving,” arXiv preprint arXiv:2308.01006, 2023.
 [16] X. Jia, P. Wu, L. Chen, Y. Liu, H. Li, and J. Yan, “Hdgt: Heterogeneous driving graph transformer for multiagent trajectory prediction via scene encoding,” IEEE transactions on pattern analysis and machine intelligence, 2023.
 [17] Z. Huang, X. Mo, and C. Lv, “Multimodal motion prediction with transformerbased neural network for autonomous driving,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 2605–2611.
 [18] X. Mo, Y. Xing, H. Liu, and C. Lv, “Mapadaptive multimodal trajectory prediction using hierarchical graph neural networks,” IEEE Robotics and Automation Letters, 2023.
 [19] S. Shi, L. Jiang, D. Dai, and B. Schiele, “Motion transformer with global intention localization and local movement refinement,” Advances in Neural Information Processing Systems, 2022.
 [20] Y. Hu, W. Shao, B. Jiang, J. Chen, S. Chai, Z. Yang, J. Qian, H. Zhou, and Q. Liu, “Hope: Hierarchical spatialtemporal network for occupancy flow prediction,” arXiv preprint arXiv:2206.10118, 2022.
 [21] X. Huang, X. Tian, J. Gu, Q. Sun, and H. Zhao, “Vectorflow: Combining images and vectors for traffic occupancy and flow prediction,” arXiv preprint arXiv:2208.04530, 2022.
 [22] T. Gilles, S. Sabatini, D. Tsishkou, B. Stanciulescu, and F. Moutarde, “Thomas: Trajectory heatmap output with learned multiagent sampling,” arXiv preprint arXiv:2110.06607, 2021.
 [23] A. Kamenev, L. Wang, O. B. Bohan, I. Kulkarni, B. Kartal, A. Molchanov, S. Birchfield, D. Nistér, and N. Smolyanskiy, “Predictionnet: Realtime joint probabilistic traffic prediction for planning, control, and simulation,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 8936–8942.
 [24] Z. Huang, H. Liu, J. Wu, and C. Lv, “Conditional predictive behavior planning with inverse reinforcement learning for humanlike autonomous driving,” IEEE Transactions on Intelligent Transportation Systems, 2023.
 [25] P. Hang, C. Lv, C. Huang, J. Cai, Z. Hu, and Y. Xing, “An integrated framework of decision making and motion planning for autonomous vehicles considering social behaviors,” IEEE transactions on vehicular technology, vol. 69, no. 12, pp. 14 458–14 469, 2020.
 [26] D. Xu, Y. Chen, B. Ivanovic, and M. Pavone, “Bits: Bilevel imitation for traffic simulation,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 2929–2936.
 [27] H. Liu, Z. Huang, X. Mo, and C. Lv, “Augmenting reinforcement learning with transformerbased scene representation learning for decisionmaking of autonomous driving,” arXiv preprint arXiv:2208.12263, 2022.
 [28] K. Renz, K. Chitta, O.B. Mercea, A. Koepke, Z. Akata, and A. Geiger, “Plant: Explainable planning transformers via objectlevel representations,” arXiv preprint arXiv:2210.14222, 2022.
 [29] N. Rhinehart, R. McAllister, K. Kitani, and S. Levine, “Precog: Prediction conditioned on goals in visual multiagent settings,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 2821–2830.
 [30] Z. Huang, H. Liu, J. Wu, W. Huang, and C. Lv, “Learning interactionaware motion prediction model for decisionmaking in autonomous driving,” arXiv preprint arXiv:2302.03939, 2023.
 [31] J. L. V. Espinoza, A. Liniger, W. Schwarting, D. Rus, and L. Van Gool, “Deep interactive motion prediction and planning: Playing games with motion prediction models,” in Learning for Dynamics and Control Conference. PMLR, 2022, pp. 1006–1019.
 [32] C. Burger, J. Fischer, F. Bieder, Ö. Ş. Taş, and C. Stiller, “Interactionaware gametheoretic motion planning for automated vehicles using bilevel optimization,” in 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2022, pp. 3978–3985.
 [33] W. Wang, L. Wang, C. Zhang, C. Liu, L. Sun et al., “Social interactions for autonomous driving: A review and perspectives,” Foundations and Trends® in Robotics, vol. 10, no. 34, pp. 198–376, 2022.
 [34] Z. Huang, H. Liu, and C. Lv, “Gameformer: Gametheoretic modeling and learning of transformerbased interactive prediction and planning for autonomous driving,” arXiv preprint arXiv:2303.05760, 2023.
 [35] P. Karkus, B. Ivanovic, S. Mannor, and M. Pavone, “Diffstack: A differentiable and modular control stack for autonomous vehicles,” in Conference on Robot Learning. PMLR, 2023, pp. 2170–2180.
 [36] N. Hanselmann, K. Renz, K. Chitta, A. Bhattacharyya, and A. Geiger, “King: Generating safetycritical driving scenarios for robust imitation via kinematics gradients,” in European Conference on Computer Vision. Springer, 2022, pp. 335–352.
 [37] S. Casas, A. Sadat, and R. Urtasun, “Mp3: A unified model to map, perceive, predict and plan,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14 403–14 412.
 [38] M. Liang, B. Yang, W. Zeng, Y. Chen, R. Hu, S. Casas, and R. Urtasun, “Pnpnet: Endtoend perception and prediction with tracking in the loop,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11 553–11 562.
 [39] Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y. Qiao, and J. Dai, “Bevformer: Learning bird’seyeview representation from multicamera images via spatiotemporal transformers,” in European conference on computer vision. Springer, 2022, pp. 1–18.
 [40] Y. Zhang, Z. Zhu, W. Zheng, J. Huang, G. Huang, J. Zhou, and J. Lu, “Beverse: Unified perception and prediction in birdseyeview for visioncentric autonomous driving,” arXiv preprint arXiv:2205.09743, 2022.
 [41] A. K. Akan and F. Güney, “Stretchbev: Stretching future instance prediction spatially and temporally,” in European Conference on Computer Vision. Springer, 2022, pp. 444–460.
 [42] H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li, J. Yang, H. Deng et al., “Delving into the devils of bird’seyeview perception: A review, evaluation and recipe,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
 [43] X. Jia, Y. Gao, L. Chen, J. Yan, P. L. Liu, and H. Li, “Driveadapter: Breaking the coupling barrier of perception and planning in endtoend autonomous driving,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7953–7963.
 [44] J. Mao, Y. Qian, H. Zhao, and Y. Wang, “Gptdriver: Learning to drive with gpt,” arXiv preprint arXiv:2310.01415, 2023.
 [45] J. Mao, J. Ye, Y. Qian, M. Pavone, and Y. Wang, “A language agent for autonomous driving,” arXiv preprint arXiv:2311.10813, 2023.
 [46] C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” arXiv preprint arXiv:2312.14150, 2023.
 [47] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
 [48] H. Liu, Z. Huang, and C. Lv, “Multimodal hierarchical transformer for occupancy flow field prediction in autonomous driving,” in 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023, pp. 1449–1455.
 [49] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
 [50] M. Bhardwaj, B. Boots, and M. Mukadam, “Differentiable gaussian process motion planning,” in 2020 IEEE international conference on robotics and automation (ICRA). IEEE, 2020, pp. 10 598–10 604.
 [51] H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631.
 [52] S. Ettinger, S. Cheng, B. Caine, C. Liu, H. Zhao, S. Pradhan, Y. Chai, B. Sapp, C. R. Qi, Y. Zhou et al., “Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9710–9719.
 [53] R. Mahjourian, J. Kim, Y. Chai, M. Tan, B. Sapp, and D. Anguelov, “Occupancy flow fields for motion forecasting in autonomous driving,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 5639–5646, 2022.
 [54] W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun, “Endtoend interpretable neural motion planner,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 8660–8669.
 [55] P. Hu, A. Huang, J. Dolan, D. Held, and D. Ramanan, “Safe local motion planning with selfsupervised freespace forecasting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 732–12 741.
 [56] T. Khurana, P. Hu, A. Dave, J. Ziglar, D. Held, and D. Ramanan, “Differentiable raycasting for selfsupervised occupancy forecasting,” in European Conference on Computer Vision. Springer, 2022, pp. 353–369.
 [57] W. Tong, C. Sima, T. Wang, L. Chen, S. Wu, H. Deng, Y. Gu, L. Lu, P. Luo, D. Lin et al., “Scene as occupancy,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8406–8415.
 [58] Z. Chen, M. Ye, S. Xu, T. Cao, and Q. Chen, “Deepemplanner: An em motion planner with iterative interactions,” arXiv preprint arXiv:2311.08100, 2023.
 [59] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “Carla: An open urban driving simulator,” in Conference on robot learning. PMLR, 2017, pp. 1–16.
 [60] K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger, “Transfuser: Imitation with transformerbased sensor fusion for autonomous driving,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
 [61] L.C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
 [62] D. Kim, S. Woo, J.Y. Lee, and I. S. Kweon, “Video panoptic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9859–9868.
 [63] J. Gu, C. Hu, T. Zhang, X. Chen, Y. Wang, Y. Wang, and H. Zhao, “Vip3d: Endtoend visual trajectory prediction via 3d agent queries,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5496–5506.
 [64] O. Scheel, L. Bergamini, M. Wolczyk, B. Osiński, and P. Ondruska, “Urban driver: Learning to drive from realworld demonstrations using policy gradients,” in Conference on Robot Learning. PMLR, 2022, pp. 718–728.
 [65] Y. Liu, J. Zhang, L. Fang, Q. Jiang, and B. Zhou, “Multimodal motion prediction with stacked transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7577–7586.
 [66] A. Hu, Z. Murez, N. Mohan, S. Dudas, J. Hawke, V. Badrinarayanan, R. Cipolla, and A. Kendall, “Fiery: Future instance prediction in bird’seye view from surround monocular cameras,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 15 273–15 282.
 [67] P. Li, S. Ding, X. Chen, N. Hanselmann, M. Cordts, and J. Gall, “Powerbev: A powerful yet lightweight framework for instance prediction in bird’seye view,” arXiv preprint arXiv:2306.10761, 2023.
 [68] N. Rhinehart, R. McAllister, and S. Levine, “Deep imitative models for flexible inference, planning, and control,” arXiv preprint arXiv:1810.06544, 2018.
 [69] B. Varadarajan, A. Hefny, A. Srivastava, K. S. Refaat, N. Nayakanti, A. Cornman, K. Chen, B. Douillard, C. P. Lam, D. Anguelov et al., “Multipath++: Efficient information fusion and trajectory aggregation for behavior prediction,” in 2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022, pp. 7814–7821.
 [70] A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative qlearning for offline reinforcement learning,” Advances in Neural Information Processing Systems, vol. 33, pp. 1179–1191, 2020.
 [71] Z. Zhang, A. Liniger, D. Dai, F. Yu, and L. Van Gool, “Endtoend urban driving by imitating a reinforcement learning coach,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 15 222–15 232.