Rewarding What Matters: Step-by-Step Reinforcement Learning for Task-Oriented Dialogue

Huifang Du^*
Tongji University
duhuifang@tongji.edu.cn
&Shuqin Li^*
Hangzhou Dianzi University
shuqinlee9683@gmail.com
&Minghao Wu
Monash University
minghao.wu@monash.edu
Xuejing Feng
Tongji University
fengxuejing@tongji.edu.cn
&Yuan-Fang Li
Monash University
yuanfang.li@monash.edu
&Haofen Wang
Tongji University
carter.whfcarter@gmail.com

Abstract

Reinforcement learning (RL) is a powerful approach to enhance task-oriented dialogue (TOD) systems. However, existing RL methods tend to mainly focus on generation tasks, such as dialogue policy learning (DPL) or response generation (RG), while neglecting dialogue state tracking (DST) for understanding. This narrow focus limits the systems to achieve globally optimal performance by overlooking the interdependence between understanding and generation. Additionally, RL methods face challenges with sparse and delayed rewards, which complicates training and optimization. To address these issues, we extend RL into both understanding and generation tasks by introducing step-by-step rewards throughout the token generation. The understanding reward increases as more slots are correctly filled in DST, while the generation reward grows with the accurate inclusion of user requests. Our approach provides a balanced optimization aligned with task completion. Experimental results demonstrate that our approach effectively enhances the performance of TOD systems and achieves new state-of-the-art results on three widely used datasets, including MultiWOZ2.0, MultiWOZ2.1, and In-Car. Our approach also shows superior few-shot ability in low-resource settings compared to current models.

Huifang Du^* Tongji University duhuifang@tongji.edu.cn Shuqin Li^* Hangzhou Dianzi University shuqinlee9683@gmail.com Minghao Wu Monash University minghao.wu@monash.edu

Xuejing Feng Tongji University fengxuejing@tongji.edu.cn Yuan-Fang Li Monash University yuanfang.li@monash.edu Haofen Wang Tongji University carter.whfcarter@gmail.com

¹¹footnotetext: These authors contributed equally to this work.

Refer to caption — Figure 1: A task-oriented dialogue system needs to successfully perform both understanding and generation to achieve its dialogue goals.

1 Introduction

The rapid advancements in pre-trained language models (PLMs) have significantly influenced a variety of real-world applications Devlin et al. (2018); Raffel et al. (2020); Chung et al. (2024). Among these, the development of task-oriented dialogue (TOD) systems stands out as particularly impactful Wen et al. (2017); Hosseini-Asl et al. (2020). Typically, a TOD system comprises several components He et al. (2022b); Feng et al. (2023) as shown in Figure 1, including dialogue state tracking (DST) for understanding user’s belief state Chen et al. (2020); Guo et al. (2023), dialogue policy learning (DPL) for generating dialogue acts Zhao et al. (2024); Zhang et al. (2019), and response generation (RG) for generating system responses Pei et al. (2020); Chen et al. (2019). More recently, there has been growing interest in constructing end-to-end (E2E) TOD systems based on PLMs to equip models with all these essential capabilities He et al. (2022b); Hosseini-Asl et al. (2020); Feng et al. (2023); Yu et al. (2023).

Building on the advancements in TOD systems discussed earlier, recent research explores the use of offline reinforcement learning (RL) to optimize TOD systems further learning goal-oriented conversational strategies Lu et al. (2019); Jang et al. (2021); Feng et al. (2023). However, current RL approaches typically focus on enhancing the generation component, such as generating dialog acts (DPL task) Li et al. (2023) or system response (RG task) Yu et al. (2023). This biased focus prevents the systems from reaching optimal performance by ignoring the crucial interdependence between understanding and generation. Furthermore, RL for TOD systems often faces issues with sparse and delayed rewards Lu et al. (2019); Abdulhai et al. (2023), which are only provided upon reaching the goal at the dialogue or turn level Kwan et al. (2023); Lu et al. (2019); Abdulhai et al. (2023). This leads to insufficient exploration and unstable training for RL. While many efforts have tried to mitigate these reward issues to offer dense rewards, the design of the reward function in these methods tends to be complex, which may limit the method’s generalization Li et al. (2020); Feng et al. (2023).

In this work, we propose to design a simple but effective reward function to jointly optimize both understanding and generation components in an end-to-end manner to achieve the globally optimal performance. We propose the combination of understanding reward and generation reward throughout per token generation to reinforce the learning step by step. The understanding reward is the growing proportion of correctly filled slots in the DST process, while the generation reward is measured by the correct inclusion of the user requests in the DPL and RG process. We conduct extensive experiments using two model backbones, the Flan-T5 base and Flan-T5 large models Chung et al. (2024), on three widely used benchmarks: MultiWOZ2.0, MultiWOZ2.1, and In-Car. The results show that our approach significantly improves model performance against strong baselines, establishing new state-of-the-art results. We also show that our approach outperforms current models in low-resource conditions, highlighting its adaptability in real-world scenarios where data is limited.

Our contributions to this work are summarized as follows:

•

We introduce a novel approach that integrates RL into both understanding (DST) and generation (DPL and RG) components in an end-to-end manner, which promotes a balanced optimization for TOD systems.
•

To tackle the challenges of sparse and delayed rewards in RL for TOD systems, we propose a combined reward mechanism that provides progressive feedback during token generation. This step-by-step reward significantly enhances efficiency.
•

Experimental results show that our approach establishes new state-of-the-art results on multiple benchmarks (MultiWOZ2.0, MultiWOZ2.1, and In-Car). Furthermore, the method shows superior performance in low-resource conditions.

2 Related Work

In this section, we review works on TOD systems utilizing both pipeline and E2E methods, the integration of reinforcement learning (RL), and the design of reward functions for RL. Additionally, we discuss the role of large language models (LLMs) in TOD systems.

Pipeline and End-to-End Approaches.

Pipeline approaches are characterized by their modular structure, where dialogue state tracking (DST) Chen et al. (2020); Guo et al. (2023), dialogue policy learning (DPL) Zhao et al. (2024); Zhang et al. (2019), and response generation (RG) Pei et al. (2020); Chen et al. (2019) are processed sequentially. They offer interpretability and modularity but often struggle to capture the overall context of conversations Kwan et al. (2023). In contrast, E2E approaches directly map input utterances to system responses without explicit intermediate representations He et al. (2022b); Yang et al. (2021); He et al. (2022a). Some models, such as SPACE-3 He et al. (2022a), UBAR Yang et al. (2021), and PPTOD Su et al. (2022a), restructure all sub-tasks into a single sequence prediction through pre-training and fine-tuning. However, supervised fine-tuning (SFT) focuses more on learning at the token level than on the particular requirements, which limits the model’s ability to complete specific tasks.

RL-Based Policy Learning.

RL can be leveraged to enhance model performance by tailoring it to the specific requirements of TOD tasks. However, RL models face challenges due to large action spaces and sparse rewards Feng et al. (2023); Zhang et al. (2019); Wu et al. (2019). Some studies use deep reinforcement learning (DRL) methods like Deep Q-Networks (DQN) Peng et al. (2018); Jang et al. (2021) to improve policies with simulated user interactions. Hierarchical RL (HRL) breaks tasks into sub-tasks, creating a policy hierarchy Peng et al. (2017); Liu et al. (2020), while feudal RL (FRL) abstracts state and action spaces for more general policies Gao et al. (2018); Casanueva et al. (2018). These methods primarily focus on dialogue policy learning with complex algorithmic designs and often lack a robust understanding of user intentions, resulting in suboptimal performance.

Reward Design for TOD.

Recent studies have found offline RL to be a promising method for stabilizing training with static datasets Snell et al. (2023); Feng et al. (2023). Following the offline principle, many methods design rewards at the dialog and turn level when a goal is achieved Kwan et al. (2023); Lu et al. (2019); Tang et al. (2018), but reward signals remain sparse. Inverse reinforcement learning (IRL) and reward shaping techniques have been introduced to learn denser rewards and encourage faster learning Li et al. (2020); Takanobu et al. (2019). However, IRL can be computationally intensive, and reward shaping might result in unintended behaviors if not carefully designed Arora and Doshi (2021); Gupta et al. (2024). Alternatively, some methods employ rewards for every token, which may lack semantic significance towards the dialogue goal Yu et al. (2023); Gupta et al. (2024). Our approach provides progressive rewards directly towards the dialogue goal.

Large Language Models for TOD.

LLMs have demonstrated impressive capabilities in understanding and generating text for various tasks Ouyang et al. (2022a); OpenAI (2023); Chowdhery et al. (2023); Wu et al. (2024c). However, LLMs underperform compared to specialized task-specific models Hudeček and Dušek (2023); Li et al. (2023); Wu et al. (2024b). Fine-tuning LLMs for specific tasks is also computationally inefficient. All these reasons lead to a growing interest in prompt engineering approaches that leverage in-context learning without requiring parameter updates Wei et al. (2022); Wang et al. (2022); Yao et al. (2024); Wu et al. (2024a). Yet, LLMs still tend to perform less effectively Yang et al. (2024).

3 Preliminary

3.1 Supervised Fine-Tuning for TOD

The TOD task is typically modeled as an E2E problem and addressed by a seq2seq model (e.g. T5) using supervised fine-tuning (SFT). The input of the model can be represented as $\text{I}_{t}=[\text{prefix}:u_{t-1}:bs_{t-1}:da_{t-1}:sr_{t-1}:u_{t}]$ , where $[\cdot:\cdot]$ denotes the concatenation operator, $u_{t}$ represents the current user utterance, $bs_{t-1}$ , $da_{t-1}$ , and $sr_{t-1}$ represent the belief state (BS), dialogue act (DA), and system response (SR) at turn $t-1$ respectively. The prefix instruction is “translate dialogue to belief state, dialogue action, and system response: [input]”. The model is fine-tuned to maximize the likelihood of successively generating correct BS, DA, and SR given the input:

\mathcal{L}_{\theta}=\sum_{t=1}^{T}\log P(bs_{t},da_{t},sr_{t}\mid\text{I}_{t}% ;\theta),

(1)

where $\theta$ represents the parameters of the model.

3.2 Reinforcement Learning for TOD

Formally, the RL approaches for TOD tasks operate within a Markov Decision Process (MDP) Kaelbling et al. (1998) characterized by the tuple $\langle S,A,P,R,\gamma\rangle$ . The state space $S$ can be represented as a set of states $\mathbf{s}_{i}=\{s_{1},s_{2},\ldots,s_{k}\}$ , where each state includes the dialogue context and history up to the current time step. Each turn in the dialogue is considered an independent episode. An action $a_{\Delta t}\in A$ is the $\Delta t$ -th action taken during an episode, which corresponds to selecting the next token in the dialogue. Transition probability $P(s^{\prime}\mid s,a)$ is the probability of transitioning to state $s^{\prime}$ given action $a$ and state $s$ . The discount factor $\gamma\in[0,1]$ is used to weigh future rewards. The SFT model is used to initialize a policy network $\pi$ , which is subsequently refined to maximize the reward $R$ , using algorithms such as proximal policy optimization (PPO) Schulman et al. (2017).

4 Main Method

We aim to enhance TOD systems using a combination of SFT and RL. While SFT can provide a stable initial base for RL Ramamurthy et al. (2023); Yu et al. (2023); Li et al. (2023), it equally treats every ground-truth token as an objective, without prioritizing task-specific goals. We utilize RL to refine the model to optimize for task completion.

In TOD tasks, accurately understanding user needs (i.e., belief states) is crucial for generating appropriate dialogue acts, which are essential for producing system responses that meet current needs and effectively drive the conversation forward. However, existing RL methods often focus solely on optimizing dialogue policy learning Li et al. (2023); Takanobu et al. (2020) or response generation Yu et al. (2023), neglecting the importance of understanding and the interdependence between understanding and generation. Moreover, these methods typically use sparse rewards at the dialogue or turn level Kwan et al. (2023); Lu et al. (2019); Tang et al. (2018); Abdulhai et al. (2023).

Task completion metrics evaluate whether the model correctly generates informable and requestable slot values defined in the dialogue schema, reflecting its performance in understanding and generation tasks. The policy model’s sequence generation process involves continually satisfying these lists. Inspired by these metrics, we hypothesize that providing progressive task-oriented rewards during token generation for understanding and generation tasks can enhance TOD systems. The model architecture and our reward function are illustrated in Figure 2. In Section 4.1, we explain how these metrics are measured to support our reward function design. In Section 4.2, we show how our reward function provides continuous, step-by-step feedback, guiding the E2E model through understanding and generation tasks for a more coherent and responsive dialogue system.

4.1 Task Completion Metrics

An informable list and requestable list are commonly predefined for dialog goals in datasets, such as In-Car and MultiWOZ. The informable list contains slots and their values representing the user’s requirements. For example, a user’s preference for a restaurant is characterized by a “cheap” value for the “price range” slot. The Inform metric evaluates whether the system accurately learns user demands as defined in the informable list and then provides a suitable entity in response. The requestable list includes user-requested values, such as “postcode”. The Success metric measures whether the generated DAs or SRs contain all attributes in the requestable list. Therefore, we believe that a slot-value-specific reward derived from the informable list can enhance the system’s understanding of user needs, while the value-specific reward based on the requestable list can improve responsiveness to user requests. Accordingly, we introduce the design of a progressive reward function combining the understanding reward for DST, as well as the generation reward for DPL and RG.

4.2 Step-by-Step Goal-Oriented Reward

Understanding Reward.

We design the understanding reward for DST by measuring the growing proportion of correctly identified slot-value pairs in the informable list during token (action) generation. This reward function directly reflects how well the system understands the user’s needs, which is closely related to the goals of DST. Formally, we denote $SV_{gt}$ as the set of ground-truth slot values in the current turn and $\hat{SV}$ as the set of predicted ones during the token generation:

R_{u}=\frac{|SV_{gt}\cap\hat{SV}|\cdot\rho_{u}}{|SV_{gt}|},

(2)

where $\rho_{u}=\exp\left(-\alpha\cdot\frac{|SV_{gt}\setminus\hat{SV}|}{|SV_{gt}|}\right)$ represents a penalty based on the discrepancy between the number of predicted slot-value pairs and ground truth slot-value pairs, with $\alpha$ being a tunable parameter that controls the sensitivity of this penalty. The function provides a dense reward that progressively reflects the accuracy of DST.

Generation Reward.

We observe that the accuracy of both DPL and RG depends on how many values in their generations are correctly included in the requestable list. Therefore, we set the same reward function for these two generation tasks. The reward for DPL and RG is the increasing inclusion of values in the user requestable list during each token generation, which measures the system’s ability to fulfill user requests continuously. Formally, $S_{gt}$ is all ground-truth user request values in the current turn, and $\hat{S}$ denotes the predicted values during token generation:

R_{g}=\frac{|S_{gt}\cap\hat{S}|\cdot\rho_{g}}{|S_{gt}|},

(3)

where the penalty term $\rho_{g}=\exp\left(-\beta\cdot\frac{|S_{gt}\setminus\hat{S}|}{|S_{gt}|}\right)$ peralizes the difference between the number of generated values and values in the ground-truth requestable list, and $\beta$ is a tunable parameter that controls the sensitivity of this penalty. The function provides a dense reward that progressively reflects how well the generation completes.

TOD Reward.

To offer a comprehensive reward that evaluates both the understanding and generation performance, we define the TOD reward as a weighted combination of the understanding reward $R_{u}$ and the generation reward $R_{g}$ :

R_{tod}=\frac{|SV_{gt}\cap\hat{SV}|\cdot\rho_{u}+|S_{gt}\cap\hat{S}|\cdot\rho_% {g}}{|SV_{gt}|+|S_{gt}|}.

(4)

The combined reward function encourages balanced optimization of both the understanding (DST) and the generation (DPL, RG), which enhances the global robustness of TOD systems. The use of dense rewards derived from the informable and requestable lists ensures continuous feedback during token-level generation. Unlike sparse rewards that only provide feedback at the end of dialogues, our approach offers step-by-step rewards, accelerating the learning process. The progressive nature of the rewards, based on the discrepancies $\rho_{u}$ and $\rho_{g}$ , helps make incremental improvements.

Reward Shaping.

To prevent the policy network $\pi$ from straying too far from the initial model $\pi_{\text{o}}$ , we also add a KL constraint to balance the reward. Formally, the final RL reward function is:

R_{total}=R_{t}-\beta D_{KL}({\pi}\parallel{\pi_{\text{o}}}),

(5)

where $\beta$ is dynamically adapted during training.

Optimization.

We use natural language policy optimization (NLPO) Ramamurthy et al. (2023), which is an extension of PPO. NLPO incorporates action elimination through a parameterized-masked approach. It learns to mask out less relevant tokens using top-p sampling, which restricts the token set to those with a cumulative probability above a specified threshold. NLPO maintains a separate masked policy that updates periodically, providing an additional constraint to ensure the selection of more task-relevant actions.

Method	MultiWOZ2.0				MultiWOZ2.1				In-Car
Method	Inform	Succ.	BLEU	Comb.	Inform	Succ.	BLEU	Comb.	Match	SuccF1	BLEU	Comb.
E2E
SimpleTOD	84.4	70.1	15.0	092.3	85.0	70.5	15.2	093.0	-	-	-	-
DoTS	86.6	74.1	15.1	095.5	86.7	74.2	15.9	096.3	-	-	-	-
PPTOD	89.2	79.4	18.6	102.9	87.1	79.1	19.2	102.3	-	-	-	-
UBAR^†	85.1	71.0	16.2	94.3	86.2	70.3	16.5	094.7	-	-	-	-
LABES	-	-	-	-	76.9	63.3	17.9	088.0	85.8	77.0	22.8	104.2
SPACE-3^∗∗	88.7	78.7	16.3	100.0	90.9	81.0	16.8	102.7	84.7	79.6	18.6	100.7
SPACE-3	95.3	88.0	19.3	111.0	95.6	86.1	19.9	110.8	85.3	83.2	22.9	107.1
GALAXY^∗	93.1	81.0	18.4	105.5	93.5	81.7	18.3	105.9	81.9	83.3	22.0	104.6
GALAXY	94.4	85.3	20.5	110.4	95.3	86.2	20.0	110.8	85.3	83.6	23.0	107.4
RL
MinTL	84.9	74.9	17.9	097.8	-	-	-	-	-	-	-	-
GPT-Critic	90.1	76.6	17.8	101.1	-	-	-	-	-	-	-	-
FanReward	93.1	83.9	18.0	106.5	-	-	-	-	-	-	-	-
Ours_base	92.1	88.3	16.6	106.9	92.7	88.5	16.2	106.8	84.3	83.8	22.8	106.9
Ours_large	96.1	92.4	17.2	111.5	96.9	91.1	16.9	110.9	86.2	86.1	23.0	109.2

Model	Inform	Succ.	BLEU	Comb.
Ours	96.1	92.4	17.2	111.5
$-R_{u}$	91.2	87.0	16.1	105.2
$-R_{g}$	92.1	87.5	15.6	105.4
$-R_{u}-R_{g}$	86.0	81.8	17.2	101.1

Method	MultiWOZ2.0
Method	Inform	Succ.	BLEU	Comb.
Codex^†	76.7	41.5	07.7	66.8
ChatGPT^†	71.8	44.1	010.5	68.4
Claude	78.3	41.2	02.9	62.7
GPT-4o	77.0	53.1	05.2	70.3
DSP w/ ChatGPT^†	95.3	82.3	10.9	99.6
Ours w/ ChatGPT	95.1	91.2	09.8	102.9
Ours_large	96.1	92.4	17.2	111.5

Rewarding What Matters: Step-by-Step Reinforcement Learning for Task-Oriented Dialogue

Abstract

1 Introduction

2 Related Work

Pipeline and End-to-End Approaches.

RL-Based Policy Learning.

Reward Design for TOD.

Large Language Models for TOD.

3 Preliminary

3.1 Supervised Fine-Tuning for TOD

3.2 Reinforcement Learning for TOD

4 Main Method

4.1 Task Completion Metrics

4.2 Step-by-Step Goal-Oriented Reward

Understanding Reward.

Generation Reward.

TOD Reward.

Reward Shaping.

Optimization.

5 Experiments

5.1 Dataset

5.2 Evaluation Metrics

5.3 Baselines

5.4 Main Results

5.5 Ablation Study

6 Analysis and Discussion

6.1 Low-Resource Evaluation

6.2 Integration with LLMs

6.3 Human Evaluation

7 Conclusion

8 Limitations

References

Appendix A Implementation Details

SFT Details.

RL Details.

Model and Implementation Details.

Appendix B Reward Curve

Appendix C Case Study

Appendix D Error Examples