Nonautoregressive Generative Models for Reranking Recommendation
Abstract.
Contemporary recommendation systems are designed to meet users’ needs by delivering tailored lists of items that align with their specific demands or interests. In a multistage recommendation system, reranking plays a crucial role by modeling the intralist correlations among items. The key challenge of reranking lies in the exploration of optimal sequences within the combinatorial space of permutations. Recent research proposes a generatorevaluator learning paradigm, where the generator generates multiple feasible sequences and the evaluator picks out the best sequence based on the estimated listwise score. The generator is of vital importance, and generative models are wellsuited for the generator function. Current generative models employ an autoregressive strategy for sequence generation. However, deploying autoregressive models in realtime industrial systems is challenging. Firstly, the generator can only generate the target items one by one and hence suffers from slow inference. Secondly, the discrepancy between training and inference brings an error accumulation. Lastly, the lefttoright generation overlooks information from succeeding items, leading to suboptimal performance.
To address these issues, we propose a NonAutoRegressive generative model for reranking Recommendation (NAR4Rec) designed to enhance efficiency and effectiveness. To tackle challenges such as sparse training samples and dynamic candidates, we introduce a matching model. Considering the diverse nature of user feedback, we employ a sequencelevel unlikelihood training objective to differentiate feasible sequences from unfeasible ones. Additionally, to overcome the lack of dependency modeling in nonautoregressive models regarding target items, we introduce contrastive decoding to capture correlations among these items. Extensive offline experiments validate the superior performance of NAR4Rec over stateoftheart reranking methods. Online A/B tests reveal that NAR4Rec significantly enhances the user experience. Furthermore, NAR4Rec has been fully deployed in a popular video app Kuaishou with over 300 million daily active users.
1. Introduction
Recommendation systems offer users personalized item lists tailored to their interests. Various approaches have been proposed to capture user interests, focusing on feature interactions(Cheng et al., 2016; Guo et al., 2017; Lian et al., 2018), user preference modeling(Zhou et al., 2018, 2019b), and so on. However, most existing methods treat individual items separately, neglecting their mutual influence and leading to suboptimal results. Acknowledging that user interactions with one item may correlate with others in the recommendation list(Pei et al., 2019), reranking is introduced to consider contextual information and generate an optimal sequence of recommendation items.
The main challenge in reranking is exploring optimal sequences within the vast space of permutations. Reranking methods are typically categorized into onstage and twostage approaches. Onestage methods (Ai et al., 2018; Pei et al., 2019) take candidates as input, estimating refined scores for each item within the permutation, and rerank them greedily based on these scores. However, onestage methods encounter an inherent contradiction(Feng et al., 2021; Xi et al., 2021): the reranking operation inherently alters the permutation, introducing different mutual influences between items compared to the initial arrangement. Consequently, the refined score conditioned on the initial permutation is considered implausible.
To tackle this challenge, twostage methods utilize a generatorevaluator framework. Here, the generator creates multiple feasible sequences, and the evaluator selects the optimal sequence based on the estimated listwise score. Within the generatorevaluator framework, the generator plays a crucial role. Generative models(Bello et al., 2018; Feng et al., 2021; Gong et al., 2022; Zhuang et al., 2018) are preferred over heuristic methods(Feng et al., 2021; Lin et al., 2023; Xi et al., 2021; Shi et al., 2023) for the generator function due to the expansive solution space of item permutations. Generative models commonly employ an autoregressive strategy for sequence generation.
However, deploying the autoregressive models in realtime industrial recommendation systems presents challenges. Firstly, autoregressive models suffer from inference efficiency. Autoregressive models adopt a sequential approach to generate target sequences item by item, resulting in slow inference as the time complexity increases linearly with the sequence length. Secondly, a critical issue arises from the traininginference discrepancy in autoregressive models. While these models are trained to predict the next item based on the ground truth up to that point. However, during inference, they receive their own previously generated outputs as input. This misalignment introduces an accumulated error, where inaccuracies generated in earlier timesteps propagate and accumulate over time. Consequently, this accumulation leads to sequences that deviate from the true distribution of the target sequence. Additionally, autoregressive models have limited information utilization. The sequential decoding process focuses solely on preceding items, neglecting information from succeeding items. This limitation results in suboptimal performance as the model fails to fully leverage the available context.
To address those challenges, we propose NonAutoRegressive generative model for reranking Recommendation(NAR4Rec). Unlike autoregressive models, which generate sequences step by step, relying on their own previous outputs, NAR4Rec generates all items within the target sequence simultaneously.
However, we find it nontrivial to deploy nonautoregressive in recommendation systems. Firstly, the sparse nature and dynamic candidates in recommendation systems pose difficulties for learning convergence, which we address by sharing position embedding and introducing a matching model. Secondly, the diverse nature of user feedback, including both positive and negative interactions, renders maximum likelihood training less suitable. We propose unlikelihood training to distinguish between desirable and undesirable sequences. Lastly, nonautoregressive models assume an independent selection of items at each position in a sequence, which is inadequate when modeling intralist correlation. Hence we propose contrastive decoding to capture dependencies across items.
To summarize, our contributions are listed as follows:

•
We make the first attempt to adopt nonautoregressive models for reranking, which significantly speeds up the inference speed and meets the requirements of realtime recommendation systems.

•
We propose a matching model to enhance convergence, a sequencelevel unlikelihood training method to guide the generated sequence towards improved overall utility, and a contrastive decoding method to refine current decoding strategies with intralist correlation.

•
Extensive offline experiments demonstrate that NAR4Rec outperforms stateoftheart methods. Online A/B tests further validate the effectiveness of NAR4Rec. Furthermore, NAR4Rec has been fully deployed in a realworld video app Kuaishou with over 300 million daily active users, notably improving the user experience.
2. Related work
2.1. Reranking in Recommendation Systems
In contrast to earlier phases like matching and ranking(Burges et al., 2005; Liu et al., 2009), which typically learn a userspecific itemwise scoring function, the core of reranking in recommendation systems lies in modeling correlations within the exposed list. Reranking(Pang et al., 2020; Pei et al., 2019; Xi et al., 2021), building upon candidate items from the ranking stage, selects a subset and determines their order to ensure exposing the most suitable items to the users. Existing research on reranking can be systematically classified into two principal categories: onestage(Ai et al., 2018; Pei et al., 2019; Pang et al., 2020), and twostage methods(Feng et al., 2021; Shi et al., 2023; Lin et al., 2023; Xi et al., 2021).
Onestage methods treat reranking as a retrieval task, recommending the top k items based on scores from a ranking model. These methods refine the initial list distribution using listwise information, optimizing overall recommendation quality. Subsequently, the candidates are reranked by the refined itemwise score in a greedy manner. The distinction lies in the network architectures for capturing listwise information, such as GRU in DLCM(Ai et al., 2018), and transformer in PRM(Pei et al., 2019). However, user feedback for the exposed list is influenced not just by item interest but also by arrangements and surrounding context(Joachims et al., 2017; Lorigo et al., 2008, 2006; Yang, 2017). The reranking operation modifies permutations, thereby introducing influences distinct from the initial permutation. Moreover, onestage methods, which exclusively model the initial permutation, fall short of capturing alternative permutations. Consequently, those methods struggle to maximize overall user feedback(Xi et al., 2021).
Twostage methods (Feng et al., 2021; Shi et al., 2023; Lin et al., 2023; Xi et al., 2021) embrace a generatorevaluator framework. In this framework, the generator initiates the process by generating multiple feasible sequences, and subsequently, the evaluator selects the optimal sequence based on the estimated listwise score. This framework allows for a comprehensive exploration of various feasible sequences, and an informed selection of the most optimal one based on listwise considerations. The role of the generator is particularly crucial for generating sequences. Common approaches for generators can be categorized into heuristic methods(Feng et al., 2021; Lin et al., 2023; Xi et al., 2021; Shi et al., 2023), such as beam search or item swapping, and generative models (Bello et al., 2018; Feng et al., 2021; Gong et al., 2022; Zhuang et al., 2018). Generative models are more suitable than heuristic methods for reranking given the vast permutation space. These generative models typically adopt a stepgreedy strategy which autoregressively decides the display results of each position. However, the high computational complexity of online inferences limits their application in realtime recommendation systems.
To address the challenges linked with autoregressive generation models, our work investigates the viability of nonautoregressive generative models within the generatorevaluator framework. Nonautoregressive generative models generate the target sequence once to alleviate computational complexity.
2.2. Nonautoregressive Sequence Generation
Nonautoregressive sequence generation(Gu et al., 2018) was initially introduced in machine translation to speed up the decoding process. Then it has since gained increased attention in nature language processing, e.g. text summarization(Gu et al., 2019; Awasthi et al., 2019), text error correction(Leng et al., 2021b, a). Specifically, efforts have been focused on tackling the absence of target information in nonautoregressive models. Strategies include enhancing the training corpus to mitigate targetside dependencies(Gu et al., 2018; Zhou et al., 2019a) and refining training approaches(Ghazvininejad et al., 2019; Stern et al., 2019) to alleviate learning difficulties.
Although nonautoregressive generation has been explored in text, those conventional techniques are not directly applicable to recommendation systems. We tackle the challenges encountered in recommendation systems to improve the convergence and performance of nonautoregressive models and make the first attempt to integrate nonautoregressive models into reranking within realtime recommender systems.
3. Preliminary
3.1. Reranking problem Formulation
For each user $u$ within the set $U$, a request encompasses a set of user profile features(such as user ID, gender, age), the recent interaction history, and $n$ candidates items denoted as $X=\{{x}_{1},{x}_{2},\mathrm{\cdots},{x}_{n}\}$, where $n$ is the number of candidates. Given candidates $X$, the goal of reranking is to propose an item sequence $Y=\{{y}_{1},{y}_{2},\mathrm{\cdots},{y}_{m}\}$ that elicits the most favorable feedback for user $u$, where $m$ is the sequence length and $Y$ is the recommended list of the reranking model. We denote the reranking models as $\mathcal{F}(u,\theta ,X)$ where the corresponding parameter is $\theta $. In realtime recommendation systems, reranking acts as the last stage to deliver the ultimate list of recommended items. Typically, $n$ significantly exceeds $m$, with $m$ being less than 10 and $n$ ranging from several tens to hundreds.
In a multiinteraction scenario, users may exhibit distinct types of interaction (e.g., clicks, likes, comments) for each item exposure. Formally, we define the set of user interactions as $B$, and ${s}_{u,{y}_{i},b}$ represents user $u$’s response to item ${y}_{i}$ concerning interaction $b\in B$. Given $Y$, each item ${y}_{i}$ has a multiinteraction response ${\mathbf{e}}_{u,{y}_{i}}=[{\text{e}}_{u,{y}_{i},{b}_{1}},\mathrm{\dots},{\text{e}}_{u,{y}_{i},{b}_{B}}]$. For all items $Y$,the overall user response is:
(1)  $${\mathbf{E}}_{u,Y}=\left[\begin{array}{ccc}{e}_{u,{y}_{1},{b}_{1}}& \mathrm{\dots}& {e}_{u,{y}_{m},{b}_{1}}\\ \mathrm{\vdots}& \mathrm{\ddots}& \mathrm{\vdots}\\ {e}_{u,{y}_{1},{b}_{B}}& \mathrm{\dots}& {e}_{u,{y}_{m},{b}_{B}}\end{array}\right].$$ 
The overall utility is quantified as the summation of individual item utilities, denoted as $\mathcal{R}(u,Y)={\sum}_{i=1}^{m}\mathcal{R}(u,{y}_{i})$. The utility associated with each item may correspond to a specific interaction type $b$, such as clicks, watch time, or likes. In such cases, the item utility is expressed as $\mathcal{R}(u,{y}_{i})={e}_{u,{y}_{i},b}$. Alternatively, the item utility can be represented as the weighted sum of diverse interactions $\mathcal{R}(u,{y}_{i})={\sum}_{b}{w}_{b}{e}_{u,{y}_{i},b}$, where ${w}_{b}$ denotes the weight for each interaction.
The reranking objective is to maximum the overall utility $\mathcal{R}(u,Y)$ for a given user $u$:
(2)  $$ma{x}_{\theta}\mathcal{R}(u,Y).$$ 
Reranking introduces a permutation space with exponential size, represented as $\mathcal{O}({A}_{n}^{m})$, where $n$ represents the number of candidates and $m$ represents the number of items to be selected and ordered. Each permutation represents a potential arrangement of items, and users provide unique feedback for each permutation. However, in practical scenarios, users typically encounter only one permutation. Thus, the main challenge in reranking lies in efficiently and effectively determining the optimal permutation given the vast solution space yet extremely sparse real user feedback as training samples.
3.2. Autoregressive sequence generation
Given a set of candidate items denoted as $X$, autoregressive models decompose the distribution over potential generated sequences $Y$ into a series of conditional probabilities:
(3)  $${p}_{\text{AR}}(YX;\theta )=\sum _{i=1}^{m+1}p({y}_{i}{y}_{0:i1},{x}_{1:n};\theta ),$$ 
where the special tokens ${y}_{0}$ (e.g., ¡bos¿) and ${y}_{m+1}$ (e.g., ¡eos¿) denote the beginning and end of target sequences. Importantly, the length of the generated sequence is predetermined and fixed, unlike variable lengths in text.
Factorizing the sequence generation output distribution autoregressively leads to a maximum likelihood training with a crossentropy loss at each timestep:
(4)  $${\mathcal{L}}_{\text{AR}}=\mathrm{log}{p}_{\text{AR}}(YX;\theta )=\sum _{i=1}^{m+1}\mathrm{log}p({y}_{i}{y}_{0:i1},{x}_{1:n};\theta ).$$ 
The training objective aims to optimize individual conditional probabilities. In training, when the target sequence is known, these probabilities are calculated based on earlier target items rather than modelgenerated ones, enabling efficient parallelization. In inference, autoregressive models generate the target sequence itembyitem sequentially, efficiently capturing the distribution of the target sequence. This makes them wellsuited for the reranking task, particularly considering the vast space of possible permutations.
While autoregressive models have proven effective, deploying them in industrial recommendation systems is challenging. Firstly, their sequential decoding process leads to slow inference, introducing latency that hinders realtime application. Secondly, these models, trained to predict based on ground truth, face a discrepancy during inference when they receive their own generated outputs as input. This misalignment may lead to compounded errors, as inaccuracies generated in earlier timesteps accumulate over time, resulting in inconsistent or divergent sequences that deviate from the true distribution of the target sequence. Thirdly, autoregressive models rely on a lefttoright causal attention mechanism, limiting the expressive power of hidden representations, as each item encodes information solely from preceding items(Sun et al., 2019). This constraint impedes optimal representation learning, resulting in suboptimal performance.
3.3. Nonautoregressive sequence generation
To address the aforementioned challenges, nonautoregressive sequence generation eliminates autoregressive dependencies from existing models. Each element’s distribution $p({y}_{i})$ depends solely on the candidates $X$:
(5)  $${p}_{\text{NAR}}(YX;\theta )=\prod _{i=1}^{m}p({y}_{i}{x}_{1:n};\theta ).$$ 
Then the loss function for the nonautoregressive model is:
(6)  $${\mathcal{L}}_{\text{NAR}}=\mathrm{log}{p}_{\text{NAR}}(YX;\theta )=\prod _{i=1}^{m}\mathrm{log}p({y}_{i}{x}_{1:n};\theta ).$$ 
Despite the removal of the autoregressive structure, the models retain an explicit likelihood function. The training of models employs separate crossentropy losses for each output distribution. Crucially, these distributions can be computed simultaneously during inference, which significantly differs from the sequential process of autoregressive models. This nonautoregressive approach reduces inference latency, thereby enhancing the efficiency of recommendation systems in realworld applications.
4. Approach
In this section, we present a detailed introduction of NAR4Rec. We will first discuss our nonautoregressive model structure, which estimates the probability by a matching model in section 4.1. Then, we delve into unlikelihood training, a method aimed at discerning feedback within the recommended sequence in section 4.2. Finally, we propose contrastive decoding to model the dependency in target sequence in section 4.3. The sequence evaluator in our generatorevaluator framework is introduced in section 4.4.
4.1. Matching model
Nonautoregressive models encounter challenges in training convergence due to two main reasons. Firstly, the sparse nature of training sequences presents learning difficulties. Unlike text sequences that often share linguistic structures, recommended lists in training samples seldom have the same exposures, posing challenges for effective learning from limited data. Secondly, during the reranking stage, identical index for candidates may denote different items, leading to a variable vocabulary as the candidates to be ranked vary across samples.
Conventional models may struggle to handle such variations efficiently. To tackle these challenges, we introduce two key components to our models: a candidates encoder for effectively encoding representations of candidates and a position encoder to capture positionspecific information within the target sequence. Initially, we randomly initialize an embedding for each position in the target sequences. Notably, we share these position embeddings across training data to enhance learning on sparse data. Subsequently, we integrate bidirectional selfattention and crossattention modules to acquire representations for each position, leveraging information from the candidates.
Additionally, to address the challenge posed by the dynamic vocabulary arising from variations in candidates across samples, we employ a matching mechanism. Specifically, we match each candidate with each position in the target sequence, thereby yielding probabilities for each candidate at every position. In the following, we elaborate on the structure of NAR4Rec.
Given a user $u$ and candidates $X=\{{x}_{1},{x}_{2},\mathrm{\cdots},{x}_{n}\}$, the hidden representation of ${x}_{i}$ is ${\mathbf{x}}_{i}\in {\mathbb{R}}^{{d}_{x}}$. We stack ${\mathbf{x}}_{i}$ together into matrix $\mathbf{X}\in {\mathbb{R}}^{n\times {d}_{x}}$. Additionally, we randomly initialize an embedding vector for each position $j$ as ${t}_{j}\in {d}_{t}$. Also, we stack ${t}_{j}$ into $\mathbf{T}\in {\mathbb{R}}^{m\times {d}_{t}}$. To align the embedding dimension of $\mathbf{X}$ with that of $\mathbf{T}$, we project them into same hidden dimension $d$ by a linear projection layer. Then $\mathbf{X}\in {\mathbb{R}}^{n\times d}$ and $\mathbf{T}\in {\mathbb{R}}^{m\times d}$. Consequently, $\mathbf{X}$ is represented as $\mathbf{X}\in {\mathbb{R}}^{n\times d}$ and $\mathbf{T}$ as $\mathbf{T}\in {\mathbb{R}}^{m\times d}$.
Candidates Encoder. The candidates encoder adopts the standard Transformer architecture(Vaswani et al., 2017) by stacking $L$ Transformer layers. In each layer, the architecture mainly consists of two blocks, a selfattention block and a feedforward network. An input $\mathbf{X}$ for selfattention block is linearly transformed into query (${\mathbf{Q}}_{x}$), key (${\mathbf{K}}_{x}$) and value (${\mathbf{V}}_{x}$) as follows:
(7)  $${\mathbf{Q}}_{x}={\mathrm{\mathbf{X}\mathbf{W}}}_{x}^{\mathbf{Q}},{\mathbf{K}}_{x}={\mathrm{\mathbf{X}\mathbf{W}}}_{x}^{\mathbf{K}},{\mathbf{V}}_{x}={\mathrm{\mathbf{X}\mathbf{W}}}_{x}^{\mathbf{V}},$$ 
where ${\mathbf{X}}^{\mathbf{Q}}$, ${\mathbf{X}}^{\mathbf{K}}$ and ${\mathbf{X}}^{\mathbf{V}}$ denote the weight matrices. Then, the selfattention operation is applied as:
(8)  $$\text{Attention}({\mathbf{Q}}_{x},{\mathbf{K}}_{x},{\mathbf{V}}_{x})=\mathrm{Softmax}(\frac{{\mathbf{Q}}_{k}{\mathbf{K}}_{x}^{T}}{\sqrt{d}}){\mathbf{V}}_{x},$$ 
For a multihead version of selfattention mechanism, the input is linearly projected into ${\mathbf{Q}}_{x}$, ${\mathbf{K}}_{x}$ and ${\mathbf{V}}_{x}$ with $h$ times using individual linear projections to small dimensions(e.g. ${d}_{k}=\frac{d}{h}$). Finally, the output of selfattention(SAN) is
(9)  SAN  $=[hea{d}_{1},\mathrm{\cdots},hea{d}_{h}]{\mathbf{W}}_{x}^{O},$  
$hea{d}_{i}$  $=Attention({\mathbf{Q}}_{i},{\mathbf{K}}_{i},{\mathbf{V}}_{i}).$ 
The feedforward network is typically placed after the selfattention block,
(10)  $$FFN(\mathbf{X})=\sigma ({\mathrm{\mathbf{X}\mathbf{W}}}_{x}^{in}){\mathbf{W}}_{x}^{out},$$ 
where ${\mathbf{W}}_{x}^{out}$ and ${\mathbf{W}}_{x}^{in}$ denote the weight matrices of the two linear projection.
Position Encoder. The position encoder adopts a similar Transformer architecture as the candidates encoder. The key difference between them is that the position encoder inserts a crossattention block between the selfattention block and the feedforward network in each Transformer layer. As can be seen in Fig. 2, in each layer, the crossattention block receives the hidden representation from the selfattention blocks of both encoders and processes them via crossattention operation. Specifically, the hidden representation from the candidates encoder and position encoder are denoted as $\mathbf{X}$ and $\mathbf{T}$, respectively. Similar to the selfattention block, we initially apply linear projection to them:
(11)  $$\mathbf{Q}={\mathrm{\mathbf{T}\mathbf{W}}}^{\mathbf{Q}},\mathbf{K}={\mathrm{\mathbf{X}\mathbf{W}}}^{\mathbf{K}},\mathbf{V}={\mathrm{\mathbf{X}\mathbf{W}}}^{\mathbf{V}}.$$ 
Then, we applies the formula in eq. 9 to $\mathbf{Q}$, $\mathbf{K}$, and $\mathbf{V}$ to get the output hidden representation. The crossattention is introduced to capture the correlation between candidates and target sequence.
Probability Matrix. To compute the probability matrix, we perform a matrix multiplication on the output hidden representation from the candidates encoder (denoted as $\{{\mathbf{x}}_{\mathrm{\U0001d7cf}},{\mathbf{x}}_{\mathrm{\U0001d7d0}},\mathrm{\dots},{\mathbf{x}}_{\mathbf{n}}\}$) and position encoder (denoted as $\{{\mathbf{t}}_{\mathrm{\U0001d7cf}},{\mathbf{t}}_{\mathrm{\U0001d7d0}},\mathrm{\dots},{\mathbf{t}}_{\mathbf{m}}\}$). Subsequently, we apply a columnwise softmax function to normalize the scores. Formally, the probability score of placing the $i$th candidate item to the $j$th position is calculated as:
(12)  $$\widehat{{p}_{ij}}=\frac{exp(\mathbf{x}_{\mathbf{i}}^{}{}_{}{}^{\u22ba}{\mathbf{t}}_{\mathbf{j}})}{{\sum}_{i=1}^{n}exp(\mathbf{x}_{\mathbf{i}}^{}{}_{}{}^{\u22ba}{\mathbf{t}}_{\mathbf{j}})}.$$ 
Training objectives NAR4Rec is trained via crossentropy loss function, defined as follows:
(13)  $$\mathcal{L}(Y,X)=\sum _{i=1}^{n}\sum _{j=1}^{m}{p}_{ij}\mathrm{log}(\widehat{{p}_{ij}}),$$ 
where ${p}_{ij}$ is 1 if ${x}_{j}$ is in position ${p}_{j}$ otherwise 0.
4.2. Unlikelihood Training
The discrepancy between text and item sequences hinders the direct application of generative models from text to item recommendation. This disparity arises from the unique characteristics of user interactions in recommendation scenarios. Unlike the structured nature of natural language text, the feedback within recommendation sequences is diverse due to the varied nature of user interactions. While text sequences typically follow conventional language structures to convey information or construct a coherent narrative, user feedback in recommendation sequences is characterized by diverse actions such as clicks or likes, reflecting a diverse and nuanced feedback.
Consequently, the difference in objectives between maximum likelihood training(as in eq. 5) and reranking (as in eq. 2) poses a significant challenge. Although maximum likelihood training effectively captures patterns in text sequences, its applicability diminishes in recommendation scenarios where user preferences are dynamic and subjective. The essence of highquality recommendations lies not just in sequence patterns from training data but, more crucially, in the user utility of the recommended list. User interactions with recommended items are subjective and contextdriven, adding complexity to aligning the training objective with the desired model behavior. To address this challenge, we propose unlikelihood training, guiding the model to assign lower probabilities to undesired generations. This adjustment aligns the training process with the intricate feedback patterns.
Unlikelihood training reduces the model’s probability of generating a negative sequence. Given candidates $X$ and a negative sequence ${Y}_{neg}$, the unlikelihood loss is:
(14)  $${\mathcal{L}}_{\text{ul}}({Y}_{neg},X)=\sum _{i=1}^{n}\sum _{j=1}^{m}{p}_{ij}\mathrm{log}(1{\widehat{p}}_{ij}),$$ 
The loss decreases as ${\widehat{p}}_{ij}$ decreases.
Unlike text generation, where messages are clear and contentfocused, managing attributes like topic, style, and sentiment in the output text is straightforward(Li et al., 2020; Welleck et al., 2019). However, recommendation sequences involve user feedback with implicit signals. For instance, a lack of interaction with a recommended item may suggest disinterest. This highlights the model’s need to understand both explicit and implicit cues in user feedback. Effective control over generation in recommendation sequences becomes crucial to tailor the output based on user preferences and behaviors, thus enhancing the personalized recommendations. Specifically, we classify a item sequence as positive or negative based on the overall utility defined in section 3.1, and the corresponding loss is as follows:
(15)  $$ 
where $\alpha $ is the threshold for positive and negative sequences.
In summary, beyond the primary goal of learning positive sequence patterns through sequence likelihood, unlikelihood training introduces an additional objective to reduce the likelihood of generating sequences with low utilities, effectively training recommendation models to discern feedback within recommendation sequences.
4.3. Contrastive Decoding
Compared with autoregressive generation, the nonautoregressive approach significantly enhances computation efficiency and makes it feasible to deploy in realtime recommendation systems. However, nonautoregressive generation introduce the conditional independence assumption: each target item’s distribution $p({y}_{i})$ depends only on the candidates $X$. This deviation from autoregressive models poses challenges in capturing the inherently multimodal nature of the distribution of valid target sequences. Take machine translation for example, when translating the phrase ”thank you” into German could result in multiple valid translations such as ”Vielen Dank” and ”Danke”. However, nonautoregressive models may generate unplausible translations like ”Danke Dank”. The conditional independence assumption in eq. 5 restricts the model’s ability to effectively grasp the multimodal distribution in target sequences.
Essentially, the assumption of conditional independence limits the model’s ability to navigate a vast solution space and identify the most suitable permutation from numerous valid options for a given set of candidates. This limitation is especially evident in recommendation where the number of reasonable target sequences far exceeds those encountered in text. Consequently, nonautoregressive frameworks grapple with the challenge of mitigating the impact of conditional independence to improve their capacity for generating diverse and contextually appropriate target sequences. To tackle this, we propose contrastive decoding to model the cooccurrence relationship between items and thereby improve the target dependency.
Contrastive decoding incorporates a diversity prior that regulates the sequence decoding procedure. This is grounded in the intuition that an effective recommended list needs to be composed of a wide variety of items. In fact, contrastive decoding leverages a similarity score function as a regulator when decoding, capturing the interdependence between various positions in the target sequence.
Formally, given the preceding context $$, at time step $i$, the selection of the output ${y}_{i}$ follows
(16)  $${y}_{t}=argma{x}_{x\in X}(1\alpha )\times p(x{p}_{i},X)\alpha \times max(s(\mathbf{x},{\mathbf{x}}_{j})),$$ 
where $0\le j\le i1$. In eq. 16, the first term, termed as model confidence, denotes the probability of candidates $x$ predicted by the model. The second term, known as similarity penalty, quantifies the distinctiveness candidate $x$ concerning the previously selected items, where $s(x,{\mathbf{x}}_{\mathbf{j}})$ is computed as:
(17)  $$s({\mathbf{t}}_{\mathbf{i}},{\mathbf{t}}_{\mathbf{j}})=\frac{\mathbf{t}_{\mathbf{i}}^{}{}_{}{}^{\top}{\mathbf{t}}_{\mathbf{j}}}{\Vert {\mathbf{t}}_{\mathbf{i}}\Vert \cdot \Vert {\mathbf{t}}_{\mathbf{j}}\Vert}.$$ 
Specifically, the similarity penalty is defined as the maximum similarity between the representation of $x$ and all items in $$. NAR4Rec utilizes the dot product item embedding and position embedding to compute the probability matrix. Higher embedding similarity between items often means similar probability in a certain position. We introduce such penalty to introduce intralist correlation.
Also, to encourage the language model to learn discriminative and isotropic item representations, we incorporate a contrastive objective into the training of the language model. Specifically, given a sequence $X$, the ${\mathcal{L}}_{\text{position}}$ and ${\mathcal{L}}_{\text{item}}$ are defined as:
(18)  $${\mathcal{L}}_{\text{position}}=\frac{1}{n\times (n1)}\sum _{i=1}^{n}\sum _{j=1,j\ne i}^{n}\mathrm{max}\{0,\rho s({\mathbf{x}}_{i},{\mathbf{x}}_{i})+s({\mathbf{x}}_{i},{\mathbf{x}}_{j})\},$$ 
where $\rho \in [1,1]$ is a predefined margin and ${x}_{i}$ is the hidden representation of item ${x}_{i}$ from candidates encoder.
(19)  $${\mathcal{L}}_{\text{item}}=\frac{1}{m\times (m1)}\sum _{i=1}^{m}\sum _{j=1,j\ne i}^{\bm{x}}\mathrm{max}\{0,\rho s({\mathbf{t}}_{i},{\mathbf{t}}_{i})+s({\mathbf{t}}_{i},{\mathbf{t}}_{j})\},$$ 
where ${t}_{i}$ is the hidden representation of position ${t}_{i}$ from the position encoder. Intuitively, by training with ${\mathcal{L}}_{\text{CL}}$, the model learns to pull away the distances between representations of distinct tokens.^{1}^{1}1By definition, the cosine similarity $s({h}_{{x}_{i}},{h}_{{x}_{i}})$ of the identical token ${x}_{i}$ is $1.0$. Therefore, a discriminative and isotropic model representation space can be obtained.
The overall training objective $\mathcal{L}$ is then defined as
(20)  $$\mathcal{L}(Y,X)={\mathcal{L}}_{\text{ul}}+{\mathcal{L}}_{\text{position}}+{\mathcal{L}}_{\text{item}},$$ 
where the unlikelihood training objective is described in eq. 15. Note that, when the margin $\rho $ in ${\mathcal{L}}_{\text{position}}$ and ${\mathcal{L}}_{\text{item}}$ equals to $0$, the $\mathcal{L}(Y,X)$ degenerates to the vanilla unlikelihood training objective ${\mathcal{L}}_{\text{ul}}$.
4.4. Sequence Evaluator
The sequence evaluator model is designed to estimate the overall utility of a given sequence, as illustrated in fig. 2. The generated sequence from the generator is first encoded using a selfattention and a feedforward layer to capture contextual information. The hidden representation then passes through the linear projection layer to predict the score for a specific target. The overall utility is calculated as the weighted sum of itemwise scores. Ultimately, the sequence with the highest overall utility is chosen for delivery to the users.
5. Experiments
In this section, we conduct extensive offline experiments and online A/B tests to demonstrate the effectiveness of NAR4Rec. We first describe our experiment setup and baselines in section 5.1. For offline experiments in section 5.2, we compare NAR4Rec with existing baselines on both performance and training and inference time. Then we alternate the hyperparameter to analyse the hyperparameter sensitivity of NAR4Rec. To further show the effectiveness of NAR4Rec in realtime recommendation system, we conduct online A/B tests to ablate our proposed methods in section 5.3.
5.1. Experiment details
Dataset  #Requests  #Users  #Ads 

Avito  53,562,269  1,324,103  23,562,269 
Meituan  230,525,531  3,201,922  98,525,531 
Dataset: To evaluate reranking recommendation, we expect that each sample of the dataset is an exposed sequence to users rather than a manually constructed sequence. For public dataset, we choose Avito dataset. For industrial dataset, we use realworld data collected from Kuaishou shortvideo platform. The detailed introduction is given in table 1.

•
Avito^{2}^{2}2https://www.kaggle.com/c/avitocontextadclicks/data: The Avito dataset is a publicly available collection of user search logs from avito.ru. The dataset comprises over 53 million lists with 1.3 million users and 36 million ads. Each sample corresponds to a search page with multiple ads. The user search logs from first 21 days are used as training set and the search logs from the last 7 days are used as test set. The sequence length in Avito is 5.

•
Kuaishou: The Kuaishou dataset is derived from Kuaishou, a widely used shortvideo application with a user base of over 300 million daily active users. Each sample in the dataset represents an actual user request log, which contains user information(e.g. user id, age, gender), candidates items and user interaction to exposed items. The dataset consists of a total of 82,230,788 users, 26,835,337 items, and 1,811,625,438 requests. Each request contains 6 items in the exposed item sequence and 60 candidates from ranking.
5.2. Offline experiments
Baselines We compare the proposed NAR4Rec with 6 stateoftheart reranking methods. We select DNN, DCN as pointwise baselines, PRM as onestage listwise baselines, EdgeRerank, PIER, Seq2slate as twostage baselines. Crucially, Seq2slate is a autoregressive generative model. A brief overview of these baseline methods is as follows:

•
DNN(Covington et al., 2016): DNN is a basic method for clickthrough rate prediction, which applies a multilayer perception to learn feature interaction.

•
DCN(Wang et al., 2017): DCN incorporates feature crossing at each layer, eliminating the need for manual feature engineering while keeping the added complexity to the DNN model minimal.

•
PRM(Pei et al., 2019): PRM models the mutual correlation among items by leveraging the selfattention mechanism and then rank the items by the estimated scores to generate the item sequence.

•
EdgeRerank(Gong et al., 2022): EdgeRerank generates the contextaware sequence with adaptive beam search on estimate scores.

•
Seq2Slate(Bello et al., 2018): Seq2Slate leverages pointer networks, which are seq2seq models with an attention mechanism to predict the next item given the items already selected.

•
PIER(Shi et al., 2023): PIER applies hashing algorithm to slect topk candidates from the full permutation based on user interests. Then the generator and evaluator are jointly trained to generate better permutations.
Metrics As there is not common sequence generation metrics for recommendation, we follow previous work(Shi et al., 2023; Lin et al., 2023) and evaluate these models using three commonly adopted metrics: AUC, LogLoss, and NDCG on Avito dataset. For Avito dataset, where $n$ and $m$ are equal(5), the task is to predict the itemwise clickthrough rate given a listwise input. For Kuaishou dataset, where $n$ and $m$ are $60$ and $6$ respectively, we employ Recall@6, Recall@10, and LogLoss as evaluation metrics. The task for Kuaishou dataset is to predict whether an item is chosen to be one of the exposed $6$ items.
Hyperparameters For Avito dataset, the learning rate is 103, the optimizer is Adam and the batch size is 1024. For Kuaishou dataset, the learning rate and optimizer is the same as Avito, but the batch size is 256.
5.2.1. Performance comparison
Here we show the results of our proposed method NAR4Rec. As can be seen in in table 2 and table 3, NAR4Rec outperforms 5 baslines including recent strong reranking methods(Pei et al., 2019; Bello et al., 2018; Shi et al., 2023). PRM outperforms DNN and DCN by effectively capturing the mutual influence between items. Additionally, Edgererank surpass DNN and DCN with an adative beam search with previous item information. PIER demonstrates superiority over other baselines by the interaction per category. Notably, our proposed method exhibits the highest improvement with a significant increase of 0.0125 in the AUC metric compared to other baseline models. table 3 shows the results of our offline experiments on Kuaishou. The evaluation metrics used in this experiment include Recall@6, Recall@10, and Loss. Our method achieves superior results on all metrics compared to other baseline models as well.
AUC  LogLoss  NDCG  

DNN  0.6614  0.0598  0.6920 
DCN  0.6623  0.0598  0.7004 
PRM  0.6881  0.0594  0.7380 
Edgererank  0.6953  0.0574  0.7203 
PIER  0.7109  0.0409  0.7401 
Seq2Slate  0.7034  0.0486  0.7225 
NAR4Rec  0.7234  0.0384  0.7409 
Recall@6  Recall@10  LogLoss  

DNN  66.47%  86.65%  0.6764 
DCN  68.22%  87.95%  0.6809 
PRM  73.17%  92.25%  0.5328 
Edgererank  73.63 %  92.90%  0.5252 
PIER  73.50%  92.44%  0.5361 
NAR4Rec  74.86%  93.16%  0.5199 
5.2.2. Training and Inference Time comparison
Given that NAR4Rec is closely related to autoregressive models, we conduct a comparison with autoregressive models Seq2Slate. We compare the training and inference time on the Avito dataset between Seq2slate and NAR4Rec. We also give training and inference time for generators in other baselines in table 4. Since Seq2slate utilizes recurrent neural networks as its backbone network, both training and inference processes adopt an autoregressive manner. The inference speedup of NAR4Rec over Seq2slate is almost the same as training. NAR4Rec only requires 58 minutes to complete the training while Seq2Slate requires 283 minutes. Such a significant reduction in training time (i.e. approximately 5$\times $ speedup) highlights the computational efficiency of NAR4Rec. The autoregressive model represented by Seq2Slate generates target sequences item by item, while our Nonautoregressive NAR4Rec generates all items at once. So when generating a sequence with length 5, NAR4Rec shows approximately 5$\times $ speedup.
Training Time  Inference Time  

DNN  0.102s  0.034s 
DCN  0.106s  0.035s 
PRM  0.109s  0.036s 
Edgererank  0.105s  0.035s 
PIER  0.160s  0.053s 
Seq2Slate  0.558s  0.186s 
NAR4Rec  0.112s  0.037s 
5.2.3. Hyperparameter Analysis of NAR4Rec
We further analyze the hyperparameter sensitivity on NAR4Rec. Here, we conduct a series of experiments on NAR4Rec and PIER. As shown in fig. 3, we demonstrate that our experimental results exhibit insensitivity to variations in the learning rate, batch size, and epoch.
Then, we analyze the impact of weight $\alpha $ and margin $\rho $ in contrastive loss and the impact of penalty parameter $\alpha $ in contrastive decoding. Figure 4 shows the results of our experiments. We change $\omega $ while fixing $\rho $=0.5, and change $\rho $ while fixing $\omega $=0.01 in contrastive loss. When changing $\alpha $ in contrastive decoding, we set $\rho $=0.5 and $\omega $=0.01 as the default parameters.
5.3. Online experiments
Text sequence generation often is evaluated by human labeling. In recommendation, we resort to online A/B experiments to obtain the feedback from users to demonstrate our effectiveness.
5.3.1. Experiments setup
In online A/B experiments, we evenly divide the traffic of the entire app into ten buckets. The online baseline is Edgererank(Gong et al., 2022), with 20% of the traffic assigned to NAR4Rec, while the remaining traffic is assigned to Edgererank.
5.3.2. Experimental Results
The experiments have been launched on the system for ten days, and the result is listed in table 5. NAR4Rec outperforms Edgererank(Gong et al., 2022) by a large margin. NAR4Rec shows users watch more(i.e a higher Views) videos, spend more time on each video(i.e. more Long Views and Complete Views) and a more positive user feedback(i.e. a improvement on like, follows) over (Gong et al., 2022).
Views  Likes  Follows  Long Views  Complete Views 
+1.161%  +1.71%  +1.15%  +1.82%  +2.45% 
5.3.3. Ablation Study on Unlikelihood Training
To show the effectiveness of unlikelihood training, we compare vanilla training on all exposed sequences with unlikelihood training. Unlikelihood shows more Views and a longer Watch Time.
Views  Watch Time  
Vanilla training  0.370%*  0.277%* 
5.3.4. Ablation Study on Contrastive Decoding
Here we compare the common decoding algorithm in text sequence and a diversitybased algorithm(i.e. Deep DPP)with contrastive decoding. Those decoding algorithms shows a significant drop in View and Watch time, suggesting a poorer user feedback.
Views  Watch Time  

Deep DPP  0.363%*  0.361%* 
Beam Search  0.327%*  0.214%* 
Greedy Search  0.216%*  0.178%* 
Topk Sampling  0.254%*  0.131% 
6. Conclusion
In this paper, we provide an overview of the current formulation and challenges associated with reranking in recommendation systems. Although nonautoregressive generation has been explored in natural language processing, conventional techniques are not directly applicable to recommendation systems. We tackle the challenges in recommendations to improve the convergence and performance of nonautoregressive models and make the first attempt to integrate nonautoregressive models into reranking in realtime recommender systems. Extensive online and offline A/B experiments have demonstrated the effectiveness and efficiency of NAR4Rec as a versatile framework for generating sequences with enhanced utility. Moving forward, our future work will focus on refining the modeling of sequence utility to further enhance the capabilities of NAR4Rec.
References
 (1)
 Ai et al. (2018) Qingyao Ai, Keping Bi, Jiafeng Guo, and W Bruce Croft. 2018. Learning a deep listwise context model for ranking refinement. In The 41st international ACM SIGIR conference on research & development in information retrieval. 135–144.
 Awasthi et al. (2019) Abhijeet Awasthi, Sunita Sarawagi, Rasna Goyal, Sabyasachi Ghosh, and Vihari Piratla. 2019. Parallel Iterative Edit Models for Local Sequence Transduction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP). 4260–4270.
 Bello et al. (2018) Irwan Bello, Sayali Kulkarni, Sagar Jain, Craig Boutilier, Ed Chi, Elad Eban, Xiyang Luo, Alan Mackey, and Ofer Meshi. 2018. Seq2Slate: Reranking and slate optimization with RNNs. arXiv preprint arXiv:1810.02019 (2018).
 Burges et al. (2005) Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning. 89–96.
 Cheng et al. (2016) HengTze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proceedings of the 1st workshop on deep learning for recommender systems. 7–10.
 Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Proceedings of the 10th ACM conference on recommender systems. 191–198.
 Feng et al. (2021) Yufei Feng, Binbin Hu, Yu Gong, Fei Sun, Qingwen Liu, and Wenwu Ou. 2021. GRN: Generative Rerank Network for Contextwise Recommendation. arXiv preprint arXiv:2104.00860 (2021).
 Ghazvininejad et al. (2019) Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. MaskPredict: Parallel Decoding of Conditional Masked Language Models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP). 6112–6121.
 Gong et al. (2022) Xudong Gong, Qinlin Feng, Yuan Zhang, Jiangling Qin, Weijie Ding, Biao Li, Peng Jiang, and Kun Gai. 2022. Realtime Short Video Recommendation on Mobile Devices. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 3103–3112.
 Gu et al. (2018) Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2018. NonAutoregressive Neural Machine Translation. In International Conference on Learning Representations.
 Gu et al. (2019) Jiatao Gu, Changhan Wang, and Junbo Zhao. 2019. Levenshtein transformer. Advances in Neural Information Processing Systems 32 (2019).
 Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorizationmachine based neural network for CTR prediction. In Proceedings of the 26th International Joint Conference on Artificial Intelligence. 1725–1731.
 Joachims et al. (2017) Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, and Geri Gay. 2017. Accurately interpreting clickthrough data as implicit feedback. In Acm Sigir Forum, Vol. 51. Acm New York, NY, USA, 4–11.
 Leng et al. (2021a) Yichong Leng, Xu Tan, Rui Wang, Linchen Zhu, Jin Xu, Wenjie Liu, Linquan Liu, XiangYang Li, Tao Qin, Edward Lin, et al. 2021a. FastCorrect 2: Fast Error Correction on Multiple Candidates for Automatic Speech Recognition. In Findings of the Association for Computational Linguistics: EMNLP 2021. 4328–4337.
 Leng et al. (2021b) Yichong Leng, Xu Tan, Linchen Zhu, Jin Xu, Renqian Luo, Linquan Liu, Tao Qin, Xiangyang Li, Edward Lin, and TieYan Liu. 2021b. Fastcorrect: Fast error correction with edit alignment for automatic speech recognition. Advances in Neural Information Processing Systems 34 (2021), 21708–21719.
 Li et al. (2020) Margaret Li, Stephen Roller, Ilia Kulikov, Sean Welleck, YLan Boureau, Kyunghyun Cho, and Jason Weston. 2020. Don’t Say That! Making Inconsistent Dialogue Unlikely with Unlikelihood Training. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4715–4728.
 Lian et al. (2018) Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun. 2018. xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1754–1763.
 Lin et al. (2023) Xiao Lin, Xiaokai Chen, Chenyang Wang, Hantao Shu, Linfeng Song, Biao Li, et al. 2023. Discrete Conditional Diffusion for Reranking in Recommendation. arXiv preprint arXiv:2308.06982 (2023).
 Liu et al. (2009) TieYan Liu et al. 2009. Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3, 3 (2009), 225–331.
 Lorigo et al. (2008) Lori Lorigo, Maya Haridasan, Hrönn Brynjarsdóttir, Ling Xia, Thorsten Joachims, Geri Gay, Laura Granka, Fabio Pellacini, and Bing Pan. 2008. Eye tracking and online search: Lessons learned and challenges ahead. Journal of the American Society for Information Science and Technology 59, 7 (2008), 1041–1052.
 Lorigo et al. (2006) Lori Lorigo, Bing Pan, Helene Hembrooke, Thorsten Joachims, Laura Granka, and Geri Gay. 2006. The influence of task and gender on search and evaluation behavior using Google. Information processing & management 42, 4 (2006), 1123–1131.
 Pang et al. (2020) Liang Pang, Jun Xu, Qingyao Ai, Yanyan Lan, Xueqi Cheng, and Jirong Wen. 2020. Setrank: Learning a permutationinvariant ranking model for information retrieval. In Proceedings of the 43rd international ACM SIGIR conference on research and development in information retrieval. 499–508.
 Pei et al. (2019) Changhua Pei, Yi Zhang, Yongfeng Zhang, Fei Sun, Xiao Lin, Hanxiao Sun, Jian Wu, Peng Jiang, Junfeng Ge, Wenwu Ou, et al. 2019. Personalized reranking for recommendation. In Proceedings of the 13th ACM conference on recommender systems. 3–11.
 Shi et al. (2023) Xiaowen Shi, Fan Yang, Ze Wang, Xiaoxu Wu, Muzhi Guan, Guogang Liao, Wang Yongkang, Xingxing Wang, and Dong Wang. 2023. PIER: PermutationLevel InterestBased EndtoEnd Reranking Framework in Ecommerce. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4823–4831.
 Stern et al. (2019) Mitchell Stern, William Chan, Jamie Kiros, and Jakob Uszkoreit. 2019. Insertion transformer: Flexible sequence generation via insertion operations. In International Conference on Machine Learning. PMLR, 5976–5985.
 Sun et al. (2019) Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. 2019. BERT4Rec: Sequential recommendation with bidirectional encoder representations from transformer. In Proceedings of the 28th ACM international conference on information and knowledge management. 1441–1450.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
 Wang et al. (2017) Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & cross network for ad click predictions. In Proceedings of the ADKDD’17. 1–7.
 Welleck et al. (2019) Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2019. Neural Text Generation With Unlikelihood Training. In International Conference on Learning Representations.
 Xi et al. (2021) Yunjia Xi, Weiwen Liu, Xinyi Dai, Ruiming Tang, Weinan Zhang, Qing Liu, Xiuqiang He, and Yong Yu. 2021. Contextaware reranking with utility maximization for recommendation. arXiv preprint arXiv:2110.09059 (2021).
 Yang (2017) Ziying Yang. 2017. Relevance judgments: Preferences, scores and ties. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1373–1373.
 Zhou et al. (2019a) Chunting Zhou, Jiatao Gu, and Graham Neubig. 2019a. Understanding Knowledge Distillation in Nonautoregressive Machine Translation. In International Conference on Learning Representations.
 Zhou et al. (2019b) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019b. Deep interest evolution network for clickthrough rate prediction. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 5941–5948.
 Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for clickthrough rate prediction. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1059–1068.
 Zhuang et al. (2018) Tao Zhuang, Wenwu Ou, and Zhirong Wang. 2018. Globally optimized mutual influence aware ranking in ecommerce search. arXiv preprint arXiv:1805.08524 (2018).