Transformers in Reinforcement Learning: A Survey

Pranav Agarwal École de Technologie Supérieure/MilaMontréalQuébecCanada pranav.agarwal.1@ens.etsmtl.net Aamer Abdul Rahman École de Technologie Supérieure/MilaMontréalQuébecCanada aamer.abdul-rahman.1@ens.etsmtl.net Pierre-Luc St-Charles Mila, Applied ML Research TeamMontréalQuébecCanada pierreluc.stcharles@mila.quebec Simon J.D. Prince University of BathBathUnited Kingdom sjdp23@bath.ac.uk  and  Samira Ebrahimi Kahou École de Technologie Supérieure/Mila/CIFARMontréalQuébecCanada samira.ebrahimi.kahou@gmail.com
Abstract.

Transformers have significantly impacted domains like natural language processing, computer vision, and robotics, where they improve performance compared to other neural networks. This survey explores how transformers are used in reinforcement learning (RL), where they are seen as a promising solution for addressing challenges such as unstable training, credit assignment, lack of interpretability, and partial observability. We begin by providing a brief domain overview of RL, followed by a discussion on the challenges of classical RL algorithms. Next, we delve into the properties of the transformer and its variants and discuss the characteristics that make them well-suited to address the challenges inherent in RL. We examine the application of transformers to various aspects of RL, including representation learning, transition and reward function modeling, and policy optimization. We also discuss recent research that aims to enhance the interpretability and efficiency of transformers in RL, using visualization techniques and efficient training strategies. Often, the transformer architecture must be tailored to the specific needs of a given application. We present a broad overview of how transformers have been adapted for several applications, including robotics, medicine, language modeling, cloud computing, and combinatorial optimization. We conclude by discussing the limitations of using transformers in RL and assess their potential for catalyzing future breakthroughs in this field.

reinforcement learning, transformers, representation learning, neural networks, literature survey
copyright: noneccs: Computing methodologies Reinforcement learningccs: General and reference Surveys and overviewsccs: Computing methodologies Neural networks

1. Introduction

Reinforcement learning (RL) is a learning paradigm that enables sequential decision-making by learning from feedback obtained through trial and error. It is usually formalized in terms of an Markov decision process (MDP), which provides a mathematical framework for modeling the interaction between an agent and its environment.

Most RL algorithms optimize the agent’s policy to select actions that maximize the expected cumulative reward. In deep RL, neural networks are used as function approximators for mapping the current state of the environment to the next action and for estimating future returns. This approach is beneficial when dealing with large or continuous state spaces that make tabular methods computationally expensive (Sutton and Barto, 1998) and has been successful in challenging applications (Arulkumaran et al., 2017; Nguyen et al., 2020; Latif et al., 2023). However, standard neural network architectures like convolutional neural networks (CNNs) and recurrent neural networks (RNNs) struggle with long-standing problems in RL. These problems include partial observability (Esslinger et al., 2022), inability to handle high-dimensional state and action spaces (Barto and Mahadevan, 2003), and difficulty in handling long-term dependencies (Chen et al., 2022a).

Partial observability is a challenge in RL (Liu et al., 2022a); in the absence of complete information, the agent may be unable to make optimal decisions. A typical way to address this problem is to integrate the agent’s input (Shao et al., 2019a) over time using CNNs and RNNs. However, RNNs tend to forget information (Pascanu et al., 2013), while CNNs are limited in the number of past-time steps they can process (Karpathy et al., 2014). Various strategies have been proposed to overcome this limitation, including gating mechanisms, gradient clipping, non-saturating activation functions, and manipulating gradient propagation paths (Ribeiro et al., 2020). Sometimes different data modalities, such as text, audio, and images are combined to provide additional information to the agent (Lathuilière et al., 2019; Song et al., 2021; Carta et al., 2020). However, integrating encoders for different modalities increases the model’s architectural complexity. With CNNs and RNNs, it is also difficult to determine which past actions contributed to current rewards (Ma et al., 2021). This is known as the credit assignment problem. These challenges and others, such as training instability, limit the scope of most RL applications to unrealistic virtual environments.

The transformer was first introduced in 2017 (Vaswani et al., 2017) and has rapidly impacted the field of deep learning (Lin et al., 2022), improving the state-of-the-art in natural language processing (NLP) and computer vision (CV) tasks (Tunstall et al., 2022; Khan et al., 2021; Devlin et al., 2019; Petit et al., 2021; Zhong et al., 2021). The key idea behind this neural network architecture is to use a self-attention mechanism to capture long-range relationships within the data. This ability to model large-scale context across sequences initially made transformers well-suited for machine translation tasks. Transformers have since been adapted to tackle more complex tasks like image segmentation (Petit et al., 2021), visual question answering (Zhong et al., 2021), and speech recognition (Dong et al., 2018).

This document surveys the use of transformers in RL. We begin by providing a concise overview of RL (Sec. 2.1) and transformers (Sec. 2.3) that is accessible to readers with a general background in machine learning. We highlight challenges that classical RL approaches face and how transformers can help deal with these challenges (Sec. 2.2 and 2.4). Transformers can be applied to RL in different ways (Fig. 1). We discuss how they can be used to learn representations (Sec. 3), model transition functions (Sec. 4), learn reward functions (Sec. 5) and learn policies (Sec. 6). In Sec.  7 and Sec.  8, we discuss different training and interpretation strategies, and in Sec. 9, we provide an overview of RL applications that use transformers, including robotics, medicine, language modeling, edge-cloud computing, combinatorial optimization, environmental sciences, scheduling, trading, and hyper-parameter optimization. Finally, we discuss limitations and open questions for future research (Sec. 10). With this work, we aim to inspire further research and facilitate the development of RL approaches for real-world applications.

Refer to caption
Figure 1. This survey presents a comprehensive overview of the use of transformers in RL. Modeling an RL policy may involve representation learning, modeling the transition function, reward function learning, and policy learning. Transformers can be used across all of these tasks.

2. Background

This section introduces the fundamental concepts of RL and discusses associated challenges. We also provide an overview of transformers and their potential advantages in RL.

2.1. Reinforcement Learning

Reinforcement learning (RL) is a reward-based learning paradigm that enables agents to learn from their experience and improve their performance over time. This is commonly formulated in terms of Markov decision processes (MDPs), in which the agent chooses an action (𝐚𝐀) based on the state (𝐬𝐒) of the environment and receives feedback in the form of rewards (r). The MDP framework assumes that the environment satisfies the Markov property, which asserts that the next state is independent of the past states, given the present state and the most recent action. This allows the agent to make decisions based only on the current state without tracking the history of previous states and actions. After taking action, the agent receives feedback from the environment in the form of a reward. The state is updated to a new value (𝐬𝐒), as determined by the transition function p(𝐬𝐬,𝐚) describing how the environment responds to the agent’s action. The overall goal of RL is to learn how to solve a multi-step problem by maximizing the return g, which is the total discounted reward:

(1) g=t=0Tγtrt,

where rt is the reward received at each time step tT is the total number of time steps in the given episode, and γ[0,1] is a discount factor, affecting the importance given to immediate rewards versus future rewards in a given episode.

There are multiple categories of RL algorithms, each with advantages and disadvantages (AlMahamid and Grolinger, 2021). Choosing the appropriate category relies on factors such as the problem’s complexity, the size of the state and action spaces, and available computational resources. We now briefly review these categories.

Model-Based RL. In model-based RL, a transition function is learned using transitions (𝐬,𝐚,r,𝐬) generated by environmental interaction. This transition function models the probability distribution p(𝐬,r𝐬,𝐚) over the subsequent state 𝐬 and reward r given the current state 𝐬 and action 𝐚. By leveraging this learned model, the agent plans and selects actions that maximize the expected return g. However, this approach can be computationally expensive and may suffer from inaccuracies in the learned model, leading to sub-optimal performance (Moerland et al., 2020). Despite these drawbacks, this approach is sample efficient. In other words, it can achieve good performance using relatively few interactions with the environment compared to other methods (Mohanty et al., 2020; Wang et al., 2022c).

Model-Free RL. In model-free RL, optimal actions are learned by direct interaction with the environment. Methods from this category cannot model state-transition dynamics to plan actions, which can result in slower convergence and lower sample efficiency compared to model-based RL (Yarats et al., 2021). However, model-free RL is more adaptable to environmental changes, making it more robust in complex or noisy environments (Mnih et al., 2013; Lillicrap et al., 2016; Schulman et al., 2017). Additionally, it is less computationally expensive as it does not need to learn a model of the environment.

Further, RL methods can be categorized as on-policy, off-policy, and offline RL depending on how the data collection relates to the policy being learned.

On-Policy RL. On-policy RL approaches use the current policy to gather transitions for updating the value function. For instance in SARSA (Sutton and Barto, 1998), the current policy is used to collect a tuple (𝐬,𝐚,r,𝐬,𝐚), consisting of the current state-action pair 𝐬,𝐚, the immediate reward r, and the next state-action pair 𝐬,𝐚. This is then used for estimating the return of the current state-action pair Qtarget(𝐬,𝐚) and the next state-action pair Qtarget(𝐬,𝐚). The value function Q is updated using the following temporal difference (TD) learning rule:

(2) Qtarget(𝐬,𝐚)Qtarget(𝐬,𝐚)+α[r+γQtarget(𝐬,𝐚)Qtarget(𝐬,𝐚)],

where α(0,1] represents the learning rate. Although on-policy methods are comparatively easy to implement, they have several drawbacks. They tend to be sample inefficient (Larsen et al., 2021), requiring significant interaction with the environment to achieve good performance. Additionally, they can be susceptible to policy oscillation and instability (Young and Sutton, 2020), and they lack the flexibility to explore, slowing down the learning process and resulting in sub-optimal policies.

Off-Policy RL. Off-policy RL strategies use two policies — a behavior policy and a target policy. The behavior policy collects data that is subsequently used to estimate the expected return of an action under the given target policy. Since the behavior policy is used for data collection, it can explore different states and actions without affecting the current target policy. Thus, off-policy methods are well-suited for understanding the value of a given state and action. Usually, the target policy is updated using the behavior policy via importance sampling (IS). This adjusts the value estimates of the target policy based on the IS ratio between the behavior b(𝐚|𝐬) and target π(𝐚|𝐬) policies:

(3) IS=π(𝐚|𝐬)b(𝐚|𝐬),

and the final value estimate is given as:

(4) Qtarget(𝐬,𝐚)Qtarget(𝐬,𝐚)+α[IS(r+γQtarget(𝐬,𝐚))Qtarget(𝐬,𝐚)].

Offline-RL. Offline RL or batch RL (Levine et al., 2020a) uses a static dataset of transitions, denoted by 𝐃={𝐬𝐭,𝐚𝐭,rt,𝐬𝐭}t=1T, collected using a behavior policy, and so does not require interaction with the environment to collect trajectories. Offline-RL updates the state-action value function Qtarget as:

(5) Qtarget(𝐬,𝐚)Qtarget(𝐬,𝐚)+α[r+γmax𝐚(Qtarget(𝐬,𝐚))Qtarget(𝐬,𝐚)],

where the max operator estimates the maximum expected return over all possible actions in the next state. Offline RL is a more practical strategy for safety-critical applications as it does not require interactions with the environment (Shi et al., 2021; Killian et al., 2023). However, the static nature of the dataset does not allow the agent to explore and adapt to new information, potentially limiting performance (Lu et al., 2022).

Multi-Agent Reinforcement Learning. Online, offline, and off-policy learning setups can be used to facilitate adaptive decision-making in dynamic environments with multiple interacting agents (Gronauer and Diepold, 2022). Each of the I agents has its own policy πi(𝐚i|𝐬i), state space 𝐒i, and action space 𝐀i. The agents interact with each other and the environment, and their actions can affect the outcomes of other agents. The goal in multi-agent reinforcement learning (MARL) is to learn a joint policy π(𝐚1,𝐚2,,𝐚I|𝐬1,𝐬2,,𝐬I) that maximizes the collective reward of all agents. Formally, the objective in MARL can be expressed as maximizing the expected sum of discounted rewards of all the agents:

(6) 𝔼[t=0γti=1Iri(t)],

where γ is the discount factor and ri(t) is the reward received by agent i at time t. In general, there are two main approaches to MARL: decentralized policies (Zhang et al., 2018a), and centralized training (Sharma et al., 2021). Decentralized policies involve each agent independently learning its policy without explicit coordination or communication with other agents. In contrast, centralized training uses a shared value function that considers joint states and actions, enabling agents to communicate and coordinate their actions through a communication protocol. Communication protocols (Foerster et al., 2016) facilitate information exchange and collaboration among agents.

Upside Down RL. Classical RL usually involves optimizing policies by estimating the expected future return. Upside-down RL (Schmidhuber, 2019; Arulkumaran et al., 2022) flips the traditional RL paradigm and uses the desired return g, the horizon h (i.e., the time remaining until the end of the current trial), and the state as inputs. This input acts as a command which is mapped to action probabilities. Upside-down RL offers improved stability compared to classical RL as it avoids the need to estimate the value function, which can introduce instabilities in traditional RL algorithms (Chen et al., 2021a; Sutton and Barto, 1998). The loss function of upside-down RL can be defined as:

(7) (𝜽)=t=0[𝐚tf(𝐬t,gt,ht,𝜽)]2,

where 𝜽 contains the model parameters. The term 𝐚t is the action at time step t, and f(𝐬t,gt,ht,𝜽) is the predicted action when conditioned on state 𝐬t, expected future return gt, and horizon ht.

2.2. Challenges in Reinforcement Learning

In this section, we discuss the different challenges of classical RL algorithms.

Curse of Dimensionality. Real-world applications often involve high-dimensional state spaces, which makes it hard for classical RL algorithms to learn optimal policies (Barto and Mahadevan, 2003). This is because the required training data grows exponentially as the data dimensionality increases (Wang et al., 2020a). One way to mitigate this problem is to encode high-dimensional states into a lower-dimensional space. RL policies perform better when trained on encoded low-dimensional data (Yarats et al., 2021).

Partially Observable Environment. A partially observable environment presents a challenge for RL algorithms, as the agent cannot access observations that contain complete information about the environmental state at each time-step (Vinyals et al., 2019; OpenAI et al., 2019). Without complete information, the algorithm may struggle to make the best decision, leading to uncertainty and compromises in performance. To address this, the policy must maintain an internal representation of the state, often in the form of memory, from which the actual state can be estimated (Icarte et al., 2020). Historically, this has often been done with RNNs, but these cannot efficiently model long contexts (Pascanu et al., 2013; Hochreiter, 1998).

Credit Assignment. The term credit assignment refers to the problem of associating the actions taken by an agent with the reward it receives (Mesnard et al., 2021). This is challenging for two reasons: First, the reward may be delayed; the agent may not be able to observe the consequences of its actions until several time steps into the future. Second, other factors or multiple actions may influence the received reward, making it challenging to identify which action led to that reward. Inaccurate credit assignment can lead to slower training and sub-optimal policies (Ausin et al., 2021). Moreover, when the reward is sparse (i.e., when the agent receives little feedback for its actions), the credit-assignment problem becomes even more difficult (Seo et al., 2019). One potential solution is to use models that integrate information across all time steps, which may be better suited for solving this issue (Chen et al., 2021a).

Recent studies have exploited transformers to tackle these three key challenges. Transformers have demonstrated success in modeling long-term dependencies in sequential data while showing promising results in promoting generalization and faster learning in domains such as NLP and CV. We now provide a brief overview of transformers and explore the various ways in which they have been applied to learning optimal RL policies.

2.3. Transformers

Refer to caption
Figure 2. a) The dot-product attention mechanism (for 4 embeddings of size 5 each). Input features (𝐗) are processed using the key (𝐊), query (𝐐), and value (𝐕) tensors. Each query undergoes a dot product with every key, and the result is normalized using the softmax operator to compute the attention. This is then used to weigh the values to produce the output. b) The (multi-head) self-attention mechanism is just one component in the transformer block, which also contains residual links, layer normalization, and parallel MLPs that process each input embedding separately.

Transformers are a class of neural network architectures consisting of multiple layers, each containing a multi-head self-attention mechanism, parallel fully-connected networks, residual connections, and layer normalization. Given a sequence of N input embeddings, transformers produce a sequence of N output embeddings, each of which represents the relationship between the corresponding input embedding and the rest of the input sequence. In NLP, the input embeddings may represent words from a given sentence, while in RL, they may represent different states.

The self-attention mechanism allows each input embedding (each row of 𝐗) to simultaneously attend to all the other embeddings in the input sequence. It computes one attention score for each pair of inputs. This is done by projecting each input into a query 𝐐=𝐗𝐖q and a key 𝐊=𝐗𝐖k tensor. The attention scores are then computed by taking the dot product of each query vector (row of the query tensor) with every key vector (row of the key tensor), followed by a softmax operation that normalizes the resulting scores such that they add up to one for each query. The attention scores are then used to compute a weighted sum of value 𝐕=𝐗𝐖v tensors (see Fig. 2):

(8) Attention(𝐐,𝐊,𝐕)=Softmax(𝐐𝐊dq)𝐕.

To help stabilize gradients during training, the dot product is scaled by a factor of dq, where dq is the dimension of the query tensor.

Transformers often compute multiple sets of attention scores in parallel (each with a different set of learned parameters 𝐖k,𝐖q,𝐖v). This allows the model to attend to multiple aspects of the input sequence simultaneously and is known as multi-head self-attention. Each attention “head” output is concatenated and linearly transformed to produce the final output representation. For applications where the order of the inputs is important, a position encoding is added that allows the network to establish the position of each input.

Refer to caption
Figure 3. a) Self-attention computes attention scores for a given segment by considering all input across all time steps. b) Masked self-attention considers the current and past time steps while disregarding the future. c) Transformer XL (Dai et al., 2019) establishes a connection between past segments, incorporating information from multiple previous segments. It calculates attention scores by combining the hidden state of the previous segment, enabling the modeling of longer-term dependencies.

In a transformer block (Fig. 2b), a residual connection is placed around the multi-head self-attention mechanism. This improves training stability by allowing the gradient to flow easily through the network. The output is then processed using layer normalization, which normalizes the activations of each layer across the feature dimension. Each output is processed in parallel by the same MLP. Once more, these are bypassed by a residual connection, and a second layer norm is subsequently added.

Architectural Variations. The bidirectional encoder representations from transformers (BERT) model (Devlin et al., 2019) and the generative pre-trained transformer (GPT) (Radford et al., 2018) are two popular variants of the transformer architecture. BERT (Fig. 4) is a transformer encoder in which each output receives information from every input in the self-attention mechanism (Fig. 3a). The goal is to process the incoming data to generate a latent representation that integrates contextual information. This can be particularly useful in RL, as it enables the agent to make informed decisions based on a more comprehensive understanding of the environment (Banino et al., 2022; Wang et al., 2023).

Refer to caption
Figure 4. The BERT architecture. The input tokens are converted to word embeddings and passed through transformer layers to produce latent representations. Some input tokens are replaced with mask embedding as part of the formulation for the pre-training task. The output embeddings predict the missing (masked) word using a softmax and multi-class classification loss. In each loss term corresponding to the prediction of a masked word, BERT uses context from before and after the word.

Conversely, GPT (Fig. 5) uses a decoder architecture to auto-regressively generate a sequence of output tokens, considering only the past tokens. The use of masked self-attention (Fig. 3b) prevents it from cheating by looking ahead to tokens that it should not know yet during training by clamping the associated attention values to zero. In RL, this autoregressive nature can be used to implement an RL policy that is conditioned on a sequence of past states and actions (Hernandez et al., 2021; Chen et al., 2021a; Janner et al., 2021). GPT uses multiple blocks, each containing a multi-head attention mechanism. The original transformer (Vaswani et al., 2017) combined these two approaches in an encoder-decoder architecture for machine translation. An encoder architecture processes the incoming sentence, and the decoder auto-regressively produces the output sentence. In doing so, the decoder also considers the attention over the encoder’s latent representation using a “cross-attention” block.

Refer to caption
Figure 5. The GPT architecture. The tokens are mapped to embeddings through transformer layers with masked self-attention, such that each position attends only to earlier tokens. The goal is to predict the next token at each position correctly. This approach is efficient to train as every word contributes to the final loss.

Vision Transformers. Inspired by the success of transformer-based architectures like BERT and GPT in NLP, Dosovitskiy et al. (2021) proposed the vision transformer (ViT) architecture for processing images. ViT architecture (Dosovitskiy et al., 2021) is suited to a wide range of RL tasks where images must be used to learn a policy (Tao et al., 2022; Kargar and Kyrki, 2021; Goulão and Oliveira, 2022). The ViT architecture is a transformer encoder that processes patches of the image (Fig. 6). Each patch is combined with a positional encoding that provides knowledge about its original image location.

Transformer-XL. The computational complexity of the self-attention mechanism increases quadratically with input sequence length due to an exponential increase in pairwise comparison. Hence, the transformer architectures discussed so far typically partition long input sequences into shorter sequences to reduce memory demands. While this approach helps minimize memory usage, it makes capturing global context challenging. Moreover, the conventional transformer model is limited because it does not consider the boundaries of the input sequence when forming context. Instead, it selects consecutive chunks of symbols without regard for a sentence or semantic boundaries. This can result in context fragmentation, where the model lacks the necessary contextual information to accurately predict the first few symbols in a sequence. The transformer-XL (TrXL) architecture (Dai et al., 2019) addresses these issues by dividing the input into segments and incorporating segment-level recurrence (Fig. 3c) and relative positional encodings. By caching and reusing the representation computed for the previous segment during training, TrXL can extend the context and better capture long-term dependencies. Additionally, TrXL can process elements of new segments without recomputing the past segments, leading to faster inference.

2.4. Key Advantages of Transformers in RL

This section outlines the transformer characteristics that are important for RL applications.

Attention Mechanism. The attention mechanism is crucial in transformers for sequence modeling of states (Benjamins et al., 2022). It enables the RL agent to focus selectively on relevant cues in the environment (Manchin et al., 2019) and ignore redundant features, which leads to faster training. This is particularly useful in high-dimensional state spaces where there are a large number of input elements.

Multi-Modal Architectures. For complex tasks, RL agents may require additional information from different data modalities (Zhang et al., 2018b; Kiran et al., 2022). Past approaches have used different architectures to handle multiple modalities (Ramachandram and Taylor, 2017). However, transformers can process multiple modalities of data (e.g., text, images) effectively (Jaegle et al., 2022; Xu et al., 2022c) using the same architecture.

Parallel Processing. Learning a policy in RL can be computationally expensive, especially for complex tasks requiring many samples (Obando-Ceron and Castro, 2021). RNNs require the sequential processing of inputs, which is inefficient. Transformers are well-suited for parallelization due to their self-attention mechanism, which considers all inputs simultaneously. RL algorithms can exploit transformers to learn more efficient policies in significantly less time.

Scalability. Current RL algorithms struggle to scale effectively to complex tasks that require the integration of multiple skills (Zhan et al., 2017; Kalashnikov et al., 2018). However, the performance of transformers has been shown to improve smoothly as the size of the model, dataset, and compute increases (Kaplan et al., 2020). This ability can potentially be leveraged in RL to create generalist agents capable of performing various tasks in different environments and with different embodiments (Lee et al., 2022).

These points highlight how the properties of transformers make them attractive for RL. In the following sections, we examine the use of transformers in each stage of the RL workflow, including representation learning (Sec. 3), policy learning (Sec. 6), learning the reward function (Sec. 5), and modeling the environment (Sec. 4). Additionally, we cover various training strategies (Sec. 7) and techniques for interpreting RL policies that use transformers (Sec. 8).

3. Representation Learning

Concise and meaningful representations are critical for efficient decision-making (Lesort et al., 2018) in RL. Empirical evidence has shown that training agents directly on high dimensional data, such as image pixels, is sample inefficient (Lake et al., 2016). Therefore, good data representations are crucial for learning RL policies (Lesort et al., 2018; Laskin et al., 2020) since they can enhance performance, convergence speed, and policy stability (Ghosh and Bellemare, 2020).

For instance, in self-driving cars, the raw sensory inputs 𝐨t (e.g., camera images, LIDAR readings) are high dimensional and often contain redundant information. If these inputs 𝐨t are mapped to a compact representation 𝐬t, the RL agent can learn more efficiently. Similarly, in game-playing scenarios (Fig. 6), it’s helpful to encode and extract relevant features from the pixel to be used as input to the RL algorithm learning the policy. Transformers can produce transferable and discriminative feature representations for diverse data modalities (Zhou et al., 2021; Brown et al., 2020; Zhang et al., 2022d; Choi et al., 2020; Ying et al., 2021).

Refer to caption
Figure 6. Representation learning simplifies the information given to RL agents by condensing the original high-dimensional input (here a frame from an Atari game). Transformers can learn low-dimensional representations 𝐬 from high-dimensional observations 𝐨. In this case, this is achieved by employing a vision transformer. The input is provided in multiple patches, and the transformer encoder learns a shared low-dimensional representation contextualized using a <cls> input token.

3.1. Comparisons between Transformers, CNNs, and GNNs

Encoding high-dimensional representations using pre-trained CNN and transformer architectures is an active research area. Both approaches yield comparable performance in computer vision tasks (Woo et al., 2023; Liu et al., 2022c; Malpure et al., 2021). However, several studies (Zhou et al., 2021; Zhang et al., 2022d) have shown that transformers generate more expressive representations than CNNs for tasks where the data distribution differs at training and test time. This advantage stems from CNN’s inherent inductive bias towards local spatial features, which limits their ability to capture the global dependencies necessary for reasoning (Vo et al., 2017). Transformers can encode the image as a sequence of patches without local convolution and resolution reduction (Fig. 6). Hence, they model the global context in every layer, leading to a stronger representation for learning efficient policies (Dosovitskiy et al., 2021). Transformers exhibit comparable generalization capabilities to graph neural networks (GNNs) in graphs (Dwivedi and Bresson, 2020) and, in some instances, outperform them by capturing long-range semantics (Ying et al., 2021).

Multi-task reinforcement learning (MTRL) is a learning paradigm where an agent is trained to perform multiple tasks simultaneously. It has traditionally relied on GNNs to handle incompatible environments (i.e., differing state-action spaces) (Wang et al., 2018; Huang et al., 2020a). This is due to the ability of GNNs to operate on graphs of variable sizes. However, Kurin et al. (2021) hypothesize that the restrictive nature of message passing in sparse graphs may adversely impact performance. They propose replacing GNNs with transformers which obviates the need to learn multi-hop communication; the transformer can be considered a GNN applied to fully-connected graphs with attention as an edge-to-vertex aggregation operation (Battaglia et al., 2018). This enables a dedicated message-passing scheme for each state and pass, effectively avoiding the requirement for multi-hop message propagation. This overcomes the challenges of gradient propagation and information loss arising from such multi-hop propagation. The transformer-based model of Kurin et al. (2021), Amorpheus, learns better representations and improves performance without imposing a relational inductive bias.

3.2. Advanced Representation Learning using Transformers

Transformers, in combination with other attention mechanisms, enable the learning of expressive representations. SloTTAr (Gopalakrishnan et al., 2023) combines a transformer encoder-decoder architecture with slot attention (Locatello et al., 2020). The transformer encoder focuses on learning spatio-temporal features from action-observation sequences. Utilizing the slot attention mechanism, features are grouped at each temporal location, resulting in K slot representations. The decoder subsequently decodes these slot representations to generate action logits. Notably, this parallelizable process enables faster training compared to existing benchmarks.

In multi-agent reinforcement learning, transformers have proven effective in modeling relations among agents and the environment (Zhang et al., 2022b). Li et al. (2022a) proposed replacing RNNs with a transformer encoder for robust temporal learning. Similarly, Zhang et al. (2022a) used a visual feature extractor based on the ViT architecture to obtain more robust representations for robotic visual exploration. Their network, utilizing self-attention, outperformed CNN backbones in robotic tasks.

Transformers have been widely adopted in scenarios involving the processing of multimodal information. Yang et al. (2022) introduced scene-fusion transformers that fuse observed trajectories and scene information to generate expressive representations for trajectory prediction. To reduce computational complexity, they employ sparse self-attention. Zhang et al. (2023) utilize transformers to integrate visual and text features effectively.

3.3. Enhancing Transferability and Generalization

An inherent difficulty faced by RL is generalizing to new unseen tasks (Levine et al., 2020b). This difficulty results from the intrinsic differences between various RL tasks (e.g., autonomous driving and drug discovery). While meta-learning methods such as model-agnostic meta-learning (MAML) (Finn et al., 2017) have been developed to generalize to new tasks with different distributions using limited data, these methods are hard to use in RL due to poor sample-efficiency and unstable training (Liu et al., 2019).

Transformers have shown great potential for meta-reinforcement learning (TrMRL), as demonstrated by Melo (2022). Transformers are excellent at handling long sequences and capturing dependencies over a long period, which enables them to adapt to new tasks using self-attention quickly. In TrMRL, the proposed agent uses self-attention blocks to create an episodic memory representing a consensus of recent working memories. The transformer architecture encodes the working memory and tasks as a distribution over these memories. During meta-training, the agent learns to differentiate tasks and identify similarities in the embedding space. This approach performs comparably or better than PEARL (Rakelly et al., 2019) and MAML (Finn et al., 2017). It is particularly efficient in memory refinement and task association.

Shang et al. (2022) introduced the state-action-reward transformer (StARformer) to model multiple data distributions by learning transition representations between individual time steps of the state, action, and reward. StARformer consists of the step transformer and the sequence transformer. The step transformer uses self-attention to capture a local representation that understands the relationship between state-action-reward triplets within a single time-step window. The sequence transformer combines these local representations with global representation in the form of pure state features extracted as convolutional features, introducing a Markovian-like inductive bias. This bias helps reduce model capacity while effectively capturing long-term dependencies.

4. Transition Function Learning

The transition function p(𝐬,r|𝐬,𝐚) describes how the environment transitions from the current state 𝐬 to the next state 𝐬 and issues rewards r in response to the actions 𝐚 taken by the agent. Learning this function (Fig. 7) and subsequently exploiting it to train an RL agent is known as model-based RL (Sec. 2.1). Model-based RL offers a significant advantage compared to model-free RL approaches (Moerland et al., 2020); it allows the agent to plan future trajectories for each action, improving robustness and safety. Interactions with the external environment can be computationally expensive, particularly when relying on simulations that mimic the real world (Featherstone, 2014). If we learn the transition function, these interactions can be reduced.

Refer to caption
Figure 7. A transformer can be employed to learn the dynamics of the simulator by modeling the transition function. This is accomplished by leveraging pre-collected trajectories of environment interaction. The transformer-based world model is trained using these trajectories. An RL policy π is then trained to learn the task using this learned world model without requiring further environmental interaction.

A standard method in model-based reinforcement learning (MBRL) (Ha and Schmidhuber, 2018) involves training an end-to-end world model to represent the environment’s dynamics accurately. For instance, TransDreamer (Chen et al., 2022a) trains a single model that learns visual representations and dynamics using the evidence lower bound loss (Kingma and Welling, 2019). However, this approach can result in inaccuracies in the learned world model.

The masked world model (MWM) (Seo et al., 2022) addresses this by decoupling visual representation and dynamics learning. This framework utilizes an autoencoder with convolutional layers and ViT to learn visual representations. The autoencoder reconstructs pixels based on masked convolutional features. A latent dynamics model is learned by operating on the representations from the autoencoder. An auxiliary reward prediction objective is introduced for the autoencoder to encode task-relevant information. Importantly, this approach outperforms the strong RNN-based model, DreamerV2 (Hafner et al., 2021) in terms of both sample efficiency and final performance on various robotic tasks.

Learning the dynamics of the world 𝐳t+1pG(𝐳t+1|𝐳t,𝐚t) has been formulated as a sequence modeling problem in imagination with auto-regression over an inner speech (IRIS) (Micheli et al., 2022). This approach takes advantage of the transformer’s ability to process sequences of discrete tokens. IRIS uses a discrete autoencoder to construct a language of image tokens, while a transformer models the dynamics over these tokens. By simulating millions of trajectories accurately, IRIS surpasses recent methods in the Atari 100k benchmark (Bellemare et al., 2015) in just two hours of real-time experience.

Building upon the auto-regressive nature of transformer decoders, Robine et al. (2023) introduces the transformer-based world model (TWM). Based on the TrXL architecture, TWM learns the transition function from real-world episodes while attending to latent states, actions, and rewards associated with each time step. By allowing direct access to previous states instead of viewing them through a compressed recurrent state, the TrXL architecture enables the world model to learn long-term dependencies while maintaining computational efficiency.

5. Reward Learning

The reward function is crucial in RL as it quantifies the desirability of different actions 𝐚 for a given state 𝐬, guiding the learning process. Typically, reward functions are predefined by human experts who carefully consider relevant factors based on their domain knowledge. However, designing an appropriate reward function is challenging in real-world scenarios, requiring a deep understanding of the problem domain. Moreover, manually designing it introduces bias and may lead to sub-optimal behavior.

Recent research has explored different approaches for learning reward functions by integrating human data in various forms, such as real-time feedback, expert demonstrations, preferences, and language instructions. Transformers have proven valuable in these contexts. The transformer architecture is particularly advantageous with non-Markovian rewards, which are characterized by delays and dependence on the sequence of states encountered during an episode (e.g., when rewards are only provided at the end). Transformers efficiently capture dependencies across input sequences, making them well-suited to handle such scenarios.

Refer to caption
Figure 8. Nakatani et al. (2022) use a BERT-based reward function r(𝐬,𝐚) to provide feedback to the RL policy π to learn the task of machine translation. Given the input sentence 𝐬, the policy π translates it into a different language. This predicted action 𝐚 is then compared with the optimal action 𝐚+ using BERT, which provides feedback to the agent in the form of a reward r.

The preference transformer (Kim et al., 2023) model captures human preferences by focusing on crucial events and modeling the temporal dependencies inherent in human decision-making processes; it effectively predicts non-Markovian rewards and assigns appropriate importance weights based on the trajectory segment. This approach reduces the effort required for designing reward functions and enables handling complex control tasks such as locomotion, navigation, and manipulation.

To train an RL policy for generating text that aligns with human-labeled ground truth, the bilingual evaluation understudy (BLEU) score (Papineni et al., 2002) is often used as a reward function. However, BLEU may not consistently correlate strongly with human evaluation. In (Nakatani et al., 2022), a BERT-based reward function is introduced, demonstrating a higher correlation with human evaluation. This approach leverages a pre-trained BERT model (Fig. 8) to assess the semantic similarity between the generated and reference sentences and update the policy accordingly.

6. Policy Learning

Policy learning is central to RL; it involves learning the policy π(𝐬) which the agent uses to select actions 𝐚=π(𝐬) with the objective of maximizing the discounted cumulative reward g. Transformers have been used for modeling π(𝐬) in various scenarios, including off-policy, on-policy, and offline RL.

6.1. Offline RL with the Decision Transformer

Offline RL trains a policy using a limited, static dataset of previously collected experiences. This is different from online or off-policy RL approaches (which continuously interact with the environment to update their policies) since the agent cannot collect experience beyond the fixed dataset, which limits its ability to learn, explore, and improve performance.

Refer to caption
Figure 9. The decision transformer approaches offline RL as a sequence-prediction task. The network uses a sequence of states, actions, and return-to-go (RTG) to predict the next action. At inference, the RTG acts as a condition for the policy to generate a given set of actions.

The decision transformer (DT) (Chen et al., 2021a) (Fig. 9) is an offline RL method that uses the upside-down RL paradigm (see Sec. 2.1). It uses a transformer-decoder to predict actions conditioned on past states, past actions, and expected return-to-go (the sum of the future rewards). The parameters are optimized by minimizing the cross-entropy (discrete) or mean square error (continuous) loss between the predicted and actual actions.

DT uses the GPT architecture to address the credit-assignment problem; the self-attention mechanism can associate rewards with the corresponding state-action pairs across long time intervals. This also allows the DT policy to learn effectively even in the presence of distracting rewards (Hung et al., 2018). Empirical experiments demonstrate that the DT outperforms state-of-the-art model-free offline approaches on offline datasets such as Atari and Key-to-Door tasks.

The DT is a model-free approach that predicts actions based on past trajectories without forecasting new states, so it can’t plan future actions. This limitation is addressed by the trajectory transformer (TT) (Janner et al., 2021), an MBRL approach that formulates RL as a conditional sequence modeling problem. TT models past states, actions, and rewards to predict future actions, states, and rewards effectively. Using rewards as inputs prevent myopic behavior and enable the agent to plan future actions through search methods like beam-search (Negrinho et al., 2018).

This task-specific conditioning of agents offers flexibility in learning complex tasks. Prompt-based DT (Xu et al., 2022b), enables few-shot adaptation in offline RL. The input trajectory, which acts as a prompt, contains segments of few-shot demonstrations, encoding task-specific information to guide policy generation. This approach allows the agent to exploit offline trajectories collected from different tasks and adapt to new scenarios for generalizing to unseen tasks. Similarly, the text decision transformer (TDT) (Putterman et al., 2021) employs natural language signals to guide policy-based language instruction in the Atari-Frostbite environment.

However, DTs face several challenges. They struggle to learn effectively from sub-optimal trajectories. In a stochastic environment, their performance tends to degrade since the action taken may have been sub-optimal, and the achieved outcomes are merely a result of random environment transition. Insufficient distribution coverage of the environment is another challenge in offline RL approaches like DT. To overcome these challenges, solutions such as Q-learning decision transformer (QDT) (Yamagata et al., 2022) re-label the return-to-go using a more accurate learned Q-function. Environment-stochasticity-independent representations (ESPER) (Paster et al., 2022) addresses stochastic performance degradation by conditioning on average return. Additionally, bootstrapped transformer (BooT) (Wang et al., 2022e), incorporates bootstrapping to generate more offline data. By adopting these approaches, the learning capabilities of DTs can be improved, enabling more effective and robust policies in various scenarios.

6.2. Online RL with Transformers

Transformers have also been applied to online RL, where the agent interacts with the environment while learning. In realistic environments, issues such as noisy sensors, occluded images, or unknown agents introduce the problem of partial observability. This makes it difficult for agents to choose the correct action (Sec. 2.2). Here, retaining recent observations in memory is crucial to help disambiguate the true state. Traditionally, this problem has been approached using RNNs, but transformers can provide better alternatives.

Refer to caption
Figure 10. The deep transformer Q-network (Esslinger et al., 2022) employs a transformer decoder to learn a policy Qπ(𝐒) that maps a given state to its corresponding Q value for each action. During training, the network predicts Q values for all the past observations, enabling policy updates via Bellman error, which measures the discrepancy between the current and updated value estimates obtained through the TD error. The network uses the current state’s Q value to predict the optimal action at inference.

The deep transformer Q-network (DTQN) (Esslinger et al., 2022) addresses the challenge of partially observable environments using a transformer decoder architecture. At each time step of training, it receives the agent’s previous k observations and generates k sets of Q-values. This unique training strategy encourages the network to predict Q-values even in contexts with incomplete information, leading to developing a more robust agent. During the evaluation, it selects the action with the highest Q-values from the last time-step in its history (Fig. 10).

The DTQN incorporates a learned positional encoding, which enables the network to adapt to different domains by learning domain-specific temporal dependencies. This domain-specific encoding matches the temporal dependencies of each environment and allows the DTQN to adapt to environments with varying levels of temporal sensitivity. The DTQN demonstrates superior learning speed and outperforms previous recurrent approaches in various partially observable domains, including gym-gridverse, car flag, and memory cards (Morad et al., 2023).

6.3. Transformers for Multi-Agent Reinforcement Learning

MARL (Sec. 2.1) presents unique challenges as agents learn and adapt their behaviors through interactions with other agents and the environment. One such challenge stems from the model architecture’s fixed input and output dimensions, which means that different tasks must be trained independently from scratch (Shao et al., 2019b; Wang et al., 2020b). Consequently, zero-shot transfer across tasks is limited.

Another challenge arises from the failure to disentangle observations from different agents (Hu et al., 2021). When all information from various agents or environments is treated equally, it can result in misguided decisions by individual agents. This challenge becomes particularly prominent when utilizing a centralized value function, which serves as a shared estimate of actions and state value across multiple agents to guide their behavior (Chen and Tan, 2023). As a result, appropriately assigning credit to individual agents becomes difficult.

Refer to caption
Figure 11. The multi-agent transformer (MAT) (Wen et al., 2022) employs an encoder-decoder architecture. The encoder receives a series of observations from the agents and converts them into latent representations. These representations are then fed into the decoder. The decoder generates each agent’s optimal action step-by-step sequentially and auto-regressively. To ensure proper training, the masked attention blocks restrict agents to only access actions from preceding agents.

The universal policy decoupling transformer (UPDeT) (Hu et al., 2021) is designed to handle challenges in tasks with varying observation and action configuration requirements. It achieves this by separating the action space into multiple action groups, effectively matching related observations with corresponding action groups. UPDeT improves the decision-making process by employing a self-attention mechanism and optimizing the policy at the action-group level. This enhances the explainability of decision-making while allowing for high transfer capability to new tasks.

This characteristic is also observed in the multi-agent transformer (MAT) (Wen et al., 2022). MAT (Fig. 11) transforms the joint policy search problem into a sequential decision-making process, allowing for parallel learning of agents’ policies regardless of the number of agents involved. The encoder utilizes the self-attention mechanism to process a sequence of each agent’s observations, capturing their interactions. This generates a sequence of latent representations that are then fed into the decoder. The decoder, in turn, produces each agent’s optimal action in an auto-regressive and sequential manner. As a result, MAT possesses robust generalization capabilities, surpassing multi-agent proximal policy optimization (MAPPO) (Lohse et al., 2021), and heterogeneous-agent proximal policy optimization (HAPPO) (Kuba et al., 2022) in few-shot experiments on multi-agent MuJoCo tasks.

However, MARL faces limitations in real-world applications due to the curse of many agents (Wang et al., 2020a), which stems from the exponentially growing state-action space as the number of agents increases. This presents challenges in learning the value functions and policies of the agents, leading to inefficient relational reasoning among them and credit-assignment problems. Concatenating the state-action spaces of individual agents and treating them as a single-agent problem leads to exponential state and time complexity (Zhou et al., 2020). Additionally, independent learning of policies may struggle to converge without cooperation (Gupta et al., 2021).

TransMix (Khan et al., 2022) tackles the challenge through a centralized learning approach, enabling agents to exchange information during training. During policy execution, each agent relies on a partially observable map. The action space in the star-craft multi-agent challenge (SMAC) (Vinyals et al., 2017) encompasses various actions, including moving units, attacking enemies, gathering resources, constructing buildings, and issuing commands to control the game state. Utilizing transformers, TransMix captures global and local contextual interactions among agent Q-values, histories, and global state information, facilitating efficient credit assignment.

The transformer’s ability to reason about relationships between agents improves results in both model-free MARL (regardless of the number of agents) and model-based MARL (with a logarithmic dependence on the number of agents) (Guedj, 2019). Notably modeling the transformer’s self-attention with other neural network types requires an impractically large number of trainable parameters, highlighting the significance of self-attention in capturing agent interactions (Guedj, 2019). Moreover, the transformer’s performance remains stable across different agent counts, with accuracy impacted by neural network depth (Guedj, 2019), making it highly efficient for MARL.

7. Training Strategy

Training transformers poses challenges due to their reliance on residual branches, which amplify minor parameter perturbations, disrupting model output (Liu et al., 2020); specialized optimizers and weight initializers are needed for successful training. Likewise, the training of RL policies can be unstable (Nikishin et al., 2018) and require distinct strategies for achieving optimal performance. Hence, the integration of transformers into RL is particularly challenging. These challenges can manifest as sudden or extreme changes in performance during training, impeding effective learning and generalization.

The standard transformer architecture is difficult to optimize using RL objectives and needs extensive hyper-parameter tuning, which is time-consuming. Here, we review strategies for training transformers in RL. These include pre-training and transfer learning to expedite learning, improved weight initialization to mitigate gradient issues, and efficient layer utilization for capturing relevant information.

7.1. Pre-Training and Transfer Learning

Transformers can be pre-trained on large, reward-free datasets, providing opportunities to fine-tune when only small annotated datasets are available. Meng et al. (2021) propose using DT to pre-train agents on large, reward-free offline datasets of prior interactions. During pre-training, reward tokens are masked, allowing the transformers to learn to predict actions based on the previous state and action content while extracting behavior from the dataset. This pre-trained model can then be fine-tuned with a small, reward-annotated dataset to learn the skills necessary to achieve the desired behavior based on the reward function.

Transfer learning is challenging when the environment dynamics change. A training method for DT (Boustati et al., 2021) addresses this challenge by using counterfactual reasoning. It generates counterfactual trajectories in an alternative environment, which are used to train a more adaptable learning agent. This process aids in regularizing the agent’s internal representation of the environment, enhancing its adaptability to structural changes. Moreover, unsupervised pre-training of vision and sequence encoders has also improved downstream few-shot learning performance (Putterman et al., 2021). By leveraging pre-trained models, the agent can quickly adapt to new, unseen environments and achieve higher performance with limited training data.

7.2. Stabilizing Training

In the RL setting, transformer models require learning rate warmup to prevent divergence caused by backpropagation through the layer normalization modules, which can destabilize optimization. To enhance stability, Melo (2022) proposes to use T-Fixup initialization (Huang et al., 2020b). This applies Xavier initialization (Glorot and Bengio, 2010) to all parameters except input embeddings, eliminating the need for learning rate warmup and layer normalization. It is crucial in environments where learned behavior guides exploration; it addresses instability during early training stages when policies are more exploratory and prevent convergence to sub-optimal policies.

The gated transformer-XL (GTrXL) architecture (Parisotto et al., 2020) has demonstrated promising results in stabilizing RL training and improving performance. It improves upon the original TrXL architecture by applying layer normalization exclusively to the input stream within the residual model rather than the shortcut stream. This modification allows the initial input to propagate through all the layers, promoting training stability. GTrXL replaces the residual connection with a gated recurrent unit (GRU)-style gating mechanism. This gating mechanism regulates information flow through the network controlling the amount of information passed via the shortcut. This added flexibility enhances the model’s adaptability to RL scenarios and facilitates stable training.

8. Interpretability

Interpretability of the learned RL policies is desirable in safety-critical applications like healthcare and autonomous driving (Glanois et al., 2021). This helps in building trust, facilitating debugging, and promoting ethical and fair decision-making. However, achieving interpretability has been a significant challenge and a bottleneck in the progress of RL (Milani et al., 2022; Heuillet et al., 2021).

One way to interpret transformers is to visualize the attention weights using heatmaps (Zhang et al., 2022c). This helps to understand which features are used to learn the particular task. In multi-agent scenarios, these visualizations reveal the localized areas of the input space where the individual agents focus their attention, facilitating coordinated and cooperative behavior. For instance, Motokawa and Sugawara (2021) introduce a multi-agent transformer deep Q-network (MAT-DQN) that integrates transformers into a deep Q-network. Using heatmaps, MAT-DQN provides insights into the important input information that influences the agent’s decision-making process for cooperative behavior.

Analyzing attention heatmaps unveils the agent’s ability to consider other agents, relevant objects, and pertinent tasks, allowing for a clear interpretation of the policy. Such visualization is critical in sparse reward settings, where understanding which past state had the most influence on decision-making is crucial. Attention-augmented memory (AAM) (Qu et al., 2022) exemplifies this by combining the current observation with memory. This enables the agent to understand “what” the agent observes in the current environment and “where” it directs its attention in its memory.

An interesting method for enhancing interpretability involves the use of transformers in neuro-symbolic policies (Bastani et al., 2020). Neuro-symbolic policies combine programs and neural networks to improve interpretability and flexibility in RL tasks. Specifically, a neuro-symbolic transformer is a variant of the traditional transformer model that incorporates programmatic policies into the attention mechanism. Instead of utilizing a neural network, the attention layer employs a program to determine the relevant inputs to focus on. These programmatic policies can take various forms, including decision trees, rule lists, and state machines. This approach improves interpretability by providing a more precise understanding and visualization of why agents attend to specific inputs.

However, it has been demonstrated that attention weights alone are unreliable predictors of the importance of intermediate components in NLP (Serrano and Smith, 2019; Bai et al., 2021), leading to inaccurate explanations of model decisions; learned attention weights often highlight less meaningful tokens and exhibit minimal correlation with other feature importance indicators like gradient-based measures. Furthermore, relying solely on attention weights can result in fragmented explanations that overlook most other computations. Recent work has introduced the assignment of a local relevancy score (Chefer et al., 2021). These scores are propagated through layers to achieve class-based separation and enhance the interpretability of the transformers. This approach holds promise for future research to improve the interpretability of RL policies.

9. Applications

RL has traditionally been constrained to unrealistic scenarios in virtual environments. However, with modern deep neural network architectures, there has been a notable shift towards employing RL to address a broader range of practical challenges. The following section describes real-world applications where RL powered by transformers can make a substantial impact.

9.1. Robotics

In robotics, autonomous agents automate complex real-world tasks; a classic example is autonomous driving. Here, learning the RL policy for trajectory planning is essential: it involves forecasting the future positions of one or more agents in an environment while considering contextual information. This requires adequate planning and coordination among agents by modeling their spatial and temporal interactions.

Several studies have proposed to use transformers for processing sequences of high-dimensional scene observations for predicting actions. A recent study (Kargar and Kyrki, 2022) uses ViT to extract spatial representations from a birds-eye view of the ego vehicle to learn driving policies. Compared with CNNs, ViTs are more effective in capturing the global context of the scene. The attention mechanism used in ViTs allow the policy to discern the neighboring cars that are pivotal in the decision-making process of the ego vehicle. As a result, the ViT-based DQN agent outperforms its CNN-based counterparts. Liu et al. (2022b) introduce a transformer architecture to encode heterogeneous information, including the historical state of the ego vehicle and candidate route waypoints, into the scene representation. This approach enhances sample efficiency and results in more diverse and successful driving behaviors during inference. The object memory transformer (Fukushima et al., 2022) explores how long-term histories and first-person views can enhance navigation performance in object navigation tasks. An object scene memory stores long-term scene and object semantics, focusing attention on the most salient event in past observations. The results indicate that incorporating long-term object histories with temporal encoding significantly enhances prediction performance.

Transformers also excel in capturing both spatial relationships and intra-agent interactions, making them ideal for facilitating cooperative exploration and developing intelligent embodied agents. Multi-agent active neural SLAM (MAANS) (Yu et al., 2022) addresses the challenge of cooperative multi-agent exploration (Oroojlooy and Hajinezhad, 2023), where multiple agents collaborate to explore unknown spatial regions. This approach extends the single-agent active neural SLAM (Chaplot et al., 2020) method to the multi-agent setting and utilizes a multi-agent spatial planner with a self-attention-based architecture known as the Spatial-TeamFormer. This hierarchically integrates intra-agent interactions and spatial relationships, employing two layers: An individual spatial encoder that captures spatial features for each agent, and a team relational encoder for reasoning about interactions among agents. To focus on spatial information, the intra-agent self-attention performs spatial self-attention over each agent’s spatial map independently. The team relation encoder focuses on capturing team-wise interactions without leveraging spatial information. This allows MAANS to outperform planning-based competitors in a photo-realistic environment, as shown in experiments on Habitat (Savva et al., 2019).

9.2. Medicine

RL has the potential to assist clinicians; tasks involving diagnosis, report generation, and drug discovery can be considered sequential decision-making problems (Yu et al., 2023).

Disease Diagnosis. Diagnosing a medical condition involves modeling a patient’s information (e.g., treatment history, present signs, and symptoms) to accurately understand the disease. Chen et al. (2022b) propose a model for disease diagnosis called the DxFormer. This employs a decoder-encoder transformer architecture, where the decoder inquires about implicit symptoms. At the same time, the encoder is responsible for disease diagnosis, which models the input sequence of symptoms as a sequence classification task. To facilitate symptom inquiry, the decoder is formulated as an agent that interacts with a patient simulator in a serialized manner, generating possible symptom tokens that may co-occur with prior known symptoms and inquiring about them. The inquiry process proceeds until the confidence level in the predicted disease surpasses a selected threshold, thus enabling a more accurate and reliable diagnosis.

Clinical Report Generation. RL can generate medical reports from images by employing appropriate evaluation metrics such as human evaluations or consensus-based image description evaluation (CIDEr) (Vedantam et al., 2015) and BLEU metrics as rewards. Previous approaches to medical image captioning were constrained by their reliance on RNNs for text generation, which often resulted in slow performance and incoherent reports, as highlighted by Xiong et al. (2019). To address this limitation, their work introduces an RL approach based on transformers for medical image captioning. Initially, a pre-trained CNN is employed to identify the region of interest in chest X-ray images. Then, a transformer encoder is utilized to extract the visual features from the identified regions. These features serve as input to the decoder, which generates sentences describing the X-ray scans. Similarly Miura et al. (2021) used a meshed-memory transformer (M2 Trans) (Cornia et al., 2020) that generates radiology reports, proving more effective than traditional RNN and transformer models. M2 Trans incorporates a CNN to extract image regions. These regions are then encoded using a memory-augmented attention process. This involves assigning attention weights to the image based on prior knowledge stored in memory matrices that capture relationships between different regions. This model is trained using rewards, aiming to enhance generated reports’ factual completeness and consistency.

Drug Discovery. RL has the potential to accelerate drug discovery efforts. It has been utilized to bias or fine-tune generative models, enabling the generation of molecules with more desirable characteristics, such as bioactivity. Traditional generative models for molecules, such as RNNs or generative adversarial networks (GANs) (Goodfellow et al., 2014) have limitations in satisfying specific constraints, such as synthesizability or desirable physical properties. Recent research (Wang et al., 2021a; Li et al., 2022b; Yang et al., 2021; Liu et al., 2023) uses transformers as generative models for molecular generation. These approaches generate better plausible molecules with rich semantic features. A discriminator grants rewards that guide the policy update of the generator. These works demonstrate that transformer-based methods significantly improve capturing and utilizing structure-property relations, leading to higher structural diversity and a broader range of scaffold types for the generated molecules.

9.3. Language Modeling

Language modeling involves understanding the sequential context of language to perform diverse tasks like recognition, interpretation, or retrieval. Large language models like GPT leverage pre-training on vast corpora, enabling them to generate fluent, natural language by sampling from the learned distribution, thus minimizing the need for extensive domain-specific knowledge engineering. However, these models face challenges in maintaining task coherence and goal-directedness. Alabdulkarim et al. (2021) use proximal policy optimization (PPO) to fine-tune an existing transformer-based language model specifically for story generation to address this issue. This approach inputs a text prompt and generates a story based on the provided goal. This policy is updated using a reward mechanism that considers the proximity of the generated story to the desired input goal and the frequency of verb occurrence in the story compared to the goal.

Several studies use additive learning to benefit from pre-trained language models with limited data, incorporating a task-specific adapter over the frozen pre-trained language model. Jo et al. (2022) use RL to selectively sample tokens between the general pre-trained language model and the task-specific adapter. The authors argue that this enables the adapter to focus solely on the task-relevant component of the output sequence, making the model more robust to over-fitting. Cohen et al. (2022) introduce a conversational bot powered by RL where pre-trained models encode conversation history. Given that the action space for dialogue systems can be very large, the authors propose limiting the action space to a small set of generated candidate actions at each conversation turn. They use Q-Learning-based RL to allow a dynamic action space at each stage of the conversation.

Increasing the size of language models alone does not necessarily mitigate the risk of toxic biases in the training data. Several RL-based approaches have been proposed to better align these models with the user’s intended objectives. To align GPT-3 to the user’s preferred intentions, Ouyang et al. (2022) introduce InstructGPT. First, the authors propose to collect a set of human-written demonstrations of desired output behavior and fine-tune GPT-3 with supervised learning. Next, a reward model is trained on model outputs ranked from best to worst. Using this reward model, the model is further optimized with RL using PPO. Results demonstrate that InstructGPT, with 1.3B parameters, produces preferable outputs than much larger models such as GPT-3 with 175B parameters. Faal et al. (2023) propose an alternative method to mitigate toxicity in language models via fine-tuning with PPO. They use a reward model based on multi-task learning to mitigate unintended bias in toxicity prediction related to various social identities.

9.4. Edge and Cloud Computing

RL is a valuable tool for optimizing the performance of decision-making systems that require real-time adaptation to changing conditions, such as those used in edge and cloud computing. In edge computing, RL can optimize resource-constrained internet of things (IoT) devices’ performance (Chen et al., 2021b). In cloud computing, RL can be used to optimize resource allocation and scheduling in large-scale distributed systems (Gondhi and Gupta, 2017). Integrating transformers with RL in these two settings can be particularly useful as they can handle high-dimensional sensory states (Ho et al., 2019) and sequences of symbolic states (Bhattamishra et al., 2020).

A distributed deep RL algorithm proposed by Wang et al. (2022d) utilizes transformers to model the policy for optimizing offloading strategies in vehicular networks. These networks enable vehicle-to-vehicle communication. To represent the input sub-task priorities and dependencies, a directed acyclic graph (DAG) is used. Thereafter the attention mechanism employed by the transformer allows for the efficient extraction of state information from this DAG-based topology representation. This facilitates informed offloading decisions. The reward function used in this algorithm optimizes for latency and energy consumption providing valuable feedback. This approach enables faster convergence of the vehicular agent to equilibrium.

9.5. Combinatorial Optimization

Combinatorial optimization involves finding the values of a set of discrete parameters that minimize cost functions (Mazyavkina et al., 2021). Recently, transformer-based models have shown promise in combinatorial optimization (e.g., for the traveling salesman and routing problems) due to their ability to handle sequential data and model complex relationships between entities.

Travelling Salesman. This problem is a classical combinatorial optimization problem commonly found in crew scheduling applications. It has been formulated as an RL problem by Smith (2022). The problem is NP-hard and not NP-complete. The high polynomial complexities of brute-force algorithms necessitate the development of faster methods. This study uses a decision transformer (DT) to solve this challenge. The DT is fed random walks as input and aims to find the optimal path among all nodes. The DT has the advantage of scaling with pseudo-linear time, as it only needs to predict once per node in the route. This significantly improves over previous methods, such as dynamic programming and simulated annealing, which have polynomial and exponential complexity. However, the DT could not always accurately model the travelling salesman problem leading to inconsistent performance.

Routing. Identifying the most efficient route between two nodes in a graph is important in industries such as transportation, logistics, and networking (Mor and Speranza, 2022). Traditional heuristic-based algorithms may not always yield the optimal solution, as adapting to changing conditions is challenging (Wu et al., 2022). GNNs have been used to address these challenges (Lu et al., 2020). However, these may not be sufficient for handling data with complex inter-relationships and structures, motivating the use of transformers. In addition, routing often requires optimizing for multiple constraints, such as cost, time, or distance, which can be tackled using RL. A transformer-based policy has been proposed by Wang and Chen (2022) to tackle the routing problem using a standard transformer encoder with positional encoding, which ensures translation invariance of the input nodes. A GNN layer is used in the decoder, enabling consideration of the graph’s topological structure formed by the node relationships. The policy is then trained using the REINFORCE algorithm (Williams, 1992). This approach improves learning efficiency and optimization accuracy compared to traditional methods while providing better generalization in new scenarios.

9.6. Environmental Sciences

RL algorithms can help address climate change by optimizing the behavior of systems and technologies for reducing greenhouse gas emissions and mitigating the impacts of climate change (Strnad et al., 2019). These algorithms can learn and adapt to multiple constraints, optimizing performance without compromising productivity. However, in such settings, RL algorithms must rely on past context stored in memory, and integrating prior knowledge is crucial for their success, suggesting the use of transformers.

Nasir and Durlofsky (2023) formulate the problem of closed-loop reservoir management problem as a partially observable Markov decision process (POMDP). In subsurface flow settings, such as oil reservoirs, the goal is to extract as much oil as possible while minimizing costs and environmental impact. However, this requires making decisions about well pressure settings often complicated by geological model uncertainties. This work models the RL policy with PPO using temporal convolution and gated transformer blocks for an efficient and effective representation of the state. The framework’s training is accomplished with data generated from flow simulations across an ensemble of prior geological models. After appropriate training, the policy instantaneously maps flow data observed at wells to optimal pressure. This approach helps reduce computational costs and improve decision-making in subsurface flow settings.

Wang et al. (2022b) introduce transformer-based multi-agent actor-critic framework (T-MAAC) leveraging MARL algorithms to stabilize voltage in power distribution networks. This framework recognizes the need for coordination among multiple units in the grid to handle the rapid changes in power systems resulting from the increased integration of renewable energy and tackles this problem by using MARL algorithms. The proposed approach introduces a transformer-based actor that takes the grid state representation as input and outputs the maximum reactive power ratio that each agent in the power distribution can generate. Subsequently, the critic approximates the global Q-values using the self-attention mechanism to model the correlation between agents across the entire grid. The policy is reinforced through feedback in the form of reward, aiming to control voltage within a safe range while minimizing power loss in the distribution network. This approach consistently enhances the effectiveness of the active voltage control task.

9.7. Scheduling

The scheduling problem involves determining the optimal arrangement of tasks or events within a specified time frame while considering constraints such as resource availability or dependencies between tasks (Allahverdi, 2016). This problem can arise in various contexts, such as scheduling jobs in manufacturing or optimizing computer resource usage, and can be approached using various techniques (Parmentier and T’kindt, 2023). Transformers are now being used to solve scheduling problems.

The job-shop scheduling problem (JSSP) is a classical NP-hard problem that involves scheduling a set of jobs on a set of machines, where each job has to be processed on each machine exactly once, subject to various constraints. An RL approach for solving the JSSP using the disjunctive graph embedded recurrent decoding (DGERD) transformer is proposed by Chen et al. (2023). This work uses the attention mechanism and disjunctive graph embedding to model the JSSP, which allows complex relationships between jobs and machines to be captured. In the context of JSSP, the attention module learns to prioritize certain jobs or machines based on their importance or availability. By doing so, it can generate more efficient and robust schedules. The disjunctive graph embedding converts the JSSP instance into a graph representation to capture the structural properties, enabling better generalization and reducing over-fitting. This acts as an input for the DGERD transformer consisting of a parallel-computing encoder and a recurrent-computing decoder. The encoder takes the disjunctive graph embedding of the JSSP instance and generates a set of hidden representations that capture the relevant features of the input. This hidden representation is then fed to the decoder to generate an output schedule sequentially. The policy is optimized using feedback from the environment in the form of makespan (length of time that elapses from the start of work to the end) and tardiness penalties. This helps in generating schedules that are both fast and reliable.

9.8. Trading

Stock portfolio optimization involves choosing the optimal combination of assets to obtain the highest possible returns while minimizing risk (Hieu, 2020). This process can be challenging due to the various factors that can impact a portfolio’s performance (Haugh and Lo, 2001), which include market conditions, economic events, and changes in the value of individual stocks. Various techniques can be used to optimize a portfolio, including modern portfolio theory and optimization algorithms (Thakkar and Chaudhari, 2021), and RL is one such approach to automate the trading process. For this application, RL involves training a model to make trading decisions based on historical data and market conditions to maximize the portfolio’s return over time.

Although past performance may not indicate future results, data-driven approaches rely on past features to model a particular stock’s expected future performance. This is because various historical data points, such as price trends, trading volumes, and market sentiment, may hint at a stock’s future performance (Milosevic, 2016). In portfolio optimization, it is necessary to consider both short-term and long-term trends (Ta et al., 2020). The transformer architecture is well-suited for this task.

The first application of transformers in portfolio selection, as introduced by Xu et al. (2020), involved using the relation-aware transformer (RAT). This uses an encoder-decoder transformer architecture to model the RL policy. The encoder takes in the sequential price series of assets, such as stocks and cryptocurrencies, as the input state. It performs sequential feature extraction, comprising a sequential attention layer for capturing patterns in asset prices and a relation attention layer for capturing correlations among assets. The decoder has a network resembling an encoder, with an additional decision-making layer that incorporates leverage and enables accurate decisions, including short sales, for each asset. The final action is determined by combining the initial portfolio vector, the short sale vector, and the reinvestment vector. The agent then receives reward-based feedback, measured as the log return of portfolios. To evaluate the proposed method, real-world cryptocurrency, and stock datasets are used and compared against state-of-art portfolio selection methods. The results demonstrate a significant improvement over existing approaches.

9.9. Hyper-Parameter Optimization

Hyper-parameter optimization (HPO) involves finding the optimal set of hyper-parameters for training a machine learning model. Some commonly used hyper-parameters in machine learning models include the learning rate, batch size, number of hidden units in a neural network, and activation function. As such, finding the best combination of these hyper-parameters can be challenging for large models due to their correspondingly large search space (Ali et al., 2023). Manually setting hyper-parameter values is fast but requires expertise and domain knowledge (Shawki et al., 2021). Automated techniques like random search, grid search, or Bayesian optimization can automatically find ideal hyper-parameter combinations (Bergstra and Bengio, 2012; Snoek et al., 2015), but minimizing overall computational costs remains a challenge. Such auto-tuners rarely perform well for complex tasks and are prone to errors with increased model complexity (Shawki et al., 2021).

Attention and memory enhancement (AME) (Xu et al., 2022a), is a transformer-based search algorithm to enhance the selection of hyper-parameters that tackles these challenges. AME utilizes RL and addresses HPO without relying on distribution assumptions. The agent, or the searcher, is modeled using GTrXL and learns a series of state-to-action mappings based on rewards. In this context, the state refers to the combination of evaluated configurations, while an action corresponds to the new configuration chosen by the agent from the search space. Utilizing GTrXL improves the ability to capture relationships among different configurations through memory mechanisms and multi-head attention, thereby enabling attentive sampling. The agent is trained using feedback in the form of rewards, which promotes the generation of high-performance configurations and penalizes those leading to reduced performance. Consequently, it effectively locates high-performance configurations within vast search space. Results demonstrate that the AME algorithm surpasses other HPOs like Bayesian optimization, evolutionary algorithms, and random search methods in terms of adaptability to diverse tasks, efficiency, and stability.

10. Limitations

As discussed above, transformers are gradually being integrated into RL for various applications. Despite these advances, some limitations impede their widespread use. This section details these limitations and provides insights for future research.

Balancing Local & Global Context. In RL, global contextual information is required for efficient high-level planning (Barto and Mahadevan, 2003). This information is combined with additional nearby details, known as the local context, to predict low-level actions precisely. As detailed by Li et al. (2019); Wang et al. (2021b), transformers may not be as effective as other models in capturing local context. This limitation is mainly because of the self-attention mechanism, which compares queries and keys for all elements in a sequence using the dot product. This point-wise comparison does not directly consider the local context for each sequence position, which may lead to confusion due to noisy local points. Recent studies (Lin et al., 2023; Wang et al., 2022a, 2021c) inspired by CNNs have proposed modifications to the original attention mechanism to balance the local and global context more effectively. These approaches include local window-based boundary-aware attention, allowing the model to focus on a small window of nearby details and the global context when making predictions.

Weak Inductive Bias. CNNs and long short-term memory (LSTM) (Hochreiter and Schmidhuber, 1997) models have a strong inductive bias toward the dataset’s structure, which helps narrow the search space and leads to faster training (d’Ascoli et al., 2021). This makes them better suited for situations with less training data. However, transformers have a relatively weak inductive bias, making them more capable of finding general solutions (Hessel et al., 2019) but more susceptible to over-fitting, especially when less data is available. This limitation can be a significant challenge in RL, where training a policy already requires millions of trajectories. Furthermore, learning models like the decision transformer require collecting trajectories from learned policies which can be even more challenging. One approach to counter weaker inductive bias in transformers is to use foundation models (Zhou et al., 2023; Moor et al., 2023). Foundation models are pre-trained on large and diverse datasets, which allows them to learn general patterns that can be applied to a wide range of downstream tasks. The foundation model can achieve state-of-the-art results with less data by fine-tuning the pre-trained model on a smaller task-specific dataset.

Quadratic Complexity. The self-attention mechanism of transformers becomes more computationally expensive as input sequence length increases due to the quadratic increase in pairwise comparisons between tokens (Keles et al., 2023). This limitation, along with hardware and model size constraints, restricts the ability of transformers to process longer input sequences, making them unsuitable for specific tasks that require substantial amounts of contextual information, like document summarization or genome fragment classification. This limitation can also pose challenges in RL for applications requiring extended temporal modeling. However, recent works (Katharopoulos et al., 2020; Lu et al., 2021; Ren et al., 2021) have provided methods to reduce this cost to linear or sub-quadratic, providing new possibilities for using transformers in applications that require longer input sequences.

11. Conclusion

This survey explored the diverse uses of transformers in RL, including representation learning, reward modeling, transition function modeling, and policy learning. While the original transformer architecture has limitations, it can be modified for many RL applications. We showcased the advances in transformers that have broadened the scope of RL to real-world problems in robotics, drug discovery, stock trading, and cloud computing. Finally, we discussed the current limitations of transformers in RL and ongoing research in this field. Given its versatility in addressing challenges such as partial observability, credit assignment, interpretability, and unstable training — issues commonly encountered in traditional RL— we anticipate that the transformer architecture will continue to gain popularity in the RL domain.

Acknowledgement. We thank CIFAR, Google, CMLabs for funding the project, and Vincent Michalski for the valuable feedback.

References

  • (1)
  • Alabdulkarim et al. (2021) Amal Alabdulkarim, Winston Li, Lara J. Martin, and Mark O. Riedl. 2021. Goal-Directed Story Generation: Augmenting Generative Language Models with Reinforcement Learning. CoRR abs/2112.08593 (2021).
  • Ali et al. (2023) Yasser A Ali, Emad Mahrous Awwad, Muna Al-Razgan, and Ali Maarouf. 2023. Hyperparameter Search for Machine Learning Algorithms for Optimizing the Computational Complexity. Processes 11, 2 (2023), 349.
  • Allahverdi (2016) Ali Allahverdi. 2016. A survey of scheduling problems with no-wait in process. Eur. J. Oper. Res. 255, 3 (2016), 665–686.
  • AlMahamid and Grolinger (2021) Fadi AlMahamid and Katarina Grolinger. 2021. Reinforcement Learning Algorithms: An Overview and Classification. In 34th IEEE Canadian Conference on Electrical and Computer Engineering, CCECE 2021. IEEE, 1–7.
  • Arulkumaran et al. (2022) Kai Arulkumaran, Dylan R. Ashley, Jürgen Schmidhuber, and Rupesh Kumar Srivastava. 2022. All You Need Is Supervised Learning: From Imitation Learning to Meta-RL With Upside Down RL. CoRR abs/2202.11960 (2022).
  • Arulkumaran et al. (2017) Kai Arulkumaran, Marc Peter Deisenroth, Miles Brundage, and Anil Anthony Bharath. 2017. Deep Reinforcement Learning: A Brief Survey. IEEE Signal Process. Mag. 34, 6 (2017), 26–38.
  • Ausin et al. (2021) Markel Sanz Ausin, Mehak Maniktala, Tiffany Barnes, and Min Chi. 2021. Tackling the Credit Assignment Problem in Reinforcement Learning-Induced Pedagogical Policies with Neural Networks. In Artificial Intelligence in Education - 22nd International Conference, AIED, Utrecht, Netherlands, Vol. 12748. Springer, 356–368.
  • Bai et al. (2021) Bing Bai, Jian Liang, Guanhua Zhang, Hao Li, Kun Bai, and Fei Wang. 2021. Why Attentions May Not Be Interpretable?. In KDD ’21: The 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,. ACM, 25–34.
  • Banino et al. (2022) Andrea Banino, Adrià Puigdomènech Badia, Jacob C. Walker, Tim Scholtes, Jovana Mitrovic, and Charles Blundell. 2022. CoBERL: Contrastive BERT for Reinforcement Learning. In The Tenth International Conference on Learning Representations, ICLR.
  • Barto and Mahadevan (2003) Andrew G. Barto and Sridhar Mahadevan. 2003. Recent Advances in Hierarchical Reinforcement Learning. Discret. Event Dyn. Syst. 13, 1-2 (2003), 41–77.
  • Bastani et al. (2020) Osbert Bastani, Jeevana Priya Inala, and Armando Solar-Lezama. 2020. Interpretable, Verifiable, and Robust Reinforcement Learning via Program Synthesis. In xxAI - Beyond Explainable AI - International Workshop, Held in Conjunction with ICML, Vol. 13200. Springer, 207–228.
  • Battaglia et al. (2018) Peter W. Battaglia, Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinícius Flores Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, Çaglar Gülçehre, H. Francis Song, Andrew J. Ballard, Justin Gilmer, George E. Dahl, Ashish Vaswani, Kelsey R. Allen, Charles Nash, Victoria Langston, Chris Dyer, Nicolas Heess, Daan Wierstra, Pushmeet Kohli, Matthew M. Botvinick, Oriol Vinyals, Yujia Li, and Razvan Pascanu. 2018. Relational inductive biases, deep learning, and graph networks. CoRR abs/1806.01261 (2018).
  • Bellemare et al. (2015) Marc G. Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. 2015. The Arcade Learning Environment: An Evaluation Platform for General Agents (Extended Abstract). In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31, 2015, Qiang Yang and Michael J. Wooldridge (Eds.). AAAI Press, 4148–4152. http://ijcai.org/Abstract/15/585
  • Benjamins et al. (2022) Carolin Benjamins, Theresa Eimer, Frederik Schubert, Aditya Mohan, André Biedenkapp, Bodo Rosenhahn, Frank Hutter, and Marius Lindauer. 2022. Contextualize Me - The Case for Context in Reinforcement Learning. CoRR abs/2202.04500 (2022).
  • Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. 2012. Random Search for Hyper-Parameter Optimization. J. Mach. Learn. Res. 13 (2012), 281–305.
  • Bhattamishra et al. (2020) Satwik Bhattamishra, Arkil Patel, and Navin Goyal. 2020. On the Computational Power of Transformers and Its Implications in Sequence Modeling. In Proceedings of the 24th Conference on Computational Natural Language Learning, CoNLL 2020, Online, November 19-20, 2020. Association for Computational Linguistics, 455–475.
  • Boustati et al. (2021) Ayman Boustati, Hana Chockler, and Daniel C. McNamee. 2021. Transfer learning with causal counterfactual reasoning in Decision Transformers. CoRR abs/2110.14355 (2021).
  • Brown et al. (2020) Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. (2020).
  • Carta et al. (2020) Thomas Carta, Subhajit Chaudhury, Kartik Talamadupula, and Michiaki Tatsubori. 2020. VisualHints: A Visual-Lingual Environment for Multimodal Reinforcement Learning. CoRR abs/2010.13839 (2020).
  • Chaplot et al. (2020) Devendra Singh Chaplot, Dhiraj Gandhi, Saurabh Gupta, Abhinav Gupta, and Ruslan Salakhutdinov. 2020. Learning To Explore Using Active Neural SLAM. In 8th International Conference on Learning Representations, ICLR.
  • Chefer et al. (2021) Hila Chefer, Shir Gur, and Lior Wolf. 2021. Transformer Interpretability Beyond Attention Visualization. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. Computer Vision Foundation / IEEE, 782–791.
  • Chen et al. (2022a) Chang Chen, Yi-Fu Wu, Jaesik Yoon, and Sungjin Ahn. 2022a. TransDreamer: Reinforcement Learning with Transformer World Models. CoRR abs/2202.09481 (2022).
  • Chen et al. (2021a) Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. 2021a. Decision Transformer: Reinforcement Learning via Sequence Modeling. (2021), 15084–15097.
  • Chen et al. (2023) Ruiqi Chen, Wenxin Li, and Hongbing Yang. 2023. A Deep Reinforcement Learning Framework Based on an Attention Mechanism and Disjunctive Graph Embedding for the Job-Shop Scheduling Problem. IEEE Trans. Ind. Informatics 19, 2 (2023), 1322–1331.
  • Chen and Tan (2023) Renlong Chen and Ying Tan. 2023. Credit assignment with predictive contribution measurement in multi-agent reinforcement learning. Neural Networks 164 (2023), 681–690. https://doi.org/10.1016/j.neunet.2023.05.021
  • Chen et al. (2021b) Wuhui Chen, Xiaoyu Qiu, Ting Cai, Hong-Ning Dai, Zibin Zheng, and Yan Zhang. 2021b. Deep Reinforcement Learning for Internet of Things: A Comprehensive Survey. IEEE Commun. Surv. Tutorials 23, 3 (2021), 1659–1692.
  • Chen et al. (2022b) Wei Chen, Cheng Zhong, Jiajie Peng, and Zhongyu Wei. 2022b. DxFormer: A Decoupled Automatic Diagnostic System Based on Decoder-Encoder Transformer with Dense Symptom Representations. CoRR abs/2205.03755 (2022).
  • Choi et al. (2020) Kristy Choi, Curtis Hawthorne, Ian Simon, Monica Dinculescu, and Jesse H. Engel. 2020. Encoding Musical Style with Transformer Autoencoders. In Proceedings of the 37th International Conference on Machine Learning, ICML, Vol. 119. PMLR, 1899–1908.
  • Cohen et al. (2022) Deborah Cohen, Moonkyung Ryu, Yinlam Chow, Orgad Keller, Ido Greenberg, Avinatan Hassidim, Michael Fink, Yossi Matias, Idan Szpektor, Craig Boutilier, and Gal Elidan. 2022. Dynamic Planning in Open-Ended Dialogue using Reinforcement Learning. CoRR abs/2208.02294 (2022).
  • Cornia et al. (2020) Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-Memory Transformer for Image Captioning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. Computer Vision Foundation / IEEE, 10575–10584.
  • Dai et al. (2019) Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G. Carbonell, Quoc Viet Le, and Ruslan Salakhutdinov. 2019. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL. Association for Computational Linguistics, 2978–2988.
  • d’Ascoli et al. (2021) Stéphane d’Ascoli, Hugo Touvron, Matthew L. Leavitt, Ari S. Morcos, Giulio Biroli, and Levent Sagun. 2021. ConViT: Improving Vision Transformers with Soft Convolutional Inductive Biases. In Proceedings of the 38th ICML, Vol. 139. PMLR, 2286–2296.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171–4186.
  • Dong et al. (2018) Linhao Dong, Shuang Xu, and Bo Xu. 2018. Speech-Transformer: A No-Recurrence Sequence-to-Sequence Model for Speech Recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2018, Calgary, AB, Canada, April 15-20, 2018. IEEE, 5884–5888.
  • Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In 9th International Conference on Learning Representations, ICLR.
  • Dwivedi and Bresson (2020) Vijay Prakash Dwivedi and Xavier Bresson. 2020. A Generalization of Transformer Networks to Graphs. CoRR abs/2012.09699 (2020).
  • Esslinger et al. (2022) Kevin Esslinger, Robert Platt, and Christopher Amato. 2022. Deep Transformer Q-Networks for Partially Observable Reinforcement Learning. CoRR abs/2206.01078 (2022).
  • Faal et al. (2023) Farshid Faal, Ketra A. Schmitt, and Jia Yuan Yu. 2023. Reward modeling for mitigating toxicity in transformer-based language models. Appl. Intell. 53, 7 (2023), 8421–8435.
  • Featherstone (2014) Roy Featherstone. 2014. Rigid body dynamics algorithms. Springer.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. 2017. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. In Proceedings of the 34th ICML, Vol. 70. PMLR, 1126–1135.
  • Foerster et al. (2016) Jakob N. Foerster, Yannis M. Assael, Nando de Freitas, and Shimon Whiteson. 2016. Learning to Communicate with Deep Multi-Agent Reinforcement Learning. In NeurIPS. 2137–2145.
  • Fukushima et al. (2022) Rui Fukushima, Kei Ota, Asako Kanezaki, Yoko Sasaki, and Yusuke Yoshiyasu. 2022. Object Memory Transformer for Object Goal Navigation. arXiv preprint arXiv:2203.14708 (2022).
  • Ghosh and Bellemare (2020) Dibya Ghosh and Marc G. Bellemare. 2020. Representations for Stable Off-Policy Reinforcement Learning. In Proceedings of the 37th ICML, Vol. 119. PMLR, 3556–3565.
  • Glanois et al. (2021) Claire Glanois, Paul Weng, Matthieu Zimmer, Dong Li, Tianpei Yang, Jianye Hao, and Wulong Liu. 2021. A Survey on Interpretable Reinforcement Learning. CoRR abs/2112.13112 (2021).
  • Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, Chia Laguna Resort, Sardinia, Italy, May 13-15, 2010 (JMLR Proceedings, Vol. 9), Yee Whye Teh and D. Mike Titterington (Eds.). JMLR.org, 249–256. http://proceedings.mlr.press/v9/glorot10a.html
  • Gondhi and Gupta (2017) Naveen Kumar Gondhi and Ayushi Gupta. 2017. Survey on machine learning based scheduling in cloud computing. In Proceedings of the 2017 International Conference on Intelligent Systems, Metaheuristics & Swarm Intelligence. 57–61.
  • Goodfellow et al. (2014) Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. 2014. Generative Adversarial Networks. CoRR abs/1406.2661 (2014).
  • Gopalakrishnan et al. (2023) Anand Gopalakrishnan, Kazuki Irie, Jürgen Schmidhuber, and Sjoerd van Steenkiste. 2023. Unsupervised Learning of Temporal Abstractions With Slot-Based Transformers. Neural Comput. 35, 4 (2023), 593–626.
  • Goulão and Oliveira (2022) Manuel Goulão and Arlindo L. Oliveira. 2022. Pretraining the Vision Transformer using self-supervised methods for vision based Deep Reinforcement Learning. CoRR abs/2209.10901 (2022).
  • Gronauer and Diepold (2022) Sven Gronauer and Klaus Diepold. 2022. Multi-agent deep reinforcement learning: a survey. Artif. Intell. Rev. 55, 2 (2022), 895–943.
  • Guedj (2019) Benjamin Guedj. 2019. A Primer on PAC-Bayesian Learning. CoRR abs/1901.05353 (2019).
  • Gupta et al. (2021) Nikunj Gupta, G. Srinivasaraghavan, Swarup Kumar Mohalik, and Matthew E. Taylor. 2021. HAMMER: Multi-Level Coordination of Reinforcement Learning Agents via Learned Messaging. CoRR abs/2102.00824 (2021).
  • Ha and Schmidhuber (2018) David Ha and Jürgen Schmidhuber. 2018. World Models. CoRR abs/1803.10122 (2018). arXiv:1803.10122 http://arxiv.org/abs/1803.10122
  • Hafner et al. (2021) Danijar Hafner, Timothy P. Lillicrap, Mohammad Norouzi, and Jimmy Ba. 2021. Mastering Atari with Discrete World Models. In 9th International Conference on Learning Representations, ICLR.
  • Haugh and Lo (2001) Martin B. Haugh and Andrew W. Lo. 2001. Computational challenges in portfolio management. Comput. Sci. Eng. 3, 3 (2001), 54–59.
  • Hernandez et al. (2021) Alberto Olmo Hernandez, Sarath Sreedharan, and Subbarao Kambhampati. 2021. GPT3-to-plan: Extracting plans from text using GPT-3. CoRR abs/2106.07131 (2021).
  • Hessel et al. (2019) Matteo Hessel, Hado van Hasselt, Joseph Modayil, and David Silver. 2019. On Inductive Biases in Deep Reinforcement Learning. CoRR abs/1907.02908 (2019).
  • Heuillet et al. (2021) Alexandre Heuillet, Fabien Couthouis, and Natalia Díaz Rodríguez. 2021. Explainability in deep reinforcement learning. Knowl. Based Syst. 214 (2021), 106685.
  • Hieu (2020) Le Trung Hieu. 2020. Deep Reinforcement Learning for Stock Portfolio Optimization. CoRR abs/2012.06325 (2020).
  • Ho et al. (2019) Jonathan Ho, Nal Kalchbrenner, Dirk Weissenborn, and Tim Salimans. 2019. Axial Attention in Multidimensional Transformers. CoRR abs/1912.12180 (2019).
  • Hochreiter (1998) Sepp Hochreiter. 1998. The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions. Int. J. Uncertain. Fuzziness Knowl. Based Syst. 6, 2 (1998), 107–116.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Comput. 9, 8 (1997), 1735–1780.
  • Hu et al. (2021) Siyi Hu, Fengda Zhu, Xiaojun Chang, and Xiaodan Liang. 2021. UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers. CoRR abs/2101.08001 (2021).
  • Huang et al. (2020a) Wenlong Huang, Igor Mordatch, and Deepak Pathak. 2020a. One Policy to Control Them All: Shared Modular Policies for Agent-Agnostic Control. In Proceedings of the 37th International Conference on Machine Learning, ICML, Vol. 119. PMLR, 4455–4464.
  • Huang et al. (2020b) Xiao Shi Huang, Felipe Pérez, Jimmy Ba, and Maksims Volkovs. 2020b. Improving Transformer Optimization Through Better Initialization. In Proceedings of the 37th International Conference on Machine Learning, ICML, Vol. 119. PMLR, 4475–4483.
  • Hung et al. (2018) Chia-Chun Hung, Timothy P. Lillicrap, Josh Abramson, Yan Wu, Mehdi Mirza, Federico Carnevale, Arun Ahuja, and Greg Wayne. 2018. Optimizing Agent Behavior over Long Time Scales by Transporting Value. CoRR abs/1810.06721 (2018).
  • Icarte et al. (2020) Rodrigo Toro Icarte, Richard Anthony Valenzano, Toryn Q. Klassen, Phillip J. K. Christoffersen, Amir-massoud Farahmand, and Sheila A. McIlraith. 2020. The act of remembering: a study in partially observable reinforcement learning. CoRR abs/2010.01753 (2020).
  • Jaegle et al. (2022) Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier J. Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, and João Carreira. 2022. Perceiver IO: A General Architecture for Structured Inputs & Outputs. (2022).
  • Janner et al. (2021) Michael Janner, Qiyang Li, and Sergey Levine. 2021. Offline Reinforcement Learning as One Big Sequence Modeling Problem. In NeurIPS. 1273–1286.
  • Jo et al. (2022) DaeJin Jo, Taehwan Kwon, Eun-Sol Kim, and Sungwoong Kim. 2022. Selective Token Generation for Few-shot Natural Language Generation. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022. International Committee on Computational Linguistics, 5837–5856.
  • Kalashnikov et al. (2018) Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. 2018. Scalable Deep Reinforcement Learning for Vision-Based Robotic Manipulation. In 2nd Annual Conference on Robot Learning, CoRL 2018, Zürich, Vol. 87. PMLR, 651–673.
  • Kaplan et al. (2020) Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. CoRR abs/2001.08361 (2020).
  • Kargar and Kyrki (2021) Eshagh Kargar and Ville Kyrki. 2021. Vision Transformer for Learning Driving Policies in Complex Multi-Agent Environments. CoRR abs/2109.06514 (2021).
  • Kargar and Kyrki (2022) Eshagh Kargar and Ville Kyrki. 2022. Vision Transformer for Learning Driving Policies in Complex and Dynamic Environments. In IEEE Intelligent Vehicles Symposium. IEEE, 1558–1564.
  • Karpathy et al. (2014) Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-Scale Video Classification with Convolutional Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR. IEEE Computer Society, 1725–1732.
  • Katharopoulos et al. (2020) Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. 2020. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention. In Proceedings of the 37th ICML, Vol. 119. PMLR, 5156–5165.
  • Keles et al. (2023) Feyza Duman Keles, Pruthuvi Mahesakya Wijewardena, and Chinmay Hegde. 2023. On The Computational Complexity of Self-Attention. In International Conference on Algorithmic Learning Theory, Vol. 201. PMLR, 597–619.
  • Khan et al. (2022) Muhammad Junaid Khan, Syed Hammad Ahmed, and Gita Sukthankar. 2022. Transformer-Based Value Function Decomposition for Cooperative Multi-Agent Reinforcement Learning in StarCraft. (2022), 113–119.
  • Khan et al. (2021) Salman H. Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. 2021. Transformers in Vision: A Survey. CoRR abs/2101.01169 (2021).
  • Killian et al. (2023) Taylor W. Killian, Sonali Parbhoo, and Marzyeh Ghassemi. 2023. Risk Sensitive Dead-end Identification in Safety-Critical Offline Reinforcement Learning. Trans. Mach. Learn. Res. 2023 (2023).
  • Kim et al. (2023) Changyeon Kim, Jongjin Park, Jinwoo Shin, Honglak Lee, Pieter Abbeel, and Kimin Lee. 2023. Preference Transformer: Modeling Human Preferences using Transformers for RL. CoRR abs/2303.00957 (2023).
  • Kingma and Welling (2019) Diederik P. Kingma and Max Welling. 2019. An Introduction to Variational Autoencoders. Found. Trends Mach. Learn. 12, 4 (2019), 307–392. https://doi.org/10.1561/2200000056
  • Kiran et al. (2022) B. Ravi Kiran, Ibrahim Sobh, Victor Talpaert, Patrick Mannion, Ahmad A. Al Sallab, Senthil Kumar Yogamani, and Patrick Pérez. 2022. Deep Reinforcement Learning for Autonomous Driving: A Survey. IEEE Trans. Intell. Transp. Syst. 23, 6 (2022), 4909–4926.
  • Kuba et al. (2022) Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun, Jun Wang, and Yaodong Yang. 2022. Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning. In 10th International Conference on Learning Representations, ICLR.
  • Kurin et al. (2021) Vitaly Kurin, Maximilian Igl, Tim Rocktäschel, Wendelin Boehmer, and Shimon Whiteson. 2021. My Body is a Cage: the Role of Morphology in Graph-Based Incompatible Control. In 9th International Conference on Learning Representations, ICLR.
  • Lake et al. (2016) Brenden M. Lake, Tomer David Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. 2016. Building machines that learn and think like people. Behavioral and Brain Sciences 40 (2016).
  • Larsen et al. (2021) Thomas Nakken Larsen, Halvor Ødegård Teigen, Torkel Laache, Damiano Varagnolo, and Adil Rasheed. 2021. Comparing Deep Reinforcement Learning Algorithms’ Ability to Safely Navigate Challenging Waters. Frontiers Robotics AI 8 (2021), 738113.
  • Laskin et al. (2020) Michael Laskin, Aravind Srinivas, and Pieter Abbeel. 2020. CURL: Contrastive Unsupervised Representations for Reinforcement Learning. In Proceedings of the 37th ICML, Vol. 119. PMLR, 5639–5650.
  • Lathuilière et al. (2019) Stéphane Lathuilière, Benoit Massé, Pablo Mesejo, and Radu Horaud. 2019. Neural network based reinforcement learning for audio-visual gaze control in human-robot interaction. Pattern Recognit. Lett. 118 (2019), 61–71.
  • Latif et al. (2023) Siddique Latif, Heriberto Cuayáhuitl, Farrukh Pervez, Fahad Shamshad, Hafiz Shehbaz Ali, and Erik Cambria. 2023. A survey on deep reinforcement learning for audio-based applications. Artif. Intell. Rev. 56, 3 (2023), 2193–2240.
  • Lee et al. (2022) Kuang-Huei Lee, Ofir Nachum, Mengjiao Yang, Lisa Lee, Daniel Freeman, Sergio Guadarrama, Ian Fischer, Winnie Xu, Eric Jang, Henryk Michalewski, and Igor Mordatch. 2022. Multi-Game Decision Transformers. In NeurIPS.
  • Lesort et al. (2018) Timothée Lesort, Natalia Díaz Rodríguez, Jean-François Goudou, and David Filliat. 2018. State representation learning for control: An overview. Neural Networks 108 (2018), 379–392.
  • Levine et al. (2020a) Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. 2020a. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. CoRR abs/2005.01643 (2020).
  • Levine et al. (2020b) Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. 2020b. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems. CoRR abs/2005.01643 (2020).
  • Li et al. (2022b) Chen Li, Chikashige Yamanaka, Kazuma Kaitoh, and Yoshihiro Yamanishi. 2022b. Transformer-based Objective-reinforced Generative Adversarial Network to Generate Desired Molecules. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI. ijcai.org, 3884–3890.
  • Li et al. (2019) Shiyang Li, Xiaoyong Jin, Yao Xuan, Xiyou Zhou, Wenhu Chen, Yu-Xiang Wang, and Xifeng Yan. 2019. Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting. In NeurIPS. 5244–5254.
  • Li et al. (2022a) Weiyuan Li, Ruoxin Hong, Jiwei Shen, and Yue Lu. 2022a. Learning to Navigate in Interactive Environments with the Transformer-based Memory. (2022).
  • Lillicrap et al. (2016) Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2016. Continuous control with deep reinforcement learning. In 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings.
  • Lin et al. (2022) Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. 2022. A survey of transformers. AI Open 3 (2022), 111–132.
  • Lin et al. (2023) Xian Lin, Li Yu, Kwang-Ting Cheng, and Zengqiang Yan. 2023. BATFormer: Towards Boundary-Aware Lightweight Transformer for Efficient Medical Image Segmentation. IEEE Journal of Biomedical and Health Informatics (2023).
  • Liu et al. (2022b) Haochen Liu, Zhiyu Huang, Xiaoyu Mo, and Chen Lv. 2022b. Augmenting Reinforcement Learning with Transformer-based Scene Representation Learning for Decision-making of Autonomous Driving. CoRR abs/2208.12263 (2022).
  • Liu et al. (2019) Hao Liu, Richard Socher, and Caiming Xiong. 2019. Taming MAML: Efficient unbiased meta-reinforcement learning. In Proceedings of the 36th ICML, Vol. 97. PMLR, 4061–4071.
  • Liu et al. (2020) Liyuan Liu, Xiaodong Liu, Jianfeng Gao, Weizhu Chen, and Jiawei Han. 2020. Understanding the Difficulty of Training Transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020. Association for Computational Linguistics, 5747–5763.
  • Liu et al. (2022a) Qinghua Liu, Alan Chung, Csaba Szepesvári, and Chi Jin. 2022a. When Is Partially Observable Reinforcement Learning Not Scary?. In Conference on Learning Theory, Vol. 178. PMLR, 5175–5220.
  • Liu et al. (2023) Xuhan Liu, Kai Ye, Herman W. T. van Vlijmen, Adriaan P. IJzerman, and Gerard J. P. van Westen. 2023. DrugEx v3: scaffold-constrained drug design with graph transformer-based reinforcement learning. J. Cheminformatics 15, 1, 24.
  • Liu et al. (2022c) Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022c. A ConvNet for the 2020s. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. IEEE, 11966–11976.
  • Locatello et al. (2020) Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. 2020. Object-Centric Learning with Slot Attention. In NeurIPS.
  • Lohse et al. (2021) Oliver Lohse, Noah Pütz, and Korbinian Hörmann. 2021. Implementing an Online Scheduling Approach for Production with Multi Agent Proximal Policy Optimization (MAPPO). In Advances in Production Management Systems. Artificial Intelligence for Sustainable and Resilient Production Systems: IFIP WG 5.7 International Conference, APMS 2021, Nantes, France, September 5–9, 2021, Proceedings, Part V. Springer, 586–595.
  • Lu et al. (2022) Cong Lu, Philip J. Ball, Tim G. J. Rudner, Jack Parker-Holder, Michael A. Osborne, and Yee Whye Teh. 2022. Challenges and Opportunities in Offline Reinforcement Learning from Visual Observations. CoRR abs/2206.04779 (2022).
  • Lu et al. (2020) Hao Lu, Xingwen Zhang, and Shuang Yang. 2020. A Learning-based Iterative Method for Solving Vehicle Routing Problems. In 8th International Conference on Learning Representations, ICLR.
  • Lu et al. (2021) Jiachen Lu, Jinghan Yao, Junge Zhang, Xiatian Zhu, Hang Xu, Weiguo Gao, Chunjing Xu, Tao Xiang, and Li Zhang. 2021. SOFT: Softmax-free Transformer with Linear Complexity. In NeurIPS. 21297–21309.
  • Ma et al. (2021) Michel Ma, Pierluca D’Oro, Yoshua Bengio, and Pierre-Luc Bacon. 2021. Long-Term Credit Assignment via Model-based Temporal Shortcuts. In Deep RL Workshop NeurIPS 2021.
  • Malpure et al. (2021) Durvesh Malpure, Onkar Litake, and Rajesh Ingle. 2021. Investigating Transfer Learning Capabilities of Vision Transformers and CNNs by Fine-Tuning a Single Trainable Block. CoRR abs/2110.05270 (2021).
  • Manchin et al. (2019) Anthony Manchin, Ehsan Abbasnejad, and Anton van den Hengel. 2019. Reinforcement Learning with Attention that Works: A Self-Supervised Approach. In Neural Information Processing - 26th International Conference, ICONIP, Vol. 1143. Springer, 223–230.
  • Mazyavkina et al. (2021) Nina Mazyavkina, Sergey Sviridov, Sergei Ivanov, and Evgeny Burnaev. 2021. Reinforcement learning for combinatorial optimization: A survey. Comput. Oper. Res. 134 (2021), 105400.
  • Melo (2022) Luckeciano C. Melo. 2022. Transformers are Meta-Reinforcement Learners. In International Conference on Machine Learning, ICML, Vol. 162. PMLR, 15340–15359.
  • Meng et al. (2021) Linghui Meng, Muning Wen, Yaodong Yang, Chenyang Le, Xiyun Li, Weinan Zhang, Ying Wen, Haifeng Zhang, Jun Wang, and Bo Xu. 2021. Offline Pre-trained Multi-Agent Decision Transformer: One Big Sequence Model Tackles All SMAC Tasks. CoRR abs/2112.02845 (2021).
  • Mesnard et al. (2021) Thomas Mesnard, Theophane Weber, Fabio Viola, Shantanu Thakoor, Alaa Saade, Anna Harutyunyan, Will Dabney, Thomas S. Stepleton, Nicolas Heess, Arthur Guez, Eric Moulines, Marcus Hutter, Lars Buesing, and Rémi Munos. 2021. Counterfactual Credit Assignment in Model-Free Reinforcement Learning. In Proceedings of the 38th ICML, Vol. 139. PMLR, 7654–7664.
  • Micheli et al. (2022) Vincent Micheli, Eloi Alonso, and François Fleuret. 2022. Transformers are Sample Efficient World Models. CoRR abs/2209.00588 (2022).
  • Milani et al. (2022) Stephanie Milani, Nicholay Topin, Manuela Veloso, and Fei Fang. 2022. A Survey of Explainable Reinforcement Learning. CoRR abs/2202.08434 (2022).
  • Milosevic (2016) Nikola Milosevic. 2016. Equity forecast: Predicting long term stock price movement using machine learning. CoRR abs/1603.00751 (2016).
  • Miura et al. (2021) Yasuhide Miura, Yuhao Zhang, Emily Bao Tsai, Curtis P. Langlotz, and Dan Jurafsky. 2021. Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021. Association for Computational Linguistics, 5288–5304.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. 2013. Playing Atari with Deep Reinforcement Learning. CoRR abs/1312.5602 (2013).
  • Moerland et al. (2020) Thomas M. Moerland, Joost Broekens, and Catholijn M. Jonker. 2020. Model-based Reinforcement Learning: A Survey. CoRR abs/2006.16712 (2020).
  • Mohanty et al. (2020) Sharada P. Mohanty, Jyotish Poonganam, Adrien Gaidon, Andrey Kolobov, Blake Wulfe, Dipam Chakraborty, Grazvydas Semetulskis, João Schapke, Jonas Kubilius, Jurgis Pasukonis, Linas Klimas, Matthew J. Hausknecht, Patrick MacAlpine, Quang Nhat Tran, Thomas Tumiel, Xiaocheng Tang, Xinwei Chen, Christopher Hesse, Jacob Hilton, William Hebgen Guss, Sahika Genc, John Schulman, and Karl Cobbe. 2020. Measuring Sample Efficiency and Generalization in Reinforcement Learning Benchmarks: NeurIPS 2020 Procgen Benchmark. In NeurIPS Competition and Demonstration Track, Vol. 133. PMLR, 361–395.
  • Moor et al. (2023) Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. 2023. Foundation models for generalist medical artificial intelligence. Nature 616, 7956 (2023), 259–265.
  • Mor and Speranza (2022) Andrea Mor and Maria Grazia Speranza. 2022. Vehicle routing problems over time: a survey. Ann. Oper. Res. 314, 1 (2022), 255–275.
  • Morad et al. (2023) Steven D. Morad, Ryan Kortvelesy, Matteo Bettini, Stephan Liwicki, and Amanda Prorok. 2023. POPGym: Benchmarking Partially Observable Reinforcement Learning. CoRR abs/2303.01859 (2023). https://doi.org/10.48550/arXiv.2303.01859 arXiv:2303.01859
  • Motokawa and Sugawara (2021) Yoshinari Motokawa and Toshiharu Sugawara. 2021. MAT-DQN: Toward Interpretable Multi-agent Deep Reinforcement Learning for Coordinated Activities. In 30th International Conference on Artificial Neural Networks, Vol. 12894. Springer, 556–567.
  • Nakatani et al. (2022) Yuki Nakatani, Tomoyuki Kajiwara, and Takashi Ninomiya. 2022. Comparing BERT-based Reward Functions for Deep Reinforcement Learning in Machine Translation. In Proceedings of the 9th Workshop on Asian Translation, WAT@COLING 2022, Gyeongju, Republic of Korea, October 17, 2022. International Conference on Computational Linguistics, 37–43.
  • Nasir and Durlofsky (2023) Yusuf Nasir and Louis J. Durlofsky. 2023. Deep reinforcement learning for optimal well control in subsurface systems with uncertain geology. J. Comput. Phys. 477 (2023), 111945.
  • Negrinho et al. (2018) Renato Negrinho, Matthew R. Gormley, and Geoffrey J. Gordon. 2018. Learning Beam Search Policies via Imitation Learning. In NeurIPS. 10675–10684.
  • Nguyen et al. (2020) Thanh Thi Nguyen, Ngoc Duy Nguyen, and Saeid Nahavandi. 2020. Deep Reinforcement Learning for Multiagent Systems: A Review of Challenges, Solutions, and Applications. IEEE Trans. Cybern. 50, 9 (2020), 3826–3839.
  • Nikishin et al. (2018) Evgenii Nikishin, Pavel Izmailov, Ben Athiwaratkun, Dmitrii Podoprikhin, Timur Garipov, Pavel Shvechikov, Dmitry Vetrov, and Andrew Gordon Wilson. 2018. Improving stability in deep reinforcement learning with weight averaging. In Uncertainty in artificial intelligence workshop on uncertainty in Deep learning.
  • Obando-Ceron and Castro (2021) Johan Samir Obando-Ceron and Pablo Samuel Castro. 2021. Revisiting Rainbow: Promoting more insightful and inclusive deep reinforcement learning research. In Proceedings of the 38th International Conference on Machine Learning, ICML, Vol. 139. PMLR, 1373–1383.
  • OpenAI et al. (2019) OpenAI, Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, Jonas Schneider, Nikolas Tezak, Jerry Tworek, Peter Welinder, Lilian Weng, Qiming Yuan, Wojciech Zaremba, and Lei Zhang. 2019. Solving Rubik’s Cube with a Robot Hand. CoRR abs/1910.07113 (2019).
  • Oroojlooy and Hajinezhad (2023) Afshin Oroojlooy and Davood Hajinezhad. 2023. A review of cooperative multi-agent deep reinforcement learning. Appl. Intell. 53, 11 (2023), 13677–13722.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. 2022. Training language models to follow instructions with human feedback. In NeurIPS.
  • Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. ACL, 311–318.
  • Parisotto et al. (2020) Emilio Parisotto, H. Francis Song, Jack W. Rae, Razvan Pascanu, Çaglar Gülçehre, Siddhant M. Jayakumar, Max Jaderberg, Raphaël Lopez Kaufman, Aidan Clark, Seb Noury, Matthew M. Botvinick, Nicolas Heess, and Raia Hadsell. 2020. Stabilizing Transformers for Reinforcement Learning. In Proceedings of the 37th International Conference on Machine Learning, ICML, Vol. 119. PMLR, 7487–7498.
  • Parmentier and T’kindt (2023) Axel Parmentier and Vincent T’kindt. 2023. Structured learning based heuristics to solve the single machine scheduling problem with release times and sum of completion times. Eur. J. Oper. Res. 305, 3 (2023), 1032–1041.
  • Pascanu et al. (2013) Razvan Pascanu, Tomás Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, ICML, Vol. 28. JMLR.org, 1310–1318.
  • Paster et al. (2022) Keiran Paster, Sheila A. McIlraith, and Jimmy Ba. 2022. You Can’t Count on Luck: Why Decision Transformers and RvS Fail in Stochastic Environments. (2022).
  • Petit et al. (2021) Olivier Petit, Nicolas Thome, Clément Rambour, Loic Themyr, Toby Collins, and Luc Soler. 2021. U-Net Transformer: Self and Cross Attention for Medical Image Segmentation. In Machine Learning in Medical Imaging - 12th International Workshop, MLMI, MICCAI, Vol. 12966. Springer, 267–276.
  • Putterman et al. (2021) Aaron Putterman, Kevin Lu, Igor Mordatch, and P. Abbeel. 2021. Pretraining for Language-Conditioned Imitation with Transformers.
  • Qu et al. (2022) Jia Qu, Shotaro Miwa, and Yukiyasu Domae. 2022. Interpretable Navigation Agents Using Attention-Augmented Memory. In IEEE International Conference on Systems, Man, and Cybernetics, SMC 2022, Prague, Czech Republic, October 9-12, 2022. IEEE, 2575–2582.
  • Radford et al. (2018) Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever, et al. 2018. Improving language understanding by generative pre-training. (2018).
  • Rakelly et al. (2019) Kate Rakelly, Aurick Zhou, Chelsea Finn, Sergey Levine, and Deirdre Quillen. 2019. Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables. In Proceedings of the 36th International Conference on Machine Learning, ICML, Vol. 97. PMLR, 5331–5340.
  • Ramachandram and Taylor (2017) Dhanesh Ramachandram and Graham W. Taylor. 2017. Deep Multimodal Learning: A Survey on Recent Advances and Trends. IEEE Signal Process. Mag. 34, 6 (2017), 96–108.
  • Ren et al. (2021) Hongyu Ren, Hanjun Dai, Zihang Dai, Mengjiao Yang, Jure Leskovec, Dale Schuurmans, and Bo Dai. 2021. Combiner: Full Attention Transformer with Sparse Computation Cost. In NeurIPS. 22470–22482.
  • Ribeiro et al. (2020) Antônio H. Ribeiro, Koen Tiels, Luis Antonio Aguirre, and Thomas B. Schön. 2020. Beyond exploding and vanishing gradients: analysing RNN training using attractors and smoothness. In The 23rd International Conference on Artificial Intelligence and Statistics, AISTATS, Vol. 108. PMLR, 2370–2380.
  • Robine et al. (2023) Jan Robine, Marc Höftmann, Tobias Uelwer, and Stefan Harmeling. 2023. Transformer-based World Models Are Happy With 100k Interactions. CoRR abs/2303.07109 (2023).
  • Savva et al. (2019) Manolis Savva, Jitendra Malik, Devi Parikh, Dhruv Batra, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, Jia Liu, and Vladlen Koltun. 2019. Habitat: A Platform for Embodied AI Research. In IEEE/CVF International Conference on Computer Vision, ICCV. IEEE, 9338–9346.
  • Schmidhuber (2019) Jürgen Schmidhuber. 2019. Reinforcement Learning Upside Down: Don’t Predict Rewards - Just Map Them to Actions. CoRR abs/1912.02875 (2019).
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. Proximal Policy Optimization Algorithms. CoRR abs/1707.06347 (2017).
  • Seo et al. (2019) Minah Seo, Luiz Felipe Vecchietti, Sangkeum Lee, and Dongsoo Har. 2019. Rewards Prediction-Based Credit Assignment for Reinforcement Learning With Sparse Binary Rewards. IEEE Access 7 (2019), 118776–118791.
  • Seo et al. (2022) Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Stephen James, Kimin Lee, and Pieter Abbeel. 2022. Masked World Models for Visual Control. In Conference on Robot Learning, CoRL, Vol. 205. PMLR, 1332–1344.
  • Serrano and Smith (2019) Sofia Serrano and Noah A. Smith. 2019. Is Attention Interpretable?. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL. Association for Computational Linguistics, 2931–2951.
  • Shang et al. (2022) Jinghuan Shang, Kumara Kahatapitiya, Xiang Li, and Michael S. Ryoo. 2022. StARformer: Transformer with State-Action-Reward Representations for Visual Reinforcement Learning. In Computer Vision - ECCV - 17th European Conference, Vol. 13699. Springer, 462–479.
  • Shao et al. (2019a) Kun Shao, Zhentao Tang, Yuanheng Zhu, Nannan Li, and Dongbin Zhao. 2019a. A Survey of Deep Reinforcement Learning in Video Games. CoRR abs/1912.10944 (2019).
  • Shao et al. (2019b) Kun Shao, Yuanheng Zhu, and Dongbin Zhao. 2019b. StarCraft Micromanagement With Reinforcement Learning and Curriculum Transfer Learning. IEEE Trans. Emerg. Top. Comput. Intell. 3, 1 (2019), 73–84.
  • Sharma et al. (2021) Piyush K. Sharma, Rolando Fernandez, Erin G. Zaroukian, Michael R. Dorothy, Anjon Basak, and Derrik E. Asher. 2021. Survey of Recent Multi-Agent Reinforcement Learning Algorithms Utilizing Centralized Training. CoRR abs/2107.14316 (2021).
  • Shawki et al. (2021) N Shawki, R Rodriguez Nunez, I Obeid, and J Picone. 2021. On automating hyperparameter optimization for deep learning applications. In 2021 IEEE Signal Processing in Medicine and Biology Symposium (SPMB). IEEE, 1–7.
  • Shi et al. (2021) Tianyu Shi, Dong Chen, Kaian Chen, and Zhaojian Li. 2021. Offline Reinforcement Learning for Autonomous Driving with Safety and Exploration Enhancement. CoRR abs/2110.07067 (2021).
  • Smith (2022) Carson Smith. 2022. Attention-Based Learning for Combinatorial Optimization. Ph. D. Dissertation. Massachusetts Institute of Technology.
  • Snoek et al. (2015) Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Md. Mostofa Ali Patwary, Prabhat, and Ryan P. Adams. 2015. Scalable Bayesian Optimization Using Deep Neural Networks. In Proceedings of the 32nd ICML, Vol. 37. JMLR.org, 2171–2180.
  • Song et al. (2021) Hailuo Song, Ao Li, Tong Wang, and Minghui Wang. 2021. Multimodal Deep Reinforcement Learning with Auxiliary Task for Obstacle Avoidance of Indoor Mobile Robot. Sensors 21, 4 (2021), 1363.
  • Strnad et al. (2019) Felix M. Strnad, Wolfram Barfuss, Jonathan F. Donges, and Jobst Heitzig. 2019. Deep reinforcement learning in World-Earth system models to discover sustainable management strategies. CoRR abs/1908.05567 (2019).
  • Sutton and Barto (1998) Richard S. Sutton and Andrew G. Barto. 1998. Reinforcement learning - an introduction. MIT Press.
  • Ta et al. (2020) Van-Dai Ta, CHUAN-MING Liu, and Direselign Addis Tadesse. 2020. Portfolio optimization-based stock prediction using long-short term memory network in quantitative trading. Applied Sciences 10, 2 (2020), 437.
  • Tao et al. (2022) Tianxin Tao, Daniele Reda, and Michiel van de Panne. 2022. Evaluating Vision Transformer Methods for Deep Reinforcement Learning from Pixels. CoRR abs/2204.04905 (2022).
  • Thakkar and Chaudhari (2021) Ankit Thakkar and Kinjal Chaudhari. 2021. A comprehensive survey on portfolio optimization, stock price and trend prediction using particle swarm optimization. Archives of Computational Methods in Engineering 28 (2021), 2133–2164.
  • Tunstall et al. (2022) Lewis Tunstall, Leandro von Werra, and Thomas Wolf. 2022. Natural language processing with transformers. " O’Reilly Media, Inc.".
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In 30th Annual Conference on Neural Information Processing Systems. 5998–6008.
  • Vedantam et al. (2015) Ramakrishna Vedantam, C. Lawrence Zitnick, and Devi Parikh. 2015. CIDEr: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015. IEEE Computer Society, 4566–4575.
  • Vinyals et al. (2019) Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander Sasha Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom Le Paine, Çaglar Gülçehre, Ziyu Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy P. Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, and David Silver. 2019. Grandmaster level in StarCraft II using multi-agent reinforcement learning. Nat. 575, 7782 (2019), 350–354.
  • Vinyals et al. (2017) Oriol Vinyals, Timo Ewalds, Sergey Bartunov, Petko Georgiev, Alexander Sasha Vezhnevets, Michelle Yeo, Alireza Makhzani, Heinrich Küttler, John P. Agapiou, Julian Schrittwieser, John Quan, Stephen Gaffney, Stig Petersen, Karen Simonyan, Tom Schaul, Hado van Hasselt, David Silver, Timothy P. Lillicrap, Kevin Calderone, Paul Keet, Anthony Brunasso, David Lawrence, Anders Ekermo, Jacob Repp, and Rodney Tsing. 2017. StarCraft II: A New Challenge for Reinforcement Learning. CoRR abs/1708.04782 (2017).
  • Vo et al. (2017) Quan-Hoang Vo, Huy-Tien Nguyen, Bac Le, and Minh-Le Nguyen. 2017. Multi-channel LSTM-CNN model for Vietnamese sentiment analysis. In 9th International Conference on Knowledge and Systems Engineering, KSE 2017, Hue, Vietnam, October 19-21, 2017. IEEE, 24–29.
  • Wang et al. (2021a) Jike Wang, Chang-Yu Hsieh, Mingyang Wang, Xiaorui Wang, Zhenxing Wu, Dejun Jiang, Benben Liao, Xujun Zhang, Bo Yang, Qiaojun He, Dongsheng Cao, Xi Chen, and Tingjun Hou. 2021a. Multi-constraint molecular generation based on conditional transformer, knowledge distillation and reinforcement learning. Nat. Mach. Intell. 3, 10 (2021), 914–922.
  • Wang et al. (2021c) Jiacheng Wang, Lan Wei, Liansheng Wang, Qichao Zhou, Lei Zhu, and Jing Qin. 2021c. Boundary-Aware Transformers for Skin Lesion Segmentation. In Medical Image Computing and Computer Assisted Intervention - MICCAI 24th International Conference, Vol. 12901. Springer, 206–216.
  • Wang et al. (2022d) Jiayue Wang, Hongbo Zhao, Haoqiang Liu, Liwei Geng, and Zebin Sun. 2022d. A Distributed Vehicle-assisted Computation Offloading Scheme based on DRL in Vehicular Networks. In 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2022, Taormina, Italy, May 16-19, 2022. IEEE, 200–209.
  • Wang et al. (2022e) Kerong Wang, Hanye Zhao, Xufang Luo, Kan Ren, Weinan Zhang, and Dongsheng Li. 2022e. Bootstrapped Transformer for Offline Reinforcement Learning. In NeurIPS.
  • Wang et al. (2020a) Lingxiao Wang, Zhuoran Yang, and Zhaoran Wang. 2020a. Breaking the Curse of Many Agents: Provable Mean Embedding Q-Iteration for Mean-Field Reinforcement Learning. In Proceedings of the 37th International Conference on Machine Learning, ICML, Vol. 119. PMLR, 10092–10103.
  • Wang et al. (2022b) Minrui Wang, Mingxiao Feng, Wengang Zhou, and Houqiang Li. 2022b. Stabilizing Voltage in Power Distribution Networks via Multi-Agent Reinforcement Learning with Transformer. In KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022. ACM, 1899–1909.
  • Wang et al. (2023) Qi Wang, Kenneth H. Lai, and Chunlei Tang. 2023. Solving combinatorial optimization problems over graphs with BERT-Based Deep Reinforcement Learning. Inf. Sci. 619 (2023), 930–946.
  • Wang et al. (2018) Tingwu Wang, Renjie Liao, Jimmy Ba, and Sanja Fidler. 2018. NerveNet: Learning Structured Policy with Graph Neural Networks. In 6th International Conference on Learning Representations, ICLR.
  • Wang et al. (2020b) Weixun Wang, Tianpei Yang, Yong Liu, Jianye Hao, Xiaotian Hao, Yujing Hu, Yingfeng Chen, Changjie Fan, and Yang Gao. 2020b. From Few to More: Large-Scale Dynamic Multiagent Curriculum Learning. In The Thirty-Fourth Conference on Artificial Intelligence, AAAI. AAAI Press, 7293–7300.
  • Wang and Chen (2022) Yang Wang and Zhibin Chen. 2022. A Deep Reinforcement Learning Algorithm Using A New Graph Transformer Model for Routing Problems. In Intelligent Systems and Applications - Proceedings of the Intelligent Systems Conference, IntelliSys, Vol. 544. Springer, 365–379.
  • Wang et al. (2021b) Yifan Wang, Zhichao Min, and Sen Jia. 2021b. Local-Global-Aware Convolutional Transformer for Hyperspectral Image Classification. In 23rd Int Conf on High Performance Computing. IEEE, 1188–1194.
  • Wang et al. (2022a) Zhendong Wang, Xiaodong Cun, Jianmin Bao, Wengang Zhou, Jianzhuang Liu, and Houqiang Li. 2022a. Uformer: A General U-Shaped Transformer for Image Restoration. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. IEEE, 17662–17672.
  • Wang et al. (2022c) Zhihai Wang, Jie Wang, Qi Zhou, Bin Li, and Houqiang Li. 2022c. Sample-Efficient Reinforcement Learning via Conservative Model-Based Actor-Critic. In Thirty-Sixth AAAI Conference on Artificial Intelligence. AAAI Press, 8612–8620.
  • Wen et al. (2022) Muning Wen, Jakub Grudzien Kuba, Runji Lin, Weinan Zhang, Ying Wen, Jun Wang, and Yaodong Yang. 2022. Multi-Agent Reinforcement Learning is a Sequence Modeling Problem. In NeurIPS.
  • Williams (1992) Ronald J. Williams. 1992. Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning. Mach. Learn. 8 (1992), 229–256.
  • Woo et al. (2023) Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. 2023. ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders. CoRR abs/2301.00808 (2023).
  • Wu et al. (2022) Yaoxin Wu, Wen Song, Zhiguang Cao, Jie Zhang, and Andrew Lim. 2022. Learning Improvement Heuristics for Solving Routing Problems. IEEE Trans. Neural Networks Learn. Syst. 33, 9 (2022), 5057–5069.
  • Xiong et al. (2019) Yuxuan Xiong, Bo Du, and Pingkun Yan. 2019. Reinforced Transformer for Medical Image Captioning. In Machine Learning in Medical Imaging - 10th International Workshop, MLMI, MICCAI, Vol. 11861. Springer, 673–680.
  • Xu et al. (2020) Ke Xu, Yifan Zhang, Deheng Ye, Peilin Zhao, and Mingkui Tan. 2020. Relation-Aware Transformer for Portfolio Policy Learning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020. ijcai.org, 4647–4653.
  • Xu et al. (2022b) Mengdi Xu, Yikang Shen, Shun Zhang, Yuchen Lu, Ding Zhao, Joshua B. Tenenbaum, and Chuang Gan. 2022b. Prompting Decision Transformer for Few-Shot Policy Generalization. In International Conference on Machine Learning, ICML, Vol. 162. PMLR, 24631–24645.
  • Xu et al. (2022a) Nuo Xu, Jianlong Chang, Xing Nie, Chunlei Huo, Shiming Xiang, and Chunhong Pan. 2022a. AME: Attention and Memory Enhancement in Hyper-Parameter Optimization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR. IEEE, 480–489.
  • Xu et al. (2022c) Peng Xu, Xiatian Zhu, and David A. Clifton. 2022c. Multimodal Learning with Transformers: A Survey. CoRR abs/2206.06488 (2022).
  • Yamagata et al. (2022) Taku Yamagata, Ahmed Khalil, and Raúl Santos-Rodríguez. 2022. Q-learning Decision Transformer: Leveraging Dynamic Programming for Conditional Sequence Modelling in Offline RL. CoRR abs/2209.03993 (2022).
  • Yang et al. (2022) Biao Yang, Jicheng Yang, Rongrong Ni, Changchun Yang, and Xiaofeng Liu. 2022. Multi-granularity scenarios understanding network for trajectory prediction. Complex & Intelligent Systems (2022), 1–14.
  • Yang et al. (2021) Lijuan Yang, Guanghui Yang, Zhitong Bing, Yuan Tian, Yuzhen Niu, Liang Huang, and Lei Yang. 2021. Transformer-Based Generative Model Accelerating the Development of Novel BRAF Inhibitors. ACS Omega 6 (2021), 33864 – 33873.
  • Yarats et al. (2021) Denis Yarats, Amy Zhang, Ilya Kostrikov, Brandon Amos, Joelle Pineau, and Rob Fergus. 2021. Improving Sample Efficiency in Model-Free Reinforcement Learning from Images. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI. AAAI Press, 10674–10681.
  • Ying et al. (2021) Chengxuan Ying, Tianle Cai, Shengjie Luo, Shuxin Zheng, Guolin Ke, Di He, Yanming Shen, and Tie-Yan Liu. 2021. Do Transformers Really Perform Badly for Graph Representation?. In NeurIPS. 28877–28888.
  • Young and Sutton (2020) Kenny Young and Richard S. Sutton. 2020. Understanding the Pathologies of Approximate Policy Evaluation when Combined with Greedification in Reinforcement Learning. CoRR abs/2010.15268 (2020).
  • Yu et al. (2023) Chao Yu, Jiming Liu, Shamim Nemati, and Guosheng Yin. 2023. Reinforcement Learning in Healthcare: A Survey. ACM Comput. Surv. 55, 2 (2023), 5:1–5:36.
  • Yu et al. (2022) Chao Yu, Xinyi Yang, Jiaxuan Gao, Huazhong Yang, Yu Wang, and Yi Wu. 2022. Learning Efficient Multi-agent Cooperative Visual Exploration. In Computer Vision - ECCV - 17th European Conference, Vol. 13699. Springer, 497–515.
  • Zhan et al. (2017) Yusen Zhan, Haitham Bou-Ammar, and Matthew E. Taylor. 2017. Scalable lifelong reinforcement learning. Pattern Recognit. 72 (2017), 407–418.
  • Zhang et al. (2022b) Fengzhuo Zhang, Boyi Liu, Kaixin Wang, Vincent Y. F. Tan, Zhuoran Yang, and Zhaoran Wang. 2022b. Relational Reasoning via Set Transformers: Provable Efficiency and Applications to MARL. In NeurIPS.
  • Zhang et al. (2022c) Hao Zhang, Hao Wang, and Zhen Kan. 2022c. Exploiting Transformer in Reinforcement Learning for Interpretable Temporal Logic Motion Planning. CoRR abs/2209.13220 (2022).
  • Zhang et al. (2018b) Jiaping Zhang, Tiancheng Zhao, and Zhou Yu. 2018b. Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, Melbourne, Australia. Association for Computational Linguistics, 140–150.
  • Zhang et al. (2018a) Kaiqing Zhang, Zhuoran Yang, Han Liu, Tong Zhang, and Tamer Basar. 2018a. Fully Decentralized Multi-Agent Reinforcement Learning with Networked Agents. In Proceedings of the 35th ICML, Vol. 80. PMLR, 5867–5876.
  • Zhang et al. (2022d) Li Zhang, Sixiao Zheng, Jiachen Lu, Xinxuan Zhao, Xiatian Zhu, Yanwei Fu, Tao Xiang, and Jianfeng Feng. 2022d. Visual Representation Learning with Transformer: A Sequence-to-Sequence Perspective. CoRR abs/2207.09339 (2022).
  • Zhang et al. (2022a) Tianyao Zhang, Xiaoguang Hu, Jin Xiao, and Guofeng Zhang. 2022a. TVENet: Transformer-Based Visual Exploration Network for Mobile Robot in Unseen Environment. IEEE Access 10 (2022), 62056–62072.
  • Zhang et al. (2023) Zhipeng Zhang, Zhimin Wei, Zhongzhen Huang, Rui Niu, and Peng Wang. 2023. One for all: One-stage referring expression comprehension with dynamic reasoning. Neurocomputing 518 (2023), 523–532.
  • Zhong et al. (2021) Huasong Zhong, Jingyuan Chen, Chen Shen, Hanwang Zhang, Jianqiang Huang, and Xian-Sheng Hua. 2021. Self-Adaptive Neural Module Transformer for Visual Question Answering. IEEE Trans. Multim. 23 (2021), 1264–1273.
  • Zhou et al. (2023) Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, Hao Peng, Jianxin Li, Jia Wu, Ziwei Liu, Pengtao Xie, Caiming Xiong, Jian Pei, Philip S. Yu, and Lichao Sun. 2023. A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT. CoRR abs/2302.09419 (2023).
  • Zhou et al. (2021) Hong-Yu Zhou, Chixiang Lu, Sibei Yang, and Yizhou Yu. 2021. ConvNets vs. Transformers: Whose Visual Representations are More Transferable?. In IEEE/CVF International Conference on Computer Vision Workshops, ICCVW. IEEE, 2230–2238.
  • Zhou et al. (2020) Meng Zhou, Ziyu Liu, Pengwei Sui, Yixuan Li, and Yuk Ying Chung. 2020. Learning Implicit Credit Assignment for Cooperative Multi-Agent Reinforcement Learning. In NeurIPS.