Imagine, Initialize, and Explore: An Effective Exploration Method in Multi-Agent Reinforcement Learning
Abstract
Effective exploration is crucial to discovering optimal strategies for multi-agent reinforcement learning (MARL) in complex coordination tasks. Existing methods mainly utilize intrinsic rewards to enable committed exploration or use role-based learning for decomposing joint action spaces instead of directly conducting a collective search in the entire action-observation space. However, they often face challenges obtaining specific joint action sequences to reach successful states in long-horizon tasks. To address this limitation, we propose Imagine, Initialize, and Explore (IIE), a novel method that offers a promising solution for efficient multi-agent exploration in complex scenarios. IIE employs a transformer model to imagine how the agents reach a critical state that can influence each other’s transition functions. Then, we initialize the environment at this state using a simulator before the exploration phase. We formulate the imagination as a sequence modeling problem, where the states, observations, prompts, actions, and rewards are predicted autoregressively. The prompt consists of timestep-to-go, return-to-go, influence value, and one-shot demonstration, specifying the desired state and trajectory as well as guiding the action generation. By initializing agents at the critical states, IIE significantly increases the likelihood of discovering potentially important under-explored regions. Despite its simplicity, empirical results demonstrate that our method outperforms multi-agent exploration baselines on the StarCraft Multi-Agent Challenge (SMAC) and SMACv2 environments. Particularly, IIE shows improved performance in the sparse-reward SMAC tasks and produces more effective curricula over the initialized states than other generative methods, such as CVAE-GAN and diffusion models.
Introduction
Recent progress in cooperative multi-agent reinforcement learning (MARL) has shown attractive prospects for real-world applications, such as autonomous driving (Zhou et al. 2021) and active voltage control on power distribution networks (Wang et al. 2021). To utilize global information during training and maintain scalability in execution, centralized training with decentralized execution (CTDE) has become a widely adopted paradigm in MARL. Under this paradigm, value decomposition methods (Rashid et al. 2018; Son et al. 2019; Rashid et al. 2020) factorize the joint $Q$-value as a function of individual utility functions, ensuring consistency between the centralized policy and the individual policies. Consequently, they have achieved state-of-the-art performance in challenging tasks, such as StarCraft unit micromanagement (Samvelyan et al. 2019).
Despite their success, the simple $\u03f5$-greedy exploration strategy used in these methods has been found ineffective in solving coordination tasks with complex state and reward transitions (Mahajan et al. 2019; Wang et al. 2020c; Zheng et al. 2021). To address this limitation, MAVEN (Mahajan et al. 2019) adopts a latent space for hierarchical control, allowing agents to condition their behavior on shared latent variables and enabling committed exploration. EITI and EDTI (Wang et al. 2020c) quantify and characterize the influence of one agent’s behavior on others using mutual information and the difference of expected returns, respectively. By optimizing EITI or EDTI objectives as intrinsic rewards, agents are encouraged to coordinate their exploration and learn policies that optimize team performance. EMC (Zheng et al. 2021), on the other hand, uses prediction errors of individual $Q$-values to capture the novelty of states and the influence from other agents for coordinated exploration. Unfortunately, these methods suffer from the exponential growth of the action-observation space with the number of agents and become inefficient in long-horizon tasks.
It is possible to decompose the complex task rather than directly conducting collective searches across the entire action-observation space. To this end, RODE (Wang et al. 2020b) decomposes joint action spaces into restricted role action spaces by clustering actions based on their effects on the environment and other agents. Low-level role-based policies explore and learn within these restricted action-observation spaces, reducing execution and training complexity. However, RODE still struggles with long-term tasks as it is still unlikely to obtain a long sequence of specific actions to achieve successful states. The Go-Explore family of algorithms (Ecoffet et al. 2019; Guo et al. 2020; Ecoffet et al. 2021) decomposes the exploration into two phases: returning to the previous state and then starting to explore. These methods store the high-scoring trajectories in an archive and return to sampled states from this archive by running a goal-conditioned policy. However, in the multi-agent field, agents should be encouraged to teleport to the states where interactions happen and may lead to critical under-explored regions. These approaches only consider the sparse and deceptive rewards in single-agent settings, making them impractical for MARL with complex reward and transition dependencies among cooperative agents.
Teleporting agents to interaction states that influence each other’s transition function can significantly increase the likelihood of discovering potentially important yet rarely visited states and reduce the exploration space. However, there often exist multiple feasible but inefficient trajectories to reach such interaction states due to the abundance of agents’ tactics and the compositional nature of their functionalities. In light of this, we propose a novel MARL exploration method named Imagine, Initialize, and Explore (IIE). It leverages the GPT architecture (Radford et al. 2018) to imagine trajectories from the initial state to interaction states, acting as a powerful “memorization engine” that can generate diverse agent behaviors. We use target timestep-to-go, return-to-go, influence value, and a one-shot demonstration as prompts to specify the “path” of the imagined trajectory. The influence value is an advantage function that compares an agent’s current action $Q$-value to a counterfactual baseline that marginalizes this agent. Specifically, we obtain several trajectory segments by dividing the exploration episode, where each segment starts at the initial state from the environment and ends at an interaction state with the highest influence value. Then, the GPT model learns to predict states, observations, prompts, actions, and rewards in an autoregressive manner on these segments.
Fig. 1 presents an overview of IIE architecture. Before the exploration phase, the agents are given a new initial state and few-shot demonstrations to construct the prompt and generate the imagined trajectory. The agents operating in partially observable environments can benefit from this imagination by utilizing it to initialize the recurrent neural network state. Then, the agents are initialized to the last state of the imagined trajectory by the simulator and interact with the environment to explore. IIE gradually provides high-influence starting points that can be viewed as auto-curricula that help agents collect crucial under-explored regions. To make the best use of the imagination, we also stitch it with the interaction sequence from exploration for policy training.
The main contributions of this paper are threefold: First, it introduces Imagine, Initialize, and Explore, which leverages the GPT architecture to imagine how the agents reach critical states before exploration. This method bridges sequence modeling and transformers with MARL instead of using GPT as a replacement for reinforcement learning algorithms (Chen et al. 2021). Second, empirical results demonstrate significant performance improvements of IIE on partially observable MARL benchmarks, including StarCraft Multi-Agent Challenge (SMAC) with dense and sparse reward settings as well as SMACv2. And third, guided by the target timestep-to-go, return-to-go, influence value, and a one-shot demonstration, the imagination model can produce more effective curricula and outperforms behavior cloning, CVAE-GAN, and classifier-guided diffusion.
Background
Decentralized Partially Observable Markov Decision Process.
A fully cooperative multi-agent task in the partially observable setting can be formulated as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) (Oliehoek and Amato 2016), consisting of a tuple $G=\u27e8A,S,\mathrm{\Omega},O,U,P,r,\gamma \u27e9$, where $a\in A\equiv \{1,\mathrm{\dots},n\}$ is a set of agents, $S$ is a set of states, and $\mathrm{\Omega}$ is a set of joint observations. At each time step, each agent obtains its observation $o\in \mathrm{\Omega}$ based on the observation function $O(s,a):S\times A\to \mathrm{\Omega}$, and an action-observation history ${\tau}_{a}\in T\equiv {(\mathrm{\Omega}\times U)}^{\ast}$. Each agent $a$ chooses an action ${u}_{a}\in U$ by a stochastic policy ${\pi}_{a}\left({u}_{a}\right|{\tau}_{a}):T\times U\to [0,1]$, which forms a joint action $\mathbf{u}\in \mathbf{U}$. It results in a joint reward $r(s,\mathbf{u})$ and a transit to the next state ${s}^{\prime}\sim P(\cdot |s,\text{\mathbf{u}})$. The formal objective function is to find the joint policy $\bm{\pi}$ that maximizes a joint action-value function ${Q}^{\bm{\pi}}({s}_{t},{\text{\mathbf{u}}}_{t})=r({s}_{t},{\text{\mathbf{u}}}_{t})+\gamma {\mathbb{E}}_{{s}^{\prime}}\left[{V}^{\bm{\pi}}({s}^{\prime})\right]$, where ${V}^{\bm{\pi}}(s)=\mathbb{E}\left[{\sum}_{t=0}^{\mathrm{\infty}}{\gamma}^{t}{r}_{t}|{s}_{0}=s,\bm{\pi}\right]$, and $\gamma \in [0,1)$ is a discounted factor.
Centralized Training with Decentralized Execution.
In this paradigm, agents’ policies are trained with access to global information in a centralized way and executed only based on local histories in a decentralized way (Kraemer and Banerjee 2016). One of the most significant challenges is to guarantee the consistency between the individual policies and the centralized policy, which is also known as Individual-Global Max (Son et al. 2019):
$$\underset{\mathbf{u}}{\mathrm{arg}\mathrm{max}}{Q}_{j}({s}_{t},\mathbf{u})=\left\{\begin{array}{c}{\mathrm{arg}\mathrm{max}}_{{u}^{1}}{Q}^{1}({s}_{t},{u}^{1})\\ \mathrm{\dots}\\ {\mathrm{arg}\mathrm{max}}_{{u}^{n}}{Q}^{n}({s}_{t},{u}^{n})\end{array}\right\}$$ | (1) |
To address this problem, QMIX (Rashid et al. 2018) applies a state-dependent monotonic mixing network ${f}_{s}$ to combine per-agent $Q$-value functions with the joint $Q$-value function ${Q}_{j}(s,\mathbf{u})$. The restricted space of all ${Q}_{j}(s,\mathbf{u})$ that QMIX can be represented as:
$${\mathcal{Q}}^{m}:=\{{Q}_{j}|{Q}_{j}={f}_{s}({Q}^{1}(s,{u}^{1}),\mathrm{\dots},{Q}^{n}(s,{u}^{n}))\},$$ | (2) |
where ${Q}^{a}(s,{u}^{a})\in \mathbb{R}$, $\frac{\partial {f}_{s}}{\partial {Q}^{a}}\u2a7e0,\forall a\in A$.
Method
This section presents a novel multi-agent exploration method, Imagine, Initialize, and Explore, consisting of three key components: (1) the imagination model, which utilizes a transformer model to generate the trajectory representing how agents reach a target state autoregressively, (2) the prompt generator, which specifies the target state and trajectory for imagination, and (3) the environment simulator, which can teleport the agents to a state instantly.
Imagination Model
It has been found that agents operating in a partially observable environment can benefit from the action-observation history (Hausknecht and Stone 2015; Karkus, Hsu, and Lee 2017; Rashid et al. 2018), e.g., a model that has a recall capability such as gated recurrent unit (GRU). Therefore, it is necessary to imagine the trajectory from the initial state to the selected interaction state. We formulate the problem of trajectory generation as a sequence modeling task, where the sequences have the following form:
$$x=\{\mathrm{\dots},{s}_{t},{o}_{t}^{1},\mathrm{\dots},{o}_{t}^{n},\mathcal{P}({s}_{t}),{u}_{t}^{1},\mathrm{\dots},{u}_{t}^{n},{r}_{t},{s}^{t+1},\mathrm{\dots}\},$$ | (3) |
where $t$ represents the timestep, $n$ is the number of agents, and $\mathcal{P}({s}_{t})$ is the prompt for action generation at the state ${s}_{t}$, with the definition provided in the next subsection.
We obtain the token embeddings for states, observations, prompts, actions, and rewards through a linear layer followed by layer normalization. Moreover, an embedding for each timestep is learned and added to each token. The transformer model processes the tokens and performs autoregressive modeling to predict future tokens.
The imagination model is trained to focus on interaction states where the agents can influence each other’s transition function by prioritizing trajectories with a high-influence last state in the sampled batch. Specifically, we split interaction sequences from the sampled batch into $K$ segments, starting from the initial state by the simulator and ending at the states with top-$K$ high-influence levels. To ensure generalization across the entire explored regions, we also enforce all state-action pairs of the sampled batch to be assigned a minimum probability of $\frac{\lambda}{N}$, where $\lambda $ is a hyperparameter, and $N$ is the number of state-action pairs in the batch. The influence $\mathcal{I}$ is defined as an advantage function that compares the $Q$-value for the current action ${u}^{a}$ to a counterfactual baseline, marginalizing out ${u}^{a*}$ at a given state $s$:
$$\mathcal{I}(s)=\underset{a\in A}{\mathrm{max}}\left\{{Q}_{j}(s,\bm{\tau},\text{\mathbf{u}})-{\mathbb{E}}_{{u}^{a*}}{Q}_{j}(s,\bm{\tau},({u}^{a*},{u}^{-a}))\right\},$$ | (4) |
where ${Q}_{j}(s,\bm{\tau},\text{\mathbf{u}})={f}_{s}({Q}^{1}({\tau}^{1},{u}^{1};\theta ),\mathrm{\dots},{Q}^{n}({\tau}^{n},{u}^{n};\theta );\varphi )$ represents the joint $Q$-value function, $-a$ denotes all agents $A$ except agent $a$, ${f}_{s}$ denotes the monotonic mixing network whose non-negative weights are generated by hyper-networks that take the state as input.
The imagination model parameterized by $\psi $ is trained by:
${L}_{m}$ | $$ | (5) | ||
$$ |
where $T$ is the episode length. Since the observation is only related to the current state and the vision range of the agents, we filter out the historical memories in $$ and use ${s}_{t}$ as the input of the observation model ${q}^{\psi}({o}_{t}^{a}|{s}_{t})$.
Prompt For Imagination
The primary objective of our imagination model is to identify critical states and imagine how to reach them from the initial point. However, multiple feasible trajectories exist to reach these states due to the diverse tactics and compositional nature of agents’ functionalities. Despite their feasibility, many of these trajectories are highly inefficient for imagination. For example, agents may wander around their initial points before engaging the enemy, decreasing success rates within a limited episode length. Therefore, a prompt, serving as a condition for action generation, is necessary to specify the target trajectory.
Given the starting state ${s}_{i}$, we propose a prompt generator ${\mathcal{P}}^{\xi}({s}_{i})$ to predict the sequence $\{{s}_{i},{\mathcal{I}}_{i},{\mathcal{T}}_{i},{\mathcal{R}}_{i}\}$, where ${\mathcal{I}}_{i}$ is the influence value, ${\mathcal{T}}_{i}$ is the timestep-to-go representing how quickly we can achieve the interaction state, and ${\mathcal{R}}_{i}={\sum}_{t=i}^{i+\mathcal{T}}{r}_{t}$ is the return-to-go.
As the training datasets for the imagination model contain a mixture of trajectory segments, directly sampling from the prompt generator is unlikely to produce the critical state consistently. Instead, we aim to control the prompt generator to produce frequently visited but high-influence states and the trajectory leading to them. To this end, we sample the target influence value according to the log-probability:
$${\mathcal{I}}_{i}=\mathrm{log}p(\mathcal{I}|{s}_{i})+\kappa (\mathcal{I}-{\mathcal{I}}_{low})/({\mathcal{I}}_{high}-{\mathcal{I}}_{low}),$$ | (6) |
where $\kappa $ is a hyperparameter that controls the preference on high-influence states, ${\mathcal{I}}_{low}$ and ${\mathcal{I}}_{high}$ are the lower bound and upper bound of ${\mathcal{I}}_{i}$, respectively. The target timestep-to-go and return-to-go are obtained in the same way.
In addition, we also provide a one-shot demonstration for the imagination to avoid it being too far away from the target trajectory, also known as the “hallucination” problem in the large language model (McKenna et al. 2023; Manakul, Liusie, and Gales 2023). Specifically, we store the trajectory segments in a few-shot demonstration dataset and use the prompt to characterize the segments. Before imagination, we search for the trajectory whose description has the highest similarity with current prompt $\mathcal{P}({s}_{i})$ and then prepend it into the original input $x$. This process only affects the inference procedure of the model – training remains unaffected and can rely on standard next-token prediction frameworks and infrastructure.
Initialize and Explore
Before exploration, we begin by sampling a prompt $\mathcal{P}({s}_{0})=\{\mathcal{I},\mathcal{T},\mathcal{R}\}$ for imagination given the initial state ${s}_{0}$ from the environment simulator. Then, the agents imagine how to reach the interaction state ${s}_{\mathcal{T}}$ in an autoregressive manner by conditioning on the prompt and the most related trajectory from the few-shot demonstration dataset. After rolling out each imagination step and obtaining the next state, we have the next prompt through decrementing the target timestep-to-go $\mathcal{T}$ and return-to-go $\mathcal{R}$ by $1$ and the predicted reward ${r}_{t}$ from the imagination model, respectively. We maintain the influence value $\mathcal{I}$ as a constant and repeat this process until the target timestep-to-go is zero. The imagined trajectory ${x}_{0:\mathcal{T}}=\{{s}_{0},{o}_{0},{\text{\mathbf{u}}}_{0},{r}_{0},{s}_{1},\mathrm{\dots},{\text{\mathbf{u}}}_{\mathcal{T}},{r}_{\mathcal{T}}\}$ is then used to initialize GRU state of the agent network.
Next, we define a probability $\alpha $ of initializing the agents at state ${s}_{0}$ to emphasize the importance of the target task. With the probability $1-\alpha $, we initialize the agents at the last state ${s}_{\mathcal{T}}$ of the imagined trajectory using the simulator. For StarCraft II, we add a distribution over team unit types, start positions, and health points in the reset function of the environment simulator. We anneal $\alpha $ from $1.0$ to $0.5$ over a fixed number of steps after the pretraining phase of the imagination model. Finally, the agents interact with the environment to collect the online data ${x}_{\mathcal{T}:T}=\{{s}_{\mathcal{T}},{o}_{\mathcal{T}},{\text{\mathbf{u}}}_{\mathcal{T}},{r}_{\mathcal{T}},{s}_{\mathcal{T}+1},\mathrm{\dots},{\text{\mathbf{u}}}_{T},{r}_{T}\}$. We stitch the imagined trajectory and the online data $\mathcal{X}={x}_{0:\mathcal{T}-1}\u2a01{x}_{\mathcal{T}:T}$ as training datasets for policy training.
The individual $Q$-value functions ${Q}^{a}({o}^{a},{\tau}^{a},{u}^{a})$ are optimized jointly to minimize the following loss:
$$\underset{\theta ,\varphi}{\mathrm{min}}\sum _{t=0}^{T-1}{\left[{Q}_{j}({s}_{t},{\bm{\tau}}_{t},{\mathbf{u}}_{t};\theta ,\varphi )-y({s}_{t},{\bm{\tau}}_{t},{\mathbf{u}}_{t})\right]}^{2},$$ | (7) |
where $y({s}_{t},{\bm{\tau}}_{t},{\mathbf{u}}_{t})={r}_{t}+\gamma {\mathrm{max}}_{\mathbf{u}}{Q}_{j}^{\prime}({s}_{t+1},{\bm{\tau}}_{t+1},\mathbf{u})$ is the target value function, and ${Q}_{j}^{\prime}$ is the target network whose parameters are periodically copied from ${Q}_{j}$.
Related Work
Multi-agent Exploration.
Individual exploration suffers from the inconsistency between local and global information as well as the non-stationary problem in multi-agent settings. To address these limitations, Jaques et al. (2019) introduce intrinsic rewards based on “social influence” to incentivize agents in selecting actions that can influence other agents. Similarly, Wang et al. (2020c) leverage mutual information to capture the interdependencies of rewards and transitions, promoting exploration efficiency and facilitating the policies training. EMC (Zheng et al. 2021) leverages prediction errors from individual $Q$-values as intrinsic rewards and uses episodic memory to improve coordinated exploration. MAVEN (Mahajan et al. 2019) employs a hierarchical policy for committed and temporally extended exploration, learning multiple state-action value functions for each agent through a shared latent variable. RODE (Wang et al. 2020b) introduces a role selector to enable informed decisions and learn role-based policies in a smaller action space based on the actions’ effects. However, learning in long-horizon coordination tasks remains challenging due to the exponential state-action spaces.
Go-Explore.
Go-Explore (Ecoffet et al. 2019) is one of the most famous exploration approaches, particularly well-suited for hard-exploration domains with sparse or deceptive rewards. In the Go-Explore family of algorithms (Ecoffet et al. 2021; Guo et al. 2020), an agent returns to a promising state without exploration by running a goal-conditioned policy or restoring the simulator state and then explores from this state. However, these methods are primarily designed for single-agent scenarios, neglecting the interdependencies between agent policies and the strategies of others in MARL. Moreover, these methods require restoring previously visited simulator states or training a goal-conditioned policy to generate actions for returning. This leads to significant computational costs, especially in coordination scenarios involving complex observations and transitions.
Transformer Model.
Several works have explored the integration of transformer models into reinforcement learning (RL) settings. We classify them into two major categories depending on the usage pattern. The first category focuses on representing components in RL algorithms, such as policies and value functions (Parisotto et al. 2020; Parisotto and Salakhutdinov 2021). These methods rely on standard RL algorithmis to update policy, where the transformer only provides the large representation capacity and improves feature extraction. Conversely, the second category aims to replace the RL pipeline with sequence modeling. They autoregressively generate states, actions, and rewards by conditioning on the desired return-to-go during inference (Chen et al. 2021; Lee et al. 2022; Reed et al. 2022). However, there is a risk of unintended behaviors when offline datasets contain destructive biases. Moreover, it remains an open question of how to extend these supervised learning paradigms to online settings and satisfy the scalability in practice.
We utilize a transformer model to imagine the trajectory to reach the interaction state guided by a prompt, offering increased representational capacity and stability compared to existing Go-Explore methods. In contrast to current multi-agent exploration methods, we initialize agents at the states with high influence values. This form of curriculum learning significantly reduces the exploration space and enhances coordination exploration. We aim to bridge sequence modeling and transformers with MARL rather than replacing conventional RL algorithms.
Results
In this section, we conduct empirical experiments to answer the following questions: (1) Is Imagine, Initialize, and Explore (IIE) better than the existing MARL exploration methods in complex cooperative scenarios or sparse reward settings? (2) Can IIE generate a reasonable curriculum of the last states over timesteps and outperform other returning methods? We also investigate the contribution of each component in the proposed prompt to the imagination model.
We conduct experiments on NVIDIA RTX 3090 GPUs. Each task needs to train for about 12 to 20 hours, depending on the number of agents and the episode length limit. We evaluate 32 episodes with decentralized greedy action selection every $10k$ timesteps for each algorithm. All figures are plotted using mean and standard deviation with confidence internal 95%. We conduct five independent runs with different random seeds for each learning curve.
Performance Comparison
In our evaluation, we compare the performance of CW-QMIX (Rashid et al. 2020), QPLEX (Wang et al. 2020a), MAVEN (Mahajan et al. 2019), EMC (Zheng et al. 2021), RODE (Wang et al. 2020b), QMIX (Rashid et al. 2018), MAPPO (Yu et al. 2022), and IIE on the StarCraft Multi-Agent Challenge (SMAC) (Samvelyan et al. 2019) and SMACv2 (Ellis et al. 2022) benchmarks.
SMAC is a partially observable MARL benchmark known for its rich environments and high control complexity. It requires learning policies in a large observation-action space, where agents take various actions, such as “move” in cardinal directions, “stop”, and selecting an enemy to attack. The maximum number of actions and the episode length vary across different scenarios, ranging from 7 to 70 actions and 60 to 400 timesteps, respectively. In contrast, SMACv2 presents additional challenges of stochasticity and generalization. It includes procedurally generated scenarios with start positions, unit types, attack range, and sight range. The agents must generalize their learned policies to previously unseen settings during testing.
Fig. 2 shows that IIE considerably outperforms the state-of-the-art MARL methods in both SMAC and SMACv2 maps with the dense reward setting. This result demonstrates that IIE significantly enhances learning speed and coordination performance in complex tasks. In specific scenarios like 3s5z_vs_3s6z and corridor, EMC shows faster learning in the beginning, which can be attributed to the fact that IIE requires more time to pre-train the imagination model in more complex tasks. However, as the training progresses, IIE excels in providing a more targeted and efficient exploration of complex coordination scenarios, leading to the best final performance across all scenarios.
Sparse-reward Benchmark
We investigate the performance of IIE, EMC, MAVEN, RODE, QMIX, and MAPPO, in addition to two multi-agent exploration methods designed for sparse rewards, including LIIR (Du et al. 2019) and MASER (Jeon et al. 2022), on the SMAC benchmark with the sparse-reward setting. In this setting, global rewards are sparsely given only when one or all enemies are defeated, with no additional reward for state information such as enemy and ally health. This problem is difficult for credit assignments because credit must be propagated from the end of the episode.
The imagination model in IIE can improve robustness in these settings because it makes minimal assumptions on the density of the reward. As shown in Fig. 3, sparse and delayed rewards minimally affect IIE. We hypothesize that the transformer architecture in the imagination model can be effective critics and enable more accurate value prediction. In contrast, QMIX and MAPPO fail to solve these tasks since they heavily rely on densely populated rewards for calculating temporal difference targets or generalized advantage estimates. LIIR, MASER, RODE, and MAVEN exhibit slow learning and instability, particularly on the heterogeneous map 2s3z, suggesting the difficulty of learning such intrinsic rewards, subgoals, roles, or noise-based hierarchical policies in hard-exploration scenarios.
Different Returning Methods
In this section, we seek insight into whether the imagination model in IIE can be thought of as performing efficient curriculum learning and is better than other returning methods. To investigate this, we compare it with behavior cloning (BC), a goal-conditioned policy method (GC-policy) (Ecoffet et al. 2021), and generative models, including CVAE-GAN (Bao et al. 2017) and classifier-guided diffusion (CG-diffusion) (Ajay et al. 2023). Before exploration, we initialize the agents to the last state of the imagination trajectory produced by BC, CVAE-GAN, and CG-diffusion using the environment simulator or run a goal-conditioned policy from GC-policy to return agents to the target state.
We show the performance comparison and visualization results in Fig. 4. IIE provides agents with interaction points for exploration with the shortest relative distance between agents and enemies and decreasing but comparable health levels as training progresses. This contributes to early and more frequent multi-agent interactions against the enemies quickly, reducing the complexity of the multi-agent exploration space. As a result, IIE outperforms BC, GC-policy, CVAE-GAN, and CG-diffusion across all tasks, indicating that the prompt-based imagination can be more effective and has better generalization than simply performing imitation learning or other generative models. We also illustrate the data distribution of imagined trajectories in a 2D space with dimensional reduction via T-SNE. The results show that the imagined trajectory will be close to the target trajectory, which verifies that IIE can capture and learn the dynamics of the multi-agent environment based on the prompt.
CG-diffusion and GC-policy have plateaued in performance, showing a similar trend of emergent complexity with IIE but taking far longer to learn the joint policy. On the one hand, the continuous diffusion models have been extremely successful in vision and audio domains, but they do not perform well in text because of the inherently discrete nature of text (Li et al. 2022). The imagination of the trajectory is more related to text than image generation because it has low redundancy, and the transition between two connected states is essential. On the other hand, the applicability of the current goal-conditioned methods with complex observations is limited, as their flexibility is often constrained to task spaces using low-dimensional parameters and requires well-defined task similarity. CVAE-GAN does not show positive results and keeps generating similar health levels. We hypothesize that the mode collapse problem is a bottleneck for CVAE-GAN. The generator can find data that can easily fool the discriminator because of the unbalanced training data from the exploration. BC shows the worst results because it imitates all past interaction sequences and lacks the generalization ability to avoid sub-optimal solutions. From Fig. 4e-g, we can see that BC prioritizes states with short distances but does not provide reasonable health levels for enemies - the mean health level of enemies far exceeds that of allies, making it difficult or even impossible to achieve any success.
Different Prompts for Imagination
In this section, we conduct ablation studies to analyze the contributions of each component in the prompts, including the timesteps, the accumulated return-to-go (RTG), the constant influence value (IV), and the most related trajectory (T). We integrate the desired timesteps into the following prompts: (1) RTG, (2) IV, (3) RTG with the influence value, denoted as IIE-wo-T, (4) RTG with the most related trajectory, denoted as IIE-wo-IV, (5) IV with the most related trajectory, denoted as IIE-wo-RTG. We compare them with IIE and QMIX on the SMAC benchmark.
Fig. 5 shows that RTG and IIE-wo-IV achieve poor performance in MMM2, 6h_vs_8z, and corridor due to their ignorance of the importance of interactions in multi-agent exploration and limited positive samples for value decomposition. IV and IIE-wo-T perform worse than IIE-wo-IV and IIE with a considerable gap, as they do not exploit one-shot demonstration to guide action generation, leading to a potential risk that the imagined trajectories may deviate significantly from the target trajectory, especially in long-horizon tasks with complex transition functions. IIE outperforms IIE-wo-IV across MMM2, 6h_vs_8z, and 2c_vs_64zg maps because RTG can further specify a trajectory, improving the efficiency and robustness of the imagination learning. However, IIE-wo-IV performs better than IIE in corridor, implying that the imagination model may have some degree of generalization without RTG and provide more diverse samples for policy training.
Conclusion
We proposed Imagine, Initialize, and Explore, which enables us to break down a difficult multi-agent exploration problem into a curriculum of subtasks created by initializing agents at the interaction state - the last state in the imagined trajectory. We empirically evaluated our algorithm and found it outperforms current multi-agent exploration methods and generative models on various benchmarks. We also show that the prompt-based imagination model performs efficient conditional sequence generation and has one-shot generalization. We hope this work will inspire more investigation of sequence-prediction models’ applications in multi-agent reinforcement learning (MARL) rather than using them to replace conventional MARL. In future work, we consider learning continuous prompts to specify the imagination, as opposed to the simple influence value used in this paper.
Acknowledgments
This work was supported in part by National Key R&D Program of China under grant No. 2021ZD0112700, NSFC under grant No. 62125305, No. 62088102, No. 61973246, No. 62203348, and No. U23A20339, the Fundamental Research Funds for the Central Universities under Grant xtr072022001.
References
- Ajay et al. (2023) Ajay, A.; Du, Y.; Gupta, A.; Tenenbaum, J. B.; Jaakkola, T. S.; and Agrawal, P. 2023. Is Conditional Generative Modeling all you need for Decision Making? In The Eleventh International Conference on Learning Representations.
- Ao et al. (2022) Ao, S.; Zhou, T.; Jiang, J.; Long, G.; Song, X.; and Zhang, C. 2022. EAT-C: Environment-Adversarial sub-Task Curriculum for Efficient Reinforcement Learning. In International Conference on Machine Learning, 822–843. PMLR.
- Bao et al. (2017) Bao, J.; Chen, D.; Wen, F.; Li, H.; and Hua, G. 2017. CVAE-GAN: fine-grained image generation through asymmetric training. In Proceedings of the IEEE international conference on computer vision, 2745–2754.
- Chen et al. (2021) Chen, L.; Lu, K.; Rajeswaran, A.; Lee, K.; Grover, A.; Laskin, M.; Abbeel, P.; Srinivas, A.; and Mordatch, I. 2021. Decision transformer: Reinforcement learning via sequence modeling. Advances in neural information processing systems, 34: 15084–15097.
- Du et al. (2019) Du, Y.; Han, L.; Fang, M.; Liu, J.; Dai, T.; and Tao, D. 2019. Liir: Learning individual intrinsic reward in multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 32.
- Ecoffet et al. (2019) Ecoffet, A.; Huizinga, J.; Lehman, J.; Stanley, K. O.; and Clune, J. 2019. Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995.
- Ecoffet et al. (2021) Ecoffet, A.; Huizinga, J.; Lehman, J.; Stanley, K. O.; and Clune, J. 2021. First return, then explore. Nature, 590(7847): 580–586.
- Ellis et al. (2022) Ellis, B.; Moalla, S.; Samvelyan, M.; Sun, M.; Mahajan, A.; Foerster, J. N.; and Whiteson, S. 2022. SMACv2: An Improved Benchmark for Cooperative Multi-Agent Reinforcement Learning. arXiv preprint arXiv:2212.07489.
- Fang et al. (2021) Fang, K.; Zhu, Y.; Savarese, S.; and Fei-Fei, L. 2021. Adaptive Procedural Task Generation for Hard-Exploration Problems. In International Conference on Learning Representations.
- Florensa et al. (2018) Florensa, C.; Held, D.; Geng, X.; and Abbeel, P. 2018. Automatic goal generation for reinforcement learning agents. In International conference on machine learning, 1515–1528. PMLR.
- Guo et al. (2020) Guo, Y.; Choi, J.; Moczulski, M.; Feng, S.; Bengio, S.; Norouzi, M.; and Lee, H. 2020. Memory based trajectory-conditioned policies for learning from sparse rewards. Advances in Neural Information Processing Systems, 33: 4333–4345.
- Hausknecht and Stone (2015) Hausknecht, M.; and Stone, P. 2015. Deep recurrent q-learning for partially observable mdps. In 2015 aaai fall symposium series.
- Janner et al. (2022) Janner, M.; Du, Y.; Tenenbaum, J. B.; and Levine, S. 2022. Planning with diffusion for flexible behavior synthesis. arXiv preprint arXiv:2205.09991.
- Jaques et al. (2019) Jaques, N.; Lazaridou, A.; Hughes, E.; Gulcehre, C.; Ortega, P.; Strouse, D.; Leibo, J. Z.; and De Freitas, N. 2019. Social influence as intrinsic motivation for multi-agent deep reinforcement learning. In International conference on machine learning, 3040–3049. PMLR.
- Jeon et al. (2022) Jeon, J.; Kim, W.; Jung, W.; and Sung, Y. 2022. MASER: Multi-Agent Reinforcement Learning with Subgoals Generated from Experience Replay Buffer. In Chaudhuri, K.; Jegelka, S.; Song, L.; Szepesvari, C.; Niu, G.; and Sabato, S., eds., Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, 10041–10052. PMLR.
- Karkus, Hsu, and Lee (2017) Karkus, P.; Hsu, D.; and Lee, W. S. 2017. Qmdp-net: Deep learning for planning under partial observability. Advances in neural information processing systems, 30.
- Kraemer and Banerjee (2016) Kraemer, L.; and Banerjee, B. 2016. Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing, 190: 82–94.
- Kurach et al. (2020) Kurach, K.; Raichuk, A.; Stanczyk, P.; Zajac, M.; Bachem, O.; Espeholt, L.; Riquelme, C.; Vincent, D.; Michalski, M.; Bousquet, O.; and Gelly, S. 2020. Google research football: A novel reinforcement learning environment. In Proceedings of the AAAI conference on artificial intelligence, volume 34.04.
- Lee et al. (2022) Lee, K.-H.; Nachum, O.; Yang, M. S.; Lee, L.; Freeman, D.; Guadarrama, S.; Fischer, I.; Xu, W.; Jang, E.; Michalewski, H.; et al. 2022. Multi-game decision transformers. Advances in Neural Information Processing Systems, 35: 27921–27936.
- Li et al. (2022) Li, X.; Thickstun, J.; Gulrajani, I.; Liang, P. S.; and Hashimoto, T. B. 2022. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35: 4328–4343.
- Lowe et al. (2017) Lowe, R.; Wu, Y. I.; Tamar, A.; Harb, J.; Pieter Abbeel, O.; and Mordatch, I. 2017. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 30.
- Mahajan et al. (2019) Mahajan, A.; Rashid, T.; Samvelyan, M.; and Whiteson, S. 2019. Maven: Multi-agent variational exploration. Advances in Neural Information Processing Systems, 32.
- Manakul, Liusie, and Gales (2023) Manakul, P.; Liusie, A.; and Gales, M. J. 2023. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
- Matheron, Perrin, and Sigaud (2020) Matheron, G.; Perrin, N.; and Sigaud, O. 2020. PBCS: Efficient exploration and exploitation using a synergy between reinforcement learning and motion planning. In Artificial Neural Networks and Machine Learning–ICANN 2020: 29th International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 15–18, 2020, Proceedings, Part II, 295–307. Springer.
- McKenna et al. (2023) McKenna, N.; Li, T.; Cheng, L.; Hosseini, M. J.; Johnson, M.; and Steedman, M. 2023. Sources of Hallucination by Large Language Models on Inference Tasks. arXiv preprint arXiv:2305.14552.
- Oliehoek and Amato (2016) Oliehoek, F. A.; and Amato, C. 2016. A concise introduction to decentralized POMDPs. Springer.
- Parisotto and Salakhutdinov (2021) Parisotto, E.; and Salakhutdinov, R. 2021. Efficient Transformers in Reinforcement Learning using Actor-Learner Distillation. In International Conference on Learning Representations.
- Parisotto et al. (2020) Parisotto, E.; Song, F.; Rae, J.; Pascanu, R.; Gulcehre, C.; Jayakumar, S.; Jaderberg, M.; Kaufman, R. L.; Clark, A.; Noury, S.; et al. 2020. Stabilizing transformers for reinforcement learning. In International conference on machine learning, 7487–7498. PMLR.
- Radford et al. (2018) Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I.; et al. 2018. Improving language understanding by generative pre-training. In OpenAI. OpenAI.
- Rashid et al. (2020) Rashid, T.; Farquhar, G.; Peng, B.; and Whiteson, S. 2020. Weighted qmix: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning. Advances in neural information processing systems, 33: 10199–10210.
- Rashid et al. (2018) Rashid, T.; Samvelyan, M.; Schroeder, C.; Farquhar, G.; Foerster, J.; and Whiteson, S. 2018. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning. In International Conference on Machine Learning, 4295–4304. PMLR.
- Reed et al. (2022) Reed, S.; Zolna, K.; Parisotto, E.; Colmenarejo, S. G.; Novikov, A.; Barth-Maron, G.; Gimenez, M.; Sulsky, Y.; Kay, J.; Springenberg, J. T.; et al. 2022. A generalist agent. arXiv preprint arXiv:2205.06175.
- Samvelyan et al. (2023) Samvelyan, M.; Khan, A.; Dennis, M. D.; Jiang, M.; Parker-Holder, J.; Foerster, J. N.; Raileanu, R.; and Rocktäschel, T. 2023. MAESTRO: Open-Ended Environment Design for Multi-Agent Reinforcement Learning. In The Eleventh International Conference on Learning Representations.
- Samvelyan et al. (2019) Samvelyan, M.; Rashid, T.; De Witt, C. S.; Farquhar, G.; Nardelli, N.; Rudner, T. G.; Hung, C.-M.; Torr, P. H.; Foerster, J.; and Whiteson, S. 2019. The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043.
- Silver et al. (2018) Silver, D.; Hubert, T.; Schrittwieser, J.; Antonoglou, I.; Lai, M.; Guez, A.; Lanctot, M.; Sifre, L.; Kumaran, D.; Graepel, T.; et al. 2018. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419): 1140–1144.
- Son et al. (2019) Son, K.; Kim, D.; Kang, W. J.; Hostallero, D. E.; and Yi, Y. 2019. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International Conference on Machine Learning, 5887–5896. PMLR.
- Wang et al. (2020a) Wang, J.; Ren, Z.; Liu, T.; Yu, Y.; and Zhang, C. 2020a. QPLEX: Duplex Dueling Multi-Agent Q-Learning. In International Conference on Learning Representations.
- Wang et al. (2021) Wang, J.; Xu, W.; Gu, Y.; Song, W.; and Green, T. C. 2021. Multi-agent reinforcement learning for active voltage control on power distribution networks. Advances in Neural Information Processing Systems, 34: 3271–3284.
- Wang et al. (2020b) Wang, T.; Gupta, T.; Mahajan, A.; Peng, B.; Whiteson, S.; and Zhang, C. 2020b. Rode: Learning roles to decompose multi-agent tasks. arXiv preprint arXiv:2010.01523.
- Wang et al. (2020c) Wang, T.; Wang, J.; Wu, Y.; and Zhang, C. 2020c. Influence-Based Multi-Agent Exploration. In International Conference on Learning Representations.
- Yu et al. (2022) Yu, C.; Velu, A.; Vinitsky, E.; Gao, J.; Wang, Y.; Bayen, A.; and Wu, Y. 2022. The surprising effectiveness of ppo in cooperative multi-agent games. Advances in Neural Information Processing Systems, 35: 24611–24624.
- Zheng et al. (2021) Zheng, L.; Chen, J.; Wang, J.; He, J.; Hu, Y.; Chen, Y.; Fan, C.; Gao, Y.; and Zhang, C. 2021. Episodic Multi-agent Reinforcement Learning with Curiosity-driven Exploration. Advances in Neural Information Processing Systems, 34.
- Zhou et al. (2021) Zhou, M.; Luo, J.; Villella, J.; Yang, Y.; Rusu, D.; Miao, J.; Zhang, W.; Alban, M.; FADAKAR, I.; Chen, Z.; et al. 2021. SMARTS: An Open-Source Scalable Multi-Agent RL Training School for Autonomous Driving. In Conference on Robot Learning, 264–285. PMLR.
Appendix
Environments
StarCraft II is a real-time strategy game featuring three different races, Protoss, Terran, and Zerg, with different properties and associated strategies. The objective is to build an army powerful enough to destroy the enemy’s base. When battling two armies, players must ensure army units are acting optimally.
StarCraft Multi-Agent Challenge (SMAC) is a partially observable reinforcement learning benchmark built in StarCraft II. An individual agent with parameter sharing controls each allied unit, and a hand-coded built-in StarCraft II AI controls enemy units. The difficulty of the game AI is set to the “very difficult” level. On the SMAC benchmark, agents can access their local observations within the field of view at each time step. The feature vector contains attributes of both allied and enemy units: distance, relative x, relative y, health, shield, and unit_type. In addition, agents can observe the last actions of allied units and the terrain features surrounding them. The global state vector includes the coordinates of all agents relative to the center of the map and other features present in the local observation of agents. The state stores the energy of Medivacs, the cooldown of the rest of the allied units, and the last actions of all agents. Note that the global state information is only available to agents during centralized training. All features in state and local observations are normalized by their maximum values.
After receiving the observations, each agent is allowed to take action from a discrete set which consists of move[direction], attack[enemy_id], stop and no-op. Move direction includes north, south, east, and west. Note that the dead agents can only take no-op action while live agents cannot. For health units, Medivacs use heal[agent_id] actions instead of attack[enemy_id]. Depending on different scenarios, the maximum number of actions varies between $7$ and $70$. Note that agents can only perform the attack[enemy_id] action when the enemy is within its shooting range. At each time step, agents take joint action and receive a positive global reward based on the total damage dealt to the enemy units. In addition, they can receive an extra reward of $10$ points after killing each enemy unit and $200$ points after killing all enemy units. The rewards are scaled to around $20$, so the maximum cumulative reward is achievable in each scenario.
Attribution | Dense reward | Sparse reward |
---|---|---|
Win | +200 | +200 |
One enemy dies | +10 | +10 |
One ally dies | 0 | 0 |
Enemy’s health | + health difference | 0 |
Enemy’s shield | + shield difference | 0 |
Ally’s health | 0 | 0 |
Enemy’s shield | 0 | 0 |
SMACv2 is a new version of SMAC that uses procedural content generation to improve stochasticity, including random team compositions, random start positions, and increasing diversity among unit types by using the true unit attack and sight ranges. Each race has a special unit that should not be generated too often: the colossus in Protoss, the medivac unit in Terran, and the baneling unit in Zerg. All of these special units are spawned with a probability of 10%. The other units used spawn with a probability of 45%. Random start positions in SMACv2 come in two different flavors. In reflect scenarios, the allied units are spawned uniformly randomly on one side of the scenario. The enemy positions are the allied positions reflected in the vertical midpoint of the scenario. This is similar to how units spawned in the original SMAC benchmark but without clustering them together. In surround scenarios, allied units are spawned at the center and surrounded by enemies stationed along the four diagonals. There are two changes to the observation space from SMAC. First, each agent observes their field-of-view direction. Secondly, each agent observes their position in the map as x- and y-coordinates. This is normalized by dividing by the map width and height, respectively. The only change to the state from SMAC was to add the field-of-view direction of each agent to the state. In addition, the fixed attack and sight range are replaced by the values from SC2. The attack range for melee units is imposed to 2 because using the actual attack ranges makes attacking too difficult.
The reward setting for the dense and sparse cases on the SMAC benchmark is shown in Tab. 1.
We use SC2.4.6.2.69232 (the same version for the evaluation of VDN, QMIX, QPLEX, QTRAN, and RODE) instead of SC2.4.10 (the version for the evaluation of MAPPO) and SC2.4.1.4 (the version for the evaluation of LIIR). Performance is not comparable across versions.
Experimental Setup
We adopt the same architectures for QMIX${}^{1}$, QPLEX${}^{1}$, CW-QMIX^{2}^{2}2https://github.com/oxwhirl/wqmix, RODE^{3}^{3}3https://github.com/TonghanWang/RODE, MAVEN^{4}^{4}4https://github.com/AnujMahajanOxf/MAVEN, EMC^{5}^{5}5https://github.com/kikojay/EMC as their official implementations (Samvelyan et al. 2019; Wang et al. 2020a; Rashid et al. 2020; Wang et al. 2020b; Mahajan et al. 2019; Zheng et al. 2021).
Each agent independently learns a policy with fully shared parameters between all policies. We used RMSProp with a learning rate of $5\times {10}^{-4}$ and $\gamma =0.99$, buffer size 5000, mini-batch size 32 for all algorithms. The dimension of each agent’s GRU hidden state is set to 64. For our experiments, we employ an $\u03f5$-greedy exploration scheme for the joint policy, where $\u03f5$ decreases from 1 to 0.05 over 1 million timesteps in 6h_vs_8z, 3s5z_vs_3s6z and corridor, and over 50 thousand timesteps in other maps.
The pretraining phase starts at the beginning of the training and ends at $50k$ and $100k$ for $2M$-step and $5M$-step maps, respectively. During the pretraining, we set the probability $\alpha $ of initializing the agents at state ${s}_{0}$ as $1$. After pretraining, the probability $\alpha $ annealed from $1.0$ to $0.5$ over $50k$ and $100k$ for $2M$-step and $5M$-step maps, respectively.
We build our imagination model implementation based on Decision Transformer^{6}^{6}6https://github.com/kzl/decision-transformer (Chen et al. 2021). The full list of hyperparameters can be found in Tab. 3. We use the maximum timesteps ${E}_{t}^{M}$ in the environment as the context length, which is shown in Tab. 2. The imagination models were trained using the AdamW optimizer.
Scenario | ${E}_{t}^{M}$ | Scenario | ${E}_{t}^{M}$ |
---|---|---|---|
MMM2 | 180 | 6h_vs_8z | 150 |
2c_vs_64zg | 400 | 3s5z_vs_3s6z | 170 |
5m_vs_6m | 70 | 10m_vs_11m | 150 |
corridor | 400 | 3s_vs_5z | 250 |
protoss_5v5 | 200 | terran_5v5 | 200 |
protoss_10v11 | 200 | zerg_5v5 | 200 |
3m | 60 | 8m | 120 |
2m_vs_1z | 150 | 2s3z | 120 |
Hyperparameter | Value | Hyperparameter | Value |
---|---|---|---|
number of layers | 6 | max timesteps | 400 |
attention heads | 8 | weight decay | 0.1 |
embedding dims | 64 | max RTG | 20 |
grad norm clip | 1.0 | learning rate | $6\times {10}^{-4}$ |
Adam betas | (0.9,0.95) | training epochs | 5 |
The prompt generator uses a feedforward network with two hidden layers of size $\{64,32\}$ to encode the state and output $\{{E}_{t}^{M},20,10\}$ dimensional vectors for each modality, i.e., the target timestep, return-to-go, and influence value, respectively. These vectors then turn into class probabilities using a softmax function.
For CW-QMIX, the weight for negative samples is set to $\alpha =0.5$ for all scenarios.
For CG-diffusion, the implementation is based on Diffuser^{1}^{1}1https://github.com/jannerm/diffuser (Janner et al. 2022). We use $N=100$ diffusion steps for all scenarios.
For CVAE-GAN, the generator $G({s}^{\prime}|z,c)$ first encodes the $64$-dimensional input noise $z$ using a $64$-dimensional fully-connected layer, where $c=(s,\mathcal{P}(s))$ is the condition, Then, it produces the reconstructed next state by another fully-connected layer. We apply a sigmoid function at the output layer and scale the output by the range of each modality defined by the task space. Batch normalization is used in all layers. The classifier $C(\mathcal{P}(s)|s)$ uses the same architecture as the prompt generator in IIE and is trained in the same way. In the discriminator $D(G({s}^{\prime}|z,c))$, a feedforward network with two hidden layers of size $\{128,64\}$ is used to encode the state or the output from the generator, and then predict a score. The loss function for CVAE-GAN is the same as the objective in (Bao et al. 2017).
For GC-policy, we treat any trajectory as a successful trail for reaching its final state conditioned on the prompt. Therefore, we train the individual policy for each agent $i$ by maximizing the likelihood of the actions for a reach goal $J({\pi}_{i})={\mathbb{E}}_{D}[{\mathrm{log}}_{{\pi}_{i}}(a|s,\mathcal{P}(s))]$. This policy is parametrized using a neural network that takes as the input state and the goal (prompt), then returns probabilities for a discretized grid of actions of the action space. The neural network concatenates the state and goal together. It passes the concatenated input into a feedforward network with two hidden layers of size $256$ and $128$, respectively, outputting logits for each discretized action. The loss is optimized using the Adam optimizer with learning rate $\alpha =5\times {10}^{-4}$, with a batch size of 256 transitions, taking one gradient step for every step in the environment. In GC-policy, the agents reach the selected state from an archive using the goal-conditioned policy rather than the environment simulator. The archive is a first-in-first-out buffer that stores states with the top-$128$ influence value.
The implementation of MAPPO is consistent with their official repositories^{2}^{2}2https://github.com/zoeyuchao/mappo (Yu et al. 2022). As shown in Tab. 4, all hyper-parameters are left unchanged at the origin best-performing status.
For the baselines in the sparse-reward setting, we adopt the same architecture for LIIR^{3}^{3}3https://github.com/yalidu/liir (Du et al. 2019) and MASER^{4}^{4}4https://github.com/Jiwonjeon9603/MASER (Jeon et al. 2022) as their official implementations.
Hyperparameter | Value | Hyperparameter | Value |
---|---|---|---|
critic lr | 5e-4 | actor lr | 5e-4 |
ppo epoch | 5 | ppo-clip | 0.2 |
optimizer | Adam | batch size | 3200 |
optim eps | 1e-5 | hidden layer | 1 |
gain | 0.01 | training threads | 32 |
rollout threads | 8 | $\gamma $ | 0.99 |
hidden layer dim | 64 | activation | ReLU |
We conduct experiments on an NVIDIA RTX 3090 GPU. Each task needs to train for about 12 to 20 hours, depending on the number of agents and episode length limit of each map. We evaluate 32 episodes with decentralized greedy action selection every $10k$ timesteps for each algorithm.
All figures in the experiments are plotted using mean and standard deviation with confidence internal 95%. We conduct five independent runs with different random seeds for each learning curve.
Pseudocode for Prompt Sampling
Environment Simulator
We design a convenient interface in the reset function, serving as the environment simulator in Imagine, Initialize, and Explore. Suppose all units are initialized at the starting point ${s}_{0}$, and we have already obtained the imagined trajectory ${x}_{0:\mathcal{T}}=\{{s}_{0},{o}_{0},{\text{\mathbf{u}}}_{0},{r}_{0},{s}_{1},\mathrm{\dots},{\text{\mathbf{u}}}_{\mathcal{T}},{r}_{\mathcal{T}}\}$. The last state ${s}_{\mathcal{T}}$ is formulated as a distribution class $\mathcal{D}$ over team unit types, start positions, and health levels.
First, we kill all units through the DebugKillUnit command from the s2clientprotocol.debug_pb2 package. Then, we create new units based on the distributions $\mathcal{D}$ through the DebugCreateUnit command, where we can set the unit_type, the owner (allies or enemies), and initialized positions using the Point2D command. Note that we have to recover the coordinates of all units in the real axis because they have been processed as the relative value to the center of the map in the state and the observations.
Since we cannot directly initialize the health level of a unit through the DebugCreateUnit command, we scale the health value in the raw observation from the controller and kill agents when their health is below the health level. For example, consider the maximal health of the agent $i$ is ${H}_{i}$, the remaining health value at the last state ${s}_{\mathcal{T}}$ is $\beta {H}_{i}$. Suppose the exact health value in the raw observation is ${h}_{i}$. We compute the current health by ${h}_{i}^{c}=\left[{h}_{i}-(1-\beta ){H}_{i}\right]/{H}_{i}$ and use it as the feature of health values for the state and the observations. Moreover, we kill the agent $i$ if ${h}_{i}^{c}\le 0$ after taking the joint action at each timestep.
Imagine, Initialize, Explore Pseudocode
More Related Work
Auto-curricula Learning.
Curriculum learning is a technique that leverages easier datasets or tasks to facilitate training. In reinforcement learning (RL), this approach involves selecting tasks from a predefined set or parameterized space of goals and scenes to expedite performance improvement on the target task (Florensa et al. 2018; Matheron, Perrin, and Sigaud 2020; Fang et al. 2021; Ao et al. 2022). In the multi-agent field, recent works have explored the adaptive selection of co-players in competitive games, where playing against increasingly stronger opponents is crucial to avoid exploitation by other agents (Silver et al. 2018; Samvelyan et al. 2023). These methods use a regret-based curriculum to attain a Nash-Equilibrium policy against every rational agent in every environment. It is important to note that these approaches primarily concentrate on competitive tasks rather than cooperative ones. Additionally, the applicability of the current auto-curricula learning methods to MARL tasks with complex observations is limited, as their flexibility is often constrained to task spaces using low-dimensional parameters and requires well-defined task similarity.
Limitations
The main limitation of this paper is that it is confined to scenarios within the game StarCraft II. This is an environment that, while complex, cannot represent the dynamics of all multi-agent tasks. Evaluation of MARL algorithms, therefore, should not be limited to one benchmark but should target a variety with a range of tasks.
The environment simulator in this paper is a small distribution class that can change how units are generated. It can be easily generalized to many multi-agent benchmarks and applications. For example, we can add a configuration of the initial position of the agents into the reset_world function on the particle world benchmark (Lowe et al. 2017). We can also utilize the AddPlayer function to build a player by defining its coordinates and role on the Google Research Football benchmark (Kurach et al. 2020). We leave the implementation of this functionality as future work.