GFlowNet Foundations

Yoshua Bengio yoshua.bengio@mila.quebec
Mila, Universié de Montréal, CIFAR, IVADO Salem Lahlou11footnotemark: 1 lahlosal@mila.quebec
Mila, Universié de Montréal Tristan Deleu11footnotemark: 1 deleutri@mila.quebec
Mila, Universié de Montréal Edward J. Hu edward@edwardjhu.com
Mila, Universié de Montréal, Microsoft Mo Tiwari motiwari@stanford.edu
Stanford University Emmanuel Bengio bengioem@mila.quebec
Mila, McGill University
Equal Contribution
Abstract

Generative Flow Networks (GFlowNets) have been introduced as a method to sample a diverse set of candidates in an active learning context, with a training objective that makes them approximately sample in proportion to a given reward function. In this paper, we show a number of additional theoretical properties of GFlowNets, including a new local and efficient training objective called detailed balance for the analogy with MCMC. GFlowNets can be used to estimate joint probability distributions and the corresponding marginal distributions where some variables are unspecified and, of particular interest, can represent distributions over composite objects like sets and graphs. GFlowNets amortize the work typically done by computationally expensive MCMC methods in a single but trained generative pass. They could also be used to estimate partition functions and free energies, conditional probabilities of supersets (supergraphs) given a subset (subgraph), as well as marginal distributions over all supersets (supergraphs) of a given set (graph). We introduce variations enabling the estimation of entropy and mutual information, sampling from a Pareto frontier, connections to reward-maximizing policies, and extensions to stochastic environments, continuous actions and modular energy functions.

1 Introduction

Building upon the introduction of Generative Flow Networks (GFlowNets) by Bengio et al. (2021), we provide here an in-depth formal foundation and expansion of the set of theoretical results in ways that may be of interest for the active learning scenario of Bengio et al. (2021) but also much more broadly.

1.1 What is a GFlowNet ?

GFlowNets have properties which make them well-suited to perform amortized probabilistic inference in general, whether for sampling or for marginalizing. Sampling takes place at training time while run-time sampling or computations of marginalized quantities can be done in a single pass through a sequence of constructive stochastic steps. This makes GFlowNets an interesting alternative to Monte-Carlo Markov chains (MCMC) and related to amortized variational inference (Malkin et al., 2023).

Because sampling of a compositional object s can be achieved through a sequence of stochastic steps, very rich multimodal distributions PT(s) over such objects can be represented, and the offline training objectives make it possible to explore and discover modes of the distribution of interest. The key property of GFlowNets is that their sampling policy is trained to make the probability PT(s) of sampling an object s approximately proportional to the value R(s) of a given reward function applied to that object. We also talk of an energy function (s)=logR(s), i.e., the reward function is non-negative and corresponds to an unnormalized probability. Whereas one typically trains a generative model from a dataset of positive examples, a GFlowNet is trained to match the given energy or reward function and convert it into a sampler. We view that sampler as a generative policy because the composite object s is constructed through a sequence of smaller stochastic steps (see Fig. 1), often corresponding to constructively composing different elements of s, like the edges of a graph.

This conversion of an energy function or unnormalized probability function to a sampler is similar to what MCMC methods achieve but once trained, GFlowNets will generate a sample in one shot instead of generating a long sequence of samples whose distribution would gradually approach the desired one. GFlowNets thus avoid the lengthy stochastic search in the space of such objects and the associated mode-mixing intractability challenge of MCMC methods (Jasra et al., 2005; Bengio et al., 2013; Pompe et al., 2020). Multiple iid samples can be obtained from the GFlowNet by calling the sampler multiple times. GFlowNets exchange that intractability of sampling with MCMC for the challenge of amortized training of the generative policy. The latter problem would be equally intractable if the modes of the reward function did not have a inherent (but not necessarily known) structure over which the learner could generalize, i.e., the learner had almost no chance to correctly guess where to find new modes based on (i.e., training on) those it had already visited.

Refer to caption
Figure 1: A diagram of how a GFlowNet iteratively constructs an object. We adopt notation that is common in the reinforcement learning literature: st represents the state of the partially constructed object (in this case, a graph) at time t, at represents the action taken by the GFlowNet at time t to transition to state st+1=T(st,at). In this diagram, the GFlowNet takes a 3-node graph as input and determines an action to take. The action, combined with the environment transition function T(st,at), determines st+1: a four-node graph. This process repeats until an exit action is sampled and the sample is complete.

The energy function or reward function (exponential of minus energy) is evaluated only at the end of the sequential construction process for objects s, in what we call a terminating state. Every such constructive sequence starts in the single initial state s0 and ends in a terminal state. As illustrated in Figure 2, we can visualize the set of all trajectories starting from s0 and ending in a terminal state s. The term ”flow” in ”generative flow networks” refers to unnormalized probabilities that can be learned by GFlowNet learning procedures. The flow in an intermediate state s is a weighted sum of the non-negative rewards of the terminating states reachable from s. Those weights are such as to avoid double-counting: if we were to inject a fixed flow of liquid in s0 and dispatch that liquid in each child of any state s proportionally to the GFlowNet policy for choosing a child of s, we would obtain the flow at each state and the flow at terminating states would match the reward function at those states. As shown in greater detail here and for the first time in the first GFlowNet paper (Bengio et al., 2021), this can be achieved with a flow constraint at each state: the sum of incoming flows must match the sum of outgoing flows.

Refer to caption
Figure 2: Illustration of the structure of a Generative Flow Network (GFlowNet), as a pointed DAG over states s, with particles flowing along edges to represent the flow function. Any object sampled by the GFlowNet policy can be obtained by starting from initial state s0 and then at each step choosing a child with probability proportional to the GFlowNet policy’s transition probability. This process stops when a terminating action is chosen from a terminating state s (yielding a terminal state sf), at which point a reward R(s) is obtained. The figure shows a tiny GFlowNet and the possible trajectories from s0 to any of the terminal states. It illustrates that in general a state can be reached through several trajectories. GFlowNet algorithms learn a policy such that the probability of sampling terminating state s is proportional to R(s). It tries to learn a flow function F(s) and F(ss) over all states (including intermediate states) s and transitions ss with F(s)=R(s) at terminal states and F(s0) being the sum of rewards over all terminal states. A sufficient property to achieve this is that at each state the sum of incoming flows equals the sum of outgoing flows.

1.2 Contributions of this paper

In this paper, an important contribution is the notion of conditional GFlowNet, which enables estimation of intractable sums corresponding to marginalization over many steps of object construction, and can thus be used to compute free energies111In machine learning, a free energy is the logarithm of an unnormalized marginal probability, a generally intractable sum of exponentiated negative energies. over different types of joint distributions, perhaps most interestingly over sets and graphs. This marginalization also enables estimation of entropies, conditional entropies  and mutual information. GFlowNets can thus be generalized to estimate multiple flows corresponding to modeling a rich outcome (rather than a scalar reward function) .

We refer the reader to Bengio et al. (2021) and Sec. 7 for a discussion of related approaches and differences with common generative models and reinforcement learning (RL) methods. In an RL context, two interesting properties of GFlowNets already noted in that paper are that they (1) can be trained in an offline manner with trajectories sampled from a distribution different from the one represented by the GFlowNet and (2) they match the reward function in probability rather than try to find a configuration which maximizes rewards or returns. The latter property is particularly interesting in the context of exploration, to ensure the configurations sampled from the generative policy are both interesting and diverse. It is also interesting to transform GFlowNets into amortized probabilistic inference machines: if we choose the reward function to be a prior (over some random variable) times a likelihood (how well some data is fit given that choice of random variable value), then the GFlowNet policy learns to sample from the corresponding Bayesian posterior (which is proportional to prior times likelihood). The ability of GFlowNets to generate a diverse set of samples then corresponds to the ability to sample from the modes of the target distribution.

An important source of inspiration for GFlowNets is the way information propagates in temporal-difference RL methods (Sutton and Barto, 2018). Both rely on a principle of coherence for credit assignment which may only be achieved asymptotically when training converges. While exact gradient calculation may be intractable, because the number of paths in state space to consider is exponentially large, both methods rely on local coherence between different components and a training objective that states that if all the learned components are coherent with each other locally, then we obtain a system that estimates the quantities of interest globally. Examples include estimation of expected discounted returns in temporal-difference methods and probability measures with GFlowNets.

This paper extends the theory of the original GFlowNet construction (Bengio et al., 2021) in several directions, including a new local training objective called detailed balance (for the analogy with the detailed balance condition of Monte-Carlo Markov chains) which avoids forming explicit sums required by the previously proposed flow matching loss, as well as formulations enabling the calculation of marginal probabilities (or free energies) for subsets of variables, more generally for subsets of larger sets, or subgraphs, their application to estimating entropy and mutual information, and the introduction of an unsupervised form of GFlowNets (the reward function is not needed while training, only observations of outcomes) enabling sampling from a Pareto frontier, for example. Although basic GFlowNets are more similar to bandits (in that a reward is only provided at the end of a sequence of actions), they can be extended to take into account intermediate rewards and thus a notion of return, and sample according to these returns. The original formulation of GFlowNets is also limited to discrete and deterministic environments, while this paper suggests how these two limitations could be lifted. Finally, whereas the basic formulation of GFlowNets assumes a given reward or energy function, this paper considers how the energy function could be jointly learned with the GFlowNet, opening the door to novel energy-based modeling methodologies and a modular structure for both the energy function and the GFlowNet.

1.3 GFlowNets in other works

In addition to the theory presented in this paper, Malkin et al. (2023) and Zimmermann et al. (2022) prove some partial equivalences between GFlowNets and hierarchical variational methods, providing yet more theoretical evidence for the efficacy of GFlowNets in learning to sample proportionally to a given reward function. These works also provide evidence for the superiority of GFlowNets in off-policy settings.

GFlowNets have found a wide array of applications due to the associated diversity of generated samples. In contexts where a cheap proxy for the true reward function exists, GFlowNets have been used to surface samples under which to query the proxy before more expensive evaluation under the true reward function. In these settings, the diversity of samples generated by GFlowNets can be used for robustness to proxy misspecification and to incorporate epistemic uncertainty. For example, Zhang et al. (2023) use GFlowNets to produce sample schedules for operations in a computation graph, where evaluating the runtimes of sample schedules via a proxy is fast but evaluating the same schedules on target hardware is expensive. In active learning problems, Jain et al. (2022, 2023) use GFlowNet sampling as a subroutine inside an active learning loop as a substitute for Bayesian Optimization or RL-based methods. Jain et al. (2022) apply GFlowNets to search for novel anti-microbial peptides, discover DNA sequences that have high binding activity with human transcription factors, and to find proteins with high fluorescence. Additionally, Jain et al. (2023) develops preference-conditional GFlowNets, where a preference weight vector is used to scalarize multiple objective functions into a single reward. The authors apply their techniques to various molecule and DNA sequence generation tasks and find that their methods are able to find different Pareto-optimal samples along the Pareto frontier.

GFlowNets have found applications in several other machine learning problems. For example, Zhang et al. (2022) simultaneously train an energy-based model and a GFlowNet; the energy function is trained with samples from a GFlowNet, which, in turn, uses the energy function to form its reward. Their method results in a generative model for binary vectors in high dimensions, e.g., binarized digits. Deleu et al. (2022) use a GFlowNet for structure learning; the GFlowNet produces samples that approximates the true posterior over causal graphs given a dataset. Their method works on both observational and interventional data, and compares favorably to MCMC- and variational inference-based methods. Hu et al. (2023) find maximum-likelihood estimates of latent variable models with discrete compositional latents by jointly training a GFlowNet to approximately sample from the generally intractable posterior in the E-step of the expectation-maximization (EM) algorithm.

2 Flow Networks and Markovian Flows

2.1 Some elements of graph theory

In this section, we recall some basic definitions and properties of graphs, which are the basis of flow networks and GFlowNets.

Definition 1.

A directed graph is a tuple G=(𝒮,𝔸), where 𝒮 is a finite set of states, and 𝔸 a subset of 𝒮×𝒮 representing directed edges. Elements of 𝔸 are denoted ss and called edges or transitions.
A trajectory in such a graph is a sequence τ=(s1,,sn) of elements of 𝒮 such that every transition stst+1𝔸 and n>1.
We denote sτ to mean that s is in the trajectory τ, i.e., t{1,,n}st=s, and similarly ssτ to mean that t{1,,n1}st=s,st+1=s. For convenience, we also use the notation τ=s1sn. The length of a trajectory is the number of edges in it (the length of τ=(s1,,sn) is thus n1).
A directed acyclic graph (DAG) is a directed graph in which there is no trajectory τ=(s1,,sn) satisfying sn=s1.

Given a DAG G=(𝒮,𝔸), and two states s,s𝒮, if there exists a trajectory in G starting in s and ending in s, then we write s<s. The binary relationship “<” defines a strict partial order (i.e. it is irreflexive, asymmetric and transitive). We write ss if s<s or s=s. The binary relation “” is a (non-strict) partial order (i.e. it is reflexive, antisymmetric and transitive).

If there is no order relation between s and s, we write ss.

Definition 2.

Given a DAG G=(𝒮,𝔸), the parent set of a state s𝒮, which we denote Par(s), contains all of the direct parents of s in G, i.e., Par(s)={s𝒮:ss𝔸}; similarly, the child set Child(s) contains all of the direct children of s in G, i.e., Child(s)={s𝒮:ss𝔸}.

Definition 3.

Given a DAG G=(𝒮,𝔸). G is called a pointed DAG if there exist two states s0,sf𝒮 that satisfy:

s𝒮{s0}s0<s and s𝒮{sf}s<sf.

s0 is called the source state or initial state. sf is called the sink state or final state. Because “<” is a strict partial order, these two states are unique.

A complete trajectory in such a DAG is any trajectory starting in s0 and ending in sf. We denote such a trajectory as τ=(s0,s1,,sn,sn+1=sf).

We denote by 𝒯 the set of all complete trajectories in G, and by 𝒯partial the set of (possibly incomplete) trajectories in G.

A state s𝒮 is called a terminating state if it is a parent of the sink state, i.e. ssf𝔸. The transition ssf is called a terminating edge. We denote by:

  • 𝔸f={ss𝔸,ssf}, the set of non-terminating edges in G,

  • 𝔸f={ss𝔸,s=sf}=𝔸𝔸f, the set of terminating edges in G,

  • 𝒮f={s𝒮,ssf𝔸f}=Par(sf), the set of terminating states in G.

In Fig. 3, we visualize the concepts introduced in the previous definitions.

Refer to caption
Figure 3: Example of a pointed DAG G illustrating the notions of initial state (s0), final or sink state (sf), terminating states in 𝒮f, with a transition to sf called a terminating edge, in 𝔸f. A terminating state may have other children different from the sink state (e.g., the terminating state s7).

Note that the constraint of a single source state and single sink state is only a mathematical convenience since a bijection exists between general DAGs and those with this constraint (by the addition of a unique source/sink state connected to all the other source/sink states).

Definition 4.

Let G be a pointed DAG with source state s0 and sink state sf. A forward (resp. backward) probability function consistent with G is any non-negative function P^F (resp. P^B) defined on 𝔸 that satisfies s𝒮{sf},sChild(s)P^F(ss)=1 (resp. s𝒮{s0},sPar(s)P^B(ss)=1).

With pointed DAGs, consistent forward and backward probability functions, that are probabilities over states, can be used to define probabilities over trajectories, i.e. probability measures on some subsets of 𝒯partial. The following lemma shows how to construct such factorized probability measures:

Lemma 5.

Let G=(𝒮,𝔸) be a pointed DAG, and consider a forward probability function PF^, and a backward probability function PB^ both consistent with G. For any state s𝒮{sf}, we denote by 𝒯s,f𝒯partial the set of trajectories in G starting in s and ending in sf; and for any state s𝒮{s0}, we denote by 𝒯0,s𝒯partial the set of trajectories in G starting in s0 and ending in s.

Consider the extensions of P^F and P^B on 𝒯partial defined by:

τ=(s1,,sn)𝒯partialP^F(τ):=t=1n1P^F(st+1st) (1)
τ=(s1,,sn)𝒯partialP^B(τ):=t=1n1P^B(stst+1) (2)

We have the following:

s𝒮{sf}τ𝒯s,fP^F(τ)=1 (3)
s𝒮{s0}τ𝒯0,sP^B(τ)=1 (4)

Proof For convenience, we will use 𝒯ss,sf to denote the set of trajectories starting with ss and ending in sf, and 𝒯0,ss to denote the set of trajectories starting in s0 and ending with ss. This allows to write:

ssf𝒯s,f=sChild(s)𝒯ss,sf,{𝒯ss,sf,sChild(s)} pairwise disjoint,
ss0𝒯0,s=sPar(s)𝒯0,ss,{𝒯0,ss,sPar(s)} pairwise disjoint.

Additionally, for any ssf, we denote by ds,f the maximum trajectory length in 𝒯s,f; and for any ss0, we denote by d0,s the maximum trajectory length in 𝒯0,s.

We will prove Eq. 3 by strong induction on ds,f and Eq. 4 by strong induction on d0,s.

Base cases: If ds,f=1 and d0,s=1, then 𝒯s,f={(ssf)} and 𝒯0,s={(s0s)}. Hence, τ𝒯s,fP^F(τ)=P^F(ssf)=P^F(sfs)=1 given that sf is the only child of s (otherwise ds,f cannot be 1), and τ𝒯0,sP^B(τ)=P^B(s0s)=1 given that s0 is the only parent of s (otherwise d0,s cannot be 1).

Induction steps: Consider ssf such that ds,f>1 and ss0 such that d0,s>1. Because of the disjoint unions written above, we have:

τ𝒯s,fP^F(τ)=s~Child(s)τ𝒯ss~,fP^F(τ)=s~Child(s)P^F(s~s)τ𝒯s~,fP^F(τ)=1,
τ𝒯0,sP^B(τ)=s~Par(s)τ𝒯0,s~sP^B(τ)=s~Par(s)P^B(s~s)τ𝒯0,s~P^B(τ)=1,

where we used the induction hypotheses in the third equality of each line.

2.2 Trajectories and Flows

We augment pointed DAGs it with a function F called a flow. An analogy which helps to picture flows is a stream of particles flowing through a network where each particle starts at s0 and flowing through some trajectory terminating in sf. The flow F(τ) associated with each complete trajectory τ contains the number of particles sharing the same path τ.

Definition 6.

Given a pointed DAG, a trajectory flow (or “flow”) is any non-negative function F:𝒯+ defined on the set of complete trajectories 𝒯. F induces a measure over the σ-algebra Σ=2𝒯, the power set on the set of complete trajectories 𝒯. In particular, for every subset A𝒯, we have

F(A)=τAF(τ). (5)

The pair (G,F) is called a flow network.

This definition ensures that (𝒯,2𝒯,F) is a measure space. We abuse the notation here, using F to denote both a function of complete trajectories, and its corresponding measure over (𝒯,2𝒯). A special case is when the event A is the singleton trajectory {τ}, where we just write its measure as F(τ). We also abuse the notation to define the flow through either a particular state s, or through a particular edge ss in the following way.

Definition 7.

The flow through a state (or state flow) F:𝒮+ corresponds to the measure of the set of complete trajectories going through a particular state:

F(s):=F({τ𝒯:sτ})=τ𝒯:sτF(τ). (6)

Similarly, the flow through an edge (or edge flow) F:𝔸+ corresponds to the measure of the set of complete trajectories going through a particular edge:

F(ss):=F({τ𝒯:ssτ})=τ𝒯:ssτF(τ). (7)

Note that with this definition, we have F(ss)=0 if ss𝔸 is not an edge in the pointed DAG (since F()=0). We call the flow of a terminating transition F(ssf) a terminating flow. The following proposition relates the state flows and the edge flows:

Proposition 8.

Given a flow network (G,F). The state flows and edge flows satisfy:

s𝒮{sf}F(s)=sChild(s)F(ss) (8)
s𝒮{s0}F(s)=sPar(s)F(ss) (9)

Proof Given ssf, the set of complete trajectories going through s is the (disjoint) union of the sets of trajectories going through ss, for all sChild(s):

{τ𝒯:sτ}=sChild(s){τ𝒯:ssτ}.

Therefore, it follows that:

F(s)=τ:sτF(τ)=sChild(s)τ:ssτF(τ)=sChild(s)F(ss)

Similarly, Eq. 9 follows by writing the set of complete trajectories going though ss0 as the (disjoint) union of the sets of trajectories going through ss for all sPar(s).

2.3 Flow Induced Probability Measures

Definition 9.

Given a flow network (G,F), the total flow Z is the measure of the whole set 𝒯, corresponding to the sum of the flows of all the complete trajectories:

Z:=F(𝒯)=τ𝒯F(τ). (10)

Proposition 10.

The flow through the initial state equals the flow through the final state equals the total flow Z.

Proof Since τ𝒯,s0,sfτ, applying Eq. 6 to s0 and sf yields

F(s0) =τ𝒯F(τ)=Z, (11)
F(sf) =τ𝒯F(τ)=Z. (12)


Intuitively, Prop. 10 justifies the use of the term “flow”, introduced by Bengio et al. (2021), by analogy with a stream of particles flowing from the initial state to the final states.

We use the letter Z in Def. 9, often used to denote the partition function in probabilistic models and statistical mechanics, because it is a normalizing constant which can turn the measure space (𝒯,2𝒯,F) defined above into the probability space (𝒯,2𝒯,P):

Definition 11.

Given a flow network (G,F), the flow probability is the probability measure P over the measurable space (𝒯,2𝒯) associated with F:

A𝒯P(A):=F(A)F(𝒯)=F(A)Z. (13)

For two events A,B𝒯, the conditional probability P(AB) thus satisfies:

P(AB) :=F(AB)F(B). (14)

Similar to the flow F, we abuse the notation P to define the probability of going through a state:

s𝒮P(s):=F(s)Z, (15)

and similarly for the probability of going through an edge. Note that P(s) does not correspond to a distribution over states, in the sense that s𝒮P(s)1; in particular, it is easy to see that P(s0)=1 (in other words, the probability of a trajectory passing through the initial state s0 is 1). Additionally, for a trajectory τ𝒯, we also use the abuse of notation P(τ) instead of P({τ}) to denote the probability of going through a specific trajectory τ.

Definition 12.

Given a flow network (G,F), the forward transition probability operator PF is a function on 𝒮×𝒮, that is a special case of the conditional probabilities induced by F (Eq. 14):

ss𝔸PF(ss):=P(sss)=F(ss)F(s). (16)

Similarly, the backwards transition probability is the operator defined by:

ss𝔸PB(ss):=P(sss)=F(ss)F(s). (17)

Note how PF and PB are consistent with G (in the sense of Def. 4), as a consequence of Prop. 8.

Because flows define probabilities over states and edges, they can be used to define probability distributions over the terminating states of a graph (denoted by 𝒮f=Par(sf)) as follows:

Definition 13.

Given a flow network (G,F), the terminating state probability PT is the probability over terminating states 𝒮f under the flow probability P:

s𝒮fPT(s):=P(ssf)=F(ssf)Z (18)

Contrary to the probability P(s) of going through a state s, the terminating state probability PT is a well-defined distribution over the terminating states s𝒮f, in the following sense:

Proposition 14.

The terminating state probability PT is a well-defined distribution over the terminating states s𝒮f, in that PT(s)0 for all s𝒮f, and

s𝒮fPT(s)=1.

Proof Since the flow F(ssf) is non-negative, it is easy to see that PT(s)0. Moreover, using the definition of 𝒮f=Par(sf), Prop. 8 (relating the edge flows and the state flows), and Prop. 10 (F(sf)=Z), we have

s𝒮fPT(s)=1Zs𝒮fF(ssf)=1ZsPar(sf)F(ssf)=F(sf)Z=1.


The terminating state probability is particularly important in the context of estimating flow networks (see Sec. 3), as it shows that a flow network (G,F) induces a probability distribution over terminating states which is proportional to the terminating flows F(ssf), the normalization constant Z being given by initial flow F(s0).

2.4 Markovian Flows

Defining a flow requires the specification of |𝒯| non-negative values (one for every trajectory τ𝒯), which is generally exponential in the number of graph edges. Markovian flows however have the remarkable property that they can be defined with much fewer “numbers”, given that trajectory flows factorize according to G.

Definition 15.

Let (G,F) be a flow network, with flow probability measure P. F is called a Markovian flow (or equivalently (G,F) a Markovian flow network) if, for any state ss0, outgoing edge ss, and for any trajectory τ=(s0,s1,,sn=s)𝒯partial starting in s0 and ending in s:

P(ssτ)=P(sss)=PF(ss). (19)

Note that the Markovian property does not hold for all of the flows as defined in the previous sections (e.g. Fig. 4). Intuitively, a flow can be considered non-Markovian if a particle in the “flow stream” can remember its past history; if not, its future behavior can only depend on its current state and the flow must be Markovian. In this work, we will primarily be concerned with Markovian flows, though later we will re-introduce a form of memory via state-conditional flows that allow each flow “particle” to remember parts of its history. The following proposition shows that Markovian flows have the property that the flows at (or the probabilities of) complete trajectories factorize according the the graph, and that it is a sufficient condition for defining Markovian flows.

Proposition 16.

Let (G,F) be a flow network, and P the corresponding flow probability. The following three statements are equivalent:

  1. 1.

    F is a Markovian flow

  2. 2.

    There exists a unique probability function P^F consistent with G such that for all complete trajectories τ=(s0,,sn+1=sf):

    P(τ)=t=1n+1P^F(stst1). (20)

    Moreover, the probability function P^F is exactly the forward transition probability associated with the flow probability P: P^F=PF.

  3. 3.

    There exists a unique probability function P^B consistent with G such that for all complete trajectories τ=(s0,,sn+1=sf):

    P(τ)=t=1n+1P^B(st1st). (21)

    Moreover, the probability function P^B is exactly the backwards transition probability associated with the flow probability P: P^B=PB.

Proof Recall from Lemma 5 the notations 𝒯0,s to denote the set of partial trajectories from s0 to s, and 𝒯s,f to denote the set of partial trajectories from s to sf. We will prove the equivalences 12 and 13.

  • 12: Suppose that F is a Markovian flow. Then using the laws of probability, the Markov property in Eq. 19, and P(s0)=1, for some complete trajectory τ=(s0,,sn+1=sf):

    P(τ) =P(s0s1sn+1)=P(s0s1)t=1nP(stst+1s0st)
    =P(s0s1)t=1nP(stst+1st)
    =P(s0)PF(s1s0)t=1nPF(st+1st)
    =t=1n+1PF(stst1),

    where the second line uses to Markov property, and the third line uses the definition of the forward transition probability PF. PF thus satisfies Eq. 20 for all complete trajectories.

    To show uniqueness of PF, assume Eq. 20 is satisfied by some P^F for all complete trajectories. By definition of the forward transition probability:

    PF(ss):=P(sss)=P(ss)P(s).

    Any complete trajectory τ going through a state s can be (uniquely) decomposed into a partial trajectory τ𝒯0,s from s0 to s, and a partial trajectory τ′′𝒯s,f from s to sf. Using the definition of P(s), we have:

    P(s) =τ:sτP(τ)=τ:sτ(stst+1)τP^F(st+1st)
    =[τ𝒯0,s(stst+1)τP^F(st+1st)][τ′′𝒯s,f(stst+1)τ′′P^F(st+1st)]= 1(Lemma 5)
    =τ𝒯0,s(stst+1)τP^F(st+1st).

    Similarly, any complete trajectory going through ss can be (uniquely) decomposed into a partial trajectory τ𝒯0,s from s0 to s, and a partial trajectory τ′′𝒯s,f from s to sf. Again, using the definition of P(ss):

    P(ss) =τ:(ss)τP(τ)=τ:(ss)τ(stst+1)τP^F(st+1st)
    =[τ𝒯0,s(stst+1)τP^F(st+1st)]=P(s)P^F(ss)[τ′′𝒯s,f(stst+1)τ′′P^F(st+1st)]= 1(Lemma 5)
    =P(s)P^F(ss).

    Combining the two results above, we get:

    PF(ss)=P(ss)P(s)=P^F(ss).
  • 21: Suppose that there exists a probability function P^F consistent with G such that for some complete trajectory τ=(s0,,sn+1=sf)

    P(τ)=t=1n+1P^F(stst1).

    For the same reasons as those used to justify the uniqueness in the 12 proof, P^F is necessarily equal to the forward transition probability PF, associated with P.

    We now want to show that the flow F associated with P is Markovian, by showing the Markov property from Eq. 19. Let τ𝒯0,s be any partial trajectory from s0 to s; using the definition of conditional probability:

    P(ssτ)=P(s0ss)P(s0s).

    Following the same idea as above, we will now rewrite P(s0s), as a sum over complete trajectories that share the same prefix trajectory τ. Any such complete trajectory τ can be (uniquely) decomposed into this common prefix τ, and a partial trajectory τ′′𝒯s,f from s to sf.

    P(s0s) =τ:ττP(τ)=τ:ττ(stst+1)τPF(st+1st)
    =[st1stτPF(stst1)][τ′′𝒯s,f(stst+1)τ′′PF(st+1st)]= 1(Lemma 5)
    =st1stτPF(stst1).

    Similarly, any complete trajectory τ that share the same prefix trajectory (s0,,s,s) can be (uniquely) decomposed into this common prefix, and a partial trajectory τ′′𝒯s,f from s to sf, leading to:

    P(s0ss)=P(s0s)PF(ss)

    Combining the two results above, we can conclude that P satisfies the Markov property, and therefore that the flow F is Markovian:

    P(ssτ)=P(s0ss)P(s0s)=PF(ss)=P(sss)
  • {1,2}3: Suppose that F is a Markovian flow. We have shown above that this is equivalent to P being decomposed into a product of forward transition probabilities PF. For some complete trajectory τ=(s0,,sn+1=sf):

    P(τ)=t=1n+1PF(stst1)=t=1n+1P(st1st)P(st1)=t=1n+1P(st1st)P(st)=t=1n+1PB(st1st),

    where the third equality uses the fact that P(s0)=P(sf)=1, and using the definition of the backwards transition probability PB. The proof of uniqueness of PB is similar to that of PF in 12, and uses:

    P(ss) =τ:(ss)τP(τ)=τ:(ss)τ(stst+1)τP^B(stst+1)
    =[τ𝒯0,s(stst+1)τP^B(stst+1)]= 1(Lemma 5)P^B(ss)[τ′′𝒯s,f(stst+1)τ′′P^B(stst+1)]=P(s)
    =P(s)P^B(ss),
  • 31: Similar to the proof of 21, P^B is necessarily equal to the backwards transition probability PB associated with P. Additionally, PB is related to the forward transition probability PF:

    P(ss)=PB(ss)P(s)=PF(ss)P(s).

    We can therefore write the decomposition of P in terms of PF, instead of PB. For some complete trajectory τ=(s0,,sn+1=sf):

    P(τ) =t=1n+1PB(st1st)=t=1n+1P(st1)P(st)PF(st+1st)=P(s0)P(sf)t=1n+1PF(st+1st)
    =t=1n+1PF(st+1st),

    where we used the fact that P(s0)=P(sf)=1. Using “21”, we can conclude that F is a Markovian flow.


The decomposition of Eq. 20 shows how Markovian flows can be used to draw terminating states from the terminating state probability PT (Eq. 18). Namely, we have the following result:

Corollary 17.

Let (G,F) be a Markovian flow network, and PF the corresponding forward transition probability. Consider the procedure starting from s=s0, and iteratively drawing one sample from PF(.s) until reaching sf. Then the probability of the procedure terminating in a state s is PT(s).

Proof First, note that the procedure terminates with probability 1, given that G is acyclic.

For the procedure to terminate in a state s, it means that the trajectory τ𝒯 implicitly constructed during the procedure contains the edge ssf. The probability of the procedure terminating in s is thus:

τ𝒯:ssfτss′′τPF(s′′s)P(τ), according to Eq. 20=P(ssf)=PT(s)


The following proposition shows that, as a consequence of the Prop. 16, we obtain three different parametrizations of Markovian flows.

Proposition 18.

Given a pointed DAG G=(𝒮,𝔸), a Markovian flow on G is completely and uniquely specified by one of the following:

  1. 1.

    the combination of the total flow Z^ and the forward transition probabilities P^F(ss) for all edges ss𝔸,

  2. 2.

    the combination of the total flow Z^ and the backward transition probabilities P^B(ss) for all edges ss𝔸.

  3. 3.

    the combination of the terminating flows F^(ssf) for all terminating edges ssf𝔸f and the backwards transition probabilities P^B(ss) for all non-terminating edges ss𝔸f,

Proof In the first two settings, we define a flow function F:𝒯+, at a trajectory τ=(s0,s1,,sn,sn+1=sf) as:

  1. 1.

    F(τ):=Z^t=1n+1P^F(stst1),

  2. 2.

    F(τ):=Z^t=1n+1P^B(st1st)

We need to prove that it is the only Markovian flow that can be defined for both settings. The proof for the third setting will follow from that of the second setting.

First setting:

First, we need to show that the total flow Z associated with the flow function F (Eq. 10) matches Z^. This is a consequence of Lemma 5:

Z=τ𝒯F(τ)=Z^τ=(s0,s1,,sn+1=sf)𝒯t=1n+1P^F(stst1)=1, according to Lemma 5=Z^

Then, we need to show that the forward transition probability function PF associated with F (Eq. 16) matches P^F, and that the flow F is Markovian. To this end, note that the corresponding flow probability P satisfies Eq. 20. Thus, as a consequence of Prop. 16, F is a Markovian flow, and its forward transition probability function is P^F.

As a last requirement, we need to show that if a Markovian flow F has a partition function Z=Z^ and a forward transition probability function PF=P^F, then it is necessarily equal to F. This is a direct consequence of Prop. 16, given that for any τ=(s0,,sn+1=sf)𝒯:

F(τ)=Zt=1n+1PF(stst1)=Z^t=1n+1P^F(stst1)=F(τ)

Second setting:

First, we show that as a consequence of Lemma 5, the total flow Z associated with F matches Z^:

Z=τ𝒯F(τ)=Z^τ=(s0,s1,,sn+1=sf)𝒯t=1n+1P^B(s1st)=1, according to Lemma 5=Z^

Second, we note that the flow probability P associated with F satisfies Eq. 21. Thus, as a consequence of Prop. 16, F is a Markovian flow, and its backward transition probability function is P^B.

Finally, if a Markovian flow F has a partition function Z=Z^ and a backward transition probability function PB=P^B, then following Prop. 16, τ𝒯,F(τ)=F(τ).

Third setting:

From the terminating flows F^(ssf) and the backwards transition probabilities P^B(ss) for non-terminating edges, we can uniquely define a total flow Z^, and extend P^B to all edges as follows:

Z^ :=sPar(sf)F^(ssf)
P^B(ss) :={P^B(ss)if ssfF^(ssf)Z^otherwise.

This takes us back to the second setting, for which we have already proven that with Z^ and P^B defined for all edges, a Markovian flow is uniquely defined.


2.5 Flow Matching Conditions

In Prop. 18, we saw how forward and backward probability functions can be used to uniquely define a Markovian flow. We will show in the next proposition how non-negative functions of states and edges can be used to define a Markovian flow. Such functions cannot be unconstrained (as P^F and Z^ in Prop. 18 e.g.), as we have seen in Prop. 8.

Proposition 19.

Let G=(𝒮,𝔸) be a pointed DAG. Consider a non-negative function F^ taking as input either a state s𝒮 or a transition ss𝔸. Then F^ corresponds to a flow if and only if the flow matching conditions:

s>s0,F^(s) =sPar(s)F^(ss)
s<sf,F^(s) =s′′Child(s)F^(ss′′) (22)

are satisfied. More specifically, F^ uniquely defines a Markovian flow F matching F^ on states and transitions:

τ=(s0,,sn+1=sf)𝒯F(τ)=t=1n+1F^(st1st)t=1nF^(st). (23)

Proof Necessity is a direct consequence of Prop. 8. Let’s show sufficiency. Let P^F be the forward probability function defined by:

ss𝔸P^F(ss):=F^(ss)F^(s).

P^F is consistent with G given that F^ satisfies the flow matching conditions (Eq. 22). Let Z^=F^(s0). According to Prop. 18, there exists a unique Markovian flow F with forward transition probability function PF=P^F and partition function Z=Z^, and such that for a trajectory τ=(s0,,sn+1=sf)𝒯:

τ=(s0,,sn+1=sf)𝒯F(τ)=Z^t=1n+1P^F(stst1)=t=1n+1F^(st1st)t=1nF^(st). (24)

Additionally, similar to the proof of Prop. 16, we can write for any state ss0:

F(s) =Z^τ𝒯0,s(stst+1)τP^F(st+1st)
=Z^F^(s)F^(s0)τ𝒯0,s(stst+1)τP^B(stst+1)=1, according to Lemma 5
=F^(s),

where P^B(ss):=F^(ss)F^(s) defines a backward probability function consistent with G. And because ss𝔸PF(ss)=P^F(ss), it follows that ss𝔸F(ss)=F^(ss).

To show uniqueness, let’s consider a Markovian flow F that matches F^ on states and edges. Following Prop. 16, for any trajectory τ=(s0,,sn+1=sf)𝒯

F(τ)=Z^t=1n+1P^F(stst1)=t=1n+1F^(st1st)t=1nF^(st)=F(τ).


Note how Eq. 22 can be used to recursively define the flow in all the states if Z is given and either the forward or the backwards transition probabilities are given. Either way, we would start from the flow at one of the extreme states s0 or sf and then distribute it recursively through the directed acyclic graph of the flow network, either going forward or going backward. A setting of particular interest, that will be central in Sec. 3, is when we are given all the terminal flows F(ssf), and we would like to deduce a state flow function F(s) and a forward transition probability function PF(ss) for the rest of the flow network.

Next, we will see how to parametrize Markovian flows using forward and backward probability functions consistent with the DAG. Unlike the condition in Prop. 19, the new condition does not involve a sum over transitions, which could be problematic if each state can have a large number of successors or if the state-space is continuous. Interestingly, the resulting condition is analogous to the detailed balance condition of Monte-Carlo Markov chains.

Definition 20.

Given a pointed DAG G=(𝒮,𝔸), a forward transition probability function P^F and a backward transition probability function P^B consistent with G, P^F and P^B are compatible if there exists an edge flow function F^:𝔸+ such that

ss𝔸P^F(ss) =F^(ss)sChild(s)F^(ss),P^B(ss) =F^(ss)s′′Par(s)F^(s′′s) (25)

Proposition 21.

Let G=(𝒮,𝔸) be a pointed DAG. Consider a non-negative function F^ over states, a forward transition probability function P^F and a backwards transition probability function P^B consistent with G. Then, F^,P^B, and PF^ jointly correspond to a flow if and only if the detailed balance conditions holds:

ss𝔸F^(s)P^F(ss)=F^(s)P^B(ss). (26)

More specifically, F^,P^F, and P^B uniquely define a Markovian flow F matching F^ on states, and with transition probabilities matching P^F and P^B. Furthermore, when this condition is satisfied, the forward and backward transition probability functions P^F and P^B are compatible.

Proof For necessity, consider a flow F, with state flow function denoted F, and forward and backward transitions PF and PB. It is clear from the definition of PF and PB (Def. 12) that Eq. 26 holds. We prove the sufficiency of the condition by first defining the edge flow

ss𝔸F^(ss):=F^(s)P^F(ss). (27)

We then sum both sides of Eq. 26 over s, yielding

s>s0sPar(s)F^(s)P^F(ss)=F^(s)sPar(s)P^B(ss)=F^(s) (28)

where we used the fact that P^B is a normalized probability distribution. Combining this with Eq. 27, we get

s>s0F^(s)=sPar(s)F^(ss) (29)

which is the first equality of the flow-matching condition (Eq. 22) of Prop. 19. We can obtain the second equality by first using the normalization of P^, and then using our definition of the edge flow (Eq. 27):

s>s0F^(s) =F^(s)s′′Child(s)P^F(s′′s)
=s′′Child(s)F^(s)P^F(s′′s)
=s′′Child(s)F^(ss′′). (30)

Following Prop. 19, there exists a unique Markovian flow F with state and edge flows given by F^. Using Eq. 27 and Eq. 26, it follows that F has transition probabilities P^F and P^B as required. The uniqueness is also a consequence of Eq. 27. This proves sufficiency.

To show that P^F and P^B are compatible (Def. 20), we first combine Eq. 27 and Eq. 30 (with relabeling of variables) to obtain

ss𝔸P^F(ss)=F^(ss)sChild(s)F^(ss),

we then isolate P^B in Eq. 26, yielding

ss𝔸P^B(ss)=F^(s)F^(s)P^F(ss)=F^(ss)F^(s)=F^(ss)s′′Par(s)F^(s′′s),

We thus get Eq. 25 of Def. 20, as desired.

At first glance, it may seem that when P^B is unconstrained, the detailed balance condition can trivially be achieved by setting

ss𝔸P^B(ss)=P^F(ss)F^(s)F^(s) (31)

However, because we also have the constraint sPar(s)P^B(ss)=1, then Eq. 31 can only be satisfied if the flows are consistent with the forward transition:

sPar(s)P^F(ss)F^(s)=F^(s).

2.6 Backwards Transitions can be Chosen Freely

Consider the setting in which we are given terminating flows to be matched, i.e. where the goal is to find a flow function with the right terminating flows. This is the setting introduced in Bengio et al. (2021), and that will be studied in Sec. 3. In this case, Prop. 18 tells us that in order to fully determine the forward transition probabilities and the state or state-action flows, it is not sufficient in general to specify only the terminating flows; it is also necessary to specify the backwards transition probabilities on the edges other than the terminal ones (the latter being given by the terminating flows).

What this means is that the terminating flows do not specify the flow completely, e.g., because many different paths can land in the same terminating state. The preference over such different ways to achieve the same final outcome is specified by the backwards transition probability PB (except for PB(ssf) which is a function of the terminating flows and Z). For example, we may want to give equal weight to all parents of a node s, or we may prefer shorter paths, which can be achieved if we keep track in the state s of the length of the shortest path to the node s, or we may let a learner discover a PB that makes learning PF or F easier.

2.7 Equivalence Between Flows

In the previous sections, we have seen that Markovian flows have the property that trajectory flows or probabilities factorize according to the DAG, and we have seen different ways of characterizing Markovian flows. In Sec. 3, we show how to approximate Markovian flows in order to define probability measures over terminating states. In this section, through an equivalence relation between trajectory flows, we justify the focus on Markovian flows. Given a pointed DAG G=(𝒮,𝔸), we denote by:

  • (G): the set of flows on G, i.e. the set of functions from 𝒯, the set of complete trajectories in G, to +,

  • Markov(G): the set of flows in (G) that are Markovian.

Definition 22.

Let G=(𝒮,𝔸) be a pointed DAG, and F1,F2(G) two trajectory flow functions. We say that F1 and F2 are equivalent if they coincide on edge-flows, i.e.:

ss𝔸F1(ss)=F2(ss)

Fig. 4 shows four flow functions in a simple pointed DAG that are pairwise equivalent.

Refer to caption
τ F1(τ) F2(τ) F3(τ) F4(τ)
s0,s2,sf 1 4/5 1 6/5
s0,s1,s2,sf 1 6/5 1 4/5
s0,s2,s3,sf 1 6/5 2 9/5
s0,s1,s2,s3,sf 2 9/5 1 6/5
Figure 4: Equivalent flows and Markovian flows. Flows F1 and F2 are equivalent. F3 and F4 are equivalent, but not equivalent to F1 and F2. F2 and F4 are Markovian. F1 and F3 are not Markovian. F1,F2,F3 and F4 coincide on the terminating flows i.e. at s2sf and s3sf.

This defines an equivalence relation (i.e., a relation that is reflexive, symmetric, and transitive). Hence, each flow F belongs to an equivalence class, and the set of flows (G) can be partitioned into equivalence classes. Note that if two flows are equivalent, then the corresponding state flow functions also coincide (as a direct consequence of Prop. 8).

Proposition 23.

Given a pointed DAG G. If two flow function F1,F2Markov(G) are equivalent, then they are equal. Additionally, for any flow function F(G), there exists a unique Markovian flow function FMarkov(G) such that F and F are equivalent.

Proof Because F1 and F2 are Markovian, then for any trajectory τ=(s0,,sn+1=sf):

F1(τ) =t=1n+1F1(st1st)t=1nF1(st)
=t=1n+1F2(st1st)t=1nF2(st)
=F2(τ),

where we combined the definition of equivalent flows and Prop. 16.

Given a flow function F, because its state and edge flow functions satisfy the flow matching conditions (as a consequence of Prop. 8), then according to Prop. 19, the flow F defined by:

τ:=(s0,,sn+1=sf)𝒯F(τ)=t=1n+1F(st1st)t=1nF(st)

is Markovian, and coincides with F on state and edge flows. Combining this with the statement above, we conclude that F is the unique Markovian flow that is equivalent to F.
The previous proposition shows that in each equivalence class stands out a particular flow function, that has a property the other flows in the same equivalence class don’t have: it is Markovian.

A consequence of this is that, if we care essentially about state and edge flows, instead of dealing with the full set of flows (G), it suffices to restrict any flow learning problem to the set of Markovian flows Markov(G). The advantage of this restriction is that defining a flow requires the specification of F(τ) for all trajectories τ𝒯, whereas defining a Markovian flow requires the specification of F(ss) for all edges ss𝔸, which is generally exponentially smaller than 𝒯 (note that the edge flows still need to satisfy the flow-matching conditions in Prop. 19). Thus, in order to approximate or learn a flow function that satisfies some conditions on its edge or state values, it suffices to approximate or learn a Markovian flow, by learning the edge flow function, which is a much smaller object than the actual flow function.

3 GFlowNets: Learning a Flow

With the theoretical preliminaries established in Sec. 1 and Sec. 2, we now consider the general class of problems introduced by Bengio et al. (2021) where some constraints or preferences over flows are given. Our goal is to find functions such as the state flow function F(s) or the transition probability function P(sss) that best match these desiderata using corresponding estimators F^(s) and P^(sss) which may not correspond to a proper flow. Such learning machines are called Generative Flow Networks (or GFlowNets for short). We focus on scenarios where we are given a target reward function R:𝒮f+, and aim at estimating flows F that satisfy:

s𝒮fF(ssf)=R(s) (32)

Because of the equivalences that exist in the set of flows, then without loss of generality, we choose GFlowNets to approximate Markovian flows only. We are thus interested in the following set of flows:

Markov(G,R)={FMarkov(G),s𝒮fF(ssf)=R(s)} (33)

For now, we informally define a GFlowNet as an estimator of a Markovian flow function FMarkov(G,R). We provide a more formal definition later-on.

With an estimator F^ of such a Markovian flow F, we can define an approximate forward transition probability function P^F, as in Prop. 16, in order to draw trajectories τ𝒯 (the set of complete trajectories in G) by iteratively sampling each state given the previous one, starting at s0 and then with st+1P^F(.st) until we reach the sink state sn+1=sf for some n.

Next, we will clarify how such an estimator can be obtained.

3.1 GFlowNets as an Alternative to MCMC Sampling

The main established methods to approximately sample from the distribution associated with an energy function are Monte-Carlo Markov chain (MCMC) methods, which require significant computation (running a potentially very long Markov chain) to obtain samples. Instead, the GFlowNet approach amortizes upfront computation to train a generator that yields very efficient computation (a single configuration is constructed, no chain needed) for each new sample. For example, Bengio et al. (2021) build a GFlowNet that constructs a molecule via a small sequence of actions, each of which adds an atom or a molecular substructure to an existing molecule represented by a graph, starting from an empty graph. Only one such configuration needs to be considered, in contrast with MCMC methods, which require potentially very long chains of such configurations, and suffer from the challenge of mode-mixing (Jasra et al., 2005; Bengio et al., 2013; Pompe et al., 2020), which can take time exponentially long in the distance between modes. In GFlowNets, this computational challenge is avoided but the computational demand is converted to that of training the GFlowNet. To see how this can be extremely beneficial, consider having already constructed some configurations x and obtained their unnormalized probability or reward R(x). With these pairs (x,R(x)), a machine learning system could potentially generalize about the value of R elsewhere, and if it is a generative model, sample new x’s in places of large R(x). Hence, if there is an underlying statistical structure in how the modes of R are related to each other, a generative learner that generalizes could guess the presence of modes it has not visited yet, taking advantage of the patterns it has already uncovered from the (x,R(x)) pairs it has seen. On the other hand, if there is no structure (the modes are randomly placed), then we should not expect GFlowNets to do significantly better than MCMC because training becomes intractable in high-dimensional spaces (since it requires visiting every area of the configuration space to ascertain its reward).

3.2 GFlowNets and flow-matching losses

We have seen in Sec. 2.4 and Sec. 2.5 different ways of parametrizing a flow. For example, with a partition function and forward transition probabilities, or with edge flows that satisfy the flow matching conditions. Because there are many ways to parametrize GFlowNets, we start with an abstract formulation for them, where o𝒪 represents a parameter configuration (e.g., resulting from or while training of a GFlowNet), Π(o) gives the corresponding probability measure over trajectories τ𝒯, and maps a Markovian flow F to its parametrization o. In the following definition, we show what conditions should be satisfied in order for such a parametrization to be valid.

Definition 24.

Given a pointed DAG G=(𝒮,𝔸), with an initial and sink states s0 and sf respectively, and a target reward function R:𝒮f+, we say that the triplet (𝒪,Π,) is a flow parametrization of (G,R) if:

  1. 1.

    𝒪 is a non-empty set,

  2. 2.

    Π is a function mapping each object o𝒪 to an element Π(o)Δ(𝒯), the set of probability distributions on 𝒯,

  3. 3.

    is an injective functional from Markov(G,R) to 𝒪,

  4. 4.

    For any FMarkov(G,R), Π((F)) is the probability measure associated with the flow F (Def. 11).

To each object o𝒪, the distribution Π(o) implicitly defines a terminating state probability measure:

s𝒮fPT(s):=τ𝒯:ssfτΠ(o)(τ), (34)

where the dependence on o in PT is omitted for clarity.

The intuition behind the introduction of (𝒪,Π,) is that we can define a probability measure over 𝒯 for each object o𝒪, but only some of these objects correspond to a Markovian flow with the right terminating flows. For such objects o (i.e. those that can be written as o=(F) for some flow FMarkov(G,R)), the probability measure PT corresponds to the distribution of interest, according to Def. 13, i.e.:

s𝒮fPT(s)R(s)

GFlowNets thus provide a solution to the generally intractable problem of sampling from a target reward function R, or its associated energy function:

s𝒮f(s):=logR(s) (35)

Directly approximating flows FMarkov(G,R) is a hard problem, whereas with some sets 𝒪, searching for an object o(Markov(G,R))𝒪 is a simpler problem that can be tackled with function approximation techniques.

Note that not the set 𝒪 cannot be arbitrary, as there needs to be a way to define an injective function from Markov(G,R) to 𝒪. Below, for a given DAG G, we show three examples clarifying the abstract concept of parametrization:

Example 1.

Edge-flow parametrization: Consider 𝒪edge=(𝔸f,+), the set of functions from 𝔸f to +, and the functionals edge:Markov(G,R)𝒪edge and Πedge:𝒪edgeΔ(𝒯) defined by:

edge(F):(ss)𝔸fF(ss),
τ=(s0,,sn=sf)𝒯Πedge(F^)(τ)t=1nPF^(stst1),

where

PF^(ss)={F^(ss)s′′sfF^(ss′′)+R(s)if ssfR(s)s′′sfF^(ss′′)+R(s)if s=sf (36)

The injectivity of edge follows directly from Prop. 23 (two Markovian flows that coincide on both their terminating and non-terminating edge flow values are equal). And for any Markovian flow FMarkov(G,R), Πedge(edge(F)) equals the probability measure associated with F, as is shown in Prop. 16.

(𝒪edge,Πedge,edge) is thus a valid flow parametrization of (G,R).

Example 2.

Forward transition probability parametrization: Consider the set 𝒪PF=𝒪1×𝒪2, where 𝒪1=(𝒮{sf},+) is the set of function from 𝒮{sf} to + and 𝒪2 is the set of forward probability functions P^F consistent with G , and the functionals PF:Markov(G,R)𝒪PF and ΠPF:𝒪PFΔ(𝒯) defined by:

PF(F)=(s𝒮{sf}F(s),(ss)𝔸PF(ss)),
τ=(s0,,sn=sf)𝒯ΠPF(F^,P^F)(τ)t=1nP^F(stst1),

where PF is the forward transition probability function associated with F (Eq. 16). To verify that PF is injective, consider F1,F2Markov(G,R) such that PF(F1)=PF(F2). It means that s𝒮f, F1(s)=F2(s), and ss𝔸, F1(ss)F1(s)=F2(ss)F2(s). It follows that ss𝔸, F1(ss)=F2(ss). Which, according to Prop. 23, means that F1=F2. And for any Markovian flow FMarkov(G,R), ΠPF(PF(F)) equals the probability measure associated with F, as is shown in Prop. 16.

(𝒪PF,ΠPF,PF) is thus a valid flow parametrization of (G,R).

Example 3.

Transition probabilities parametrization: Similar to Ex. 2, we can parametrize a Markovian flow using the state-flow function and both its forward and backward transition probabilities, i.e. with 𝒪PFB=𝒪PF×𝒪3, PFB, and ΠPFB defined as:

PFB(F)=(PF(F),(ss)𝔸fPB(ss),),
τ=(s0,,sn=sf)𝒯ΠPFB(F^,P^F,P^B)(τ)t=1nP^F(stst1),

where PB is the function defined by Eq. 17. and 𝒪3 is the set of backward probability functions P^B consistent with G. The injectivity of PFB is a direct consequence of that of PF. And for any Markovian flow F, ΠPFB(PFB(F)) equals the probability measure associated with F, as is shown in Prop.3.

(𝒪PFB,ΠPFB,PFB) is thus a valid flow parametrization of (G,R).

We now have all the ingredients to formally define a GFlowNet:

Definition 25.

A GFlowNet is a tuple (G,R,𝒪,Π,), where:

  • G=(𝒮,𝔸) is a pointed DAG with initial state s0 and sink state sf,

  • R:𝒮f+ a target reward function,

  • (𝒪,Π,) a flow parametrization of (G,R).

Each object o𝒪 is called a GFlowNet configuration. When it is clear from context, we will use the term GFlowNet to refer to both (G,R,𝒪,Π,) and a particular configuration o; similar to how the term “Neural Network” refers to both the class of functions that can be represented with a particular architecture, and to a particular element of that class / weight configuration.

If o(Markov(G,R)), then the corresponding terminating state probability measure (Eq. 34) is proportional to the target reward R.

Once we have a GFlowNet (G,R,𝒪,Π,), we still need a way to find objects o(Markov(G,R))𝒪. To this end, it suffices to design a loss function on 𝒪 that equals zero on objects o(Markov(G,R)) and only on those objects. If our loss function is chosen to be non-negative, then an approximation of the target distribution (on 𝒮f) is obtained by approximating the minimum of the function . This provides a recipe for casting the search problem of interest to a minimization problem, as we typically do in machine learning. Such loss functions can be easily designed for the natural parametrizations we considered in Ex. 1, Ex. 2, and Ex. 3, as we will illustrate below.

Definition 26.

Let (G,R,𝒪,Π,) be a GFlowNet. A flow-matching loss is any function :𝒪+ such that:

o𝒪(o)=0FMarkov(G,R)o=(F) (37)

We say that is edge-decomposable, if there exists a function L:𝒪×𝔸+ such that:

o𝒪(o)=ss𝔸L(o,ss),

We say that is state-decomposable, f there exists a function L:𝒪×𝒮+ such that:

o𝒪(o)=s𝒮L(o,s),

We say that is trajectory-decomposable if there exists a function L:𝒪×𝒯+ such that:

o𝒪(o)=τ𝒯L(o,τ)

As mentioned above, with a such a loss function, our search problems can be written as minimization problems of the form

mino𝒪(o), (38)

which can be tackled with gradient-based learning if the function is differentiable. Note that with an edge-decomposable flow-matching loss, the minimization problem in Eq. 38 is equivalent to:

mino𝒪𝔼(ss)πT[L(o,ss)], (39)

where πT is any full support probability distribution on 𝔸, i.e. a probability distribution such that ss𝔸πT(ss)>0. A similar statement can be made for state-decomposable or trajectory-decomposable flow-matching losses.

Example 4.

Consider the edge-flow parametrization (𝒪edge,Πedge,edge), and the function LFM:𝒪edge×𝒮+ defined for each F^𝒪edge and s𝒮 as

LFM(F^,s)={(log(δ+sPar(s)F^(ss)δ+R(s)+s′′Child(s){sf}F^(ss′′)))2if ssf,0otherwise

where δ0 is a hyper-parameter. The function FM mapping each F^𝒪edge to

FM(F^)=s𝒮LFM(F^,s) (40)

is a flow-matching loss, that is (by definition) state-decomposable.

To see this, let F^𝒪edge such that FM(F^)=0, and extend it to terminating edge:

s𝒮fF^(ssf):=R(s)

Now that F^ is defined for all edges in G, we can write that

s𝒮sPar(s)F^(ss)=s′′Child(s)F^(ss′′).

Which, according to Prop. 19, means that there exists a Markovian flow FMarkov(G,R) such that edge(F)=F^. The converse

FMarkov(G,R)FM(edge(F))=0

is a trivial consequence of Prop. 19.

This is the loss function proposed in Bengio et al. (2021). δ allows to reduce the importance given to small flows (those smaller than δ), and the usage of the square of the log-ratio is justified as a way to ensure that states with large flows do not contribute to the gradients of FM much more than states with small flows.

Example 5.

Detailed-balance loss: Consider the transition probabilities parametrization (𝒪PFB,ΠPFB,PFB), and the function LDB:𝒪PFB×𝔸+ defined for each (F^,P^F,P^B)𝒪PFB and ss𝔸 as

LDB(F^,P^F,P^B,ss)={(log(δ+F^(s)P^F(ss)δ+F^(s)P^B(ss)))2 if ssf,(log(δ+F^(s)P^F(ss)δ+R(s)))2otherwise,

where δ0 is a hyper-parameter. The function DB mapping each (F^,P^F,P^B)𝒪PFB to

DB(F^,P^,P^B))=ss𝔸LDB(F^,P^,P^B,ss)

is a flow-matching loss that is (by definition) edge-decomposable. The proof of this statement is similar to the one of the example above, using Prop. 21.

According to Sec. 2.6, the reward function does not completely specify the flow. Thus, the detailed-balance loss of Ex. 5 can be used with the (𝒪PF,ΠPF,PF) parametrization, using any function P^B𝒪3 as input to the detailed-balance loss.

Example 6.

Trajectory-balance loss: This loss has been introduced in Malkin et al. (2022) for the parametrization (𝒪TB,ΠTB,TB), where 𝒪TB=𝒪1×𝒪2×𝒪3, with 𝒪1=+ parametrizes the partition function Z^, and 𝒪2 and 𝒪3 introduced in Ex. 2 and Ex. 3 (the set of forward and backward probabilities consistent with G). TB maps a Markovian flow in Markov(G,R) to the corresponding triplet (Z,PF,PB), and ΠTB maps a parametrization (Z^,P^F,P^B) to a probability over trajectories defined by P^F as in Ex. 2. Prop. 18 justifies the validity of this parametrization. The loss TB maps each (Z^,P^F,P^B)𝒪TB to:

TB(Z^,P^F,P^B)=τ𝒯LTB(Z^,P^F,P^B,τ),

where

τ=(s0,,sn+1=sf)𝒯LTB(Z^,P^F,P^B,τ)=(logZ^t=1n+1P^F(stst1)R(sn)t=1nP^B(st1st))2. (41)

Malkin et al. (2022) prove that TB is a flow-matching loss and call it trajectory balance. It is trajectory-decomposable by definition.

Training by stochastic gradient descent:

In the examples of the previous section, given a GFlowNet (G,R,𝒪,Π,) and a flow-matching loss , objects o𝒪 are themselves functions or combinations of functions, and we can thus parametrize 𝒪 with function approximators such as Neural Networks. However, most of the times, the evaluation (let alone the minimization) of (o) is intractable, given that even with a full support distribution, only a subset of edges (or states or trajectories) can be visited in finite time. In practice, with an edge-decomposable loss e.g., we resort to a stochastic gradient, such as

oL(o,ss),ssπo (42)

for edge-decomposable losses, or

oL(o,τ),τπo (43)

for trajectory-decomposable losses, where πo, called the training distribution, is a distribution over edges or trajectories that can be associated with Π(o), corresponding to the online setting in RL, or defined in other ways, corresponding to the behavior policy in offline RL, see Sec. 3.3.3 below.

3.3 Extensions

In this section, we discuss possible relaxations to the GFlowNet training paradigm introduced thus far.

3.3.1 Introducing Time Stamps to Allow Cycles

Note that the state-space of a GFlowNet can easily be modified to accommodate an underlying state space for which the transitions do not form a DAG, e.g., to allow cycles. Let 𝒮 be such an underlying state-space. Define the augmented state space 𝒮=𝒮×, where ={0,1,2,} is the set of natural numbers, and st=(st,t) is the augmented state, where t is the position of the state st in the trajectory. With this augmented state space, we automatically avoid cycles. Furthermore, we may design or train the backwards transition probabilities PB(stst+1=(st+1,t+1)) to create a preference for shorter paths towards st+1, as discussed in Sec. 2.6. Note that we can further generalize this setup by replacing with any totally ordered indexing set; the augmented state space will still have an associated DAG. The ordering “<” in the original state-space is lifted to the augmented state-space: (st,t)<(st,t) if and only if t<t and st<st.

3.3.2 Stochastic Rewards

We also consider the setting in which the given reward is stochastic rather than being a deterministic function of the state, yielding training procedures based on stochastic gradient descent. For example, with the trajectory balance loss of Eq. 41, if R(s) is stochastic (even when given s), we can think of what is being really optimized is the squared loss with logR(s) replaced by its expectation (given s). This is a straightforward consequence of minimizing the expected value of a squared error loss (as for example in neural networks trained with a squared error loss and a stochastic target output, where the neural network effectively tries to estimate the expected value of that target).

3.3.3 GFlowNets can be trained offline

As discussed in Sec. 3.2, we do not need to train a GFlowNet using samples from its own trajectory distribution P^=Π(o). Those training trajectories can be drawn from any training distribution πT with full support, as already shown by Bengio et al. (2021). It means that a GFlowNet can be trained offline, as in offline reinforcement learning (Ernst et al., 2005; Riedmiller, 2005; Lange et al., 2012).

It should also be noted that with a proper adaptive choice of πT, and assuming that computing R is cheaper or comparable in cost to running the GFlowNet on a trajectory, it should be more efficient to continuously draw new training samples from πT than to rehearse the same trajectories multiple times. An exception would be rehearsing the trajectories leading to high rewards if these are rare.

How should one choose the training distribution πT? It needs to cover the support of R but if it were uniform it would be very wasteful and if it were equal to the current GFlowNet policy π it might not have sufficient effective support and thus miss modes of R, i.e., regions where R(x) is substantially greater than 0 but R(x)PT(x). Hence the training distribution should be sampled from an exploratory policy that visits places that have not been visited yet and may have a high reward. High epistemic uncertainty around the current policy would make sense and the literature on acquisition functions for Bayesian optimization (Srinivas et al., 2010) may be a good guide. More generally, this means the training distribution should be adaptive. For example, πT could be the policy of a second GFlowNet trained mostly to match a different reward function that is high when the losses observed by the main GFlowNet are large. It would also be good to regularly visit those trajectories corresponding to known large R, i.e., according to samples from π, to make sure those are not forgotten, even temporarily.

3.4 Exploiting Data as Known Terminating States

In some applications we may have access to a dataset of (s,R(s)) pairs and we would like to use them in a purely offline way to train a GFlowNet, or we may want to combine such data with queries of the reward function R to train the GFlowNet. For example, the dataset may contain examples of some of the high-reward terminating states s which would be difficult to obtain by sampling from a randomly initialized GFlowNet. How can we compute a gradient update for the GFlowNet parameters using such (s,R(s)) pairs?

If we choose to parametrize the backwards transition probabilities PB (which is necessary for implementing the detailed balance loss), then we can just sample a trajectory τ leading to s using PB and use these trajectories to update the flows and forward transition probabilities along the traversed transitions. However, this alone is not guaranteed to produce the correct GFlowNet sampling distribution because the empirical distribution over training trajectories τ defined as above does not have full support. Suppose for example that the dataset only contains high-reward terminating states with R(s)=1. The GFlowNet could then just sample trajectories uniformly (which would be wrong, we would like the probability of most states not in the training set to be very small). On the other hand, if we combine the distribution of trajectories leading to terminal transitions in the dataset with a training distribution whose support covers all possible trajectories, then the offline property of GFlowNet guarantees that we can recover a flow-matching model.

4 Conditional Flows and Free energies

A remarkable property of flow networks is that we can recover the normalizing constant Z from the initial state flow F(s0) (Prop. 10). Z also gives us the partition function associated with a given terminal reward function R specifying the terminating flows.

What about internal states s with s0<s<sf? If we had something like a normalizing constant for only the terminating flows achievable from s, we would be able to obtain a form of marginalization given state s, i.e., a conditional probability for terminating states ss, given s. Naturally, one could ask: does the flow F(s) through state s give us that kind of marginalization over only the downstream terminating flows? Unfortunately in general, the answer to this question is no, as illustrated in Fig. 5: in this example F(s2)=4, whereas the sum of terminating flows achievable from s2 is 6 (the terminating states reachable from s2 are {s5,s6,s7}). The discrepancy is caused by the flow through (s0,s1,s5) that contributes to the terminating flow F(s5sf), but not to F(s2) since there is no order relation between s1 and s2.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 5: Example of a state-conditional flow network. (a) The original (Markovian) flow network. (b) The subgraph of states reachable from s2; there is a flow through (s0,s1,s5) that contributed to F(s5sf), but not to F(s2), showing that F(s2) does not marginalize the rewards of its descendant. (c) State-conditional flow network Fs2, which differs from the original flow F on the subgraph, but satisfies the desired marginalization property.

In Sec. 5.3, we show how GFlowNets applied to sampling sets of random variables can be used to estimate the marginal probability for the values given to a subset of the variables. It requires computing the kind of intractable sum discussed above (over the rewards associated with all the descendants of a state s, with s corresponding to such a subset of variables and a descendant to a full specification of all the variables). That motivates the following definition:

Definition 27.

Given a pointed DAG G=(𝒮,𝔸), the corresponding partial order denoted by , and a function :𝒮, called the energy function, we define the free energy (s) of a state s as:

e(s):=s:sse(s). (44)

Free energies are generic formulations for the marginalization operation (i.e. summing over a large number of terms) associated with energy functions, and we find their estimation to open the door to interesting applications where expensive MCMC methods would typically be the main approach otherwise.

4.1 Conditional flow networks

In Sec. 2.2, we defined a flow network as a DAG, augmented with some function F over the set of complete trajectories 𝒯. We can extend this notion of flow networks by conditioning each component on some information x. In general, this conditioning variable can represent any conditioning information, either external to the flow network (but influencing the terminating flows), or internal (e.g., x can be a property of complete trajectories over another flow network, like passing through a particular state).

Definition 28.

Let 𝒳 be a set of conditioning variables. We consider a family of DAGs Gx=(𝒮x,𝒜x) indexed by x𝒳, along with a family of initial and terminal states denoted by (s0x)𝒮x and (sfx)𝒮x respectively. For each DAG Gx, we denote by 𝒯x the set of complete trajectories in Gx, and we denote by 𝒯 their union:

𝒯=x𝒳𝒯x.

A conditional flow network is the specification of 𝒳, the family {Gx,x𝒳}, along with a conditional flow function F, i.e. a function F:𝒳×𝒯+ such that F(x,τ)=0 if τ𝒯x. For clarity, we will denote, for each x𝒳, by Fx the function mapping each τ𝒯x to F(x,τ). Similar to Sec. 2.2, Fx induces a measure of the σ-algebra 2𝒯x for each x.

Conditional flow networks effectively represent a family of flow networks, indexed by the value of x. Since conditional flow networks are defined using the same components as an unconditional flow network, they inherit from all the properties of flow networks for all DAGs Gx and flow functions Fx. In particular, we can directly extend the notion of probability distribution over flows, state and edge flows, forward and backward transition probabilities (Sec. 2.3), of Markovian flows (Sec. 2.4), and any flow matching condition (Sec. 2.5) to conditional flows; the only difference is that now every term explicitly depends of the conditioning variable x.

In Sec. 4.2 and Sec. 4.3, we will elaborate two important examples of conditional flow networks: flow networks conditioned on external information that changes the reward R(sx), and state-conditional flow networks that depend on internal information, i.e., previously visited states.

4.2 Reward-conditional flow networks

Definition 29.

Let 𝒳 be a set of conditioning variables. Consider a flow network given by a pointed DAG G=(𝒮,𝔸) and a flow function F. Consider a family of non-negative functions of 𝒮: {Rx:𝒮f+,x𝒳}. A reward-conditional flow network compatible with the family is a conditional flow network (Def. 28), with Gx=G for every x𝒳, such that the edge-flow functions induced by the conditional flow function F satisfy:

x𝒳s𝒮fFx(ssf)=Rx(s).

We will use the notations Rx(s) and R(sx) interchangeably.

Note that the definition above implies that all the DAGs of a reward-conditional flow network are identical, and only the terminating flows differ amongst the members of the family.

Example 7.

We will see in Sec. 4.4 that we can estimate a conditional flow network using a GFlowNet (Sec. 3), given a reward function R(sx). In an Energy-Based Model, the model Pθ(s) is associated with a given energy function θ(s), parametrized by θ, with

Pθ(s)=exp(θ(s))Z(θ).

This model can be parametrized using a reward-conditional flow network, conditioned on θ with the reward function R(sθ)=exp(θ(s)). We show in Sec. 4.5 how to use such a conditional flow-network to learn an Energy-Based Model.

4.3 State-conditional flow networks

Definition 30.

Consider a flow network given by a DAG G=(𝒮,𝔸) and a flow function F. For each state s𝒮, let Gs be the subgraph of G containing all the states s such that ss. A state-conditional flow network is given by the family {Gs,s𝒮}, along with a conditional flow function F:𝒮×𝒯+, where 𝒯=s𝒮𝒯s, and 𝒯s the set of complete trajectories in 𝒢s, that satisfies:

Fs(ssf)=F(ssf). (45)

Note that in the definition above, we abused the notation F to refer to both flow functions and edge flow functions, but also used Fs to refer to the conditional flow function (or the corresponding edge flow function) τF(s,τ). Unlike the reward-conditional flow networks defined in Sec. 4.2, the structure of the DAG in a state-conditional flow network depends on the anchor state s. In particular, this means that the initial state (s0s)=s changes, but the final state (sfs)=sf remains unchanged, for any state s.

Since the definition of a state-conditional flow network depends on an original flow network, we must ensure that this definition is indeed correct, i.e. that such a state-conditional flow network that satisfies the conditions in Eq. 45 exists.

Proposition 31.

For any flow network given by a DAG G=(𝒮,𝔸) and a flow F, we can define a state-conditional flow network as per Def. 30.

Proof Let s𝒮 be a state. Since the structure of the DAG Gs is clearly well-defined, we just need to show that there exists a flow function Fs:𝒯s+ that satisfies Eq. 45. If such a function exists for every s𝒮, then it would suffice to define the conditional flow function F:𝒮×𝒯+ as:

F(s,τ)={Fs(τ)if τ𝒯s0otherwise.

Let Ass be the set of complete trajectories in 𝒯s terminating in ss; the condition in Eq. 45 then reads:

Fs(ssf)=Fs(Ass)=τAssFs(τ)=F(ssf). (46)

Note that in Eq. 46, F(ssf) is a given quantity because the flow F is known. Since the sets of trajectories {Ass,,ss} form a partition of all the complete trajectories 𝒯s, Eq. 46 is a system of linear equations, whose unknowns are Fs(τ) for all τ𝒯s, where each equation involves separate sets of unknowns. Therefore there exists at least a solution Fs(τ) of this system.

We can construct such a solution in the following way. For some τ𝒯s, we can first start by selecting the complete trajectories τ¯𝒯 that contain τ:

Cτ={τ¯𝒯:ττ¯}.

The key difference between the DAG G and the subgraph Gs though is that G may contain trajectories that terminate in some ss but do not pass through s, and those are therefore not covered by the trajectories of Gs. Let Uss be the set of complete trajectories of G defined as

Uss={τ¯𝒯:s′′>s,s′′τ¯,sτ¯,sτ¯}.

For all τ𝒯s such that τ terminates in some ss, we can therefore construct the flow Fs(τ) as

Fs(τ):=F(Cτ)+1nF(Uss),

where n=|Ass| is the number of trajectories τ𝒯s that terminate in s. It is easy to verify that Fs(τ) is a solution of Eq. 46.

While we saw in Prop. 10 that the initial flow F(s0) was equal to the partition function, the initial state-conditional flow also benefit from a marginalizing property, and is now related to the free energy at s.

Proposition 32.

Given a state-conditional flow network (G,F) as in Def. 30, for any state s, the initial flow of the state-conditional flow network corresponds to marginalizing the terminating flows F(ssf) for ss:

Fs(s0s)=Fs(s)=s:ssF(ssf)=exp((s)),

where (s) is the free energy associated to the energy function (s)=logF(ssf).

Proof This is a direct consequence of Prop. 10, applied to the state-conditional flow function s, along with Def. 27.

Fs(s0s) =τ𝒯sFs(τ)=s:ssFs(ssf)
=s:ssF(ssf)=s:sse(s)


Note that the definition of state-conditional flow networks is consistent with our original definition of (unconditional) flow networks in Section 2.1, in the sense that the original flow network is a valid state-conditional flow networks anchored at the initial state.

Another quantity of interest that state-conditional flow networks allow us to evaluate, is the probability of terminating a trajectory in a state s if all terminating edge flows were diverted towards an earlier state s<s:

Corollary 33.

Consider a flow network given by a DAG G=(𝒮,𝔸) and a flow F, from which we define any state-conditional flow network, as per Def. 30. Given a state s, the flow function Fs induces a probability distribution over {s′′𝒮f:s′′s}𝒮f, that we denote by PT(.s).

Under this measure, the probability of terminating a trajectory in Gs in a state s (i.e. the last edge of the trajectory is ssf) is:

PT(ss)=𝟏sse(s)+(s), (47)

where is the energy function mapping each state s that is parent of sf to logF(ss), and is the corresponding free energy function.

Proof Because Fs is a flow function, Def. 11 and Prop. 10 tell us that:

PT(ss)={Fs(ssf)Fs(s)if ss0otherwise

Combining this with Prop. 32, and Eq. 45, we obtain for ss:

PT(ss) =𝟏ssF(ssf)e(s)
=𝟏sse(s)e(s)


4.4 Conditional GFlowNets

Similar to the way we used a GFlowNet to estimate the flow of a flow network, we can also use a (conditional) GFlowNet in order to estimate a conditional flow network, with given target reward functions. A conditional GFlowNet follows the construction presented in Section 3, with the exception that all quantities to be learned now depend on the conditioning variable x𝒳 (e.g., x is an additional input of the neural network).

All parametrizations and losses presented in Sec. 3.2 could in principle be used to train a conditional GFlowNet, regardless of the conditioning set. Below we discuss yet another loss, first presented in Deleu et al. (2022), that could be used to train both GFlowNets and conditional GFlowNets.

Example 8.

Given a family of DAGs Gx and reward functions Rx indexed by x𝒳, where each state sGx is terminating (i.e. is a parent of sf), and following Exs. 2 and 3, we consider a parametrization given by the forward and backward transition probabilities 𝒪Px=𝒪2x×𝒪3x, where 𝒪2x (resp. 𝒪3x) is the set of forward (resp. backward) probability functions P^F (resp. P^B) consistent with Gx for every x𝒳, and (ΠPx,Px) defined as in Exs. 3 and 3. Each (𝒪Px,ΠPx,Px) is a flow parametrization of (Gx,Rx), which can be trained with an edge-decomposable flow-matching loss, as proved in Deleu et al. (2022), and defined for every ss𝔸f:

DB(P^F,P^B,ss,x)=(logRx(s)PB(ss,x)PF(sfs,x)Rx(s)PF(ss,x)PF(sfs,x))2

4.5 Training Energy-Based Models with a GFlowNet

A GFlowNet can be trained to convert an energy function into an approximate corresponding sampler. Thus, it can be used as an alternative to MCMC sampling (Sec. 3.1). Consider the model Pθ(s) associated with a given parametrized energy function θ(s) with parameters θ: Pθ(s)=eθ(s)Z. Sampling from Pθ(s) could be approximated by sampling from the terminating probability distribution PT(s) of a GFlowNet trained with target terminal reward R(s)=eθ(s) (see Eq. 34). In practice, P^T would be an estimator for the true Pθ because the GFlowNet training objective is not zeroed (insufficient capacity or finite training time). The GFlowNet samples drawn according to P^T could then be used to obtain a stochastic gradient estimator for the negative log-likelihood of observed data x with respect to parameters θ of an energy function θ:

logPθ(x)θ=θ(x)θsPθ(s)θ(s)θ. (48)

An approximate stochastic estimator of the second term could thus be obtained by sampling one or more terminating states sP^T(s), i.e., from the trained GFlowNet’s sampler. Furthermore, if the GFlowNet’s loss is 0, i.e. P^T=Pθ, the gradient estimator would be unbiased.

One could thus potentially jointly train an energy function θ and a corresponding GFlowNet by alternating updates of θ using the above equation (with sampling from Pθ replaced by sampling from P^T) and updates of the GFlowNet using the updated energy function for the target terminal reward.

If we fix F^(ssf)=R(s) by construction (which we can do if the reward function is deterministic), then we can parametrize the energy function with the same neural network that computes the flow, since (s)=logR(s)=logF^(ssf). Hence the same parameters are used for the energy function and for the GFlowNet, which is appealing.

The above strategy for learning jointly an energy function and how to sample from it could be generalized to learning conditional distributions by using a conditional GFlowNet instead. Let x be an observed random variable and h be a hidden variable, with the GFlowNet generating the pair (x,h) in two sub-trajectories: either first generate x and then generate h given x, or first generate h and then generate x given h. This can be achieved by introducing a 6-valued component u in the state to make sure that both h and x are generated before exiting into sf, with the following values and constraints:

s=s0 u=0 (49)
s=sf u=5 (50)
(utut+1) {01,02,11,22,13,24,33,44,35,45} (51)

where u=1 indicates that x is being generated (before h), u=2 indicates that h is being generated (before x), u=3 indicates that h is being generated (conditioned on x), and u=4 indicates that x is being generated (given h). The GFlowNet cannot reach the final state sf until both x and h have been generated. The conditional GFlowNet can thus approximately sample PT(x), PT(hx), PT(h), PT(xh) as well as PT(x,h). If we only want to sample x (or only h), we allow exiting as soon as it is generated (resp. h is generated). See Sec. 5.3 for a more general discussion on how to represent, estimate and sample marginal distributions.

Let us denote by Pθ(x,h)eθ(x,h) the joint distribution over (x,h) associated with the energy function, i.e., with F(ssf). When x is observed but h is not, θ could thus be updated by approximating the marginal log-likelihood gradient

logPθ(x)θ=hPT(hx)θ(x,h)θx,hPT(x,h)θ(x,h)θ (52)

using samples from the estimated terminal sampling probabilities P^T of a trained GFlowNet to approximate in a stochastic gradient way the above sums (using one or a batch of samples).

Note how we now have outer loop updates (of the energy function, i.e., the reward function) from actual data, and an inner loop updates (of the GFlowNet) using the energy function as a driving target for the GFlowNet. How many inner loop updates are necessary for such a scheme to work is an interesting open question but most likely depends on the form of the underlying data generating distribution. If the work on GANs (Goodfellow et al., 2014) is a good analogy, a good strategy may be to interleave updates of the energy function (as minus the log-terminal flow of a GFlowNet) based on a batch of data, and updates of the GFlowNet as a sampler based on both these samples (trajectories can be sampled backwards from a terminating state s using PB) and forward samples from the tempered training policy πT defined by the forward transition probabilities of the GFlowNet.

4.6 Active Learning with a GFlowNet

An interesting variant on the above scheme is one where the GFlowNet sampler is used not just to produce negative examples for the energy function but also to actively explore the environment. Jain et al. (2022) use an active learning scheme where the GFlowNet is used to sample candidates x for which we expect the reward R(x) to be generally large (since the GFlowNet approximately samples proportionally to R(x)). The challenge is that evaluating the true reward R for any x is computationally expensive and can potentially be noisy (for example, a biological assay to measure the binding energy of a drug to a given target protein). Thus, instead of using the true reward directly, the authors introduce a proxy f^ (which approximates the true reward function f), which is used to train the GFlowNet. This would lead to a setup similar to Sec. 4.5, with an inner loop where a GFlowNet is trained to match the proxy f^, and an outer-loop where the proxy f^ is learned in a supervised fashion using (x,y) pairs, where x is proposed by the GFlowNet, and y is the corresponding true reward from the environment (for example, outcome of a biological of chemical assay). It is important to note here that the GFlowNet and the proxy are intricately linked since the coverage of proxy f^ over the domain of x relies on diverse candidates from the GFlowNet. And similarly, since the GFlowNet matches a reward distribution defined by the proxy reward function f^, it also depends on the quality of the true reward function f.

This setup can be further extended by incorporating information about how novel a given candidate is, or how much epistemic uncertainty, u(x,f), there is in the prediction of f^. We can use the acquisition function heuristics (like Upper Confidence Bound (UCB) or Expected Improvement (EI)) from Bayesian optimization (Močkus, 1975; Srinivas et al., 2010) to combine the predicted usefulness f^(x) of configuration x with an estimate of the epistemic uncertainty around that prediction. Using this as the reward can allow the GFlowNet to explore areas where the predicted usefulness is high (f^(x) is large) and at the same time explore areas where there is more information to be gathered about useful configurations of x. The uncertainty over the predictions of f^ with the appropriate acquisition function can provide more control over the exploratory behaviour of GFlowNets.

As discussed by Bengio et al. (2021) when comparing GFlowNets with return-maximizing reinforcement learning methods, an interesting property of sufficiently trained GFlowNets is that they will sample from all the modes of the reward function, which is particularly desirable in a setting where exploration is required, as in active learning. The experiments in the paper also demonstrate this advantage experimentally in terms of the diversity of the solutions sampled by the GFlowNet compared with PPO, an RL method that had previously been used for generating molecular graphs and that tends to focus on a single mode of the reward function.

4.7 Estimating Entropies, Conditional Entropies and Mutual Information

Definition 34.

Given a reward function R with 0R(s)<1 s, we define the entropic reward function R associated with R as:

R(s)=R(s)logR(s). (53)

In brief, in this section, we show that we can estimate entropies by training two GFlowNets: one that estimates flows as usual for a target terminal reward function R(s), and one that estimates flows for the corresponding entropic reward function. We show below that we obtain an estimator of entropy by looking up the flow in the initial state, and if we do this exercise with conditional flows, we get conditional entropy. Once we have the conditional entropy, we can also estimate the mutual information.

Proposition 35.

Consider a flow network (G,F) such that the terminating flows match a given reward function R, i.e. s𝒮f,F(ssf)=R(s), with R(s)<1 for all s, and a second flow network (G,F) with the same pointed DAG, but with a flow function for which the terminating flows match the entropic reward function R (Eq. 53), then the entropy H[S] associated with the terminating state random variable S𝒮f with distribution PT(S=s)=R(s)Z (Eq. 18) is

H[S]:=sPT(s)logPT(s)=F(s0)F(s0)+logF(s0). (54)

Proof First apply the definition of PT(s), then Eq. 11 on both flows:

sPT(s)logPT(s) =sR(s)F(s0)(logR(s)logF(s0))
=(sR(s)logR(s))+(logF(s0)sR(s))F(s0)
=F(s0)F(s0)+logF(s0).

Note that we need R(s)<1 to make sure that the rewards R(s) (and thus the flows) are positive.

Proposition 36.

Given a set 𝒳 of conditioning variables, consider a conditional flow network defined by a conditional flow function F, for which the terminating flows match a target reward R family (conditioned on x𝒳) that satisfies Rx(s)<1 for all s, and a second conditional flow network defined by a conditional flow function F, for which the terminating flows match the entropic reward functions Rx (Eq. 53), then the conditional entropy H[Sx] of random terminating states S𝒮f consistent with condition x is given by

H[Sx]=F(s0x)F(s0x)+logF(s0x). (55)

In particular, for a state-conditional GFlowNet (𝒳=𝒮 is the state space of the DAG), we obtain

H[Ss]=F(ss)F(ss)+logF(ss). (56)

More generally, the mutual information MI(S;X) between the random draw of a terminating state S=s according to PT(sx) and the conditioning random variable X is

MI(S;X)=H[S]EX[H[SX]]=F(s0)F(s0)+logF(s0)EX[F(s0X)F(s0X)+logF(s0X)] (57)

where F(s) and F(s) indicate the unconditional flows (trained with no condition x given) while F(sx) and F(sx) are their conditioned counterparts.

Proof The proof of Eq. 55 follows from the fact that each (Gx,Fx) is a flow network, to which we can apply Prop. 35. Eq. 57 is a direct consequence of the definition of the Mutual Information, Eq. 54 and Eq. 55.

If we have a sampling mechanism for P(X), we can thus approximate the expectation in Eq. 57 by a Monte-Carlo average with draws from P(X).

5 GFlowNets on Sets, Graphs, and to Marginalize Joint Distributions

5.1 Set GFlowNets

We first define an action space for constructing sets and we view the GFlowNet as a means to generate a random set S and to estimate quantities like probabilities, conditional probabilities or marginal probabilities for realizations of this random variable. The elements of those sets are taken from a larger “universe” set 𝒰.

Definition 37.

Given a “universe” set 𝒰, consider the pointed DAG G=(𝒮,𝔸), where 𝒮:=2𝒰{sf} is the set of all subsets of 𝒰 with an additional state sf, s0={} is the empty set, and for any two subsets s,s of 𝒰, ss𝔸a𝒰s,s=s{a}; meaning that each transition in the DAG corresponds to adding one element of 𝒰 to the current subset. Additionally all subsets are connected to sf, i.e. s𝒮,ssf𝔸. A set flow network is a flow network on this graph G, and a set GFlowNet is an estimator of such a flow network, as defined in Sec. 3. The target terminal reward function R:sF(ssf) satisfies:

Z=s2𝒰R(s)<. (58)

A set flow network defines a terminating probability distribution PT on states (see Def. 13 and Eq. 44), with

PT(s)=e(s)+(s0)=F(ssf)F(s0) (59)

where represents the energy logR. Similarly, Cor. 33 provides us with a formula for conditional probabilities of a given superset s of a given set s under PT,

PT(sss):=e(s)+(s)=F(ssf)F(ss) (60)

where indicates free energy (see Def. 27).

Remember that with a GFlowNet with state and edge flow estimator F^, it is not guaranteed that F^(s)=R(s) for all states s𝒮{sf}, so we could estimate probabilities with

P^T(s)=F^(ssf)F^(s0) (61)

or alternatively

P^T(s)=R(s)F^(s0). (62)

Similarly, we can estimate conditional superset probabilities with Eq. 60 or with

P^T(sss)=R(s)F^(ss), (63)

none of which are guaranteed to exactly sum to 1.

We can also compute the marginal probability over all supersets of a given set s, as shown below.

Proposition 38.

Following the notations of Def. 37, let 𝔖(s)={ss} be the set of all supersets of a set s. The probability of drawing any element from 𝔖(s) given a set flow network is

PT(𝔖(s))=ssPT(s)=e(s)Z=F(ss)F(s0). (64)

Proof We can rewrite the sum as follows, first applying the definition of PT (Eq. 18), and Prop. 32:

ssPT(s) =ssF(ssf)F(s0)
=F(ss)F(s0)
=e(s)Z

where we notice that for states that are sets, the order relationship ss is equivalent to the subset relationship ss.

To summarize, a GFlowNet which is trained to match a given energy function (i.e. derived rewards) over sets can be used to represent that distribution, sample from it, estimate the probability of a set under it, estimate the partition function, search for the lowest energy set, sample a conditional distribution over supersets of a given set, estimate that conditional distribution for a given pair of set and superset, compute the marginal probability of a subset (i.e., summing over the probabilities of the supersets), and compute the entropy of the set distribution or of the conditional distribution of supersets of a set. For example, using a GFlowNet to model distributions over sets has been successfully used in Malik et al. (2023), where the goal is to iteratively select an informative batch of points for efficient active learning. The authors additionally show that the distribution over sets defined by a mutual information-based reward (Sec. 4.7) can be approximated to satisfying levels using a neural network as a function approximator for the GFlowNet policy.

5.2 GFlowNet on Graphs

A graph is a special kind of set in which there are two kinds of elements: nodes and edges, with edges being pairs of node indices. Graphs may also have content attached to nodes and/or edges. The set operations described in the previous section can thus be specialized accordingly. Some actions (i.e. edges in the GFlowNet DAG) could insert a node while other actions could insert an edge. The set of allowable actions can be limited, for example to make sure the graph has a single connected component, or to ensure acyclicity. Like for sets in general, one cannot have an action which adds a node or edge which is already in the set. Deleu et al. (2022) used a GFlowNet over DAGs in order to learn an approximation of the posterior distribution over the graphical structure of a Bayesian Network. While the GFlowNet was learned by minimizing some loss derived from flow-matching conditions as in Sec. 3.2, they showed that the resulting distribution is an accurate approximation of the target posterior. Since graphs are sets, all the GFlowNet operations on sets can be applied on graphs.

5.3 Marginalizing over Missing Variables

The ability of GFlowNets to capture probability distributions over sets can be applied to modeling the joint distribution over random variables, to calculating marginal probabilities over given subsets of variable values, and to sampling or computing probabilities for any conditional (e.g., for a subset of variables given another subset of variables).

Let X=(X1,X2,,Xn) be a composite random variable with n element random variables Xi, 1in, each with possible values xi𝒳i (not necessarily numbers). If we are given an energy function or a terminal reward function R(x) to score any instance X=x, we can train a particular kind of set GFlowNet for which the set elements are pairs (i,xi) and at most one element in the set with index i, for all i. The only allowed terminating transitions are when the set has exactly size n and every size-n set s terminates on the next transition.

Note how that GFlowNet can sample an X in any possible order, if PB allows that order. Given an existing set of (i,xi) pairs (represented by a state s={(i,xi)}i), we can estimate the marginal probability of that subset of variables (implicitly summing over all the missing ones, see Eq. 64). We can sample the other variables by setting the state at s and continuing to sample from the GFlowNet’s policy (the learned forward transition probability function). We can sample from a chosen subset S of the other variables by constraining that policy to only add elements which are in S. In addition, we can do all the other things that are feasible on set GFlowNets, such as estimating the partition function, sampling in an order which prefers the early subsequences with the largest marginal probability, searching for the most probable configuration of variables, or estimating the entropy of the distribution.

5.4 Modular Energy Function Decomposition

Let us see how we can apply the graph GFlowNet framework to a special kind of graph: a factor graph (Kschischang et al., 2001) with reusable factors. This will yield a distribution PT(g) over graphs g, each of which is associated with an energy function value (g) (and associated reward R(g)). Energy-based models are convenient because they can decompose a joint probability into independent pieces (possibly corresponding to independent mechanisms, Schölkopf et al., 2012; Goyal et al., 2019; Goyal and Bengio, 2020), each corresponding to a factor of a factor graph. In our case, we would like a shared set of factors 𝔽 to be reusable across many factor graphs g. The factor graph will provide an energy and a probability over a set of random variables 𝒱. Let the graph g={(Fi,vi)}i be written as a set of pieces (Fi,vi), where Fi𝔽 is the index of a factor with energy function term Fi, selected from the pool 𝔽 of possible factors, and where vi=(v1,v2,) is a list of realizations of the random variables Vjvj, where Vj𝒱 is a node of the factor graph. That list defines the edges of the factor graph connecting variable Vj with the j-th argument of Fi. Let us denote Fi(vi) the value of this energy function term Fi applied to those values vi, i.e.,

Fi(vi)=Fi(v1,v2,). (65)

The total energy function of such a graph can then be decomposed as follows

(g)=iFi(vi). (66)

What is interesting with this construction is that the graph GFlowNet can now sample a graph g, possibly given some conditioning observations x: see Sec. 4.5 on how GFlowNets can be trained jointly with an energy function, including the case where only some random variables are observed. Hence, given some observed variables (not necessarily always the same), the graph GFlowNet can sample a latent factor graph containing and connecting (with energy function terms) both observed and latent random variables, and whose structure defines an energy function over values of the joint observed and latent variables.

Not only can we use the compositional nature of the objects generated by a GFlowNet to decompose the total energy into reusable energy terms corresponding to ideally independent mechanisms, but we can also decompose the GFlowNet itself into modules associated with each mechanism. The action space of this graph GFlowNet is fairly complex, with each action corresponding with the addition of a latent variable Vk or the addition of a graph piece (Fi,vi). Such an action is taken in the context of the state of the GFlowNet, which is a partially constructed graph (arising from the previous actions). The GFlowNet and its associated energy function parameters are thus decomposed into modules. Each module knows how to compute an energy function Fi and how to score and sample competitively (against the other modules) a new graph piece (to insert the corresponding factor in the graph).

Consider some observed variables (a subset of the Vj’s with their values vj), collectively denoted x. Consider a graph g among those compatible with x (i.e. with some nodes corresponding to Vj=vj for the observed variables) and denote h the specification of the part of g not already provided by x. We can think about latent variable h as the explanation for the observed x. Note how marginalizing over all the possible h, we can compute the free energy of x. The principles of Sec. 4.5 can be applied to train such an energy-based GFlowNet. It also makes sense to represent a prior over graph structures in the energy function. For example, we may prefer sparse factors (with few arguments), and we may introduce soft or hard constraints having to do with a notion of type that is commonly used in computer programming and in natural language. Each random variable in the graph can have as one of its attributes a type, and each factor energy function argument can expect a type. Energy function terms can be added to construct this prior by favouring graph pieces (Fi,vi=(vk)k) in which the type of variable Vk (of which vk is a realization) matches the type expected of the k-th argument of Fi. This is very similar to attention mechanisms (Bahdanau et al., 2014; Vaswani et al., 2017), which can be seen to match a query (an expected type) with a key (the type associated with an element). For instance, Pan et al. (2023) propose the Forward-Looking GFlowNet framework that exploits the modularity of the energy function, as in Eq. 66 and obtain a better approximation of the target distribution than those resulting from regular losses.

6 Continuous or Hybrid Actions and States

All of the mathematical developments above have used sums over states or actions, with the idea that these would be elements of a discrete space. However, for the most part one can replace these sums by integrals in case the states or actions are either continuous or hybrid (with some discrete components and some continuous components). Beyond this, we discuss below what the presence of continuous-valued actions and states changes to the GFlowNet framework.

Although there are explicit sums respectively over successors and predecessors which come up in Eq. 40, such sums are also hiding in the detailed balance constraint of Eq. 26. Indeed, these sums are implicit as part of the normalizing constant in the conditional density of the next state or previous state in PF(st+1st) and PB(stst+1). We consider below ideas to deal with this challenge.

6.1 Integrable Normalization Constants

We first note that if we can handle a continuous state, we can also handle a hybrid state, as follows. Let the state be decomposed into

s=(si,sx) (67)

where si is discrete and sx is continuous. Then we can decompose any of the transition conditionals as follows:

PF(st+1st)=P(st+1xst+1i,st)P(st+1ist). (68)

We note that this is formally equivalent to decomposing the transition into two transitions, first to perform the discrete choice into the next state, and second to perform the continuous choice into the next state (given the discrete choice). Having continuous-valued inputs to a neural net is no problem. The challenge is to represent continuous densities on the output, with the need to both being able to compute the density of a particular value (say P(st+1xst+1i,st)) and to be able to sample from it. Computing categorical probabilities and sampling from a conditional categorical is standard fare, so we only discuss the continuous conditional. One possibility is to parametrize st+1xst+1i,st with a density for which the normalization constant is a known tractable integral, like the Gaussian. However, that may limit capacity too much, and may prevent a good minimization of the detailed balance or flow-matching loss. One workaround is to augment the discrete part of the state si with extra dimensions corresponding to “cluster IDs”, i.e., partition the continuous density into a mixture. We know that with enough mixture components, we can arbitrarily well approximate densities from a very large family. Other approaches include modeling the conditional density with an autogressive or normalizing flow model (Rezende and Mohamed, 2015, with a different meaning of the word flow), or, like in denoising diffusion models (Sohl-Dickstein et al., 2015), decomposing sampling of sx into several resampling steps, transforming its ditribution from a simple one to complex one.

To guarantee that the detailed balance constraint can be exactly satisfied, we could go further and think about parametrizing the edge flow F((si,sx)(si,sx)), and note that this is the natural parametrization if we use the node-based flow-matching loss. For example, keeping with the Gaussian example, we would now have a joint Gaussian energy in the vector (sx,sx) for each feasible discrete component indexed by (si,si). Note that in practical applications of GFlowNets, not all the transitions that satisfy the order relationship are generally allowed in the GFlowNet’s underlying DAG. For example, with set GFlowNets, the only allowed actions add one element to the set (not an arbitrary number of elements). These constraints on the action space mean that the number of legal (si,si) pairs is manageable and correspond to the number of discrete actions. The overall action is therefore seen as having a discrete part (choosing si given si) and a continuous part (choosing sx given si, si and sx). With such a joint flow formulation, the forward and backward conditional densities can be computed exactly and be compatible with each other.

6.2 GFlowNets in GFlowNets

Another way to implement an edge flow involving continuous variables is to use a lower-level GFlowNet, energy function pair to represent its flow, conditional probabilities and sample from them. Remember that such a pair can be trained following the approach discussed in Sec. 4.5. Instead of a joint Gaussian for (sx,sx) given (si,si) we could have a smaller-scale GFlowNet and energy function (representing an edge flow in the outer GFlowNet) to handle a whole family of transitions of a particular type in a larger-scale outer GFlowNet. Imagine that we have a fairly arbitrary energy function for such a transition, with parameters that we will learn. Then we can also train a GFlowNet to sample in either direction (either from st to st+1 or from st+1 to st) and to evaluate the corresponding normalizing constants (and hence, the corresponding conditional probabilities). The discrete aspect of the state and of its transition may correspond to a family of transitions (e.g., insert a particular type of node in a graph GFlowNet), and a separate GFlowNet, energy function module may be specialized and trained to handle such transitions.

While this paper was under review, Lahlou et al. (2023) extended the theory of GFlowNets to more general state spaces, including continuous ones, and experimentally validated that the usual losses can be used to effectively perform inference in continuous domains.

7 Related Work

There are several classes of related literature that concern the problem of generating a diversity of samples, given some energy or reward signal, in particular:

  • generative models (in particular deep learning ones),

  • RL methods that maximize reward with some form of exploratory behavior or smoothness prior,

  • MCMC methods that solve the problem of sampling from p(x)f(x) in principle,

  • evolutionary methods, that can leverage group diversity through iterations over a population of solutions.

In what follows we discuss these and offer insights into similarities and differences between GFlowNets and these approaches. Note that the literature related to this problem is much larger than we can reference here, and extends to many other subfields of ML, such as GANs (Kumar et al., 2019), VAEs (Kingma and Welling, 2013; Kusner et al., 2017), and normalizing flows (Dinh et al., 2014, 2016; Rezende and Mohamed, 2015). Yet another related type of approach are the Bayesian optimization methods (Močkus, 1975; Srinivas et al., 2010), which have also been used for searching in the space of molecules (Griffiths and Hernández-Lobato, 2017). The main relation with Bayesian optimization methods is that GFlowNets are generative and can thus complement Bayesian optimization methods which scan a tractable list of candidates. When the search space is too large to be able to separately compute a Bayesian optimization acquisition function score on every candidate, using a generative model is appealing. In addition, GFlowNets are used to explore the modes of the distribution rather than to search for the single most dominant mode. This difference is similar to that with classical RL methods, discussed further below.

7.1 Contrast with Generative Models

The main difference between GFlowNets and established deep generative models like VAEs or GANs is that whereas the latter are trained by being provided a finite set of examples sampled from the distribution of interest, a GFlowNet is normally trained by being provided an energy function or a reward function.

This reward function tells us not just about the samples that are likely under the distribution of interest (which we can think of as positive examples) but also about those that are unlikely (which we can think of as negative examples) and also about those in-between (whose reward is not large but is not zero either). If we think of the maximum likelihood training objective in those terms, it is like a reward function that gives a high reward to every training example (seen as a positive example, where the probability should be high) and a zero reward everywhere else. However, other reward functions are possible, as seen in the application of GFlowNets to the discovery of new molecules (Bengio et al., 2021), where the reward is not binary and increases monotonically as a function of the value of a desirable property of the candidate molecule.

Note however that the difference with other generative modeling approaches blurs when we include the learning of the energy function along with the learning of the GFlowNet sampler, as outlined in Sec. 4.5. In that case, the pair comprising the trainable GFlowNet sampler and the trainable energy function achieves a similar objective as a trainable generative model. Note that GFlowNets have been designed for generating discrete variable-size compositional structures (like sets or graphs), for both latent and observed variables, whereas GANs, VAEs or normalizing flows start from the point of view of modeling real-valued fixed-size vectors using real-valued fixed-size latent variables.

An interesting difference between GFlowNets and most generative model training frameworks (typically some variation on maximum likelihood) is in the very nature of the training objective for GFlowNets, which came about in the context of active learning scenarios. Whereas the GFlowNet training pairs (s,R(s)) can come from any distribution over s (any full-support training policy πT), which does not have to be stationary (and indeed will generally not be, in an active learning setting), the maximum likelihood framework is very sensitive to changes in the distribution of the data it sees. This is connected to the “offline learning” property of the flow matching objective (Sec. 3.3.3, among others).

7.2 Contrast with Regularized Reinforcement Learning

The flow-matching loss of GFlowNets (Bengio et al., 2021) arose from the inspiration of the temporal-difference training (Sutton and Barto, 2018) objectives associated with the Bellman equation. The flow-matching equations are analogous to the Bellman equation in the sense that the training objective is local (in time and states), credit assignment propagates through a bootstrap process and tries to fix the parametrization so that these equations are satisfied, knowing that if they were (everywhere), we would obtain the desired properties. However, these desired properties are different, as elaborated in the next paragraph. The context in which GFlowNets were developed is also different from the typical way of thinking about agents learning in some environment: we can think of the deterministic environments of GFlowNets as involving internal actions typically needed by a cognitive agent that needs to perform some kind of inference through a sequence of steps (predict or sample some things given other things), i.e., through actions internal to the agent and controlling its computation. This is in contrast with the origins of RL, focused on the actions of an agent in an external and unknown stochastic environment. GFlowNets were introduced as a tool for learning an internal policy, similar to the use of attention in modern deep learning, where we know the effect of actions, and the composition of these actions defines an inference machinery for that agent.

Classical RL (Sutton and Barto, 2018) control methods work by maximizing return in Markov Decision Processes (MDPs); their focus is on finding the policy πargmaxπVπ(s)s maximizing the expected return Vπ(s), which happens to provably be achieved with a deterministic policy (Sutton and Barto, 2018), even in stochastic MDPs. In a deterministic MDP, of interest here, this means that training an RL agent is a search for the most rewarding trajectory, or in the case of terminal-reward-only MDPs (again of interest here), the most rewarding terminating state.

Another perspective, that emerged out of both the probabilistic inference literature (Toussaint and Storkey, 2006) and the bandits literature (Auer et al., 2002), is concerned with finding policies of the form π(a|s)f(s,a). It turns out that maximizing both return and entropy (Ziebart et al., 2008) of policies in a control setting yield policies such that

p(τ)=[p(s0)t=0T1P(st+1|st,at)]exp(ηt=0T1R(st,at)) (69)

where τ=(s0,a0,s1,a1,,sT) and η can be seen as a temperature parameter. This result can also be found under the control-as-inference framework (Haarnoja et al., 2017; Levine, 2018). In deterministic MDPs with terminal rewards and no discounting of future rewards, this simplifies to p(τ)exp(ηρ(τ)), where ρ is the return.

In recent literature, this entropy maximization (MaxEnt) is often interpreted as a regularization scheme (Nachum et al., 2017), entropy being used either as an intrinsic reward signal or as an explicit regularization objective to be maximized. Another way to understand this scheme is to imagine ourselves in an adversarial bandit setting (Auer et al., 2002) where each arm corresponds to a unique trajectory, drawn with probability exp(ρ(τ)).

An important distinction to make between MaxEnt RL and GFlowNets is that, in the general case they do not find the same result. A GFlowNet learns a policy such that PT(s)R(s), whereas MaxEnt RL (with appropriately chosen temperature and R) learns a policy such that PT(s)n(s)R(s), where n(s) is the number of paths in the DAG of all trajectories that lead to s (a proof is provided in Bengio et al., 2021). An equivalence only exists if the DAG minus sf is a tree rooted at s0, which has been found to be useful (Buesing et al., 2019). What this overweighting by a factor n(s) means practically is that states corresponding to longer sequences (which typically will have exponentially more paths to them) will tend to be sampled much more often (typically exponentially more often) than states corresponding to shorter sequences. Clearly, this breaks the objective of sampling terminating states in proportion to their reward and provides a strong motivation for considering GFlowNets instead.

Another perspective on maximizing entropy in RL is that one can also maximize entropy on the states’ stationary distribution dπ (Ziebart et al., 2008), rather than the policy. In fact, one can show that the objective of training a policy such that PT(s)R(s) is equivalent to training a policy that maximizes r(s,a)=logR(s,a)logdπ(s,a). Unfortunately, computing stationary distributions, although possible (Nachum et al., 2019; Wen et al., 2020), is not always tractable nor precise enough for purposes of reward regularization.

7.3 Contrast with Monte-Carlo Markov Chain methods

MCMC has a long and rich history (Metropolis et al., 1953; Hastings, 1970; Andrieu et al., 2003), and is particularly relevant to the present work, since it is also a principled class of methods towards sampling from PT(s)R(s). MCMC-based methods have already found some amount of success with learned deep neural networks used to drive sampling (Grathwohl et al., 2021; Dai et al., 2020; Xie et al., 2021; Nash and Durkan, 2019; Seff et al., 2019).

An important drawback of MCMC is its reliance on iterative sampling (forming the Markov chain, one configuration at a time, each of which is like a terminating state of a GFlowNet): a new state configuration is obtained at each step of the chain by making a small stochastic change to the configuration in the previous step. Although these methods guarantee that asymptotically (in the length of the chain) we obtain samples drawn from the correct distribution, there is an important set of distributions for which finite chains are unlikely to provide enough diversity of the modes of the distribution.

This is known as the mode-mixing problem (Jasra et al., 2005; Bengio et al., 2013; Pompe et al., 2020): the chances of going from one mode to a neighboring one may become exponentially small (and thus require exponentially long chains) if the modes are separated by a long sequence of low-probability configurations. This can be alleviated by burning more computation (sampling longer chains) but becomes exponentially unsustainable with increased mode separation. The issue can also be reduced by introducing random sampling (e.g., drawing multiple chains) and simulated annealing (Andrieu et al., 2003) to facilitate jumping between modes. However, this becomes less effective in high dimensions and when the modes occupy a tiny volume (which can become an exponentially small fraction of the total space as its dimension increases) since random sampling is unlikely to land in the neighborhood of a mode.

In contrast, GFlowNets belong to the family of amortized sampling methods (which includes VAEs, Kingma and Welling, 2013), where we train a machine learning system to produce samples: we have exchanged the complexity of sampling through long chains for the complexity of training the sampler. The potential advantage of such amortized samplers is when the distribution of interest has generalizable structure: when it is possible to guess reasonably well where high-probability samples can be found, based on the knowledge of a set of known high-probability samples (the training set). This is what makes deep generative models work in the first place and thus suggests that in such high-dimensional settings where modes occupy tiny volumes (as per the manifold hypothesis, Cayton, 2005; Narayanan and Mitter, 2010; Rifai et al., 2011), one can capitalize on the already observed (x,R(x)) pairs (where x is an already visited configuration and R(x) its reward) to “jump” from known modes to yet unvisited ones, even if these are far from the ones already visited.

How well this will work then depends on the ability to generalize of the learner, i.e., on the strength and appropriateness of its inductive biases, as usual in machine learning. In the case where there is no structure at all (and thus no possibility to generalize when learning about the distribution), there is no reason to expect that amortized ML methods will fare better than MCMC. But if there is structure, then the exponential cost of mixing between modes could go away. There is plenty of evidence that ML methods can do a good job in such high-dimensional spaces (like the space of natural images) and this suggests that GFlowNets and other amortized sampling methods would be worth considering where ML generally works well. Molecular graph generation experiments (Bengio et al., 2021) comparing GFlowNets and MCMC methods appear to confirm this.

Another factor to consider (independent of the mode mixing issue) is the amortization of the computational costs: GFlowNets pay a large price upfront to train the network and then a small price (sampling once from PT) to generate each new sample. Instead, MCMC has no upfront cost but pays a lot for each independent sample. Hence, if we want to only sample once, MCMC may be more efficient, whereas if we want to generate a lot of samples, amortized methods may be advantageous. One can imagine settings where GFlowNets and MCMC could be combined to achieve some of the advantages of both approaches.

Evolutionary Methods

Evolutionary methods work similarly to MCMC methods, via an iterative process of stochastic local search, and populations of candidates are found that maximize one or many objectives (Brown et al., 2004; Salimans et al., 2017; Jensen, 2019; Swersky et al., 2020). From such a perspective, they have similar advantages and disadvantages. One practical advantage of these methods is that natural diversity is easily obtainable via group metrics and subpopulation selection (Mouret and Doncieux, 2012). This is not something that is explicitly tackled by GFlowNet, which instead relies on i.i.d. sampling and giving non-zero probability to suboptimal samples as a diversity mechanism.

Sequential Monte-Carlo

Sequential Monte Carlo (SMC, Naesseth et al., 2019; Arulampalam et al., 2002) methods are a class of methods aimed at solving inference problems. Similar to GFlowNets, SMC samplers are trained to sample from a distribution given by its unnormalized probability density, and require forward and backward kernels, as in GFlowNets. Unlike GFlowNets though, they require specifying intermediate targets γt(zt). In addition, with GFlowNets, the reward normally only comes at the end of the sequence, unlike the per-time-step likelihoods used to reweight particles in SMC. GFlowNets can be applied in settings which do not fit the typical particle filter setting, such as those whose intermediate states do not correspond to valid elements of the sample space. The trajectories do not necessarily represent a sequence of latent variables associated with corresponding observations, as in filtering tasks. The GFlowNet trajectory distributions are only defined by the terminal state reward function.

8 Conclusions and Open Questions

This paper extends and deepens the mathematical framework and mathematical properties of GFlowNets (Bengio et al., 2021). It connects the notion of flow in GFlowNets with that of measure over trajectories and introduces a novel training objective (the detailed balance loss) which makes it possible to choose a parametrization separating the backward policy PB which controls preferences over the order in which things are done from the constraints imposed by the target reward function.

An important contribution of this paper is the mathematical framework for marginalization or free energy estimation using GFlowNets. It relies on the simple idea of conditioning the GFlowNet so as to push the ability to estimate a partition function already introduced by Bengio et al. (2021) to a much more general setting. This makes it possible in principle to estimate intractable sums of rewards over the terminating states reachable by an arbitrary state, opening the door to marginalization over supergraphs of graphs, supersets of sets, and supersets of (variable,value) pairs. In turn, this provides formulae for estimating entropies, conditional entropies and mutual information.

Appendix A introduces alternatives to the flow-matching objective which may bypass the slow “bootstrapping” propagation of credit information from the end of the action sequence to its beginnings, with so-called direct credit assignment (which has some similarity to policy gradient). This question also motivated the trajectory balance objective introduced by Malkin et al. (2022) and further work in this direction is warranted.

In an attempt to better discern the links between GFlowNets and the more common forms of RL, Appendix C considers the cases where rewards are provided not just at the end but possibly after every action, and where the environment may be stochastic. It also shows how a greedy policy that maximises returns could be obtained from a trained GFlowNet. Another link with RL is introduced in Appendix D by considering something similar to a value function in GFlowNets, i.e., the expected downstream reward from a given state s. In a similar spirit, Appendix E generalizes GFlowNets to the case where rewards can be earned along the trajectory, thus introducing a notion of return, but keeping the objective of sampling in proportion to the return.

Appendix F generalizes GFlowNets in another direction, considering the possibility of having a family of flows being learned simultaneously over the same graph. This opens the door to a distributional generalization of GFlowNets (by analogy with distributional RL) as well a notion of unsupervised GFlowNet (where the reward function can be defined after the GFlowNet has been trained in an unsupervised fashion) and learning to sample from the Pareto front defined by a set of reward functions.

Many open questions obviously remain, from the extension to continuous actions and states to hierarchical versions of GFlowNets with abstract actions and integrating the energy function in the GFlowNet parametrization itself, enabling an interesting form of modularization and knowledge decomposition. Importantly, many of the mathematical formulations presented in this paper will require empirical validation to ascertain their usefulness, improve these ideas, turn them into impactful algorithms and explore a potentially very broad range of interesting applications, from replacing MCMC or being combined with MCMC in some settings, to probabilistic reasoning to further applications in active learning for scientific discovery.

Acknowledgements

The authors want to acknowledge the useful suggestions and feedback on the paper and its ideas provided by Alexandra Volokhova, Marc Bellemare, Valentin Thomas, Modjtaba Shokrian Zini, Mohammad Pedramfar, and Axel Nguyen Kerbel. They are also grateful for the financial support from CIFAR, Samsung, IBM, Google, Microsoft, JP Morgan Chase, and the Thomas C. Nelson Stanford Interdisciplinary Graduate Fellowship.

This research was funded in part by JPMorgan Chase & Co. Any views or opinions expressed herein are solely those of the authors listed, and may differ from the views and opinions expressed by JPMorgan Chase & Co. or its affiliates. This material is not a product of the Research Department of J.P. Morgan Securities LLC. This material should not be construed as an individual recommendation for any particular client and is not intended as a recommendation of particular securities, financial instruments or strategies for a particular client. This material does not constitute a solicitation or offer in any jurisdiction.

References

  • Andrieu et al. (2003) C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan. An introduction to mcmc for machine learning. Machine learning, 50(1):5–43, 2003.
  • Arulampalam et al. (2002) M. Arulampalam, S. Maskell, N. Gordon, and T. Clapp. A tutorial on particle filters for online nonlinear/non-gaussian bayesian tracking. IEEE Transactions on Signal Processing, 50(2):174–188, 2002. doi: 10.1109/78.978374.
  • Auer et al. (2002) P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
  • Bahdanau et al. (2014) D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. ICLR’2015, arXiv:1409.0473, 2014.
  • Bai et al. (2019) S. Bai, J. Z. Kolter, and V. Koltun. Deep equilibrium models. CoRR, abs/1909.01377, 2019. URL http://arxiv.org/abs/1909.01377.
  • Bellemare et al. (2017) M. G. Bellemare, W. Dabney, and R. Munos. A distributional perspective on reinforcement learning. In International Conference on Machine Learning, 2017.
  • Bengio et al. (2021) E. Bengio, M. Jain, M. Korablyov, D. Precup, and Y. Bengio. Flow network based generative models for non-iterative diverse candidate generation. NeurIPS’2021, arXiv:2106.04399, 2021.
  • Bengio et al. (2013) Y. Bengio, G. Mesnil, Y. Dauphin, and S. Rifai. Better mixing via deep representations. In International conference on machine learning, pages 552–560. PMLR, 2013.
  • Brown et al. (2004) N. Brown, B. McKay, F. Gilardoni, and J. Gasteiger. A graph-based genetic algorithm and its application to the multiobjective evolution of median molecules. Journal of chemical information and computer sciences, 44(3):1079–1087, 2004.
  • Buesing et al. (2019) L. Buesing, N. Heess, and T. Weber. Approximate inference in discrete distributions with monte carlo tree search and value functions, 2019.
  • Cayton (2005) L. Cayton. Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep, 12(1-17):1, 2005.
  • Dai et al. (2020) H. Dai, R. Singh, B. Dai, C. Sutton, and D. Schuurmans. Learning discrete energy-based models via auxiliary-variable local exploration. In Neural Information Processing Systems (NeurIPS), 2020.
  • Deleu et al. (2022) T. Deleu, A. Góis, C. Emezue, M. Rankawat, S. Lacoste-Julien, S. Bauer, and Y. Bengio. Bayesian structure learning with generative flow networks. In Uncertainty in Artificial Intelligence, pages 518–528. PMLR, 2022.
  • Dinh et al. (2014) L. Dinh, D. Krueger, and Y. Bengio. Nice: Non-linear independent components estimation. ICLR’2015 Workshop, arXiv:1410.8516, 2014.
  • Dinh et al. (2016) L. Dinh, J. Sohl-Dickstein, and S. Bengio. Density estimation using real NVP. ICLR’2017, arXiv:1605.08803, 2016.
  • Dosovitskiy and Djolonga (2019) A. Dosovitskiy and J. Djolonga. You only train once: Loss-conditional training of deep networks. In International Conference on Learning Representations, 2019.
  • Ernst et al. (2005) D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005.
  • Ghosh et al. (2018) D. Ghosh, A. Gupta, and S. Levine. Learning actionable representations with goal-conditioned policies. arXiv preprint arXiv:1811.07819, 2018.
  • Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  • Goyal and Bengio (2020) A. Goyal and Y. Bengio. Inductive biases for deep learning of higher-level cognition. arXiv, abs/2011.15091, 2020. https://arxiv.org/abs/2011.15091.
  • Goyal et al. (2019) A. Goyal, A. Lamb, J. Hoffmann, S. Sodhani, S. Levine, Y. Bengio, and B. Schölkopf. Recurrent independent mechanisms. ICLR’2021, arXiv:1909.10893, 2019.
  • Grathwohl et al. (2021) W. Grathwohl, K. Swersky, M. Hashemi, D. Duvenaud, and C. J. Maddison. Oops i took a gradient: Scalable sampling for discrete distributions, 2021.
  • Griffiths and Hernández-Lobato (2017) R.-R. Griffiths and J. M. Hernández-Lobato. Constrained bayesian optimization for automatic chemical design. arXiv preprint arXiv:1709.05501, 2017.
  • Haarnoja et al. (2017) T. Haarnoja, H. Tang, P. Abbeel, and S. Levine. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, pages 1352–1361. PMLR, 2017.
  • Hastings (1970) W. K. Hastings. Monte carlo sampling methods using markov chains and their applications. Biometrika, 1970.
  • Hu et al. (2023) E. Hu, N. Malkin, M. Jain, K. Everett, A. Graikos, and Y. Bengio. Gflownet-em for learning compositional latent variable models. arvix, 2023.
  • Jain et al. (2022) M. Jain, E. Bengio, A. Hernandez-Garcia, J. Rector-Brooks, B. F. P. Dossou, C. Ekbote, J. Fu, T. Zhang, M. Kilgour, D. Zhang, L. Simine, P. Das, and Y. Bengio. Biological sequence design with gflownets. International Conference on Machine Learning (ICML), 2022.
  • Jain et al. (2023) M. Jain, S. C. Raparthy, A. Hernandez-Garcia, J. Rector-Brooks, Y. Bengio, S. Miret, and E. Bengio. Multi-objective gflownets. arXiv preprint arXiv:2210.12765, 2023.
  • Jasra et al. (2005) A. Jasra, C. C. Holmes, and D. A. Stephens. Markov chain monte carlo methods and the label switching problem in bayesian mixture modeling. Statistical Science, pages 50–67, 2005.
  • Jensen (2019) J. H. Jensen. A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space. Chemical science, 10(12):3567–3572, 2019.
  • Kingma and Welling (2013) D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Kschischang et al. (2001) F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on information theory, 47(2):498–519, 2001.
  • Kumar et al. (2019) R. Kumar, S. Ozair, A. Goyal, A. Courville, and Y. Bengio. Maximum entropy generators for energy-based models, 2019.
  • Kusner et al. (2017) M. J. Kusner, B. Paige, and J. M. Hernández-Lobato. Grammar variational autoencoder. In International Conference on Machine Learning, pages 1945–1954. PMLR, 2017.
  • Lahlou et al. (2023) S. Lahlou, T. Deleu, P. Lemos, D. Zhang, A. Volokhova, A. Hernández-García, L. N. Ezzine, Y. Bengio, and N. Malkin. A theory of continuous generative flow networks. International Conference on Machine Learning (ICML), 2023.
  • Lange et al. (2012) S. Lange, T. Gabel, and M. Riedmiller. Batch reinforcement learning. In Reinforcement learning, pages 45–73. Springer, 2012.
  • Levine (2018) S. Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint arXiv:1805.00909, 2018.
  • Malik et al. (2023) S. A. Malik, S. Lahlou, A. Jesson, M. Jain, N. Malkin, T. Deleu, Y. Bengio, and Y. Gal. Batchgfn: Generative flow networks for batch active learning. arXiv preprint arXiv: 2306.15058, 2023.
  • Malkin et al. (2022) N. Malkin, M. Jain, E. Bengio, C. Sun, and Y. Bengio. Trajectory balance: Improved credit assignment in gflownets. arXiv preprint arXiv:2201.13259, 2022.
  • Malkin et al. (2023) N. Malkin, S. Lahlou, T. Deleu, X. Ji, E. Hu, K. Everett, D. Zhang, and Y. Bengio. GFlowNets and variational inference. International Conference on Learning Representations (ICLR), 2023.
  • Metropolis et al. (1953) N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953.
  • Močkus (1975) J. Močkus. On bayesian methods for seeking the extremum. In Optimization techniques IFIP technical conference, pages 400–404. Springer, 1975.
  • Mouret and Doncieux (2012) J.-B. Mouret and S. Doncieux. Encouraging Behavioral Diversity in Evolutionary Robotics: An Empirical Study. Evolutionary Computation, 20(1):91–133, 03 2012. ISSN 1063-6560. doi: 10.1162/EVCO“˙a“˙00048. URL https://doi.org/10.1162/EVCO_a_00048.
  • Nachum et al. (2017) O. Nachum, M. Norouzi, K. Xu, and D. Schuurmans. Bridging the gap between value and policy based reinforcement learning. arXiv preprint arXiv:1702.08892, 2017.
  • Nachum et al. (2019) O. Nachum, Y. Chow, B. Dai, and L. Li. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. arXiv preprint arXiv:1906.04733, 2019.
  • Naesseth et al. (2019) C. A. Naesseth, F. Lindsten, T. B. Schön, et al. Elements of sequential monte carlo. Foundations and Trends® in Machine Learning, 12(3):307–392, 2019.
  • Narayanan and Mitter (2010) H. Narayanan and S. Mitter. Sample complexity of testing the manifold hypothesis. In NIPS’2010, pages 1786–1794, 2010.
  • Nash and Durkan (2019) C. Nash and C. Durkan. Autoregressive energy machines. In International Conference on Machine Learning, pages 1735–1744. PMLR, 2019.
  • Pan et al. (2023) L. Pan, N. Malkin, D. Zhang, and Y. Bengio. Better training of gflownets with local credit and incomplete trajectories. arXiv preprint arXiv: 2302.01687, 2023.
  • Pompe et al. (2020) E. Pompe, C. Holmes, and K. Łatuszyński. A framework for adaptive mcmc targeting multimodal distributions. The Annals of Statistics, 48(5):2930–2952, 2020.
  • Rezende and Mohamed (2015) D. Rezende and S. Mohamed. Variational inference with normalizing flows. In International conference on machine learning, pages 1530–1538. PMLR, 2015.
  • Riedmiller (2005) M. Riedmiller. Neural fitted q iteration–first experiences with a data efficient neural reinforcement learning method. In European conference on machine learning, pages 317–328. Springer, 2005.
  • Rifai et al. (2011) S. Rifai, Y. N. Dauphin, P. Vincent, Y. Bengio, and X. Muller. The manifold tangent classifier. Advances in neural information processing systems, 24:2294–2302, 2011.
  • Salimans et al. (2017) T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative to reinforcement learning, 2017.
  • Schmidhuber (2019) J. Schmidhuber. Reinforcement learning upside down: Don’t predict rewards–just map them to actions. arXiv preprint arXiv:1912.02875, 2019.
  • Schölkopf et al. (2012) B. Schölkopf, D. Janzing, J. Peters, E. Sgouritsa, K. Zhang, and J. Mooij. On causal and anticausal learning. In ICML’2012, pages 1255–1262, 2012.
  • Seff et al. (2019) A. Seff, W. Zhou, F. Damani, A. Doyle, and R. P. Adams. Discrete object generation with reversible inductive construction. arXiv preprint arXiv:1907.08268, 2019.
  • Sohl-Dickstein et al. (2015) J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  • Srinivas et al. (2010) N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger. Gaussian process optimization in the bandit setting: No regret and experimental design. In International Conference on Machine Learning (ICML), 2010.
  • Sutton and Barto (2018) R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Swersky et al. (2020) K. Swersky, Y. Rubanova, D. Dohan, and K. Murphy. Amortized bayesian optimization over discrete spaces. In Conference on Uncertainty in Artificial Intelligence, pages 769–778. PMLR, 2020.
  • Toussaint and Storkey (2006) M. Toussaint and A. Storkey. Probabilistic inference for solving discrete and continuous state markov decision processes. In Proceedings of the 23rd international conference on Machine learning, pages 945–952, 2006.
  • Vaswani et al. (2017) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • Wen et al. (2020) J. Wen, B. Dai, L. Li, and D. Schuurmans. Batch stationary distribution estimation. arXiv preprint arXiv:2003.00722, 2020.
  • Xie et al. (2021) Y. Xie, C. Shi, H. Zhou, Y. Yang, W. Zhang, Y. Yu, and L. Li. {MARS}: Markov molecular sampling for multi-objective drug discovery. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=kHSu4ebxFXY.
  • Zhang et al. (2022) D. Zhang, N. Malkin, Z. Liu, A. Volokhova, A. Courville, and Y. Bengio. Generative flow networks for discrete probabilistic modeling. International Conference on Machine Learning (ICML), 2022.
  • Zhang et al. (2023) D. W. Zhang, C. Rainone, M. Peschl, and R. Bondesan. Robust scheduling with gflownets. International Conference on Learning Representations (ICLR), 2023.
  • Ziebart et al. (2008) B. D. Ziebart, A. L. Maas, J. A. Bagnell, A. K. Dey, et al. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pages 1433–1438. Chicago, IL, USA, 2008.
  • Zimmermann et al. (2022) H. Zimmermann, F. Lindsten, J.-W. van de Meent, and C. A. Naesseth. A variational perspective on generative flow networks. arXiv preprint 2210.07992, 2022.

Appendix A Direct Credit Assignment in GFlowNets

Similarly to temporal-difference methods, which are based on minimizing the mismatch with respect to the Bellman equations, the flow-matching and detailed-balance losses will take many updates (and sampling many trajectories) to propagate a reward mismatch on a terminating state into the transition probabilities inside the flow network. This would be particularly acute for longer trajectories and prompts the question of alternative more direct training objectives. This section attempts to answer the following question: “Given a training trajectory τ, are there more direct ways of assigning credit to the earlier transitions in the trajectory?”. Malkin et al. (2022) provide an alternative answer by introducing the Trajectory Balance loss (Ex. 6).

We can view the process of sampling a trajectory with a GFlowNet as analogous to sampling a sequence of states in a stochastic recurrent neural network. What makes things complicated is that such a neural network (i) does not directly output a prediction to be matched by some target, and (ii) that the state may be discrete (or a combination of discrete and continuous components).

Regarding (i), we recall that the flow in a transition ss (involved in the training objective) is an intractable sum over all the possible trajectories leading to s. However, we may be able to obtain a stochastic gradient that, in average over such trajectories, estimates the desired quantity. For this, we exploit the properties of flows to obtain a stochastic gradient estimator, derived through the next three propositions.

Notation for derivatives through the flow-matching constraint: below we sometimes need to distinguish total derivatives (noted dydx) that take into account the indirect effects due to the flow matching constraint from other derivatives (noted yx) and capturing only direct gradients. Either notation can be used when the constraint does not change the result. Computing indirect gradients through implicit dependencies is an active area of research and commonly utilizes the Implicit Function Theorem, e.g., for implicit layers in fixed-point iteration layers and Deep Equilibrium Models (Bai et al., 2019).

Proposition 39.

Consider the effect of a slight change in the log of the flow at s<s under the flow-matching constraint: it yields a change in the log of the flow at s, following the conditional probability P(s|s):

dlogF(s)dlogF(s)=P(s|s) (70)

where P is the distribution on events over trajectories.

Proof We are going to consider a partition of the complete trajectories going through s into those that also go through s (set 𝒯ss) and those that don’t (set 𝒯ss). By definition we can write:

P(s)=P(𝒯ss)+P(𝒯ss)

Because 𝒯ss={τ𝒯:sτ,sτ}, we can write P(𝒯ss)=P(s)P(s|s). Hence:

F(s)=F(s)P(s|s)+F(𝒯ss).

Additionally, because the flow in s does not influence trajectories in 𝒯ss, then dF(𝒯ss)dF(s)=0; which leads to:

dF(s)dF(s) =P(s|s)
dF(s)dlogF(s) =F(s)P(s|s)
=F(ss)
=P(s|s)F(s)
dlogF(s)dlogF(s) =P(s|s).


Proposition 40.

Consider the effect of a slight change in the log of the flow on the edge ss: we obtain a change in the log of the flow at s following the backward conditional probability PB(s|s):

dlogF(s)dlogF(ss)=PB(s|s) (71)

where P is the distribution on events over trajectories.

Proof We first use the chain rule and properties of the derivatives of the log, and then we use from the flow matching constraint that F(s)=sPar(s)F(ss)dF(s)dF(ss)=1:

dlogF(s)dlogF(ss) =1F(s)dF(s)dF(ss)dF(ss)dlogF(ss)
=F(ss)F(s)dF(s)dF(ss)
=F(ss)F(s)
=PB(s|s).


Intuitively, Eq. 70 and Eq. 71 tell us how a perturbation in the flow at one place would result in a change elsewhere in order to maintain the flow match everywhere, using the flow-matching conditions to propagate infinitesimal changes in flow backwards and forwards. Hence, they only become true as we approach the limit of matched flows, and in practice (with an imperfectly trained GFlowNet) the corresponding expressions will be biased. However, we can exploit them to estimate long-range equilibrium gradients and obtain an estimator of credit assignment across a long trajectory in Prop. 41 below.

In the GFlowNet setting, suppose we parametrize the edge flow estimator Fθ(ss) via parameters θ. In order to understand the effect of a change in θ on our loss function , we must compute the total derivative dLdθ by summing the direct and indirect gradients. In our context, the direct gradient Lθ is due to the explicit change in loss from changing θ (not taking the flow-matching constraint into account) and the indirect gradient includes the induced changes in the flow due to the constraint. With this, we are in a position to formalize unbiased estimators for the total derivative dLdθ in Prop. 41:

Proposition 41.

Let be a flow-matching loss computed along a trajectory τ=(s0,s1,,sf) sampled according to the GFlowNet trajectory distribution P(τ). Let =iL(si) decompose the total loss into per-state losses L(si) along the trajectory. Let θ parametrize the edge flow estimator Fθ(ss). Then, in the limit of the flows becoming matched,

G1iL(si)θ+L(si)logF(si)t=1i1logFθ(st1st)θ (72)

is an unbiased estimator of the total derivative ddθ, as is

G2iL(si)θ+L(si)logF(si)t=1i1sPar(st)PB(s|st)logFθ(sst)θ (73)

as is the convex combination

G=λG1+(1λ)G2 (74)

for any 0λ1. Intuitively, L(si)θ indicates the gradient directly through the occurrences of Fθ in L(si), and F(si) indicates either the forward or backward flow sum present in L(si) to obtain the flow through si.

Proof We will use the partial derivative notation when not considering the indirect influence due to the matching flow constraint and the total derivative notation when considering it, to apply Prop. 39. Consider the expected value under the GFlowNet’s trajectory distribution of the G in Eq. 72, use the previous proposition (Eq. 70) and the chain rule:

E[G1] =E[iL(si)θ+t=1i1L(si)logF(si)logFθ(st1st)θ]
=siP(si)(L(si)θ+ss<siP(ss|si)L(si)logF(si)logFθ(ss)θ)
=siP(si)(L(si)θ
+ss<siPB(s|s)P(s|si)(L(si)logF(si)logFθ(ss)θ))
=siP(si)(L(si)θ
+ss<siL(si)logF(si)dlogF(si)dlogF(s)dlogF(s)dlogFθ(ss)logFθ(ss)θ)
=siP(si)(L(si)θdirect gradients+ss<sidL(si)dlogFθ(ss)logFθ(ss)θindirect gradients)
=E[ddθ]
=dE[]dθ

The above demonstration shows that G in Eq. 74 is asymptotically (as the flows become matched) unbiased when λ=1 because we recover the G1 of Eq. 72. The same proof technique can then be used for G2 which uses transitions sst sampled from PB(s|st) instead of the trajectory transitions st1st and we obtain that the estimator G in Eq. 74 is asymptotically unbiased when λ=0:

E[G2] =E[it=1i1L(si)θ+L(si)logF(si)sPar(s)PB(s|s)logFθ(sst)θ]
=sis<siP(s,si)(L(si)θ+
sPar(s)PB(s|s)L(si)logF(si)logFθ(ss)θ)
=siP(si)(L(si)θ+
s<siP(s|si)sPar(s)PB(s|s)L(si)logF(si)logFθ(ss)θ)
=siP(si)(L(si)θ+
ss<si(L(si)logF(si)dlogF(si)dlogF(s)dlogF(s)dlogFθ(ss)logFθ(ss)θ))
=siP(si)(L(si)θdirect gradients+ss<si(dL(si)dlogFθ(ss)logFθ(ss)θ)indirect gradients)
=dE[]dθ

where the last identity follows the fourth line in the proof for G1. Finally a convex combination of two unbiased estimators is unbiased, so we obtain that G in Eq. 74 is asymptotically unbiased for any 0λ1.


This surprising result says that something very close to policy gradient actually provides an asymptotically (i.e., when flows are matched) unbiased gradient on the parameters of the edge flow, in expectation222the connection becomes clearer when you imagine minus the loss L to be the reward itself, and we see that we immediately get a training signal at earlier times in the sequence with G, similarly to policy gradient. There are also differences because the above proposition relies on staying close to the learning fixed point where the flows are matched.. Note that it only works exactly in an online setting, i.e., when the trajectory is sampled according to the learner’s current policy. Otherwise, the gradient estimator may be biased (it would be biased anyways in practice because the flows are never perfectly matched). However, if instead of sampling trajectories τ from the GFlowNet transition probabilities PF(st+1|st) we sample them from a training distribution P~ with transition probabilities P~(st+1|st) we can calculate the importance weights (by the ratio P(τ)/P~(τ)) and correct the estimator accordingly. Since the training distribution P~ should be broader and have a full support, the importance ratio cannot explode but there could still be the usual numerical problems with the variance of such importance-weighted estimators.

We now consider the setting in which the sampling policy is only slightly different from P^ , which is typically the case because we want the sampling policy to be broader and more exploratory, and because we may be using delayed data, e.g., with a replay buffer. This slight difference may induce a bias but it might still be advantageous to use the above gradient estimator. Note how it does not come in conflict with the gradient of the flow matching loss (which is the first term in G). The expected advantage of using G is that it may initially speed up training by directly providing updates to earlier transitions of a complete trajectory. However, analogous to the trade-off between temporal-difference methods and policy-gradient methods, this may come at the price of higher variance.

This estimator is unbiased when the flows are matched and when the trajectory is sampled according to the GFlowNet’s distribution, but it also makes a lot of intuitive sense: if the estimated flow at si is too small (in the eye of Li) one can clearly push that flow up by increasing the probability of a transition on a path leading to si. Even if we consider a slightly different trajectory sampling distribution, so long as it leads to si we would expect that increasing its probability would increase the probability of ending up in si (see Prop. 39).

If the state has a continuous component, we could also increase the probability of ending up in si by choosing more often a more probable path to si. This could be calculated by backpropagating through the state transitions (with some form of backpropagation through time). However, if the transitions are not fully known or are not differentiable, this approach may be more challenging, and is related to similar questions raised with credit assignment in reinforcement learning with continuous states.

Finally, keep in mind that the more direct credit assignment terms in G have to be combined with the local terms Liθ which make sure that the flows becomes better matched, since flow-matching is a necessary condition for G to be unbiased.

Appendix B Conditional GFlowNets

Definition 42.

Consider a set of conditioning information 𝒳, a family of DAGs 𝒢={Gx=(𝒮x,𝔸x),x𝒳}, a family of target reward functions ={Rx:𝒮xf+,x𝒳}, and a flow parametrization (𝒪x,Πx,x) of (Gx,Rx) for every x𝒳. The three functions 𝒪:x𝒳𝒪x, Π:x𝒳Πx, and :x𝒳x parametrize the family of DAGs and target reward functions, and we say that (𝒪,Π,) form a conditional flow parametrization of (𝒳,𝒢,). The tuple (𝒳,𝒢,,𝒪,Π,) is called a conditional GFlowNet.

For clarity, for an object o𝒪, we will use ox and o(x) interchangeably.

As with the unconditional case, conditional GFlowNets provide a way to sample from different target reward functions Rx simultaneously. Given a configuration o𝒪 of the conditional GFlowNet and a condition x𝒳, πx:=Πx(ox) is a distribution over 𝒯x, the set of complete trajectories in Gx, which implicitly defines a terminating state probability measure in Gx:

x𝒳s𝒮xfPT(sx):=τ𝒯x:ssfxτπx(τ), (75)

where the dependence on o in PT is omitted for clarity.

Conditional GFlowNets cast the problem of sampling from target reward functions to a search problem: searching for objects o𝒪 such that oxx(Markov(Gx,Rx))𝒪x. For such objects o𝒪, the terminating state probability measures correspond to the distribution of interest, i.e.:

x𝒳s𝒮xfPT(sx)Rx(s).

Similar to the unconditional case, we need to design a conditional loss function on 𝒪 that equals zero on such objects o and only on those objects.

Definition 43.

Let (𝒳,𝒢,,𝒪,Π,) be a conditional GFlowNet. A conditional flow-matching loss is any function :𝒪+ such that:

o𝒪(o)=0x𝒳FxMarkov(Gx,Rx)ox=x(Fx). (76)

We say that is condition-decomposable, if there are functions x:𝒪x+ such that:

o𝒪(o)=x𝒳x(ox)

We can obtain conditional flow-matching losses that are condition-decomposable starting from any family of flow-matching losses x for (Gx,Rx,𝒪x,Πx,x) (Def. 26). In particular, we can choose these losses to be state-decomposable, edge-decomposable or trajectory-decomposable.

For instance, if each x=s𝒮xLx(.,s) is state-decomposable, then the simultaneous sampling problem is cast to the following minimization problem:

mino𝒪𝔼(x,s)πT[Lx(o,s)], (77)

where πT is any conditional full support distribution on 𝒳×x𝒳𝒮x, i.e. a probability distribution that satisfies:

x𝒳sx𝒳𝒮xπT(x,s)>0s𝒮x.

Such a conditional full support distribution can be obtained starting from any distribution π𝒳 with full support on 𝒳 and distributions πx with full support on 𝒮x for any x𝒳, as:

πT(x,s)={π𝒳(x)πx(s)if s𝒮x0otherwise.
Example 9.

Consider the set:

𝒪={F^:𝒳×x𝒳𝔸xf+,F^(ssx)=0 if ss𝔸x}.

For each x𝒳 the function F^x:=F^(.x) is an element of 𝒪edge,x(Ex. 1), i.e. it is a function from 𝔸xf to +. Meaning that the set 𝒪 can be seen as a function mapping each x𝒳 to Fx𝒪edge,x. Denoting by :x𝒳edge,x and Π:x𝒳Πedge,x, we obtain a valid conditional flow parametrization (𝒪,,Π).

Instead of learning each function F^x separately, this parametrization enables learning functions F^𝒪 of both the condition x and the non-terminating edge ss, thus exploiting the generalization capabilities of machine learning algorithms not only on edges, but also on conditions.

Consider the functions Lx defined for each x𝒳 as:

Lx(F^x,s)={(log(δ+sPar(s)F^(ssx)δ+R(sx)+s′′Child(s){sfx}F^(ss′′x)))2if ssf,0otherwise

where δ0 is a hyperparameter. The function mapping each F^𝒪 to

(F^)=x𝒳s𝒮xLx(F^x,s),

is a conditional flow-matching loss, that is both condition-decomposable and state-decomposable.

Appendix C Policies in Deterministic and Stochastic Environments

Until now, we have focused on a deterministic environment where state changes can perfectly be calculated from a given action. This makes sense when the actions are cognitive actions, internal to an agent, e.g., sequentially constructing a candidate solution to a problem (an explanation, a plan, an inferred guess, etc). What about the scenario where the actions are external and affect the real world? The outcome are likely to be only imperfectly predictable. To address this scenario, we will now extend the GFlowNet framework to learn a policy π for an agent in an environment that could be deterministic or stochastic.

Definition 44.

A policy π:𝒜×𝒮 is a probability distribution π(a|s) over actions a𝒜 for each state s. To denote the fact that the action space may be restricted based on s, we write 𝒜(s) for the valid actions in state s.

To denote the introduction of actions in the GFlowNet framework, we will decompose transitions in two steps: first an action at is sampled according to a policy π from state st, and then the environment transforms this (in a possibly stochastic way) into a new state st+1.

Definition 45.

We generalize the notion of state as follows: even states are of the form s𝒮 while odd states are of the form (s,a)𝒮×𝒜. The policy π governs the transition from an even state to a compatible next odd state with a𝒜(s), while the environment P(stst+1|st,at) governs the transition from an odd state to the next even state.

As a result of the above definition, the even-to-even transition is summarized by

PF(st+1|st)=atP(stst+1|st,at)π(at|st). (78)

Note that the detailed balance condition, which involves a backward transition PB, will also be decomposed in two parts: (1) for inverting the even-to-odd transition,

PB(st|(st,at))=1 (79)

by definition, and (2) for inverting the odd-to-even transition, we have to actually represent (and learn)

PB((st,at)|st+1). (80)

This conditional distribution incorporates the preference we may have over different paths leading to the same state while consistent with the environment P(stst+1|st,at). The normalization constraint on P^B can guarantee flow-matching via detailed balance, as argued around Eq. 31.

C.1 Known Deterministic Environments

A deterministic environment is perfectly controllable: we can choose the action that leads to the most desired next state, among the valid actions from the previous state. In the case where the environment is deterministic, we can directly apply the results of Sec. 3, as follows. At each time step t and from state st, the agent picks an allowed action at𝒜(st) according to a policy π(at|st). The set of allowed actions should coincide with those actions for which π(at|st)>0. Since the environment is deterministic and known, there is a deterministic function T:𝒮×𝒜𝒮 which gives us the next state st+1=T(st,at). In that case, we can ignore the even/odd state distinction and identify the learnable policy π with the learnable transition probability function PF(st+1|st) of GFlowNets, as follows.

Proposition 46.

In a deterministic environment and a GFlowNet agent with policy π(at|st) and state transitions given by st+1=T(st,at), the transition probability PF(st+1|st) is given by

PF(st+1|st)=a:T(st,a)=st+1π(a|st). (81)

Hence if only one action at can transition from st to st+1=T(st,at), then

PF(st+1|st)=π(at|st). (82)

Proof The result is obtained by marginalizing over a:

PF(st+1|st) =aP(stst+1,at|st)
=aP(stst+1|st,at)π(a|st)
=a:st+1=T(st,a)π(a|st)

with P(stst+1|st,at)=1st+1=T(st,a).

The case with a single possible action to obtain the transition is obtained because the sum contains only one term.

Proposition 47.

In a deterministic environment with st+1=T(st,at), a backwards transition probability function can be derived from a backwards policy πB,

PB(st|st+1)=a:T(st,a)=st+1πB(a|st+1) (83)

and in the case where a single action at explains each transition st+1=T(st,at),

PB(st|st+1)=πB(at|st+1) (84)

Proof The proof goes along exactly the same lines as for Prop. 46.

C.2 Unknown Deterministic Environments

If the environment is deterministic but unknown, we have to learn the transition function T and we should also learn its inverse T1 which recovers the previous state given the next state and the action:

T1:𝒮×𝒜𝒮s.t.T1(T(s,a),a)=s. (85)

Unfortunately, if the state and action spaces are discrete and in high dimension, learning T and T1 in a way that generalizes to unseen transitions333Seen transitions can just be recorded in a table, but in a combinatorial state-space, they will form an exponentially tiny fraction of the ones to be encountered in the future. may be difficult and might be more easily achievable via a continuous relaxation. The methods for stochastic environments could be used for this purpose.

C.3 Stochastic Environments

The setting of stochastic environments is less straightforward but more general. We will decompose the transition as per Eq. 78 but not assume that P(stst+1|st,at) is a dirac. The first thing to note is that we can still obtain a Markovian flow, but that we are not guaranteed to find a policy which matches the desired terminal reward function.

Proposition 48.

In a stochastic environment with environment transitions P(stst+1|st,at), any policy π(a|s) can yield a Markovian flow and it may not be possible to perfectly achieve desired flows F^(ssf)=R(s).

Proof We obtain a flow by satisfying the flow-matching or detailed balance equations for both even and odd steps, which can always be done for the following reason. From the even states, we can define an edge flow

F^(st(st,at))=F^(st)π(at|st)

and the backwards transition is πB(st|(st,at))=1. This leads to the intermediate state flow

F^((st,at))=F^(st(st,at))

since there is only one edge into (st,at), the one starting at st and taking action at. From the odd states, we have the edge flow

F^((st,at)st+1)= F^((st,at))P(stst+1|st,at)
= F^(st)π(at|st)P(stst+1|st,at)

with P(stst+1|st,at) representing the environment, and we obtain the even state flow with the usual formula (Eq. 22)

F^(st+1)=(st,at)F^((st,at)st+1)=(st,at)F^(st)π(at|st)P(stst+1|st,at).

If unknown, the environment transitions P(stst+1|st,at) can be estimated in the usual supervised way by observing the triplets (st,at,st+1) and estimating transition probabilities that approximately maximize the empirical log-likelihood of these observations. However, whereas in a stationary environment the transitions P(stst+1|st,at) do not depend on the policy, the backwards transitions P^B((st,at)|st+1) depend on the forward environment transition and on the state flows, i.e., on the policy. With enough training time and capacity, the forward and backward transitions can be made compatible, but as usual with GFlowNets, in a realistic settings the flow matching equations will not be perfectly achieved.

If there is enough capacity and training time (training to completion), we thus obtain a flow. Then, defining the transition probabilities by the sequential sampling of transitions from the even and odd steps above, we obtain a Markovian flow (Prop. 16).

To show that the desired terminal flows are not necessarily achievable, it is sufficient to identify a counter-example. Consider a terminal reward R(s)>0 while the environment transitions into s have zero probability. In that case, no matter how we choose our policy, we cannot put the desired flow into state s.

Keep also in mind that in practice, even in a completely controllable environment, we will not be guaranteed to find a flow that matches the target terminal reward function simply because of finite capacity and finite training time for the GFlowNet.

Whereas with a deterministic environment for the GFlowNet, one can freely choose PB for non-terminal edges, it is not so for stochastic environments, as argued below. On even-to-odd transitions, PB(s|st,at)=1s=st by construction. On odd-to-even transitions (st,at)st+1 the problem is that the forward transition is not a free parameter (it corresponds to the environment’s P(stst+1|st,at)).

Counterexample 49.

With a GFlowNet with a fixed stochastic environment, it may not be possible for PB(st,at|st+1) to be chosen freely while also matching the flows and the terminal rewards.

For a counterexample, consider the setting in Fig. 6. Suppose R(s′′′)0 and the environment-provided transition T(s′′′|s′′,a′′)=0. Then, in order to match the terminal reward R(s′′′), we must require that PB(s,a|s′′′)0, which means that PB cannot be chosen freely.

Refer to caption
Figure 6: Consider a simple counter-example with only two paths from s to s′′′, with a given R(s′′′)>0. One path goes through s and the other through s′′. From s the only feasible action a leads to s′′′ and similarly from s′′ with action a′′. However, it may be that the environment probability P(s′′′|(s′′,a′′))=0, constraining PB((s′′,a′′)|s′′′)=0. Therefore it may not possible to choose the backward transitions freely while matching the flows and terminal rewards.

Appendix D Expected Downstream Reward and Reward-Maximizing Policy

We have already introduced the probability distribution PT(s) and conditional probabilities PT(s|ss) over terminating states (the states visited just before exiting into sf). More generally, one can consider any distribution Pπ(s) over terminating states arising from some arbitrary choice of GFlowNet policy π and compute the expected reward under this distribution:

Definition 50.

The expected reward after visiting state s of a flow with terminal reward function R, under some distribution over terminating states P, is

VPπ(s):=EPπ(S)[R(S)|Ss]=ssR(s)Pπ(s|ss). (86)

Proposition 51.

When the probability distribution over terminating states is PT given by the flow (see Def. 13), the expected reward under PT is

VPT(s)=ssR(s)2ssR(s). (87)

Proof We apply the definition of conditional PT (Cor. 33) to Eq. 86 and obtain the result.

While we have a simple expression of the expected reward under PT, the expected reward is defined more broadly for the distribution Pπ arising from any policy π. In particular for any policy π(a|s), we can also define the expected reward VPπ under the distribution Pπ over terminating states induced by π. Expected rewards play a role similar to the state and state-action value functions in reinforcement learning, and as a consequence they also satisfy an equivalent of the policy improvement theorem when intermediate rewards are 0 and the discount factor γ=1:

Proposition 52.

Let Pπ be a distribution over terminating states arising from a policy π, and π¯ a greedy policy under the expected reward VPπ, i.e.,

π¯(a|s)=0unless
VPπ((s,a))VPπ((s,a))a. (88)

Then for all s

VPπ¯(s)VPπ(s). (89)

That is, the expected reward under the probability induced by π¯ is no worse than the one induced by π.

Proof Let us denote by π¯(s) the action deterministically chosen by greedy policy π¯ from s, and sn the stochastically sampled terminating state. Then:

VPπ(s) VPπ((s,π¯(s)))
=Eπ¯[VPπ(st+1)|st=s]
Eπ¯[VPπ(st+1,π¯(st+1))|st=s]
=Eπ¯[𝔼π¯[VPπ(st+2)|st=s|st=s]
=Eπ¯[VPπ(st+2)|st=s]
Eπ¯[VPπ(sn)|st=s]
=Eπ¯[R(sn)|st=s]
=VPπ¯(s)

where we have used the fact that, for all s, VPπ(s)VPπ(s,π¯(s)) since π¯ is a greedy policy.


An immediate consequence is the following:

Corollary 53.

There exists a policy π(a|s) that maximizes the expected reward for all states s, namely the greedy policy of Prop. 52 associated with the GFlowNet’s policy π yielding terminal distribution PT(s)=R(s)/Z.

How do can we estimate the expected reward under PT? We just need to train another flow (or set of heads for a GFlowNet, see Appendix F) with R2 as the reward function.

Proposition 54.

Consider two flows F and F, one matching terminal reward function R and the other matching terminal reward function R2. Then the expected reward under PT (the distribution over terminating states defined by the flow F) is

VPT(s)=F(s|s)F(s|s). (90)

Proof We start from Eq. 87 of the above corollary and notice that the numerator is the self-flow for F while the denominator is the self-flow for F (see Eq. 44).

D.1 Preference for High-Reward Early Trajectory

We have seen in Sec. 2.6 that by imposing a particular preference on PB, one can make the GFlowNet sampling mechanism prefer to construct states in some orders more than others, e.g., one could prefer to start with states with larger expected reward (over their potential continuations), using Def. 50. It suffices to define PB(st,at|st+1) so that it puts more probability mass on state-action pairs (st,at) with larger V((st,at)).

Appendix E Intermediate Rewards and Trajectory Returns

Up to now and in the GFlowNet paper (Bengio et al., 2021), we have considered terminal rewards as events happening only once per trajectory, at its end. Consider instead an agent experiencing a complete trajectory τ and declare its return to be the sum of some intermediate environment rewards associated with all the transitions into the sink node from each of the visited states.

Definition 55.

The trajectory return ρ(τ) associated with a partial trajectory τ=(si,si+1,sn,sf) is defined as

ρ(si,si+1,,sn,sf) =t=inR(st)=t=inF(stsf) (91)
ρ(sf) =0 (92)

and the expected future return ρ¯(st) associated with a state st is defined as

ρ¯(st) =E[ρ(st,st+1,st+2,,sn,sf)|st]
=st+1,,snP(st+1,,sn,sf|st)ρ(st,st+1,,sf) (93)

where the expectation is defined under the flow’s probability measure over trajectories (conditioned on the trajectory going through st).

Proposition 56.

The expected future return ρ¯(s) achievable from trajectories starting at s satisfies the following recursion:

ρ¯(s)=R(s)+sChild(s)PF(s|s)ρ¯(s). (94)

Proof From the definition of return (Def. 55), we obtain the following:

ρ¯(st) =st+1,st+2,P(st+1,,sf|st)(R(st)+ρ(st+1,st+2,,sn))
=R(st)+st+1PF(st+1|st)st+2,,snP(st+2,,sn|st+1)ρ(st+1,,sn)
=R(st)+st+1PF(st+1|st)E[ρ(st+1,st+2,)|st+1]
=R(st)+st+1PF(st+1|st)ρ¯(st+1).


One reason why the above recursion is interesting is that it corresponds to the Bellman equation (Sutton and Barto, 2018) for the value function (which is the expected downstream return) in the case of no discounting (with an episodic setting). It is interesting to compare it with one of the equations we obtain for the state flow (Eq. 22):

F(st) =st+1F(st+1)PB(st|st+1)
=R(st)+st+1sfPB(st|st+1)F(st+1)

The two recursions are different: one uses the forward transition to propagate values backward, while the other uses the backward transitions to propagate flows.

Definition 57.

Let us denote r(s) the possibly stochastic environment reward, provided when an agent visits state s (and generally distinct from the GFlowNet terminal reward at s), and consider the environment reward accumulated in the partial trajectory τ=(s0,s1,,sn) leading to state sn=s. Let us call the GFlowNet state return-augmented if the state s includes the accumulated reward ν(s), i.e., there exists a function ν(s)=t=0nr(st). Let us call a GFlowNet with a return-augmented state a return-augmented GFlowNet.

We want to keep track of accumulated intermediate rewards in the GFlowNet state to compute the terminal reward from the state, and thus train GFlowNets to sample in proportion to the accumulated reward:

Proposition 58.

Suppose G is a return-augmented GFlowNet with target terminal reward function R(s) equal to the accumulated environment reward ν(s). Furthermore, suppose G is trained to completion. Then sampling from G produces accumulated reward ν with probability proportional to ν.

Proof Since G is a GFlowNet trained to completion with terminal reward function R(s)=ν(s), we know that sampling from G samples terminal accumulated rewards ν(s) with probability proportional to ν.

Note that such a GFlowNet can only be trained offline, i.e., using trajectories which have terminated and for which we have observed the return.

Note also that if we did not augment the state with the accumulated reward, then the GFlowNet terminal reward would not be a function of the state. Having a return-augmented state also makes it possible to handle stochastic environment rewards in the GFlowNet framework.

Appendix F Multi-Flows, Distributional GFlowNets, Unsupervised GFlowNets and Pareto GFlowNets

Consider an environment with stochastic rewards. As with Def. 57, we could augment the state to include the random accumulated rewards, thus (by Prop. 58) making the GFlowNet sample trajectories with returns ρ occurring with probability proportional to ρ. However, similarly to Distributional RL (Bellemare et al., 2017), it could be interesting to generalize GFlowNets to capture not just the expected value of achievable terminal rewards but also other statistics of its distribution. More generally, we can think of this like a family of GFlowNets, each of which models in its flow a particular future environmental outcome of interest. With the particle analogy of GFlowNets, it would be as if the particles had a colour or label (just like the frequency of each photon in an group of photons travelling together in a beam of light) and that we separately account for the flows associated with all the possible label types.

If the number of outcomes (the number of possible labels) is small, this could be implemented with different output heads of the GFlowNet (e.g., one output for the flow associated with each label). When a trajectory associated with a particular label outcome is observed, the corresponding heads get gradients. A more powerful and general implementation puts the outcome event as an input of the GFlowNet, thus amounting to training a conditional GFlowNet (see Sec. 4.4), and formalized below.

Definition 59.

Let us define the outcome y=f(s) as a known function f(s) of the state s. An outcome can be whether an environment reward takes a particular value, or it can be a vector of important features of s which are sufficient to determine many possible environment reward functions (in particular, f can be the identity function). Let us call the conditional GFlowNet taking y as conditioning input, with conditional flows F(A|y) for events A over the trajectories consistent with reward function

Ry(s)=1f(s)=y (95)

an outcome-conditioned GFlowNet with outcome function f.

We will limit ourselves here to a discrete set of outcomes for simplicity but expect that this approach can be generalized to a continuous set. Note how, if the outcome-conditioned GFlowNet is trained to completion, it makes it possible to only sample terminating states s yielding the chosen outcome y. In principle, this allows sampling objects guaranteed to have a high reward (under the given reward function). In practice, a GFlowNet will never be perfectly trained to completion, and we should think of such an outcome-conditioned GFlowNet similarly to a goal-conditioned policy in RL (Ghosh et al., 2018) or the reward-conditioned upside-down RL (Schmidhuber, 2019). An interesting question for future work is to extend these outome-conditioned GFlowNets to the case of stochastic rewards or stochastic environments.

Definition 60.

A distributional GFlowNet is an outcome-conditioned GFlowNet taking as conditioning input the value of the environment return associated with complete trajectories. This can be achieved by making the GFlowNet return-augmented, so that the return can be read from the terminating state of the trajectory.

Training an outcome-conditioned GFlowNet can only be done offline because the conditioning input (e.g., the final return) may only be known after the trajectory has been sampled. A reasonable contrastive training procedure could thus proceed as follows:

  1. 1.

    Sample a trajectory τ+ according to an unconditional training policy πT.

  2. 2.

    Obtain the outcome y+=f(s+) from the terminating state s+ (occurring just before the sink state sf in τ+).

  3. 3.

    Update the conditional GFlowNet with τ+ and target terminating reward R(s+|y+)=1f(s+)=y+=1.

  4. 4.

    Sample a trajectory τ according to the conditional GFlowNet policy with condition y+.

  5. 5.

    Obtain the actual outcome y=f(s) for the terminating state s (occurring just before the sink state sf) in τ. If the GFlowNet was perfectly trained, we should have y+=y but otherwise, especially if the number of possible outcomes is large, this becomes unlikely.

  6. 6.

    Update the conditional GFlowNet using a flow-matching loss with trajectory τ and target terminating reward R(s|y)=1y=y+ (likely to be 0 if there are many possible values for y).

An interesting question for future work is to consider a smoother reward function instead of the sharp but sparse reward R(s|y)=1f(s)=y as conditional reward, in order to make training easier.

F.1 Defining a reward function a posteriori

The reward function may not be known a priori or it may be known only up to some unknown constants (e.g., defining a Pareto front) or we may wish to generalize GFlowNets so they can be trained or pre-trained in a more unsupervised way, with the specific generative task only specified afterwards. The following important proposition allows us to do just that, and convert an outcome-conditioned GFlowNet into one that samples according to a given reward function, without having to retrain the network.

Proposition 61.

Consider an outcome-conditioned GFlowNet trained to completion with respect to the possible outcomes y=f(s) over terminating states s and a terminal reward function R(s)=r(f(s)) given a posteriori (possibly after training the GFlowNet) as a function r(y) of the outcome y=f(s). Then a GFlowNet with flow Frf(A) over events A which matches target terminal reward function R=rf can be obtained from the flow F(A|y) of the outcome-conditioned GFlowNet via

Frf(A)=yr(y)F(A|y). (96)

For example, the GFlowNet policy πrf(a|s) for terminal reward R=rf can be obtained from

πrf(a|s)=yr(y)F((s,a)|y)yr(y)F(s|y) (97)

or

πrf(a|s)=yr(y)F(s|y)π(a|s,y)yr(y)F(s|y) (98)

where F(s|y), F((s,a)|y) and π(a|s,y) are the outcome-conditioned state flow, state-action flow and action policy respectively.

Proof We first clarify what Frf(A) means:

Frf(A)=sr(f(s))P(A|ssf)=sr(f(s))PB(A|ssf) (99)

where P(A|ssf) is the probability of event A (e.g., a particular state or transition) among the trajectories that end in terminal transition ssf, and it can be determined entirely by PB (considering all the backward paths starting at s and going back to the particular state or transition A). This means that P(A|ssf) does not depend of the choice of reward function, so Eq. 99 is valid for any r.

Clearly, we see that Eq. 96 works in the extreme case where r(y)=1y=y is an indicator function at some specific y value since the sum in Eq. 96 reduces to F(A|y) which corresponds to reward function Ry as per Eq. 95. This yields

F1y(A)=F(A|y)=s1f(s)=yP(A|ssf) (100)

where 1y denotes the function that, given s, returns 1y=s.

To complete the proof let us start with the right-hand side of Eq. 96 and insert the above definition of F(A|y) (Eq. 100), then swap the sums and use the indicator function to cancel the sum over y, and finally apply the definition of Frf(A) in Eq. 99:

yr(y)F(A|y) =yr(y)s1f(s)=yP(A|ssf)
=syr(y)1f(s)=yP(A|ssf)
=sr(f(s))P(A|ssf)
=Frf(A) (101)

recovering the left-hand side of Eq. 96 as desired.

This makes it possible to predict probabilities and perform sampling actions for all the possible outcomes y arising in different states (in the extreme where we know nothing about possible reward functions and the outcome is y=s), and then convert that GFlowNet on the fly to one specialized to a given terminal reward function R=rf. However, we note that it requires more computation for each action at run-time: we have to perform these sums (possibly via Monte-Carlo integration) over the outcome space, and there may be a computational time versus accuracy trade-off in the resulting decisions (based on how many Monte-Carlo samples are used to approximate the above sums).

F.2 Pareto GFlowNets

A related application of these ideas concerns Pareto optimization, where we are not sure about the correct reward function up to a few coefficients forming a convex combination of underlying objectives.

Definition 62.

The Pareto additive terminal reward functions can be written as

Rω(s)=iωifi(s) (102)

where ω{ωWd:ωi0,iωi=1} are convex weights and the outcomes of interest are d objectives yi=fi(s), and W is a discrete set of convex weights.

Definition 63.

The Pareto multiplicative terminal reward functions can be written as

Rω(s)=eiωiei(s) (103)

where ω{ωWd:ωi0,iωi=1} are convex weights and the outcomes of interest are d objectives yi=fi(s), and W is a discrete set of convex weights.

In these cases, we can train a conditional GFlowNet with ω as conditioning input and Rω(s)=R(s|ω) as conditional terminal reward function. At run-time, we can scan the set of ω’s in order to obtain different policies or predicted probabilities or free energies. The above can easily be generalized to a non-convex and non-linear combination of the objectives, so long as the combined objective is parametrized by ω. Note the similarity between this idea (and more generally outcome-conditioned GFlowNets) and the earlier work by Dosovitskiy and Djolonga (2019). The same idea of conditioning by a form of specification of the loss can be applied to GFlowNets to obtain a family of GFlowNets, one for each variant of the loss function.

A useful application of a Pareto GFlowNet with such reward functions is to draw samples from the Pareto frontier. Once the Pareto GFlowNet is trained, we can draw samples from the Pareto frontier by first sampling the convex weights ω and then sampling trajectories. This can be useful in multi-objective optimization or sampling, where we want to draw a diversity of solutions corresponding to different trade-off points of the various objectives.

We could also train an outcome-driven GFlowNet by providing the vector of objective values y as input and exploit prior knowledge about the objectives. For example, we may believe that different objectives can be modeled independently of each other and that F(s|y) (or similarly for F((s,a)|y)) can be written as a basis expansion F(s|y)=i=1dj=1Nϕj(yi)Fi,j(s) with N bases ϕj used to represent the objective yi. This could be advantageous from a generalization point of view if learning about the different objectives should be disentangled from one another (e.g., one is stationary and the other is not).