Yoshua Bengio, Kolya Malkin, Moksh Jain (2022)

(you have to click on the right-pointing triangle of toggle-able sections to see them if they are hidden)

(tinyurl link to this page: https://tinyurl.com/gflownet-tutorial)

(see also Emmanuel Bengio's initial blog post on GFlowNets as well as the GFlowNet as Amortized Samplers and Marginalization Estimators tutorial)

The Gist

A GFlowNet is a trained stochastic policy or generative model, trained such that it samples objects $x$ through a sequence of constructive steps, with probability proportional to a reward function $R(x)$, where $R$ is a non-negative integrable function. This makes a GFlowNet able to sample a diversity of solutions $x$ that have a high value of $R(x)$.

It is a stochastic policy because it makes sense to apply it when constructing $x$ can be done through a sequence of steps, which can be thought of as internal actions (they are meant to construct things like thoughts or plans or explanations). That sequential construction is convenient to sample compositional objects, which can often be defined by composing elements in some order, with generally many possible orders leading to the same object.

Figure 1. A constructive trajectory to build a lego object goes through a sequence of steps (each adding a lego block). The same object could be obtained in exponentially many different ways. The state of the GFlowNet describes the object (or partially constructed object) and the trajectories leading into it specify how it could have been built. The forward-sampling policy of the GFlowNet performs one change to the state at a time (e.g. adding one lego block), but we can also define a backward-sampling policy that undoes such changes (e.g. removing a lego block), and they can be combined (e.g., to form a Markov chain by which we can fix our current attempt at a lego boat). When we decide to show our construction, we have reached a terminal state and we may get a reward which, e.g., is higher for boat-looking constructions, and can also be interpreted as exp(-energy), i.e., as an unnormalized probability function. The GFlowNet forward-sampling policy tells us how we could generate the potentially exponentially many boat-looking constructions. An intermediate quantity can be estimated, the flow, and the flow in the initial state corresponds to a free energy or normalizing constant (summing over all values of the reward of attainable terminal states). This flow can be generalized to estimate marginalized quantities such as marginal and conditional probabilities over subsets of random variables, which require in principle summing over exponentially many paths or terminal states.

Figure 1. A constructive trajectory to build a lego object goes through a sequence of steps (each adding a lego block). The same object could be obtained in exponentially many different ways. The state of the GFlowNet describes the object (or partially constructed object) and the trajectories leading into it specify how it could have been built. The forward-sampling policy of the GFlowNet performs one change to the state at a time (e.g. adding one lego block), but we can also define a backward-sampling policy that undoes such changes (e.g. removing a lego block), and they can be combined (e.g., to form a Markov chain by which we can fix our current attempt at a lego boat). When we decide to show our construction, we have reached a terminal state and we may get a reward which, e.g., is higher for boat-looking constructions, and can also be interpreted as exp(-energy), i.e., as an unnormalized probability function. The GFlowNet forward-sampling policy tells us how we could generate the potentially exponentially many boat-looking constructions. An intermediate quantity can be estimated, the flow, and the flow in the initial state corresponds to a free energy or normalizing constant (summing over all values of the reward of attainable terminal states). This flow can be generalized to estimate marginalized quantities such as marginal and conditional probabilities over subsets of random variables, which require in principle summing over exponentially many paths or terminal states.

A neural net can be used to sample each of these forward-going constructive actions, one at a time. An object $x$ is done being constructed when a special "exit" action or a deterministic criterion of the state (e.g., $x$ has exactly $n$ elements) has been triggered, when we have reached a terminal state $x$, and after which we can get a reward $R(x)$. Seen as a stochastic policy, the GFlowNet has an action space (for deciding what to do at each step) and a state space (for the partially constructed objects). Each sequence of actions $(a_0, a_1, \ldots)$ forms a trajectory $\tau$ that is a sequence of states $(s_0,s_1,s_2,\ldots)$. There may be many trajectories that lead to the same state $s_t$ (think about how one can construct a set by inserting elements in any order, or build a graph by pasting graph pieces again in many possible orders, transforming a vector through a sequence of stochastic steps like in Denoising Diffusion, or like in Figure 1, constructing the same lego boat in many different ways). The last state $s_n$ of a complete trajectory $\tau$ is an object $x \in \cal X$ that the GFlowNet can sample, and the training objective aims at making it sample $x$ with probability proportional to $R(x)$. Note that [2] discusses how to generalize this to having intermediate rewards and not just at the end of the trajectory.

A GFlowNet is a generative model because after (and during) training we can sample from it, but in its most basic form its training objective is about matching a reward function $R$ rather than fitting a finite dataset (the usual objective for generative models). Since $R$ is not an external quantity (like in typical RL) but an internal quantity (e.g. corresponding to an energy function in a world model), it means that the quality of training of a GFlowNet does not depend on the size of a fixed external training dataset but rather on how much compute is available to query $R$ on many trajectories. Training the GFlowNet can be done by repeatedly querying that function, so we can make our GFlowNet as big a neural net as we can afford computationally, without necessarily worrying about overfitting. Underfitting is the real danger, instead. Going back to the lego construction, we are not trying to find the most boat-looking lego structure, but a way to sample from all the boat-looking structures. Because of the sequential construction of objects $x$ (with an arbitrary number of steps), it is very convenient to generate variable-size structured objects, like sets, graphs, lists, programs or other recursively constructed data structures. Note that a GFlowNet can also define a backward-sampling procedure, i.e., given a constructed object, we could sample a plausible trajectory that could have led to constructing it. The forward-sampling policy is called $P_F$ below and the backward-sampling policy is called $P_B$.

Importantly, when the reward function $R$ represents the product of a prior (over some random variable) times a likelihood (measuring how well that choice of value of the random variable fits some data), the GFlowNet will learn to sample from the corresponding Bayesian posterior. This makes GFlowNets amortized probabilistic inference machines that can be used both to sample latent variables (as in [14]) or parameters and theories shared across examples (as in [5]). They can also be used to perform approximate and amortized marginalization (without the need for sampling at run-time), as discussed in this other tutorial, and this is useful for training AI systems that can answer probabilistic questions (e.g., about probabilities or expected values of some variables given others, or to estimate information theoretic quantities like conditional entropies or mutual information).

In terms of neural net architectures, the simplest GFlowNet architecture is one where we have a neural net (as large as we can afford) that outputs a stochastic policy $\pi(a_t|s_t)$, where $s_t$ represents a partially constructed object and $a_t$ is one of the possible actions from $s_t$. In regular GFlowNets, choosing $a_t$ from $s_t$ deterministically yields some $s_{t+1}$, which means that we can also write $\pi(a_t|s_t)=P_F(s_{t+1}|s_t)$ for that policy. It makes sense to consider deterministic transitions when the policy is an internal policy, not operating in some stochastic external environment, but instead performing computational choices (like what to attend to, what computation to perform, what memory to retrieve, etc). GFlowNets are motivated by the kind of internal policy which could model cognition, i.e., the sequence of micro-choices about internal computation of a thinking agent corresponding to a form of probabilistic inference. $P_F$ stands for the "forward" transition probability along the GFlowNet constructive process to generate these structured internal computational constructions. The same neural net is used at every step, but it produces a stochastic output $a_t$, from which a next state $s_{t+1}=T(s_t,a_t)$ is obtained (the form of $T$ is application-dependent, e.g., following the rules of chemistry if $s_t$ is a partially constructed molecule represented as a graph and $a_t$ puts an atom as a specific new node connected to an existing node in the graph). Since the state is a variable-size object, the neural net better have an appropriate architecture for taking such objects in input (e.g., an RNN, a graph neural net or a transformer). We can thus more generally see the iterated application of the GFlowNet decisions as a particular form of recurrent stochastic net where the hidden recurrent state ($s_t$) is stochastic.

Figure 2: the most basic component of a GFlowNet is a neural network that defines its constructive policy, i.e., how to sample the next state $s_{t+1}$ given the previous state $s_t$, through the choice of an action $a_t$. The new state $s_{t+1}$ becomes the input for the next application of the GFlowNet policy denoted $P_F$ or $\pi$, until a special "exit" action signals the end of the sequence, that an object $x=s_n$ has been generated, and a reward $R(x)$ is provided to encourage or discourage the policy from generating such a solution. The training objective optimizes the parameters of $P_F$ towards making $x$ be generated with probability proportional to $R(x)$.

Figure 2: the most basic component of a GFlowNet is a neural network that defines its constructive policy, i.e., how to sample the next state $s_{t+1}$ given the previous state $s_t$, through the choice of an action $a_t$. The new state $s_{t+1}$ becomes the input for the next application of the GFlowNet policy denoted $P_F$ or $\pi$, until a special "exit" action signals the end of the sequence, that an object $x=s_n$ has been generated, and a reward $R(x)$ is provided to encourage or discourage the policy from generating such a solution. The training objective optimizes the parameters of $P_F$ towards making $x$ be generated with probability proportional to $R(x)$.

GFlowNets for Brain-Inspired Reasoning