Yoshua Bengio, Kolya Malkin, Moksh Jain (2022)

(you have to click on the right-pointing triangle of toggle-able sections to see them if they are hidden)

## The Gist

A GFlowNet is a trained stochastic policy or generative model, trained such that it samples objects $x$ through a sequence of constructive steps, with probability proportional to a reward function $R(x)$, where $R$ is a non-negative integrable function. This makes a GFlowNet able to sample a diversity of solutions $x$ that have a high value of $R(x)$.

It is a stochastic policy because it makes sense to apply it when constructing $x$ can be done through a sequence of steps, which can be thought of as internal actions (they are meant to construct things like thoughts or plans or explanations). That sequential construction is convenient to sample compositional objects, which can often be defined by composing elements in some order, with generally many possible orders leading to the same object.

Figure 1. A constructive trajectory to build a lego object goes through a sequence of steps (each adding a lego block). The same object could be obtained in exponentially many different ways. The state of the GFlowNet describes the object (or partially constructed object) and the trajectories leading into it specify how it could have been build. The forward-sampling policy of the GFlowNet performs one change to the state at a time (e.g. adding one lego block), but we can also define a backward-sampling policy that undoes such changes (e.g. removing a lego block), and they can be combined (e.g., to form a Markov chain by which we can fix our current attempt at a lego boat). When we decide to show our construction, we have reached a terminal state and we may get a reward which, e.g., is higher for boat-looking constructions, and can also be interpreted as exp(-energy), i.e., as an unnormalized probability function. The GFlowNet forward-sampling policy tells us how we could generate the potentially exponentially many boat-looking constructions. An intermediate quantity can be estimated, the flow, and the flow in the initial state corresponds to a free energy or normalizing constant (summing over all values of the reward of attainable terminal states). This flow can be generalized to estimate marginalized quantities such as marginal and conditional probabilities over subsets of random variables, which require in principle summing over exponentially many paths or terminal states.

A neural net can be used to sample each of these forward-going constructive actions, one at a time. An object $x$ is done being constructed when a special "exit" action or a deterministic criterion of the state (e.g., $x$ has exactly $n$ elements) has been triggered, when we have reached a terminal state $x$, and after which we can get a reward $R(x)$. Seen as a stochastic policy, the GFlowNet has an action space (for deciding what to do at each step) and a state space (for the partially constructed objects). Each sequence of actions $(a_0, a_1, \ldots)$ forms a trajectory $\tau$ that is a sequence of states $(s_0,s_1,s_2,\ldots)$. There may be many trajectories that lead to the same state $s_t$ (think about how one can construct a set by inserting elements in any order, or build a graph by pasting graph pieces again in many possible orders, or like in Figure 1, constructing the same lego boat in many different ways). The last state $s_n$ of a complete trajectory $\tau$ is an object $x \in \cal X$ that the GFlowNet can sample, and the training objective aims at making it sample $x$ with probability proportional to $R(x)$. Note that [2] discusses how to generalize this to having intermediate rewards and not just at the end of the trajectory.

A GFlowNet is a generative model because after (and during) training we can sample from it, but in its most basic form its training objective is about matching a reward function $R$ rather than fitting a finite dataset (the usual objective for generative models). Since $R$ is not an external quantity (like in typical RL) but an internal quantity (e.g. corresponding to an energy function in a world model), it means that the quality of training of a GFlowNet does not depend on the size of a fixed external training dataset but rather on how much compute is available to query $R$ on many trajectories. Training the GFlowNet can be done by repeatedly querying that function, so we can make our GFlowNet as big a neural net as we can afford computationally, without necessarily worrying about overfitting. Going back to the lego construction, we are not trying to find the most boat-looking lego structure, but a way to sample from all the boat-looking structures. Because of the sequential construction of objects $x$ (with an unbounded number of steps), it is very convenient to generate variable-size structured objects, like sets, graphs, lists, programs or other recursively constructed data structures. Note that a GFlowNet can also define a backward-sampling procedure, i.e., given a constructed object, we could sample a plausible trajectory that could have led to constructing it. The forward-sampling policy is called $P_F$ below and the backward-sampling policy is called $P_B$.

In terms of neural net architectures, the simplest GFlowNet architecture is one where we have a neural net (as large as we can afford) that outputs a stochastic policy $\pi(a_t|s_t)$, where $s_t$ represents a partially constructed object and $a_t$ is one of the possible actions from $s_t$. In regular GFlowNets, choosing $a_t$ from $s_t$ deterministically yields some $s_{t+1}$, which means that we can also write $\pi(a_t|s_t)=P_F(s_{t+1}|s_t)$ for that policy. It makes sense to consider deterministic transitions when the policy is an internal policy, not operating in some stochastic external environment, but instead performing computational choices (like what to attend to, what computation to perform, what memory to retrieve, etc). GFlowNets are motivated by the kind of internal policy which could model cognition, i.e., the sequence of micro-choices about internal computation of a thinking agent corresponding to a form of probabilistic inference. $P_F$ stands for the "forward" transition probability along the GFlowNet constructive process to generate these structured internal computational constructions. The same neural net is used at every step, but it produces a stochastic output $a_t$, from which a next state $s_{t+1}=T(s_t,a_t)$ is obtained (the form of $T$ is application-dependent, e.g., following the rules of chemistry if $s_t$ is a partially constructed molecule represented as a graph and $a_t$ puts an atom as a specific new node connected to an existing node in the graph). Since the state is a variable-size object, the neural net better have an appropriate architecture for taking such objects in input (e.g., an RNN, a graph neural net or a transformer). We can thus more generally see the iterated application of the GFlowNet decisions as a particular form of recurrent stochastic net where the hidden recurrent state ($s_t$) is stochastic.

Figure 2: the most basic component of a GFlowNet is a neural network that defines its constructive policy, i.e., how to sample the next state $s_{t+1}$ given the previous state $s_t$, through the choice of an action $a_t$. The new state $s_{t+1}$ becomes the input for the next application of the GFlowNet policy denoted $P_F$ or $\pi$, until a special "exit" action signals the end of the sequence, that an object $x=s_n$ has been generated, and a reward $R(x)$ is provided to encourage or discourage the policy from generating such a solution. The training objective optimizes the parameters of $P_F$ towards making $x$ be generated with probability proportional to $R(x)$.

### GFlowNets for Brain-Inspired Reasoning

Brain sciences show that conscious reasoning involves a sequential process of thought formation, where at each step a competition takes place among possible thought contents (and relevant parts of the brain with expertise on that content), and each thought involves very few symbolic elements (a handful). This is the heart of the Global Workspace Theory (GWT), initiated by Baars (1993,1997) and extended (among others) by Dehaene et al (2011, 2017, 2020) as well as through Graziano's Attention Schema Theory (2011, 2013, 2017). In addition, it is plausible (and supported by works on the link between Bayesian reasoning and neuroscience) that each such step is stochastic. This suggests that something like a GFlowNet could learn the internal policy that selects that sequence of thoughts. In addition, the work on GFlowNet as amortized inference learners [4,5,7,9 below], both from a Bayesian [7] and variational inference [9] perspectives, would make this kind of computation very useful to achieve probabilistic inference in the brain. GFlowNets can learn a type of System 1 inference machine (the amortized approximate inference $Q$ in variational inference) that is trained to probabilistic select answers to questions of a particular kind so as to be consistent with System 2 modular knowledge (the “world model” $P$ in variational inference). That System 1 inference machinery $Q$ is also crucial to help train the model $P$ (as shown in several of these papers). Unlike typical neural networks, the inference machinery does not need to be trained only from external (real) data, it can take advantage of internally generated (hallucinated) pseudo-data in order to make $Q$ consistent with $P$. The real data is of course important to make $P$ (the world-model) as much as compatible with the real world. The kind of sequential probabilistic inference that GFlowNets can perform is a powerful form of learned reasoning machinery, which could be used for interpretation of sensory inputs, interpretation of selected past observations, planning, and counterfactuals (what if an agent had acted differently in the past). The stochastic selection of just a few elements of content (that go into a thought) make GFlowNets a good candidate to implement the “consciousness priors” introduced by Bengio (2017) and elaborated by Goyal and Bengio (2022). In particular, the GWT bottleneck, when applied to such probabilistic inference, would enforce the inductive bias that the graph of dependencies between high-level concepts (that we can manipulate consciously and reason with) is very sparse (with small degree corresponding to the 5-ish capacity of working memory).

### Where would we (or the brain) get the energy or reward function in practice?

A trained GFlowNet is both a sampler (to generate objects $x$ with probability proportional to $R(x)$) and an inference machine (it can be used to answer questions and predict probabilities about some variables in $x$ given other variables, marginalizing over the others). But that is with respect to a given unnormalized probability function $R(x)$ or energy function ${\cal E}(x) = -\log R(x)$ for that joint probability model. Where would a learning agent get that energy function from?

We are working on several approaches to do that (more details in this section below):