Flow matching Youtube视频逐字稿

视频链接:How I Understand Flow Matching

My kids love Play-Doh. Last time, they made a Play-Doh version of their stuffy. It's pretty amazing that we can create opportunely complex shapes by squeezing, pressing, stretching, rolling, and twisting a simple ball of Play-Doh.

This is the core idea of flow-based generative models.

In this video, we are going to talk about the main ideas behind normalizing flows, continuous normalizing flows, and a scalable training method called Flow Matching. We'll leave the techniques for scaling up the training for the next video.

Imagine we collect a dataset of images. It would be awesome if we could model the data distribution. We can create new images from this distribution or evaluate the likelihood of a sample. But we don't know what the true data distribution is. We only have the samples.

On the other hand, we have simple base distributions, like a Gaussian distribution, from which we can easily draw samples and evaluate their probability. The idea is to build a generator that transforms a simple distribution into a data distribution.

We can train this generator by maximum likelihood. This corresponds to minimizing the KL divergence between the two distributions.

So, how do we get a likelihood? Here, we have a sample noise vector \(z\). We can transform this noise sample \(z\) into an image \(x\). If our generator is invertible, we will know the corresponding noise \(z\) that generates this image.

Can we compute a likelihood like this? Well, not quite.

Let's take a look at a one-dimensional example. Here, our base distribution is a uniform distribution between 0 and 1. Suppose our generator just stretches the \(z\)-value by a factor of 2. We see that the density of our transformed distribution is now half of the original density. This is because the probability contained in these areas must stay the same.

The same concept works for any one-dimensional density function. We can compute a likelihood by adding a scalar term to account for how much the density function is stretched or compressed in the local region. We take the absolute value here because either mapping produces the same density.

Okay, now let's check the two-dimensional case.

We look at a specific location \(z\) and its local neighborhood. This vector specifies the change in \(x_1\) and \(x_2\) directions caused by \(z_1\), and this vector specifies the change caused by \(z_2\). Here, we compute the area spanned by the two vectors using the determinant.

We can now write down the relation that the probabilities in these areas must stay the same. Here's a visual example. Now we can move the \(\Delta z\) to the other side and move them into the determinant. We see that these are just partial derivatives. Transposing the matrix does not affect the determinant.

This matrix has a name — it's the Jacobian matrix.

Let’s simplify this a bit further. We can move the determinant of the Jacobian matrix to the right-hand side and express the reciprocal as the determinant of the inverse transform. This is known as the change of variable formula.

Now, back to the maximum likelihood estimation. Using this formula, we can write the local likelihood into two terms.

How do we compute this?

First, we need an invertible generator \(G\). Second, we need a way to compute the determinant of the Jacobian matrix efficiently.

It’s hard to create a complex transformation with just one single generator. In practice, we compose a collection of generators to gradually transform a simple base distribution into a complicated data distribution. The likelihood computation of such a model is also simple and involves multiplying each individual determinant.

Now, let's see some examples of invertible generators in which the determinant of the Jacobian matrix is easy to compute.

One popular design is called the coupling layer. The first step is to split the input into two disjoint sets. We do nothing for the first half. We train a neural network to predict the scale and translation vector, and compute the second half by element-wise multiplication and addition. These two vectors are then concatenated as the output of the coupling layer.

Let’s ask two questions:

First, is this generator invertible?
Given \(x\), we copy the first half. Then we compute the scaling and translation to invert the second half of \(z\). Here, the neural network can be very complex and does not need to be invertible.

Second, can we compute the determinant of the Jacobian matrix efficiently?

Here’s the Jacobian matrix of a coupling layer:

  • The top-left part is an identity matrix since we copy the first half of the input directly to the output.
  • The top-right part is all zeros because the output vector has nothing to do with that part of the input.
  • The bottom-left part is tricky — it involves a neural network, so it can be complex — but we don’t care because it doesn’t affect the determinant.
  • The bottom-right part is a diagonal matrix because it only has element-wise multiplication and addition.

The determinant is just the multiplication of all the predicted scaling values.

When we stack these layers together, we need to shuffle these splits around to ensure that all the dimensions are updated. The original paper used a special checkerboard pattern and channel masking to create different splits. This type of permutation is later generalized by invertible 1x1 convolutions. Training with 1x1 convolutions achieves lower negative log-likelihood and can generate high-resolution samples.

Another example is autoregressive flows.

To generate the value of the \(i\)-th position, you use the input vectors before the \(i\)-th position to compute the condition \(h_i\), and use it to transform \(z_i\) to \(x_i\) using an invertible function \(T\).

Note that this transformer has nothing to do with the transformers we use in language modeling.

We can generate all the outputs following this strategy. If the transformation \(T_i\) is invertible, we can find the corresponding input \(z\). But this process is sequential.

Fortunately, the forward sampling process can be easily parallelized. The Jacobian matrix has a lower triangular structure because it’s autoregressive. This means we don’t need a full Jacobian to compute the determinant — we just need to compute the gradients on the diagonal and multiply them together.

So far, we have seen the coupling blocks and the autoregressive flows. To make the computation tractable, we’ve sacrificed some model expressiveness.

Is it possible to have a free-form Jacobian matrix?

This is the idea behind residual flows.

The layers in residual flows are very simple: it processes the input \(z\) with a neural network \(u\), and adds the output \(u(z)\) back to produce the final output.

Is this invertible?

Given \(x\), can we know the corresponding \(z\)? In general, this is not visible. But Stephen said it can be invertible if the function \(u\) is a contraction mapping. This means that the distance between two points after the mapping is smaller than the distance before the mapping.

If \(u\) is a contraction mapping, then there exists a unique fixed point \(z^*\). Let’s use \(x - u(z)\) as our contraction map. Applying the fixed-point theorem, we get the expression. Shuffling the equation a bit, we find that \(z^*\) is what we want. Since \(z^*\) is unique, we can revert this residual layer.

The theorem also gives us a bonus — it shows us an iterative algorithm to find the unique \(z^*\).

How about the determinant?

With some math, we can expand the determinant into an infinite series of matrix traces. But this is scary — we need to compute the trace of the Jacobian, perform matrix multiplications to the \(k\)-th power, and sum up infinite terms. How is it possible?

Luckily, we can use some tricks to simplify the computation.

To estimate the trace of a matrix \(A\), we pretend there is an identity matrix. We can rewrite the identity matrix as the covariance matrix of a random Gaussian vector \(v\), with zero mean and unit variance.

The linearity of expectations allows us to shuffle things around to get this expression. We can now estimate the trace efficiently using Monte Carlo sampling. But then, we cannot evaluate infinite terms.

The Residual Flow paper showcases a trick to compute an unbiased estimate by truncating the sample to finite terms.

Next, we’ll see how we can generalize residual flow ideas to continuous normalizing flows.

Let’s take a closer look at the residual flow method. It gradually transforms a simple base distribution into a data distribution via \(K\) residual layers. Moving these terms around, we get something that looks like a derivative. When we increase the number of layers \(K\) to infinity, we get an ordinary differential equation (ODE) saying that the change in the position of a sample follows the vector field.

Our goal is to represent this time-varying vector field with a neural network parameterized by \(\theta\). This is called a neural ODE.

Let’s visualize what this looks like. Here we see the vector field gradually transforms a simple base distribution, like a Gaussian, into a more complicated one. The arrows there specify the time-varying vector fields.

Now we understand how vector fields push samples around in space. How does the probability density change at a specific location?

Let’s use a 1D example to build our intuition. We plot the probability distributions over \(x\) at time \(t\). For a specific position \(x\), we have a probability density of \(p_t(x)\). At time \(t + \varepsilon\), let’s say the probability density becomes lower at the same position.

How can we explain this with our vector field?

At this position, the vector field must have pushed the samples away from the position. We can use the spatial gradient of the vector field to quantify the local “outgoingness” — in other words, how much it diverges.

When the vector field flows into the position, the spatial gradient will be negative. The sum of the change in probability density and the local outgoingness must remain zero.

The same relation holds for higher-dimensional data. So we can replace the 1D gradient with a general gradient operator, called the divergence.

This is known as the continuity equation or transport equation.

The continuity equation gives us a tool for training continuous normalizing flows using maximum likelihood. Unfortunately, computing the local likelihood involves integrating vector fields over time with an ODE solver. This limits the scalability of training continuous normalizing flows on large datasets or high-resolution images.

Next, we’ll use flow matching to enable scalable training of continuous normalizing flows.

Let’s look at the continuity equation again. Instead of focusing on learning the right probability density — which is computationally expensive — we can instead just learn to match the flow.

The time-varying vector field \(u_t\) fully determines the probability path and the final target distribution. This insight leads to the flow matching objective. The goal is to train the neural network to match the vector field.

Here, the probability path interpolates from the base distribution at time zero to the target distribution at time one. This looks great — the training objective is just a simple \(L_2\) regression loss. It’s simple to implement and does not involve integrating the vector field during training.

But something terrible happened — we don’t know what the probability path or the vector field is. If we knew the vector field directly, why would we need this neural network?

The trick is creating training data for the probability path and the vector field using conditioning.

Here, we express the marginal probability path as a mixture of conditional probability paths that vary with some conditioning variable \(z\). Using conditions, we can design a valid conditional probability path and vector field for training.

Let’s say our condition is a single data point in our training dataset — we call it \(x_1\). Here is the equation and the visualization of the conditional probability path.

The vector field is also very simple. Intuitively, for any point \(x\), we move toward the data point \(x_1\). The speed depends on the time, ensuring that we land exactly on the data point \(x_1\) when time equals 1.

We can now define the conditional flow matching objective. The conditional probability path and the conditional vector field are all easy to compute. Surprisingly, the gradients of the conditional flow matching objective are the same as the unconditional one.

This provides a scalable way of training continuous normalizing flows.

This is great. Let’s look at several other designs.

Here, instead of conditioning only on the data point \(x_1\), we also sample a noise \(x_0\) from the base distribution. The conditional probability path can be a Gaussian distribution with a small variance that moves between \(x_0\) and \(x_1\).

The conditional vector field is constant over time on this path. This is called independent coupling condition. These two methods — like Rectified Flow and Stochastic Interpolants — fall under this setting.

We can go beyond simple pairwise conditioning as well.

For example, we can draw multiple samples from the base distribution and multiple data points from the training dataset. We then establish the correspondence between them using optimal transport, and create probability paths and vector fields for training.

This helps us create straighter paths for more stable training and faster inference speed.

Here’s a summary of three examples of conditional flow matching designs:

  • Compared with independent coupling,
  • Having some coupling within each mini-batch
  • Leads to clearer probability paths.

Now, let’s visualize the training process.

We independently sample a data point \(x_1\) and a noise \(x_0\) from the base distribution. Based on the probability path, we can create a sample \(x_t\). Using this noisy sample, we train the neural network to match the conditional vector field.

In this example, the conditional vector field is a constant vector from noise to data.

It’s useful to put things in perspective by comparing this with diffusion models.

In training a diffusion model, we also sample a data point from the dataset and sample noise from a Gaussian distribution with zero mean and unit variance. We encode the image using a forward diffusion process to get a noisy image. We can then train the neural network to predict the noise.

By comparing these two, we can see how flow matching simplifies and generalizes diffusion models.

In diffusion models, the conditional probability distributions come from a fixed stochastic process. It cannot generate pure Gaussian noise within a finite number of forward diffusion steps.

The flow matching framework focuses directly on moving samples from a base to a target distribution and learns the flow in between.

Flow matching keeps the essence of diffusion models but removes the unnecessary restrictions of the forward diffusion process.


In summary, we covered:

  • The basics of discrete-time normalizing flows,
  • Continuous normalizing flows, and
  • Flow matching as a scalable method for training continuous normalizing flows.

I expect that we will see a lot more exciting developments and applications of flow matching.

Thanks for listening, and I’ll see you next time.

posted @ 2025-06-20 18:19  AAA建材王师傅  阅读(21)  评论(0)    收藏  举报