WTF is V-pred?

Or: what is v-prediction, v-parameterization, zero terminal SNR, ztSNR

tl;dr answer

Better colors/full color range, and better composition coherency.

Comparison of images generated without a model using V-pred and ztSNR, and one with.
Note for columns 1 and 3, the lack of V-pred and ztSNR causes the inability to generate a purely red or purely black background, and for columns 2 and 4 the inability to generate purely red or purely black hair against a red or black background respectively (indicating mean color leakage)

Note for columns 1 and 3, the lack of V-pred and ztSNR causes the inability to generate a purely red or purely black background, and for columns 2 and 4 the inability to generate purely red or purely black hair against a red or black background respectively (indicating mean color leakage)


Want to know more? Read on.

About V-prediction

V-prediction is an alternative parameterization approach for diffusion models, which traditionally rely on predicting either the original data sample (x0) or the noise ϵ (epsilon, or eps) added at each step. Foundational models such as Stable Diffusion 1.5 and Stable Diffusion XL were both trained with epsilon prediction.

The idea of V-prediction parameterization for diffusion models was introduced back in 2022 (paper: Progressive Distillation for Fast Sampling of Diffusion Models), which proposed predicting a new target parameter "v" (velocity). This paper, published by engineers at Google, was largely foundational to their Imagen Video model released the same year. In the respective paper, they commented on the usage of V-prediction providing "numerical stability" and found it "avoids color shifting artifacts" and has "faster convergence".

About zero terminal signal-to-noise ratio

Zero terminal signal-to-noise ratio (ztSNR or zSNR) refers to a condition at the last timestep in the forward diffusion process where it is expected that only pure noise remains. That is to say, in the reverse diffusion process, it is expected to begin from pure noise.

In the paper Common Diffusion Noise Schedules and Sample Steps are Flawed, published in May 2023, it was shown that previous diffusion models contained a flaw in their training where the final forward step did not completely destroy all signal (was not pure noise). This left low-level signal information—such as the mean brightness of each channel—present in the last timestep, leading any generated image to maintain a mean brightness value around 0 (assuming a range of -1 to 1). Consequently, models trained with this flaw cannot generate images of purely black or purely white tones. The paper ultimately provides a solution to this issue.

Below is an example of this flawed timestep in practice. Notice that in the final forward step, the image is not entirely destroyed, and a faint black-and-white grid remains discernible.

ztSNR example

However, without getting into the maths, true ztSNR is impossible to achieve with epsilon prediction. Therefore, the authors chose to use v-prediction (see section 3.2 of the aforementioned paper).

This is an important distinction to make. V-prediction can exist without ztSNR, but ztSNR cannot exist without v-prediction. Models trained with v-pred in general will perform and converage better due to the nature of that parameterization, but they don't truly shine unless they are trained with ztSNR.

How does "offset noise" differ from this?

Offset noise was a training technique proposed by Crosslabs in January 2023 that attempted to resolve the issue of Stable Diffusion's inability to generate very dark or very bright images.

However, this method is a hacky patch and never actually solved the root cause. One core issue that exists is offset noise results in images that do not represent the true data distrubtion of the training data. A user-facing issue is that multiple LoRAs cannot be applied to a model if they are trained with offset noise, as this essentially "stacks" the offset behavior, resulting in images far too bright or far too dark.

Where have V-pred and ztSNR been implemented in practice?

Unfortunately, the public has largely lagged behind in adopting this improvement, and as of this writing only a few models make use of it:

What UIs support these models?

Most major UIs (A1111, Forge, reForge, ComfyUI) support these models. However, model authors are also expected to desiginate these models as using v-pred and ztSNR by including a respective v_pred and ztsnr key in the state_dict of the model. If these keys are not included in the model file, the model will need to be manually configured in the UI.

Software V-pred autodetection ztSNR autodetection V-pred manual configuration ztSNR manual configuration
AUTOMATIC1111 (dev branch)
Forge
reForge
ComfyUI

Is there any reason/advantages to use an eps/"normal" model rather than a v-pred model?

This is subjective at times. There are pros and cons to each. While ztSNR is a huge benefit that only v-pred makes possible, v-pred is prone to overfitting and makes certain concepts impossible to generate that are possible in an eps-based prediction model. Ultimately, this needs more research to be better understood.

I'm not convinced of this tech. Show me more examples.

Sure.

Examples

Why do newer models, such as FLUX.1-dev, not have this issue?

Newer models more often are utilizing different architectures and techniques and thus are not subject to this issue. It primarily is only an issue that exists with somewhat older (2021-2023) U-Net diffusion-based architectures.

Edit
Pub: 29 Oct 2024 21:31 UTC
Edit: 11 Nov 2024 15:02 UTC
Views: 2019