Vpred Easycripts Parameter Guide

Model

Base model: The base model you are training on, it should be either Illustrious0.1 or NoobAI-Vpred 1.0 for most users.

External VAE: Not applicable to most people.

SD2.X Based: Not applicable to Illustrious/Noob.

V Param: Noise schedule, necessary to bake for a Vpred model. Keep on.

SDXL Based: Keep on.

Scale V pred loss: Scales the loss to be in line with EDM, causes detail deterioration. Not recommended.

No Half Vae: Loads the VAE in fp32. Not necessary for most users. Keep off.

Full FP16: Allows training in full FP16 precision. Broken outputs on my end.

Low RAM: Lowers RAM usage by a couple percent while worsening it/s by about as much. Not recommended.

Full BF16: Allows training in full BF16 precision. Quality improvement while lowering VRAM usage by a few hundred MBs. Keep on.

High VRAM: Unsure, lowers it/s.

FP8 Base: Loads the base model in FP8 precision. Reduces VRAM usage by 4GBs, and slightly increases it/s with no quality deterioration. Keep on.

Resolution

Width: Keep at 1024 for Illustrious/Noob. Higher values do not increase bake quality.
Height: Not applicable.

Gradient

Gradient Checkpointing: Worsens speed (~25%), but massively reduces VRAM usage (40%+).

Gradient Accumulation: Used for virtually extending batch sizes for less VRAM cost. Not recommended.

Other general args

Seed: The random seed used for all randomization in the training. Use whatever you want.

Clip Skip: Not applicable in SDXL models.

Prior Loss Weight: Supposedly helps the model not erase old concepts while learning new ones. Unsure effect. Keep at default.

Xformers: Optimization, raises it/s. Mutually exclusive with SDPA

SDPA: Optimization, raises it/s slightly higher than Xformers. Mutually exclusive with SDPA. Keep on.

Cache Latents: Caches the latent representations of images prior to training. Gets you a 15%+ increase in it/s for a quick caching before training. Keep on.

Cache Latents (To Disk): Allows you to save the cache to disk, useful if you're going to be baking the same dataset multiple times, and want to skip the caching process.

Comment: For adding a comment to the model's metadata.

Batch Size: Represents the maximum number of images in each batch. Multiple batches allow for quicker training but also exhibit problems with symmetricality in the end bake. Keep at 1.

Max Token Length: Represents the largest the largest size a training prompt can be. Lower values incur minor quality loss. Keep at 225.

Training Precision: The precision of your training. Not applicable when using full BF16.

Max Training Time: Used for setting the maximum training time.

Keep Tokens Separator: Unsure of a practical use for it.

Network Args

Type: The type of network you are using. I had the best results with LoCon.

Network Dimension: Represents the size of the Linear Dimensions of your model. Lower values might not train all of the concept, higher might overfit on details and fry. Keep as low as possible depending on your use. 16 was a good base value for me.

Network Alpha: Represents the alpha of the Linear Dimensions and acts as a scalar on the network dimension. The smaller value, the less of a "brake" on the model it is. I had the best results with an Alpha value of 2x the Network Dimension size, so 32 in the case of a Network Dimension value at 16.

Min/Max Timestep: The minimum and maximum timesteps to train on. Keep as defaults.

Train on: Selects whether to train the model on the UNET, the Text Encoder, or Both. I had the best results with Both.

Conv Dimension/Conv Alpha: They represent the size of the Convolutional Dimensions of the model. Lower values will lead to underbaking, higher to frying. Keep at default.

Network Dropout: Used to drop out parts of the model. Causes degradation of quality. Keep off.

Rank Dropout: Used to drop out full ranks of the model. Causes degradation of quality. Keep off.

Module Dropout: Used to drop full modules of the model. Causes degradation of quality. Keep off.

IP Noise Gamma: "Reduces the random noise, allowing what you want to learn to learn faster." Causes significant quality degradation for me. Keep off.

Lora FA: A tweak to supposedly reduce VRAM usage while keeping everything the same. Causes minor quality degradation. Keep off.

Optimizer Args

Optimizer Type: The type of optimizer to be used in training. I had the best luck with AdamW (Not AdamW8bit! Despite the popularity, I found it to be of lesser quality than AdamW). AdaFactor is acceptable if you want an adaptive optimizer.

LR Scheduler: The scheduler to be used for the learning rate during training. I had the best results with cosine. (Without restarts, I found those to decrease output quality.)

Loss Type: The way loss is calculated. Keep at L2.

Learning Rate: The learning rate for the training. 0.0001 is a good basic value for AdamW. If using higher batches, I found the square root of batch size times basic learning rate formula to work the best.

Unet Learning Rate/TE Learning Rate: Allows to set decoupled learning rates for the Unet and Text Encoder. Not going into that rabbit hole.

Scale Weight Norms: Used to prevent any one weight from getting too large to the rest. Minor quality degradation. Keep off.

Min SNR Gamma: Removes random noise during training. Lower values are stronger. I had good results at value 1.

Warmup Ratio: Used to set a ratio of "warmup" (lower LR) steps in comparison to the total value of steps. Keep off.

Num Cycles: The amount of restarts for use with the cosine with restarts scheduler.

Max Grad Norm: "The maximum gradient after normalization". Unsure of practical uses. Keep default.

Zero Tern SNR: Tweak to the noise schedule to allow for the full range of color. Absolutely needed for Vpred training. Keep on.

Masked Loss: Used for images with masked loss. Obscure usage.

Weight decay: Supposed to make generalization better. AdamW recommends 0.1 as default, and I had the best results with that value.

Bucket Args

Enable: Separates your images into "buckets" of different resolutions. Keep on.

Don't Upscale Images: Prevents the upscaling of your images, as well as generates unique buckets for your images. I had the best results with this on.

Minimum Bucket Resolution: The minimal resolution for any bucket. Keep at default.

Maximum Bucket Resolution: The maximum resolution for any bucket. Keep at default.

Bucket Resolution Steps: The amount of pixels between each bucket. Keep at default.

Noise Offset Args

Noise offset: Tweak to model noise.
Pyramid noise: Tweak to model noise, supposedly less destructive than noise offset. Causes quality deterioration. Keep off.

Extra Args

Wavelet loss (wavelet_loss = true): Slightly degrades quality on my end. Keep off.

Subset Args

Input Image Dir: The folder with images you are going to be training on.

Masked Image Dir: For use with masked loss.

Number of Repeats: The amount of times each image in a dataset will be repeated each epoch. Keep at 1. If you need to extend the training time, increase the epoch or step count.

Keep Tokens: For keeping a specified amount of tokens unshuffled while using shuffle captions. Keeping tokens generally results in minor frying. I had the best results at 0.

Caption Extensions: Keep at default.

Shuffle Captions: Shuffles the captions while training. Improves quality. Keep on.

Flip Augment: Flips the latents of the image while training. Causes quality degradation. Keep on.

Random Crop: Randomly crops images while training. Incompatible with caching latents.

Regularization Images: For specifying regularization images, a method of supposedly reducing overfitting. Uncommon use.

Face Crop: Creates crops of faces from the dataset for training. Caused quality degradation on my end. Keep off.

Caption Dropout: Cuts out captions during training. Caused quality degradation on my end. Keep off.

Token Warmup: Adds token as training continues. Caused quality degradation on my end. Keep off.

Addendum

The wd-eva02-large-tagger-v3 is currently the best image tagger available. I use it at 0.35 confidence, and adjust the min tag fraction depending on the size of the dataset. Hand-tagging datasets is not worth it in my experience.

I also recommend resizing your finished models and then retesting against each other. It can remove overfitting while keeping most of the data of the model.