LoRA is NOT a finetune

The key motivation
Bare minimum VRAM requirements
Migrating from easyscripts to sd-scripts
1. Locon extraction
Honorable mentions

The key motivation

cheald wrote: I think that LoRA training is fundamentally flawed and biases heavily towards doing most of its learning in the largest (by element count) layers of the network.

Under the current training mechanism, effective learning rates are inconsistent across layers because of disparity in the sizes of the up and down matrices between layers. This is made worse by the fact that rank selection mutates the effective learning rate of the layer relative to other layers, making it more difficult to evaluate rank selection independently.

For further details, start reading from https://github.com/kohya-ss/sd-scripts/discussions/294#discussioncomment-10081465. This is not the only issue, but it's really the biggest one.

In simple baking terms, this means that during LoRA training some layers are getting deeply fried, while others are barely warm. Of course this means you will have lots of different issues with a LoRA trained this way, but thankfully you can avoid all of them if you just train a finetune instead of a LoRA and then extract a locon if you wish. Easy scripts do not support training finetunes, so you have no option of continuing to stick to it any longer. You may want to try OneTrainer but this guide will use kohya-ss/sd-scripts.

By the way, any of the LyCORIS algorithms are NOT immune to these problems. You can safely throw all your doras into the bin, where they belong.

Bare minimum VRAM requirements

8gb
You will have to patch sd-scripts to move text encoders from the gpu (or use --cache_text_encoder_outputs). You can run full unet-only finetune with batch size 1, Adafactor optimizer and full_bf16 training. This config will suck, but it will suck less than lora funetuning nonetheless.
12gb
You can switch to PagedAdamW8bit without any text encoder patches or just raise the batch size for adafactor to a whopping 4.

Migrating from easyscripts to sd-scripts

After following this, you should have something working on hand which you can fit depending on your needs and hardware.

This will come in handy shortly.

If you use an old version of easy scripts, you can export easyscripts' .toml configs into sd-scripts' network .toml configs. To do that, open any config inside easy scripts, find Save Runtime Toml option.
If you are using the most recent version, well, you don't have that menu option. Instead, start the lora training as you usually do, then go to easyscripts/backend/runtime_store and snatch config.toml and dataset.toml from there.

First of all, install sd-scripts, following the instructions https://github.com/kohya-ss/sd-scripts. I use dev branch.

Additionally, you may want to install this:

⎗

pip install lycoris-lora
pip install lion-pytorch
pip install torchastic
pip install torch-optimi
pip install tensorboard

sdxl_train_network.py (the script for loras) and sdxl_train.py (for finetunes) take slightly different arguments.

The script you are interested in is sdxl_train.py. Export your configs from easy scripts, as described earlier. You will find two files, one for datasets and another for the actual configuration. You can pass both of them to the script by --dataset_config and --config_file respectively.

Open up the powershell/terminal, activate the environment and type python sdxl_train.py --help to list all supported parameters. Edit both files. Important things to note:

unet_lr -> learning_rate
remove all network arguments since they're not needed.
disable gradient accumulation if it's enabled (you can configure that later), enable checkpointing if it's not enabled
set batch size to 1 (which is in the dataset config for a good reason)

Other common things you want to add are:

⎗

mixed_precision = "bf16"
full_bf16 = true
save_precision = "fp16"
train_text_encoder = false

This is pretty much it.

If you want to use Adafactor optimizer:

⎗

optimizer_type = "adafactor"
optimizer_args = [ "scale_parameter=False", "relative_step=False", "warmup_init=False" ]
fused_backward_pass = true
learning_rate = 1e-5  # maybe too low

If you want to use optimi.Lion optimizer:

⎗

optimizer_type = "optimi.Lion"
optimizer_args = ['weight_decay=0.1', 'betas=0.9,0.99', 'kahan_sum=True', 'gradient_release=True']
fused_optimizer_groups = 16
learning_rate = 1e-5  # maybe too high

Anon reported that this doesn't fit on the 16gb gpu under desktop Windows, so this is pretty much a 24gb-only config. The same anon reported that it fits (barely!) with kahan_sum=False, so you may as well try this first:

⎗

1	optimizer_args = ['weight_decay=0.1', 'betas=0.9,0.99', 'kahan_sum=False', 'gradient_release=True']

If you are on Windows, to run the training itself, you can use the runner file anon shared, https://files.catbox.moe/rg2b2i.ps1. However, make sure to add --dataset_config and --config_file to the arguments, and remove all duplicated params from .ps1 file, so you would actually use the config files:

⎗

accelerate launch --num_cpu_threads_per_process 1 sdxl_train.py `
    --pretrained_model_name_or_path="path goes here" `
    --output_name="name goes here" `
    --output_dir="path goes here" `
    --dataset_config="path goes here" `
    --config_file="path goes here"

(You may also put output_dir, output_name, pretrained_model_name_or_path inside config.toml)

If you are on Linux, you may want to try using my runner script which is much more convenient for terminal use: https://files.catbox.moe/1flau4.sh. Pass a path to the folder which contains config.toml and dataset.toml in it, and the command would look like ./finetune_run.sh configs/mybestconfig.

Locon extraction

Install lycoris library if you haven't already (pip install lycoris-lora). Download this script https://github.com/KohakuBlueleaf/LyCORIS/blob/main/tools/extract_locon.py and put it in the kohya directory.

Run it like that:

⎗

1	python extract_locon.py --safetensors --is_sdxl --mode=quantile --device=cuda --linear_quantile=0.2 --conv_quantile=0.1 "path to base model" "path to extract from" "path to output lora"

Fixed dim mode also works well: --mode=fixed --linear_dim=20 --conv_dim=10

The script is very unoptimized, make sure you have a lot of RAM and a large page file. It may run for up to 10 minutes. It also doesn't check if the text encoder weights are different and includes all of them in the lora so the file size may be a bit inflated for no real reason.

You may try this version: https://files.catbox.moe/j86aw9.py. It extracts only unet, which reduces the end file size, and it should run significantly faster.

Another thing to try is ComfyUI's Extract and Save Lora node (or any other node for node extraction) but I personally didn't test that.

Honorable mentions

Stochastic rounding AdamW optimizers, at full bf16 uses 20% less VRAM compared to mixed precision bf16 Adam8bit while maintaining the similar precision, but requires some patches to the train script: https://github.com/lodestone-rock/torchastic
Optimi optimizers, implement kahan_sum which also helps with full_bf16 training: https://github.com/warner-benjamin/optimi

LoRA is NOT a finetune

The key motivation

Bare minimum VRAM requirements

Migrating from easyscripts to sd-scripts

Locon extraction

Honorable mentions

Warning