LoRA is NOT a finetune
The key motivation
cheald wrote: I think that LoRA training is fundamentally flawed and biases heavily towards doing most of its learning in the largest (by element count) layers of the network.
Under the current training mechanism, effective learning rates are inconsistent across layers because of disparity in the sizes of the up and down matrices between layers. This is made worse by the fact that rank selection mutates the effective learning rate of the layer relative to other layers, making it more difficult to evaluate rank selection independently.
For further details, start reading from https://github.com/kohya-ss/sd-scripts/discussions/294#discussioncomment-10081465. This is not the only issue, but it's really the biggest one.
In simple baking terms, this means that during LoRA training some layers are getting deeply fried, while others are barely warm. Of course this means you will have lots of different issues with a LoRA trained this way, but thankfully you can avoid all of them if you just train a finetune instead of a LoRA and then extract a locon if you wish. Easy scripts do not support training finetunes, so you have no option of continuing to stick to it any longer. You may want to try OneTrainer but this guide will use kohya-ss/sd-scripts.
By the way, any of the LyCORIS algorithms are NOT immune to these problems. You can safely throw all your doras into the bin, where they belong.
Bare minimum VRAM requirements
- 8gb
You will have to patch sd-scripts to move text encoders from the gpu (or use--cache_text_encoder_outputs
). You can run full unet-only finetune with batch size 1, Adafactor optimizer and full_bf16 training. This config will suck, but it will suck less than lora funetuning nonetheless. - 12gb
You can switch to PagedAdamW8bit without any text encoder patches or just raise the batch size for adafactor to a whopping 4.
Migrating from easyscripts to sd-scripts
After following this, you should have something working on hand which you can fit depending on your needs and hardware.
This will come in handy shortly.
- If you use an old version of easy scripts, you can export easyscripts' .toml configs into sd-scripts' network .toml configs. To do that, open any config inside easy scripts, find
Save Runtime Toml
option. - If you are using the most recent version, well, you don't have that menu option. Instead, start the lora training as you usually do, then go to
easyscripts/backend/runtime_store
and snatchconfig.toml
anddataset.toml
from there.
First of all, install sd-scripts, following the instructions https://github.com/kohya-ss/sd-scripts. I use dev branch.
Additionally, you may want to install this:
sdxl_train_network.py
(the script for loras) and sdxl_train.py
(for finetunes) take slightly different arguments.
The script you are interested in is sdxl_train.py
. Export your configs from easy scripts, as described earlier. You will find two files, one for datasets and another for the actual configuration. You can pass both of them to the script by --dataset_config
and --config_file
respectively.
Open up the powershell/terminal, activate the environment and type python sdxl_train.py --help
to list all supported parameters. Edit both files. Important things to note:
- unet_lr -> learning_rate
- remove all network arguments since they're not needed.
- disable gradient accumulation if it's enabled (you can configure that later), enable checkpointing if it's not enabled
- set batch size to 1 (which is in the dataset config for a good reason)
- Other common things you want to add are:
This is pretty much it.
- If you want to use Adafactor optimizer:
-
If you want to use optimi.Lion optimizer:
Anon reported that this doesn't fit on the 16gb gpu under desktop Windows, so this is pretty much a 24gb-only config. The same anon reported that it fits (barely!) with
kahan_sum=False
, so you may as well try this first:
If you are on Windows, to run the training itself, you can use the runner file anon shared, https://files.catbox.moe/rg2b2i.ps1. However, make sure to add --dataset_config
and --config_file
to the arguments, and remove all duplicated params from .ps1 file, so you would actually use the config files:
(You may also put output_dir, output_name, pretrained_model_name_or_path inside config.toml
)
If you are on Linux, you may want to try using my runner script which is much more convenient for terminal use: https://files.catbox.moe/1flau4.sh. Pass a path to the folder which contains config.toml
and dataset.toml
in it, and the command would look like ./finetune_run.sh configs/mybestconfig
.
Locon extraction
Install lycoris library if you haven't already (pip install lycoris-lora
). Download this script https://github.com/KohakuBlueleaf/LyCORIS/blob/main/tools/extract_locon.py and put it in the kohya directory.
Run it like that:
Fixed dim mode also works well: --mode=fixed --linear_dim=20 --conv_dim=10
The script is very unoptimized, make sure you have a lot of RAM and a large page file. It may run for up to 10 minutes. It also doesn't check if the text encoder weights are different and includes all of them in the lora so the file size may be a bit inflated for no real reason.
You may try this version: https://files.catbox.moe/j86aw9.py. It extracts only unet, which reduces the end file size, and it should run significantly faster.
Another thing to try is ComfyUI's Extract and Save Lora
node (or any other node for node extraction) but I personally didn't test that.
Honorable mentions
- Stochastic rounding AdamW optimizers, at full bf16 uses 20% less VRAM compared to mixed precision bf16 Adam8bit while maintaining the similar precision, but requires some patches to the train script: https://github.com/lodestone-rock/torchastic
- Optimi optimizers, implement kahan_sum which also helps with full_bf16 training: https://github.com/warner-benjamin/optimi