THE OTHER LoRA TRAINING RENTRY
By yours truly, The Other LoRA Rentry Guy.
This is not a how to install guide, it is a guide about how to improve your results, describe what options do, and hints on how to train characters using bad or few images.
Do not use the dreambooth extension for sd-webui. Any benefit that the interface gives is negated by using a lot more VRAM (whatever the training uses + 4GB at least), being much slower and requiring a specific version to work. You'd be entering more text commands to get it to work than trying everything in this guide. It also doesn't support safetensors. Please avoid. Using it as-is will just give you burned outputs, I've had many cases of people asking why nothing works and the solution was "don't use the extension".
- Minor update. Rewrote a few sections, particularly DAdaptation's, and removed some uncertain language as I've tested things enough.
- I'll be occupied until the end of May, so updates will be infrequent. I'm still slowly rewriting parts of the guide as I go.
- I'll have a lot of free time around early June so I'll make sure to test and explain a bunch of newer things (like training individual layers).
- Fortunately there haven't been any ground-breaking developments lately.
- After using
min_snr_gamma=5for a while, I can positively say it's good to have at best, and doesn't do anything bad at worst.
- CURRENTLY TESTING
- INSTALLING Kohya'S TRAINING SCRIPT
- ANCIENT NINJA WISDOM
- Why so many scripts, it's confusing!
- Why aren't there more precise numbers?
- I want to make a perfect LORA, I'm carefully arranging all the elements until it's perfect and...
- Is dreambooth useless now?
- Stable Diffusion 2.x
- Making LORAs from the difference between two models ("distill")
- Getting started
- LEARNING RATES
- TYPES OF TRAINING
- PREPARING IMAGES
- Selecting images
- The short version
- The long version
- Image size/aspect ratio
- REGULARIZATION IMAGES
- TAGGING YOUR IMAGES (FINETUNING ONLY, IMPORTANT!!!!)
- OTHER TRAINING PARAMETERS
- TESTING AND DEBUGGING MODELS
- Rare character HOWTO
- AFTER FINISHING!
- SAMPLE POWERSHELL SCRIPT (WINDOWS)
- SAMPLE BASH SCRIPT (LINUX)
This guide is meant to co-exist with other guides about LORA training.
If you have at least 6GB of VRAM you can use the Kohya scripts for training. 8GB+ is recommended.
This guide includes my findings about specific options which should work regardless of whatever you are training styles or characters. However, I'm mostly focused on training characters, but if something is obviously useful for styles I'll mention it as well.
Trying to figure out how every option dampens learning.
INSTALLING Kohya'S TRAINING SCRIPT
Clone https://github.com/kohya-ss/sd-scripts and follow the install instructions.
DO NOT INSTALL IT IN THE SAME PLACE AS YOUR WEBUI INSTALL! MAKE A NEW PYTHON VIRTUAL ENV!
The script uses different library versions and WILL break your webui.
This script seems capable of training dreambooths as well, which might be more efficient than using the webui's extension. However, that is beyond the scope of this guide, but if you are interested, know it's an option. Dreambooths still have uses even with LORA training.
Once you've followed the install instructions, grab the bash (linux) or powershell (windows) scripts at the bottom of this rentry and edit paths as required to launch Kohya's script. I'll add command line arguments as I keep testing stuff.
ANCIENT NINJA WISDOM
|Model||Also known as "checkpoint", it's the results of training, usually distributed as a single file containing "weights"|
|Baking||Household name for training a model|
|Ninja Scrolls||Funny nickname for the full documentation of Kohya's scripts (In Japanese)|
|Kohya||Developer of the training scripts and other Stable Diffusion-related technologies|
|WebUI||The most common Stable Diffusion generation tool|
|Extension||An extension for WebUI, like a plugin. Can be added from WebUI's extensions tab.|
|Voldy||AUTOMATIC1111, author of webui|
|Unet||The system that controls how the machine learns images and some unknown decision/association properties|
|Text Encoder(TE)||The system that translates your prompt's words or tokens into data the AI understands|
|CLIP||A text encoder, typically the one we will be training. Stable Diffusion v2 models use OpenCLIP instead|
|Net Dim||Also known as "rank", it's the total capacity of the model, usually reflected as a bigger file|
|AI||Actually, this is less of an AI and more "machine learning", but it's easier to call it "AI" informally|
|Dreambooth||A different type of training, resulting in bigger files (2-4GB)|
|LORA||The type of training covered in this guide. The formal spelling is "LoRA", from "Low Rank Adaptation"|
|Embed||Also known as "textual inversion". It's an older style that only trains the text encoder|
|Hypernetwork||Similar to an embed, but acting on the Unet instead|
|Subject||Training a character, object, vehicle, background...into a model|
|Style||Training a model to reproduce a specific aesthetic|
|Concept||Training a model to reproduce something like a pose or composition|
|Training set||A combination of your training images and tags|
|Distilling||Household term for extracting a LORA from a bigger model|
|Overfitting||A model tries to reproduce the training set too aggressively, usually a result of a burned Unet|
|Deep-frying||An effect where the generated images have very saturated colors, usually a result of high CFG scale|
|Interrogator||A smaller AI that gives you the tags of the things it finds in an image|
|Laetitia||A character from Lobotomy Corporation/Library of Ruina from South Korean developers Project Moon. I use a model trained on her to try out options. Consider her this guide's mascot.|
Why so many scripts, it's confusing!
The Powershell/Bash script is just to launch the Kohya script with a bunch of long and tedious arguments that are a pain to write manually.
If you know what you are doing you don't need either.The Powershell/Bash scripts are a convenience feature, even if it can get confusing. Imagine typing and changing all that stuff by hand every time!
It's also a base for craftier users to automate training batches and the like.
Why aren't there more precise numbers?
Everything is highly approximated and abstract because we are dealing with something subjective like art quality and expectations, so it's difficult to get precise measurements for anything that isn't obvious.
I'll try to narrow numbers as I go, but it's going to be a slow process, so meanwhile numbers will have to come in effective ranges.
I want to make a perfect LORA, I'm carefully arranging all the elements until it's perfect and...
Don't. Stop. You are spending too much time planning and too little time baking.
It's impossible to entirely predict what the AI is going to do, what elements it's going to struggle with, how it will accept the given images and so on.
So what do I do first?
Bake a test model first and troubleshoot it later. It's the only way to know how the AI has understood the training set. Use some default numbers and then come here when troubles arise so you can figure out how to solve or improve on it.
It doesn't take long to bake and you might just get lucky and get it first try, even with a sloppy training set.
Is dreambooth useless now?
No. While it's a lot more economical and usually faster to train a LORA, dreambooths are still useful as:
- Making full models to mix with.
- Making models to train from (like, a dreambooth for the style of a series, then train the characters from that dreambooth).
- Styles in general.
But I heard LORA sucks compared to dreambooth.
Not really. You are probably remembering the good dreambooths. A poorly trained dreambooth is as sad as a poorly trained LORA. LORA training just made it easier (faster/using regular hardware) to notice a lot of bad practices and advice that was floating around before it became a thing.
A well trained LORA is comparable to a dreambooth in results at a similar scope (one/few characters, a style, etc).
We still don't know how far we can push LORA, though, and as said, dreambooth still have uses.
Are hypernetworks useless now?
Not really, either. Hypernetworks still work fine for styles.
Are embeds (textual inversion) useless now?
Also no. Embeds came out useful for negatives (bad-prompt-v2, etc) and they are still an aid to help laser-focus on something by messing with the text encoder at generation time, which can give an extra boost to other training methods. You can also use some extensions and/or additional training to merge specific tags into one, for convenience. Like an alias and an alias with training sprinkles.
Admittedly, its limited scope makes them the most outdated training technology.
Stable Diffusion 2.x
Kohya's script does support SD2.x, it has a
--v2 command line argument.
Unfortunately I haven't experimented with this, but know it's possible.
Making LORAs from the difference between two models ("distill")
The sacred ninja scrolls mention a very useful tool included with the scripts.
This individual script allows you to create a LORA from a first model and a second model trained from that first model. For example, you could create a model out of the difference between NAI and HLL, or NAI and Anything v3.
The obvious advantage is that you can then plug that LORA when generating with any other model and obtain benefits similar to a mix, but adjustable on the fly. Models that have only one gimmick or multiple versions with subtle differences can probably be safely converted to LORAs and give your hard disk a break. Just extract the juice.
From my testing it's about 95% as good as the finetuned model, but takes a lot less time to swap and is friendlier to experiment with when generating images.
To set it up, first you need to go into the sd-scripts directory and open a terminal.
Enter your python virtual env first.
And then run the
networks/extract_lora_from_models.py script as follows:
You can then load the resulting LORA as any other.
This LORA is a sample distilled version of the HLL2 model. Was created with dim 192. Works fine for me.
What do I need to get started?
- You'll need to have installed the Kohya scripts, from 10 to 50 images for a character, 100-4000 for styles or 50-2000 for concepts.
- You will need a video card with at least 6GB of VRAM, preferably nVidia for CUDA, or use an online compute service like Google Colab. (I'm not the author of this Colab, it was just recommended.)
- You'll also need a base model to train from. As of right now, the best ones to train from are the NovelAI leaked model for anything drawn (anime, cartoon, etc) and Stable Diffusion 1.5 for realistic subjects. Anythingv3, Elysium and other mixes are also suitable, but the more "base" a model, the more compatible with mixes it'll be. Refer to the base models section for more information.
- You'll need a text editor. Notepad can work, but I recommend something a bit more programming oriented like Notepad++, VSCode, Sublime Text, Vim, Emacs, or whatever you have.
- Optionally, have an image editor like Photoshop, Krita (recommended), GIMP, Paint.NET or whatever you have. You may not need it, but it can be useful.
- Grab the template scripts at the bottom of the guide to launch the whole thing. Any other alternate launcher can do, settings should be the same.
- An interrogator to generate captions if using finetuning. I recommend using this extension with your webui. It'll allow you to batch-caption images with various settings.
- Patience. If it doesn't work the first time don't ragequit, it's likely possible to fix it. And quality takes time.
This is a work in progress, ranges and parameters may change as I train more models.
These are the standard ranges for general training. Every training set has different requisites, but these seem safe enough.
|Category||Images||Net Dim/Rank||Alpha||Unet LR||TE LR||Regularization||Total Steps||Resolution|
|Character (good inputs)||35-60||96-148||64-128||0.0001||0.00005||No||1000+||512-768|
|Character (bad inputs)||15-30||96-128||64-128||0.0001||0.000045||Yes||1600+||512-768|
|⭐My current settings (characters)||15-55||128||64||0.0001||0.00005||Either||~1800||576|
Total steps depend on options, number and quality of images, and so on, but you usually want something above 1000 for things to stick.
Run the script once and take note of the total steps it'll perform to have an exact number to work with.
It's a good idea to divide them in at least some epochs so it can save snapshots, that you can use to track progress or debug with.
There are no optimal numbers you can just punch into the training scripts to get good results.
You can get good results with the defaults, or get disastrous results with the defaults. You won't really know until you try.
A perfect model will require a bunch of tries, troubleshooting and patience. All is based on the complexity of the subject, how clear it is to the AI, how properly tagged it is, and so on. It also depends on your personal standards, of course.
This paragraph is vague because what's "good" is entirely subjective. If it's not obviously broken, of course.
What model to train from?
"training from" because you are resuming training from a model, also known as a "checkpoint".
Short answer: NAI for anything 2D (anime, cartoon, sketches,etc), and SD1.5 for realism/misc.
Usually, you want to train from a model with a lot of "shared ancestry". For example most known mixes contain or are derived from NAI, so training a model from NAI makes it compatible with all of them.
But if you go too far, it could be affected by "mixing dementia", so if you train 2D in SD1.5, it might come out poor in mixes.
- You can train from mixes, but the effects are hard to predict when generating with different models.
- The tagging has to be compatible (don't use NAI tags with models trained on e6 or Waifu Diffusion tags etc, they'll "point to the wrong places" and cause who knows what).
Grid of Unet/TE strengths 0.1,0.25,0.5,1 and 2, Grid of Unet/TE strengths 1.0,1.2,1.4,1.5,1.6,1.8
Display of effects of learning rate. This displays this model has just enough Unet training but can use a bit more TE training, Unet 1.0 - TE 1.5 looks accurate but not chibi. That means next training will go better with 1.5 LE rate.
If you can't stand the scientific "e-notation" numbers, Kohya's script admits real numbers as well. Therefore "1e-4" becomes "0.0001" and "5e-5" becomes "0.00005". It's up to taste.
There are two ways to control learning rate.
|--learning_rate||0.005-0.0001||Master knob for LR. Sets the values for the other two.|
|--unet_lr||0.0001-0.005||Sets the Unet's LR. Most sensitive part of the model, don't set it too high.|
|--text_encoder_lr||0.00001-0.00005||Sets the text encoder's LR. It's the language processing of the model. Better set much lower than Unet's.|
What does this mean?
If you don't care, just set
--learning_rate to set the other two.
Otherwise, set them individually, since it's redundant to specify
--learning_rate if you set the other two. I just set it the same as the Unet LR.
Read below to see what each training component does.
Text encoder learning rate
The text encoder controls how the AI interprets text prompts when generating, and associates things to "neurons" when training.
The documentation for the Kohya scripts suggest using
5e-5 for it. If none is specified, it'll use the value of
Testing models with exact same training set and only changing this option, on the same seed, it seems to separate details better.
- Lowering LE learning rate seems to have benefits in separating objects. If you get unwanted objects in your generations, you may want to lower it.
- If you have difficulty causing things to appear without weighting the prompt a lot, you lowered it too much.
Unet learning rate
The Unet serves as a rough equivalent of visual memory. It also seems to have some information about how elements it learns relates to each other and their position in a structure.
The Unet is very easy to overcook, so if things look wrong, it's likely to be over or undercooked. The margin for it coming right is narrow, but the "good zone" varies from set to set, it's hard to determine.
Check out the troubleshooting section for advice if your generations look funky.
The standard value is
1e-4. Normally you don't want to touch Unet values unless you know what you are doing or:
- If your model seems overfit, it might have trained the Unet too aggressively, you can solve this with less learning rate or less steps, or decreasing alpha or using other dampeners.
- If your model outputs pure blobs of visual noise (not "slightly blobby", I mean literally incomprehensible masses of nothing in particular) you set it way too high. Divide it by at least 8, you probably missed a zero or something.
- If your models seems too weak, cannot replicate fine details or so, it might be too low or require more steps.
Learning rates and batch size
I've heard recommendations about multiplying the learning rates by the batch size. In my experiments it seems to work fine, so I've set it as default in the powershell scripts.
UPDATE: Seems this was causing the TE to get a bit overcooked. I've removed the multiplication for TE LR. Seems it's better to leave it static.
Unless you cannot use AdamW8bit for some reason, it doesn't seem useful to train with the other optimizers. Some people show okay results, but in my testing the better results were still worse than the same set trained with AdamW8bit, or similar but requiring more real time or resources, thus making them not much of an optimization.
There might be cases where they help, so it's probably useful to have some records in English about this.
|AdamW8bit||None||Default. So far, the most well tested. WHEN IN DOUBT USE THIS||LOW|
|AdamW||None||Default, but 32 bits. Will use double the VRAM but it's more precise on paper, just slightly better in practice.||MID|
||Adjust learning rate on the fly (adaptive). Requires arguments to work. Really high VRAM usage. (8GB Minimum)||HIGH!|
|SGDNesterov8bit||Works, low VRAM usage (equivalent to AdamW8bit). Extremely slow learning, 2000 steps nowhere near enough. Not really more "optimized" since it'll require much more time. Needs more testing.||LOW|
|SGDNesterov||Same issues as SGDNesterov8bit, but at higher precision like AdamW.||MID|
||Overrides scheduler. Results similar to Nesterov, but it's adaptive and its VRAM usage is very low, could be useful to experiment further. Seems much better than DAdaptation.||LOW!|
|Lion||None||Not bad but can give out very strange results. Bit heavier due to being higher precision.||MID-HIGH|
|Lion8bit||None?||New 8-bit variant of Lion. Might make it more viable for low-VRAM setups.||TBD|
More experimentation is needed. It's very time-consuming to test these optimizers, so it's very possible I'm just not using the right parameters, but for the time being, just use AdamW8bit unless you have a good reason not to. If you can afford it, use DAdaptation as it removes any guessing in terms of what values to use.
Optimizers give the training their own set of rules and can simplify guesswork or use less resources, or more resources but for a better result.
While they cannot possibly turn a bad training set into something usable,
DAdaptation and AdaFactor are adaptive, meaning they will automatically adjust learning rates. If properly configured, they'll eliminate the need to change or guess training rate values.
Lion gave me very strange results, like a model with a white-haired character giving a rainbow-colored mess for its hair, even when every other optimizer gets that right. But some people swear by it, so maybe there's more to it. It does add its own unique flavor to models, but I cannot quite explain why yet.
An 8-bit variant of Lion was recently introduced to the Kohya scripts, but requires an optional update to the bitsandbytes library. I'll be covering it on June. Keep in mind possible damage to your Kohya scripts installation if you proceed, give it a week or two.
AdaFactor on the other hand did output results that weren't broken, but the training seemed really weak and might require a bit more time.
Makes me wonder if it might be more suitable for style or concept learning due to their smaller impact on the output, but I haven't tested it yet.
Update: Seeing you need very specific parameters to get good results, don't count it out yet.
I'll keep experimenting during June.
DAdaptation needs specific arguments to work. Scheduler must be set to
constant and it requires
--optimizer_args "decouple=True" "weight_decay=0.01" "betas=0.9,0.99".
DAdaptation seems to have difficulty taking the Unet/TE learning rates separately. Seems it's best to leave both at 1.0, despite talk of having them at unet=1.0 and te=0.5, and let it sort things out.
DAdaptation is an adaptive optimizer, it will adapt training values on the fly, saving you the need to control it yourself. It will generally give very good results for minimum effort, as long as the training set is not faulty (but that's true for everything in training). At this moment in time, I consider it the best available optimizer.
Big caveat, though. DAdaptation is pretty heavy. It uses 7.1GB of VRAM at batch size 1 (res 576x576, 6.9GB at 512x512), so 6GB VRAM users cannot use it, unfortunately. AdaFactor could be an alternative in that case.
Note that DAdaptation is not deterministic (it won't reliably repeat the same results with same seed and settings). Seems it will never be able to give the same outputs twice, making testing very difficult.
Due to the non-determinism of the optimizer, it's difficult to reliably gauge Alpha effects, take this with a grain of salt.
I tried training DAdaptation with various Alpha settings. Seems Alpha 1 gives the best results. Alpha 64 (with rank 128) was fine too.
Alpha 0 (= Net Dim) gave pretty bad results in comparison. Until I have more accurate numbers, I would recommend keeping Alpha between 1 and half of net dim/rank. (So if net dim is 128, from 1-64. If net dim is 32, from 1-16).
Noise offset up to ~0.1 seems to not disturb things.
20230428: After training a good amount of models using DAdaptation, using the very same settings, I'd consider it the most solid way to train if you got enough VRAM. You don't need to worry about tweaking values, only about making sure the training set is good. This also confirms the training set is really the most important factor in training, as any defects encountered during my use of DAdaptation boiled down to bad images or tags entirely.
min_snr_gamma seems to actually work fine after all. The first few times giving worse results seemed to be a "fluke" of sorts.
min_snr_gamma argument with the Kohya scripts seems to decrease quality, but needs more testing.
"betas=0.9,0.99" in addition to the other
optimizer_args seems to beneficial, results to determine.
To be completed.
As of right now,
cosine_with_restarts seems to be the most effective of all the schedulers.
The scheduler causes alterations to the learning rate with a given pattern. For example
cosine will make the learning rates oscillate up and down.
However, there might be edge cases where an alternate scheduler is preferred. For example, if using
DAdaptation as optimizer, you want to set it to
|constant||Learning rates do not change.|
|constant_with_warmup||Like constant, but starts at zero and increases linearly during warmup_steps until reaching the given values.|
|linear||Drops constantly until it's zero at the end.|
|cosine||Learning rates go up and down following a cosine wave form.|
|cosine_with_restarts||Like cosine, but starts from full LR at given intervals.|
|polynomial||Like linear, but with a fancier curve.|
TODO: Make a graph!!
TYPES OF TRAINING
SELECT YOUR STYLE.
We got Finetunes, dreambooth and keyword dreambooth. Trickster and Swordmaster may be coming someday.
"Finetune style" (Default)
Like finetuning a model, it uses caption files, small .txt files matching the name of each of your images. Those captions instruct the training to look for something, and trains both the Unet and the text encoder.
Captions are composed of either a bunch of prose describing the image ("Laetitia walking on a flower field") or a collection of tags ("Laetitia, 1girl, solo, flower field").
If you find the captioning process a hassle, or you want to train an artist style without bothering, scroll down to the "Keyword Dreambooth" section.
Multiple concept support
Training multiple concepts in a single LORA, despite their more focused usage, is still possible. The image example uses the character of Laetitia in Hitachi Seaside Park, both baked into the same model.
You can train backgrounds, more than one character, items related to that character...anything goes, but it needs tagging and some logical balance, some common sense applies:
- Don't put more backgrounds than characters.
- If training more than one character, make sure the amount of total images for each character are roughly the same or one will take over.
I was able to replicate the "feel" of some good backgrounds with just one or two images (x10 repeats/epoch). I fully recommend training a few "thematic" elements with a character model. It gives flavor.
The name strikes me as odd, I always thought the basic form of dreambooth was "keyword activation AND regularization" with the finetuned hybrids coming out later.
Do NOT train styles dreambooth style. It seems to completely discard style, use it for characters.
Training your LORA with regularization images is described in the documentation as "dreambooth style" setup.
This method changes the rules and resembles dreambooth training results a lot more. By using it, you'll forfeit most of the style of your character but the AI will still somehow figure out the details. You will need to double the amount of steps for it to work.
In exchange, there are a number of benefits:
- Seem to separate details from style a lot better.
- Seem to soak "active" style much better (either from the model in use when generating or style addons).
- Is very able to turn stuff like chibi characters into something more normal consistently.
Read the REGULARIZATION IMAGES section for more information on how to set them up.
When to use "finetune" or "dreambooth style" training
From my experiments seems you want dreambooth style when your training image set is bad or has a style or quirks you don't exactly want passed into image gen (like training a character on scribbles or chibis or such), and regular LORA finetuning when you do want the style and quirks of the training images to have more presence.
If you are very constrained with your training images it seems to always be worth it.
"Keyword Dreambooth" style
Okay I finally settled into a descriptive enough name. This otherwise unnamed variation doesn't require tagging and uses a "sks" type of keyword instead.
Not recommended for anything serious but can be useful for testing or just for fun. It's the lazy boy approach to model training.
This model basically works like those basic dreambooths that define an activation word instead of tagging.
This example was trained from NAI, res 576, 160 net dim, about 2000 steps, keyword dreambooth style with 50 images (+50 random reg images). Both samples come from the same model, flexibility at work.
To set it up, name your folders with the format
<repetitions>_<activation word> <class> (mind the space). You'd end up with something like
2_sks robot, for example. Use this format for regularization images as well. (You need them for this method).
To activate, use the activation word in a prompt. Hardly used words made up of 3 characters (like the classic "sks", but there are more) work best. You can use any word but it's generally less efficient.
Make a compromise between "easy to remember/sensible" and "efficient" at your convenience.
Without the proper tag guidance the results are fairly bad, but it works. You need better prompts using models trained this way.
Also works if you want to train something not entirely humanoid, which is annoying to tag in bulk, but it's very random whether it'll stay on model or become a humanoid version of it.
Since the standard Kohya scripts cannot train LoHA, I'm leaving a proper testing of it for June.
LoHA is described as a good method for character training, it seems to give good results while keeping tiny sizes due to the small ranks involved.
Valid formats and extensions are
bmp. They must be lowercase in Linux or they will be ignored.
If you got incorrect images there (GIF, etc), the script will ignore them and continue. This can cause your training to go for more steps or whole epochs than intended and may cause the model to come out weird. When the script starts make sure the number of images and steps matches what you want.
Organize your images as follows:
You want images where your subject is clearly represented.
Avoid images where:
- The character is not represented accurately. (wrong hair color/style, wrong details...)
- The character is being partially covered by something else.
- The style or pose are strange or low quality.
You do want images where:
- The art is accurate to the character's design.
- Close-ups of the head. If you have large images it's never a bad idea to pick one with a good face and try to fit it in the training resolution (512x512 by default).
You may also want closeups of certain details. If you notice the AI is not giving a specific part enough definition, then a full-resolution image of that part might help. This could be useful for intricate decorations, accessories and such.
You also want some variety, but try to preserve character accuracy.
Most problems with a model's results are usually caused by improper tagging or too much/little baking, images don't tend to cause as many issues by themselves, and are easier to spot. It's obvious when a certain image is causing issues because something similar to it pops out in generations and it's basically poisoning the Unet with suckiness.
You will only be able to notice them after you trained, it's impossible to predict how the AI processes them otherwise.
The short version
If you aren't super picky about details, just grab 30-60 images of decent quality and go wild. That'll usually work.
If you don't have that luxury, refer to the "Rare character HOWTO" below, and good luck.
If you just want to generate a character without changing standard clothes, hair or accessories, that's all you need. If you want more out of the model, read the long version.
The long version
If you want the model to be flexible with clothes, backgrounds and whatever details, you will need to strategize and pick images more carefully.
Like before, the advice is to train first and fix issues later, as you might get lucky with a lazy set.
Consistency of individual elements
Depending on your standards, some images have more value than others when teaching the AI.
For example, imagine you train Superman, with his iconic logo over his chest. If you train with just images of Superman in his costume, the AI will attempt to draw the logo on pretty much everything you prompt for, even a business suit. You also need to tag accordingly so the AI can tell the logo is part of the costume and not part of Superman (but that's the most complex part, refer to the tagging section for details).
For teaching the AI an individual element, you want images where that element is clear, and images where the element is not present, so you can use prompts to control it, or to avoid it appearing all the time.
The Unet seems to learn by "difference", so simply editing an accessory or other element out of an image, and then feeding both "on" and "off" images will help separating it. You can afford some sloppiness.
Image size/aspect ratio
By default the scripts will "bucket" images. That means putting them in different containers (the buckets) based on their resolution and aspect ratio. This is highly convenient. However, it seems there still needs to be some care involved.
If your images are bucketed too randomly, that seems to cause issues and lower quality of the model.
It seems to do better when most images fall in a single bucket or they are in a balanced spread (buckets have around the same number of images).
Regardless of bucketing, you usually want images that are the exact same resolution you are training with, those will always work better.
I seem to have somewhat better results when I increase the minimum a bit (320) and decrease the maximum a bit (768). Maybe because that means a smaller amount of buckets to spread images into.
Using them is what achieves "dreambooth style" training, read above for an explanation.
Setting them up is similar to how you set up your training images, just place your regularization images with the same format as the training images.
You don't need to match the number of repetitions in the regularization image folder, but the names need to match. Use the multiplier as necessary to boost the number of reg images. Usually, you want as many regularization images as regular training images, repeats included.
The reg images do not need captions, but I hear it can help. Untested at the moment.
Use stuff of the same "class" as your character. Since you'll be training only characters with reg images, pick something that resembles your character but not quite. I find AI outputs of a prompt equivalent to the one you would use with the LORA enabled (if your character has pink hair and is a girl, prompt for girls with pink hair, even if they don't resemble the character much, you want similar but not "close" results, apparently).
I've tried using images of the character being trained as well and it still worked.
This is highly abstracted and not exactly how it works, but if you are going this route, it'll work.
Think of the Unet as an old TV or CRT screen.
When you train the Unet you are putting an image on the screen and "leaving it for a while". Eventually, some "pixels" on the screen will get burned.
By applying regularization, you are kind of flushing the image so it doesn't burn up too much, like a screensaver.
A small amount of attributes from the reg images can go into the Unet, which might be enough to boost diversity of poses and other minor details.
Same setup as before but with some significant differences.
Turns out the regularization can be optimized with some extra steps. After someone tipped me about the actual effects of regularization, and I read the documents myself, turns out there's a method to it.
You will want to generate an AI reg image for every training image you have. The names will have to match. So every training image will have a matching regularization image.
To generate the images, you need to use:
- Generating images from the same model you are going to train (animefinal-full-pruned.ckpt if NAI, etc).
- Same prompt as the caption for the training image.
- DDIM sampler, resolution equal to your training resolution (not the same as the training image!), seed equal to your training seed (420 if you didn't touch it in the scripts below).
Then rename the image so it matches the filename of the matching training image.
While I notice a decent quality level, and somehow a bit more retaining of style, I can't say it blows the easy method out of the water.
Apparently, and roughly, regularization images are reduced to latents and then trained on how to produce them back, using DDIM as sampler.
The theory seems to be that it then tries to match the tags, seed and latents so it has a more direct clue of what is it training, as opposed to trying to "look" into the whole thing. This does not mean the reg is being used as a mask or anything, latents aren't exactly images, they are just represented as such so the eldritch knowledge doesn't fry our meaty brain.
Therefore the reg images need to match the tags and seed, as it seems to have an easier time correlating both that way.
Unfortunately, while the results were very good from testing it "by the books", it's not much better than doing it with random images.
The reason for why it works regardless is unknown.
TAGGING YOUR IMAGES (FINETUNING ONLY, IMPORTANT!!!!)
The AI is more tolerant of a few low quality images than of a few bad tags.
Install this webui extension: https://github.com/toshiaki1729/stable-diffusion-webui-dataset-tag-editor
It's a tool to batch edit/add/remove/interrogate tags for image sets, it'll help a lot with interrogating and simple batch (multiple) addition or deletion of tags.
The first step is using an interrogator like Deepdanbooru or WD1.4 Tagger. That will give you a bunch of tags to work with.
If using the above extension, do something like this.
The first time before clicking Load make sure to set an interrogator (wd-v1-4-swinv2-tagger-v2 works fine for me), and "Use Interrogator Caption" to "If Empty" to get your starting tag base.
SET IT TO "NO" AFTERWARDS. Consider it the gun safety trigger. If you leave it in anything but "No" or "if Empty" (once .txt files exist) it'll do it over and overwrite your changes!
Tagging is surprisingly important and too much or too little (or too randomly spread) can ruin your model or make it too rigid.
Training works by instructing the AI to "find" the tagged elements in the given image even if it doesn't exactly know what is what.
Your job is precisely to help it figure it out (unless it already knows from existing training).
When to tag characters/styles
Characters and styles don't need to be strictly tagged, but you are adding extra glue to correlate the character with the associated tags.
You WILL need to tag the character if you are training more than one. I generally recommend doing so.
Make sure you can provide examples with certain tags "on" and "off", like if your subject character has a hat, provide a pic with hat (and tag it as having a hat) and one without (without the tag). If you are having a severe lack of images, head to the "Rare character HOWTO" section.
Tagging works like trying to explain a thing to a person who is very gullible. Tag a bunch of apples as oranges and it'll start generating oranges when asking for apples. You also might want to specify certain details. If you have a character in a cape, it'll always consider the cape part of the character unless you tell it the character is wearing a cape.
Let's put it this way. The AI learns the whole character as a whole collection of things unless you specify its components.
What to tag
- Tag things you want the AI to be aware of. It can infer some things on its own, but don't count on it if it's unusual. The AI can't quite separate certain elements from a character because it has no idea what it is. You need to teach it with on-off samples so it can get it. So tag those.
- You'll also want to tag obvious stuff, the color and type of hair, the color and type of clothes, and whatever elements the AI struggles with when you test the model.
- What you tag also seems to have some influence at generation time. For example if a character has visible eyelashes in the training images, and you tag "eyelashes", you'll obtain visible eyelashes much more consistently without the need of explicitly prompting for it. Not doing it will make it random and might require prompting explicitly.
- Poses are to be tagged as well. From what other people has shown me, not tagging any pose seems to lead to samey poses when generating.
- Anything you feel important to note or you want to be more likely to appear in generations.
- Remember awareness of a tag means it can also be used for negatives. So tagging things you want to remove that way works, too.
What not to tag
- You don't usually need to indicate stuff like a human having a face or hands unless they don't. The Unet seems to be able to figure out the relationship between some elements.
- It's also not strictly necessary to tag a style/artist.
- Tagging things like "simple background" can influence the model into generating images with simple backgrounds. This doesn't disable the ability to generate backgrounds (simple background in prompt negatives can fix that) but might be undesirable. Similar to the eyelashes example above, existing tags might have some influence when generating.
- You may want to avoid tagging certain things, so the AI doesn't fixate on them, at your convenience.
EL DIABLO. This is bad and you want them gone. False positives seem to confuse the AI and consistency drops considerably.
Automated tagging (Deepdanbooru, WD1.4 tagger...) make a lot of false positives or fails to see details, gets incorrect hair colors due to lighting or palette...so make sure to remove incorrect tags, and add the ones that are missing. A single "brown hair" in a blonde character, due to a darker palette, will make the generations randomly show brown hair in broad daylight.
Make sure it doesn't goof like confusing a sparkle for a star, a lamp for a moon, and obvious goofs like that.
Let's put in in a simple, alarmist sentence: Bad tags contaminate the model.
Interrogators will usually return a group of tags when an element is detected. Like "boots, red footwear, red boots" when given an image of Laetitia.
You usually want to consolidate those into a single one. "Red boots" is the most directly descriptive there, so I choose that one.
When testing your model, if you feel an element is too insistent, too hard to remove, or appears everywhere, that's because you tagged wrong. Clean up and try again focusing on the tags related to that annoying element, that's the only way I found to get them out.
OTHER TRAINING PARAMETERS
The value for
--clip_skip must be the same as the one you use to generate images with the model you train from.
So if you train on NAI, that uses CLIP skip 2, set it to
If you use a realistic model and some furry models that uses CLIP skip 1, set it to
The LORA won't exactly break but won't come as good.
If you are analyzing a model too hard you may find the 11th layer is zeroed with CLIP skip 2. This is normal, as you skip training it, it's zeroed out.
Number of steps/epochs
Your training steps is divided by your batch size. You are doing X steps at once, so if you have batch 2, it'll do two steps at once. Therefore, if you got 1000 steps and batch size 2, it'll show as 500 steps. Keep this in mind!
While more repeats are somehow better than more epochs, it's useful to have a few actual epochs in order to generate progress snapshots. If you burned a model, the snapshots might be fine at a previous epoch.
Total number of steps is your
(number of images x number of repeats) x number of epochs unless you are using regularization images which is,
((number of images x number of repeats)+(number of reg images x number of repeats)) x epochs, or basically double the steps.
You want at least ~1000 total steps for training to stick. ~800 at the bare minimum (depends on whether the concept has prior training or not). For training from absolute scratch (a non-humanoid or obscure character, or a new character released after the main models) you'll want at least ~1500.
The perfect number is hard to say, as it depends on training set size.
- You want more steps if you lower the learning rate.
- If, during use of the model, you can raise LORA strength above 1.0 and it still works fine or better, you can train it more without ill effects.
You want to cram as many steps as possible into the training without burning the components, specially the Unet. Read below for more details.
Certain options act as learning dampeners, meaning they reduce the learning rate or its impact.
|Option||MIN||MAX||Base Default||Learn dampening||VRAM use||Generation effect|
|--resolution||512||768||512||↑↑↑ every +64||↑↑||Increases quality of finer details.|
|--noise_offset||0.0||1.0||OFF||↑↑↑ every +0.1||Increases dynamic range (brighter brights, darker darks). May "deep fry" if set too high.|
|--network_dim||1||768||128||↑ every +32||↑||Increases network size (can cram more into it)|
|--alpha||1||net dim||net dim||↑↑ every -16||Dampens learning to prevent precision errors.|
|--min_snr_gamma||1||20||5||More the closer to 0||Smooths average loss, making learning more stable.|
Numbers aren't absolute because there are no hard numbers to math. Every arrow represents a noticeable small decrease.
This is interesting for a good reason. The more steps you can fit into the training without burning up the Unet or TE, the better. So at times you may want to dampen learning so you can train more. This, of course, means more training time, but seems that the results make much better use of the images in addition to their effects.
You can also compensate the learn dampening with more learning rate, but that might not work as well.
The simplest one to use is Alpha, as it has no other effects.
Net dim (network dimensions)/Rank
Network Dimensions can also be referred to as "rank". If you read about rank 128, that'd be net dim 128.
The current maximum value seems to be 768, after that it's known to cause issues.
The network dimensions, or rank, indicates how many parameters of the Unet/TE to train. The default is 4, but 128 is a very good value for most characters or styles.
The ninja scrolls indicate the higher the value the higher the "expressive power" of the model, but results in bigger files and more training time.
The only other drawback is file size. Seems to be about
value x 1.3MB (x 2 if full precision). A hypothetical model with the current safe maximum size, 768, will be roughly 1GB. Double that if saving the model at float precision.
If you don't know what to set it at, leave it at 128 for characters. It's a lot more than it needs but it's a decent de-facto standard. 96 is also a good value for characters, and you can even go a bit lower (64) if your training set is known as good, to save on file size.
Alpha needs to be equal or lower than rank.
Network alpha is basically a learning brake, or dampener. It is always relative to the value of net dim.
This is used to scale weights (the model's actual data) when saving them by multiplying them by (
alpha/net dim) and was introduced as a way to prevent rounding errors from zeroing some of the weights.
- Setting it to 0, or the same value as net dim, prevents the dampening from happening.
- The default value is 1, which dampens learning considerably, so more steps or higher learning rates are necessary to compensate.
- The maximum value is the same value as net dim. Higher values are allowed but when I tried it to see what happened, the output was pretty burned up.
Observe what the numbers become when given a net dim of 128:
Setting it too high might cause "deep-frying" effects similar to high (11+) CFG scale.
Noise offset works by adding a random value to the latents at learning time. This increases the dynamic range of images at the cost of slower learning.
This feature is entirely optional and should only be used if you want the effect or want to use it as a learn dampener.
The effects are similar to increasing the CFG scale at generation time, but a bit more subtle. Raising it too high might "deep fry" the outputs, so be careful. Just a bit doesn't hurt if you want slightly bolder colors.
A value of 0.1 is recommended. Just do 1-2 more epochs to counter the learn dampening.
Effects vary from training set to training set, some seem more prone to deep-frying when the value is high than others. Regularized models usually come out better.
After trying it for a while I can't really recommend it for general use, but it can be useful if your inputs have dull colors, or in some situations where the outputs looks better with higher CFG scale, so you can bake it in instead of raising it in webui.
- I've been using a value of 0.06 lately, just to add a little bit.
- If you know what you are doing, your training set is solid and you want more contrast, you can raise it safely.
Exact effects to determine.
You can specify resolution as a single number (eg "512") or as an
width x height value (eg "512x758").
Seems to increase detail quality and composition at the cost of more training time/VRAM, but it also seemed to cause secondary effects.
Training with it too high might decrease quality of lower resolution images, but small increments seem fine.
- 512 is a fine default.
- So far, 576 (576x576) has been consistently improving my bakes at the cost of training speed and VRAM usage.
The augmentations are basically simple image effects applied during training.
The effects can range from subtle to "enough of a boost to save a bad model", but they make training slower.
The available options are as follows:
--cache_latents will not work when augmentations other than
flip_aug are enabled.
|Augmentation||Values||Effects||Disables cache latents?|
|flip_aug||None||Randomly flips images horizontally. Useful anytime, unless your character is heavily asymmetrical. Like that guy from Street Fighter III.||NO|
|color_aug||None||Does random hue shifts. Can enhance color ranges and separate similarly-colored elements a bit better.||YES|
|crop_aug||None||Slices large images into parts instead of scaling them. Needs testing.||YES|
|face_crop_aug_range||"min,max" (e.g."2.0,4.0")||Tries to zoom into faces. Effects hard to notice?||YES|
This is a new feature to smooth the random peaks in training loss, leading to "smoother" learning. Can help reduce Unet/TE issues.
Recommended value is
--min_snr_gamma=5, 1 will have a stronger effect and 20 will barely have any effect. 20 is considered the maximum value.
It does not involve any speed loss or VRAM increase, so it doesn't hurt to use it. It's a pretty simple and straighforward optimization.
Using SNR Gamma can cause some issues with DAdaptation and possibly AdaFactor, but I only encountered one noticeable quality drop, so it should be fine to use all the time.
SNR Gamma will introduce a tiny bit of dampening as well.
TESTING AND DEBUGGING MODELS
Due to the random nature of AI imaging, first impressions might be misleading. Your model might output 3 great images by sheer chance, you call it good, and then you find out it has issues. Same if the first few images are bad.
So make sure to test your models and try a lot of different prompts, mixes and other addons to see how the model actually performs. XY plots or a few batches of like 5 images can help expediting the process.
Try a multitude of scenarios, poses, clothes and expressions. I just keep a list of prompts I copypaste when testing.
Furthermore, the AI shows amusing or revolting quirks over time that aren't obvious at first glance, usually due to some stray or missing tag. Signs of overfitting may not be obvious at first too.
In other words you want some solid numbers. Don't take first impressions literally, since this is highly random. Change some tags around, change negatives. I've gotten models that seemed bad at first glance but turned out to be pretty good once I adjusted my prompts a bit.
Strength (at generation time)
Your LORA should work fine at 1.0 strength when using it to generate images.
- If you have to lower it to not ruin images, it's overfit or poorly trained.
- If you have to raise it, that means you still have room available for more training or higher training rate, whether you want or not depends on how good it is now.
- If you have to raise it or otherwise it's hardly noticeable, you need more training or higher training rate.
A model becomes "overfit" when it tries to reproduce the training images too aggressively or the results are plain weird.
This is usually caused by the unet "burning up".
Overfitting usually happens when you train for too long or with too high of a learning rate. So here's what to do to get a view of the various epochs:
- First, make sure to save a snapshot every epoch.
- Generate a XY plot with a search and replace. Set your prompt to have a keyword such as "LORA", then have the XY plot search and replace that term for your LORA's activation token at strength 1.0 (so step 1 finds "LORA" and replaces it for "<lora:last:1.0>" or "<lora:last-0000001:1.0>", tailor it to your model's file names). Then run that plot and see how the model progresses every step. The natural progression should be from "no effect" (first one, using "LORA" which means nothing to the AI) to an increasing likeness of a character or style.
- At this point you might find that epoch 7/10, for example, is perfectly fine and ready to go. Nothing else to do, just keep the winner.
- If it's always overfit, even at the start, review your tags.
- If even with pristine clean tags it still overfits, you may have net dim too high or alpha higher than net dim.
- If it overfits even then, and are training on something that isn't NAI or base SD, try training from those instead.
- If nothing else works, you might have too many images or images that are dragging the thing down.
If you have the additional-networks extension
The webui extension allows modifying Unet and text encoder strength separately. This allows you to play with the values and try to see which component is over/underbaked. The changes usually reflect the adjustment necessary to training.
If, for example, your model works fine at 0.5 strength for Unet and 1.0 strength for TE, that means you just need to reduce the Unet learning rate by half.
If it works fine at 1.0 Unet strength and 2.0 TE strength, then leave Unet as it is, and double the TE learning rate.
It saves a lot of guesswork.
Nowadays, you can also check tags and metadata from the standard sd-webui LoRA selector.
In this case I didn't have enough examples of the character (One possible generic skin for the Medic class, or "megane Medi-ko" from Etrian Odyssey IV) with different clothes so it's biased to filling space for a bag that might or might not be asked for, and a specific design with a sailor neck and visible straps around the waist.
The bag can be solved by properly tagging all images with a visible bag, something I failed to do in this case. I went with "messenger bag" since it was the type that came up the most often when autotagging, but only applied it to a few images.
The coat elements issue is harder to get rid of and usually requires more images to teach it how to separate element from character and a tag to solidify that knowledge (preferably an existing one).
The differences are subtle compared to TE issues, but the more you train the more sense it makes. You'll also notice that kind of thing because it's really persistent unlike TE errors that come and go. Usually requires a full change of clothes (like putting a character only wearing dresses in a hoodie or armor, or direct nudity) to remove via prompting or very strong weights for alternate clothes of accessories.
Rare character HOWTO
EFFORT ALERT. DON'T GIVE UP
WORK IN PROGRESS (sciencing needs time and a lot of trial and error, please wait patiently)
I'm particularly specializing on this, I'll update this section with advice on how to get by if you only have a few crappy images of your favorite character.
- You will need at least 10 to 25 images depending on their quality and variety. There have been successful attempts with less images, but it's up to luck.
- Have a bit of critical eye, if you can choose, choose images that look better or at least look better to you.
- If you can't choose, use regularization to reduce the impact of the image's style.
- Human artists aren't always accurate and goof like the AI, and those goofs carry to the AI. You want many views and styles of the same thing, not variations of it. In other words, artist mistakes like wrong hairstyle, eye colors, mistaken clothing elements or so are better avoided. If you got nothing else, edit them for accuracy.
- At times this happens even within canon art. In those cases, pick your favorite variation and try to reinforce it the same way.
- img2img until it works is an option, but it can take from one to infinite tries.
- You want at least a couple views and different costumes, or if the costume has multiple layers, pictures with some layers removed will allow you to toggle, as long as you tag them accordingly.
- Again, this is not always possible. It's not uncommon for a character to be always drawn in the same clothes or pose, specially if really obscure.
- 3D models are viable, but unless you want to generate renders, use Dreambooth style to reduce impact.
- BE PRAYING
Using AI outputs as inputs
Stable Diffusion does not add an invisible watermark to prevent loopback training. Ignore hearsay stating the contrary.
AI outputs are perfectly valid as AI inputs. However, you must be really picky about them. They can influence the results heavily to look like a specific model, like the OrangeMixes.
Ideally, you want the AI images to be the same size as your training resolution.
Standard SD goofs can easily carry over your outputs unless regularized or balanced by many quality pictures (if you are reading this, chances are you don't have that luxury, though), so make sure it's as proper as possible, ask for a second opinion from a friend or something if you doubt (seeing many generations of the same thing can dull your senses enough to miss obvious errors).
A couple quick edits can do. If you got an image that looks fine but the eyes look dull, add some reflections on Photoshop or Krita. You don't need real art skills for this, just get by.
Hands are generally a lost cause but try to at the very least have the correct amount of fingers. Inpaint until it works if necessary.
Once you have managed to get a couple good outputs, add them to the training set, tag accordingly and train again. Rinse and repeat until the AI outputs come out good enough.
You can also use AI outputs as regularization images. I've done this multiple times without issue at this point, seems to work as well or better than manually curated artwork.
Elbow grease is necessary to get a really good model out of an obscure character. Depending on how complex the character is, you might need at least 5 tries to get good results (by my standards of accuracy and flexibility, of course. If your second try or even the first is good enough for you, then just wrap up and call it done, you already won).
For readers with art skills
If you have some art skills, you can train your own characters and style.
Similarly, if you train an existing character, you can customize a model further than what it's possible with public images and fanarts. You can supply extra angles, poses, clothes, expressions and proportions to increase the range of the model. While every trained model is different because of curation and standards, supplying your own artwork can make it much more unique thanks to that.
I've had training respond pretty positively to my own images and even lazy sketches, and I'm hardly the best at art.
When I have more free time to experiment with this, I'll expand this section further.
Use this webui extension to use your LORA: https://github.com/kohya-ss/sd-webui-additional-networks
Adjust strength as necessary. A good LORA will work fine at x1.0 strength.
Native SD-webui LORA support
As of recently, SD-webui supports LORA natively.
Read more about it here: Features · AUTOMATIC1111stable-diffusion-webui Wiki
However, seems it doesn't work with SDv2 LORAs, from what people tells me.
Is the extension still worth using?
Yes and no. The extension allows using reading and editing metadata, and setting strength of the Unet and TE separately, which is useful for debugging a model's training.
You can use both the native support and the extension at once, they won't conflict.
SAMPLE POWERSHELL SCRIPT (WINDOWS)
START COPYING BELOW HERE, SAVE TO A TEXT FILE, EDIT PATHS AND OPTIONS AS NEEDED
SAMPLE BASH SCRIPT (LINUX)
START COPYING BELOW HERE, SAVE TO A TEXT FILE, GIVE IT EXECUTABLE PERMISSIONS, ETC
Quick LoHA explanation and settings.
- It seems LyCORIS is not quite compatible with Kohya's normal script. Which kinda sucks because it's the most up-to-date in terms of fixes and new methods. At least, I can't get it to create a proper file, as it has zero effect on generation once training is complete.
- This means it requires using alternative scripts. Since that'll be confusing to readers, I'll let it be until June.
- Due to LyCORIS presence, the standard name of LoRA has been renamed to "LoRA-LierLa" (LoRA for Linear Layers). LoRA-C3Lier (Conv2D 3x3 kernel) is something I still need to investigate further.
- Moved changelog to the bottom.
LyCORIS (LoCon, LoHa, (IA)^3, LoKR) are still considered moving targets, although there are already some extensions supporting it. I'll start doing research on those soon.
- Due to LyCORIS presence, the standard name of LoRA has been renamed to "LoRA-LierLa" (LoRA for Linear Layers). LoRA-C3Lier (Conv2D 3x3 kernel) is something I still need to investigate further.
- Moved changelog to the bottom.
LyCORIS (LoCon, LoHa, (IA)^3, LoKR) are still considered moving targets, although there are already some extensions supporting it. I'll start doing research on those soon.
- Added note that when using
DAdaptationthe first value specified in arguments (Unet or TE) is the one that counts. It's better to specify, in order, Unet and then TE learning rates. Corrected scripts to reflect this.
- Small writeup on min_snr_gamma, it's not a very meaty concept, it just generally helps for no (that I can notice) drawbacks.
- Kohya scripts now allow training individual LoRA layers. This means styles and specially poses/concepts can benefit greatly from this.
- I just need to figure out what layers do what...
- If someone knows please let me know somehow.
- This also means the LoRA Block Weight extension is a lot more useful for figuring them out.
- Kohya recommends not updating immediately but I made a few bakes without strange effects so far.
--min_snr_gamma=5does work with non-adaptive optimizers such as Adam8, Lion and such. Writeup will be coming soon.
--min_snr_gamma=5didn't work great, but maybe it's because it doesn't play well with adaptive optimizers like DAdaptation. I'll keep testing and make a full writeup as soon as possible.
- A new option,
--min_snr_gammawas added to the Kohya scripts. I read a value of 5 is a noticeable improvement in overall training, I will try to confirm by tomorrow.
At this time LoCon/LoHa are moving targets and not supported by the guide until development stabilizes.
- Some info on schedulers. Normally you want to stick to
cosine_with_restarts, but just for completeness.
- Seriously need to catch on with LoCon. Give me until next wednesday, I will have more free time to experiment then.
- Updated and cleaned up the tagging sections.
- Better explanation for Alpha. Hopefully this one works.
- Some fact-checking here and there.
- Added a more definite warning about the dreambooth extension for sd-webui at the top of the page. TL;DR: Don't use it.
- Corrected Kohya's name. My apologies.
Someone is going to use this guide to train a Kromer model. And that's terrifying. - The Other Lora Rentry Guy, 2023.