Loss Masks
Stable diffusion training works by showing the AI a copy of a training image with noise artificially added to it. The AI then attempts to denoise the training image. The resulting image is then compared to the original on a per-pixel basis and the difference between the two is reported as loss. The value of loss during each step of training determines what the AI learns.
Loss masks scale the amount of loss reported for specific pixels of your training images. This allows the user to effectively remove parts of training images from the AI's consideration, as denoising decisions made in these areas will not influence its learning. This provides an alternative to introducing blank, erased spaces or introducing areas artificially filled via inpainting to the dataset. Loss masks may also be used to allow the AI to retain a small degree of information, rather than discarding the loss entirely.
There are two forms of loss masks available in SD-Scripts: image based and .npz based. image based loss masks are separate image files paired with each training image, while .npz based loss masks are created automatically by SD-Scripts during the latent caching step and are based off of a training image's transparency
When using image based loss masks each training folder must be matched with one mask folder. This mask folder must then contain one mask image for each training image. Multiple training folders cannot point towards the same mask folder, their contents must be 1:1.
Image mask folders must be specified in the dataset config .toml file.
An example of a dataset config is as follows:
In the above example "image_dir" refers to the folder containing the training images. "conditioning_data_dir" refers to the folder containing the loss masks.
If you are using the bmaltis GUI then epoch count, max step count and lr warmup steps must be specified under additional parameters when using a .toml file for some reason. Those commands are:
Alternatively, SD-Scripts can generate its own loss masks based on a training image's alpha channel. There's no need to configure the dataset config .toml this way but using it requires deleting parts of the training image. Creating the .npz files for alpha masks is also very slow and the resulting files are several times larger than image masks.
Alpha based masks can be enabled with
Composition issues with loss masks and a solution:
Loss masks seem to exacerbate issues with image composition that cause character features to be stretched and deformed. I have not identified the actual reason, but the following is an example of a problematic dataset:
The issue is seemingly due to the fact that the character's features are consistently in the same areas of the image, and even touching the borders. This seems to cause the AI to learn that those features belong in those specific areas, regardless of the overall composition of the image.
In order to mitigate stretching the dataset can be modified to look like this:
The location of the character and their features and now much more randomly dispersed throughout the canvas and this is reflected in learning.
Tools
In order to streamline dataset modification and mask generation I use two scripts that are found in this mega: https://mega.nz/folder/9oxATQTK#mAe-SdELv6A03F7Dazxy3w
pad_buckets.py will take images in an input directory and its sub-directories, automatically crop/scale/translate/rotate all images in a directory and its sub-directories and fit them into buckets semi-randomly, and then save them into an output directory while preserving the original file structure. This is intended to prevent composition issues when using many images with similar subject framing, but can also just be used for dataset augmentation.
mask.py will take images from an input directory and its sub-directories, automatically generate loss masks for them based on transparency, and then save them into an output directory while preserving the original file structure. mask.py does not handle partial transparency, all pixels with color information are marked for full loss calculation.
Difference Training
LORAs learn by adjusting the model's innate knowledge in order to better reproduce training images during denoising. this means that we can take an unwanted concept, add it to the base model, and then train a second LORA that includes that unwanted concept in its dataset. If the LORA cannot improve that model's ability to reproduce the unwanted concept then it will only learn the difference between its innate knowledge and new concepts within the dataset. If the difference LORA is then used during inference the unwanted concept should be absent or mitigated.
Difference training has two steps:
- Train a Base LORA
- Train a Difference LORA
Training the Base LORA
Creating a Base LORA containing the unwanted concept is probably the most important part of difference training. It's also effectively just training a style LORA. These are some general observations I've made about Base LORA training over the last year or so, which may or may not all be true:
- Keeping the base model relatively intact hypothetically improve the difference LORA's accuracy and ability to be applied to various models.
- Activation tokens and regularization images may preserve the original model.
- Activation tokens are not necessary to prevent style learning and may weaken the effect.
- The activation token used for styles influences the effectiveness of difference training. "3D" can work better than "screenshot" for example.
- Style accuracy plays a significant degree in the quality of the Difference LORA.
- Using multiple styles greatly reduces style learning in the Difference LORA
The best way to acquire a set of regularization set of images is probably to generate them with the base model, but remember that these are treated like any other image during training. A LORA can and will still learn style, concepts and composition from these images. This means that they need to be captioned like any other image. Because of this having a relatively small number of steps be spent on regularization images is likely a good idea, 15-20% of the total step count might be a good starting point.
LORA training software like SD-scripts contains an option for a regularization dataset to be used alongside training data, but this only controls the number of regularization images that are loaded so that 50% of all steps are spent on them and scales their loss during training. I would advice against using this feature.
Below are examples of captioned images from a Base LORA's training dataset which was trained to reproduce Koikatsu and MMD screenshots:
Image | Caption |
---|---|
![]() |
koikatsu, 1girl, solo, thighhighs, long hair, skirt, blue eyes, full body, shoes, looking at viewer, smile, standing, brown hair, sneakers, black footwear, shirt, short sleeves, very long hair, pleated skirt, blush, white shirt, closed mouth |
![]() |
3d, 1girl, solo, tube top, pink hair, black ribbon, strapless, thighhighs, long hair, breasts, standing, ribbon, black shorts, midriff, crop top, looking at viewer, shorts, single thighhigh, two side up, full body |
![]() |
1girl, solo, green hair, green eyes, shorts, looking at viewer, navel, short shorts, shirt, closed mouth, black shorts, white shirt, bare shoulders, cowboy shot, breasts, midriff, simple background, medium breasts, medium hair |
And a link to the dataset which these images are from: https://mega.nz/folder/klp0WB5A#2lH3NJOYbLbJIuGlE8-Mug
If the Base LORA's training works then the results should look something like this, with the style noticeably changing while testing:
Training the Difference LORA
The main difference between training a Difference Lora and training any other concept or character LORA is the use of the base LORA:
Using the Base LORA can be done in two ways. If using SD-Scripts you can simply use the argument --base_weights "C:\path\to\base_lora.safetensors"
merge base lora with model or use base weights
. However, this functionality is broken with at least some iterations of the Bmaltis GUI. A less space efficient alternative is simply merging the Base LORA with whatever checkpoint you are using to train, which can be done with various GUIs or the following terminal command:
C:/path/to/sd-scripts/networks/sdxl_merge_lora.py --sd_model C:\path\to\model.safetensors --save_precision fp16 --precision float --save_to C:\path\to\base_lora_model_merge.safetensors --models C:\path\to\base_lora.safetensors --ratios 1
Once that is done you simply need to use the resulting model in place of whatever model you would normally use.
The only other difference with training a Difference LORA is the potential use of corresponding style tokens in captions. These are some general observations I've made about Difference LORA training over the last year or so:
- If multiple styles are used then including activation tokens in the prompt of your training images reduces style learning. This may not be noticeable with some training settings.
- The position of style tokens in the training image's caption does not appear to greatly matter but this may vary depending on the performance of the Base LORA.
Below are examples of captioned images from a Difference LORA's training dataset which was trained using Koikatsu and MMD screenshots:
Image | Caption |
---|---|
![]() |
maid, ellen joe, 1girl, solo, red eyes, full body, short hair, standing, 3d, argyle thighband pantyhose, high heels, black dress, headdress, maid apron, wrist cuffs, metal collar, tail, black background, puffy sleeves |
![]() |
school uniform, 1girl, solo, full body, short hair, closed mouth, standing, red nails, hand up, medium breasts, ellen joe, koikatsu, grey shirt, brown pantyhose, pleated skirt, school shoes, belt, choker, pendant choker, black background |
Since caption shuffling is enabled for this dataset the position of the style tokens does not matter.