ControlNet - redraw your images


In this guide I want to show you how to use ControlNet to improve your Images.
While this guide will focused on Koikatsu images, you can also use the method for any other image.
Please note that ControlNet is actively being improved and changed every day and I might not always be fast to update this guide.


What is ControlNet?

ControlNet is a new way to influence diffusion models with additional conditions. But unlike the text prompt which only gives rough concepts to the AI, ControlNet uses an image (map) as input. These maps contain simplified spacial data, for example the edges resulting from edge-detection or a depth map.

Examples for maps usable with ControlNet
This shows three example maps which could be used with ControlNet. The leftmost image is the original from with the maps were generated

Benefits of ControlNet

Other than img2img which "just" provides the AI with a starting point from which should generate the image, ControlNet actually guides the generation process. This way a lot more details can be preserved while also being able to come up with entirely new content (for example a background).
ControlNet does not provide any infromation about color, so you have full control* over the colors in your text prompt.

* this can actually be a downside, as it's much harder to describe a certain color. And sometimes color just does not want to go where you want it to. This can be bypassed to an extend by using ControlNet together with img2img

Setup


Prerequisites

  • A working and fairly recent version of Automatic1111's WebUI
  • Basic knowledge of using the WebUI and prompt engineering. You can refer to this guide.

Installation

  • The ControlNet extension:
    1. Start the WebUI
    2. Go to Extensions -> Available and click the Load from button
    3. If the "extension Index URL" textbox is empty type in: https://raw.githubusercontent.com/AUTOMATIC1111/stable-diffusion-webui-extensions/master/index.json
    4. Look for sd-webui-controlnet and Click Install
    5. Restart or reload the WebUI
  • ControlNet Models:
    1. Have a look here which models you want (for this guide we will be using Canny Edge and HED)
    2. Download the the according difference models from Huggingface or CivitAI
    3. Put the models this folder: {stable-diffusion-webui}\models\ControlNet

Usage


ControlNet section in the WebUI
ControlNet integrates straight into the txt2img and img2img tabs and does not create its own tab

If the section in the image above does not show up for you, you might have an outdated version of the WebUI.

The ControlNet UI explained

ControlNet UI sections

General:

  • Input Image: This image is used for the preprocessor or straight as map
  • Camera Button: allows you quickly take an image using your webcam (if you have one)
  • Enable: Toggle ControlNet on and off
  • Invert Input Color: you mostly use this if you are doing scribble
  • RGB to BGR: I'm not sure when you would use this, if anyone knows tell me
  • Low VRAM: Use this if you have 6 or 8 GB of VRAM.

Model Settings:

  • Preprocessor: Choose your desired preprocessor. This converts the input image to the according map. If you want to use the input image as map, set this to none.
  • Model: Choose the model you want to use. The model must match the input map. More about models in the section below.
  • Weight: This similar to the emphasis value of tags (like in "(blue eyes:1.2)"). If you want the AI to invent more details, lower this value. If you want the AI to stick more to your input map, increase it.
  • Guidance Start/End (T): These represents on how much of the diffusion process ControlNet should be applied to. Start is the percentage of steps at which ControlNet should start controlling and End the percentage at which it will stop. In a process with 20 steps, start = 0.2 and end = 0.8, ControlNet would NOT have any affect on the first and last 4 steps. You can use this if you want you only want to provide a guidance in the beginning and leave the rest to the AI.

Preprocessor Settings:
This section is a little different for every preprocessor and only shows up with you have one selected. But it's generally made up of two parts:

  • Resolution: This specifies the amount of pixels which the shorter side of the input map will have. It's usually good to set this to the same amount of pixels as the shorter side of the image you are generating. But you can try higher values to get a more detailed map. Please note that more detailed maps may actually lead to worse results, for more information read the section below.
  • Thresholds: This specifies threshold values for the preprocessor. These vary from preprocessor to preprocessor. It's worth playing with a little, but you can generally leave them at default for the most part.

Canvas Section:
This section is only useful when drawing your own map in the UI. Otherwise you can ignore it.

Resize mode:
This specifies how the input map should be transformed if the aspect ratio of the map and the generated image does not match. You usually want to leave this as Scale to Fit.

The Buttons below can be used to preview the preprocessor result. This is especially useful for preprocessors which have setting that you can edit and for openpose. I recommend checking openpose before starting a generation, because it sometimes has trouble detecting poses from anime pictures. You can use the Openpose-Editor extension to adjust the skeleton if openpose does not detect it correctly!

Why more detailed maps can cause less detailed results:
ControlNet uses pretrained models to tell the AI how to interpret a map. These models were trained on image pairs that were converted at the same resolution as the training image. That means when generating an 512x512 image it expects the map to have the level of detail as a map that was converted from a 512x512 image. If the map suddenly has a higher level of detail, the AI gets somewhat confused and does not really stick to the map as much. That way you get a less detailed output in the end.
If you are loosing detail in the diffusion process, you should increase the resolution at which you are generating rather than the resolution of the preprocessor.

The generation process

Basic Usage

  1. Decide on a Stable Diffusion model to use.
  2. Put your image (e.g. a Koikatsu render) in the image field in the ControlNet UI.
  3. Select a model and preprocessor. There is no generally best model to use, but here are some personal opinions that might help you choose:
    • Canny Edge is a generally good for preserving the outline of your characters. The sections that do not have lines in the map hold no valuable information, which means you can control them with the text prompt. This means Canny Edge is usually the best if your images does not have a background and you want the AI to invent one.
    • Hed maps have more dimensions then just black and white, which makes this better at preserving the detail of your character. But this decreases the Stable Diffusion's ability to invent new stuff such as a background. This also means that if your map has "errors" (for example because of clipping in your input image) it's easy to confuse the AI.
    • Depth maps are really good when it comes to scenery. If your images has a lot of already existing background and scenery and you dont care as much about the character, try using depth.
    • More findings and tips to come, there is a lot to experiment with here!
  4. Set a generation resolution and change the resolution of the preprocessor accordingly. This really depends on the level of detail of your image and your GPU power. Just remember: You can always upscale later!
  5. Write your prompt.
    • While ControlNet on its own already suffices to generate an image, writing a good prompt is vital for generating high quality images!
    • I usually go about writing my prompts in three steps:
      1. A default prompt. This can be whatever you want but I personally use this:
        • Positive: Masterpiece, best quality,
        • Negative: lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry
          Tip: you can use the "styles" function in the top right to save default tags and quickly apply them!
      2. Write down the tags that describe your image. Tip: you can use the Tagger extension to automate this process to some extend.
      3. Refine the prompt to make the AI generate what you want. This includes but is not limited to:
        • Emphasis: Lower tags that dominate the image, increase tags the get omitted.
        • Expand negative: Add things to the negative prompt that the AI mistakes or the opposites of things you want. For ControlNet you often want to put parts of clothing in the negative that it mistakes otherwise (for example: gloves).
        • Add or change colors: As mentioned before, ControlNet does not influence the colors of a generated image. For the output to match your input image you have to carefully explain the colors. You will probably use a lot of emphasis here. Unfortunately ControlNet seems to increase the chance of colors spilling from their tag into other parts of the image.
        • Add a background: This can actually help to reduce errors with the character in an image. It seem that Stable Diffusion is more aware of anatomy and lighting if you give the character a context.
  6. Generate and change settings according to the output.

Generation showcase
Here you can see that there is a lot of prompt engineering that goes into image. Notice that there is still quite the differences in color from the original image to the generated one.

Improved workflow for Koi images:

This improvement uses the Multi-ControlNet capability of the ControlNet extension. To be able to use multiple ControlNets at once you need to do the following:

  • Make sure you are on a fairly recent version of the extension (Multi-ControlNet is a new feature)
  • Enable Multi-Controlnet:
    1. Go to the settings tab and choose ControlNet on the left
    2. Increase the Multi ControlNet: Max models amount slider to at least 2 (or higher if you want to use more ControlNets at once)
    3. Increase the Model cache size. This is not necessary but if you do not increase the cache it will have to reload the models every time you press generate which slows everything down a bit.
    4. Restart the WebUI.
  • Check that there are multiple tabs in the ControlNet UI now.

The main problem with using just Canny Edge or Hed is that the AI sometimes gets confused what a certain shape belongs to, which leads to weird stuff like hair becoming arms and legs not being where they are supposed to be. This can be prevented by using openpose alongside our main ControlNet to give the AI an idea of how a character is posed.
To do so, configure your first ControlNet as described above, go to the second ControlNet tab and put in your image aswell. Choose openpose for the preprocessor and model. Make sure you also adjust the resolution like with canny or hed.
As mentioned earlier, openpose sometimes has trouble detecting the pose from anime style images, so make sure you check the map. If openpose failed I recommend using the Openpose Editor extension to fix the skeleton:

  • Adjust the width and height to match your generated image
  • Import your image with the Add Background image or Detect from Image button (the latter will take a moment, dont worry if looks like it doesnt load)
  • Adjust the skeleton to fit the pose. For the head, put the first in the middle of the head, the second node roughly above the middle of the eyes and the outer node where human ears would be.

Click the send to txt2img (or img2img if you're doing that) button. Make sure to select which ControlNet to send it to with the dropdown on the right! In our case the second (which is number 1).
Then continue as usual, writing your prompt and adjusting the primary ControlNet to get something you are happy with.

Img2Img and other methods not explained here

Img2Img

It's often hard to get colors similar to the colors in your original image using txt2img. In this case it useful to use img2img with a rather high denoising strength (0.6 to 0.8). This way, the generation process will start off with the right colors and take it from there. But there is a catch: doing that means you are confined to the background you have in the original image.

ComfyUI

ComfyUI is a alternative to Automatic1111's WebUI which lets you build Stable Diffusion workflows in a node-based fashion. It's a lot more complicated but also gives you more freedom. Two core advantages are the ability to use area confined prompts (they only affect a set area of the image) and using multiple ControlNets. The drawback is setup: while installing ComfyUI isn't hard, creating a workflow takes some time and there are currently no preprocessors (unless you locally merge a pull request).
*In summary: ComfyUI is a lot less user friendly, but can yield better results if you know how to use it.

Examples/Showcases

The method explained in this guide was used here in addition to a pass of SD upscale without ControlNet.

Black dress image redrawing
Left: Original | Middle: generated Image | Right: upscaled Image


School uniform image redrawing
Left: Original | Middle: generated Image | Right: upscaled Image


As generating the image is done in txt2img you can specify which colors you want:

Color variation on dress
Same seed was used for all but the red dress


Color variation on hair
Same seed on all images, but I additionally saturated some of the images in post

Credits

  • Njaecha - The guide
Edit
Pub: 19 Feb 2023 15:32 UTC
Edit: 29 Mar 2023 10:37 UTC
Views: 9018