An "ideal image generation model" manifesto.

What's this?

I have noticed that many people, myself included, don't fully understand how the image generation models are supposed to behave. Then I thought about collecting info and writing a comprehensive list (in no particular order) of features that an "ideal" model MUST have, expanding the terms "text to image model" at least as I imagine it, completely ignoring the implementation details while partially addressing the existing limitations of the current iteration of SOTA models. Each specific thing which helps or will help the models reach the "ideal" level I would consider good, new papers, new architectures, new training method, new dataset management approaches, etc.

This rentry should outline "the best" image generation model, ever, an ideal to be strived for. It's a minor and obvious thing at the first glance, but I think it will greatly help in aligning people's views on what they actually want and need out of the image generation models. I hope this manifesto will help to set the priorities right, help anyone who reads it navigate through difficult topics about image generation models and how they are designed, splitting the big and complex problems into smaller and simpler ones, helping identifying existing issues in their models and finding the new areas for improvement.

For each paragraph I will include my thoughts and suggestions as well as additional explanations and examples.

What's this?
The manifesto
Closing thoughts

The manifesto

The model MUST:

be EASY TO INFER
- Most people who have GPU with 8GB+ of VRAM should be able to run it at high speeds.
- For corporations, this means reduced operating costs and simpler infrastructure.
be CHEAP TO PRETRAIN.
- At the very maximum, the Stable Diffusion 1.5.
be OBEDIENT.
- The model should generate images that accurately reflect the details and concepts specified in the prompt (if it is possible to do so), the model should not put the random things it came up with instead. It must include every detail specified. If the user wants a bad-looking image, the model should generate a bad-looking image and vice versa.
- If you prompt a man with his right arm sticking out of his chest, the model should depict a man with his right arm sticking out of his chest, not a normal man reaching out to the viewer his left arm or a man performing a Nazi salute.
be COMPLETELY UNCENSORED.
- The model MUST include NSFW and shock content, such as porn images, guro or violence. It MUST include celebrity and political figures, copyrighted materials. It MUST include various styles of various artists.
- If you have censor the model output for whatever reason, safety, deepfakes, specific artist or author preferences, copyrights or other things related to censorship should be handled in a way that doesn't let the end user generate such content, yet does not hurt the model's performance. This aspect should be adjustable on-the-fly, for example through the external filters, which I consider separate from the main model.
be DIVERSE, encompassing ALL of the possible subjects and their possible relationships.
- The model should not focus on a specific style, group of concepts, characters or source of concepts, the model should understand it ALL.
- Example of publicly accessible sources (more = better):
  - large image datasets (LAION, etc)
  - social networks (Instagram, X/Twitter, Discord, etc)
  - booru websites (danbooru, e621, etc)
  - artist websites (Pixiv, DeviantArt, ArtStation, etc)
  - various video sources (films, YouTube, cartoons, anime, etc)
  - gallery and image hosting websites (imgur, etc)
  - UI screenshots
  - you name it
  - the list continues...
be internally COHERENT.
- The model must CLEARLY and WITHOUT ARTIFACTS depict the intention behind the prompt, even if the prompt is vague and incomplete while being grounded in nature when imagining the details unless instructed otherwise.
- For example, for the prompt "five cars" the model should imagine 5 distinct cars, being stuck in a traffic of a tiny city road, with each car having distinct color and model alongside the bored meat bags inside it and other stuff such as red traffic lights, pedestrians, buildings and the road itself.
- NOT five cars in a vacuum, NOT five cars with faceless drivers, NOT five cars and a green traffic lights, NOT five cars and incomprehensible road markings.
- Although, if you wish to get something simple, you can prompt "just a five cars" to get a five fairly detailed cars freely placed on a simple background, each with headlights, windows and even four wheels.
be able to CREATE TEXT on the image.
be CHEAP TO ADAPT to the new data.
- The existing model must be able adapt to new concepts quickly, with minimal computational resources, while maintaining its robustness and not crumbling.
- A common ways to achieve something like this are LoRAs or Hypernetworks.
be FLEXIBLE.
- The model MUST be able to generalize concepts well, swiftly handling different topics or styles in one prompt.
- "A photo of anime girl hugging crying Abraham Lincoln depicted as a greyscale photo at his deathbed."

Closing thoughts

This is not a strict list by any means, the point is to establish goals and capture the key aspects which separate good text genereation models from bad, ensembling them all in one place.