This is a script that creates a visualized narration from a text (slideshow or picture book-like video) by chaining a bunch of AI tools together. GET IT HERE
Here are some output examples, none very good: (Amara's Awkward Adventure, The Last Hope of the Universe, A Grey Lump in the Dirt (ongoing story from /qst/)

It rougly works by:

  1. creating an audio file from your text input with XTTS-v2
  2. creating subtitles to it with OpenAI's whisper
  3. generating timestamped image prompts from the subtitles with a language model
  4. generating images from the image prompts with Stable Diffusion or another image model that can use the same workflow in ComfyUI (flux is fun!)
  5. merging the audio and the images into a video

You can also provide an already existing audio file as your input instead and skip the first step or even provide an audio file and image prompts, skipping steps one to three, which is especially useful if something goes wrong during image generation. All generations are saved into "./outputs/" (except the images, which are saved in your ComfyUI outputs folder), so you can just reuse them and fix them up if necessary.

Sending the whole SRT file to the LLM gave me poor results on long inputs, so I implemented batch processing. If used, the SRT is split after a set number of sections (a hundred seem to be fine), which are then sent to the LLM for image prompt generation. With --gen_char_info, there exists a flag that prompts a description of the characters in the story to be generated which will subsequently be used as part of the image prompt generation prompt. It should probably be used when an SRT split can be expected to occur to help with visual consistency. Claude 3.5 Sonnet is strongly recommended for consistency and quality. Regular Sonnet likes to shit the bed sometimes.

Everything except the LLM functions is designed to work locally on your machine (well, technically you could run the LLM locally as well if you can set up an openai-api compatible endpoint but the included prompts are made with Claude in mind) so it probably won't work on your X220 and might take a while to finish. It depends on torch, transformers, safetensors, accelerate, argparse, coqui-tts, openai, openai-whisper and moviepy. pip install these. You also need an existing ComfyUI installation. If image generation runs much slower than in ComfyUI, try to match the version numbers from its requirements list or just copy the venv over and use it or something.

There are a lot of options you can set from the command line. Use --help to see them all or just read the argparse section of the script. Essential imo are the input options of which one must always be given:

command-line flag input comment
-i , --input_text PATH Path to a text file
-I, --input_raw_text TEXT Quoted text
-a, --input_audio PATH Path to an audio file
--input_audio_and_prompts PATH PATH Path to an audio file and Path to a prompts file

and

command-line flag input comment
-s, --speaker PATH Path to the voice sample you want to use
--sd_model MODEL_NAME The model name from your ComfyUI checkpoints folder you want to use. Can also be a relative path such as 1.5/bstaber.safetensors if you use subfolders
--lora LORA_NAME FLOAT FLOAT Can be a relative path from your ComfyUI loras folder as well. The floats are respectedly the model strength and the clip strength

You can use as many LoRAs as you want. If your input is in a different language than English, you can set it to e.g. Russian with -l ru. The script can deal with rate limited proxies (only really relevant when SRT's are split). There's no time measurement currently, it will just wait for a minute when the number of sent prompts hit a multiple of the set rate limit.

The simplest usage is just running ./aikino.py -i /path/to/your/smut/here. Fill out the Defaults and Constants sections and change them to your liking before running the script. Send vlogs of your roleplays!

Edit
Pub: 12 Sep 2024 14:11 UTC
Edit: 12 Sep 2024 14:32 UTC
Views: 225