Zero to Nemo

Quick newfag guide on how to set up a frontend and a chat model for /aicg/ newfags.

Quick LLM recap

llm_recap_image

Just stare at the image until you get it~

The process loops for every next token, iteratively.

powered_by_ran

Note: although samplers are usually explained in terms of probabilities, they're often implemented at the logit level: https://github.com/LostRuins/koboldcpp/blob/65c5c77a166e65f3eb80072ca4c2974fec4bacc1/src/llama-sampling.cpp#L767 . But you could, obviously, filter out the tokens after the probabilities have been calc'ed. In general you probably shouldn't pour too much brainpower into the order these things happen.

You can play around with OpenAI's tokenizer here: https://platform.openai.com/tokenizer

A base model blindly text completes sentences.

An instruct model was tuned to recognize special tokens that help it interpret inputs as message exchanges between a user and the model itself (usually termed assistant or model).

Because of this, an instruct model has an instruct format. Here's Mistral's v3 Tekken instruct format:

<s>[INST]This is an user message![/INST]And this is assistant responding.</s>[INST]A second user message.[/INST]One more assistant response.</s>

Source: https://github.com/mistralai/cookbook/blob/ac4c4cebb0fc2461467115dda82f03229c543381/concept-deep-dive/tokenization/templates.md#templates

And here's OpenAI's Chatml:

1
2
3
4
<|im_start|>user
How are you, mister computer?<|im_end|>
<|im_start|>assistant
I am doing well!<|im_end|>

You will later see {{char}} in ST. This is not a special token. This is a macro that simply gets text-replaced with the character's name.

Quants are versions of a model with reduced precision. This enables fuck-off huge models to actually fucking run on your gaming rig.

KoboldCPP

First, determine how much VRAM your PC has (the GPU memory) and then choose a Nemo quant size from here (the entire model file should fit in your VRAM with some extra left over. If it eats beyond your VRAM, or maybe even beyond your RAM into the swap memory, the model will still work, it will just run much slower):

https://huggingface.co/bartowski/Mistral-Nemo-Instruct-2407-GGUF/tree/main

If the UI confuses you, here's the download button:

download_button_nemo_q8

For other models, you can kind of just get most quants from Bartowski's HF. Dude pumps them out like a factory. Very based.

Then, installing koboldcpp is as easy as downloading the appropriate binary from https://github.com/LostRuins/koboldcpp/releases/latest (if you have a GPU from the last 8 years, the cu12 binary. Otherwise download base. If neither works, man, I'm sorry).

Run the binary, and you'll see a menu like:

koboldcpp_first_page

If Kobold fails to recognize your GPU, you'll probably have to look through the project github for any issues. Otherwise, just click on Browse, pick out the model GGUF you dled a minute ago, and click launch. Everything should just fucking work.

KoboldCPP might take a minute to load things into VRAM and RAM, but it's basically running now, and we can focus on SillyTavern. Kobold will open its own UI page, but we'll just use Silly instead for the same thing.

SillyTavern

Installation

SillyTavern has two main installation prereqs:

NodeJS: https://nodejs.org/en

Git: https://git-scm.com/

If you're using Linux, just apt install -y git nodejs or however it works in your distro. If you're using Windows, follow along with the installation wizards and agree to add everything to the PATH when it asks.

Now install ST by:

  • opening Powershell/Console
  • cding into your preferred installation folder
  • running the command git clone https://github.com/SillyTavern/SillyTavern

and that's it. You can now boot up SillyTavern by entering the folder ST dowloaded into and double clicking Start.bat on Windows or start.sh on Linux.

Usage

First things first:

st_name_prompt

Don't overthink this, just put in a cool name and click save.

THE NAV:

the_nav

What a wonderful nav.

Let's start off with your connection settings. This is what the first select should look like:

funny_api_options_thing

And this gives me the opportunity to make an important distinction:

A text completion API involves your client sending over a text string and getting a completion, that is, a continuation of the text. Your string may or may not adhere to an instruct format, and the model used in the source may be a base model or an instruct model. For instruct text completions, you prevent the model from extrapolating the contents of its response with stop sequences, which are just strings that tell the model to shut up if it outputs them. KoboldCPP is right there under text completion.

A chat completion API involves you client sending over a message exchange, typically in a JSON format like this:

1
2
3
4
[
  {"role": "user", "content": "Initial user message."},
  {"role": "assistant", "content": "Some sort of reply."}
]

and then receiving an entire new message as a reply.

This is what your complete connection config should look like:

koboldcpp_config_example

Now, to the actual text completion configuration:

funny_configuration_image

Very extensive options, yes. Here are the parts that actually involve some sort of brainpower on your part:

brainpower

Okay, now to explain the bits:

context

  • Response (tokens): how many tokens the response is capped at. Does not instruct the model to write more or less, just truncates. You can leave it at whatever value as long as it's not cutting your responses midway.
  • Context (tokens): how the "input size" that ST should aim for when sending prompts. Affects how it truncates inputs. Model/wallet dependent. Some models claim to have huge contexts but just melt down if you start sending them 30k inputs.
  • Streaming: API sends ST the new tokens as they're predicted. Very convenient but also scuffed with some models (like Gemini).

The select is just for saving settings you like.

samplers

ESSENTIALLY, TO GIVE YOU THE REALLY IMPORTANT KNOWLEDGE:

  • raise temp to taste (more temp makes the model more random, literally)
  • raise min_p to filter out bad tokens (bad tokens as in "random unicode gibberish in the middle of the text gen") until the model is giving coherent outputs for the temp you chose
  • freqpen/prespen/reppen are best used to prevent looping, not to prevent the model from reusing words

Play with https://artefact2.github.io/llm-sampling/index.xhtml to get a better feel for it.

Also, consider reading https://gist.github.com/kalomaze/4473f3f975ff5e5fade06e632498f73e by kalo.

story string

The story string is a jinja2 templates-like string which prefaces the text completion. You can place your desired writing style instructions and stuff there (written in plain English, or your preferred language), and should include the card fields so they make it to the prompt (description, etc).

Even for a retarded dummy model like nemo, you do not need to explain what "romance" is.

instruct sequences

The instruct sequences tell ST how to format the text completion. They're not up to taste: they're model dependent. The instruct format I brought up earlier. ST comes with most of these pre-baked in a select, you just have to look at the model page to figure out which one do you want.

According to my executive sources most modern models are smart enough to sort themselves out even with incorrect sequences.

And we're done. You can now just:

  • create a persona (the character you play) (refer to the nav menu earlier)
  • load up a card (the character the model plays) (you can find examples of these here
  • have fun!

See also

Mistral from the tap

Rather than run Mistral models locally, you can create an account over at Mistral and either pay or use the free-tier of their services.

Mistral explicitly admits they'll review your outputs and might use them to train new models, though.

  • Go on the console: https://console.mistral.ai/
  • Open billing
  • Enable the experiment tier
  • Switch to the API Keys menu
  • Generate an API key

Then load it up in ST:

chat_comp_with_mistra

The free tier gives you a pretty reasonable amount of usage to fuck around with.

OpenRouter

OpenRouter stands out as a pretty reliable and cheap way to pay to run a variety of models that you either don't want to or cannot run locally with good precision.

Some models in there may require chat completion, and also not support certain samplers. Have fun sorting that out.

TabbyAPI

I recommended koboldcpp earlier on the basis of "you download the binaries and it just fucking works honk shoo honk shoo", but it might be worth it to look at tabbyapi or other alternatives (aphrodite, vllm, etc), depending on your use cases and the kind of quants you want to use (exl2, for example).

Other models

Other model families besides Mistral's to look into include, but aren't limited to

And a host of local finetuners, like Anthracite or TheDrummer making models specifically for RP.

Autism

The majority of this guide has been written assuming you have a regular gaming rig with like, 16gb of vram or whatever, and also assuming that you will eventually move from local to cloud models.

Go to /g/lmg and look at stuff like this if you want to go off the local deep end and figure out how to run 123bs and shit at home.

Edit
Pub: 19 Jan 2025 11:19 UTC
Edit: 19 Jan 2025 19:17 UTC