StatuoTW's Guide to Using Local LLMs

General Warning

This guide is merely meant to inform. I am not responsible for any damage you to do your computer. I haven't had any issues like that while doing it, but if your comp blue screens and your motherboard fries, that's on you and I am not responsible for it in any way. By reading this guide, you agree that you are fully responsible for any actions you take.

Getting Started

Okay, let's get this out of the way to start with:

Running Locally isn't for everyone.

If you're not technically savvy, can't figure out how to edit your network router settings, or get confused when someone asks you to set up a batch file, you're probably not going to survive using a Local LLM. Things break. You have to be able to fix them or know where to go to get help to fix them. If I tell you to go to a Github page and follow the instructions you should be able to do so without additional handholding. Some level of Computer expertise is required here. If I tell you that you need a certain amount of VRAM, you have to be able to at least Googlefu your way to figuring out what graphics card you have and how much VRAM it has.

If you're the kind of person who just can't fathom doing any troubleshooting, you're better off using a website that provides this stuff as a service and will do that stuff for you for a fee. What alternatives are there? Are there free options? I wrote up a guide here you can read instead.

On top of that you're using your own equipment. This means that your hardware will degrade over time. People smarter than me have compared running a Graphics Card/CPU for LLMs as comparable to running a medium/heavy game on your computer. So be aware of this risk, it is a factor in your decision.

Alright, not scared off? Let's break this shit open.

Key Terms used in this Guide

What is an LLM?

Large Language Model (LLM) are essentially text processing machines that use algorithms to determine what word is supposed to come next in a sequence. People call it "AI" but it's not really AI in the Sci-fi sense of the term. It's still just a rock we're bashing to do math and output words.

What is Context Size?

Tokens - in its simplest form - are just pieces of words that the model will use. Context Size refers to the maximum amount of tokens a model can process at any given time. Trying to go past the Context size results in errors. But if you're in a particularly long-running chat, the model will just dump out earlier posts in the chat to make room for new posts. In example, Mars/Mercury both have an 8k context size, meaning they can handle 8196 tokens at once. Long story short: The more context size your model has, the more it can remember.

For the purposes of this guide, 4k is the base standard. 6k is pretty damn good and usable for most chats. 8k is great. Anything above that I find to be a bit excessive. Some people swear by it, but I've just not been able to find a use case for needing 16k context, let alone 32k.

For your reference: 8k Context Size can remember back approximately 40 messages. 4k context size can remember back approximately 20 messages. These of course presume that you're not using bots that take up a massive amount of context size. The bot I used to test this was about 1k of context size combined with my System Prompt.

What is a System Prompt? What's A Jailbreak?

A System Prompt/Pre-History Prompt is a set of instructions you provide to the LLM that informs its response. A Jailbreak/Post-history prompt likewise is used to break through censored models filters. If you're running locally, you'll never need a Jailbreak.

LORA

LORA or Low-Rank Adaptation of Large Language Models are essentially ways to modify the training data of models you already have. They add a layer of customizability to the models you download such as making it easier to do NSFW stuff or making models more coherent. Many models come pre-merged with some LORAs.

Other terms

Other terms are introduced as relevant and explained to try and not overwhelm you right from the start of this guide.

Why Run Locally at all?

Find out here.

Determining if you can even run locally.

So let's dive into the big question first: Can you even run locally?

Graphics Cards

There's a lot of competing cards out there that people use for gaming and whatnot. I'm not going to go into all of them, but I will speak briefly about the top three I've heard about in the AI Space.

Nvidia Graphics Cards

Nvidia Graphics Cards are the current cream of the crop when it comes to using LLM's. They have the CUDA Kernel built into their cards which long-story-short allows it to process LLMs at a much better rate than the competitors currently.

For Nvidia Graphics Cards you're going to want something at least in the latter half of the 2000's series, or ideally something from the 3000's series.

AMD Graphics Cards

AMD lags behind Nvidia since they don't have any official support for LLM's and most of the dev stuff happens through CUDA currently. That being said, the community has banded together to make it possible to use AMD Graphics Cards, but it won't perform as well as Nvidia currently.

For AMD Graphics Cards you'll want stuff from the latter half of the 6000 series at least.

Intel Graphics Cards

Oof.

You should just look into a subscription service like Chub Venus or NovelAI, on god. I've heard people trying to run locally using these but support is few and far between and it's more likely to break.

Graphics Card I didn't mention

Chances are if you have something weird and off the wall I didn't mention, you should just use a subscription service.

Operating System

Okay, real quick here. You probably want to be using Windows. Mac support is spotty at best and you're not going to have a good time. Honestly, if you use Mac you probably just want to set up with a subscription service currently. Linux users probably don't need this guide to begin with, honestly. But Linux users can find guides for their stuff in the Github pages of Ooba or KoboldCPP.

VRAM

Okay, now that we've discussed graphics cards a bit, let's talk about VRAM. VRAM means a lot that I'm not going to go into because frankly it's not important what it does for the purposes of this guide. What is important is how much of it you have. Because this ultimately determines what kind of model you can hold. So let's cover Model Size and Quantization before I start giving you recommendations so you know what's happening.

7b, 13b, 34b: Model Size

If you've spent any time in a Bot discord, you might have heard people throw around things like: "13b" "33b" "70b" and wondered what that means. Those numbers are model sizes. In as ELI5 as I can, the number directly before the "b" is the model size and larger is generally better. The larger the number, the better the model usually is at being able to perform logical leaps. This isn't to say that a 13b model is bad, far from it, most people running locally use a 13b model and it's more than serviceable. It's just that you generally have to do less work when it comes to prompting the bigger the model size.

Currently, most local users exist in the 7b-13b space because that is what is most easily accessible to everyone. So you have more options there, and less going upwards since those require high-end rigs. You probably don't want to be using anything lower than a 7b. At that point you're getting a size of model that is frankly just not as up to par as a 7b model that can't make many logical leaps. I would consider 7b to be the absolute minimum you would want to run.

Quantization

350bpw, 5bpw, Q_4_K_M, etc. These are all Quantizations or Quants as we'll be referring to them from here on out. A quant is essentially just a compressed version of a full model. Whenever someone Quantizes a model, they're making it smaller so that the model can run on rigs with smaller VRAM capacity. This results in Perplexity Loss or in as ELI5 as I can make, the models quality gets worse. Whether it's logic or that the model just spits out gibberish, things just happen like this.

Current rule of thumb is that Anything below Q4 or 3.5bpw is not worth it. You can certainly go lower, but the quality of a model seriously degrades the lower you go. If you're at that point, you're better off just dropping down a model tier and having a much better experience.

There are currently two types of quants I recommend using. GGUF Quants and ExLlamav2 Quants. GPTQ and AWQ are options, but both are inferior compared to GGUF and ExLlamav2 quants. For GGUF quants you'll be using KoboldCPP and ExLlamav2 you'll be using Oobabooga.

IQ Quants

Recent advancements for GGUF files mean that IQ Quants (IQ3-S, IQ3-XX, etc.) can perform as well or better than Q_4_K_M quants when used. Whether this holds up to scrutiny always depend on model, but you're encouraged to try this on your own to see if you can run higher tier models on systems with lower VRAM.

VRAM and what you can Use

Alright, so let's finally go over this. First off, you should figure out how much VRAM you can have. Most newer graphics cards have at least 8gb of VRAM. 8gb of VRAM is what I would consider the bare minimum to have an enjoyable experience chatting. If you have less than that, it's time to start looking into subscription models. If you're using AMD/Intel Graphics cards and only have 8gb, you can try but you may need to consider subscription models as well.

There is a way to use KoboldCPP to buffer your VRAM with your regular RAM and CPU to process. But this drastically lowers your generation speeds and comes at the cost of using your CPU to help process. Since your CPU is essentially the brain of your computer you might want to think twice about using it a heavy load to process huge models. I also just don't recommend it. We're not talking about the difference of a generation that takes a 10 seconds suddenly becoming a minute. We're talking generations can slow to once every 10-30 minutes or even longer depending on how high you go. It's up to you if you want to pursue this option.

Keep in mind, just running your Windows desktop takes anywhere from 1-3gb of VRAM automatically. This is why your 8gb of VRAM cannot actually hold a 13b model at 4.0bpw/Q_4 quants despite it being smaller in size than your VRAM capacity. A headless server may be able to hold more but I've never tried it.

The VRAM Table

Alright, this table will go over basically what you can use per what VRAM you have. This may become outdated over time, but is somewhat accurate as of 2/24/2024. Keep in mind, whenever I say 4.0bpw or Q4 i'm referring to the Quant size I mentioned before. If I mention a model size, you should automatically presume that with the amount of VRAM you have, you can hold smaller models.

For simplicity's sake, I'm only referring to 4.0bpw or below here. So just because you have 12gb of VRAM it doesn't mean you can run a 7b at 8.0bpw.

VRAM	Models/Quant Recommendations
8gb	7b models at 4.0bpw/Q4. Usually can hold 4k context size, 8k will slow down massively
12gb	13b models at 4.0bpw/Q4. can hold 4k context size. 8k will slow down massively or just error out. 6k is a good medium.
	20b models at 3.0bpw. 4k context size will be slow.
16gb	13b models at 4.0bpw/Q4 with good stretching up to 8k context size.
	20b models at 4.0bpw/Q4
24gb	34b models at 4.0bpw/Q4. Some of the ones with longer context size you can fit up to about 16k context size comfortably. Theoretically up to 30k context size.
	8x7b Mixtral models at 3.5bpw with 8k context
	70b models at 2.4bpw with 8k context

It's possible your rig goes higher than what I've listed here. So let's go into how you can determine if the model you're looking at can fit.

Can it fit?

First off, you're going to be getting your models from Huggingface. This is where 90% of all models go and you should familiarize yourself with what a standard model looks like. Try not to download every model you see and load them. Run an antivirus/anti-malware scanner occasionally. You never know what weird people online will do.

For ExLlamav2 Files

For the purposes of this demo we are going to look at Echidna-Tiefighter-13b Exl2 and Darkforest-20b Exl2 so you can get a sense of what to look for and how to figure out if it fits.

FitEchidna

Using the above image as a reference, I've outlined things. First off is the Output.safetensors file. This is going to be the bulk of the file size for your models. The file size next to it in gigabytes is approximately how much VRAM you can expect the model to take. So allowing for the 1-2gb window for using Windows at all, you can expect this model to require at least 12gb of VRAM to run functionally well.

FitDarkForest

With DarkForest, you can see that it has two Output.safetensors files. This is because of the bigger file size required to hold a 20b model. When you see this, it means you need to combine the size of each of the tensor files in order to reach where you should be at. So at about 10.5 GB of VRAM, you can expect to need at least 16gb of VRAM to load this efficiently.

For GGUF Files

For the purposes of this demo we are going to look at Echidna-Tiefighter-13b GGUF so you can get a sense of what to look for and how to figure out if it fits.

FitEchdina2

You can see from the above image there's a lot more options here than there is with the ExL2 quant above. This is because GGUF files are single files rather than multiple files. The Quant size is on the left and the size of the model is on the right. The one I've underlined is the Q2 quant which uses 5.43 gb of size. Which means, theoretically, you could fit a Q2 quant of this 20b model on 8gb of Vram. It probably wouldn't run well but it's possible.

What You Need

In order to run LLMs, you need two things:

A Frontend and a Backend. Your Backend is the program which actually loads the models and allows you to plug them in somewhere else. Most backends also simultaneously work as a frontend. But for the purpose of using character cards like in Chub Venus, you'll want a separate front end generally. Your Frontend meanwhile is the chatting interface you use to actually chat with the bots.

Backends

We're only going to cover two backends for this guide: KoboldCPP and Oobabooga. We won't be going into how to install them since their Github pages handle that well enough. But I will cover the pros and cons of each.

KoboldCPP

KoboldCPP is a simple program with a "One-click" mentality to help you run it. You click it, it launches a window for you to customize and use it with. You select your GGUF model to load with it, and it launches after you configure some settings.

Pros:

Probably the simplest interface/setup of the two. Launch it, set your Context Size and SmartContext. Put 0 for layers so it tries to put as many as it can before it stops then let it run.
GGUF files are far more plentiful than Exl2 files.
Less moving parts. So long as KoboldCPP runs and you have the GGUF file you can run it fairly easily without having to worry about extensions.
Mirostat mode is fantastic for keeping chats fresh.
Kobold team is constantly updating and working on it.
KoboldCPP lets you load models that are outside of your VRAM range by using your CPU and physical RAM as a buffer.
Advances in GGUF models mean you can even begin to load higher tier models that you may not have had access to before.
The creators created Kobold for usage mainly as a story-writer/generator/roleplay backend. So its presets are geared towards making those things easier.

Cons:

You can save presets for loading your models, but it's an extra step that takes some time to launch. It's not a huge time tax, but it adds up.
KoboldCPP is not technically recommended for use with sites like Chub Venus, instead you need the Kobold United version which has a higher VRAM cost. This can cause errors if you try to plug your local in to Chub Venus.
Using your CPU/Physical RAM can burn those out as well, leading to other issues that you wouldn't have if you used your graphics card only.
Slower processing time in general comparatively to Ooba.
If you want to make changes, such as applying a LORA to the model you have to close the entire program and load it again with the other LORA.
Download the models you want to use and manually move them into an appropriate folder.

Oobabooga

Oobabooga is a backend that has more options than KoboldCPP does, raising its complexity but also the things that you can do with it. Because of its extra complexity other things can go wrong with it by nature. But Ooba remains your best option for having the most options available and its dev team is very active.

Pros:

Power to use multiple different filetypes, including GGUF and ExLlamav2, as well as GPTQ and AWQ models.
Extra extensions to allow for easier connectability with external services.
You can use multiple LORAs at once with ExLlamav2 quants, allowing greater customizability.
Can work almost entirely from the interface provided when you launch, including downloading models from HuggingFace.
Wide range of applications outside of Chatbots.
Ooba tends to generate faster than Kobold, which means it doesn't need features like SmartContext.

Cons:

Constant updates means features you like may no longer be available or supported as time goes on.
I have installed Ooba a few times only to find out that the extensions required for certain model loaders to work were linked to broken systems, so it broke the entire install and I wasted a few hours on it only to realize that there was nothing I could do.
Error messages are extremely hard to parse. If you're not tech-savvy and something breaks you may be in for a wild ride that ends with you dropping this entirely.

Okay, what do you use?

Oobabooga. But I also have a backup install of KoboldCPP. There's not really a wrong way to go persay. But if you're looking for my recommendation on what you should use:

If you want less complexity, grab KoboldCPP
If you want more control, grab Oobabooga.

If you're just getting started and aren't confident, just roll with KoboldCPP. It is honestly far easier to use and setup. You can always swap to Oobabooga later.

Installation

Alright, so I'm not going to provide screenshots because this is all fairly self-explanatory stuff and the Github pages are going to go into more detail on things you really do need to know.

That being said, let's dive into a very basic overview. Do not bug their Discord Servers about anything here in this guide. They didn't write this. You can and should read their guides on installation. Things may change and not be relevant here anymore after some time. I am only providing basics to help you launch this for your first time.

KoboldCPP

You can find their Github page here.

On the releases section you should find their most recent release and download:

The Linux version if you use Linux
The Koboldcpp.exe version if you use Windows/Nvidia.
The Koboldcpp_nocuda.exe if you don't use Windows/Nvidia
The Fork for AMD users is here

After that, create a folder somewhere. You should prefer an HDD for this if you can. It will take longer to load initially but will be easier than wearing out your SSD.

With that done, just double click the KoboldCPP.exe file and it will launch the interface.

KoboldCPP

Now I'm not going to go into super detail on these. That's what their wiki's and Github pages are for.

But a baseline you can use the browse to find your model. It should automatically detect your GPU for this.

Context Shift: Leave this one. Speeds up processing times.
Context Size: Dependent on your model. Most 7b's, 13b's, and 20b's are 4k context (4096). 34b's tend to be anywhere from 8k (8196) or higher. You can usually find this info on the models page on HuggingFace.
GPU Layers: To 0. This sets it to use as many layers as it can to hold the model. You can set this to be less, however, to avoid a heavy load on your system. My experience with about 8gb of VRAM is something like 30 GPU layers is enough to allow you to do some other things while prompts generate. But this layer expectancy gets stranger the higher up in model tier (7b, 13b, 34b, etc) you go. So you'll have to experiment with this later on.
Presets: Use CuBlas if you're using a Nvidia Card. CLBlast if you're not.
Remote Tunnel: This allows you to use KoboldCPP using Cloudflare, which generates a link in the KoboldCPP console you can then use to connect your local to another device in your home/on the internet.
Tokens: Select "Use SmartContext" in this tab.
Custom RoPE Config: In the Tokens tab, the Custom RopE Config is how you scale models to use higher context size. You can experiment with this. But general rule of thumb is 1.75 for 1.5x context (4096 to 6117 for example) or 2.5 for 2x context size (4096 to 8192). Not recommended to use this unless you know what you're doing since it can massively degrade a models quality.
Model Tab: Here you can find the model information and apply a LORA if you want to.

Then you hit Launch and KoboldCPP will load in your browser window.

If you're just using it locally on your own computer, your normal API address for connecting it to something outside of Kobold is just:

In the console window you should see it as "Starting OpenAI Compatible API on port xxxxx at xxxxxxx"

Make note of that API for later when we're connecting to the Frontend.

Oobabooga

You can find their Github page here.

On the releases section you'll find their most recent release. Go ahead and grab it. For most users you'll want to just grab the ZIP file, then extract the Zip file into an appropriate folder. Again, I recommend using an HDD for this.

For Windows users, find start_windows.bat and double click on that. Follow the instructions that show up in the Command Prompt and let it do its thing.
For Linux and Macos, same thing but it will say "start_linux.sh" and "start_macos.sh."
When it's finished, it will say "Running on local URL:" in the command prompt. Close out of the Command Prompt.
Find "CMD_FLAGS.txt" and open it
At the bottom, add in the following and save:
--extensions openai --api
If you intend to use Ooba on a different device on your local network, use this instead:
--extensions openai --api --listen
Launch the start file appropriate to your OS as mentioned above to launch Ooba once more.
Again make note of your OpenAI-compatible URL.
If Ooba doesn't launch in a separate window, you can open it by using the "Local URL" mentioned in the command prompt.

Real quick, let's talk about loading a model in Ooba. I'm going to cover ExLlamav2 models here but you can also load GGUF Models by using the llama.cpp loaders.

Oobass2
OobaScreenshot

On the right hand side of the screen is the download option. In the first screenshot I have a picture of the Huggingface page. Next to the model name there's a little two squares. Click that to copy the link to the HuggingFace model.

You then paste it into the screenshot like I did and hit Download. It will automatically download the model for you.

On the left hand side are loaders. Just above the loaders is the dropdown where you select your models. If you see none in the dropdown after downloading, select the "Refresh" button next to the model dropdown.

Ooba will automatically try to select the best loader based on the model metadata. I personally prefer the Exllamav2 loader for my EXL2 models. Some people don't. Ooba recommends the Exllamav2_HF loaders. It's up to you which one you want to use, but I have noticed that some models don't load in ExLlamav2 but will load in Exllamav2_HF. So if for some reason your model won't load in the Exllamav2 loader, try the HF version instead.

Set the Context Size as per the model recommendation on their Huggingface page. If they don't have one, presume the following:

Model Size	Context Size
7b	4096 (4k)
13b	4096 (4k)
20b	4096 (4k)
34b	8192 (8k)
8x7b	8192 (8k)
70b	8192 (8k)

Hit "Load"

Hopefully it should load without errors. It will post that it successfully loaded or error out. God be with you because I can't help you if it doesn't load.

Annnnnd you're done.

Frontends

Alright, we're only going to discuss one Frontend here: SillyTavern. Why SillyTavern? Well because it's the most feature complete local Frontend you can ask for right now. Most people use it. You can technically use websites like Chub Venus as a Frontend, but part of the vibe of running locally is not depending on these websites at all.

SillyTavern

Pros:

The most feature complete frontend out there.
Tools to use Cards you download from websites with ease.
Locally stored. You're immune to site outages.
Plethora of options to customize your experience.
Wide ranged of connectivity with most available APIs.
Looks nice. Designed for chatting.
Can technically be run on your phone - which we will not cover here. Go to their Discord for that.

Cons:

Don't ask me how but their updates continually break things. Even in past versions. I keep hearing it's poor Version Control due to a wide range of contributors. But they once broke Lorebooks for all versions on their 1.11 release and you had to update.
Their Discord is known to be hostile to anyone coming in to ask questions. To the point people come in to other Discords after being chased out of the ST Discord.
Features in SillyTavern can be poorly explained or not even explained at all.
People note that updating often breaks things or that changes are simply nonsensical in nature.

Installing ST

You can find the Github here.

Much like the Backend section on installing. These instructions may be outdated or otherwise incorrect by the time someone links you this rentry. This is done at your own risk as with anything else. Don't bug the people in the Discord about this Guide. They didn't write it.

Once more: THE SILLYTAVERN PEOPLE DID NOT WRITE THIS GUIDE. DO NOT BUG THEM ABOUT IT.

Anyway, simple steps:

Download Node.js, it's required for SillyTavern to run. You can get it from here and the latest LTS version is the recommended one.
Get the latest SillyTavern release (Avoid testing versions). The Source Code.zip file is generally the one you want to grab from their latest stable release.
Create a SillyTavern folder, unzip the contents of SillyTavern to that folder.
Find "Start.bat" and run it. It should install everything it needs and then open in a regular window.

Cool, we're done.

After you're done getting SillyTavern to launch. There's a little "Plug" icon near the top that allows you to hook up your API connection to it. Select the appropriate options from the Dropdown (Text Completion for Ooba/KoboldCPP. There's a second drop down you use to select which one) and you grab the API link from before to paste it into the appropriate fields. Then you hit connect and it should be fine.

Recommended Models. Sites to get Cards. Etc.

Okay, now that we've finished installing our required stuff, we can start actually acquiring models and cards. But where to get this stuff?

Models - HuggingFace

You can find models on HuggingFace. The creators I recommend following are TheBloke, Kooten, and Lonestriker. TheBloke will mainly deal in your GGUF files. Kooten and Lonestriker mainly deal in ExLlamav2 files.

You can download files through Ooba using the Model tab, follow their documentation for instructions on that.

For Kobold, just download the GGUF files and put them in the same folder as your KoboldCPP executable.

Statuo's Recommended Models.

I'm probably not going to update this list. But hey, here have some recommended models. They are also all uncensored.:

7b
Kunoichi-DPO-V2-7b is my top recommended pick for 7b models. Insanely good for its quant size and geared towards roleplaying. Use Rope Scaling at 2.6 and set the Context Size to 8k if your rig can handle it. Otherwise, 4k is fine. This is my current favorite model, even above 34bs.
Kunocchini is a solid alternative. Same model as Kunoichi but stretched to 200k context. You'll probably only want to use it at 16k at most though.

13b
Echidna-Tiefigther is my top recommendation for 13b. At 4k context size it's a great roleplaying model.
Psyfighter 2 is also great.
MythoMax 13b is an oldie but a goodie.

20b
I don't actually have many good 20b's, so these are just testimonials. Shoutout to Wilson in the Chub Discord.
DarkForest 20b 4k context and built for roleplaying.
Psyonic Cetacean 20b Same as above.

34b
Nous-Capybara-limarpv3-34b I believe this one stretches to 200k natively. It worked to 16k when I was using it and that's about as high as I'd go.
Yi-34b-200k-RPMerge I've heard great things about this one but rebounded off of it almost immediately. Your mileage may vary.

8x7b Mixtrals
zaq-hack's Noromaid-v0.4-Mixtral-Instruct-8x7b-Zloss with RPcal Probably the best mixtral model out of the bunch I tried.
Noromaid-v0.1-mixtral-8x7b-Instruct-v3 A good alternative if you'd prefer something from LoneStrike or Kooten.

70b
On god, haven't found a 70b that was worth it yet. People are swearing by it and maybe it's just better at higher quants, but I really can't picture a use case for it right now. It's slower, usually caps out at 8k, and requires just as much regenning as other models. I'll probably update this if I find something worth it.
That being said:

Midnight-Rose 70b was recommended by Wilson in the Chub Server. I haven't gotten around to trying it, but hey.

Getting Cards for Chatting/Roleplaying

You can get cards from multiple places. But my favored place is Base Chub or Chub Venus. Keep in mind this is the same site, just different aesthetics. Most other sites prevent you from downloading cards altogether or you have to do a crazy workaround to download them. There is a Booru out there with cards on it but I found them painful to navigate. The only downside is that Chub is completely uncensored in the way Archive of Our Own (AO3) is. So if that's a dealbreaker for you, you'll have to find alternatives.

My recommendation is you just create an account on Chub Venus since it automatically blocks most tags for you, then you can find a bunch of content based on tags you like while avoiding most of the tags you don't like. You can then import them into SillyTavern by following the guide on their site.