Vector Storage Quick Rundown
What's an embedding model
An ML model that embeds texts into a vector space (that is, it codifies a text into a sequence of numbers).
The idea is then calculating the geometric distance between texts for like, text searching. Or activating lorebook entries.
What's vector storage in ST
Instead of wrangling keys, ST can use an embedding model to determine which lorebook entries to activate, by comparing the last few messages in context and the contents of the LB entries.
Note: if a vectorized entry has keys, and those keys show up in the chat, it still gets activated like a normal entry. The vectorization just gives you a second way of enabling the entry.
How do I enable vector storage on LB entries
When presented with the "Strategy" options Blue, Green and Chain, select Chain.
How do I configure vector storage
Go on the extensions menu (the three boxes), select Vector Storage, and play around. Pay attention to the World Info options!!! Vectorization ignores World Info entries by default!!!
How do I run embedding models locally
You can download embedding models on huggingface, like https://huggingface.co/Qwen/Qwen3-Embedding-0.6B-GGUF (~600mb at Q8) or https://huggingface.co/Qwen/Qwen3-Embedding-8B-GGUF (~8gbs at Q8).
Download llamacpp (if you know which download you want, great, otherwise the normal CPU version is fine) and then run the llama-server
binary with ./llama-server --embedding -m /path/to/your/embedding.gguf --pooling cls
on powershell/console/etc. Add -ngl 99999
to the end of the command to move all layers to GPU if you downloaded a binary with GPU support (adjust the number to reduce the number of layers).
You CAN run embedding models on CPU though. It's not as annoying as running LLMs.
Then you go on the extensions menu and configure ST to use llamacpp instead of "Local (Transformers)" (which runs a ~130mb 2023 embedding model on your browser).
It should also be possible to run embedding models on koboldcpp but I couldn't figure that out lol.
ST caches your vectorizations WITH COMPLETE DISREGARD FOR WHAT MODEL IT IS THAT LLAMACPP IS ACTUALLY RUNNING. Flush the cache whenever you switch embedding models by going to data/default-user
and nuking the vectors
folder.
What embedding model does ST use by default
https://huggingface.co/Cohee/jina-embeddings-v2-base-en
You can double check by going on https://github.com/SillyTavern/SillyTavern/blob/release/default/config.yaml and checking the value for extensions.model.embedding
(just CTRL+F embedding and it should show up)
Do bigger models actually perform better
Yeah IMO. Very nice!
Also because of how agressive the caching is you don't really have to worry that much about the compute.