!!! danger This isn't an enable and forget magical miracle cost saving feature. From now on you need to be somewhat aware of what the API calls you're actually sending look like.
->  <-
-> **So true!** <-
https://rentry.org/pay-the-piper
https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
https://rentry.org/prompt-caching-for-st
You pay more to store part of your prompt at the Anthropic server so you can pay less on the next prompt, if it can reuse the stuff that was cached (if both prompts begin with the same text that you told Anthropic to cache).
Make sure you're using at least version [1.12.8](https://github.com/SillyTavern/SillyTavern/releases/tag/1.12.8) of ST. You can also just check if shows up in your .
Edit your to ensure the Claude section reads somewhat like
where is SOME non-negative number that MAY OR MAY NOT be 0. Depends on where you like to make your injections at depth and your PHIs and etc.
Anthropic allows you cache prompt prefixes. Prefix, as in, "the prefix (beginning substring) of the text completion (we won't explain what the text completion is actually like) (all you get to know is that the order is tools \-\> sysprompt \-\> messages)".
Which is enough. You cache the beginning of the prompt, and everything after this marked point is mutable aka regular input tokens.
marks the point in the message history where the immutable beginning of the prompt ends (inclusive).
So it should be BEHIND:
- your prefill (and it is, automatically), because the prefill doesn't stay in the chat history, **(so depth 0 is the last user prompt, depth 1 is the assistant prompt immediately before that, etc. Depth increments on ROLE switches and not just per message)** **(it STILL works that way for openrouter so pay attention to whatever the fuck your system messages are doing)**
- relevant consequence of the prefill thing: evens for caching at user messages, odds for aching at assistant messages (in general you'll want evens)
- your PHI, if you have one (because it moves along the chat history)
- your injections at depth (see above)
- your {{random}}s
- your group nudges
- the mutable parts of your prompt in general
The caching, as it was implemented, has fuckall idea what messages come from where. It just crawls up the API request and slaps on caching markers.
In particular, if you have a lorebook with deterministic keys, you're likely to hit no caches between messages, but you can still reduce costs between swipes (or not, I don't know your usage patterns.) Non-deterministic lorebooks are worse.
You can always just move your lb entries into depth and then place behind that tho.
The simplest scenario is **no depth injection + nothing between Chat History and Prefill**; cachingAtDepth **0**. The second simplest scenario is **no depth injection + any number of relative position user prompts, as long as there is no assistant stuff between Chat History and Prefill**; cachingAtDepth **2**.
TTL of **5 minutes**.
You can pay up to **25% more than usual** in the mathematical worst case scenario with 0 cache hits.
Up to a **90% discount** (ignoring output tokens etc etc).
Substantially faster gens according to Dario.
Not in the defs or anywhere in the sysprompt, but you can move them into the PHI or the prefill for likely analogous effects to what you were already doing.
Depends! Not if you want cache hits between prompts, yes if you just care about making your swipes cheaper.
If God hates you, it's possible to get 0 between prompt cache hits because your messages array is a mess and conspires against behaving sanely.
Which might happen if you use group chats or something.
Your swipes should _reliably_ cost much less as long as you follow the suggestions above tho.
Group chat should be fine under direct Claude with "Join character cards (include muted)". OpenRouter is the issue since they sweep all system messages into Claude API's parameter, which breaks things like group chat and impersonate. Group nudge can be blanked out in Utility Prompts and copied to a custom prompt set to user.
Oh and group chat (no matter which API you use) has an [obscure bug](https://github.com/SillyTavern/SillyTavern/issues/2997#issuecomment-2440110098) where any chat examples with only {{char}} will disappear if the char is not the active char, so you see this shit:
and the cache will be invalidated when the next char speaks.
If using group nudge and/or PHI without depth 1+ injection, then set to 2 as explained in "What value do I use for cachingAtDepth".
NGL? Like, the beginning of the chat history is typically more immutable than the sysprompt if you summarize (**as you should**), so I consider caching at the correct depth to be strictly better than just caching the sysprompt.
And just caching the sysprompt is unlikely to actually save you much money, because it's like 1k tokens of your 20k+ tokens context. Caching the sysprompt mostly just wastes money on cards with lorebooks and {{random}} in the defs.
But again it really depends a lot on how you use the model.
OpenRouter gives you literal savings numbers on their website.
Anthropic reads you out how many tokens were cached/read from the caches (you can see this stuff on the console if you disable streaming).
For :
->  <-
The prompt itself and the one two messages back are cached (hence the outline and the colored marking).
No. I'd have made caching the prefill a specific option but we only get to use 4 breakpoints (anthropic API) and then it'd be possible to have invalid config.yamls without an intuitive, unambiguous resolution, and I'd probs have had to argue with Cohee for longer to get him to merge my PR.
->  <-
Set cachingDepth to 8, should work reasonably fine with even the goofiest jb as long as you're not doing group chats or abusing impersonate.