How to get That Prime Rib Claude Opussy Norbussy, by Simply Paying the Piper (Less!)

This isn't an enable and forget magical miracle cost saving feature. From now on you need to be somewhat aware of what the API calls you're actually sending look like.

me when things are functional in the ideal circumstances

So true!

Original rentry (by someone else) (based)

https://rentry.org/pay-the-piper

Original docs (by Anthropic) (cringe)

https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching

What the fuck is caching

You pay more to store part of your prompt at the Anthropic server so you can pay less on the next prompt, if it can reuse the stuff that was cached (if both prompts begin with the same text that you told Anthropic to cache).

How DO I cache

Make sure this commit is merged into your ST: https://github.com/SillyTavern/SillyTavern/commit/54db4983f4663d77db79ec1246888a5791bdb619 (that is, make sure you've git pulled staging after the date on it). You can also just check if cachingAtDepth shows up in your default/config.yaml.

Edit your config.yaml to ensure the Claude section reads somewhat like

1
2
3
claude:
  enableSystemPromptCache: false
  cachingAtDepth: 0

where cachingAtDepth is SOME non-negative number that MAY OR MAY NOT be 0. Depends on where you like to make your injections at depth and your PHIs and etc.

How does this work

Anthropic allows you cache prompt prefixes. Prefix, as in, "the prefix (beginning substring) of the text completion (we won't explain what the text completion is actually like) (all you get to know is that the order is tools -> sysprompt -> messages)".

Which is enough. You cache the beginning of the prompt, and everything after this marked point is mutable aka regular input tokens.

What value do I use for cachingAtDepth

cachingAtDepth marks the point in the message history where the immutable beginning of the prompt ends (inclusive).

So it should be BEHIND:

  • your prefill (and it is, automatically), because the prefill doesn't stay in the chat history, (so depth 0 is the last user prompt, depth 1 is the assistant prompt immediately before that, etc. Depth increments on ROLE switches and not just per message) (it STILL works that way for openrouter so pay attention to whatever the fuck your system messages are doing)
  • relevant consequence of the prefill thing: evens for caching at user messages, odds for aching at assistant messages (in general you'll want evens)
  • your PHI, if you have one (because it moves along the chat history)
  • your injections at depth (see above)
  • your {{random}}s
  • your group nudges
  • the mutable parts of your prompt in general

The caching, as it was implemented, has fuckall idea what messages come from where. It just crawls up the API request and slaps on caching markers.

In particular, if you have a lorebook with deterministic keys, you're likely to hit no caches between messages, but you can still reduce costs between swipes (or not, I don't know your usage patterns.) Non-deterministic lorebooks are worse.

You can always just move your lb entries into depth and then place cachingAtDepth behind that tho.

The simplest scenario is no depth injection + nothing between Chat History and Prefill; cachingAtDepth 0. The second simplest scenario is no depth injection + any number of relative position user prompts, as long as there is no assistant stuff between Chat History and Prefill; cachingAtDepth 2.

How long does the cache last

TTL of 5 minutes.

How much money can I lose from this

You can pay up to 25% more than usual in the mathematical worst case scenario with 0 cache hits.

What kind of savings I can gain

Up to a 90% discount (ignoring output tokens etc etc).

Any benefits other than savings?

Substantially faster gens according to Dario.

Wait can I no longer use {{random}} or dice rolls

Not in the defs or anywhere in the sysprompt, but you can move them into the PHI or the prefill for likely analogous effects to what you were already doing.

Wait can I no longer just let messages cycle instead of summarizing or truncating manually

Depends! Not if you want cache hits between prompts, yes if you just care about making your swipes cheaper.

What if God hates me

If God hates you, it's possible to get 0 between prompt cache hits because your messages array is a mess and conspires against behaving sanely.

Which might happen if you use group chats or something.

Your swipes should reliably cost much less as long as you follow the suggestions above tho.

Group chat (by Anon)

Group chat should be fine under direct Claude with "Join character cards (include muted)". OpenRouter is the issue since they sweep all system messages into Claude API's system parameter, which breaks things like group chat and impersonate. Group nudge can be blanked out in Utility Prompts and copied to a custom prompt set to user.

Oh and group chat (no matter which API you use) has an obscure bug where any chat examples with only {{char}} will disappear if the char is not the active char, so you see this shit:

1
2
3
4
5
6
[Example Chat]
[Example Chat]
User: Hi.
Assistant: Hi.
[Example Chat]
User: This one is user only.

and the cache will be invalidated when the next char speaks.

If using group nudge and/or PHI without depth 1+ injection, then set cachingAtDepth to 2 as explained in "What value do I use for cachingAtDepth".

Should I also use the sysprompt caching thingy

NGL? Like, the beginning of the chat history is typically more immutable than the sysprompt if you summarize (as you should), so I consider caching at the correct depth to be strictly better than just caching the sysprompt.

And just caching the sysprompt is unlikely to actually save you much money, because it's like 1k tokens of your 20k+ tokens context. Caching the sysprompt mostly just wastes money on cards with lorebooks and {{random}} in the defs.

But again it really depends a lot on how you use the model.

How do I see how much money I'm saving

OpenRouter gives you literal savings numbers on their website.

Anthropic reads you out how many tokens were cached/read from the caches (you can see this stuff on the console if you disable streaming).

Fuck it I've disabled streaming and opened the console, what are the cache hits I'm supposed to be getting

For cachingAtDepth: 0:

graph

The prompt itself and the one two messages back are cached (hence the outline and the colored marking).

Can I use this with noass

Nyo. I'd have made caching the prefill a specific option but we only get to use 4 breakpoints (anthropic API) and then it'd be possible to have invalid config.yamls without an intuitive, unambiguous resolution, and I'd probs have had to argue with Cohee for longer to get him to merge my PR.

I'm stupid I don't know how {{random}} and stuff work I just use janitorai Miguel O'Hara cards with discord jbs

drool

Set cachingDepth to 8, should work reasonably fine with even the goofiest jb as long as you're not doing group chats or abusing impersonate.

Edit
Pub: 17 Nov 2024 19:50 UTC
Edit: 19 Nov 2024 10:24 UTC