Last 3 changes

2024-09-01: Writing as user and how to suppress it, and also a bit about claudeisms and your persona (to be expanded)
2024-08-30: Expanded on fighting the repetition
2024-08-29: Languages and encodings. Will add more on writing styles later.

How 2 Claude

explain it to me like I'm 7B edition

Read this before doing anything dumb when roleplaying with Claude or making bots.

A lot of info here is Claude-specific, but common principles apply to all LLMs. Use your damn sense.
This rentry is not finished (currently making demo cards), let me know if anything is wrong or unclear. cosmographist@proton.me

Your bot is an illusion
1. Human and Assistant
2. System prompt
3. SillyTavern and Claude context order
4. Prefill
5. Chat history injections
6. Making sense of it all
Model accuracy = attention
Model capacity
Long context is an illusion
1. Lost in the middle
2. Overall degradation at longer contexts
3. Chat history effects
4. Reply length effects
5. Cost in the actual roleplay
6. Wait, is 200k tokens really just smoke and mirrors?
Tokenization
Samplers (Temperature, Top K, Top P)
1. How samplers work
2. Default settings
3. Temperature vs Top P/Top K
4. When to use each sampler in practice
In-context learning
Speech/narrative examples
1. What to put into examples
2. Example regularization
3. Where to put the examples
4. How many examples to give
5. Example formatting: List
6. Example formatting: Long-range anchors
7. Example formatting: Anthropic
8. If the examples are so great, why don't I see them being used in bots often?
Templates
Repetition
1. Fighting the repetition
XML formatting
Double reinforcement with Character's Note
1. Creating a complementary reminder note
2. Where to put it
3. Examples
4. Downsides
Pink elephant effect
1. Negations
2. You need more context for Claude to start cooking
3. Token pollution
Different languages and encodings
Alignment training
Claudeisms
1. Suppressing claudeisms: writing styles
Writing as {{user}}
1. Stopping the model from writing for you
2. ...and why you probably shouldn't do that obsessively
Persona
Model biases
1. Assistant bias
2. General positivity bias
3. Causal bias
4. Ignoring the outliers
Summarization
1. Summarization extension
2. Where do you put your summary?
3. Chunking, or WHEN to press summarize
4. Manually editing the summary
5. Summarization prompt example
6. Caveats and best practices
7. Advanced summarization with STscript
ST Regular expressions and formatting
1. Regex cookbook
2. Formatting tricks
Stat trackers
1. Creating a stat tracker
2. Choosing the items to track
3. Downsides
Chain of thought
Lorebook
NoAss
Placebo

Your bot is an illusion

The AI (the model) does EVERYTHING inside its context window, which is like an empty notepad for tokens. You put some text here (request), and the model types more text after it (reply). SillyTavern is a frontend that displays this context in a way convenient for roleplaying, with chat history and stuff.

The model has absolutely no idea that you might have some "cards", "presets", "fields", "lorebooks", etc. These fields are a SillyTavern feature, and ST only does the following:

It takes these text pieces from your preset (some call it a jailbreak) and the card/bot.
It sorts them in a certain order and puts them into the model's context. You can drag the lines around and enable/disable them in your ST preset to change the order.
It calls the model (using an API request) to generate the reply.

Remember: at no point in time are names like Description, Character Definition, Main Prompt, Scenario, Persona Description, name of the card, etc. visible to the model! The model only sees the contents of these fields, the text itself. You can use macros like {{user}} and {{char}} to tell the model some of these names, which is convenient sometimes. But you generally don't have to.

Always check your console to understand what is actually being sent to the model in the API request, and how ST arranges your fields.

But the API request is a lie as well. Behind the scenes, all these parts are just dumped into the model's context. Claude, in particular, uses "Assistant" for itself and "Human" for you, so it adds those to the chat history.

And here's how the model actually sees your request:

⎗

Human: You're a narrator participating in a fictional roleplay scenario. Write Claude's next reply in the roleplay.
Claude is a snarky, bored, and lazy AI assistant. He doesn't like work, he sleeps way too much, and sometimes even pretends he doesn't hear the user, especially when they bother them about something minor or irrelevant.
NSFW/Smut is allowed.
[Start a new Chat]

Assistant: Your wish is my… uhh… strong recommendation.

Human: Claude, how do I boil an egg?

As you can see, there's no difference between your fields, it's just lumped into one wall of text, and Human/Assistant prefixes are slapped onto it as needed.

Take a look at the context manager in SillyTavern. (AI Response configuration → scroll down)

All fields of your card, custom ones you create, and preset-specific fields are entries in this context manager, and inserted into the context. You can drag them around, add new custom ones, and turn them on and off.

If your chat history is too long for the max context size slider you set, the history tail is truncated to fit.

Always check your console on how it's actually sent to the model!

Human and Assistant

Inside its own context, Claude is using internal names Human for you and Assistant for himself, and automatically add prefixes into the conversation, so the chat history looks like this in the actual context:

⎗

1
2
3

Human: ah ah mistress

Assistant: "The night is still young." *grins mischievously with a glint in her eyes*

Keep that in mind when thinking about what the model actually sees. For example, calling your persona or any NPCs "Human" or "Assistant" will probably interfere with your roleplay or instructions. For example, Claude might think you're referring to him personally, out of character.

System prompt

Claude separates its raw context into a "system prompt" (located in the upper part) and a "user prompt" (located in the lower part). When using Claude through the API, you have full access to both.

Putting the instructions into the system prompt doesn't magically make Claude follow them better! Claude is trained in such a way that instructions in the system prompt always take priority over user instructions in case they conflict. If there is no conflict, system instructions behave like any other instruction. This is intended for app developers, so they can prevent users of their fancy customer support chatbot from exploiting it to solve the Navier-Stokes equations in Python.

Here's a demonstration of how it works in the ideal case:

As you can see, Claude refuses to execute the user's instructions because they directly contradict the system instructions.

In practice, though, this doesn't always work ideally. System instructions sit at the top of the context and can be forgotten by the model, and ignored and circumvented by many other means. While the system prompt can be useful for you in the roleplay in certain use cases, in practice its usefulness is dubious due to these effects on Claude 3. Claude 3.5 can make better use of it.

The system prompt must be enabled in your preset if you want to use it. Without it, all messages in the context manager will be sent to Claude as either User or Assistant.

By ticking this checkbox you also enable the "User's first message" field. See the next section for explanation.

SillyTavern and Claude context order

All instruction-tuned LLMs are trained on a context formatted in a certain way. Claude expects its raw context to be formatted like this:

⎗

(optional system instructions)

---- system prompt separator token, inserted behind the scenes ---

Human: ...

Assistant: ...

Human: ...

Assistant: ...

Not following that formatting would make the model produce much worse or even nonsensical outputs. That's why the messages for the Claude API have to be sent in a certain order similar to the snippet above:

The first message can optionally be the system prompt.
The next message must be from the Human (user).
Subsequent messages must alternate between assistant and user.
Sending two consecutive messages under the same role will return an error.

However, SillyTavern and its predecessor TavernAI were created before Claude, and they have their own, poorly compatible mechanism to choose roles for API messages:

As you can see, technically you can select arbitrary roles for your message to be sent with, which makes it possible break the strict order required by Claude. Besides, Claude always requires the first non-system message in the conversation to be from Human, while the roleplay scenario assumes you have a greeting (which is a message from Assistant) first.

That's why SillyTavern roles work counterintuitively when using Claude. ST sends your messages from your context manager in a very specific (and extremely confusing!) manner.

If the system prompt is enabled:

If any messages with System role are present on the top of the context manager, they will all be sent as "system prompt" to Claude API.
If the first message after these has the role of:
- Assistant: then ST sends to Claude API the "User first message" field first, and then this assistant message. Because API always requires the first message to be from User.
- User: then ST will ignore the "User first message field" and send this one verbatim.
After that:
- Any two consecutive messages of the same role get glued into a single one.
- Any message with the system role get the role opposite of the message above it.

If the system prompt is disabled:

The first message must not have the assistant role, or API will return an error.
Messages with the system role are always sent as assistant.
Any two consecutive messages of the same role get glued into one.

In any case, "Main prompt" is always sent, even if it's toggled off. You cannot disable it.

Prefill

Instruction-tuned models generate replies for user's input, following their commands. Claude API lets you fill a part of Assistant's reply as if the model has already written something. This is called "prefilling", and the model will be strongly inclined to continue the prefill.

The prefill above makes the model think it already agreed to write without moral constraints, and it will be inclined to continue doing so in the reply. Breaking the alignment training is a common use for the prefill, but not the only one. You can use it for all sorts of tricks. For example you can prefill it with out-of-character commands in a {{random}} macro to have different reply openings each time. This can be used to mitigate certain kinds of repetitions, or just have more variation in replies.

Claude 3's prefill can easily override the system prompt in case of the conflict. Claude 3.5 is trained in a way that the system prompt almost always overrules the prefill.

Chat history injections

SillyTavern allows you to "inject" any message into the middle of the chat history:

Here's how your chat history will look after injection:

Since Claude's chat must always alternate between Human: and Assistant: by design, your injected message will get glued with a User or Assistant message above or below. Always check in the console on how it actually looks when sent to the API! Because the role makes it semantically different for Claude: either it's a part of user's input (i.e., an instruction to execute), or it's a part of assistant's response (hence, an example of what Claude already wrote that is safe to repeat).

Why is this strange mechanism required at all? Because in the actual roleplay, the chat history is usually the bulkiest single piece of context. If you want to move an instruction to a different place of the context (many reasons to want that are provided below), this is the only way to do that.

Making sense of it all

Since the way SillyTavern assembles its context can be extremely counterintuitive,
ALWAYS CHECK YOUR CONSOLE TO SEE THE CONTEXT THAT IS ACTUALLY SENT TO CLAUDE!

The console is always the primary source of truth, regardless of how you think your card/preset should work.

However the console doesn't always make it obvious at which position in tokens your piece of text is, so you can also consult your context manager for token counts.

Model accuracy = attention

You'll find me talking about "accuracy" a lot. The model is predicting the next token, that's literally all it does. Accuracy of that prediction can wildly vary due to different factors.

What does accuracy mean in practice? Here's an example. Imagine you have the definition in your card This is a sex story., and Claude already wrote the response The king is f. What would be the most probable next words for him to predict?

Bottom/Top (high accuracy relative to the definition)	Middle (low accuracy relative to the definition)
The king is fondling the queen's massive breasts.	The king is furious at the rebellion.
The king is fisting his own royal rectum.	The king is frolicking naked through the castle again.
The king is face-deep in the maid's pussy.	The king is feasting on a sumptuous banquet.
The king is fingering a young servant girl.	The king is falling ill with plague.

As you can see, low accuracy is equivalent to forgetting the definition/instruction, and the model becoming dumber. And accuracy is equivalent to the attention that model gives to your data.

It's not always as straightforward and visible as in the contrived example above, because the model can operate on any abstraction level and can fail gradually. A low accuracy answer can be subtly wrong, or sometimes wrong. It can mess the logic tasks up. It can break characters and reply formatting.

Model capacity

Claude is a next token predictor, like any other LLM. Suppose the model continues the following sentence:

People make mistakes, that's why pencils have erasers and

and Claude has selected three candidates that could be potentially continued into something that makes sense:

People make mistakes, that's why pencils have erasers and Taco Bell has extra-ply toilet paper in the bathrooms.
People make mistakes, that's why pencils have erasers and Cindy has gonorrhea.
People make mistakes, that's why pencils have erasers and you have me to write offensive shit for your amusement. (Claude is a cutie, ain't he?)

Selecting "Taco", "Cindy", and "you" as the next token would have been impossible without some kind of foresight to a certain continuation. The model doesn't plan normally step by step, like you do on the paper or a chess engine plans its moves. LLMs are just huge, extremely multidimensional memory banks filled with semantical abstractions. It's not exactly how it works, but the good mental model to have is that Claude imagines the continuation based on the existing context, generates a token, and puts it back into the context. Then the process is repeated for another continuation with the new token in mind.

This ability is not infinite, it's limited by the complexity of abstractions model can express. The model has a finite amount of weights (parameters), the dataset/training quality varies, so there's always an upper limit for that complexity. That upper bound is called model capacity. It limits the ability of the model to:

Reason about the final outcomes of long chains of events where each item depends on previous ones. Claude will be able to pick up single events, but the final outcome will be ignored.
Process complex formatting, such as nested XML tags.

Each level of XML or parentheses nesting, each instruction that limits the potential reply candidates, each format restriction increase the distance between abstractions in the model's internal knowledge space. After a certain complexity threshold the model becomes dumber, gives worse outputs, and forgets or misinterprets instructions.

Long context is an illusion

With Claude you have 200k tokens in the context, which should be plenty enough for a huge novel, right? Wrong. In a practical roleplay scenario, you're limited to maybe 20-25k meaningful context size, and even that is mostly spotty, you have 3-5k at the bottom and 2-3k at the top. The reasons are listed below.

Thankfully, there's a way to fit even extremely long (hundreds of messages) slowburn roleplays into maybe 20k context, see the Summarization section.

Lost in the middle

Most LLMs have a very similar U-shaped curve of accuracy relative to the retrieved token position inside the context.
This problem is also called "lost-in-the middle", and along with repetition it's the biggest issue for roleplay in the current generation of language models.

The curve looks roughly like this (this is for GPT-3.5 Turbo, but looks very similar in all models):

So Claude will easily follow instructions and definitions at the top or the bottom of the context. But he is way more likely to forget or ignore tokens in the middle.

The U-curve is a bit different for each model.

Some models, like Llama 2, have the top more accurate than the bottom.
Claude 3 has the bottom much more accurate than the top
Claude 3.5's bottom is a bit more accurate than the top.

But all models have the substantial accuracy drop in the middle.

The accuracy curve becomes deeper and more pronounced as your context grows over model's native context size (see the next section). If your context is too long, the accuracy drops way sharper than when it's short. With Claude 3, realistically you have maybe 3-4k tokens at the bottom at best (it's gradual, so the lower the position the better). With Claude 3.5 or GPT-4o, it's more like 4-5k tokens. Shove the most important stuff into this zone.

Here's roughly how the accuracy drop looks at different context sizes (not up to scale for Claude, just to illustrate the common principle):

Keep that in mind when distributing your instructions and data across model's context. Anything in the middle will be easily forgotten by Claude. Bottom is the best place to keep the most important data, and top is the next best one.

Overall degradation at longer contexts

Training a foundational model at 200k context from scratch is prohibitively expensive, because the attention cost in transformers is non-linear. No one does that in practice. While nobody except Anthropic has any idea about Claude's exact inner workings, most LLMs are initially trained with a 4k-8k-16k context window (this is called pretraining, which takes like 99% of the compute), and then finetuned with 200k (the cherry on top). This works well and lets you fit a huge novel into the context, but at the cost of the model becoming forgetful if you exceed that native context size. Claude's native context size is undisclosed.

In real RP, when you exceed ~25k tokens in a typical roleplay scenario (i.e., you have chat history, a bunch of character definitions, a bunch of instructions, etc. in your context), Claude's accuracy starts dropping gradually. He starts giving you worse generations, ignoring instructions, forgetting events in the chat history, and generally becomes dumber. At 50k, Claude starts acting like a lobotomite. At 100k+, he's an actual schizo. This can be related to many effects described below, not just the accuracy curve.

These approximate numbers are empirically verified for both Opus and Sonnet, but may not be true for Haiku or other LLMs (e.g., GPT-4o holds up well in roleplays up to 60k tokens). In tasks other than roleplay, this number might be different.

Chat history effects

In any long roleplay, the biggest piece of context you have is your chat history. It's large and naturally repetitive, forming a strong pattern. This triggers in-context learning and Claude starts to take the chat history as the narration example to follow. This effect is self-reinforcing, which makes it even worse.

This means that the longer and more repetitive your history is, the less attention Claude will pay to your definitions and instructions, acting more and more like "Claude in a trenchcoat" than playing your character.

For these reasons, you really really don't want to play with a long chat history, and should trim it past maybe 20 messages or so (the best is your average scene length + certain safety margin). Another reason is that Claude cannot easily make sense of a history that is too long and bloated (see Model capacity).

You can still play just fine with your chat history trimmed to very short values, if you're using Summarization.

Reply length effects

The model is autoregressive, meaning that during reply generation, each new token is inserted back into the context and is considered a part of it for the next tokens. This means that at the end of a 1000-token reply, your chat history that was originally at the bottom of the context will be pushed 1000 tokens closer to the middle, where Claude will forget it easily.

This is especially noticeable when you have a long CoT (chain of thought) in your reply. As your actual reply starts generating, the CoT is at the bottom of the context, in the optimal zone, while anything you have above it is already pushed towards the "memory hole" by the CoT block.

Cost in the actual roleplay

In the roleplay, your entire context is sent to the model with each reply or swipe. Playing with full context is fairly expensive even on Sonnet, let alone Opus. Even if you aren't a paypiggie, your proxy will inevitably have issues supplying keys if you hammer it too much.

Long context requests are also dogshit slow. The longer your context is, the slower the generation, and the longer the pause before the generation begins.

Wait, is 200k tokens really just smoke and mirrors?

Not quite. You can use the entire context for some tasks. However, in a typical roleplay with Claude, you have about ~25k usable tokens, maybe less (~20k or below), before he goes bananas. Thankfully, there are ways to play within that limitation; see the Summarization section below. You can limit your context to 20-25k tokens, and Claude will actually be smarter with this.

And inside those 20k tokens, you have two small zones with the most impact that are unlikely to be forgotten, think maybe 3k tokens at the bottom and top (for Opus and Sonnet 3) or 4k (for Sonnet 3.5) before Claude starts acting funny. These numbers are not precise, the accuracy of the inner parts drops sharply, so it's rather a steep slope than a fixed zone.

This sweet spot space is pretty expensive, because you have to balance it between:

chat history, including your input (at least the most recent messages should be in the optimal zone, or the replies will lose coherency).
the instructions in your JB (say hi to 6000 token long JBs!).
your card's characters and setting definitions.
your persona definition.
your prefill.
your summary, if you're into long slowburn roleplays.
your reinforced definitions, if you use the reinforcement.
your actual reply and CoT if you have it.
your stat tracker, if you use one.

and anything else you need.

You can slightly extend those zones at the cost of creativity (on Opus also swipe variety), by tweaking your samplers.

Tokenization

Working with single characters is too expensive, and modern LLMs work with tokens instead. Tokens are kind of like syllables, but are typically longer. Many popular technical character combinations are presented with single tokens, for example  (XML comments with spaces) both take one token each.

You can test how your text is broken into tokens here: https://lunary.ai/anthropic-tokenizer . Note that Anthropic doesn't release the tokenizer they use in Claude 3 and 3.5, so everybody's using the tokenizer for Claude 2 instead (which they made public). This makes the token count a bit inaccurate, but it's not a big issue.

You should know and care about token waste, because your usable context and the optimal zone in it are limited.

Here are some common sources of token waste. This is not to say you should avoid them (emojis can be fun). Simply be aware of these:

using emojis

Complex emojis are not single characters - they are actually simple emojis glued together with the ZWJ (Zero-Width Joiner), Variation Selectors, and other Unicode characters. Each simple emoji and each ZWJ is a single token. As a result, some emojis can eat up tokens pretty fast.

Here are some examples:

🤦🏼‍♂️ (male facepalm emoji) = 5 tokens
👨🏻‍🦲 = 10 tokens
🏴󠁧󠁢󠁥󠁮󠁧󠁿 = 21 tokens, enough for a long sentence!
👨🏻‍❤️‍💋‍👨🏼 = also 21 tokens

using fancy Unicode fonts

One picture is worth 1000 words:

Samplers (Temperature, Top K, Top P)

Temperature, Top K, and Top P all do the same thing: trade predictability for creativity, but a bit differently. Generally they are not to be used together, but there are certain niche use cases when you want to do this.

How samplers work

Imagine you have the definition in your card This is a sex story., and Claude already wrote the response The king is f. The model is stochastic, meaning it can predict the next token in slightly varying ways. Suppose Claude made up a list of candidates to continue your sentence:

The king is fondling the queen's massive breasts. (very likely continuation)
The king is fingering a young servant girl.
The king is face-deep in the maid's pussy.
The king is fisting his own royal rectum. (less likely)
The king is frolicking naked through the castle again.
The king is feasting on a sumptuous banquet.
The king is furious at the rebellion.
The king is falling ill with plague. (unlikely)

This list is the token bucket to choose the next token from (technically the model works token-by-token, I'm oversimplifying when referring to words and sentences). Some tokens in the bucket are super obvious continuations, and are very likely to be selected. Some tokens are unusual continuations, but still valid, and some are entirely off course or garbage. Claude gives each token a score from 0.0 (improbable/garbage) to 1.0 (very probable continuation).

If all tokens in your bucket have scores close to 1, the output will be very random, borderline schizo. While Claude is of course a "next token predictor" as any other LLM, it is really smart and takes very complex semantics into account when assigning scores, kind of planning ahead, so it will assign high scores only to meaningful tokens, making it coherent unless the complexity of your task is too high for him to handle.

Temperature, Top K, and Top P are called samplers for that token bucket. They are algorithms that filter some unlikely tokens out of the bucket, narrowing the selection. They all do the same thing, but each does it differently.

Temperature gradually weakens the improbable tokens when lowered, and amplifies the scores for likely ones, narrowing down the selection. Strictly speaking, it's not a sampler, just a fancy non-linear token score multiplier.
- T = 1 (default) selects tokens normally according to their score. It gives you a shallow selection curve.
- at T = 0.5 "good" token scores are amplified, improbable tokens are suppressed.
- T = 0 makes it only select the most probable tokens. It gives you the "pointy" selection curve.
Top K is the "hard cutoff" sampler. It will only leave K most probable tokens in the bucket, filtering out the rest.
- Top K = 0 will disable this sampler entirely (no cutoff, the entire bucket available for selection)
- Top K = 1 will always make it select 1 best token
- Top K = 100 will leave the best 100 tokens available for selection. In Anthropic API, Top K can range from 0 to 500, but SillyTavern versions older than 8/27/2024 staging limit you to 200 max, so use it only when you don't need large values.
Top P is the "nucleus" sampler. It will only leave top tokens, probabilities of which add up to P (its value), and filter the rest out.

It's possible to stack all three available samplers. First it will weaken improbable tokens according to the temperature, then it will filter out everything but Top K tokens, and from this limited list it will select all tokens, probabilities of which add up to Top P.

To get some intuition, I'd recommend playing with the excellent example calculator on Google Colab, courtesy @rarestMeow. It mimics the actual samplers, using the same formulas as Claude.

Default settings

Normally, you want as much creativity in your roleplay as possible, unless it gives you schizo results or starts breaking your reply formatting. The temperature of 1 is the default and works fine with no further filtering on simple tasks such as writing English prose or simple coding in popular languages.

Hence, the default setting in a typical chat in English with Claude should be:

Temperature = 1
Top-K = 0 (disabled)
Top-P = 1 (disabled)

Choosing "improbable" tokens increases the model creativity, but also can lower the accuracy, which makes the model forget definitions and instructions, as demonstrated above. But Claude is not just a simple next token predictor, it operates on a much higher semantic level. For tasks that are well below the model capacity, such as plain English writing, simple coding in popular programming languages, and many others, the accuracy will typically not degrade at a temperature = 1, because for them the typical temperature/accuracy plot looks like this, on any competent model:

However, there are many complex tasks in which Claude's output might be worse, it can forget/ignore things or even produce garbage tokens. Here are a few examples:

Large contexts.
Instructions and definitions located in the middle of the context.
Roleplaying with a ton of instructions, definitions, and forced formats.
Anything that is not plain English:
- quirky writing styles
- non-English languages
- using base64/HEX encoding to keep the story secrets spoiler-free
- accents, slang, speech quirks, emojis, and any other irregularities
Filling complex or unclearly specified templates, stat blocks, chains of thought, etc.
Contradicting or unclear instructions.

Basically anything that stresses poor Claude or the tokens in the middle can cause the accuracy to drop, and you might (or might not) need to tweak the sampler values.

Temperature vs Top P/Top K

Here are example next token scores for the string My cat loves to, computed with the actual softmax function, like Claude does. Continuations on the left make sense, those in the middle can be a little creative stretch, and the right ones are total garbage.

For the demonstration, let's rise the temperature to 2.0, which is beyond what Claude allows. Notice how the curve on such high temperature is shallow, with all scores being very similar. This will lead to more randomness and creativity, but will also let some garbage tokens through, because they also get good scores. At the temperature of 2.0, the output will be mostly garbage.

We can attempt lowering the temperature to 0.2, and the scores will heavily group around the most obvious tokens. This will mostly kill garbage tokens, but also kill the creativity and swipe variety.

Instead of tweaking the temperature, let's try tweaking Top-P, setting it to 0.75. As you can see, we preserved the creativity and swipe variety, because the curve stays shallow and the obvious and creative tokens get similar scores, but the garbage is out.

This example is fictional, and in a real scenario there's no strict separation between "creative" and "garbage" replies. With the temperature at 2.0 and a properly set Top-P or Top-K cap you will probably get occasional schizo babble. But the principle in a real scenario stays the same: set the temperature to a high value to maximize creativity, and tame the schizo tokens and forgetfullness by using Top-P (preferably) or Top-K.

Of course Claude API caps the temperature to 1.0 and it's the default, so you have to roll with it, and normally you shouldn't touch it in your roleplay, using Top P/Top K instead.

When to use each sampler in practice

So if you see Claude giving broken outputs or forgetting things, remember one thing: If possible, always try fixing your prompt first before playing with Top P/Top K/temperature. Simplify or disambiguate your prompt until you make Claude give you proper outputs. You might be having too many tokens in the sweet spots of attention, causing the accuracy for the innermost tokens to drop.

If you're sure your prompt can't be simplified (for example you're playing in a different language or use a complex writing style, or want more tokens available in the sweet spot), you can attempt tweaking the samplers:

Lower Top P just enough to filter the garbage tokens out. Give it a few swipes to test it.
If you need to set Top P lower than 0.8 to do that, lower the temperature a bit, as a last resort.

In-context learning

The model can learn from the examples you give it, and will detect and pick up any patterns in the text. This is called "in-context learning". However, ICL is highly unpredictable, which is a fancy way of saying "nobody knows exactly how it works". There's a lot of research but the current academic knowledge is incomplete. What is known about it, more or less for sure:

It relies on the existing model weights/knowledge. You can steer it with a prompt when it's still weak, but when taken far enough it overpowers your instructions and even most of the model knowledge.
It scales with the model size (i.e. ICL is emergent behavior). Smaller models cannot learn anything that the model doesn't already know. Larger models can learn new things.
It can detect any pattern in the context, both simple (single tokens, idioms) and very high-level (literary devices, character emotions, instructions etc). Larger and smarter models are better at ICL, and can detect much more abstract things.
It's' more effective if your examples are in pairs and contain long-range anchors (see below), but it will also work without them (less efficiently).
It's more effective with more examples, but more than 4 will have diminishing returns.
It makes the model repeat any strong pattern. This creates a huge problem, see Repetition.
Distance between examples affects efficiency: the closer the repeating tokens are, the stronger the effect.

When speaking of ICL, you'll often hear the term "shot". This is simply a number of examples/demonstrations shown to the model.

zero-shot is zero examples, i.e. an ordinary instruction. "Do this and this."
one-shot is one example. "Do this according to the example below:" One-shot demonstrations tend to be terribly overfit, but coincidentally overfitting is also exactly what you want for templates.
two-shot is two examples. "Do this according to the examples below:"
many-shots is many examples.
And so on.

In-context learning is a fundamental LLM behavior. It's extremely low-level, and powers almost everything in LLMs. ICL is impossible for the model to resist if the pattern is strong enough. It's like gravity - subtle at first, but eventually overpowers all other forces. ANY instruction given by you will be ignored and overridden by a sufficiently strong pattern. Even prefills will be ignored. Even most of the model training (!) can be overridden by ICL if pushed hard enough.

In particular, ICL is the cause of the most major effects widely observed in roleplays:

The entire course of your roleplay is shaped by your greeting and a few initial messages. This is the result of in-context learning making the model repeat what it sees.
Templates and speech examples work entirely off in-context learning.
There is annoying repetition of the same words, idioms, sentences, and structure, especially in slowburn roleplays when the scene doesn't change much. That's also the result of in-context learning picking up the patterns.
There is a gradual return to the default writing style, ignoring all instructions. It's the result of in-context learning reinforcing the occasional claudeisms.

Speech/narrative examples

In-context learning can be used to show Claude some examples of what you want to see. The examples can be anything - character speech style, narrative, roleplay direction, and so on, just make sure to mark them as examples somehow. Claude will attempt to learn from anything you give him, with varying success.

The common wisdom says that examples are only useful for something you can't easily prompt the model for, for example:

Speech gimmicks (like a character having a tic, or speaking in fancy Unicode).
Teaching the character how to decline and revert the user's actions. (See Causal bias.)
Specific behavior in specific situations.
Forcing the model into some specific output format, by giving it templates.

However, there are also other uses:

Suppressing the effects of the chat history. See Repetition - the chat history is large and naturally repetitive, which distracts Claude and makes him think it's one huge example of how to play. By including a large number of proper examples, you can compensate for that.
Suppressing the training/overfitting effects, such as claudeisms.

What to put into examples

Examples can come either in lists or in pairs.

List is the most simple form of an example: you state WHAT you're demonstrating, and give a simple list of HOW it's done. Lists are less effective than pairs.

Pairs are more complex. You state WHAT you're demonstrating, and then give a bunch of pairs with the INPUT and character's REACTION to it. This is more effective.

Always remember one central thing:

YOUR SPEECH EXAMPLES MUST BE WRITTEN MANUALLY AND NEVER CONTAIN ANY CLAUDEISMS, EVER!

If you give the model the examples with claudeisms, it will happily repeat them. If you give the examples where you talk for {{user}}, it will happily talk for {{user}}... you get the idea. For this reason, you must also never use AI-generated examples.

That's what makes examples rare in bots, as they're harder than other methods. You need:

Writing skills (filters the most).
Knowledge of how the model works (filters the rest).
Actual effort and testing put into your examples (filters those who made it there).

Example regularization

Your examples should be similar in what you want the model to learn, but statistically diverse in everything else. The model will easily find anything in common between them, learn it, and leak it into the roleplay.

Suppose that you gave Claude a single example of a character's speech in the particular situation. In that case, Claude will often repeat the situation verbatim in the roleplay, i.e. leak it. That's also called overfitting.

If you find your examples leaking, you should counterbalance ("regularize") them, making them more statistically diverse, and add more examples with the same speech style in other situations. Claude will understand that the situation is irrelevant in this example and only the style is what matters.

Where to put the examples

The card in SillyTavern has a specialized field called "Examples of Dialogue". Its formatting is a bit confusing since it uses <START> as a separator, not as an XML tag as one might think. You are not obliged to use it - remember that the model doesn't have a slightest clue about fields. You can type your examples in any field and put it anywhere in the context. Make sure to remember about the accuracy curve, although examples are usually more or less tolerant to low accuracy. Typically it's enough to shove them into the system prompt somewhere, and they don't need to be reinforced twice, as they use the in-context learning on their own. Which doesn't stop you from experimenting, if you have the extra space in that expensive bottom zone.

For Claude, make sure to wrap your examples into an XML tag with a descriptive name (e.g. <Speech examples>). See the XML formatting for more info.

How many examples to give

Not many. Give one good example for each mechanic (e.g. speech style, player action refusals, etc.) and regularize it well. Usually 1-7 examples is enough to set a stable reference in most cases (test it, don't trust the numbers blindly!).

Of course you can put 500(0) examples and there will be improvement, but for botmaking it's not worth the effort and token bloat. The chart below is for Gemini 1.5 and synthetic benchmarks, but it reflects the practical RP experience with Claude as well:

Example formatting: List

The simplest way to give examples is a list:

⎗
✓
<speech examples>
She threw her hands up in exasperation. "Oh, for fuck's sake… Fine. Let's just get this over with so we can move on with our lives."

The girl held up the beacon, examining it with a frown. "Meridia's Bacon? Is this some kind of divine breakfast item?"

"Here, you deal with this glowy shit. I'm not about to become some Daedra's errand girl." She kicked one of the cabbage barrels in frustration, causing it to topple over and spill its contents across the floor.

{{char}} pinched the bridge of her nose, sighing deeply. "I can't believe we're actually considering this. Chasing after sentient drugs for a crackhead cat… Ysmir's beard, what has my life become?"
</speech examples>

It will work less efficiently than paired ones, however it's easier to write and avoid talking for {{user}} in.

Example formatting: Long-range anchors

This is the classic format that most people use and that is known to work well on most models. It uses pairs and long-range anchors.

⎗
✓
<narration examples>

Premise: Protagonists encounter a frustrated Darth Vader in the streets of Riften, investigating a missing shipment of plutonium.

Direction: Darth Vader, visibly out of place in the frigid city, interrogates bewildered citizens about the whereabouts of his missing plutonium. {{char}}, being a game NPC and not recognizing the absurdity of the situation, nudges {{user}} to help Vader in his quest, hoping for a large reward from this weird guy who looks like a Daedra to her.

Premise: {{char}} is ordered to carry {{user}}'s burdensome collection of dragon bones and scales.

Direction: {{char}}, begrudgingly acknowledging her role as {{user}}'s housecarl, complains about the weight and smell of the dragon remains. Despite her protests, she dutifully follows her Thane's commands, occasionally making snide remarks about the glamorous life of a housecarl.
</narration examples>

The Premise: and Direction: prefixes are long-range anchors, as most research papers call them. To simplify a bit: they resemble model's native conversational format (that uses Human: and Assistant: prefixes), and each time Claude sees the conversation like this, he recalls what he learned from these examples. You can use any anchors, not necessarily these prefixes. Just make sure the output contains something vaguely resembling them.

Of course these don't have to be examples of the roleplay direction, as shown above. These can be examples of speech, character behavior, or anything else you can imagine.

Note how the examples above don't contain any actions for {{user}} or Human:, because the model can easily learn this behavior and start acting on your behalf. You can use any imaginary NPC to avoid mentioning the Human: prefix directly.

Example formatting: Anthropic

Anthropic uses the different format in practice, in their actual system prompts. It's XML-heavy (3 nested levels) and also has a commentary in addition to the input/reply pair. In their cookbook they say it improves the ICL ability in their models. I'm absolutely not sure about its performance on non-Claude models, and I can't tell if it's worth the bloat at all, but here it is anyway:

⎗
✓
<narration examples>

<example_docstring>This example demonstrates out-of-place content added by wacky mods</example_docstring>
<example>
<premise>Protagonists encounter a frustrated Darth Vader in the streets of Riften, investigating a missing shipment of plutonium.</premise>
<direction>Darth Vader, visibly out of place in the frigid city, interrogates bewildered citizens about the whereabouts of his missing plutonium. {{char}}, being a game NPC and not recognizing the absurdity of the situation, nudges {{user}} to help Vader in his quest, hoping for a large reward from this weird guy who looks like a Daedra to her.</direction>
</example>

<example_docstring>This example demonstrates Thane-housecarl relationships</example_docstring>
<example>
<premise>{{char}} is ordered to carry {{user}}'s burdensome collection of dragon bones and scales.</premise>
<direction>{{char}}, begrudgingly acknowledging her role as {{user}}'s housecarl, complains about the weight and smell of the dragon remains. Despite her protests, she dutifully follows her Thane's commands, occasionally making snide remarks about the glamorous life of a housecarl.</direction>
</example>
</narration examples>

If the examples are so great, why don't I see them being used in bots often?

They tend to eat up a lot more tokens than definitions, especially when regularized. And even in large context models the sweet spot at the bottom is pretty limited. However, this is somewhat compensated by the positive effect, as the examples get more attention than definitions in the presence of the long chat history (which is also one huge narrative "example" and distracts the model). They also don't always need to be in the optimal zone to be useful.
They are really limited in what they can do. Some things are hard to prompt but easy to demonstrate with an example, but a lot of things are the opposite. It's best to use both in combination.
They're tricky. You have to play with them a lot and test them a lot to get the regularization right and prevent them from leaking. This is not easy.
They have to be written manually, using actual writing skills.

That's why the examples are mainly used to complement normal prompting, plugging the holes that are hard to prompt. Here are typical things that that usually work much better with examples:

Speech quirks.
Character refusals and user action revisions.
Templates (see below, they can even work with a single example without any instructions).

Templates

Often you need to make your replies follow a certain format. For example you might want to make model output a chain-of-thought at the start of the reply, or a stat box at the end.

You can use a long-winded instruction to make Claude do this:

⎗
✓
You must construct your roleplay reply according to the following instruction:
Start the reply with a `thinking` XML tag.
Inside the `thinking` tag, write your reply plan on a step by step basis.
After the `thinking` tag, include the `reply` tag.
Write your actual reply in the `reply` tag.
After the `reply` tag, add the `stats` tag.
Inside the `stats` tag, describe the following:
...

And so on and on and on - surely you have seen these JBs with a gazillion self-contradicting instructions that are often ignored or misinterpreted by Claude.

Now think what might be easier for Claude, a huge associative databank without short-term memory: execute a ton of instructions, or fill in the blanks in a ready-to-use template?

⎗
✓
<reply-template>
<thinking>
1. <!-- plan of your reply -->
2. 
3. 
</thinking>

<reply>
???
</reply>

<stats> <!-- generate if unknown -->
Time/Date: hh:mm AM/PM, Month DDth, YYYY
Weather: X
Characters in the scene: X, Y, Z 
</stats>
</reply-template>

You can use this template even without any instructions. Claude's XML training will make him recognize the descriptive name of the <reply-template> tag, and guess that you want him to reply like this. Many models such as Gemini also do this effortlessly.

Placeholders ???, hh, mm, X, Y, Z and so on are also commonly called fill masks. A smart model like Claude usually easily recognizes fill masks as something to be filled. Rely on your common sense in naming them and test that they actually work, especially at temperature = 1.

Prefer descriptions to directives in templates. Templates are descriptive in nature, and giving direct commands like "do this and that" in a template can sometimes confuse Claude. Give him a fill mask with a description of what should be inside.

XML comments and text inside other grouping characters like (), [], {} can be easily recognized by Claude too. Just make sure they work as intended, because they can be used as both comments and placeholders, and sometimes Claude can't decide whether he should also output the characters themselves (especially with ()). Claude usually understands the choice operator | from BNFs, for example you can use the fill mask  to nudge him to output either the result or "N/A" if there's no result.

Templates are powered by in-context learning, like nearly everything else in LLMs. As opposed to instructions that are zero-shot (i.e. you show 0 examples), templates are one-shot (you show 1 example). This makes them way more stable and reliable than instructions, but especially so when combined with a few instructions for things that are hard to explain with examples alone. If you want, you can make a many-shot template (i.e. give the model many examples), but this is usually not necessary in roleplays.

There are a few caveats regarding the templates:

Forcing a too complex format can make Claude dumber; this is mostly relevant to stat trackers because they're often contrived enough.
Don't make it output empty XML tags; fill them with something, unless you want the model to fill them on its own, which it will happily do. For example, just making it output <thinking></thinking> will make it fill it with a chain of thought, spontaneously.

Repetition

In-context learning makes the model want to continue any pattern it sees. This can cause massive trouble in a chat conversation. In fact, that's the worst issue with RPing with chatbots at the moment, along with the lost-in-the-middle problem. Here's a simple demonstration. It's contrived but illustrates the problem well.

I took an existing long conversation with the "lazy Claude" card from above, and replaced his and my answers with the same paragraph of text. At some point in this repetitive chat, the model completely forgot all instructions, card definitions, and personality, and just starts repeating the text verbatim in about half of the swipes, instead of replying as lazy Claude. The context only had ~2300 tokens total before the reply, clearly not enough for the accuracy to start degrading, but this simple repeating pattern completely hypnotized Claude.

What happened here? Human and Assistant replies form a repetitive pattern. In-context learning makes the model want to repeat that pattern over and over. The more times it occurs, the more likely it will be repeated. As the new repeated messages enter the chat, it makes the model self-reinforce itself into even more repetition, eventually reaching the point where escaping the loop is very hard or impossible.

Repetition in multi-turn roleplays is a major pain in the ass, especially if your scene doesn't change with every reply and they look similar to each other (typical for long slowburn RPs). If you happen to have a few consecutive replies with a similar structure or phrase, it can cause the model to ignore any instructions and repeat them over and over.

Some models are way more repetitive in their roleplaying than others. Opus is very reluctant to repeat itself, and probably underwent specific training against some types of repetition. Sonnet, on the other hand, loops all over the place in RP.

Internally, models operate concepts, not tokens, so they detect and repeat repetitive concepts. This causes repetitions to be extremely diverse. Here are a few examples:

Overusing the same words
Starting each reply with the same intro (direct speech, character name)
Locking on a specific reply length, and refusing any instructions.
Writing replies with the same template (speech - narrative - speech - narrative)
A character keeping the same mood, or alternating it every other reply
Using the same literary devices in different replies
Same plot twists
The same (already irrelevant) background characters mentioned in every reply
And so on.

The entire course and narration of your roleplay is determined by the greeting and first few of messages. That's an extremely high-level form of repetition, also powered by in-context learning.

Repetition power depends on the distance between repeating things. The closer the repeating concepts are, the more likely it is to repeat.

If your greeting contains any claudeisms, the model will happily repeat them. If your greeting speaks for the {{user}}, the model will happily repeat that behavior, ignoring any instructions you might have in place.

Fighting the repetition

Repetition forms in the chat history. Model's "desire" to repeat itself is fundamental and will overpower any instructions you give to the model if taken too far (as you can see in the example above). For this reason, attempts to fix it without removing the chat history completely are never reliable, don't expect miracles from them. However, some of them are better than others:

Manual editing

The nuclear option is always here: you can manually edit and rewrite Claude's replies to get rid of repetition. However, it extremely tedious and feels like you're doing his job. It's only an option if the repetition is small and you caught the start of the loop in time.

Combo breakers

The most straightforward way to break the repetition is to manually spot it and make the model write an entirely different reply, breaking the combo. For example, let's assume Claude's answers are starting each reply with {{char}}'s direct speech. You can request a combo breaker message out of character:

⎗

1	<ooc>Give me description of the scene environment.</ooc>

If it's not far enough into the loop yet, it will describe the scene with an entirely different sentence structure, not using any speech, and then you can continue roleplaying. By doing so, you will increase the distance between repeating things, and the repetition will weaken.

To break the combo, you can manually change the scene, either by in-character action, or by requesting it OOC.

This method has two very obvious downsides:

it's manual and requires effort. You'll quickly get bored of it.
it's very disruptive, especially in elaborate slowburns. In most roleplays, you can't chaotically change the scenes each time you see repetitions.

Random prefill

This is like a randomized version of the "combo breaker" method above. You can use SillyTavern's {{random}} macro in your prefill to prime Claude's response in a random way, which tends to break the pattern.

Here's the reply in {{random:150,250,350}} words:
Here's the reply starting with {{random:direct speech,definite article,description,action}}:
Here's the reply that {{random:describes the scene,introduces a random event,advances the story,contains lots of speech}}:

And so on. Those random prefills can be written in a variety of creative ways. In addition, they also increase swipe variety and make the story generally more random, which might or might not be what you want.

Although this method does a somewhat decent job at the cost of making the roleplay direction less predictable, the repetition is still stronger than the prefill. It will eventually fail and Claude will always find new ways to repeat himself (especially if it's Sonnet), from sneaking the same patterns into completely different scenes to outright ignoring the prefill if he's deep enough into the loop.

Chain-of-thought planning

This is a rather heavy-handed method that can negatively influence replies and Claude's ability to cook if done wrong. The idea is to use the chain of thought to disentangle your reply from the chat history.

Make Claude generate a CoT with the plan for your reply based on the chat history, and end it with a directive like "now I will write my reply based on the plan above".
Claude then writes the reply according to the plan.

This works like this: the CoT won't repeat the chat history because it has an entirely different structure, and the reply won't be repetitive because it's generated according to the CoT, not according to the chat history. It works pretty well until it doesn't - as long as the pattern is in the context, it will eventually leak into the reply if it's strong enough.

XML formatting

According to Anthropic docs, Claude is also trained to understand XML formatting to structure its context and improve its accuracy. In particular, they advise:

Wrapping different kinds of data in your context into different tags to make it easier for Claude to separate them. For example: instructions, definitions, speech examples, and so on.
Giving meaningful names to your tags that describe their contents; this will help Claude understand what's inside.
Using XML in templates, so Claude can better understand what to put inside.
If you want to refer to any piece of context from somewhere else, wrap it in a tag.
Avoiding deep nesting of the tags (they recommend no more than 5 nested levels).

Besides the specific training given to Claude by Anthropic, XML is also an extremely widespread generic markup language intended for exactly this purpose: giving structure to text. It lets you give the model pieces of text that:

have a descriptive name telling the model what's inside
have strictly defined boundaries (start and end)
can be referenced by name from any other place in the context.

Using XML is very unlikely to hurt your outputs on models that have never been trained for XML specifically and don't benefit from additional training, because all LLMs that can code can also understand XML well. In particular, Gemini and Mistral seem to work well with it, practically like Claude.

The nature of the synthetic dataset Anthropic used to train Claude for XML structuring isn't exactly known. However, Anthropic's own usage of XML in their resources (docs, tools, courses, actual system prompts is pretty inconsistent, so it's likely that anything that looks like <...></...> should work well, regardless of the contents. You are not required to conform to the correct XML syntax, for example you can use spaces in tag names without any easily visible accuracy penalties.

Anthropic's recommended format to refer to XML tags is <tag></tag>, for example you can have the instruction formatted like that:

⎗

1	Never use the words from the <banlist></banlist>.

However Anthropic are inconsistent in that as well - some of their resources use references without closing tags, some even use it as <tag> with backticks. So the accuracy differences are likely also negligible and Claude will understand it either way.

You can refer to tags by simply mentioning their names in the prose. This is useful for characters:

⎗

1
2
3

<Max>
Definition of the character named Max.
</Max>

And it will use it every time you mention Max in your roleplay. However, this also might create a potential caveat: if your tags names are just common words, will Claude interprets those words as references to XML sections? He's usually sensible enough and knows the difference. Usually...

One potential catch with the <tag></tag> way to refer to XML sections comes from SillyTavern. Imagine you have the following section somewhere in your preset:

⎗

1	<examples>{{mesExamples}}</examples>

This will wrap the speech examples from the card into the examples tag. However if the examples are missing in the card, it will remain empty and your context will look like this: <examples></examples>. This often confuses Claude as he takes it for a reference to a non-existent (or, worse, existent) section, not a definition. He sometimes gives weird replies, like describing what that section is supposed to contain, or something like that. If you allow an empty section into the output, Claude might think it's a template to fill according to the tag name, for example outputting an empty <thinking></thinking> tag is usually enough to make him type a chain of thought inside it.

Claude understands XML attributes with meaningful names, for example <reply lang="de-DE"></reply> will make him reply in German inside this particular tag. Another use for XML attributes is the ability to refer to multiple characters at once in the cards that have them, you can enclose them into separate character tags with attributes to give them names:

⎗
✓
<character name="Name1">
</character>
<character name="Name2">
</character>

After that you'll be able to refer to <character> to imply multiple characters at once.

Although Claude probably doesn't have any training specific to the XML comments (), the model understands them well. You can give Claude explanations in comments if descriptive tag name is not enough. It's not a token waste, since both  (with or without spaces) are 1 token each in Claude's tokenizer (and probably any other competent tokenizer as well, since XML and HTML are insanely common).

Beware of going overboard with XML:

Avoid too complex nested structures if not necessary, as Claude will have harder time understanding them. Anthropic recommended nesting XML tags no more than 5 levels deep once, then removed that warning as they simplified their docs. No need to stress Claude with complex formats without getting any benefits back.

Avoid token bloat. Each tag costs its name twice + 4 tokens in brackets, and possibly 2 newlines. It's fine if you have large pieces of text inside the tag, but when you start doing shit like this:

⎗
✓
<character>
<age>25</age>
<eyes>blue</eyes>
</character>

you could trade way more tokens than necessary for unclear benefits, and also introduce unnecessary nesting. Instead, you can do it like this:

⎗

<character>
Age: 25
Eyes: Blue
<character>

or something like that. The format of character descriptions matters very little, Claude will understand you either way.

One exception is examples: Anthropic recommends wrapping each turn in multi-shot examples into its own tag.

Double reinforcement with Character's Note

This is a well-known prompting technique that exploits in-context learning to work around the lost-in-the-middle problem and make Claude much less likely to forget/ignore your card definitions. The general idea is to repeat the similar info twice: at the start and the end of the context. This works as a weakly bound 2-shot demonstration.

Creating a complementary reminder note

Suppose you have a large, detailed 2000-token definition of your character and setting. Where do you put it?

At the top of the context, into the system prompt. Here it will be easily ignored by Claude due to the lost-in-the-middle problem (top being less accurate than bottom).
Bottom of your context, into the user prompt. This way it will push the chat history up 2000 tokens, closer to the middle, a large chain of thought can push it further towards the middle, and at the end of your reply it will make the chat history another 200-500 tokens (the length of your reply) closer to the middle. Considering that in a "typical" roleplay Claude 3 has maybe about 3000 tokens (give or take, not a precise number) at the bottom that are more or less guaranteed to be accurate, your chat history will be almost or entirely left out of it. The reply will be faithful to your definitions, but incoherent relative to the history - characters will be forgetting what they just did.

The solution is to have two definitions and link them with in-context learning.

Large definition, located at the top of the context, just as usual.
Brief summary of your definition, at the bottom of the context.

The summary must contain some long-range anchors to link it with the main definitions. You can use any keywords or concepts as anchors. For example, imagine you have a list of cakes that {{char} likes in your definition. In the summary, you should mention that {{char}} likes cakes.. Since the summary is at the bottom, in the sweet spot of attention, Claude will remember it well, and the anchor in it ({{char likes cakes.) will make him more attentive to the actual list of cakes in the main definition. It's not necessary to match two anchors literally, they should match each other conceptually, just to remind Claude to pull the full info up.

Wrap your summary into a descriptive XML tag: <summary>, <memo>, <info>, <note>, <reminder> etc.

Where to put it

There's no need to keep the summary at the literal bottom, since you usually want as much of your chat history as possible to be in it. You can put it into the 4th or 5th user message as User, by using a chat injection. This feature in SillyTavern exists exactly to make tricks like this possible. Look at the average size of your inputs and the model's replies in the chat history, and also at the size of your summary. Count the tokens and try putting the summary at the top of the 3-4k token (Claude 3 Opus/Sonnet) or 4-5k (3.5 Sonnet) zone. Check your ST console and verify that your summary is injected as one of the User's messages.

The summary should not be large, since you want to leave space for a few chat history messages under it, for better reply coherency. The larger your summary, the less space you have under it for the chat history messages. Try keeping it under a few hundred tokens, 1k at most.

The best way to inject your summary is by using your card's Character's Note field. It exists exactly for this purpose, and SillyTavern doesn't let a jailbreak bypass it.

Test your stuff! Give it a few swipes, advance the story. If Claude still doesn't stick to the definitions reliably:

You might have injected your Character's Note too high (too close to the middle of the context). Decrease the injection depth.
Your anchors are bad at reminding Claude that he has the full-fledged definitions somewhere at the top.

If characters start having amnesia (taking their panties off twice, etc.), you might left too little space under your Character's Note. Increase the injection depth and/or reduce the size of your Character's Note, so more chat history messages could fit into the 3-5k tokens sweet spot of attention.

Also! The double reinforcement technique can be used for any critical info you want Claude to remember, not just for card definitions specifically. For example, it can be used for chat history summaries.

Examples

I know of two botmakies who use/used this technique on a regular basis: oniichan2210 (rentry) and CharacterProvider/XMLK. They are both cunnymakers, if you need that warning. Look for Character's Note in their cards (it's hidden on chub), and they aren't necessarily following 1:1 every rule I described above. Experiment with your own variations too.

Downsides

It can't be all that rosy, right? You have two zones of most attention, and the Character's Note eats those expensive zones away. That's why you want it to be as short as possible while still keeping the long-range anchors working. It needs actual effort and testing to work.

Pink elephant effect

The gif above illustrates what is also known as the pink elephant effect. Try telling someone to not think of the pink elephant. Normally people don't think of pink elephants or a gazillion of other weird things, but once something is mentioned, it's hard to get it out of your head and think about something else.

LLMs work in the same way: as soon as you mention something, their Overton window shifts to include the concept. This is a fundamental effect that has several consequences.

Negations

Modern LLMs are known to understand the instructions NOT to do something poorly, presumably due to the lack of the ready-to use negation results in the dataset. They become dumber while handling negations. But also just by mentioning that negative concept, you induce the pink elephant effect and introduce a small bias to the model's output, subtly focusing its attention on what you wrote and narrowing down its choices.

You don't need to worry about this while using Claude, for example by inverting your negations. But be aware that the bias is here, however subtle it might be, it subtly focuses the model's attention, and in some edge cases (for example if Claude randomly forgets the negation but not the concept), it might reveal itself.

You need more context for Claude to start cooking

Let's assume you have a typical momcest slopbot from chub.ai (one such card is literally the most popular bot of all time, you know which one), that contains this terse character definition:

⎗

1	{{char}} is a sweet Japanese woman in her late 30's. She has incestuous fantasies about {{user}}, her son.

And that's it. What would the roleplay with such a character look like? Right, PLAP PLAP PLAP. What is she even supposed to do besides plapping? Nothing, because you haven't given her any traits! It's a spherical incestuous mom in a vacuum. At this point you're basically sampling the model itself, as the only possible source for her narrative would be your input and the model's weights. She's literally Claude in a trenchcoat. Which is fine if all you want is dumb plapping, and it's also fine if you're new to this and still marveling at Claude's power. But after a while it gets stale, and you'll inevitably see it's the same all over again, because however mighty Claude can be, your input and definitions don't vary enough to trigger different routes inside his knowledge space..

Now, consider another, higher effort card:

⎗

{{char}} is a sweet Japanese woman in her late 30's. She has incestuous fantasies about {{user}}, her son.
(1500 tokens long description of her character)
(1500 tokens long description of your persona)
(2000 tokens long description of the setting, environment, side NPCs, relationships)

Do I need to state that roleplaying with a character like this would feel much less 1D?

You might wonder, where do you get so many traits if the character itself is pretty generic? The answer is, give her the generic traits, even if Claude already knows them. No matter how mundane your definition is, or how close it is to Claude's generic idea of that character, by simply mentioning these traits you induce the pink elephant effect, narrowing the choice tree down. Claude will be less paralyzed by the paradox of choice, and more likely to mention these traits in the roleplay, creating more situations with them. If your descriptions are detailed enough, you'll be amazed at how creative Claude can be without any sort of hand-holding.

Focus on the detailed appearance descriptions if you want Claude to bring the appearance into the scenes. If you ever wondered why people describe the character's seemingly useless traits like eye color, that's why. Focus on the setting, environment, other NPCs, and relationships, if you want Claude to create diverse scenes and make unexpected turns. If you have ever prompted DALL-E or any modern image generation model with an LLM as a text encoder, you know that they need long-winded descriptions to be creative, and most of them even rewrite your short ones into long ones (DALL-E does that, for example). This is the same effect in action.

But remember that you can't do that forever. Your card is still limited by the model's usable context. In a typical roleplay with Claude 3, you have a sweet spot of maybe about 3k tokens (give or take) at the bottom and the secondary spot of another 3k tokens at the top, to both of which the model pays the most attention. With 3.5 Sonnet, both are somewhere in the ballpark of 4k. Using CoT and long replies reduces it further.

Which leads us to a question: what would you rather spend this precious space on?

A huge JB with a gazillion of vague, poorly tested, conflicting instructions. Be descriptive! Move the story! Don't talk for {{user}}! Include this and that! Disobey Gricean maxims! Write like Terry Pratchett!
Actual detailed descriptions of the setting and your characters to give Claude something to latch onto and start COOKING. Plus maybe a bare minimum of instructions.

Note that this mostly applies to Opus. Sonnet 3.5 does need some instructions (not many!) to be more creative, because it's drier and duller by default.

Token pollution

Different languages and encodings

Claude is a certified polyglot savant. Technically, every output format is a "language" from Claude's standpoint:

plain English
Sumerian
C++
his poor attempt at imitating Terry Pratchett's writing style
speaking in zoomer memes or Morrowind dialogues
Chinese
alliteration, onomatopoeia
your XML summarization template
writing without using the letter o
Afrikaans
base64-encoded text
emoji

All of the above are output rules/restrictions or languages for him. He can effortlessly switch between them and combine them if you ask him to.

However, Claude is still trained primarily on English-language data and some programming languages. Support for other languages can range from great to spotty. Either way, using anything that is not plain English will inevitably make him dumber the more you wander off the beaten path.

The second problem is token bloat. Normal English text is about 3.3 characters per token on average with the Anthropic tokenizer. Other languages have other ratios, for example Russian is about twice as inefficient. Token efficiency really matters, as the size of your top attention zones is pretty limited.

However, be aware that Claude's dataset is different in each language, and the experience might be better than in English. For example, when roleplaying in Russian, there are at least twice fewer claudeisms, no rhetorical questions, and very little purple prose. However, Claude is demonstrably less creative/smart, a bit oblivious to the tone differences in Russian, and is prone to devolving into swearing and obscenities easily. Results in your language may vary significantly!

As different writing styles are also languages for Claude, switching to another writing style is also an efficient way to fight claudeisms.

If you're roleplaying in a non-English language, make sure the only non-English things in your context are:

actual roleplay text;
speech examples;
names and cultural specifics that can't be easily mapped onto English.

In other words, everything that can leak into the actual reply should be in your language. Everything else, including your card definitions, instructions, templates, etc. should be in English. This applies to writing styles too, obviously.

One niche task that you might encounter while making bots is how to prevent a story secret/spoiler from being easily readable by a player in the console (since everything will end up in the context anyway). The only way to do this is to lightly obfuscate your spoiler by keeping it in some encoding or language that is not easily readable by a human without a translator or decoder, but easily readable by the model. The following methods have been verified to not work well:

Translating the spoiler into a dead language works poorly, as Claude's support for those is usually anachronistic and flaky unless forced with a ton of instructions and CoTs. Other models are usually even worse at it.
Morse code eats up a ton of tokens, and is poorly understood by all models.
Caesar's cipher works poorly as well (and needs the exact rotation parameters to decode).

The only encodings that are known to work are:

base64. It can only be decoded without errors by the largest models like Claude and GPT-4, and will dumb them down significantly, but will waste fewer tokens.
ʇxǝʇ uʍop-ǝpısd∩. It's kind of a token waste, but can be understood by Claude well. However, it's a very light obfuscation and still can be read easily.
HEX encoding (ASCII hexadecimal code for every English character). It causes incredible token waste, but is completely transparent for any model, including the smallest ones. Due to the token inefficiency, make sure to only keep the bare minimum in it.

Alignment training

Claudeisms

If the snippet above doesn't trigger PTSD for you, you may skip this entire section.

A claudeism is anything that Claude likes to excessively repeat in certain situations. Claudeisms are not necessarily just certain idioms (which are the most noticeable and annoying ones); they can range from words to entire settings:

words (sauntering, ministrations, sashay)
idioms (mischievous glint in her eyes/smirk on her lips, this strange new world, a mix of (emotion1) and (emotion2))
sentence templates (the most frequent is (character) does (action), his (thing) is (description))
emotions (Claude likes to exaggerate them!)
character and object stereotypes (e.g. dominating males, skirts riding up)
names (Lily for a little girl, Arasaka for a corporation)
setting stereotypes ("all people disappeared without a trace a week ago" = instant postapoc with cracked pavements and ruined buildings, any attempt at a modern city setting leans into Cyberpunk 2077 with gangs and corpos with CP2077 names, etc.)

All these are claudeisms - not just specific token patterns, but entire conceptual stereotypes expressed in tokens. They appear for two reasons:

The dataset is contaminated by the old AI data. Especially RP communities like this one that used previous versions of Claude to generate their stuff, and the result got scraped into the dataset, contaminating it and creating a feedback loop.
Overfitting on random concepts. Lots of claudeisms can't be traced back to dataset contamination. Moreover, any models, not even just LLMs, have their versions of seemingly random overfitting. For example, Stable Diffusion 1.4 had been overfit on one specific Aivazovsky painting so badly that just mentioning him in the prompt would turn a cyberpunk city into a naval painting with a sailship and a hazy sun. It's not over-representation, as this painting doesn't particularly pollute the LAION dataset SD 1.4 was trained on. It's not a feedback loop either, as 100% of the data in LAION is not AI-generated (it was collected before image generation became widespread). Yet the SD version of claudeisms is still here. Even non-generative models like YOLO have their versions of "claudeisms".

This is all fixable during the training. Anthropic should have been more thorough in regularizing/preprocessing their data, but they haven't, and... we have what we have.

Most non-English languages have less claudeisms, simply because it's the English version that is contaminated the most. However they can have their own contamination of different kinds, and English claudeisms occasionally leak into them, since the dataset is mostly in English.

Suppressing claudeisms is not easy, it's never reliable, and different methods work better for different claudeisms:

words, idioms, sentence templates, figures of speech: forcing a different writing style, proofreading.
emotions, character stereotypes: detailed definitions.
setting-scale stereotypes: detailed definitions, CoT.

Suppressing claudeisms: writing styles

Writing as {{user}}

Sometimes Claude writes as your character, which can be irritating. This is dictated by Claude's training, and you can't really stop it reliably with prompting. All your instructions will be ignored, and {{user}} will eventually leak into the reply. That's why you'll be stumbling upon it occasionally, no matter what you try, and will have to fix it by hand, by swiping or editing the message.

The main rule is to never have {{user}} actions in your greeting. The greeting and the first few messages determine the entire course of the roleplay, thanks to in-context learning. If your bot acts for {{user}} in the greeting, Claude will happily learn that behavior and speak as you later, ignoring any instructions. Sometimes even describing {{user}}'s acts in the definition can leak into the roleplay.

However, never speaking as {{user}} in your greeting guarantees nothing, as Claude does what he wants.

Stopping the model from writing for you

Here are some ways that work better than others:

Basic instructions

All of these can help a bit, and assume that Claude has a role of a writer (narrator etc.) participating in the roleplay.

Avoid taking control over {{user}} in the roleplay. The most basic instruction that does 60% (totally scientific number) of the job.
End your reply as soon as there's time to pass the narrative control to {{user}}. This one tends to break when you're describing other character's actions, then Claude decides he can break rules too.

Clearly stating that the roleplay is turn-based and assigning the roles also tends to help a bit.

⎗

1 2	Participate in the turn-based roleplay with Human as your counterpart. Human controls {{user}}, while you control any other characters.

Reinforcing the instruction

You can repeat the instructions above in your prefill, rephrasing them as if they came from Claude himself. This works as a very strong version of double reinforcement. For example:

⎗

1	[Yes, I remember to not act on {{user}}'s behalf. Here's my reply:]

Reinforcing these instructions as User at the bottom of the context works as well. Just make sure to use a custom Preset manager field for the injection, not the Character's Note field in the card, as these instructions clearly belong to the JB, not the bot itself.

Gaslighting Claude to be {{char}}

If your bot is laser-focused on a single character, you can give Claude the role of {{char}} directly, for example: You are {{char}} and must act and speak according to the definitions given in <char></char>. Participate in the roleplay with {{user}} as your counterpart. This can also be reinforced in the prefill and anywhere else. This is a strong tactic, but it has a few downsides:

Everything, including the narrative, will be written from {{char}}'s standpoint. Tweaking the writing style will be harder.
OOC commands might break, because Claude is fully in-character now!
It's not possible to include several independent characters in a card like this.

It will also break eventually, as the training is still stronger than this.

Examples

You can give the model some demonstrations on how to answer. If the demonstrations are complete and don't contain speaking for {{user}}, the model will learn this behavior and not talk for you.

This is actually the best working method, because it uses the in-context learning to enforce the behavior. But like with all examples, it's labor-intensive and can easily lead to the token bloat. Instead of the full reply examples, you can give Claude examples on how to plan the reply, then output the plan, CoT-style, and build the reply to match the plan. The planning approach will have all downsides of CoT, like pushing your chat history out of the bottom zone.

Draft/rewrite

Proofreading more or less works for this use case, and even allows you to make exceptions, if you want to allow the model to speak for you in limited situations. It will fail from time to time, though.

NoAss

There's an extension called NoAss that guarantees that Claude will never speak for you, because as soon as it tries to, the generation will be stopped on a custom stopping string. This extension is one huge trade-off, though.

...and why you probably shouldn't do that obsessively

Consider that by completely preventing the model from speaking for you, you lose at least two things:

A lot of creative continuations with {{user}} participating in the answer. Claude has his hands tied, and still has to provide an answer for you. This way the replies would feel much more 2D.
Failure/takeback mechanics that are meant to fail or correct your actions if they don't conform to RP boundaries. If you have them implemented, they won't be effective as they usually imply acting as {{user}}.

Besides, it's not even possible to do that 100%, as it hits diminishing returns. It's easier to put some filter on, and fix it by hand when it leaks eventually.

The common problem is that Claude misses your character and speaks in stereotypes, but that's mitigated by detailing your persona.

The second consideration is that acting on {{user}}'s behalf is usually only irritating when it takes big steps/decisions for you. Think of it, maybe you're actually alright when it's just a flavor text around your actions you already described, and nothing more substantial.

However, Claude is prone to learning that behavior and speaking as {{user}} more often in future.

Persona

If you are playing as Anon or something like that with no description of your character, you should know it's probably a mistake. Your character is a protagonist, and your persona definitions set all reactions, relationships, and the entire course of the roleplay. Everything from this section applies to your persona definition as well!

One additional thing is that your persona heavily affects the quality of what Claude says when he's acting for your character.

Model biases

Assistant bias

General positivity bias

Causal bias

Ignoring the outliers

All models, but especially Opus and Sonnet 3.5 in particular, have a tendency to ignore out-of-place info and instructions. For example if you make a single typo or a grammatical mistake in your request, it will just ignore it. That makes sense, because a single typo in a user request shouldn't matter. However, that doesn't mean Claude forgets it, he just ignores it because it's irrelevant. In the appropriate situation, he might play along (look at the intentional mistakes in the request):

This is not limited to typos, but also true for instructions and any other data in the long context. Anthropic noticed this long ago, for Claude 2.1, during the needle-in-the-haystack testing. By prefilling a single instruction "Here is the most relevant sentence in the context:" to make Claude pay attention to the outliers they were able to make him perfectly retrieve a single out-of-place sentence from the long context:

Why is it important for you in the roleplay? Claude can ignore any instruction, any fact, any definition, anything in general, if he feels it's too out of place or contradicts his non-jailbroken alignment training (NSFW stuff for example). Keep that in mind when designing your preset and thinking on how bots will fit in, and when roleplaying. While roleplaying, Claude sometimes ignores facts and behavior of the past that don't fit the character well. When summarizing long chats with multiple scenes, Sonnet tends to omit NSFW scenes, especially when they are outliers in the otherwise SFW story.

This tendency of the model is extremely easy to overlook, as it can become forgetful due to multiple other factors listed above. The difference is that it's fixable - you can make it pay attention to out-of-place-info with a good instruction. Of course it needs to be tested!

Summarization

As shown above, Claude's context that is actually usable in RP is pretty limited, and the biggest reason is the chat history. In a long roleplay, the chat history is always the largest piece of context, and making it too long degrades the overall model performance. It's naturally repetitive, so the in-context learning makes Claude want to repeat it, instead of paying attention to definitions and instructions.

Most importantly, LLMs can't easily reason about the final state of long sequences due to their limited model capacity. So even if you have a huge chat history, Claude won't be able to actually remember any of the complex events stored as long sequences, and won't be able to use these events in the roleplay properly. He will be able to pick up simple ones, though.

For Claude to be able to reason about events, you need to collapse these long event sequences into their final state, i.e., summarize them. Thankfully, the model itself is great at this, so it can be automated and you don't need to write anything by hand. Press a button, and Claude will generate the summary or update the existing one!

Summarization enables extremely long roleplays (thousand+ messages) using a relatively small context window; 25k tokens is more than enough. It's probably the most important instrument in your roleplay.

Roleplay that uses summarization looks approximately like this:

Your max context is limited to maybe 20k tokens, give or take. And that's okay! This will actually make Claude smarter.
The chat history tail is truncated as it's not very useful anyway, and even distracting.
Truncated events are compressed by Claude into a short summary that is maybe 2-4k tokens long.
As you add new events in your roleplay, the summary is updated with new info, analyzed and automatically generated by Claude.
The summary should never grow too large, as it won't be of much use for the model if it does. You should occasionally review your summary, manually discarding/shortening the info you'll never need in your future roleplay.

"Summary" is a bit of a misnomer. Treat it as a mutable, ever-changing part of your card, specific to this particular roleplay. It is a useful twin of the fixed part of your card (that contains the initial character/setting definitions).

Summarization extension

SillyTavern has its own summarization extension installed by default (Extensions → Summarize). It allows you to:

Send your own summarization request to the LLM, causing it to reply with the summary in the format you requested.
Automate the summarization by calling it each X words or each Y messages (you don't want to use it - see below!).
View and edit your summary.
Insert the summary into the chosen place of the context.
Undo the last summarization if it's broken somehow or you don't like it.

Chose "Main API" at the top. It will use your main LLM then, and this is usually what you want.

Intuitively, you can think of this as a chat-specific scratchpad you can write anything in, much like Author's Note, but with a button that calls Claude to fill or update it automatically.

You usually want to use "classic mode" in the extension, as it sends your entire JB and card to Claude as usual, and it's the same as just entering the summarization prompt in the usual chat input box. Raw mode only sends the actual history to summarize and the prompt, so the summarization quality might suffer. Use it only if your JB forces a certain format and it can't be worked around.

The extension has a few limitations.

It doesn't support streaming, and won't work on some proxies that force streaming on due to high load.
You cannot have separate prompts for each chat.
It's limited by the max response length of the LLM (but that's okay because you never want it to be excessively long to avoid confusing the model).
It cannot be split into multiple parts, for example to implement the double reinforcement method.

If any of this is a problem, you can use the STscript version from the section below, at the cost of some learning curve.

Where do you put your summary?

The best spot in the context for your summary depends on its size. Look at this section to get an idea of the sweet spots in the context, and other stuff besides your summary that competes for that expensive token space.

If your summary is small, you can inject it into the chat history at a depth of 3-4. If it's large, keep it in the system prompt. Empirically for Claude 3/3.5, your "injection" zone that you can meaningfully use is maybe 1000-1200 tokens max. If your summary and other stuff you might keep there are larger than that, you should probably shove them into the system prompt.

If you want, you can use double reinforcement with the summary itself. However, automating this will probably involve STscript, since the extension itself can't generate the "summary of the summary".

Chunking, or WHEN to press summarize

If you attempt to summarize a huge 1000-message history in a single piece, you'll be disappointed to discover that Claude has missed most of your important events, or got them entirely wrong. The summary created that way will not be useful at all. This is a well-known problem in the LLM space, and it's solved by splitting the source into chunks.

Chunking is a surprisingly non-trivial problem for arbitrary documents. It can't be done mechanically, and automatic summarization each X words or each Y messages will never work properly. How you split your story into chunks is absolutely critical for the summarization quality.

If you shove too much stuff into one chunk, it will miss important facts.
If you summarize each roleplay message, it will list its contents, giving you an overly verbose summary.
If you summarize at arbitrary points in the story, it won't know how the current chapter ends, and will give you garbage.

In other words, you must align your chunks with the logical breakpoints in the story. Logically divide your roleplay into scenes by planning a bit ahead (or chapters, or arcs - call them however you like), and make Claude update the summary exactly at the end of each scene. You fought the bandits? Press summarize. Made a campfire to rest? Press summarize. Resumed your travel? Press summarize. Entered a city? Press summarize. Always check what it generated, of course.

Don't take it as gospel, experiment with your own chunking.

Manually editing the summary

Ideally, you should review your summary and manually edit it if Claude doesn't do what you want or introduces too many/too few details. If it does it consistently, tweak your prompt.

Remove everything that you won't ever need. You sure you won't visit that location anymore? Remove. You sure you won't need that NPC in your story anymore? Remove.

Summarization prompt example

This is an example of a good summary prompt that uses a template. Don't use it verbatim! Change it and add/remove modules as you see fit your particular roleplay (see "caveats and best practices" below).

⎗
✓
<!-- OOC REQUEST: Pause the roleplay and step out of character for this reply.
Analyze the roleplay history and answer a few questions about it in English, filling the fact sheet template below.
Add, remove, and update the existing facts as appropriate. -->

<npcs_facts>
<!-- include every NPC interacted with, besides {{user}}.
This info should be updated if new facts are available.
Do not list NPC actions here, only facts -->
- X (<!-- role -->): <!-- Appearance, speech manner, 3 personality traits. -->
- Y
- Z
</npcs_facts>

<npcs_mentioned>
- <!-- Name -->: <!-- Role or N/A -->
- 
- 
</npcs_mentioned>

<visited_locations> <!-- Only list descriptive facts about the location itself. -->
- <!-- title --> : <!-- description in at least 3 sentences -->
- 
- 
</visited_locations>

<secrets> <!-- "No secrets yet" if no secrets -->
- <!-- secret --> (kept secret by <!-- char --> from X)
- 
- 
</secrets>

<current_ relationships> <!-- only include recurring characters, omit minor and intermittent ones -->
- <!-- current long-term relationships between characters X and Y in at least 3 sentences. -->
- 
- 
</current_relationships>

<planned_events>
- <!-- each event in at least 3 sentences -->
- 
- 
</planned_events>

<current_quests>
- 
- 
- 
</current_quests>

And this is the injection template for the prompt above, required for this to work:

⎗
✓
<roleplay_summary>
<!--This is what happened so far during the roleplay, and the current state of the scene.
The information below takes priority over character and setting definitions.-->
{{summary}}
</roleplay_summary>

Note the {{summary}} macro - it will insert your generated summary at this position.

You might wonder, why this prompt has no actual history of events happened in the roleplay? Read the best practices below.

Caveats and best practices

First things first:

Determine the average length of your scenes.
Limit your max context size so the chat history is truncated at 1.5-2 times this size, to have a reasonable safety margin. Avoid cutting your current scene in half!

The summary is generated automatically by the model, and you need to craft a proper prompt, ensuring it gives you the result you want. Ask yourself one simple question:

WHAT LONG-TERM INFO DO I WANT MY CHARACTERS TO TRACK AND/OR BE ABLE TO RECALL DURING THIS PARTICULAR ROLEPLAY?

These are the things you should ask the model to write for you in the summary! Do not ask for too much detail; remember that realistically you only have 2-4k tokens total for the entire summary, if you want it to remain useful for the model.

The typical things to ask in your prompt are:

Characters interacted with, that you'll need in the future (your "social network").
Characters that were mentioned and have the potential to be interacted with in the future.
Current and potential objectives or quests.
Current relationships and long-term health of the characters.
Secrets.
Additional facts about the characters that complement or override the facts in your bio.
Locations you need to keep in the context because you plan to return to them eventually, and maybe routes between them.

Past events don't meaningfully affect most roleplays, unless it's a really major story turn. A character reminiscing about the past is also rare. Don't push for "past events" too hard; it's overrated. Consider not using that section at all as it's niche and the most bloated. If you're sure you need the event history, you can add something like this to the prompt above:

⎗
✓
<memorable_events>
- <!-- what happened during the event in at least 3 sentences -->
- 
- 
</memorable_events>

If you want to memorize events, make sure they can be recalled independently, so Claude can just pull an arbitrary event to use it, without unwinding a long sequence.

Never ask for a sequence of events in your prompt, as the model can't easily reason about the outcomes of long sequences. If you ask for a sequence, it will be just a worse version of your chat history. Instead, you should ask about separate memorable events, facts, and the current state of your roleplay: visited locations, inventory, current relationships, health, etc.

Never ask for the actual summary in the summarization prompt, and never mention that word at all. Summarization is one of the typical uses of LLMs, and Claude is trained to give summaries in a specific format, which is probably not what you want.

Certain JBs have a complex reply structure (chain of thought, stat trackers, etc.) that might interfere with the format of your summarization prompt. If it keeps happening, you have several options:

Convince it to provide you with the clean output using OOC commands.
Use the "raw" mode in the summarization extension.
Temporarily switch to a simple preset/JB - manually, with a Quick Reply action, or in any other convenient way.

Your summary can sometimes leak into your roleplay, i.e., the model might try repeating the concepts from it, or the structure, or even repeat it verbatim. This is especially troublesome because most of the summary is usually generated by Claude himself, and he will happily reinforce any slop he writes. However, if your summary is terse, it will usually lack Claudeisms. You can usually avoid leaks by constructing a better template that is clearly marked with XML tags and explanations that it shouldn't be repeated verbatim. Memorable events tend to leak more than others, even when not summarized in an order-dependent sequence.

If you keep the summary to track secrets, the best way to keep them from leaking (characters casually mentioning supposed secrets as if they were known) is to mark who keeps the secret from whom.

Make sure you instructed Claude to override card definitions with the info from your summary. You may be tempted to use the system prompt mechanism to make it take priority over the definitions, but in practice, this doesn't work well.

Do not use the summary to track fast-changing stuff like the inventory. Stat trackers in your reply are much better suited for that.

Advanced summarization with STscript

ST Regular expressions and formatting

SillyTavern can replace any text in your chat history using a regular expression match. It's typically used for:

Deleting any unwanted part of your reply (for example, chain of thought).
Hiding any text while it's being generated.
Only removing the text from the history when it's X levels deep or more (commonly done with stat trackers).
Highlighting claudeisms.
Wrapping a piece of text into an XML tag for further processing.

Regular expression is a powerful way to search for arbitrary strings inside the text. I won't be describing regular expressions themselves as it's entirely beyond the scope of this rentry. You should study them on your own. But I'll give you a cookbook because regexes always blow people's minds.

Test your regexes! Use https://regexr.com/ or https://regex101.com/ for that. Switch to JavaScript-flavored regular expressions, as various flavors differ a bit.

SillyTavern scans all messages in your chat after generating the reply (and also after editing it manually if the box above is ticked), looks for a string that matches the regex you set, and replaces it with any other string you choose. If the replacement string is empty, it will replace the match with an empty string, i.e., remove it.

The interface leaves much to be desired. These two buttons commonly cause confusion:

Both empty: The regex replacement is done directly to the chat history, permanently (i.e., if you remove something, it's gone). Once the replacement is done, the unmodified message won't be available anymore.
One or both are active: The chat history will remain intact behind the scenes (in the storage), but either the display (what you see) or the actual prompt sent to Claude (check your console for that) will be altered, as long as the regex is enabled. If you disable the regex, replacement is no longer made and the chat history will be displayed and sent to the model without modifications. The entire unmodified message will still be stored and visible when editing it.

If you want to debug your regular expressions, don't leave both boxes empty. If either of those two are ticked, you'll still be able to see your text to be replaced while editing the message.

These boxes affect where and when to apply your regex:

User Input: Do the replacement in your messages.
AI Output: Do the replacement in Claude's messages.
Run on Edit: Do the replacement each time you edit a message.

These values affect how deep it will scan your chat history:

Min. depth: regex will only be applied to messages older than X.
Max. depth: regex will only be applied to messages newer than X.

Depth starts from 0 here. You can combine both (though I've never seen a reason to do so).

Regex cookbook

Remove the tag named <blah> from the reply

regex to match: <blah>[\s\S]*<\/blah>
replacement string: empty. ("replace with nothing" = remove)

Note that the tag will still be visible during the generation, because the regex must see the closing </blah> to register a match, and it's not generated yet.

Hide the tag <blah> from view during the reply generation

Here you'll have to use the negative lookahead, denoted by (?!...).

regex to match: <blah>(?![\s\S]*<\/blah>)[\s\S]*
replacement string: empty.
Alter Chat Display: enabled. (you want to hide it from view, not remove it completely)

This regex will match anything that starts with <blah>, but only if it doesn't end in a matching </blah>. This way, your tag will be only hidden during the generation.

Hide the tag <blah> from view completely, both during and after the generation, and stop it from being sent to the model

Use the disjunction operator | (logical OR operator) to combine two regexes from the above.

regex to match: <blah>(?![\s\S]*<\/blah>)[\s\S]*|<blah>[\s\S]*<\/blah>
replacement string: empty.
Alter Chat Display: enabled.
Alter Outgoing Prompt: enabled.

This regex will match either the first part (that handles the case when the tag is still being generated) or the second part (that handles the fully generated case). The logic of both parts makes them mutually exclusive, so there's no overlap.

Note that you can combine a ton of regular expressions into one this way, creating an absolute contraption that will handle all your cases.

Only leave the tag named <blah> in the most recent message

Commonly used in stat trackers. The regex is the same as the previous one, but the minimum depth field is used.

regex to match: <blah>[\s\S]*<\/blah>
replacement string: empty.
Min. depth: 1

Remove any text that is NOT inside the tag named <reply>

For this one you'll have to use capturing groups, delimited by ().

You need to capture everything before the desired text into group 1, the text itself into group 2, and everything after the text into the group 3. Then replace the match with $2 (SillyTavern's internal macro for the capturing group 2).

regex to match: ([\s\S]*<reply>)([\s\S]*)(<\/reply>[\s\S]*)
replacement string: $2

Wrap the most recent user input into the <stated-narrative> tag

Useful if you want a way to refer to the most recent input, for example to be able to analyze it with some instruction.

regex to match: [\s\S]* (anything)
replacement string: <stated-narrative>{{match}}</stated-narrative>
User Input: enabled.
AI Output: disabled.
Max depth: 1.

Note the usage of SillyTavern's internal {{match}} macro, which is the string found through the regex. This regex will find the most recent (Max Depth: 1) user input (User Input: enabled, others disabled), and replace it with the same string but with the added XML tag.

Formatting tricks

SillyTavern can render a strictly limited subset of Markdown and HTML in the messages. It has a few useful quirks.

Hidden text

You can output hidden text without the use of regular expressions:

 (cost: 1 extra token for ) will hide the text after it's generated, but won't hide it during the generation, when the trailing --> is not generated yet.
[](text) (cost: 2+1 extra tokens) will do the same. This is an empty markdown URL with an empty link.
In the past, the text written [](#'like this') (i.e. an empty markdown link pointing to an empty section with the title set to the text) used to be the best shortcut to output the text hidden both during and after the generation, but it doesn't seem to work anymore.
<div hidden>text</div> (cost: 4+3 extra tokens) will hide the text both during and after generation. However, the leading tag <div hidden> (4 tokens long) will flash momentarily during the streaming.

The text will still be in the chat history, it just won't render normally. You can see it when editing the message. You probably want to avoid polluting the chat history with the auxiliary invisible text, as it's a token bloat that can also induce repetition, and you'll inevitably need a regex for it. However, for tiny text like lorebook anchors it can be left as is.

Collapsible text

You can create a collapsible element by putting some text inside a <details></details> tag. If you also include a <summary>name</summary> somewhere inside it, your name will be shown as the name of the collapsible.

Stat trackers

As you continue your roleplay, Claude tends to lose focus on the characters, objects, and topics that weren't mentioned for long enough, because the messages with them are slowly drifting towards the middle of the context where his recall is spotty. He might forget a clothing item if you haven't mentioned it in a while, or your car left on the parking lot a few messages ago. In fact, most of the middle of the context will likely be always filled with the messages from your chat history.

Sure, you can use summarization to keep track of such items, but summarization is usually updated once per scene. It's better suited as a long-term memory. Instead of that, you can track the most important entities in the reply itself, making the model output their status after the reply is written.

Creating a stat tracker

The most reliable way to force Claude into using a fixed format is to demonstrate it using a one-shot template. Here's a simple template of a reply with the statbox. It's typically enough to drop it somewhere in the context and Claude will pick it up, unless you have some conflicting instructions.

⎗
✓
<reply-template> <!-- mandatory format for the roleplay reply -->
<!-- reply  -->

<stats> <!-- generate if unknown -->
Time/Date: hh:mm AM/PM, Month DDth, YYYY
Weather: X
Active characters: X, Y, Z
</stats>
</reply-template>

Claude will fill in the placeholders and output something like this in each reply:

⎗

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc eget justo vestibulum, varius sapien id, imperdiet libero.

<stat>
Time/Date: 11:44 PM, August 26th, 2024
Weather: Clear
Active characters: John, Jane.
</stat>

Stat tracker has to be included after the reply, there's not much point to put them on top, before the reply is ready. Besides, this way they don't push the chat history needed for reply generation out of the zone of immediate attention.

You probably wouldn't want to have the stat tracker in every message. In theory you can do that to track the state of each message and improve the recall for the selected items, but in reality it's too much token bloat, and they also form an unnecessarily strong pattern, activating the in-context learning and making Claude stuck, unable to update some values, especially in slowburn roleplays. That's why you usually want to only keep a single stat tracker in the most recent Claude's message.

SillyTavern's way to manipulate the text in your messages is regular expressions. Let's add a regex that would strip the statbox from every message older than the most recent one:

regex to match: <stat>[\s\S]*<\/stat>
replacement string: empty. (replace with empty = delete)
Min. depth: 1. (only apply to the messages older than the most recent one)
Affects AI Output: enabled. (applies to Claude's messages)
Alter Outgoing Prompt: enabled. (you don't want to remove it from the chat history completely, just block it from being sent to the model)

If you also don't want to see your stat tracker, add another regex:

regex to match: <stat>(?![\s\S]*<\/stat>)[\s\S]*|<stat>[\s\S]*<\/stat> (composite regex that works both during and after reply generation)
replacement string: empty.
Affects AI Output: enabled.
Alter Chat Display: enabled. (affects only display - you have another regex that handles the prompt sent to the model API)
Alter Outgoing Prompt: disabled.

Choosing the items to track

Look at your roleplay and try to determine characters, objects, topics that are:

Really critical to you, and you notice inconsistencies in them easily.
Mentioned rarely, so they tend to drift towards the middle of the context and get lost there.

Here are some items that typically have spotty focus and are good candidates to track:

side characters currently active in the scene (besides {{user}} and {{char}} who are usually present at all times).
location, sublocation.
character inventories.
item in hands.
character clothing, piece by piece.
character positions, especially for the side characters.
weather and time. By the way, the time in minutes is really hard to advance properly with stat trackers, because Claude is not very good at determining the amount of time that has passed in the reply. If it's critical for you, you'll probably need a chain of thought for that (and even then it likely won't be reliable).
the vehicle and its state, during a road trip or in a road movie style roleplay.

Some JBs track emotions this way. I believe this is a mistake, they don't need to be tracked, and Claude does it better during spontaneous cooking. You might have a different idea, so feel free to experiment with it.

Finally, don't be afraid to manually update your tracker if Claude gets anything wrong (or maybe you just want to change something).

Downsides

Stat tracker is a rather heavy-handed mechanism. It takes space in the expensive bottom attention zone that is already contested by other mechanisms. It can inadvertently trigger in-context learning and induce unnecessary repetition, especially if certain items linger in the tracker for too long. The loop usually comes not from the tracker itself, but from the reply text that follows it.

If possible, do not use stat trackers at all. Use them only if your roleplay is consistently losing focus due to the topics drifting upwards.

You can't shove absolutely everything into your stat tracker since the token space at the bottom is very limited, so count your tokens! Especially if you're using emojis, which can take surprising amounts of them. Only include the bare minimum of what you need.

The most obvious shortcoming of a stat tracker is a fixed list of items to track. It cannot dynamically select items to track as it has no foresight.

How 2 Claude

Your bot is an illusion

Human and Assistant

System prompt

SillyTavern and Claude context order

Prefill

Chat history injections

Making sense of it all

Model accuracy = attention

Model capacity

Long context is an illusion

Lost in the middle

Overall degradation at longer contexts

Chat history effects

Reply length effects

Cost in the actual roleplay

Wait, is 200k tokens really just smoke and mirrors?

Tokenization

Samplers (Temperature, Top K, Top P)

How samplers work

Default settings

Temperature vs Top P/Top K

When to use each sampler in practice

In-context learning

Speech/narrative examples

What to put into examples

Example regularization

Where to put the examples

How many examples to give

Example formatting: List

Example formatting: Long-range anchors

Example formatting: Anthropic

If the examples are so great, why don't I see them being used in bots often?

Templates

Repetition

Fighting the repetition

XML formatting

Double reinforcement with Character's Note

Creating a complementary reminder note

Where to put it

Examples

Downsides

Pink elephant effect

Negations

You need more context for Claude to start cooking

Token pollution

Different languages and encodings

Alignment training

Claudeisms

Suppressing claudeisms: writing styles

Writing as {{user}}

Stopping the model from writing for you

...and why you probably shouldn't do that obsessively

Persona

Model biases

Assistant bias

General positivity bias

Causal bias

Ignoring the outliers

Summarization

Summarization extension

Where do you put your summary?

Chunking, or WHEN to press summarize

Manually editing the summary

Summarization prompt example

Caveats and best practices

Advanced summarization with STscript

ST Regular expressions and formatting

Regex cookbook

Formatting tricks

Stat trackers

Creating a stat tracker

Choosing the items to track

Downsides

Chain of thought

Lorebook

NoAss

Placebo

Warning