Building LLM Gameplay mechanics with Guidance

Original: https://medium.com/@mikudev/building-llm-gameplay-mechanics-with-guidance-0bc3d52e52e9
Author: mikudev

Large Language Models (LLMs) have been proven to be powerful tools and have been impacting every business process during recent years.
A big problem I see is that a lot of products that integrate LLMs over-rely on chat completions as their default prompt architecture, ignoring completely the benefits LLMs offer when using plain text completion guidance, even in instruct-tuned models.

Guidance

There's and old Microsoft library called guidance. It allows you to enforce a format for token generation of a given prompt.
For example, you can define a prompt as:

⎗
✓
This are three possible titles for a story about llamas:
Funny title: {{GEN funny_title max_tokens=20}}
Mysterious title: {{GEN myst_title max_tokens=20}}
Dramatic title: {{GEN drama_title max_tokens=20}}

Then, you can inference using the guidance library on top of the LLM and will return the value for each title.
First, it will inference funny_title; then, it will replace it with the generated value and generate myst_title and do the same for drama_title.
This is also useful for enforcing a JSON format like

⎗
✓
{
  "name": "{{GEN character_name max_tokens=4}}",
  "personality": "{{GEN character_personality max_tokens=100}}",
  "outfit": "{{GEN character_outfit max_tokens=100}}",
  "weapon": "{{GEN character_weapon max_tokens=3}}"
}

This is supported by most of the guidance-like library implementation, even if you're calling through an api and not making the guidance library to load the model. Some examples are:
But what happens if you want to restrict the list of options that can be generated? For example, for weapon we might want to only generate valid weapons like sword or axe.
This is where Select and logit_bias come to play.

Select and `logit_bias`

In a nutshell, logit_bias is an array of probabilities you want to assign to certain tokens of being generated. This is sent to the LLM in the query for the text generation. Usually some openai-like endpoints support it.
With the Select feature from guidance, you define a list of phrases or words that will be the only ones that be generated by the LLM in that position of the prompt. It leverages the use of logit_bias, generating token by token and applying the correct probabilities.

What about function calling?

In the current state of the art, there's the tendency of going for "function calling" finetuning, which is a good replacement for guidance; but most small models, this is not an option.
In function calling, you define functions that the LLM can use and return calls signatures of.
Also, it's worth noting that "tool use" or "function calling" is more focused on building agentic behaviour.
For more information about function calling, check OpenAI's docs on it

MikuGG's Guidance package

We use selection and thus logit_bias a lot in miku.gg to leverage features like emotion inference and narration conditions.
We want be able to plug guidance to an api and, while there are several alternatives in npm like guidescript or salutejs, none of the existing solutions has support for select. That's why we created our own public package in @mikugg/guidance. To make it work, you need:

An openai-like endpoint with logit_bias support
A javascript or typescript tokenizer for the LLM you are using.

For example, you can generate JSON data by following the format

⎗
✓
import * as Guidance from "@mikugg/guidance";
const tokenizer = new Guidance.Tokenizer.MistralTokenizer();
const generator = new Guidance.TokenGenerator.OpenAITokenGenerator({
  apiKey: "sk-EMPTY",
  baseURL: "http://localhost:2242/v1",
  model: "mistralai/Mistral-7B-v0.1",
});
const templateProcessor = new Guidance.Template.TemplateProcessor(
  tokenizer,
  generator
);
let result = await templateProcessor
  .processTemplate(
  `RPG Game Character specification
  {
    "name": "{{name}}",
    "job": "{{GEN job stop=",}}",
    "armor": "{{SEL armor options=valid_armors}}",
    "weapon": "{{SEL weapon options=valid_weapons}}",
    "pants": "{{SEL pants options=valid_pants}}"
  }`,
  new Map<string, string[] | string>([
    ["name", "Rudeus"],
    ["valid_armors", ["plate", "leather"]],
    ["valid_weapons", ["axe", "mace", "spear", "sword", "bow", "crossbow"]],
    ["valid_pants", ["leather_jacket", "leather_shorts", "hat"]],
  ]))
  .then((result) => {
    console.log(result.entries());
  });

Use cases for guidance in MikuGG

MikuGG is a platform for generative visual novels. We make use of LLMs to leverage the gameplay assistance capabilities of the system as well as dialogue generation.
Specifically, we use guidance for:

Inferring character reactions
Knowing if a condition is met in the narration
Suggest possible new scenarios

Emotion system

Most regular chatbot platforms that have character sprites related to emotions rely only on classifier models like nateraw/bert-base-uncased-emotion or joeddav/distilbert-base-uncased-go-emotions-student.
The problem is that you need first to generate the character response and deduce the emotion/reaction from it.
We went for a different approach: Instead, we first infer the reaction and then the text response.
For example, a prompt would be:

⎗
✓
Anna's Reaction: angry
Anna: {{GEN anna_response max_tokens=100}}

This allows us to influence the tone of the character response with the reaction from the prompt.
We use select for indicating the list of possible reactions.

MikuGG's response re-rolls

A "regenerate response" for MikuGG is not just resending the prompt to the LLM, but instead, we randomly set a different reaction and then generate the character's response. This give us more variability on the regenerated responses.

Narration conditions

Another useful feature we have in MikuGG is the ability to set conditions that, when met, an action of some sort is triggered.
Examples:

When someone suggest to go the park, suggest changing to the scene park.
When the monster is defeated, give the user a sword item.
When the mage agrees to join the party, add the mage to the user's party.
If Roxy breaks something on the house, give the user the "homewrecker" achievement.

To evaluate if any conditions met, we ask the LLM directly "Has X condition been met?" and we use select for the set ['Yes', 'No'] of possible answers.
If the answer is "Yes", we trigger the action.
This is very useful for making the gameplay more immersive.

Scene Generation

This is a simpler one. We receive a small prompt from the user ask the LLM to generate a structured description which consist of: scene prompt + background prompt + music prompt
We then do a embedding vector similarity to retrieve the closest background and music for the new scene.

Issues with logit_bias and guidance

While guidance and logit_bias offer powerful capabilities for controlling LLM outputs, there are several challenges and limitations to consider when implementing these techniques. The following subsections explore some of the key issues that developers may encounter.

logit_bias support

One of the primary challenges when working with logit_bias is its limited support among AI endpoint providers. This feature is not widely available, which can restrict the options for developers looking to implement fine-grained control over token generation.
Additionally, using logit_bias typically requires working with prompt completions rather than chat completions. This distinction is crucial, as many modern AI interactions are built around chat-based interfaces.
Currently, the most popular engines that support logit_bias are vllm and aphrodite-engine (the one I use).

Tokenization functions

Guidance's select function requires us to tokenize the prompts to infer multiple options and gives us a few challenges:

Time-consuming tokenization: The guidance library requires tokenizing every completion possibility. This process can be extremely time-consuming, especially if the tokenization function is computationally expensive.
Full string tokenization: Tokenizing only the completion prompt is insufficient to determine how the prompt tokens should continue. Instead, we need to tokenize the entire completed string, which adds to the computational overhead.
Limited tokenization support: When using a JavaScript library, we face limited support for tokenization functions. While some implementations exist for models like LLaMA3 and Mistral (even for frontend code), the options are not as extensive as in other languages.
API limitations: Using an API for tokenization is often not feasible due to the high number of queries required, which can lead to rate limiting or excessive costs.

Token-based api pricing

The guidance prompt method can get very expensive with token-based API pricing models. This approach requires querying different parts of the prompt separately, which can lead to increased costs as you're essentially paying for the same prefix multiple times. While this can be optimized when running your own inference server by caching the calculated prefix, it becomes problematic and inefficient when using external APIs with per-token pricing.
To run guidance efficiently, the most viable option is to operate your own inference server. This allows for caching and optimization of prefix calculations, making subsequent queries faster and more cost-effective. Without access to your own inference server, there are few options to optimize costs when using guidance with token-based API pricing models. This economic reality pushes developers towards local inference solutions for cost-effective implementation of guidance-based systems.

Probability traps

When using guidance and logit_bias for token selection, we can encounter "probability traps". These occur due to the way probabilities are calculated and compared during the token generation process. The primary issue is that the prefix of one option might have a higher probability than the prefix of another option, leading to a premature selection. However, the model might have intended another word with the same prefix as the first option, but with a lower initial probability. This alternative word could have a higher probability as a whole, resulting in a suboptimal selection if only prefixes are considered.
To address these issues, several solutions can be implemented. One approach is to modify the code to calculate the computed probability of all complete words or phrases, rather than just their prefixes. Another solution is to implement look-ahead mechanisms that consider potential completions of prefixes before making a final selection. Additionally, setting minimum probability thresholds that must be met before a selection is made can allow for more comprehensive comparisons. These strategies aim to ensure that the selected option truly reflects the model's intended output.
Example of the problem:
Consider a scenario where we're using guidance to select between two options: hamburger and knife in response to the question "What item do I need for the fight?" The token probabilities, based on syllable-like tokenization, might look like this:

⎗

"ham": 0.6
"bur": 0.3
"ger": 0.1
"hamburger": 0.018 (0.6 * 0.3 * 0.1)
"knife": 0.4

In this case, the AI model might initially think of hammer as a potential weapon, leading to a high probability for the ham token. However, hammer is not in our list of allowed options. If we only consider the first token, we would select hamburger because it's the only option that matches the high-probability ham prefix.
However, when we look at the full word probabilities, we see that knife actually has a much higher overall probability (0.4) compared to hamburger (0.018). This makes sense in the context of the question about needing an item for a fight.
This example demonstrates how prefix-based selection can lead to absurd choices in context-sensitive scenarios. By implementing full-word probability calculations or look-ahead mechanisms, we can make more accurate selections that better reflect the model's true preferences and the context of the query. In this case, such mechanisms would allow us to correctly select knife as the more appropriate answer, avoiding the trap of being misled by high-probability prefixes that lead to contextually inappropriate selections. However, this approach is a bit more expensive to run.

Conclusion

While still a flawed system, guidance and logit_bias prompt it can help us a lot to achieve that structured outputs that allow us to integrate LLM into more workflows.
Particularly for gaming, I've seen several amazing Sillytavern and RisuAI cards that try to implement similar mechanics, but they rely too much on the model to give structured output, meaning they can only use gpt4-level models. This guidance tool allows us to explore similar implementations for open source and less expensive models, like I'm already doing for some of miku.gg mechanics.