Introduction to LLM

Whether we are talking about Mythomax, Mistral, GPT 4, Claude, or any other model, they all are LLMs. It may look a bit complicated, but it's not rocket science. If you think it's too hard, that you are too "dumb" or that it's only for experimented users... No, it's not. For reference, I've been using LLM for a little more than a year, but I really got into it (as in, learning how the behind-the-scenes work, more than the basics) about six months ago, without any experience besides that. Everybody can learn, as long as they want to do it.

Still, it will be technical, even if I tried to make it as simple as possible. If you want to only have some answers about roleplaying, I guess the fourth point will be enough, although I recommend reading everything.

What is a LLM ?
How does an autocomplete work ?
1. N-gram
  1. Bigram
  2. 4-gram
2. Statistics with n-gram
From n-gram to LLM
1. Boosted autocomplete
2. The training
In roleplay

What is a LLM ?

Let's start with the basics. LLM stands for Large Language Model. Despite all the buzz about AI, we're not quite at the point where machines think exactly like humans. What we commonly use for chatting or role-playing is actually a LLM—a sort of supercharged auto-complete tool. It might seem like a human wrote the responses you get, or at least something as intelligent as a human, but it's really just a clever mix of statistics, predictions, data, and context.

How does an autocomplete work ?

Before we dive in, let's understand what an autocomplete is. It's a tool that predicts the next word in a text. Think of your phone's keyboard suggesting words as you type—that's autocomplete in action. I'll start by explaining n-gram models and how statistics come into play with autocomplete. While there are more complex models behind autocompletes, understanding these basics will help grasp how LLMs work later on.

N-gram

In a way, n-gram models form the foundation, and the ancestry, of autocomplete. The n number represents the number of words considered to predict the next word, plus one. Only those words are considered in that particular sequence. In theory, there could be as many versions as the number indicates, but they would quickly become ineffective (I will address this later). Below are two examples, with bigram and 4-gram models, but there are more variations.

Bigram

Bigram models only consider the last word used to predict the next word, and no other word, nor the context. For example:

The cake is a [predicted word]

Here, only "a" will be taken into consideration. The model will not understand we are talking about a cake.

You will get weird sentences such as:

The cake is a dress
The cake is a table
The cake is a pillow

4-gram

4-gram models only consider the last three words used to predict the next word, and no other word, nor the context. For example:

The cake is a [predicted word]

Here, only "cake is a" will be taken into consideration.

You will get sentences such as:

The cake is a food
The cake is a dessert
The cake is a lie
The cake is a gift

Statistics with n-gram

You've probably come across the idea of word probability in text and wondered, "How does one word become more likely than another?" It's a valid question, so let's dive into how the probability of a word is determined in a text.

Firstly, let's explore the concept with a bigram. The algorithm counts how often pairs of words, or bigrams, appear in the text. Then, it ranks them from most to least common. When you start typing the first word, the model looks at all the bigrams that begin with that word and selects the most frequent one.

For example, let's say we counted all the bigram of a text, and those are the four most prevalent bigram starting with the word "the":

"the"+"car"=21%
"the"+"window"=12%
"the"+"pillow"=7%
"the"+"door"=3%

If you write "the" and ask the model the generate the next word, it has a 21% of chance generating the word "car", a 12% of chance generating the word "window", etc.

Now, let's scale up to a 4-gram. Similar to the bigram, the algorithm counts the occurrences of sequences of four words, or 4-grams, in the text. Again, it sorts them by frequency. When you input the first three words, the model looks at all the 4-grams that begin with that sequence and chooses the most common one.

For example, let's say we counted all the 4-gram of a text, and those are the four most prevalent bigram starting with the words "cake is a":

"cake is a"+"dessert"=20%
"cake is a"+"lie"=15%
"cake is a"+"food"=13%
"cake is a"+"gift"=7%

If you write "cake is a" and ask the model to generate the next word, it has a 20% of chance generating the word "dessert", a 15% of chance generating the word "lie", etc.

Like you must have understood, n-gram models can only generate something for the exact sequence of words used without any variation. The longer the sequence is, less possibilities will be available, until, ultimately, nothing can be generated.

From n-gram to LLM

Boosted autocomplete

Let's clarify once again: n-gram models are not LLMs. I explained n-grams earlier to help you understand how autocompletes work and how words are connected to predict the next word.

However, as mentioned, LLMs are essentially advanced versions of autocompletes, still using word associations to generate new text but in a more intricate manner. This advancement is evident in two key aspects:

Sequential Interpretation: Unlike n-gram models, LLMs do not strictly interpret text in fixed sequences. This means they don't require perfectly structured input to generate text. They can understand and predict words even in less conventional or textbook-like formats.
Expanded Context: LLMs aren't confined to rigid sequences, allowing them to process and incorporate a much broader context. This versatility enables them to not only generate individual words but also produce entire texts, poetry, and even code snippets.

The training

Now, we will see how a LLM is trained to understand how it works.

The data

First of all, what exactly is "data"? In the realm of Language Model (LLM) training, data refers to the information on which the LLM will be trained—it's what the LLM will "know." This data can encompass a wide range of sources, such as books, novels, news articles, and even images in the case of AI designed for image generation.

It's crucial to grasp that the LLM will only possess knowledge that is present within the data—it won't have any additional insights beyond that. Furthermore, the more extensive and diverse the data, the better the LLM's understanding of the subject matter. However, it's essential to note that the data available during the training period shapes the LLM's knowledge base.

For instance, consider the case of "Harry Potter," a widely known and extensively discussed series. With decades since its inception, there's a wealth of data available, including books, movies, and fanfiction. Consequently, an LLM trained on "Harry Potter" would likely have a comprehensive understanding of the series. Conversely, let's take the example of "Baldur's Gate 3," a popular game released in 2023. While it's a well-known title, an LLM might not be familiar with its storyline unless the training data includes information about the game. Therefore, to ensure that an LLM comprehensively understands a topic like "Baldur's Gate 3," it must be trained on data that includes relevant information, ideally starting after the game's release date.

This variability in training data is what makes every LLM unique. Each LLM is trained on different datasets, resulting in variations in their knowledge and capabilities.

Once the data is obtained, it undergoes processing, where it is segmented into tokens and analyzed by a neural network.

A quick explanation of neural networks in LLMs

Neural networks, the backbone of Large Language Models (LLMs), can be thought of as digital brains. Just as our brains process information through interconnected neurons, neural networks process data by linking pieces of information together. In the context of LLMs, these networks have been trained to recognize patterns in text and simulate human-like responses.

But how are these networks trained?

During training, the LLM repeatedly reads through the dataset, analyzing patterns of word sequences. Each pass through the data, known as an epoch, allows the model to refine its understanding of language. However, it's important to understand that while the LLM can make predictions based on these patterns, it lacks true comprehension of language as humans do, it's ultimately a mathematical construct.

The training process involves adjusting the connections between "neurons", known as parameters, to minimize the difference between the model's predictions and the actual data. These connections determine the weight of each word in the context of surrounding words, allowing the model to generate coherent text.

Determining the optimal number of epochs is essential. Too few epochs may result in nonsensical output, with words or syllables jumbled together. Conversely, too many epochs can lead to overfitting, where the model merely replicates the training data without offering meaningful insights or generating novel text.

In summary, neural networks in LLMs learn from data to predict and generate text, but they do so through mathematical operations, lacking the true understanding of language that humans possess.

Fine-tuning

Now, we've got our base LLM, a freshly trained model capable of generating text. But the job isn't finished. While you could technically use the base LLM as-is, it's unlikely to produce the best results. This is where fine-tuning comes into play, it's the process of customizing the LLM to suit your specific needs.

To fine-tune an LLM, you'll need to provide it with additional data, distinct from the initial training data, and tailored to its intended application. For instance, if you're aiming to use the LLM for role-playing purposes, you'll want to feed it data related to role-playing, such as character profiles, game mechanics, and storytelling conventions. The beauty of fine-tuning is its versatility, you can adapt the LLM for virtually any purpose by providing it with relevant data.

In roleplay

TLDR: More something appears in the history so far, more it will appear.

The context

As we know, every LLM has a maximum context size. Anything beyond this context is forgotten.

Even with a large dataset, the LLM always relies on this context, or prompt (the input provided to the LLM for generating output), as a guide. It instructs the LLM on how to write, including the style, characters, and current situation.

Let's consider the writing style for a moment. Since the LLM uses the context, anything that appears to be a pattern within this context is more likely to reappear (remember the part about neural networks: using certain words will add weight to those words), as it currently holds more significance. And naturally, more patterns lead to more patterns, creating a cascade effect until we encounter the dreaded repetitions. That's why it's important to always keep an eye on the LLM's outputs. Something seemingly harmless at first, like a sentence repeated in two consecutive messages, can spiral into a loop.

It's the same principle for basically everything that will be generated. More dialogs will lead to more dialogs, NSFW will lead to NSFW, flowery language will lead to flowery language, etc.
Remember, both your inputs and the LLM outputs will be in the context, meaning that both will impact how the LLM will answer. It's important to understand, what you want to receive is what you need to give.

The importance of good settings

I will quickly explain the importance of settings without going in details about what they are and what they do. Essentially, settings impact how an LLM interprets context to generate new text. They determine how close to the context future outputs should be, each with its unique characteristics.

Good settings can enhance the LLM's creativity while also regulating the recognition and generation of new patterns. Conversely, poor settings can lead to repetitive or bland outputs, essentially "breaking" the LLM.

Let's take the example of penalties. If you don't know much about how they function, you may think setting them to the max is the best, right ? Since it controls how much something will appear, it means it will appear less ? Yes, it will. Too much. You will get less repetitions, since the repeated words won't be generated, but remember, every word is impacted. You won't get words such as "the", "a", "is", and more. And since more and more words will be forbidden, the LLM will need to use more words, more purple prose, until your chat will look like a jumbled shakespearian-like mess.

I will just add a quick note about formatting, in particular, asterisks for actions. Often, LLMs don't follow this formatting, and it's completely normal. They most likely haven't been trained on this kind of formatting, more on novel style.

About the prompt

Positioning

We need to consider the positioning of elements within the prompt. You may have heard that the beginning and end of the prompt carry particular weight, and this is partially true. The beginning sets the tone for the context, like the writing style, topic, and more. Meanwhile, the end provides more specific instructions, guiding the LLM for the next output. However, it's important to note that while the beginning and end are significant, the entirety of the prompt influences the LLM's output generation.

Since what is at the beginning of the prompt is used to set the tone of everything that will happen next, it has a specific field in most frontends, called pre-history instructions (or system prompt). Here, you can detail how you want the bot to write (remember, the context will always be more important). If you want the LLM to describe things, it's here.

It's better to be precise in what you want. If you say you want the LLM to "use vivid descriptions", the LLM may do it, or not. More like, it may give you a very detailed description, or just a bland one. This is, again, because of how a LLM works. It doesn't know what vivid is. However, if you say you want the LLM to "use vivid descriptions, such as the smell, touch, feelings and sounds", it will be able to link those words to more data and produce what you want.

In the same way, you have the post-history instructions field, always at the very end of the prompt. On uncensored LLMs, this field doesn't need to be used unless you want to send specific instructions about the next message generated in particular, which isn't often. On censored models, this field welcomes the jailbreak, which is used to "uncensor" censored models. It makes the censored LLM think it is allowed to talk about certain subjects.

Repetitions and negative

"If what is at the beginning or at the end has more influence on the output, I can write something multiple times so it will be stronger !"

Yes and no. If you are trying to give a positive instruction, to tell the LLM what you want it to write, it will probably do it, but too much, even focusing only on that.

If you are trying to give a negative instruction, it will have the complete opposite effect of what you are aiming to do. Remember ! The LLM will use the context has a guide, as an example on how to write. It will only see that you wrote about that a lot, so you gave it more weight, so you want it more. Often, when trying to give negative instruction, the best thing is not talking about it at all.

Talking about negative, let's take a brief moment to talk about the "don't", often used for the "don't talk for {{user}}". It doesn't work, and will more often make the LLM talk for {{user}} than the opposite. Every LLM has positive bias, meaning they answer better to positive wording than negative wording. Instead of write "don't do", it's better to write "avoid" or "omit".

A bit about numbers

The use of numbers in LLMs deserves its own part in this introduction to LLMs. It you followed the whole document, you may have already understood the particularity of numbers in LLMs interpretation... or rather the lack of particularity. Numbers, like words, aren't understood by the Large Language Model, they only have more or less weight depending on what is present in the context. They are generated randomly, just like words.

Higher ends LLMs such as OpenAI can generate numbers with more coherence, even resolve some math problems, but it still doesn't have any comprehension of the number themselves.

I am mentioning this specifically to give some answers about stats tracking. With almost every LLMs, the vast majority, stats tracking won't work. It may produce numbers that looks like correct stats tracking, but it's in really just random numbers. The LLM won't understand that Life at 0% means Death, not that Food at 0% will mean hunger. Yes, higher end models will give better statistics, but it may break at any moment, or give you uncoherent stats.

In conclusion

Large Language Models are neural networks using the context as a guide. More you talk about something, more this something will be generated. Same thing with formatting, you send good messages, you get good messages. You send long messages, you get long message.

I hope it hasn't been too complicated and that you understands a bit better how LLMs work. I didn't go too in-depth, I'm not a programmer, so I don't want to say things I'm not sure about.