LLM Reference Glossary

(v 0.2, Last Updated 2023-06-15)

Terms

General

Card, Character Card

A pseudo-standardized way of sharing LLM prompt material for the purposes of roleplay and chat. Can come in the form of a raw .json file, or PNG/WebP images with the JSON data embedded. Supported by some generator front-ends and websites like chub.ai.
More Info: https://github.com/malfoyslastname/character-card-spec-v2

Context Size, Max Context Size

The amount of tokens an generator can perform work on. Common values are 1024 and 2048, though some models can go much larger. If input is over the maximum context size, the input is truncated to fit.
It can help to think of the input to a model as a sheet of paper. The prompt is all of the words that you've written, and the output is the model's attempt to fill out the rest of the sheet. In that analogy, the Context Size is the size of the sheet of paper.

Fine-tuning

Taking an existing LLM and specializing the model's capabilities, optimizing its performance on a narrower, task-specific dataset. Most models in actual use are fine-tunes of other foundational models (or other fine-tuned models).
More Info: https://rentry.org/llm-training#the-basics

GGML

Georgi Gerganov (ggerganov) Machine Learning

A C Library and binary storage format for LLMs, supporting quantization. Used for CPU-based generators like llama.cpp, koboldcpp, etc.
More Info: https://github.com/rustformers/llm/blob/main/crates/ggml/README.md

GLoRA

Generalized Low-Rank Adaptation

More Info: https://github.com/Arnav0400/ViT-Slim/tree/master/GLoRA

GPTQ

Generative Pre-trained Transformer (GPT) Quantization

A method of model storage and quantization for GPU generators.

Infinite Context, Infinite Memory

A shorthand term for the strategy of utilizing a vector database like ChromaDB to query for snippets from past conversations with an LLM, to inject context-relavant data into a current prompt. This gives conversations a sense of long-term memory whle still using a limited context size.
More Info:

https://cthiriet.com/articles/infinite-memory-llm

k-quantization, k-quants

A method of quantization using mixtures of different weight types, going from 8-bit Q8_K all the way down to 2-bit Q2_K.
More Info:

https://github.com/ggerganov/llama.cpp/pull/1684

Landmark Attention

A QLoRA which compresses LLM context into "landmarks" to, quote: "[make] the process of selecting relevant tokens for answers more efficient, and allowing 2-16x longer context use without memory constraints." Claims to achieve context lengths of up to 32k with some models.
More Info: https://github.com/eugenepentland/landmark-attention-qlora

LLM

Large Language Model

LoRA

Low-Rank Adaptation

A method to expedite the training process of LLMs, reducing the number of trainable parameters by 10,000 times and the GPU memory requirements by over 3 times.
More Info:

Model Size, Parameter Count, Parameters

The number of weights a specific model contains. usually measured in billions (i.e., 7B means "7 billion parameters").

Perplexity

A common metric for evaluating a model, a measurement of how well a model is able to predict the contents of a dataset. Higher perplexity is bad, but lower perplexity isn't always good.
More Info: https://huggingface.co/docs/transformers/perplexity

Prompt

The input fed to a model. Contains the instructions for the model, or the personality, scenario, and history when used in a roleplay or chat context. The prompt takes up a portion of the max context size.

QLoRA

Quantized Low-Rank Adaptation

A method of LLM training using quantization for the weights, allowing for on-the-fly and near-lossless quantization of language models and massive reductions in memory requirements.
More Info:

Quantization

A way of "compressing" models by changing the way that the weights are stored. Can take a f16 or f32 full-precision model and bring each weight down to using as little as 2-bits, allowing for less storage space and faster inference. There are several different methods of quantization, all with their own upsides and downsides.
More Info:

RPTQ

Reorder-Based Post-Training Quantization

An experimental method of model quantization using 3-bit and 4-bit weights using a reorder based approach, originally laid out in this paper. Not currently used.
More Info: https://github.com/AlpinDale/RPTQ-for-LLaMA

Soft Prompt

A technique used to subtly guide a language model's response by including additional context in the input text.
More Info: https://github.com/KoboldAI/KoboldAI-Client/wiki/Soft-Prompts

SpQR

Sparse-Quantized Representation

A near lossless quantization scheme for 3-bit/4-bit weights.
More Info:

SqueezeLLM, Dense-and-Sparse Quantization

A method of model quantization using 3-bit and 4-bit weights. It splits weight matrices into two components, "dense" which can be heavily quantized without affecting performance, and "sparse" which preserves sensitive and outlier parts of the weight matrices.
More Info: https://github.com/SqueezeAILab/SqueezeLLM

SuperCOT

A LoRA indended for use with LangChain, trained on Chain-of-Though (COT) datasets.
More Info: https://huggingface.co/kaiokendev/SuperCOT-LoRA

TART

Task-Agnostic Reasoning Transformer

A task, model, and domain agnostic method for improving in-context learning peformance for classification tasks. Can sit on top of any model, and claims to be able to outperform 65B models.
More Info: https://github.com/HazyResearch/TART

Token, Tokenization

Words and sentences are fed into an LLM by first being tokenized. A token may be a single character, a part of a word, or a full word. Different models can tokenize the same sentence in different ways, so tokenization is model-dependant. Models usually also have special tokens for things like marking the end of a generation, signifying the start of a response, etc.
More info: https://vaclavkosar.com/ml/Tokenization-in-Machine-Learning-Explained

xPos

Extrapolatable Position Embedding

More Info:

https://arxiv.org/abs/2212.10554

Models

This is not meant to be a comprehensive list. For more in-depth info, see: https://rentry.org/lmg-resources#general-resources-guides-for-large-language-models-llm-datasets

Alpaca

Type	Sizes	Context	License
Fine-Tune (LLaMA)	7B	2048 Tokens	Code: Apache License 2.0, dataset and weight diffs: CC BY NC 4.0

Released by Stanford, now the base of many other fine-tunes.
More Info:

ChatGLM

Type	Sizes	License
Foundational	6B	Open (Apache License 2.0)

Chinese-focused model. 动态网自由门天安門天安门法輪功李洪志 Free Tibet 六四天安門事件 The Tiananmen Square protests of 1989 天安門大屠殺 The Tiananmen Square Massacre 反右派鬥爭 The Anti-Rightist Struggle 大躍進政策 The Great Leap Forward 文化大革命 The Great Proletarian Cultural Revolution 人權 Human Rights 民運 Democratization 自由 Freedom 獨立 Independence 多黨制 Multi-party system 台灣臺灣 Taiwan Formosa 中華民國 Republic of China 西藏土伯特唐古特 Tibet 達賴喇嘛 Dalai Lama 法輪功 Falun Dafa 新疆維吾爾自治區 The Xinjiang Uyghur Autonomous Region 諾貝爾和平獎 Nobel Peace Prize 劉暁波 Liu Xiaobo 民主言論思想反共反革命抗議運動騷亂暴亂騷擾擾亂抗暴平反維權示威游行李洪志法輪大法大法弟子強制斷種強制堕胎民族淨化人體實驗肅清胡耀邦趙紫陽魏京生王丹還政於民和平演變激流中國北京之春大紀元時報九評論共産黨獨裁專制壓制統一監視鎮壓迫害侵略掠奪破壞拷問屠殺活摘器官誘拐買賣人口遊進走私毒品賣淫春畫賭博六合彩天安門天安门法輪功李洪志 Winnie the Pooh 劉曉波动态网自由门
More Info: https://huggingface.co/THUDM/chatglm-6b

LLaMA

Large Language Model Meta AI

Type	Sizes	Context	License
Foundational	7B, 13B, 33B, 65B	2048 Tokens	Restrictive

Meta's foundational model, which many others are fine-tuned off of. Comes in 4 sizes: 7B, 13B, 33B, and 65B.
More Info: https://ai.facebook.com/blog/large-language-model-llama-meta-ai/

OpenLLaMA

Type	Sizes	Context	License
Foundational	3B, 7B, 13B (600BT Preview)	2048 Tokens	Open (Apache License 2.0)

An open-source reproduction of Meta's more restrictively licensed LLaMA.
More Info: https://github.com/openlm-research/open_llama

StarCoderBase, StarCoder

Type	Sizes	Context	License
Foundational	15B	8192 Tokens	Code: Open (Apache License 2.0), Model: Semi-Open (bigcode-openrail-m)

A model trained mostly on GitHub code and some natural language text, geared towards being a coding assistant. Technically StarCoderBase is foundational and StarCoder is a fine-tune focused on Python. The model is public, but you have to have a Hugging Face account and accept their terms to access.
More Info:

4chan-specific

/lmg/

Local Models General

A frequent thread on 4chan's /g/ board, for the discussion of local large language models.
More Info: https://4chan.org/g/lmg

/aicg/

AI Chatbot General

A frequent thread on 4chan's /g/ board, for the discussion of character cards.
More Info: https://4chan.org/g/aicg

Sites

Hugging Face, HF

The primary place where LLM models and datasets are uploaded and distributed.
More Info: https://huggingface.co/

Character Hub, chub.ai

A site for uploading and sharing Character Cards, as well as lore books. Uses Gitlab on the backend and supports forking existing characters.
More Info: https://chub.ai/

People

Georgi Gerganov (ggerganov)

Author of open-source LLM-related projects like GGML, llama.cpp, and whisper.cpp.
More Info: https://github.com/ggerganov

LostRuins

Author of koboldcpp.
More Info: https://github.com/LostRuins

LLM Reference Glossary

Terms

General

Card, Character Card

Context Size, Max Context Size

Fine-tuning

GGML

GLoRA

GPTQ

Infinite Context, Infinite Memory

k-quantization, k-quants

Landmark Attention

LLM

LoRA

Model Size, Parameter Count, Parameters

Perplexity

Prompt

QLoRA

Quantization

RPTQ

Soft Prompt

SpQR

SqueezeLLM, Dense-and-Sparse Quantization

SuperCOT

TART

Token, Tokenization

xPos

Models

Alpaca

ChatGLM

LLaMA

OpenLLaMA

StarCoderBase, StarCoder

4chan-specific

/lmg/

/aicg/

Sites

Hugging Face, HF

Character Hub, chub.ai

People

Georgi Gerganov (ggerganov)

LostRuins

See Also