On llama.cpp quant names
or "Fine, I'll Do It Myself, Then, Part 2"
I am likely a degree of wrong with much of this. People are welcome to correct me.**
This is not a primer on what quantization is. I'm hoping you know already.
The quant names for llama.cpp were coined by ikawrakow, who wrote the implementation for much of them. (I think all of them actually.) They're pretty terse, a lot of information in few characters, and are subject to change as new schemes are devised and implemented.
We'll break it up like so:
(I)Qx
_(K)_V
Where:
- Q
x
- Indicates a quantization ofx
bit depth. The easy part! - I/K - When neither of these are present, it's a "legacy quant", basically the original GGUF/GGML quantization spec. Legacy quants will have every weight take up exactly
x
bits. Usually only 4-bit legacy quants are made anymore, because they're pretty fast owing to having to do less math to get the final weights back. - Only I or K will be present at a time, and either will mean that the average bit depth* will be higher than
x
, but not by more than ~0.5. They basically take different approaches to the quantization process. "I" means that it uses what's called an "importance matrix" to boil down the weights, "K" does so by grouping together "blocks" of weights into "super-blocks". You don't need to worry too hard about what all that means, just know that for the same valuex
, IQ quants are generally preferred over K. - V - A discriminator that usually communicates how far from
x
the average bit depth is.- For IQ quants, the smallest is XXS (
x+0.06
), and the largest is M. - For K quants, the smallest is sometimes simply K (with no final letter), and then S, M and L exist. L is at most
x+0.56
and usually ~x+0.5
. Note that for both IQ and K quants, there won't be every size variant at every available bit weight, as some combinations were found to be mostly redundant by the llama.cpp developers. - For legacy quants, V will be a number, in that case it simply indicates a variant on the old quantization technique, and they will be comparable to each other and you should basically just grab whichever's smaller, if you use these at all.
- For IQ quants, the smallest is XXS (
A couple others you probably want to know about:
- NL (as V) - "Non-Linear", a way to provide a 4-bit quantization for certain types of models with weird layouts. If a workable 4-bit quant is available, you probably want to use that instead.
- i1- (at the start) - An affectation by HuggingFace user mradermacher, this means that the importance matrix is an original one he created.
So what do i use then?
┐(~ー~;)┌
There's not really a right answer, even if we know your hardware.
The general guidance is to use the largest model and quant that you can fit in the RAM/VRAM you want to allocate to it.
However!
- In artificial benchmarks, IQ quants tend to be generally superior to K quants (smaller and less perplexity variance) from 3 bits onwards
- A lower-weight quant of a larger model will always have better perplexity than even an unquantized smaller model, except that 2-bit quants approach perplexity similar to a model of half the parameter count. (1-bit quants weren't tested ere as they didn't exist yet, but probably end up being too lossy** to be worth using. This is probably not a hard-and-fast rule.)
- You may prefer a smaller but less quantized model, however!! It really does just depend on what you're doing and what your preferences are.
- In short, remember always the golden rule:
* - "Average bit depth" is an oversimplification. The weight is calculated per "block", but I didn't want to try and explain what that actually means because to be frank, I don't fully understand it myself.
** - "lossy" isn't the technically correct term, but ikawrakow does use the analogy of compression with quants even though it's not technically correct either, so bite me about it.