Model quantization is a method of reducing the size of a trained model while maintaining accuracy. It works by reducing the precision of the weights and activations used by a model without affecting (significantly) the overall accuracy.
For example, a model that was trained in 16 bit precision could be quantized to 8, 4, or even 3 bits, all with generally acceptable quality loss. This is pretty cool because the precision size is directly proportional to the VRAM requirement for a model. This allows you to do things like run a 30B model on a 3090 (24gb VRAM), instead of 4x 3090s.
Generally, you need either a huge amount of RAM (or swap memory). For me, I needed 64gb RAM to quantize a 13b llama model, and 100gb to quantize a 30b model. You also need a GPU. I was able to quantize a 30b with a single 3090.
If you don't have this sort of hardware, you can always rent it. vast.ai is pretty gud
- NVIDIA drivers
- python3 with pip
- An unquantized model. It should have a bunch of files in it named "pytorch_model-0000(some number)-of-00007.bin" along with some other config files.
- Linux or WSL
These commands worked for me. I'm using oobabooga's fork of GPTQ, but you could use this one as well https://github.com/qwopqwop200/GPTQ-for-LLaMa, but for me it was broken (generated garbage) on the text-generation-webui
This generates a model in safetensors format with int4, groupsize 128, and true sequential.