/lmg/ recommended models
The number of gigabytes in parentheses is the minimum amount of memory (VRAM + RAM) required to run the model at reasonable quant. For smaller models this is widely considered to be at least Q4_K_M. Large models can be quanted to 1 bit while remaining coherent but the exact impact on their performance, especially on long-context tasks, is still unexplored. With more memory you'll be able to fit more context. Using a smaller quant will make the model dumber.
Ideally the entire model fits in your VRAM. You can cope by loading parts of the model into RAM instead but it will be much slower.
MoE (mixture of experts) models don't use all of their parameters for each token so they are much faster than a dense model of the same size. Because of this loading a part of the model into RAM instead of VRAM is viable. All of the large models listed here are MoE. If you're using llama-server from llama.cpp it will automatically load the model in the most optimal way for your hardware. You should only set the preferred context size because otherwise it defaults to 4096 if you're short on VRAM.
ERP
- Nemo (12GB) - The model every vramlet started with, now showing its age. Uncensored with a system prompt.
- Gemma 4 31B (24GB) - A proper successor to Nemo with a different writing style. Worth trying even if you can run bigger models. Supports vision so it can comment on your dick pics. Uncensored with a system prompt. Anons often say that for this use case it's as good as much larger models. You can also try the MoE and smaller versions listed below.
- GLM-4.5 Air (80GB) - The middle point between Nemo and DeepSeek. Like Nemo and Gemma its pretraining doesn't seem to have been filtered at all. Needs a prefill to get around refusals. MoE model.
- GLM-4.6 (200GB) / 4.7 - Same as the above but even more parameters and thus smarter. 4.7 has better benchmark scores but some Anons think that it's more safetyslopped.
- DeepSeek V3 (200GB) / R1 0528 / V3.1 Terminus - R1 is a thinking model and Terminus is a hybrid thinking model. V3 has repetition issues in long chats. R1 is more resistant but still requires sampler trickery. Terminus has almost none but it has less variety. Even the smallest quants of DeepSeek like the UD-IQ1_S are very good.
- Kimi K2.6 (400GB) - DeepSeek architecture but bigger. Similar unfiltered dataset. Some Anons prefer it over DeepSeek. Supports vision.
Programming & General
Like most benchmarks, public programming benchmarks have found their way into the training dataset of most models. In my experience, if you work on anything other than webshit, bigger model = better despite the benchmark scores. Test them yourself on your own codebases.
For general assistant and "claw" type shit that searches the web and makes tool calls small models are good enough.
- Kimi K2.6 (400GB) - Supports vision.
- GLM 5.1 (300GB)
- GLM 4.7 (250GB)
- Qwen Series - Benchmaxxed models with an impressive lack of world knowledge compared to similarly sized models from other labs but they are often better at programming. Supports vision.
- Qwen3.5 397B A17B (250GB)
- Qwen3.5 122B A10B (100GB)
- Qwen3.6 27B (24GB)
- Qwen3.6 35B A3B (24GB) - Faster but dumber MoE version of the above.
- Qwen3.5 9B (12GB) / Qwen3.5 4B (8GB) / Qwen3.5 2B (4GB) / Qwen3.5 0.8B (2GB) - Performance drops sharply here. If you want to use them you should use bigger quants like Q8 for the smallest models.
- Gemma 4 Series - If you want to discuss obscure animu that Qwen doesn't know about Gemma probably has you covered. Supports vision.
- Gemma 4 31B (24GB)
- Gemma 4 26B A4B (24GB) - Faster but dumber MoE version of the above.
- Gemma 4 E4B (12GB) / Gemma 4 E2B (8GB) - Similar to the Qwens above. I wouldn't use these unless you're desperate.