Qwen2.5-Coder & Speculative Decoding
Model
- https://huggingface.co/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF
- https://huggingface.co/bartowski/Qwen2.5-Coder-14B-Instruct-GGUF
- https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF
- https://huggingface.co/bartowski/Qwen2.5-Coder-3B-Instruct-GGUF
- https://huggingface.co/bartowski/Qwen2.5-Coder-1.5B-Instruct-GGUF
Use 32B, 14B or 7B as main model, then 1.5B or 3B as draft model. Draft model should at least Q5_K_M.
start.bat:
Adjust --n-gpu-layers
accordingly, but make sure the draft model in --gpu-layers-draft
fully loaded to VRAM.
continue.dev config.json
Interesting things to read:
- Speculative decoding just landed in llama.cpp's server with 25% to 60% speed improvements (https://archive.is/HycFA)
- Speculative decoding can identify broken quants? (https://archive.is/zm9QL)
- How to do correctly speculative decoding on the CPU using small models 1B and 7B? (https://archive.is/pEIc4)
- LM Studio 0.3.10 with Speculative Decoding released (https://archive.is/Im7db)