Qwen2.5-Coder & Speculative Decoding

Model

Use 32B, 14B or 7B as main model, then 1.5B or 3B as draft model. Draft model should at least Q5_K_M.

start.bat:

⎗

.\llama.cpp\llama-server.exe --model .\Qwen2.5-Coder-32B-Instruct-Q5_K_M.gguf --alias Qwen2.5-Coder-32B-Instruct --ctx-size 8192 --mlock --n-gpu-layers 40 --temp 0 --top-k 1 --host 127.0.0.1 --port 8991 --no-webui --model-draft .\Qwen2.5-Coder-1.5B-Instruct-Q5_K_M.gguf --ctx-size-draft 8192 --gpu-layers-draft 99 --verbose-prompt

Adjust --n-gpu-layers accordingly, but make sure the draft model in --gpu-layers-draft fully loaded to VRAM.

continue.dev config.json

⎗
✓
{
  "models": [
    {
      "title": "llama.cpp",
      "provider": "llama.cpp",
      "model": "Qwen2.5-Coder-32B-Instruct",
      "apiBase": "http://localhost:8991"
    }
  ]
}

Interesting things to read:

Speculative decoding just landed in llama.cpp's server with 25% to 60% speed improvements (https://archive.is/HycFA)
Speculative decoding can identify broken quants? (https://archive.is/zm9QL)
How to do correctly speculative decoding on the CPU using small models 1B and 7B? (https://archive.is/pEIc4)
LM Studio 0.3.10 with Speculative Decoding released (https://archive.is/Im7db)