Qwen2.5-Coder & Speculative Decoding

Model

  1. https://huggingface.co/bartowski/Qwen2.5-Coder-32B-Instruct-GGUF
  2. https://huggingface.co/bartowski/Qwen2.5-Coder-14B-Instruct-GGUF
  3. https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF
  4. https://huggingface.co/bartowski/Qwen2.5-Coder-3B-Instruct-GGUF
  5. https://huggingface.co/bartowski/Qwen2.5-Coder-1.5B-Instruct-GGUF

Use 32B, 14B or 7B as main model, then 1.5B or 3B as draft model. Draft model should at least Q5_K_M.

start.bat:

.\llama.cpp\llama-server.exe --model .\Qwen2.5-Coder-32B-Instruct-Q5_K_M.gguf --alias Qwen2.5-Coder-32B-Instruct --ctx-size 8192 --mlock --n-gpu-layers 40 --temp 0 --top-k 1 --host 127.0.0.1 --port 8991 --no-webui --model-draft .\Qwen2.5-Coder-1.5B-Instruct-Q5_K_M.gguf --ctx-size-draft 8192 --gpu-layers-draft 99 --verbose-prompt

Adjust --n-gpu-layers accordingly, but make sure the draft model in --gpu-layers-draft fully loaded to VRAM.


continue.dev config.json

{
  "models": [
    {
      "title": "llama.cpp",
      "provider": "llama.cpp",
      "model": "Qwen2.5-Coder-32B-Instruct",
      "apiBase": "http://localhost:8991"
    }
  ]
}

Interesting things to read:

Edit Report
Pub: 21 Feb 2025 06:12 UTC
Edit: 21 Feb 2025 08:38 UTC
Views: 60