Llama 2 Quickstart Guide

Quickstart Llama 2 using Koboldcpp and GGML formatted models.

Download a Llama 2 Model

Navigate to the files tab on the Hugginface model page and download your desired model. The Huggingface model pages provide an explanation of the filenames and memory requirements. You only need to download one of them. For your first time, just select the smallest model file.

Chat-style Llama 2 Models

7B: https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGML
13B: https://huggingface.co/TheBloke/Llama-2-13B-chat-GGML
70B: https://huggingface.co/TheBloke/Llama-2-70B-Chat-GGML

Regular Llama 2 Models

7B: https://huggingface.co/TheBloke/Llama-2-7B-GGML
13B: https://huggingface.co/TheBloke/Llama-2-13B-GGML
70B: https://huggingface.co/TheBloke/Llama-2-70B-GGML

Download Koboldcpp

Get latest KoboldCPP.exe here (ignore security complaints from Windows). Download the larger regular file and not the no_cuda version.
Linux and Mac users follow the build instructions on the Github page

Launch Koboldcpp in GPU mode

When launching the exe, a GUI appears. Select the Use CLBlast Preset in the Quick Launch tab.
Check the Streaming Mode toggle.
Select model file in the Model section in the Quick Launch tab.
Click Launch.

you can use Launch flags for more launch options that are not present in the GUI

Koboldcpp Webui settings

Change generation parameters in "Settings" on the top of the WebUI.
- For Chat Models, the Format should be set to Chat Mode. I recommend unchecking Multiline Replies and Continue Bot Replies.
- For Regular Models, use Story Mode or Adventure Mode.
To load TavernAI style "Character cards", click Load on the top of the WebUI and select your card.

Notes

GGML is a model format used by llama.cpp and many other clients like Koboldcpp.

Breaking changes, especially with formats like GGML, happen quite frequently. Be sure to keep your client and your knowledge up to date.

30B models of Llama 2 are not yet released at the time of writing.

Speed Up by loading layers into VRAM

In the Quick Launch tab, having selected the Use CLBlast Preset, you can set the number of layers of the model to be loaded into the GPU's VRAM
To get the amount of Layers in the model, click Launch and you will see the number of layers in the console window
- Something like llama_model_load_internal: offloaded 41/41 layers to GPU
The more layers there are in VRAM, the faster it can generate!
- See the explanations in the Hugginface model pages for how much memory a model uses.