llama.cpp Inference
Tips
llama.cpp is a lightweight framework for running large language models, specifically optimized for CPU performance.
Thanks to the efforts of RWKV community member @MollySophia, llama.cpp now supports RWKV-6/7 models.
This guide explains how to perform inference with RWKV models using llama.cpp.
Inference with RWKV Models in llama.cpp
Build llama.cpp Locally
You can download precompiled binaries from the llama.cpp releases page.
llama.cpp offers multiple precompiled versions. Choose the appropriate package based on your GPU type:
System | GPU Type | Package Name |
---|---|---|
macOS | Apple Silicon | macos-arm64.zip |
Windows | Intel GPU (including Arc dGPU/Xe iGPU) | win-sycl-x64.zip |
Windows | NVIDIA GPU (CUDA 11.7-12.3) | win-cuda-cu11.7-x64.zip |
Windows | NVIDIA GPU (CUDA 12.4+) | win-cuda-cu12.4-x64.zip |
Windows | AMD and other GPUs (including AMD iGPUs) | win-vulkan-x64.zip |
Windows | No GPU | win-openblas-x64.zip |
Alternatively, follow the official llama.cpp build instructions to compile from source.
Obtain GGUF Format Models
llama.cpp uses .gguf
format models. Use one of the methods below to obtain a .gguf
RWKV model.
Download pre-quantized GGUF models from the RWKV GGUF repository.
Create a models
folder in your llama.cpp directory and place the downloaded GGUF model there.
Warning
RWKV GGUF models come in various quantization levels. Higher precision (e.g., FP16
) yields better responses but requires more resources.
Recommended priority: FP16
> Q8_0
> Q5_K_M
> Q4_K_M
. Lower quantizations (e.g., Q3_0
, Q2_0
) may severely degrade performance.
- Download an RWKV Hugging Face model from the RWKV HF Hub, such as
RWKV/RWKV7-Goose-World3-2.9B-HF
. - Use the following command to download the source code of
llama.cpp
:
git clone https://github.com/ggml-org/llama.cpp.git
- Run the following command in the llama.cpp directory to convert the HF model to GGUF:
python llama.cpp/convert_hf_to_gguf.py ./RWKV7-Goose-World3-2.9B-HF
Warning
Replace ./RWKV7-Goose-World3-2.9B-HF
with your actual RWKV HF model directory.
Run RWKV Model for Chat
Execute the following command in the llama.cpp directory to start a chat session:
./llama-cli -m models/rwkv-7-world-2.9b-Q8_0.gguf -p "You are a helpful assistant" -cnv -t 8 -ngl 99 -n 500
This command runs the models/rwkv-7-world-2.9b-Q8_0.gguf
model with 8 threads, initial prompt You are a helpful assistant
, and generates up to 500 tokens per response.
Parameter Explanation:
./llama-cli
: Launches the compiledllama-cli
executable.-m models/rwkv-7-world-2.9b-Q8_0.gguf
: Path to the model.-p "You are a helpful assistant"
: Initial prompt to start the chat.-cnv
: Enables chat mode (default, can be omitted).-t 8
: Thread count (adjust based on available CPU cores).-ngl 99
: Loads all model layers onto the GPU.-n 500
: Maximum tokens to generate.
Tips
For a full list of parameters, see the llama.cpp documentation.
Additional Features (Optional)
Enable completion Mode
Tips
By default, ./llama-cli
uses chat mode. Add -no-cnv
to switch to completion mode, where the model extends the given prompt.
./llama-cli -m models/rwkv-7-world-2.9b-Q8_0.gguf -p "User: What's Spring Festival.\n\nAssistant:" -no-cnv -t 8 -ngl 99 -n 500
-p "User: What's Spring Festival.\n\nAssistant:"
: The "prompt" parameter enables the model to continue writing based on this prompt word.-no-cnv
: Disables chat mode for completion.
Launch Web Service (Recommended)
Start a web server with:
./llama-server -m models/rwkv-7-world-2.9b-Q8_0.gguf -ngl 99
Access the Web UI at http://127.0.0.1:8080
:
Quantize GGUF Models
Use ./llama-quantize
to quantize fp16
or fp32
GGUF models. Example:
./llama-quantize models/rwkv-7-world-2.9b-F16.gguf models/rwkv-7-world-2.9b-Q8_0.gguf Q8_0
Warning
Input models must be fp32
or fp16
. Recommended quantizations: Q5_1
, Q8_0
.
Run ./llama-quantize --help
to view all quantization options: