GGUF - Quantization Format for llama.cpp
The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
When to use GGUF
Use GGUF when:
- Deploying on consumer hardware (laptops, desktops)
- Running on Apple Silicon (M1/M2/M3) with Metal acceleration
- Need CPU inference without GPU requirements
- Want flexible quantization (Q2_K to Q8_0)
- Using local AI tools (LM Studio, Ollama, text-generation-webui)
Key advantages:
- Universal hardware: CPU, Apple Silicon, NVIDIA, AMD support
- No Python runtime: Pure C/C++ inference
- Flexible quantization: 2-8 bit with various methods (K-quants)
- Ecosystem support: LM Studio, Ollama, koboldcpp, and more
- imatrix: Importance matrix for better low-bit quality
Use alternatives instead:
- AWQ/GPTQ: Maximum accuracy with calibration on NVIDIA GPUs
- HQQ: Fast calibration-free quantization for HuggingFace
- bitsandbytes: Simple integration with transformers library
- TensorRT-LLM: Production NVIDIA deployment with maximum speed
Quick start
Installation
bash1# Clone llama.cpp 2git clone https://github.com/ggml-org/llama.cpp 3cd llama.cpp 4 5# Build (CPU) 6make 7 8# Build with CUDA (NVIDIA) 9make GGML_CUDA=1 10 11# Build with Metal (Apple Silicon) 12make GGML_METAL=1 13 14# Install Python bindings (optional) 15pip install llama-cpp-python
Convert model to GGUF
bash1# Install requirements 2pip install -r requirements.txt 3 4# Convert HuggingFace model to GGUF (FP16) 5python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf 6 7# Or specify output type 8python convert_hf_to_gguf.py ./path/to/model \ 9 --outfile model-f16.gguf \ 10 --outtype f16
Quantize model
bash1# Basic quantization to Q4_K_M 2./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M 3 4# Quantize with importance matrix (better quality) 5./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix 6./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
Run inference
bash1# CLI inference 2./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?" 3 4# Interactive mode 5./llama-cli -m model-q4_k_m.gguf --interactive 6 7# With GPU offload 8./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"
Quantization types
K-quant methods (recommended)
| Type | Bits | Size (7B) | Quality | Use Case |
|---|---|---|---|---|
| Q2_K | 2.5 | ~2.8 GB | Low | Extreme compression |
| Q3_K_S | 3.0 | ~3.0 GB | Low-Med | Memory constrained |
| Q3_K_M | 3.3 | ~3.3 GB | Medium | Balance |
| Q4_K_S | 4.0 | ~3.8 GB | Med-High | Good balance |
| Q4_K_M | 4.5 | ~4.1 GB | High | Recommended default |
| Q5_K_S | 5.0 | ~4.6 GB | High | Quality focused |
| Q5_K_M | 5.5 | ~4.8 GB | Very High | High quality |
| Q6_K | 6.0 | ~5.5 GB | Excellent | Near-original |
| Q8_0 | 8.0 | ~7.2 GB | Best | Maximum quality |
Legacy methods
| Type | Description |
|---|---|
| Q4_0 | 4-bit, basic |
| Q4_1 | 4-bit with delta |
| Q5_0 | 5-bit, basic |
| Q5_1 | 5-bit with delta |
Recommendation: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.
Conversion workflows
Workflow 1: HuggingFace to GGUF
bash1# 1. Download model 2huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b 3 4# 2. Convert to GGUF (FP16) 5python convert_hf_to_gguf.py ./llama-3.1-8b \ 6 --outfile llama-3.1-8b-f16.gguf \ 7 --outtype f16 8 9# 3. Quantize 10./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M 11 12# 4. Test 13./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50
Workflow 2: With importance matrix (better quality)
bash1# 1. Convert to GGUF 2python convert_hf_to_gguf.py ./model --outfile model-f16.gguf 3 4# 2. Create calibration text (diverse samples) 5cat > calibration.txt << 'EOF' 6The quick brown fox jumps over the lazy dog. 7Machine learning is a subset of artificial intelligence. 8Python is a popular programming language. 9# Add more diverse text samples... 10EOF 11 12# 3. Generate importance matrix 13./llama-imatrix -m model-f16.gguf \ 14 -f calibration.txt \ 15 --chunk 512 \ 16 -o model.imatrix \ 17 -ngl 35 # GPU layers if available 18 19# 4. Quantize with imatrix 20./llama-quantize --imatrix model.imatrix \ 21 model-f16.gguf \ 22 model-q4_k_m.gguf \ 23 Q4_K_M
Workflow 3: Multiple quantizations
bash1#!/bin/bash 2MODEL="llama-3.1-8b-f16.gguf" 3IMATRIX="llama-3.1-8b.imatrix" 4 5# Generate imatrix once 6./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35 7 8# Create multiple quantizations 9for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do 10 OUTPUT="llama-3.1-8b-${QUANT,,}.gguf" 11 ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT 12 echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))" 13done
Python usage
llama-cpp-python
python1from llama_cpp import Llama 2 3# Load model 4llm = Llama( 5 model_path="./model-q4_k_m.gguf", 6 n_ctx=4096, # Context window 7 n_gpu_layers=35, # GPU offload (0 for CPU only) 8 n_threads=8 # CPU threads 9) 10 11# Generate 12output = llm( 13 "What is machine learning?", 14 max_tokens=256, 15 temperature=0.7, 16 stop=["</s>", "\n\n"] 17) 18print(output["choices"][0]["text"])
Chat completion
python1from llama_cpp import Llama 2 3llm = Llama( 4 model_path="./model-q4_k_m.gguf", 5 n_ctx=4096, 6 n_gpu_layers=35, 7 chat_format="llama-3" # Or "chatml", "mistral", etc. 8) 9 10messages = [ 11 {"role": "system", "content": "You are a helpful assistant."}, 12 {"role": "user", "content": "What is Python?"} 13] 14 15response = llm.create_chat_completion( 16 messages=messages, 17 max_tokens=256, 18 temperature=0.7 19) 20print(response["choices"][0]["message"]["content"])
Streaming
python1from llama_cpp import Llama 2 3llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35) 4 5# Stream tokens 6for chunk in llm( 7 "Explain quantum computing:", 8 max_tokens=256, 9 stream=True 10): 11 print(chunk["choices"][0]["text"], end="", flush=True)
Server mode
Start OpenAI-compatible server
bash1# Start server 2./llama-server -m model-q4_k_m.gguf \ 3 --host 0.0.0.0 \ 4 --port 8080 \ 5 -ngl 35 \ 6 -c 4096 7 8# Or with Python bindings 9python -m llama_cpp.server \ 10 --model model-q4_k_m.gguf \ 11 --n_gpu_layers 35 \ 12 --host 0.0.0.0 \ 13 --port 8080
Use with OpenAI client
python1from openai import OpenAI 2 3client = OpenAI( 4 base_url="http://localhost:8080/v1", 5 api_key="not-needed" 6) 7 8response = client.chat.completions.create( 9 model="local-model", 10 messages=[{"role": "user", "content": "Hello!"}], 11 max_tokens=256 12) 13print(response.choices[0].message.content)
Hardware optimization
Apple Silicon (Metal)
bash1# Build with Metal 2make clean && make GGML_METAL=1 3 4# Run with Metal acceleration 5./llama-cli -m model.gguf -ngl 99 -p "Hello" 6 7# Python with Metal 8llm = Llama( 9 model_path="model.gguf", 10 n_gpu_layers=99, # Offload all layers 11 n_threads=1 # Metal handles parallelism 12)
NVIDIA CUDA
bash1# Build with CUDA 2make clean && make GGML_CUDA=1 3 4# Run with CUDA 5./llama-cli -m model.gguf -ngl 35 -p "Hello" 6 7# Specify GPU 8CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35
CPU optimization
bash1# Build with AVX2/AVX512 2make clean && make 3 4# Run with optimal threads 5./llama-cli -m model.gguf -t 8 -p "Hello" 6 7# Python CPU config 8llm = Llama( 9 model_path="model.gguf", 10 n_gpu_layers=0, # CPU only 11 n_threads=8, # Match physical cores 12 n_batch=512 # Batch size for prompt processing 13)
Integration with tools
Ollama
bash1# Create Modelfile 2cat > Modelfile << 'EOF' 3FROM ./model-q4_k_m.gguf 4TEMPLATE """{{ .System }} 5{{ .Prompt }}""" 6PARAMETER temperature 0.7 7PARAMETER num_ctx 4096 8EOF 9 10# Create Ollama model 11ollama create mymodel -f Modelfile 12 13# Run 14ollama run mymodel "Hello!"
LM Studio
- Place GGUF file in
~/.cache/lm-studio/models/ - Open LM Studio and select the model
- Configure context length and GPU offload
- Start inference
text-generation-webui
bash1# Place in models folder 2cp model-q4_k_m.gguf text-generation-webui/models/ 3 4# Start with llama.cpp loader 5python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35
Best practices
- Use K-quants: Q4_K_M offers best quality/size balance
- Use imatrix: Always use importance matrix for Q4 and below
- GPU offload: Offload as many layers as VRAM allows
- Context length: Start with 4096, increase if needed
- Thread count: Match physical CPU cores, not logical
- Batch size: Increase n_batch for faster prompt processing
Common issues
Model loads slowly:
bash1# Use mmap for faster loading 2./llama-cli -m model.gguf --mmap
Out of memory:
bash1# Reduce GPU layers 2./llama-cli -m model.gguf -ngl 20 # Reduce from 35 3 4# Or use smaller quantization 5./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M
Poor quality at low bits:
bash1# Always use imatrix for Q4 and below 2./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix 3./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M
References
- Advanced Usage - Batching, speculative decoding, custom builds
- Troubleshooting - Common issues, debugging, benchmarks
Resources
- Repository: https://github.com/ggml-org/llama.cpp
- Python Bindings: https://github.com/abetlen/llama-cpp-python
- Pre-quantized Models: https://huggingface.co/TheBloke
- GGUF Converter: https://huggingface.co/spaces/ggml-org/gguf-my-repo
- License: MIT