What is gguf-quantization?

Ideal for AI Agents like Claude Code, AutoGPT, and LangChain requiring efficient inference and flexible quantization on various hardware platforms. The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

How do I install gguf-quantization?

Run the command: npx killer-skills add majiayu000/claude-skill-registry/gguf-quantization. It works with Cursor, Windsurf, VS Code, Claude Code, and 15+ other IDEs.

What are the use cases for gguf-quantization?

Key use cases include: Deploying AI models on consumer hardware with optimized performance, Running AI inference on Apple Silicon devices with Metal acceleration, Enabling CPU-based inference without GPU requirements, Implementing flexible quantization strategies for improved model efficiency.

Which IDEs are compatible with gguf-quantization?

This skill is compatible with Cursor, Windsurf, VS Code, Claude Code, GitHub Copilot, JetBrains, Cline, Roo Code, and many more. Use the Killer-Skills CLI for universal one-command installation.

Are there any limitations for gguf-quantization?

Requires compatibility with llama.cpp. Limited to specific hardware platforms (CPUs, Apple Silicon, GPUs).

GGUF - Quantization Format for llama.cpp

Name: gguf-quantization
Availability: InStock
Rating: 4.0 (1 reviews)
Author: majiayu000

The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.

When to use GGUF

Use GGUF when:

Deploying on consumer hardware (laptops, desktops)
Running on Apple Silicon (M1/M2/M3) with Metal acceleration
Need CPU inference without GPU requirements
Want flexible quantization (Q2_K to Q8_0)
Using local AI tools (LM Studio, Ollama, text-generation-webui)

Key advantages:

Universal hardware: CPU, Apple Silicon, NVIDIA, AMD support
No Python runtime: Pure C/C++ inference
Flexible quantization: 2-8 bit with various methods (K-quants)
Ecosystem support: LM Studio, Ollama, koboldcpp, and more
imatrix: Importance matrix for better low-bit quality

Use alternatives instead:

AWQ/GPTQ: Maximum accuracy with calibration on NVIDIA GPUs
HQQ: Fast calibration-free quantization for HuggingFace
bitsandbytes: Simple integration with transformers library
TensorRT-LLM: Production NVIDIA deployment with maximum speed

Quick start

Installation

bash
1# Clone llama.cpp
2git clone https://github.com/ggml-org/llama.cpp
3cd llama.cpp
4
5# Build (CPU)
6make
7
8# Build with CUDA (NVIDIA)
9make GGML_CUDA=1
10
11# Build with Metal (Apple Silicon)
12make GGML_METAL=1
13
14# Install Python bindings (optional)
15pip install llama-cpp-python

Convert model to GGUF

bash
1# Install requirements
2pip install -r requirements.txt
3
4# Convert HuggingFace model to GGUF (FP16)
5python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf
6
7# Or specify output type
8python convert_hf_to_gguf.py ./path/to/model \
9    --outfile model-f16.gguf \
10    --outtype f16

Quantize model

bash
1# Basic quantization to Q4_K_M
2./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M
3
4# Quantize with importance matrix (better quality)
5./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
6./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

Run inference

bash
1# CLI inference
2./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?"
3
4# Interactive mode
5./llama-cli -m model-q4_k_m.gguf --interactive
6
7# With GPU offload
8./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"

Quantization types

K-quant methods (recommended)

Type	Bits	Size (7B)	Quality	Use Case
Q2_K	2.5	~2.8 GB	Low	Extreme compression
Q3_K_S	3.0	~3.0 GB	Low-Med	Memory constrained
Q3_K_M	3.3	~3.3 GB	Medium	Balance
Q4_K_S	4.0	~3.8 GB	Med-High	Good balance
Q4_K_M	4.5	~4.1 GB	High	Recommended default
Q5_K_S	5.0	~4.6 GB	High	Quality focused
Q5_K_M	5.5	~4.8 GB	Very High	High quality
Q6_K	6.0	~5.5 GB	Excellent	Near-original
Q8_0	8.0	~7.2 GB	Best	Maximum quality

Legacy methods

Type	Description
Q4_0	4-bit, basic
Q4_1	4-bit with delta
Q5_0	5-bit, basic
Q5_1	5-bit with delta

Recommendation: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.

Conversion workflows

Workflow 1: HuggingFace to GGUF

bash
1# 1. Download model
2huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b
3
4# 2. Convert to GGUF (FP16)
5python convert_hf_to_gguf.py ./llama-3.1-8b \
6    --outfile llama-3.1-8b-f16.gguf \
7    --outtype f16
8
9# 3. Quantize
10./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M
11
12# 4. Test
13./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50

Workflow 2: With importance matrix (better quality)

bash
1# 1. Convert to GGUF
2python convert_hf_to_gguf.py ./model --outfile model-f16.gguf
3
4# 2. Create calibration text (diverse samples)
5cat > calibration.txt << 'EOF'
6The quick brown fox jumps over the lazy dog.
7Machine learning is a subset of artificial intelligence.
8Python is a popular programming language.
9# Add more diverse text samples...
10EOF
11
12# 3. Generate importance matrix
13./llama-imatrix -m model-f16.gguf \
14    -f calibration.txt \
15    --chunk 512 \
16    -o model.imatrix \
17    -ngl 35  # GPU layers if available
18
19# 4. Quantize with imatrix
20./llama-quantize --imatrix model.imatrix \
21    model-f16.gguf \
22    model-q4_k_m.gguf \
23    Q4_K_M

Workflow 3: Multiple quantizations

bash
1#!/bin/bash
2MODEL="llama-3.1-8b-f16.gguf"
3IMATRIX="llama-3.1-8b.imatrix"
4
5# Generate imatrix once
6./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35
7
8# Create multiple quantizations
9for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do
10    OUTPUT="llama-3.1-8b-${QUANT,,}.gguf"
11    ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT
12    echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))"
13done

Python usage

llama-cpp-python

python
1from llama_cpp import Llama
2
3# Load model
4llm = Llama(
5    model_path="./model-q4_k_m.gguf",
6    n_ctx=4096,          # Context window
7    n_gpu_layers=35,     # GPU offload (0 for CPU only)
8    n_threads=8          # CPU threads
9)
10
11# Generate
12output = llm(
13    "What is machine learning?",
14    max_tokens=256,
15    temperature=0.7,
16    stop=["</s>", "\n\n"]
17)
18print(output["choices"][0]["text"])

Chat completion

python
1from llama_cpp import Llama
2
3llm = Llama(
4    model_path="./model-q4_k_m.gguf",
5    n_ctx=4096,
6    n_gpu_layers=35,
7    chat_format="llama-3"  # Or "chatml", "mistral", etc.
8)
9
10messages = [
11    {"role": "system", "content": "You are a helpful assistant."},
12    {"role": "user", "content": "What is Python?"}
13]
14
15response = llm.create_chat_completion(
16    messages=messages,
17    max_tokens=256,
18    temperature=0.7
19)
20print(response["choices"][0]["message"]["content"])

Streaming

python
1from llama_cpp import Llama
2
3llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35)
4
5# Stream tokens
6for chunk in llm(
7    "Explain quantum computing:",
8    max_tokens=256,
9    stream=True
10):
11    print(chunk["choices"][0]["text"], end="", flush=True)

Server mode

Start OpenAI-compatible server

bash
1# Start server
2./llama-server -m model-q4_k_m.gguf \
3    --host 0.0.0.0 \
4    --port 8080 \
5    -ngl 35 \
6    -c 4096
7
8# Or with Python bindings
9python -m llama_cpp.server \
10    --model model-q4_k_m.gguf \
11    --n_gpu_layers 35 \
12    --host 0.0.0.0 \
13    --port 8080

Use with OpenAI client

python
1from openai import OpenAI
2
3client = OpenAI(
4    base_url="http://localhost:8080/v1",
5    api_key="not-needed"
6)
7
8response = client.chat.completions.create(
9    model="local-model",
10    messages=[{"role": "user", "content": "Hello!"}],
11    max_tokens=256
12)
13print(response.choices[0].message.content)

Hardware optimization

Apple Silicon (Metal)

bash
1# Build with Metal
2make clean && make GGML_METAL=1
3
4# Run with Metal acceleration
5./llama-cli -m model.gguf -ngl 99 -p "Hello"
6
7# Python with Metal
8llm = Llama(
9    model_path="model.gguf",
10    n_gpu_layers=99,     # Offload all layers
11    n_threads=1          # Metal handles parallelism
12)

NVIDIA CUDA

bash
1# Build with CUDA
2make clean && make GGML_CUDA=1
3
4# Run with CUDA
5./llama-cli -m model.gguf -ngl 35 -p "Hello"
6
7# Specify GPU
8CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35

CPU optimization

bash
1# Build with AVX2/AVX512
2make clean && make
3
4# Run with optimal threads
5./llama-cli -m model.gguf -t 8 -p "Hello"
6
7# Python CPU config
8llm = Llama(
9    model_path="model.gguf",
10    n_gpu_layers=0,      # CPU only
11    n_threads=8,         # Match physical cores
12    n_batch=512          # Batch size for prompt processing
13)

Integration with tools

Ollama

bash
1# Create Modelfile
2cat > Modelfile << 'EOF'
3FROM ./model-q4_k_m.gguf
4TEMPLATE """{{ .System }}
5{{ .Prompt }}"""
6PARAMETER temperature 0.7
7PARAMETER num_ctx 4096
8EOF
9
10# Create Ollama model
11ollama create mymodel -f Modelfile
12
13# Run
14ollama run mymodel "Hello!"

LM Studio

Place GGUF file in ~/.cache/lm-studio/models/
Open LM Studio and select the model
Configure context length and GPU offload
Start inference

text-generation-webui

bash
1# Place in models folder
2cp model-q4_k_m.gguf text-generation-webui/models/
3
4# Start with llama.cpp loader
5python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35

Best practices

Use K-quants: Q4_K_M offers best quality/size balance
Use imatrix: Always use importance matrix for Q4 and below
GPU offload: Offload as many layers as VRAM allows
Context length: Start with 4096, increase if needed
Thread count: Match physical CPU cores, not logical
Batch size: Increase n_batch for faster prompt processing

Common issues

Model loads slowly:

bash
1# Use mmap for faster loading
2./llama-cli -m model.gguf --mmap

Out of memory:

bash
1# Reduce GPU layers
2./llama-cli -m model.gguf -ngl 20  # Reduce from 35
3
4# Or use smaller quantization
5./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M

Poor quality at low bits:

bash
1# Always use imatrix for Q4 and below
2./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix
3./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

References

Advanced Usage - Batching, speculative decoding, custom builds
Troubleshooting - Common issues, debugging, benchmarks

Resources

Repository: https://github.com/ggml-org/llama.cpp
Python Bindings: https://github.com/abetlen/llama-cpp-python
Pre-quantized Models: https://huggingface.co/TheBloke
GGUF Converter: https://huggingface.co/spaces/ggml-org/gguf-my-repo
License: MIT

gguf-quantization — Categories.community

About this Skill

↓ Quality Score

Agent Capability Analysis

Ideal Agent Persona

Core Value

↓ Capabilities Granted for gguf-quantization MCP Server

! Prerequisites & Limits

# Tags