KS
Killer-Skills

gguf-quantization — Categories.community

v1.0.0
GitHub

About this Skill

Ideal for AI Agents like Claude Code, AutoGPT, and LangChain requiring efficient inference and flexible quantization on various hardware platforms. The most comprehensive Claude Code skills registry | Web Search: https://skills-registry-web.vercel.app

majiayu000 majiayu000
[0]
[0]
Updated: 2/20/2026

Quality Score

Top 5%
80
Excellent
Based on code quality & docs
Installation
SYS Universal Install (Auto-Detect)
Cursor IDE Windsurf IDE VS Code IDE
> npx killer-skills add majiayu000/claude-skill-registry/gguf-quantization

Agent Capability Analysis

The gguf-quantization MCP Server by majiayu000 is an open-source Categories.community integration for Claude and other AI agents, enabling seamless task automation and capability expansion.

Ideal Agent Persona

Ideal for AI Agents like Claude Code, AutoGPT, and LangChain requiring efficient inference and flexible quantization on various hardware platforms.

Core Value

Empowers agents to utilize the GGUF format for efficient CPU and GPU inference, enabling flexible quantization options from Q2_K to Q8_0, and leveraging Metal acceleration on Apple Silicon devices.

Capabilities Granted for gguf-quantization MCP Server

Deploying AI models on consumer hardware with optimized performance
Running AI inference on Apple Silicon devices with Metal acceleration
Enabling CPU-based inference without GPU requirements
Implementing flexible quantization strategies for improved model efficiency

! Prerequisites & Limits

  • Requires compatibility with llama.cpp
  • Limited to specific hardware platforms (CPUs, Apple Silicon, GPUs)
Project
SKILL.md
9.6 KB
.cursorrules
1.2 KB
package.json
240 B
Ready
UTF-8

# Tags

[No tags]
SKILL.md
Readonly

GGUF - Quantization Format for llama.cpp

The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.

When to use GGUF

Use GGUF when:

  • Deploying on consumer hardware (laptops, desktops)
  • Running on Apple Silicon (M1/M2/M3) with Metal acceleration
  • Need CPU inference without GPU requirements
  • Want flexible quantization (Q2_K to Q8_0)
  • Using local AI tools (LM Studio, Ollama, text-generation-webui)

Key advantages:

  • Universal hardware: CPU, Apple Silicon, NVIDIA, AMD support
  • No Python runtime: Pure C/C++ inference
  • Flexible quantization: 2-8 bit with various methods (K-quants)
  • Ecosystem support: LM Studio, Ollama, koboldcpp, and more
  • imatrix: Importance matrix for better low-bit quality

Use alternatives instead:

  • AWQ/GPTQ: Maximum accuracy with calibration on NVIDIA GPUs
  • HQQ: Fast calibration-free quantization for HuggingFace
  • bitsandbytes: Simple integration with transformers library
  • TensorRT-LLM: Production NVIDIA deployment with maximum speed

Quick start

Installation

bash
1# Clone llama.cpp 2git clone https://github.com/ggml-org/llama.cpp 3cd llama.cpp 4 5# Build (CPU) 6make 7 8# Build with CUDA (NVIDIA) 9make GGML_CUDA=1 10 11# Build with Metal (Apple Silicon) 12make GGML_METAL=1 13 14# Install Python bindings (optional) 15pip install llama-cpp-python

Convert model to GGUF

bash
1# Install requirements 2pip install -r requirements.txt 3 4# Convert HuggingFace model to GGUF (FP16) 5python convert_hf_to_gguf.py ./path/to/model --outfile model-f16.gguf 6 7# Or specify output type 8python convert_hf_to_gguf.py ./path/to/model \ 9 --outfile model-f16.gguf \ 10 --outtype f16

Quantize model

bash
1# Basic quantization to Q4_K_M 2./llama-quantize model-f16.gguf model-q4_k_m.gguf Q4_K_M 3 4# Quantize with importance matrix (better quality) 5./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix 6./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

Run inference

bash
1# CLI inference 2./llama-cli -m model-q4_k_m.gguf -p "Hello, how are you?" 3 4# Interactive mode 5./llama-cli -m model-q4_k_m.gguf --interactive 6 7# With GPU offload 8./llama-cli -m model-q4_k_m.gguf -ngl 35 -p "Hello!"

Quantization types

K-quant methods (recommended)

TypeBitsSize (7B)QualityUse Case
Q2_K2.5~2.8 GBLowExtreme compression
Q3_K_S3.0~3.0 GBLow-MedMemory constrained
Q3_K_M3.3~3.3 GBMediumBalance
Q4_K_S4.0~3.8 GBMed-HighGood balance
Q4_K_M4.5~4.1 GBHighRecommended default
Q5_K_S5.0~4.6 GBHighQuality focused
Q5_K_M5.5~4.8 GBVery HighHigh quality
Q6_K6.0~5.5 GBExcellentNear-original
Q8_08.0~7.2 GBBestMaximum quality

Legacy methods

TypeDescription
Q4_04-bit, basic
Q4_14-bit with delta
Q5_05-bit, basic
Q5_15-bit with delta

Recommendation: Use K-quant methods (Q4_K_M, Q5_K_M) for best quality/size ratio.

Conversion workflows

Workflow 1: HuggingFace to GGUF

bash
1# 1. Download model 2huggingface-cli download meta-llama/Llama-3.1-8B --local-dir ./llama-3.1-8b 3 4# 2. Convert to GGUF (FP16) 5python convert_hf_to_gguf.py ./llama-3.1-8b \ 6 --outfile llama-3.1-8b-f16.gguf \ 7 --outtype f16 8 9# 3. Quantize 10./llama-quantize llama-3.1-8b-f16.gguf llama-3.1-8b-q4_k_m.gguf Q4_K_M 11 12# 4. Test 13./llama-cli -m llama-3.1-8b-q4_k_m.gguf -p "Hello!" -n 50

Workflow 2: With importance matrix (better quality)

bash
1# 1. Convert to GGUF 2python convert_hf_to_gguf.py ./model --outfile model-f16.gguf 3 4# 2. Create calibration text (diverse samples) 5cat > calibration.txt << 'EOF' 6The quick brown fox jumps over the lazy dog. 7Machine learning is a subset of artificial intelligence. 8Python is a popular programming language. 9# Add more diverse text samples... 10EOF 11 12# 3. Generate importance matrix 13./llama-imatrix -m model-f16.gguf \ 14 -f calibration.txt \ 15 --chunk 512 \ 16 -o model.imatrix \ 17 -ngl 35 # GPU layers if available 18 19# 4. Quantize with imatrix 20./llama-quantize --imatrix model.imatrix \ 21 model-f16.gguf \ 22 model-q4_k_m.gguf \ 23 Q4_K_M

Workflow 3: Multiple quantizations

bash
1#!/bin/bash 2MODEL="llama-3.1-8b-f16.gguf" 3IMATRIX="llama-3.1-8b.imatrix" 4 5# Generate imatrix once 6./llama-imatrix -m $MODEL -f wiki.txt -o $IMATRIX -ngl 35 7 8# Create multiple quantizations 9for QUANT in Q4_K_M Q5_K_M Q6_K Q8_0; do 10 OUTPUT="llama-3.1-8b-${QUANT,,}.gguf" 11 ./llama-quantize --imatrix $IMATRIX $MODEL $OUTPUT $QUANT 12 echo "Created: $OUTPUT ($(du -h $OUTPUT | cut -f1))" 13done

Python usage

llama-cpp-python

python
1from llama_cpp import Llama 2 3# Load model 4llm = Llama( 5 model_path="./model-q4_k_m.gguf", 6 n_ctx=4096, # Context window 7 n_gpu_layers=35, # GPU offload (0 for CPU only) 8 n_threads=8 # CPU threads 9) 10 11# Generate 12output = llm( 13 "What is machine learning?", 14 max_tokens=256, 15 temperature=0.7, 16 stop=["</s>", "\n\n"] 17) 18print(output["choices"][0]["text"])

Chat completion

python
1from llama_cpp import Llama 2 3llm = Llama( 4 model_path="./model-q4_k_m.gguf", 5 n_ctx=4096, 6 n_gpu_layers=35, 7 chat_format="llama-3" # Or "chatml", "mistral", etc. 8) 9 10messages = [ 11 {"role": "system", "content": "You are a helpful assistant."}, 12 {"role": "user", "content": "What is Python?"} 13] 14 15response = llm.create_chat_completion( 16 messages=messages, 17 max_tokens=256, 18 temperature=0.7 19) 20print(response["choices"][0]["message"]["content"])

Streaming

python
1from llama_cpp import Llama 2 3llm = Llama(model_path="./model-q4_k_m.gguf", n_gpu_layers=35) 4 5# Stream tokens 6for chunk in llm( 7 "Explain quantum computing:", 8 max_tokens=256, 9 stream=True 10): 11 print(chunk["choices"][0]["text"], end="", flush=True)

Server mode

Start OpenAI-compatible server

bash
1# Start server 2./llama-server -m model-q4_k_m.gguf \ 3 --host 0.0.0.0 \ 4 --port 8080 \ 5 -ngl 35 \ 6 -c 4096 7 8# Or with Python bindings 9python -m llama_cpp.server \ 10 --model model-q4_k_m.gguf \ 11 --n_gpu_layers 35 \ 12 --host 0.0.0.0 \ 13 --port 8080

Use with OpenAI client

python
1from openai import OpenAI 2 3client = OpenAI( 4 base_url="http://localhost:8080/v1", 5 api_key="not-needed" 6) 7 8response = client.chat.completions.create( 9 model="local-model", 10 messages=[{"role": "user", "content": "Hello!"}], 11 max_tokens=256 12) 13print(response.choices[0].message.content)

Hardware optimization

Apple Silicon (Metal)

bash
1# Build with Metal 2make clean && make GGML_METAL=1 3 4# Run with Metal acceleration 5./llama-cli -m model.gguf -ngl 99 -p "Hello" 6 7# Python with Metal 8llm = Llama( 9 model_path="model.gguf", 10 n_gpu_layers=99, # Offload all layers 11 n_threads=1 # Metal handles parallelism 12)

NVIDIA CUDA

bash
1# Build with CUDA 2make clean && make GGML_CUDA=1 3 4# Run with CUDA 5./llama-cli -m model.gguf -ngl 35 -p "Hello" 6 7# Specify GPU 8CUDA_VISIBLE_DEVICES=0 ./llama-cli -m model.gguf -ngl 35

CPU optimization

bash
1# Build with AVX2/AVX512 2make clean && make 3 4# Run with optimal threads 5./llama-cli -m model.gguf -t 8 -p "Hello" 6 7# Python CPU config 8llm = Llama( 9 model_path="model.gguf", 10 n_gpu_layers=0, # CPU only 11 n_threads=8, # Match physical cores 12 n_batch=512 # Batch size for prompt processing 13)

Integration with tools

Ollama

bash
1# Create Modelfile 2cat > Modelfile << 'EOF' 3FROM ./model-q4_k_m.gguf 4TEMPLATE """{{ .System }} 5{{ .Prompt }}""" 6PARAMETER temperature 0.7 7PARAMETER num_ctx 4096 8EOF 9 10# Create Ollama model 11ollama create mymodel -f Modelfile 12 13# Run 14ollama run mymodel "Hello!"

LM Studio

  1. Place GGUF file in ~/.cache/lm-studio/models/
  2. Open LM Studio and select the model
  3. Configure context length and GPU offload
  4. Start inference

text-generation-webui

bash
1# Place in models folder 2cp model-q4_k_m.gguf text-generation-webui/models/ 3 4# Start with llama.cpp loader 5python server.py --model model-q4_k_m.gguf --loader llama.cpp --n-gpu-layers 35

Best practices

  1. Use K-quants: Q4_K_M offers best quality/size balance
  2. Use imatrix: Always use importance matrix for Q4 and below
  3. GPU offload: Offload as many layers as VRAM allows
  4. Context length: Start with 4096, increase if needed
  5. Thread count: Match physical CPU cores, not logical
  6. Batch size: Increase n_batch for faster prompt processing

Common issues

Model loads slowly:

bash
1# Use mmap for faster loading 2./llama-cli -m model.gguf --mmap

Out of memory:

bash
1# Reduce GPU layers 2./llama-cli -m model.gguf -ngl 20 # Reduce from 35 3 4# Or use smaller quantization 5./llama-quantize model-f16.gguf model-q3_k_m.gguf Q3_K_M

Poor quality at low bits:

bash
1# Always use imatrix for Q4 and below 2./llama-imatrix -m model-f16.gguf -f calibration.txt -o model.imatrix 3./llama-quantize --imatrix model.imatrix model-f16.gguf model-q4_k_m.gguf Q4_K_M

References

Resources

Related Skills

Looking for an alternative to gguf-quantization or building a Categories.community AI Agent? Explore these related open-source MCP Servers.

View All

widget-generator

Logo of f
f

widget-generator is an open-source AI agent skill for creating widget plugins that are injected into prompt feeds on prompts.chat. It supports two rendering modes: standard prompt widgets using default PromptCard styling and custom render widgets built as full React components.

149.6k
0
Design

chat-sdk

Logo of lobehub
lobehub

chat-sdk is a unified TypeScript SDK for building chat bots across multiple platforms, providing a single interface for deploying bot logic.

73.0k
0
Communication

zustand

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication

data-fetching

Logo of lobehub
lobehub

The ultimate space for work and life — to find, build, and collaborate with agent teammates that grow with you. We are taking agent harness to the next level — enabling multi-agent collaboration, effortless agent team design, and introducing agents as the unit of work interaction.

72.8k
0
Communication