llama.cpp
Run It Yourself

The honest comparison tool. Explore quantization trade-offs, check if your hardware can run a model, and see real-world benchmarks — all in one place.

🦙 Local Inference 📦 GGUF Quantization 🖥️ CPU + GPU Pre-baked Data

Select Model

Quantization Comparison

QuantFile SizeRAM NeededQualityCPU tok/sGPU tok/s

Your Hardware

What You Can Run

Select your hardware and click "Check Compatibility"

Hardware Config

Performance (tokens/sec)

ModelQ4_K_MQ5_K_MQ8_0F16
>30 tok/s — Conversational 10-30 tok/s — Usable <10 tok/s — Slow

What Makes llama.cpp Different

llama.cpp takes large language models that normally need expensive cloud GPUs and runs them on your own hardware — laptop, desktop, even a Raspberry Pi. It does this through quantization: compressing model weights from 16-bit to 4-bit with minimal quality loss. The result is private, offline, zero-cost inference.

Pure C/C++

No Python, no PyTorch, no CUDA toolkit. A single compiled binary that runs on macOS, Linux, and Windows. Metal on Apple Silicon, CUDA on NVIDIA, Vulkan everywhere else.

GGUF Format

The standard format for quantized models. A single file contains weights, tokenizer, and metadata. Download one file, run it. Over 100,000 GGUF models on Hugging Face.

Full Privacy

Every token is generated on your machine. No API calls, no data leaving your network. Critical for regulated industries: banking, healthcare, legal.

← LangChain Demo All Demos