llama.cpp Demo — Imbila.AI

Select Model

🦙

Llama 3.1 8B

Meta · General purpose · 8.0B params

🌬️

Mistral 7B v0.3

Mistral AI · Efficient · 7.2B params

🌐

Qwen 2.5 14B

Alibaba · Multilingual · 14.8B params

🦙

Llama 3.1 70B

Meta · Flagship · 70.6B params

Quantization Comparison

Quant	File Size	RAM Needed	Quality	CPU tok/s	GPU tok/s

Benchmarks are representative estimates. Your mileage will vary.

Your Hardware

System RAM

GPU / VRAM

Apple Silicon (unified memory)

What You Can Run

Select your hardware and click "Check Compatibility"

RAM estimates include ~2.5 GB overhead for the llama.cpp process.

Hardware Config

💻

MacBook Pro M3 Max

36 GB Unified · 14-core CPU · 30-core GPU

🖥️

Desktop RTX 4090

24 GB VRAM · 64 GB RAM · i9-13900K

🖥️

Desktop RTX 3060 12GB

12 GB VRAM · 32 GB RAM · Ryzen 5800X

💻

Mac Mini M2 Pro

16 GB Unified · 10-core CPU · 16-core GPU

🖱️

Budget Desktop

No GPU · 16 GB RAM · Ryzen 5600

Performance (tokens/sec)

Model	Q4_K_M	Q5_K_M	Q8_0	F16

>30 tok/s — Conversational 10-30 tok/s — Usable <10 tok/s — Slow

Community-reported estimates. Actual performance varies with context length and prompt complexity.

How It Works

What Makes llama.cpp Different

llama.cpp takes large language models that normally need expensive cloud GPUs and runs them on your own hardware — laptop, desktop, even a Raspberry Pi. It does this through quantization: compressing model weights from 16-bit to 4-bit with minimal quality loss. The result is private, offline, zero-cost inference.

Pure C/C++

No Python, no PyTorch, no CUDA toolkit. A single compiled binary that runs on macOS, Linux, and Windows. Metal on Apple Silicon, CUDA on NVIDIA, Vulkan everywhere else.

GGUF Format

The standard format for quantized models. A single file contains weights, tokenizer, and metadata. Download one file, run it. Over 100,000 GGUF models on Hugging Face.

Full Privacy

Every token is generated on your machine. No API calls, no data leaving your network. Critical for regulated industries: banking, healthcare, legal.

llama.cppRun It Yourself

Select Model

Quantization Comparison

Your Hardware

What You Can Run

Hardware Config

Performance (tokens/sec)

What Makes llama.cpp Different

Pure C/C++

GGUF Format

Full Privacy

llama.cpp
Run It Yourself