llama.cpp and GGUF Quantization: Local LLM Deployment

Introduction

Running Large Language Models (LLMs) locally without depending on cloud APIs is a goal many developers and organizations share. With llama.cpp, this becomes a reality. llama.cpp is an LLM inference framework written in pure C/C++, designed to run large models efficiently on consumer hardware, including ordinary laptops.

The GGUF (GPT-Generated Unified Format) is a model format optimized for llama.cpp, supporting various quantization levels that allow you to trade off between model quality and memory usage according to your hardware capabilities.

In this tutorial, we will learn how to install llama.cpp, download GGUF models from HuggingFace, understand quantization levels, and build a fully functional local chatbot.

Prerequisites

Computer with at least 8GB RAM (16GB+ recommended)
At least 10GB free storage space
Python 3.8 or later
Git and CMake (for building from source)
Optional: NVIDIA GPU with CUDA or Apple Silicon for acceleration

Installing llama.cpp

Method 1: Build from Source (Recommended)

# Clone repository git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp Basic build (CPU only) make Or using CMake mkdir build && cd build cmake .. cmake --build . --config Release

Build with GPU Acceleration

# CUDA (NVIDIA GPU)
make GGMLCUDA=1


Or via CMake
mkdir build && cd build
cmake .. -DGGMLCUDA=ON
cmake --build . --config Release

Metal (Apple Silicon / macOS)
make GGMLMETAL=1


Or via CMake
mkdir build && cd build
cmake .. -DGGMLMETAL=ON
cmake --build . --config Release

Vulkan (cross-platform GPU)
make GGMLVULKAN=1

Method 2: Install via pip (Python Bindings)

# Basic installation (CPU only)
pip install llama-cpp-python

With CUDA support
CMAKEARGS="-DGGMLCUDA=on" pip install llama-cpp-python


With Metal support (macOS)
CMAKEARGS="-DGGMLMETAL=on" pip install llama-cpp-python


With Vulkan support
CMAKEARGS="-DGGMLVULKAN=on" pip install llama-cpp-python


Force reinstall if upgrading
pip install llama-cpp-python --force-reinstall --no-cache-dir

Verify Installation

# For source build
./llama-cli --version

For Python
python -c "from llamacpp import Llama; print('llama-cpp-python installed successfully')"

Understanding GGUF Format and Quantization Levels

What is GGUF?

GGUF is a file format specifically designed for storing quantized LLM models. It replaces the older GGML format and offers:

Better backward and forward compatibility
More complete metadata (tokenizer info, model parameters, etc.)
Faster loading times
Support for multiple model architectures

Quantization Levels

Quantization is the process of reducing the numerical precision of model parameters to save memory and increase speed, with a trade-off in output quality.

|-------------|------|-----------------|------------|---------|----------|

| F16 | 16 | ~14 GB | ~16 GB | Best | Reference, evaluation |

| Q80 | 8 | ~7.5 GB | ~10 GB | Excellent | Production (if RAM allows) |

| Q6K | 6 | ~5.5 GB | ~8 GB | Very Good | Balance quality/performance |

| Q5KM | 5 | ~5.0 GB | ~7.5 GB | Good | General recommendation |

| Q5KS | 5 | ~4.8 GB | ~7 GB | Good | Slightly smaller |

| Q4KM | 4 | ~4.0 GB | ~6.5 GB | Decent | Most popular |

| Q4KS | 4 | ~3.8 GB | ~6 GB | Fair | Limited RAM |

| Q3KM | 3 | ~3.3 GB | ~5.5 GB | Degraded | Only if RAM is very limited |

| Q2K | 2 | ~2.7 GB | ~5 GB | Low | Experimental only |

Recommendations:

Q4KM: Best balance between size and quality (most popular)

Q5KM: For better quality with slightly more RAM

Q80: If RAM is not a concern, quality approaches F16

llama.cpp and GGUF Quantization: Local LLM Deployment

llama.cpp and GGUF Quantization: Local LLM Deployment

Introduction

Prerequisites

Installing llama.cpp

Method 1: Build from Source (Recommended)

Basic build (CPU only)

Or using CMake

Build with GPU Acceleration

Or via CMake

Metal (Apple Silicon / macOS)

Or via CMake

Vulkan (cross-platform GPU)

Method 2: Install via pip (Python Bindings)

With CUDA support

With Metal support (macOS)

With Vulkan support

Force reinstall if upgrading

Verify Installation

For Python

Understanding GGUF Format and Quantization Levels

What is GGUF?

Quantization Levels

Related Articles

Reflex Tutorial: Building Full-Stack Web Apps in Pure Python

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

SGLang Tutorial: Fast LLM Serving and Structured Generation

TRL Tutorial: LLM Post-Training with SFT, DPO, and Reward Modeling

Related Articles

Reflex Tutorial: Building Full-Stack Web Apps in Pure Python

Reflex: Membangun Aplikasi Web Full-Stack dengan Python Murni Reflex memungkinkan Anda membangun aplikasi web lengkap — ...

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

ColBERT & RAGatouille: Retrieval Late-Interaction untuk RAG yang Lebih Baik Sebagian besar sistem RAG mengandalkan dense...

SGLang Tutorial: Fast LLM Serving and Structured Generation

SGLang: Serving LLM yang Cepat dan Model Pemrograman untuk Generasi Terstruktur SGLang adalah dua hal dalam satu paket: ...

TRL Tutorial: LLM Post-Training with SFT, DPO, and Reward Modeling

Post-Training LLM dengan TRL: SFT, Reward Modeling, dan DPO Setelah sebuah base language model selesai dipretraining, mo...