llama.cpp and GGUF Quantization: Local LLM Deployment

# llama.cpp dan GGUF Quantization: Deploy LLM Secara Lokal ## Pendahuluan Menjalankan Large Language Model (LLM) secara lokal tanpa bergantung pada cloud API adalah impian banyak developer dan organ...

By Ruby Abdullah · · tutorial
llama.cppGGUFQuantizationLocal LLMPython

llama.cpp and GGUF Quantization: Local LLM Deployment

Introduction

Running Large Language Models (LLMs) locally without depending on cloud APIs is a goal many developers and organizations share. With llama.cpp, this becomes a reality. llama.cpp is an LLM inference framework written in pure C/C++, designed to run large models efficiently on consumer hardware, including ordinary laptops.

The GGUF (GPT-Generated Unified Format) is a model format optimized for llama.cpp, supporting various quantization levels that allow you to trade off between model quality and memory usage according to your hardware capabilities.

In this tutorial, we will learn how to install llama.cpp, download GGUF models from HuggingFace, understand quantization levels, and build a fully functional local chatbot.

Prerequisites

  • Computer with at least 8GB RAM (16GB+ recommended)
  • At least 10GB free storage space
  • Python 3.8 or later
  • Git and CMake (for building from source)
  • Optional: NVIDIA GPU with CUDA or Apple Silicon for acceleration

Installing llama.cpp

# Clone repository

git clone https://github.com/ggerganov/llama.cpp.git

cd llama.cpp

Basic build (CPU only)

make

Or using CMake

mkdir build && cd build

cmake ..

cmake --build . --config Release

Build with GPU Acceleration

# CUDA (NVIDIA GPU)

make GGMLCUDA=1

Or via CMake

mkdir build && cd build

cmake .. -DGGMLCUDA=ON

cmake --build . --config Release

Metal (Apple Silicon / macOS)

make GGMLMETAL=1

Or via CMake

mkdir build && cd build

cmake .. -DGGMLMETAL=ON

cmake --build . --config Release

Vulkan (cross-platform GPU)

make GGMLVULKAN=1

Method 2: Install via pip (Python Bindings)

# Basic installation (CPU only)

pip install llama-cpp-python

With CUDA support

CMAKEARGS="-DGGMLCUDA=on" pip install llama-cpp-python

With Metal support (macOS)

CMAKEARGS="-DGGMLMETAL=on" pip install llama-cpp-python

With Vulkan support

CMAKEARGS="-DGGMLVULKAN=on" pip install llama-cpp-python

Force reinstall if upgrading

pip install llama-cpp-python --force-reinstall --no-cache-dir

Verify Installation

# For source build

./llama-cli --version

For Python

python -c "from llamacpp import Llama; print('llama-cpp-python installed successfully')"

Understanding GGUF Format and Quantization Levels

What is GGUF?

GGUF is a file format specifically designed for storing quantized LLM models. It replaces the older GGML format and offers:

  • Better backward and forward compatibility
  • More complete metadata (tokenizer info, model parameters, etc.)
  • Faster loading times
  • Support for multiple model architectures

Quantization Levels

Quantization is the process of reducing the numerical precision of model parameters to save memory and increase speed, with a trade-off in output quality.

| Quantization | Bits | Model Size (7B) | RAM Needed | Quality | Use Case |

|-------------|------|-----------------|------------|---------|----------|

| F16 | 16 | ~14 GB | ~16 GB | Best | Reference, evaluation |

| Q80 | 8 | ~7.5 GB | ~10 GB | Excellent | Production (if RAM allows) |

| Q6K | 6 | ~5.5 GB | ~8 GB | Very Good | Balance quality/performance |

| Q5KM | 5 | ~5.0 GB | ~7.5 GB | Good | General recommendation |

| Q5KS | 5 | ~4.8 GB | ~7 GB | Good | Slightly smaller |

| Q4KM | 4 | ~4.0 GB | ~6.5 GB | Decent | Most popular |

| Q4KS | 4 | ~3.8 GB | ~6 GB | Fair | Limited RAM |

| Q3KM | 3 | ~3.3 GB | ~5.5 GB | Degraded | Only if RAM is very limited |

| Q2K | 2 | ~2.7 GB | ~5 GB | Low | Experimental only |

Recommendations:
  • Q4KM: Best balance between size and quality (most popular)
  • Q5KM: For better quality with slightly more RAM
  • Q80: If RAM is not a concern, quality approaches F16

Related Articles

Reflex Tutorial: Building Full-Stack Web Apps in Pure Python

Reflex: Membangun Aplikasi Web Full-Stack dengan Python Murni Reflex memungkinkan Anda membangun aplikasi web lengkap — ...

ColBERT & RAGatouille Tutorial: Late-Interaction Retrieval for RAG

ColBERT & RAGatouille: Retrieval Late-Interaction untuk RAG yang Lebih Baik Sebagian besar sistem RAG mengandalkan dense...

SGLang Tutorial: Fast LLM Serving and Structured Generation

SGLang: Serving LLM yang Cepat dan Model Pemrograman untuk Generasi Terstruktur SGLang adalah dua hal dalam satu paket: ...

TRL Tutorial: LLM Post-Training with SFT, DPO, and Reward Modeling

Post-Training LLM dengan TRL: SFT, Reward Modeling, dan DPO Setelah sebuah base language model selesai dipretraining, mo...