Serving LLMs in Production with Text Generation Inference (TGI)

Text Generation Inference (TGI) is Hugging Face's purpose-built toolkit for serving large language models behind a production HTTP API. It pairs a Rust web layer with a Python inference backend, ships as a ready-to-run Docker image, and exposes an OpenAI-compatible Messages API so existing clients work with minimal changes. This tutorial walks through deploying TGI, tuning its launcher, and querying it from curl, the OpenAI client, and the Hugging Face Hub client.

What TGI Is

TGI is an inference server maintained by Hugging Face and used to power their hosted Inference Endpoints. Architecturally it splits responsibilities:

A Rust router/webserver handles HTTP, request validation, batching decisions, and token streaming. Rust keeps the request-handling path fast and memory-safe under concurrency.
A Python model server (one process per shard) runs the actual model forward passes on the GPU, using optimized attention and batching kernels.

This separation lets the busy networking layer stay lean while the heavy numerical work sits in well-tested Python/CUDA code. Because TGI is built by the same team behind the Transformers and Hub libraries, model loading, tokenizer handling, and gated-model authentication line up naturally with the rest of the Hugging Face ecosystem.

Key Features

Continuous batching. Incoming requests join an in-flight batch as soon as slots free up, rather than waiting for a fixed batch window. This keeps the GPU busy and improves throughput under mixed load.
Tensor parallelism (sharding). A model can be split across multiple GPUs with --num-shard, enabling models that do not fit on a single card.
Optimized attention. TGI uses Flash Attention and Paged Attention kernels to reduce memory use and speed up both prefill and decode.
Token streaming. Server-Sent Events (SSE) stream tokens as they are produced, which lowers perceived latency for chat UIs.
Quantization. Built-in support for bitsandbytes, GPTQ, AWQ, and other schemes to fit larger models into less VRAM.
OpenAI-compatible Messages API. A /v1/chat/completions endpoint that mirrors the OpenAI request/response shape.

If you have read our earlier vLLM tutorial: both servers do continuous batching and paged attention, and both expose an OpenAI-style API. The practical difference is packaging and ecosystem. TGI is Docker-first, tightly integrated with the Hub and Inference Endpoints, and driven by a single launcher binary with explicit flags. We compare them more concretely at the end.

Prerequisites

A machine with one or more NVIDIA GPUs and recent drivers.
Docker with the NVIDIA Container Toolkit installed, so --gpus all works.
A Hugging Face account and an access token if you plan to serve gated models such as the Llama family. Create one under Settings → Access Tokens.

Verify GPU access inside Docker before going further:

docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi

If the GPU table prints, the container runtime can see your hardware.

The Fastest Path: Running TGI with Docker

TGI is distributed as a container image at ghcr.io/huggingface/text-generation-inference. The minimum you need is a model ID and GPU access. We will use a Mistral instruct model, which is open and small enough to run on a single mid-range GPU.

# Reuse a host directory for the model cache so re-runs do not re-download mkdir -p $HOME/.cache/huggingface docker run --gpus all --shm-size 1g -p 8080:80 \ -v $HOME/.cache/huggingface:/data \ ghcr.io/huggingface/text-generation-inference:latest \ --model-id mistralai/Mistral-7B-Instruct-v0.3

A few details matter here:

Text Generation Inference (TGI) Tutorial: Production LLM Serving

Serving LLMs in Production with Text Generation Inference (TGI)

What TGI Is

Key Features

Prerequisites

The Fastest Path: Running TGI with Docker

Related Articles

SGLang Tutorial: Fast LLM Serving and Structured Generation

KServe Tutorial: Serverless Model Serving on Kubernetes

Complete BentoML Tutorial: Packaging and Serving ML Models to Production

Kedro Tutorial: Reproducible and Maintainable Data Science Pipelines

Related Articles

SGLang Tutorial: Fast LLM Serving and Structured Generation

SGLang: Serving LLM yang Cepat dan Model Pemrograman untuk Generasi Terstruktur SGLang adalah dua hal dalam satu paket: ...

KServe Tutorial: Serverless Model Serving on Kubernetes

Serverless Model Serving di Kubernetes dengan KServe KServe adalah platform native Kubernetes untuk menyajikan model mac...

Complete BentoML Tutorial: Packaging and Serving ML Models to Production

Tutorial Lengkap BentoML: Packaging dan Serving ML Models ke Production BentoML adalah framework open-source untuk build...

Kedro Tutorial: Reproducible and Maintainable Data Science Pipelines

Kedro: Pipeline Data Science yang Reproducible dan Mudah Dirawat Sebagian besar proyek data science dimulai dari satu no...