Serving LLMs in Production with Text Generation Inference (TGI)
Text Generation Inference (TGI) is Hugging Face's purpose-built toolkit for serving large language models behind a production HTTP API. It pairs a Rust web layer with a Python inference backend, ships as a ready-to-run Docker image, and exposes an OpenAI-compatible Messages API so existing clients work with minimal changes. This tutorial walks through deploying TGI, tuning its launcher, and querying it from curl, the OpenAI client, and the Hugging Face Hub client.
What TGI Is
TGI is an inference server maintained by Hugging Face and used to power their hosted Inference Endpoints. Architecturally it splits responsibilities:
- A Rust router/webserver handles HTTP, request validation, batching decisions, and token streaming. Rust keeps the request-handling path fast and memory-safe under concurrency.
- A Python model server (one process per shard) runs the actual model forward passes on the GPU, using optimized attention and batching kernels.
This separation lets the busy networking layer stay lean while the heavy numerical work sits in well-tested Python/CUDA code. Because TGI is built by the same team behind the Transformers and Hub libraries, model loading, tokenizer handling, and gated-model authentication line up naturally with the rest of the Hugging Face ecosystem.
Key Features
- Continuous batching. Incoming requests join an in-flight batch as soon as slots free up, rather than waiting for a fixed batch window. This keeps the GPU busy and improves throughput under mixed load.
- Tensor parallelism (sharding). A model can be split across multiple GPUs with
--num-shard, enabling models that do not fit on a single card. - Optimized attention. TGI uses Flash Attention and Paged Attention kernels to reduce memory use and speed up both prefill and decode.
- Token streaming. Server-Sent Events (SSE) stream tokens as they are produced, which lowers perceived latency for chat UIs.
- Quantization. Built-in support for
bitsandbytes, GPTQ, AWQ, and other schemes to fit larger models into less VRAM. - OpenAI-compatible Messages API. A
/v1/chat/completionsendpoint that mirrors the OpenAI request/response shape.
If you have read our earlier vLLM tutorial: both servers do continuous batching and paged attention, and both expose an OpenAI-style API. The practical difference is packaging and ecosystem. TGI is Docker-first, tightly integrated with the Hub and Inference Endpoints, and driven by a single launcher binary with explicit flags. We compare them more concretely at the end.
Prerequisites
- A machine with one or more NVIDIA GPUs and recent drivers.
- Docker with the NVIDIA Container Toolkit installed, so
--gpus allworks. - A Hugging Face account and an access token if you plan to serve gated models such as the Llama family. Create one under Settings → Access Tokens.
Verify GPU access inside Docker before going further:
docker run --rm --gpus all nvidia/cuda:12.4.0-base-ubuntu22.04 nvidia-smi
If the GPU table prints, the container runtime can see your hardware.
The Fastest Path: Running TGI with Docker
TGI is distributed as a container image at ghcr.io/huggingface/text-generation-inference. The minimum you need is a model ID and GPU access. We will use a Mistral instruct model, which is open and small enough to run on a single mid-range GPU.
# Reuse a host directory for the model cache so re-runs do not re-download
mkdir -p $HOME/.cache/huggingface
docker run --gpus all --shm-size 1g -p 8080:80 \
-v $HOME/.cache/huggingface:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id mistralai/Mistral-7B-Instruct-v0.3
A few details matter here: