Tutorial Lengkap vLLM: High-Performance LLM Serving
vLLM adalah library Python untuk inference dan serving LLM dengan performa tinggi. Dengan teknologi PagedAttention yang inovatif, vLLM dapat mencapai throughput 10-24x lebih tinggi dibandingkan implementasi standar, menjadikannya pilihan ideal untuk production deployment.
Mengapa vLLM?
Keunggulan vLLM:- High throughput: PagedAttention untuk memory efficiency
- Continuous batching: Maximize GPU utilization
- OpenAI-compatible API: Easy integration
- Tensor parallelism: Multi-GPU support
- Quantization support: AWQ, GPTQ, FP8
- Production LLM serving
- High-traffic API endpoints
- Batch inference
- Multi-tenant deployments
Instalasi
# Install vLLM
pip install vllm
Dengan CUDA support
pip install vllm --extra-index-url https://download.pytorch.org/whl/cu118
Verify installation
python -c "import vllm; print(vllm.version)"
Quick Start
1. Offline Inference
from vllm import LLM, SamplingParams
Load model
llm = LLM(model="meta-llama/Llama-2-7b-hf")
Sampling parameters
samplingparams = SamplingParams(
temperature=0.8,
topp=0.95,
maxtokens=256
)
Generate
prompts = [
"Explain machine learning in simple terms:",
"Write a Python function to calculate factorial:",
"What is the capital of Indonesia?"
]
outputs = llm.generate(prompts, samplingparams)
for output in outputs:
prompt = output.prompt
generatedtext = output.outputs[0].text
print(f"Prompt: {prompt[:50]}...")
print(f"Response: {generatedtext}\n")
2. OpenAI-Compatible Server
# Start vLLM server
python -m vllm.entrypoints.openai.apiserver \
--model meta-llama/Llama-2-7b-hf \
--host 0.0.0.0 \
--port 8000
from openai import OpenAI
client = OpenAI(
baseurl="http://localhost:8000/v1",
apikey="token-abc123" # Required but not validated
)
Chat completion
response = client.chat.completions.create(
model="meta-llama/Llama-2-7b-hf",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is Python?"}
],
temperature=0.7,
maxtokens=256
)
print(response.choices[0].message.content)
Streaming
stream = client.chat.completions.create(
model="meta-llama/Llama-2-7b-hf",
messages=[{"role": "user", "content": "Tell me a story"}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Sampling Parameters
from vllm import SamplingParams
Basic parameters
params = SamplingParams(
temperature=0.8, # Randomness (0.0 = deterministic)
topp=0.95, # Nucleus sampling
topk=50, # Top-k sampling
maxtokens=512, # Max output tokens
stop=["", "\n\n"], # Stop sequences
)
Advanced parameters
params = SamplingParams(
n=3, # Number of outputs per prompt
bestof=5, # Generate 5, return best 3
presencepenalty=0.5, # Penalize repeated tokens
frequencypenalty=0.5, # Penalize frequent tokens
repetitionpenalty=1.1, # Repetition penalty
lengthpenalty=1.0, # Length penalty for beam search
usebeamsearch=False, # Enable beam search
earlystopping=True,
skipspecialtokens=True,
ignoreeos=False,
)
Server Configuration
1. Basic Server Options
python -m vllm.entrypoints.openai.apiserver \
--model meta-llama/Llama-2-7b-hf \
--host 0.0.0.0 \
--port 8000 \
--dtype auto \
--max-model-len 4096 \
--gpu-memory-utilization 0.9