Hugging Face TGI
Text Generation Inference (TGI) is Hugging Face's production-ready LLM serving solution, launched in 2023. TGI provides optimized inference for popular models (Llama, Mistral, Falcon, StarCoder) with continuous batching, tensor parallelism, quantization, and streaming support. As of October 2025, TGI powers Hugging Face's Inference Endpoints and is used by thousands of companies for production LLM deployment. Key features: 2-3x faster than naive PyTorch serving, automatic continuous batching, multi-GPU support, Docker deployment, OpenAI-compatible API. Open source (Apache 2.0) with commercial support available. Competes with vLLM and TensorRT-LLM for high-throughput LLM serving.
Overview
TGI optimizes LLM inference through: (1) Continuous batching - dynamically add/remove requests for maximum GPU utilization, (2) Tensor parallelism - split models across multiple GPUs, (3) Flash Attention - 2-4x faster attention computation, (4) Paged Attention - efficient KV cache management, (5) Quantization - bitsandbytes, GPTQ, AWQ for smaller memory footprint. Performance: Llama 3 70B serves 45 tokens/sec/user on 4× A100 vs 15 tokens/sec with vanilla transformers = 3x improvement. Supports: Llama, Mistral, Mixtral, Falcon, StarCoder, Bloom, GPT-NeoX. Deployment: Docker container, Kubernetes, AWS/GCP/Azure. API: OpenAI-compatible plus streaming, tool calling, guided generation.
Key Features
- Continuous batching: 2-5x throughput improvement via dynamic batching
- Tensor parallelism: Distribute models across 2-8 GPUs seamlessly
- Flash Attention 2: Integrated for 2-4x faster attention
- Quantization: GPTQ, AWQ, bitsandbytes support for 4-8 bit inference
- Streaming: Token-by-token streaming for real-time responses
- Tool calling: Function calling and JSON structured outputs
- OpenAI compatible: Drop-in replacement for OpenAI API
- Docker deployment: Single container with all dependencies
Performance Benchmarks
Llama 3 70B on 4× A100 80GB: TGI achieves 45 tokens/sec throughput with 32 concurrent users vs 15 tokens/sec with transformers = 3x faster. Mistral 7B on single A100: 120 tokens/sec with TGI vs 40 tokens/sec naive = 3x faster. Memory: Llama 3 70B requires 140GB with FP16, 70GB with GPTQ 4-bit. Latency: Time-to-first-token ~50-100ms, subsequent tokens stream at model's generation speed. TGI vs vLLM: Similar performance (both use continuous batching + Flash Attention), TGI easier integration with Hugging Face ecosystem, vLLM slightly faster for some workloads.
Code Example
# Deploy TGI with Docker
# docker run --gpus all --shm-size 1g -p 8080:80 \
# -v $PWD/data:/data \
# ghcr.io/huggingface/text-generation-inference:latest \
# --model-id meta-llama/Meta-Llama-3-70B \
# --quantize gptq \
# --num-shard 4
# Client usage (OpenAI-compatible)
from huggingface_hub import InferenceClient
client = InferenceClient("http://localhost:8080")
# Streaming generation
for token in client.text_generation(
"Write a Python function to calculate factorial:",
max_new_tokens=200,
stream=True
):
print(token, end="", flush=True)
# Batch inference
responses = client.text_generation(
["Explain AI:", "What is Python?"],
max_new_tokens=100
)
# Using OpenAI Python client (compatible API)
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="dummy" # TGI doesn't require auth by default
)
response = client.chat.completions.create(
model="tgi",
messages=[{"role": "user", "content": "Hello!"}],
stream=True
)
for chunk in response:
print(chunk.choices[0].delta.content, end="")
# Tool calling / function calling
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for location",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"}
}
}
}
}]
response = client.chat.completions.create(
model="tgi",
messages=[{"role": "user", "content": "What's the weather in Paris?"}],
tools=tools
)
if response.choices[0].message.tool_calls:
print(response.choices[0].message.tool_calls[0].function)
TGI vs vLLM vs TensorRT-LLM
TGI: Best Hugging Face integration, OpenAI-compatible API, Docker-first, Apache 2.0. vLLM: Slightly faster (5-10%), PagedAttention pioneer, more quantization options. TensorRT-LLM: Fastest (10-20% over vLLM), NVIDIA-only, complex setup. Choose TGI for: Hugging Face ecosystem users, easy deployment, commercial support needs. Choose vLLM for: Maximum performance, custom model formats. Choose TensorRT-LLM for: NVIDIA infrastructure, absolute peak performance. Most organizations start with TGI for ease of use, switch to vLLM/TensorRT-LLM if performance becomes bottleneck.
Professional Integration Services by 21medien
21medien offers TGI deployment services including Docker/Kubernetes setup, multi-GPU configuration, quantization optimization, API integration, and production monitoring. Our team specializes in LLM serving infrastructure, cost optimization through quantization and batching, and building reliable serving systems. Contact us for custom TGI deployment solutions.
Resources
GitHub: https://github.com/huggingface/text-generation-inference | Documentation: https://huggingface.co/docs/text-generation-inference | Docker Hub: https://hub.docker.com/r/huggingface/text-generation-inference