← Back to Library
AI Concepts Provider: Industry Standard

Model Serving

Model serving is the engineering discipline of deploying AI models to production where they handle real user requests. Unlike training (batch processing, can take hours), serving requires low latency (<100ms), high throughput (1000+ requests/second), and 99.9% uptime. Key challenges: loading multi-GB models into memory, batching requests efficiently, managing GPU resources, handling traffic spikes. Solutions include specialized frameworks (TensorFlow Serving, TorchServe, vLLM), optimization techniques (quantization, batching), and infrastructure (load balancers, autoscaling). Modern model serving enables GPT-4 to handle millions of users, Stable Diffusion to generate images in seconds, and recommendation engines to personalize experiences in real-time.

Model Serving
ai-concepts model-serving mlops inference deployment production-ai

Overview

Model serving bridges the gap between trained models and production applications. Training produces a 7GB model file—serving makes that model answer 10,000 queries per second with 50ms latency. The challenge: AI models are computationally expensive (GPT-4 requires ~100 billion operations per token), memory-intensive (loads GBs into VRAM), and resource-constrained (GPUs cost $3/hour). Effective serving requires batching multiple requests together, caching frequent queries, quantizing weights to reduce memory, and horizontal scaling across multiple GPUs.

Model Serving vs Training

  • **Training**: Batch processing, takes hours/days, uses multiple GPUs, can retry failures, offline
  • **Serving**: Real-time, requires <100ms, single GPU per request, must handle failures gracefully, user-facing
  • **Training Optimizations**: Large batches, gradient checkpointing, mixed precision, distributed across GPUs
  • **Serving Optimizations**: Small batch size, dynamic batching, quantization, KV-cache, speculative decoding

Key Serving Challenges

  • **Latency**: Users expect <100ms responses, but large models take 500ms+ per forward pass
  • **Throughput**: Handle 1000+ requests/second with limited GPU resources
  • **Memory**: 70B parameter model requires 140GB VRAM, but H100 has only 80GB
  • **Cost**: GPU instances cost $2-$8/hour, must maximize utilization to justify cost
  • **Reliability**: 99.9% uptime required, models must handle edge cases gracefully
  • **Scaling**: Traffic varies 10× between peak and off-peak hours

Business Integration

Model serving determines AI product viability. A chatbot with 5-second response times fails—users abandon. An image generator that costs $2/image isn't profitable—serving optimizations reduce cost to $0.05. A recommendation engine that crashes under Black Friday traffic loses millions—autoscaling prevents outages. Effective serving makes AI products fast, cheap, and reliable enough for production. Key business metrics: P95 latency (95% of requests <100ms), throughput (requests/second/GPU), cost per inference ($0.001-$0.10), and uptime (99.9%+).

Real-World Example: E-Commerce Search

An e-commerce site uses a 7B embedding model for semantic product search. Initial deployment: naive Flask API, loads model on each request, 3000ms latency, 1 request/second throughput. After serving optimizations: TorchServe with batching, quantized INT8 model, caching, result: 80ms latency, 200 requests/second throughput, $0.002/request cost. This enables real-time search for 10M users with 4 GPU instances ($288/day) instead of 800 instances ($57,600/day).

Implementation Example

Technical Specifications

  • **Target Latency**: P50 <50ms, P95 <100ms, P99 <200ms for production systems
  • **Throughput**: 100-1000 requests/second/GPU depending on model size
  • **Batch Size**: 8-32 optimal for transformers, 64-128 for CNNs
  • **Memory**: INT8 quantization reduces VRAM by 4×, INT4 by 8× with <2% accuracy loss
  • **Cost**: $0.001-$0.10 per inference depending on model size and optimization
  • **Frameworks**: TorchServe, TensorFlow Serving, vLLM, Ray Serve, Triton Inference Server

Best Practices

  • Use specialized serving frameworks (vLLM for LLMs, TorchServe for general models) not Flask
  • Enable batching—handle 8-32 requests together for 5-10× throughput improvement
  • Quantize models to INT8 or INT4—reduces memory 4-8× with minimal accuracy loss
  • Cache frequent queries—80% of queries are repeats, caching saves 80% of compute
  • Monitor P95/P99 latency not just average—tail latency matters for user experience
  • Autoscale based on queue depth not CPU—GPU utilization is what matters
  • Use health checks and graceful shutdown—avoid serving stale or crashed models