Quantization (Model Compression)

Overview

Quantization addresses the fundamental tension in modern AI between model capability and computational practicality. Transformer models achieve their impressive performance through billions of parameters, each stored as a 32-bit floating-point number. A 7B parameter model consumes 28GB memory (7B × 4 bytes) just for weights, excluding activations, gradients, and optimizer states needed during training. Quantization compresses these representations by mapping continuous float values to discrete integer buckets. The simplest approach, symmetric quantization, maps FP32 values in range [-α, α] to INT8 values [-127, 127], where α is the maximum absolute weight value. Each weight W_q = round(W / α × 127), and during inference, W ≈ W_q × α / 127. This 4x compression comes with quantization error, but transformer models exhibit remarkable robustness: redundant parameters and attention mechanisms provide natural error resilience.

The quantization ecosystem has matured into distinct formats optimized for different deployment scenarios. GGUF (GPT-Generated Unified Format) targets CPU and Apple Silicon deployment with mixed-precision quantization (Q4_K_M, Q5_K_M variants) and optimized memory layout, powering llama.cpp's ability to run 70B models on MacBooks. GPTQ (GPU Post-Training Quantization) uses advanced calibration with Hessian-based importance weighting to achieve 4-bit compression with <1% perplexity increase, ideal for GPU inference with AutoGPTQ and ExLlama2 backends. AWQ (Activation-aware Weight Quantization) protects the 1% most important weights identified through activation distributions, delivering 20-30% faster inference than GPTQ with comparable quality. NF4 (4-bit NormalFloat) developed for QLoRA assumes weights follow normal distribution, quantizing with non-uniform buckets that align with this distribution. The choice between formats involves tradeoffs: GGUF maximizes CPU compatibility, GPTQ prioritizes GPU memory efficiency, AWQ optimizes inference speed, and NF4 enables training on quantized models.

Key Concepts

Precision reduction: Converting float32 (4 bytes) → int8 (1 byte) → int4 (0.5 bytes) for 4-8x compression
Quantization granularity: Per-tensor (entire layer), per-channel (each output dimension), or per-group (128 weights)
Symmetric vs asymmetric: Symmetric quantization maps to [-127, 127], asymmetric allows offset for unsigned ranges
Calibration: Using representative data to determine optimal quantization scales and zero-points
Mixed precision: Keeping sensitive layers (embeddings, first/last layers) in higher precision
Quantization-aware training (QAT): Training with quantization in the loop to adapt weights for lower precision
Post-training quantization (PTQ): Quantizing after training without weight updates, faster but potentially lower quality
Dynamic vs static quantization: Computing scales at inference time (dynamic) versus pre-computing (static)

How It Works

Post-training quantization, the most common approach, proceeds in three steps: (1) Calibration—running representative data through the model to collect activation statistics and weight distributions; (2) Scale computation—determining quantization parameters (scale factor S and zero-point Z) that minimize quantization error, often using min-max ranges or percentile-based clipping; (3) Weight transformation—converting floating-point weights to integers: W_q = round(W/S + Z). During inference, dequantization happens just-in-time before matrix multiplications: W ≈ (W_q - Z) × S. Advanced methods like GPTQ improve on this by using the Hessian matrix (second derivative of loss) to identify weights most sensitive to quantization, protecting them with higher precision. AWQ takes a different approach: analyzing activation magnitudes across calibration data to find the 1% of weights with highest activation impact, then using mixed precision to preserve these critical weights at 8-bit while quantizing the rest to 4-bit. Quantization-aware training goes further by injecting fake quantization operations during training, allowing gradients to flow through quantization and adapt weights to perform better in low precision.

Use Cases

Edge deployment: Running LLMs on smartphones, Raspberry Pi, and embedded devices
Consumer hardware inference: Enabling 70B models on RTX 4090 (24GB) instead of requiring A100 (80GB)
Cost optimization: Reducing cloud inference costs by fitting more requests per GPU
Latency reduction: Faster matrix multiplication with INT8 operations on tensor cores
On-device AI: Privacy-preserving inference without cloud API calls (Apple Intelligence, Android AI)
Multi-model serving: Loading multiple quantized models in the same GPU memory budget
Fine-tuning on consumer hardware: QLoRA enables training 65B models on single 24GB GPU
Real-time applications: Lower latency for chatbots, code completion, and content generation
IoT and robotics: AI capabilities on constrained hardware platforms
Offline applications: Complete AI systems without internet connectivity requirements

Technical Implementation

Implementing quantization requires selecting format, precision level, and tooling based on deployment target. For CPU/Apple Silicon deployment, GGUF via llama.cpp provides the best experience: Q4_K_M balances quality and size (4.5 bits per weight average), Q5_K_M adds quality for 15% more memory (5.5 bits), and Q8_0 provides near-lossless compression (8 bits). GPU deployment typically uses GPTQ (via AutoGPTQ or ExLlama2) or AWQ (via AutoAWQ) for 4-bit quantization, with AWQ offering 1.2-1.3x faster inference but slightly higher memory usage. The quantization process requires calibration data: 512-2048 samples from domains similar to production use cases, often using Wikitext or C4 datasets. Calibration takes 5-30 minutes for 7B models, 1-3 hours for 70B models on modern GPUs. Quality assessment compares perplexity on held-out data: <1% increase considered excellent, 1-3% acceptable, >5% suggests issues. For production, hardware acceleration matters: NVIDIA GPUs with INT8 tensor cores (A100, H100, RTX 40-series) achieve 2-3x speedup over FP16, while newer hardware (H100, B200) adds FP8 and INT4 support for 4-5x gains. Apple's Neural Engine on M4 Max delivers 50 tokens/sec for Llama 3 8B in Q4 quantization, matching dedicated GPU performance.

Best Practices

Start with 8-bit quantization for near-lossless compression, move to 4-bit only if memory constrained
Use GPTQ or AWQ for GPU deployment (4-bit), GGUF for CPU/Mac (4-8 bit mixed)
Calibrate with domain-relevant data—generic web text may not represent your use case
Test perplexity and task-specific metrics before deploying quantized models
Keep embeddings and output layers at higher precision for better quality (mixed precision)
For QLoRA training, use NF4 quantization with compute_dtype=bfloat16
Monitor inference latency—sometimes higher quantization doesn't accelerate due to memory bandwidth
Combine quantization with other optimizations: flash attention, paged attention, continuous batching
Document quantization method and calibration data for reproducibility and debugging
Regularly re-quantize when models are updated or fine-tuned

Tools and Frameworks

The quantization ecosystem centers on several specialized tools. llama.cpp provides GGUF quantization with quantize command (./quantize model.gguf model.Q4_K_M.gguf Q4_K_M) and CPU-optimized inference reaching 50+ tokens/sec on M4 Max for 8B models. AutoGPTQ handles GPU quantization with Python API: from auto_gptq import AutoGPTQForCausalLM, offering 4-bit GPTQ with ExLlama2 backend for 2x faster inference than bitsandbytes. AutoAWQ provides activation-aware quantization with similar API and 20-30% speed advantage over GPTQ at comparable quality. bitsandbytes (by Tim Dettmers) pioneered accessible quantization with 8-bit and 4-bit support integrated into Hugging Face Transformers, enabling quantized inference with one parameter (load_in_8bit=True). Optimum (Hugging Face) abstracts quantization across backends (ONNX Runtime, OpenVINO, TensorRT) for hardware-specific optimization. The Hugging Face Hub hosts 30,000+ pre-quantized models in GGUF, GPTQ, and AWQ formats, eliminating the need to quantize yourself. Inference servers like vLLM (automatic INT8 KV cache compression), Text Generation Inference (GPTQ/AWQ support), and Ollama (GGUF-focused) provide production-ready quantized model serving. Apple's MLX framework enables quantization for Apple Silicon with mlx.quantize, leveraging unified memory architecture.

Related Techniques

Quantization is one of several model optimization techniques. Pruning removes unnecessary weights (structured or unstructured), often combined with quantization for 10-20x total compression. Knowledge distillation trains smaller models to mimic larger ones, complementary to quantization (can quantize the distilled model). LoRA and QLoRA use quantization differently: QLoRA quantizes the base model to 4-bit (with NF4) while keeping adapter weights at full precision, enabling training on consumer GPUs. Flash Attention optimizes attention computation memory and speed without precision reduction, stackable with quantization for maximum performance. Compilation (TensorRT, TorchCompile, PyTorch 2.0) fuses operations and generates optimized kernels, improving quantized model performance further. Mixture-of-Experts (MoE) models like Mixtral use sparse activation, loading only 2-8 experts per token; quantizing each expert to 4-bit enables running 8x7B MoE in 28GB. The emerging frontier is extreme quantization: 3-bit (reducing 70B to 26GB), 2-bit (17.5GB), and 1-bit (8.75GB, but typically >10% quality loss) using techniques like QuIP and OmniQuant. FP8 (8-bit floating point) on H100/B200 GPUs offers better quality than INT8 for similar memory savings.

Overview

Key Concepts

How It Works

Use Cases

Technical Implementation

Best Practices

Tools and Frameworks

Related Techniques

Official Resources

Related Technologies

LoRA

Llama 4

Hugging Face

Cookie Settings

Necessary Cookies

External Services