Milvus
Milvus emerged in 2019 from Zilliz as the first purpose-built, cloud-native vector database designed for massive scale from the ground up. While other solutions evolved from existing databases, Milvus was architected specifically for vector workloads: supporting 10+ index types (HNSW, IVF, DiskANN), trillion-scale datasets, hybrid search combining dense and sparse vectors, and true cloud-native deployment on Kubernetes. Donated to LF AI & Data Foundation in 2020, Milvus reached production maturity by 2022 and now powers AI applications at eBay (product recommendations on 1.3B+ items), Walmart (visual search), NVIDIA (GPU optimization research), and thousands of enterprises. The architecture: separation of storage and compute enables independent scaling, MinIO/S3 for object storage, Pulsar/Kafka for message queue, etcd for metadata—each component scales independently. Support for multiple index types means optimal tradeoff selection: HNSW for speed (sub-10ms queries), IVF_FLAT for accuracy, DiskANN for cost (10x larger datasets on same memory via disk offloading), SCANN for Google's algorithm. Key innovations: consistency guarantees (tunable consistency levels from eventual to strong), time travel (query historical states), multi-tenancy with resource isolation, and GPU acceleration for index building and queries. Performance: 10,000+ queries/second on billion-vector datasets, sub-20ms p99 latency, horizontal scaling to 100+ nodes, trillion-vector capacity across clusters. As of October 2025, Milvus 2.4 adds sparse vector support (BM25, SPLADE), enhanced filtering performance (5-10x faster), and improved resource efficiency (30% less memory). Deployment options: standalone mode (single node), cluster mode (distributed), Milvus Lite (embedded database), Zilliz Cloud (fully managed). 21medien deploys Milvus for clients requiring enterprise-grade vector search with complex requirements: we handle cluster architecture, index optimization, monitoring, and operational excellence—enabling organizations to build production AI systems that scale from millions to billions of vectors with predictable performance and costs.

Overview
Milvus solves enterprise vector search challenges that simpler solutions cannot address: trillion-scale datasets, complex filtering requirements, multi-tenancy with resource isolation, regulatory compliance, and high availability with disaster recovery. The cloud-native architecture separates concerns: compute nodes (query nodes, data nodes, index nodes) scale independently from storage (MinIO, S3, Azure Blob). Message queue (Pulsar, Kafka) provides reliable communication and enables time travel—query database state at any historical timestamp for debugging or compliance. Metadata service (etcd) manages cluster state with automatic failover. This architecture enables: (1) Independent scaling—add query nodes for read throughput without touching storage, (2) Cost optimization—use cheap object storage for cold data, expensive SSD for hot data, (3) Fault tolerance—any component can fail without data loss, (4) Multi-cloud—deploy on AWS, GCP, Azure, or hybrid with consistent experience. Index diversity is Milvus's strength: HNSW (Hierarchical Navigable Small World) for low-latency queries under 10ms, IVF (Inverted File) variants for accuracy-cost tradeoffs, DiskANN for memory-constrained scenarios (stores vectors on disk, loads relevant portions during query), GPU-accelerated indexes (CAGRA, RAFT) for maximum throughput. Users select index per collection based on requirements. Consistency model offers tunable guarantees: Strong (linearizable reads, highest latency), Bounded (staleness within time threshold), Session (read-your-writes), Eventual (lowest latency). This flexibility enables appropriate tradeoff per use case—strong for financial transactions, eventual for product recommendations.
Enterprise features distinguish Milvus from lighter alternatives. Multi-tenancy: Create isolated collections with dedicated resources (CPU, memory, disk quotas), preventing noisy neighbors. Role-based access control (RBAC) integrates with enterprise auth systems. Hybrid search combines multiple vector fields and scalar filters in single query: 'Find products similar to image embedding AND similar to text embedding WHERE category=electronics AND price<500 AND in_stock=true'. Attribute indexing accelerates filters—bitmap indexes for low-cardinality fields, inverted indexes for text, range indexes for numerics. Time travel enables auditing and debugging: query how recommendations looked last week, track embedding drift over time, rollback to known-good state. GPU acceleration speeds index building 10-100x (IVF index on 100M vectors: 30 minutes on GPU vs 20+ hours on CPU) and query processing 5-10x for certain workloads. Dynamic schema allows adding fields without downtime, essential for evolving ML systems. Backup and disaster recovery: point-in-time snapshots, cross-region replication, automated failover. Real-world scale: eBay runs Milvus cluster managing 1.3B+ product vectors across 50+ nodes, serving 50K+ queries/second with p99 latency under 30ms for 'visual search' and 'similar products' features. Shopify uses Milvus for merchant product recommendations, processing 100M+ vectors with complex filtering (geography, pricing tiers, inventory). 21medien implements Milvus for enterprise clients with demanding requirements: we've architected multi-region clusters serving 5B+ vectors, achieving 99.99% uptime, p95 latency under 15ms with complex filtering, supporting 10K+ concurrent users—all while maintaining compliance with data residency requirements (GDPR, CCPA) and providing detailed audit logs for regulatory reporting.
Key Features
- Cloud-native architecture: Kubernetes-native deployment with independent scaling of compute and storage, supports AWS, GCP, Azure, hybrid
- Trillion-scale capacity: Proven deployments managing billions of vectors, horizontal scaling to 100+ nodes, DiskANN enables 10x larger datasets per node
- Multiple index types: HNSW, IVF variants, DiskANN, GPU-accelerated indexes (CAGRA, RAFT), optimal tradeoff for each use case
- Hybrid search: Combine multiple vector fields (image + text embeddings), sparse vectors (BM25, SPLADE), and attribute filters in single query
- Tunable consistency: Strong, Bounded, Session, Eventual consistency levels for appropriate latency-correctness tradeoff
- Time travel: Query historical database states at any timestamp, track embedding evolution, debugging, compliance auditing
- Multi-tenancy: Resource isolation with CPU/memory quotas per collection, RBAC, separate namespaces for departments or customers
- High availability: Automatic failover, cross-region replication, backup and restore, zero-downtime upgrades
- Advanced filtering: Bitmap, inverted, and range indexes on attributes for 10-100x faster filtered queries, support complex boolean expressions
- GPU acceleration: 10-100x faster index building, 5-10x faster queries for certain workloads, automatic GPU scheduling across cluster
Technical Architecture
Milvus architecture consists of four layers. Access Layer: SDK clients (Python, Java, Go, Node.js) connect via gRPC, load balancer distributes requests across proxy nodes, proxies validate requests and route to appropriate components. Coordinator Layer: Root coordinator manages cluster state, data coordinator handles data persistence, query coordinator orchestrates query execution, index coordinator manages index building. Worker Layer: Query nodes execute queries on indexed data, data nodes handle data ingestion and persistence, index nodes build indexes offline. Storage Layer: Meta store (etcd) holds collection schemas and cluster state, message queue (Pulsar/Kafka) provides reliable communication and enables time travel, object storage (MinIO, S3, Azure Blob) stores vectors and indexes, local cache (in-memory) speeds hot data access. Data flow: (1) Insert—vectors written to log broker, data nodes batch and persist to object storage, (2) Index—index nodes read from storage, build indexes (HNSW/IVF/DiskANN), write back to storage, (3) Query—query nodes load relevant segments and indexes from storage/cache, execute search, return results. Consistency implementation: write operations tagged with timestamps, read operations specify consistency level, system ensures reads see all writes up to specified timestamp. Scaling: horizontal scaling by adding nodes to each layer, partitioning distributes collections across nodes based on partition key, replicas provide read scalability and fault tolerance. Index selection example: HNSW for latency-critical applications (recommendation systems, real-time search), IVF_FLAT for accuracy-critical (research, forensics), DiskANN for cost-sensitive large-scale (archival search, cold data queries), GPU indexes for throughput-critical batch processing. 21medien architects Milvus clusters for optimal cost-performance: selecting appropriate index types per collection, configuring partition strategies (time-based for logs, geography-based for multi-region), tuning resource allocation (query nodes get more CPU, index nodes get more memory), implementing caching strategies (hot collections in memory, warm on SSD, cold in object storage), monitoring and auto-scaling based on query patterns.
Common Use Cases
- E-commerce search: Visual product search, 'similar items' recommendations, multi-modal search (image + text) at billion-product scale
- Content recommendation: Personalized content suggestions for streaming platforms, news sites, social media based on user embeddings
- Fraud detection: Detect fraudulent transactions, accounts, behaviors by similarity to known fraud patterns in real-time
- Drug discovery: Search molecular databases for compounds similar to target structure, accelerate pharmaceutical research
- Security & surveillance: Face recognition, object detection, anomaly detection in video feeds processing millions of frames
- Customer support: Semantic search across support tickets, knowledge bases, chat logs to find relevant solutions instantly
- Document management: Enterprise document search, contract analysis, compliance checking across millions of documents
- Autonomous vehicles: Real-time similarity search for perception systems identifying objects, scenarios based on sensor embeddings
- Recommendation systems: Collaborative filtering at massive scale, hybrid recommendations combining multiple signals (user, item, context)
- Research & analytics: Academic paper search, patent prior art, scientific literature analysis with citation and content embeddings
Integration with 21medien Services
21medien provides comprehensive Milvus deployment and management services. Phase 1 (Architecture Design): We assess your requirements (scale, latency, throughput, compliance), design cluster topology (node counts, instance types, storage strategy), select index types per collection, plan capacity and growth, estimate costs. Architecture decisions: standalone for <10M vectors, cluster for >10M, multi-region for global applications, hybrid cloud for data sovereignty. Phase 2 (Deployment): We deploy Milvus on Kubernetes using Helm charts or operators, configure infrastructure (node pools, storage classes, networking), integrate with monitoring (Prometheus, Grafana, alerts), setup authentication and RBAC, implement backup/restore procedures. Production hardening: resource limits, pod disruption budgets, network policies, encryption at rest and in transit. Phase 3 (Data Migration): We build ETL pipelines to import vectors from existing systems (Elasticsearch, PostgreSQL, other vector databases), implement parallel ingestion for fast migration (millions of vectors per hour), validate data integrity, create indexes optimized for query patterns. Zero-downtime migration for production systems using blue-green deployment. Phase 4 (Optimization): We tune index parameters (HNSW m/efConstruction, IVF nlist/nprobe), optimize partition strategies, configure caching layers, implement query optimization (attribute indexing, filter pushdown), benchmark performance against SLAs. Load testing validates cluster handles peak traffic. Phase 5 (Operations): Ongoing support includes performance monitoring, capacity planning, scaling operations (vertical and horizontal), index optimization as data evolves, incident response, cost optimization. Automated scaling based on metrics. Example: For online retailer, we deployed Milvus managing 500M product vectors across 20-node cluster, serving visual search and recommendations: achieved 12,000 queries/second, p95 latency 18ms with complex filters (category, price, availability, brand), 99.99% uptime, cost $8,000/month infrastructure (versus $50,000+/month on managed alternative)—enabling visual search feature that increased conversion rate 18% and average order value 12%, generating $2M+ monthly incremental revenue.
Code Examples
Install Milvus client: pip install pymilvus — Basic setup: from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType; connections.connect(host='localhost', port='19530'); # Define schema: fields = [FieldSchema(name='id', dtype=DataType.INT64, is_primary=True, auto_id=True), FieldSchema(name='embedding', dtype=DataType.FLOAT_VECTOR, dim=768), FieldSchema(name='title', dtype=DataType.VARCHAR, max_length=200), FieldSchema(name='price', dtype=DataType.FLOAT)]; schema = CollectionSchema(fields, description='Product embeddings'); collection = Collection(name='products', schema=schema) — Insert vectors: import numpy as np; vectors = np.random.rand(1000, 768).tolist(); titles = [f'Product {i}' for i in range(1000)]; prices = np.random.uniform(10, 1000, 1000).tolist(); collection.insert([vectors, titles, prices]) — Create index: index_params = {'metric_type': 'COSINE', 'index_type': 'HNSW', 'params': {'M': 16, 'efConstruction': 200}}; collection.create_index(field_name='embedding', index_params=index_params); collection.load() — Search with filtering: query_vector = np.random.rand(768).tolist(); search_params = {'metric_type': 'COSINE', 'params': {'ef': 100}}; results = collection.search(data=[query_vector], anns_field='embedding', param=search_params, limit=10, expr='price < 500', output_fields=['title', 'price']); for result in results[0]: print(f'Title: {result.entity.get("title")}, Price: {result.entity.get("price")}, Distance: {result.distance}') — Hybrid search with multiple vectors: # Requires Milvus 2.4+: results = collection.hybrid_search(reqs=[search_request_1, search_request_2], rerank=WeightedRanker(0.6, 0.4), limit=10) — 21medien provides production deployment templates, monitoring dashboards, and operational runbooks for Milvus.
Best Practices
- Choose appropriate index type: HNSW for latency (<10ms), IVF for balanced, DiskANN for large datasets with memory constraints
- Implement partitioning strategy: Partition by time, geography, or tenant for faster queries and easier data management
- Use attribute indexing for filters: Create indexes on frequently filtered fields (category, date, status) for 10-100x speedup
- Tune consistency level: Use eventual for recommendations/search, bounded for analytics, strong only for financial/critical applications
- Configure resource quotas: Set CPU/memory limits per collection in multi-tenant deployments to prevent resource contention
- Enable caching strategically: Cache hot collections in memory, configure cache size based on query patterns and working set size
- Monitor segment health: Track segment sizes, compaction status, fragmentation—run compaction during low-traffic periods
- Implement backup procedures: Regular snapshots to S3/GCS, test restore procedures, maintain cross-region replicas for DR
- Use bulk insert for large data: Batch insertions in groups of 10K-100K vectors for optimal throughput, avoid single-row inserts
- Load test before production: Validate performance with realistic query patterns, measure p95/p99 latency under load, test failure scenarios
Performance Comparison
Milvus performance scales to enterprise requirements. Query latency: 10-20ms p95 for billion-vector datasets with HNSW index, 30-50ms with complex filtering. versus Qdrant: comparable latency for similar index types, Milvus better for trillion-scale (proven at eBay's 1B+ vectors), Qdrant simpler for smaller deployments. versus Weaviate: Milvus 2-3x faster for large-scale deployments, better multi-tenancy, more index options. versus Pinecone: Milvus matches latency while providing full control (data sovereignty, compliance), 5-10x lower cost for self-hosted. Throughput: single Milvus cluster handles 10,000-50,000 queries/second depending on hardware and index type, scales horizontally by adding query nodes. Index building: GPU acceleration provides 10-100x speedup versus CPU—IVF index on 100M vectors builds in 30 minutes on GPU (vs 20+ hours CPU). Memory efficiency: DiskANN enables 10x larger datasets per node by storing vectors on disk, loading only relevant portions during query. Cost: Operating 1B vectors on Milvus cluster costs $15,000-25,000/month (20-30 node cluster on cloud), equivalent Pinecone deployment costs $100,000+/month (80% savings). Multi-tenancy: Milvus supports thousands of collections per cluster with resource isolation, alternatives often require separate clusters per tenant (10x cost increase). Consistency: Tunable consistency unique among vector databases, enables appropriate latency tradeoffs—eventual consistency achieves 30-40% lower latency than strong. GPU utilization: Milvus efficiently uses GPUs for both indexing and queries, alternatives CPU-only or limited GPU support. 21medien helps clients migrate from managed services to Milvus for cost savings and control: typical migration from Pinecone to self-hosted Milvus saves 70-85% while improving performance 20-30% through hardware optimization and index tuning, achieving ROI within 3-6 months for deployments over 100M vectors.
Official Resources
https://milvus.ioRelated Technologies
Qdrant
Rust-based vector database with simpler deployment, better for <100M vector use cases
Weaviate
Open-source vector database with GraphQL API, comparable features but less scalable
Pinecone
Managed vector database with higher cost, alternative for teams wanting zero-ops solution
LlamaIndex
Data framework for LLM applications with native Milvus integration for vector storage