Vector Databases for Retrieval-Augmented Generation (RAG): Implementation Guide

AI Engineering

Technical guide to implementing RAG systems with vector databases. Compare Pinecone, Weaviate, Milvus, and pgvector. Learn about embeddings, similarity search, and production architecture.

Vector Databases for Retrieval-Augmented Generation (RAG): Implementation Guide

Retrieval-Augmented Generation (RAG) enhances LLM responses by retrieving relevant context from knowledge bases. Vector databases enable efficient similarity search crucial for RAG systems.

RAG Architecture Overview

Components

  • 1. Document ingestion: Process and chunk source documents
  • 2. Embedding generation: Convert chunks to vector embeddings
  • 3. Vector storage: Store embeddings in vector database
  • 4. Query processing: Convert user queries to embeddings
  • 5. Similarity search: Find relevant chunks
  • 6. Context assembly: Combine retrieved chunks
  • 7. LLM generation: Generate response with context

Benefits

  • Up-to-date information without retraining
  • Source attribution for responses
  • Reduced hallucinations
  • Domain-specific knowledge
  • Cost-effective compared to fine-tuning
  • Easy to update knowledge base

Vector Database Comparison

Pinecone

Fully managed vector database as a service.

Advantages:

  • Zero infrastructure management
  • High performance for large-scale deployments
  • Automatic scaling
  • Real-time updates
  • 99.9% SLA availability

Considerations:

  • SaaS only (no self-hosting)
  • Per-vector pricing can be expensive at scale
  • Data stored externally (GDPR considerations)

Weaviate

Open-source vector database with cloud and self-hosted options.

Advantages:

  • Flexible deployment options
  • Built-in ML models for vectorization
  • GraphQL API
  • Hybrid search (vector + keyword)
  • Strong community support

Considerations:

  • Requires infrastructure management if self-hosted
  • Learning curve for GraphQL
  • Performance tuning needed for scale

Milvus

Open-source vector database optimized for billion-scale deployments.

Advantages:

  • Excellent performance at massive scale
  • Multiple index types for optimization
  • Cloud-native architecture (Kubernetes)
  • Active development and community
  • GPU acceleration support

Considerations:

  • Complex setup and configuration
  • Requires significant ops expertise
  • Resource intensive

pgvector (PostgreSQL Extension)

Vector similarity search as PostgreSQL extension.

Advantages:

  • Leverage existing PostgreSQL infrastructure
  • Combine vector search with relational data
  • Familiar PostgreSQL tooling and expertise
  • ACID transactions
  • Lower operational complexity

Considerations:

  • Performance limited compared to specialized databases
  • Best for small to medium datasets (<1M vectors)
  • Less optimized for pure vector operations

Embedding Models

OpenAI text-embedding-3-large

  • 3072 dimensions
  • Excellent general-purpose performance
  • Cost: $0.13 per 1M tokens (October 2025)
  • Easy API integration

Cohere Embed v3

  • 1024 dimensions
  • Multilingual support
  • Competitive pricing
  • Good for semantic search

Open-Source: BGE-large

  • 1024 dimensions
  • Self-hostable (no API costs)
  • Strong performance on benchmarks
  • Requires compute resources

Selection Criteria

  • Task-specific performance: Test on your data
  • Cost: API vs self-hosting
  • Dimensionality: Storage vs performance trade-off
  • Language support: Multilingual requirements
  • Latency: Embedding generation speed

Document Processing

Chunking Strategies

Fixed-size chunking:

  • Split documents into fixed token counts (e.g., 512 tokens)
  • Simple to implement
  • May split mid-sentence or mid-concept

Semantic chunking:

  • Split at natural boundaries (paragraphs, sections)
  • Preserves semantic coherence
  • Variable chunk sizes

Overlapping chunks:

  • Include overlap between chunks (e.g., 50 tokens)
  • Prevents information loss at boundaries
  • Increases storage requirements

Metadata

Store metadata with vectors:

  • Source document ID and title
  • Chunk position in document
  • Last updated timestamp
  • Author, category, tags
  • Access control metadata

Similarity Search

Distance Metrics

Cosine similarity:

  • Most common for text embeddings
  • Measures angle between vectors
  • Range: -1 to 1 (1 = identical)

Euclidean distance:

  • Geometric distance between points
  • Sensitive to vector magnitude
  • Use when magnitude matters

Dot product:

  • Fast computation
  • Requires normalized vectors for similarity
  • Common in production systems

Retrieval Parameters

  • Top-k: Number of results to retrieve (typical: 3-10)
  • Similarity threshold: Minimum similarity score
  • Filters: Metadata-based filtering before or after search
  • Re-ranking: Post-process results for relevance

Code Example: Complete RAG Pipeline with Pinecone

python
import pinecone
from openai import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing import List, Dict
import numpy as np

class RAGPipeline:
    """Production RAG system with Pinecone and OpenAI"""
    
    def __init__(self, pinecone_api_key: str, openai_api_key: str, index_name: str = "knowledge-base"):
        # Initialize Pinecone
        pinecone.init(api_key=pinecone_api_key)
        self.index = pinecone.Index(index_name)
        
        # Initialize OpenAI
        self.openai_client = OpenAI(api_key=openai_api_key)
        
        # Initialize text splitter
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=512,
            chunk_overlap=50,
            length_function=len
        )
    
    def embed_text(self, text: str) -> List[float]:
        """Generate embeddings using OpenAI"""
        response = self.openai_client.embeddings.create(
            model="text-embedding-3-large",
            input=text
        )
        return response.data[0].embedding
    
    def ingest_documents(self, documents: List[Dict[str, str]]):
        """Chunk and index documents into Pinecone"""
        vectors_to_upsert = []
        
        for doc_id, doc in enumerate(documents):
            # Split document into chunks
            chunks = self.text_splitter.split_text(doc["content"])
            
            for chunk_id, chunk in enumerate(chunks):
                # Generate embedding
                embedding = self.embed_text(chunk)
                
                # Prepare metadata
                metadata = {
                    "text": chunk,
                    "source": doc.get("source", "unknown"),
                    "doc_id": doc_id,
                    "chunk_id": chunk_id,
                    "title": doc.get("title", "")
                }
                
                # Create vector ID
                vector_id = f"doc{doc_id}_chunk{chunk_id}"
                
                vectors_to_upsert.append((
                    vector_id,
                    embedding,
                    metadata
                ))
            
            # Batch upsert every 100 vectors
            if len(vectors_to_upsert) >= 100:
                self.index.upsert(vectors=vectors_to_upsert)
                vectors_to_upsert = []
        
        # Upsert remaining vectors
        if vectors_to_upsert:
            self.index.upsert(vectors=vectors_to_upsert)
    
    def retrieve_context(self, query: str, top_k: int = 5, min_score: float = 0.7) -> List[Dict]:
        """Retrieve relevant context for query"""
        # Generate query embedding
        query_embedding = self.embed_text(query)
        
        # Search Pinecone
        results = self.index.query(
            vector=query_embedding,
            top_k=top_k,
            include_metadata=True
        )
        
        # Filter by minimum score and extract context
        context_chunks = []
        for match in results.matches:
            if match.score >= min_score:
                context_chunks.append({
                    "text": match.metadata["text"],
                    "source": match.metadata["source"],
                    "score": match.score
                })
        
        return context_chunks
    
    def generate_response(self, query: str, context_chunks: List[Dict]) -> str:
        """Generate response using retrieved context"""
        # Assemble context
        context_text = "\n\n".join([
            f"[Source: {chunk['source']}]\n{chunk['text']}"
            for chunk in context_chunks
        ])
        
        # Create prompt
        prompt = f"""Answer the question based on the provided context. If the context doesn't contain enough information, say so.

Context:
{context_text}

Question: {query}

Answer:"""
        
        # Generate response
        response = self.openai_client.chat.completions.create(
            model="gpt-5",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3,
            max_tokens=500
        )
        
        return response.choices[0].message.content
    
    def query(self, question: str, top_k: int = 5) -> Dict[str, any]:
        """Complete RAG pipeline: retrieve and generate"""
        # Retrieve relevant context
        context_chunks = self.retrieve_context(question, top_k=top_k)
        
        if not context_chunks:
            return {
                "answer": "I couldn't find relevant information to answer your question.",
                "sources": [],
                "confidence": "low"
            }
        
        # Generate response
        answer = self.generate_response(question, context_chunks)
        
        # Extract unique sources
        sources = list(set([chunk["source"] for chunk in context_chunks]))
        
        return {
            "answer": answer,
            "sources": sources,
            "context_chunks": context_chunks,
            "confidence": "high" if context_chunks[0]["score"] > 0.85 else "medium"
        }

# Usage example
rag = RAGPipeline(
    pinecone_api_key="your-pinecone-key",
    openai_api_key="your-openai-key"
)

# Ingest documents
documents = [
    {
        "title": "AI Safety Guidelines",
        "content": "AI systems should be designed with safety as a priority...",
        "source": "safety_doc.pdf"
    },
    {
        "title": "Production Deployment",
        "content": "When deploying AI systems to production...",
        "source": "deployment_guide.pdf"
    }
]

rag.ingest_documents(documents)

# Query the system
result = rag.query("How should I deploy AI systems safely?")
print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")
print(f"Confidence: {result['confidence']}")

Code Example: RAG with pgvector (PostgreSQL)

python
import psycopg2
from psycopg2.extras import execute_values
from openai import OpenAI
from typing import List, Dict
import numpy as np

class PgVectorRAG:
    """RAG system using PostgreSQL with pgvector extension"""
    
    def __init__(self, db_config: Dict[str, str], openai_api_key: str):
        # Connect to PostgreSQL
        self.conn = psycopg2.connect(**db_config)
        self.cursor = self.conn.cursor()
        
        # Initialize OpenAI
        self.openai_client = OpenAI(api_key=openai_api_key)
        
        # Create table with vector extension
        self._init_database()
    
    def _init_database(self):
        """Initialize database schema with pgvector"""
        self.cursor.execute("CREATE EXTENSION IF NOT EXISTS vector;")
        
        self.cursor.execute("""
            CREATE TABLE IF NOT EXISTS documents (
                id SERIAL PRIMARY KEY,
                content TEXT NOT NULL,
                embedding vector(3072),  -- text-embedding-3-large dimensions
                metadata JSONB,
                created_at TIMESTAMP DEFAULT NOW()
            );
        """)
        
        # Create vector index for fast similarity search
        self.cursor.execute("""
            CREATE INDEX IF NOT EXISTS documents_embedding_idx 
            ON documents USING ivfflat (embedding vector_cosine_ops)
            WITH (lists = 100);
        """)
        
        self.conn.commit()
    
    def embed_text(self, text: str) -> List[float]:
        """Generate embeddings"""
        response = self.openai_client.embeddings.create(
            model="text-embedding-3-large",
            input=text
        )
        return response.data[0].embedding
    
    def add_document(self, content: str, metadata: Dict = None):
        """Add document to vector database"""
        embedding = self.embed_text(content)
        
        self.cursor.execute(
            """
            INSERT INTO documents (content, embedding, metadata)
            VALUES (%s, %s, %s)
            """,
            (content, embedding, metadata or {})
        )
        self.conn.commit()
    
    def semantic_search(self, query: str, top_k: int = 5, threshold: float = 0.7) -> List[Dict]:
        """Perform semantic search using cosine similarity"""
        query_embedding = self.embed_text(query)
        
        self.cursor.execute(
            """
            SELECT 
                id,
                content,
                metadata,
                1 - (embedding <=> %s::vector) as similarity
            FROM documents
            WHERE 1 - (embedding <=> %s::vector) > %s
            ORDER BY embedding <=> %s::vector
            LIMIT %s;
            """,
            (query_embedding, query_embedding, threshold, query_embedding, top_k)
        )
        
        results = []
        for row in self.cursor.fetchall():
            results.append({
                "id": row[0],
                "content": row[1],
                "metadata": row[2],
                "similarity": float(row[3])
            })
        
        return results
    
    def generate_answer(self, query: str, top_k: int = 5) -> Dict:
        """Complete RAG: search and generate"""
        # Retrieve relevant documents
        results = self.semantic_search(query, top_k=top_k)
        
        if not results:
            return {"answer": "No relevant information found.", "sources": []}
        
        # Build context
        context = "\n\n".join([doc["content"] for doc in results])
        
        # Generate answer
        prompt = f"""Based on the following context, answer the question.

Context:
{context}

Question: {query}

Answer:"""
        
        response = self.openai_client.chat.completions.create(
            model="gpt-5",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3
        )
        
        return {
            "answer": response.choices[0].message.content,
            "sources": results,
            "num_sources": len(results)
        }
    
    def close(self):
        """Close database connection"""
        self.cursor.close()
        self.conn.close()

# Usage
db_config = {
    "host": "localhost",
    "database": "rag_db",
    "user": "postgres",
    "password": "your-password"
}

rag = PgVectorRAG(db_config, openai_api_key="your-openai-key")

# Add documents
rag.add_document(
    "Vector databases enable efficient similarity search for RAG systems.",
    metadata={"source": "rag_guide.pdf", "page": 1}
)

# Query
result = rag.generate_answer("What are vector databases used for?")
print(result["answer"])

rag.close()

Production Architecture

Ingestion Pipeline

  • 1. Document queue (e.g., SQS, RabbitMQ)
  • 2. Processing workers: Parse, chunk, clean
  • 3. Embedding service: Generate vectors (batch for efficiency)
  • 4. Vector DB write: Store with metadata
  • 5. Monitoring: Track ingestion metrics

Query Pipeline

  • 1. User query received
  • 2. Query preprocessing (normalization, expansion)
  • 3. Generate query embedding
  • 4. Vector similarity search with filters
  • 5. Re-rank results (optional)
  • 6. Assemble context for LLM
  • 7. Generate response
  • 8. Post-process and return

Caching Layers

  • Query cache: Store popular query results
  • Embedding cache: Reuse embeddings for common queries
  • LLM response cache: Cache complete responses
  • TTL based on content update frequency

Optimization Strategies

Indexing

  • HNSW (Hierarchical Navigable Small World): Fast approximate search
  • IVF (Inverted File Index): Partition space for efficiency
  • Trade-off: Speed vs accuracy
  • Tune index parameters based on dataset size

Hybrid Search

Combine vector and keyword search:

  • Vector search for semantic similarity
  • Keyword search for exact matches
  • Combine results with weighted scoring
  • Improves recall and precision

Query Expansion

  • Generate multiple query variations
  • Retrieve for each variation
  • Deduplicate and re-rank results
  • Improves recall for ambiguous queries

Evaluation Metrics

Retrieval Metrics

  • Recall@k: Percentage of relevant docs in top-k
  • Precision@k: Percentage of retrieved docs that are relevant
  • MRR (Mean Reciprocal Rank): Position of first relevant result
  • NDCG: Normalized discounted cumulative gain

End-to-End Metrics

  • Answer relevance: Does LLM response address query?
  • Faithfulness: Is response grounded in retrieved context?
  • Context relevance: Is retrieved context useful?
  • Latency: Total time from query to response

Common Pitfalls

  • Chunk size too large: Dilutes relevance
  • Chunk size too small: Loses context
  • Insufficient top-k: Misses relevant information
  • Excessive top-k: Noise in context
  • No metadata filtering: Retrieves irrelevant but similar content
  • Ignoring context window limits: Truncated context
  • Not monitoring retrieval quality: Degrading performance

Cost Management

  • Batch embedding generation to reduce API calls
  • Cache embeddings for frequently accessed documents
  • Use smaller embedding models if acceptable performance
  • Implement query deduplication
  • Monitor per-query costs
  • Right-size vector database infrastructure

Security Considerations

  • Access control: Filter results based on user permissions
  • Data isolation: Multi-tenant vector separation
  • Encryption: Protect vectors at rest and in transit
  • Audit logging: Track who retrieved what
  • PII handling: Careful with personal data in documents

RAG systems provide a practical way to enhance LLMs with domain knowledge. Proper implementation of vector databases, embeddings, and retrieval strategies is crucial for production quality.

Author

21medien

Last updated