Retrieval-Augmented Generation (RAG) enhances LLM responses by retrieving relevant context from knowledge bases. Vector databases enable efficient similarity search crucial for RAG systems.
RAG Architecture Overview
Components
- 1. Document ingestion: Process and chunk source documents
- 2. Embedding generation: Convert chunks to vector embeddings
- 3. Vector storage: Store embeddings in vector database
- 4. Query processing: Convert user queries to embeddings
- 5. Similarity search: Find relevant chunks
- 6. Context assembly: Combine retrieved chunks
- 7. LLM generation: Generate response with context
Benefits
- Up-to-date information without retraining
- Source attribution for responses
- Reduced hallucinations
- Domain-specific knowledge
- Cost-effective compared to fine-tuning
- Easy to update knowledge base
Vector Database Comparison
Pinecone
Fully managed vector database as a service.
Advantages:
- Zero infrastructure management
- High performance for large-scale deployments
- Automatic scaling
- Real-time updates
- 99.9% SLA availability
Considerations:
- SaaS only (no self-hosting)
- Per-vector pricing can be expensive at scale
- Data stored externally (GDPR considerations)
Weaviate
Open-source vector database with cloud and self-hosted options.
Advantages:
- Flexible deployment options
- Built-in ML models for vectorization
- GraphQL API
- Hybrid search (vector + keyword)
- Strong community support
Considerations:
- Requires infrastructure management if self-hosted
- Learning curve for GraphQL
- Performance tuning needed for scale
Milvus
Open-source vector database optimized for billion-scale deployments.
Advantages:
- Excellent performance at massive scale
- Multiple index types for optimization
- Cloud-native architecture (Kubernetes)
- Active development and community
- GPU acceleration support
Considerations:
- Complex setup and configuration
- Requires significant ops expertise
- Resource intensive
pgvector (PostgreSQL Extension)
Vector similarity search as PostgreSQL extension.
Advantages:
- Leverage existing PostgreSQL infrastructure
- Combine vector search with relational data
- Familiar PostgreSQL tooling and expertise
- ACID transactions
- Lower operational complexity
Considerations:
- Performance limited compared to specialized databases
- Best for small to medium datasets (<1M vectors)
- Less optimized for pure vector operations
Embedding Models
OpenAI text-embedding-3-large
- 3072 dimensions
- Excellent general-purpose performance
- Cost: $0.13 per 1M tokens (October 2025)
- Easy API integration
Cohere Embed v3
- 1024 dimensions
- Multilingual support
- Competitive pricing
- Good for semantic search
Open-Source: BGE-large
- 1024 dimensions
- Self-hostable (no API costs)
- Strong performance on benchmarks
- Requires compute resources
Selection Criteria
- Task-specific performance: Test on your data
- Cost: API vs self-hosting
- Dimensionality: Storage vs performance trade-off
- Language support: Multilingual requirements
- Latency: Embedding generation speed
Document Processing
Chunking Strategies
Fixed-size chunking:
- Split documents into fixed token counts (e.g., 512 tokens)
- Simple to implement
- May split mid-sentence or mid-concept
Semantic chunking:
- Split at natural boundaries (paragraphs, sections)
- Preserves semantic coherence
- Variable chunk sizes
Overlapping chunks:
- Include overlap between chunks (e.g., 50 tokens)
- Prevents information loss at boundaries
- Increases storage requirements
Metadata
Store metadata with vectors:
- Source document ID and title
- Chunk position in document
- Last updated timestamp
- Author, category, tags
- Access control metadata
Similarity Search
Distance Metrics
Cosine similarity:
- Most common for text embeddings
- Measures angle between vectors
- Range: -1 to 1 (1 = identical)
Euclidean distance:
- Geometric distance between points
- Sensitive to vector magnitude
- Use when magnitude matters
Dot product:
- Fast computation
- Requires normalized vectors for similarity
- Common in production systems
Retrieval Parameters
- Top-k: Number of results to retrieve (typical: 3-10)
- Similarity threshold: Minimum similarity score
- Filters: Metadata-based filtering before or after search
- Re-ranking: Post-process results for relevance
Code Example: Complete RAG Pipeline with Pinecone
import pinecone
from openai import OpenAI
from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing import List, Dict
import numpy as np
class RAGPipeline:
"""Production RAG system with Pinecone and OpenAI"""
def __init__(self, pinecone_api_key: str, openai_api_key: str, index_name: str = "knowledge-base"):
# Initialize Pinecone
pinecone.init(api_key=pinecone_api_key)
self.index = pinecone.Index(index_name)
# Initialize OpenAI
self.openai_client = OpenAI(api_key=openai_api_key)
# Initialize text splitter
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
length_function=len
)
def embed_text(self, text: str) -> List[float]:
"""Generate embeddings using OpenAI"""
response = self.openai_client.embeddings.create(
model="text-embedding-3-large",
input=text
)
return response.data[0].embedding
def ingest_documents(self, documents: List[Dict[str, str]]):
"""Chunk and index documents into Pinecone"""
vectors_to_upsert = []
for doc_id, doc in enumerate(documents):
# Split document into chunks
chunks = self.text_splitter.split_text(doc["content"])
for chunk_id, chunk in enumerate(chunks):
# Generate embedding
embedding = self.embed_text(chunk)
# Prepare metadata
metadata = {
"text": chunk,
"source": doc.get("source", "unknown"),
"doc_id": doc_id,
"chunk_id": chunk_id,
"title": doc.get("title", "")
}
# Create vector ID
vector_id = f"doc{doc_id}_chunk{chunk_id}"
vectors_to_upsert.append((
vector_id,
embedding,
metadata
))
# Batch upsert every 100 vectors
if len(vectors_to_upsert) >= 100:
self.index.upsert(vectors=vectors_to_upsert)
vectors_to_upsert = []
# Upsert remaining vectors
if vectors_to_upsert:
self.index.upsert(vectors=vectors_to_upsert)
def retrieve_context(self, query: str, top_k: int = 5, min_score: float = 0.7) -> List[Dict]:
"""Retrieve relevant context for query"""
# Generate query embedding
query_embedding = self.embed_text(query)
# Search Pinecone
results = self.index.query(
vector=query_embedding,
top_k=top_k,
include_metadata=True
)
# Filter by minimum score and extract context
context_chunks = []
for match in results.matches:
if match.score >= min_score:
context_chunks.append({
"text": match.metadata["text"],
"source": match.metadata["source"],
"score": match.score
})
return context_chunks
def generate_response(self, query: str, context_chunks: List[Dict]) -> str:
"""Generate response using retrieved context"""
# Assemble context
context_text = "\n\n".join([
f"[Source: {chunk['source']}]\n{chunk['text']}"
for chunk in context_chunks
])
# Create prompt
prompt = f"""Answer the question based on the provided context. If the context doesn't contain enough information, say so.
Context:
{context_text}
Question: {query}
Answer:"""
# Generate response
response = self.openai_client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": prompt}],
temperature=0.3,
max_tokens=500
)
return response.choices[0].message.content
def query(self, question: str, top_k: int = 5) -> Dict[str, any]:
"""Complete RAG pipeline: retrieve and generate"""
# Retrieve relevant context
context_chunks = self.retrieve_context(question, top_k=top_k)
if not context_chunks:
return {
"answer": "I couldn't find relevant information to answer your question.",
"sources": [],
"confidence": "low"
}
# Generate response
answer = self.generate_response(question, context_chunks)
# Extract unique sources
sources = list(set([chunk["source"] for chunk in context_chunks]))
return {
"answer": answer,
"sources": sources,
"context_chunks": context_chunks,
"confidence": "high" if context_chunks[0]["score"] > 0.85 else "medium"
}
# Usage example
rag = RAGPipeline(
pinecone_api_key="your-pinecone-key",
openai_api_key="your-openai-key"
)
# Ingest documents
documents = [
{
"title": "AI Safety Guidelines",
"content": "AI systems should be designed with safety as a priority...",
"source": "safety_doc.pdf"
},
{
"title": "Production Deployment",
"content": "When deploying AI systems to production...",
"source": "deployment_guide.pdf"
}
]
rag.ingest_documents(documents)
# Query the system
result = rag.query("How should I deploy AI systems safely?")
print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")
print(f"Confidence: {result['confidence']}")
Code Example: RAG with pgvector (PostgreSQL)
import psycopg2
from psycopg2.extras import execute_values
from openai import OpenAI
from typing import List, Dict
import numpy as np
class PgVectorRAG:
"""RAG system using PostgreSQL with pgvector extension"""
def __init__(self, db_config: Dict[str, str], openai_api_key: str):
# Connect to PostgreSQL
self.conn = psycopg2.connect(**db_config)
self.cursor = self.conn.cursor()
# Initialize OpenAI
self.openai_client = OpenAI(api_key=openai_api_key)
# Create table with vector extension
self._init_database()
def _init_database(self):
"""Initialize database schema with pgvector"""
self.cursor.execute("CREATE EXTENSION IF NOT EXISTS vector;")
self.cursor.execute("""
CREATE TABLE IF NOT EXISTS documents (
id SERIAL PRIMARY KEY,
content TEXT NOT NULL,
embedding vector(3072), -- text-embedding-3-large dimensions
metadata JSONB,
created_at TIMESTAMP DEFAULT NOW()
);
""")
# Create vector index for fast similarity search
self.cursor.execute("""
CREATE INDEX IF NOT EXISTS documents_embedding_idx
ON documents USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
""")
self.conn.commit()
def embed_text(self, text: str) -> List[float]:
"""Generate embeddings"""
response = self.openai_client.embeddings.create(
model="text-embedding-3-large",
input=text
)
return response.data[0].embedding
def add_document(self, content: str, metadata: Dict = None):
"""Add document to vector database"""
embedding = self.embed_text(content)
self.cursor.execute(
"""
INSERT INTO documents (content, embedding, metadata)
VALUES (%s, %s, %s)
""",
(content, embedding, metadata or {})
)
self.conn.commit()
def semantic_search(self, query: str, top_k: int = 5, threshold: float = 0.7) -> List[Dict]:
"""Perform semantic search using cosine similarity"""
query_embedding = self.embed_text(query)
self.cursor.execute(
"""
SELECT
id,
content,
metadata,
1 - (embedding <=> %s::vector) as similarity
FROM documents
WHERE 1 - (embedding <=> %s::vector) > %s
ORDER BY embedding <=> %s::vector
LIMIT %s;
""",
(query_embedding, query_embedding, threshold, query_embedding, top_k)
)
results = []
for row in self.cursor.fetchall():
results.append({
"id": row[0],
"content": row[1],
"metadata": row[2],
"similarity": float(row[3])
})
return results
def generate_answer(self, query: str, top_k: int = 5) -> Dict:
"""Complete RAG: search and generate"""
# Retrieve relevant documents
results = self.semantic_search(query, top_k=top_k)
if not results:
return {"answer": "No relevant information found.", "sources": []}
# Build context
context = "\n\n".join([doc["content"] for doc in results])
# Generate answer
prompt = f"""Based on the following context, answer the question.
Context:
{context}
Question: {query}
Answer:"""
response = self.openai_client.chat.completions.create(
model="gpt-5",
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
return {
"answer": response.choices[0].message.content,
"sources": results,
"num_sources": len(results)
}
def close(self):
"""Close database connection"""
self.cursor.close()
self.conn.close()
# Usage
db_config = {
"host": "localhost",
"database": "rag_db",
"user": "postgres",
"password": "your-password"
}
rag = PgVectorRAG(db_config, openai_api_key="your-openai-key")
# Add documents
rag.add_document(
"Vector databases enable efficient similarity search for RAG systems.",
metadata={"source": "rag_guide.pdf", "page": 1}
)
# Query
result = rag.generate_answer("What are vector databases used for?")
print(result["answer"])
rag.close()
Production Architecture
Ingestion Pipeline
- 1. Document queue (e.g., SQS, RabbitMQ)
- 2. Processing workers: Parse, chunk, clean
- 3. Embedding service: Generate vectors (batch for efficiency)
- 4. Vector DB write: Store with metadata
- 5. Monitoring: Track ingestion metrics
Query Pipeline
- 1. User query received
- 2. Query preprocessing (normalization, expansion)
- 3. Generate query embedding
- 4. Vector similarity search with filters
- 5. Re-rank results (optional)
- 6. Assemble context for LLM
- 7. Generate response
- 8. Post-process and return
Caching Layers
- Query cache: Store popular query results
- Embedding cache: Reuse embeddings for common queries
- LLM response cache: Cache complete responses
- TTL based on content update frequency
Optimization Strategies
Indexing
- HNSW (Hierarchical Navigable Small World): Fast approximate search
- IVF (Inverted File Index): Partition space for efficiency
- Trade-off: Speed vs accuracy
- Tune index parameters based on dataset size
Hybrid Search
Combine vector and keyword search:
- Vector search for semantic similarity
- Keyword search for exact matches
- Combine results with weighted scoring
- Improves recall and precision
Query Expansion
- Generate multiple query variations
- Retrieve for each variation
- Deduplicate and re-rank results
- Improves recall for ambiguous queries
Evaluation Metrics
Retrieval Metrics
- Recall@k: Percentage of relevant docs in top-k
- Precision@k: Percentage of retrieved docs that are relevant
- MRR (Mean Reciprocal Rank): Position of first relevant result
- NDCG: Normalized discounted cumulative gain
End-to-End Metrics
- Answer relevance: Does LLM response address query?
- Faithfulness: Is response grounded in retrieved context?
- Context relevance: Is retrieved context useful?
- Latency: Total time from query to response
Common Pitfalls
- Chunk size too large: Dilutes relevance
- Chunk size too small: Loses context
- Insufficient top-k: Misses relevant information
- Excessive top-k: Noise in context
- No metadata filtering: Retrieves irrelevant but similar content
- Ignoring context window limits: Truncated context
- Not monitoring retrieval quality: Degrading performance
Cost Management
- Batch embedding generation to reduce API calls
- Cache embeddings for frequently accessed documents
- Use smaller embedding models if acceptable performance
- Implement query deduplication
- Monitor per-query costs
- Right-size vector database infrastructure
Security Considerations
- Access control: Filter results based on user permissions
- Data isolation: Multi-tenant vector separation
- Encryption: Protect vectors at rest and in transit
- Audit logging: Track who retrieved what
- PII handling: Careful with personal data in documents
RAG systems provide a practical way to enhance LLMs with domain knowledge. Proper implementation of vector databases, embeddings, and retrieval strategies is crucial for production quality.