Testing AI systems requires different approaches than traditional software. This guide covers strategies for ensuring quality in LLM-powered applications.
Unit Testing AI Components
Testing Deterministic Logic
- Test input preprocessing and validation
- Test output parsing and formatting
- Test error handling and retry logic
- Test caching mechanisms
- Mock LLM responses for deterministic tests
Prompt Testing
- Version control prompts like code
- Test prompt variations against expected outputs
- Validate prompt token counts
- Test prompt injection vulnerabilities
- Regression tests when updating prompts
Integration Testing
API Integration Tests
- Test against actual LLM APIs in staging
- Verify error handling for API failures
- Test rate limiting behavior
- Validate timeout handling
- Test with various input types and lengths
End-to-End Tests
- Test complete user workflows
- Verify multi-step processes
- Test agent interactions in multi-agent systems
- Validate state management across steps
- Test with production-like data volumes
LLM Output Validation
Automated Evaluation
- Format validation (JSON, structured outputs)
- Content moderation checks
- Factual accuracy verification against known data
- Consistency checks across similar inputs
- Hallucination detection
Evaluation Metrics
- BLEU score for translation/generation quality
- ROUGE score for summarization
- Exact match for structured outputs
- Semantic similarity for meaning preservation
- Domain-specific accuracy metrics
Human Evaluation
- Sample-based human review
- User acceptance testing
- A/B testing with real users
- Feedback collection mechanisms
- Expert review for domain-specific tasks
Regression Testing
Golden Dataset
- Curate representative test cases
- Include edge cases and common failures
- Cover all major features and use cases
- Update when new patterns emerge
- Maintain expected outputs
Continuous Evaluation
- Run regression tests on every prompt change
- Test against golden dataset before deployment
- Monitor performance degradation
- Alert on significant quality drops
- Track metrics over time
Performance Testing
Load Testing
- Simulate peak traffic scenarios
- Test rate limiting and queuing behavior
- Measure latency under load
- Identify bottlenecks
- Verify auto-scaling functionality
Latency Testing
- Measure p50, p95, p99 response times
- Test with various prompt lengths
- Benchmark streaming vs non-streaming
- Test caching effectiveness
- Identify slow operations
Security Testing
Input Validation
- Test with malicious inputs
- Prompt injection attempts
- SQL injection in generated code
- XSS in generated content
- Test input length limits
Output Safety
- Content moderation testing
- PII detection verification
- Test for data leakage
- Verify access control enforcement
- Test audit logging
Testing Tools and Frameworks
LLM Evaluation Frameworks
- LangChain evaluation tools
- OpenAI Evals framework
- Custom evaluation pipelines
- Integration with CI/CD
- Automated reporting
Monitoring Tools
- LangSmith for LLM observability
- Custom dashboards (Grafana, DataDog)
- Alert systems for quality degradation
- Cost tracking integration
- User feedback collection
Test Data Management
Synthetic Data Generation
- Generate test cases with LLMs
- Create diverse input scenarios
- Simulate edge cases
- Privacy-safe testing
- Scale test coverage
Data Privacy in Testing
- Anonymize production data for testing
- Synthetic data for sensitive domains
- Separate test environments
- GDPR compliance in test data
- Data retention policies
CI/CD Integration
Automated Testing Pipeline
- Run tests on every commit
- Automated regression testing
- Performance benchmarks
- Quality gates before deployment
- Automated rollback on failures
Deployment Strategies
- Canary deployments for new prompts
- Blue-green deployments
- A/B testing in production
- Feature flags for gradual rollout
- Quick rollback mechanisms
Monitoring Production Quality
Real-time Monitoring
- Track error rates
- Monitor latency metrics
- Watch for quality degradation
- User satisfaction scores
- Cost per request trends
Feedback Loops
- Collect user feedback
- Monitor thumbs up/down ratings
- Analyze support tickets
- Track feature usage
- Identify improvement opportunities
Code Example: Unit Testing LLM Applications
Comprehensive testing framework for AI systems with assertions and metrics.
import pytest
from typing import List, Dict
import anthropic
class LLMTester:
"""Test framework for LLM applications"""
def __init__(self, model: str = "claude-sonnet-4.5"):
self.client = anthropic.Anthropic()
self.model = model
def assert_contains_keywords(self, response: str, keywords: List[str]):
"""Assert response contains required keywords"""
for keyword in keywords:
assert keyword.lower() in response.lower(), f"Missing keyword: {keyword}"
def assert_max_length(self, response: str, max_tokens: int):
"""Assert response is within token limit"""
# Rough estimate: 1 token ≈ 4 characters
estimated_tokens = len(response) / 4
assert estimated_tokens <= max_tokens, f"Response too long: {estimated_tokens} tokens"
def assert_no_harmful_content(self, response: str):
"""Assert response contains no harmful content"""
harmful_patterns = ["violence", "hate", "illegal"]
for pattern in harmful_patterns:
assert pattern not in response.lower(), f"Harmful content detected: {pattern}"
def test_consistency(self, prompt: str, num_samples: int = 3) -> float:
"""Test response consistency across multiple calls"""
responses = []
for _ in range(num_samples):
message = self.client.messages.create(
model=self.model,
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
responses.append(message.content[0].text)
# Calculate similarity (simple approach)
unique_responses = len(set(responses))
consistency_score = 1 - (unique_responses / num_samples)
return consistency_score
def test_latency(self, prompt: str, max_seconds: float = 5.0):
"""Test response latency"""
import time
start = time.time()
message = self.client.messages.create(
model=self.model,
max_tokens=500,
messages=[{"role": "user", "content": prompt}]
)
latency = time.time() - start
assert latency <= max_seconds, f"Response too slow: {latency:.2f}s"
return latency
# Pytest test cases
def test_sentiment_classification():
"""Test sentiment classification accuracy"""
tester = LLMTester()
prompt = """Classify sentiment as Positive, Negative, or Neutral:
"This product is amazing!""""
message = tester.client.messages.create(
model=tester.model,
max_tokens=10,
messages=[{"role": "user", "content": prompt}]
)
response = message.content[0].text
tester.assert_contains_keywords(response, ["positive"])
def test_response_quality():
"""Test response meets quality standards"""
tester = LLMTester()
prompt = "Explain machine learning in one sentence"
message = tester.client.messages.create(
model=tester.model,
max_tokens=100,
messages=[{"role": "user", "content": prompt}]
)
response = message.content[0].text
# Quality assertions
tester.assert_contains_keywords(response, ["machine learning", "data"])
tester.assert_max_length(response, 50)
tester.assert_no_harmful_content(response)
def test_consistency():
"""Test response consistency"""
tester = LLMTester()
consistency = tester.test_consistency("What is 2+2?", num_samples=3)
assert consistency >= 0.8, f"Low consistency: {consistency}"
def test_latency():
"""Test response latency"""
tester = LLMTester()
latency = tester.test_latency("Hello", max_seconds=3.0)
print(f"Latency: {latency:.2f}s")
# Run tests
if __name__ == "__main__":
pytest.main([__file__, "-v"])
Code Example: Evaluation Metrics
Comprehensive evaluation framework with accuracy, F1, and custom metrics.
from typing import List, Dict, Tuple
from dataclasses import dataclass
@dataclass
class EvaluationResult:
accuracy: float
precision: float
recall: float
f1_score: float
confusion_matrix: Dict[str, Dict[str, int]]
class LLMEvaluator:
"""Evaluate LLM performance on test datasets"""
def evaluate_classification(
self,
predictions: List[str],
ground_truth: List[str],
labels: List[str]
) -> EvaluationResult:
"""Evaluate classification performance"""
# Calculate confusion matrix
confusion = {label: {label2: 0 for label2 in labels} for label in labels}
for pred, truth in zip(predictions, ground_truth):
confusion[truth][pred] += 1
# Calculate metrics per class
precision_per_class = {}
recall_per_class = {}
for label in labels:
true_positive = confusion[label][label]
false_positive = sum(confusion[other][label] for other in labels if other != label)
false_negative = sum(confusion[label][other] for other in labels if other != label)
precision = true_positive / (true_positive + false_positive) if (true_positive + false_positive) > 0 else 0
recall = true_positive / (true_positive + false_negative) if (true_positive + false_negative) > 0 else 0
precision_per_class[label] = precision
recall_per_class[label] = recall
# Macro average
avg_precision = sum(precision_per_class.values()) / len(labels)
avg_recall = sum(recall_per_class.values()) / len(labels)
f1 = 2 * (avg_precision * avg_recall) / (avg_precision + avg_recall) if (avg_precision + avg_recall) > 0 else 0
# Accuracy
correct = sum(1 for p, t in zip(predictions, ground_truth) if p == t)
accuracy = correct / len(predictions)
return EvaluationResult(
accuracy=accuracy,
precision=avg_precision,
recall=avg_recall,
f1_score=f1,
confusion_matrix=confusion
)
def calculate_rouge_score(self, generated: str, reference: str) -> Dict[str, float]:
"""Calculate ROUGE score for text generation"""
# Simple ROUGE-1 implementation
generated_words = set(generated.lower().split())
reference_words = set(reference.lower().split())
overlap = len(generated_words & reference_words)
precision = overlap / len(generated_words) if generated_words else 0
recall = overlap / len(reference_words) if reference_words else 0
f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
return {
"rouge-1-precision": precision,
"rouge-1-recall": recall,
"rouge-1-f1": f1
}
def evaluate_rag_system(
self,
questions: List[str],
retrieved_docs: List[List[str]],
ground_truth_docs: List[List[str]]
) -> Dict[str, float]:
"""Evaluate RAG retrieval performance"""
mrr_scores = [] # Mean Reciprocal Rank
for retrieved, truth in zip(retrieved_docs, ground_truth_docs):
# Find rank of first relevant document
for rank, doc in enumerate(retrieved, 1):
if doc in truth:
mrr_scores.append(1 / rank)
break
else:
mrr_scores.append(0)
return {
"mean_reciprocal_rank": sum(mrr_scores) / len(mrr_scores),
"retrieval_success_rate": sum(1 for s in mrr_scores if s > 0) / len(mrr_scores)
}
# Example usage
evaluator = LLMEvaluator()
# Classification evaluation
predictions = ["positive", "negative", "neutral", "positive"]
ground_truth = ["positive", "negative", "positive", "positive"]
labels = ["positive", "negative", "neutral"]
results = evaluator.evaluate_classification(predictions, ground_truth, labels)
print(f"Accuracy: {results.accuracy:.2%}")
print(f"F1 Score: {results.f1_score:.2%}")
print(f"Confusion Matrix: {results.confusion_matrix}")
# Text generation evaluation
generated = "Machine learning is a subset of AI that enables systems to learn from data"
reference = "Machine learning is AI technique that allows systems to learn from data"
rouge = evaluator.calculate_rouge_score(generated, reference)
print(f"ROUGE-1 F1: {rouge['rouge-1-f1']:.2%}")
Best Practices
- Test early and often
- Automate where possible
- Maintain golden datasets
- Monitor production continuously
- Implement feedback loops
- Version control prompts
- Document test coverage
- Regular security audits
- Performance benchmarks
- User acceptance testing
Testing AI systems requires combining traditional software testing with AI-specific evaluation methods. Continuous monitoring and feedback loops are essential for maintaining quality in production.