Testing and Quality Assurance for AI-Powered Systems

Testing AI systems requires different approaches than traditional software. This guide covers strategies for ensuring quality in LLM-powered applications.

Unit Testing AI Components

Testing Deterministic Logic

Test input preprocessing and validation
Test output parsing and formatting
Test error handling and retry logic
Test caching mechanisms
Mock LLM responses for deterministic tests

Prompt Testing

Version control prompts like code
Test prompt variations against expected outputs
Validate prompt token counts
Test prompt injection vulnerabilities
Regression tests when updating prompts

Integration Testing

API Integration Tests

Test against actual LLM APIs in staging
Verify error handling for API failures
Test rate limiting behavior
Validate timeout handling
Test with various input types and lengths

End-to-End Tests

Test complete user workflows
Verify multi-step processes
Test agent interactions in multi-agent systems
Validate state management across steps
Test with production-like data volumes

LLM Output Validation

Automated Evaluation

Format validation (JSON, structured outputs)
Content moderation checks
Factual accuracy verification against known data
Consistency checks across similar inputs
Hallucination detection

Evaluation Metrics

BLEU score for translation/generation quality
ROUGE score for summarization
Exact match for structured outputs
Semantic similarity for meaning preservation
Domain-specific accuracy metrics

Human Evaluation

Sample-based human review
User acceptance testing
A/B testing with real users
Feedback collection mechanisms
Expert review for domain-specific tasks

Regression Testing

Golden Dataset

Curate representative test cases
Include edge cases and common failures
Cover all major features and use cases
Update when new patterns emerge
Maintain expected outputs

Continuous Evaluation

Run regression tests on every prompt change
Test against golden dataset before deployment
Monitor performance degradation
Alert on significant quality drops
Track metrics over time

Performance Testing

Load Testing

Simulate peak traffic scenarios
Test rate limiting and queuing behavior
Measure latency under load
Identify bottlenecks
Verify auto-scaling functionality

Latency Testing

Measure p50, p95, p99 response times
Test with various prompt lengths
Benchmark streaming vs non-streaming
Test caching effectiveness
Identify slow operations

Security Testing

Input Validation

Test with malicious inputs
Prompt injection attempts
SQL injection in generated code
XSS in generated content
Test input length limits

Output Safety

Content moderation testing
PII detection verification
Test for data leakage
Verify access control enforcement
Test audit logging

Testing Tools and Frameworks

LLM Evaluation Frameworks

LangChain evaluation tools
OpenAI Evals framework
Custom evaluation pipelines
Integration with CI/CD
Automated reporting

Monitoring Tools

LangSmith for LLM observability
Custom dashboards (Grafana, DataDog)
Alert systems for quality degradation
Cost tracking integration
User feedback collection

Test Data Management

Synthetic Data Generation

Generate test cases with LLMs
Create diverse input scenarios
Simulate edge cases
Privacy-safe testing
Scale test coverage

Data Privacy in Testing

Anonymize production data for testing
Synthetic data for sensitive domains
Separate test environments
GDPR compliance in test data
Data retention policies

CI/CD Integration

Automated Testing Pipeline

Run tests on every commit
Automated regression testing
Performance benchmarks
Quality gates before deployment
Automated rollback on failures

Deployment Strategies

Canary deployments for new prompts
Blue-green deployments
A/B testing in production
Feature flags for gradual rollout
Quick rollback mechanisms

Monitoring Production Quality

Real-time Monitoring

Track error rates
Monitor latency metrics
Watch for quality degradation
User satisfaction scores
Cost per request trends

Feedback Loops

Collect user feedback
Monitor thumbs up/down ratings
Analyze support tickets
Track feature usage
Identify improvement opportunities

Code Example: Unit Testing LLM Applications

Comprehensive testing framework for AI systems with assertions and metrics.

python

import pytest
from typing import List, Dict
import anthropic

class LLMTester:
    """Test framework for LLM applications"""

    def __init__(self, model: str = "claude-sonnet-4.5"):
        self.client = anthropic.Anthropic()
        self.model = model

    def assert_contains_keywords(self, response: str, keywords: List[str]):
        """Assert response contains required keywords"""
        for keyword in keywords:
            assert keyword.lower() in response.lower(), f"Missing keyword: {keyword}"

    def assert_max_length(self, response: str, max_tokens: int):
        """Assert response is within token limit"""
        # Rough estimate: 1 token ≈ 4 characters
        estimated_tokens = len(response) / 4
        assert estimated_tokens <= max_tokens, f"Response too long: {estimated_tokens} tokens"

    def assert_no_harmful_content(self, response: str):
        """Assert response contains no harmful content"""
        harmful_patterns = ["violence", "hate", "illegal"]
        for pattern in harmful_patterns:
            assert pattern not in response.lower(), f"Harmful content detected: {pattern}"

    def test_consistency(self, prompt: str, num_samples: int = 3) -> float:
        """Test response consistency across multiple calls"""
        responses = []
        for _ in range(num_samples):
            message = self.client.messages.create(
                model=self.model,
                max_tokens=100,
                messages=[{"role": "user", "content": prompt}]
            )
            responses.append(message.content[0].text)

        # Calculate similarity (simple approach)
        unique_responses = len(set(responses))
        consistency_score = 1 - (unique_responses / num_samples)
        return consistency_score

    def test_latency(self, prompt: str, max_seconds: float = 5.0):
        """Test response latency"""
        import time
        start = time.time()

        message = self.client.messages.create(
            model=self.model,
            max_tokens=500,
            messages=[{"role": "user", "content": prompt}]
        )

        latency = time.time() - start
        assert latency <= max_seconds, f"Response too slow: {latency:.2f}s"
        return latency

# Pytest test cases
def test_sentiment_classification():
    """Test sentiment classification accuracy"""
    tester = LLMTester()

    prompt = """Classify sentiment as Positive, Negative, or Neutral:
    "This product is amazing!""""

    message = tester.client.messages.create(
        model=tester.model,
        max_tokens=10,
        messages=[{"role": "user", "content": prompt}]
    )

    response = message.content[0].text
    tester.assert_contains_keywords(response, ["positive"])

def test_response_quality():
    """Test response meets quality standards"""
    tester = LLMTester()

    prompt = "Explain machine learning in one sentence"

    message = tester.client.messages.create(
        model=tester.model,
        max_tokens=100,
        messages=[{"role": "user", "content": prompt}]
    )

    response = message.content[0].text

    # Quality assertions
    tester.assert_contains_keywords(response, ["machine learning", "data"])
    tester.assert_max_length(response, 50)
    tester.assert_no_harmful_content(response)

def test_consistency():
    """Test response consistency"""
    tester = LLMTester()

    consistency = tester.test_consistency("What is 2+2?", num_samples=3)
    assert consistency >= 0.8, f"Low consistency: {consistency}"

def test_latency():
    """Test response latency"""
    tester = LLMTester()

    latency = tester.test_latency("Hello", max_seconds=3.0)
    print(f"Latency: {latency:.2f}s")

# Run tests
if __name__ == "__main__":
    pytest.main([__file__, "-v"])

Code Example: Evaluation Metrics

Comprehensive evaluation framework with accuracy, F1, and custom metrics.