Choosing between hosted AI models (GPT-5, Claude Opus 4.1) and self-hosted open-weight models (Llama 4, Mistral) is one of the most critical infrastructure decisions for AI applications. The choice affects costs, privacy, performance, and operational complexity. This guide provides a comprehensive total cost of ownership (TCO) analysis, break-even calculations, and decision framework based on November 2025 pricing and capabilities.
Cost Structure Comparison
Hosted Models (API-Based)
Pricing per 1 million tokens (November 2025):
- GPT-5: $2.50 input / $10.00 output
- GPT-4o: $2.50 input / $10.00 output
- GPT-4o-mini: $0.15 input / $0.60 output
- Claude Opus 4.1: $15.00 input / $75.00 output
- Claude Sonnet 4.5: $3.00 input / $15.00 output
- Claude Haiku 4.5: $1.00 input / $5.00 output
- Gemini 2.5 Pro: $1.25 input / $5.00 output
Self-Hosted Models (Infrastructure Costs)
Llama 4 variants and hardware requirements:
- Llama 4 Scout (17B active params): 1x A100 40GB (~$1.50/hour cloud)
- Llama 4 Maverick (17B active, 128 experts): 4x A100 80GB (~$10/hour cloud)
- Llama 4 8B: 1x RTX 4090 or A5000 (~$0.50-1.00/hour cloud)
- Fixed costs: Engineering ($150k/year), ops overhead ($50k/year)
- On-prem option: 8x H100 cluster = $240k capex + $30k/year power/cooling
Break-Even Analysis
Scenario 1: Moderate Usage (10M tokens/day)
# Break-even calculator for hosted vs self-hosted
from dataclasses import dataclass
from typing import Dict
@dataclass
class HostedPricing:
"""Hosted model pricing per 1M tokens."""
input_cost: float
output_cost: float
model_name: str
@dataclass
class SelfHostedCosts:
"""Self-hosted infrastructure costs."""
gpu_hourly_cost: float
gpu_count: int
engineering_annual: float = 150000 # ML engineer salary
ops_annual: float = 50000 # DevOps overhead
model_name: str = "Llama 4"
def calculate_hosted_monthly_cost(
pricing: HostedPricing,
daily_input_tokens_millions: float,
daily_output_tokens_millions: float
) -> Dict[str, float]:
"""Calculate monthly hosted costs."""
daily_cost = (
(daily_input_tokens_millions * pricing.input_cost) +
(daily_output_tokens_millions * pricing.output_cost)
)
monthly_cost = daily_cost * 30
annual_cost = daily_cost * 365
return {
"daily_cost": daily_cost,
"monthly_cost": monthly_cost,
"annual_cost": annual_cost,
"model": pricing.model_name
}
def calculate_selfhosted_monthly_cost(
costs: SelfHostedCosts,
utilization: float = 0.7 # 70% utilization
) -> Dict[str, float]:
"""Calculate monthly self-hosted costs."""
# GPU compute costs (24/7 running)
gpu_monthly = costs.gpu_hourly_cost * costs.gpu_count * 24 * 30 * utilization
# Amortized fixed costs
engineering_monthly = costs.engineering_annual / 12
ops_monthly = costs.ops_annual / 12
total_monthly = gpu_monthly + engineering_monthly + ops_monthly
total_annual = total_monthly * 12
return {
"gpu_cost_monthly": gpu_monthly,
"engineering_monthly": engineering_monthly,
"ops_monthly": ops_monthly,
"total_monthly": total_monthly,
"total_annual": total_annual,
"model": costs.model_name
}
def find_breakeven_volume(
hosted: HostedPricing,
selfhosted: SelfHostedCosts,
input_output_ratio: float = 0.5 # 50% input, 50% output
) -> Dict[str, float]:
"""Find daily token volume where self-hosted breaks even."""
# Self-hosted monthly cost (fixed)
selfhosted_monthly = calculate_selfhosted_monthly_cost(selfhosted)
fixed_monthly_cost = selfhosted_monthly["total_monthly"]
# Hosted cost per token (weighted average of input/output)
avg_cost_per_million = (
(hosted.input_cost * input_output_ratio) +
(hosted.output_cost * (1 - input_output_ratio))
)
# Break-even: fixed_monthly = daily_tokens * avg_cost * 30
# daily_tokens = fixed_monthly / (avg_cost * 30)
breakeven_daily_millions = fixed_monthly_cost / (avg_cost_per_million * 30)
breakeven_monthly_millions = breakeven_daily_millions * 30
return {
"breakeven_daily_tokens_millions": breakeven_daily_millions,
"breakeven_monthly_tokens_millions": breakeven_monthly_millions,
"hosted_model": hosted.model_name,
"selfhosted_model": selfhosted.model_name
}
# Example calculations
if __name__ == "__main__":
# Define models
gpt5 = HostedPricing(
input_cost=2.50,
output_cost=10.00,
model_name="GPT-5"
)
gpt4o_mini = HostedPricing(
input_cost=0.15,
output_cost=0.60,
model_name="GPT-4o-mini"
)
llama4_scout = SelfHostedCosts(
gpu_hourly_cost=1.50,
gpu_count=1,
model_name="Llama 4 Scout"
)
llama4_maverick = SelfHostedCosts(
gpu_hourly_cost=2.50,
gpu_count=4,
model_name="Llama 4 Maverick"
)
# Scenario: 10M tokens/day (5M input, 5M output)
daily_input = 5.0
daily_output = 5.0
print("=== Scenario: 10M tokens/day (5M in, 5M out) ===")
# GPT-5 hosted
gpt5_cost = calculate_hosted_monthly_cost(gpt5, daily_input, daily_output)
print(f"\nGPT-5 (hosted):")
print(f" Monthly: ${gpt5_cost['monthly_cost']:,.2f}")
print(f" Annual: ${gpt5_cost['annual_cost']:,.2f}")
# GPT-4o-mini hosted
mini_cost = calculate_hosted_monthly_cost(gpt4o_mini, daily_input, daily_output)
print(f"\nGPT-4o-mini (hosted):")
print(f" Monthly: ${mini_cost['monthly_cost']:,.2f}")
print(f" Annual: ${mini_cost['annual_cost']:,.2f}")
# Llama 4 Scout self-hosted
scout_cost = calculate_selfhosted_monthly_cost(llama4_scout)
print(f"\nLlama 4 Scout (self-hosted):")
print(f" GPU cost: ${scout_cost['gpu_cost_monthly']:,.2f}/month")
print(f" Engineering: ${scout_cost['engineering_monthly']:,.2f}/month")
print(f" Total monthly: ${scout_cost['total_monthly']:,.2f}")
print(f" Total annual: ${scout_cost['total_annual']:,.2f}")
# Break-even analysis
print("\n=== Break-Even Analysis ===")
breakeven_gpt5 = find_breakeven_volume(gpt5, llama4_scout)
print(f"\nGPT-5 vs Llama 4 Scout:")
print(f" Break-even: {breakeven_gpt5['breakeven_daily_tokens_millions']:.1f}M tokens/day")
print(f" Current volume: 10M tokens/day")
if 10 > breakeven_gpt5['breakeven_daily_tokens_millions']:
savings = gpt5_cost['annual_cost'] - scout_cost['total_annual']
print(f" ✓ Self-hosting saves ${savings:,.2f}/year")
else:
print(f" ✗ Hosted is cheaper at current volume")
breakeven_mini = find_breakeven_volume(gpt4o_mini, llama4_scout)
print(f"\nGPT-4o-mini vs Llama 4 Scout:")
print(f" Break-even: {breakeven_mini['breakeven_daily_tokens_millions']:.1f}M tokens/day")
print(f" Current volume: 10M tokens/day")
if 10 > breakeven_mini['breakeven_daily_tokens_millions']:
savings = mini_cost['annual_cost'] - scout_cost['total_annual']
print(f" ✓ Self-hosting saves ${savings:,.2f}/year")
else:
print(f" ✗ Hosted is cheaper at current volume")
Calculation Results
For 10M tokens/day (5M input, 5M output):
- GPT-5 hosted: $9,375/month ($112,500/year)
- GPT-4o-mini hosted: $562/month ($6,750/year)
- Llama 4 Scout self-hosted: $18,333/month ($220,000/year)
- Break-even vs GPT-5: 29M tokens/day
- Break-even vs GPT-4o-mini: 489M tokens/day
Scenario 2: High Volume (100M tokens/day)
- GPT-5 hosted: $93,750/month ($1,125,000/year)
- GPT-4o-mini hosted: $5,625/month ($67,500/year)
- Llama 4 Scout self-hosted: $18,333/month ($220,000/year)
- Savings with self-hosted: $905,000/year vs GPT-5 ✓
- Self-hosted still more expensive vs GPT-4o-mini at this volume
Decision Framework
Choose Hosted Models When:
- Volume < 30M tokens/day (hosted GPT-5 cheaper than self-hosted)
- Startup/small team (no ML engineering capacity)
- Need multiple models (GPT, Claude, Gemini for different use cases)
- Rapid iteration required (no infrastructure setup time)
- Variable workload (pay only for usage)
- Compliance allows external APIs
- Latest capabilities needed (GPT-5, Claude Opus 4.1 frontier performance)
Choose Self-Hosted When:
- Volume > 100M tokens/day (significant cost savings)
- Privacy/compliance requires on-premise (healthcare, finance, government)
- Data cannot leave network (trade secrets, classified info)
- Customization needed (fine-tuning, architectural changes)
- Predictable costs preferred (no surprise API bills)
- Low latency critical (no external API roundtrip)
- Have ML engineering team (can manage infrastructure)
Hybrid Approach
- Self-host for high-volume, simple queries (Llama 4 8B)
- Use hosted for complex reasoning (GPT-5, Claude Opus 4.1)
- Route based on query complexity and sensitivity
- Best of both: cost optimization + capability access
Hidden Costs to Consider
Self-Hosted Hidden Costs
- Engineering time: Model updates, optimization, debugging (20-40% of 1 FTE)
- Infrastructure complexity: Kubernetes, GPU orchestration, monitoring
- Model staleness: Open-source models lag behind frontier (6-12 months)
- Quality ceiling: Llama 4 < GPT-5 on complex reasoning
- Cold start costs: Scaling up/down takes minutes vs instant API
- Disaster recovery: Backup infrastructure needed
Hosted Hidden Costs
- Rate limiting: May need multiple API keys
- Vendor lock-in: Prompts optimized for specific model
- Price increases: Providers can raise prices
- Model deprecation: Old versions sunset (forced migration)
- Unpredictable costs: Token usage spikes
- Network egress: Data transfer costs if high volume
Privacy and Compliance Considerations
Data Privacy with Hosted Models
- OpenAI API: Data not used for training (enterprise terms)
- Anthropic API: Data not retained (per DPA)
- Google Vertex AI: Data stays in your GCP project
- Azure OpenAI: Data residency in your Azure region
- Risk: Data transits provider infrastructure
Compliance Requirements Favoring Self-Hosted
- HIPAA: PHI cannot leave network (self-hosted required)
- GDPR: Data localization requirements (EU hosting)
- Finance: Trade secret protection (on-premise)
- Government: Classified data (air-gapped deployment)
- Zero-trust architectures: No external dependencies
Performance Comparison
Quality Benchmarks (November 2025)
- Coding (SWE-bench): GPT-5 (74.9%) > Llama 4 Maverick (68%) > Llama 4 Scout (61%)
- Math (AIME): GPT-5 (94.6%) > Llama 4 Maverick (87%) > Llama 4 Scout (79%)
- Reasoning: Claude Opus 4.1 > GPT-5 > Llama 4 Maverick > Llama 4 Scout
- Multimodal: Gemini 2.5 Pro > GPT-5 > Llama 4 (native multimodal)
- Context: Llama 4 Scout (10M tokens) >> GPT-5 (128K) > Claude (200K)
Latency Comparison
- Hosted API: 200ms-2s (network + inference)
- Self-hosted (same datacenter): 50-500ms (inference only)
- Self-hosted (edge): 10-200ms (co-located with app)
- Advantage: Self-hosted for latency-critical applications
Implementation: Hybrid Routing
from enum import Enum
from typing import Optional
import openai
import requests
class ModelTier(Enum):
"""Model selection tiers."""
FAST_CHEAP = "fast_cheap" # Self-hosted Llama 4 8B
BALANCED = "balanced" # GPT-4o-mini or self-hosted Llama 4 Scout
PREMIUM = "premium" # GPT-5 or Claude Opus 4.1
class HybridLLMRouter:
"""Route queries to optimal model based on complexity and privacy."""
def __init__(self,
selfhosted_endpoint: str,
openai_api_key: str):
self.selfhosted_endpoint = selfhosted_endpoint
self.openai_client = openai.OpenAI(api_key=openai_api_key)
def estimate_complexity(self, query: str) -> ModelTier:
"""Estimate query complexity to select appropriate model tier."""
# Simple heuristics (in production, use classifier)
query_lower = query.lower()
# Premium tier indicators
premium_keywords = [
"analyze", "explain complex", "step by step",
"reasoning", "proof", "design", "architecture"
]
if any(kw in query_lower for kw in premium_keywords):
return ModelTier.PREMIUM
# Fast tier indicators
fast_keywords = [
"summarize", "translate", "classify",
"sentiment", "yes or no", "true or false"
]
if any(kw in query_lower for kw in fast_keywords):
return ModelTier.FAST_CHEAP
# Default: balanced
return ModelTier.BALANCED
def route_query(self,
query: str,
force_tier: Optional[ModelTier] = None,
require_privacy: bool = False) -> dict:
"""Route query to appropriate model."""
# Privacy requirement forces self-hosted
if require_privacy:
return self._call_selfhosted(query, "llama-4-scout")
# Determine tier
tier = force_tier or self.estimate_complexity(query)
if tier == ModelTier.FAST_CHEAP:
# Use self-hosted for cost efficiency
return self._call_selfhosted(query, "llama-4-8b")
elif tier == ModelTier.BALANCED:
# Use GPT-4o-mini (good balance of cost/quality)
return self._call_hosted(query, "gpt-4o-mini")
else: # PREMIUM
# Use GPT-5 for best quality
return self._call_hosted(query, "gpt-5")
def _call_selfhosted(self, query: str, model: str) -> dict:
"""Call self-hosted model."""
response = requests.post(
f"{self.selfhosted_endpoint}/v1/chat/completions",
json={
"model": model,
"messages": [{"role": "user", "content": query}],
"max_tokens": 500
},
timeout=30
)
response.raise_for_status()
result = response.json()
return {
"response": result["choices"][0]["message"]["content"],
"model": model,
"hosting": "self-hosted",
"cost_estimate": 0.0 # Already paid for
}
def _call_hosted(self, query: str, model: str) -> dict:
"""Call hosted API (OpenAI)."""
response = self.openai_client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": query}],
max_tokens=500
)
# Calculate cost
pricing = {
"gpt-5": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60}
}
input_cost = (response.usage.prompt_tokens / 1_000_000) * pricing[model]["input"]
output_cost = (response.usage.completion_tokens / 1_000_000) * pricing[model]["output"]
total_cost = input_cost + output_cost
return {
"response": response.choices[0].message.content,
"model": model,
"hosting": "hosted",
"cost_estimate": total_cost
}
# Example usage
if __name__ == "__main__":
import os
router = HybridLLMRouter(
selfhosted_endpoint="http://llama-inference.internal:8000",
openai_api_key=os.environ["OPENAI_API_KEY"]
)
# Simple query -> self-hosted (cheap)
result1 = router.route_query("Classify sentiment: This product is great!")
print(f"Query 1: {result1['model']} ({result1['hosting']})")
# Complex query -> hosted premium (quality)
result2 = router.route_query(
"Analyze the architectural trade-offs between microservices and monolithic design"
)
print(f"Query 2: {result2['model']} ({result2['hosting']})")
# Privacy-required -> self-hosted (compliance)
result3 = router.route_query(
"Summarize this patient record",
require_privacy=True
)
print(f"Query 3: {result3['model']} ({result3['hosting']})")
Real-World Case Studies
Case Study 1: SaaS Startup (Hosted Winner)
- Volume: 5M tokens/day
- Team: 5 engineers (no ML specialists)
- Decision: GPT-4o-mini hosted
- Cost: $280/month vs $18,333/month self-hosted
- Outcome: Saved $217,000/year, faster time-to-market
Case Study 2: Enterprise at Scale (Self-Hosted Winner)
- Volume: 500M tokens/day
- Team: 50 engineers (including 5 ML specialists)
- Decision: Self-hosted Llama 4 Scout + GPT-5 for 10% complex queries
- Cost: $300k/year self-hosted + $168k/year GPT-5 = $468k total
- vs hosted-only: $5.6M/year
- Outcome: Saved $5.1M/year (91% reduction)
Case Study 3: Healthcare (Self-Hosted Required)
- Volume: 20M tokens/day
- Constraint: HIPAA compliance, PHI cannot leave network
- Decision: Self-hosted Llama 4 Scout (on-premise)
- Cost: $220k/year
- Alternative: Not possible with hosted (compliance)
- Outcome: Compliance achieved, acceptable costs
Conclusion
The hosted vs self-hosted decision depends on volume, privacy requirements, and team capabilities. For most organizations under 30M tokens/day, hosted models (GPT-5, Claude, Gemini) offer better economics and faster iteration. Above 100M tokens/day, self-hosted models (Llama 4) provide significant cost savings ($1M+/year). Privacy-sensitive industries (healthcare, finance, government) often require self-hosting regardless of volume. The optimal strategy for many enterprises is hybrid: self-host for high-volume simple queries, use hosted APIs for complex reasoning and latest capabilities.