You can reduce your LLM API costs by up to 100x by routing each request to the cheapest model that can handle it — because 80% of your AI calls don't need the most expensive model, and the price gap between premium and budget models is 60-250x.
Your AI Bill Is Probably 10x Higher Than It Should Be
Let's start with an uncomfortable truth: if you're sending all your AI API calls to a single premium model, you're overpaying by a massive margin.
Here's a typical scenario. A developer building an AI agent uses Claude Opus 4 for everything — Q&A, code formatting, translations, data extraction, and complex reasoning. Their monthly bill: $3,000.
After implementing smart routing, the same workload costs $300-500/month. Same quality output. Same user experience. Just smarter model selection.
This guide walks through every technique for reducing your LLM API costs, from quick wins to advanced strategies. We'll cover the 2026 pricing landscape, seven proven optimization strategies, and real-world examples with specific numbers.
Understanding the 2026 Token Economics
Before diving into optimization strategies, you need to understand the current pricing landscape. The spread between models is extraordinary:
| Model | Input Cost/1M | Output Cost/1M | Best For | |-------|--------------|----------------|----------| | Gemini 3 Flash | $0.075 | $0.30 | Simple Q&A, lookups | | Mistral Small 3 | $0.10 | $0.30 | Light tasks, formatting | | GPT-4o-mini | $0.15 | $0.60 | Translation, extraction | | Llama 3.3 70B | $0.18 | $0.40 | Summarization, chat | | Claude Haiku 3.5 | $0.25 | $1.25 | Code formatting, classification | | DeepSeek V3 | $0.27 | $1.10 | General coding, analysis | | DeepSeek R1 | $0.55 | $2.19 | Reasoning tasks | | Gemini 3 Pro | $1.25 | $5.00 | Long context, codebase analysis | | GPT-5.2 | $1.75 | $14.00 | Agentic workflows | | Mistral Large | $2.00 | $6.00 | Multilingual, mid-tier tasks | | GPT-4o | $2.50 | $10.00 | Multimodal, general purpose | | Claude Sonnet 4 | $3.00 | $15.00 | Complex coding, analysis | | Claude Opus 4 | $15.00 | $75.00 | Architecture, deep reasoning |
Key insight: The output cost gap between Opus ($75/M) and Gemini Flash ($0.30/M) is 250x. Between Sonnet ($15/M) and Flash ($0.30/M), it's 50x. These aren't small differences — they're order-of-magnitude savings waiting to happen.
Another key insight: Output tokens are typically 2-5x more expensive than input tokens. This means controlling output length has an outsized impact on your bill.
Strategy 1: Use the Right Model for Each Task (Biggest Impact)
This is the single most impactful change you can make. Most applications have a predictable distribution of task types:
- 50-60% simple tasks — Q&A, lookups, formatting, classification → $0.30-1.25/M output
- 25-35% standard tasks — code generation, analysis, translation → $1.10-15/M output
- 10-15% complex tasks — architecture, deep reasoning, novel problem-solving → $14-75/M output
If you're sending everything to Claude Sonnet ($15/M output), you're paying 50x more than necessary for half your requests. If you're using Opus ($75/M), it's 250x more for simple tasks.
How to Implement This
Option A: Manual routing (build it yourself)
def pick_model(task_type):
routing = {
"simple_qa": "gemini-3-flash",
"translation": "gpt-4o-mini",
"code_format": "claude-haiku-3.5",
"complex_reasoning": "claude-opus-4",
"coding": "deepseek-v3",
"long_context": "gemini-3-pro",
}
return routing.get(task_type, "claude-sonnet-4")
This works but requires you to classify tasks yourself, maintain the routing logic, and update it as new models launch. You also need to handle failover, rate limits, and provider outages.
Option B: Use an LLM router (recommended)
An LLM router like ClawRouters handles this automatically:
from openai import OpenAI
client = OpenAI(
base_url="https://www.clawrouters.com/api/v1",
api_key="cr_your_key_here"
)
# ClawRouters automatically picks the cheapest capable model
response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "What's the capital of France?"}]
)
# → Routed to Gemini Flash ($0.30/M) instead of Opus ($75/M)
With ClawRouters, you set model="auto" and the router classifies each request in under 10ms, routing to the optimal model. The free BYOK plan means no additional cost for this intelligence — you only pay the provider's actual model price with zero markup. Compare this to OpenRouter's 5.5% fee, which adds cost rather than saving it.
Estimated Savings from Smart Routing
| Monthly API Spend (before) | With Smart Routing (after) | Savings | |---------------------------|---------------------------|---------| | $500 (all Sonnet) | $75-125 | 75-85% | | $2,000 (all Sonnet) | $300-500 | 75-85% | | $5,000 (mixed premium) | $750-1,250 | 75-85% | | $10,000 (all Opus) | $800-1,500 | 85-92% |
Strategy 2: Optimize Your Prompts
Shorter prompts = fewer input tokens = lower costs. But don't sacrifice clarity — the goal is to be concise, not cryptic.
Remove Redundant Instructions
Before (wasteful — 47 tokens):
You are a helpful AI assistant. I would like you to please help me with the following task.
Can you please translate the following text from English to Spanish? Please make sure the
translation is accurate and natural-sounding. The text is: "Hello, how are you?"
After (efficient — 11 tokens):
Translate to Spanish: "Hello, how are you?"
Same result, 76% fewer input tokens. Over thousands of requests, this adds up fast.
Use System Messages Efficiently
Put reusable context in system messages. Most providers cache system messages across requests in the same session, so you're not paying for the same instructions repeatedly.
# Efficient: system message set once, reused across calls
system_msg = {"role": "system", "content": "You are a code reviewer. Be concise."}
# Each user message only contains the new content
response = client.chat.completions.create(
model="auto",
messages=[system_msg, {"role": "user", "content": code_snippet}]
)
Batch Similar Requests
Instead of 10 separate API calls to translate 10 sentences, send them in one request:
Translate these to Spanish (return as JSON array):
1. "Hello"
2. "Goodbye"
3. "Thank you"
...
One API call instead of ten. Fewer requests means lower per-call overhead and reduced total token usage from repeated instructions.
Use Few-Shot Examples Sparingly
Few-shot examples improve output quality but add input tokens. Evaluate whether each example is actually improving results. Often, 1-2 examples work as well as 5-10 — at a fraction of the cost.
Strategy 3: Implement Caching
Many AI applications ask the same (or very similar) questions repeatedly. Caching can eliminate redundant API calls entirely.
Exact Match Caching
For identical requests, return cached responses:
import hashlib
import json
cache = {}
def cached_completion(messages, model="auto"):
key = hashlib.md5(json.dumps(messages, sort_keys=True).encode()).hexdigest()
if key in cache:
return cache[key] # Free! No API call needed
response = client.chat.completions.create(model=model, messages=messages)
cache[key] = response
return response
Even a simple in-memory cache can eliminate 10-30% of API calls for most applications.
Semantic Caching
For similar (not identical) questions, use embedding similarity to find cached responses. If someone asks "What's the capital of France?" and you've already answered "Capital of France?", the response is the same. Bifrost includes built-in semantic caching, and several caching libraries support this pattern.
Provider-Level Prompt Caching
Anthropic offers prompt caching that can reduce costs by up to 90% for long system prompts. If you're sending the same 10K-token system prompt with every request, enable prompt caching to pay for it only once. Google's Gemini models also support context caching for long prompts.
| Provider | Caching Feature | Savings | |----------|----------------|---------| | Anthropic | Prompt caching | Up to 90% on cached portions | | Google | Context caching | Variable, depends on reuse | | DeepSeek | Input caching | $0.014/M on cached tokens (95% savings) | | OpenAI | Automatic caching | 50% off cached input tokens |
Time-Based Cache Invalidation
Not all responses should be cached forever. Set TTLs based on content type:
- Factual lookups (capitals, definitions): Cache for days/weeks
- Code generation: Cache for hours (context-dependent)
- Real-time data (prices, weather): Don't cache or cache for minutes
- Personalized responses: Generally don't cache
Strategy 4: Reduce Output Token Usage
Output tokens are typically 2-5x more expensive than input tokens. Controlling output length is one of the highest-leverage optimizations.
Set max_tokens
Always set a reasonable max_tokens limit:
response = client.chat.completions.create(
model="auto",
messages=[{"role": "user", "content": "Is Python dynamically typed? Answer yes or no."}],
max_tokens=10 # Don't let the model write an essay
)
Without max_tokens, a model might generate 500 tokens for a question that needs 5. At Opus output pricing ($75/M), those extra 495 tokens cost $0.037 each time. Across 10,000 requests, that's $370 wasted.
Ask for Concise Responses
Add "Be concise" or "Answer in one sentence" to your prompts when you don't need detailed explanations. This simple instruction can reduce output by 50-80%.
Use Structured Output
Request JSON output for data extraction tasks. It's typically shorter and more predictable than prose:
response = client.chat.completions.create(
model="auto",
messages=[{
"role": "user",
"content": 'Extract the name and email from this text: "Contact John at john@example.com". Return as JSON.'
}],
response_format={"type": "json_object"}
)
JSON output is typically 30-50% fewer tokens than equivalent natural language, and it's easier to parse programmatically.
Avoid Chain-of-Thought When Unnecessary
Chain-of-thought (CoT) prompting improves reasoning but massively increases output tokens. Use it only when the task genuinely requires multi-step reasoning. For classification, extraction, and simple Q&A, CoT is pure waste.
Strategy 5: Use Streaming Wisely
Streaming doesn't change the per-token cost, but it enables powerful optimizations:
- Stop early — If the first tokens indicate a bad response, abort and retry with a different prompt or model
- Time-box responses — Cut off responses that are taking too long (and costing too much)
- Progressive validation — Check output quality incrementally and stop generation if it goes off-track
stream = client.chat.completions.create(
model="auto",
messages=[...],
stream=True,
max_tokens=500
)
output = ""
for chunk in stream:
content = chunk.choices[0].delta.content or ""
output += content
# Stop early if the response is going in the wrong direction
if len(output) > 100 and not looks_useful(output):
break # Saved tokens by aborting early
Strategy 6: Multi-Provider Arbitrage
Different providers sometimes offer the same model at different prices, or comparable models at different rates. Take advantage of this:
Provider Price Comparison (Same or Similar Models)
| Model | Direct Provider | Via OpenRouter | Via ClawRouters (BYOK) | |-------|----------------|---------------|----------------------| | Claude Sonnet 4 | $3/$15 | $3.17/$15.83 (+5.5%) | $3/$15 (0% markup) | | GPT-4o | $2.50/$10 | $2.64/$10.55 (+5.5%) | $2.50/$10 (0% markup) | | DeepSeek V3 | $0.27/$1.10 | $0.28/$1.16 (+5.5%) | $0.27/$1.10 (0% markup) |
With ClawRouters' BYOK plan, you pay exactly the provider price with no markup. With OpenRouter, you pay 5.5% on top. On $10,000/month of API calls, that's $550/month in avoidable fees.
Use BYOK to Keep Your Provider Discounts
If you've negotiated volume discounts with providers, make sure your router supports BYOK (Bring Your Own Key). ClawRouters and LiteLLM both support BYOK — OpenRouter does not. This means your enterprise pricing from Anthropic or OpenAI flows through without markup.
Strategy 7: Monitor and Analyze
You can't optimize what you can't measure. Track these metrics:
- Cost per request type — Which tasks are most expensive?
- Model usage distribution — Are you over-using expensive models?
- Token counts by request — Are some prompts unnecessarily long?
- Quality vs. cost — Are cheaper models delivering acceptable results?
- Cache hit rates — Is your caching effective?
- Wasted tokens — How many output tokens are discarded or unused?
ClawRouters includes a built-in analytics dashboard that tracks all of this automatically. For self-hosted solutions, tools like Helicone provide deep observability.
Set Up Cost Alerts
Configure alerts when:
- Daily spend exceeds 2x your average
- A single request costs more than $1 (usually means a runaway prompt)
- Cache hit rate drops below your target
- A particular user/feature is consuming disproportionate resources
Strategy 8: Negotiate Provider Pricing
At scale ($1,000+/month), most providers offer volume discounts:
- OpenAI — Tier-based discounts for committed spend, up to 25% off at high volumes
- Anthropic — Enterprise pricing for high-volume users, custom rate agreements
- Google — Free tier + committed use discounts for Gemini, context caching included
- DeepSeek — Already very cheap, but offers enterprise plans for dedicated capacity
Combine negotiated rates with smart routing for maximum savings. Even with a 20% OpenAI discount, routing simple tasks to Flash ($0.30/M) instead of GPT-4o ($8/M after discount) still saves 26x.
Putting It All Together: Three Real Examples
Example 1: AI Coding Agent
Before optimization:
- All requests → Claude Opus 4
- 1,000 requests/day, avg 3K output tokens
- Daily cost: 3M tokens × $75/M = $225/day = $6,750/month
After implementing all strategies:
-
Smart routing (Strategy 1): 80% of requests go to cheaper models
- 800 simple requests → Gemini Flash/DeepSeek: avg $0.70/M = $1.68/day
- 200 complex requests → Opus: 600K tokens × $75/M = $45/day
-
Prompt optimization (Strategy 2): 30% token reduction
- Adjusted: $1.18 + $31.50 = $32.68/day
-
Caching (Strategy 3): 20% cache hit rate
- Adjusted: $26.14/day
-
Output optimization (Strategy 4): 15% output reduction
- Adjusted: ~$22/day
Final: ~$22/day = $660/month — 90% reduction from $6,750/month
Example 2: Customer Support Bot
Before optimization:
- All requests → Claude Sonnet 4
- 5,000 requests/day (customer inquiries), avg 1K output tokens
- Daily cost: 5M tokens × $15/M = $75/day = $2,250/month
After optimization:
-
Smart routing: 70% are simple FAQ answers → Flash/Mini
- 3,500 simple → avg $0.45/M = $1.58/day
- 1,500 complex → Sonnet: 1.5M × $15/M = $22.50/day
-
Caching: 40% of FAQ questions are repeated
- Adjusted: ~$14.40/day
-
Output optimization: "Be concise" in system prompt, 40% shorter responses
- Adjusted: ~$8.65/day
Final: ~$8.65/day = $260/month — 88% reduction from $2,250/month
Example 3: Content Generation SaaS
Before optimization:
- Mix of GPT-4o and Sonnet
- 50,000 content generations/month, avg 2K output tokens
- Monthly cost: 100M tokens × avg $12.50/M = $1,250/month
After optimization:
-
Smart routing: Blog posts → Sonnet, social media → DeepSeek, titles/hashtags → Flash
- Blog (20%): 20M × $15/M = $300
- Social (50%): 50M × $1.10/M = $55
- Simple (30%): 30M × $0.30/M = $9
- Total: $364/month
-
Caching: Template-based content, 15% cache hit rate
- Adjusted: $309/month
-
Prompt optimization: Tighter templates, 20% fewer input tokens
- Adjusted: ~$280/month
Final: ~$280/month — 78% reduction from $1,250/month
The Fastest Path to Savings
If you want to implement the highest-impact strategy immediately:
- Sign up for ClawRouters (free BYOK plan)
- Add your provider API keys
- Change your base URL to
https://www.clawrouters.com/api/v1 - Set
model="auto"for smart routing
That's it. You'll see 60-90% cost reduction from day one, with zero code changes beyond the base URL.
For the full setup walkthrough, see our Setup Guide. To compare ClawRouters with other options, check our OpenRouter vs ClawRouters vs LiteLLM comparison. And for a broader view of the AI router landscape, see our guide to the best LLM routers in 2026.