Figuring out how to do LLM integration is the first real challenge every AI-powered product faces. You have 50+ models across OpenAI, Anthropic, Google, Meta, and others — each with different pricing, strengths, and API quirks. Getting it wrong means either overpaying by 10–100x or shipping a product that hallucinates on tasks a better model would handle perfectly.
TL;DR: To do LLM integration well, you need three things: (1) choose the right model for each task type, not one model for everything, (2) use an OpenAI-compatible gateway so you can swap models without rewriting code, and (3) automate model selection with a router like ClawRouters to cut costs 60–90% while maintaining output quality. This guide walks through each step with code examples and real cost numbers.
Why "How to Do LLM" Is the Wrong Question (and What to Ask Instead)
Most developers start by asking "how to do LLM" — meaning how to call an LLM API, get a response, and plug it into their app. That part is straightforward: send a prompt, get a completion. The real question is how to do LLM integration at scale without burning through your budget.
Here is why this matters:
| Model | Input Price (per 1M tokens) | Output Price (per 1M tokens) | Best For | |-------|----------------------------|-----------------------------:|----------| | GPT-4.1 | $2.00 | $8.00 | General reasoning | | Claude Opus 4 | $15.00 | $75.00 | Complex analysis | | Claude Sonnet 4 | $3.00 | $15.00 | Balanced quality/cost | | Gemini 2.5 Flash | $0.15 | $0.60 | Simple tasks, high volume | | DeepSeek V3 | $0.27 | $1.10 | Budget coding |
Sending every request to Claude Opus 4 when 70% of your queries are simple Q&A tasks is like taking a helicopter to the grocery store. It works, but you are paying 250x more than necessary for those trips.
The Three Pillars of Production LLM Usage
Getting LLM integration right comes down to three decisions:
- Model selection — Which model handles which task type?
- API architecture — How do you structure calls so switching models is painless?
- Cost optimization — How do you automate the selection process at scale?
The rest of this guide breaks down each pillar with practical steps.
Step 1: Make Your First LLM API Call
If you have never called an LLM API before, here is the simplest starting point. Most providers follow the OpenAI chat completions format:
import openai
client = openai.OpenAI(
api_key="your-api-key",
base_url="https://api.clawrouters.com/v1" # swap to route through ClawRouters
)
response = client.chat.completions.create(
model="auto", # let the router pick the best model
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain recursion in three sentences."}
]
)
print(response.choices[0].message.content)
Notice the base_url — by pointing to ClawRouters instead of OpenAI directly, you get automatic routing across 50+ models with zero code changes. If you already have OpenAI SDK calls in your codebase, the migration is a one-line change.
Choosing Your First Model
For getting started, here is a decision tree:
- Building a prototype? → Use
autorouting through ClawRouters and let the system pick - Need maximum quality? → Claude Opus 4 or GPT-4.1 for reasoning, Claude Sonnet 4 for balanced tasks
- Optimizing for cost? → Gemini 2.5 Flash or DeepSeek V3 for simple tasks
- Running an AI agent? → Use a router — agents make 50–200 calls per task with wildly varying complexity
Check the full model comparison and pricing calculator to see real-time pricing across all supported providers.
Step 2: Structure Your LLM Integration for Production
A production LLM integration needs more than a single API call. Here are the patterns that teams shipping real products use.
Use an OpenAI-Compatible Gateway
The biggest mistake teams make is hardcoding a specific provider's SDK. When you need to switch models (and you will), you end up rewriting integration code.
Instead, use the OpenAI SDK format as your standard interface. ClawRouters exposes an OpenAI-compatible API endpoint, so every model — Claude, Gemini, Mistral, DeepSeek, Qwen — is accessible through the same SDK:
# Switch between ANY provider by changing the model string
# No SDK changes, no code rewrites
# Route to Claude
response = client.chat.completions.create(model="claude-sonnet-4-20250514", ...)
# Route to Gemini
response = client.chat.completions.create(model="gemini-2.5-flash", ...)
# Let ClawRouters auto-select the best model
response = client.chat.completions.create(model="auto", ...)
Implement Error Handling and Fallbacks
LLM providers have outages. Rate limits hit. A production integration needs fallback chains:
# ClawRouters handles this automatically, but if building manually:
fallback_chain = ["claude-sonnet-4-20250514", "gpt-4.1", "gemini-2.5-pro"]
for model in fallback_chain:
try:
response = client.chat.completions.create(model=model, messages=messages)
break
except Exception:
continue
With ClawRouters, fallback chains are built in — if a provider returns an error, the request automatically retries on the next best model with no client-side logic needed.
Set Up Observability
You cannot optimize what you cannot measure. Track these metrics from day one:
- Cost per request — Which endpoints are expensive?
- Latency (TTFB and total) — Where are users waiting?
- Token usage — Are prompts bloated?
- Error rate by provider — Which models are unreliable?
The ClawRouters dashboard provides all of these out of the box, including per-model breakdowns and daily cost trends.
Step 3: Optimize Costs With Smart Model Routing
This is where most teams leave 60–90% of their LLM budget on the table. The insight is simple: not every request needs the most expensive model.
According to internal data across ClawRouters users, the typical request distribution looks like this:
- ~40% simple tasks (summarization, extraction, formatting) → Gemini Flash handles these at $0.15/M input tokens
- ~35% medium tasks (coding, translation, analysis) → Claude Sonnet or GPT-4.1 at $2–3/M input tokens
- ~25% complex tasks (multi-step reasoning, research, creative writing) → Claude Opus or GPT-4.1 at $8–15/M input tokens
Manual Routing vs. Automatic Routing
You can route manually by classifying tasks in your application code, but this approach has problems:
- You need to maintain routing logic as models change
- New models require code deploys to integrate
- Edge cases (a "simple" prompt that actually needs deep reasoning) cause quality drops
Automatic routing solves this. ClawRouters classifies each request in under 10ms using a two-layer system: rule-based pattern matching for obvious cases, and a lightweight AI classifier for ambiguous ones. The result is optimal model selection on every call with no manual intervention.
Real Cost Example: AI Coding Agent
Consider a coding agent that makes 100 API calls per task:
| Approach | Model Used | Cost per Task | Monthly (500 tasks) | |----------|-----------|---------------|--------------------:| | All Opus | Claude Opus 4 | $4.50 | $2,250 | | All Sonnet | Claude Sonnet 4 | $0.90 | $450 | | Manual routing | Mixed | $0.60 | $300 | | ClawRouters auto | Optimized mix | $0.35 | $175 |
That is a 92% cost reduction compared to using Opus for everything, with negligible quality difference because simple sub-tasks (file reads, formatting, boilerplate) never needed Opus in the first place.
Step 4: Scale Your LLM Integration
Once you have the basics working, here are the patterns for scaling.
Semantic Caching
Many LLM calls are near-duplicates. A user asking "what is a REST API?" and "explain REST APIs" should hit the same cached response. Semantic caching can reduce your total LLM calls by 30–50% on workloads with repetitive queries.
Streaming Responses
For user-facing applications, always use streaming to reduce perceived latency:
stream = client.chat.completions.create(
model="auto",
messages=messages,
stream=True
)
for chunk in stream:
print(chunk.choices[0].delta.content, end="")
ClawRouters supports streaming across all providers, normalizing the different streaming formats into a single consistent interface.
Batch Processing
For offline workloads (data enrichment, document processing), batch your requests and use the cheapest available models. A task that does not need real-time responses should not pay real-time prices.
Step 5: Common Mistakes to Avoid
Based on patterns seen across thousands of ClawRouters integrations:
Over-Engineering Prompts
Long, complex system prompts increase token usage without proportional quality gains. Keep system prompts under 500 tokens for most use cases. Use few-shot examples only when zero-shot performance is measurably worse.
Ignoring Token Limits
Each model has different context windows (8K to 1M+ tokens). Sending 100K tokens to a model with an 8K context window fails silently in some providers. Always check model limits — the ClawRouters models page lists context windows for all supported models.
Not Monitoring Costs
LLM costs can spike overnight when a new feature increases call volume. Set up budget alerts and review your dashboard daily during the first month of any new integration.
Getting Started in 5 Minutes
The fastest path from zero to production-ready LLM integration:
- Sign up at ClawRouters — the free tier includes unlimited routing with your own API keys
- Set your base URL to
https://api.clawrouters.com/v1in your OpenAI SDK config - Use
model: "auto"to let the router handle model selection - Monitor costs on your dashboard and adjust routing strategy as needed
Check the full setup guide for framework-specific instructions (Python, Node.js, cURL, and more).
For detailed pricing plans, ClawRouters offers a free BYOK tier, a Basic plan at $29/month with 20M tokens included, and a Pro plan at $99/month with 100M tokens and access to all premium models.