← Back to Blog

How to Build an LLM Router: Architecture, Code, and Lessons Learned

2026-03-22·12 min read·ClawRouters Team
how to build an llm routerllm router architectureai model routingllm routing tutorial

TL;DR — An LLM router classifies each incoming prompt by task type and complexity, then routes it to the cheapest model that can deliver acceptable quality. Building one yourself requires a task classifier, a model registry with cost/capability scores, a routing algorithm, fallback logic, and provider adapters. Most teams spend 2-4 engineering months getting this right. This guide walks through the full architecture so you can decide whether to build or use a managed router like ClawRouters.


Why Build an LLM Router?

If you're sending every API call to the same model, you're overpaying by 60-250x on the majority of requests. Research from Stanford's HELM benchmark and real-world production data consistently show that approximately 80% of typical AI agent calls don't require a premium model. A simple lookup, a JSON reformatting task, or a translation can be handled by a $0.30/M-token model just as well as a $75/M-token one.

An LLM router fixes this by inserting an intelligent decision layer between your application and the model providers. Instead of one-size-fits-all, every request gets matched to the right model for the job.

When Building Your Own Makes Sense

Building a custom LLM router is worth considering if:

For everyone else — especially teams that want to ship fast and focus on their actual product — a managed solution like ClawRouters handles all of this out of the box. See our comparison of LLM routing platforms for a full breakdown.

Core Architecture of an LLM Router

Every LLM router, whether homegrown or managed, follows the same fundamental pipeline:

Incoming Request
  → Task Classification (what kind of task is this?)
  → Complexity Scoring (how hard is it?)
  → Model Selection (which model fits best?)
  → Provider Routing (call the chosen provider's API)
  → Fallback Handling (retry on failure)
  → Response Normalization (unified output format)

Let's break each stage down.

The Request Pipeline

The pipeline must be fast. Every millisecond you add to routing is latency your users feel on top of the model's own response time. Production routers like ClawRouters achieve sub-10ms classification — your target should be under 50ms for the entire routing decision.

Step 1: Build a Task Classifier

The classifier is the brain of your router. It takes a raw prompt and outputs a task type (e.g., "code_generation", "simple_qa", "translation") and a complexity score.

L1: Rule-Based Classification

Start with a fast, synchronous classifier based on pattern matching:

import re

TASK_PATTERNS = {
    "code_generation": [
        r"write\s+(a\s+)?function", r"implement\s+", r"```",
        r"def\s+\w+", r"class\s+\w+", r"fix\s+this\s+(code|bug)"
    ],
    "translation": [
        r"translate\s+(this\s+)?(to|into)\s+", r"in\s+(spanish|french|chinese|japanese)"
    ],
    "simple_qa": [
        r"^(what|who|when|where|how\s+many)\s+(is|are|was|were)\b",
        r"define\s+", r"explain\s+briefly"
    ],
    "summarization": [
        r"summarize\s+", r"tldr", r"key\s+points\s+of"
    ],
    "complex_reasoning": [
        r"compare\s+and\s+contrast", r"analyze\s+",
        r"design\s+(a\s+)?(system|architecture)", r"trade-?offs?\s+between"
    ],
}

def classify_l1(prompt: str) -> tuple[str, float]:
    prompt_lower = prompt.lower()
    scores = {}
    for task_type, patterns in TASK_PATTERNS.items():
        matches = sum(1 for p in patterns if re.search(p, prompt_lower))
        if matches > 0:
            scores[task_type] = matches / len(patterns)

    if not scores:
        return ("general", 0.3)  # low confidence fallback

    best = max(scores, key=scores.get)
    return (best, min(scores[best] * 2, 1.0))

This runs in under 1ms and catches the obvious cases. According to our internal benchmarks at ClawRouters, a well-tuned L1 classifier correctly identifies the task type for ~70% of requests.

L2: AI-Powered Classification

When L1 confidence is below a threshold (we use 0.7), escalate to a lightweight AI model:

async def classify_l2(prompt: str) -> tuple[str, float]:
    response = await call_model(
        model="claude-haiku",  # fast and cheap
        messages=[{
            "role": "user",
            "content": f"""Classify this prompt into exactly one category:
code_generation, translation, simple_qa, summarization,
complex_reasoning, data_extraction, creative_writing, general

Also rate complexity 1-5. Reply as JSON:
{{"task": "...", "complexity": N}}

Prompt: {prompt[:500]}"""
        }]
    )
    result = parse_json(response)
    return (result["task"], result["complexity"] / 5)

The two-tier approach keeps costs down — you only call the L2 classifier for ambiguous requests, which is roughly 30% of traffic. At Haiku pricing ($1.25/M output tokens), the classification cost per request is negligible.

Step 2: Design a Model Registry

Your model registry is the single source of truth for all available models, their capabilities, and their costs.

Registry Schema

MODEL_REGISTRY = {
    "claude-opus": {
        "provider": "anthropic",
        "input_cost": 15.00,   # per 1M tokens
        "output_cost": 75.00,
        "capability_scores": {
            "code_generation": 5,
            "complex_reasoning": 5,
            "creative_writing": 5,
            "simple_qa": 3,      # overkill for simple tasks
        },
        "max_tokens": 200000,
        "complexity_range": (0.6, 1.0),  # only use for hard tasks
    },
    "gemini-flash": {
        "provider": "google",
        "input_cost": 0.075,
        "output_cost": 0.30,
        "capability_scores": {
            "simple_qa": 4,
            "data_extraction": 4,
            "translation": 3,
            "complex_reasoning": 2,
        },
        "max_tokens": 1000000,
        "complexity_range": (0.0, 0.5),
    },
    # ... more models
}

Key Design Decisions

Step 3: Implement the Routing Algorithm

This is where classification meets model selection. The routing algorithm takes the classifier output and the model registry, then picks the best model.

Three Routing Strategies

def select_model(task: str, complexity: float, strategy: str = "balanced"):
    candidates = []
    for name, model in MODEL_REGISTRY.items():
        score = model["capability_scores"].get(task, 1)
        low, high = model["complexity_range"]
        if low <= complexity <= high and score >= 2:
            candidates.append((name, model, score))

    if not candidates:
        return "claude-sonnet"  # safe fallback

    if strategy == "cheapest":
        candidates.sort(key=lambda c: c[1]["output_cost"])
    elif strategy == "best":
        candidates.sort(key=lambda c: (-c[2], c[1]["output_cost"]))
    elif strategy == "balanced":
        candidates.sort(
            key=lambda c: -c[2] / (c[1]["output_cost"] + 0.01)
        )

    return candidates[0][0]

The "balanced" strategy optimizes for the quality-to-cost ratio — this is what most production systems should default to. In our experience at ClawRouters, balanced routing delivers 85-90% of the quality of always using the best model at 10-20% of the cost.

Quality Thresholds

Don't just pick the cheapest model that can technically handle the task. Set minimum capability scores:

QUALITY_THRESHOLDS = {
    "standard": {"min_score": 2, "prefer_score": 3},
    "enhanced": {"min_score": 3, "prefer_score": 4},  # for premium tiers
}

This prevents your router from sending complex architecture questions to a model that scored 2/5 on reasoning just because it's cheaper.

Step 4: Build Fallback Chains and Provider Adapters

Real-world LLM APIs fail. Rate limits, outages, timeouts — you need graceful degradation.

Fallback Chain Logic

async def execute_with_fallback(request, primary_model, max_retries=2):
    chain = build_fallback_chain(primary_model)  # [primary, fallback1, fallback2]

    for i, model in enumerate(chain):
        try:
            response = await call_provider(model, request)
            return response
        except RetryableError as e:  # 429, 500-504, timeout
            if i < len(chain) - 1:
                log.warning(f"{model} failed ({e}), trying {chain[i+1]}")
                continue
            raise

    raise AllProvidersFailed("Exhausted fallback chain")

Provider Adapter Pattern

Each provider has a different API format. Abstract this behind a common interface:

class ProviderAdapter:
    async def chat(self, request) -> Response: ...
    async def chat_stream(self, request) -> AsyncIterator[Chunk]: ...

class AnthropicAdapter(ProviderAdapter):
    async def chat(self, request):
        # Convert OpenAI-format messages to Anthropic format
        # Handle system messages, tool use, etc.
        ...

class GoogleAdapter(ProviderAdapter):
    async def chat(self, request):
        # Convert to Gemini API format
        ...

Building and maintaining provider adapters is one of the most time-consuming parts of building an LLM router. Each provider has quirks — Anthropic handles system messages differently, Google's streaming format is unique, and Chinese providers like DeepSeek and Moonshot each have their own OpenAI-compatible-but-not-quite APIs. ClawRouters currently supports 50+ models across 8 providers, and keeping adapters up to date is a continuous effort.

Step 5: Add Observability and Analytics

A router without observability is flying blind. You need to track:

async def log_request(request_id, model, task, complexity, cost, latency):
    await db.insert("usage_logs", {
        "request_id": request_id,
        "model": model,
        "task_type": task,
        "complexity": complexity,
        "estimated_cost": cost,
        "latency_ms": latency,
        "timestamp": datetime.utcnow(),
    })

Without analytics, you can't iterate on your routing rules. ClawRouters provides a built-in dashboard with real-time cost tracking, model distribution charts, and per-key usage breakdowns — saving you from building yet another internal tool.

Build vs. Buy: The Real Cost

Here's the honest breakdown from teams who've gone both routes:

| Factor | Build Your Own | Use ClawRouters | |--------|---------------|-----------------| | Time to production | 2-4 months | 60 seconds | | Engineering cost | $50K-150K+ | Free (BYOK plan) | | Ongoing maintenance | 10-20 hrs/month | Zero | | Model coverage | Limited by your adapters | 50+ models, auto-updated | | Classification accuracy | Depends on your ML team | Production-tuned, <10ms | | Failover | Must build | Built-in | | Analytics dashboard | Must build | Built-in |

The BYOK (Bring Your Own Keys) plan on ClawRouters is free with no markup — you use your own provider API keys, and ClawRouters handles the routing logic. Unlike OpenRouter which charges 5.5% on every request, there's no per-request fee.

For teams that want managed keys and higher rate limits, paid plans start at $29/month.

Getting Started in 60 Seconds

Whether you build or buy, the fastest way to start saving on LLM costs today:

  1. Sign up for ClawRouters (free, no credit card)
  2. Add your provider API keys in the dashboard
  3. Point your app at the OpenAI-compatible endpoint:
from openai import OpenAI

client = OpenAI(
    base_url="https://www.clawrouters.com/api/v1",
    api_key="cr_your_key_here"
)

# model="auto" lets ClawRouters route intelligently
response = client.chat.completions.create(
    model="auto",
    messages=[{"role": "user", "content": "Your prompt here"}]
)
  1. Check our Setup Guide for framework-specific instructions, including Cursor and Windsurf integration.

FAQ

Ready to Reduce Your AI API Costs?

ClawRouters routes every API call to the optimal model — automatically. Start saving today.

Get Started Free →

Get weekly AI cost optimization tips

Join 2,000+ developers saving on LLM costs