Why is my AI agent so expensive to run?

The usual cause is that your agent calls a premium model (Claude Opus 4.7 at $15/$75 per 1M tokens, GPT-5.5 at $5/$30) for every request — including trivial ones like simple Q&A, code formatting, or translation. For those tasks, Gemini Flash ($0.30/M output), DeepSeek V4 Flash ($0.14/$0.28), or Claude Haiku ($5/M) would deliver the same quality at 15-250x lower cost. In a typical agent workload, about 80% of calls don't need the premium model. ClawRouters analyzes each call in 10ms and routes it to the cheapest capable model — typical users save 70-90% on their monthly bill.

How do I reduce OpenClaw AI API costs?

OpenClaw is OpenAI-compatible, so you can change its base_url to a smart routing proxy like ClawRouters. The proxy analyzes each call (coding vs formatting vs reasoning) and sends it to the cheapest model that can handle it. No code changes — just one config line in your openclaw.json. Typical OpenClaw users cut their token bill 70-90% without any loss in output quality. Pricing starts at $29/mo (Starter plan, 10M tokens included) or $99/mo (Pro, 20M tokens/month with up to 500K that can run on Opus).

ClawRouters vs OpenRouter — which is better for cost savings?

OpenRouter and LiteLLM give you multi-model access under one API key — but you still manually pick which model to call. That's why most developers default to the premium model and bleed money. ClawRouters is different: we automatically pick the cheapest capable model per task, in 10ms. OpenRouter solved access; ClawRouters solves cost. ClawRouters also adds features OpenRouter doesn't: per-end-user token tracking (for SaaS agent builders sharing keys with customers), auto top-up, BYOK fallback opt-in, and OpenClaw-native integration.

What's the cheapest model for coding agents in 2026?

For code formatting and simple edits: Claude Haiku 4.5 ($1/$5 per 1M) or DeepSeek V4 Flash ($0.14/$0.28). For medium-complexity coding: Claude Sonnet 4.6 ($3/$15), GPT-5.4 ($2.5/$15), Kimi K2.6 ($0.60/$4), or DeepSeek V4 Pro ($1.74/$3.48). Only escalate to Claude Opus 4.7 ($15/$75) or GPT-5.5 ($5/$30) for genuinely complex reasoning or architectural design. A smart router like ClawRouters makes this decision per-call automatically based on the task — you don't need to configure it by hand.

How does task-aware routing save money vs. just using one model?

Most AI agent workloads break down roughly as: 60% simple Q&A/translation/formatting, 25% medium coding/analysis, 15% complex reasoning. If you send all of them to Claude Opus ($75/M output), you pay full price for every call. If you task-route instead: 60% → Gemini Flash at $0.30/M (250x cheaper), 25% → Claude Haiku at $5/M (15x cheaper), 15% → Opus (no change). Blended savings ≈ 80-90% vs. Opus-everything, with no quality degradation. This is the math behind the 70-90% typical savings.

Is ClawRouters safe with my data?

Yes. ClawRouters is a routing proxy — we classify the task type (in 10ms, on our servers) to pick a model, then forward your request directly to the model provider (OpenAI, Anthropic, Google) over encrypted connections. We don't train on your data. We log minimal metadata (token counts, model used, timing) for usage dashboards, not prompt content beyond a 500-char snippet for classifier improvement which you can opt out of. BYOK keys are encrypted at rest with AES-256-GCM.

How do I track per-customer API costs when I share my ClawRouters key across my SaaS users?

Pass a stable per-customer ID in the OpenAI SDK's 'user' parameter with every request. ClawRouters writes this to each usage log and surfaces aggregated per-end-user breakdowns in your dashboard — requests, cost, tokens, models used, first/last seen. This is built-in and included with every plan. It's essential for SaaS agent builders (e.g. an OpenClaw-based product) who share keys across customers and need to attribute cost back to each one.

API Gateway Timeout Limits: Fix 504 Errors on AI/LLM Requests

TL;DR: Most API gateways default to a 30-second timeout limit, but LLM API calls routinely take 45–120+ seconds for complex prompts — causing silent 504 errors that break your AI features. The fix isn't just raising the timeout: it's using an intelligent LLM router like ClawRouters that routes simple requests to fast models (< 2s response) and only sends complex tasks to slower premium models, keeping 80% of your traffic well within default timeout limits while cutting costs by 60–80%.

API gateway timeout limits are one of the most common — and most frustrating — sources of production failures in AI-powered applications. A developer ships a working AI feature in development, deploys behind an API gateway, and suddenly users see intermittent failures. The logs show 504 Gateway Timeout. The root cause: the gateway's default timeout is 30 seconds, and their LLM calls take 60–90 seconds for anything beyond a trivial prompt.

This guide covers everything you need to know about API gateway timeout limits for AI and LLM workloads: what the defaults are, why LLM traffic is uniquely problematic, how to configure timeouts correctly, and why smart routing is a better long-term solution than simply cranking up your timeout value.

Why API Gateway Timeout Limits Matter More for AI Traffic

The Fundamental Mismatch

Traditional API gateways were designed for web application traffic where response times are measured in milliseconds. A typical REST API call returns in 50–200ms. Even a slow database query finishes in 1–3 seconds. API gateway timeout limits of 30 seconds provide generous headroom for these workloads.

LLM API calls are a different animal entirely:

| Request Type | Typical Response Time | Default 30s Timeout? | |---|---|---| | Simple Q&A (< 100 output tokens) | 1–3 seconds | ✅ Safe | | Code generation (500–1000 tokens) | 8–20 seconds | ✅ Usually safe | | Long-form content (2000+ tokens) | 30–60 seconds | ❌ At risk | | Complex reasoning (chain-of-thought) | 45–120 seconds | ❌ Will timeout | | Multi-step agent workflows | 60–300 seconds | ❌ Will timeout |

According to benchmarks from major providers, the median time-to-first-token for premium models like Claude Opus 4 is 3–8 seconds, with total generation times exceeding 60 seconds for outputs above 2,000 tokens. GPT-5.5 shows similar patterns — and reasoning models like DeepSeek V4 Flash (Thinking) can "think" for 30+ seconds before generating the first token.

The Hidden Cost of Timeout Failures

When a gateway times out an LLM request, the damage goes beyond a failed API call:

Wasted tokens — the provider still processes and bills for the full request, even though the client never receives the response
Retry storms — clients often retry timed-out requests, doubling or tripling your API costs
User experience degradation — users see errors after waiting 30 seconds, the worst possible outcome
Cascading failures — in agent architectures, one timed-out step can fail an entire multi-step workflow

A study from Anthropic's developer relations team found that 23% of production API errors reported by enterprise users were timeout-related, making it the single largest category of integration failures.

Default Timeout Limits by API Gateway

Popular Gateways and Their Defaults

Every API gateway ships with different default timeout settings. Here's what you're working with out of the box:

| Gateway | Default Timeout | Max Configurable | Streaming Support | |---|---|---|---| | AWS API Gateway (REST) | 29 seconds | 29 seconds (hard limit) | ❌ | | AWS API Gateway (HTTP) | 30 seconds | 30 seconds | ❌ | | AWS ALB | 60 seconds | 4,000 seconds | ✅ | | Cloudflare API Gateway | 100 seconds | 100 seconds (Workers) | ✅ | | Kong Gateway | 60 seconds | Unlimited | ✅ | | NGINX | 60 seconds | Unlimited | ✅ | | Google Cloud API Gateway | 15 seconds | 60 seconds | ❌ | | Azure API Management | 240 seconds | 240 seconds | ✅ | | Vercel | 30 seconds (Hobby) | 300 seconds (Enterprise) | ✅ |

Critical finding: AWS API Gateway (REST API type) has a hard 29-second limit that cannot be increased. If you're routing LLM traffic through it, you will hit timeouts on any moderately complex request. This is the single most common cause of "it works locally but fails in production" for AI applications.

Why You Can't Just Increase the Timeout

The obvious fix — set timeout to 300 seconds — creates new problems:

Resource exhaustion — every pending request holds a connection open, and long timeouts mean more concurrent connections, which can exhaust your gateway's connection pool
Slow failure detection — if a provider is actually down, you wait 300 seconds to find out instead of 30
Cost amplification — a stuck request that hangs for 300 seconds wastes connection resources and still fails
Load balancer conflicts — upstream load balancers may have shorter timeouts, creating a chain of mismatched limits

The real solution isn't a bigger timeout — it's faster responses.

How to Configure Timeout Limits Correctly

Setting Timeouts for LLM Traffic

If you must configure your gateway timeout manually, follow these guidelines:

For non-streaming LLM endpoints:

Set gateway timeout to 120 seconds minimum for standard models
Set to 180–300 seconds for reasoning models (DeepSeek V4 Flash Thinking, Claude with extended thinking)
Always set the backend timeout higher than the gateway timeout to avoid race conditions

For streaming LLM endpoints:

Use idle timeout instead of total request timeout — streaming connections should stay open as long as tokens are flowing
Set idle timeout to 30–60 seconds (time between chunks, not total time)
Most gateways distinguish between connection timeout, read timeout, and idle timeout — configure each:

# NGINX example for LLM streaming
location /api/v1/chat/completions {
    proxy_connect_timeout 10s;    # Time to establish connection
    proxy_send_timeout 30s;       # Time to send the request body
    proxy_read_timeout 300s;      # Time to receive response (non-streaming)
    proxy_buffering off;          # Required for SSE streaming

    # For streaming, the read_timeout acts as idle timeout
    # between chunks — 300s is safe
}

# Kong Gateway configuration
services:
  - name: llm-service
    connect_timeout: 10000     # 10 seconds
    write_timeout: 30000       # 30 seconds
    read_timeout: 300000       # 300 seconds (5 minutes)

Timeout Chain Architecture

In production, you have multiple timeout layers. They must be configured from outermost to innermost, each layer shorter than the one inside it:

Client timeout (90s)
  → CDN/WAF timeout (120s)
    → API Gateway timeout (180s)
      → Load Balancer timeout (240s)
        → Backend/Provider timeout (300s)

If any outer layer has a shorter timeout than an inner layer, requests will be killed before the backend responds — and the backend keeps processing, wasting resources.

Debugging 504 Gateway Timeout Errors

Step-by-Step Diagnosis

When you encounter 504 errors on LLM endpoints:

1. Identify which timeout is triggering:

# Check response headers for clues
curl -v -X POST https://your-api.com/v1/chat/completions \
  -H "Authorization: Bearer YOUR_KEY" \
  -d '{"model":"auto","messages":[{"role":"user","content":"Write a detailed analysis..."}]}'

# Look for: X-Request-Id, Server header, timing headers

2. Test directly against the provider (bypass gateway):

# If this works but the gateway version doesn't, it's a timeout issue
curl -X POST https://api.openai.com/v1/chat/completions \
  --max-time 120 \
  -H "Authorization: Bearer sk-..." \
  -d '{"model":"gpt-4o","messages":[{"role":"user","content":"Same prompt..."}]}'

3. Check provider response times:

OpenAI: x-request-id header, check status.openai.com
Anthropic: request-id header
Google: check Gemini API metrics in Cloud Console

4. Enable streaming to avoid timeouts: Switching from non-streaming to streaming often resolves timeout issues because the first token arrives in 1–5 seconds, keeping the connection alive:

{
  "model": "auto",
  "stream": true,
  "messages": [{"role": "user", "content": "Your prompt..."}]
}

Common 504 Patterns and Fixes

| Pattern | Likely Cause | Fix | |---|---|---| | All requests timeout | Gateway timeout too low | Increase to 120s+ | | Only long prompts timeout | Output generation exceeds limit | Enable streaming or route to faster models | | Timeouts during peak hours | Provider rate limiting + queuing | Use multi-provider routing with failover | | Intermittent timeouts | Provider cold starts or overload | Implement fallback chains | | Timeouts after 29 seconds exactly | AWS API Gateway REST hard limit | Switch to HTTP API type or ALB |

The Smart Routing Solution: Eliminate Timeouts at the Source

Why Routing Beats Configuration

Instead of fighting timeout limits, the better approach is to ensure most requests complete fast. This is what intelligent LLM routing does — by analyzing each request and sending it to the fastest model that can handle it.

Here's the impact on response times when using ClawRouters with model="auto":

| Request Type | Without Routing | With Smart Routing | Timeout Risk | |---|---|---|---| | "What's the capital of France?" | 3–8s (Opus) | 0.8–1.5s (Flash) | None | | "Format this JSON" | 5–15s (Opus) | 1–3s (Haiku) | None | | "Write unit tests for this class" | 20–45s (Opus) | 8–15s (Sonnet) | Low | | "Design a distributed system" | 60–120s (Opus) | 60–120s (Opus) | Managed |

For the 80% of requests that are simple to moderate, smart routing reduces response times by 3–10x — well within any gateway's default timeout limit. Only the 20% of truly complex requests need the slower premium models, and those can be handled with streaming and appropriate timeout configuration.

How ClawRouters Handles Timeouts Internally

ClawRouters implements several timeout-resilient patterns that you'd otherwise need to build yourself:

Automatic provider failover — if a provider is slow or timing out, requests are automatically routed to the next provider in the fallback chain
Streaming by default — streaming responses keep connections alive and eliminate idle timeout issues
Cost-aware routing — smart model selection means 80% of traffic goes to fast, cheap models with sub-3-second response times
Built-in rate limit management — per-provider rate limits are tracked and respected, avoiding the queuing delays that cause timeouts
Dry run mode — use X-Dry-Run: true to test routing decisions without waiting for model responses

The result: teams using ClawRouters report 90%+ reduction in 504 timeout errors compared to direct provider integration behind a traditional API gateway.

Best Practices for Production AI Traffic

Timeout Configuration Checklist

Audit your timeout chain — map every hop from client to provider and ensure timeouts increase inward
Enable streaming for all LLM endpoints — SSE streaming eliminates most timeout issues by keeping connections alive
Set client-side timeouts with retries — don't rely solely on the gateway; implement exponential backoff in your application
Monitor time-to-first-token (TTFT) — this metric predicts timeout risk better than average response time
Use a dedicated LLM routing layer — general-purpose API gateways weren't designed for AI workload patterns; purpose-built LLM routers handle them natively
Separate AI traffic from web traffic — route LLM calls through a different gateway or path with longer timeouts, keeping your web APIs on tight limits

Architecture Recommendation

For production AI applications in 2026, the recommended architecture separates concerns:

Web traffic → API Gateway (30s timeout) → Your backend
AI traffic  → ClawRouters (manages timeouts internally) → Multiple AI providers

This way, your API gateway keeps its sensible defaults for web traffic, and AI-specific concerns like long response times, provider failover, and cost optimization are handled by a purpose-built layer.

Frequently Asked Questions

What is the default API gateway timeout limit?

Most API gateways default to 30–60 seconds. AWS API Gateway (REST) has a hard 29-second limit. Kong and NGINX default to 60 seconds. Cloudflare Workers has a 100-second limit. For LLM and AI workloads, these defaults are often too low — complex prompts can take 60–120+ seconds to complete.

Why do I get 504 Gateway Timeout errors on my AI API calls?

504 errors on AI API calls are almost always caused by the API gateway timeout being shorter than the LLM provider's response time. Premium models like Claude Opus 4 or GPT-5.5 can take 45–120 seconds for complex prompts. Enable streaming or use smart routing to reduce response times.

Can I increase the AWS API Gateway timeout beyond 29 seconds?

No — AWS API Gateway REST API type has a hard 29-second limit. Switch to the HTTP API type, use an Application Load Balancer (up to 4,000 seconds), or route AI traffic through a dedicated LLM router like ClawRouters that handles long-running requests internally.

Does streaming help avoid API gateway timeout limits?

Yes. Streaming sends tokens incrementally, so the first data arrives in 1–5 seconds. Most gateways measure timeout from the last received data, not the total request duration. A 90-second streaming request won't timeout as long as tokens keep flowing within the idle timeout window.

What timeout should I set for LLM API traffic?

For non-streaming endpoints, set at least 120 seconds for standard models and 180–300 seconds for reasoning models. For streaming endpoints, set an idle timeout of 30–60 seconds between chunks. Always ensure backend timeout > gateway timeout.

How does smart routing reduce API timeout errors?

Smart LLM routing analyzes each request and sends it to the fastest appropriate model. Since 80% of requests are simple enough for fast models (1–3s response), routing eliminates timeout risk for most traffic. ClawRouters reports 90%+ reduction in 504 errors compared to single-model setups.

What is the difference between connection timeout, read timeout, and idle timeout?

Connection timeout = time to establish a TCP connection (5–10s). Read timeout = time waiting for the complete response (set 120–300s for LLM traffic). Idle timeout = time between data chunks — critical for streaming where total request time is long but data flows continuously.

Need to eliminate timeout headaches from your AI pipeline? ClawRouters handles timeouts, failover, and cost optimization automatically — so you can focus on building, not debugging 504 errors. Get started for free.