From Demo to Production: Multi‑Model Routing

12/1/2025

AI apps behave differently in production than in a demo. The moment your QPS rises and users depend on results, variability becomes the enemy:

Model prices shift weekly.
Latency spikes at peak times.
Quality differs by task.
Failures happen in bursts.

If you ship with a single model and a single prompt path, you are implicitly betting that all tasks are the same and the model will remain stable. Production traffic proves the opposite.

Multi-model routing architecture — A production request is rarely a single model call. The router chooses a pool, applies policy, and uses fallbacks and cache.

What changes when you go live

In real traffic you will see a long tail of intents:

short, repetitive asks ("reset password", "pricing", "refund policy")
complex reasoning requests ("compare plans for a 200‑seat org")
tool‑heavy flows ("generate invoice", "open ticket", "update shipment")
safety‑sensitive queries ("medical advice", "financial guidance")

Each of those benefits from a different model/cost/latency tradeoff. A “best overall” model is almost never best for all of them.

Case study: a support copilot at scale

One SkyAIApp user (we’ll call them AcmeShop) runs a B2C marketplace with ~1.2M monthly active users and a support copilot embedded in web + mobile apps.

Before routing

Single model: gpt-5.2
Traffic: ~22k requests/day, peak 40 QPS
p95 latency: 1.55s
Success rate: 97.9%
Cost: $4.10 per 1k requests
Typical failure mode: burst timeouts during peak, leading to retries and duplicated tool actions

After routing with SkyAIApp

AcmeShop defined three pools:

Cost pool for repetitive FAQs and low‑risk tasks
gemini-3-flash primary → claude-4.5-haiku fallback
Balanced pool for most conversations
claude-4.5-sonnet primary → gpt-5.2 fallback
Quality‑first pool for complex multi‑step reasoning and tool plans
gpt-5.2-high primary → claude-4.5-opus fallback

They also enabled semantic cache for FAQ‑like intents and added a strict fallback chain for tool calls.

Results after 3 weeks

Router mix: 54% cost, 39% balanced, 7% quality‑first
p95 latency: 1.02s (34% faster)
Success rate: 99.4% (retries down, fallbacks smarter)
Cost: $2.95 per 1k requests (28% cheaper)
Cache hit rate: 31% for FAQ intents

The key insight: routing reduced both spend and latency because cheaper models were also faster for simple asks, and fallbacks avoided expensive full retries.

How SkyAIApp routing works

SkyAIApp provides a policy‑driven router with a clean separation of concerns:

Unified API surface
Call one endpoint regardless of provider.
Signal collection
Per‑request features such as intent class, prompt size, tool usage, locale, user tier, and cache similarity.
Decision engine
A weighted score over cost, p95 latency, success rate, and offline quality evals.
Fallback chains
Provider‑aware fallback (same family → different family) with configurable retries.
Adaptive learning
Outcomes update your routing weights over time.

Policies are versioned, testable, and deployable like code. You can A/B a new policy on 5% of traffic, then promote it when the metrics look good.

Caching as a production lever

Semantic caching is the highest ROI knob for many apps, especially those with repeated intents. The trick is using cache only when safe:

Cache only for intents you label as “repeatable” (FAQ, docs, policy text).
Use high‑threshold similarity for long‑tail asks.
Attach safety and locale constraints to cache keys.

In AcmeShop’s case, caching was only enabled for intents tagged faq.* and only when the router predicted low tool usage.

Rollout checklist

If you’re moving from demo to production, here’s a tight, repeatable rollout:

Create at least two pools (cost + balanced) before launch.
Backfill evals on a representative traffic sample.
Turn on fallbacks for non‑idempotent tools with strict retry caps.
Enable cache on one intent family first and measure hit rate + safety.
Alert on drift in model latency or cost so policies stay current.

Production routing is not a “nice‑to‑have”. It is the difference between an AI feature that scales and one that becomes a cost or reliability liability.