From Demo to Production: Multi‑Model Routing
AI apps behave differently in production than in a demo. The moment your QPS rises and users depend on results, variability becomes the enemy:
- Model prices shift weekly.
- Latency spikes at peak times.
- Quality differs by task.
- Failures happen in bursts.
If you ship with a single model and a single prompt path, you are implicitly betting that all tasks are the same and the model will remain stable. Production traffic proves the opposite.
A production request is rarely a single model call. The router chooses a pool, applies policy, and uses fallbacks and cache.
What changes when you go live
In real traffic you will see a long tail of intents:
- short, repetitive asks ("reset password", "pricing", "refund policy")
- complex reasoning requests ("compare plans for a 200‑seat org")
- tool‑heavy flows ("generate invoice", "open ticket", "update shipment")
- safety‑sensitive queries ("medical advice", "financial guidance")
Each of those benefits from a different model/cost/latency tradeoff. A “best overall” model is almost never best for all of them.
Case study: a support copilot at scale
One SkyAIApp user (we’ll call them AcmeShop) runs a B2C marketplace with ~1.2M monthly active users and a support copilot embedded in web + mobile apps.
Before routing
- Single model:
gpt-5.2 - Traffic: ~22k requests/day, peak 40 QPS
- p95 latency: 1.55s
- Success rate: 97.9%
- Cost: $4.10 per 1k requests
- Typical failure mode: burst timeouts during peak, leading to retries and duplicated tool actions
After routing with SkyAIApp
AcmeShop defined three pools:
- Cost pool for repetitive FAQs and low‑risk tasks
gemini-3-flashprimary →claude-4.5-haikufallback - Balanced pool for most conversations
claude-4.5-sonnetprimary →gpt-5.2fallback - Quality‑first pool for complex multi‑step reasoning and tool plans
gpt-5.2-highprimary →claude-4.5-opusfallback
They also enabled semantic cache for FAQ‑like intents and added a strict fallback chain for tool calls.
Results after 3 weeks
- Router mix: 54% cost, 39% balanced, 7% quality‑first
- p95 latency: 1.02s (34% faster)
- Success rate: 99.4% (retries down, fallbacks smarter)
- Cost: $2.95 per 1k requests (28% cheaper)
- Cache hit rate: 31% for FAQ intents
The key insight: routing reduced both spend and latency because cheaper models were also faster for simple asks, and fallbacks avoided expensive full retries.
How SkyAIApp routing works
SkyAIApp provides a policy‑driven router with a clean separation of concerns:
- Unified API surface
Call one endpoint regardless of provider. - Signal collection
Per‑request features such as intent class, prompt size, tool usage, locale, user tier, and cache similarity. - Decision engine
A weighted score over cost, p95 latency, success rate, and offline quality evals. - Fallback chains
Provider‑aware fallback (same family → different family) with configurable retries. - Adaptive learning
Outcomes update your routing weights over time.
Policies are versioned, testable, and deployable like code. You can A/B a new policy on 5% of traffic, then promote it when the metrics look good.
Caching as a production lever
Semantic caching is the highest ROI knob for many apps, especially those with repeated intents. The trick is using cache only when safe:
- Cache only for intents you label as “repeatable” (FAQ, docs, policy text).
- Use high‑threshold similarity for long‑tail asks.
- Attach safety and locale constraints to cache keys.
In AcmeShop’s case, caching was only enabled for intents tagged faq.* and only when the router predicted low tool usage.
Rollout checklist
If you’re moving from demo to production, here’s a tight, repeatable rollout:
- Create at least two pools (cost + balanced) before launch.
- Backfill evals on a representative traffic sample.
- Turn on fallbacks for non‑idempotent tools with strict retry caps.
- Enable cache on one intent family first and measure hit rate + safety.
- Alert on drift in model latency or cost so policies stay current.
Production routing is not a “nice‑to‑have”. It is the difference between an AI feature that scales and one that becomes a cost or reliability liability.