How the router actually decides
After reading: you can answer 'how does SkyAIApp represent candidate models? how is scoring computed? what's the cache lookup algorithm? when do fallbacks fire? how do traces form a tree?' — and why we picked these designs.
Request lifecycle
How one request is governed, routed, executed, and audited
Edge ingress
TLS, WAF, API key, tenant identity, and rate limits run before the request enters the control plane.
Normalize request
OpenAI-compatible calls, native route requests, and metadata are normalized into one auditable envelope.
Input guardrails
Injection, PII, tenant boundary, and residency checks run before spending on a model call.
Semantic cache lookup
Namespace, embedding similarity, and TTL determine cache hits; a hit returns immediately and skips the model call.
Candidate filtering
Model pools are filtered by context window, modality, region, RBAC, budget, and provider health.
Score and rank
Strategy weights normalize cost, latency, quality, and reliability to produce the primary and fallback chain.
Execute primary or fallback
Timeouts, 5xxs, safety hits, or budget changes trigger a rechecked fallback handoff.
Assemble response
Model output, routing metadata, guardrail results, and trace ID are assembled before returning to the app.
Why this shape
Why is cache lookup before candidate filter?
Cache hit is the free path — skip the model means skipping the most expensive step. By looking up first, hits only execute steps 1–3.
Why is budget checked both in filter and fallback?
The filter excludes 'obviously over budget' candidates. After a fallback fires, we re-account for already-spent cost — what was a borderline candidate may now exceed the budget.
Why score with 1/normalized rather than w * cost?
Cost spans 3 orders of magnitude ($0.0001–$0.05). Direct multiplication is dominated by extremes. Normalizing to [0,1] then inverting makes 'cheap+slow' and 'expensive+fast' actually comparable under balanced.
Why tie-break by provider diversity?
If primary + secondary are both from OpenAI, an OpenAI outage takes down both. Diversity tie-break ensures the fallback chain provides genuine resilience.
Why is trace writing async?
Zero impact on response latency. Traces go through a fire-and-forget queue with eventual consistency. Cost: P99 write latency ~2s (a 'just now' request may briefly not show in dashboards).
Semantic cache internals
Cache is a namespaced vector index (HNSW) + content store (S3-compatible). Lookup flow:
fn cache_lookup(ns: &str, prompt: &str, similarity: f32, ttl: Duration)
-> Option<CachedResponse>
{
// 1. Embed the prompt with a small embedder (~5ms p50)
let q_vec: [f32; 384] = embed_small(prompt);
// 2. ANN search in the namespaced HNSW index
// K=8 candidates returned with cosine similarities
let candidates = hnsw.search(ns, &q_vec, k = 8);
// 3. Filter by threshold AND TTL
let now = SystemTime::now();
let valid = candidates
.iter()
.filter(|c| c.similarity >= similarity)
.filter(|c| now.duration_since(c.created_at) <= ttl)
.max_by(|a, b| a.similarity.partial_cmp(&b.similarity).unwrap());
valid.map(|hit| {
// 4. Fetch content (separated from index for cost reasons)
let body = content_store.get(&hit.content_key)?;
Some(CachedResponse {
output: body,
similarity: hit.similarity,
stored_at: hit.created_at,
})
})
.flatten()
}Why a small embedder (not OpenAI ada-002): the lookup cost cannot exceed the saved cost. We use a distilled MPNet (384-dim); a cache hit costs ~5ms and ~0.00001 USD per lookup.
How traces form a tree
Each trace is a span tree. The router creates a root span; each internal step (cache, filter, score, execute) is a child span. Agent runs nest each step as a child; tool calls inside an agent step are grandchildren.
tr_01JFGYZ7K8M2N3P4Q5R6S7T8U9 (root: route, 1820ms)
├─ ingress (3ms)
├─ normalize (1ms)
├─ cache.lookup (5ms) miss
├─ filter (1ms) 12 candidates
├─ score (2ms) gpt-5.5-pro wins
├─ execute.primary (1810ms)
│ └─ http.openai (1808ms) 200 OK
│ ├─ tcp.connect (45ms)
│ ├─ tls.handshake (60ms)
│ └─ stream.read (1700ms)
├─ guardrail.output (1ms) PII clean
├─ cache.write (4ms) stored
└─ billing.emit (0.4ms)Multi-region active-active
Router runs active-active in 3 regions (us-east, eu-west, ap-southeast). GeoDNS routes requests to the nearest region. Policies / cache / usage use a write-once-read-many topology: writes go to us-east, async replicated globally (eventual consistency ≤ 30s). Result: a single region outage doesn't affect global availability; policy changes propagate globally in ≤ 30s.
Two-sided guardrails
Guardrails run on both input and output: input checks prompt injection / PII / topic blocks; output checks PII leakage / hallucination suppression / content moderation. Every hit becomes a guardrail span in the trace; DPOs can export with one click.
See also
Was this page helpful?
Let us know how we can improve