Architecture deep-dive

How the router actually decides

After reading: you can answer 'how does SkyAIApp represent candidate models? how is scoring computed? what's the cache lookup algorithm? when do fallbacks fire? how do traces form a tree?' — and why we picked these designs.

Request lifecycle

Request lifecycle architecture

How one request is governed, routed, executed, and audited

sync path

cache fast path

async side path

01entry

Edge ingress

TLS, WAF, API key, tenant identity, and rate limits run before the request enters the control plane.

TLS 1.3

WAF

key-tier limits

02normalize

Normalize request

OpenAI-compatible calls, native route requests, and metadata are normalized into one auditable envelope.

JSON Schema

prompt fingerprint

policy resolve

03protect

Input guardrails

Injection, PII, tenant boundary, and residency checks run before spending on a model call.

PII

region tags

tool scopes

04fast path

Semantic cache lookup

Namespace, embedding similarity, and TTL determine cache hits; a hit returns immediately and skips the model call.

HNSW

similarity gate

TTL

05model pool

Candidate filtering

Model pools are filtered by context window, modality, region, RBAC, budget, and provider health.

budget cap

context fit

provider health

06decision

Score and rank

Strategy weights normalize cost, latency, quality, and reliability to produce the primary and fallback chain.

cost

latency

quality score

07execute

Execute primary or fallback

Timeouts, 5xxs, safety hits, or budget changes trigger a rechecked fallback handoff.

HTTP/2

timeout

fallback

08response

Assemble response

Model output, routing metadata, guardrail results, and trace ID are assembled before returning to the app.

trace_id

route reason

usage metadata

The main path shows synchronous steps that block the response; side cards show cache, fallback, and observability paths. Every node writes into the same trace tree.

Why this shape

Why is cache lookup before candidate filter?

Cache hit is the free path — skip the model means skipping the most expensive step. By looking up first, hits only execute steps 1–3.

Why is budget checked both in filter and fallback?

The filter excludes 'obviously over budget' candidates. After a fallback fires, we re-account for already-spent cost — what was a borderline candidate may now exceed the budget.

Why score with 1/normalized rather than w * cost?

Cost spans 3 orders of magnitude ($0.0001–$0.05). Direct multiplication is dominated by extremes. Normalizing to [0,1] then inverting makes 'cheap+slow' and 'expensive+fast' actually comparable under balanced.

Why tie-break by provider diversity?

If primary + secondary are both from OpenAI, an OpenAI outage takes down both. Diversity tie-break ensures the fallback chain provides genuine resilience.

Why is trace writing async?

Zero impact on response latency. Traces go through a fire-and-forget queue with eventual consistency. Cost: P99 write latency ~2s (a 'just now' request may briefly not show in dashboards).

Semantic cache internals

Cache is a namespaced vector index (HNSW) + content store (S3-compatible). Lookup flow:

fn cache_lookup(ns: &str, prompt: &str, similarity: f32, ttl: Duration)
    -> Option<CachedResponse>
{
    // 1. Embed the prompt with a small embedder (~5ms p50)
    let q_vec: [f32; 384] = embed_small(prompt);

    // 2. ANN search in the namespaced HNSW index
    //    K=8 candidates returned with cosine similarities
    let candidates = hnsw.search(ns, &q_vec, k = 8);

    // 3. Filter by threshold AND TTL
    let now = SystemTime::now();
    let valid = candidates
        .iter()
        .filter(|c| c.similarity >= similarity)
        .filter(|c| now.duration_since(c.created_at) <= ttl)
        .max_by(|a, b| a.similarity.partial_cmp(&b.similarity).unwrap());

    valid.map(|hit| {
        // 4. Fetch content (separated from index for cost reasons)
        let body = content_store.get(&hit.content_key)?;
        Some(CachedResponse {
            output: body,
            similarity: hit.similarity,
            stored_at: hit.created_at,
        })
    })
    .flatten()
}

Why a small embedder (not OpenAI ada-002): the lookup cost cannot exceed the saved cost. We use a distilled MPNet (384-dim); a cache hit costs ~5ms and ~0.00001 USD per lookup.

How traces form a tree

Each trace is a span tree. The router creates a root span; each internal step (cache, filter, score, execute) is a child span. Agent runs nest each step as a child; tool calls inside an agent step are grandchildren.

tr_01JFGYZ7K8M2N3P4Q5R6S7T8U9   (root: route, 1820ms)
├─ ingress           (3ms)
├─ normalize         (1ms)
├─ cache.lookup      (5ms)   miss
├─ filter            (1ms)   12 candidates
├─ score             (2ms)   gpt-5.5-pro wins
├─ execute.primary   (1810ms)
│  └─ http.openai    (1808ms) 200 OK
│     ├─ tcp.connect (45ms)
│     ├─ tls.handshake (60ms)
│     └─ stream.read (1700ms)
├─ guardrail.output  (1ms)   PII clean
├─ cache.write       (4ms)   stored
└─ billing.emit      (0.4ms)

Multi-region active-active

Router runs active-active in 3 regions (us-east, eu-west, ap-southeast). GeoDNS routes requests to the nearest region. Policies / cache / usage use a write-once-read-many topology: writes go to us-east, async replicated globally (eventual consistency ≤ 30s). Result: a single region outage doesn't affect global availability; policy changes propagate globally in ≤ 30s.

Two-sided guardrails

Guardrails run on both input and output: input checks prompt injection / PII / topic blocks; output checks PII leakage / hallucination suppression / content moderation. Every hit becomes a guardrail span in the trace; DPOs can export with one click.

Was this page helpful?

Let us know how we can improve