Symptom-first triage
Find root cause and fix by the symptom you see. Each entry gives a concrete check you can run in the console. Still stuck? Send us the trace_id.
Universal diagnostic playbook
- Get the trace_id. Always start here. Traces are ground truth; error messages can mislead.
- Search the trace in console. Inspect the decision tree: which candidate won? why? did fallback fire? did guardrails hit?
- Compare trace vs expectation. The discrepancy usually pinpoints the bug.
- Reproduce in sandbox with sk_test_. Sandbox mirrors production semantics, costs nothing, lets you inject any error.
- Email us if stuck. Attach the trace_id; 1-business-day response.
Common issues
SymptomAll requests return 401 unauthorized
Happens: After rotating keys or deploying to new environment
All requests return 401 unauthorized
Happens: After rotating keys or deploying to new environment
- Key not propagated to prod (env var typo, secret manager not refreshed)
- Key revoked in console (check Settings → API Keys)
- Wrong Authorization header format (missing 'Bearer ' prefix)
1) console.log(process.env.SKYAIAPP_API_KEY) on the server to confirm non-empty; 2) check key status in console; 3) curl manually to confirm.
SymptomOccasional 504 router.timeout
Happens: P99 latency spikes or upstream provider hiccups
Occasional 504 router.timeout
Happens: P99 latency spikes or upstream provider hiccups
- timeout_ms too short (default 60s, but quality-first + 1M token prompts often exceed)
- Fallback chain is all slow models
- Upstream provider regional incident (cross-check status page)
Increase timeout_ms to 90s; add a fast model (claude-haiku-4.5 / gpt-5.5-mini) at the end of the fallback chain as a last-resort.
SymptomCost is 30%+ higher than expected
Happens: First weekly billing review post-launch
Cost is 30%+ higher than expected
Happens: First weekly billing review post-launch
- Cache not enabled (cache: true never set)
- Inconsistent metadata → cache namespace fragmentation
- Default fallback chain is all premium (-pro) models
- Users repeat the same prompt — but you didn't use idempotency keys
1) console → Analytics → Cache to see hit rate; 2) verify metadata is consistent per tenant; 3) declare fallback.models explicitly so cascading degradation works.
SymptomCan't find my just-sent request in traces
Happens: Right after sending a call
Can't find my just-sent request in traces
Happens: Right after sending a call
- Trace writes are async — P99 ~2s delay
- Used sk_test_ key — sandbox traces show in the sandbox dashboard, not prod
- Sampling rate < 100% (only adjustable on Enterprise)
Wait 5 seconds and refresh; verify you're on the production dashboard; confirm res.traceId matches your search (copy-paste, no trailing spaces).
SymptomWebhook endpoint isn't receiving events
Happens: New deployment or just-added event subscription
Webhook endpoint isn't receiving events
Happens: New deployment or just-added event subscription
- Endpoint is not HTTPS
- Your endpoint returns ≥ 400 → marked failed → retry queue (max 5 attempts)
- Signature verification fails with 401 → your rejection is final
- Event type not subscribed in console
console → Webhooks → Deliveries shows every attempt with full request/response. Manually replay once to confirm your endpoint works.
SymptomFallback never triggers
Happens: Primary occasionally fails but fallback_chain is empty in traces
Fallback never triggers
Happens: Primary occasionally fails but fallback_chain is empty in traces
- You didn't pass fallback → router uses default (which may not include your expected models)
- stability goal + balanced strategy biases toward not switching (conservative)
- Upstream error was 4xx (non-retryable) → fallback won't fire
Pass fallback: { models: [...], maxRetries: 2 } explicitly; on 4xx, fix the original request (fallback can't rescue you).
SymptomAgent hits max_steps without finishing
Happens: Complex multi-tool tasks
Agent hits max_steps without finishing
Happens: Complex multi-tool tasks
- max_steps too low (default 8)
- Agent stuck in a tool-call loop — same tool repeatedly
- Tool descriptions vague, agent doesn't know when to stop
1) Raise max_steps to 20; 2) inspect trace step sequence for the loop; 3) tighten tool descriptions (explicit success criteria).
SymptomPII false positives
Happens: Clean prompts being flagged
PII false positives
Happens: Clean prompts being flagged
- Default PII set is broad (includes IP / timestamps / ID patterns)
- Business IDs (e.g. order_12345) caught by ID pattern
Disable specific types in console → Guardrails; or configure an allowlist for that workflow.
Quick diagnostic commands
# Confirm key works
curl -s -o /dev/null -w "%{http_code}\n" \
-H "Authorization: Bearer $SKYAIAPP_API_KEY" \
https://api.skyaiapp.com/v1/models
# Expect: 200
# Force a routing decision and inspect
curl https://api.skyaiapp.com/v1/route \
-H "Authorization: Bearer $SKYAIAPP_API_KEY" \
-H "Content-Type: application/json" \
-d '{"goal":"cost","messages":[{"role":"user","content":"ping"}]}' \
| jq '.routing'
# Pull a trace by ID
curl -s https://api.skyaiapp.com/v1/traces/tr_01JFGYZ7K8M2N3P4Q5R6S7T8U9 \
-H "Authorization: Bearer $SKYAIAPP_API_KEY" \
| jq '.routing.decision_reason'
# Tail recent failed traces (CI / oncall)
curl -G https://api.skyaiapp.com/v1/traces \
-H "Authorization: Bearer $SKYAIAPP_API_KEY" \
--data-urlencode "filter=status=failed" \
--data-urlencode "limit=20" | jq '.data[] | {id,trace_id,error}'See also
Error codes & retries
Code lookup + retry matrix
Architecture
Understand the decision logic
Testing & mocking
Reproduce bugs in CI
Was this page helpful?
Let us know how we can improve