Troubleshooting

Symptom-first triage

Find root cause and fix by the symptom you see. Each entry gives a concrete check you can run in the console. Still stuck? Send us the trace_id.

Universal diagnostic playbook

Get the trace_id. Always start here. Traces are ground truth; error messages can mislead.
Search the trace in console. Inspect the decision tree: which candidate won? why? did fallback fire? did guardrails hit?
Compare trace vs expectation. The discrepancy usually pinpoints the bug.
Reproduce in sandbox with sk_test_. Sandbox mirrors production semantics, costs nothing, lets you inject any error.
Email us if stuck. Attach the trace_id; 1-business-day response.

Common issues

Symptom

All requests return 401 unauthorized

Happens: After rotating keys or deploying to new environment

Likely root causes

Key not propagated to prod (env var typo, secret manager not refreshed)
Key revoked in console (check Settings → API Keys)
Wrong Authorization header format (missing 'Bearer ' prefix)

Fix

1) console.log(process.env.SKYAIAPP_API_KEY) on the server to confirm non-empty; 2) check key status in console; 3) curl manually to confirm.

Auth docs

Symptom

Occasional 504 router.timeout

Happens: P99 latency spikes or upstream provider hiccups

Likely root causes

timeout_ms too short (default 60s, but quality-first + 1M token prompts often exceed)
Fallback chain is all slow models
Upstream provider regional incident (cross-check status page)

Fix

Increase timeout_ms to 90s; add a fast model (claude-haiku-4.5 / gpt-5.5-mini) at the end of the fallback chain as a last-resort.

Error handling + retry matrix

Symptom

Cost is 30%+ higher than expected

Happens: First weekly billing review post-launch

Likely root causes

Cache not enabled (cache: true never set)
Inconsistent metadata → cache namespace fragmentation
Default fallback chain is all premium (-pro) models
Users repeat the same prompt — but you didn't use idempotency keys

Fix

1) console → Analytics → Cache to see hit rate; 2) verify metadata is consistent per tenant; 3) declare fallback.models explicitly so cascading degradation works.

Cost optimization guide

Symptom

Can't find my just-sent request in traces

Happens: Right after sending a call

Likely root causes

Trace writes are async — P99 ~2s delay
Used sk_test_ key — sandbox traces show in the sandbox dashboard, not prod
Sampling rate < 100% (only adjustable on Enterprise)

Fix

Wait 5 seconds and refresh; verify you're on the production dashboard; confirm res.traceId matches your search (copy-paste, no trailing spaces).

Symptom

Webhook endpoint isn't receiving events

Happens: New deployment or just-added event subscription

Likely root causes

Endpoint is not HTTPS
Your endpoint returns ≥ 400 → marked failed → retry queue (max 5 attempts)
Signature verification fails with 401 → your rejection is final
Event type not subscribed in console

Fix

console → Webhooks → Deliveries shows every attempt with full request/response. Manually replay once to confirm your endpoint works.

Webhooks docs

Symptom

Fallback never triggers

Happens: Primary occasionally fails but fallback_chain is empty in traces

Likely root causes

You didn't pass fallback → router uses default (which may not include your expected models)
stability goal + balanced strategy biases toward not switching (conservative)
Upstream error was 4xx (non-retryable) → fallback won't fire

Fix

Pass fallback: { models: [...], maxRetries: 2 } explicitly; on 4xx, fix the original request (fallback can't rescue you).

Strategy guide

Symptom

Agent hits max_steps without finishing

Happens: Complex multi-tool tasks

Likely root causes

max_steps too low (default 8)
Agent stuck in a tool-call loop — same tool repeatedly
Tool descriptions vague, agent doesn't know when to stop

Fix

1) Raise max_steps to 20; 2) inspect trace step sequence for the loop; 3) tighten tool descriptions (explicit success criteria).

Symptom

PII false positives

Happens: Clean prompts being flagged

Likely root causes

Default PII set is broad (includes IP / timestamps / ID patterns)
Business IDs (e.g. order_12345) caught by ID pattern

Fix

Disable specific types in console → Guardrails; or configure an allowlist for that workflow.

Security guide

Quick diagnostic commands

# Confirm key works
curl -s -o /dev/null -w "%{http_code}\n" \
  -H "Authorization: Bearer $SKYAIAPP_API_KEY" \
  https://api.skyaiapp.com/v1/models
# Expect: 200

# Force a routing decision and inspect
curl https://api.skyaiapp.com/v1/route \
  -H "Authorization: Bearer $SKYAIAPP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"goal":"cost","messages":[{"role":"user","content":"ping"}]}' \
  | jq '.routing'

# Pull a trace by ID
curl -s https://api.skyaiapp.com/v1/traces/tr_01JFGYZ7K8M2N3P4Q5R6S7T8U9 \
  -H "Authorization: Bearer $SKYAIAPP_API_KEY" \
  | jq '.routing.decision_reason'

# Tail recent failed traces (CI / oncall)
curl -G https://api.skyaiapp.com/v1/traces \
  -H "Authorization: Bearer $SKYAIAPP_API_KEY" \
  --data-urlencode "filter=status=failed" \
  --data-urlencode "limit=20" | jq '.data[] | {id,trace_id,error}'

Was this page helpful?

Let us know how we can improve

Symptom-first triage

Universal diagnostic playbook

Common issues

Quick diagnostic commands

See also

Was this page helpful?