Troubleshooting

Symptom-first triage

Find root cause and fix by the symptom you see. Each entry gives a concrete check you can run in the console. Still stuck? Send us the trace_id.

Universal diagnostic playbook

  1. Get the trace_id. Always start here. Traces are ground truth; error messages can mislead.
  2. Search the trace in console. Inspect the decision tree: which candidate won? why? did fallback fire? did guardrails hit?
  3. Compare trace vs expectation. The discrepancy usually pinpoints the bug.
  4. Reproduce in sandbox with sk_test_. Sandbox mirrors production semantics, costs nothing, lets you inject any error.
  5. Email us if stuck. Attach the trace_id; 1-business-day response.

Common issues

Symptom

All requests return 401 unauthorized

Happens: After rotating keys or deploying to new environment

Likely root causes
  • Key not propagated to prod (env var typo, secret manager not refreshed)
  • Key revoked in console (check Settings → API Keys)
  • Wrong Authorization header format (missing 'Bearer ' prefix)
Fix

1) console.log(process.env.SKYAIAPP_API_KEY) on the server to confirm non-empty; 2) check key status in console; 3) curl manually to confirm.

Auth docs
Symptom

Occasional 504 router.timeout

Happens: P99 latency spikes or upstream provider hiccups

Likely root causes
  • timeout_ms too short (default 60s, but quality-first + 1M token prompts often exceed)
  • Fallback chain is all slow models
  • Upstream provider regional incident (cross-check status page)
Fix

Increase timeout_ms to 90s; add a fast model (claude-haiku-4.5 / gpt-5.5-mini) at the end of the fallback chain as a last-resort.

Error handling + retry matrix
Symptom

Cost is 30%+ higher than expected

Happens: First weekly billing review post-launch

Likely root causes
  • Cache not enabled (cache: true never set)
  • Inconsistent metadata → cache namespace fragmentation
  • Default fallback chain is all premium (-pro) models
  • Users repeat the same prompt — but you didn't use idempotency keys
Fix

1) console → Analytics → Cache to see hit rate; 2) verify metadata is consistent per tenant; 3) declare fallback.models explicitly so cascading degradation works.

Cost optimization guide
Symptom

Can't find my just-sent request in traces

Happens: Right after sending a call

Likely root causes
  • Trace writes are async — P99 ~2s delay
  • Used sk_test_ key — sandbox traces show in the sandbox dashboard, not prod
  • Sampling rate < 100% (only adjustable on Enterprise)
Fix

Wait 5 seconds and refresh; verify you're on the production dashboard; confirm res.traceId matches your search (copy-paste, no trailing spaces).

Symptom

Webhook endpoint isn't receiving events

Happens: New deployment or just-added event subscription

Likely root causes
  • Endpoint is not HTTPS
  • Your endpoint returns ≥ 400 → marked failed → retry queue (max 5 attempts)
  • Signature verification fails with 401 → your rejection is final
  • Event type not subscribed in console
Fix

console → Webhooks → Deliveries shows every attempt with full request/response. Manually replay once to confirm your endpoint works.

Webhooks docs
Symptom

Fallback never triggers

Happens: Primary occasionally fails but fallback_chain is empty in traces

Likely root causes
  • You didn't pass fallback → router uses default (which may not include your expected models)
  • stability goal + balanced strategy biases toward not switching (conservative)
  • Upstream error was 4xx (non-retryable) → fallback won't fire
Fix

Pass fallback: { models: [...], maxRetries: 2 } explicitly; on 4xx, fix the original request (fallback can't rescue you).

Strategy guide
Symptom

Agent hits max_steps without finishing

Happens: Complex multi-tool tasks

Likely root causes
  • max_steps too low (default 8)
  • Agent stuck in a tool-call loop — same tool repeatedly
  • Tool descriptions vague, agent doesn't know when to stop
Fix

1) Raise max_steps to 20; 2) inspect trace step sequence for the loop; 3) tighten tool descriptions (explicit success criteria).

Symptom

PII false positives

Happens: Clean prompts being flagged

Likely root causes
  • Default PII set is broad (includes IP / timestamps / ID patterns)
  • Business IDs (e.g. order_12345) caught by ID pattern
Fix

Disable specific types in console → Guardrails; or configure an allowlist for that workflow.

Security guide

Quick diagnostic commands

# Confirm key works
curl -s -o /dev/null -w "%{http_code}\n" \
  -H "Authorization: Bearer $SKYAIAPP_API_KEY" \
  https://api.skyaiapp.com/v1/models
# Expect: 200

# Force a routing decision and inspect
curl https://api.skyaiapp.com/v1/route \
  -H "Authorization: Bearer $SKYAIAPP_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"goal":"cost","messages":[{"role":"user","content":"ping"}]}' \
  | jq '.routing'

# Pull a trace by ID
curl -s https://api.skyaiapp.com/v1/traces/tr_01JFGYZ7K8M2N3P4Q5R6S7T8U9 \
  -H "Authorization: Bearer $SKYAIAPP_API_KEY" \
  | jq '.routing.decision_reason'

# Tail recent failed traces (CI / oncall)
curl -G https://api.skyaiapp.com/v1/traces \
  -H "Authorization: Bearer $SKYAIAPP_API_KEY" \
  --data-urlencode "filter=status=failed" \
  --data-urlencode "limit=20" | jq '.data[] | {id,trace_id,error}'

See also

Need human help? Bring your trace_id: Email the founders

Was this page helpful?

Let us know how we can improve

Troubleshooting | SkyAIApp Docs — SkyAIApp