Testing & mocking

Safely test AI apps in CI

Three-layer test strategy: unit (mock SDK) / integration (sk_test_ + sandbox) / E2E (record + replay). Concrete code for every layer.

The three layers

Layer	What it tests	Network	Cost	Speed
Unit (90%)	Your business logic, error handling, param construction	None (mocked)	$0	ms
Integration (9%)	SDK ↔ Router interface, serialization, signatures	sandbox	$0	100s of ms
E2E (1%)	Full business flow, real model quality	prod / staging	per call	seconds

Rule of thumb: 90% unit + 9% integration + 1% E2E. Run E2E nightly, not on every PR — keep PR feedback under 5 minutes.

Layer 1 — Unit tests (mock SDK)

The SDK ships with a mock mode: mock: true disables all network calls; the __mock__ API lets you declare what each request should return.

// vitest — TypeScript
import { SkyAI, RateLimitError, RouterTimeoutError } from "@skyaiapp/sdk";
import { describe, test, expect, beforeEach } from "vitest";

let sky: SkyAI;
beforeEach(() => {
  sky = new SkyAI({ apiKey: "sk_test_x", mock: true });
});

describe("summarize()", () => {
  test("routes to a cheap model on cost goal", async () => {
    sky.__mock.mockRoute({
      goal: "cost",
      response: {
        output: "ok.",
        routing: { selectedModel: "claude-haiku-4.5", costUsd: 0.0001, latencyMs: 150, cacheHit: false },
        traceId: "tr_test_001",
      },
    });

    const res = await summarize(sky, "doc...");
    expect(res.model).toBe("claude-haiku-4.5");
    expect(res.cost).toBeLessThan(0.001);
  });

  test("retries once on rate limit", async () => {
    sky.__mock.mockRoute({
      goal: "cost",
      throw: new RateLimitError({ retryAfterMs: 50, code: "rate_limit.account" }),
    });
    sky.__mock.mockRoute({
      goal: "cost",
      response: { output: "ok.", routing: { selectedModel: "x", costUsd: 0, latencyMs: 0, cacheHit: false }, traceId: "t" },
    });

    const res = await summarizeWithRetry(sky, "doc...");
    expect(res).toBeTruthy();
    expect(sky.__mock.callCount("route")).toBe(2);
  });

  test("returns null on timeout (graceful degradation)", async () => {
    sky.__mock.mockRoute({
      goal: "cost",
      throw: new RouterTimeoutError({ code: "router.timeout" }),
    });
    expect(await summarize(sky, "doc...")).toBeNull();
  });
});

# pytest — Python equivalent
import pytest
from skyaiapp import SkyAI, RouterTimeoutError, RateLimitError
from skyaiapp.testing import mock_route

@pytest.fixture
def sky():
    return SkyAI(api_key="sk_test_x", mock=True)

@pytest.mark.asyncio
async def test_routes_cheap_on_cost_goal(sky):
    sky.__mock__.add(mock_route(
        goal="cost",
        response={
            "output": "ok.",
            "routing": {"selected_model": "claude-haiku-4.5", "cost_usd": 0.0001, "latency_ms": 150, "cache_hit": False},
            "trace_id": "tr_test_001",
        },
    ))
    res = await summarize(sky, "doc...")
    assert res.model == "claude-haiku-4.5"
    assert res.cost < 0.001

Layer 2 — Integration (sk_test_ + sandbox)

Sandbox mirrors production semantics, never bills, returns canned model output. CI uses sk_test_ keys against the sandbox URL.

// .env.ci
SKYAIAPP_API_KEY=sk_test_01JEXAMPLE...
SKYAIAPP_BASE_URL=https://api-sandbox.skyaiapp.com

// integration.test.ts
import { SkyAI } from "@skyaiapp/sdk";

const sky = new SkyAI({
  apiKey:  process.env.SKYAIAPP_API_KEY!,
  baseURL: process.env.SKYAIAPP_BASE_URL,
});

test("real round-trip — verify SDK ↔ Router contract", async () => {
  // Sandbox returns deterministic canned outputs you can assert on.
  const res = await sky.route({
    goal: "cost",
    messages: [{ role: "user", content: "__sandbox_echo__: hello" }],
  });
  expect(res.output).toContain("hello");
  expect(res.traceId).toMatch(/^tr_/);
});

Sandbox supports error injection: add __sandbox_error__:rate_limit in the prompt to trigger 429; __sandbox_error__:upstream_5xx for 502. Lets you exercise error paths.

Layer 3 — E2E (real models + recorded fixtures)

Run E2E sparingly (nightly) against real models to verify quality. Use record/replay: first run hits real models and records, subsequent runs replay locally — no recurring spend.

// e2e.test.ts
import { SkyAI, recordReplay } from "@skyaiapp/sdk";

const sky = new SkyAI({
  apiKey: process.env.SKYAIAPP_API_KEY!,
  // record-replay middleware — first run records to disk, later runs replay.
  middleware: [recordReplay({
    cassetteDir: "./test/cassettes",
    mode: process.env.UPDATE_CASSETTES === "1" ? "record" : "replay",
  })],
});

test("E2E — summarizer produces a 2-sentence output", async () => {
  const res = await summarize(sky, longDoc);
  expect(res.output.split(".").filter(s => s.trim()).length).toBe(2);
  expect(res.output.length).toBeLessThan(400);
});

// To re-record cassettes (after a prompt change):
//   UPDATE_CASSETTES=1 npm run test:e2e

CI configuration

# .github/workflows/test.yml
name: test
on: [push, pull_request]
jobs:
  unit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20 }
      - run: npm ci
      - run: npm run test:unit          # mock SDK; no secrets needed
  integration:
    runs-on: ubuntu-latest
    needs: unit
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20 }
      - run: npm ci
      - run: npm run test:integration
        env:
          SKYAIAPP_API_KEY: ${{ secrets.SKYAIAPP_TEST_KEY }}      # sk_test_…
          SKYAIAPP_BASE_URL: https://api-sandbox.skyaiapp.com
  e2e:
    if: github.event_name == 'schedule'   # nightly only
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: 20 }
      - run: npm ci
      - run: npm run test:e2e
        env:
          SKYAIAPP_API_KEY: ${{ secrets.SKYAIAPP_LIVE_KEY }}      # sk_live_…

Common anti-patterns

Using sk_live_ keys in CI
Why it's bad: Bills on every PR + pollutes prod dashboard with test traces. Always sk_test_.
Mocking fetch instead of the SDK interface
Why it's bad: Breaks every SDK upgrade. Always mock at the SDK boundary.
Asserting exact model output
Why it's bad: LLMs are non-deterministic. Assert structure (length, fields, format regex), not exact strings.
Running E2E on every PR
Why it's bad: PR cycle stretches → review momentum dies. Move E2E to nightly.

Was this page helpful?

Let us know how we can improve