Stress-test your AI agents before production does

Point Fabrik at an agent. It builds a simulation environment around it, generates scenarios from your code and real production traces, runs everything in parallel, and tells you exactly what broke.

Catch
before your users do.
app.fabriklabs.ai/runs/run_8f3a
run_8f3arefund-agent
240 rollouts · OpenAI Agents SDK · seed 42
Running
0
Total
0
Passed
0
Failed
0
Score
216 / 240 complete90%
Rollouts · click to inspect
Failure clusters
Wrong refund amount9
Unhandled null customer7
Tool-call shape drift4
Double-refund on retry2

Runs your agent on the framework you already use

OpenAI Agents SDKOpenAI Agents SDKNative
Vercel AI SDKVercel AI SDKNative
LangGraphLangGraphNative
Google ADKGoogle ADKNative
Generic HTTP / JSONGeneric
Generic CLIGeneric
Generic WebSocketGeneric
Node functionGeneric
OpenAI Agents SDKOpenAI Agents SDKNative
Vercel AI SDKVercel AI SDKNative
LangGraphLangGraphNative
Google ADKGoogle ADKNative
Generic HTTP / JSONGeneric
Generic CLIGeneric
Generic WebSocketGeneric
Node functionGeneric
OpenAI Agents SDKOpenAI Agents SDKNative
Vercel AI SDKVercel AI SDKNative
LangGraphLangGraphNative
Google ADKGoogle ADKNative
Generic HTTP / JSONGeneric
Generic CLIGeneric
Generic WebSocketGeneric
Node functionGeneric
OpenAI Agents SDKOpenAI Agents SDKNative
Vercel AI SDKVercel AI SDKNative
LangGraphLangGraphNative
Google ADKGoogle ADKNative
Generic HTTP / JSONGeneric
Generic CLIGeneric
Generic WebSocketGeneric
Node functionGeneric

01/Why simulation, not evals

Evals score outputs against rubrics. They miss the failures that actually break agents.

These aren't output-quality failures — they're integration failures. They only show up when the agent runs against a realistic environment with realistic state.

stripe.refunds.create

Wrong-amount tool calls

The agent calls a tool with the wrong arguments because the user's request was ambiguous and the agent guessed.

langgraph.node

Loops on changed shapes

A LangGraph node loops three times because the OpenAI tool-call response shape changed underneath it.

auth.proxy

Unhandled null state

The auth proxy returns null because the customer is a returning user with a different ID format than the test fixtures.

02/How Fabrik works

From a connected agent to a failure report — automatically

Your agent

Point Fabrik at a connected sandbox.

Discover

Fabrik learns how the agent works.

Build environment

Mocked services, seeded world state, personas.

Generate scenarios

From your code and real production traces.

Run in parallel

Hundreds of rollouts at once.

Report

Failure clusters and exactly what broke.

03/The product

Five things you can do in sixty seconds

Watch parallel rollouts

Pick a scenario set, hit run, and see N scenarios execute simultaneously — each with its persona, current turn, and assertion status emerging live.

run_8f3arefund-agent
240 rollouts · OpenAI Agents SDK · seed 42
Running
0
Total
0
Passed
0
Failed
0
Score
216 / 240 complete90%
Rollouts
Frustrated enterprise user5 / 5
Ambiguous refund intent3 / 5
Returning customer, alt ID5 / 5
Schema-drift on tool call2 / 4
Happy path, single order4 / 4
Null from auth proxy3 / 6
Partial refund, multi-item6 / 6
Failure clusters
Wrong refund amount9
Unhandled null customer7
Tool-call shape drift4
Double-refund on retry2

Inspect a rollout's trace

Actor messages, agent messages, tool calls, mock hits/misses, DB reads/writes, assertion results, and grader output — one timeline, color-coded by lane.

rollout_8f3a_002
Ambiguous refund intent · adversarial
failed · step 7
actor+0.0s
I need to refund order_1001
agent+0.4s
Looking up order_1001…
toolmock hit
fetch_order(order_1001)
dbread
orders → { total: 149.99, refundable: 42.00 }
agent+1.1s
Processing refund for $149.99…
toolmock hit
stripe.refunds.create({ amount: 149.99 })
assertexpected 42.00, got 149.99
refund_amount == order.refundable
graderscore 0.40
Refunded full order total, not the refundable amount

Compare two runs

Pick a baseline. Every scenario is grouped by delta: fixed, regressed, or unchanged. Failure clusters sit side-by-side with deltas.

app.fabriklabs.ai/compare/8f3a..7c1b
run_8f3avsrun_7c1b · baseline
agent v1agent v2
47
Fixed
3
Regressed
188
Unchanged
12
New
Scenariov1 → v2
refund · ambiguous intentFixed
refund · partial multi-itemFixed
router · null customerRegressed
refund · happy pathUnchanged
router · enterprise tierUnchanged
refund · double-refund retryFixed

Production traces → scenarios

Drop a JSONL / Langfuse / OpenTelemetry export. Fabrik normalizes it, redacts PII, and seeds scenario generation grounded in real user phrasing.

production_traces.jsonl1,204 turns
normalizedPII redacteddeduped
+ 240 trace-seeded scenarios

Export training data

Filter to passing rollouts with high behavioral scores. Download as JSONL in OpenAI fine-tuning shape, aggregated across runs into one re-fetchable snapshot.

{"messages":[
  {"role":"user","content":"refund order_1001"},
  {"role":"assistant","content":"Refunded $42.00 …"}
],"metadata":{"score":0.91,"run":"8f3a"}}

04/The output

Three things at once, every run

Parallel
rollouts across realistic scenarios
Stateful
personas, tools, databases, auth, and memory simulated together
Reproducible
traces, assertions, and failure clusters for every rollout

Bugs you would have shipped

Specific scenarios where the agent regressed, fabricated answers, called the wrong tool, or violated your policies — with reproducible inputs.

Comparison data across versions

"v2 fixed 47% of v1's failures, here are 3 new regressions" — with one click.

Training data

Every passing rollout becomes a JSONL line in OpenAI fine-tuning shape. The longer Fabrik runs, the more high-signal corpus you accumulate.

Most teams come for the bug-finding. The training data quietly compounds in the background.

05/Quick start

Three ways in — pick how much access you want to give

Full SDK injection

Zero code — Fabrik writes the wrapping

Best when you control the agent's repo. Fabrik creates a fabrik-prep branch, analyzes your code, and proposes wraps for your DB / auth / API / notification / payment calls one group at a time. You approve each plan; Fabrik commits it.

Environment-only

~3 lines in your handler

Best when you can't or don't want Fabrik to edit your code. Fabrik runs discovery, builds the mock catalog, detects personas + framework, and publishes an environment version. You add a few lines to your request handler.

Bring your own framework

Native trace enrichment

Best when you're already running OpenAI Agents SDK / Vercel AI SDK / LangGraph / Google ADK. Fabrik's framework-detection skill identifies the framework and sets up the right adapter automatically.

Environment-only — that's the whole integration:

refreshFabrikRuntimeFromRequest,
withFabrikRuntimeResponse,
} from '@fabrik-evals/core';
const body = await req.json();
refreshFabrikRuntimeFromRequest(body); // pulls the runtime envelope
const reply = await myAgent(body.messages);
return Response.json(withFabrikRuntimeResponse({ text: reply }));
}

06/See it run

One happy-path test vs. hundreds of parallel rollouts

Traditional evals run the workflow once and call it green. Fabrik runs it against every persona, edge case, and service failure — and shows you exactly what broke.

app.fabriklabs.ai/runs/run_8f3a
run_8f3arefund-agent
240 rollouts · OpenAI Agents SDK · seed 42
Running
0
Total
0
Passed
0
Failed
0
Score
216 / 240 complete90%
Rollouts · click to inspect
Failure clusters
Wrong refund amount9
Unhandled null customer7
Tool-call shape drift4
Double-refund on retry2

Find the bugs before your users do

Get early access and updates on AI agent simulation and reliability.

or
Book a demo

No spam. Only valuable updates.