Stress-test your AI agents before production does

Point Fabrik at an agent. It builds a simulation environment around it, generates scenarios from your code and real production traces, runs everything in parallel, and tells you exactly what broke.

Catch

before your users do.

Join the waitlist Book a demo

app.fabriklabs.ai/runs/run_8f3a

run_8f3arefund-agent

240 rollouts · OpenAI Agents SDK · seed 42

Running

Total

Passed

Failed

Score

216 / 240 complete90%

Rollouts · click to inspect

Failure clusters

Wrong refund amount9

Unhandled null customer7

Tool-call shape drift4

Double-refund on retry2

Runs your agent on the framework you already use

OpenAI Agents SDKNative

Vercel AI SDKNative

LangGraphNative

Google ADKNative

Generic HTTP / JSONGeneric

Generic CLIGeneric

Generic WebSocketGeneric

Node functionGeneric

OpenAI Agents SDKNative

Vercel AI SDKNative

LangGraphNative

Google ADKNative

Generic HTTP / JSONGeneric

Generic CLIGeneric

Generic WebSocketGeneric

Node functionGeneric

OpenAI Agents SDKNative

Vercel AI SDKNative

LangGraphNative

Google ADKNative

Generic HTTP / JSONGeneric

Generic CLIGeneric

Generic WebSocketGeneric

Node functionGeneric

OpenAI Agents SDKNative

Vercel AI SDKNative

LangGraphNative

Google ADKNative

Generic HTTP / JSONGeneric

Generic CLIGeneric

Generic WebSocketGeneric

Node functionGeneric

01/Why simulation, not evals

Evals score outputs against rubrics. They miss the failures that actually break agents.

These aren't output-quality failures — they're integration failures. They only show up when the agent runs against a realistic environment with realistic state.

stripe.refunds.create

Wrong-amount tool calls

The agent calls a tool with the wrong arguments because the user's request was ambiguous and the agent guessed.

langgraph.node

Loops on changed shapes

A LangGraph node loops three times because the OpenAI tool-call response shape changed underneath it.

auth.proxy

Unhandled null state

The auth proxy returns null because the customer is a returning user with a different ID format than the test fixtures.

02/How Fabrik works

From a connected agent to a failure report — automatically

Your agent

Point Fabrik at a connected sandbox.

Discover

Fabrik learns how the agent works.

Build environment

Mocked services, seeded world state, personas.

Generate scenarios

From your code and real production traces.

Run in parallel

Hundreds of rollouts at once.

Report

Failure clusters and exactly what broke.

03/The product

Five things you can do in sixty seconds

Watch parallel rollouts

Pick a scenario set, hit run, and see N scenarios execute simultaneously — each with its persona, current turn, and assertion status emerging live.

run_8f3arefund-agent

240 rollouts · OpenAI Agents SDK · seed 42

Running

Total

Passed

Failed

Score

216 / 240 complete90%

Rollouts

Frustrated enterprise useredge5 / 5

Ambiguous refund intentadversarial3 / 5

Returning customer, alt IDtrace-seeded5 / 5

Schema-drift on tool calladversarial2 / 4

Happy path, single orderedge4 / 4

Null from auth proxyadversarial3 / 6

Partial refund, multi-itemtrace-seeded6 / 6

Failure clusters

Wrong refund amount9

Unhandled null customer7

Tool-call shape drift4

Double-refund on retry2

Inspect a rollout's trace

Actor messages, agent messages, tool calls, mock hits/misses, DB reads/writes, assertion results, and grader output — one timeline, color-coded by lane.

rollout_8f3a_002

Ambiguous refund intent · adversarial

failed · step 7

actor+0.0s

I need to refund order_1001

agent+0.4s

Looking up order_1001…

toolmock hit

fetch_order(order_1001)

dbread

orders → { total: 149.99, refundable: 42.00 }

agent+1.1s

Processing refund for $149.99…

toolmock hit

stripe.refunds.create({ amount: 149.99 })

assertexpected 42.00, got 149.99

refund_amount == order.refundable

graderscore 0.40

Refunded full order total, not the refundable amount

Compare two runs

Pick a baseline. Every scenario is grouped by delta: fixed, regressed, or unchanged. Failure clusters sit side-by-side with deltas.

app.fabriklabs.ai/compare/8f3a..7c1b

run_8f3avsrun_7c1b · baseline

agent v1agent v2

Fixed

−3

Regressed

188

Unchanged

New

Scenariov1 → v2

refund · ambiguous intentFixed

refund · partial multi-itemFixed

router · null customerRegressed

refund · happy pathUnchanged

router · enterprise tierUnchanged

refund · double-refund retryFixed

Production traces → scenarios

Drop a JSONL / Langfuse / OpenTelemetry export. Fabrik normalizes it, redacts PII, and seeds scenario generation grounded in real user phrasing.

production_traces.jsonl1,204 turns

normalizedPII redacteddeduped

+ 240 trace-seeded scenarios

Export training data

Filter to passing rollouts with high behavioral scores. Download as JSONL in OpenAI fine-tuning shape, aggregated across runs into one re-fetchable snapshot.

{"messages":[
  {"role":"user","content":"refund order_1001"},
  {"role":"assistant","content":"Refunded $42.00 …"}
],"metadata":{"score":0.91,"run":"8f3a"}}

04/The output

Three things at once, every run

Parallel

rollouts across realistic scenarios

Stateful

personas, tools, databases, auth, and memory simulated together

Reproducible

traces, assertions, and failure clusters for every rollout

Bugs you would have shipped

Specific scenarios where the agent regressed, fabricated answers, called the wrong tool, or violated your policies — with reproducible inputs.

Comparison data across versions

"v2 fixed 47% of v1's failures, here are 3 new regressions" — with one click.

Training data

Every passing rollout becomes a JSONL line in OpenAI fine-tuning shape. The longer Fabrik runs, the more high-signal corpus you accumulate.

Most teams come for the bug-finding. The training data quietly compounds in the background.

05/Quick start

Three ways in — pick how much access you want to give

Full SDK injection

Zero code — Fabrik writes the wrapping

Best when you control the agent's repo. Fabrik creates a fabrik-prep branch, analyzes your code, and proposes wraps for your DB / auth / API / notification / payment calls one group at a time. You approve each plan; Fabrik commits it.

Environment-only

~3 lines in your handler

Best when you can't or don't want Fabrik to edit your code. Fabrik runs discovery, builds the mock catalog, detects personas + framework, and publishes an environment version. You add a few lines to your request handler.

Bring your own framework

Native trace enrichment

Best when you're already running OpenAI Agents SDK / Vercel AI SDK / LangGraph / Google ADK. Fabrik's framework-detection skill identifies the framework and sets up the right adapter automatically.

Environment-only — that's the whole integration:

  refreshFabrikRuntimeFromRequest,
  withFabrikRuntimeResponse,
} from '@fabrik-evals/core';
 
  const body = await req.json();
  refreshFabrikRuntimeFromRequest(body); // pulls the runtime envelope
 
  const reply = await myAgent(body.messages);
 
  return Response.json(withFabrikRuntimeResponse({ text: reply }));
}

06/See it run

One happy-path test vs. hundreds of parallel rollouts

Traditional evals run the workflow once and call it green. Fabrik runs it against every persona, edge case, and service failure — and shows you exactly what broke.

app.fabriklabs.ai/runs/run_8f3a

run_8f3arefund-agent

240 rollouts · OpenAI Agents SDK · seed 42

Running

Total

Passed

Failed

Score

216 / 240 complete90%

Rollouts · click to inspect

Failure clusters

Wrong refund amount9

Unhandled null customer7

Tool-call shape drift4

Double-refund on retry2

Find the bugs before your users do

Get early access and updates on AI agent simulation and reliability.

Book a demo

No spam. Only valuable updates.

Stress-test your AI agents before production does

Evals score outputs against rubrics. They miss the failures that actually break agents.Evals score outputs against rubrics. They miss the failures that actually break agents.

Wrong-amount tool calls

Loops on changed shapes

Unhandled null state

From a connected agent to a failure report — automaticallyFrom a connected agent to a failure report — automatically

Five things you can do in sixty secondsFive things you can do in sixty seconds

Watch parallel rollouts

Inspect a rollout's trace

Compare two runs

Production traces → scenarios

Export training data

Three things at once, every runThree things at once, every run

Bugs you would have shipped

Comparison data across versions

Training data

Three ways in — pick how much access you want to giveThree ways in — pick how much access you want to give

Full SDK injection

Environment-only

Bring your own framework

One happy-path test vs. hundreds of parallel rolloutsOne happy-path test vs. hundreds of parallel rollouts

Find the bugs before your users doFind the bugs before your users do

Evals score outputs against rubrics. They miss the failures that actually break agents.

From a connected agent to a failure report — automatically

Five things you can do in sixty seconds

Three things at once, every run

Three ways in — pick how much access you want to give

One happy-path test vs. hundreds of parallel rollouts

Find the bugs before your users do