02/RELIABILITY

Evals Aren't Enough: Reliable Agents Need Simulation

Fabrik Labs·February 2026·12 min

Abstract

AI agents don't fail because they can't answer a question. They fail because they're systems: stochastic, stateful, tool-using pipelines that unfold over time. An agent can score well on an evaluation dataset and still break when it's dropped into a real workflow. This isn't a knock on evals — it's a statement about what evals are structurally able to measure, and what they can't.

I/Section

The Evaluation Dataset Is Not the Workflow

Classic eval thinking assumes a bounded world: a dataset of tasks, a spec, a grader, a pass/fail. You run the suite, watch the number go up, and assume reliability improved. But agent failures are rarely “wrong answer” failures. They are trajectory failures — errors that emerge only after multiple steps.

The agent makes an early assumption that later becomes inconsistent. Tool outputs vary due to timeouts, schema changes, or partial results. The agent's plan drifts as context grows and gets summarized. Small deviations compound into the wrong end state. A dataset-style eval typically captures snapshots of behavior, while real workflows expose dynamics.

That mismatch is why “passing one step” does not equal “validating the system.” Anthropic's engineering guidance reflects this systems reality: agent-specific design, multiple trials, and operational framing — because the “same” task can play out differently across runs.

Figure 1

The Reliability Stack

Pre-deployment → Post-deployment

Three layers of agent reliability — evals, observability, and simulation

Three reliability layers — evals, observability, simulation — and the role each plays.

II/Section

Why Evals Fail to Predict Reliability

The core problem is that evals often test outcomes without faithfully modeling the process that produces them. Agents are non-deterministic decision pipelines. If you measure success on a finite dataset, you can miss the long-tail failure modes that only appear when the agent explores different paths, tools return unexpected shapes, or the environment pushes back.

Real environments are richer than task definitions. In production, “the environment” includes API behaviors, tool contracts, latency, partial failures, auth edge cases, and messy user inputs. Even if your dataset covers the “right” tasks, it is hard to cover the right interactions at the right moments.

This is why recent research is shifting attention from “better graders” toward environment modeling — because robustness comes from exposure to diverse trajectories and feedback, not only scoring more examples.

III/Section

Simulation Is the Missing Layer

Simulation flips the question from “Did the agent pass this dataset?” to “Does the agent behave correctly across the space of realistic workflows?” A simulation environment can produce multi-step, stateful scenarios rather than isolated examples, generate counterfactuals and perturbations, support deterministic replay for debugging and regression, and stress-test long-horizon behavior where drift and compounding errors actually appear.

Crucially, simulation creates a place to optimize, not just measure. One reason the eval-only approach stalls is that it tells you that something failed but not reliably why, and it doesn't give you a controllable environment to iterate. Simulation gives you an engine to diagnose, fix, and replay.

Research has started to formalize this: instead of hand-building brittle testbeds, you can use models to simulate environment feedback and generate diverse trajectories for training and improvement — explicitly treating “the environment” as a first-class object.

IV/Section

Evals Are Becoming Infinite Observability

A visible trend in the agent tooling ecosystem is that many eval platforms are converging toward always-on tracing: capture every prompt, tool call, intermediate step, cost, latency, and outcome — then slice it endlessly. The motivation is real: if agents fail across trajectories, you want full-fidelity traces to debug them.

But most of this “new eval stack” is essentially observability infrastructure — and increasingly it is built on or compatible with OpenTelemetry. LangSmith supports exporting traces via OpenTelemetry. Arize Phoenix ships OTEL wrappers for LLM and agent tracing. Major observability vendors now ingest GenAI traces via OpenTelemetry semantic conventions, reinforcing OTEL as the lingua franca.

This is good news for interoperability: traces can move between tools and vendors, and teams don't have to re-instrument everything. But it does not solve the core reliability gap — because observability is fundamentally post-hoc. It helps you understand failures after the agent experiences real conditions, often after users do. If you want trustworthy agents, you need a layer that catches failures before deployment, in a controlled environment that can be replayed and optimized.

V/Section

Where Fabrik Fits

Fabrik is built around a simple premise: you can't trust agents without simulating how they operate. Evals remain useful as a starting point and as guardrails, and observability is essential once you ship. But reliability — the kind that prevents silent churn and compounding workflow failure — requires a simulation layer that tests agents as systems, not as single outcomes on a dataset.

That is the gap Fabrik exists to fill: simulate real workflows pre-production, surface hidden failure modes, and provide deterministic replay so teams can ship agents with evidence — not hope.

The three layers work together. Evals give you a baseline. Observability gives you production insight. Simulation gives you pre-deployment confidence. Without all three, you are flying blind at one stage or another. Fabrik ensures the simulation layer is no longer the missing piece.

Key findings

What this paper concludes

01Agent failures are trajectory failures — errors that emerge across multiple steps, not wrong answers to single questions.

02Evaluation datasets test outcomes without modeling the process, missing long-tail failures that appear only under diverse execution paths.

03Simulation flips the question from “did the agent pass?” to “does it behave correctly across realistic workflows?” — enabling optimization, not just measurement.

04The eval tooling ecosystem is converging on OpenTelemetry-based observability, which is valuable but fundamentally post-hoc.

05Reliability requires three layers working together: evals for baseline, observability for production insight, and simulation for pre-deployment confidence.

Methodology

References: Anthropic, “Demystifying evals for AI agents” (Jan 2026). arXiv:2511.01824, “Simulating Environments with Reasoning Models for Agent Training.” LangSmith OpenTelemetry tracing documentation. Arize Phoenix OTEL tracing setup. OpenTelemetry semantic conventions for GenAI.