I/Section
From Prompt Pipelines to Stateful Systems
For most of the past two years, AI engineering looked like prompt engineering. A request came in, a prompt was assembled, a model produced a response, and the request ended. The system was stateless. The unit of work was a single turn. Reliability, in that world, was largely a property of the model and the prompt: if both were good enough, the response was good enough.
That world is gone. Production agents now maintain memory across turns, call external tools, mutate state in databases, coordinate sub-workflows, and operate against systems they do not control — payment providers, CRMs, identity systems, notification services. The unit of work is no longer a turn. It is a trajectory through an evolving environment, often spanning minutes, sometimes hours.
Once you cross that threshold, model quality stops being the bottleneck. The bottleneck becomes the coordination of stateful components over time. That is a systems-engineering problem, and the techniques that govern it — state management, retries, idempotency, observability, fault isolation — are the techniques that have governed distributed systems for decades. The model is one component in a much larger machine.
AGENT
AGENT
Reliability is the coordination of stateful systems over time
II/Section
Local Correctness vs System Failure
A useful way to see the shift: response quality is local; reliability is global. Every individual step in an agent workflow can be locally correct — the right database row is retrieved, the right policy is applied, the right API is called — and the workflow as a whole can still produce an incorrect outcome. The failure does not live in any single step. It lives in the trajectory.
Consider a refund workflow: retrieve customer, check refund policy, call Stripe, update CRM, send confirmation email, write memory. Each step succeeds. Each step's output is well-formed. And yet, because a transient network error triggered a retry between steps three and four, the refund is issued twice. There is no log line where something went wrong. Every component reports green. The system is broken anyway.
This is the failure class that local evaluation cannot see. You can grade every step against a reference and get perfect scores. You can replay every tool call and confirm each one returned the expected value. The wrongness is not in the parts; it is in how the parts compose under conditions the test harness never simulated: retries, partial failures, races, stale state, out-of-order delivery.
All steps green — outcome failed. Reliability is a property of the trajectory.
III/Section
State Evolution
The deeper reason this is hard is that agent workflows are not stationary. Every action changes the conditions under which the next action runs. A tool call mutates memory. The new memory changes what the retriever surfaces. The new retrieval changes what the planner proposes. The new plan changes which tools get called next. By turn five, the agent is operating in a state that turn one's test cases never anticipated.
This compounds with practical realities. Context windows force summarization, which lossily compresses history. Retries replay parts of trajectories under slightly different conditions. Partial tool failures leave the system in states that no single tool intended. Auth tokens expire mid-session. Memory writes succeed but reads return stale values for some bounded window. None of these are pathological — they are the normal operating regime of distributed systems.
Static evaluation datasets cannot cover this space because the space is not a set of inputs; it is a set of trajectories through a state machine that the agent itself helps construct. The reliability question is not “does the agent produce the right answer for input X?” It is “does the agent's behavior remain correct as state evolves under realistic perturbations?” Those are different questions, and they need different tools.
IV/Section
Simulation Environments
The operational answer is to construct a controlled runtime environment around the agent — one that the agent cannot distinguish from production, but that the engineering team fully controls. Fabrik builds these environments: simulated users with persistent personas, mocked external APIs that respond with realistic schemas, seeded databases with known state, working auth and identity flows, configurable latency and failure injection.
The point of the environment is not to test whether a single answer is correct. It is to stress-test whether the overall orchestration remains reliable as conditions evolve. What happens when the CRM returns stale data for ninety seconds? When Stripe times out on the second call but not the first? When the user changes their mind three turns in? When a tool returns a structurally valid response with semantically wrong content? Each of these is a normal production event. Each can be reproduced deterministically inside a simulation.
Crucially, the agent operates against the simulation exactly as it would against production. The same tool clients, the same memory layer, the same retrieval pipeline, the same prompts. Nothing about the agent code changes between environments. That fidelity is what makes the results transferable. A failure surfaced in simulation is a failure that would surface in production; a workflow that holds together in simulation is one that has been exposed to the actual conditions of failure rather than to a curated dataset of inputs.
V/Section
Observability vs Simulation
These two categories are often confused, and the confusion costs teams real reliability. Observability explains what happened. Simulation explores what could happen. Observability analyzes executions; simulation stress-tests the execution space.
Both matter. Production traces, structured logs, and OpenTelemetry-based GenAI dashboards are how teams diagnose incidents after they occur. They are how on-call engineers reconstruct trajectories and how data teams measure cost, latency, and outcome distributions. None of that is going away, and none of it should. But observability is, by construction, post-hoc. By the time a trace reveals a failure mode, a user has already experienced it.
Simulation is the pre-deployment complement: a place to discover failure modes before they ship, in conditions that production may not exhibit for weeks or months. The two together form a complete reliability practice — simulate before deployment, observe after, and feed production traces back into the simulation environment so the next generation of scenarios reflects what is actually happening in the wild.
VI/Section
Harness Optimization
Once you accept that reliability is a property of the harness around the model — the orchestration, the retries, the memory, the routing, the state handling — the work changes. You stop trying to make the model better at a benchmark. You start tuning the harness against realistic trajectories.
That tuning surface is large. Prompt structure interacts with retrieval, which interacts with tool selection, which interacts with memory writes, which interact with subsequent retrieval. Retry policies trade off against idempotency guarantees. Context compression strategies trade off against long-horizon coherence. Routing decisions trade off latency against accuracy. Each of these is a knob, and each can be measured — but only inside an environment where trajectories can be replayed under controlled variation.
This is what we mean when we say the future of AI engineering is orchestration engineering. The model is upstream of everything, but it is increasingly a fixed input. The variable inputs — the ones engineering teams actually control — are how the model is wrapped, prompted, retried, memorized, routed, and recovered. Reliability lives in those choices. Simulation is the loop that lets you iterate on them with evidence instead of guesswork.
Key findings
What this paper concludes
- 01Agent reliability is an orchestration problem, not a model-quality problem — once agents become stateful, the bottleneck moves out of the model.
- 02Local step correctness does not imply global workflow correctness; the failure modes that matter live in the trajectory, not the steps.
- 03State mutates with every action, so trajectories are non-stationary — static evaluation datasets cannot cover the space the agent actually explores.
- 04Simulation environments provide a controlled runtime to stress-test orchestration under realistic perturbations: retries, partial failures, stale state, latency, adversarial inputs.
- 05Observability and simulation are complementary, not competitive: observability explains what happened in production; simulation explores what could happen before deployment.
- 06The reliability surface lives in the harness — prompts, retries, memory, routing, state handling — and that surface can only be tuned with evidence inside a controlled environment.