Engineering is shifting. The job is no longer writing software - it's maintaining systems that can observe their own failures, evolve their own quality layer, and improve their own operating harness over time.
Today, code generation is cheap. Modern coding systems can produce thousands of lines of working code in minutes, faster than any team can review, test, or even fully understand.
The bottleneck has moved. It is no longer writing code. It is everything that comes after: validating behavior, catching regressions, debugging failures, and maintaining evaluations and reliability as systems evolve and user behaviors drift.
The new era of writing and maintaining software
Unlike traditional software, where failures are deterministic and localized, agent systems fail in ways that are stochastic, distribution-dependent, and often difficult to reproduce. Small changes in prompts, tool schemas, or context construction can lead to qualitatively different behaviors, and compounding downstream consequences. Improvements are reactive to incidents, while complexity increases. Over time, the system becomes harder to maintain.
The new era of engineering will be designing systems that can sustain and improve themselves over time. This includes building robust harnesses that define how agents operate, evaluation layers that continuously measure behavior, constraints that bound system outputs, and feedback loops that convert failures into actionable signals.
This shift is already visible. Recently, Andrej Karpathy demonstrated autonomous research loops where systems generate ideas, run experiments, and iteratively improve their outputs. These systems are not just producing results, they are iterating and evolving faster than humans can reliably audit.
NeoSigma: Closing the Feedback Loop
In production environments, the most valuable signals come from real-world failures. These signals that appear as customer feedback, support tickets, incident traces, and production failures, however remain fragmented across apps and teams. The key challenge isn't just detecting failures. It's systematically capturing and transforming them into reusable artifacts that improve the system over time, evolving with user behaviors and adapting to new constraints.
NeoSigma addresses this by transforming raw failure signals into a structured improvement pipeline. We are building the infrastructure for self-maintaining agent systems by closing the feedback loop between production failures and system improvements. Each failure is analyzed to produce a representation of what went wrong, along with a hypothesis about its root cause. These failures are then converted into reusable evaluation cases that encode them in a reproducible form. The system proposes targeted changes to the agent harness, applies them, and validates the outcome against the evolving evaluation set. This process ensures that every failure contributes to a persistent improvement, rather than being resolved as a one-off fix. By ensuring that every failure is captured, formalized, and incorporated into future evaluations, NeoSigma enables improvements that are both more targeted and more durable, reducing the likelihood of recurring failure modes.

Chirag Mahapatra
Director of Engineering, Mercor
Using NeoSigma felt like moving from reactive debugging to a system that actively improves itself. Failures get fixed autonomously by getting encoded into the system, so you see reliability compound over time instead of fighting the same issues repeatedly.
Experiment on Tau3 bench
We show how our system works on Tau3 bench to demonstrate our framework in a controlled setting for the domains retail, telecom and airline. Tau3 bench is a public benchmark for agent reliability in multi-turn tool-calling agents in customer service simulations. Tau3 bench tasks require an agent to complete realistic customer requests - managing plans, processing changes, handling eligibility constraints - across a sequence of turns using a fixed set of domain tools. Each task is a multi-turn conversation between a simulated customer and the agent and is scored as pass/fail: the agent either fully resolves the customer's request according to policy or it doesn't.
We start with a baseline agent and run our self-improving system on top of it. The system observes failures from agent executions, clusters them into underlying failure modes, converts them into reusable evaluation cases, and iteratively proposes improvements to the agent harness. Each iteration is evaluated to measure reliability gains. We use val_score as the primary metric, defined as the fraction of tasks the agent completes successfully on the validation set, which is treated as a strict black box throughout, capturing performance on unseen failures and distributions.
Note The underlying model is fixed to GPT-5.4 (no reasoning) throughout all experiments. We intentionally do not switch models or rely on models that may already be trained on this data, in order to isolate gains from the system itself. Improvements to the agent harness, across prompts, few-shot examples, context construction, tool interfaces, and workflow design, are often significantly cheaper, more controllable, and model-agnostic compared to training or switching models. All gains come from the agent harness improvements and maintained evals demonstrating that meaningful reliability gains can be achieved purely through better system design and feedback loops.

Victor Barres
Tau bench co-creator, Researcher at Sierra
Intelligence in an agent is as much the ability to solve problems as it is the ability to learn from experience and adapt to an ever-changing environment. But without disciplined maintenance and control, agents don't gracefully adapt - they silently degrade. Any serious effort to build production-grade agents must treat these as first-class engineering concerns, not afterthoughts. NeoSigma is paving the way towards making this an operational reality.
The self-improvement loop setup
Simulate batch of production traffic
From Tau3 real-world request distribution
Scan traces · Classify root causes · Surface dominant failure patterns
Track failures for evaluation cases · Cluster by shared root-cause · Rerank by recurrence & severity
Promote resolved failures to regression suite · Outcome recorded · Batch advances
- 1.
Simulating production traffic:
We simulate production traffic from the training task scenarios with a fixed batch size of 10 tasks. The full setup mirrors how a real production system receives a rolling stream of requests and incidents: not all at once, but in windows that each reveal something new about where the agent is failing.
- 2.
Phase A: Failure Mining:
After each batch of requests, the system scans the trace files for failed tasks and extracts structured failure records. It answers the central questions: What is the root cause of failure for each case? What is the dominant failure pattern? What failure cases are recurring and still not fixed? What should the agent have done differently?
- 3.
Phase B: Eval-set candidate generation with clustering:
Converts the enriched failure records into a structured pool of eval candidates. Every failed task from the batch is pushed into a candidates register pool and grouped by shared root-cause mechanism into clusters. High
total_failuresand lowresolution_rateidentify the most systemic and unaddressed failure modes. These clusters form the basis of the evolving eval-set. Rather than treating failures independently, the system tracks and prioritizes them at the level of underlying failure patterns, enabling more efficient coverage of the error space. We made the evaluation loop completely autonomous where a model-driven decision layer acts on behalf of the user: making clustering and eval-set acceptance decisions, while preserving the option to introduce human review where needed. - 4.
Phase C: Optimization loop:
The optimizer processes one batch of production data at a time, looks at mined failures, targets the highest-priority unresolved clusters, and uses them to drive harness improvements. It optimizes over full agent harness design over a fixed budget of iterations: each cycle proposes a harness change, runs it against the regression set and the held-out validation set, and decides whether to keep or revert the change. The loop exits when either a gate-passing change is found or the budget is exhausted. Optimization is performed at the level of failure clusters rather than individual tasks, and proposed changes are designed to address the root cause across the cluster, rather than patching isolated instances. The proposed changes are accepted only if they satisfy two conditions: (1) the change must not break any previously resolved failures in the regression set, and (2) they do not degrade overall performance on the held-out validation set. This gating ensures that improvements generalize beyond known failures and reflect system-level gains, rather than overfitting to the regression set.
- 5.
Phase D: Regression Suite Maintenance:
The regression set is not a static benchmark. It's a living collection of cases that evolves with the agent. At the end of each batch cycle, resolved failures are tracked in a regression set. The regression set becomes a gate to guard against a new change that re-introduces a previously fixed failure. Each improvement cycle makes it harder to accidentally regress, which forces each subsequent improvement to be genuinely additive.

Reah Miyara
Senior Director at Google · ex-OpenAI Post-Training Lead
Transforming performance in production environments requires much more than better models. It requires systems that learn from their own mistakes at scale. Capturing your agents' failures are your most valuable dataset.
Results
We ran the self-improvement system loop completely autonomously for 18 batches, executing 96 harness experiments under GPT-5.4. The baseline agent started at a val score of 0.560. After 18 batches of automated failure mining, clustering, and harness optimization, the agent reached 0.780, a 39.3% improvement with no model upgrade.
Agent harness improvements
Starting from a baseline agent, we observe consistent improvements in the agent harness (val_score) on a held-out validation set. The results reflect genuine improvements in generalization rather than overfitting to known cases. Each iteration consists of multiple candidate harness updates proposed by the system where the system explores multiple trajectories. Any candidate harness that degrades performance or regresses on previously fixed failures is rejected and the system continues to search for alternatives. Over time, this results in a form of constrained optimization where only globally consistent improvements are accepted.
As the regression set grows, the optimization problem becomes progressively harder. Each new change must satisfy an expanding set of constraints, ensuring that improvements are cumulative and do not undo prior fixes. All gains are achieved with a fixed underlying model (GPT-5.4), isolating the impact of harness-level improvements. As the regression set expands, it serves as a stricter gate, only iterations that exceed the 80% threshold are run on the validation set, ensuring no regressions. Over time, as the regression set grows sufficiently large, it begins to act as a proxy validation set, creating a tight feedback loop where improvements are continuously validated against a representative distribution of past failures.
Agent performance on the validation set improves from 0.56 → 0.78 over 96 iterations of harness optimization. At each iteration, the system explores multiple candidate updates, retaining only those that both improve validation performance and satisfy the regression gate (≥80%). Updates that fail to meet these criteria are automatically discarded. In later stages (iterations 60-90), most candidate changes are rejected or reverted, as the regression gate prevents any update that reintroduces previously fixed failure modes or degrades performance. As the experiments progress, the optimization problem becomes harder, forcing each improvement to be additive and resulting in steady, compounding gains. This shifts reliability from a manual debugging loop to an automated improvement process, saving substantial engineering time in maintaining complex, real-world agent systems.
Failure mining
The system builds and maintains a live evaluation dataset derived from failures. Each failed trajectory is first triaged into a structured representation that captures the task, context, and failure behavior. As failures accumulate, the system identifies recurring patterns and groups them into clusters. These clusters represent underlying failure modes rather than individual instances. This allows the evaluation set to remain compact while still covering a broad space of errors.
Our system automatically discovered 29+ distinct failure clusters across domains from production traces, without any manual labeling (e.g., wrong insurance cancellation reason, missing refuel after device fix, wrong order identification). Failures are treated as recurring patterns rather than isolated incidents. As clusters are resolved, they are incorporated into the regression set, preventing recurrence. High-impact failure modes are systematically identified, prioritized, and driven toward resolution, enabling continuous, measurable improvements in system behavior.
Maintaining failures with a Regression set
Every production failure enters the pipeline as a candidate, gets clustered by root cause, and is fixed in the inner loop. Once promoted, it runs against every future change blocking any regression from reaching production. In practice, this acts as a guardrail against regressions, ensuring that improvements do not regress with respect to known failure modes.
This creates a tight coupling between observed failures and future system behavior. Failures are not just fixed; they are encoded into the system's evaluation layer, ensuring that similar issues are unlikely to recur.
The regression suite grows from 0 to 17 test cases across 18 batches, with each resolved failure cluster contributing new cases. The ≥ 80% gate is enforced throughout, rejecting any iteration that regresses on known failures. The evaluation set is not static, it evolves with the system. Each fix becomes a permanent constraint, making future improvements harder but more reliable, and ensuring progress compounds without backsliding.
Agent Harness Evolution
Throughout the experiment, the system explores multiple candidate updates to the agent harness and only those that satisfy strict constraints (val_score >= best_seen and regression_score >= 80%) are kept, and the ones that don’t pass these constraints are reverted. This ensures the improvements generalize to unseen validation-set and preserve performance on previously resolved failures. Over time, the harness evolves to handle a broader range of failure modes more reliably. Agent harness updates span the full stack, including prompt design, harness state tracking, tool interfaces, and overall system architecture, enabling the agent to both fix existing failures and generalize to unseen scenarios.
The optimizer iteratively evolves the agent harness (agent.py) through a sequence of targeted code edits, with each commit representing a concrete change to prompts, control logic, tool usage, state management, etc. Agent improvement is not a single trajectory but a search process with many rejected paths. The explorer view shows accepted commits with diffs. The file diffs for the reverted commit are not shown to keep the interface tractable. The system tracks both accepted commits (what worked) and discarded alternatives that were explored (what did not work) to reach robust, non-regressive improvements.

Shyamal Anadkat
ex-OpenAI, Applied Evals
Evals grounded in real usage are the foundation of systems that compound in quality over time. Companies that close the loop between production signals and evaluation will win. NeoSigma is building that infrastructure for AI systems to automate this loop.
Takeaways
- 1.
Failure discovery from realistic traces
Starting from Tau3 bench agent traces, the system automatically identifies and clusters failure modes - tool-call errors, context problems, workflow regressions - without any manual labeling.
- 2.
Automated eval creation and maintenance
Each failure cluster becomes a reusable eval case. The eval set is a living distribution, not a static artifact - it grows as the system encounters new failure modes.
- 3.
Harness improvements across the full stack
Improvements span prompts, few-shot examples, tool definitions, and context construction, and are validated against the evolving eval set.
- 4.
Measurable, tracked reliability gains
We track performance across iterations on the full Tau3 bench suite. Improvements accumulate over iterations, driven by the feedback loop between failures and evals.
Self-maintaining agent systems represent a shift in how we build and operate software. These are systems that can observe their own behavior in deployment, identify and categorize their failure modes, maintain an evolving set of evaluation cases that reflect real-world conditions, and apply targeted improvements to their own operating harness. Rather than relying on static testing or manual iteration, they continuously refine themselves through interaction with their environment.
At NeoSigma, we are shaping this future. We are building the infrastructure to support this feedback loop in real-world systems, helping teams capture failures, convert them into structured evaluation signals, and use them to drive continuous improvements in agent behavior.
If you are deploying agent systems and want to close the feedback loop in real production systems faster, we would love to talk.
Acknowledgements
Special thanks to Shyamal Anadkat (ex-OpenAI), Tim Weingarten (ex-Anthropic, Claude Cowork), Victor Barres (Sierra), Reah Miyara (Google DeepMind), Chirag Mahapatra (Mercor), and Karthik Narasimhan (GPT co-creator, ex-OpenAI) for reviewing and providing valuable feedback on this blog post.
Building production AI?
We would love to chat
We're currently running a closed beta with teams operating at the frontier of agentic capabilities. If you're running AI agents in production and want to build self-improving loops, we'd love to hear from you.