
Reading Instructions
This is a technical article suitable for readers who want to gain a deeper understanding.
Cursor’s Harness Evaluation Methodology: Keep Rate, Anomaly Detection, and A/B Testing
Original: Cursor Blog - Continually improving our agent harness (2026-04-30)
Tags: #harness-engineering #eval #measurement #cursor
Why This Article is Important
On April 30, 2026, Cursor’s blog systematically disclosed its engineering methodology for the Agent Harness. Rather than showcasing new features, it demonstrates how they determine whether a change genuinely improves the Agent.
This is rare information in the AI Agent field. Most teams rely on subjective feelings or benchmark numbers, but Cursor presents the complete measurement stack for production-grade harness evaluation:
- Offline eval (CursorBench) captures capability baselines
- Online A/B testing validates real usage scenarios
- Keep Rate: tracks the survival rate of code in user codebases
- LLM semantic satisfaction analysis: assesses whether users are genuinely satisfied
- Anomaly detection + automated ticket creation drives continuous improvement
These methods directly address the core question of whether harness changes are effective.
I. Two-Tier Evaluation Architecture: Offline Eval + Online Experiments
Offline Eval: Rapid Standardized Readings
Cursor maintains CursorBench as an offline evaluation suite, aimed at:
“gives us a fast, standardized read on quality and lets us compare across time”
The value of offline eval lies in speed and standardization—each code submission can be quickly run to capture regressions. However, Cursor explicitly points out:
“even the best benchmarks only approximate real usage, meaning we’d miss important signals if we relied on them entirely”
This indicates that offline eval is a necessary condition but not a sufficient one.
Online Experiments: A/B Testing Real Users
Cursor runs double-blind A/B tests in production, deploying harness variants in parallel on real user traffic. The advantage of online experiments is capturing signals in real usage scenarios, but the trade-offs include:
- Sufficient traffic is needed to achieve statistical significance
- Experiment cycles are long (from days to weeks)
- There is a risk of user impact
Key Insight: The two-tier evaluation is complementary; offline eval is responsible for rapid iteration, while online experiments validate hypotheses. Missing either creates blind spots.
II. Core Metrics of Online Experiments
Cursor acknowledges the existence of “fuzzier but more important questions”—these are the truly difficult-to-measure aspects.
2.1 Keep Rate: Code Survival Rate
Keep Rate = The proportion of Agent-generated code still retained in the user codebase
This is a retrospective validation metric checked after a fixed time window:
| Situation | Signal Interpretation |
|---|---|
| Code is heavily modified | Initial quality of the Agent was insufficient |
| User requests the Agent to fix something | The Agent didn’t get it right the first time |
| User moves on to the next feature | The Agent completed the task |
The core insight of Keep Rate: Whether users need to go back and clean up the mess. A low Keep Rate indicates that the Agent generated a lot of code that users found unacceptable or unsatisfactory on the first response.
This metric is closer to real value than “whether the code compiles/passes tests”—code can pass tests but still not be what users want.
2.2 LLM Semantic Satisfaction Analysis
Cursor uses models to evaluate the semantic feedback from users:
“A user moving on to the next feature is a strong signal the agent did its job, while a user pasting a stack trace is a reliable signal that it didn’t”
This is not simple text matching or sentiment analysis; rather, it allows the LLM to read user responses to Agent outputs and capture satisfaction/dissatisfaction signals at the semantic level.
Advantages of this method:
- Captures nuanced dissatisfaction like “good enough but not what the user wanted”
- Does not rely on users actively scoring (users rarely fill out feedback forms)
- Scalable deployment (models automatically analyze all conversations)
2.3 General Engineering Metrics
| Metric | Purpose |
|---|---|
| Latency | End-to-end response time, affecting user experience |
| Token Efficiency | Context window utilization, related to costs |
| Tool Call Count | Reflects whether the Agent efficiently achieves goals |
| Cache Hit Rate | Affects costs and first-round response speed |
These are directional metrics that help assess the side effects of changes but cannot independently prove quality improvements.
III. Downgrade Tracking and Automated Repair Loops
3.1 Classification System for Tool Errors
Cursor classifies Agent tool call errors into two main categories:
Unknown Errors:
- Always treated as Bugs
- Set fixed threshold alerts; any rise in unknown error rates triggers investigation
Expected Errors:
- InvalidArguments: model call parameter errors
- UnexpectedEnvironment: contradictions in the context window
- ProviderError: external service outages (GenerateImage, WebSearch, etc.)
- UserAborted, Timeout, etc.
3.2 Anomaly Detection: Beyond Fixed Thresholds
The issue with fixed threshold alerts: different models have different baselines, using the same threshold can lead to missed or false alerts.
Cursor’s solution:
“We compute baselines per-tool and per-model, because different models may mess up tool calls at different rates”
Establish baselines by tool × model, then detect anomalous deviations—this is more sensitive than fixed thresholds and can capture cases where the error rate significantly rises above the normal level for that model.
3.3 Automated Problem Discovery and Ticket Creation
Cursor has deployed a weekly-running Cloud Agent equipped with a dedicated skill:
- Search production logs
- Discover newly emerged or spiking errors
- Automatically create or update tickets in the backlog
The significance of this mechanism is: turning manual inspections into automated monitoring, allowing engineering teams to shift from a fire-fighting mode to a preventive mode.
Cursor reported a specific achievement:
“Over the course of a focused sprint earlier this year, we drove unexpected tool call errors down by an order of magnitude”
One sprint reduced unexpected tool call errors by an order of magnitude—this is only achievable with precise measurement systems in place.
IV. Evolution of the Context Window: From Guardrails to Dynamic Fetching
Cursor candidly reviews the evolution of its Context Window strategy, which serves as a valuable engineering record.
Early 2024: Extensive Guardrails
During a period of weaker model capabilities, Cursor actively added numerous contextual engineering guardrails:
- After each edit, show lint and type errors to the Agent
- Automatically rewrite its file reading requests when the Agent requests too few lines
- Limit the maximum number of tool calls per round for the Agent
- Provide extensive static context (codebase layout, semantically matched snippets, compressed versions of user manually attached files)
2026 (Now): Dynamic Context
“we’ve adapted to increasing model capability by knocking down guardrails and providing more dynamic context, which can be fetched by the agent while it works”
Current strategy:
- Retain a small amount of useful static context (operating system, git status, current/recently viewed files)
- Significantly reduce static context, shifting to dynamic on-demand fetching
- The Agent actively discovers and requests the needed context while working
This represents a technical evolution path of model capability improvement → guardrail reduction → dynamic context. Guardrails are not “better design” but rather “engineering compensations for insufficient model capabilities”.
V. Deep Customization for Model-Specific Needs
Cursor emphasizes that its harness abstraction layer can deeply customize for each model, and this customization “goes very deep”.
Native Adaptation of Tool Formats
Core example:
| Model | Training Format Used | Reason |
|---|---|---|
| OpenAI Model | Patch-based Format | Used during training |
| Anthropic Model | String Replacement | Used during training |
“Either model could use either tool, but giving it the unfamiliar one costs extra reasoning tokens and produces more mistakes”
This detail indicates that tool formats are not neutral—they affect the model’s reasoning overhead and error rates. This is the minimum entry point for harness customization depth.
Custom Prompts at Provider and Version Levels
Cursor observes the “personality” differences among various models:
- OpenAI Models: more literal and precise, stricter instruction following
- Claude Models: more intuitive, more tolerant of imprecise instructions
These differences affect prompt wording, number of examples, instruction precision, etc.—all requiring adaptation at the harness layer.
“Context Anxiety”: Harness Mitigating Model Quirks
Cursor mentions a specific case of model quirks:
“we observed one model develop what we came to call context anxiety: As its context window filled up, it would start refusing work, hedging that the task seemed too big. We were able to reduce the behavior through prompt adjustments”
This case illustrates that even in 2026, models can exhibit unexpected behavior patterns. The role of the harness is not just to “provide good tools for the model” but also to “identify and mitigate unexpected behaviors of the model”.
VI. Engineering Challenges of Mid-Chat Model Switching
When users switch models mid-conversation, Cursor faces two challenges:
6.1 Conversation History Distribution Shift
After switching, the new model faces a “conversation history generated by another model,” which differs from the data distribution it was trained on.
Cursor’s handling:
- Add custom instructions to inform the model “you are taking over a mid-chat session”
- Guide the model not to call tools that appear in the conversation history but are not in the current model’s toolset
6.2 Cache Miss
“caches are provider- and model-specific, so switching means a cache miss and a slower, more expensive first turn”
Cursor attempts to alleviate this issue with session summaries, but this introduces new problems:
“if the user is deep into a complex task, the summary can lose important details”
This trade-off indicates that session summaries are a form of information compression, and compression inevitably incurs losses. For complex tasks, this loss is unacceptable.
Cursor’s practical recommendation:
“We generally recommend staying with one model for the duration of a conversation unless you have a reason to switch”
This forms an interesting contrast with Augment’s research on “good habit accumulation”—both focus on the contextual quality issues of long-term Agents but approach them from different angles.
VII. Future Directions: Harness as the Core of Multi-Agent Orchestration
Cursor’s judgment about the future is very clear:
“The future of AI-assisted software engineering will be multi-agent. Instead of running every subtask through a single agent, the system will learn to delegate across specialized agents and subagents”
The key implication is:
“The ability to orchestrate that kind of coordination will live in the harness rather than any single agent”
This means:
- Agents are execution units (perform specific tasks)
- Harness is the orchestration layer (decides which Agent to dispatch, how to frame tasks, and how to stitch results)
This aligns with Anthropic’s “Brain/Hands decoupling” but more specifically points to the responsibilities of Harness in multi-Agent scenarios:
| Responsibility | Description |
|---|---|
| Dispatch Decisions | Determine which Agent is suitable for the current subtask |
| Framing | Adjust task descriptions based on Agent strengths |
| Result Stitching | Integrate outputs from multiple Agents into a coherent workflow |
| Context Passing | Manage information flow across Agents |
Engineering Insights
1. Measurement Precedes Optimization
The core of Cursor’s methodology is: without measurement, there is no improvement. Keep Rate, LLM semantic analysis, anomaly detection baselines—these constitute a complete measurement stack that grounds the process of “optimization”.
For teams building their own Agent systems, this means:
- First establish a baseline (baseline measurement)
- Then introduce changes (controlled change)
- Finally validate the effects (comparison against baseline)
Skipping measurement for direct optimization is akin to a blind person riding a blind horse.
2. Philosophy of Classifying Expected vs Unknown Errors
Cursor categorizes errors into “expected” and “unknown,” setting strict alerts for the latter—this reflects a defensive engineering mindset: any anomaly should be investigated rather than ignored.
This is particularly important for Agent systems, as the behavioral space of models is vast, and certain error patterns may only emerge at production scale.
3. Guardrails as Technical Debt, Not Design Choices
Cursor clearly states that early guardrails were “engineering compensations for insufficient model capabilities” and have been gradually removed as model capabilities improved. This indicates that guardrails should be viewed as technical debt that should be actively repaid as model capabilities evolve, rather than permanently retained.
4. The Depth of Harness Customization Determines Agent Limits
Tool formats, prompt strategies, anomaly handling—Cursor deeply customizes these dimensions for different models. This shows that the ultimate performance of an Agent is a joint product of the Model and the Harness, not an independent attribute of the model itself.
The same model, under different depths of Harness, can exhibit significantly different capability levels and behavioral characteristics.
Conclusion
Cursor’s article provides an important firsthand perspective on how to continuously improve the Agent Harness in a production environment.
Core methodology:
- Offline + Online two-tier evaluation: rapid iteration + real validation
- Keep Rate + LLM semantic analysis: capturing real user value
- Anomaly detection + automated ticket: turning manual inspections into automated monitoring
- Per-model deep customization: joint optimization of Harness and model
- Multi-Agent orchestration as the core responsibility of Harness: Agents are execution units, Harness is the coordination layer
This methodology is valuable for any team building Agent systems: establishing a measurement system is the first and most challenging step, but it determines whether the system can continuously improve.
Comments
Discussion is powered by Giscus (GitHub Discussions). Add
repo,repoID,category, andcategoryIDunder[params.comments.giscus]inhugo.tomlusing the values from the Giscus setup tool.