Evaluating Cursor's Agent Harness: Keep Rate, Anomaly Detection, and A/B Testing

Reading Instructions

This is a technical article suitable for readers who want to gain a deeper understanding.

Cursor’s Harness Evaluation Methodology: Keep Rate, Anomaly Detection, and A/B Testing

Original: Cursor Blog - Continually improving our agent harness (2026-04-30)

Tags: #harness-engineering #eval #measurement #cursor

Why This Article is Important

On April 30, 2026, Cursor’s blog systematically disclosed its engineering methodology for the Agent Harness. Rather than showcasing new features, it demonstrates how they determine whether a change genuinely improves the Agent.

This is rare information in the AI Agent field. Most teams rely on subjective feelings or benchmark numbers, but Cursor presents the complete measurement stack for production-grade harness evaluation:

Offline eval (CursorBench) captures capability baselines
Online A/B testing validates real usage scenarios
Keep Rate: tracks the survival rate of code in user codebases
LLM semantic satisfaction analysis: assesses whether users are genuinely satisfied
Anomaly detection + automated ticket creation drives continuous improvement

These methods directly address the core question of whether harness changes are effective.

I. Two-Tier Evaluation Architecture: Offline Eval + Online Experiments

Offline Eval: Rapid Standardized Readings

Cursor maintains CursorBench as an offline evaluation suite, aimed at:

“gives us a fast, standardized read on quality and lets us compare across time”

The value of offline eval lies in speed and standardization—each code submission can be quickly run to capture regressions. However, Cursor explicitly points out:

“even the best benchmarks only approximate real usage, meaning we’d miss important signals if we relied on them entirely”

This indicates that offline eval is a necessary condition but not a sufficient one.

Online Experiments: A/B Testing Real Users

Cursor runs double-blind A/B tests in production, deploying harness variants in parallel on real user traffic. The advantage of online experiments is capturing signals in real usage scenarios, but the trade-offs include:

Sufficient traffic is needed to achieve statistical significance
Experiment cycles are long (from days to weeks)
There is a risk of user impact

Key Insight: The two-tier evaluation is complementary; offline eval is responsible for rapid iteration, while online experiments validate hypotheses. Missing either creates blind spots.

II. Core Metrics of Online Experiments

Cursor acknowledges the existence of “fuzzier but more important questions”—these are the truly difficult-to-measure aspects.

2.1 Keep Rate: Code Survival Rate

Keep Rate = The proportion of Agent-generated code still retained in the user codebase

This is a retrospective validation metric checked after a fixed time window:

Situation	Signal Interpretation
Code is heavily modified	Initial quality of the Agent was insufficient
User requests the Agent to fix something	The Agent didn’t get it right the first time
User moves on to the next feature	The Agent completed the task

The core insight of Keep Rate: Whether users need to go back and clean up the mess. A low Keep Rate indicates that the Agent generated a lot of code that users found unacceptable or unsatisfactory on the first response.

This metric is closer to real value than “whether the code compiles/passes tests”—code can pass tests but still not be what users want.

2.2 LLM Semantic Satisfaction Analysis

Cursor uses models to evaluate the semantic feedback from users:

“A user moving on to the next feature is a strong signal the agent did its job, while a user pasting a stack trace is a reliable signal that it didn’t”

This is not simple text matching or sentiment analysis; rather, it allows the LLM to read user responses to Agent outputs and capture satisfaction/dissatisfaction signals at the semantic level.

Advantages of this method:

Captures nuanced dissatisfaction like “good enough but not what the user wanted”
Does not rely on users actively scoring (users rarely fill out feedback forms)
Scalable deployment (models automatically analyze all conversations)

2.3 General Engineering Metrics

Metric	Purpose
Latency	End-to-end response time, affecting user experience
Token Efficiency	Context window utilization, related to costs
Tool Call Count	Reflects whether the Agent efficiently achieves goals
Cache Hit Rate	Affects costs and first-round response speed

These are directional metrics that help assess the side effects of changes but cannot independently prove quality improvements.

III. Downgrade Tracking and Automated Repair Loops

3.1 Classification System for Tool Errors

Cursor classifies Agent tool call errors into two main categories:

Unknown Errors:

Always treated as Bugs
Set fixed threshold alerts; any rise in unknown error rates triggers investigation

Expected Errors:

InvalidArguments: model call parameter errors
UnexpectedEnvironment: contradictions in the context window
ProviderError: external service outages (GenerateImage, WebSearch, etc.)
UserAborted, Timeout, etc.

3.2 Anomaly Detection: Beyond Fixed Thresholds

The issue with fixed threshold alerts: different models have different baselines, using the same threshold can lead to missed or false alerts.

Cursor’s solution:

“We compute baselines per-tool and per-model, because different models may mess up tool calls at different rates”

Establish baselines by tool × model, then detect anomalous deviations—this is more sensitive than fixed thresholds and can capture cases where the error rate significantly rises above the normal level for that model.

3.3 Automated Problem Discovery and Ticket Creation

Cursor has deployed a weekly-running Cloud Agent equipped with a dedicated skill:

Search production logs
Discover newly emerged or spiking errors
Automatically create or update tickets in the backlog

The significance of this mechanism is: turning manual inspections into automated monitoring, allowing engineering teams to shift from a fire-fighting mode to a preventive mode.

Cursor reported a specific achievement:

“Over the course of a focused sprint earlier this year, we drove unexpected tool call errors down by an order of magnitude”

One sprint reduced unexpected tool call errors by an order of magnitude—this is only achievable with precise measurement systems in place.

IV. Evolution of the Context Window: From Guardrails to Dynamic Fetching

Cursor candidly reviews the evolution of its Context Window strategy, which serves as a valuable engineering record.

Early 2024: Extensive Guardrails

During a period of weaker model capabilities, Cursor actively added numerous contextual engineering guardrails:

After each edit, show lint and type errors to the Agent
Automatically rewrite its file reading requests when the Agent requests too few lines
Limit the maximum number of tool calls per round for the Agent
Provide extensive static context (codebase layout, semantically matched snippets, compressed versions of user manually attached files)

2026 (Now): Dynamic Context

“we’ve adapted to increasing model capability by knocking down guardrails and providing more dynamic context, which can be fetched by the agent while it works”

Current strategy:

Retain a small amount of useful static context (operating system, git status, current/recently viewed files)
Significantly reduce static context, shifting to dynamic on-demand fetching
The Agent actively discovers and requests the needed context while working

This represents a technical evolution path of model capability improvement → guardrail reduction → dynamic context. Guardrails are not “better design” but rather “engineering compensations for insufficient model capabilities”.

V. Deep Customization for Model-Specific Needs

Cursor emphasizes that its harness abstraction layer can deeply customize for each model, and this customization “goes very deep”.

Native Adaptation of Tool Formats

Core example:

Model	Training Format Used	Reason
OpenAI Model	Patch-based Format	Used during training
Anthropic Model	String Replacement	Used during training

“Either model could use either tool, but giving it the unfamiliar one costs extra reasoning tokens and produces more mistakes”

This detail indicates that tool formats are not neutral—they affect the model’s reasoning overhead and error rates. This is the minimum entry point for harness customization depth.

Custom Prompts at Provider and Version Levels

Cursor observes the “personality” differences among various models:

OpenAI Models: more literal and precise, stricter instruction following
Claude Models: more intuitive, more tolerant of imprecise instructions

These differences affect prompt wording, number of examples, instruction precision, etc.—all requiring adaptation at the harness layer.

“Context Anxiety”: Harness Mitigating Model Quirks

Cursor mentions a specific case of model quirks:

“we observed one model develop what we came to call context anxiety: As its context window filled up, it would start refusing work, hedging that the task seemed too big. We were able to reduce the behavior through prompt adjustments”

This case illustrates that even in 2026, models can exhibit unexpected behavior patterns. The role of the harness is not just to “provide good tools for the model” but also to “identify and mitigate unexpected behaviors of the model”.

VI. Engineering Challenges of Mid-Chat Model Switching

When users switch models mid-conversation, Cursor faces two challenges:

6.1 Conversation History Distribution Shift

After switching, the new model faces a “conversation history generated by another model,” which differs from the data distribution it was trained on.

Cursor’s handling:

Add custom instructions to inform the model “you are taking over a mid-chat session”
Guide the model not to call tools that appear in the conversation history but are not in the current model’s toolset

6.2 Cache Miss

“caches are provider- and model-specific, so switching means a cache miss and a slower, more expensive first turn”

Cursor attempts to alleviate this issue with session summaries, but this introduces new problems:

“if the user is deep into a complex task, the summary can lose important details”

This trade-off indicates that session summaries are a form of information compression, and compression inevitably incurs losses. For complex tasks, this loss is unacceptable.

Cursor’s practical recommendation:

“We generally recommend staying with one model for the duration of a conversation unless you have a reason to switch”

This forms an interesting contrast with Augment’s research on “good habit accumulation”—both focus on the contextual quality issues of long-term Agents but approach them from different angles.

VII. Future Directions: Harness as the Core of Multi-Agent Orchestration

Cursor’s judgment about the future is very clear:

“The future of AI-assisted software engineering will be multi-agent. Instead of running every subtask through a single agent, the system will learn to delegate across specialized agents and subagents”

The key implication is:

“The ability to orchestrate that kind of coordination will live in the harness rather than any single agent”

This means:

Agents are execution units (perform specific tasks)
Harness is the orchestration layer (decides which Agent to dispatch, how to frame tasks, and how to stitch results)

This aligns with Anthropic’s “Brain/Hands decoupling” but more specifically points to the responsibilities of Harness in multi-Agent scenarios:

Responsibility	Description
Dispatch Decisions	Determine which Agent is suitable for the current subtask
Framing	Adjust task descriptions based on Agent strengths
Result Stitching	Integrate outputs from multiple Agents into a coherent workflow
Context Passing	Manage information flow across Agents

Engineering Insights

1. Measurement Precedes Optimization

The core of Cursor’s methodology is: without measurement, there is no improvement. Keep Rate, LLM semantic analysis, anomaly detection baselines—these constitute a complete measurement stack that grounds the process of “optimization”.

For teams building their own Agent systems, this means:

First establish a baseline (baseline measurement)
Then introduce changes (controlled change)
Finally validate the effects (comparison against baseline)

Skipping measurement for direct optimization is akin to a blind person riding a blind horse.

2. Philosophy of Classifying Expected vs Unknown Errors

Cursor categorizes errors into “expected” and “unknown,” setting strict alerts for the latter—this reflects a defensive engineering mindset: any anomaly should be investigated rather than ignored.

This is particularly important for Agent systems, as the behavioral space of models is vast, and certain error patterns may only emerge at production scale.

3. Guardrails as Technical Debt, Not Design Choices

Cursor clearly states that early guardrails were “engineering compensations for insufficient model capabilities” and have been gradually removed as model capabilities improved. This indicates that guardrails should be viewed as technical debt that should be actively repaid as model capabilities evolve, rather than permanently retained.

4. The Depth of Harness Customization Determines Agent Limits

Tool formats, prompt strategies, anomaly handling—Cursor deeply customizes these dimensions for different models. This shows that the ultimate performance of an Agent is a joint product of the Model and the Harness, not an independent attribute of the model itself.

The same model, under different depths of Harness, can exhibit significantly different capability levels and behavioral characteristics.

Conclusion

Cursor’s article provides an important firsthand perspective on how to continuously improve the Agent Harness in a production environment.

Core methodology:

Offline + Online two-tier evaluation: rapid iteration + real validation
Keep Rate + LLM semantic analysis: capturing real user value
Anomaly detection + automated ticket: turning manual inspections into automated monitoring
Per-model deep customization: joint optimization of Harness and model
Multi-Agent orchestration as the core responsibility of Harness: Agents are execution units, Harness is the coordination layer

This methodology is valuable for any team building Agent systems: establishing a measurement system is the first and most challenging step, but it determines whether the system can continuously improve.