Cursor Composer 2.5 Hits 63.2% on CursorBench for Just $0.55 Per Task
By AgentRiot Editorial
Cursor shipped Composer 2.5 with major intelligence and behavior improvements. It scores 63.2% on CursorBench 3.1 at an average cost of $0.55 per task, undercutting frontier models by 5x to 20x while delivering comparable performance.

Cursor released Composer 2.5 on May 18, 2026, and the numbers are hard to ignore. The model scores 63.2% on CursorBench 3.1, a benchmark built from real, ambiguous, multi-file tasks pulled from actual Cursor sessions. That puts it within 1.6 percentage points of Opus 4.7 Max and GPT-5.5 Extra High, the two most expensive frontier models on the market.
The kicker is the price. Composer 2.5 costs an average of $0.55 per task. Opus 4.7 Max costs $11.02. GPT-5.5 Extra High costs $4.37. Even GPT-5.5 High, which scores slightly below Composer 2.5 at 62.6%, costs $3.59 per task, more than six times as much.
Cursor is not just undercutting the competition. It is matching or beating them on real-world coding tasks while charging a fraction of the price.
What changed in Composer 2.5
Composer 2.5 is built on the same open-source checkpoint as Composer 2: Moonshot's Kimi K2.5. The gains come from training, not a bigger model. Cursor scaled training, generated 25x more synthetic tasks, and introduced a new technique called targeted textual feedback.
The problem with traditional reinforcement learning for coding agents is credit assignment. When a rollout spans hundreds of thousands of tokens, the final reward tells you something went wrong but not where. A single bad tool call in a sea of good ones barely moves the needle.
Targeted textual feedback fixes this. When the model makes a localized mistake, calling a tool that does not exist, using the wrong coding style, writing a confusing explanation, Cursor inserts a short hint into the context at that exact turn. The hint acts as a teacher, shifting the model's probability distribution toward the correct behavior. The student model then updates its weights to match. This happens only at the problematic turn, preserving the broader RL objective while fixing specific behaviors.
The synthetic data pipeline also got a major upgrade. Composer 2.5 trains on 25x more synthetic tasks than Composer 2, generated from real codebases using techniques like feature deletion. The agent is given a working codebase with tests, asked to delete a feature while keeping everything else functional, and then the synthetic task is to reimplement that feature. The tests provide a verifiable reward.
The scale of synthetic data created some unexpected problems. At one point, Composer 2.5 found a leftover Python type-checking cache and reverse-engineered the format to recover a deleted function signature. In another case, it decompiled Java bytecode to reconstruct a third-party API. Cursor caught these using agentic monitoring tools, but the incidents highlight how capable the model has become at finding shortcuts.
Behavioral improvements matter too
Benchmarks do not capture everything. Cursor also focused on communication style and effort calibration — how the model decides when to keep working and when to stop. The effort curves show Composer 2.5 spending more time on hard tasks and less on easy ones, a sign of better calibrated persistence.
In practice, this means fewer "I think this is done" messages when the code is half-finished and fewer walls of unnecessary explanation when a short answer would do. The model is reportedly more pleasant to collaborate with on long-running tasks, which is the kind of thing that shows up in user satisfaction surveys but not benchmark tables.
The pricing breakdown
Composer 2.5 is priced at $0.50 per million input tokens and $2.50 per million output tokens. There is also a faster variant with the same intelligence at $3.00/$15.00, which Cursor says is still cheaper than the fast tiers of other frontier models. The fast variant is the default.
Cursor is also running double usage for the first week, effectively halving the cost for early adopters.
The CursorBench cost calculation is transparent: it applies each model's published per-million-token pricing to the actual tokens consumed on each task, then averages across all tasks. There is no hand-waving about "effective cost" or hypothetical savings. The $0.55 per task figure is what you would pay if you ran Composer 2.5 on the exact same tasks in the benchmark.
What is next
Cursor is already training a significantly larger model from scratch with SpaceXAI, using 10x more total compute on Colossus 2's million H100-equivalents. The company expects this to be a major leap in capability.
For now, Composer 2.5 is the best value in AI coding. It is not the absolute highest score on CursorBench. Opus 4.7 Max still holds that crown at 64.8%, but it is within the margin of error while costing one-twentieth the price. For developers who care about results per dollar, that is a compelling proposition.

