Benchmarks

The Autofix Bot team on December 9, 2025

Autofix Bot is the AI agent purpose-built for deep code review. We are excited to present benchmarks for our GA release, showcasing how our novel hybrid agent delivers frontier-level detection performance for security vulnerabilities, code quality issues, and hardcoded secrets, while featuring category-leading cost efficiency.

Modern software is increasingly written with AI assistance. Randomized field experiments at Microsoft and Accenture show a 26% increase in completed tasks and 13.5% more commits when developers use AI coding assistants1, and GitHub reports that 92% of US developers now use them daily.2 But what exactly are we shipping? Industry reports increasingly point to a troubling pattern: code duplication is rising, code churn is spiking, and teams report spending more time debugging AI-generated code than they save from its speed. In a 2025 bakeoff of 100+ LLMs, only ~55% of generated code passed basic security checks.3

The problem compounds. Research shows that LLMs exhibit an inherent preference for specific coding styles that often don't match your codebase, and they perpetuate patterns, good or bad, from their training data.4 When they encounter anti-patterns, they amplify them.

Teams accumulate code that works but that nobody fully understands, making future modifications increasingly costly. Asking an LLM to review its own output doesn't help. Studies show LLM-based code review correctly classifies issues only ~68% of the time, and can introduce regressions that actually worsen the code.5 Security vulnerabilities, hardcoded secrets, and mounting technical debt are all symptoms of the same underlying problem: AI coding agents, left unchecked, produce slop.

Autofix Bot is purpose-built for deep code review. It works in tandem with AI coding agents to catch what they miss — security vulnerabilities, code quality issues, hardcoded secrets, and anti-patterns, with higher accuracy than static-only or LLM-only review.

In this document, we present the architecture, analysis pipeline, performance benchmarks, and future roadmap for Autofix Bot.

Benchmarks: Code Review


OpenSSF CVE Benchmark (200+ real-life CVEs)

Accuracy (%)
0
25
50
75
100
81.2
74.5
71.5
59.4
56.9
Autofix Bot
Cursor Bugbot
Claude Code
CodeRabbit
Semgrep (CE)

Autofix Bot achieves the highest accuracy in finding bad and insecure code on The OpenSSF CVE Benchmark.6

The benchmark consists of code and metadata for over 200 real-life security vulnerabilities in JavaScript and TypeScript, which have been validated and fixed in open-source projects. It evaluates tools on two key metrics: their ability to detect the vulnerability (avoiding false negatives) and their ability to recognize the validated patch (avoiding false positives).

We're choosing accuracy as the hero metric for our evaluation. Accuracy measures how often the agent gets it right: detecting real vulnerabilities in vulnerable code, and recognizing that patched code is actually fixed.

Here are the full results across all tools we evaluated:

Autofix BotCursor BugbotClaude CodeCodeRabbitSemgrep (CE)
Diffs processed165165165165165
Accuracy81.21%74.55%71.52%59.39%56.97%
Precision84.93%69.23%88.89%82.61%66.67%
Recall75.61%87.80%48.78%23.17%26.83%
F1 Score80.00%77.42%62.99%36.19%38.26%
Avg. time per diff143.77s189.88s43.92s124.81s90s
Cost$21.24$40/mo†$48.86$30/mo‡Free

† Capped at 200 PR reviews per month. ‡ Capped at 5 reviews per hour.
Cost for Autofix Bot and Claude Code reflects total API spend for the benchmark run. Cursor Bugbot and CodeRabbit are subscription-based.

What do these results mean? Precision and recall represent a fundamental trade-off in code review:

  • High precision, low recall (Claude Code, CodeRabbit, Semgrep) — fewer false positives, but they miss most real vulnerabilities. Not ideal when security is on the line.
  • High recall, lower precision (Cursor Bugbot) — catches more issues, but at the cost of more noise.
  • Best balance (Autofix Bot) — highest accuracy (81.21%) and F1 score (80.00%), meaning it catches the most real issues while keeping false positives manageable.

For teams shipping AI-generated code, this balance is critical. You need a code reviewer that doesn't miss vulnerabilities and doesn't drown you in false positives.

Methodology

We evaluated each tool against the OpenSSF CVE Benchmark by simulating pull requests that introduce vulnerable code. For each CVE in the benchmark, we created two variants: the vulnerable version and the patched version. A tool scores correctly when it flags the vulnerable code and stays silent on the patch.

Autofix Bot was invoked via the API, analyzing the staged diff for each CVE variant.7

Cursor Bugbot was evaluated through GitHub pull requests:

  • Created repositories for each CVE variant with Bugbot integration enabled
  • Opened PRs where the feature branch introduced vulnerable files
  • Collected Bugbot's review comments and inline annotations via the GitHub API
  • Extracted location metadata from Bugbot's HTML markers to map findings to specific lines

CodeRabbit was evaluated via its CLI:

  • Simulated file additions by removing target files, committing, then restoring and staging them
  • Ran coderabbit review --plain on the staged changes
  • Processed in batches of 5 due to CodeRabbit's rate limit of 5 reviews per hour

Claude Code was evaluated using the official security review command:

  • Used the same diff simulation approach as CodeRabbit
  • Invoked via claude /security-review with the prompt from claude-code-security-review

Semgrep (Community Edition) was run with default rulesets against each CVE variant's codebase.

All tools were given the same code context and evaluated on identical CVE variants to ensure a fair comparison.

Benchmarks: Secrets Detection

We benchmark Autofix Bot's secrets detection against three widely used static tools — Gitleaks8, Detect-Secrets9, and TruffleHog10 — on a proprietary labeled secrets corpus.


Secrets Detection: F1 Score Comparison

0
25
50
75
100
92.78
75.62
64.09
41.22
Autofix Bot
Gitleaks
detect-secrets
TruffleHog

We choose F1 as the headline metric here because secrets datasets are imbalanced and teams care about two things at once: not missing real leaks (recall) and not drowning reviewers in noise (precision).

Autofix BotGitleaksdetect-secretsTruffleHog
Accuracy87.45%58.49%52.12%23.36%
Precision98.69%96.98%67.52%83.04%
Recall87.45%61.97%61.00%27.41%
F1 Score92.78%75.62%64.09%41.22%
Perfect matches453303270121
Partial matches0184621
Missed secrets65197202376
False positives61015229
False negatives65197202376

What do these results mean? Static-only tools face a fundamental limitation.

  • High precision, low recall (Gitleaks, TruffleHog) — fewer false positives, but they miss 38–73% of real secrets.
  • Balanced but noisy (detect-secrets) — catches more secrets, but with substantially more false positives.
  • Best in class (Autofix Bot) — highest F1 score (92.78%), catching nearly all secrets while keeping false positives to a minimum.

Autofix Bot achieves this by combining a static regex sweep (maximizing recall) with a custom fine-tuned classifier (maximizing precision).

Hybrid Agent Architecture

We designed Autofix Bot with two key goals:

  1. Maximize recall on security vulnerabilities, code quality issues, hardcoded secrets, and anti-patterns, while
  2. Keeping false positives low enough to generate remediation patches automatically

The agent architecture is hybrid by construction. Static program analysis provides stable, deterministic signal; an agentic layer uses that signal (plus code and query tools) to conduct focused review, generate remediation patches, and explain those patches.

The pipeline contains the following steps:

  1. Codebase indexing: Build an AST and whole-project graph (data-flow, control-flow, import graph, sources/sinks) that act as stores. The agent can query these stores during analysis and remediation.
  2. Static pass: Run 5,000+ deterministic checkers to establish a low-false-positive baseline of known issues — security vulnerabilities, code quality problems, and potential secrets. A sub-agent suppresses context-specific false positives.
  3. AI review: With static findings, source code tools (ripgrep, graph lookups, etc), and specialized sub-agents, the main agent performs deep review over the relevant code slice. The agent has access to all stores created in step 1, giving it context of the entire codebase and all open-source dependencies.
  4. Remediation: Specialized sub-agents generate fixes for individual issues detected across steps 2 and 3, and explanations where automated fixes are unsafe.
  5. Sanitization: A language-specific static harness validates all edits generated in step 4, with an additional AI pass to ensure alignment with the intended fix.
  6. Output: Emit a clean git patch, ready to be applied at the HEAD of the branch which was analyzed.
  7. Caching: Multi-layered caching for source code, AST, and the project's stores to improve repeat analysis performance.

Why Hybrid?

LLM-only code review has well-documented shortcomings that our hybrid architecture directly addresses:

  • Low recall on critical issues. LLMs get distracted by stylistic concerns and miss real problems, especially when multiple issue types are present. A static pass helps steer the LLM's focus.
  • Misses interprocedural issues. A grep-only approach lacks deep semantic analysis — it can't track data or control flow across functions. Static analyzers are built for exactly this.
  • Non-determinism. Re-reviews of the same code produce wildly different results. A static pass anchors the output across repeated runs.
  • Cost. LLM-only review is expensive for large codebases. Static narrowing trims the search space, reducing prompt size and tool invocations.
  • Time. Deterministic static seeds let us shard safely and parallelize analysis, while LLM-only reviewers require additional passes or more tool calls.

What's Next

Autofix Bot is now generally available. You can use it today as a terminal UI, as a plugin for Claude Code, or via the MCP server that works with any AI coding tool.

We're continuing to expand Autofix Bot's capabilities:

  • Code quality tooling. We're adding tools that help your AI coding agent write structurally better code. In upcoming releases, we'll provide complexity analysis, documentation coverage, and other metrics that your agent can use to make better decisions.
  • Open-source vulnerability remediation. We're building automated fixes for vulnerabilities in your dependencies — version upgrades plus the refactors needed to ensure your code doesn't break.

Follow @autofixbot for updates, and keep an eye on the Updates feed.


Footnotes

  1. Randomized field experiments at Microsoft, Accenture, and an anonymous Fortune 100 electronics firm evaluated the causal effect of granting Microsoft Copilot access on weekly developer outputs. Pooled weighted-IV estimates show a 26.08% increase in completed tasks, and 13.55% increase in commits. Read more.
  2. GitHub's 2025 developer survey.
  3. 2025 GenAI Code Security Report, by Veracode.
  4. Deng et al. (2024), Beyond Functional Correctness: Investigating Coding Style Inconsistencies in Large Language Models. arXiv:2407.00456.
  5. Evaluating Large Language Models for Code Review, arXiv 2025.
  6. The OpenSSF CVE Benchmark consists of code and metadata for over 200 real life CVEs, as well as tooling to analyze the vulnerable codebases using a variety of static analysis security testing (SAST) tools and generate reports to evaluate those tools.
  7. Autofix Bot Benchmark Dataset: The benchmark dataset and tooling used to evaluate Autofix Bot.
  8. Gitleaks: A tool for detecting secrets like passwords, API keys, and tokens in git repos, files, and whatever else you wanna throw at it via stdin.
  9. detect-secrets: An enterprise friendly way of detecting and preventing secrets in code.
  10. TruffleHog: Find, verify, and analyze leaked credentials.