
Claude Code now dispatches a team of agents on every PR to catch bugs that skims miss. Available in research preview for Team and Enterprise.
Today we're introducing Code Review, which dispatches a team of agents on every PR to catch the bugs that skims miss, built for depth, not speed. It's the system we run on nearly every PR at Anthropic. Now in research preview for Team and Enterprise.
Code output per Anthropic engineer has grown 200% in the last year. Code review has become a bottleneck, and we hear the same from customers every week. They tell us developers are stretched thin, and many PRs get skims rather than deep reads.
We needed a reviewer we could trust on every PR. Code Review is the result: deep, multi-agent reviews that catch bugs human reviewers often miss themselves. It's a more thorough (and more expensive) option than our existing Claude Code GitHub Action, which remains open source and available.
We run Code Review on nearly every PR at Anthropic. Before, 16% of PRs got substantive review comments. Now 54% do. It won't approve PRs — that's still a human call — but it closes the gap so reviewers can actually cover what's shipping.
When a PR is opened, Code Review dispatches a team of agents. The agents look for bugs in parallel, verify bugs to filter out false positives, and rank bugs by severity. The result lands on the PR as a single high-signal overview comment, plus in-line comments for specific bugs.
Reviews scale with the PR. Large or complex changes get more agents and a deeper read; trivial ones get a lightweight pass. Based on our testing, the average review takes around 20 minutes.
We've been running Code Review internally for months: on large PRs (over 1,000 lines changed), 84% get findings, averaging 7.5 issues. On small PRs under 50 lines, that drops to 31%, averaging 0.5 issues. Engineers largely agree with what it surfaces: less than 1% of findings are marked incorrect.
In one case, a one-line change to a production service looked routine and was the kind of diff that normally gets a quick approval. But Code Review flagged it as critical. The change would have broken authentication for the service, a failure mode that’s easy to read past in the diff but obvious once pointed out. It was fixed before merge, and the engineer shared afterwards that they wouldn't have caught it on their own.
Early access customers have seen similar patterns. On a ZFS encryption refactor in TrueNAS's open-source middleware, Code Review surfaced a pre-existing bug in adjacent code: a type mismatch that was silently wiping the encryption key cache on every sync. It was a latent issue in code the PR happened to touch, the kind of thing a human reviewer scanning the changeset wouldn't immediately go looking for.
Code Review optimizes for depth and is more expensive than lighter-weight solutions like the Claude Code GitHub Action. Reviews are billed on token usage and generally average $15–25, scaling with PR size and complexity.
Admins have many ways to control spend and usage:
Code Review is available now as a research preview in beta for Team and Enterprise plans.
Explore the docs for more information.
When a tool flags 8 issues on clean code and 8 issues on broken code, it's not a reviewer, it's a random number generator with a UI. The approach we've found more tractable is to separate concerns: let deterministic tools (linters, SAST, SCA) handle what they're definitively good at - style, known vuln patterns, dependency CVEs, secrets and reserve the AI layer for things humans actually need help reasoning about. Running this locally as a pre-push or CI step means you catch the boring 80% before it ever reaches a $25 AI review. You're not paying Claude to tell you your import is unused - you're paying it to reason about whether your auth flow has a TOCTOU issue. That's a very different and much more valuable question.
Or, just spin up your own review workflow, I've been doing this for the past couple of months after experimenting with Greptile and it works pretty well, example setup below:
https://gist.github.com/rlueder/a3e7b1eb40d90c29f587a4a8cb7c...
An average of $0.04/review (200+ PRs with two rounds each approx.) total of $19.50 using Opus 4.6 over February.
It fills in a gap of working on a solo project and not having another set of eyes to look at changes.
> We've been running Code Review internally for months: on large PRs (over 1,000 lines changed), 84% get findings, averaging 7.5 issues. On small PRs under 50 lines, that drops to 31%, averaging 0.5 issues. Engineers largely agree with what it surfaces: less than 1% of findings are marked incorrect.
So the take would be that 84% heavily Claude driven PRs are riddled with ~7.5 issues worthy bugs.
Not a great ad of agent based development quality.
I ask Claude or codex to review staged work regularly, as part of my workflow. This is often after I’ve reviewed myself, so I’m asking it to catch issues I missed.
It will _always_ find about 8 issues. The number doesn’t change, but it gets a bit … weird if it can’t really find a defect. Part of the art of using the tool is recognizing this is happening, and understanding it’s scraping the bottom of its barrel.
However, if there _are_ defects, it’s quite good at finding and surfacing them prominently.
Zero