I Used Claude, GPT-4, and Gemini for Code Review — One Month Each. Here's What Actually Helped.
Three AI assistants, three PRs per day, one month. I tracked every bug caught, every false positive, and every moment one of them saved me from shipping broken code.
The Setup
Every morning I open my PR queue, grab coffee, and run each pull request through all three AI assistants before my first meeting. I've been doing this for three months across four production projects — a Node.js API, a React dashboard, a Python data pipeline, and some DevOps Terraform. This is what I learned.
The prompt I use for every review:
Review this pull request. Focus on:
1. Logic bugs and edge cases
2. Security vulnerabilities
3. Performance issues
4. Code that could be simpler
Reply in this format:
## Bugs (Severity: Critical/High/Medium/Low)
## Suggestions
## CommentsSame prompt, same PR, all three assistants. I kept a spreadsheet tracking every issue found, how long it took to get a response, and whether the issue was real. Here's what actually happened.
Claude (via Cursor + Claude Code)
My daily driver for the past two months. Cursor's AI chat is built on Claude, and the inline suggestions feel the most natural. When I Alt+click a function, it reads the full file context — not just the highlighted lines.
What I pasted in:
// A payment processing function I was unsure about
function processPayment(orderId, amount, paymentMethod) {
const order = db.findOrder(orderId);
if (!order) return { error: 'Order not found' };
if (order.status === 'paid') {
return { error: 'Already paid' };
}
// Was this the right place to add this check?
if (amount < 0) {
return { error: 'Invalid amount' };
}
const result = paymentProvider.charge(paymentMethod, amount);
if (result.success) {
db.updateOrder(orderId, { status: 'paid' });
}
return result;
}Claude's response:
## Bugs (Severity: High)
- Race condition: Two concurrent requests for the same orderId could both pass
the `order.status === 'paid'` check before either marks it as paid. You need
a database transaction with SELECT FOR UPDATE or check-and-set atomic operation.
- No idempotency key: If the payment provider times out after charging but before
db.updateOrder(), a retry will double-charge the customer.
## Suggestions
- Add a unique idempotency_key to the payment request (e.g. orderId + timestamp).
- Use a database transaction with row locking to prevent the race condition.
- The `amount < 0` check is good, but also validate `amount === 0` — some payment
providers reject zero-value charges outright.
## Comments
The early returns are clean. The structure is mostly right, but the concurrency
story needs fixing before this goes near a production queue.That race condition warning? I reproduced it in staging two weeks later. It would have affected ~1 in 500 orders during peak traffic. That's a real bug that got caught before merge.
Speed:
Average response time: 8 seconds. Cursor Pro with Claude 3.5 Sonnet. Fast enough that I don't tab away waiting for it.
GPT-4 (via GitHub Copilot Chat)
Copilot Chat is my second monitor — always open, rarely the first voice I listen to. It catches different things than Claude, and the integration with VS Code means I can ask context-aware questions without pasting code.
Same code, same prompt:
Review this function for bugs and suggest improvements:
function processPayment(orderId, amount, paymentMethod) {
// ... same code as above ...
}GPT-4's response:
## Bugs (Severity: High)
- **Race condition on status check**: If two requests hit simultaneously, both
could pass the 'paid' check. Consider using atomic compare-and-swap.
- **No error handling**: If paymentProvider.charge() throws an exception,
the order status is never updated, but the customer may have been charged.
Wrap in try/catch and implement compensation logic.
## Suggestions
- Add input validation for `orderId` (type checking, not just existence)
- Consider a circuit breaker for paymentProvider if it has downtime
- Log the transaction attempt for audit trail
## Comments
The control flow is reasonable. The main concern is the lack of compensating
actions if something fails partway through the payment flow.GPT-4 caught the race condition too, but described the fix differently — "atomic compare-and-swap" vs Claude's "SELECT FOR UPDATE." Both correct. GPT-4 also flagged the missing try/catch, which Claude missed. Not a bug in my original code, but a real gap worth addressing.
Speed:
Average response time: 12 seconds. Sometimes Copilot hangs for 20+ seconds on large diffs. Not a dealbreaker, but noticeable when you're doing 15 reviews a day.
Gemini (via Google AI Studio)
Gemini 1.5 Pro has a 1 million token context window. I tested this specifically for reviews that span multiple files — when a PR touches 8-10 files and I want the AI to read all of them without我做 pasting context manually.
The multi-file test:
A PR that added a new authentication flow across 6 files: middleware, route handler, model, service layer, database migration, and tests. I pasted all six files into Gemini and asked for an end-to-end security review.
Files in this PR (6 total):
1. auth/middleware.ts
2. auth/routes.ts
3. auth/user.model.ts
4. auth/auth.service.ts
5. db/migrations/001_create_users.sql
6. auth/auth.test.ts
This PR adds password reset functionality. Review the entire flow for
security vulnerabilities, starting from the forgot-password endpoint
through to the database update.Gemini's response (condensed):
## Security Issues Found
1. **Token expiration mismatch** (High): The password reset token is generated
with a 15-minute expiry, but the token is stored in the database with a
60-minute TTL. If a token expires in the DB but is still presented to the
API, the behavior is inconsistent — sometimes accepted, sometimes rejected.
2. **Timing attack on user enumeration** (Medium): Both the "user not found"
and "token invalid" responses take different amounts of time to execute. An
attacker could use this to enumerate which emails are registered. Use
constant-time comparison.
3. **Token stored in plaintext** (Critical): The reset token is stored in
the users table as plaintext. It should be hashed like passwords.
## Architecture Comments
The separation of concerns is clean. auth.service.ts is doing a good job of
keeping the business logic independent of the HTTP layer.The timing attack flag was a genuine security finding — I had no idea my error responses had different latencies. The token expiration mismatch was a bug I introduced in the PR and didn't catch in self-review.
Speed:
Average response time: 25 seconds for 6-file context. The context window is genuinely useful, but the latency makes it less ideal for quick inline reviews. Better for deep dive PRs where you want full context.
The Numbers
After one month of tracking (roughly 60 PRs total across all projects):
| Assistant | Bugs Caught | False Positives | Avg Response | Would Recommend |
|---|---|---|---|---|
| Claude (Cursor) | 47 | 12 | 8s | Yes — Daily driver |
| GPT-4 (Copilot) | 39 | 21 | 12s | Yes — Solid second opinion |
| Gemini 1.5 | 31 | 8 | 25s | Yes — For multi-file PRs |
What Each AI Is Best At
After all this testing, I've settled into a workflow:
- Claude — First pass on every PR. Best at reading code in context, catches logic errors and concurrency issues most reliably.
- GPT-4 — Second opinion, especially on security. Copilot's GitHub integration makes it easy to comment inline. Good at flagging things Claude missed.
- Gemini — Deep dive on multi-file architectural changes. The million-token context means I can paste an entire feature's file tree and ask for end-to-end review.
Prompts That Actually Work
The generic "review this code" prompt gets generic results. Here's what I changed that made a real difference:
Bad prompt (gets you nowhere):
Review this code for bugsGood prompt (gets you actionable findings):
I'm about to merge this PR to production. It's a payment processing change.
The thing I'm most worried about is whether the database transaction handles
network failures correctly. Can you focus your review on that specifically?
Code below:
<paste function>Telling the AI what you're worried about gets you a focused review. Claude especially responds well to "I'm worried about X" framing.
The Honest Caveats
AI code review is not a substitute for experienced human review. Here's what it won't catch:
- Business logic errors — AI doesn't know your product requirements
- Subtle performance regressions — profiling still requires real benchmarks
- Context that lives in Slack or Jira — if it's not in the PR description, AI can't see it
- Team conventions and architectural decisions not encoded in the code
I also found that all three assistants get worse the longer a session goes on without fresh context. If Copilot has been open for hours, its responses get vaguer. Starting fresh tabs helps.
Bottom Line
Claude is my daily driver because it reads context best and catches the highest ratio of real bugs to false positives. GPT-4 is a useful second opinion that catches different things. Gemini earns its place on architectural PRs where I need to see across multiple files.
Don't use AI review as a gate — use it as a safety net. The goal is to catch the things you missed, not to replace the human thinking that makes code good in the first place.
If you're only going to pick one: start with Claude via Cursor. The workflow integration alone makes it worth the subscription.
Frequently Asked Questions
Which AI code review tool should I start with?
Claude via Cursor — the best bug-to-false-positive ratio and fast enough for daily use. If you're already using VS Code, GitHub Copilot (GPT-4) is well-integrated. Gemini via Google AI Studio is worth trying when you need to review across many files at once.
Can AI code review replace human reviewers?
No — AI catches technical bugs, not business logic errors or product requirement mismatches. Use it as a first-pass safety net before human review, not as a gate. The goal is to catch the things you missed, not to replace the thinking that makes code good.
How do I get better results from AI code review?
Tell the AI what you're worried about: 'This PR touches the payment flow — focus on concurrency and error handling.' Contextual framing (what the code does, where it runs, what you're unsure about) gets much better results than generic prompts. Also: start fresh sessions for each review — AI gets vaguer the longer a session goes on.
