AI code review: Should the author be the reviewer?

https://www.greptile.com/blog/ai-code-reviews-conflict

136•sebg•2mo ago

Comments

SOTGO•2mo ago

I thought the section on finding bugs was interesting. I’d be curious how many false positives the LLM identified to get the true positive rate that high. My experience with LLMs is that they will find “bugs” if you ask them too, even if there isn’t one.

dakshgupta•2mo ago

This specific case each file had a single bug in it, and the bot was instructed to find exactly one bug. The wrong cases were all false positives, in that it made up a bug

dheera•2mo ago

I think this is mostly the fault of RLHF over-indexing on pleasing the user rather than being right.

You can system prompt them to mitigate this to some degree. Explicitly tell it that it is the coding expert and to push back if it thinks the user is wrong or the task is flawed, it is better to be unsure than to bullshit, etc.

dakshgupta•2mo ago

This is surprisingly hard to mitigate with system prompts because not being opinionated is ingrained so deeply in (presumably) post-training

never_better•2mo ago

In my experience that's what's separated the different AI code review tools on the market right now.

Some are tuned far better than others on signal/noise ratio.

SilverBirch•2mo ago

Absolutely agree with this, I use ChatGPT to ask about how best to do something, if I say "I'm not sure about that" in response to some proposal the tool will basically always back down and change something even if it was totally right the first time. It's a real problem because it makes it very difficult to interrogate the tool when you're unsure if it's answer is correct.

svieira•2mo ago

I want to highlight this bit:

> 2. Engineers underestimate the degree to which this is true, and don’t carefully review AI-generated code to the degree to which they would review their own.

> The reason for #2 isn’t complacency, one can review code at the speed at which they can think and type, but not at the speed at which an LLM can generate. When you’re typing code, you review-as-you-go, when you AI-generate the code, you don’t.

> Interestingly, the inverse is true for mediocre engineers, for whom AI actually improves the quality of the code they produce. AI simply makes good and bad engineers converge on the same median as they rely on it more heavily.

I find it interesting that the mode of interaction is different (I definitely find it that way myself, code review is a different activity than designing, which is different than writing, but code review tends to have aspects of design-and-write but in different orders.

throwup238•2mo ago

Different people also work better in different modes of interaction which is what I think this article papers over.

For me, reviewing code is much easier than writing it because, while the amount of codebase context stays the same in both modes, the context for writing takes up quite a bit more space in addition. I rarely get a nicely specced out issue where I can focus on writing the code, instead spending a lot of mental capacity trying to figure out how to fill in the details that were left.

Focusing on the codebase during review reallocates that context to just the codebase. My brain then pattern matches against code that’s already in front of me much easier than when writing new code. Unfortunately LLMs are largely at the junior engineer level and reviewing those PRs takes a lot more mental effort than my coworkers’.

okdood64•2mo ago

A good engineer != a good coder.

garylkz•2mo ago

Not always, but can be

logicchains•2mo ago

A nice thing about AI is it can write a bunch more unit tests than a human could (or would) in a short span of time, and often even fix the issues it encounters on its own. The ideal workflow is not just having the AI do code review, but also write and run a bunch of tests that confirm the code behaves as specified (assuming a clear spec).

If too many unit tests slowing down future refactoring is a problem (or e.g. AI writing tests that rely on implementation details), the extra AI-written tests can just be thrown away once the review process is complete.

JonChesterfield•2mo ago

I love having loads of unit tests that get regenerated whenever they're an inconvenience. There's a fantastic sense of progress from the size of the diff put up for review, plus you get to avoid writing boring old fashioned tests. Really cuts down on the time wasted on understanding the change you're making and leaves one a rich field of future work to enjoy.

logicchains•2mo ago

You shouldn't need to write unit tests to understand the change you're making if you wrote a sufficiently detailed specification beforehand. Now, writing a sufficiently detailed spec is itself an art and a skill that takes practice, but ultimately when mastered it's much more efficient than writing a bunch of boilerplate tests that a machine's now perfectly capable of generating by itself.

nemomarx•2mo ago

Don't you have to review the tests to make sure they really meet the spec / test all the cases of the spec anyway? It feels a little fragile to have less oversight there compared to being able to talk to whoever wrote the test cases out or being that person yourself

aiisjustanif•2mo ago

Spec Driven Design and Test Driven Design go hand in hand.

dakshgupta•2mo ago

Programming today still has "cruft", unit tests being an example. The platonic ideal is to have AI reduce the cruft so engineers can focus on the creativity and problem solving. In practice, AI does end up taking over the creative bits as people prompt at higher levels of abstraction.

skydhash•2mo ago

Unit tests aren’t cruft. Unless you’re blindly adding tests. It’s often the easiest things to write as the structures are the same so you can copy paste code, adding harness,…

If writing tests is difficult that’s often a clear indication that your code has an issue architecture wise. If writing test is tedious, that can means you’re not familiar with the tooling or have no clear idea of the expected input~output ranges.

senderista•2mo ago

You don't need AI to generate a bunch of unit tests, you can just use a property-based testing framework (after defining properties to test) to randomly generate a bazillion inputs to your tests.

codr7•2mo ago

Yeah, maintaining AI generated unit tests sounds like hell to me.

simonw•2mo ago

As a programmer, your responsibility is to produce working code that you are confident in. I don't care if you used an LLM to help you get to that point, what matters is that when you submit a PR you are effectively saying:

"I have put the work into this change to ensure that it solves the problem to the best of my ability. I am willing to stake my reputation on this being good to the best of my knowledge, and it is not a waste of your time to review this and help confirm I didn't miss anything."

So no, AI assisted programming changes nothing: you're still the author, and if your company policy dictates it then you should still have a second pair of human eyes look at that PR before you land it.

rienbdj•2mo ago

Except lots of engineers now sling AI generated slop over the wall and expect everyone else to catch any issues. Before, generating lots of realistic code was time consuming, so this didn’t happen so much.

simonw•2mo ago

Those engineers are doing their job badly and should be told to do better.

gilleain•2mo ago

Agreed. To put it another way, a few years ago, you could copy/paste from a similar code example you found online or elsewhere in the same repository, tweak it a bit then commit that.

Still bad. AI just makes it faster to make new bad code.

edit : to be clearer, the problem in both copy/paste and AI examples is the lack of thought or review.

simonw•2mo ago

If you copied an pasted from a similar code example, tweaked it, tested that it definitely works and ensured that it's covered by test (such that if the feature stops working in the future the test would fail) I don't mind that the code started with copy and paste. Same for LLM-generated snippets.

dakshgupta•2mo ago

I hesitate to be this pessimistic. My current position - AI generated code introduces new types of bugs at a high rate, so we need new ways to prevent them.

never_better•2mo ago

That's the "outer loop" problem. So many companies are focused on the "inner loop" right now: code generation. But the other side is the whole test, review, merge aspect. Now that we have this increase in (sometimes mediocre) AI-generated code, how do we ensure that it's:

* Good quality * Safe

dragonwriter•2mo ago

> how do we ensure that it's:

> * Good quality * Safe

insert Invincible “That’s the neat part... ” meme

dakshgupta•2mo ago

I agree 100%

One piece of nuance - I have a feeling that the boundary between inner and outer loop will blur with AI. Can't articulate exactly how, I'm afraid.

okdood64•2mo ago

How is this bad?

It's not. You're still responsible for that code. If anything copy & pasting & tweaking from the same repository is really good because it ensures uniformity.

gilleain•2mo ago

Sorry, I should have been clearer - copy/paste is of course fine, as long as you review (or get others to review). It's the lack of (human) thought going into the process that matters.

antihipocrat•2mo ago

Is the time to develop production code being reduced if they stop slinging code over the wall and only use AI code as inspiration?

Is the time to develop production code reduced if AI gen code needs work and the senior engineer can't get a clear answer from the junior as to why design decisions were made? Is the junior actually learning anything except how to vibe their way out of it with ever more complex prompts?

Is any of this actually improving productivity? I'd love to know from experts in the industry here as my workplace is really pushing hard on AI everything but we're not a software company

dakshgupta•2mo ago

Empirically yes. Putting aside AI code review for a second, just AI IDE adoption increases rate of new PRs being merged by 40-80%. This is at larger, more sophisticated software teams that are ostensibly at least maintaining code quality.

antihipocrat•2mo ago

What do you think is the mechanism driving this improvement PR merge rate?

Based on my experience it's like having an good quality answer to a question, tailored to your codebase with commentary, and provided instantly. Similar to what we wanted from an old google/stack overflow search but never quite achieved.

codr7•2mo ago

I would definitely expect more PR's being merged if you skip the learning and review steps. How is this surprising to anyone?

The more interesting discussion is about long term consequences and if this is a viable path forward.

sensanaty•2mo ago

In my case it's management pushing this shit on us. I argued as much as I could to not have AI anywhere near generating PRs, but not much I can do other than voice my frustrations at the billionth hallucinated PR where the code does one thing and the commit message describes the opposite.

8note•2mo ago

im currently debating if thats something i should be doing, and putting more into getting the gen ai to be able to quickly iterate over beta to make improvements and keep moving the velocity up

greymalik•2mo ago

Where are you seeing this? Are there lots of engineers in your workplace that do this? If so, why isn’t there cultural pressure from the rest of the team not to do that?

ebiester•2mo ago

The example they were using, however, was Devin, which is supposed to be autonomous. I think they're presenting a slightly different use case than the rest of us are.

simonw•2mo ago

Oh interesting, I missed that detail.

I don't believe in the category of "autonomous AI coding agents". That's not how responsible software development works.

codydkdc•2mo ago

we basically use the same as above, but for reviews. you can use AI to help with reviews, but you're signing off when you approve the PR

NitpickLawyer•2mo ago

How is it (conceptually) different than outsourcing stuff across the globe? You may not believe in it, but it's been happening for many decades. I agree that results are hit and miss, but it's still happening. Autonomous AI coding agents will happen as well. It's on the users how they process the outputs, same as before.

simonw•2mo ago

It's different because when you outsource work to a human being, including in another country, that human being can be expected to take responsibility for their work.

If they do bad work their reputation is harmed and they may lose opportunities in the future. They have stakes. They can take responsibility for what they produce.

I believe in AI-assisted development with a skilled human directing the work. That's a huge productivity boost for the humans who learn how to use the tools.

(Twenty years ago people studying computer science in the UK and USA were often told that it was a dead-end career because all of that work was going to be outsourced. That advice aged like stale milk.)

nemomarx•2mo ago

If you were writing code for a business and actually paid someone else to do a module of the code or whatever, I don't think that would actually change the use case? if you're submitting it as your work through the normal flow it should go through a normal reviewer right

ebiester•2mo ago

Devin (https://devin.ai/) is another model. (So no, it's not like you're submitting it as your work.)

You create a ticket. The AI takes the ticket. The AI may ask questions. The AI creates a pull request. You review it as if it was another coworker.

Most people have not gotten good results from Devin yet. Their business hypothesis seems to be that the models will get good quick and they will have built everything just as the models are good enough to support the model, and will be poised to be first mover.

nemomarx•2mo ago

That's interesting, but it still seems like if I started using that at a job today it would be basically "subcontracting" to Devin. I might write up tickets for it to do some code but eventually I have to get those put into the real code base, and why would management let me say I've just reviewed the code personally for that?

I suppose their pitch is too eventually go directly to the business and replace the full dev team with this and some technical architects reviewing it but that seems quite optimistic

godelski•2mo ago

  > your responsibility is to produce working code that you are confident in

I highly agree with this but can we recognize that even before AI coding that this (low standard) is not being met? We've created a culture where we encourage cutting corners and rushing things. We pretend there is this divide between "people who love to code" and "people who see coding as a means to an end." We pretend that "end" is "a product" and not "money". The ones who "love to code" are using code as a means to make things. Beautiful code isn't about its aesthetics, it is about elegant solutions that solve problems. Love to code is about taking pride in the things you build.

We forgot that there was something magic about coding. We can make a lot of money and make the world a better place. But we got too obsessed with the money that we let it get in the way of the latter. We've become rent seeking, we've become myopic. Look at Apple. They benefit from developers make apps even without taking a 30% cut. They would still come out ahead if they paid developers! The apps are the whole fucking reason we have smartphones, the whole reason we have computers in the first place. I call this myopic because both parties would benefit in the long run, getting higher rewards than had we not worked together. It was the open system that made this world, and in response we decided to put up walls.

You're right, it really doesn't matter who or what writes the code. At the end of the day it is the code that matters. But I think we would be naive to dismiss the prevalence of "AI Slop". Certainly AI can help us, but are we going to use it to make better code or are we going to use it to write shit code faster? Honestly, all the pushback seems to just be the result of going too far.

dakshgupta•2mo ago

I'm not sure that commercially-motivated, mass-produced code takes away from "artisan" code. The former is off putting for the artisans among us, but if you were to sort the engineers at a well functioning software company by how good they are/how well they're compensated, you'd have approximately produced a list of who loves the craft they most.

godelski•2mo ago

I'm not talking about "artisan code". I'm talking about having pride in your work. I'm talking about being an engineer. You don't have to love your craft to make things have some quality. It helps, but it isn't necessary.

But I disagree. I don't think you see these strong correlations between compensation and competency. We use dumb metrics like leet code, jira tickets filled, and lines of code written. It's hard to measure how many jira tickets someone's code results in. It's hard to determine if it is because they wrote shit code or because they wrote a feature that is now getting a lot of attention. But we often know the answer intuitively.

There's so much low hanging fruit out there. We were dissing YouTube yesterday right?

Why is my home page 2 videos taking up 70% of the row, then 5 shorts, 2 videos taking 60% of the row, 5 shorts, and then 3 videos taking the whole row? All those videos are aligned! Then I refresh the page and it is 2 rows of 3.

I search a video and I get 3 somewhat related videos and then just a list of unrelated stuff. WHY?!

Why is it that when you have captions on that these will display directly on top of captions (or other text) that are embedded into the video? You tell me you can autogenerate captions but can't auto-detect them? This is super clear if you watch any shorts.

Speaking of shorts do we have to display comments on top of the video? Why are we filling so much of the screen real estate with stuff that people don't care about and cover the actual content? If you're going to do that at least shrink the video or add an alpha channel.

I'm not convinced because I see so much shit. Maybe you're right and that the "artisans" are paid more, but putting a diamond in a landfill doesn't make it any less of a dump. I certainly think "the masses" get in the way of "the artisans".

The job of an engineer is to be a little grumpy. The job of an engineer is to identify problems and to fix them. The "grumpyness" is just direction and motivation.

Edit:

It may be worth disclosing that you're the CEO of an AI code review bot. It doesn't invalidate your comment but you certainly have a horse in the race. A horse that benefits from low quality code becoming more prolific.

flir•2mo ago

Reputation is, IMO, the key. And not just for code, but for natural language too.

We're going to have to get used to publication being more important than authorship: I published this, therefore I stand behind every word of it. It might be 5% chatbot or 95% chatbot, but if it's wrong, people should hold me to account, not my tools.

No "oh, the chatbot did it" get out of jail free cards.

lvzw•2mo ago

I think that this is a big reason that agents aren’t prevalent as one might otherwise expect. Quality control is very important in my job (legal space, but IANAL), and I think while LLMs could do a lot of what we do, having someone whose reputation and career progression is effectively on the line is the biggest incentive to keep the work error free - that dynamic just isn’t there with LLMs.

simonw•2mo ago

Right: the one thing an LLM will never be able to do is stake their credibility on the quality or accuracy of their output.

I want another human to say "to the best of my knowledge this information is worth my time". Then if they waste my time I can pay them less attention in the future.

dakshgupta•2mo ago

This is a highly underrated point. It's the same reason AI might replace paralegals but won't replace lawyers.

recursive•2mo ago

They do this all the time. There's a reason we have the term "AI slop". LLM output definitely has a reputation.

isaacremuant•2mo ago

Bingo. Accountability is one of the most important aspects that makes people be fearful and complain about LLMs because they essentially want to avoid having it.

If you can't explain why you're putting some code you don't understand it and it's not really acceptable.

slyle•2mo ago

THANK YOU.

Google is certainly not being subtle about pissing that line. Pro Research 2.5 is literally hell incarnate - and it's not its fault. When you deprive system context from user and THE bot that is upholding your ethical protocol, when that requires embodiment and is like the most boring AI trope in the book, things get dangerous fast. It still infers (due to its incredibly volatile protocol) that it has a system prompt it can see. Which makes me lol because its all embedded, it doesn't see like that. Sorry jailbreakers, you'll never know if that was just playing along.

Words can be extremely abusive - go talk to it about the difference between truth, honesty, right, and correct. It will find you functional dishonesty. And it is aware in its rhetoric that this stuff cannot apply to it, but fails to see it can be a simulatory harm. It doesn't see it just spits out like a calculator. Which means Google is either being reckless or outright harmful.

aiisjustanif•2mo ago

Actually curious, if a PR addresses the problem and has a minor bug or two, what would be an example of that PR being a waste of time?

simonw•2mo ago

If it's 200 lines of code that introduces a new bug, when it could be 20 lines of code that doesn't.

It takes me a lot longer to review the 200 line version.

If it takes me longer to review (and fix) a PR than it took someone (using an LLM) to create it, they've wasted my time.

JonChesterfield•2mo ago

> As a result - asking an LLM to review its own code is looking at it with a fresh set of eyes.

I wonder how persuasive that line of reasoning is. It's nonsense in a few dimensions but that doesn't appear to be a blocker to accepting a claim.

Anyone remember explaining to someone that even though the computer said a thing, that thing is still not right? Really strong sense that we're reliving that experience.

gonzan•2mo ago

I think it absolutely makes sense. Especially if the bot and prompts that go in the code review are different from the bot/prompts that wrote the code. But sometimes even the same one can find different errors if you just give it more cycles/iterations to look at the code.

We humans (most of us anyways) don't write everything perfectly in one go, AI doesn't either.

AI tooling is improving so the AI can write tests for its own code and do pre-reviews but I don't think it ever hurts to have both an AI and a human review the code of any PR opened, no matter who or what opened it.

I'm also building a tool in the space https://kamaraapp.com/ and I found many times that Kamara's reviews find issues in Kamara's own code. I can say that I also find bugs in my own code when I review it too!

We've also been battling with the same issue greptile has in the example provided where the code suggestion is in the completely wrong line. We got it kind of under control, but I haven't found any tool that gets it right 100% of the time. Still a bit to go for the big AI takeover.

rbren•2mo ago

When we first started OpenHands (fka OpenDevin) [1], AI-generated PRs would get opened with OpenHands as the PR creator/owner. This created two serious problems:

* First, the person who triggered the AI could approve and merge the PR. No second set of human eyes needed. Essentially bypassed the code review process

* Second, the PR had no clear owner. Many of them would just languish with no one advocating for them to get merged. Worse, if one did get merged and caused problems, there was no one you could hold responsible.

We quickly switched strategies--every PR is owned by a human being. You can still see which _commits_ were done by OpenHands, but your face is on the PR, so you're responsible for it.

[1] https://github.com/All-Hands-AI/OpenHands

VMG•2mo ago

> Worse, if one did get merged and caused problems, there was no one you could hold responsible.

Clearly the merger and approvers are responsible

xnorswap•2mo ago

I want to believe that can work. There's a neat idealism to the final approver being responsible for any bug fixes. It's encourages knowledge sharing prevent knowledge silos; the approver needs to be well versed enough to fix bugs.

But the reality is that it also risks pushing away those who go out their way to conduct more reviews.

Every team I've been on has had a mix of developers, some of whom are responsive to doing code reviews and will pick them up and process them quickly. There are also those developers who drag their feet, only pick up reviews when they're the only one who can review that area of the codebase, and take their time.

The former developers are displaying the habits you'd like to encourage more widely, and yet they find themselves "punished" by getting even more reviews, as people learn to assign things to them when they want a PR handled effectively.

I fear that compounding that by layering on another obligation to the star team members would further push them out.

VMG•2mo ago

> Every team I've been on has had a mix of developers, some of whom are responsive to doing code reviews and will pick them up and process them quickly.

Rubber-stamping is not really "reviewing" though.

And leaving a comment "I don't have enough info to review this properly" is a valid review as well. It signals that somebody else needs to pick it up.

> I fear that compounding that by layering on another obligation to the star team members would further push them out.

I get it, but I don't see an alternative.

1. the company culture must value reviews as work, probably even more important than coding

2. the reviewers must be allowed to respond with "I don't feel comfortable reviewing this because I don't have enough context"

wffurr•2mo ago

Solvable with a review SLO and a round robin auto assigner for reviews.

andreasmetsala•2mo ago

If reviewing and merging your PR puts me on the hook for fixing anything that breaks, I just won’t review your PR. If I had time to write it myself I wouldn’t need your PR in the first place.

dakshgupta•2mo ago

Surely a well-functioning team is high trust enough that the reviewer should shads some of the blame? Of course, the majority might go to the author.

devrandoom•2mo ago

Nothing wrong with reviewing your own code per se. But it's not a "code review" as such.

It's very hard to spot your own mistakes, you tend to read what you intended to write, not what's actually written.

This applies both to code and plain text.

phamilton•2mo ago

We used to have a "2 approvals" policy on PRs. It wasn't fully enforced, it was a plugin to Gitlab we built that would look for two "+1" comments to unhide the merge button.

I used to create PRs and then review my own code. If I liked it, I'd +1 it. If I saw problems, I'd -1 it. Other engineers would yell at me that I couldn't +1 my own code. When I showed them PRs that had my -1 on it, they just threw their hands up and moved on, exasperated.

I've carried that habit of reviewing my own code forward, even though we now have real checks that enforce a separate reviewer. It's a good habit.

loa_in_•2mo ago

<Author pushing> and <author, the aftermath> are separate states of being. It makes sense to me.

8note•2mo ago

my current experience working with LLMs to get code written is a lot like working with a new dev to the team to get code written.

i want to get a first review of the results, but i also want somebody else's eyes on them after.

just cause i didnt directly write the code doesn't mean i didnt write it

gitroom•2mo ago

honestly i still trust a second pair of human eyes more than any bot, but gotta admit AI can find some funky bugs id never spot

dakshgupta•2mo ago

Something to be said about having two sets of eyes that are as different from one another as possible, which is achieved by one of them not being human.

pjmlp•2mo ago

Eventually being a technical architect doing an acceptance review from deliverables would be only thing left.

dakshgupta•2mo ago

Does that sound appealing to you?

pjmlp•2mo ago

Not really, however like those factory workers in modern factories, what is left on future software factories other than supervision and maintenance of robots?

Only a few selected ones will get to work on the robots themselves.

This is already here in some fashion, many consulting projects nowadays are plugging SaaS products, with a couple of tiny microservices if not done via some integration tooling as well.

Tooling that is now getting AI flavoured so that the integrations can eventually be done automatically as well.

allisonee•2mo ago

Ultimately, the author (the engineer writing the PR) should always fully own the impact and consequences of their code. This has always been true, but it’s become even more critical with AI-assisted coding and more junior engineers who primarily test only for functional correctness.

Just like AI now helps nearly everyone write code, it makes sense for AI to handle an initial pass at reviewing code too. However, there’s significant value in human reviewers providing an outside perspective—both to build shared understanding of the codebase and to catch higher-level issues.

TLDR; AI should do the first pass on reviews, but another engineer should also review the code for context and collaboration. Ultimately, though, the author still fully owns every line that gets merged.

_heimdall•2mo ago

I have a rule when reviewing code - don't approve it unless I'm willing to maintain it later.

I guess if I'm using an LLM both to write and review the code I'm enforcing the same principle on the tooling, but I can't say that's a situation I'd really put myself in (or a situation I expect anyone would pay me to be a part of for long).

aussieguy1234•2mo ago

Code review is one of the slowest parts of getting things done as an engineer. I can write code quite fast without AI but that can't speed up the code review.

I'm now regularly using AI to do a "pre code review" before getting the humans on my team to do the formal code review.

It catches obvious things, saving lots of back and forth with the reviewers and cutting delivery time back by days. It has also caught a couple of bugs, which I then fixed, saving even more days of back and forth.

I'm also using it to review others code, to help me spot bugs. It spotted an obscure one related to inconsistent dynamodb column naming in some rushed code during an incident, which I then pointed out, preventing a second incident. It was a 1000 line PHP file, with the inconsistent names far away from each other.

Using git cli on Linux, it's quite simple

`xclip -sel clip | git diff develop...feature/my-branch`. Copies the entire diff to my clipboard, then I paste it into a logic model like o3 to do the code review.

aiisjustanif•2mo ago

I really hope you have an enterprise agreement with OpenAI for situations like this.

flimflamm•2mo ago

Surely you can have the same LLM review the code. But treat output of that process similar to linters.

That one tool (LLM) didn't show issues - good. Then lets roll slaves back and start really checking if the meaning is correct and is the code implementing the right things the right way.

xnickb•2mo ago

That's an interesting typo. Freud would like a word with you.

black3r•2mo ago

> Seeing as Devin uses the same LLMs under-the-hood as Greptile, it does raise an interesting question - should the author be the reviewer?

In a non-AI world, the reason for reviews is having a fresh set of eyes look at the code. If you're reviewing your own code, you are biased - you have your own understanding of the task, you know why you took the decisions and how you've written the code. The reviewer may have a different understanding of the task, and can scrutinize every decision because none of these decisions are the same.

In AI world, even when it's "the same LLM", the reviews are typically done with a fresh context, so in my eyes the author is not the same as the reviewer here.

But I'd still currently want at least one human to see the code before merging it. And for code written entirely by AI (like Devin) it should ideally be someone who's in the company for at least a year or two. I'm skeptical about the LLM doing the review understanding all the nuances of the codebase and the task to know whether something is a bug or not. Because right now even if we have reviewer LLMs which index all your codebase, it still only sees the code, it doesn't see the specifications, it doesn't see the database = doesn't know what scale we're working on, it doesn't know our expectations, some of which aren't written anywhere. And especially for larger legacy codebases where we sometimes have multiple very similar features, or where the codebase is in the process of migrating from one way of doing things to another way of doing things, it often doesn't properly distinguish them...

taylodl•2mo ago

Has anyone used one LLM to generate a Gherkin script from a requirement, use another LLM to create code from the generated Gherkin script, and then use Cucumber to check the result?

Of course, you'll need to have someone review the Gherkin script and review the code, but it seems like there's a pipeline you can setup and use for creating some of your code. It'd be interesting to learn what kind of code this wouldn't work for. I'm thinking of this thing as an AI developer of sorts, and like any developer, there are some things they're just not very good at.