Honest question: if you're using multiple agents, it's usually to produce not a dozen lines of code. It's to produce a big enough feature spanning multiple files, modules and entry points, with tests and all. So far so good. But once that feature is written by the agents... wouldn't you review it? Like reading line by line what's going on and detecting if something is off? And wouldn't that part, the manual reviewing, take an enormous amount of time compare to the time it took the agents to produce it? (you know, it's more difficult to read other people's/machine code than to write it yourself)... meaning all the productivity gained is thrown out the door.
Unless you don't review every generated line manually, and instead rely on, let's say, UI e2e testing, or perhaps unit testing (that the agents also wrote). I don't know, perhaps we are past the phase of "double check what agents write" and are now in the phase of "ship it. if it breaks, let agents fix it, no manual debugging needed!" ?
Serious planning. The plans should include constraints, scope, escalation criteria, completion criteria, test and documentation plan.
Enforce single responsibility, cqrs, domain segregation, etc. Make the code as easy for you to reason about as possible. Enforce domain naming and function / variable naming conventions to make the code as easy to talk about as possible.
Use code review bots (Sourcery, CodeRabbit, and Codescene). They catch the small things (violations of contract, antipatterns, etc.) and the large (ux concerns, architectural flaws, etc.).
Go all in on linting. Make the rules as strict as possible, and tell the review bots to call out rule subversions. Write your own lints for the things the review bots are complaining about regularly that aren't caught by lints.
Use BDD alongside unit tests, read the .feature files before the build and give feedback. Use property testing as part of your normal testing strategy. Snapshot testing, e2e testing with mitm proxies, etc. For functions of any non-trivial complexity, consider bounded or unbounded proofs, model checking or undefined behaviour testing.
I'm looking into mutation testing and fuzzing too, but I am still learning.
Pause for frequent code audits. Ask an agent to audit for code duplication, redundancy, poor assumptions, architectural or domain violations, TOCTOU violations. Give yourself maintenance sprints where you pay down debt before resuming new features.
The beauty of agentic coding is, suddenly you have time for all of this.
> Serious planning. The plans should include constraints, scope, escalation criteria, completion criteria, test and documentation plan.
I feel like i am a bit stupid to be not able to do this. my process is more iterative. i start working on a feature then i disocover some other function thats silightly related. go refactor into commmon code then proceed with original task. sometimes i stop midway and see if this can be done with a libarary somewhere and go look at example. i take many detours like these. I am never working on a single task like a robot. i dont want claude to work like that either .That seems so opposite of how my brain works.
When I get an idea for something I want to build, I will usually spend time talking to ChatGPT about it. I'll request deep research on existing implementations, relevant technologies and algorithms, and a survey of literature. I find NotebookLM helps a lot at this point, as does Elevenreader (I tend to listen to these reports while walking or doing the dishes or what have you). I feed all of those into ChatGPT Deep Research along with my own thoughts about the direction the system, and ask it to produce a design document.
At the moment, I use Opus or GPT-5.4 on high to generate those plans, and Sonnet or GPT-5.4 medium to implement.
The roadmap and the design are definitely not set in stone. Each step is a learning opportunity, and I'll often change the direction of the project based on what I learn during the planning and implementation. And of course, this is just what works for me. The fun of the last few months has been everyone finding out what works for them.
You seem to work a lot like how I do. If that is being stupid, then well, count me in too. To be honest, if I had to go through all the work of planning, scope, escalation criteria, etc., then I would probably be better off just writing the damn code myself at that point.
Many of those tools are overpowered unless you have a very complex project that many people depend on.
The AI tools will catch the most obvious issues, but will not help you with the most important aspects (e.g. whether you project is useful, or the UX is good).
In fact, having this complexity from the start may kneecap you (the "code is a liability" cliché).
You may be "shipping a lot of PRs" and "implementing solid engineering practices", but how do you know if that is getting closer to what you value?
How do you know that this is not actually slowing your down?
It depends a lot on what kind of company you are working at, for my work the product concerns are taken care by other people, I'm responsible for technical feasibility, alignment, design but not what features should be built, validating if they are useful and add value, etc., product people take care of that.
If you are solo or in a small company you apply the complexity you need, you can even do it incrementally when you see a pattern of issues repeating to address those over time, hardening the process from lessons learnt.
Ultimately the product discussion is separate from the engineering concerns on how to wrangle these tools, and they should meet in the middle so overbearing engineering practices don't kneecap what it is supposed to do: deliver value to the product.
I don't think there's a hard set of rules that can be applied broadly, the engineering job is to also find technical approaches that balance both needs, and adapt those when circumstances change.
On the one side I reject that product and engineering concerns are separated: Sometimes you want to avoid a feature due to the way it will limit you in the future, even if the AI can churn it in 2 minutes today.
On the other side perhaps your company, like most, does not know how to measure overengineering, cognitive complexity, lack of understanding, balancing speed/quality, morale, etc. but they surely suffer the effects of it.
I suspect that unless we get fully automated engineering / AGI soon, companies that value engineers with good taste will thrive, while those that double down into "ticket factory" mode will stagnate.
> On the one side I reject that product and engineering concerns are separated: Sometimes you want to avoid a feature due to the way it will limit you in the future, even if the AI can churn it in 2 minutes today.
That is exactly not what I meant, I'm sorry if it wasn't clear but your assumption about how my job works is absolutely wrong.
I even mention that the product discussion is separate only on "how to wrangle these tools":
> Ultimately the product discussion is separate from the engineering concerns on how to wrangle these tools, and they should meet in the middle so overbearing engineering practices don't kneecap what it is supposed to do: deliver value to the product.
Delivering value, which means also avoiding a feature that will limit or entrap you in the future.
> On the other side perhaps your company, like most, does not know how to measure overengineering, cognitive complexity, lack of understanding, balancing speed/quality, morale, etc. but they surely suffer the effects of it.
We do measure those and are quite strict about it, most of my design documents are about the trade-offs in all of those dimensions. We are very critical about proposals that don't consider future impacts over time, and mostly reject workarounds unless absolutely necessary (and those require a phase-out timeline for a more robust solution that will be accounted for as part of the initiative, so the cost of the technical debt is embedded from the get-go).
I believe I wasn't clear and/or you misunderstood what I said, I agree with you on all these points, and the company I work for is very much in opposite to a "ticket factory". Work being rejected due to concerns for the overall impact cross-boundaries on doing it is very much praised, and invited.
My comment was focused on how to wrangle these tools for engineering purposes being a separate discussion to the product/feature delivery, it's about tool usage in the most technical sense, which doesn't happen together with product.
We on the engineering side determine how to best apply these tools for the product we are tasked on delivering, the measuring of value delivered is outside and orthogonal to the technical practices since we already account for the trade-offs during proposal, not development time. This measurement already existed pre-AI and is still what we use to validate if a feature should be built or not, its impact and value delivered afterwards, and the cost of maintaining it vs value delivered. All of that includes the whole technical assessment as we already did before.
Determining if a feature should be built or not is ultimately a pairing of engineering and product, taking into account everything you mentioned.
Determining the pipeline of potential future non-technical features at my job is not part of engineering, except for side-projects/hack ideas that have potential to be further developed as part of the product pipeline.
Sorry, I think you're right that I misinterpreted your comment. I still had in mind OP's example (BDD, mutational testing, all that jazz). I apologize!
Reading your comment, it looks like you work for a pretty nice company that takes those things seriously. I envy you!
My concern was that for companies unlike yours that don't have well established engineering practices, it _feels_ that with AI you can go much faster and in fact it's a great excuse to dismantle any remaining practices. But, in reality they either doing busywork or building the wrong thing. My guess is that those are going to learn that this is a bad idea in the future, when they already have a mess to deal with.
To put what I mean into perspective... if you browse OP's profile you can find absolutely gigantic PRs like https://github.com/leynos/weaver/pull/76. I can not review any PR like that in good faith, period.
Can't upvote you enough. This is the way. You aren't vibe coding slop you have built an engineering process that works even if the tools aren't always reliable. This is the same way you build out a functioning and highly effective team of humans.
The only obvious bit you didn't cover was extensive documentation including historical records of various investigations, debug sessions and technical decisions.
Building a fancy looking process doesnt mean output isnt slop. Vibecoders on reddit have even more insane "engineering" process.
parent comment has all these
This is the biggest bottleneck for me. What's worse is that LLMs have a bad habit of being very verbose and rewriting things that don't need to be touched, so the surface area for change is much larger.
Not only that, but LLMs do a disservice to themselves by writing inconcise code, decorating lines with redundant comments, which wastes their context the next time they work with it
I have had good luck in asking my agent 'now review this change: is it a good design, does it solve the problem, are there excessive comments, is there anything else a reviewer would point out'. I'm still working on what promt to use but that is about right.
It's kind weird; I jumped on the vibe coding opencode bandwagon but using local 395+ w/128; qwen coder. Now, it takes a bit to get the first tokens flowing, and and the cache works well enough to get it going, but it's not fast enough to just set it and forget it and it's clear when it goes in an absurd direction and either deviates from my intention or simply loads some context whereitshould have followed a pattern, whatever.
I'm sure these larger models are both faster and more cogent, but its also clear what matter is managing it's side tracks and cutting them short. Then I started seeing the deeper problematic pattern.
Agents arn't there to increase the multifactor of production; their real purpose is to shorten context to manageable levels. In effect, they're basically try to reduce the odds of longer context poisoning.
So, if we boil down the probabilty of any given token triggering the wrong subcontext, it's clear that the greater the context, the greater the odds of a poison substitution.
Then that's really the problematic issue every model is going to contend with because there's zero reality in which a single model is good enough. So now you're onto agents, breaking a problem into more manageable subcontext and trying to put that back into the larger context gracefully, etc.
Then that fails, because there's zero consistent determinism, so you end up at the harness, trying to herd the cats. This is all before you realize that these businesses can't just keep throwing GPUs at everything, because the problem isn't computing bound, it's contextual/DAG the same way a brain is limited.
We all got intelligence and use several orders of magnitude less energy, doing mostly the same thing.
It’s a blend. There are plenty of changes in a production system that don’t necessarily need human review. Adding a help link. Fixing a typo. Maybe upgrades with strong CI/CD or simple ui improvements or safe experiments.
There are features you can skip safely behind feature flags or staged releases. As you push in you fine with the right tooling it can be a lot.
If you break it down often quite a bit can be deployed safely with minimal human intervention (depends naturally on the domain, but for a lot of systems).
> you know, it's more difficult to read other people's/machine code than to write it yourself
Not at all, it's just a skill that gets easier with practice. Generally if you're in the position to review a lot of PR's, you get proficient at it pretty quickly. It's even easier when you know the context of what the code is trying to do, which is almost always the case when e.g. reviewing your team-mates' PR's or the code you asked the AI to write.
As I've said before (e.g. https://news.ycombinator.com/item?id=47401494), I find reviewing AI-generated code very lightweight because I tend to decompose tasks to a level where I know what the code should look like, and so the rare issues that crop up quickly stand out. I also rely on comprehensive tests and I review the test cases more closely than the code.
That is still a huge amount of time-savings, especially as the scope of tasks has gone from a functions to entire modules.
That said, I'm not slinging multiple agents at a time, so my throughput with AI is way higher than without AI, but not nearly as much as some credible reports I've heard. I'm not sure they personally review the code (e.g. they have agents review it?) but they do have strategies for correctness.
I'll often run 4 or 5 agents in parallel. I review all the code.
Some agents will be developing plans for the next feature, but there can sometimes be up to 4 coding.
These are typically a mix between trivial bug fixes and 2 larger but non-overlapping features. For very deep refactoring I'll only have a single agent run.
Code reviews are generally simple since nothing of any significance is done without a plan. First I run the new code to see if it works. Then I glance at diffs and can quickly ignore the trivial var/class renames, new class attributes, etc leaving me to focus on new significant code.
If I'm reviewing feature A I'll ignore feature B code at this point. Merge what I can of feature A then repeat for feature B, etc.
This is all backed by a test suite I spot check and linters for eg required security classes.
Periodically we'll review the codebase for vulnerabilities (eg incorrectly scoped db queries, etc), and redundant/cheating tests.
But the keys to multiple concurrent agents are plans where you're in control ("use the existing mixin", "nonsense, do it like this" etc) and non-overlapping tasks. This makes reviewing PRs feasible.
Unless you don't review every generated line manually, and instead rely on, let's say, UI e2e testing, or perhaps unit testing (that the agents also wrote). I don't know, perhaps we are past the phase of "double check what agents write" and are now in the phase of "ship it. if it breaks, let agents fix it, no manual debugging needed!" ?