Yep. The only people I've heard saying that generated code is fine are those who don't read it.
The problem is that the mitigations offered in the article also don't work for long. When designing a system or a component we have ideas that form invariants. Sometimes the invariant is big, like a certain grand architecture, and sometimes it’s small, like the selection of a data structure. You can tell the agent what the constraints are with something like "Views do NOT access other views' state" as the post does.
Except, eventually, you'll want to add a feature that clashes with that invariant. At that point there are usually three choices:
- Don’t add the feature. The invariant is a useful simplifying principle and it’s more important than the feature; it will pay dividends in other ways.
- Add the feature inelegantly or inefficiently on top of the invariant. Hey, not every feature has to be elegant or efficient.
- Go back and change the invariant. You’ve just learnt something new that you hadn’t considered and puts things in a new light, and it turns out there’s a better approach.
Often, only one of these is right. Often, at least one of these is very, very wrong, and with bad consequences.
Picking among them isn’t a matter of context. It’s a matter of judgment, and the models - not the harnesses - get this judgment wrong far too often. I would say no better than random chance.
Even if you have an architecture in mind, and even if the agent follows it, sooner or later it will need to be reconsidered. What I've seen is that if you define the architectural constraints, the agent writes complex, unmaintainable code that contorts itself to it when it needs to change. If you don't read what the agent does very carefully - more carefully than human-written code because the agent doesn't complain about contortious code - you will end up with the same "code that devours itself", only you won't know it until it's too late.
If you know how to write good code you can force AI to write good code with various techniques. It's 100% doable. You just need to figure out the problems AI has and find solutions to make it easier for it. Ex: extremely small contexts
Modularize to modules with clear boundaries and only allow the AI to work within those boundaries. Make modules pure from IO so they are easily testable. Hide modules behind interfaces etc .. You can write 100 tests that executes within a second. You can write benchmarks etc .. AI needs boundaries and small contexts to work well. If you fail to give it that it will perform poorly. You are in charge.
That doesn't quite work, and precisely for the reason I mentioned: You can definitely tell the AI to follow some strategy, but at some point the strategy will need to change, and the AI won't tell you that (even if you tell it to). Unless you read the code every time you won't know if the AI is following the strategy and producing good results or following it and producing bad results because the strategy has to change. This can happen even in small changes: the AI will follow the strategy even if the change proves it's wrong, and if you don't pay close attention, these mistakes pile up.
So yes, you might get good results in one round, but not over time. What does work is to carefully review the AI's output, although the review needs to be more careful than review of human-written code because the agents are very good at hiding the time bombs they leave behind.
If I instruct the AI to make small modules where I can verify they work, have tests and no side effects - then it is good enough code for me. It works, is readable and can be extended - and will turn into bad code if this is not done with care.
Sure, if you carefully review the agent's output, including tests, you can get good results. If you don't carefully review the output, you obviously have no idea if it's good enough for you. The only way to find out is that 30 changes down the line the agent won't be able to change one thing without breaking another, but by then the codebase will be too far gone to fix.
The concept of a small module is an architecture invariant. You’re making that decision, not the LLM. And you’ve made that decision because the machine is not good at certain things. You’re doing that because you can’t trust the LLM to make that decision on its own.
I’m doing it because as a DDD adherent, I’ve been building software that way for 15 years without GenAI and now with GenAI I can do it faster.
You can’t play whack-a-mole with GenAI. You have to start from well-known principles and watch everything it produces. Every module or bounded context has to have its own invariants.
You can’t fully automate software engineering with GenAI. It seems the vast majority of GenAI users think they can and end up in the same place as the OP.
Maybe learn Domain-Driven Design, Event Sourcing, and then try again. The results will be dramatically improved.
Yeah I agree. It's improved quite a bit just in the past few months. The code should always be reviewed, and you need to spend some time tuning your skills and agent configs. If you're still getting bad code out of your LLM tooling, you might not be using or configuring it correctly.
This is actually what I do. I'm extremely picky about the code and force the LLM to rewrite it 1000x times until it is basically exactly what I want. You might be wondering what is the point when it would be faster for me to just write the code myself?
I have ADHD and for whatever reason telling the LLM what to do instead of doing it myself bypasses the task avoidance patterns and/or focus problems I tend to suffer from. I do not find it fun, but I am thankful for it.
I have used LLMs a couple of times to get started on something. I don’t have ADHD, so this is not a regular occurrence for me. But when I have tried this, I have always found the LLM solution so horrible that it instantly inspired me to do it myself. So, in that sense it worked, I got unstuck, but no LLM garbage makes it into the project.
It is significantly easier to micro-manage an AI than a suite of junior developers. The AI doesn't replace a principal engineer, it's replacing junior and weaker senior developers who need stories broken down extremely concisely to be able to get anything done. The time it takes to break down a story such that a junior through weak senior developers can pick it up and execute it well would have the AI already done with testing built around it.
Juniors learn. Some juniors are potential good seniors. Over time they will internalise good architecture and be able to make good judgments on their own.
Micromanaging LLMs is like having Dory from Finding Nemo as your colleague. You find ways to communicate, but there is no learning going on.
LLMs can learn, just not the same way that juniors do. When an LLM does something wrong you can always update it's rules or skills to not make that mistake again. Or you can utilize a subagent whose sole purpose is to review code to prevent that mistake. Lots of ways you can improve LLMs over time.
Of course if you don't provide that feedback loop, no learning happens. I guess the same could be said of a junior, though.
I think if you tried working with some junior folks, you'd be quite surprised. You know, with at least some of them choosing to use their brains and all.
Sure. That's how I work with AI, and the way I believe that AI is meant to be use -- as a companion tool.
But it's a lot of work. It saves me time for certain tasks, but not others. I haven't measured my productivity gains, but they're at most 2x.
But that's not "vibe coding" (which was the point of the article) or the (false) promise of "10x productivity" and "code that writes itself" that companies are being told is going to reduce their engineering headcount tenfold.
I agree with this. I've been writing a new internal framework at work and migrating consumers of the old framework to the new one.
I had strong principles at the outset of the project and migrated a few consumers by hand, which gave me confidence that it would work. The overall migration is large and expensive enough that it has been deferred for nearly a decade. Bringing down the cost of that migration made me turn to AI to accelerate it.
I found that it was OK at the more mechanical and straightforward cases, which are 80% of the use cases, to be fair. The remaining 20% need changes to the framework. Most of them need very small changes, such as an extra field in an API, but one or two require a partial conceptual redesign.
To over simplify the problem, the backend for one system can generate certain data in 99% of cases. In a few critical cases, it logically cannot, and that data must be reported to it. Some important optimizations were made with the assumption that this would be impossible.
The AI tooling didn't (yet) detect this scenario and happily added migration logic assuming it would work properly.
Now, because of how this is being rolled out, this wasn't a production bug or anything (yet). However, asking the right questions to partner teams revealed it and unearthed that some others were going to need it as well.
Ultimately, it isn't a big problem to solve in a way that will mostly satisfy everyone, but it would have been a big problem without a human deeper in the weeds.
Over time, this may change. Validation tooling I built may make a future migration of this kind easier to vibe code even if AI functionality doesn't continue to improve. Smarter models with more context will eventually learn these problems in more and more cases.
The code it generates still oscilates between beautiful and broken (or both!) so for now my artistic sensibilities make me keep a close eye on it. I think of the depressed robot from the Hitchhiker's Guide to the Galaxy as the intelligence behind it. Maybe one day it'll be trustworthy
“The only people I've heard saying that generated code is fine are those who don't read it.” Are you sure these people aren’t busy working rather than chatting? (haha)
But in all seriousness it depends on what you’re doing with it. Writing a quick tool using an LLM is much easier than context changing to write it yourself. If you need the tool, that’s very valuable.
Also as a webdev, it writes basic CRUD pretty good. I am tired of having to build forms myself and the LLMs are usually really good at that.
Been building a new app with lots of policies and whatnot and instructing a LLM is just much faster than doing the same repetitive shit over and over myself.
If you were tired of writing forms yourself, had you looked at https://jsonforms.io/? Just specify the the data you need, or extract it from the api spec and go. Display the form uniformly every time across your site. No need to burn AI time.
I typically avoid any most abstractions or third party dependencies. Yea it could be neat, but I still need a lot of custom logic here and there.
Same reason I avoid stuff like GraphQL.
A little update: upon viewing the page on phone, for me the "comitter" field in the demo is going out of bounds... Really not speaking for their product.
I don't know or care about that specific tool, or really what you do at all, I was just reacting to how the principle you stated conflicts with the practice you described. How you reconcile those is up to you.
Mate, that's literally what you implied, innit? You probably "can" do it yourself, but you choose not to - I wonder why? Also the point of sarcasm is to communicate it in such way that it is obvious, without using the "/s" signifier. You know like, telling a joke at a party that you don't have to explain.
Yea because having 200 different abstractions and DSLs makes stuff easier for sure! Why not use all the stuff that was popular 6 years ago like Prisma, GraphQL and Redux, whoops suddenly you need a whole team of devs knowing all kinds of unecessary abstractions.
...which means you depend on the LLMs? Of course strictly "to save time". It's not like you are slowly forgetting how to start a project in the first palce or implement that db integration, right?
LOL why would I ever forget how to start a project or how to connect to a DB or make migrations and whatnot, brother generating a web form for creating and updating models is not that big of a deal. A LLM can do this while providing a11y attributes and proper styling in like 10 minutes. This includes creating a migration which I take a look at and correct if needed, creating the model, creating required policies, creating the controller endpoints which i correct in case its needed, creating a template file for the crud operations with search and pagination and whatnot while making it somewhat look good.
I can do all of this myself, but why would I waste 1-2 hours (per model) on doing all that myself if I can just instruct some stupid LLM to do it for me? It's repetitive boilerplate.
This the core unspoken bone of contention in most AI arguments I think: most people either arent writing code with strict quality requirements or dont realize where their use of AI is violating them.
That said most of the world's most useful code has strict quality requirements. Even before AI 90% of SLOC would be tossed away without much if any use, 9% was used infrequently while 1% runs half the world's software.
I think this misses the scale of the problem. Review never fixed tech debt, nor did it fix relevant/bloated test suites. It didn't solve complexity, or eliminate footguns. Very few people (I would argue almost noone) had developed theories for what all of these even were, or how to spot them in code.
Reviewers aren't perfect, far from it. And we just gave them ~20x more code to review. Incentives mean that taking 20x longer to review is unacceptable. So where do we go from here?
The generated code is more than fine, it’s good in many cases. And I read it :)
Indeed for the task of “jump into an unfamiliar codebase and make a requested change that aligns with existing styles and patterns, and uses existing functionality” I would say something like opus 4.7 exceeds the capabilities of most developers.
I agree with both statements, but that doesn't change the problem I stated. If an agent produces reasonable code 80-90% of the time, and 10-20% of the time it makes mistakes that could render the codebase irretrievably unevolvable once they accumulate, the only thing you can do is to carefully review the agent's output 100% of the time. That it gets things right 80% of the time as opposed to 40% of the time doesn't change this calculus one iota.
But agents generate code much faster, and to know slow them down, some people want to not do the only thing that can currently ensure you get good results, which is to carefully review the output. Once that happens, there is simply no way for them to know how good or bad what they're getting is.
Human developers don't produce code at such a rate, and their judgment is, on average, better. So one, the review doesn't make you feel like you're slowing things down much, and two, the problems are less hidden.
And humans produce 100% reasonable code or what? The kind of mess me and everyone I've worked with produces by hand is the inverse of that. Constant shortcuts and lazy slop through and through. Never worked anywhere where the code wasn't an entangled disarray.
As soon as requirements change the abstractions fall apart and everything gets shoehorned.
The invariant, stated informally, would be hard to prove is broken by a human reviewer in the loop. Spoken language isn’t precise enough for the task.
Even if you could state it in a precise formal language the LLM under the agent doesn’t have the capability to understand what the invariant is for and why it’s important. You’ll still get oddly generated code. You might get an LLM that can associate certain tokens with those in the formal language specification which can hold invariants and perhaps even write the proofs… but you’ll still get a whole bunch of other code generated from the informal parts of the prompt.
I agree that simply adding constraints and prompts to you skills and specs isn’t going to prevent these things. Worse, that even if you could invent a better mouse trap the creature will still escape.
The problem is… “elongation:” the addition of code for the sake of the prompt/task/etc. Often less is better. This takes a human with the ability to anticipate what other humans would want/expect. When you need a generator, they’re great but it’s a firehouse that whose use should be restrained a little more.
> The invariant, stated informally, would be hard to prove is broken by a human reviewer in the loop. Spoken language isn’t precise enough for the task.
That depends on the invariant. Some are behavioural, like "variable x must be even if y is positive", but some are architectural, such as "a new view requires a new class".
But that's only one side of the problem because maintaining the invariant can be just as bad as breaking it. You ask the agent to add a feature and it may well maintain the invariant - only it shouldn't have, because the feature uncovers the fact that the invariant is architecturally wrong.
The problem is that evolving software requires exercising judgment about when you need to follow the existing strategy and when you need to rethink it. If there is any mechanical rule that could state what the right judgment is, I don't know what it is.
Yes! I was trying to make this part of my point but you definitely made it much more clear and concise.
With a skilled operator, it could be possible to drive an agent to handle these kinds of changes. I would be concerned that spoken language wouldn't be precise enough to handle the refactoring and changes necessary to make to a code base when an invariant changes... regardless of whether it was a property, architectural, or procedural change. It already can take several prompts and burn quite a few tokens doing large-scale rewrites and code changes. Maybe the parameters and weights can be tuned for this kind of work but I remain skeptical that what we have at present is "efficient" at this kind of work.
And the solution is the same, as when it was outsourced- and the "patch" was fix it by writing spec. Thus i conclude my TED talk with the statement: LLMs are the new outsourcing and run into the same problems.
Not quite, because the architecture often needs to evolve when you learn more as the project evolves. People will complain when they feel the constraints drive them to unnatural workarounds, the agents don't.
You can try telling the agent to stop and ask when a constraint proves problematic, except it doesn't have as good a judgment as humans to know when that's the case. I often find myself saying, "why did you write that insane code instead of raising the alarm about a problem?" and the answer is always, "you're absolutely right; I continued when I should have stopped." Of course, you can only tell when that happens if you carefully review the code.
So I run a solo saas that supports my family, and so the stakes feel very high for me. I use AI heavily, and I’ve seen the exact problem you’re describing. I feel like I’m often really riding the edge in terms of trying to use AI to accelerate product development while not letting tech debt accumulate too fast, or let my mental model of the codebase slip too much.
Here’s what’s working for me right now:
1. The basics: use best model available, have skills and rules that specify project guidelines, etc.
2. Always use plan mode. It works much better to iterate on the concept of what we’re going to do, then do the implementation. The models will adhere to the plan at very high rates in my experience.
3. Don’t give chunks of work that are too large in scope. This is just art, and I’m constantly experimenting with how ambitious I can be.
4. I review all code to some extent, but I have a strong mental model of what areas of the app are more critical, where hidden bugs might accumulate, etc, and I review both tests and impl more strenuously in those areas. Whereas like a widget for my admin panel probably gets a 2 second glance.
5. Have the discipline to go through periodically and clean up tech debt, refactor things that you’d do differently now, etc. I find the AI a huge help here, because I can clean up cruft in an hour that would have once taken me days, and thus probably wouldn’t have gotten done.
6. I’m experimenting with shifting my architecture to make it easier to review AI code, make it less likely it’ll make mistakes, etc. Honestly mostly things I should have always been doing, but the level of formalism and abstraction on my solo projects is usually different than on a bigger team.
To each their own, but I’ve grown this from nothing to about $350k in ARR over the last ten months, and I’m very confident I never could have built this product without AI help in triple that time.
This is the rule I have settled on and I can feel why. Writing the first buggy working version with agents is always fun. Then making the software reliable with the agents, the way you want is very painful.
I read all the code I generate with Cursor and some of it smells a bit weird but is easily fixable and most of it is as good as what I would write or better.
> Picking among them isn’t a matter of context. It’s a matter of judgment, and the models - not the harnesses - get this judgment wrong far too often. I would say no better than random chance.
Yeah I’m currently working for several months already on a harness that wraps Claude Code and Codex etc to ensure that these types of invariants are captured and enforced (after the first few harness attempts failed), and - while it’s possible - slows down the workflow significantly and burns a lot more tokens. In addition to requiring more human involvement, of course.
I suspect this is the right direction, though, as the alternatives inevitably lead any software project to delve into a spaghetti mess maintenance nightmare.
It's not enough to enforce the invariants because they may need to change. You need to follow the invariants when they're right, and go back and reconsider them when they prove unhelpful. Knowing which is the case requires judgment that today's models are simply incapable of (not consistently, at least).
What's the difference between asking an AI to write you a module you never read and installing a 3rd-party module without auditing all its source code?
If the 3rd party module is popular, its badness will affect other people too and either the module will get improved or well known workarounds/"best practices" will develop. With AI-generated code, more often than not you're the sole user.
I would use Stripe, curl, and ffmpeg without audits, because I trust them to provide good code and to respect their API. I wouldn’t trust AI to write a Fibonacci series implementation.
Write your code by hand, but AI still serves as something of a stack overflow and code completion tool. Also good for writing tedious things like regex or little one-off utility scripts as well as a first crack at unit tests. Using it to actually write big blocks of important code is a no-no in my opinion as it produces what I would characterize as slop, even if it technically works.
> What I've seen is that if you define the architectural constraints, the agent writes complex, unmaintainable code...
To be fair, there are many people like this as well. One of my personal favorite examples was way back in the 80s when I inherited the code for a protocol converter that let ASCII terminals communicate with IBM mainframes via the 3270 protocol.
One of the pieces of code in there, for managing indicator lights, was simply wrong. It was ca. 150 lines of Z80 assembly language that was trying to faithfully follow the copious IBM documentation of how things worked, but it had subtle issues and didn't always work.
My approach was to accept the documentation as accurate (the IBM documentation was always verbose and almost never wrong), but to reason that the original 3270 had these functions implemented in TTL logic gates, and there was no way in heck that they were wasting enough gates on indicator lights to require the logical equivalent of 150 instructions.
So in my mind, it had to be a really simple circuit that had emergent properties that required the reams of documentation. With that mindset, I was able to craft correct code for this in 12 instructions.
Many systems are likewise fractal in nature. You want to figure out the generating equations, rather than all the rules that derive from those. And, in many cases, writing down the generating equations is at least as easy to do in code as it would be to do in English for someone or something else to implement.
Yeah, their statement just isn't true. With enough instruction, I've been able to get great output from models. I think that's the key: with detailed, pointed instructions, the output will match.
Indeed, I'm not using LLM output without thorough review.
After reading a bunch of other comments, it sounds like people are referring to letting agents go wild and code whatever off a limited prompt. I'm not using LLMs like that; I'm generally interacting only via conversations with pretty detailed initial prompts. My interactions with the chat after that are corrections/guiding prompts to keep it on point and edit the prompt output from time to time.
They are nowhere near solved. Agents make serious mistakes in judgment and do it frequently enough to threaten the viability of the codebase unless you slow down and monitor them very, very closely. If you do that, it's all good. If you're not, your codebase is rotting at a superhuman speed underneath you and you have no idea until it collapses.
I agree they make mistakes in judgement, that's the whole point of plan mode. That judgement comes to the surface before lots of tokens are wasted without sight of the overall solution.
It's all very simple. "Use x library, data model should be xyz, do m, not n."
They're obviously not at the point of replacing an experienced programmer as far as knowing the start-to-finish way of accomplishing every detail, that's what the human is for.
Plan mode improves results, but it doesn't solve the underlying problems. Pretty often Claude Opus 4.7 on xhigh will formulate a reasonable enough plan, churn for a while, then come back with a summary that it didn't stick to the plan because it wasn't accurate.
Worse, the disclaimer is buried under a bunch of "did X, did Y on line Z of file a/b/c", as if it's just a minor inconvenience. To the extent the plan was inaccurate, you're left in an undefined state where you might as well undo what it just did..
You have to review the plan and fill in any missing gaps or correct anything that's wrong. Plan mode often isn't one shot, it might take a few iterations, but once the plan is nailed down, the results are usually very good.
You're right. I think having it spawn lots of subagents, read everything, formulate a big and detailed plan, only for it to be subtly wrong while requiring me to carefully review the result and the intermediate plans that produced it is quite tiring. I suppose things slip through.
If you understand these subtle pieces you perceive the AI to get wrong, you should include that in your prompt. Also, unit test and functional test coverage go a long way to ensure correct behavior.
When it was Copilot tab-completing lines, people would say, "yea, but you still have to make sure you're the one writing the whole functions".
Then when it was completing functions, people would say, "yeah, but you still have to make sure you're the one writing the logic around the functions"
Then when it was completing the logic around the functions, people would say, "yeah, but you still have to make sure you're the one writing the features"
Now it's completing features and people say, "yeah, but you still have to make sure you're the one writing the architecture"
I don't know if architecture is a solvable problem for these models, but it is interesting watching the expectations moving over time.
I heard a talk from a VP at NVIDIA a couple of months ago and he echoed this. Essentially their policy is "you are still fully responsible for the code you ship, whether AI helps with it or not"
I think being responsible for the code is a better framing. I run a saas and I don’t always review all the code, but this thing supports my family, so I am acutely aware that I’m responsible for what it does. My customers aren’t going to let me blame the agent for fucking up their workflows.
But that still doesn’t mean I review all the code. I tend to review defensively, based on the potential for harm if this piece of code is broken. And I rely a lot on tests, static analysis, canaries, analytics, health checks, etc. to reduce risk for when I’m wrong. So far it’s working.
I'm no longer sure you have to, actually. I mean, we do trust the assembly that compilers produce without having to read it, don't we? We're rapidly getting to that stage with LLMs, IMO.
The assembly is a deterministic transform of the input logic, and if it doesn't match then it's a bug in the compiler. If an LLM-based code generator doesn't match what you asked for, that's OK, just pull the slot-machine handle again. that's the difference.
The "pull the slot-machine handle again" is the dangerous thing here.
I can feel it sometimes, as my brain shuts down and I gamble instead of thinking. It's a reversion to what I call "monkey mind" where you just keep pressing buttons to "make it work". I took a decade training my mind away from this, and too much AI is bringing it back.
And then getting bugs when they use a new version of the AI, just like people occasionally got bugs when they upgraded to new versions of the compiler...
I know it’s tiring to talk about “hallucination”, but truly, models still do hallucinate
They constantly say they did a thing they didn’t, say they know how to solve something when they don’t, etc. Regardless of guard rails or tests - AI forces a constant vigilance of a new kind.
Not just “what might have gone wrong” but also “what do I think is working but isn’t actually”.
And we’re not even talking about how it chooses substandard solutions, is happy to muddy code/architectures, add spaghetti on top of spaghetti etc.
Agentic coding often feels like an army of unexperienced developers who are also incredibly eager to please.
"Still" means "it always had hallucinations, and it still does, despite people thinking that it doesn't anymore". People think we've moved past that. We haven't.
> we do trust the assembly that compilers produce without having to read it
Yes, because wrong assembly blows really loudly. From wrong behavior to invalid instruction errors and everything between them. Moreover, compilers are battle tested over the years, with extremely detailed test suites, and extreme testing (everyday, hundreds of thousands users test and verify them).
Also, as people said, assembly generation is deterministic. For a given source file and set of flags, you get the same thing out. Byte by byte, bit by bit. This is what we call "reproducible builds".
AI is not like that. It's randomized on purpose, it pulls from training set which contains imperfect, non-ideal code. "Yeah, it works whatever", doesn't cut it when you pull a whole function out of its connections, formed by the training data. It can and will make errors, because it's randomized from a non-ideal pool.
Next, sometimes you need tight code. Fitting into caches, running at absolute performance limit of the processor or system you have. AI is not a good fit here. Sometimes you go so far that you optimize for the architecture at hand, and it works slower on newer systems, so you need to re-optimize that thing.
For anyone who reads and murmurs "but AI can optimize", yes, by calling specific optimization routines written by real talented people for some cases; by removing their name, licenses, and context around them. This is called plagiarism in its mildest form and will get you in hot water in academia, for example. Writing closed source software doesn't make you immune from cheating and doing unethical things.
Lastly, this still rings in my ears, and I understood it over and over as I worked with more high performance, correctness critical code:
I was taking an exam, there's this tracing question. I raise my head and ask my professor: "Why do I need to trace this? Compiler is made to do this for me". The answer was simple yet deep: "If you can't trace that code, the compiler can't trace it either".
As I said, I just said "huh" at the time, but the saying came back and when I understood it fully, it was like being shocked by a Tesla coil.
Get your sleep, eat your veggies and understand your code. That's the four essential things you need to do.
This is a really, really, really bad comparison. I used to say the same thing. But the semantic distance between compiling a for loop to equivalent assembly instructions is much smaller than the distance between "I'd like a web application that can store and retrieve todo items." The space of the latter is practically infinite in what can be "compiled."
I've actually taken to double-checking the assembly in some instances. There are surprising times that the compiler won't make the shortcuts and optimizations you thought it should, and I also used this method to call out an unsuitable compiler since I caught it spitting out some ridiculous 10x-long set of instructions in certain critical instances.
We are not rapidly getting to that stage with LLMs and frankly it's hilarious that you are claiming so.
For anything other than Greenfield, new code projcets without dependencies and conventions and connections to other proprietary code, it has to be reviewed. Even for that case it's not good to not review code
Not at all. Code is not important, intent is. The leader of a product/company does not have to read code. It doesn't matter if it is generated by humans or non-humans. It simply needs to be correct enough to be usable and then steerable towards better outcomes. Understanding of code never existed from the business perspective.
> I don't know if architecture is a solvable problem for these models, but it is interesting watching the expectations moving over time.
I think the solution is between the lines of this article. The author states the steps leading to this, but doesn't arrive at it explicitly. It has been obvious (With 50/50 hindsight) to me since LLMs started getting popular, and holds:
LLMs are fantastic for software dev. If you don't let it write architecture. Create the modules, structs, and enums yourself. Add as many of the struct fields and enum variants as possible. Add doc comments to each struct, enum, field, and module. Point the LLM to the modules and data structures, and have it complete the function bodies etc as required.
The models can do architecture. However they typically (at least currently) do a really bad job until you force them. I use AI all the time, it is getting better, but I still review every single line. Individual lines are no today are not better than tab completion of last year - sometimes really good and save my typing, sometimes really really bad.
Anyone who understands the motivation, reasoning and goals can do the architecture. The crux is that hardly anyone actually understand those and even less is aligned on those, that's when misalignment happen over time, LLMs or not.
Considering how fast we can poop out code now, I think this issue is just more visible than before, but it's been an issue for as long as I've been a developer. Almost no one knows what they actually want, and half the job is trying to coax out what they want to be able to do, so you can properly architect it.
These models understand architecture perfectly well, but they're not trained to care about it when being asked to complex X or Y feature. They're trained to implement the feature in the shortest route possible.
So it's not much of a surprise that this is the situation folks find themselves in with the current models.
As somebody with a colleague that is using AI agents to "complete features", let me tell you, it is not. It is taking that dude so much longer to prompt and reprompt and then prompt again until it is anywhere close to something that passes review than it would take any competent mid-level engineer to just build the whole thing with some autocomplete help.
Have people's standards for quality just completely vanished in the pursuit of the shiny new thing? Is that guy doing something wrong?
That has also been my experience with this sort of thing fwiw, which is why I gave up and do more of a class-by-class pairing with an LLM as a workable middle ground.
100% agree. Obviously AI is at a point where the developer has to do the architecture. Or at least be in control of what kinda of architecture the AI is implementing. You can't one-shot huge features in huge codebases with AI. You are bound to get strange decisions. But that does not mean they are not worth using. That's a silly take.
It's completing shit. Even if it does not implement some lazy stuff with empty catch blocks (i.e. happy path from programming 101 tutorials), it will either expose your secrets in a sensible place or do some other stupidity.
"it takes too much effort to get the output production ready"
turning into
"maybe long term the maintenance will be more expensive"
I give it three months until people realize that you rarely need to review every single line and fully understand the code, like so many comments are claiming.
If you work on a product that has an existing user base that has an expectation that things will still work then you definitely still need to read the code. LLMs frequently break things or introduce subtle incompatibilities.
Maybe on projects with no users you can yolo things.
There are always people who will disagree, no matter how amazing something is, and they naturally respond with concerns close to the locus of the LLMification. It would be absurd to respond to “AI autocomplete is great now” with “but you still need to architecture your code”. What’s people saving seconds writing code minutiae got to do with architecturing the code?
This blob of people criticizing AI is just that, a blob. A gaggle of discrete people that your brain makes up a narrative about being some goalpost shifting entity.
Of course there could be individuals who have moved the goalposts. Which would need a pointed critique to address, not an offhand “people are saying” remark.
I've set a few rules for working with coding agents:
1. If I use a coding agent to generate code, it should be something I am absolutely confident I can code correctly myself given the time (gun to my head test).
2. If it isn't, I can't move on until I completely understand what it is that has been generated, such that I would be able to recreate it myself.
3. I can create debt (I believe this is being called Cognitive Debt) by breaking rule 2, but it must be paid in full for me to declare a project complete.
Accumulating debt increases the chances that code I generate afterwards is of lower quality, and it also feels like the debt is compounding.
I'm also not really sure how these rules scale to serious projects. So far I've only been applying these to my personal projects. It's been a real joy to use agents this way though. I've been learning a lot, and I end up with a codebase that I understand to a comfortable level.
While this is a legitimate set of rules to follow for maintaining code sanity and a solid mental model of how a codebase may grow, it’s always challenging to stick to them in a workplace where expectations around delivery speed have changed drastically with the onset of AI. The sweet spot lies in striking a balance between staying connected to the codebase and not becoming a limiting factor for the team at the same time.
That's kind of what I figured, sadly. I haven't experienced it personally yet since I got let go from my last job about 14 months ago, but it makes so much sense given how management is so willing to sacrifice quality for speed.
Another frustrating thing that has emerged from this is where managers “vibe code” half-baked ideas for a couple of hours and then hand it off as if they’ve meaningfully contributed to the implementation. Suddenly you’re expected to reverse engineer incoherent prompts, inconsistent code, and random abstractions that nobody fully understands.
In their mind they’ve already done the “architectural heavy lifting” and accelerated the team.
More often than not it just adds cognitive overhead where you spend more time deciphering and cleaning up garbage than actually building the thing properly from scratch.
Vouching for this comment because my friend confided in me a week ago that her manager also does this and is like “oh yeah, here’s 80% done, you just do the rest so we can ship it” when a large part of it is slop that needs to be rewritten, due to not enough guidance and pushback during generation.
Writing tests against a bad implementation usually doesn't work well. In this scenario I would have an LLM look at the changes in the branch and try to create a markdown document of the changes, why it thinks they were made, etc. and then review that doc with the manager and do a new implementation from scratch after aligning.
Unless the tests are written against logic that is in of itself subtly wrong and even the structure of code and what methods there are is wrong - so even your unit tests would have to be rewritten because the units are structured badly.
It’s a valid direction to look in, it just doesn’t address the root issue of throwing slop across the wall and also having unrealistic expectations due to not knowing any better.
Yep. It’s very healthy to be suspicious of code. Any code. Whether generated or not. That’s where the bugs are.
If there’s one thing that’s disturbing with AI proponent is how trusting they are of code. One change in the business domain and most of the code may have turn from useful to actively harmful. Which you have to rewrite. Good luck doing that well if you’re not really familiar with the code.
I am lucky to have never worked in a team where my manager wouldn't expect strong push back in this scenario. Many of the corporate environments described on here seem dystopian, this included.
To make a bit of a counter argument here - it's really hard to stick to 100% quality at speed 1.0, when your opponent argues for 90% at 2.5. That's the story the AI fast-movers are telling, and from a business perspective, it's hard to counter it (regardless of whether that speed increase actually materialises).
I was trying to follow similar rules, until one day I had to solve a hard mathematical problem. Claude is a phd level mathematician, I am not. I, however, know exactly the properties of the desired solution and how to test it’s correct. So I decided to keep Claude’s solution over my basic, naive one. I mentioned that in the pull request and everyone agreed that was the right call. Would you open exceptions like that in your rules? What if AI becomes so much better at coding than you , not just at doing advanced mathematics? Would you then stop to write code by hand completely since that would be the less optimal option, despite you losing your ability to judge the code directly at that point (and as in my example, you can still judge tests, hopefully)? I think these are the more interesting questions right now.
Unfortunately, it is not, and many of its attempts at mathematical proofs have major flaws. You shouldn't trust its proofs unless you are already able to evaluate them--which I think is pretty much all the OP is saying.
Trust isn’t a binary, and I can trust things I don’t understand enough that I can use them. OP was talking about needing to understand, which is quite a bit above the level of being able to validate enough to use for a task.
I definitely wouldn't put math in my code I didn't understand just because Claude says so. I am not astonished that everyone agreed, that's why shit is going to hit the fan pretty badly pretty soon due to AI coding.
There is one exception to this: If the AI also delivers the proof of why the math is correct, in a machine-checked format, and I understand the correctness theorem (not necessarily its proof). Then I would use it without hesitation.
I always found it weird when helping people with excel formulas how few people even try to check maths they don't understand, let alone try to understand it.
I struggle to remember even relatively simple maths like working out "what percentage of X is Y" so if I write a formula like that I'll put in some simple values like 12 and 6 or 10,000 and 2,456 just to confirm I haven't got the values backwards or something. I've been shown sheets where someone put a formula in that they don't understand, checked it with numbers they can't easily eyeball and just assumed it was right as it's roughly in their ball park / they had no idea what the end result should be.
Then again I've also seen sheets where a 10% discount column always had a larger number than the standard price so even obviously wrong things aren't always checked.
I don't disagree, but whoever never put math they don't fully understand in their code gets to throw the first stone.
I've reached solutions by trial and error too, and tried to rationalize them later, quite a few times. And it's easier to rationalize a working solution, however adversarial you claim to be in your rationalization.
I don't see using gen AI for the (not so) “brute force” exploration of the solution space as that different from trial and error and post fact rationalization.
How did you test that the solution is correct? Is the set of possible inputs a low-ish finite number?
Normally with mathematical problems you have to prove the solution correct. Testing is not sufficient, unless you can test all possible inputs exhaustively.
You do realize you can ask Claude about the things you don't understand?
"PhD level" just means you finished a bachelor and masters degree and are now doing a bit of original research as an employed research assistant.
Claude isn't "PhD level" anything. This shows a complete lack of understanding here. Claude has read every single text book in existence, so it can surface knowledge locked away in book chapters that people haven't read in years (nobody really reads those dense books on niche topics from start to finish).
Since Claude has infinite patience, you can just keep asking until you get it.
Man people really overestimate training. Claude did not 'read' any of that either. I wish frontier models behaved like people that had read and remembered everything they've trained on, but they're not.
I had a similar approach, but in the end I don't think it's feasible to actually sufficiently follow the rule 2. It sounds good in theory, but in practice you'll always take some mental shortcuts that you may not even be aware of. Try digging into an unknown codebase to fix some issue and compare how much will stay in your head a week after if you do it yourself or if you "completely understand" what an agent did for you. When I do it myself, it contributes to my general knowledge and I mostly retain the important parts in my head even if I lose the details over time; when I try to own what an agent did as if it was mine, it feels like I understand it well at the time after putting some effort into it but then I forget it all very fast. Ultimately I decided that having an LLM help me there is actually detrimental to my goals most of the time, and that's without even considering some other concerns raised by sibling comments here such as time and business pressures.
This is fine if it’s more enjoyable for you, that’s what’s important in personal projects most of the time.
But we don’t follow the same things for dependencies, work of colleagues, external services, all the layers down to the silicon when trying to work.
Why is AI suddenly different?
We just have to do this by risk and reward. What’s the downside if it’s wrong, and how likely is an error to be found in testing and review? What is the benefit gained if it’s all fine? This is the same for libraries and external services.
A complex financial set of rules in a non-updatable crypto contract with no testing?
A viewer for your internal log data to visualise something?
It is and has always been immensely helpful to understand what you are doing in any context.
There are some programmers who treat the job as just plumbing together what is to them completely incomprehensible black boxes, who treat the computer as a mystery machine that just does things "somehow", but these programmers will almost always be hacks that spend their entire career producing mediocre code.
There are things such a programmer can build, but they are very limited by their lack of in depth understanding, and it is only a tiny fraction of what a more competent programmer can put together.
To get beyond being a hack, you need to understand the entire stack, including the code that you didn't write, including both libraries, frameworks and the OS, and including the hardware, the networking layers, and so forth. You don't have to be an expert at these things by any means, but you do need to understand them and be comfortable treating them as transparent boxes that you may have to go in and fiddle with at some point to get where you need to go. Sometimes you need to vendor a dependency and change it. Sometimes you need to drop it entirely and replace it with something more fit for purpose you built yourself.
> To get beyond being a hack, you need to understand the entire stack, including the code that you didn't write, including both libraries, frameworks and the OS, and including the hardware, the networking layers, and so forth.
I think maybe you overestimate your own knowledge here. It's one thing to understand general principles and design or to understand a contextually-relevant vertical or whatever. It's another to demand comprehensive (even if not expert) familiarity in non-trivial projects, especially those created by many developers over long time spans. It's not just a question of intelligence or dedication or even just time spend working on a project.
The amount of software even your typical piece of code relies on is staggering and shifting, and it's only getting more complicated. A good chunk of software engineering and programming language research has been focused on making it practical to operate in such an complex environment - an environment that nobody fully understands - which is a major part of why modularity exists. Making software like "plumbing together [...] black boxes" is exactly what such research aspired to accomplish, because it allows different developers to focus on different scopes and focus on the domain they're working on. Software engineering is a practical field, and any system that requires full knowledge to operate, modify, and extend is either relatively small (maybe greenfield and written by a sole developer) or impractical to work with.
So I would say there's a wide gap between "lazy guy who doesn't give a shit" and "guy who thinks he can understand everything". Both lack the humility and wisdom needed to know the limits of their knowledge, to circumscribe what he needs to understand, and to operate within the space these afford. (Both extremes remind me of cocky junior devs. On the one hand, you have the junior dev who carelessly churns out "hot shit" garbage code by plumbing things together with no grasp or appreciation of sound design; on the other you have the dev who makes a big show about "rigor" completely detached from the actual realities and needs of the project. In each case, the dev is failing to engage intelligently with the subject matter.)
Well I mean I've built an internet search engine from scratch[1] and I'm making a living off this successfully enough to have completely left the wagie existence for the foreseeable future, so I think I at least kinda walk the talk.
I'm far from the best at anything and make no claims toward knowing everything, but I do think I have reasonable breadth in my experience and work, and I don't think I could have built something like this otherwise.
[1] ... which is something that does not decompose neatly into black boxes and must to a large degree be built from first principles as goddamn nothing off the shelves scales well enough to deal with multi-terabyte workloads at the even a fraction of the speed a bespoke solution can.
AI is different because it's a tool, and the user of the tool is responsible for the work performed.
An outsourced developer isn't a "tool". They're a human being, and responsible for their actions. They're being paid, and they either act responsibly or they get replaced.
A vibe coder is a human using a tool. The human is responsible for code quality, and if it's not good enough, they need to keep using the tool to make it better. That means understanding the tool's output.
If an artist used Photoshop to create a billboard ad that was ugly, they don't get to blame Photoshop. They have to keep using the tool until their output is good.
I don’t find “but who is to blame, ultimately?” All that useful.
So you figure out that someone you paid is at fault, instead of someone they hired. Your contract is with them so what really changes? What process or anything else is really different between it being a company with a manager who asks a team of devs and a company which asks an AI agent, to you as a customer?
Maybe it changes who gets fired or sued or whether one insurance or another pays out- but broadly I think none of what I said about project work really changes.
Product owners and hell even customers have been able to get software they don’t understand all the details of or for customers even get to see the code, purely driven with natural language.
I'd think that depends on the model of responsibility at play.
For example, suppose I hire a building contractor to build a house, and the electrician he subcontracts makes mistake.
From my perspective, the prime contractor is equally responsible for that mistake regardless of whether he used a subcontractor, or did the work himself but used a broken tool.
This doesn't make the electrician any less of a "person" in the deeply important ways, but it's not a distinction that's relevant to my handling of the problem.
I’ve also heard it being called “comprehension debt,” which I like a little more because I think it’s more precise: the specific debt being accrued is exactly a lack of comprehension of the code.
Comprehension debt just sounds like there are things you don’t (yet) understand.
Cognition debt means your lack of understanding compounds and the cognition “space” required to clear it increases accordingly.
An increasing comprehension debt that can be paid off one bit at a time within reasonable cognition space takes linear time to clear.
Cognition debt takes exponential time to clear the more of it you have. If it reaches a point where you simply don’t have the space for the cognition overhead required to understand the problem, you probably need to start over from your specifications.
I like that too. However, “cognitive debt” points to the possibility of cognitive overload, that the code can become so complex and inscrutable that it may become impossible to comprehend. “Comprehension debt” sounds a bit weaker in that respect, that it’s just a matter of catching up with one’s comprehension.
This is great until the "gun to your head" is your skip-level manager demanding that a feature be implemented by the end of the week, and they know you can just "generate it with AI" so that timeline is actually realistic now whereas two years ago it would have required careful planning, testing, and execution.
Your manager is unknowingly helping you create a form of job security for yourself, with all the technical debt and bugs being accumulated.
He might not understand it, and it might not be the type of work you want to do, but someone is going to have to fix those issues. And the longer they wait, the bigger the task gets.
That isn't new, though. Managers often pushed unrealistic timelines and showed lack of care about tech debt well before vibe coding, just the timelines where different, and the magnitude will be bigger this time. But we also have LLMs to help it clean it up faster, I guess.
The question is, is it a job you actually still want once the poo pile reaches critical mass you are the only one with a shovel and the deadline is "yesterday"
That is absolutely true. Unfortunately, this ship has sailed and we are not closing Pandora's box anymore. We'll have to adapt.
But we still hold good cards in hand.
Do they want their pile of steaming slop fixed, or not? Because no amount of complaints about the deadline being "yesterday" are going to change anything about the fact that time will be needed to fix the accrued technical debt, whether they like it or not.. And if AI dug you in that deep to start with, the solution is not to dig deeper.
I suspect some companies are going to find that out the hard (costly) way.
If the manager is unreasonable, you were always going to have a problem with them, eventually. Nothing you can do with fix this.
If manage is reasonable, you can explain to them that there isn't time to check the work of the AI, and that it frequently makes obscure mistakes that need to be properly checked, and that takes time.
At this point, if they still insist you just give it the AI's work, they've made a decision that is their fault. You've done what you can.
And when the shit hits the fan, we're back to whether they're reasonable or not. If they are, you explained what could happen and it did. If they force responsibility on you, they aren't reasonable and were never going to listen to you. That time bomb was always going to go off.
Does that matter that much in practice? I bet lots of costumers are okay with software that crashes 10x as much if it costs 10x less. There already is a ton of shitty software that still sells.
I just had a Claude episode. Instead of trying to fix the bug, it edited the data to hide the bug in the sample run. This kind of BS behavior is not rare. Absolutely, if you do not understand every bit of what's going on, you end up with a pile of BS.
I agree to this though it also depends on the nature of project.
Had a project idea which I coded with the help of AI and it became quite large to a point I was starting to have uncharted areas in the code. Mostly because I reviewed it too shallow or moved fast.
It was a good thing as that project never floated but if I were to do such a thing on my breadwinning project I would lose the joy.
This is about how I use it. I initially use it to carve out an architecture and iterate through various options. That saves a lot of time for me having to iterate through different language features and approaches. Once I get that, I have it scaffold out, and I go in and tidy things up to my personal liking and standards. From there, I start iterating through implementations. I generally have been implementing stuff myself, but I've gotten better at scaffolding out functions/methods through code instead of text. Then I ask it to finish things off. That falls into your first category of letting it implement stuff that I already know I could do. Not sure if it's faster. But it's lower cognitive load for me, since I can start thinking about the next steps without being concerned about straightforward code.
This all works pretty great. Where it starts going off the rails is if I let it use a library I'm not >=90% comfortable with. That's a good use of these tools, but if I let it plow through feature requests, I end up accumulating debt, as you pointed out.
For my uses, I'm still finding the right balance. I'm not terribly sure it makes me faster. What I do think it helps with is longer focused sections because my cognitive load is being reduced. So I can get more done but not necessarily faster in the traditional sense. It's more that I can keep up momentum easier, which does deliver more over time.
I'm interested in multi agent systems, but I'm still not sure of the right orchestration pattern. These AI tools still can go off the rails real quick.
It's not worth fighting it at work. If the idiots you work for want everything vibe coded and delivered at 5 * 2025 speed then just vibe code and try to leave the company ASAP. That's where I am right now. Of course I might end up somewhere just as ridiculous or maybe not be able to even find another job. Shitty times we live in right now.
The swindle goes like this, AI on a good codebase can build a lot of features, you think it’s faster it even seems safer and more accurate on times, especially in domains you don’t know everything about.
This goes in for a while whilst the codebase gets bigger and exploration takes longer and failure rate increases. You don’t want it to be true and try harder so you only stop after it practically became impossible to make any changes.
You look at the code again and there is so much code spaghetti is an understatement it’s the Chinese wall.
You start working…, and you realize what was going on
I deleted 75,000 of 140,000 lines of code and I honestly feel like the 3 months I went hard into agentic coding I wasted and I failed my users by building useless features increasing bugs, losing the mental model of my code and not finding the problems I didn’t know about the kind of hard decisions you only see when you in the code, the stuff that wanders in your mind for days
I find it interesting that this outcome is a surprise. I don't want this to sound smug, I'm genuinely curious what the initial expectations are and where they come from.
They seem to be different for LLMs, because would anyone be surprised if they handed summary feature descriptions to some random "developer" you've ever only met online, and got back an absolute dung pile of half-broken implementation?
For some reason, people seem to expect miracles from some machine that they would not expect of other humans, especially not ones with a proven penchant for rambling hallucinations every once in a while.
I'd like to know, ideally from people who've been there, why they think that is. Where does the trust come from?
I've never relied on an LLM to build a large section of code but I can see why people might think it is worth a try. It is incredible for finding issues in the code that I write, arguably its best use-case. When I let it write a function on its own, it is often perfect and maybe even more concise and idiomatic than I would have been able to produce. It is natural to extrapolate and believe that whatever intelligence drives those results would also be able to handle much more.
It is surprising how bad it is at taking the lead given how effective it is with a much more limited prompt, particularly if you buy in to all the hype that it can take the place of human intelligence. It is capable of applying a incredible amount of knowledge while having virtually no real understanding of the problem.
LLMs do deliver "miracles", in certain cases, if you've experienced it and have been blown away by their output (one shot functional app from a well manufactured prompt, new feature added flawlessly on a complicated existing codebase, etc.), it can be tempting to reajust your expectations and think this will work consistently and at a much larger scale.
They can assimilate 100s of thousands of tokens of context in few seconds/minutes and do exceptional pattern matching beyond what any human can do, that's a main factor in why it looks like "miracles" to us. When a model actually solves a long standing issue that was never addressed due to a lack of funding/time/knowledge, it does feel miraculous and when you are exposed to this a couple of times it's easy to give them more trust, just like you would trust someone who provided you a helping hand a couple of times more than at total stranger.
I suppose it's difficult to account for the inconsistency of something able to perform up to standard (and fast!) at one time, but then lose the plot in subtle or not-so-subtle ways the next.
We're wired to see and treat this machine as a human and therefore are tempted to trust it as if it were a human who demonstrated proficiency. Then we're surprised when the machine fails to behave like one.
I have to say, I'm still flabbergasted by the willingness to check out completely and not even keep on top of, and a mental model of, what gets produced. But the mind is easily tempted into laziness, I presume, especially when the fun part of thinking gets outsourced, and only the less fun work of checking is left. At least that's what makes the difference for me between coding and reviewing. One is considerably more interesting than the other, much less similar than they should be, given that they both should require gaining a similar understanding of the code.
Probably same reason people expected outsourcing to the cheapest firm in India would work: wishful thinking. People wanted it to work and therefore deluded themselves.
Or really the same reason people fall for get rich quick schemes.
> You look at the code again and there is so much code spaghetti is an understatement it’s the Chinese wall.
I don't understand this. A large codebase should be a collection of small codebases, just like a large city is a collection of small cities. There is a map and you zoom into your local area and work within that scope. You don't need to know every detail of NYC to get a cup of coffee.
Its your responsibility to build a sane architecture that is maintainable. AI doesn't prevent you from doing that, and in fact it can help you do so if you hold the tool correctly.
To use your analogy - there's a big difference between the streets of New York and the streets of Boston. In New York if you know you're on 96th and 3rd, then you automatically know how to get to 101st and 5th - it's just a grid. But not every town is like that - many require you to possess knowledge about specific streets and specific landmarks in order to navigate anywhere.
To speak more directly - every codebase has local reasoning and global reasoning. When looking at a single piece of code that's well-isolated, you can fully understand its behavior "locally" without knowing anything about any other part of the code. But when a piece of code is tightly coupled to many other parts of the codebase, you have to reason globally - you have to understand the whole system to even understand what that one piece of code is doing, because it has tendrils touching the whole system. That's typically what we call spaghetti code.
If you leave an AI to its own devices, it will happily "punch holes", and create shortcuts, through your architecture to implement a specific feature, not caring about what that does to the comprehensibility of the system.
> A large codebase should be a collection of small codebases, just like a large city is a collection of small cities.
Oh, great analogy there.
Just like there's almost nothing in common between a large city and a collection of small cities, a large codebase is completely different from a collection of small codebases too.
I don't think that's a useful mental model for software in general.
There are software that works like this (e.g. a website's unrelated pages and their logic), but in general composing simple functions can result in vastly non-proportional complexity. (The usual example is having a simple loop, and a simple conditional, where you can easily encode Goldbach or Collatz)
E.g. you write a runtime with a garbage collector and a JIT compiler. What is your map? You can't really zoom in on the district for the GC, because on every other street there you have a portal opening to another street on the JIT district, which have portals to the ISS where you don't even have gravity.
And if you think this might be a contrived example and not everyone is writing JIT-ted runtimes, something like a banking app with special logging requirements (cross cutting concerns) sits somewhere between these two extremes.
The GC shouldn't care about all the code it is collecting. I collects garbage, it doesn't care if the garbage is a intermediate value from your tax calculations, or the the previous state image from your UI - either way it is garbage and it is gone. Now in a few cases details of garbage collection matter by enough that it is worth something more invasive for some reason, but the vast majority of code shouldn't care about the other areas.
When on a tiny project it doesn't matter. However when you have millions of lines of code you have to trust that your code works in isolation without knowing the details.
I don't get your comment. The GC is part of the runtime, to it user code is data. But the JIT compiler and other internal details of the runtime are its code, and there are very real cross-cutting concerns, like the JIT compilers output should take into account what memory representation the GC expects, where are barriers, when one is run etc. So I'm talking about a project such as the JVM.
> have millions of lines of code you have to trust that your code works in isolation without knowing the details.
More like hope. This is where good design and architecture helps, as well as strong invariants held up by the language. But given that most applications can't really escape global state (not even internally, let alone external state like the file system), you can never really know that your code will work the way you expect it to - that is, it's not trivially composable to any depth.
It's not that you can't "build a sane architecture" as much as it is difficult to justify the time spent to do this when you can "bang out features" in 10 minutes that would take days to do manually. It's about the economics of code generation. When inventing structure and typing it out as code takes time, thinking deeply about architecture first makes sense. There is another factor, as well: "thinking deeply about the architecture" involves experimentation. You might go down a particular path while coding, and then realize some limitations and/or new ideas, etc. You ultimately craft something that will work well and play well with future code, and which may be easily understood. If somebody stops by your desk and says, "you finish that <3 day feature you were just assigned 2 hours ago> yet?", that'll be the last time you think deeply about anything at work.
Rather than arguing about the specifics, it's easier to point to numerous concrete examples, such as a fairly simple system - which should be easy to implement in 8-15k lines of code, depending on certain choices (I've been writing code long enough to estimate this relatively accurately) - being still-incomplete while approaching 150k lines. These kinds of atrocities are usually economically infeasible in hand-written code, for 2 reasons: 1) the cost to produce that much code is very high, and 2) the cost of maintaining that much code is insurmountable.
I guess you could say that AI is great at generating code that only AI can understand and maintain.
I think this is true, but i imagine there's a workflow solution to this which isnt to drop AI.
Eg., treating AI code generated as immediately legacy, with tight encapsulation boundaries, well-defined interfaces etc. And integrating in a more manual workflow.
There's a range from single-shot prompts to inline code generation, that will make more sense depending on the problem and where in the code base it is.
Single-shot stuff is going to make more sense for a protyping phase with extensive spec iteration. Once that prototype is in place, you then prob want to drop down into per-module/per-file generation, and be more systematic -- always maintaining a reasonably good mental model at this layer.
That workflow just sounds exhausting to me. Would I always need to consider how much of a blast radius my AI-generated code might have? Sounds like there’s so much extra management going into these micro decisions that it ultimately defeats the purpose of generating code altogether.
I could see value in using it during the prototyping phase, but wouldn’t like to work like you described for a serious project for end users.
And you have discovered the job of managers! There has always been a lot of hate for managers. Wonder if the robots hate us just as much? (I often feel a weird guilt when I tell an agent to do something I know I am going to throw away but will serve as an interesting exploration...I know if I did that to a human they would be pissed...)
IMO the hate has always been for clueless managers, especially clueless yet demanding managers. Managing an LLM for coding is different, try being clueless and demanding and see how far you get.
So you think "good" management translates? I actually think it very much does. Clear expectations, providing right context and "the why", quick and clear feedback loops, intervening early when they are going off track, not micromanaging too much so they can actually accomplish more. It's all very similar.
Yes it very much does but managing humans is still very different.
Understanding your domain, setting clear expectations and understanding limitations and how much ambiguity your people/robots can handle are all good management techniques, they translate.
But the nature of working with an always-on flattery machine vs humans that can exceed your expectations while also being sources of infinite drama and frustration are still fundamentally different. The blind spot is being subsceptible to the flattery machine and forgetting how much you relied on good people challenging you. The benefit is, of course, not having to deal with humans.
I just don't like to type code anymore. If I can accomplish the same by describing the code, and get the same results as if I typed it myself, I'll opt for not typing so damn much. I've done so much typing in my career, that typing ~80% less to get the same results, makes a pretty big difference in how likely I am to set out to accomplish something.
I care more about code quality now, because typing no longer limits if I feel like it's worth to refactor something or not.
> treating AI code generated as immediately legacy, with tight encapsulation boundaries, well-defined interfaces etc.
This is good advice regardless whether you're using AI or not, yet in real life "let's have well-defined boundaries and interfaces" always loses against "let's keep having meetings for years and then ducttape whatever works once the situation gets urgent".
Were you auto-committing everything without reading the generated code? and if you read it but didn't understand it why not just ask for detailed comments for each output? Knowing that a larger codebase causes it to struggle means the output needs to be increasingly scrutinized as it becomes more complex.
I don’t think it’s about what the code does. I think it’s more about how the code fits in its whole context. How useful it is in solving the overarching problem (of the whole software). How well does it follow the paradigm of the platform and the codebase.
You can have very good diffs and then found that the whole codebase is a collection of slightly disjointed parts.
I haven't had the chance to work on large codebases, but isn't it possible to somehow adapt the workflow of Working Effectively with Legacy Code, building islands of higher quality code, using the AI to help reconstruct developer intention and business rules and building seams and unit tests for the target modules?
AI doesn't necessarily have to increase your throughput, it can also serve as a flexible exploration and refactoring tool that will support either later hand crafter code or agentic implementation.
Yep. I’m approaching the same problem from a different angle: writing code fast means you aren’t being thoughtful about the features you’re building. I started realizing that after I had kids and spent more time thinking about code than writing it and it really improved the quality of my work: https://bower.sh/thinking-slow-writing-fast
Obviously this is all my fault and others might have a better jugdement when using it. I’m just sharing my experience compared to the promises you easily might believe reading X/HN/Anthropic.
I still have a lot of usage for AI: Exploration, Double-checking me, teaching me. But writing code became very tough for me to accept. Nex-edit autocompletes mainly
> I still have a lot of usage for AI: Exploration, Double-checking me, teaching me.
I'm ready to give up on having it even review my code at this point. It's been so frustrating. It hallucinates bugs, especially in places where "best practices" are at odds with reality.
Recently it informed me of a bug where it suggested the line of code in question couldn't possibly do anything because on Linux the specific stdlib behaved in X ways, but it was obvious from the line of code that it was running on Windows which doesn't have this problem at all. Of course, it doesn't actually mention that this is an issue on Linux, just that there is a bug here. It vomits up a paragraph of $WORDS explaining why this was a high-priority bug that absolutely needed to be fixed because it was failing in subtle ways. Yet the line of code in question has been running in production, producing exactly the results it is expected to, for ~3 years.
And this is just one simple example, of the many dozens+ of times it has failed this task this year. In that same review run, the agent suggested 3 additional "bugs" or other issues that should be addressed that were all flatly wrong or subjective. I'm at a point of absolute exhaustion with this sort of shit. It's worse than a junior half of the time because of how strongly opinionated it is. And the solution to this sort of problem is an endless amount of configuration and customization that will be forgotten about by all of us over time, leading to who knows what sort of knock-on effects (especially as we migrate from one model to the next). We have a guy on our team who has ~17,000 words in his agent and instructions files, yet he sees nothing wrong with this. I guess he just really loves YAML and Markdown.
There comes a realization, to many engineer’s horror, that AI won’t be able to save them and they will have to manually comprehend and possibly write a ton of code by hand to fix major issues, all while upper management is breathing down their back furious as to why the product has become a piece of shit and customers are leaving to competitors.
The engineers who sink further into denial thrash around with AI, hoping they are a few prompts or orchestrations away from everything be fixed again.
But the solution doesn’t come. They realize there is nothing they can do. It’s over.
I always find these kinds of posts interesting, to compare the velocity that people seem to get with Ai, vs what I get by just coding by hand
Coincidentally I've been working on a project for about 7 months now: its a 3d MMO. Currently its playable, and people are having fun with it - it has decent (but needs work) graphics, and you can cram a few hundred people into the server easily currently. The architecture is pretty nice, and its easy to extend and add features onto. Overall, I'm very happy with the progress, and its on track to launch after probably a years worth of development
In 7 months vibe coding, OP failed to produce a basic TUI. Maybe the feature velocity feels high, but this seems unbelievably slow for building a basic piece of UI like this - this is the kind of thing you could knock out in a few weeks by hand. There are tonnes of TUI libraries that are high quality at this point, and all you need to do is populate some tables with whatever data you're looking for. Its surprising that its taking so long
There seems to be a strong bias where using AI feels like you're making a lot of progress very quickly, but compared to manual coding it often seems to be significantly slower in practice. This seems to be backed up by the available productivity data, where AI users feel faster but produce less
> There seems to be a strong bias where using AI feels like you're making a lot of progress very quickly, but compared to manual coding it often seems to be significantly slower in practice.
This metric highly depends on who uses the AI to do what, where strong emphasis is on "who" and "what".
In my line of work (software developer) the biggest time sinks are meetings where people need to align proposed solutions with the expectations of stakeholders. From that aspect AI won't help much, or at all, so measuring the difference of man hours spent from solution proposal to when it ends up in the test loops with and without AI would yield... very disappointing results.
But for troubleshooting and fixing bugs, or actually implementing solutions once they have been approved? For me, I'm at least 10x'ing myself compared to before I was using AI. Not only in pure time, but also in my ability to reason around observed behaviors and investigating what those observations mean when troubleshooting.
But I also work with people who simply cannot make the AI produce valuable (correct) results. I think if you know exactly what you want and how you want it, AI is a great help. You just tell it to do what you would have done anyway, and it does it quicker than you could. But if you don't know exactly what you want, AI will be outright harmful to your progress.
the venn diagram of people who say "AI produces useless results I have to 100% throw away" and people who say "I've never successfully delegated parts of large software development to junior programmers" is a circle
You nailed it, I came to the same conclusion recently. When people show me what have they done with LLM, I am left unimpressed as mostly they show things that can be done manually in a very short time. I also failed to observe the rise in availability of impressive software, which coincides with the fact that LLMs are currently being used to solve simple problems, instead of important ones.
This struck me as odd too. 7 months? It wouldn’t take that long to write it in a new language.
Another thing I don’t see mentioned is code quality.
Vibe-coded code bases are an excellent example of why LLMs aren’t very good at writing code. It will often correct its own mistakes only to make them again immediately after and Inconsistent pattern use.
Recently Claude has been making some “interesting” code style choices, not inline with the code base it’s currently supposed to be working on.
Seems to be baked into the GPT- produce text- aka to produce language and code is life and purpose. So the whole system is inherently internally biased towards "roll your own everything?" unless spoke too in a "Senior-dev" language, that prevents these repetitions.
It's got a fun Zelda-inspired mechanic (I won't say which one), and you'll have to unlock abilities and parts of the world over several quests and modes to "win".
Unless there is a way to see or check the whole prompt that made this game, it is hard for me personally to say if this is impressive or not as I don't know what percentage of this was vibe coded. I also see there is an incentive to skew the truth here, as there is a substantial prize pool and that usually makes people become very creative.
In the past, I was trying to reproduce vibe coding results when I managed to get all the information from Youtube videos (model version, ide version, same input data and prompt) and never was I able to reproduce something impressive even after multiple runs of the same thing.
Turning didn't work for me. Impressive as hell indeed! The author is probably already busy 100xing other projects, no time to fix such a minor nuisance.
It's more complex than that, I think the reality is that there's a lot of code that's just not that deep bro. I have some purely personal projects that have components that I don't understand anymore, I wrote that shit by hand, they still work but I haven't touched that shit in years. There's a lot of code that AI can write that's like that that helps me, the stuff I would forget about even if I wrote it by hand. I think you have to have discipline in it's use, it's a tool like any other.
AI, and especially agentic AI can make you lose situational awareness over a codebase and when you're doing deep work that SUUUUCKS, but it's not useless, you just have to play to it's strengths. Though my favorite hill to die on is telling people not to underestimate it's value as autocomplete. Turns out 40 gigabytes of autocomplete makes for a fucking amazing autocomplete. Try it with llama.vim + qwen coder 30b, it feels like the editor is reading your mind sometimes and the latency is so low.
> The other change is simpler: I'm doing the design work myself, by hand, before any code gets written. Not a vague doc. Concrete interfaces, message types, ownership rules.
That’s the hard part of coding. If you have an architecture then writing the code is dead simple. If you aren’t writing the code you aren’t going to notice when you architected an API that allows nulls but then your database doesn’t. Or that it does allow that but you realize some other small issue you never accounted for.
I do not know how you can write this article and not realize the problem is the AI. Not that you let it architect, but that you weren’t paying attention to every single thing it does. It’s a glorified code generator. You need to be checking every thing it does.
The hard part of software engineering was never writing code. Junior devs know how to write code. The hard part is everything else.
Yes, I think there's 2 kinds of developer. Those who think the code is the hard part, and those that don't.
The developers that thing coding is hard are the ones that absolutely love AI coding. It's changed their world because things they used to find hard are now easy.
Those that think coding is easy don't have such an easy time because coding to them is all about the abstractions, the maintainability and extensibility. They want to lay sensible foundations to allow the software to scale. This is the hard part. When you discover the right abstractions everything becomes relatively easy. But getting there is the hard part. These people find AI coding a useful tool but not the crazy amazing magical tool the people who struggle with coding do.
The OP is definitely in the second camp since they could spot and realise the shortcomings of the AI. They spotted the problem, and that problem is that the AI can't do the hard bit.
I'd say there's another camp: the camp of people who know that code isn't the hard part, but that it's still time consuming to write code. AI coding is pretty useful for that, when you can nail the design but you just need a set of hands to implement it.
I'm classing that as the second camp. Because you don't find it hard to do, it's just time consuming. It means you still know what you're doing and you're just using AI as a tool to accelerate your delivery. That's the optimal way to use in my experience if you want to actually deliver well architected software.
But isn’t AI doing the same thing to project management as to coding?
PMs can now cross reference and organize tickets with just a few keystrokes. Organisational knowledge, business knowledge, design systems and patterns, etc all of it is encoded in LLM consumable artefacts. For PMs it is the same switch - instead of having to do it by hand you direct lower level employees to handle the details and inconsistencies and you just do vibe and vision.
When all of the pieces successfully connect and execute reliably, what is left for humans to do? Just direct and consume?
And AI companies with their huge swaths of data are soon gonna be in the situation of being able to do the directing themselves
Such a person is just pushing a giant pile of cleanup work onto their colleagues. Unless they actually checked, the "cross references" are probably wrong in places or just entirely made up. Lower level employees by definition don't have the experience to correct the more subtle inconsistencies, so you've basically just constructed a high pass filter that lets only the worst failures through. Moreover, you're absolutely guaranteed to lose the respect of those lower level employees--forcing someone else to clean up your sloppy work is just cruel, and people resent being treated cruelly.
this pretty much sums up what i feel about AI currently. It made my life significantly easier for most tasks I already breeze through, yet tasks I used to struggle with are still the equally difficult
What problems are they? I can't really think of any problems where writing the code was the hard part.
There's plenty of times where I don't know what code to write because I've never used a library before. But it's just a page of documentation away. It's not hard, it's just slow and tedious.
I agree with what you're saying, but I think we do have a problem right now with definitions where there's a lot of people basically getting supercharged tab completions or running a chatbot or two in a parallel pane, but still clearly reviewing everything; and on the other side of things is freaking Steve Yegge pitching a whole new editor that lets you orchestrate a dozen or more agents all vibing away on code you're apparently never going to read more than a line or two of: https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16d...
The first group are still thinking fairly deeply about design and interfaces and data structures, and are doing fairly heavy review in those areas. The second group are not, and those are the ones that I find a bit more worrisome.
> The first group are still thinking fairly deeply about design and interfaces and data structures, and are doing fairly heavy review in those areas.
I can't speak for others, but I'd go further and say that LLMs allow me to go deeper on the design side. I can survey alternative data structures, brainstorm conversationally, play design golf, work out a consistent domain taxonomy and from there function, data structure and field names, draft and redraft code, and then rewrite or edit the code myself when the AI cost/benefit trade off breaks down.
That’s a little bit of a No True Scotsman. Yes there are people who do not review anything; but even people who are reviewing every line from an LLM do not have the same understanding as someone who wrote it themselves.
I’m not making a judgement call about which is better, but it was widely accepted in tech before the advent of LLMs that you just fundamentally lack a sense of understanding as a reviewer vs an author. It was a meme that engineers would rather just rewrite a complicated feature than fix a bug, because understanding someone else’s code was too much effort.
That blog post is surreal. It's like cryptocurrencies and the whole web3 nonsense. Cryptocurrencies basically don't work, so there have been a hundred aimless attempts at fixing self inflicted problems caused by deficiencies of cryptocurrencies with no actual goal that has any impact on the real world.
It's the same thing here. AI has dropped the cost of software development, so developers are now fooling themselves into producing low or zero value software. Since the value of the software is zero or near zero, it doesn't really matter whether you get it right or not. This freedom from external constraints lets you crank up development velocity, which makes you feel super productive, while effectively accomplishing less than if you had to actually pay a meaningful cost to develop something.
Like, what is the purpose of Gas Town? It looks to me like the purpose of Gas Town is to build Gas Town.
> and on the other side of things is freaking Steve Yegge pitching a whole new editor that lets you orchestrate a dozen or more agents all vibing away on code you're apparently never going to read more than a line or two of
I find it useful to not listen to people who just talk.
> The first group are still thinking fairly deeply about design and interfaces and data structures, and are doing fairly heavy review in those areas
I worry about the first group too, because interfaces and data structures are the map, not the territory. When you create a glossary, it is to compose a message, that transmit a specific idea. I find invariably that people that focus on code that much often forgot the main purpose of the program in favor of small features (the ticket). And that has accelerated with LLM tooling.
I believe most of us that are not so keen on AI tooling are always thinking about the program first, then the various parts, then the code. If you focus on a specific part, you make sure that you have well defined contracts to the orther parts that guarantees the correctness of the whole. If you need to change the contract, you change it with regard to the whole thing, not the specific part.
The issue with most LLM tools is that they’re linear. They can follow patterns well, and agents can have feedback loop that correct it. But contracts are multi dimensional forces that shapes a solution. That solution appears more like a collapsing wave function than a linear prediction.
I’ve noticed that agents almost always fail at the planing vs execution stage.
I follow the plan -> red/green/refactor approach and it is surprisingly good, and the plans it produces all look super well reasoned and grounded, because the agent will slurp all the docs and forums with discussions and the like.
Trouble is once it starts working there would inevitably be a point where the docs and the implementation actually differ - either some combination of tools that have not been used in that way, some outdated docs, or just plain old bugs.
But if the goals of the project/feature are stated clearly enough it is quite capable of iterating itself out of an architectural dead end, that is if it can run and test itself locally.
It goes as deep as inspecting the code of dependencies and libraries and suggesting upstream fixes etc. all things that I would personally do in a deep debugging session.
And I’m supper happy with that approach as I’m more directing and supervising rather than doing the drudgery of it.
Trouble is a lot of my team mates _dont_ actually go this deep when addressing architectural problems, their usual mode of operandi is “escalate to the architect”.
This will not end up good for them in the long run I feel, but not sure what they can do themselves - the window of being able to run and understand everything seems to be rapidly closing.
Maybe that’s not super bad - I don’t exactly what the compiler is doing to translate things to machine code, and I definitely don’t get how the assembly itself is executed to produce the results I want at scale - that is level of magic and wizardry I can only admire (look ahead branching strategies and caching on modern cpus is super impressive - like how is all of this even producing correct responses reliable at such a a scale …)
Anyway - maybe all of this is ok - we will build new tools and frameworks to deal with all of this, human ingenuity and desire for improvement, measured in likes, references or money will still be there.
This is what seems to be lost on so many. As someone with relatively little code experience, I find myself learning more than ever by checking the results and what went right/wrong.
This is also why I don't see it getting better anytime soon. So many people ask me "how do you get your claude to have such good output?" and the answer is always "I paid attention and spotted problems and asked claude to fix them." And it's literally that simple but I can see their eyes already glazing over.
Just as google made finding information easier, it didn't fix the human element of deciphering quality information from poor information.
Looking at code looking for errors is a hard thing to do well for a large amount of code. A better approach is to ensure tests cover all the important cases and many edge cases. Looking at the code may still be a good idea but mostly to check the design. I think that once you get Claude to test the code it writes well, trying to find errors in the code is a waste of time. I’ve made the mistake of thinking Claude was wrong many times despite the tests passing just to be humbled by breaking the tests with my “improvements”!
This is the only way for me to use Agents without completely hating and failing at it. Think about the problem, design structures and APIs and only then let AI implement it.
And when you got familiar with the other parts, you realize that writing code is the most enjoyable one. More often than not, you’re either balancing trade offs or researching what factors yoy have missed with the previous balancing. When you get to writing code, it’s with a sigh of relief, as that means you understand the problem enough to try a possible solution.
You can skip that and go directly to writing code. But that meant you replaced a few hours of planning with a few weeks of coding.
Very well said. When I'm working on a hard problem I'll often spend a few weeks sweating details like algorithms, API shapes, wire formats, database schemas, etc. These things are all really easy to change while they're just in a design document. Once you start implementing, big sweeping edits get a lot more difficult. So better to frontload as much of that as possible in the design phase. AI coding agents don't change this dynamic. However all that frontloaded work pays off big when it does come time to implement, because the search space has been narrowed considerably.
> doing the __design work__ myself, by hand, before any code gets written.
So... Claude still is generating the code I guess?
And seriously, I can't understand that they thought their vibe coded project works fine and even bought a domain for the project without ever looking at source code it generated, FOR 7 MONTHS??
> And the goal of the article is to draw attention to their project.
Additionally, they couldn't even bother to write their own blog post, so it's a little hard to take them seriously when they say they're going to write their own code...
> Claude (c) by Anthropic (R) is the best thing since sliced bread and I'm Lovin' It(tm)! Here's a breakdown of you too can live a code free life for 10 easy payments of $99.99 a month if you subscribe now!
> Step one in your journey to code free life: code the whole damn project and put it together yourself
It's so much fluff and baloney and every single article is identical. And every single one is just over the top praise of Claude that doesn't come off as remotely authentic. There's always mentions of Claude "one shotting"(tm) something.
I bought domains for projects minutes after the idea.
I don’t think it’s that weird to not look at the code if it’s a side project and you follow along incrementally via diffs. It’s definitely a different way of working but it’s not that crazy.
The article explicitly says that the author looked at the diffs; it distinguishes this from "sitting down and actually reading the code", which they didn't do. So when plastic041 says the author spent 7 months vibe coding "without ever looking at source code", it's not unreasonable for dewey to assume that "looking at source code", in this context, actually means something stronger and excludes just looking at the diffs.
I feel like I’m watching developers speed run project and product management learnings.
We’ve moved to seeing that specs are useful and that having someone write lots of wrong code doesn’t make the project move faster (lots of times devs get annoyed at meetings and discussions because it hinders the code writing, but often those are there to stop everyone writing more of the wrong thing)
We’ve seen people find out that task management is useful.
Now more I’m seeing talk of fully doing the design work upfront. And we head towards waterfall style dev.
Then we’ll see someone start naming the process of prototyping, then I’m sure something about incremental features where you have to ma age old vs new requirements. Then talk of how really the customer needs to be involved more.
Genuinely, look at what projects and product managers do. They have been guiding projects where the product is code yet they are not expected to read the code and are required to use only natural language to achieve this.
So right. All these guys have never been managers. Do you think humans don't write things that break? Or that teams sometimes take a wrong path and burn a week of work? Or months? Well now you can experience all of that in 30 minutes of vibecoding. As a former tech product manager, it feels EXACTLY the same.
Yes, lots of the process and problems and solutions to them are the same but we’ve just massively cut the cost of a part of development. That has huge ramifications about when it makes sense to tackle different things and how tradeoffs work out.
Was it strongdm talking about the dark factories? They were working on some integration software so needed to use google drive and slack and lots of other things. They fully reimplemented those to the level they needed for their tests - outside of the biggest firms this would probably have been an enormous time and money sink. Now it’s reasonable.
On a personal project, with my wife we wanted a tracker for holiday planning. Five minutes given a barely through through request and we had a working prototype, fixed bugs in seconds and then talked through with a model what we needed and how it did or didn’t fit (and we needed that first version to figure that out). It helped drive out actual requirements from us, prioritise them, choose a stack add tickets and then went ahead and implemented it pretty far. Have a mostly working v2 which has highlighted some details about what we really wanted. Total invested cost was one day of a $20 subscription and maybe half an hour of talking to a bot and checking results.
> Vibe-coding makes you feel like you have infinite implementation budget. You don't. You have infinite LINE budget (the AI will generate as much code as you want). But you have the same finite complexity budget as always.
This is a special case of a general fundamental point I'm struggling with.
Let's assume AI has reduced the marginal cost of code to zero. So our supply of code is now infinite.
Meanwhile, other critical factors continue to be finite: time in a day, attention, interest, goodwill, paying customers, money, energy.
So how do you choose what to build?
Like a genie, the tools give us the power to ask for whatever we want. And like a genie, it turns out we often don't really know what we want.
Right - knowing what to actually build always has been and always will be the limiting factor to actual success. I could spend months and hundreds of dollars generating the absolute BEST todo list that's out there but nobody wants that.
I have vibe coded 3 applications I never had time to code but always wanted.
Now it is different in a way where now I don’t have time to use those apps.
That’s a joke.
But I do believe it answers the question of “what to build?”. If you didn’t have time before LLM assisted coding you still don’t have time for it. You most likely know what is used and what not already by heart or by some measurements.
So much of the problem here is that the author blindly trusted the agent. They're enthusiastic juniors, not jaded seniors.
Prompt for what you want. Get your feature working, then cut: reduce SLOC, refactor to remove duplication, update things to match existing patterns. You might do these instinctively, or maybe as-you-go, but that's just style. Having a dedicated pass works just as well.
The same thing goes for my code now that did when I wrote every line by hand: make it work, then make it good, then make it manageable. Manually that meant breaking things down into small blocks of individual diffs inside a PR (or splitting PRs), checking for repetitive code and refactoring, or even stashing what I got to and doing it again with the knowledge of how things went wrong.
Agents can do the same. It's WAY easier mentally and works out better if you treat them the same way and go working -> better -> done.
Personally, i've taken a serious step back from 'unsupervised' vibe-coding. When the codebase is clean and you want some additional fix or small feature, Claude is quite good at mimicking your style and does a pretty good job.
When asking for a new major feature, despite hard guidelines and context (that eat half your context window), then it quickly ships bloat. The foundations are not very well organized and this is where you acknowledge it is all about random-prediction of the next word-thing.
Overall, i've wasted more time reviewing the PR and trying to steer it properly than I expected. So multi-layer agent vibe coding is no longer the way to go *for me*. Maybe with unlimited tokens and a better prompt, to be investigated...
And it can quickly start spiraling out of control. The bloated implementations keep adding more and more context it needs for the next change. Discovery results start getting worse, implementations get worse, and bloat continues to increase.
This reads too much like it was LLM generated. I can't say for sure if it was but I have an allergic reaction to the short snappy know-it-all LLM writing style.
I’m not very familiar with Go, however after looking at the repo I can’t help but notice there is no infra to ensure code quality. Do others see the same thing, because if so that is the real problem
Yes I agree for sure llms write terrible code when left to their own devices, but so do most engineers. Which is why we have so many tools to help keep a certain level of quality. Duplication checks, tests, linters, other engineers.
I find whenever you make an llm repo without these checks, and more, it will write like an enthusiastic junior engineer, wrong and strong. However a junior engineer would be hard pressed to get 95% coverage on a codebase, the ai is more than willing and does it in a few minutes. We can use things like this to our advantage, how many people have ever seen a repo with 100% test coverage? With ai this is very possible, with people not so much.
LLM’s writes terrible code, we know this, but when dealing with humans that write terrible code we have many techniques. We should be using those same techniques to keep the llms honest, but more importantly verifiable.
It's really very easy to spend a few hours going through a vibe-coded project by hand and having an agent fix the weird parts. If you do this often enough, you can get the best of both worlds.
Then you're right back on track.
In a way it's not that different from a human-made project. Plenty of teams have to crunch, ignoring the architecture and incurring tech debt, and then come back and fix it later.
So what you really mean is you are going to do better and more detailed skills files so you can get an architecture that you've thought through rather than something random?
Partly, but the order matters. The CLAUDE.md constraints only work if you designed the architecture first. They're just how you communicate it to the AI. The mistake I made wasn't writing bad skills files, it was not designing anything at all and expecting the AI to make coherent structural decisions across 30 sessions.
The rewrite is me sitting down with a blank doc and drawing the boxes before any code exists. Then the CLAUDE.md enforces what I already decided. Whether that actually holds up as the project grows, I genuinely don't know yet.
Are you really saving any time at all using AI at all then? If you have to write the architecture for it, write all the rules you want it to follow, check everything it's written, and then reprompt it because it's not how you want it?
Yes. I do all of this and I'd estimate 50-100% coding time savings. A lot of that comes from better multitasking over single-workstream throughput, which I suppose might compromise the gains depending on what you're doing. For me it amplifies the speedup by allowing some of my "coding time" to be spent on non-coding tasks too.
I could be wrong in some subtle way I'm not seeing, but I believe the model we're working in avoids the downsides. I actually think my review bar is slightly higher now, because I don't feel as much pressure to compromise my standards when I know Claude is capable of writing the code I want.
Can't you just ask AI to break up large files into smaller ones and also explain how the code works so you can understand it, instead of start over from scratch?
That was actually the first thing I tried. It did a good jov at explaining the code base mess and the architecture. Then I ran 3-4 refactor attempts. Each one broke things in ways that were harder to debug than the original mess. The god object had so many implicit dependencies that pulling one thread unraveled something else. And each attempt burned through my daily Claude usage limit before the refactor was stable.
And I'm sure the rewrite is going to teach me a whole different set of lessons...
Not sure why good coverage wouldn't mitigate risk in a refactor...
My mantra whenever I'm working with AI is that I want it to know what "point b" looks like and be able to tell by itself whether it's gotten there...
If you have a working implementation, it sounds like you have a basis for automated tests to be written... once you have that (assuming that the tests are written to test the interface rather than the implementation), then it should be fairly direct to have an agent extract and decompose...
I'm currently working on the discovery phase of a larger refactor and have pretty quickly realized that AI can actually often be pretty useless even if you've encoded the rules in an unambiguous, programmatic way.
For example, consider a lint rule that bans Kysely queries on certain tables from existing outside of a specific folder. You'd write a rule like this in an effort to pull reads and writes on a certain domain into one place, hoping you can just hand the lint violations to your AI agent and it would split your queries into service calls as needed.
And at first, it will appear to have Just Worked™. You are feeling the AGI. Right up until you start to review the output carefully. Because there are now little discrepancies in the new queries written (like not distinguishing between calls to the primary vs. the replica, missing the point of a certain LIMIT or ORDER BY clause, failing to appropriately rewrite a condition or SELECT, etc.) You run a few more reviewer agent passes over it, but realize your efforts are entirely in vain... because even if the reviewer agent fixes 10 or 20 or 30 of the issues, you can still never fully trust the output.
As someone with experience in doing this kind of thing before AI, I went back to doing it the old way: using a codemod to rewrite the code automatically using a series of rules. AI can write the codemod, AI can help me evaluate the results, but actually having it apply all of the few hundred changes automatically led to a lack of my ability to trust the output. And I suspect that will continue to be true for some time.
This industry needs a "verification layer" that, as far as I know, it does not have yet. Some part of me hopes that someone will reply to this comment with a counterexample, because I could sorely use one.
When people talk about codebases being "incomprehensible", it's not always hyperbole. Sometimes the architecture literally cannot be broken up or understood.
When you see some legacy C++ codebase with millions of lines of code, catching cancer and slowly dying from it is more human than trying to unscrew that mess.
A really screwed code base blows out your context window and just starts burning tokens as the AI works out a way to kill -9 itself to escape the hell you're subjecting it to.
While I mostly agree - science is built up on truths. Code has a large amount of creativity and freedom built into the decisions, some codebases will be documented, follow rigorous training, and design decisions. Others will just be an absolute legacy mess of 20 years of odd decisions made by people who may have not known what they were doing. Like an art piece that you don’t really “understand”.
Most of the issues are around code hygiene rather than just LLM code being bad. You're creating code 10x faster, but you're also writing unit tests 10x faster, not just that but integration tests, CICD workflows, prod monitoring, product and engineering documentation, etc. It was already the way to get good code quality before, nowadays I think it's just reckless to generate code that's not backed by 100% test coverage and pass all lints and static checks configured.
This is it, people are acting like bad code wasn’t written before. My wife and I were full on laughing about it in bed the other night of all the absolutely horrible code we’ve seen written and how people actually think LLMs are worse than that.
The quality gates are up to you, and if you are smart you will make a lot of them and review them closely
I like to explain my opposition to vibe coding by replacing the phrase "write code for you" to "fuck your wife for you". You could make all the same arguments that the AI could do a better a job, its never impotent, it frees you from being pressured to do it when you might be tired or not in the mood etc. But thats not the point and most people would still be opposed to sort of, err, "vibe vibrating".
I feel the same way about coding, its a source of pride for me and when I hear people say I should resign myself to being an "ideas guy" while chatgpt actually creates things I find the very concept to be distasteful regardless of whether or not it can outperform me.
I'm thoroughly enjoying using AI to write code, but it paid off by years of doing things the hard way before. I already was a so called "10x developer" if I speak for myself. I'm doing things even faster now with AI.
I ran quite early into the same issues with my rust pet projects. Single structs with tons of Option<T> and validation methods etc. enums for type fields combined with says optional fields in the same layer so accessor methods all return Option<T>.
I add now a long list of instructions how to work with the type system and some do’s and don’ts. I don’t see myself as a vibe coder. I actually read the damn code and instruct the ai to get to my level of taste.
Would you be interested in sharing your findings? I'm currently experimenting with LLM-generated rust and honestly think it works quite well, however I'm looking for ways to improve the "taste" of the agent.
We're still in the early ages and must discern hard what AI is good for, what it can maybe do, what it could potentially do and what it just can't do, and move those threshold marks very conservatively. AI is also cheap enough that it's worth shots of experiments. As long as you don't really rely on AI it's easy to test the capabilities of this new conversational autocomplete, and the random gains it offers can be magnificent (except when they aren't, of course).
What has generally worked for me is paraphrasing the old adage "Write the data structures and the code will follow" over to AI. Design your data, consider the design immutable and let the AI try fill in the necessary code (well, with some guidance). If it finds the data structures aren't enough, have it prompt you instead of making changes on its own. AI can do lot of the low-hanging fruit and often the harder ones as well as long as it's bound to something.
Yet, for now, AI at best has been something that relieves me from having to write a long string of boring code: it's not sustainable to keep developing stuff relying on AI alone. It's also great when quality is not an issue; for any serious work AI has not speeded me up noticeably. I still need to think through the hard parts, and whatever I gain in generating code I lose in managing the agents. But I can parallelise code generation, trying new approaches, and exploring out because AI is cheap. AI is also pretty good for going through the codebase and reasoning about dependencies whether in the context of adding a new feature or fixing a bug: I often let AI create a proof-of-concept change that does it, then I extract the important bits out of that and usually trim down the diffs down to at least 1/3 or less.
AI further helps with non-work, i.e. tasks that you have to do in order to fulfill external demands and requirements, and not strictly create anything solid and new. I can imagine AI creating various reports and summaries and documentation, perhaps mostly to be consumed and condensed by another AI at the receiving end. Sadly, all of this is mostly things not worth doing anyway.
Overall, I cringe under all the hype that's been laid on AI: it's a new tool that's still looking for its box or niche carveout, not a revolution.
I don't think we're in the early ages... LLMs technology has essentially stagnated since GPT3.5, we just have bigger models that can handle more context. We're trying to cope for the lack of progress of the actual technology by coming up with contraptions of multiple models stuck together, Mixture-of-Experts, Reviewer models, PM models...
I see this in Claude too, but I also see this in junior engineers. In the case with Claude, I simply ask it to refactor immediately after each feature is done. The human is still responsible for the AI writes, so if the AI writes code that’s gross, I would never push that lest it sully my name and my reputation for my own code quality.
I don't think the prompts that the author has proposed will actually work. Including final scope and non-scope is good but it's more of a reaction of what the AI already did. These prompts are suitable for a rewrite, basically, since it's unlikely anyone would have had these ready when they start out.
I have found small iterations to have the best results. I'm not giving AI any chance to one shot it. For example, I won't tell it to "create a fleet view" but something more like "extract key binding to a service" so that I can reuse it in another view before adding another view. Basically, talk to the AI as an engineer talking to another engineer at the nitty gritty level that we need to deal with everyday, not a product person wishing for a business selling point to magically happen.
I don't understand the people who "get the agent to do everything" for them. It just makes a mess if you do that. Yet if I spend a little bit of time setting a project up properly (including telling my minions exactly what to do) I can then get it to do the boring things for me.
The very worst things you can do in a codebase are (a) not deeply understand how it works (have it be magic) and (b) be lazy and mess up the structure.
How do you fix a problem which happens at 2:00am and takes your system down if you don't have an excellent understanding of how it works?
Over time we're already bad at (a) because most developers hate writing documentation so that knowledge is invariably lost over time.
> I'm rewriting k10s in Rust. Not because Rust is better but, because it's the language I can steer. I've written enough of it to feel when something's wrong before I can articulate why. That instinct is the one thing vibe-coding can't replace. The AI hands you plausible-looking code. You need a nose for when it's garbage.
Isn't Golang relatively easier to read than Rust? I was under the impression that Rust is a more complex language syntactically.
> The other change is simpler: I'm doing the design work myself, by hand, before any code gets written. Not a vague doc. Concrete interfaces, message types, ownership rules. The architecture decisions that the AI kept making wrong are now made in writing before the first prompt.
This post is good to grasp the difference between "vibe-coding" and using the AI to help with design and architectural choices done by a competent programmer (I am not saying you are not one). Lately I feel that Opus 4.7 involves the user a lot more, even when given a prompt to one-shot a particular piece of software.
Go reads fine whether the architecture is good or bad, and I couldn't tell the difference until I was in trouble. Rust is harder to read but harder to misuse. The borrow checker would have caught that data race at compile time. I've also just written more Rust. That familiarity matters separately.
+1 on Open 4.7 involving the user a lot more. Rn I'm trying to get to a state where I can codify my design + decision preferences as agents personas and push myself out of the dev loop.
Buddy that k10s code was never good. Go vs Rust is not the issue here, it’s the fact the project was vibe coded without reading anything. It’s hilarious to even think that a god model was caused by anything other than someone who let the bot choose too much.
Good architecture in any language is obvious to someone who is experienced and cares.
Go is actually great for bots to write if you’re actually thinking.
Right, thank you. Personally I think reading all the code that the AI produces is impossible and kind of defeats the purpose of using it. The key is to devise a structured way to interact with it (skills and similar) and use extensive testing along the way to verify the work at all steps.
> Isn't Golang relatively easier to read than Rust? I was under the impression that Rust is a more complex language syntactically
It sounds like the author knows Rust, and might not be as familiar with Go.
A language that you are proficient in is always going to be easier read than one you don’t, even if it is an objectively easier language to to read in general.
In a world where juniors (or seniors in new territories) are incentivized to publish or perish, how will any of us gain proficiency any more? I can see the agent assisted journey accelerating some familiarity, but not proficiency.
I’ve used AI tools to do i18n translations to Spanish and Portuguese (somewhat ashamed to admit this). I’ve grown more familiar with the structure of these languages, and come to recognize some of the common vocabulary for our agtech domain. If anything, I feel more clueless about both languages now than I did before, when it comes to any sort of proficiency.
I don't bother trying to give the LLM a set of dos and don'ts for how to write the code, that becomes a frustrating game of whack-a-mole. I find it a lot more efficient to have it write some code, look it over, and if I'm not happy with some of the decisions give it specific instructions for how to fix that one part. as a bonus I end up reinforcing my knowledge of the code base in the process.
Im exploring currently if i should split up a project into a framework part and the game itself (2d, idle game).
The framework could be an isolation later against viberod but not sure if its necessary for my small project i always wanted to do and never done anything with it.
For another tool, i will try another approach: Start with a deep investigation and spec write together with AI, than starting with the core architecture layout and than adding features.
So instead of just prompting "write a golang project with a http server serving xy, and these top 3 features" i will prompt "create a basic golang scarfold for build and test" -> "create a basic http server with a basic library doing xy" -> "define api spec" -> "write feature x"
There is kind a skill and depth to vibe coding though.
This is Claude's problem.
Compared to GPT-5.5, Claude Code prefers to take shortcuts. I've tested having codexapp GPT-5.5 and Claude Code opus4.7 do the same thing - if following GPT-5.5's requirements, Claude Code's execution time for a task would stretch from 5 minutes to 40 minutes.
To solve macro architecture problems, I use Lisp to write the entire program's framework. Lisp replaces architecture documents, because I believe it has high semantic density, syntax restrictions, and checkers for assistance.
This way, at least I didn't have to rework anything anymore. I used this method to refactor my 20+ projects
A problem often ignored is that while AI is trained on human written code, how it writes is different in practice.
Will that improve or get worse? One would argue that LLMs in general are drastically more competent now than they were a couple years ago, they’re also much better at coding. We’re likely just now entering the era where they can code but are still not what you’d fully expect, or at least not what someone with absolutely no coding knowledge could use to code at the same level as someone who does know how to code.
Maybe that changes as the models improve, maybe it doesn’t, only time will tell.
Feels like it can be solved wirh even more AI: adverserial models reviewing and testing work performed by main model.
Actually I am curikus to try somwthing like that myself. Is there an existing orchestrating engine (or single agent) which can spawn multiple subagents and keep passing their feedback/output between each other until all of them agree that assignment overall is complete?
I am aware of layoffs that are really caused by AI.
Translation and asset generation teams for enterprise CMS, whose role has now been taken by AI.
Likewise traditional backend development, that was already reduced via SaaS products, serverless, iPaaS low code/no code tooling, that now is further reduced via agents workflow tooling, doing orchestration via tools (serverless endpoints).
> I typed :rs pods to switch back to the pods view. Nothing rendered. The table was empty...
> now something was fundamentally broken and I couldn't just prompt my way out of it.
Hey I don't want to over simplify, I'm sure it was complicated, but did the author have functional tests for these broken views? As long as there are functional tests passing on the previous commit I'd have thought that claude could look at the end situation and work out how to get the desired feature without breaking the other stuff.
TUIs aren't an exception, it's still essential to have a way to end-to-end test each view.
The problem wasn't the view didn't work. The problem was the view didn't work after something else had been done.
You can't test every permutation of app usage. You actually need good architechture so you can trust your test and changes to be local with minimal side-effects.
I'm moving very slowly into AI coding. I'm not comfortable enough to let Claude do anything big. What I do is this: I set out general architecture, create function stubs and add comments on how to implement things. Then I let Claude do 10 minutes of work and I check everything and refactor some of it. It saves me on boring implantation stuff (like, is this an array, move an index here or there, check for whatever exists or not, put it in the db)
the ship has sailed on my handcoding at work. the AI is producing stuff thats more bulletproof than what I can do in the same timeframe and if my competitors are using it, the pressure to ship is that much higher.
Personally, I've taken the time its freed up to spend more time on mathacademy and reading more theory oriented books on data structures and algorithms. AI coding systems are at their best when paired with someone with broad knowledge. knowing what to ask for and knowing the vocabulary to be specific about what you want to be built is going to be a much more valuable job skill going forward.
One example is a small AI based learning system I have been developing in my free time to help me learn. the mvp stored an entire knowledge graph and progress in markdown files. being an engineer, I knew this wouldn't scale so once I proved the concept viable, I moved everything into sqlite with a graphdb. then I decided to wrap some parts of teh functionality in to rust and put everything behind a small rust layer with the progress tracking logic still being in python.
someone with no knowlege of graph databases or dependncy graphs or heuristics would not be able to build this even if they had AI. they simply don't know what they dont' know and AI wont' save you there.
That said, I think its important to also spend time in the dirt. I've recently started pickign up zig as my NO AI langauge just to keep. those skills sharp.
I've been relying primarily on deepseek-v4-flash for 90% of my work. It sips tokens. that model will run on 128gb. not a cheap configuration for a consumer but within the budget of a developer relying on it for work.
Ive only been using kimi 2.5 and deepseek pro for reviewing PRs for security issues. less than 10% of my workflow requires a full powered frontier model.
I think the issue is overblown by people who think claude code is a good harness and use opus for everything. opencode is objectively better. its much more verbose about what its doing, you have more control when it comes to offloading to subagents with targeted context (crucial for running through larger jobs) and I can swap between codex and open weight models.
However everything in this field is cargo culting. We have absolutely no way to quantify productivity in the real world.
We've had advanced programming languages backed by advanced programming language theory for decades and the most used/ran programming languages in the world are C, PHP and JavaScript, languages held together by duct tape or in the case of C, programing language theory from the 60s.
We have a super minimal JavaScript runtime in the browser to avoid a bloated standard library and then people invent things like leftpad. At the same time basically every major website in the world serves mega bytes of tracking and ad serving libraries.
We all "know" AI makes coders more productive but nobody can do the equivalent of a clinical trial for a major new drug.
I think the answer here is to not use Claude with bubble tea. I tried the same thing and got the same result. But it seems to be limited to that specific framework, because it's really good at not doing the same thing with SolidJS.
While I felt this in 2025, I do not feel this in 2026. I use Claude and the rest with BubbleTea all the time.
But I will say... you have to know Golang. You have to have at least tried to make a BubbleTea app yourself and try to understand ELM architecture. You have to look at the code and increment with it.
It makes total sense for OP to switch to Rust and Ratatui if they don't know Golang well. But I don't think it's a better language for it. [Ratatui has brought me great inspiration though!]
Independent of framework, the LLMs get the spacial relationships. I say things like "the upper right panel's content is not wrapping inside and the panel's right edge should extend to the terminal edge" and the LLM will fix it. They can see the resultant text, I'm copy-pasting all the time.
TUI code is finicky; one mis-rendered component mucks everything up. The LLMs will decide themselves make little, temporary BubbleTea fixtures to help understand for itself when things aren't right.
The only real problem with LLMs and BubbleTea is that upon first prompt, they insist on using BubbleaTea v1 versus BubbleTea v2, released in December 2025. But then you just point it to the V2_UPGRADE.md and it gets back on track. That will improve as training cutoffs expand.
I vibe-coded this TUI for Mom's last night. I actually started with Grok (who started with v1) and then moved into Claude Code after some iteration:
I'm not sure we'll ever really be free of the GIGO (garbage in / garbage out) principle. Tools will get better and better, but can never be a substitute for a deep understanding of the thing we want to create.
AI writes what you ask it to write, you need to talk to it about architecture. You should have an architecture doc so AI can shape the code based on that, you can get the AI to make the architecture doc also. If using claude you can use the software architecture mode for this.
A coder typing in code is not solely to generate outcome. It's part of ongoing thinking process. Without this ongoing process, we have no material to keep iterating forward.
What has really made AI coding be able to continue to work as the project got bigger was using speckit. It has been great at keeping the code consistent across features.
Genuinely curious if you've used "plan mode" (with perhaps a plan feedback tool) to get clarity from your coding agent before unleashing it on a feature like "add a pods view with live updates"?
Getting a plan isn't a panacea but is a better way to limit downstream slop than just vibing without one.
I'm just wondering: you know what architecture you want to go to now and you have the tests... can't you just let Claude refactor it to the better architecture?
Also 1600 lines... didn't any agent reviewing the diffs point that out?
You're also adding a lot to claude.md, I dunno how much that file has grown but a big claude.md file with many instructions, I don't think the ai will be able to remember all those rules.
I'm just wondering: you know what architecture you want to go to now and you have the tests... can't you just let Claude refactor it to the better architecture?
Also 1600 lines... didn't any agent reviewing the diffs point that out?
You're also adding a lot to claude.md, I dunno how much that file has grown but a big claude.md file with many instructions, I don't think the ai will be able to remember all those rules
Vibe coding works great with test driven development. You can have AI write the tests as well, but you need to confirm yourself because it's lying all the time. AI coding is like when you first started out, it's copying random bits and pieces from the web into your code until it works... Good for one shots and proof of concept. But for any long living project I think you are better off rewriting it from scratch yourself. Abstractions let you work faster, especially when you have it all in your head.
i try to write one portable shell script per day; using AI would take all the fun out of it, so i never started using it. i honestly find it ridiculous that anyone uses it to write code, it just doesn't make sense to me.
LLMs assist those of us who were apt to take blocks of code from StackOverflow, or wherever, to solve problems quickly and avoid as much of the aggravating and slow toil of trial and error as possible.
That trial and error process is still happening with a LLM, but much faster, and with instantaneous cross-references to various forms of documentation that I would be looking up myself otherwise. It produces code of a quality that is dependent on the engineer knowing what they want in the first place and prompting for it and refining its output correctly.
It's the exact same process of sculpting code that the majority of the industry was doing "by hand" prior to the release of LLMs, but faster, and the harnesses are only getting better. To "vibe code" is to prompt vaguely and ignore the quality of the output. You're coming to a forum full of professionals and essentially telling us that you're getting really frustrated with your Scratch project.
I don't know if you're trying to lead a charge or whatever but good luck with that. As a senior SWE, it is clear to me that this is the new paradigm until something better than LLMs comes along. My workflows and efficiency have been vastly improved. I will admit that I have never really been a "I made a SMTP server in 3k of Rust" kind of guy, though.
You don't need to go back to coding by hand if you know how to do it already. There is a middle ground.
If you understand good software architecture, architect it. Create a markdown document just as you would if you had a team of engineers working with you and would hand off to them. Be specific.
Let the AI do the implementation of your architecture.
Greed really comes into play when using LLM's to write code, is so easy to say YES when this cool feature where 2 years ago would have taken a week, now is 1 day or even one prompt. The "Say no" skill that Steve Jobs said was important is gonna be needed on an minute by minute basis.
It's pretty simple to vibe code for months without producing slop. And it's the same recipe one used before AI:
1. make it work
2. make it pretty
3. make it fast
Omit 2. and 3. long enough -> slop beyond recovery
This. I definitely agree with this statement at this point in AI-assisted development. This gets at the "taste" factor that is still intrinsically human, especially in software engineering. If you can construct and guide the overall architecture of an application or system, AI can conceivably fill in the smaller feature bits, and do so well. But it must have a strong architecture and opinionated field in which to play.
My main takeaway, too. Been using Claude on my side project that I have singlehandledly been working on for three years. It works well initially, you catch all of AIs mistakes or unfavorable approaches because you know the architecture in and out. But as you stop thinking about the new features, stop losing touch with all the stuff AI throws at you, you fail to develop intuitive feeling on when and how to abstract and introduce architecture.
Another note was for me e2e tests; while AI can write them it never comes up with just basic organization or abstraction required to manage a large e2e test suite with hundreds of tests. It immediately starts to produce spaghetti code.
The problem with this dev's approach is not AI, it's their use of it. They didn't ensure that the architecture made sense. They didn't look at the code and get a "feel" for it. They didn't do the whole build stuff, step back, refactor, rinse and repeat dance. The need for that hasn't gone away; if anything, it's even more important now. Because you can spit out code 100x faster than you could before, your tech debt compounds 100x faster. The earlier you refactor, the less work it is.
I usually give the agent a solid idea of what I want, often down to the API interfaces. Then every now and then, I'll go through the code and ensure that everything makes sense, and that I'm not just spitting out code that works, but building a codebase that scales.
Let me preface my comment by saying I also still write a lot of code by hand - especially when it's something I know I need to understand in depth, and in some cases defend.
With that said, this caught my eye:
> AI gravitates toward single-struct-holds-everything because it satisfies the immediate prompt with minimal ceremony.
This is too general. "AI" is used here as a catch-all, but in fact, it was the specific model under the specific conditions you ran your prompt, including harness, markdowns, PRDs, etc. So it's not fair to say "AI does X!" in this case.
It's also very much up to you. It's very common to have a frontier model plan an architecture before you have another model implement code. If you're just one-shotting an LLM to do everything you get mediocre, more brittle code.
This stuff is still being figured out by a lot of people. But I feel the core of the issue is not using AI well. Scoping, task alignment, validation, are crucial.
The title is just flat out wrong. The author isn't going back to writing code by hand, they're plopping some new stuff into their CLAUDE.md to "fix" the issues they see AI is having.
another behavior I noticed is that even you plan with an agent than a lot of business logic leaks to the code.
some states, for an example, are meant to be assumed from the data shape, rather than the actual state fields, but damn they like adding a state field.
I don’t really think OP is writing code themselves since they admit they still use agents for code gen. I’ve really scaled back the amount I use agents though because in the medium to long term I haven’t been getting good results with them. And it’s not enjoyable. That’s enough for me, I’ll do whatever for a job because who cares, if the company wants slop I will gladly give them that, but for my own shit Ive gone back to circa 2024 and am mostly just using them as a chatbot.
Inb4 “you’re gonna be replaced” god damn it I hope so, I do not want to spend the rest of my life behind a computer screen…
But in my main work, reverse engineering, LLMs are godsend, for years now.
You can basically bruteforce binary obfuscation thanks to them. And thanks to eager chinese LLM providers, basically for free.
But I always use LLM only for boring work and rest is for me to do manually, or with scripts of course, but made by me. Because I want to learn.
Yes, there are a lot people using LLMs for full RE automation since they're selling exploits for profit. No problem with me.
I see funny future for huge corporations like Adobe, etc.
Imagine prompt, "Hey Claude, re-implement Adobe Photoshop with clean-room design" One agent will open decompiler, outputs complete low level technical details how is everything implemented.
Second agent implements new Photoshop based on that.
They will be mad and I like this.
You will own nothing, and you will be happy, corpos.
>"I'm doing the design work myself, by hand, before any code gets written."
This is what I was doing right from the beginning. AI just fills out methods and doing other low intelligence work. Both are happy. My architectures and code are really mine, easy to read and reason. AI gets paid and does not get a chance to fuck me in the process. At no point I felt any temptation to leave "serious" to AI.
I feel like this article was circling a point it never actually got to. All the advice in here (except controlling scope creep) is specific to a TUI with an elm like architecture.
But here's the thing, you almost never know what the architecture is up front. If you do you probably aren't the one writing the actual code anymore. Writing the code, with or without an AI is part of the design process. For most people it isn't until they've tried several times, fucked it up a bunch, and refactored or rewrote even more that you actually know what the architecture needs to be.
Does ‘writing code by hand’ mean you’re not going to use compilers to generate assembly?
Now I do feel lucky that I started learning coding about four years before the LLM revolution, but these things are really just natural language compilers, aren’t they? We’re just in that period - the 1980s, the greybeards tell me - where companies charged thousands of dollars per compiler instance, right? And now, I myself have never paid for a compiler.
This whole investor bubble will blow up in the face of the rentier-finance capitalists and I’ll be laughing my head off while it happens.
A model, given exactly the same inputs, will return exactly the same outputs.
But your prompts are not the only inputs. Among other things, there is a random seed injected by the vendor.
That is a primary source of non-determinism.
Then, of course, is the fact that you don't personally have an old copy of the model, and the vendor isn't going to keep the model forever, and there are no unit tests to make sure that, faced with prompts like you gave it before, the newer models won't suffer major regressions in the functionality you were using.
And even if there were no non-determinism, the models suffer greatly (much more so than traditional compilers) from the butterfly effect.
It is literally impossible to pin down part of your prompt in such a way that it always will contribute to good outcomes, and such that you can simply vary a tiny bit of the prompt to logically correlate with tiny variations in the output.
7 months ago was early November. Coding assistants were getting very good back then, but they were still significantly poorer at making good architectural decisions in my experience. They tended to just force features into the existing code base without much thought or care.
Today I've noticed assistants tend to spot architectural smells while working and will ask you whether they should try to address it, but even then they're probably never going to suggest a full refactor of the codebase (which probably is generally the correct heuristic).
My guess is that if you built this today with AI that you wouldn't run into so many of these problems. That's not to say you should build blind, but the first thing that stood out to me was that you starting building 7 months ago and coding assistants were only just becoming decent at that time, and undirected would still generally generate total slop.
If you're coding by hand, then you're that carpenter before IKEA came along. Now the market wants bland machine-built functional furniture that gets replaced every five to ten years. If every tenth piece is broken or slightly off, doesn't matter, mass production has lowered the price that a replacement is available for free and you're still making a profit.
Time to become a "product engineer" and watch the hyper-agile agents putting up digital post-it notes on digital pin-boards discussing how much each post-it is worth in digital scrum meetings. Meanwhile the agents keep wasting more and more time so that their owners make less and less of a lose, until eventually a profit is made.
Until the costs become prohibitive and humans become cheaper than the agents that replaced them. Once the agents are replaced by the humans, the next hype bubble awaits around the bend.
The problem is that the mitigations offered in the article also don't work for long. When designing a system or a component we have ideas that form invariants. Sometimes the invariant is big, like a certain grand architecture, and sometimes it’s small, like the selection of a data structure. You can tell the agent what the constraints are with something like "Views do NOT access other views' state" as the post does.
Except, eventually, you'll want to add a feature that clashes with that invariant. At that point there are usually three choices:
- Don’t add the feature. The invariant is a useful simplifying principle and it’s more important than the feature; it will pay dividends in other ways.
- Add the feature inelegantly or inefficiently on top of the invariant. Hey, not every feature has to be elegant or efficient.
- Go back and change the invariant. You’ve just learnt something new that you hadn’t considered and puts things in a new light, and it turns out there’s a better approach.
Often, only one of these is right. Often, at least one of these is very, very wrong, and with bad consequences.
Picking among them isn’t a matter of context. It’s a matter of judgment, and the models - not the harnesses - get this judgment wrong far too often. I would say no better than random chance.
Even if you have an architecture in mind, and even if the agent follows it, sooner or later it will need to be reconsidered. What I've seen is that if you define the architectural constraints, the agent writes complex, unmaintainable code that contorts itself to it when it needs to change. If you don't read what the agent does very carefully - more carefully than human-written code because the agent doesn't complain about contortious code - you will end up with the same "code that devours itself", only you won't know it until it's too late.
So yes, you might get good results in one round, but not over time. What does work is to carefully review the AI's output, although the review needs to be more careful than review of human-written code because the agents are very good at hiding the time bombs they leave behind.
If I instruct the AI to make small modules where I can verify they work, have tests and no side effects - then it is good enough code for me. It works, is readable and can be extended - and will turn into bad code if this is not done with care.
You can’t play whack-a-mole with GenAI. You have to start from well-known principles and watch everything it produces. Every module or bounded context has to have its own invariants.
You can’t fully automate software engineering with GenAI. It seems the vast majority of GenAI users think they can and end up in the same place as the OP.
Maybe learn Domain-Driven Design, Event Sourcing, and then try again. The results will be dramatically improved.
https://devarch.ai/
I have ADHD and for whatever reason telling the LLM what to do instead of doing it myself bypasses the task avoidance patterns and/or focus problems I tend to suffer from. I do not find it fun, but I am thankful for it.
It is significantly easier to micro-manage an AI than a suite of junior developers. The AI doesn't replace a principal engineer, it's replacing junior and weaker senior developers who need stories broken down extremely concisely to be able to get anything done. The time it takes to break down a story such that a junior through weak senior developers can pick it up and execute it well would have the AI already done with testing built around it.
Micromanaging LLMs is like having Dory from Finding Nemo as your colleague. You find ways to communicate, but there is no learning going on.
Of course if you don't provide that feedback loop, no learning happens. I guess the same could be said of a junior, though.
No, if you have to do all of the stuff you have listed to kind-of-make-it-work...You are not in charge.
Sure. That's how I work with AI, and the way I believe that AI is meant to be use -- as a companion tool.
But it's a lot of work. It saves me time for certain tasks, but not others. I haven't measured my productivity gains, but they're at most 2x.
But that's not "vibe coding" (which was the point of the article) or the (false) promise of "10x productivity" and "code that writes itself" that companies are being told is going to reduce their engineering headcount tenfold.
I had strong principles at the outset of the project and migrated a few consumers by hand, which gave me confidence that it would work. The overall migration is large and expensive enough that it has been deferred for nearly a decade. Bringing down the cost of that migration made me turn to AI to accelerate it.
I found that it was OK at the more mechanical and straightforward cases, which are 80% of the use cases, to be fair. The remaining 20% need changes to the framework. Most of them need very small changes, such as an extra field in an API, but one or two require a partial conceptual redesign.
To over simplify the problem, the backend for one system can generate certain data in 99% of cases. In a few critical cases, it logically cannot, and that data must be reported to it. Some important optimizations were made with the assumption that this would be impossible.
The AI tooling didn't (yet) detect this scenario and happily added migration logic assuming it would work properly.
Now, because of how this is being rolled out, this wasn't a production bug or anything (yet). However, asking the right questions to partner teams revealed it and unearthed that some others were going to need it as well.
Ultimately, it isn't a big problem to solve in a way that will mostly satisfy everyone, but it would have been a big problem without a human deeper in the weeds.
Over time, this may change. Validation tooling I built may make a future migration of this kind easier to vibe code even if AI functionality doesn't continue to improve. Smarter models with more context will eventually learn these problems in more and more cases.
The code it generates still oscilates between beautiful and broken (or both!) so for now my artistic sensibilities make me keep a close eye on it. I think of the depressed robot from the Hitchhiker's Guide to the Galaxy as the intelligence behind it. Maybe one day it'll be trustworthy
But in all seriousness it depends on what you’re doing with it. Writing a quick tool using an LLM is much easier than context changing to write it yourself. If you need the tool, that’s very valuable.
Been building a new app with lots of policies and whatnot and instructing a LLM is just much faster than doing the same repetitive shit over and over myself.
A little update: upon viewing the page on phone, for me the "comitter" field in the demo is going out of bounds... Really not speaking for their product.
The recommended tool cant even produce mobile friendly, like why would I ever use it?
Right, so depending on an LLM makes perfect sense in that case, thanks for clarifying :)
Because I like to save time?
I can do all of this myself, but why would I waste 1-2 hours (per model) on doing all that myself if I can just instruct some stupid LLM to do it for me? It's repetitive boilerplate.
That said most of the world's most useful code has strict quality requirements. Even before AI 90% of SLOC would be tossed away without much if any use, 9% was used infrequently while 1% runs half the world's software.
Reviewers aren't perfect, far from it. And we just gave them ~20x more code to review. Incentives mean that taking 20x longer to review is unacceptable. So where do we go from here?
Indeed for the task of “jump into an unfamiliar codebase and make a requested change that aligns with existing styles and patterns, and uses existing functionality” I would say something like opus 4.7 exceeds the capabilities of most developers.
But agents generate code much faster, and to know slow them down, some people want to not do the only thing that can currently ensure you get good results, which is to carefully review the output. Once that happens, there is simply no way for them to know how good or bad what they're getting is.
As soon as requirements change the abstractions fall apart and everything gets shoehorned.
Humans can be held accountable for their own slop
> The kind of mess me and everyone I've worked with produces by hand is the inverse of that
Yes, it's frustrating to work with isn't it? So why are you so excited to make higher volumes of this low quality slop using AI?
Even if you could state it in a precise formal language the LLM under the agent doesn’t have the capability to understand what the invariant is for and why it’s important. You’ll still get oddly generated code. You might get an LLM that can associate certain tokens with those in the formal language specification which can hold invariants and perhaps even write the proofs… but you’ll still get a whole bunch of other code generated from the informal parts of the prompt.
I agree that simply adding constraints and prompts to you skills and specs isn’t going to prevent these things. Worse, that even if you could invent a better mouse trap the creature will still escape.
The problem is… “elongation:” the addition of code for the sake of the prompt/task/etc. Often less is better. This takes a human with the ability to anticipate what other humans would want/expect. When you need a generator, they’re great but it’s a firehouse that whose use should be restrained a little more.
That depends on the invariant. Some are behavioural, like "variable x must be even if y is positive", but some are architectural, such as "a new view requires a new class".
But that's only one side of the problem because maintaining the invariant can be just as bad as breaking it. You ask the agent to add a feature and it may well maintain the invariant - only it shouldn't have, because the feature uncovers the fact that the invariant is architecturally wrong.
The problem is that evolving software requires exercising judgment about when you need to follow the existing strategy and when you need to rethink it. If there is any mechanical rule that could state what the right judgment is, I don't know what it is.
With a skilled operator, it could be possible to drive an agent to handle these kinds of changes. I would be concerned that spoken language wouldn't be precise enough to handle the refactoring and changes necessary to make to a code base when an invariant changes... regardless of whether it was a property, architectural, or procedural change. It already can take several prompts and burn quite a few tokens doing large-scale rewrites and code changes. Maybe the parameters and weights can be tuned for this kind of work but I remain skeptical that what we have at present is "efficient" at this kind of work.
You can try telling the agent to stop and ask when a constraint proves problematic, except it doesn't have as good a judgment as humans to know when that's the case. I often find myself saying, "why did you write that insane code instead of raising the alarm about a problem?" and the answer is always, "you're absolutely right; I continued when I should have stopped." Of course, you can only tell when that happens if you carefully review the code.
Here’s what’s working for me right now:
1. The basics: use best model available, have skills and rules that specify project guidelines, etc.
2. Always use plan mode. It works much better to iterate on the concept of what we’re going to do, then do the implementation. The models will adhere to the plan at very high rates in my experience.
3. Don’t give chunks of work that are too large in scope. This is just art, and I’m constantly experimenting with how ambitious I can be.
4. I review all code to some extent, but I have a strong mental model of what areas of the app are more critical, where hidden bugs might accumulate, etc, and I review both tests and impl more strenuously in those areas. Whereas like a widget for my admin panel probably gets a 2 second glance.
5. Have the discipline to go through periodically and clean up tech debt, refactor things that you’d do differently now, etc. I find the AI a huge help here, because I can clean up cruft in an hour that would have once taken me days, and thus probably wouldn’t have gotten done.
6. I’m experimenting with shifting my architecture to make it easier to review AI code, make it less likely it’ll make mistakes, etc. Honestly mostly things I should have always been doing, but the level of formalism and abstraction on my solo projects is usually different than on a bigger team.
To each their own, but I’ve grown this from nothing to about $350k in ARR over the last ten months, and I’m very confident I never could have built this product without AI help in triple that time.
Yeah I’m currently working for several months already on a harness that wraps Claude Code and Codex etc to ensure that these types of invariants are captured and enforced (after the first few harness attempts failed), and - while it’s possible - slows down the workflow significantly and burns a lot more tokens. In addition to requiring more human involvement, of course.
I suspect this is the right direction, though, as the alternatives inevitably lead any software project to delve into a spaghetti mess maintenance nightmare.
I would use Stripe, curl, and ffmpeg without audits, because I trust them to provide good code and to respect their API. I wouldn’t trust AI to write a Fibonacci series implementation.
The AI has no reputation to wager for my trust.
To be fair, there are many people like this as well. One of my personal favorite examples was way back in the 80s when I inherited the code for a protocol converter that let ASCII terminals communicate with IBM mainframes via the 3270 protocol.
One of the pieces of code in there, for managing indicator lights, was simply wrong. It was ca. 150 lines of Z80 assembly language that was trying to faithfully follow the copious IBM documentation of how things worked, but it had subtle issues and didn't always work.
My approach was to accept the documentation as accurate (the IBM documentation was always verbose and almost never wrong), but to reason that the original 3270 had these functions implemented in TTL logic gates, and there was no way in heck that they were wasting enough gates on indicator lights to require the logical equivalent of 150 instructions.
So in my mind, it had to be a really simple circuit that had emergent properties that required the reams of documentation. With that mindset, I was able to craft correct code for this in 12 instructions.
Many systems are likewise fractal in nature. You want to figure out the generating equations, rather than all the rules that derive from those. And, in many cases, writing down the generating equations is at least as easy to do in code as it would be to do in English for someone or something else to implement.
Well, that is problematic. I have to either assume you are disinterested or lying and neither is great for any discourse.
After reading a bunch of other comments, it sounds like people are referring to letting agents go wild and code whatever off a limited prompt. I'm not using LLMs like that; I'm generally interacting only via conversations with pretty detailed initial prompts. My interactions with the chat after that are corrections/guiding prompts to keep it on point and edit the prompt output from time to time.
It's all very simple. "Use x library, data model should be xyz, do m, not n."
They're obviously not at the point of replacing an experienced programmer as far as knowing the start-to-finish way of accomplishing every detail, that's what the human is for.
Worse, the disclaimer is buried under a bunch of "did X, did Y on line Z of file a/b/c", as if it's just a minor inconvenience. To the extent the plan was inaccurate, you're left in an undefined state where you might as well undo what it just did..
Then when it was completing functions, people would say, "yeah, but you still have to make sure you're the one writing the logic around the functions"
Then when it was completing the logic around the functions, people would say, "yeah, but you still have to make sure you're the one writing the features"
Now it's completing features and people say, "yeah, but you still have to make sure you're the one writing the architecture"
I don't know if architecture is a solvable problem for these models, but it is interesting watching the expectations moving over time.
When AI can complete lines, you still have to read and understand the code.
When AI can complete whole functions, you still have to read and understand the code.
When AI can complete features and tickets, you still have to read and understand the code.
The developers in this scenario are there to absorb the blame when things go wrong. "Human crumple-zones" to protect the company.
But that still doesn’t mean I review all the code. I tend to review defensively, based on the potential for harm if this piece of code is broken. And I rely a lot on tests, static analysis, canaries, analytics, health checks, etc. to reduce risk for when I’m wrong. So far it’s working.
I can feel it sometimes, as my brain shuts down and I gamble instead of thinking. It's a reversion to what I call "monkey mind" where you just keep pressing buttons to "make it work". I took a decade training my mind away from this, and too much AI is bringing it back.
They constantly say they did a thing they didn’t, say they know how to solve something when they don’t, etc. Regardless of guard rails or tests - AI forces a constant vigilance of a new kind.
Not just “what might have gone wrong” but also “what do I think is working but isn’t actually”.
And we’re not even talking about how it chooses substandard solutions, is happy to muddy code/architectures, add spaghetti on top of spaghetti etc.
Agentic coding often feels like an army of unexperienced developers who are also incredibly eager to please.
Yes, because wrong assembly blows really loudly. From wrong behavior to invalid instruction errors and everything between them. Moreover, compilers are battle tested over the years, with extremely detailed test suites, and extreme testing (everyday, hundreds of thousands users test and verify them).
Also, as people said, assembly generation is deterministic. For a given source file and set of flags, you get the same thing out. Byte by byte, bit by bit. This is what we call "reproducible builds".
AI is not like that. It's randomized on purpose, it pulls from training set which contains imperfect, non-ideal code. "Yeah, it works whatever", doesn't cut it when you pull a whole function out of its connections, formed by the training data. It can and will make errors, because it's randomized from a non-ideal pool.
Next, sometimes you need tight code. Fitting into caches, running at absolute performance limit of the processor or system you have. AI is not a good fit here. Sometimes you go so far that you optimize for the architecture at hand, and it works slower on newer systems, so you need to re-optimize that thing.
For anyone who reads and murmurs "but AI can optimize", yes, by calling specific optimization routines written by real talented people for some cases; by removing their name, licenses, and context around them. This is called plagiarism in its mildest form and will get you in hot water in academia, for example. Writing closed source software doesn't make you immune from cheating and doing unethical things.
Lastly, this still rings in my ears, and I understood it over and over as I worked with more high performance, correctness critical code:
I was taking an exam, there's this tracing question. I raise my head and ask my professor: "Why do I need to trace this? Compiler is made to do this for me". The answer was simple yet deep: "If you can't trace that code, the compiler can't trace it either".
As I said, I just said "huh" at the time, but the saying came back and when I understood it fully, it was like being shocked by a Tesla coil.
Get your sleep, eat your veggies and understand your code. That's the four essential things you need to do.
I am so tired of hearing about this false equivalency. Compilers are deterministic, their outputs are well understood and they’re transparent.
LLMs are not.
For anything other than Greenfield, new code projcets without dependencies and conventions and connections to other proprietary code, it has to be reviewed. Even for that case it's not good to not review code
Yeah, because they believe (sometimes wrongly) their subordinates read it.
> Understanding of code never existed from the business perspective.
It does, it's called organizational wisdom and domain knowledge, because you need those witty names to sell books to aspiring managers.
I think the solution is between the lines of this article. The author states the steps leading to this, but doesn't arrive at it explicitly. It has been obvious (With 50/50 hindsight) to me since LLMs started getting popular, and holds:
LLMs are fantastic for software dev. If you don't let it write architecture. Create the modules, structs, and enums yourself. Add as many of the struct fields and enum variants as possible. Add doc comments to each struct, enum, field, and module. Point the LLM to the modules and data structures, and have it complete the function bodies etc as required.
Considering how fast we can poop out code now, I think this issue is just more visible than before, but it's been an issue for as long as I've been a developer. Almost no one knows what they actually want, and half the job is trying to coax out what they want to be able to do, so you can properly architect it.
While the salary stays stagnant or even reduced if you adjust for inflation.
So it's not much of a surprise that this is the situation folks find themselves in with the current models.
Have people's standards for quality just completely vanished in the pursuit of the shiny new thing? Is that guy doing something wrong?
That has also been my experience with this sort of thing fwiw, which is why I gave up and do more of a class-by-class pairing with an LLM as a workable middle ground.
It's completing shit. Even if it does not implement some lazy stuff with empty catch blocks (i.e. happy path from programming 101 tutorials), it will either expose your secrets in a sensible place or do some other stupidity.
"it takes too much effort to get the output production ready"
turning into
"maybe long term the maintenance will be more expensive"
I give it three months until people realize that you rarely need to review every single line and fully understand the code, like so many comments are claiming.
Maybe on projects with no users you can yolo things.
This blob of people criticizing AI is just that, a blob. A gaggle of discrete people that your brain makes up a narrative about being some goalpost shifting entity.
Of course there could be individuals who have moved the goalposts. Which would need a pointed critique to address, not an offhand “people are saying” remark.
1. If I use a coding agent to generate code, it should be something I am absolutely confident I can code correctly myself given the time (gun to my head test).
2. If it isn't, I can't move on until I completely understand what it is that has been generated, such that I would be able to recreate it myself.
3. I can create debt (I believe this is being called Cognitive Debt) by breaking rule 2, but it must be paid in full for me to declare a project complete.
Accumulating debt increases the chances that code I generate afterwards is of lower quality, and it also feels like the debt is compounding.
I'm also not really sure how these rules scale to serious projects. So far I've only been applying these to my personal projects. It's been a real joy to use agents this way though. I've been learning a lot, and I end up with a codebase that I understand to a comfortable level.
In their mind they’ve already done the “architectural heavy lifting” and accelerated the team. More often than not it just adds cognitive overhead where you spend more time deciphering and cleaning up garbage than actually building the thing properly from scratch.
It’s a valid direction to look in, it just doesn’t address the root issue of throwing slop across the wall and also having unrealistic expectations due to not knowing any better.
If there’s one thing that’s disturbing with AI proponent is how trusting they are of code. One change in the business domain and most of the code may have turn from useful to actively harmful. Which you have to rewrite. Good luck doing that well if you’re not really familiar with the code.
Unfortunately, it is not, and many of its attempts at mathematical proofs have major flaws. You shouldn't trust its proofs unless you are already able to evaluate them--which I think is pretty much all the OP is saying.
There is one exception to this: If the AI also delivers the proof of why the math is correct, in a machine-checked format, and I understand the correctness theorem (not necessarily its proof). Then I would use it without hesitation.
I struggle to remember even relatively simple maths like working out "what percentage of X is Y" so if I write a formula like that I'll put in some simple values like 12 and 6 or 10,000 and 2,456 just to confirm I haven't got the values backwards or something. I've been shown sheets where someone put a formula in that they don't understand, checked it with numbers they can't easily eyeball and just assumed it was right as it's roughly in their ball park / they had no idea what the end result should be.
Then again I've also seen sheets where a 10% discount column always had a larger number than the standard price so even obviously wrong things aren't always checked.
I've reached solutions by trial and error too, and tried to rationalize them later, quite a few times. And it's easier to rationalize a working solution, however adversarial you claim to be in your rationalization.
I don't see using gen AI for the (not so) “brute force” exploration of the solution space as that different from trial and error and post fact rationalization.
I’m going to guess that this is Gell-Mann amnesia more than anything, and it’s going to get a lot of organizations into a lot of weird places.
Normally with mathematical problems you have to prove the solution correct. Testing is not sufficient, unless you can test all possible inputs exhaustively.
If it’s beyond our ability to review and we blindly trust it’s correct based on a limited set of tests… we’re asking for trouble.
... that can't even count.
"PhD level" just means you finished a bachelor and masters degree and are now doing a bit of original research as an employed research assistant.
Claude isn't "PhD level" anything. This shows a complete lack of understanding here. Claude has read every single text book in existence, so it can surface knowledge locked away in book chapters that people haven't read in years (nobody really reads those dense books on niche topics from start to finish).
Since Claude has infinite patience, you can just keep asking until you get it.
But we don’t follow the same things for dependencies, work of colleagues, external services, all the layers down to the silicon when trying to work.
Why is AI suddenly different?
We just have to do this by risk and reward. What’s the downside if it’s wrong, and how likely is an error to be found in testing and review? What is the benefit gained if it’s all fine? This is the same for libraries and external services.
A complex financial set of rules in a non-updatable crypto contract with no testing?
A viewer for your internal log data to visualise something?
There are some programmers who treat the job as just plumbing together what is to them completely incomprehensible black boxes, who treat the computer as a mystery machine that just does things "somehow", but these programmers will almost always be hacks that spend their entire career producing mediocre code.
There are things such a programmer can build, but they are very limited by their lack of in depth understanding, and it is only a tiny fraction of what a more competent programmer can put together.
To get beyond being a hack, you need to understand the entire stack, including the code that you didn't write, including both libraries, frameworks and the OS, and including the hardware, the networking layers, and so forth. You don't have to be an expert at these things by any means, but you do need to understand them and be comfortable treating them as transparent boxes that you may have to go in and fiddle with at some point to get where you need to go. Sometimes you need to vendor a dependency and change it. Sometimes you need to drop it entirely and replace it with something more fit for purpose you built yourself.
> To get beyond being a hack, you need to understand the entire stack, including the code that you didn't write, including both libraries, frameworks and the OS, and including the hardware, the networking layers, and so forth.
I think maybe you overestimate your own knowledge here. It's one thing to understand general principles and design or to understand a contextually-relevant vertical or whatever. It's another to demand comprehensive (even if not expert) familiarity in non-trivial projects, especially those created by many developers over long time spans. It's not just a question of intelligence or dedication or even just time spend working on a project.
The amount of software even your typical piece of code relies on is staggering and shifting, and it's only getting more complicated. A good chunk of software engineering and programming language research has been focused on making it practical to operate in such an complex environment - an environment that nobody fully understands - which is a major part of why modularity exists. Making software like "plumbing together [...] black boxes" is exactly what such research aspired to accomplish, because it allows different developers to focus on different scopes and focus on the domain they're working on. Software engineering is a practical field, and any system that requires full knowledge to operate, modify, and extend is either relatively small (maybe greenfield and written by a sole developer) or impractical to work with.
So I would say there's a wide gap between "lazy guy who doesn't give a shit" and "guy who thinks he can understand everything". Both lack the humility and wisdom needed to know the limits of their knowledge, to circumscribe what he needs to understand, and to operate within the space these afford. (Both extremes remind me of cocky junior devs. On the one hand, you have the junior dev who carelessly churns out "hot shit" garbage code by plumbing things together with no grasp or appreciation of sound design; on the other you have the dev who makes a big show about "rigor" completely detached from the actual realities and needs of the project. In each case, the dev is failing to engage intelligently with the subject matter.)
I'm far from the best at anything and make no claims toward knowing everything, but I do think I have reasonable breadth in my experience and work, and I don't think I could have built something like this otherwise.
[1] ... which is something that does not decompose neatly into black boxes and must to a large degree be built from first principles as goddamn nothing off the shelves scales well enough to deal with multi-terabyte workloads at the even a fraction of the speed a bespoke solution can.
An outsourced developer isn't a "tool". They're a human being, and responsible for their actions. They're being paid, and they either act responsibly or they get replaced.
A vibe coder is a human using a tool. The human is responsible for code quality, and if it's not good enough, they need to keep using the tool to make it better. That means understanding the tool's output.
If an artist used Photoshop to create a billboard ad that was ugly, they don't get to blame Photoshop. They have to keep using the tool until their output is good.
So you figure out that someone you paid is at fault, instead of someone they hired. Your contract is with them so what really changes? What process or anything else is really different between it being a company with a manager who asks a team of devs and a company which asks an AI agent, to you as a customer?
Maybe it changes who gets fired or sued or whether one insurance or another pays out- but broadly I think none of what I said about project work really changes.
Product owners and hell even customers have been able to get software they don’t understand all the details of or for customers even get to see the code, purely driven with natural language.
I'd think that depends on the model of responsibility at play.
For example, suppose I hire a building contractor to build a house, and the electrician he subcontracts makes mistake.
From my perspective, the prime contractor is equally responsible for that mistake regardless of whether he used a subcontractor, or did the work himself but used a broken tool.
This doesn't make the electrician any less of a "person" in the deeply important ways, but it's not a distinction that's relevant to my handling of the problem.
Comprehension debt just sounds like there are things you don’t (yet) understand.
Cognition debt means your lack of understanding compounds and the cognition “space” required to clear it increases accordingly.
An increasing comprehension debt that can be paid off one bit at a time within reasonable cognition space takes linear time to clear.
Cognition debt takes exponential time to clear the more of it you have. If it reaches a point where you simply don’t have the space for the cognition overhead required to understand the problem, you probably need to start over from your specifications.
Your manager is unknowingly helping you create a form of job security for yourself, with all the technical debt and bugs being accumulated.
He might not understand it, and it might not be the type of work you want to do, but someone is going to have to fix those issues. And the longer they wait, the bigger the task gets.
But we still hold good cards in hand.
Do they want their pile of steaming slop fixed, or not? Because no amount of complaints about the deadline being "yesterday" are going to change anything about the fact that time will be needed to fix the accrued technical debt, whether they like it or not.. And if AI dug you in that deep to start with, the solution is not to dig deeper.
I suspect some companies are going to find that out the hard (costly) way.
If manage is reasonable, you can explain to them that there isn't time to check the work of the AI, and that it frequently makes obscure mistakes that need to be properly checked, and that takes time.
At this point, if they still insist you just give it the AI's work, they've made a decision that is their fault. You've done what you can.
And when the shit hits the fan, we're back to whether they're reasonable or not. If they are, you explained what could happen and it did. If they force responsibility on you, they aren't reasonable and were never going to listen to you. That time bomb was always going to go off.
This will not happen. Nobody desires to give that up. Also AI does not deliver even remotely that much true value multiplier
Had a project idea which I coded with the help of AI and it became quite large to a point I was starting to have uncharted areas in the code. Mostly because I reviewed it too shallow or moved fast.
It was a good thing as that project never floated but if I were to do such a thing on my breadwinning project I would lose the joy.
This all works pretty great. Where it starts going off the rails is if I let it use a library I'm not >=90% comfortable with. That's a good use of these tools, but if I let it plow through feature requests, I end up accumulating debt, as you pointed out.
For my uses, I'm still finding the right balance. I'm not terribly sure it makes me faster. What I do think it helps with is longer focused sections because my cognitive load is being reduced. So I can get more done but not necessarily faster in the traditional sense. It's more that I can keep up momentum easier, which does deliver more over time.
I'm interested in multi agent systems, but I'm still not sure of the right orchestration pattern. These AI tools still can go off the rails real quick.
The swindle goes like this, AI on a good codebase can build a lot of features, you think it’s faster it even seems safer and more accurate on times, especially in domains you don’t know everything about.
This goes in for a while whilst the codebase gets bigger and exploration takes longer and failure rate increases. You don’t want it to be true and try harder so you only stop after it practically became impossible to make any changes.
You look at the code again and there is so much code spaghetti is an understatement it’s the Chinese wall.
You start working…, and you realize what was going on
I deleted 75,000 of 140,000 lines of code and I honestly feel like the 3 months I went hard into agentic coding I wasted and I failed my users by building useless features increasing bugs, losing the mental model of my code and not finding the problems I didn’t know about the kind of hard decisions you only see when you in the code, the stuff that wanders in your mind for days
They seem to be different for LLMs, because would anyone be surprised if they handed summary feature descriptions to some random "developer" you've ever only met online, and got back an absolute dung pile of half-broken implementation?
For some reason, people seem to expect miracles from some machine that they would not expect of other humans, especially not ones with a proven penchant for rambling hallucinations every once in a while.
I'd like to know, ideally from people who've been there, why they think that is. Where does the trust come from?
It is surprising how bad it is at taking the lead given how effective it is with a much more limited prompt, particularly if you buy in to all the hype that it can take the place of human intelligence. It is capable of applying a incredible amount of knowledge while having virtually no real understanding of the problem.
They can assimilate 100s of thousands of tokens of context in few seconds/minutes and do exceptional pattern matching beyond what any human can do, that's a main factor in why it looks like "miracles" to us. When a model actually solves a long standing issue that was never addressed due to a lack of funding/time/knowledge, it does feel miraculous and when you are exposed to this a couple of times it's easy to give them more trust, just like you would trust someone who provided you a helping hand a couple of times more than at total stranger.
I suppose it's difficult to account for the inconsistency of something able to perform up to standard (and fast!) at one time, but then lose the plot in subtle or not-so-subtle ways the next.
We're wired to see and treat this machine as a human and therefore are tempted to trust it as if it were a human who demonstrated proficiency. Then we're surprised when the machine fails to behave like one.
I have to say, I'm still flabbergasted by the willingness to check out completely and not even keep on top of, and a mental model of, what gets produced. But the mind is easily tempted into laziness, I presume, especially when the fun part of thinking gets outsourced, and only the less fun work of checking is left. At least that's what makes the difference for me between coding and reviewing. One is considerably more interesting than the other, much less similar than they should be, given that they both should require gaining a similar understanding of the code.
Or really the same reason people fall for get rich quick schemes.
I don't understand this. A large codebase should be a collection of small codebases, just like a large city is a collection of small cities. There is a map and you zoom into your local area and work within that scope. You don't need to know every detail of NYC to get a cup of coffee.
Its your responsibility to build a sane architecture that is maintainable. AI doesn't prevent you from doing that, and in fact it can help you do so if you hold the tool correctly.
To speak more directly - every codebase has local reasoning and global reasoning. When looking at a single piece of code that's well-isolated, you can fully understand its behavior "locally" without knowing anything about any other part of the code. But when a piece of code is tightly coupled to many other parts of the codebase, you have to reason globally - you have to understand the whole system to even understand what that one piece of code is doing, because it has tendrils touching the whole system. That's typically what we call spaghetti code.
If you leave an AI to its own devices, it will happily "punch holes", and create shortcuts, through your architecture to implement a specific feature, not caring about what that does to the comprehensibility of the system.
Oh, great analogy there.
Just like there's almost nothing in common between a large city and a collection of small cities, a large codebase is completely different from a collection of small codebases too.
Mostly because of the same kinds of effects.
There are software that works like this (e.g. a website's unrelated pages and their logic), but in general composing simple functions can result in vastly non-proportional complexity. (The usual example is having a simple loop, and a simple conditional, where you can easily encode Goldbach or Collatz)
E.g. you write a runtime with a garbage collector and a JIT compiler. What is your map? You can't really zoom in on the district for the GC, because on every other street there you have a portal opening to another street on the JIT district, which have portals to the ISS where you don't even have gravity.
And if you think this might be a contrived example and not everyone is writing JIT-ted runtimes, something like a banking app with special logging requirements (cross cutting concerns) sits somewhere between these two extremes.
When on a tiny project it doesn't matter. However when you have millions of lines of code you have to trust that your code works in isolation without knowing the details.
> have millions of lines of code you have to trust that your code works in isolation without knowing the details.
More like hope. This is where good design and architecture helps, as well as strong invariants held up by the language. But given that most applications can't really escape global state (not even internally, let alone external state like the file system), you can never really know that your code will work the way you expect it to - that is, it's not trivially composable to any depth.
Rather than arguing about the specifics, it's easier to point to numerous concrete examples, such as a fairly simple system - which should be easy to implement in 8-15k lines of code, depending on certain choices (I've been writing code long enough to estimate this relatively accurately) - being still-incomplete while approaching 150k lines. These kinds of atrocities are usually economically infeasible in hand-written code, for 2 reasons: 1) the cost to produce that much code is very high, and 2) the cost of maintaining that much code is insurmountable.
I guess you could say that AI is great at generating code that only AI can understand and maintain.
Eg., treating AI code generated as immediately legacy, with tight encapsulation boundaries, well-defined interfaces etc. And integrating in a more manual workflow.
There's a range from single-shot prompts to inline code generation, that will make more sense depending on the problem and where in the code base it is.
Single-shot stuff is going to make more sense for a protyping phase with extensive spec iteration. Once that prototype is in place, you then prob want to drop down into per-module/per-file generation, and be more systematic -- always maintaining a reasonably good mental model at this layer.
This seems to me like it requires an impossible level of discipline, judgement and foresight
I could see value in using it during the prototyping phase, but wouldn’t like to work like you described for a serious project for end users.
Understanding your domain, setting clear expectations and understanding limitations and how much ambiguity your people/robots can handle are all good management techniques, they translate.
But the nature of working with an always-on flattery machine vs humans that can exceed your expectations while also being sources of infinite drama and frustration are still fundamentally different. The blind spot is being subsceptible to the flattery machine and forgetting how much you relied on good people challenging you. The benefit is, of course, not having to deal with humans.
You call it a hackathon. You tell the human to stay up the whole night. In exchange for the extra hours worked you provide some pizza.
I care more about code quality now, because typing no longer limits if I feel like it's worth to refactor something or not.
This is good advice regardless whether you're using AI or not, yet in real life "let's have well-defined boundaries and interfaces" always loses against "let's keep having meetings for years and then ducttape whatever works once the situation gets urgent".
You can have very good diffs and then found that the whole codebase is a collection of slightly disjointed parts.
AI doesn't necessarily have to increase your throughput, it can also serve as a flexible exploration and refactoring tool that will support either later hand crafter code or agentic implementation.
I still have a lot of usage for AI: Exploration, Double-checking me, teaching me. But writing code became very tough for me to accept. Nex-edit autocompletes mainly
I'm ready to give up on having it even review my code at this point. It's been so frustrating. It hallucinates bugs, especially in places where "best practices" are at odds with reality.
Recently it informed me of a bug where it suggested the line of code in question couldn't possibly do anything because on Linux the specific stdlib behaved in X ways, but it was obvious from the line of code that it was running on Windows which doesn't have this problem at all. Of course, it doesn't actually mention that this is an issue on Linux, just that there is a bug here. It vomits up a paragraph of $WORDS explaining why this was a high-priority bug that absolutely needed to be fixed because it was failing in subtle ways. Yet the line of code in question has been running in production, producing exactly the results it is expected to, for ~3 years.
And this is just one simple example, of the many dozens+ of times it has failed this task this year. In that same review run, the agent suggested 3 additional "bugs" or other issues that should be addressed that were all flatly wrong or subjective. I'm at a point of absolute exhaustion with this sort of shit. It's worse than a junior half of the time because of how strongly opinionated it is. And the solution to this sort of problem is an endless amount of configuration and customization that will be forgotten about by all of us over time, leading to who knows what sort of knock-on effects (especially as we migrate from one model to the next). We have a guy on our team who has ~17,000 words in his agent and instructions files, yet he sees nothing wrong with this. I guess he just really loves YAML and Markdown.
There comes a realization, to many engineer’s horror, that AI won’t be able to save them and they will have to manually comprehend and possibly write a ton of code by hand to fix major issues, all while upper management is breathing down their back furious as to why the product has become a piece of shit and customers are leaving to competitors.
The engineers who sink further into denial thrash around with AI, hoping they are a few prompts or orchestrations away from everything be fixed again.
But the solution doesn’t come. They realize there is nothing they can do. It’s over.
Coincidentally I've been working on a project for about 7 months now: its a 3d MMO. Currently its playable, and people are having fun with it - it has decent (but needs work) graphics, and you can cram a few hundred people into the server easily currently. The architecture is pretty nice, and its easy to extend and add features onto. Overall, I'm very happy with the progress, and its on track to launch after probably a years worth of development
In 7 months vibe coding, OP failed to produce a basic TUI. Maybe the feature velocity feels high, but this seems unbelievably slow for building a basic piece of UI like this - this is the kind of thing you could knock out in a few weeks by hand. There are tonnes of TUI libraries that are high quality at this point, and all you need to do is populate some tables with whatever data you're looking for. Its surprising that its taking so long
There seems to be a strong bias where using AI feels like you're making a lot of progress very quickly, but compared to manual coding it often seems to be significantly slower in practice. This seems to be backed up by the available productivity data, where AI users feel faster but produce less
This metric highly depends on who uses the AI to do what, where strong emphasis is on "who" and "what".
In my line of work (software developer) the biggest time sinks are meetings where people need to align proposed solutions with the expectations of stakeholders. From that aspect AI won't help much, or at all, so measuring the difference of man hours spent from solution proposal to when it ends up in the test loops with and without AI would yield... very disappointing results.
But for troubleshooting and fixing bugs, or actually implementing solutions once they have been approved? For me, I'm at least 10x'ing myself compared to before I was using AI. Not only in pure time, but also in my ability to reason around observed behaviors and investigating what those observations mean when troubleshooting.
But I also work with people who simply cannot make the AI produce valuable (correct) results. I think if you know exactly what you want and how you want it, AI is a great help. You just tell it to do what you would have done anyway, and it does it quicker than you could. But if you don't know exactly what you want, AI will be outright harmful to your progress.
Another thing I don’t see mentioned is code quality.
Vibe-coded code bases are an excellent example of why LLMs aren’t very good at writing code. It will often correct its own mistakes only to make them again immediately after and Inconsistent pattern use.
Recently Claude has been making some “interesting” code style choices, not inline with the code base it’s currently supposed to be working on.
https://tinyskies.vercel.app/
It's got a fun Zelda-inspired mechanic (I won't say which one), and you'll have to unlock abilities and parts of the world over several quests and modes to "win".
It's also multiplayer.
In the past, I was trying to reproduce vibe coding results when I managed to get all the information from Youtube videos (model version, ide version, same input data and prompt) and never was I able to reproduce something impressive even after multiple runs of the same thing.
https://x.com/DannyLimanseta/status/2052040017007251946
Here are the direct links:
https://github.com/dannylimanseta/tinyskies
https://github.com/dannylimanseta/tinyskies/blob/cursor/glob...
https://github.com/dannylimanseta/tinyskies/blob/cursor/glob...
AI, and especially agentic AI can make you lose situational awareness over a codebase and when you're doing deep work that SUUUUCKS, but it's not useless, you just have to play to it's strengths. Though my favorite hill to die on is telling people not to underestimate it's value as autocomplete. Turns out 40 gigabytes of autocomplete makes for a fucking amazing autocomplete. Try it with llama.vim + qwen coder 30b, it feels like the editor is reading your mind sometimes and the latency is so low.
That’s the hard part of coding. If you have an architecture then writing the code is dead simple. If you aren’t writing the code you aren’t going to notice when you architected an API that allows nulls but then your database doesn’t. Or that it does allow that but you realize some other small issue you never accounted for.
I do not know how you can write this article and not realize the problem is the AI. Not that you let it architect, but that you weren’t paying attention to every single thing it does. It’s a glorified code generator. You need to be checking every thing it does.
The hard part of software engineering was never writing code. Junior devs know how to write code. The hard part is everything else.
The developers that thing coding is hard are the ones that absolutely love AI coding. It's changed their world because things they used to find hard are now easy.
Those that think coding is easy don't have such an easy time because coding to them is all about the abstractions, the maintainability and extensibility. They want to lay sensible foundations to allow the software to scale. This is the hard part. When you discover the right abstractions everything becomes relatively easy. But getting there is the hard part. These people find AI coding a useful tool but not the crazy amazing magical tool the people who struggle with coding do.
The OP is definitely in the second camp since they could spot and realise the shortcomings of the AI. They spotted the problem, and that problem is that the AI can't do the hard bit.
PMs can now cross reference and organize tickets with just a few keystrokes. Organisational knowledge, business knowledge, design systems and patterns, etc all of it is encoded in LLM consumable artefacts. For PMs it is the same switch - instead of having to do it by hand you direct lower level employees to handle the details and inconsistencies and you just do vibe and vision.
When all of the pieces successfully connect and execute reliably, what is left for humans to do? Just direct and consume?
And AI companies with their huge swaths of data are soon gonna be in the situation of being able to do the directing themselves
There's plenty of times where I don't know what code to write because I've never used a library before. But it's just a page of documentation away. It's not hard, it's just slow and tedious.
The first group are still thinking fairly deeply about design and interfaces and data structures, and are doing fairly heavy review in those areas. The second group are not, and those are the ones that I find a bit more worrisome.
I can't speak for others, but I'd go further and say that LLMs allow me to go deeper on the design side. I can survey alternative data structures, brainstorm conversationally, play design golf, work out a consistent domain taxonomy and from there function, data structure and field names, draft and redraft code, and then rewrite or edit the code myself when the AI cost/benefit trade off breaks down.
I’m not making a judgement call about which is better, but it was widely accepted in tech before the advent of LLMs that you just fundamentally lack a sense of understanding as a reviewer vs an author. It was a meme that engineers would rather just rewrite a complicated feature than fix a bug, because understanding someone else’s code was too much effort.
It's the same thing here. AI has dropped the cost of software development, so developers are now fooling themselves into producing low or zero value software. Since the value of the software is zero or near zero, it doesn't really matter whether you get it right or not. This freedom from external constraints lets you crank up development velocity, which makes you feel super productive, while effectively accomplishing less than if you had to actually pay a meaningful cost to develop something.
Like, what is the purpose of Gas Town? It looks to me like the purpose of Gas Town is to build Gas Town.
I find it useful to not listen to people who just talk.
I worry about the first group too, because interfaces and data structures are the map, not the territory. When you create a glossary, it is to compose a message, that transmit a specific idea. I find invariably that people that focus on code that much often forgot the main purpose of the program in favor of small features (the ticket). And that has accelerated with LLM tooling.
I believe most of us that are not so keen on AI tooling are always thinking about the program first, then the various parts, then the code. If you focus on a specific part, you make sure that you have well defined contracts to the orther parts that guarantees the correctness of the whole. If you need to change the contract, you change it with regard to the whole thing, not the specific part.
The issue with most LLM tools is that they’re linear. They can follow patterns well, and agents can have feedback loop that correct it. But contracts are multi dimensional forces that shapes a solution. That solution appears more like a collapsing wave function than a linear prediction.
I follow the plan -> red/green/refactor approach and it is surprisingly good, and the plans it produces all look super well reasoned and grounded, because the agent will slurp all the docs and forums with discussions and the like.
Trouble is once it starts working there would inevitably be a point where the docs and the implementation actually differ - either some combination of tools that have not been used in that way, some outdated docs, or just plain old bugs.
But if the goals of the project/feature are stated clearly enough it is quite capable of iterating itself out of an architectural dead end, that is if it can run and test itself locally.
It goes as deep as inspecting the code of dependencies and libraries and suggesting upstream fixes etc. all things that I would personally do in a deep debugging session.
And I’m supper happy with that approach as I’m more directing and supervising rather than doing the drudgery of it.
Trouble is a lot of my team mates _dont_ actually go this deep when addressing architectural problems, their usual mode of operandi is “escalate to the architect”.
This will not end up good for them in the long run I feel, but not sure what they can do themselves - the window of being able to run and understand everything seems to be rapidly closing.
Maybe that’s not super bad - I don’t exactly what the compiler is doing to translate things to machine code, and I definitely don’t get how the assembly itself is executed to produce the results I want at scale - that is level of magic and wizardry I can only admire (look ahead branching strategies and caching on modern cpus is super impressive - like how is all of this even producing correct responses reliable at such a a scale …)
Anyway - maybe all of this is ok - we will build new tools and frameworks to deal with all of this, human ingenuity and desire for improvement, measured in likes, references or money will still be there.
This is also why I don't see it getting better anytime soon. So many people ask me "how do you get your claude to have such good output?" and the answer is always "I paid attention and spotted problems and asked claude to fix them." And it's literally that simple but I can see their eyes already glazing over.
Just as google made finding information easier, it didn't fix the human element of deciphering quality information from poor information.
You can skip that and go directly to writing code. But that meant you replaced a few hours of planning with a few weeks of coding.
> back to writing code by hand
But what they are doing is
> doing the __design work__ myself, by hand, before any code gets written.
So... Claude still is generating the code I guess?
And seriously, I can't understand that they thought their vibe coded project works fine and even bought a domain for the project without ever looking at source code it generated, FOR 7 MONTHS??
And the goal of the article is to draw attention to their project.
Additionally, they couldn't even bother to write their own blog post, so it's a little hard to take them seriously when they say they're going to write their own code...
> Claude (c) by Anthropic (R) is the best thing since sliced bread and I'm Lovin' It(tm)! Here's a breakdown of you too can live a code free life for 10 easy payments of $99.99 a month if you subscribe now!
> Step one in your journey to code free life: code the whole damn project and put it together yourself
It's so much fluff and baloney and every single article is identical. And every single one is just over the top praise of Claude that doesn't come off as remotely authentic. There's always mentions of Claude "one shotting"(tm) something.
I don’t think it’s that weird to not look at the code if it’s a side project and you follow along incrementally via diffs. It’s definitely a different way of working but it’s not that crazy.
Its not weird to not look at the code, as long as you're looking at the code? (diffs?)
Uh, ok
We’ve moved to seeing that specs are useful and that having someone write lots of wrong code doesn’t make the project move faster (lots of times devs get annoyed at meetings and discussions because it hinders the code writing, but often those are there to stop everyone writing more of the wrong thing)
We’ve seen people find out that task management is useful.
Now more I’m seeing talk of fully doing the design work upfront. And we head towards waterfall style dev.
Then we’ll see someone start naming the process of prototyping, then I’m sure something about incremental features where you have to ma age old vs new requirements. Then talk of how really the customer needs to be involved more.
Genuinely, look at what projects and product managers do. They have been guiding projects where the product is code yet they are not expected to read the code and are required to use only natural language to achieve this.
Was it strongdm talking about the dark factories? They were working on some integration software so needed to use google drive and slack and lots of other things. They fully reimplemented those to the level they needed for their tests - outside of the biggest firms this would probably have been an enormous time and money sink. Now it’s reasonable.
On a personal project, with my wife we wanted a tracker for holiday planning. Five minutes given a barely through through request and we had a working prototype, fixed bugs in seconds and then talked through with a model what we needed and how it did or didn’t fit (and we needed that first version to figure that out). It helped drive out actual requirements from us, prioritise them, choose a stack add tickets and then went ahead and implemented it pretty far. Have a mostly working v2 which has highlighted some details about what we really wanted. Total invested cost was one day of a $20 subscription and maybe half an hour of talking to a bot and checking results.
This is a special case of a general fundamental point I'm struggling with.
Let's assume AI has reduced the marginal cost of code to zero. So our supply of code is now infinite.
Meanwhile, other critical factors continue to be finite: time in a day, attention, interest, goodwill, paying customers, money, energy.
So how do you choose what to build?
Like a genie, the tools give us the power to ask for whatever we want. And like a genie, it turns out we often don't really know what we want.
Now it is different in a way where now I don’t have time to use those apps.
That’s a joke.
But I do believe it answers the question of “what to build?”. If you didn’t have time before LLM assisted coding you still don’t have time for it. You most likely know what is used and what not already by heart or by some measurements.
Prompt for what you want. Get your feature working, then cut: reduce SLOC, refactor to remove duplication, update things to match existing patterns. You might do these instinctively, or maybe as-you-go, but that's just style. Having a dedicated pass works just as well.
The same thing goes for my code now that did when I wrote every line by hand: make it work, then make it good, then make it manageable. Manually that meant breaking things down into small blocks of individual diffs inside a PR (or splitting PRs), checking for repetitive code and refactoring, or even stashing what I got to and doing it again with the knowledge of how things went wrong.
Agents can do the same. It's WAY easier mentally and works out better if you treat them the same way and go working -> better -> done.
When asking for a new major feature, despite hard guidelines and context (that eat half your context window), then it quickly ships bloat. The foundations are not very well organized and this is where you acknowledge it is all about random-prediction of the next word-thing.
Overall, i've wasted more time reviewing the PR and trying to steer it properly than I expected. So multi-layer agent vibe coding is no longer the way to go *for me*. Maybe with unlimited tokens and a better prompt, to be investigated...
I really do think this whole thing is a wash.
Yes I agree for sure llms write terrible code when left to their own devices, but so do most engineers. Which is why we have so many tools to help keep a certain level of quality. Duplication checks, tests, linters, other engineers.
I find whenever you make an llm repo without these checks, and more, it will write like an enthusiastic junior engineer, wrong and strong. However a junior engineer would be hard pressed to get 95% coverage on a codebase, the ai is more than willing and does it in a few minutes. We can use things like this to our advantage, how many people have ever seen a repo with 100% test coverage? With ai this is very possible, with people not so much.
LLM’s writes terrible code, we know this, but when dealing with humans that write terrible code we have many techniques. We should be using those same techniques to keep the llms honest, but more importantly verifiable.
Then you're right back on track.
In a way it's not that different from a human-made project. Plenty of teams have to crunch, ignoring the architecture and incurring tech debt, and then come back and fix it later.
I have to periodically get it to do a bunch of refactoring
Can someone with more experience with it (or similar tools) chime in and confirm that this isn't just more AI snake oil? :)
[0] https://openspec.dev/
The rewrite is me sitting down with a blank doc and drawing the boxes before any code exists. Then the CLAUDE.md enforces what I already decided. Whether that actually holds up as the project grows, I genuinely don't know yet.
And I'm sure the rewrite is going to teach me a whole different set of lessons...
Not sure why good coverage wouldn't mitigate risk in a refactor...
My mantra whenever I'm working with AI is that I want it to know what "point b" looks like and be able to tell by itself whether it's gotten there...
If you have a working implementation, it sounds like you have a basis for automated tests to be written... once you have that (assuming that the tests are written to test the interface rather than the implementation), then it should be fairly direct to have an agent extract and decompose...
For example, consider a lint rule that bans Kysely queries on certain tables from existing outside of a specific folder. You'd write a rule like this in an effort to pull reads and writes on a certain domain into one place, hoping you can just hand the lint violations to your AI agent and it would split your queries into service calls as needed.
And at first, it will appear to have Just Worked™. You are feeling the AGI. Right up until you start to review the output carefully. Because there are now little discrepancies in the new queries written (like not distinguishing between calls to the primary vs. the replica, missing the point of a certain LIMIT or ORDER BY clause, failing to appropriately rewrite a condition or SELECT, etc.) You run a few more reviewer agent passes over it, but realize your efforts are entirely in vain... because even if the reviewer agent fixes 10 or 20 or 30 of the issues, you can still never fully trust the output.
As someone with experience in doing this kind of thing before AI, I went back to doing it the old way: using a codemod to rewrite the code automatically using a series of rules. AI can write the codemod, AI can help me evaluate the results, but actually having it apply all of the few hundred changes automatically led to a lack of my ability to trust the output. And I suspect that will continue to be true for some time.
This industry needs a "verification layer" that, as far as I know, it does not have yet. Some part of me hopes that someone will reply to this comment with a counterexample, because I could sorely use one.
A really screwed code base blows out your context window and just starts burning tokens as the AI works out a way to kill -9 itself to escape the hell you're subjecting it to.
The quality gates are up to you, and if you are smart you will make a lot of them and review them closely
I feel the same way about coding, its a source of pride for me and when I hear people say I should resign myself to being an "ideas guy" while chatgpt actually creates things I find the very concept to be distasteful regardless of whether or not it can outperform me.
I add now a long list of instructions how to work with the type system and some do’s and don’ts. I don’t see myself as a vibe coder. I actually read the damn code and instruct the ai to get to my level of taste.
What has generally worked for me is paraphrasing the old adage "Write the data structures and the code will follow" over to AI. Design your data, consider the design immutable and let the AI try fill in the necessary code (well, with some guidance). If it finds the data structures aren't enough, have it prompt you instead of making changes on its own. AI can do lot of the low-hanging fruit and often the harder ones as well as long as it's bound to something.
Yet, for now, AI at best has been something that relieves me from having to write a long string of boring code: it's not sustainable to keep developing stuff relying on AI alone. It's also great when quality is not an issue; for any serious work AI has not speeded me up noticeably. I still need to think through the hard parts, and whatever I gain in generating code I lose in managing the agents. But I can parallelise code generation, trying new approaches, and exploring out because AI is cheap. AI is also pretty good for going through the codebase and reasoning about dependencies whether in the context of adding a new feature or fixing a bug: I often let AI create a proof-of-concept change that does it, then I extract the important bits out of that and usually trim down the diffs down to at least 1/3 or less.
AI further helps with non-work, i.e. tasks that you have to do in order to fulfill external demands and requirements, and not strictly create anything solid and new. I can imagine AI creating various reports and summaries and documentation, perhaps mostly to be consumed and condensed by another AI at the receiving end. Sadly, all of this is mostly things not worth doing anyway.
Overall, I cringe under all the hype that's been laid on AI: it's a new tool that's still looking for its box or niche carveout, not a revolution.
I see this in Claude too, but I also see this in junior engineers. In the case with Claude, I simply ask it to refactor immediately after each feature is done. The human is still responsible for the AI writes, so if the AI writes code that’s gross, I would never push that lest it sully my name and my reputation for my own code quality.
I have found small iterations to have the best results. I'm not giving AI any chance to one shot it. For example, I won't tell it to "create a fleet view" but something more like "extract key binding to a service" so that I can reuse it in another view before adding another view. Basically, talk to the AI as an engineer talking to another engineer at the nitty gritty level that we need to deal with everyday, not a product person wishing for a business selling point to magically happen.
The very worst things you can do in a codebase are (a) not deeply understand how it works (have it be magic) and (b) be lazy and mess up the structure.
How do you fix a problem which happens at 2:00am and takes your system down if you don't have an excellent understanding of how it works?
Over time we're already bad at (a) because most developers hate writing documentation so that knowledge is invariably lost over time.
Isn't Golang relatively easier to read than Rust? I was under the impression that Rust is a more complex language syntactically.
> The other change is simpler: I'm doing the design work myself, by hand, before any code gets written. Not a vague doc. Concrete interfaces, message types, ownership rules. The architecture decisions that the AI kept making wrong are now made in writing before the first prompt.
This post is good to grasp the difference between "vibe-coding" and using the AI to help with design and architectural choices done by a competent programmer (I am not saying you are not one). Lately I feel that Opus 4.7 involves the user a lot more, even when given a prompt to one-shot a particular piece of software.
+1 on Open 4.7 involving the user a lot more. Rn I'm trying to get to a state where I can codify my design + decision preferences as agents personas and push myself out of the dev loop.
Good architecture in any language is obvious to someone who is experienced and cares.
Go is actually great for bots to write if you’re actually thinking.
> Go reads fine whether the architecture is good or bad
Were you reading the Golang code all along and got fooled or did you review it after it failed? Sorry I admit I didn't read the whole article.
It sounds like the author knows Rust, and might not be as familiar with Go.
A language that you are proficient in is always going to be easier read than one you don’t, even if it is an objectively easier language to to read in general.
I’ve used AI tools to do i18n translations to Spanish and Portuguese (somewhat ashamed to admit this). I’ve grown more familiar with the structure of these languages, and come to recognize some of the common vocabulary for our agtech domain. If anything, I feel more clueless about both languages now than I did before, when it comes to any sort of proficiency.
The framework could be an isolation later against viberod but not sure if its necessary for my small project i always wanted to do and never done anything with it.
For another tool, i will try another approach: Start with a deep investigation and spec write together with AI, than starting with the core architecture layout and than adding features.
So instead of just prompting "write a golang project with a http server serving xy, and these top 3 features" i will prompt "create a basic golang scarfold for build and test" -> "create a basic http server with a basic library doing xy" -> "define api spec" -> "write feature x"
There is kind a skill and depth to vibe coding though.
Will that improve or get worse? One would argue that LLMs in general are drastically more competent now than they were a couple years ago, they’re also much better at coding. We’re likely just now entering the era where they can code but are still not what you’d fully expect, or at least not what someone with absolutely no coding knowledge could use to code at the same level as someone who does know how to code.
Maybe that changes as the models improve, maybe it doesn’t, only time will tell.
Actually I am curikus to try somwthing like that myself. Is there an existing orchestrating engine (or single agent) which can spawn multiple subagents and keep passing their feedback/output between each other until all of them agree that assignment overall is complete?
Eventually like every hype wave the dust will settle, and lets see where we stand.
By now all the AI companies have consumed all human knowledge so they either learn to actually think for themselves, or that is it.
Either way, that won't change the ongoing layoffs while trying to pursue the AI dream from management point of view.
I think most companies doing layoffs are bloated to begin with, AI is just the scapegoat to do the layoffs.
Translation and asset generation teams for enterprise CMS, whose role has now been taken by AI.
Likewise traditional backend development, that was already reduced via SaaS products, serverless, iPaaS low code/no code tooling, that now is further reduced via agents workflow tooling, doing orchestration via tools (serverless endpoints).
Hey I don't want to over simplify, I'm sure it was complicated, but did the author have functional tests for these broken views? As long as there are functional tests passing on the previous commit I'd have thought that claude could look at the end situation and work out how to get the desired feature without breaking the other stuff.
TUIs aren't an exception, it's still essential to have a way to end-to-end test each view.
You can't test every permutation of app usage. You actually need good architechture so you can trust your test and changes to be local with minimal side-effects.
In that situation you have two choices:
1. Tell claude to iterate until the tests for the new view and the old views are all passing.
2. git reset --hard back to the previous commit at which all tests are passing and tell claude to try again, making sure not to break any tests.
It's essential to use tests when vibecoding anything non trivial. Almost certainly in a TDD style.
Personally, I've taken the time its freed up to spend more time on mathacademy and reading more theory oriented books on data structures and algorithms. AI coding systems are at their best when paired with someone with broad knowledge. knowing what to ask for and knowing the vocabulary to be specific about what you want to be built is going to be a much more valuable job skill going forward.
One example is a small AI based learning system I have been developing in my free time to help me learn. the mvp stored an entire knowledge graph and progress in markdown files. being an engineer, I knew this wouldn't scale so once I proved the concept viable, I moved everything into sqlite with a graphdb. then I decided to wrap some parts of teh functionality in to rust and put everything behind a small rust layer with the progress tracking logic still being in python.
someone with no knowlege of graph databases or dependncy graphs or heuristics would not be able to build this even if they had AI. they simply don't know what they dont' know and AI wont' save you there.
That said, I think its important to also spend time in the dirt. I've recently started pickign up zig as my NO AI langauge just to keep. those skills sharp.
I'm really curious if we'll seesaw once AI costs go up 10x.
Ive only been using kimi 2.5 and deepseek pro for reviewing PRs for security issues. less than 10% of my workflow requires a full powered frontier model.
I think the issue is overblown by people who think claude code is a good harness and use opus for everything. opencode is objectively better. its much more verbose about what its doing, you have more control when it comes to offloading to subagents with targeted context (crucial for running through larger jobs) and I can swap between codex and open weight models.
However everything in this field is cargo culting. We have absolutely no way to quantify productivity in the real world.
We've had advanced programming languages backed by advanced programming language theory for decades and the most used/ran programming languages in the world are C, PHP and JavaScript, languages held together by duct tape or in the case of C, programing language theory from the 60s.
We have a super minimal JavaScript runtime in the browser to avoid a bloated standard library and then people invent things like leftpad. At the same time basically every major website in the world serves mega bytes of tracking and ad serving libraries.
We all "know" AI makes coders more productive but nobody can do the equivalent of a clinical trial for a major new drug.
But I will say... you have to know Golang. You have to have at least tried to make a BubbleTea app yourself and try to understand ELM architecture. You have to look at the code and increment with it.
It makes total sense for OP to switch to Rust and Ratatui if they don't know Golang well. But I don't think it's a better language for it. [Ratatui has brought me great inspiration though!]
Independent of framework, the LLMs get the spacial relationships. I say things like "the upper right panel's content is not wrapping inside and the panel's right edge should extend to the terminal edge" and the LLM will fix it. They can see the resultant text, I'm copy-pasting all the time.
TUI code is finicky; one mis-rendered component mucks everything up. The LLMs will decide themselves make little, temporary BubbleTea fixtures to help understand for itself when things aren't right.
The only real problem with LLMs and BubbleTea is that upon first prompt, they insist on using BubbleaTea v1 versus BubbleTea v2, released in December 2025. But then you just point it to the V2_UPGRADE.md and it gets back on track. That will improve as training cutoffs expand.
I vibe-coded this TUI for Mom's last night. I actually started with Grok (who started with v1) and then moved into Claude Code after some iteration:
https://gist.github.com/neomantra/1008e7f2ad5119d3dd5716d52e...
I still do, but I used to, too.
https://github.com/github/spec-kit
Getting a plan isn't a panacea but is a better way to limit downstream slop than just vibing without one.
I dont go as fast as with other agents, but this works for me, and I enjoy the process.
Also 1600 lines... didn't any agent reviewing the diffs point that out?
You're also adding a lot to claude.md, I dunno how much that file has grown but a big claude.md file with many instructions, I don't think the ai will be able to remember all those rules.
In my experience, no. These tools suck at refactoring, mostly choosing to add more code instead.
Also 1600 lines... didn't any agent reviewing the diffs point that out?
You're also adding a lot to claude.md, I dunno how much that file has grown but a big claude.md file with many instructions, I don't think the ai will be able to remember all those rules
Do they write empty functions and let AI fill them in?
Or do they use some kind of specification language?
Are people designing those languages?
That trial and error process is still happening with a LLM, but much faster, and with instantaneous cross-references to various forms of documentation that I would be looking up myself otherwise. It produces code of a quality that is dependent on the engineer knowing what they want in the first place and prompting for it and refining its output correctly.
It's the exact same process of sculpting code that the majority of the industry was doing "by hand" prior to the release of LLMs, but faster, and the harnesses are only getting better. To "vibe code" is to prompt vaguely and ignore the quality of the output. You're coming to a forum full of professionals and essentially telling us that you're getting really frustrated with your Scratch project.
I don't know if you're trying to lead a charge or whatever but good luck with that. As a senior SWE, it is clear to me that this is the new paradigm until something better than LLMs comes along. My workflows and efficiency have been vastly improved. I will admit that I have never really been a "I made a SMTP server in 3k of Rust" kind of guy, though.
If you understand good software architecture, architect it. Create a markdown document just as you would if you had a team of engineers working with you and would hand off to them. Be specific.
Let the AI do the implementation of your architecture.
This. I definitely agree with this statement at this point in AI-assisted development. This gets at the "taste" factor that is still intrinsically human, especially in software engineering. If you can construct and guide the overall architecture of an application or system, AI can conceivably fill in the smaller feature bits, and do so well. But it must have a strong architecture and opinionated field in which to play.
Another note was for me e2e tests; while AI can write them it never comes up with just basic organization or abstraction required to manage a large e2e test suite with hundreds of tests. It immediately starts to produce spaghetti code.
The problem with this dev's approach is not AI, it's their use of it. They didn't ensure that the architecture made sense. They didn't look at the code and get a "feel" for it. They didn't do the whole build stuff, step back, refactor, rinse and repeat dance. The need for that hasn't gone away; if anything, it's even more important now. Because you can spit out code 100x faster than you could before, your tech debt compounds 100x faster. The earlier you refactor, the less work it is.
I usually give the agent a solid idea of what I want, often down to the API interfaces. Then every now and then, I'll go through the code and ensure that everything makes sense, and that I'm not just spitting out code that works, but building a codebase that scales.
The ones who are “AI pilled” and the contagious lepers.
With that said, this caught my eye:
> AI gravitates toward single-struct-holds-everything because it satisfies the immediate prompt with minimal ceremony.
This is too general. "AI" is used here as a catch-all, but in fact, it was the specific model under the specific conditions you ran your prompt, including harness, markdowns, PRDs, etc. So it's not fair to say "AI does X!" in this case.
It's also very much up to you. It's very common to have a frontier model plan an architecture before you have another model implement code. If you're just one-shotting an LLM to do everything you get mediocre, more brittle code.
This stuff is still being figured out by a lot of people. But I feel the core of the issue is not using AI well. Scoping, task alignment, validation, are crucial.
> For 7 months I'd been prompting and shipping without ever sitting down and actually reading the code Claude wrote.
But every time I read something like this, I seriously wonder about the mental state of the person that wrote it.
How do you get to this point?
some states, for an example, are meant to be assumed from the data shape, rather than the actual state fields, but damn they like adding a state field.
It would have been easy to run a few ai agents to review the code and find these issues as well and architect it clearly
Inb4 “you’re gonna be replaced” god damn it I hope so, I do not want to spend the rest of my life behind a computer screen…
But in my main work, reverse engineering, LLMs are godsend, for years now.
You can basically bruteforce binary obfuscation thanks to them. And thanks to eager chinese LLM providers, basically for free.
But I always use LLM only for boring work and rest is for me to do manually, or with scripts of course, but made by me. Because I want to learn.
Yes, there are a lot people using LLMs for full RE automation since they're selling exploits for profit. No problem with me.
I see funny future for huge corporations like Adobe, etc.
Imagine prompt, "Hey Claude, re-implement Adobe Photoshop with clean-room design" One agent will open decompiler, outputs complete low level technical details how is everything implemented.
Second agent implements new Photoshop based on that.
They will be mad and I like this.
You will own nothing, and you will be happy, corpos.
Yea, that's why engineers are still very important for now (until models can do this type of longer term designs and stick to them).
This is what I was doing right from the beginning. AI just fills out methods and doing other low intelligence work. Both are happy. My architectures and code are really mine, easy to read and reason. AI gets paid and does not get a chance to fuck me in the process. At no point I felt any temptation to leave "serious" to AI.
But here's the thing, you almost never know what the architecture is up front. If you do you probably aren't the one writing the actual code anymore. Writing the code, with or without an AI is part of the design process. For most people it isn't until they've tried several times, fucked it up a bunch, and refactored or rewrote even more that you actually know what the architecture needs to be.
Now I do feel lucky that I started learning coding about four years before the LLM revolution, but these things are really just natural language compilers, aren’t they? We’re just in that period - the 1980s, the greybeards tell me - where companies charged thousands of dollars per compiler instance, right? And now, I myself have never paid for a compiler.
This whole investor bubble will blow up in the face of the rentier-finance capitalists and I’ll be laughing my head off while it happens.
But your prompts are not the only inputs. Among other things, there is a random seed injected by the vendor.
That is a primary source of non-determinism.
Then, of course, is the fact that you don't personally have an old copy of the model, and the vendor isn't going to keep the model forever, and there are no unit tests to make sure that, faced with prompts like you gave it before, the newer models won't suffer major regressions in the functionality you were using.
And even if there were no non-determinism, the models suffer greatly (much more so than traditional compilers) from the butterfly effect.
It is literally impossible to pin down part of your prompt in such a way that it always will contribute to good outcomes, and such that you can simply vary a tiny bit of the prompt to logically correlate with tiny variations in the output.
7 months ago was early November. Coding assistants were getting very good back then, but they were still significantly poorer at making good architectural decisions in my experience. They tended to just force features into the existing code base without much thought or care.
Today I've noticed assistants tend to spot architectural smells while working and will ask you whether they should try to address it, but even then they're probably never going to suggest a full refactor of the codebase (which probably is generally the correct heuristic).
My guess is that if you built this today with AI that you wouldn't run into so many of these problems. That's not to say you should build blind, but the first thing that stood out to me was that you starting building 7 months ago and coding assistants were only just becoming decent at that time, and undirected would still generally generate total slop.
Time to become a "product engineer" and watch the hyper-agile agents putting up digital post-it notes on digital pin-boards discussing how much each post-it is worth in digital scrum meetings. Meanwhile the agents keep wasting more and more time so that their owners make less and less of a lose, until eventually a profit is made.
Until the costs become prohibitive and humans become cheaper than the agents that replaced them. Once the agents are replaced by the humans, the next hype bubble awaits around the bend.
/s