Scaling long-running autonomous coding

(cursor.com)

290 points | by samwillis 23 days ago

49 comments

simonw 23 days ago
"To test this system, we pointed it at an ambitious goal: building a web browser from scratch."
I shared my LLM predictions last week, and one of them was that by 2029 "Someone will build a new browser using mainly AI-assisted coding and it won’t even be a surprise" https://simonwillison.net/2026/Jan/8/llm-predictions-for-202... and https://www.youtube.com/watch?v=lVDhQMiAbR8&t=3913s
This project from Cursor is the second attempt I've seen at this now! The other is this one: https://www.reddit.com/r/Anthropic/comments/1q4xfm0/over_chr...
[-]
- snowmobile 21 days ago
  Well, it doesn't surprise me that this project is just a non-compiling clone of an existing browser. Says a lot about AI in general, don't you think? https://news.ycombinator.com/item?id=46649046
- mrefish 23 days ago
  Time to raise the bar. By 2029 someone will build a new browser using mainly AI-assisted coding and the surprise is that it was designed to be used by pelicans.
  [-]
  - embedding-shape 22 days ago
    > Time to raise the bar
    Lets make someone pass the one we have, this experiment didn't seem to yield a functioning browser, why would we raise the bar?
    [-]
    - mrefish 22 days ago
      > why would we raise the bar?
      The web needs to be more p5n friendly.
- scott_waddell 10 days ago
  That timeline feels aggressive but not impossible. The tooling has gotten scary good - I've seen so many people (including myself) prototype complex UIs in hours that would've taken weeks before. Browser engines are a real challenge though.
- jcfrei 22 days ago
  Surely a smart implementation would just find the chromium source on github, do some cosmetic rewrites and strip out all none-essential features?
  [-]
  - simonw 22 days ago
    You'd be able to see it doing that by looking at the transcript. You could then tell it not to!
    [-]
    - snowmobile 21 days ago
      I suppose Cursor forgot to tell their AI that, before claiming that it built everything "from-scratch"
- afishhh 22 days ago
  > The other is this one: https://www.reddit.com/r/Anthropic/comments/1q4xfm0/over_chr...
  I took a 5-minute look at the layout crate here and... it doesn't look great:
  1. Line height calculation is suspicious, the structure of the implementation also suggests inline spans aren't handled remotely correctly
  2. Uhm... where is the bidi? Directionality has far reaching implications on an inline layout engine's design. This is not it.
  3. It doesn't even consider itself a real engine:
```
        // Estimate text width (rough approximation: 0.6 * font_size * char_count)
        // In a real implementation, this would use font metrics
        let char_count = text.chars().count() as f32;
        let avg_char_width = font_size * 0.5; // Approximate average character width
        let text_width = char_count * avg_char_width;
```
  I won't even begin talking about how this particular aspect that it "approximates" also has far reaching implications on your design...
  I could probably go on in perpetuity about the things wrong with this, even test it myself or something. But that's a waste of time I'm not undertaking.
  Making a "browser" that renders a few particular web pages "correctly" is an order of magnitude easier than a browser that also actually cares about standards.
  If this is how "A Browser for the modern age." looks then I want a time machine.
  [-]
  - PaulHoule 22 days ago
    I saw a "web browser" that was AI generated in maybe 2k lines of python based on tkinter that tried to support CSS and probably was able to render some test cases but didn't at all have the shape of a real web browser.
    It reminds of having AI write me an MUI component the other day that implemented the "sx" prop [1] with some code that handles all the individual properties that were used by the component in that particular application, it might have been correct, the component at all was successful and well coded... but MUI provides a styled() function and a <Box> component, either one of which could have been used to make this component handle all the properties that "sx" is supposed to handle with as little as one line of code. I asked the agent "how would I do this using the tools that MUI provides to support sx" and had a great conversation and got a complete and clear understanding about the right way to do it but on the first try it wrote something crazy overcomplicated to handle the specific case as opposed to a general-purpose solution that was radically simple. That "web browser" was all like that.
    [1] you can write something like sx={width: 4} and MUI multiplies 4 by the application scale and puts on, say, a width: 20px style
    [-]
    - logicallee 21 days ago
      Thank you for the detailed feedback, though we would prefer for you to comment on the announcement threads where you see it. We really appreciate the feedback.
      You're referring to State of Utopia's[1] web browser, currently available here:
      https://taonexus.com/publicfiles/jan2026/172toy-browser.py.t... (turn the volume down if you play the included easter egg mini-game as it's very loud.)
      10-minute livestream demonstration:
      https://www.youtube.com/watch?v=4xdIMmrLMLo&t=45s
      That livestream demonstration is side-by-side with Chrome, rendering very simple pages.
      It compiles, renders simple web pages and is able to post.
      The differences between cursor's browser and our browser:
      - Cursor's long-running autonomously coded browser: over a million lines of code and a trillion tokens, which is computationally intensive and has a high cost. - State of Utopia's browser: under 3000 lines of code. - Cursor's browser: does not compile at present. There's no way to use it. - State of Utopia's browser: compiles in every version. You can use it right away, and it includes a fun easter-egg game. - Cursor's browser: can't make form submissions - State of Utopia's browser: can make form submissions.
      I'm submitting this using that browser. (I don't know if it will really post or not.)
      We are taking feature requests!! Submit your requested feature here:
      https://pollunit.com/polls/ahysed74t8gaktvqno100g
      We are happy to put any feature you want into the web browser.
      [1] will be available at https://stateofutopia.com or https://stofut.com for short (St. of Ut.)
  - bouk 22 days ago
    Sure, but getting this far would be inconceivable just half a year ago. It will only get better as time passes
- cube00 22 days ago
  On Jan 1 2026
  > Given how badly my 2025 predictions aged I'm probably going to sit that one out! [1]
  Seven days later you appear on the same podcast you appeared on in 2025 to share your LLM predictions for 2026.
  What changed?
  [1]: https://news.ycombinator.com/item?id=46450269
  [-]
  - marcellus23 22 days ago
    He changed his mind? The comment you're citing seems partly tongue-in-cheek anyway, but even if it wasn't, how is this some kind of gotcha?
  - simonw 22 days ago
    Bryan got in touch and said "you're being too hard on yourself, those predictions were actually pretty good".
- leptons 23 days ago
  Great, they can call it "artificial Internet Explorer", or aIE for short.
- carlesonielfa 22 days ago
  Its impressive, but how sure are we that the code for the current open source browsers isn't part of the model's training data?
  [-]
  - simonw 22 days ago
    It turns out the Cursor one is stitching together a ton of open source components already.
    That said, I don't really find the critique that models have browser source code in their training data particularly interesting.
    If they spat out a full, working implementation in response to a single prompt then sure, I'd be suspicious they were just regurgitating their training data.
    But if you watch the transcripts for these kinds of projects you'll see them make thousands of independent changes, reacting to test failures and iterating towards an implementation that matches the overall goals of the project.
    The fact that Firefox and Chrome and WebKit are likely buried in the training data somewhere might help them a bit, but it still looks to me more like an independent implementation that's influenced by those and many other sources.
    [-]
    - troupo 21 days ago
      > The fact that Firefox and Chrome and WebKit are likely buried in the training data somewhere might help them a bit, but it still looks to me more like an independent implementation that's influenced by those and many other sources.
      They generate a statistically appropriate token based on a very small context window. And they are slightly nerfed not to reproduce everything verbatim because that would bring all sorts of lawsuits.
      Of course they are not reproducing Webkit or Blink or Firefox verbatim. However, it's not an "independent implementation". That's why it's "stringing together a bunch of open-source components": https://news.ycombinator.com/item?id=46649586
      Edit: also, this "independent implementation" cannot be compiled by their own CI and doesn't work, apparently.
- bob1029 23 days ago
  The goal I am currently using for long horizon coding experiments is implementation of a PDF rasterizer given an ISO32000 specification document.
  [-]
  - llothar68 9 days ago
    I am currently using AI totry to improve pdfium to make it multithreaded and a few more features. Like partial network loading.
  - xenni 23 days ago
    We're almost there, I've been working on something similar using a markdown'd version of the ISO32000 spec
- hahahahhaah 22 days ago
  Web browser should be easy as source exists. Fix all SVG bugs in my browser tho...
  [-]
  - viraptor 22 days ago
    There are 3.5 serious open codebases of web browsers currently. Only two are full featured. It's not nothing, but it's very far from "source exists so it's easy to copy what they do".
    [-]
    - machiaweliczny 22 days ago
      But detailed specs exists for both HTML and JS and tests also exists and unlimited amount of test data. You can just try running webpage or program and also have reference implementations - it's much easier for agents to understand that. Also HTML they know super well from scraping whole internet but still impressive.
    - llothar68 9 days ago
      Ladybird and servo and quite a few older are also considerable
- cheevly 23 days ago
  2029? I have no idea why you would think this is so far off. More like Q2 2026.
  [-]
  - xmprt 23 days ago
    You're either overestimating the capabilities of current AI models or underestimating the complexity of building a web browser. There are tons of tiny edge cases and standards to comply with where implementing one standard will break 3 others if not done carefully. AI can't do that right now.
    [-]
    - tocsa 12 days ago
      Even though several people seconded the complexity of a browser, I must add one more take and bring up one of my all time favorite blog posts, back from 2000 (I am old), when browsers were already complex, Joel Spolsky's Joel On Software episode "Things You Should Never Do, Part I" https://www.joelonsoftware.com/2000/04/06/things-you-should-... His first example was Netscape browser v6.0, and why there wasn't a v5.0 after v4.0, why it took three years: "They did it by making the single worst strategic mistake that any software company can make: They decided to rewrite the code from scratch." I think this blog post is very relevant here.
    - torginus 22 days ago
      Even if AI will not achieve the ability to perform at this level on its own, it clearly is going to be an enormous force multiplier, allowing highly skilled devs to tackle huge projects more or less on their own.
      [-]
      - thesz 22 days ago
        Skilled devs compress, not generate (expand).
        https://www.youtube.com/watch?v=8kUQWuK1L4w
        The "discoverer" of APL tried to express as many problems as he could with his notation. First he found that notation expands and after some more expansion he found that it began shrinking.
        The same goes to Forth, which provides means for a Sequitur-compressed [1] representation of a program.
        [1] https://en.wikipedia.org/wiki/Sequitur_algorithm
        Myself, I always strive to delete some code or replace some code with shorter version. First, to better understand it, second, to return back and read less.
    - rvz 23 days ago
      It's most likely both.
      > There are tons of tiny edge cases and standards to comply with where implementing one standard will break 3 others if not done carefully. AI can't do that right now.
      Firstly the CI is completely broken on every commit, all tests have failed and its and looking closely at the code, it is exactly what you expect for unmaintainable slop.
      Having more lines of code is not a good measure of robust software, especially if it does not work.
    - rlt 22 days ago
      Not only edge cases and standards, but also tons of performance optimizations.
  - gordonhart 23 days ago
    Web browsers are insanely hard to get right, that’s why there are only ~3 decent implementations out there currently.
    [-]
    - qingcharles 22 days ago
      The one nice thing about web browsers is that they have a reasonably formalized specification set and a huge array of tests that can be used. So this makes them a fairly unique proposition ideally suited to AI construction.
      [-]
      - pleurotus 22 days ago
        As far as I read on Ladybird's blog updates, the issue is less the formalised specs, and more that other browsers break the specs, so websites adjust, so you need to take the non-compliance to specs into account with your design
  - johnfn 23 days ago
    You should make your own predictions, and then we can do a retrospective on who was right.
  - mkoubaa 23 days ago
    Yeah if you let them index chromium I'm sure it could do it next week. It just won't be original or interesting.
  - geeunits 23 days ago
    [flagged]
    [-]
    - dang 22 days ago
      Please don't cross into personal attack on HN.
      https://news.ycombinator.com/showhn.html
- keepamovin 22 days ago
  That makes a lot of sense for massive-scale efforts like a browser, using coordinated agents to push toward a huge, well defined target with existing benchmarks and tests.
  My angle has been a bit different: scaling autonomous coding for individual developers, and in a much simpler way. I love CLI agents, but I found myself wasting time babysitting terminals while waiting for turns to finish. At some point it clicked: what if I could just email them?
  Email sounds backward, but that’s the feature. It’s universal, async, already collaborative. The agent sends me a focused update, I reply with guidance, and it keeps working on a server somewhere, or my laptop, while I’m not glued to my desk. There’s still a human in the loop, just without micromanagement.
  It’s been surprisingly joyful and productive, and it feels closer to how real organizations already work. I’ve put together a small, usable tool around this and shared it here if anyone wants to try it or kick the tires: https://news.ycombinator.com/item?id=46629191
embedding-shape 23 days ago
Did anyone manage to run the tests from the repository itself? The code seems filled with errors and warnings, as far as I can tell none of them because of the platform I'm on (Linux). I went and looked at the Action workflow history for some pages, and seems CI been failing for a while, PRs also all been failing CI but merged. How exactly was this verified to be something to be used as an successful example, or am I misunderstanding what point they are trying to make? They mention a screenshot, but they never actually mention if their goal was successfully met, do they?
I'm not sure the approach of "completely autonomous coding" is the right way to go. I feel like maybe we'll be able to use it more effectively if we think of them as something to be used by a human to accomplish some thing instead, lean into letting the human drive the thing instead, because quality spirals so quickly out of control.
[-]
- snek_case 23 days ago
  I found the codebase very hard to navigate. Hundreds (over a thousand?) tiny files with less than 200 lines of code, in deeply nested subdirectories. I wanted to find where the JavaScript engine was, and where the DOM implementation was located, and I couldn't easily find it, even using the GitHub search feature. I'm not exactly sure what this browser implements and how.
  Even their README is kind of crappy. Ideally you want installation instructions right near the top, but it's broken into multiple files. The README link that says "running + architecture" (but the file is actually called browser_ui.md???) is hard to follow. There is no explicit list of dependencies, and again no explanation of how JavaScript execution works, or how rendering works, really.
  It's impressive that they got such a big project to be built by agents and to compile, but this codebase... Feels like AI slop, and you couldn't pay me to maintain it. You could try to get AI agents to maintain it, but my prediction is that past some scale, they would have a hard time figuring out their own mess. You would just be left with permanent bugs you can't easily fix.
  [-]
  - bonesss 22 days ago
    So the chain of events here is: copy existing tutorials and public/available code, train the model to spit it out-ish when asked, a mature-ish specification is used, and now they jitter and jumble towards a facsimile of a junior copy paste outsourcing nightmare they can’t maintain (creating exciting liabilities for all parties involved).
    I can’t shake the feeling that simply being a shameless about copy-paste (ie copyright infringement), would let existing tools do much the same faster and more efficiently. Download Chromium, search-replace ‘Google’ with ‘ME!’, run Make… if I put that in a small app someone would explain that’s actually solvable as a bash one-liner.
    There’s a lot of utility in better search and natural language interactions. The siren call of feedback loops plays with our sense of time and might be clouding or sense of progress and utility.
    [-]
    - kungfuscious 22 days ago
      You raise a good point, which is that autonomous coding needs to be benchmarked on designs/challenges where the exact thing being built isn't part of the model's training set.
      [-]
      - NitpickLawyer 22 days ago
        swe-REbench does this. They gather real issues from github repos on a ~monthly basis, and test the models. On their leaderboard you can use a slider to select issues created after a model was released, and see the stats. It works for open models, a bit uncertain on closed models. Not perfect, but best we have for this idea.
  - datsci_est_2015 22 days ago
    To steelman the vibecoders’ perspective, I think the point is that the code is not meant for you to read.
    Anyone who has looked at AI art, read AI stories, listened to AI music, or really interacted with AI in any meaningfully critical way would recognize that this was the only predictable result given the current state of AI generated “content”. It’s extremely brittle, and collapses at the smallest bit of scrutiny.
    But I guess (to continue steelmanning) the paradigm has shifted entirely. Why do we even need an entire browser for the whole internet? Why can’t we just vibe code a “browser” on demand for each web page we interact with?
    I feel gross after writing this.
    [-]
    - embedding-shape 22 days ago
      If it's not meant to be read, and not meant to be run since it doesn't compile and doesn't seem like it's been able to for quite some time, what is this mean to demonstrate?
      That agents can write a bunch of code by themselves? We already knew that, and what's even the point of that if the code doesn't work?
      I feel like I'm still missing what this entire project and blogpost is about. Is it supposed to be all theoretical or what's the deal?
      [-]
      - datsci_est_2015 22 days ago
        You and me both, bud. I often feel these days that humanity has never had a more fractured reality, and worse, those fractures are very binary and tribal. I cope by trying to find fundamental truths that are supported by overwhelming evidence rather than focus on speculation.
        I guess the fundamental truth that I’m working towards for generative AI is that it appears to have asymptotic performance with respect to recreating whatever it’s trying to recreate. That is, you can throw unlimited computing power and unlimited time at trying to recreate something, but there will still be a missing essence that separates the recreation from the creation. In very small snippets, and for very large compute, there may be reasonable results, but it will never be able to completely replace what can be created in meatspace by meatpeople.
        Whether the economics of the tradeoff between “nearly recreated” and “properly created” is net positive is what I think this constant ongoing debate is about. I don’t think it’s ever going to be “it always makes sense to generate content instead of hire someone for this”, but rather a more dirty, “in this case, we should generate content”.
        [-]
        embedding-shape 22 days ago
        No, but this blogpost is on a whole other level. Usually at least the stuff they showcase at least does something, not shovelware that doesn't compile.
    - snek_case 22 days ago
      I've had AI write some very nice, readable code, but I make it go one function at a time.
      [-]
      - datsci_est_2015 22 days ago
        Writing code one function at a time is not the the 100x speed up being hyped all over HN. I also write my code one function at a time, often assisted by various tools, some of them considered “AI”.
        Writing code one function at a time is the furthest thing than what is being showcased in TFA.
  - embedding-shape 22 days ago
    > It's impressive that they got such a big project to be built by agents and to compile
    But that's the thing, it doesn't compile, has a ton of errors, CI seems broken since long... What exactly is supposed to impressive here, that it managed to generate a bunch of code that doesn't even compile?
    What in the holy hackers is this even about? Am I missing something obvious here? How is this news?
    [-]
    - underdeserver 22 days ago
      Looks like it doesn't compile for at least one other guy (I myself haven't tried): https://github.com/wilsonzlin/fastrender/issues/98
      Yeah, answers need to be given.
      [-]
      - snek_case 22 days ago
        Cursor is in the business of selling you more tokens, so it makes sense that they would exaggerate the capabilities of their models, and even advertise it being used to produce lots of code over weeks. This would probably cost you thousands in API usage fees.
    - askl 22 days ago
      > What in the holy hackers is this even about? Am I missing something obvious here?
      It's about hyping up cursor and writing a blog post. You're not supposed to look at or use the code, obviously.
- idopmstuff 22 days ago
  > I'm not sure the approach of "completely autonomous coding" is the right way to go.
  I suspect the author of the post would agree. This feels much more like a experiment to push the limits of LLMs than anything they're looking to seriously use as a product (or even the basis of a product).
  I think the more interesting question is when the approach of completely autonomous coding will be the right way to go. LLMs are definitely progressing along a spectrum of: Can't do it -> Can do it with help -> Can do it alone but code isn't great -> Can do it alone with good code. Right now I'd say they're only in that final step for very small projects (e.g. simple Python scripts), but it seems like an inevitability that they will get there for increasingly large ones.
- csomar 22 days ago
  You can stop reading the article from here:
  > Today's agents work well for focused tasks, but are slow for complex projects.
  What does slow mean? Slower than humans? Need faster GPUs? What does it even imply? Too slow to produce the next token? Too slow in attempts to be usable? Need human intervention?
  This piece is made and written to keep the bubble inflating further.
- seanc 22 days ago
  Code filled with errors and warnings? PR's merged with failing CI?
  So I guess they've achieved human parity then?
  (I'll see myself out)
trjordan 23 days ago
This is going to sound sarcastic, but I mean this fully: why haven't they merged that PR.
The implied future here is _unreal cool_. Swarms of coding agents that can build anything, with little oversight. Long-running projects that converge on high-quality, complex projects.
But the examples feel thin. Web browsers, Excel, and Windows 7 exist, and they specifically exist in the LLM's training sets. The closest to real code is what they've done with Cursor's codebase .... but it's not merged yet.
I don't want to say, call me when it's merged. But I'm not worried about agents ability to produce millions of lines of code. I'm worried about their ability to intersect with the humans in the real world, both as users of that code and developers who want to build on top of it.
[-]
- dust42 22 days ago
  > This is going to sound sarcastic, but I mean this fully: why haven't they merged that PR.
  I would go even further, why have they not created at least one less complex project that is working and ready to be checked out? To me it sounds like having a carrot dangle in front of the face of VC investors: 'Look, we are almost there to replace legions of software developers! Imagine the market size and potential cost reductions for companies.'
  LLMs are definitely an exciting new tool and they are going to change a lot. But are they worth $B for everything being stamped 'AI'? The future will tell. Looking back the dotcom boom hype felt exactly the same.
  The difference with the dotcom boom is that at the time there was a lot more optimism to build a better future. The AI gold rush seems to be focused on getting giga-rich while fscking the bigger part of humanity.
- risyachka 23 days ago
  >> why haven't they merged that PR.
  because it is absolutely impossible to review that code and there is gazillion issues there.
  The only way it can get merged is YOLO and then fix issues for months in prod which kinda defeats the purpose and brings gains close to zero.
  [-]
  - mkoubaa 23 days ago
    On the other hand, finding fixing issues for months is still training data
- orlp 22 days ago
  > Long-running projects that converge on high-quality, complex projects
  In my experience agents don't converge on anything. They diverge into low-quality monstrosities which at some point become entirely unusable.
  [-]
  - embedding-shape 22 days ago
    Yeah, I don't think they're built for that either, you need a human to steer the "convergtion", otherwise they indeed end up building monstrosities.
- viraptor 22 days ago
  > Web browsers, Excel, and Windows 7 exist, and they specifically exist in the LLM's training sets.
  There's just a bit over 3 browsers, 1 serious excel-like and small part of windows user side. That's really not enough for training for replicating those specific tasks.
- energy123 22 days ago
  > Long-running projects that converge
  This is how I think about it. I care about asymptotics. What initial conditions (model(s) x workflow/harness x input text artefacts) causes convergence to the best steady state? The number of lines of code doesn't have to grow, it could also shrink. It's about the best output.
- dist-epoch 23 days ago
  Pretty much everything exists in the training sets. All non-research software is just a mishmash of various standard modules and algorithms.
  [-]
  - galaxyLogic 23 days ago
    Not everything, only code-bases of existing (open-source?) applications.
    But what would be the point of re-creating existing applications? It would be useful if you can produce a better version of those applications. But the point in this experiment was to produce something "from scratch" I think. Impressive yes, but is it useful?
    A more practically useful task would be for Mozilla Foundation and others to ask AI to fix all bugs in their application(s). And perhaps they are trying to do that, let's wait and see.
    [-]
    - mkoubaa 23 days ago
      You have to be careful which codebase to try this on. I have a feeling if someone unleashed agents on the Linux kernel to fix bugs it'd lead to a ban on agents there
    - conradev 22 days ago
      Re-creating closed source applications as open source would have a clear benefit because people could use those applications in a bunch of new ways. (implied: same quality bar)
ZitchDog 23 days ago
I used similar techniques to build tjs [1] - the worlds fastest and most accurate json schema validator, with magical TypeScript types. I learned a lot about autonomous programming. I found a similar "planner/delegate" pattern to work really well, with the use of git subtrees to fan out work [2].
I think any large piece of software with well established standards and test suites will be able to be quickly rewritten and optimized by coding agents.
[1] https://github.com/sberan/tjs
[2] /spawn-perf-agents claude command: https://github.com/sberan/tjs/blob/main/.claude/commands/spa...
micimize 23 days ago
> While it might seem like a simple screenshot, building a browser from scratch is extremely difficult.
> Another experiment was doing an in-place migration of Solid to React in the Cursor codebase. It took over 3 weeks with +266K/-193K edits. As we've started to test the changes, we do believe it's possible to merge this change.
In my view, this post does not go into sufficient detail or nuance to warrant any serious discussion, and the sparseness of info mostly implies failure, especially in the browser case.
It _is_ impressive that the browser repo can do _anything at all_, but if there was anything more noteworthy than that, I feel they'd go into more detail than volume metrics like 30K commits, 1M LoC. For instance, the entire capability on display could be constrained to a handful of lines that delegate to other libs.
And, it "is possible" to merge any change that avoids regressions, but the majority of our craft asks the question "Is it possible to merge _the next_ change? And the next, and the 100th?"
If they merge the MR they're walking the walk.
If they present more analysis of the browser it's worth the talk (not that useful a test if they didn't scrutinize it beyond "it renders")
Until then, it's a mountain of inscrutable agent output that manages to compile, and that contains an execution pathway which can screenshot apple.com by some undiscovered mechanism.
[-]
- meander_water 23 days ago
  The lowest bar in agentic coding is the ability to create something which compiles successfully. Then something which runs successfully in the happy path. Then something which handles all the obvious edge cases.
  By far the most useful metric is to have a live system running for a year with widespread usage that produces a lower number of bugs than that of a codebase created by humans.
  Until that happens, my skeptic hat will remain firmly on my head.
- embedding-shape 23 days ago
  > it's a mountain of inscrutable agent output that manages to compile
  But is this actually true? They don't say that as far as I can tell, and it also doesn't compile for me nor their own CI it seems.
  [-]
  - sashank_1509 23 days ago
    Oh it doesn’t compile? that’s very revealing
    [-]
    - rvz 23 days ago
      Some people just believe anything said on X these days. No timeline from start to finish, just "trust me bro".
      If you can't reproduce or compile the experiment then it really doesn't work at all and nothing but a hype piece.
  - micimize 23 days ago
    Hah I don't know actually! I was assuming it must if they were able to get that screenshot video.
    [-]
    - Snuggly73 23 days ago
      error: could not compile `fastrender` (lib) due to 34 previous errors; 94 warnings emitted
      I guess probably at some point, something compiled, but cba to try to find that commit. I guess they should've left it in a better state before doing that blog post.
      [-]
      - jaggederest 23 days ago
        I find it very interesting the degree to which coding agents completely ignore warnings. When I program I generally target warning-free code, and even with significant effort in prompting, I haven't found a model that treats warnings as errors, and they almost all love the "ignore this warning" pragmas or comments over actually fixing them.
        [-]
        ianbutler 23 days ago
        Yeah I've had problems with this recently. "Oh those are just warnings." Yes but leaving them will make this codebase shit in short time.
        I do use AI heavily so I resorted to actually turning on warnings as errors in the rust codebases I work in.
        [-]
        micimize 22 days ago
        Easiest to have different agents or turns that set aside the top-level goal via hooks/skills/manual prompt/etc. Heuristically, a human will likely ignore a lot of warnings until they've wired up the core logic, then go back and re-evaluate, but we still have to apply steering to get that kind of higher-order cognitive pattern.
        Product is still fairly beta, but in Sculptor[^1] we have an MCP that provides agent & human with suggestions along the lines of "the agent didn't actually integrate the new module" or "the agent didn't actually run the tests after writing them." It leads to some interesting observations & challenges - the agents still really like ignoring tool calls compared to human messages b/c they "know better" (and sometimes they do).
        [^]: https://imbue.com/sculptor/
        conception 23 days ago
        You can use hooks to keep them from being able to do this btw
        [-]
        jaggederest 22 days ago
        I generally think of needing hooks as being a model training issue - I've had to use them less as the models have gotten smarter, hopefully we'll reach the point where they're a nice bonus instead of needed to prevent pathological model behavior.
        suriya-ganesh 23 days ago
        unfortunately this is not the most common practice. I've worked on rust codebases with 10K+ warning. and rust was supposed to help you.
        It is also close to impossible run any node ecosystem without getting a wall of warnings.
        You are an extreme outlier for putting in the work to fix all warnings
        [-]
        embedding-shape 22 days ago
        > It is also close to impossible run any node ecosystem without getting a wall of warnings.
        Haven't found that myself, are you talking about TypeScript warnings perhaps? Because I'm mostly using just JavaScript and try to steer clear of TypeScript projects, and AFAIK, JavaScript the language nor runtimes don't really have warnings, except for deprecations, are those the ones you're talking about?
        jaggederest 23 days ago
        `cargo clippy` is also very happy with my code. I agree and I think it's kind of a tragedy, I think for production work warnings are very important. Certainly, even if you have a large number of warnings and `clippy` issues, that number ideally should go down over time, rather than up.
Snuggly73 22 days ago
And there is the thing about the cost. The blog post says that they've spent trillions (plural!) of tokens on that experiment.
Looking at OAI API pricing, 5.2 Codex is $14 per 1 million output tokens. Which makes cool $14m for 1 trillion tokens (multiplied by whatever the plural is). For something that "kind of works".
Its a nice ad for OAI and Anysphere, but maybe next time - just donate the money to a browser team?
tehsauce 23 days ago
I was excited to try it out so I downloaded the repo and ran the build. However there were 100+ compilation errors. So I checked the commit history on github and saw that for at least several pages back all recent commits had failed in the CI. It was not clear which commit I should pick to get the semi-working version advertised.
I started looking in the Cargo.toml to at least get an idea how the project was constructed. I saw there that rather than being built from scratch as the post seemed to imply that almost every core component was simply pulled in from an open source library. quickjs engine, wgpu graphics, winit windowing & input, egui for ui, html parsing, the list goes on. On twitter their CEO explicitly stated that it uses a "custom js vm" which seemed particularly misleading / untrue to me.
Integrating all of these existing components is still super impressive for these models to do autonomously, so I'm just at a loss how to feel when it does something impressive but they then feel the need to misrepresent so much. I guess I just have a lot less respect and trust for the cursor leadership, but maybe a little relief knowing that soon I may just generate my own custom cursor!
[-]
- jkelleyrtp 22 days ago
  WGPU for render, winit for window, servo css engine, taffy for layout sounds eerily similar to our existing open source Rust browser blitz.
  https://github.com/dioxuslabs/blitz
  Maybe we ended up in the training data!
  [-]
  - satvikpendem 22 days ago
    I follow Dioxus and particularly blitz / #native on your Discord and I noticed the exact same thing too. There was a comment in a readme in Cursor's browser repo they linked mentioning taffy and I thought, hang on, it's definitely not from scratch, as they advertise. People really do believe everything they read on Twitter.
    Great work by the way, blitz seems to be coming along nicely, and I even see you guys created a proto browser yourselves which is pretty cool, actually functional unlike Cursor's.
- whatever1 22 days ago
  You are doing it wrong.
  Take a screenshot and take it to your manager / investor and make a presentation “Imagine what is now possible for our business”.
  Get promoted / exit, move to other pastures and let them figure it out.
- eeL3bo1mohn7pee 22 days ago
  Of 63295 workflow runs, apparently only 1426 have been successful.
  It's hard to avoid the impression that this is an unverified pile of slop that may have actually never worked.
  The CI process certainly hasn't succeeded for the vast majority of commits.
  Baffling, really.
  [-]
  - alfalfasprout 21 days ago
    You should see the code. It's true slop. The organization makes no sense.
- wilsonzlin 21 days ago
  Thanks for the feedback. There were some build errors which have now been resolved; the CI test that was failing was not a standard check CI, and it's now been updated. Let me know if you have any further issues.
  > On twitter their CEO explicitly stated that it uses a "custom js vm" which seemed particularly misleading / untrue to me.
  The JS engine used a custom JS VM being developed in vendor/ecma-rs as part of the browser, which is a copy of my personal JS parser project vendored to make it easier to commit to.
  I agree that for some core engine components, it should not be simply pulling in dependencies. I've begun the process of removing many of these and co-developing them within the repo alongside the browser. A reasonable goal for "from scratch" may be "if other major browsers use a dependency, it's fine to do so too". For example: OpenSSL, libpng, HarfBuzz, Skia. The current project can be moved more towards this direction, although I think using libraries for general infra that most software use (e.g. windowing) can be compatible with that goal.
  I'd push back on the idea that all the agents did was wire up dependencies — the JS VM, DOM, paint systems, chrome, text pipeline, are all being developed as part of this project, and there are real complex systems being engineered towards the goal of a browser engine, even if not there yet.
  [-]
  - polyglotfacto 16 days ago
    > there are real complex systems being engineered towards the goal of a browser engine, even if not there yet.
    In various comments in https://news.ycombinator.com/item?id=46624541 I have explained at length why your fleet of autonomous agents failed miserably at building something that could be seen as a valid POC.
    One example: your rendering loop does not follow the web specs and makes no sense.
    https://github.com/wilsonzlin/fastrender/blob/19bf1036105d4e...
    The above design document is simply nonsense; typical AI hallucinated BS. Detailed critique at https://news.ycombinator.com/item?id=46705625
    The actual code is worse; I can only describe it as a tangle of spaghetti. As a Browser expert I can't make much, if anything, out of it. In comparison, when I look at code in Ladybird, a project I am not involved in, I can instantly find my way around the code because I know the web specs.
    So I agree this isn't just wiring up of dependencies, and neither is it copied from existing implementations: it's a uniquely bad design that could never support anything resembling a real-world web engine.
    Now don't get me wrong, I do think AI could be leveraged to build a web engine, but not by unleashing autonomous agents. You need humans in the loop at all levels of abstractions; the agents should only be used to bang out features re-using patterns established or vetted by human experts.
    If you want to do this the right way, get in touch: https://github.com/gterzian
  - fwip 21 days ago
    When you say "have now been resolved" - did the AI agent resolve it autonomously, did you direct it to, or did a human do it?
    [-]
    - neuronexmachina 21 days ago
      Looks like Cursor Agent was at least somewhat involved: https://github.com/wilsonzlin/fastrender/commit/4cc2cb3cf0bd...
      [-]
      - embedding-shape 21 days ago
        Looks like a bunch of different users (including Google's Jules made one commit) been contributing to the codebase, and the recent "fixes" includes switching between various git users. https://gist.github.com/embedding-shapes/d09225180ea3236f180...
        This to me seems to raise more questions than it answers.
        [-]
        mjmas 21 days ago
        The ones at *.ec2.internal generally mean that the git config was never set up ans it defaults to $(id -un)@$(hostname)
        [-]
        embedding-shape 20 days ago
        Indeed. Extra observant people will notice that the "Ubuntu" username was used only twice though, compared to "root" that was used +3700 times. And observant people who've dealt with infrastructure before, might recognize that username as the default for interactive EC2 instances :)
- handfuloflight 22 days ago
  Let us all generate our own custom cursors.
jphelan 23 days ago
This looks like extremely brittle code to my eyes. Look at https://github.com/wilsonzlin/fastrender/blob/main/crates/fa...
What is `FrameState::render_placeholder`?
``` pub fn render_placeholder(&self, frame_id: FrameId) -> Result<FrameBuffer, String> { let (width, height) = self.viewport_css; let len = (width as usize) .checked_mul(height as usize) .and_then(|px| px.checked_mul(4)) .ok_or_else(|| "viewport size overflow".to_string())?;
```
    if len > MAX_FRAME_BYTES {
      return Err(format!(
        "requested frame buffer too large: {width}x{height} => {len} bytes"
      ));
    }

    // Deterministic per-frame fill color to help catch cross-talk in tests/debugging.
    let id = frame_id.0;
    let url_hash = match self.navigation.as_ref() {
      Some(IframeNavigation::Url(url)) => Self::url_hash(url),
      Some(IframeNavigation::AboutBlank) => Self::url_hash("about:blank"),
      Some(IframeNavigation::Srcdoc { content_hash }) => {
        let folded = (*content_hash as u32) ^ ((*content_hash >> 32) as u32);
        Self::url_hash("about:srcdoc") ^ folded
      }
      None => 0,
    };
    let r = (id as u8) ^ (url_hash as u8);
    let g = ((id >> 8) as u8) ^ ((url_hash >> 8) as u8);
    let b = ((id >> 16) as u8) ^ ((url_hash >> 16) as u8);
    let a = 0xFF;

    let mut rgba8 = vec![0u8; len];
    for px in rgba8.chunks_exact_mut(4) {
      px[0] = r;
      px[1] = g;
      px[2] = b;
      px[3] = a;
    }

    Ok(FrameBuffer {
      width,
      height,
      rgba8,
    })
  }
```
} ```
What is it doing in these diffs?
https://github.com/wilsonzlin/fastrender/commit/f4a0974594e3...
I'd be really curious to see the amount of work/rework over time, and the token/time cost for each additional actual completed test case.
[-]
- blibble 23 days ago
  this is certainly an interesting way to pull out an attribute from a tag: https://github.com/wilsonzlin/fastrender/blob/main/crates/fa...
- blamestross 23 days ago
  I suppose brittle code is fine if you have cursor to update and fix it. Ideal really, keeps you dependent.
  [-]
  - xmprt 23 days ago
    To be fair, that was always the case when working with external contractors. And if agentic AI companies can capture that market, then that's still a pretty massive opportunity.
    [-]
    - janstice 21 days ago
      At least AI is (and unlike many contract dev shops) keen to write unit tests…
torginus 22 days ago
Personally what I don't like about this now that I think about it, is that they didn't scale up gradually, let's say there there's a ladder of complexity in software, starting at a simple React CRUD app, going on to something more complex, such as a Paint clone, to something even more complex, like a file manager etc, ending up at one of the most complex pieces of software ever made, a web browser.
I'd want to see some system, that 100%s the first task, saturation, does a great job on the next, then does a valiant effort on the third, then finally makes something promising but as yet unusable on the last.
This way we could see that scaling up difficulty results in a gradual decline in quality, and could have a decent measurement of where we are at and where we are going.
mk599 23 days ago
Define "from scratch" in "building a web browser from scratch". This thing has over 100 crates as dependencies... To implement css layouting, it uses Taffy, a crate used by existing browser implementations...
[-]
- rvz 23 days ago
  When I see hundreds of crates being used in a project, I have to just scratch my head and ask: What the f___?
  If one vulnerability exists in those crates well, thats that.
- qingcharles 22 days ago
  And it's not necessarily a bad move to use all those dependencies, but you're right it makes the claim shady.
  I can create a web browser in under a minute in Copilot if I ask it to build a WinForms project that embeds the WebView2 "Edge" component and just adds an address bar and a back button.
Snuggly73 22 days ago
The only thing that I got to actually run on WSL2 was the "Excel" (couldnt get anything actually to compile on Mac or Windows).
It a broken mess that probably implements 0.00001% of Excel. And its 1.2m locs.
With codebases developed in this way - either they need to figure out how agents are going to maintain them (in which case SWE as we know is dead - it will only be limited to those that can spend trillions of tokens, or they are going to remain weird demos.
[-]
- timabdulla 22 days ago
  I'd be curious to see screenshots or a video! I only have a Mac at my disposal, unfortunately.
logicallee 23 days ago
At the same time they were doing this, I also iterated on an AI-built web browser with around 2,000 lines of code. I was heavily in the loop for it, it didn't run autonomously. You can see the current version of the source code here:
https://taonexus.com/publicfiles/jan2026/172toy-browser.py.t... (turn the sound down, it's a bit loud if you interact with the built-in Tetris clone.)
You can run it after installing the packages, "pip install requests pillow urllib3 numpy simpleaudio"
I livestreamed the latest version here 2 weeks ago, it's a ten minute video:
https://www.youtube.com/watch?v=4xdIMmrLMLo&t=45s
I'm posting from that web browser. As an easter egg, mine has a cool Tetris clone (called Pentrix) based on pieces with 5 segments, the button for this is at the upper-right.
If you have any feature suggestions for what you want in a browser, please make them here:
https://pollunit.com/polls/ahysed74t8gaktvqno100g
physicsguy 22 days ago
I have been trying Claude Code a lot this week. Two projects:
* A small statically generated Hugo website but with some clever linking/taxonomy stuff. This was a fairly self-contained project that is now 'finished' but wouldn't hvae taken me more than a few days to code up from scratch. * A scientific simulation package, to try and do a clean refresh of an existing one which i can point at for implementation details but which has some technical problems I would like to reduce/remove.
Claude code absolutely smashed the first one - no issues at all. With the second, no matter what I tried, it just made lots of mistakes, even when I just told it to copy the problematic parts and transpose them into the new structure. It basically got to a point where it wasn't correct and it didn't seem to be able to get out of a bit of a 'doom loop' and required manual intervention, no matter how much prompting and hints I gave it.
[-]
- Bishonen88 22 days ago
  Similar experience here.
  Did sign up for Claude Code myself this week, too, given the $10/month promo. I have experience with AI by using AWS Kiro at work and directly prompting Claude Opus for convos. After just 2 days and ~5-6 vibe coding sessions in total I got a working Life-OS-App created for my needs.
  - Clone of Todoist with the features that I actually use/want. Projects, Tags, due dates, quick adding with a todoist like text-aware input (e.g. !p1, Today etc.)
  - A fantastical like calendar. Again, 80% of the features I used from Fantastical
  - A Habit Tracker
  - A Goal Tracker (Quarterly / Yearly)
  - A dashboard page showing todays summary with single click edit/complete marking
  - User authentication and sharing of various features (e.g. tasks)
  - Docker deployment which will eventually run on my NAS
  I'm going to add a few more things and cancel quite a few subscriptions. It one-shots all tasks within minutes. It's wild. I can code but didn't bother looking at the code myself, because ... why.
  Even though do not earn US Tech money, am tempted to buy the max subscription for a month or two although the price is still hard to swallow.
  Claude and vibe coding is wild. If I can clone todoist within a few vibe coding sessions and then implement any additional/new feature I want within minutes instead proposing, praying and then waiting for months, why would I pay $$$...
  [-]
  - DauntingPear7 21 days ago
    Wth are your usage limits? Are they increased? I’ll hit a usage limit in about 2-3 hours of using sonnet 4.5, and opus is a weekly limit.
- underdeserver 22 days ago
  On Twitter people are saying GPT-5.2 is better. That's also what Cursor used in their testing. Maybe try it?
  [-]
  - physicsguy 22 days ago
    I have Web access for ChatGPT through work, but not API access annoyingly.
    [-]
    - xpil 17 days ago
      Codex plugin (VSCode) allows consuming your "web" (ie non-api) subscription for coding/agentic tasks.
nl 23 days ago
Remember when 3D printers meant the death of factories? Everyone would just print what they wanted at home.
I'm very bullish on LLMs building software, but this doesn't mean the death of software products anymore than 3D printers meant the death of factories.
[-]
- ben_w 22 days ago
  Perhaps, but I don't think that's a good analogy, there's too many important differences to say (3d printing : all manufacturing) : (vibe coding : all software).
  The hype may be similar, if that's your point then I agree, but the weakness of 3D printing is the range of materials and the conditions needed to work with them (titanium is merely extremely difficult, but no sane government will let the general public buy tetrafluoroethylene as a feedstock), while the weakness of machine learning (even more broadly than LLMs) is the number of examples they require in order to learn stuff.
torginus 22 days ago
I'm kinda surprised how negative and skeptical anyone is here.
It kinda blows my mind that this is possible, to build a browser engine that approximates a somewhat working website renderer.
Even if we take the most pessimistic interpretation of events ( heavy human steering, relies on existing libraries, sloppy code quality at places, not all versions compile etc)
[-]
- ben_w 22 days ago
  I'm not too surprised, the way I read a lot of (not all!*) the negative comments is ~"I'm imagining having to work with this code, I'd hate it". Even though I'm fairly impressed with the work LLMs do, this has also been my experience of them… albeit with a vibe-coding** sample size of 1, done over a few days with some spare credit.
  The positive views are mostly from people who point out that what matters in the end is what the code does, not what it looks like, e.g. users don't see the code, nor do they care about the code, and that even for businesses who do care, LLMs may be the ones who have to pay down any technical debt that builds up.
  * Anyone in a field where mistakes are expensive. In one project, I asked the LLM to code-review itself and it found security vulnerabilities in its own solutions. It's probably still got more I don't know about.
  ** In the original sense of just letting the LLM do whatever it wanted in response to the prompt, never reading or code reviewing the result myself until the end.
  [-]
  - polyglotfacto 16 days ago
    > what matters in the end is what the code does, not what it looks like
    That is true in a way, although even for agents readability matters.
    But the code here does not actually do the right thing, and the way it is written also means it never could.
    Web devs do care whether the engine runs their code according to Web standards(otherwise it's early IE all over), and end-users do care that websites work as their devs intended to.
    Current state is throw-away level quality.
    I've critiqued it at length in the other post, see https://news.ycombinator.com/item?id=46705625
  - satvikpendem 22 days ago
    The problem I've had with vibe coding is akin the adage of the first 90% of the code taking 90% of the time, and the last 10% taking the other 90% of the time. The LLM can get you to 90% initially but it hits a wall unless you the user know what it's doing and outputting, but that is very difficult when you're vibe coding by its very definition, meaning that you're not looking at the code at all. And then you have to read thousands of lines of code which you don't understand that it's entirely easier to stop and hand code a new version yourself, which is precisely what I've done with some of my projects.
    [-]
    - alfalfasprout 21 days ago
      The problem is getting there 90% but poorly makes that last 10% much harder.
- polyglotfacto 16 days ago
  It's obvious by now that AI can write a whole bunch of code approximating all kinds of things. So there is no reason anymore for this to impress anyone.
  A well-architected POC built in a week with a clear path to scaling it to a full implementation down the line would be impressive, but that's not what this is.
  The current code output is basically throw-away level quality AI hallucinated BS.
danieloj 22 days ago
I'm not sure "building a web browser" is such a great test for an LLM. It helps confirm that they can handle large codebases. But the actual logic in the browser engine will be based very heavily on Chromium/Firefox etc.
jphoward 23 days ago
The browser it built, obviously the context window of the entire project is huge. They mention loads of parallel agents in the blog post, so I guess each agent is given a module to work on, and some tests? And then a 'manager' agent plugs this in without reading the code? Otherwise I can't see how, even with ChatGPT 5.2/Gemini 3, you could do this otherwise? In retrospect it seems an obvious approach and akin to how humans work in teams, but it's still interesting.
[-]
- simonw 23 days ago
  GPT-5.2-Codex has a 400,000 token window. Claude 4.5 Opus is half of that, 200,000 tokens.
  It turns out to matter a whole lot less than you would expect. Coding Agents are really good at using grep and writing out plans to files, which means they can operate successfully against way more code than fits in their context at a single time.
  [-]
  - jaggederest 23 days ago
    The other issue with "a huge token window" is that if you fill it, it seems like relevance for any specific part of the window is diminished - which makes it hard to override default model behavior.
    Interestingly, recently it seems to me like codex is actually compressing early and often so that it stays in the smarter-feeling reasoning zone of the first 1/3rd of the window, which is a neat solution for this, albeit with the caveat of post-compression behavior differences cropping up more often.
- observationist 23 days ago
  Get a good "project manager" agents.md and it changes the whole approach of vibe coding. For a professional environment, with each person given a little domain, arranged in the usual hierarchy of your coding team, truly amazing things can get done.
  Presumably the security and validation of code still needs work, I haven't read anything that indicates those are solved yet, so people still need to read and understand the code, but we're at the "can do massive projects that work" stage.
  Division of labor and planning and hierarchy are all rapidly advancing, the orchestration and coordination capabilities are going to explode in '26.
  [-]
  - heliumtera 22 days ago
    I tried this approach yesterday and I`m loving our daily standup with the agents. Looking forward to our retro and health-checks rituals
  - azan_ 22 days ago
    Could you perhaps share such agents.md? Sounds interesting
- galaxyLogic 23 days ago
  > so I guess each agent is given a module to work on, and some tests?
  Who created those agents and gives them the tasks to work on. Who created the tests? AI, or the humans?
- nl 23 days ago
  Generally they only load a bit of the project into the context at a time. Grep works really well for working out what.
tired_and_awake 23 days ago
The moment all code is interacted with through agents I cease to care about code quality. The only thing that matters is the quality of the product, cost of maintenance etc. exactly the thing we measure software development orgs against. It could be handy to have these projects deployed to demonstrate their utility and efficacy? Looking at PRs of agents feels a wrong headed, like who cares if agents code is hard to read if agents are managing the code base?
[-]
- qingcharles 22 days ago
  We don't read the binary output of our C compilers because we trust it to be correct almost every time. ("It's a compiler bug" is more of a joke than a real issue)
  If AI could reach the point where we actually trusted the output, then we might stop checking it.
  [-]
  - LiamPowell 22 days ago
    > "It's a compiler bug" is more of a joke than a real issue
    It's a very real issue, people just seem to assume their code is wrong rather than the compiler. I've personally reported 12 GCC bugs over the last 2 years and there's 1239 open wrong-code bugs currently.
    Here's an example of a simple one in the C frontend that has existed since GCC 4.7: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105180
  - ares623 22 days ago
    “If” doing a lot of work here
- flyinglizard 23 days ago
  You could look at agents as meta-compilers, the problem is that unlike real compilers they aren't verified in any way (neither formally or informally), in fact you never know which particular agent you're running against when you're asking for something; and unlike compilers, you don't just throw away everything and start afresh on each run. I don't think you could test a reasonably complex system to a degree where it really wouldn't matter what runs underneath, and as you're going to (probably) use other agents to write THOSE tests, what makes you certain they offer real coverage? It's turtles all the way down.
  [-]
  - tired_and_awake 23 days ago
    Completely agree and great points. The conclusion of "agents are writing the tests" etc is where I'm at as well. More over the code quality itself is also an agentic problem, as is compile time, reliability, portability... Turtles all the way down as you say.
    All code interactions all happen through agents.
    I suppose the question is if the agents only produce Swiss cheese solutions at scale and there's no way to fill in those gaps (at scale). Then yeah fully agentic coding is probably a pipe dream.
    On the other hand if you can stand up a code generation machine where it's watts + Gpus + time => software products. Then well... It's only a matter of time until app stores entirely disappear or get really weird. It's hard to fathom the change that's coming to our profession in this world.
- AlexCoventry 23 days ago
  You should at least read the tests, to make sure they express your intent. Personally, I'm not going to take responsibility for a piece of code unless I've read every line of it and thought hard about whether it does what I think it does.
  AI coding agents are still a huge force-multiplier if you take this approach, though.
- visarga 23 days ago
  > Looking at PRs of agents feels a wrong headed
  It would be walking the motorcycle.
- icedchai 23 days ago
  This is how we wound up with non-technical "engineering managers." Looks good to me.
  [-]
  - tired_and_awake 23 days ago
    I think this misses the point, see the other comments. Fully scaled agentic coding replaces managers too :) cause for celebration all around
    [-]
    - satvikpendem 22 days ago
      No, it becomes only managers, because they are the ones who dictate the business needs (because otherwise, what is the software the agents are making even doing without such goals), and now even worse with non technical ones.
    - icedchai 23 days ago
      I don't believe that. If you go fully agentic and you don't understand the output, you become the manager. You're in no better position than the pointy-haired boss from Dilbert.
      [-]
      - tired_and_awake 22 days ago
        Hey just wanted to thank you for the healthy back and forth! I respect your opinion and don't hold mine strongly. That said I'm eager for this space to mature and for us all to figure out the best way to interact with fault prone code generation tooling... Especially at scale where we all have the hardest time navigating complexity.
        [-]
        icedchai 22 days ago
        Thanks. It's fun chatting about this stuff! I don't hold mine strongly, either, though I am dealing with lots of AI generated slop code from others.
        Interesting times ahead.
        [-]
        tired_and_awake 21 days ago
        I feel for you. Hopefully your colleagues come around and realize that if they submit the code they are responsible for the slop.
navinsylvester 23 days ago
all these focus on long running agents without focussing on core restructure is baffling. the immediate need is to break down complex tasks into smaller ones and single shot them with some amount of parallelism. imo - we need an opinionated system but with human in the middle and then think about dreamy next steps. we need to focus on groundedness first instead of worrying about agent conjuring something from thin air. the decision to leap frog into automated long running agents is quite baffling.
boys are trying to single shot a browser when a moderate complex task can derail a repo. there’s no good amount of info which might be deliberate but from what i can pick, their value add was “distributed computing and organisational design” but that too they simplified. i agree that simplicity is always the first option but flat filesystem structure without standards will not work. period.
[-]
- vivekv 22 days ago
  I would agree with this. There are definite challenges in grounded specifications today and the tendency for an LLM to go in tangents that is still a struggle that we all deal with every day.
luhego 23 days ago
> We initially built an integrator role for quality control and conflict resolution, but found it created more bottlenecks than it solved
Of course it creates bottlenecks, since code quality takes time and people don’t get it right on the first try when the changes are complex. I could also be faster if I pushed directly to prod!
Don’t get me wrong. I use these tools, and I can see the productivity gains. But I also believe the only way to achieve the results they show is to sacrifice quality, because no software engineer can review the changes at the same speed the agent generates code. They may solve that problem, or maybe the industry will change so only output and LOC matter, but until then I will keep cursing the agent until I get the result I want.
matthewfcarlson 23 days ago
It’s fascinating that many of the issues they faced I’ve seen in human software engineering teams.
Things like integration creating bottlenecks or a lack of consistent top down direction leading to small risk adverse changes instead of bold redesigns. All things I’ve seen before.
[-]
- 2001zhaozhao 23 days ago
  At least the AI teams aren't politically competing against each other unlike human teams.
  (Or are they?)
WOTERMEON 23 days ago
Weird twist the hiring call at the end for a company that says
> Our mission is to automate coding
nashadelic 17 days ago
I'm very curious about how much of it is re-using (remixing?) an existing browser implementation it has seen in the while or its trained on. Even in that case, 99% of all code isn't doing anything novel so this copying would still have significant practical use.
gaigalas 22 days ago
There's a clear conflict between SKILLS, tools and multi-tasking.
I think "intra-context" tooling is already dead. It's too narrow.
It's all "extra-context" now: how one instruments for multiple agents, at multiple times, handling things.
Personally, I think the best tool in this realm will come from open source, and be agnostic (many agents from many places interacting), in order to leverage differences between subtle provider qualities (speed, price and so on).
Building a browser is an interesting and expensive experiment. How much did it cost?
measurablefunc 23 days ago
All of these things have readily available analogues on the web which means they are more than likely just laundering open source code & claiming victory.
[-]
- random_mutex 21 days ago
  It doesn't compile so no victory
  [-]
  - measurablefunc 21 days ago
    Just the usual corporate marketing & hype.
- rzmmm 22 days ago
  There are many open-source toy browser implementations available, so this seems quite likely.
foota 22 days ago
Slightly off topic, but they want to move from solid to react? Isn't that the reverse of the newest trend? Would be interesting to know more.
[-]
- random_mutex 21 days ago
  Most likely LLMs are better at writing react
mdswanson 23 days ago
Over the past year or so, I've built my own system of agents that behaves almost exactly like this. I can describe what I'd like built before I go to bed and have a fantastic foundation in place by the next day. For simpler projects, they'll be complete. Because of the reviews, the code continually improves until the agents are satisfied. I'm impressed every time.
[-]
- z_zetetic_z 22 days ago
  Any chance you would care to share more about this?
ora-600 23 days ago
I would love to know the cost of building this browser. I think that multi-agent orchestration systems will probably be the theme for systems this year.
I think the north-star metric for a multi-agent orchestrator system would be how much did it cost to get this done. how much better could we have done? should we have used a cheaper model for doing a trivial task and an expensive one to monitor it?
throwaway63467 22 days ago
I‘m running opus 4.5 which is arguably their best model and while it’s really good for a lot of work it always introduces subtle errors or inconsistencies when left unsupervised as prompts are never good enough to remove all ambiguity for complex asks, so I can’t imagine what it will do to a code base when left alone with it for days or weeks.
thesurlydev 22 days ago
Pretty cool and related to another path of work I'm following from Steve Yegge: https://medium.com/@steve-yegge/welcome-to-gas-town-4f25ee16...
laszlojamf 22 days ago
They mention billions of tokens, but I'm left wondering how much this experiment actually cost them...
mccoyb 23 days ago
Supposing agents and their organization improve, it seems like we’re approaching a point where the cost of a piece of software will be driven down to the cost of running the hardware, and the cost of the tokens required to replicate it.
The tokens were “expensive” from the minds of humans …
[-]
- Daishiman 23 days ago
  It will be driven down to the cost of having a good project and product manager effectively understanding what the customer wants, which has been the main barrier to excellent software for a good long time.
  [-]
  - galaxyLogic 23 days ago
    And not only understanding what the customer wants, but communicating that unambiguously to the AI. And note who is the "customer" here? Is it the end-users, or is it a client-company which contracts the project-manager for this task? But then the issue is still there, who in the client-company decides exactly what is needed and what the (potential) users want?
    I think this situation emphasizes the importance of (something like) Agile. To produce something useful can only happen via experimentation and getting feedback from actual users, and re-iterating relentlessly.
reactordev 23 days ago
The planner worker architecture works well for me. About 3 layers is the sweet spot. From prompt -> plan -> task division -> workers.
Sometimes workers will task other workers and act as a planner if the task is more complex.
It’s a good setup but it’s nothing like Claude Code.
kilroy123 22 days ago
My test for whether we've created an AGI like AI? Build a Linux kernel from scratch that can actually run a full OS on your computer.
But, if I'm being fair, a full working browser from scratch is just as good.
foota 22 days ago
I've always liked the idea of intelligence in the autonomous ships of the Revelation Space universe. Little agents reporting to progressively more intelligent and higher level ones.
[-]
- satvikpendem 22 days ago
  That's essentially all life from the sub-cellular level on up
Havoc 22 days ago
> long running
I really dislike this as a measure. A LLM on CPU is also long running cause it’s slow.
I get what it’s meant to convey but time is such a terrible measure of anything if tk/s isn’t static
sashank_1509 23 days ago
Can a browser expert please go through the code the agent wrote (skim it), and let us know how it is. Is it comparable to ladybird, or Servo, can it ever reach that capability soon?
[-]
- krackers 23 days ago
  I'm interested in this too. I was expecting just a chromium reskin, but it does seem to be at least something more than that. https://news.ycombinator.com/item?id=46625189 claims it uses Taffy for CSS layout but the docs also claim "Taffy for flex/grid, native for tables/block/inline"
- polyglotfacto 16 days ago
  I've done this in the parallel post, see https://news.ycombinator.com/item?id=46705625 (and a couple of other replies in that thread)
  TLDR; the code is not a valid POC but throw-away level quality that could never support a functioning web engine. It's actually very clear hallucinated AI BS, which is what you get when you don't have a human expert in the loop.
  I actually like using AI, but only to save me the typing.
- missingdays 22 days ago
  You can start by trying to compile the project (spoiler: you can't)
dist-epoch 23 days ago
So, who is going to compile the browser and post the binaries so we can check it out? (in a sandbox/VM obviously)
[-]
- missingdays 22 days ago
  It doesn't compile
tgtweak 22 days ago
Is it too much to expect companies to share some of this in the open vs just the results?
isusmelj 20 days ago
I think we are just very close to the peak of a typical Gartner hype cycle around LLMs. They are useful but overhyped. There will be more posts about fuckups that happen because people run things on autopilot and cannot keep up with reviewing AI generated code.
Do not get me wrong. I use AI all day to speed things up. But I believe that there is only a small group, maybe 5 percent or less, that actually knows how to use AI properly (I'd count myself not yet in that 5%), which I see as potentially dangerous. The other issue I see is inexperienced software engineers writing software. Although I see this as a great value add and productivity boost for prototyping, I am afraid of the “I do not know much about coding but can also make PRs to our codebase” mentality.
For those of you that run things on autopilot, how do you keep code quality under control? And how do you handle refactoring? I am really curious, because one option now is also to just YOLO your LLMs to write code based on the maturity of the product. You can refactor an app or parts of it pretty fast again with LLMs. While tech debt accumulates faster, we also have the opportunity to rebuild faster.
darioush 22 days ago
I find it interesting that this line of adventure quickly lead to locking problems.
ramon156 22 days ago
> A long-running agent made video rendering 25x faster with an efficient Rust version.
Which is not an optimization. This is coming from a Rust dev; Rewriting it in Rust is not the optimization.
Also, I do not believe they actually reviewed the SolidJS->React PR. This PR is incredibly unrealistic and should've been done with either stacker PRs or incremental non-breaking changes.
None of this feels organic, can we stop pretending it is?
To continue the pessimistic tone, none of the writing went in-depth. I did not gain any knowledge, just a marketing post.
sidgarimella 22 days ago
Very cool. Seems long running AI Agents are the new monuments.
jamesnorden 22 days ago
Is the code not even compiling a feature or...
george_atom 22 days ago
Reviewing all this code is the issue.
cawksuwcka 21 days ago
would really appreciate some elaboration as they gloss over the most important part in my kind. why can’t one agent just do it. that’s what ai seems to be - an amalgamation of all our knowledge. why split it back up into separate tentacles. i think focus should be on letting it envelop the problem like a fog and swallow it whole, instead of molesting it independently at touch points and reporting back to … the brain? it’s pretty ridiculous actually. just mimicking ourselves yet again.
[-]
- cawksuwcka 21 days ago
  perhaps ai is a human based solution thus it’s limitations? further, it’s a human problem so we can only solve it in a human way? can we not escape our damned humanity?
dinkm 22 days ago
“Arthur looked up. ‘Ford,’ he said, ‘there’s an infinite number of monkeys outside who want to talk to us about this script for Hamlet they’ve worked out.”