Claude Code Found a Linux Vulnerability Hidden for 23 Years

(mtlynch.io)

100 points | by eichin 13 hours ago

18 comments

jason1cho 3 hours ago
This isn't surprising. What is not mentioned is that Claude Code also found one thousand false positive bugs, which developers spent three months to rule out.
[-]
- mtlynch 2 hours ago
  > What is not mentioned is that Claude Code also found one thousand false positive bugs, which developers spent three months to rule out.
  Source? I haven't seen this anywhere.
  In my experience, false positive rate on vulnerabilities with Claude Opus 4.6 is well below 20%.
  [-]
  - christophilus 1 hour ago
    Same. Codex and Claude Code on the latest models are really good at finding bugs, and really good at fixing them in my experience. Much better than 50% in the latter case and much faster than I am.
  - j16sdiz 1 hour ago
    In TFA:
```
   I have so many bugs in the Linux kernel that I can’t 
   report because I haven’t validated them yet… I’m not going 
   to send [the Linux kernel maintainers] potential slop, 
   but this means I now have several hundred crashes that they
   haven’t seen because I haven’t had time to check them.
    
    —Nicholas Carlini, speaking at [un]prompted 2026
```
    [-]
    - mtlynch 1 hour ago
      Those aren't false positives; they're results he hasn't yet inspected.
      I wrote a longer reply here: https://news.ycombinator.com/item?id=47638062
  - r9295 2 hours ago
    In my experience, the issue has been likelihood of exploitation or issue severity. Claude gets it wrong almost all the time.
    A threat model matters and some risks are accepted. Good luck convincing an LLM of that fact
- boplicity 50 minutes ago
  The lesson here shouldn't be that Claude Code is useless, but that it's a powerful tool in the hands of the right people.
  [-]
  - amelius 31 minutes ago
    Unfortunately, also in the hands of the __wrong__ people.
    Maybe even more so, because who is going to wade through all those false positives? A bad actor is maybe more likely to do that.
    [-]
    - embedding-shape 10 minutes ago
      > A bad actor is maybe more likely to do that.
      Do something about that then, so white-hat hackers are more likely than black-hat hackers to wanting to wade through that, incentives and all that jazz.
  - teeray 25 minutes ago
    The same could be said about a Roulette wheel set before a seasoned gambler
  - mavamaarten 32 minutes ago
    I'm growing allergic to the hype train and the slop. I've watched real-life talks about people that sent some prompt to Claude Code and then proudly present something mediocre that they didn't make themselves to a whole audience as if they'd invented the warm water, and that just makes me weary.
    But at the same time, it has transformed my work from writing everything bit of code myself, to me writing the cool and complex things while giving directions to a helper to sort out the boring grunt work, and it's amazingly capable at that. It _is_ a hugely powerful tool.
    But haters only see red, and lovers see everything through pink glasses.
    [-]
    - sph 9 minutes ago
      > it has transformed my work […] to me writing the cool and complex things
      > it's amazingly capable at that.
      > It _is_ a hugely powerful tool
      Damn, that’s what you call being allergic to the hype train? This type of hypocritical thinly-veiled praise is what is actually unbearable with AI discourse.
  - righthand 44 minutes ago
    The lesson or the hype mantra?
- sva_ 2 hours ago
  Couldn't you just make it write a PoC?
  [-]
  - weird-eye-issue 1 hour ago
    Still have to validate it.
  - Gregaros 2 hours ago
    [flagged]
- addandsubtract 2 hours ago
  On the other hand, some bugs take three months to find. So this still seems like a win.
- khalic 2 hours ago
  You didn't read the article did you?
  [-]
  - j16sdiz 1 hour ago
    You didn't read the article did you?
    [-]
    - khalic 1 hour ago
      He explicitly talks about not sending the maintainers slop, learn how to read.
userbinator 3 hours ago
Not "hidden", but probably more like "no one bothered to look".
declares a 1024-byte owner ID, which is an unusually long but legal value for the owner ID.
When I'm designing protocols or writing code with variable-length elements, "what is the valid range of lengths?" is always at the front of my mind.
it uses a memory buffer that’s only 112 bytes. The denial message includes the owner ID, which can be up to 1024 bytes, bringing the total size of the message to 1056 bytes. The kernel writes 1056 bytes into a 112-byte buffer
This is something a lot of static analysers can easily find. Of course asking an LLM to "inspect all fixed-size buffers" may give you a bunch of hallucinations too, but could be a good starting point for further inspection.
[-]
- NitpickLawyer 3 hours ago
  > This is something a lot of static analysers can easily find.
  And yet they didn't (either noone ran them, or they didn't find it, or they did find it but it was buried in hundreds of false positives) for 20+ years...
  I find it funny that every time someone does something cool with LLMs, there's a bunch of takes like this: it was trivial, it's just not important, my dad could have done that in his sleep.
  [-]
  - userbinator 3 hours ago
    Remember Heartbleed in OpenSSL? That long predated LLMs, but same story: some bozo forgot how long something should/could be, and no one else bothered to check either.
mattbee 1 hour ago
Pasting a big batch of new code and asking Claude "what have I forgotten? Where are the bugs?" is a very persuasive on-ramp for developers new to AI. It spots threading & distributed system bugs that would have taken hours to uncover before, and where there isn't any other easy tooling.
I bet there's loads of cryptocurrency implementations being pored over right now - actual money on the table.
[-]
- dvfjsdhgfv 26 minutes ago
  > Pasting a big batch of new code and asking Claude "what have I forgotten? Where are the bugs?"
  It's actually the main way I use CC/codex.
cesaref 1 hour ago
I'm interested in the implications for the open source movement, specifically about security concerns. Anyone know is there has been a study about how well Claude Code works on closed source (but decompiled) source?
[-]
- skeledrew 12 minutes ago
  > Claude Code works on closed source (but decompiled) source
  Very likely not nearly as well, unless there are many open source libraries in use and/or the language+patterns used are extremely popular. The really huge win for something like the Linux kernel and other popular OSS is that the source appears in the training data, a lot. And many versions. So providing the source again and saying "find X" is primarily bringing into focus things it's already seen during training, with little novelty beyond the updates that happened after knowledge cutoff.
  Giving it a closed source project containing a lot of novel code means it only has the language and it's "intuition" to work from, which is a far greater ask.
  [-]
  - kasey_junk 2 minutes ago
    I’m not a security researcher, but I know a few and I think universally they’d disagree with this take.
    The llms know about every previous disclosed security vulnerability class and can use that to pattern match. And they can do it against compiled and in some cases obfuscated code as easily as source.
    I think the security engineers out there are terrified that the balance of power has shifted too far to the finding of closed source vulnerabilities because getting patches deployed will still take so long. Not that the llms are in some way hampered by novel code bases.
summarity 2 hours ago
Related work from our security lab:
Stream of vulnerabilities discovered using security agents (23 so far this year): https://securitylab.github.com/ai-agents/
Taskflow harness to run (on your own terms): https://github.blog/security/how-to-scan-for-vulnerabilities...
misiek08 44 minutes ago
Do not expect so many more reports. Expect so many more attacks ;)
lnkl 2 hours ago
"Guy working at company making product, says that the newer version of the product is better"
Huh, who would've expected this.
alsanan2 8 minutes ago
making public that AI is able of founding that kind of vulnerabilities is a big problem. In this case it's nice that the vulnerability has been closed before publishing but in case a cracker founds it, the result would be extremately different. This kind of news only open eyes for the crackers.
dist-epoch 3 hours ago
> "given enough eyeballs, all bugs are shallow"
Time to update that:
"given 1 million tokens context window, all bugs are shallow"
[-]
- summarity 2 hours ago
  Already happend: https://arxiv.org/abs/2407.08708
- bigbugbag 2 hours ago
  more like some bugs are shallow and others are pieced together false-positives from an automated tool reliable in its unreliability.
- riffraff 3 hours ago
  ..and three months to review the false positives
  [-]
  - 112233 2 hours ago
    this is always overlooked. AI stories sound like "with right attitude, you too can win 10M $ in lottery, like this man just did"
    Running LLM on 1000 functions produces 10000 reports (these numbers are accurate because I just generated them) — of course only the lottery winners who pulled the actually correct report from the bag will write an article in Evening Post
    [-]
    - red75prime 1 hour ago
      > these numbers are accurate because I just generated them
      Is it sarcasm, or you really did this? Claude Opus 4.6?
eichin 13 hours ago
An explanation of the Claude Opus 4.6 linux kernel security findings as presented by Nicholas Carlini at unpromptedcon.
[-]
- eichin 13 hours ago
  https://www.youtube.com/watch?v=1sd26pWhfmg is the presentation itself. The prompts are trivial; the bug (and others) looks real and well-explained - I'm still skeptical but this looks a lot more real/useful than anything a year ago even suggested was possible...
jazz9k 13 hours ago
This does sound great, but the cost of tokens will prevent most companies from using agents to secure their code.
[-]
- KetoManx64 12 hours ago
  Tokens are insanely cheap at the moment. Through OpenRouter a message to Sonnet costs about $0.001 cents or using Devstral 2512 it's about $0.0001. An extended coding session/feature expansion will cost me about $5 in credits. Split up your codebase so you don't have to feed all of it into the LLM at once and it's a very reasonable.
  [-]
  - lebovic 10 hours ago
    It cost me ~$750 to find a tricky privilege escalation bug in a complex codebase where I knew the rough specs but didn't have the exploit. There are certainly still many other bugs like that in the codebase, and it would cost $100k-$1MM to explore the rest of the system that deeply with models at or above the capability of Opus 4.6.
    It's definitely possible to do a basic pass for much less (I do this with autopen.dev), but it is still very expensive to exhaustively find the harder vulnerabilities.
    [-]
    - christophilus 1 hour ago
      This is where the Codex and Claude Code Pro/Max plans are excellent. I rarely run into the limits of Codex. If I do, I wait and come back and have it resume once the window has expired.
      [-]
      - Jcampuzano2 13 minutes ago
        Claude and Codex pro/max subs aren't supposed to be used for commercial/enterprise development so its not really an option for execs in enterprise. They need to take into account API costs.
        At my F500 company execs are very wary of the costs of most of these tools and its always top of mind. We have dashboards and gather tons of internal metrics on which tools devs are using and how much they are costing.
  - gmerc 9 hours ago
    You’d have to ignore the massive investor ROI expectations or somehow have no capability to look past “at the moment”.
    [-]
    - NitpickLawyer 3 hours ago
      That might be a problem for the labs (although I don't think it is) but it's not a problem for end-users. There is enough pressure from top labs competing with each other, and even more pressure from open models that should keep prices at a reasonable price point going further.
      In order to justify higher prices the SotA needs to have way higher capabilities than the competition (hence justifying the price) and at the same time the competition needs to be way below a certain threshold. Once that threshold becomes "good enough for task x", the higher price doesn't make sense anymore.
      While there is some provider retention today, it will be harder to have once everyone offers kinda sorta the same capabilities. Changing an API provider might even be transparent for most users and they wouldn't care.
      If you want to have an idea about token prices today you can check the median for serving open models on openrouter or similar platforms. You'll get a "napkin math" estimate for what it costs to serve a model of a certain size today. As long as models don't go oom higher than today's largest models, API pricing seems in line with a modest profit (so it shouldn't be subsidised, and it should drop with tech progress). Another benefit for open models is that once they're released, that capability remains there. The models can't get "worse".
    - KetoManx64 9 hours ago
      Not really. I'm fully taking advantage of these low prices while they last. Eventually the AI companies will run start running out of funny money and start charging what the models actually cost to run, then I just switch over to using the self hosted models more often and utilize the online ones for the projects that need the extra resources. Currently there's no reason for why I shouldn't use Claude Sonnet to write one time bash scripts, once it starts costing me a dollar to do so I'm going to change my behavior.
      [-]
      - deaux 6 hours ago
        > Currently there's no reason for why I shouldn't use Claude Sonnet to write one time bash scripts, once it starts costing me a dollar to do so I'm going to change my behavior.
        This just isn't going to happen, we have open weights models which we can roughly calculate how much they cost to run that are on the level of Sonnet _right now_. The best open weights models used to be 2 generations behind, then they were 1 generation behind, now they're on par with the mid-tier frontier models. You can choose among many different Kimi K2.5 providers. If you believe that every single one of those is running at 50% subsidies, be my guest.
      - twosdai 7 hours ago
        I also have this feeling. But do you ever doubt it. that when the time comes we will be like the boiled frog? Where its "just so convenient" or that the reality of setting up a local ai is just a worse experience for a large upfront cost?
        [-]
        iririririr 6 hours ago
        worse. he's already boiled. probably paying way more than that one dollar per bash script with all the subscriptions he already has.
        [-]
        KetoManx64 6 hours ago
        Yeah, the $20 I paid to OpenRouter about 4 months ago really cost me an arm and a leg, not sure where I'll get my next meal if I'm to be honest.
  - ThePowerOfFuet 4 hours ago
    >$0.001 cents
    $0.001 (1/10 of a cent) or 0.001 cents (1/1000 of a cent, or $0.00001)?
    [-]
    - Pikamander2 55 minutes ago
      Oh no, here we go again
      https://youtube.com/watch?v=MShv_74FNWU
- NitpickLawyer 3 hours ago
  Tokens aren't more expensive than highly trained meatbags today. There's no way they'll be more expensive "tomorrow"...
  [-]
  - bigbugbag 2 hours ago
    they are and they will be, then they won't after the market crashes, the bubble bursts and the companies bankrupts. possibly taking down major portion of the global economy with them.
- epolanski 4 hours ago
  I don't buy it.
  Inference cost has dropped 300x in 3 years, no reason to think this won't keep happening with improvements on models, agent architecture and hardware.
  Also, too many people are fixated with American models when Chinese ones deliver similar quality often at fraction of a cost.
  From my tests, "personality" of an LLM, it's tendency to stick to prompts and not derail far outweights the low % digit of delta in benchmark performance.
  Not to mention, different LLMs perform better at different tasks, and they are all particularly sensible to prompts and instructions.
up2isomorphism 10 hours ago
But on the other hand, Claude might introduce more vulnerability than it discovered.
[-]
- yunnpp 10 hours ago
  Code review is the real deal for these models. This area seems largely underappreciated to me. Especially for things like C++, where static analysis tools have traditionally generated too many false positives to be useful, the LLMs seem especially good. I'm no black hat but have found similarly old bugs at my own place. Even if shit is hallucinated half the time, it still pays off when it finds that really nasty bug.
  Instead, people seem to be infatuated with vibe coding technical debt at scale.
  [-]
  - qsera 2 hours ago
    > Code review is the real deal for these models.
    Yea, that is what I have been saying as well...
    >Instead, people seem to be infatuated with vibe coding technical debt at scale.
    Don't blame them. That is what AI marketing pushes. And people are sheep to marketing..
    I understand why AI companies don't want to promote it. Because they understand that the LCD/Majority of their client base won't see code review as a critical part of their business. If LLMs are marketed as best suited for code review, then they probably cannot justify the investments that they are getting...
  - Serberus 3 hours ago
    [dead]
- khalic 1 hour ago
  Guys please read the article before commenting...
cookiengineer 3 hours ago
> Nicholas has found hundreds more potential bugs in the Linux kernel, but the bottleneck to fixing them is the manual step of humans sorting through all of Claude’s findings
No, the problem is sorting out thousands of false positives from claude code's reports. 5 out of 1000+ reports to be valid is statistically worse than running a fuzzer on the codebase.
Just sayin'
[-]
- mtlynch 2 hours ago
  > 5 out of 1000+ reports to be valid is statistically worse than running a fuzzer on the codebase.
  Carlini said "hundreds" of crashes, not 1000+.
  It's not that only 5 were true positives and the rest were false positives. 5 were true positives and Carlini doesn't have bandwidth to review the rest. Presumably he's reviewed more than 5 and some were not worth reporting, but we don't know what that number is. It's almost certainly not hundreds.
  Keep in mind that Carlini's not a dedicated security engineer for Linux. He's seeing what's possible with LLMs and his team is simultaneously exploring the Linux kernel, Firefox,[0] GhostScript, OpenSC,[1] and probably lots of others that they can't disclose because they're not yet fixed.
  [0] https://www.anthropic.com/news/mozilla-firefox-security
  [1] https://red.anthropic.com/2026/zero-days/
- dist-epoch 2 hours ago
  > On the kernel security list we've seen a huge bump of reports. We were between 2 and 3 per week maybe two years ago, then reached probably 10 a week over the last year with the only difference being only AI slop, and now since the beginning of the year we're around 5-10 per day depending on the days (fridays and tuesdays seem the worst). Now most of these reports are correct, to the point that we had to bring in more maintainers to help us. ... Also it's interesting to keep thinking that these bugs are within reach from criminals so they deserve to get fixed.
  https://lwn.net/Articles/1065620/
  [-]
  - cookiengineer 39 minutes ago
    > https://syzbot.org/upstream
    I stand corrected.
roach54023 5 minutes ago
[dead]
adamsilvacons 46 minutes ago
[dead]
pithtkn 30 minutes ago
[dead]
LeonTing1010 2 hours ago
[dead]
_pdp_ 3 hours ago
The title is a little misleading.
It was Opus 4.6 (the model). You could discover this with some other coding agent harness.
The other thing that bugs me and frankly I don't have the time to try it out myself, is that they did not compare to see if the same bug would have been found with GPT 5.4 or perhaps even an open source model.
Without that, and for the reasons I posted above, while I am sure this is not the intention, the post reads like an ad for claude code.
[-]
- mtlynch 2 hours ago
  OP here.
  I don't understand this critique. Carlini did use Claude Code directly. Claude Code used the Claude Opus 4.6 model, but I don't know why you'd consider it inaccurate to say Claude Code found it.
  GPT 5.4 might be capable of finding it as well, but the article never made any claims about whether non-Anthropic models could find it.
  If I wrote about achieving 10k QPS with a Go server, is the article misleading unless I enumerate every other technology that could have achieved the same thing?
- mgraczyk 3 hours ago
  No the title is correct and you are misreading or didn't read. It was found with Claude code, that's the quote. This isn't a model eval, it's an Anthropic employee talking about Claude code. So comparing to other models isn't a thing to reasonably expect.
- weird-eye-issue 1 hour ago
  > You could discover this with some other coding agent harness.
  And surely that would be relevant if they were using a different harness.