"My personal conclusion can however not end up with anything else than that the big hype around this model so far was primarily marketing. I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos. Maybe this model is a little bit better, but even if it is, it is not better to a degree that seems to make a significant dent in code analyzing."
It's a good reminder for us all that the competition in this space is rough and lots of more or less subtle marketing is involved.
As part of our continued collaboration with Anthropic, we had the opportunity to apply an early version of Claude Mythos Preview to Firefox. This week’s release of Firefox 150 includes fixes for 271 vulnerabilities identified during this initial evaluation.
As these capabilities reach the hands of more defenders, many other teams are now experiencing the same vertigo we did when the findings first came into focus. For a hardened target, just one such bug would have been red-alert in 2025, and so many at once makes you stop to wonder whether it’s even possible to keep up.
I think it's more the cost to find a vulnerability that has significantly reduced, not the possibility that the vulnerability could have been found. But that cost mattered tremendously because someone has to fund the effort to find the bugs. This economics also applies to attackers.
This is roughly what I was assuming but of course the big caveat here is that they were already using the existing LLM driven tooling on an extensively audited codebase.
So while anthropic's marketing may be hype there just wasn't much left to find, a point he makes in the blog post.
Whether it's a big step forward for other kinds of projects is difficult to tell, but this highlights that everybody should be using AI code review tools to audit their existing code today, and not everybody is.
None of those other LLM tooling made the claims they're too dangerous to be released and used though, unlike Anthropic did with Mythos.
What it highlights, is that Mythos doesn't seem so much better than other LLM driven tooling at finding security issues, which was the strongest claim Anthropic made in the first place.
Anthropic using marketing to convince people their models are more advanced, better built, or that AI is a threat that needs to be regulated because only they have the answer? I’m shocked.
More seriously, so far I haven’t seen much indication that Mythos is more than Opus with a security focused code analysis harness. That said, the fact it can find these bugs in an automated fashion is the more important takeaway outside of the hype.
I’m curious what the error rate is on the detections, because none of that means much if it is wrong 90% of the time and we are only hearing about the examples that are useful marketing.
>> Anthropic using marketing to convince people their models are more advanced, better built, or that AI is a threat that needs to be regulated because only they have the answer? I’m shocked.
I remember when OpenAI was saying GPT-2 was too dangerous to release.
I remember when there was a guy at Google years a few years ago that was convinced that they had an internal, sentient creature in their labs (I think maybe 4 years ago?)
If I’m not mistaken, after the media cycle, he lost his job for breaking confidentiality.
That was the opposite of marketing, Google really didn’t get how to turn this into a product until ChatGPT happened.
Google is the leader, they really don't want AI to be a success, it only comes with a risk of disruption. They probably don't even really believe it's going to be that big of a deal. They are only in that game to hedge; sure they have wasted a trillion dollars if AI doesn't come through, but they will earn that back in 3-5 years. So why would they need to do deranged marketing stunts and sacrifice their credibility for that?
If OpenAI or Anthropic doesn't turn this into a trillion dollar industry FAST, they are cooked.
The strategy of building up fear around your product is risky, but necessary. There is simply no way to grow the AI business fast enough if they can't talk directly to the CEOs and bypass input from the employees, and baba yaga stories are perfect for that. Every time the CEO hears an employee say that the AI isn't working great for him, he hears an employee that's scared for his job or for his life, dismisses it, and sends out a mandate that everyone needs to prompt an AI every time they as much as need to go to the toilet.
They most likely understood that it wasn't viable for anything. OpenAI just yolo'd it and now we're dealing with the fallout. I'm fairly certain that any management layer at google isn't going to say yes to "invest 5 billion to make 10 million" scheme that OpenAI, Anthropic, are currently running.
I always thought he was fired for making crackpot statements to the press in reference to his professional capacity, and thus creating bad PR and embarrassing spectacle for his employer. Seems like legitimate reasons to me.
An interesting question now is whether he had standard mental health issues, or if he was an early example of AI psychosis or whatever we call people who are falling in love with their AI chatbots because they tell them how smart they are.
Optimization on "Human Feedback", early exposure to high-effort experimental systems... I wouldn't be surprised it that turns into a bigger field than is generally recognized today.
Looking at it from the outside, I think it's still pretty hard to see how he came to end up in that position, but with a bit of individual vulnerability, arbitrary time to boil the frog slowly, and a fairly large number people exposed, maybe it would be stranger not to have the event occur with someone.
Considering Richard Dawkins has recently succumbed to the same delusion it is a reminder that no matter how intelligent someone may otherwise be, we are all human and have certain tendencies and blind spots; anthropomorphizing non-entities being one of those.
Richard Dawkins is 85 to be fair, just like Bernie Sanders is 84 when he made similar comments.
The other guy worked on Google's AI safety team where one would expect he'd have a basic grasp of how the technology works before making outlandish claims.
I can't find it right now, but something came up a few years ago (probably on HN) about highly intelligent people being more adept at making up arguments to rationalize beliefs and actions that they had taken for other reasons entirely.
Sort of makes sense that wielding a more complex mind would offer more complex ways to go wrong, doesn't it?
And on balance, it also can mean that they make connections and see truth where others only see the facade. Both statements can (and are true) because highly intelligent people are still just people. Some people’s “delusions” are absolutely correct, and others “facts” are nothing more than anecdotes told to convince themselves of what they want to believe.
Sounds more like “intelligence” isn’t the only defining metric for such behavior to occur in people, because that describes a lot of less intelligent people too. Though, I suspect highly intelligent people are at least somewhat more likely to end up on the “correct” side of the facts.
As someone who watched one of their heros fall for some stupid cult like thing ten years ago and wondered the same thing. Then many years later fell for some dumb stuff. The answer is you probably will. Try to stay intellectually flexible, it'll be okay.
I have seen people I consider as much smarter than me fall for some very idiotic things. I certainly don't consider myself immune.
I think that the advice to try being intellectually flexible is a good one. Strive to learn new things, expose yourself earnestly to ideas that challenge your beliefs, exercise empathy, etc
Curl simply isn't a good data point. It's one of the most picked-over codebases in existence with extensive security testing practices. All the researchers using not-quite-Mythos models have had plenty of time to report bugs up to this point. Daniel may be right that Mythos hasn't been a game changer for curl but the preconditions are different for virtually any other codebase. Perhaps the real marketing here is his own modesty about curl's maturity.
Curl uses all sorts of tools, including AI tools to find bugs. These tools, according to the article found hundreds of bugs including a dozen CVE.
Mythos found one vulnerability. It means the Mythos is just another tool, not the revolution it claims to be.
It is common that when a new tool is introduced that a bunch of bugs are found, with diminishing returns. Mythos finding one vulnerability is consistent to what I would expect for a major update to an existing tool, which Mythos is over existing LLM-based solutions.
The question is how many security vulnerabilities are actually left in the code after all the recent AI attention. Either Mythos is a nothingburger, or it's substantially more powerful but there's nothing left to do. Even a large amount of C can be correct eventually. Curl has the _potential_ to become a good data point maybe 6-12 months from now - if researchers and new tools find many more vulnerabilities then Mythos is proved to be hype. If they don't, then maybe Mythos is overkill for today's curl and its capabilities are better deployed elsewhere (like Firefox, apparently).
I have a hard time believing that Mythos found the only remaining Curl vulnerability. It is possible, but highly improbable.
And it is not overkill, the proof is that it found that vulnerability. It is like saying the new version of some static analyzer with some new rules is "overkill" because it only found only one more bug than the previous version. Deciding whether it is overkill or not is more about context. Using a very expensive model like Mythos for some little used non-critical software is overkill, but for Curl, it absolutely isn't.
If Mythos found loads of vulnerabilities in Firefox but not in Curl, I wouldn't say that's because of Mythos is so good, but rather that with the release of Mythos, they did some testing that could have been done before using the same tools Curl have used.
> Once the end-to-end pipeline is in place, it’s trivial to swap in different models when they become available. Building this pipeline early helped us find a number of serious bugs using publicly-available models, and it also helped us hit the ground running when we had the opportunity to evaluate Claude Mythos Preview. In our experience, model upgrades increase the effectiveness of the entire pipeline: the system gets simultaneously better at finding potential bugs, creating proof-of-concept test cases to demonstrate them, and articulating their pathology and impact.
It's not, really. Curl is an extraordinarily high value target that has already been picked over by well funded security researchers and state-sponsored groups using state of the art tooling for decades. That is not the target for which Mythos is a threat.
The threat isn't high value targets, which already had sophisticated folks picking over the code base using state of the art tools and tests, it's medium to low value targets which can now be picked over by random hackers who barely know anything about security themselves at a cost of a few dollars.
We see this exact hypetrain every time a new model is released. Mythos simply hasn't lived up to the "we're all gunna die from the flood of vulnerabilities" hype even slightly. Its slightly better than previous models by all accounts, cool stuff
I've seen literally near word-for-word this exact chain of events multiple times previously
Given how much money is on the line, it would be gross negligence if anything came publicly out of the CEO's mouth or is otherwise published by the company that's not marketing.
Not really, curl has slow anonymous memory leaks because of how the connection session caching was implemented. If you don't periodically restart a program, than people encounter strange hard to diagnose issues sooner or later.
Also, looking at something that trips valgrind warnings already, may obfuscate a lot of problems in both your own code and the curl library itself.
One could report the issue as functioning as described in the API, but the developers do not accept direct community input into the project.
People use it out of convenience, but it is just as janky as most bloated projects. =3
Evidences: 10 years ago, when I interviewed Baidu AI with Andrew Ng and Dario, Dario is the kind of person is pure-hearted to the point being ideological. Given Dario's successful career so far, that essence has gradually grown into a conviction, and surrounded by a purposely built team which amplifies his ideology.
Humans are very convenient creature, a rare few small fraction of them are no doubt the master of convenience: they morph their mental manifold without a hint of contradiction in their own mental mechanisms.
Marketing is always intentional at this scale. If you think Anthropic didn't put a lot of time and effort into Glasswing as a marketing effort I think you're misunderstanding how these organizations work and how they win.
Mythos put Anthropic back into the White House’s good graces. It also branded Anthropic as badass, something their softener image probably needed to win government contracts.
Maybe it wasn’t marketing. But the product’s configuration, and how Anthropic talked about and released it, sure as hell played beautifully. (The timing, while Musk and Altman are distracted with each other, also couldn’t have been better.)
These things are layered. They are great scientists, smart people, etc.
Things change when you’re running a business like Anthropic, especially as the CEO. You have a responsibility to shareholders, and you just need to play the game.
Anthropic chose a great angle: focus on professionals / enterprise, safety, etc. Those can both be done by a genuine desire to make great technology, and for business purposes require you to position yourself in a bit “better” way than reality.
Just look at what their strategy is with Mythos, it’s almost perfection: the “it’s not ready to be released to the public” angle hits all the marks: they care about responsibility / safety, they have “the best” model, and “LLMs are dangerous, but we, as the guardians, can be trusted”. This also helps the industry as a whole with regulation: if they’re being constrained, China will develop even more dangerous models.
This is a result of how smart people treat business, it’s PR perfection, especially given how much the whole industry is talking about it.
(Yes, they fail in other PR areas, but that’s a different discussion)
I'm not sure if that distinction is important, since what you've described less charitably synonymous with the phrase "Dario is delusional, and has surrounded himself with yes-men, so outlandish marketing gets published as a side effect".
Whether the person doing the marketing was sincere about it or not is immaterial, since marketing is experienced almost entirely by the people consuming it, and not the people communicating it. What matters is if the audience is sincerely concerned by the message, and it's transparently the case that they were sincerely concerned by it.
That's an odd definition of "intentional". Evolution has filtered for people with certain views and the marketing has just emerged from their actions. ... So?
A deadly virus (naturally occurring one let's say) wasn't created intentionally. Evolution selected for it. It's still bad and kills people. Doesn't make it nice because of lack of intention.
I think that's a reasonable analysis, but it's very different than the one that's usually implied by "marketing". Most people I see talking about Dario and his "marketing" go on to express confusion or frustration on why he would decide to message this way, ignoring what I (and perhaps you?) consider to be the obvious answer that he believes it's true.
All your evidences can be exactly true, and he genuinely believes that Anthropic "winning" the AI race is the best outcome for humanity even with a little subterfuge including marketing to the current administration. If I genuinely thought I needed to do something to secure humanity, there's little I wouldn't do to achieve it.
This is an advertising masterpiece: UK gets first access, the EU is jealous and wants it, too. Thousands of bureaucrats and parasites make money in the process writing (probably using AI) whitepapers and sitting in meetings. The open source authors whose works are being scanned make nothing.
We know how the money flows. Another unrelated example is that ex MI6 director Sir John Sawers is a Palantir consultant and sells out the UK to Palantir.
I've been running my own security scanning software (disclaimer: now starting a company @ zeroquarry.com) for this, and from what I've seen there's a huge value in prompts + adversarial LLM review. Without adversarial review, you get garbage (as this blog points out: 4/5 basically are nonsense) and with a good prompt, you can use almost any "near frontier" model from my experience as long as the prompt helps with the guardrails or the model doesn't protect in such a strict way
Mythos marketing really leans into that "too powerful to be legal" vibe, much like how PS2s were allegedly banned from North Korea because their chips were basically missile-grade.
I'd go out and say the marketing is not subtle. The hype and fanboys/girls are so in line with the marketing that any level of skepticism is seen a an act of defection, but if you look at the words, hyperbole and volume that is used, there is nothing subtle about it.
It's almost Trump-esque - "this model will change everything forever; we are doomed; we are saved; we will all be fired; we will all be rich", etc
That's a pretty good encapsulation of the parallels between the political and the technological: One necessarily thrives upon the other and are inextricable. This moment is a culmination of all the disenfranchisement the bodypolitik have suffered, looking for any possible means of escape or elevation. AI and Trumpism, for their own respective cohorts, are salvation, on offer by different frontmen but ultimately in service of the same system.
They need the hype to pay off way more than we do. So many of us who still write code directly stand to lose nothing of our capabilities if the marketing claims cannot hold water.
I seem to be totally outside the hype bubble, but I have to suspect there is a lot of imagineering and wild extrapolations in the elss technical hype bubbles. I am curious but no enough to go looking.
I'm surprised you say that because it is all over Hacker News. Every single post is co-opted into promoting AI. Try finding a submission with fifty points or more than doesn't have AI or LLM's mentioned somewhere in the comments.
Eh... I think he puts the LLM down for his own ego's sake (as would I!). Curl may, next to the Linux kernel, be one of the most heavily audited codebases in existence. The LLM found something he and thousands of others missed. It's not unimpressive.
I commented this in another post but I'm going to repeat it because I believe its important for this discussion.
> The worrying part about Mythos isn't the fact that it can find bugs. The worrying part is Mythos being able to find them on its own across entire code base as vast as Firefox then write exploits for what its found with a very basic prompt.
> The skill required to find then create zero days is quickly approaching the floor.
I will never ever understand how people are amazed by this. Have they just not tried it and then just assume that because Anthropic says this is the first it must be true?
This was one of the first things I tried and it works great.
You've pointed codex to the entire source code of firefox and simply prompted it to find bugs and then had it write the exploits for you? Why haven't you published this? That would sink all of the the claude code hype.
No, I'm not interested in Firefox bugs, but I've done it with my own large projects.
What I think happened here is an Anthropic team with very little security expertise were working on finding bugs for marketing reasons and when they prompted to make POC exploits of those bugs they didn't have much success because they didn't really know what to ask for. They then proceeded to very finely tune their next model to eagerly exploit vulnerabilities making the models much more powerful for the "I don't know what I'm doing" user which they're now trying really hard to convince everyone is a game changer. </speculation>
The reason many of us are skeptical is we've used the current models to do things and they've worked.
An analogy might be if they tuned their model to eagerly instruct somebody how to make improvised weapons, now somebody is asking about how to deal with a rival at work and their model gives instructions on building a bomb from hardware store parts. Then go on a marketing spree telling everybody how dangerous it is. This example might highlight how insincere the marketing is. At any point you could have tuned the model to exploit for inexperienced people, now that you've done it does not mark a grand new capability. People who knew what they were doing could already do this with models.
> No, I'm not interested in Firefox bugs, but I've done it with my own large projects.
Can you publish your results and send them to Bruce Schneier, Dave Lewis, & Heather Adkin [1] so they know that this isn't anything new and just the work of people with little security expertise?
I can't help but think that curl is, by nature, a relatively simple and well-contained tool. Compare to an operating system or web browser or database or billion dollar company codebase.
It makes some sense that Mythos/ChatGPT 5.5 might be that much better with complexities that curl just doesn't have because it's a basic tool.
Like yeah curl is obviously extremely fully featured as an "anything client" but it's orders of magnitude less complex than other software we rely on.
I agree it's rather basic but as stated in the article, its code is still longer than war and peace. There is still plenty of opportunities for security vulnerabilities in something of that size.
I feel like, if it was a codebase without using any security analysis tools, there would have been some more significant findings - perhaps they can re-run it on an 18 month old commit and see how many it found that were subsequenty found and fixed?
Anyway, I think the case that frontier and next-gen models will get increasingly adept at finding vulnerabilities and that those on the receiving end of those vulnerabilities need to be on top of it.
> An amazingly successful marketing stunt for sure.
This. Well done by Antropic.
It even reached the CISO of my small semi-government org in the Netherlands, who slightly panicked at the announced 'tsunami' of vulnerabilities that was coming with Mythos.
Got us some more money and priority with the board, though.
I don't agree with the "no tsunami in sight": if you don't look at 100+ bugs in Firefox and many more OSS projects, bunch of old unseen-before OpenBSD/Linux RCEs, and a few LPE in just 2 or 3 weeks for Linux itself...
IMO, this does not sound like marketing scare, there is spike of vulnerability disclosures - high quality, low false positives - that can be sensed... It feels like we're speedrunning through few-years worth of high quality bug reports in just a few weeks.
Anthropic noticed the trend of AI vulnerability scanning and started advertising Mythos, which is unreleased, as being very good at it.
Then they donated very large token budgets for using Mythos privately to several teams. Those teams used the free token spend for security research (that was the deal) and anything they found got attributed to Mythos, not the token budget.
Mythos looks like a good incremental model but the PR team has done a great job of associating themselves with the current trend. So much so that comments like yours already associated vulnerabilities found with this model which isn’t even available yet
Anthropic has is quickly destroying customer goodwill by repeatedly pulling the same stunt. Horrible marketing, imho.
It's an entirely different thing to have the company conduct research on LLMs in general being a cybersecurity threat, instead of going "our new model is just too powerful" and shift the discussion to revolve around that. It's slimey.
Sure, but isn't it a verdict on Mythos compared to other models?
If so, it would still follow. "Most software" isn't analyzed as much as curl, by either other tooling or other models, that might well find close to the same as Mythos did. As such, Mythos then isn't especially/particularly dangerous.
I don't think I understand what you mean, the "not particularly dangerous" comment was in relation to the vulnerability that was found right ? Surely they would know what constitutes a lower severity level.
My guess is that it is in category of "you are holding it wrong". Still worth fixing, but requires very specific user input for example. Or very weird scenario. Or in some less used protocol or flag combination.
Curl is currently receiving a record number of high-quality bug/vuln reports (a rather sharp change from the earlier slop inundation), so it’s not like there’s nothing to find. Many or most of these are presumably found by human experts assisted by AI tools, but if Mythos were truly revolutionary, it should be able to find such issues on its own.
I know that the Mythos hype is part marketing by anthropic, but isn't it possible that with a highly scrutinized codebase, there just aren't any notable security exploits in it's current state? The fact that it found nothing isn't necessarily an incrimination against it, especially when other tools had identified hundreds of exploits previously. Seems like it's been completely picked over (for now).
> The single confirmed vulnerability is going to end up a severity low CVE planned to get published in sync with our pending next curl release 8.21.0 in late June
My mind still cannot understand the quality and refinement that's gone into cURL. It really is the perfect example of something done so right, that people barely think twice about.
Easy, it shows what is achievable if there is a high bar for quality in every single line of code that gets commited, reviewed and merged, regardless of the programming language.
However in the days of race to bottom, offshoring for penies, and now LLM powered code generation, this is a quality most companies won't care unless there is liability in place.
Curl and SQLite are my favourite examples of properly engineered, rigourously tested _anything_. It's really philosophical - those projects' contribution requirements demand such rigor, and the maintainers stand by that demand. A non-load-bearing document (not project code) is what makes that possible - very reminiscent of Einstein's thought experiments leading to tangible projects such as GPS or Descartes's belief that all problems can be solved through rational thinking.
Some people must be working on training some models exclusively on high quality OSS code base like curl and SQLite without the noise of low quality training data.
I would do that with 100% local models from scratch.
> My mind still cannot understand the quality and refinement that's gone into cURL. It really is the perfect example of something done so right, that people barely think twice about.
And all that to then end with people doing: "curl ... | bash" and not seeing anything wrong about it. Then they'll deflect about "threat models" and other non-sense.
I leave you your curl-bash, I keep my cryptographically signed packages installer.
"I signed the contract for getting access, but then nothing happened. Weeks went past and I was told there was a hiccup somewhere and access was delayed.
Eventually, I was instead offered that someone else, who has access to the model, could run a scan and analysis on curl for me using Mythos and send me a report. To me, the distinction isn’t that important."
Really? We're talking about (essentially) a product demo from a trillion dollar industry fueled by debt. Clearly, blog posts like this have an immense influence on the perception of usefulness of the particular model and AI in general. With so much staked on this for the company, wouldn't you want to be sure that you're using the actual product without anyone messing with the results in any way?
There is always marketing involved and people should be able to put marketing into perspective.
Also curl in this regard is a open source project, relativly small but critical, well known and used everywhere. Besides image libraries, tools like curl or sudo, su, passwd, etc. would also be my first try.
Mythos is still not known at all what it can do. What does it mean from cost and benchmark pov to have a 10 Trillion parameter model?
Nonetheless, the fact that LLMs got significant better in finding this, better than humans, started to happen half a year ago? so at one point we need to address the elefant in the room and state that today you need to do security scanning additional with LLMs. You need to take this serious.
In worst case, use Anthropics marketing to state that its a must now and something changed.
> What does it mean from cost and benchmark pov to have a 10 Trillion parameter model?
To me it means that we've hit the top end of the S-curve with regards to effects of scaling - if the tool isn't remarkably better despite the scale, then we're firmly in diminishing returns territory.
> Nonetheless, the fact that LLMs got significant better in finding this, better than humans, started to happen half a year ago?
*rolls eyes* regular static analyzers also have been "better than humans" for decades, being better than a human at a specific mechanical task really doesn't mean much. The interesting new thing is the type of potential "fuzzy bugs" described in the article that LLMs are able to identify (a comment not matching the code it describes, uncommon usage of a 3rd party library, mismatch of code and a protocol it implements, or often just generally weird looking code somebody should have a closer look at... this closes a gap in the traditional debugging toolboxes, but shouldn't replace them)
Static analyzers are balls. For every real bug they find you are dealing with with piles of false positives and negatives.
Now, I'm not saying you shouldn't use them. They do catch the low hanging fruit. It's that LLMs actually have a much better understanding of things like intent when looking at your code and general architecture configurations that can lead to problems.
As you say we've had static analyzers forever, hence why they aren't dropping out 50 new CVE's a day. LLMs are. There is a massive stack of software out there that is getting analyzed and exploited at a rate faster than it's getting patched. Adding to that things like NPMs exploited package of the day and popular github repository takeovers this year looks massively different from last year in quantity and quality of exploits alone.
IME LLMs generate at least as much false positives as static analyzers (note though that 99% of false positives are avoided with a proper assert hygiene, and that seems to be true both for traditional static analyzers and llms, those assert annotate the code with valuable hints that may go beyond a specific type system's capabilities).
> These tools and the analyses they have done have triggered somewhere between two and three hundred bugfixes merged in curl through-out the recent 8-10 months or so.
If you've just gone through a lengthy analysis of your code with other AI tools, surely it's reasonable not to expect to see hundreds more from a new tool?
It should be possible, unless more bugs are introduced, to eventually get to a state where there are no more bugs in your code.
Process aside, it sounds like Daniel expected to find dozens/hundreds more bugs.
curl had ~15 CVEs in 2026 so far. You surely don't think those (and the one Mythos found) were the last security bugs still left in the code base? There certainly will be more, in fact Daniel predicts ~50 CVEs for the entire year.
What's going on in this thread? It's weird how prevalent the negativity towards mythos is, and I'm not sure if it's people throwing the baby out with the bathwater or something more tinfoil-adjacent coordinated campaign. I also noticed this on a thread a few days ago, before the mozilla post. There were dozens of comments saying basically "mythos is vaporware".
I get the idea that they're using it for marketing. Of course they are. But to reduce it at "just marketing" feels either ill informed or outright wrong. Unless you have reasons to not believe the dozens of credentialed, well respected people in the field that have already shared their opinions after working with mythos. Plenty of them on all the social media sites.
And then there's the team at mozilla. They wrote a blog about this, and they've worked with anthropic before, using opus 4.6 and found and fixed 22 vulnerabilities. Then they worked with mythos and found and fixed 271 vulnerabilities. Unless you're going to accuse them of being shills, these are unquestionable numbers. The model is quantitatively better at this thing. And it matches what everyone is saying.
I think there are better things to accuse anthropic of, than that they are simply lying for marketing purposes. Of course they'll use this as a marketing campaign, but there's plenty of evidence out there that there is something there, that the model is simply better than previous generations at this. Don't fall for the cheap reductionist stuff, just because you don't like them, or feel that this is marketing fluff. It doesn't feel like a gimmick, even if it gets used to push their agenda. Something, something, propaganda often uses true statements as well.
> Unless you have reasons to not believe the dozens of credentialed, well respected people in the field that have already shared their opinions after working with mythos.
Exactly the same argument was made about o3-preview, lol. But anyway, do they talk about all domains where Mythos did the leap in capabilities (math and other research, ML, SWE) or only about cybersec?
> And then there's the team at mozilla. They wrote a blog about this, and they've worked with anthropic before, using opus 4.6 and found and fixed 22 vulnerabilities. Then they worked with mythos and found and fixed 271 vulnerabilities
Those 22 bugs were found in February, at the time when Mozilla were doing first small-scale experiments with Opus 4.6 (i.e. no proper integration into workflow, likely relatively simple harness, likely only small part of codebase was covered). You can't compare "22 bugs which were found during very early attempts to apply AI" and "271 bugs which were found during large-scale codebase scanning with properly configured AI". The fact that Mozilla is pretty vague about "contribution of other AI models" makes it even worse.
> Unless you're going to accuse them of being shills, these are unquestionable numbers. The model is quantitatively better at this thing
They found another ~150 bugs after their first announce, and only like ~35 were found by Mythos. It's already very sharp drop in contribution.
> I think there are better things to accuse anthropic of, than that they are simply lying for marketing purposes.
Anthropic already used a lot of "technically correct but in fact deceiving" statements in Mythos system card. They are playing both "It's too dangerous" and "We don't have enough compute for that super model" at the moment (it's usually a big red falg). Opus 4.7 (which was likely supposed to be "Opus 5.0", given various facts) is a disaster from various points of views. Of course people don't really believe Anthropic.
Here and on reddit, AI debugging is viewed as some weird shallow
pattern-matching that obviously fails to spot real stuff and
overload the maintainers. Instead of getting to "spotless record"
of zero flaws, the people start rationalizing that "X is not a real bug"
and inventing justifications for their(obviously bad) code,
which is critique they can't accept from AI, only through human
debate that they can't close with a WONTFIX.
Once the bug is actually usable, the tune changes completely.
> Here and on reddit, AI debugging is viewed as some weird shallow pattern-matching that obviously fails to spot real stuff and overload the maintainers.
That's because that is what a lot of people did in the last years [1] to pad their resumes or to force developers to backport patches to older (but supported) kernel versions that wouldn't have gone in if they didn't have a CVE attached [2]. Maintainers have been legitimately swamped with low-quality spam for a very long time. Only recently, in the last few months, AI actually got "good enough", the problem is that maintainers still have to differentiate between AI slop by wannabes and by AI-assisted reports reviewed and refined by actual human professionals.
At the end of the day attackers don't give a fuck. "Waaa waaa, AI was bad 6 months ago so I'm going to throw a little fit" doesn't work when it's currently actively exploiting your shit. No one gives a damn if there are 4000 bullshit security PRs lined up. The one real RCE in there mean that everything you hold dear has already been carted off by nation states, and probably rediscovered by 3 or 4 other exploitation groups by this point.
It's time for all the little snowflake software writers to pull up their pantaloons and realize that Linus' vision has become real. With enough AIs all security bugs become shallow. And that software affects the real word, real money, and real people in it. That they are also under attack by well financed groups with rather evil motivations. If I'm attacking some group using your software (such as another nation) I'm going to flood the fuck out of your PR system till you give up hope and die. I'm going to make you attack your contributors. I'm going to sow confusion so I have the maximum amount of time to lay waste to my enemies and profit to the max.
The internet is hostile. Software is hostile. There are sharks looking to eat you.
I'm disinclined to be overly generous to Antrophic, but I have to say that regardless of whether the talk of Mythos being uniquely dangerous was mostly cynical: It would be great if this starts a trend of giving security-critical software a few months head start with any new significantly improved model.
Putting on my tinfoil-hat: Sooo, the guy who runs the test and delivers the report could just have removed the more interesting bugs and delivered those to any three letter agency?
curl's source is public so what would be the gain in the rigmarole? Now if the prompt was "create a patch that inserts a zero-day while fixing a bug" that would be impressive.
Curl is likely one of the very much more combed over pieces of code at this point. It feels like it has some special draw for people looking for vulnerabilities. Not that it doesn't mean some novel idea can't be looked or checked still.
> No, based on cURL's history, it really seems like they would love to have found a really novel bug.
You just confirmed that you didn't read the article.
"Eventually, I was instead offered that someone else, who has access to the model, could run a scan and analysis on curl for me using Mythos and send me a report."
Someone external to the curl team ran the test. If that third party found a severe CVE that they could use across all the global curl attack surface, and did not disclose it back to the curl team, the third party could keep using the exploit until discovered independently.
I don't know about Mythos but in recent weeks I've noticed Opus is constantly failing to fix things in tsz[0] vs GPT 5.5 can easily churn out fixes that are solid and pass tests. I've stopped paying for Claude for now and all my money is going to OpenAI at the moment. Either Opus is massively nerfed or GPT 5.5 is really head and shoulder higher in terms of very difficult tasks. The last percent of conformance tests in tsz are really really difficult and I've seen Opus bailing again and again. So annoying to waste time and tokens to finally get "this is too involved" or "this requires a multi-week sprint to fix".
From a user’s perspective 4.7 is a downgrade compared to 4.6 . It’s intended to give Anthropic more control about their compute resources and profitability:
I am curious, what kind of work do you use Claude for that sometimes requires hours of working. In my case, I have never seen it go off for more than 10 mins and even that is very rare.
debugging code. I had some issue so I create a plan to root cause that would run the code, change some functions or variables and run again until we get a confirmed answer.
I just work up to that very workflow this morning. I ran last night and finished at around 3am with ~200k tokens spent. Fixed the issue and created a follow up doc for things that it could not verify.
Their view of open source is clearly exemplified by the disgusting picture of a deceased man in a wheelchair that the alpha-omega bureaucrats wheel around (bottom of the page).
That is how overpaid security "researchers" view open source. They never write anything, get hundreds of thousands in funding from (now AI) companies and feel superior to everyone else if they find one issue in curl.
The site does not even have an email if you want one of the projects that they so wittily depict as deceased to be scanned. What losers.
Voice input works really well for people speaking English with a Swedish accent. I think the accent of most educated Swedes is mostly a case of prosody. For sure there are some sounds we say slightly differently than native English speakers. We often have some trouble with /s/ and /z/, but I don't know, "war and peace", I think that's easily understood.
Source: voice typing this with Swedish vocal chords, and only had to correct "different lives" to "differently", and add /[^\w\s]/.
Android voice input works with kids using both English and native words, here in India. The country runs schools in 25+ primary languages, each with dialects, so a TV/phone with voice input is more marvelous than the nitpicks discussed here.
War and Peace is about 590,000 words. Tiny compared to the full Harry Potter collection (about 1 million words over the 7 books), but long for a single book.
They're referring to the typo in the title, "Piece" vs "Peace".
I also thought they were contending the word count before noticing. Even remarked how I find this a weird metric, given that code is not prose [0], but then I deleted that once I picked up on what's going on.
[0] comparing the output of `wc -w` with the word counts of books I'm reasonably sure will be super off
edit: ran a calc, substituting out symbols (but not underscores), digits, and comments yields a 390K word count compared to the 660K cited. not excluding the comments yields 600K, so more than a third of all words in the sources are comments.
I guess it's related to the phenomenon where you can read words relatively easily as long as the first and last letters are correct and the rest of the letters are there.
I routinely used to compile C programs on other compilers to find defects that one or another didn't find. Compiling on Windows vs Linux. You could summarize / minimize it down to compiling it with warning as errors etc but you'd be missing the point.
The point wasn't actual cross-platform portability even though that was a nice side effect. It was to flush out all the weird edge cases.
Edges like security flaws. Buffer overflows are usually platform specific. There are plenty of other ways to find these issues but simply recompiling for a different platform surfaces all sorts of issues.
I'm looking forward to trying Mythos run against my 5000-line, instant-finality, quantum-resistant blockchain project and decentralized exchange (an additional 5000 lines). I already ran all the models up to Opus 4.6 and they couldn't find anything.
It's a shame he seems to reject the idea of actually diving in and using these tools interactively:
> It’s not that I would have a lot of time to explore lots of different prompts and doing deep dive adventures anyway.
His expertise I think would elevate the results quite a bit. Although if he never uses LLMs, which it reads like he doesn't, I guess it might backfire just as well. Prompting style (still?) does matter after all, certainly in my experience anyways.
Perhaps I'm misreading something? From my reading of the article, it doesn't sound like Anthropic offered to let him use Mythos in any other way than that.
He explains in the article that he failed to actually secure access in the end, even though it was approved. Someone else prompted the model on his behalf, and just passed on the findings.
He posts about his use of language models a lot on Mastodon[0]. He does lots with language models, but doesn't buy all the way into the hype. I'd say he's one of. most reasonable & balanced voices on the subject of AI use in software today. Happy to use the technology, more than willing to push back on marketing bs.
I checked back two weeks worth of posts, reposts, and replies there, and do not see anything suggesting so, so I'll have to take your word for this.
What I do see is him responding to seemingly rather frequent harassment about AI use @ curl however. The stance he takes in those cases is very reasonable (even if you don't use AI for scanning the codebase and contributions, threat actors will), it's unfortunate this topic is so political that he has to deal with this to such an extent.
Won my bet "voted 10 [vulnerabilities] but in retrospect as you are familiar with Claude and such tooling if you already used any of recent model to done some kind of security review then I'd drop to 1 or even 0." https://mastodon.pirateparty.be/@utopiah/116537456780283420
"My personal conclusion can however not end up with anything else than that the big hype around this model so far was primarily marketing. I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos. Maybe this model is a little bit better, but even if it is, it is not better to a degree that seems to make a significant dent in code analyzing."
It's a good reminder for us all that the competition in this space is rough and lots of more or less subtle marketing is involved.
So while anthropic's marketing may be hype there just wasn't much left to find, a point he makes in the blog post.
Whether it's a big step forward for other kinds of projects is difficult to tell, but this highlights that everybody should be using AI code review tools to audit their existing code today, and not everybody is.
What it highlights, is that Mythos doesn't seem so much better than other LLM driven tooling at finding security issues, which was the strongest claim Anthropic made in the first place.
“Mythos isn’t supposed to be that good at security, because actually Anthropic was referring more about running llms than mythos specifically”
“The opus model is worse because they have no compute because they are training mythos. The degraded performance is justified!”
“All the bugs in Claude code is just because the models are so good they are just looping and are shipping fast”
Constantly see people crawl out of the woodwork to defend a trillion dollars company overhyping every press release it gives
Funnily enough that was while Dario Amodei was their research director.
More seriously, so far I haven’t seen much indication that Mythos is more than Opus with a security focused code analysis harness. That said, the fact it can find these bugs in an automated fashion is the more important takeaway outside of the hype.
I’m curious what the error rate is on the detections, because none of that means much if it is wrong 90% of the time and we are only hearing about the examples that are useful marketing.
I remember when OpenAI was saying GPT-2 was too dangerous to release.
If I’m not mistaken, after the media cycle, he lost his job for breaking confidentiality.
That was the opposite of marketing, Google really didn’t get how to turn this into a product until ChatGPT happened.
If OpenAI or Anthropic doesn't turn this into a trillion dollar industry FAST, they are cooked. The strategy of building up fear around your product is risky, but necessary. There is simply no way to grow the AI business fast enough if they can't talk directly to the CEOs and bypass input from the employees, and baba yaga stories are perfect for that. Every time the CEO hears an employee say that the AI isn't working great for him, he hears an employee that's scared for his job or for his life, dismisses it, and sends out a mandate that everyone needs to prompt an AI every time they as much as need to go to the toilet.
Optimization on "Human Feedback", early exposure to high-effort experimental systems... I wouldn't be surprised it that turns into a bigger field than is generally recognized today.
Looking at it from the outside, I think it's still pretty hard to see how he came to end up in that position, but with a bit of individual vulnerability, arbitrary time to boil the frog slowly, and a fairly large number people exposed, maybe it would be stranger not to have the event occur with someone.
The other guy worked on Google's AI safety team where one would expect he'd have a basic grasp of how the technology works before making outlandish claims.
It makes me wonder if there's a wrong turn in the road that I too might fall in the same pit.
I can't find it right now, but something came up a few years ago (probably on HN) about highly intelligent people being more adept at making up arguments to rationalize beliefs and actions that they had taken for other reasons entirely.
Sort of makes sense that wielding a more complex mind would offer more complex ways to go wrong, doesn't it?
Sounds more like “intelligence” isn’t the only defining metric for such behavior to occur in people, because that describes a lot of less intelligent people too. Though, I suspect highly intelligent people are at least somewhat more likely to end up on the “correct” side of the facts.
I have seen people I consider as much smarter than me fall for some very idiotic things. I certainly don't consider myself immune.
I think that the advice to try being intellectually flexible is a good one. Strive to learn new things, expose yourself earnestly to ideas that challenge your beliefs, exercise empathy, etc
Curl uses all sorts of tools, including AI tools to find bugs. These tools, according to the article found hundreds of bugs including a dozen CVE.
Mythos found one vulnerability. It means the Mythos is just another tool, not the revolution it claims to be.
It is common that when a new tool is introduced that a bunch of bugs are found, with diminishing returns. Mythos finding one vulnerability is consistent to what I would expect for a major update to an existing tool, which Mythos is over existing LLM-based solutions.
Look at the Firefox blog post where they found something like 400 (or more) findings.
I have no doubt Mythos is very good at this, but I also don't think it's something unattainable by other labs within the next few months, with focus.
And it is not overkill, the proof is that it found that vulnerability. It is like saying the new version of some static analyzer with some new rules is "overkill" because it only found only one more bug than the previous version. Deciding whether it is overkill or not is more about context. Using a very expensive model like Mythos for some little used non-critical software is overkill, but for Curl, it absolutely isn't.
If Mythos found loads of vulnerabilities in Firefox but not in Curl, I wouldn't say that's because of Mythos is so good, but rather that with the release of Mythos, they did some testing that could have been done before using the same tools Curl have used.
> Once the end-to-end pipeline is in place, it’s trivial to swap in different models when they become available. Building this pipeline early helped us find a number of serious bugs using publicly-available models, and it also helped us hit the ground running when we had the opportunity to evaluate Claude Mythos Preview. In our experience, model upgrades increase the effectiveness of the entire pipeline: the system gets simultaneously better at finding potential bugs, creating proof-of-concept test cases to demonstrate them, and articulating their pathology and impact.
The threat isn't high value targets, which already had sophisticated folks picking over the code base using state of the art tools and tests, it's medium to low value targets which can now be picked over by random hackers who barely know anything about security themselves at a cost of a few dollars.
that helps us to understand how much of Mythos is hype and how much is real
I've seen literally near word-for-word this exact chain of events multiple times previously
The other alternative is that Curl is simply secure enough that there was far less to find than in other projects.
Also, looking at something that trips valgrind warnings already, may obfuscate a lot of problems in both your own code and the curl library itself.
One could report the issue as functioning as described in the API, but the developers do not accept direct community input into the project.
People use it out of convenience, but it is just as janky as most bloated projects. =3
Marketing is not intentional.
Evidences: 10 years ago, when I interviewed Baidu AI with Andrew Ng and Dario, Dario is the kind of person is pure-hearted to the point being ideological. Given Dario's successful career so far, that essence has gradually grown into a conviction, and surrounded by a purposely built team which amplifies his ideology.
Humans are very convenient creature, a rare few small fraction of them are no doubt the master of convenience: they morph their mental manifold without a hint of contradiction in their own mental mechanisms.
Mythos put Anthropic back into the White House’s good graces. It also branded Anthropic as badass, something their softener image probably needed to win government contracts.
Maybe it wasn’t marketing. But the product’s configuration, and how Anthropic talked about and released it, sure as hell played beautifully. (The timing, while Musk and Altman are distracted with each other, also couldn’t have been better.)
Things change when you’re running a business like Anthropic, especially as the CEO. You have a responsibility to shareholders, and you just need to play the game.
Anthropic chose a great angle: focus on professionals / enterprise, safety, etc. Those can both be done by a genuine desire to make great technology, and for business purposes require you to position yourself in a bit “better” way than reality.
Just look at what their strategy is with Mythos, it’s almost perfection: the “it’s not ready to be released to the public” angle hits all the marks: they care about responsibility / safety, they have “the best” model, and “LLMs are dangerous, but we, as the guardians, can be trusted”. This also helps the industry as a whole with regulation: if they’re being constrained, China will develop even more dangerous models.
This is a result of how smart people treat business, it’s PR perfection, especially given how much the whole industry is talking about it.
(Yes, they fail in other PR areas, but that’s a different discussion)
Whether the person doing the marketing was sincere about it or not is immaterial, since marketing is experienced almost entirely by the people consuming it, and not the people communicating it. What matters is if the audience is sincerely concerned by the message, and it's transparently the case that they were sincerely concerned by it.
That's an odd definition of "intentional". Evolution has filtered for people with certain views and the marketing has just emerged from their actions. ... So?
A deadly virus (naturally occurring one let's say) wasn't created intentionally. Evolution selected for it. It's still bad and kills people. Doesn't make it nice because of lack of intention.
https://www.politico.eu/article/anthropic-hacking-technology...
This is an advertising masterpiece: UK gets first access, the EU is jealous and wants it, too. Thousands of bureaucrats and parasites make money in the process writing (probably using AI) whitepapers and sitting in meetings. The open source authors whose works are being scanned make nothing.
We know how the money flows. Another unrelated example is that ex MI6 director Sir John Sawers is a Palantir consultant and sells out the UK to Palantir.
They claim the huge advance is in exploiting the bugs.
> Over the last few months, we have stopped getting AI slop security reports in the #curl project. They're gone.
> Instead we get an ever-increasing amount of really good security reports, almost all done with the help of AI.
> They're submitted in a never-before seen frequency and put us under serious load.
> I hear similar witness reports from fellow maintainers in many other Open Source projects.
> Lots of these good reports are deemed "just bugs" and things we deem not having security properties.
[1]: https://www.linkedin.com/posts/danielstenberg_hackerone-shar...
I've been running my own security scanning software (disclaimer: now starting a company @ zeroquarry.com) for this, and from what I've seen there's a huge value in prompts + adversarial LLM review. Without adversarial review, you get garbage (as this blog points out: 4/5 basically are nonsense) and with a good prompt, you can use almost any "near frontier" model from my experience as long as the prompt helps with the guardrails or the model doesn't protect in such a strict way
About as subtle as a personal injury lawyer's billboard
It's almost Trump-esque - "this model will change everything forever; we are doomed; we are saved; we will all be fired; we will all be rich", etc
They need the hype to pay off way more than we do. So many of us who still write code directly stand to lose nothing of our capabilities if the marketing claims cannot hold water.
I'm surprised you say that because it is all over Hacker News. Every single post is co-opted into promoting AI. Try finding a submission with fifty points or more than doesn't have AI or LLM's mentioned somewhere in the comments.
That’s not really the point though. I have no doubt AI is useful, I just don’t want to have it shoved in my face every five minutes.
> The worrying part about Mythos isn't the fact that it can find bugs. The worrying part is Mythos being able to find them on its own across entire code base as vast as Firefox then write exploits for what its found with a very basic prompt.
> The skill required to find then create zero days is quickly approaching the floor.
The great exaggeration is that this is a new capability.
and then it write the exploits automatically for you?
This was one of the first things I tried and it works great.
What I think happened here is an Anthropic team with very little security expertise were working on finding bugs for marketing reasons and when they prompted to make POC exploits of those bugs they didn't have much success because they didn't really know what to ask for. They then proceeded to very finely tune their next model to eagerly exploit vulnerabilities making the models much more powerful for the "I don't know what I'm doing" user which they're now trying really hard to convince everyone is a game changer. </speculation>
The reason many of us are skeptical is we've used the current models to do things and they've worked.
An analogy might be if they tuned their model to eagerly instruct somebody how to make improvised weapons, now somebody is asking about how to deal with a rival at work and their model gives instructions on building a bomb from hardware store parts. Then go on a marketing spree telling everybody how dangerous it is. This example might highlight how insincere the marketing is. At any point you could have tuned the model to exploit for inexperienced people, now that you've done it does not mark a grand new capability. People who knew what they were doing could already do this with models.
https://www.anthropic.com/news/mozilla-firefox-security
Can you publish your results and send them to Bruce Schneier, Dave Lewis, & Heather Adkin [1] so they know that this isn't anything new and just the work of people with little security expertise?
[1] https://labs.cloudsecurityalliance.org/mythos-ciso/
It makes some sense that Mythos/ChatGPT 5.5 might be that much better with complexities that curl just doesn't have because it's a basic tool.
Like yeah curl is obviously extremely fully featured as an "anything client" but it's orders of magnitude less complex than other software we rely on.
Anyway, I think the case that frontier and next-gen models will get increasingly adept at finding vulnerabilities and that those on the receiving end of those vulnerabilities need to be on top of it.
This. Well done by Antropic.
It even reached the CISO of my small semi-government org in the Netherlands, who slightly panicked at the announced 'tsunami' of vulnerabilities that was coming with Mythos.
Got us some more money and priority with the board, though.
Never waste a good marketing scare.
IMO, this does not sound like marketing scare, there is spike of vulnerability disclosures - high quality, low false positives - that can be sensed... It feels like we're speedrunning through few-years worth of high quality bug reports in just a few weeks.
Anthropic noticed the trend of AI vulnerability scanning and started advertising Mythos, which is unreleased, as being very good at it.
Then they donated very large token budgets for using Mythos privately to several teams. Those teams used the free token spend for security research (that was the deal) and anything they found got attributed to Mythos, not the token budget.
Mythos looks like a good incremental model but the PR team has done a great job of associating themselves with the current trend. So much so that comments like yours already associated vulnerabilities found with this model which isn’t even available yet
In February, Opus discovered a whole bunch of security related bugs, but didn’t exploit them.
Mythos, in turn, was fed these bugs and told to exploit them.
Not saying it’s not impressive, but it was literally told “here are all the places our metal detector says there may be gold, please find gold”.
It's an entirely different thing to have the company conduct research on LLMs in general being a cybersecurity threat, instead of going "our new model is just too powerful" and shift the discussion to revolve around that. It's slimey.
I'm not sure that follows. As noted, curl was already analyzed to death with every tool available; most software isn't at that level.
If so, it would still follow. "Most software" isn't analyzed as much as curl, by either other tooling or other models, that might well find close to the same as Mythos did. As such, Mythos then isn't especially/particularly dangerous.
https://daniel.haxx.se/blog/2026/04/22/high-quality-chaos/, linked from TFA
The author compares it to AISLE, ZeroPath, and OpenAI’s Codex Security. AISLE and ZeroPath are much more expensive. OpenAI’s Codex Security is gated.
Most people don't care about the first two and don't complain about the latter's policy because they are all specialized models and/or harnesses.
Mythos will be available to all.
My mind still cannot understand the quality and refinement that's gone into cURL. It really is the perfect example of something done so right, that people barely think twice about.
However in the days of race to bottom, offshoring for penies, and now LLM powered code generation, this is a quality most companies won't care unless there is liability in place.
I would do that with 100% local models from scratch.
And all that to then end with people doing: "curl ... | bash" and not seeing anything wrong about it. Then they'll deflect about "threat models" and other non-sense.
I leave you your curl-bash, I keep my cryptographically signed packages installer.
Eventually, I was instead offered that someone else, who has access to the model, could run a scan and analysis on curl for me using Mythos and send me a report. To me, the distinction isn’t that important."
Really? We're talking about (essentially) a product demo from a trillion dollar industry fueled by debt. Clearly, blog posts like this have an immense influence on the perception of usefulness of the particular model and AI in general. With so much staked on this for the company, wouldn't you want to be sure that you're using the actual product without anyone messing with the results in any way?
Also curl in this regard is a open source project, relativly small but critical, well known and used everywhere. Besides image libraries, tools like curl or sudo, su, passwd, etc. would also be my first try.
Mythos is still not known at all what it can do. What does it mean from cost and benchmark pov to have a 10 Trillion parameter model?
Nonetheless, the fact that LLMs got significant better in finding this, better than humans, started to happen half a year ago? so at one point we need to address the elefant in the room and state that today you need to do security scanning additional with LLMs. You need to take this serious.
In worst case, use Anthropics marketing to state that its a must now and something changed.
To me it means that we've hit the top end of the S-curve with regards to effects of scaling - if the tool isn't remarkably better despite the scale, then we're firmly in diminishing returns territory.
And this is very much on purpose my friend. Think about what people already believe it can do though.
*rolls eyes* regular static analyzers also have been "better than humans" for decades, being better than a human at a specific mechanical task really doesn't mean much. The interesting new thing is the type of potential "fuzzy bugs" described in the article that LLMs are able to identify (a comment not matching the code it describes, uncommon usage of a 3rd party library, mismatch of code and a protocol it implements, or often just generally weird looking code somebody should have a closer look at... this closes a gap in the traditional debugging toolboxes, but shouldn't replace them)
Now, I'm not saying you shouldn't use them. They do catch the low hanging fruit. It's that LLMs actually have a much better understanding of things like intent when looking at your code and general architecture configurations that can lead to problems.
As you say we've had static analyzers forever, hence why they aren't dropping out 50 new CVE's a day. LLMs are. There is a massive stack of software out there that is getting analyzed and exploited at a rate faster than it's getting patched. Adding to that things like NPMs exploited package of the day and popular github repository takeovers this year looks massively different from last year in quantity and quality of exploits alone.
It has been clear for ages that certain type of bugs or issues are better solved from software.
But there was still plenty of things a proper SecOps Person would be able to find with help from tooling which automatic tooling wouldn't find.
Taking a limited amount of resources and focusing on the critical things.
I do think this is gone now. Same with Threat modeling etc.
If you've just gone through a lengthy analysis of your code with other AI tools, surely it's reasonable not to expect to see hundreds more from a new tool?
It should be possible, unless more bugs are introduced, to eventually get to a state where there are no more bugs in your code.
Process aside, it sounds like Daniel expected to find dozens/hundreds more bugs.
But Mythos found 1. After all that hype. 1.
I get the idea that they're using it for marketing. Of course they are. But to reduce it at "just marketing" feels either ill informed or outright wrong. Unless you have reasons to not believe the dozens of credentialed, well respected people in the field that have already shared their opinions after working with mythos. Plenty of them on all the social media sites.
And then there's the team at mozilla. They wrote a blog about this, and they've worked with anthropic before, using opus 4.6 and found and fixed 22 vulnerabilities. Then they worked with mythos and found and fixed 271 vulnerabilities. Unless you're going to accuse them of being shills, these are unquestionable numbers. The model is quantitatively better at this thing. And it matches what everyone is saying.
I think there are better things to accuse anthropic of, than that they are simply lying for marketing purposes. Of course they'll use this as a marketing campaign, but there's plenty of evidence out there that there is something there, that the model is simply better than previous generations at this. Don't fall for the cheap reductionist stuff, just because you don't like them, or feel that this is marketing fluff. It doesn't feel like a gimmick, even if it gets used to push their agenda. Something, something, propaganda often uses true statements as well.
Exactly the same argument was made about o3-preview, lol. But anyway, do they talk about all domains where Mythos did the leap in capabilities (math and other research, ML, SWE) or only about cybersec?
> And then there's the team at mozilla. They wrote a blog about this, and they've worked with anthropic before, using opus 4.6 and found and fixed 22 vulnerabilities. Then they worked with mythos and found and fixed 271 vulnerabilities
Those 22 bugs were found in February, at the time when Mozilla were doing first small-scale experiments with Opus 4.6 (i.e. no proper integration into workflow, likely relatively simple harness, likely only small part of codebase was covered). You can't compare "22 bugs which were found during very early attempts to apply AI" and "271 bugs which were found during large-scale codebase scanning with properly configured AI". The fact that Mozilla is pretty vague about "contribution of other AI models" makes it even worse.
> Unless you're going to accuse them of being shills, these are unquestionable numbers. The model is quantitatively better at this thing
They found another ~150 bugs after their first announce, and only like ~35 were found by Mythos. It's already very sharp drop in contribution.
> I think there are better things to accuse anthropic of, than that they are simply lying for marketing purposes.
Anthropic already used a lot of "technically correct but in fact deceiving" statements in Mythos system card. They are playing both "It's too dangerous" and "We don't have enough compute for that super model" at the moment (it's usually a big red falg). Opus 4.7 (which was likely supposed to be "Opus 5.0", given various facts) is a disaster from various points of views. Of course people don't really believe Anthropic.
That's because that is what a lot of people did in the last years [1] to pad their resumes or to force developers to backport patches to older (but supported) kernel versions that wouldn't have gone in if they didn't have a CVE attached [2]. Maintainers have been legitimately swamped with low-quality spam for a very long time. Only recently, in the last few months, AI actually got "good enough", the problem is that maintainers still have to differentiate between AI slop by wannabes and by AI-assisted reports reviewed and refined by actual human professionals.
[1] https://www.zdnet.com/article/how-fake-security-reports-are-...
[2] https://opensourcewatch.beehiiv.com/p/linux-gets-cve-securit...
It's time for all the little snowflake software writers to pull up their pantaloons and realize that Linus' vision has become real. With enough AIs all security bugs become shallow. And that software affects the real word, real money, and real people in it. That they are also under attack by well financed groups with rather evil motivations. If I'm attacking some group using your software (such as another nation) I'm going to flood the fuck out of your PR system till you give up hope and die. I'm going to make you attack your contributors. I'm going to sow confusion so I have the maximum amount of time to lay waste to my enemies and profit to the max.
The internet is hostile. Software is hostile. There are sharks looking to eat you.
Time to face that fact.
I would very much like to know if they were independent or affiliated to Anthropic.
> My personal conclusion can however not end up with anything else than that the big hype around this model so far was primarily marketing.
... because of this.
You just confirmed that you didn't read the article.
"Eventually, I was instead offered that someone else, who has access to the model, could run a scan and analysis on curl for me using Mythos and send me a report."
[0] https://tsz.dev
https://news.ycombinator.com/item?id=48072916
I just work up to that very workflow this morning. I ran last night and finished at around 3am with ~200k tokens spent. Fixed the issue and created a follow up doc for things that it could not verify.
The "AlphaOmega" foundation appears to be a sinecure org for an inner circle to make money:
https://alpha-omega.dev/
Their view of open source is clearly exemplified by the disgusting picture of a deceased man in a wheelchair that the alpha-omega bureaucrats wheel around (bottom of the page).
That is how overpaid security "researchers" view open source. They never write anything, get hundreds of thousands in funding from (now AI) companies and feel superior to everyone else if they find one issue in curl.
The site does not even have an email if you want one of the projects that they so wittily depict as deceased to be scanned. What losers.
Typo, or is there a spoof I should go read?
Does it say anything else? Just 'Aaaarggghhhh'?
Source: voice typing this with Swedish vocal chords, and only had to correct "different lives" to "differently", and add /[^\w\s]/.
I also thought they were contending the word count before noticing. Even remarked how I find this a weird metric, given that code is not prose [0], but then I deleted that once I picked up on what's going on.
[0] comparing the output of `wc -w` with the word counts of books I'm reasonably sure will be super off
edit: ran a calc, substituting out symbols (but not underscores), digits, and comments yields a 390K word count compared to the 660K cited. not excluding the comments yields 600K, so more than a third of all words in the sources are comments.
I guess it's related to the phenomenon where you can read words relatively easily as long as the first and last letters are correct and the rest of the letters are there.
https://wire.insiderfinance.io/the-brains-power-to-read-jumb...
The point wasn't actual cross-platform portability even though that was a nice side effect. It was to flush out all the weird edge cases.
Edges like security flaws. Buffer overflows are usually platform specific. There are plenty of other ways to find these issues but simply recompiling for a different platform surfaces all sorts of issues.
> It’s not that I would have a lot of time to explore lots of different prompts and doing deep dive adventures anyway.
His expertise I think would elevate the results quite a bit. Although if he never uses LLMs, which it reads like he doesn't, I guess it might backfire just as well. Prompting style (still?) does matter after all, certainly in my experience anyways.
> using these tools interactively
I did read the article. It seems to me they're using LLMs in a prepared manner instead, as mere scanners that produce reports.
[0] https://mastodon.social/@bagder
I checked back two weeks worth of posts, reposts, and replies there, and do not see anything suggesting so, so I'll have to take your word for this.
What I do see is him responding to seemingly rather frequent harassment about AI use @ curl however. The stance he takes in those cases is very reasonable (even if you don't use AI for scanning the codebase and contributions, threat actors will), it's unfortunate this topic is so political that he has to deal with this to such an extent.