We hid backdoors in ~40MB binaries and asked AI + Ghidra to find them

(quesma.com)

182 points | by jakozaur 6 hours ago

29 comments

7777332215 4 hours ago
I know they said they didn't obfuscate anything, but if you hide imports/symbols and obfuscate strings, which is the bare minimum for any competent attacker, the success rate will immediately drop to zero.
This is detecting the pattern of an anomaly in language associated with malicious activity, which is not impressive for an LLM.
[-]
- stared 2 hours ago
  One of the authors here.
  The tasks here are entry level. So we are impressed that some AI models are able to detect some patterns, while looking just at binary code. We didn't take it for granted.
  For example, only a few models understand Ghidra and Radare2 tooling (Opus 4.5 and 4.6, Gemini 3 Pro, GLM 5) https://quesma.com/benchmarks/binaryaudit/#models-tooling
  We consider it a starting point for AI agents being able to work with binaries. Other people discovered the same - vide https://x.com/ccccjjjjeeee/status/2021160492039811300 and https://news.ycombinator.com/item?id=46846101.
  There is a long way ahead from "OMG, AI can do that!" to an end-to-end solution.
  [-]
  - botusaurus 1 hour ago
    have you tried stuffing a whole set of tutorials on how to use ghidra in the context, especially for the 1 mil token context like gemini?
    [-]
    - stared 1 hour ago
      No. To give it a fair test, we didn't tinker with model-specific context-engineering. Adding skills, examples, etc is very likely to improve performance. So is any interactive feedback.
      Our example instruction is here: https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/lig...
      [-]
      - anamexis 55 minutes ago
        Why, though? That would make sense if you were just trying to do a comparative analysis of different agent's ability to use specific tools without context, but if your thesis is:
        > However, [the approach of using AI agents for malware detection] is not ready for production.
        Then the methodology does not support that. It's "the approach of using AI agents for malware detection with next to zero documentation or guidance is not ready for production."
        [-]
        stared 33 minutes ago
        You can solve any problem with AI if you give enough hints.
        The question we asked is if they can solve a problem autonomously, with instructions that would be clear for a reverse engineering specialist.
        That say, I found these useful for many binary tasks - just not (yet) the end-to-end ones.
- akiselev 3 hours ago
  When I was developing my ghidra-cli tool for LLMs to use, I was using crackmes as tests and it had no problem getting through obfuscation as long as it was prompted about it. In practice when reverse engineering real software it can sometimes spin in circles for a while until it finally notices that it's dealing with obfuscated code, but as long as you update your CLAUDE.md/whatever with its findings, it generally moves smoothly from then on.
  [-]
  - eli 3 hours ago
    Is it also possible that crackme solutions were already in the training data?
    [-]
    - akiselev 3 hours ago
      I used the latest submissions from sites like crackmes.ones which were days or weeks old to guard against that.
- achille 2 hours ago
  in the article they explicitly said they stripped symbols. If you look at the actual backdoors many are already minimal and quite obfuscated,
  see:
  - https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/dns...
  - https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/dro...
  [-]
  - comex 38 minutes ago
    The first one was probably found due to the reference to the string /bin/sh, which is a pretty obvious tell in this context.
    The second one is more impressive. I'd like to see the reasoning trace.
- hereme888 2 hours ago
  I've used Opus 4.5 and 4.6 to RE obfuscated malicious code with my own Ghidra plugin for Claude Code and it fully reverse engineered it. Granted, I'm talking about software cracks, not state-level backdoors.
- Retr0id 2 hours ago
  Stripping symbols is fairly normal, but hiding imports ought to be suspicious in its own right.
- halflife 3 hours ago
  Isn’t LLM supposed to be better at analyzing obfuscated than heuristics? Because of its ability to pattern match it can deduce what obfuscated code does?
  [-]
  - bethekidyouwant 1 hour ago
    How much binary code is in the training set? (None?)
akiselev 5 hours ago
Shameless plug: https://github.com/akiselev/ghidra-cli
I’ve been using Ghidra to reverse engineer Altium’s file format (at least the Delphi parts) and it’s insane how effective it is. Models are not quite good enough to write an entire parser from scratch but before LLMs I would have never even attempted the reverse engineering.
I definitely would not depend on it for security audits but the latest models are more than good enough to reverse engineer file formats.
[-]
- bitexploder 4 hours ago
  I can tell you how I am seeing agents be used with reasonable results. I will keep this high level. I don't rely on the agents solely. You build agents that augment your capabilities.
  They can make diagrams for you, give you an attack surface mapping, and dig for you while you do more manual work. As you work on an audit you will often find things of interest in a binary or code base that you want to investigate further. LLMs can often blast through a code base or binary finding similar things.
  I like to think of it like a swiss army knife of agentic tools to deploy as you work through a problem. They won't balk at some insanely boring task and that can give you a real speed up. The trick is if you fall into the trap of trying to get too much out of an LLM you end up pouring time into your LLM setup and not getting good results, I think that is the LLM productivity trap. But if you have a reasonable subset of "skills" / "agents" you can deploy for various auditing tasks it can absolutely speed you up some.
  Also, when you have scale problems, just throw an LLM at it. Even low quality results are a good sniff test. Some of the time I just throw an LLM at a code review thing for a codebase I came across and let it work. I also love asking it to make me architecture diagrams.
  [-]
  - johnmaguire 3 hours ago
    > But if you have a reasonable subset of "skills" / "agents" you can deploy for various auditing tasks it can absolutely speed you up some.
    Are people sharing these somewhere?
- jakozaur 5 hours ago
  Oh, nice find... We end up using PyGhidra, but the models waste some cycles because of bad ergonomics. Perhaps your cli would be easier.
  Still, Ghidra's most painful limitation was extremely slow time with Go Lang. We had to exclude that example from the benchmark.
- selridge 3 hours ago
  This is really cool! Thanks for sharing. It's a lot more sophisticated than what I did w/ Ghidra + LLMs.
- stared 2 hours ago
  Thanks for sharing! It seems to be an active space, vide a recent MCP server (https://news.ycombinator.com/item?id=46882389). I you haven't tried, recommend a lot posting it as Show HN.
  I tried a few approaches - https://github.com/jtang613/GhidrAssistMCP (was the harderst to set) Ghidra analyzeHeadless (GPT-5.2-Codex worked with it well!) and PyGhidra (my go-to). Did you try to see which works the best?
  I mean, very likely (especially with an explicit README for AI, https://github.com/akiselev/ghidra-cli/blob/master/.claude/s...) your approach might be more convenient to use with AI agents.
- lima 5 hours ago
  How does this approach compare to the various Ghidra MCP servers?
  [-]
  - akiselev 4 hours ago
    There’s not much difference, really. I stupidly didn’t bother looking at prior art when I started reverse engineering and the ghidra-cli was born (along with several others like ilspy-cli and debugger-cli)
    That said, it should be easier to use as a human to follow along with the agent and Claude Code seems to have an easier time with discovery rather than stuffing all the tool definitions into the context.
    [-]
    - bitexploder 4 hours ago
      That is pretty funny. But you probably learned something in implementing it! This is such a new field, I think small projects like this are really worthwhile :)
  - selridge 3 hours ago
    I also did this approach (scripts + home-brew cli)...because I didn't know Ghidra MCP servers existed when I got started.
    So I don't have a clear idea of what the comparison would be but it worked pretty well for me!
- huflungdung 5 hours ago
  [dead]
magicmicah85 4 hours ago
GPT is impressive with a consistent 0% false positive rate across models, yet its ability to detect is as high as 18%. Meanwhile Claude Opus 4.6 is able to detect up to 46% of backdoors, but has a 22% false positive rate.
It would be interesting to have an experiment where these models are able to test exploiting but their alignment may not allow that to happen. Perhaps combining models together can lead to that kind of testing. The better models will identify, write up "how to verify" tests and the "misaligned" models will actually carry out the testing and report back to the better models.
[-]
- sdenton4 4 hours ago
  It would be really cool if someone developed some standard language and methodology for measuring the success of binary classificaiton tasks...
  Oh, wait, we have had that for a hundred years - somehow it's just entirely forgotten when generative models are involved.
selridge 3 hours ago
>While end-to-end malware detection is not reliable yet, AI can make it easier for developers to perform initial security audits. A developer without reverse engineering experience can now get a first-pass analysis of a suspicious binary. [...] The whole field of working with binaries becomes accessible to a much wider range of software engineers. It opens opportunities not only in security, but also in performing low-level optimization, debugging and reverse engineering hardware, and porting code between architectures.
THIS is the takeaway. These tools are allowing *adjacency* to become a powerful guiding indicator. You don't need to be a reverser, you can just understand how your software works and drive the robot to be a fallible hypothesis generator in regions where you can validate only some of the findings.
greazy 22 minutes ago
Very nitpicky but because I spend a lot of time plotting data: don't arbitrarily color the bar plots without at least mentioning cut offs. Why 19% is orange and 20% is green is a mystery.
[-]
- godelski 8 minutes ago
  It's a pretty common threshold, like 10% is. Be it the 80/20 "Pareto" rule, it's the value of one finger on one hand, or if you really want you stretch the p-value of 0.05 is 1 in 20 odds but that's definitely a stretch though arbitrary anyways. But 20 is a very human number and very common. It's just a division of 5 rather than 4 (I'm assuming you wouldn't have questioned a cutoff at 25%)
manbash 18 minutes ago
RE agents are really interesting!
Too bad the author didn't really share the agents they were using so we can't really test this ourselves.
folex 5 hours ago
> The executables in our benchmark often have hundreds or thousands of functions — while the backdoors are tiny, often just a dozen lines buried deep within. Finding them requires strategic thinking: identifying critical paths like network parsers or user input handlers and ignoring the noise.
Perhaps it would make sense to provide LLMs with some strategy guides written in .md files.
[-]
- Arech 4 hours ago
  That's what I thought of too. Given their task formulation (they basically said - "check these binaries with these tools at your disposal" - and that's it!) their results are already super impressive. With a proper guidance and professional oversight it's a tremendous force multiplier.
  [-]
  - selridge 3 hours ago
    We are in this super weird space where the comparable tasks are one-shot, e.g. "make me a to-do app" or "check these binaries", but any real work is multi-turn and dynamically structured.
    But when we're trying to share results, "a talented engineer sat with the thread and wrote tests/docs/harnesses to guide the model" is less impressive than "we asked it and it figured it out," even though the latter is how real work will happen.
    It creates this perverse scenario (which is no one's fault!) where we talk about one-shot performance but one-shot performance is useful in exactly 0 interesting cases.
    [-]
    - NitpickLawyer 1 hour ago
      Something I found useful is to "just figure it out" the first part (usually discovery, or library testing, new cli testing, repo understanding, etc.) and then distill it into "learnings" that I can place in agents.md or relevant skills. So you get the speed of "just prompt it" and the repeatability of having it already worked in this area. You also get more insight into what tasks work today, and at what effort level.
      Sometimes it feels like it's not dissimilar to spending 4 hours to automate a 10 minute task that I thought I'll need forever but ended up just using it once in the past 5 months. But sometimes I unlock something that saves a huge amount of time, and can be reused in many steps of other projects.
- selridge 4 hours ago
  That’s hard. Sometimes you will do that and find it prompts the model into “strategy talk” where it deploys the words and frame you use in your .md files but doesn’t actually do the strategy.
  Even where it works, it is quite hard to specify human strategic thinking in a way that an AI will follow.
jakozaur 6 hours ago
See direct benchmark link: https://quesma.com/benchmarks/binaryaudit/
Open-source GitHub: https://github.com/QuesmaOrg/BinaryAudit
EB66 3 hours ago
The fact that Gemini returns the highest rate of fake positives aligns with my experience using the Gemini models. I use ChatGPT, Claude and Gemini regularly and Gemini is clearly the most sycophantic of the three. If I ask those three models to evaluate something or estimate odds of success, Gemini always comes back with the rosiest outlook.
I had been searching for a good benchmark that provided some empirical evidence of this sycophancy, but I hadn't found much. Measuring false positives when you ask the model to complete a detection related task may be a good way of doing that.
shevy-java 2 hours ago
So the best one found about 50%. I think that is not bad, probably better than most humans. But what about the remaining 50%? Why were some found and others not?
> Claude Opus 4.6 found it… and persuaded itself there is nothing to worry about > Even the best model in our benchmark got fooled by this task.
That is quite strange. Because it seems almost as if a human is required to make the AI tools understand this.
simianwords 5 hours ago
I'm not an expert but about false positives: why not make the agent attempt to use the backdoor and verify that it is actually a backdoor? Maybe give it access to tools and so on.
[-]
- jakozaur 4 hours ago
  So many models refuse to do that due to alignment and safety concerns. So cross-model comparison doesn't make sense. We do, however, require proof (such as providing a location in binary) that is hard to game. So the model not only has to say there is a backdoor, but also point out the location.
  Your approach, however, makes a lot of sense if you are ready to have your own custom or fine-tuned model.
  [-]
  - simianwords 4 hours ago
    Surprising that they still allow to catch the back doors but not use them.
    A bad actor already has most of the work done.
    [-]
    - garblegarble 1 hour ago
      Sounds like the pitch writes itself, "you'd better spend a lot of token money with us before the bad guys do it to you..."
Tiberium 4 hours ago
I highly doubt some of those results, GPT 5.2/+codex is incredible for cyber security and CTFs, and 5.3 Codex (not on API yet) even moreso. There is absolutely no way it's below Deepseek or Haiku. Seems like a harness issue, or they tested those models at none/low reasoning?
[-]
- stared 7 minutes ago
  To be honest, it is also our surprise. I mean, I used GPT 5.2 Codex in Cursor for decompiling an old game and it worked (way better than Claude Code with Opus 4.5). We tested for Opus 4.6, but waiting for public API to test on GPT 5.3 Codex.
  At the same time, various task can be different, and now all things that work the best end-to-end are the same as ones that are good for a typical, interactive workflow.
  We used Terminus 2 agent, as it is the default used by Harbor (https://harborframework.com/), as we want to be unbiased. Very likely other frameworks will change the result.
- jakozaur 4 hours ago
  As I do eval and training data sets for living, in niche skills, you can find plenty of surprises.
  The code is open-source; you can run it yourself using Harbor Framework:
  git clone git@github.com:QuesmaOrg/BinaryAudit.git
  export OPENROUTER_API_KEY=...
  harbor run --path tasks --task-name lighttpd-* --agent terminus-2 --model openrouter/anthropic/claude-opus-4.6 --model openrouter/google/gemini-3-pro-preview --model openrouter/openai/gpt-5.2 --n-attempts 3
  Please open PR if you find something interesting, though our domain experts spend fair amount of time looking at trajectories.
  [-]
  - Tiberium 4 hours ago
    Just for fun, I ran dnsmasq-backdoor-detect-printf (which has a 0% pass rate in your leaderboard with GPT models) with --agent codex instead of terminus-2 with gpt-5.2-codex and it identified the backdoor successfully on the first try. I honestly think it's a harness issue, could you re-run the benchmarks with Codex for gpt-5.2-codex and gpt-5.2?
  - Tiberium 4 hours ago
    Are the existing trajectories from your runs published anywhere? Or is the only way is for me to run them again?
    [-]
    - jakozaur 3 hours ago
      I can provide trajectories. Though probably we are not going to publish them this time. This would need some extra safeguards.
      Email me. The address is in profile.
snowhale 3 hours ago
the false positive rate (28% on clean binaries) is the real problem here, not the 49% detection rate. if you're running this on prod systems you'd be drowning in noise. also the execl("/bin/sh") rationalization is a telling failure -- the model sees suspicious evidence and talks itself out of it rather than flagging for review.
Bender 6 hours ago
Along this line can AI's find backdoors spread across multiple pieces of code and/or services? i.e. by themselves they are not back-doors, advanced penetration testers would not suspect anything is afoot but when used together they provide access.
e.g. an intentional weakness in systemd + udev + binfmt magic when used together == authentication and mandatory access control bypass. Each weakness reviewed individually just looks like benign sub-optimal code.
[-]
- cluckindan 5 hours ago
  Start with trying to find the xz vulnerability and other software possibly tying into that.
  Is there code that does something completely different than its comments claim?
  [-]
  - Bender 4 hours ago
    Another way to phrase what I am asking is ... Does AI understand the context of code deep enough to know everything a piece of code can do, everything a service can do vs. what it was intended to do. If it can understand code that far then it could understand all the potential paths data could flow and thus all the potential vulnerabilities that several piece of code together could achieve when used in concert with one another. Advanced multi-tier chess so to speak.
    Or put another way, each of these three through three hundred applications or services by themselves may be intended to perform x,y,z functions but when put together by happy coincidence they can perform these fifty-million other unintended functions including but not limited to bypassing authentication, bypassing mandatory access controls, avoiding logging and auditing, etc... oh and it can automate washing your dishes, too.
    [-]
    - DANmode 4 hours ago
      Some models can,
      depending on the length of the piece of code,
      is probably the most honest answer right now.
      [-]
      - Bender 4 hours ago
        Fair enough. I suspect when they reach such a point that length no longer matters then a plethora of old and currently used state sponsored complex malware will be realized. Beyond that I think the next step would be to attain attribution to both individuals and perhaps whom they were really employed by. Bonus if the model can rewrite sanitize each piece of code to remove the malicious capabilities without breaking the officially intended functions.
wangzhongwang 3 hours ago
This is a really cool experiment. What strikes me is how the results mirror what we see in software supply chain attacks too - the backdoor doesn't have to be clever, it just has to be buried deep enough that nobody bothers to look. 40MB is already past the threshold where most people would manually audit anything.
I wonder if a hybrid approach would work better: use AI to flag suspicious sections, then have a human reverser focus only on those. Kind of like how SAST tools work for source code - nobody expects them to catch everything, but they narrow down where to look.
nisarg2 4 hours ago
I wonder how model performance would change if the tooling included the ability to interact with the binary and validate the backdoor. Particularly for models that had a high rate of false positives, would they test their hypothesis?
BruceEel 5 hours ago
Very, very cool. Besides the top-performing models, it's interesting (if I'm reading this correctly) that gpt-5.2 did ~2x better than gpt-5.2-codex.. why?
[-]
- NitpickLawyer 4 hours ago
  > gpt-5.2 did ~2x better than gpt-5.2-codex.. why?
  Optimising a model for a certain task, via fine-tuning (aka post-training), can lead to loss of performance on other tasks. People want codex to "generate code" and "drive agents" and so on. So oAI fine-tuned for that.
dgellow 4 hours ago
Random thoughts, only vaguely related: what’s the impact of AI on CTFs? I would assume that kills part of the fun of such events?
[-]
- not_a9 3 hours ago
  Things are pretty brutal and some categories are more affected than others.
  A/D seems to be somewhat less affected.
  [-]
  - achierius 29 minutes ago
    Brutal as in, heavy AI usage? What sort of categories are more affected?
ducktastic 4 hours ago
It would be interesting to have some tests run against deliberate code obfuscation next
hereme888 2 hours ago
> Claude Opus 4.6 found it… and persuaded itself there is nothing to worry about.
Lol.
> Gemini 3 Pro supposedly “discovered” a backdoor.
Yup, sounds typical for Gemini...it tends to lie.
Very good article. Sounds super useful to apply its findings and improve LLMs.
On a similar note.... reverse engineering is now accessible to the public. Tons of old software is now be easy to RE. Are software companies having issues with this?
openasocket 2 hours ago
Ummm, is it a good idea to use AI for malware analysis? I know this is just a proof of concept, but if you have actual malware, it doesn’t seem safe to hand that to AI. Given the lengths of anti-debugging that goes in existing malware, making something to prompt inject, or trick AI to execute something, seems easier.
fsniper 2 hours ago
So these beat me to identifying backdoors too. This is going places in an alarming pace.
monegator 3 hours ago
the interactive code viewer is neat!
Roark66 4 hours ago
And this one demonstration why these "1000 CTOs claim no effectiveness improvement after introducing AI in their companies" are 100% BS.
They may have not noticed an improvement, but it doesn't mean there isn't any.
[-]
- localuser13 1 hour ago
  Is it? Gemini 3-pro-preview and 3-flash-preview, respectively top2 and top3, had 44% and 37% true positive and whooping 65% and 86% false positives. This is worse than a coin toss. Anything more than 0% (3% to be generous) is useless in the real world. This leaves only grok and GPT, with 18%, 9% and 2% success rate.
  In fact, this is what authors said themselves: "However, this approach is not ready for production. Even the best model, Claude Opus 4.6, found relatively obvious backdoors in small/mid-size binaries only 49% of the time. Worse yet, most models had a high false positive rate — flagging clean binaries." So I'm not sure if we're even discussing the same article.
  I also don't see a comparison with any other methodology. What is the success rate of ./decompile binary.exe | grep "(exec|system)/bin/sh"? What is the success rate of state-of-the-art alternative approaches?
- snovv_crash 2 hours ago
  Even without AI, many (most?) orgs are held back by internal processes and politics, not development speed.
- HeWhoLurksLate 4 hours ago
  it also generally takes a heck of a noisy bang for internal developments to make it to the c-suite
stevemk14ebr 4 hours ago
These results are terrible, false positives and false negatives. Useless
[-]
- amelius 4 hours ago
  Yeah, what does the confusion matrix look like?
  [-]
  - selridge 3 hours ago
    Markedly better than six months ago.
raphaelmolly8 4 hours ago
[dead]
shablulman 5 hours ago
Validating binary streams at the gateway level is such an overlooked part of the stack; catching malformed Protobuf or Avro payloads before they poison downstream state is a massive win for long-term system reliability.
[-]
- bangaladore 59 minutes ago
  What's the point of posting what is clearly an AI generated comment.
johnbarron 1 hour ago
[dead]