XBOW, an autonomous penetration tester, has reached the top spot on HackerOne

2025-06-2415:53284123xbow.com

For the first time in bug bounty history, an autonomous penetration tester has reached the top spot on the US leaderboard.

For the first time in bug bounty history, an autonomous penetration tester has reached the top spot on the US leaderboard.

Our path to reaching the top ranks on HackerOne began with rigorous benchmarking. Since the early days of XBOW, we understood how crucial it was to measure our progress, and we did that in two stages:

  • First we tested XBOW with existing CTF challenges (from well-known providers like PortSwigger and Pentesterlab), then quickly moved on and built our own unique benchmark that simulates real-world scenarios—ones never used to train LLMs before. The results were encouraging, but still these were artificial exercises.
  • The logical next step, therefore, was to focus on discovering zero-day vulnerabilities in open source projects, which led to many exciting findings. Some of these were reported on this blog before: in every case, we gave the AI access to source code, simulating a white-box pentest. While our paying customers were enthusiastic about XBOW’s capabilities, the community raised a key question: How would XBOW perform in real, black-box production environments? We took up that challenge, choosing to compete in one of the largest hacker arenas, where companies serve as the ultimate judges by verifying and triaging vulnerabilities themselves.

Dogfooding AI in Bug Bounties

XBOW is a fully autonomous AI-driven penetration tester. It requires no human input, operates much like a human pentester, but can scale rapidly, completing comprehensive penetration tests in just a few hours.

When building AI software, having precise benchmarks to keep pushing the limit of what’s possible, is essential. But when some of those benchmarks evolve into real-world environments, it’s a developer’s dream come true.

Discovering bugs in structured benchmarks and open source projects was a fantastic starting point. However, nothing can truly prepare you for the immense diversity of real-world environments, which span from cutting-edge technologies to 30-year-old legacy systems. No number of design partners can offer that breadth of system variety as that level of unpredictability is nearly impossible to simulate.

To bridge that gap, we started dogfooding XBOW in public and private bug bounty programs hosted on HackerOne. We treated it like any external researcher would: no shortcuts, no internal knowledge—just XBOW, running on its own.

HackerOne offers this unique opportunity, and as XBOW discovered and reported vulnerabilities across multiple programs, we soon found ourselves climbing the H1 ranks.

Scaling Discovery and Scoping capabilities

Our first challenge was scaling. While XBOW can easily scan thousands of web apps simultaneously, HackerOne hosts hundreds of thousands of potential targets. As a startup with limited resources, even when we focused on specific vulnerability classes, we still needed to be strategic. That’s why we built infrastructure on top of XBOW to help us identify the high-value targets and prioritize those that would maximize our return on investment.

We started by consuming bug bounty program scopes and policies, but this information isn’t always machine-readable. With a combination of large language models and some manual curation, we managed to parse through them—with a few hiccups. (At one point, we were officially removed from a program that didn’t allow “automatic scanners.”)

With the domains ingested into our database, and a bit of “magic” to expand subdomains, we built a scoring system to highlight the most interesting targets. This scoring criteria covered a broad range of signals, including target appearance, presence of WAFs and other protections, HTTP status codes, redirect behavior, authentication forms, number of reachable endpoints, underlying technologies, and more.

Domain deduplication quickly became essential in large programs, it is common to encounter cloned or staging environments(e.g. stage0001-dev.example.com). Once a vulnerability is found in one, similar issues are likely to exist across others. To stay efficient, we used SimHash to detect content-level similarity and leveraged a headless browser to capture website screenshots and then applied imagehash techniques to assess visual similarity analysis, allowing us to group assets and focus our efforts on unique, high-impact targets.

Automated Vulnerability Discovery

AI can be remarkably effective at discovering a broad range of vulnerabilities—but the real challenge isn’t always detection, It’s precision. Automation has long struggled with false positives, and nowhere is this more evident than in vulnerability scanning. Tools that flag dozens of irrelevant issues often create more work than they save. When AI enters the equation, the stakes grow even higher: models can generalize well, but verifying technical edge cases is a different game entirely.

To ensure accuracy, we developed the concept of validators, automated peer reviewers that confirm each vulnerability XBOW uncovers. Sometimes this process leverages a large language model; in other cases, we build custom programmatic checks. For example, to validate Cross-Site Scripting findings, a headless browser visits the target site to verify that the JavaScript payload was truly executed. (don’t miss Brendan Dolan-Gavitt’s BlackHat presentation on AI agents for Offsec)

XBOW’s Real-World Impact

Running XBOW across a wide range of public and private programs yielded results that exceeded our expectations—not just in volume, but in consistency and quality.

Over time, XBOW reported thousands of validated vulnerabilities, many of them affecting high-profile targets from well-known companies. These findings weren’t just theoretical; every submission was confirmed by the program owners and triaged as real, actionable security issues.

The most public signal of progress came from the HackerOne leaderboard. Competing alongside thousands of human researchers, XBOW climbed to the top position in the US ranking. That wasn’t our original goal, and indeed was surprising since we didn’t have a buffer of untriaged reports from previous quarters—but it became a useful benchmark to track real-world performance and collect traces to reinforce our models.

XBOW submitted nearly 1,060 vulnerabilities. All findings were fully automated, though our security team reviewed them pre-submission to comply with HackerOne’s policy on automated tools. It was a unique privilege to wake up each morning and review creative new exploits.

To date, bug bounty programs have resolved 130 vulnerabilities, while 303 were classified as Triaged (mostly by VDP programs that acknowledged the issue but did not proceed to resolution). In addition, 33 reports are currently marked as new, and 125 remain pending review by program owners.

Across all submissions, 208 were marked as duplicates, 209 as informative and 36 as not applicable (most of them self-closed by our team). Interestingly, many of these informative vulnerabilities came from programs with specific constraints such as policies excluding third-party vulnerabilities or disallowing certain classes like Cache Poisoning.

XBOW identified a full spectrum of vulnerabilities including: Remote Code Execution, SQL Injection, XML External Entities (XXE), Path Traversal, Server-Side Request Forgery (SSRF), Cross-Site Scripting, Information DIsclosures, Cache Poisoning, Secret exposure, and more.

Over the past 90 days alone, the vulnerabilities submitted were classified as 54 critical, 242 high, 524 medium, and 65 low severity issues by program owners. Notably, around 45% of XBOW’s findings are still awaiting resolution, highlighting the volume and impact of the submissions across live targets.

XBOW’s path to the top involved uncovering a wide range of interesting and impactful vulnerabilities. Among them was a previously unknown vulnerability in Palo Alto’s GlobalProtect VPN solution, affecting over 2,000 hosts. Throughout this process, XBOW consistently demonstrated its ability to adapt to edge cases and develop creative strategies for complex exploitation scenarios entirely on its own.

In the spirit of transparency, and in accordance with the rules and regulations of POC || GTFO, our security team will be publishing a series of blog posts over the coming weeks, showcasing some of our favorite technical discoveries by XBOW.

XBOW is an enterprise solution. If your company would like a demo, email us at [email protected].


Read the original article

Comments

  • By hinterlands 2025-06-2421:255 reply

    Xbow has really smart people working on it, so they're well-aware of the usual 30-second critiques that come up in this thread. For example, they take specific steps to eliminate false positives.

    The #1 spot in the ranking is both more of a deal and less of a deal than it might appear. It's less of a deal in that HackerOne is an economic numbers game. There are countless programs you can sign up for, with varied difficulty levels and payouts. Most of them pay not a whole lot and don't attract top talent in the industry. Instead, they offer supplemental income to infosec-minded school-age kids in the developing world. So I wouldn't read this as "Xbow is the best bug hunter in the US". That's a bit of a marketing gimmick.

    But this is also not a particularly meaningful objective. The problem is that there's a lot of low-hanging bugs that need squashing and it's hard to allocate sufficient resources to that. Top infosec talent doesn't want to do it (and there's not enough of it). Consulting companies can do it, but they inevitably end up stretching themselves too thin, so the coverage ends up being hit-and-miss. There's a huge market for tools that can find easy bugs cheaply and without too many false positives.

    I personally don't doubt that LLMs and related techniques are well-tailored for this task, completely independent of whether they can outperform leading experts. But there are skeptics, so I think this is an important real-world result.

    • By bgwalter 2025-06-2423:271 reply

      Maybe that is because the article is chaotic (like any "AI" article) and does not really address the false positive issue in a well.presented manner? Or even at all?

      Below people are reading the tea leaves to get any clue.

      • By moomin 2025-06-2511:191 reply

        There’s two whole paragraphs under a dedicated heading. I don’t think the problem is with the article here. Paragraphs reproduced below:

        AI can be remarkably effective at discovering a broad range of vulnerabilities—but the real challenge isn’t always detection, It’s precision. Automation has long struggled with false positives, and nowhere is this more evident than in vulnerability scanning. Tools that flag dozens of irrelevant issues often create more work than they save. When AI enters the equation, the stakes grow even higher: models can generalize well, but verifying technical edge cases is a different game entirely.

        To ensure accuracy, we developed the concept of validators, automated peer reviewers that confirm each vulnerability XBOW uncovers. Sometimes this process leverages a large language model; in other cases, we build custom programmatic checks. For example, to validate Cross-Site Scripting findings, a headless browser visits the target site to verify that the JavaScript payload was truly executed. (don’t miss Brendan Dolan-Gavitt’s BlackHat presentation on AI agents for Offsec)

        • By eeeeeeehio 2025-06-2511:54

          This doesn't say anything about many false positives they actually have. Yes, you can write other programs (that might even invoke another LLM!) to "check" the findings. That's a very obvious and reasonable thing to do. But all "vulnerability scanners", AI or not, must take steps to avoid FP -- that doesn't tell us how well they actually work.

          The glaring omission here is a discussion of how many bugs the XBOW team had to manually review in order to make ~1k "valid" submissions. They state:

          > It was a unique privilege to wake up each morning and review creative new exploits.

          How much of every morning was spent reviewing exploits? And what % of them turned out to be real bugs? These are the critical questions that are (a) is unanswered by this post, and (b) determine the success of any product in this space imo.

    • By normie3000 2025-06-2422:016 reply

      > Top infosec talent doesn't want to do it (and there's not enough of it).

      What is the top talent spending its time on?

      • By hinterlands 2025-06-2422:40

        Vulnerability researchers? For public projects, there's a strong preference for prestige stuff: ecosystem-wide vulnerabilities, new attack techniques, attacking cool new tech (e.g., self-driving cars).

        To pay bills: often working for tier A tech companies on intellectually-stimulating projects, such as novel mitigations, proprietary automation, etc. Or doing lucrative consulting / freelance work. Generally not triaging Nessus results 9-to-5.

      • By mr_mitm 2025-06-2510:32

        Working from 9 to 5 for a guaranteed salary that is not dependent on how many bugs you find before anybody else, and not having to argue your case or negotiate the bounty.

      • By kalium-xyz 2025-06-2511:53

        From my experience they work on random person projects 90% of their time

      • By tptacek 2025-06-2422:40

        Specialized bug-hunting.

      • By UltraSane 2025-06-252:49

        The best paying bug bounties.

      • By atemerev 2025-06-254:51

        "A bolt cutter pays for itself starting from the second bike"

    • By Sytten 2025-06-250:471 reply

      100% agree with OP, to make a living in BBH you can't go hunting on VDP program that don't pay anything all day. That means you will have a lot of low hanging fruits on those programs.

      I don't think LLM replace humans, they do free up time to do nicer tasks.

      • By skeeter2020 2025-06-2513:35

        ...which is exactly what technology advancements in our field have done since its inception, vs. the "this changes everything for everybody forever" narative that makes AI cheerleaders so exhausting.

    • By moomin 2025-06-2511:16

      Honestly I think this is extremely impressive, but it also raises what I call the “junior programmer” problem. Say XBOW gets good enough to hoover up basically all that money and can do it cost-effectively. What then happens to the pipeline of security researchers?

    • By absurdo 2025-06-2421:31

      > so they're well-aware of the usual 30-second critiques that come up in this thread.

      Succinct description of HN. It’s a damn shame.

  • By tecleandor 2025-06-2418:412 reply

    First:

    > To bridge that gap, we started dogfooding XBOW in public and private bug bounty programs hosted on HackerOne. We treated it like any external researcher would: no shortcuts, no internal knowledge—just XBOW, running on its own.

    Is it dogfooding if you're not doing it to yourself? I'd considerit dogfooding only if they were flooding themselves in AI generated bug reports, not to other people. They're not the ones reviewing them.

    Also, honest question: what does "best" means here? The one that has sent the most reports?

    • By jamessinghal 2025-06-2418:512 reply

      Their success rates on HackerOne seem widely varying.

        22/24 (Valid / Closed) for Walt Disney
      
        3/43 (Valid / Closed) for AT&T

      • By pclmulqdq 2025-06-2420:271 reply

        Walt Disney doesn't pay bug bounties. AT&T's bounties go up to $5k, which is decent but still not much. It's possible that the market for bugs is efficient.

        • By monster_truck 2025-06-2422:361 reply

          Walt Disney's program covers substantially more surface area, there's 6? publicly traded companies listed there. In addition to covering far fewer domains & apps, AT&T's conditions and exclusions disqualify a lot more.

          The market for bounties is a circus, breadcrumbs for free work from people trying to 'make it'. It can safely be analogized to the classic trope of those wanting to work in games getting paid fractional market rates for absurd amounts of QA effort. The number of CVSS vulns with a score above 8 that have floated across the front page of HN in the past year without anyone getting paid tells you that much.

          • By ackbar03 2025-06-2511:431 reply

            > The market for bounties is a circus, breadcrumbs for free work from people trying to 'make it'. > The number of CVSS vulns with a score above 8 that have floated across the front page of HN in the past year without anyone getting paid tells you that much.

            You make it sound like there's a ton of people going around who can just dig up CVSS vulns above 8 and is making me all confused. Is that really happening? I have a single bounty on H1 just to show I could do it, and that still took ages and was a shitty bug.

            • By monster_truck 2025-06-2516:22

              The weighted average is 7.6. Finding them doesn't necessarily take much effort if you know what to look for.

      • By thaumasiotes 2025-06-2418:531 reply

        > Their success rate on HackerOne seems widely varying.

        Some of that is likely down to company policies; Snapchat's policy, for example, is that nothing is ever marked invalid.

        • By jamessinghal 2025-06-2418:571 reply

          Yes, I'm sure anyone with more HackerOne experience can give specifics on the companies' policies. For now, those are the most objective measures of quality we have on the reports.

          • By moyix 2025-06-2419:09

            This is discussed in the post – many came down to individual programs' policies e.g. not accepting the vulnerability if it was in a 3rd party product they used (but still hosted by them), duplicates (another researcher reported the same vuln at the same time; not really any way to avoid this), or not accepting some classes of vuln like cache poisoning.

    • By inhumantsar 2025-06-2423:38

      I think they mean dogfooding as in putting on the "customer" hat and using the product.

      Seems reasonable to call that dogfooding considering that flooding themselves wouldn't be any more useful than synthetic testing and there's only so much ground they could cover using it on their own software.

      If this were coming out of Microsoft or IBM or whatever then yeah, not really dogfooding.

  • By vmayoral 2025-06-256:051 reply

    It’s humans who:

    - Design the system and prompts

    - Build and integrate the attack tools

    - Guide the decision logic and analysis

    This isn’t just semantics — overstating AI capabilities can confuse the public and mislead buyers, especially in high-stakes security contexts.

    I say this as someone actively working in this space. I participated in the development of PentestGPT, which helped kickstart this wave of research and investment, and more recently, I’ve been working on Cybersecurity AI (CAI) — the leading open-source project for building autonomous agents for security:

    - CAI GitHub: https://github.com/aliasrobotics/cai

    - Tech report: https://arxiv.org/pdf/2504.06017

    I’m all for pushing boundaries, but let’s keep the messaging grounded in reality. The future of AI in security is exciting — and we’re just getting started.

    • By vasco 2025-06-256:511 reply

      > It's humans

      Who would it be, gremlins? Those humans weren't at the top of the leaderboard before they had the AI, so clearly it helps.

      • By vmayoral 2025-06-3010:17

        Actually, those humans (XBOW's) were already top rankers. Just look it up.

        What's being critized here is the hype, which can be misleading and confusing. On this topic, wrote a small essay: “Cybersecurity AI: The Dangerous Gap Between Automation and Autonomy,” to sort fact from fiction -> https://shorturl.at/1ytz7

HackerNews