Hacker News

AI crawlers, fetchers are blowing up websites; Meta, OpenAI are worst offenders

2025-08-2111:35231144www.theregister.com

: One fetcher bot seen smacking a website with 39,000 requests per minute

Show article

Cloud services giant Fastly has released a report claiming AI crawlers are putting a heavy load on the open web, slurping up sites at a rate that accounts for 80 percent of all AI bot traffic, with the remaining 20 percent used by AI fetchers. Bots and fetchers can hit websites hard, demanding data from a single site in thousands of requests per minute.

I can only see one thing causing this to stop: the AI bubble popping

According to the report [PDF], Facebook owner Meta's AI division accounts for more than half of those crawlers, while OpenAI accounts for the overwhelming majority of on-demand fetch requests.

"AI bots are reshaping how the internet is accessed and experienced, introducing new complexities for digital platforms," Fastly senior security researcher Arun Kumar opined in a statement on the report's release. "Whether scraping for training data or delivering real-time responses, these bots create new challenges for visibility, control, and cost. You can't secure what you can't see, and without clear verification standards, AI-driven automation risks are becoming a blind spot for digital teams."

The company's report is based on analysis of Fastly's Next-Gen Web Application Firewall (NGWAF) and Bot Management services, which the company says "protect over 130,000 applications and APIs and inspect more than 6.5 trillion requests per month" – giving it plenty of data to play with. The data reveals a growing problem: an increasing website load comes not from human visitors, but from automated crawlers and fetchers working on behalf of chatbot firms.

The report warned, "Some AI bots, if not carefully engineered, can inadvertently impose an unsustainable load on webservers," Fastly's report warned, "leading to performance degradation, service disruption, and increased operational costs." Kumar separately noted to The Register, "Clearly this growth isn't sustainable, creating operational challenges while also undermining the business model of content creators. We as an industry need to do more to establish responsible norms and standards for crawling that allows AI companies to get the data they need while respecting websites content guidelines."

That growing traffic comes from just a select few companies. Meta accounted for more than half of all AI crawler traffic on its own, at 52 percent, followed by Google and OpenAI at 23 percent and 20 percent respectively. This trio then has its hands on a combined 95 percent of all AI crawler traffic. Anthropic, by contrast, accounted for just 3.76 percent of crawler traffic. The Common Crawl Project, which slurps websites to include in a free public dataset designed to prevent duplication of effort and traffic multiplication at the heart of the crawler problem, was a surprisingly-low 0.21 percent.

The story flips when it comes to AI fetchers, which unlike crawlers are fired off on-demand when a user requests that a model incorporates information newer than its training cut-off date. Here, OpenAI was by far the dominant traffic source, Fastly found, accounting for almost 98 percent of all requests. That's an indication, perhaps, of just how much of a lead OpenAI's early entry into the consumer-facing AI chatbot market with ChatGPT gave the company, or possibly just a sign that the company's bot infrastructure may be in need of optimization.

While AI fetchers make up a minority of Ai bot requests – only about 20%, says Kumar – they can be responsible for huge bursts of traffic, with one fetcher generating over 39,000 requests per minute during the testing period. "We expect fetcher traffic to grow as AI tools become more widely adopted and as more agentic tools come into use that mediate the experience between people and websites," Kumar told The Register.

Perplexity AI, which was recently accused of using IP addresses outside its reported crawler ranges and ignoring robots.txt directives from sites looking to opt out of being scraped, accounted for just 1.12 percent of AI crawler bot and 1.53 percent of AI fetcher bot traffic recorded for the report – though the report noted that this is growing.

Kumar decried the practice of ignoring robots.txt notes, telling El Reg, "At a minimum, any reputable AI company today should be honoring robots.txt. Further and even more critically, they should publish their IP address ranges and their bots should use unique names. This will empower site operators to better distinguish the bots crawling their sites and allow them to enforce granular rules with bot management solutions."

But he stopped short of calling for mandated standards, saying that industry forums are working on solutions. "We need to let those processes play out. Mandating technical standards in regulatory frameworks often does not produce a good outcome and shouldn't be our first resort."

It's a problem large enough that users have begun fighting back. In the face of bots riding roughshod over polite opt-outs like robots.txt directives, webmasters are increasingly turning to active countermeasures like the proof-of-work Anubis or gibberish-feeding tarpit Nepenthes, while Fastly rival Cloudflare has been testing a pay-per-crawl approach to put a financial burden on the bot operators. "Care must be exercised when employing these techniques," Fastly's report warned, "to avoid accidentally blocking legitimate users or downgrading their experience."

Kumar notes that small site operators, especially those serving dynamic content, are most likely to feel the effects most severely, and he had some recommendations. "The first and simplest step is to configure robots.txt which immediately reduces traffic from well-behaved bots. When technical expertise is available, websites can also deploy controls such as Anubis, which can help reduce bot traffic." He warned, however, that bots are always improving and trying to find ways around "tarpits" like Anubis, as code-hosting site Codeberg recently experienced. "This creates a constant cat and mouse game, similar to what we observe with other types of bots today," he said.

We spoke to Anubis developer Xe Iaso, CEO of Techaro. When we asked whether they expected the growth in crawler traffic to slow, they said: "I can only see one thing causing this to stop: the AI bubble popping.

"There is simply too much hype to give people worse versions of documents, emails, and websites otherwise. I don't know what this actually gives people, but our industry takes great pride in doing this."

However, they added: "I see no reason why it would not grow. People are using these tools to replace knowledge and gaining skills. There's no reason to assume that this attack against our cultural sense of thrift will not continue. This is the perfect attack against middle-management: unsleeping automatons that never get sick, go on vacation, or need to be paid health insurance that can produce output that superficially resembles the output of human employees. I see no reason that this will continue to grow until and unless the bubble pops. Even then, a lot of those scrapers will probably stick around until their venture capital runs out."

The Register asked Xe whether they thought broader deployment of Anubis and other active countermeasures would help.

They responded: "This is a regulatory issue. The thing that needs to happen is that governments need to step in and give these AI companies that are destroying the digital common good existentially threatening fines and make them pay reparations to the communities they are harming. Ironically enough, most of these AI companies rely on the communities they are destroying.

"This presents the kind of paradox that I would expect to read in a Neal Stephenson book from the '90s, not CBC's front page. Anubis helps mitigate a lot of the badness by making attacks more computationally expensive. Anubis (even in configurations that omit proof of work) makes attackers have to retool their scraping to use headless browsers instead of blindly scraping HTML."

And who is paying the piper?

"This increases the infrastructure costs of the AI companies propagating this abusive traffic. The hope is that this makes it fiscally unviable for AI companies to scrape by making them have to dedicate much more hardware to the problem. In essence: it makes the scrapers have to spend more money to do the same work."

We approached Anthropic, Google, Meta, OpenAI, and Perplexity but none provided a comment on the report by the time of publication. ®

Read the original article

rntn

Karma: 66152

@Hacker__News
@hacker._news

Comments

By pjc50 2025-08-2115:425 reply

Place alongside https://news.ycombinator.com/item?id=44962529 "Why are anime catgirls blocking my access to the Linux kernel?". This is why.

AI is going to damage society not in fancy sci-fi ways but by centralizing profit made at the expense of everyone else on the internet, who is then forced to erect boundaries to protect themselves, worsening the experience for the rest of the public. Who also have to pay higher electricity bills, because keeping humans warm is not as profitable as a machine which directly converts electricity into stock price rises.

By rnhmjoj 2025-08-2115:555 reply

I'm far from being an AI enthusiast as anyone can be, but this issue has nothing to do with AI specifically. It's just that some greedy companies are writing incredibly shitty crawlers that don't follow any of the enstablished conventions (respecting robots.txt, using a proper UA string, rate limiting, whatever). This situation could have easily happened earlier than the AI boom, for different reasons.

By mostlysimilar 2025-08-2116:321 reply

But it didn't, and it's happening now, because of AI.

By kjkjadksj 2025-08-2117:351 reply

People have been complaining about these crawlers for years well before AI

By PaulDavisThe1st 2025-08-2117:462 reply

The issue is 1 to 4 orders of magnitude worse than it was just a couple of years ago. This is not "crawlers suck". This is "crawlers are overwhelming us and almost impossible to fully block". It really isn't the same thing.

By tadfisher 2025-08-2119:141 reply

Tragedy of the commons. Before, it was cryptominers eating up all free sources of compute [0]. Now it's AI crawlers eating up all available bandwidth and server resources [1]. Reading SourceHut's struggles against the Once-lers of the world makes me want to introduce a new application layer protocol where consumers pay for abusing shared resources. Which sucks, because the Internet should remain free.

[0]: https://drewdevault.com/2021/04/26/Cryptocurrency-is-a-disas...

[1]: https://drewdevault.com/2025/03/17/2025-03-17-Stop-externali...

By PaulDavisThe1st 2025-08-2119:311 reply

> Tragedy of the commons.

No, because there is no such thing, at least not as understood by Garrett Hardin, who put forward the phrase.

Commons fail when selfish, greedy people subvert or destroy the governance structures that help control them. If those governance structures exist (and they do for all historical commons) and continue to exist, the commons suffers no tragedy.

This recent slide deck talks about Ostrom's ideas on this, which even Hardin eventually conceded were correct, and that his diagnosis of a "tragedy of the commons" does not actually describe the historical processes by which commons are abused.

https://dougwebb.site/slides/commons

That said ... arguably there is a problem here with a "commons" that does in fact lack any real governance structure.

By erlend_sh 2025-08-224:38

No idea why this is getting downvoted; this is a very important correction since the “tragedy of the commons” meme is based on a flawed premise that needs to be amended.

By p3rls 2025-08-222:32

i am getting almost 500,000 ai scraper requests a day according to cloudflare's ai audit. google requests the same pages 10+ times each an hour. it was never this bad before.

By Fomite 2025-08-2117:58

I'd argue it's part of the baked in, fundamental disrespect AI firms have for literally everyone else.

By majkinetor 2025-08-2117:031 reply

Obaying robots.txt can not be enforced. Even if one country makes laws about it, another one will have 0 fucks to give.

By spinningslate 2025-08-2117:251 reply

It was never intended to be "enforced":

> The standard, developed in 1994, relies on voluntary compliance [0]

It was conceived in a world with an expectation of collectively respectful behaviour: specifically that search crawlers could swamp "average Joe's" site but shouldn't.

We're in a different world now but companies still have a choice. Some do still respect it... and then there's Meta, OpenAI and such. Communities only work when people are willing to respect community rules, not have compliance imposed on them.

It then becomes an arms race: a reasonable response from average Joe is "well, OK, I'll allows anyone but [Meta|OpenAI|...] to access my site. Fine in theory, dificult in practice:

1. Block IP addresses for the offending bots --> bots run from obfuscated addresses

2. Block the bot user agent --> bots lie about UA.

...and so on.

[0]: https://en.wikipedia.org/wiki/Robots.txt

By majkinetor 2025-08-2119:03

Thanks for the info. However people seem to think that robots.txt will protect them while it was created for another world as you nicelly stated. I guess Nepenthes like tools will be more common in the future, now that tragedy of commons entered digital domain.

By sznio 2025-08-228:41

I strongly believe that AI companies are running a DDOS attack on the open web. Making websites go down aligns with their intetests: it removes training data that competitors could use, and it removes sources for humans to browse, making us even more reliant on chatbots to find anything.

If it was crap coding, then the bots wouldn't have so many mechanisms to circumvent blocks. Once you block the OpenAI IP ranges, they start using residential proxies. Once you block their UA strings, they start impersonating other crawlers or browsers.

By 1vuio0pswjnm7 2025-08-2118:211 reply

"It's just that some greedy companies are writing incredibly shitty crawlers that don't follow any of the enstablished [sic] conventions (respecting robots.txt, using proper UA string, rate limiting, whatever)."

How does "proper UA string" solve this "blowing up websites" problem

The only thing that matters with respect to the "blowing up websites" problem is rate-limiting, i.e., behaviour

"Shitty crawlers" are a nuisance because of their behaviour, i.e., request rate, not because of whatever UA string they send; the behaviour is what is "shitty" not the UA string. The two are not necessarily correlated and any heuristic that naively assumes so is inviting failure

"Spoofed" UA strings have been facilitated and expected since the earliest web browsers

For example,

https://raw.githubusercontent.com/alandipert/ncsa-mosaic/mas...

To borrow the parent's phrasing, the "blowing up websites" problem has nothing to do with UA string specifically

It may have something to do with website operator reluctance to set up rate-limiting though; this despite widespread impelementation of "web APIs" that use rate-limiting

NB. I'm not suggesting rate-limiting is a silver bullet. I'm suggesting that without rate-limiting, UA string as a means of addressing the "blowing up websites" problem is inviting failure

By AbortedLaunch 2025-08-2118:452 reply

Some of these crawlers appear to be designed to avoid rate limiting based on IP. I regularly see millions of unique ips doing strange requests, each just one or at most a few per day. When a response contains a unique redirect I often see a geographically distinct address fetching the destination.

By 1vuio0pswjnm7 2025-08-2121:341 reply

"I regularly see millions of unique ips doing strange requests, each just one or at most a few per day."

How would UA string help

For example, a crawler making "strange" requests can send _any_ UA string, and a crawler doing "normal" requests can also send _any_ UA string.

The "doing requests" is what I refer to as "behaviour"

A website operator might think "Crawlers making strange requests send UA string X but not Y"

Let's assume the "strange" requests cause a "website load" problem^1

Then a crawler, or any www user, makes a "normal" request and sends UA string X; the operator blocks or redirects the request, unnecessarily

Then a crawler makes "strange" request and sends UA string Y; the operator allows the request and the website "blows up"

What matters for the "blowing up websites" problem^1 is behaviour, not UA string

1. The article's title calls it the "blowing up websites" problem, but the article text calls it a problem with "website load". As always the details are missing. For example, what is the "load" at issue. Is it TCP connections or HTTP requests. What number of simultaneous connections and/or requests per second is acceptable, what number is not unacceptable. Again, behaviour is the issue, not UA string

The acceptable numbers need to be published; for example, see documentation for "web APIs"

By AbortedLaunch 2025-08-225:45

I do not make any point on UA-strings, just on the difficulty of rate limiting.

By 1vuio0pswjnm7 2025-08-2216:24

"Some of these crawlers appear to be designed to avoid rate limiting based on IP."

Unless the rate is exceeded, the limit is not being avoided

"I regularly see millions of unique ips doing strange requests, each just one or at most a few per day."

Assuming the rate limit is more than one or a few requests every 24h this would be complying with the limit, not avoiding it

It could be that sometimes the problem website operators are concerned about is not "website load", i.e., the problem the article is discussing, it is actually something else (NB. I am not speculating about this particular operator, I am making a general observation)

If a website is able to fulfill all requests from unique IPs without affecting quality of service, then it stands to reason "website load" is not a problem the website operator is having

For example, the article's title claims Meta is amongst the "worst offenders" of creating excessive website load caused by "AI crawlers, fetchers"

Meta has been shown to have used third party proxy services wth rotating IP addresses in order to scrape other websites; it also sued one of these services because it was being used to scrape Meta's website, Facebook

https://brightdata.com/blog/general/meta-dismisses-claim-aga...

Whether the problem that Meta was having with this "scraping" was "website load" is debatable; if the requests were being fulfilled without affecting QoS, then arguably "website load" was not a problem

Rate-limiting addresses the problem of website load; it allows website operators to ensure that requests from all IP addresses are adequately served as opposed to preferentially servicing some IP addresses to the detriment of others (degraded QoS)

Perhaps some website operators become concerned that many unique IP addresses may be under the control of a single entity, and that this entity may be a competitor; this could be a problem for them

But if their website is able to fulfill all the requests it receives without degrading QoS then arguably "website load" is not a problem they are having

NB. I am not suggesting that a high volume of requests from a single entity, each complying with a rate-limit is acceptable, nor am I making any comment about the practice of "scraping" for commercial gain. I am only commenting about what rate-limiting is designed to do and whether it works for that purpose

By superkuh 2025-08-2116:512 reply

This isn't AI damaging anything. This is corporations damaging things. Same as it ever was. No need for scifi non-human persons when legal corporate persons exist. They latch on to whatever big new thing in tech that people don't understand which comes along and brand themselves with it and cause damage trying to make money; even if they mostly fail at it. And for most actual humans they only ever see or interact with the scammy corporation versions of $techthing and so come to believe $techthing = corporate behavior.

And as for denying service and preventing human people from visiting websites: cloudflare does more of that damage in a single day than all these "AI" associated corporations and their crappy crawlers have in years.

By autoexec 2025-08-2117:321 reply

> This isn't AI damaging anything. This is corporations damaging things.

This is corporations damaging things because of AI. Corporations will damage things for other reasons too but the only reason they are breaking the internet in this way, at this time, is because of AI.

I think the "AI doesn't kill websites, corporations kill websites" argument is as flawed as the "Guns don't kill people, people kill people" argument.

By superkuh 2025-08-2118:52

Correct. It's a good, legitimate argument in both contexts. I use both local AI and local firearms as a human person and I am not doing, and have not done, damage to anyone. The tools aren't the problem.

The problem in this case is the near complete protection from legal liability that corporate structures give to the people behaving badly. Like how Coca Cola can get away with killing people (https://prospect.org/features/coca-cola-killings/) but a person can't, if you want to keep the firearms analogy going. But it's a bad analogy because the firearms as tool actually at least are involved in the bad (and good) actions. AI itself isn't even involved in the HTTP requests and probably isn't even running on the same premises.

By ujkhsjkdhf234 2025-08-2118:33

Cloudflare exists because people can't be good stewards of the internet.

> This isn't AI damaging anything. This is corporations damaging things

This is the guns don't kill people, people kill people argument. The problem with crawlers is about 10x worse than it was previously because of AI and their hunger for data.

By msgodel 2025-08-225:10

This isn't really about AI. This is a couple corporations being bad netizens and abusing infrastructure.

The same incentives to do this already existed for search engine operators.

By renewiltord 2025-08-2117:373 reply

If you don't want to receive data, don't. If you don't want to send data, don't. No one is asking you to receive traffic from my IPs or send to my IPs. You've just configured your server one way.

Or to use a common HN aphorism “your business model is not my problem”. Disconnect from me if you don’t want my traffic.

By PaulDavisThe1st 2025-08-2117:501 reply

I don't know if I want your traffic until I see what your traffic is.

You want to look at one of our git commits? Sure! That's what our web-fronted git repo is for. Go right head! Be our guest!

Oh ... I see. You want to download every commit in our repository. One by one, when you have used git clone. Hmm, yeah, I don't want your traffic.

But wait, "your traffic" seems to originate from ... consults fail2ban logs ... more than 900k different IP addresses, so "disconnecting" from you is non-trivial.

I can't put it more politely than this: fuck off. Do not pass go. Do not collect stock options. Go to hell, and stay there.

By renewiltord 2025-08-2118:114 reply

There's a protocol for that. Just reject the connection. Don't implode, just write some code. Your business model isn't my problem.

By PaulDavisThe1st 2025-08-2118:33

Reject the connection based on what?

IP address (presumably after too many visits) ? So now the iptables mechanism has to scale to fit your business model (of hammering my git repository 1 commit at a time from nearly a million IP addresses) ? Why does the code I use have to fit your braindead model? We wouldn't care if you just used git clone, but you're too dumb to do that.

The URL? Legitimate human (or other) users won't be happy about that.

Our web-fronted git repo is not part of our business model. It's just a free service we like to offer people, unrelated to revenue flow or business operations. So your behavior is not screwing my business model, but it is screwing up people who for whatever reason want to use that service, who can no longer use the web-fronted git repo.

ps. I've used "you" throughout the above because you used "my". No idea if you personally are involved in any such behavior.

By ben_w 2025-08-2118:391 reply

Reject 900k different connections from different origins each asking for what would in isolation be fine and the only problem is the quantity?

By Nextgrid 2025-08-2122:152 reply

But what's the difference between one user making 900k hits and 900k different users making one hit? In both cases you have made a resource available and people are requesting it, some more than others.

If serving traffic for free is a problem, don't. If you are only able to serve N requests per second/minute/day/etc, do that. But don't complain if you give out something for free and people take it.

(also, a lot of the numbers people quote during these AI scraper "attacks" are very tame and the fact they are branded as problematic makes me suspect there's substantial incompetence in the solutions deployed to serve them)

By latexr 2025-08-225:57

> But what's the difference between one user making 900k hits and 900k different users making one hit?

What’s the difference between giving 900K meals to one person and feeding 900K people? The former is being abusive, wasteful, and depriving almost 900K other people of food. They are also being deceitful by pretending to be 900K different people.

Resources are finite. Web requests aren’t food, but you still pay for them. A spike in traffic may mean your service being down for the rest of the month, which is more acceptable if you helped a bunch of people who have now learned about and can talk about and share what you provided, versus having wasted all your traffic on a single bad actor who didn’t even care because they were just a robot.

> makes me suspect there's substantial incompetence in the solutions deployed to serve them

So you see bots scraping the Wikipedia webpages instead of downloading their organised dump, or scraping every git service webpage instead of cloning a repo, and think the incompetence is with the website instead of the scraper wasting time and resources to do a worse job?

By PaulDavisThe1st 2025-08-2215:32

There were never 900k users interested in each commit. Never was, never will be. So that's a false comparison.

These scrapers have upped both the server load (requests per second) and bandwidth requirements, without me consenting to it. If they were actual human users OR bots that were appropriately designed to minimize their impact on the target sites, that's perfectly OK.

Maybe if this was truly the only way to get to our god-like LLM to work in a god-like way (*), it would also be acceptable. But it isn't.

And on top of that, they are incompetently designed and they are causing real issues that a huge number of sites need to address.

(*) put differently, if all this current scraping activity delivered some notable benefit to humanity

By whatevaa 2025-08-225:08

That's exactly what they are doing. Rejecting the connection of people like you, cause you don't care. And if you start your own bussiness, you will suddenly encounter the same problem too. Then you will be able to "just write some code".

Anytime somebody writes "just" you immedially can understand that they have no idea what they are talking about.

By msgodel 2025-08-225:131 reply

Sure. What that looks like is always using ssh to access git and things like github going away. I think most of us can agree that's probably not good. For the tools non-technical people use it's probably far worse, pretty much the end of the open web outside static personal pages.

I think the ISPs serving these requests are probably going to have to start going after customers for being abusive in order for this to stop.

By renewiltord 2025-08-226:22

Seems fine to me. Same as ads. If you don’t want to send content with ads which I will render without ads don’t send. That ended some businesses and made others paywall.

Such is life.

By latexr 2025-08-2118:051 reply

> Disconnect from me if you don’t want my traffic.

The problem is precisely that that is not possible. It is very well known that these scrapers aren’t respecting the wishes of website owners and even circumvent blocks any way they can. If these companies respected the website owners’ desires for them to disconnect, we wouldn’t be having this conversation.

By renewiltord 2025-08-2118:141 reply

Websites aren't people. They don't have desires. Machines have communication protocols. You can set your machine to blackhole the traffic or TCP RST or whatever you want. It's just network traffic. Do what you want with it.

People send me spam. I don't whine about it. I block it.

By latexr 2025-08-2118:291 reply

> Websites aren't people. They don't have desires.

Obviously I’m talking about the people behind them, and I very much doubt you lack the minimal mental acuity to understand that when I used “website owners” in the preceding sentence. If you don’t want to engage in a good faith discussion you can just say so, no need to waste our time with fake pedantry. But alright, I edited that section.

> You can set your machine to blackhole the traffic or TCP RST or whatever you want. It's just network traffic.

And then you spend all your time in a game of cat and mouse, while these scrappers bring your website down and cost you huge amounts of money. Are you incapable of understanding how that is a problem?

> People send me spam. I don't whine about it. I block it.

Is the amount of spam you get so overwhelming that it swamps your inbox every day to a level you’re unable to find the real messages? Do those spammers routinely circumvent your rules and filters after you’ve blocked them? Is every spam message you get costing you money? Are they increasing every day? No? Then it’s not the same thing at all.

By whatevaa 2025-08-225:10

I would suggest not arguing with a wall, the person you are replying to thinks there exists some magic sauce of code to solve this problem.

By pjc50 2025-08-2120:451 reply

People are doing exactly that. And then other people who want to use the website are asking why they get blocked by false positives.

By renewiltord 2025-08-2217:39

Yeah, it just seems like things are playing out as one would expect them to. You're right

By IT4MD 2025-08-2118:51

>AI is going to damage society not in fancy sci-fi ways but by centralizing profit made at the expense of everyone else on the internet

10/10. No notes.

By mcpar-land 2025-08-2115:523 reply

My worst offender for scraping one of my sites was Anthropic. I deployed an ai tar pit (https://news.ycombinator.com/item?id=42725147) to see what it would do it with it, and Anthropic's crawler kept scraping it for weeks. I calculated the logs and I think I wasted nearly a year of their time in total, because they were crawling in parallel. Other scrapers weren't so persistent.

By fleebee 2025-08-2117:18

For me it was OpenAI. GTPBot hammered my honeypot with 0.87 requests per second for about 5 weeks. Other crawlers only made up 2% of the traffic. 1.8 million requests, 4 GiB of traffic. Then it just abruptly stopped for whatever reason.

By whatevaa 2025-08-225:12

Tar pits and serve fake but legitimate looking content. Poison it.

By Group_B 2025-08-2116:49

That's hilarious. I need to set up one of these myself

By bwb 2025-08-2115:284 reply

My book discovery website shepherd.com is getting hammered every day by AI crawlers (and crashing often)... my security lists in CloudFlare are ridiculous and the bots are getting smarter.

I wish there were a better way to solve this.

By weaksauce 2025-08-2116:351 reply

put a honeypot link in your site that only robots will hit because it’s hidden. make sure it’s not in robots.txt or ban it if you can in robots.txt. setup a rule that any ip that hits that link will get a 1 day ban in your fail2ban or the like.

By bwb 2025-08-2117:352 reply

Got a good link to something on github that does this?

I have to make sure legit bots don't get hit, as a huge percent of our traffic which helps the project stay active is from google, etc.

By weaksauce 2025-08-2219:00

here's one example using rack attack on a rails app:

https://github.com/pinballmap/pbm/blob/302ac638850711878ac61...

but it only bans for 3 hours. if they don't respect the hidden link and robots.txt they get banned.

By subscribed 2025-08-228:07

I did it manually and got fail2ban to read the access log anyway.

The it's the permanent iptables rule, but could be CF API call as well.

By skydhash 2025-08-2116:071 reply

If you're not updating the publicly accessible part of the database open, try to see if you can put some cache strategy up and let cloudflare take the hit.

By bwb 2025-08-2116:23

Yep, all but one page type is heavily cached at multiple levels. We are working to get the rest and improve it further... just annoying as they don't even respect limits..

By p3rls 2025-08-2120:30

At this point I'd take a thermostat that can read when my dashboard starts getting heated (always the same culprits causing these same server spikes) and flicks attack mode on for cloudflare.... it's so ridiculous trying to run anything that's not a wordpress these days

By shepherdjerred 2025-08-221:421 reply

ah, you're the one who stopped me from being jerred@shepherd.com!

By bwb 2025-08-229:19

hah eh?