Hacker News

Wikipedia is struggling with voracious AI bot crawlers

2025-04-0212:2891100www.engadget.com

Wikimedia has seen a 50 percent increase in bandwidth used for downloading multimedia content since January 2024 due to AI crawlers taking its content to train generative AI models. It has to find a…

Show article

Wikimedia has seen a 50 percent increase in bandwidth used for downloading multimedia content since January 2024, the foundation said in an update. But it's not because human readers have suddenly developed a voracious appetite for consuming Wikipedia articles and for watching videos or downloading files from Wikimedia Commons. No, the spike in usage came from AI crawlers, or automated programs scraping Wikimedia's openly licensed images, videos, articles and other files to train generative artificial intelligence models.

This sudden increase in traffic from bots could slow down access to Wikimedia's pages and assets, especially during high-interest events. When Jimmy Carter died in December, for instance, people's heightened interest in the video of his presidential debate with Ronald Reagan caused slow page load times for some users. Wikimedia is equipped to sustain traffic spikes from human readers during such events, and users watching Carter's video shouldn't have caused any issues. But "the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs," Wikimedia said.

The foundation explained that human readers tend to look up specific and often similar topics. For instance, a number of people look up the same thing when it's trending. Wikimedia creates a cache of a piece of content requested multiple times in the data center closest to the user, enabling it to serve up content faster. But articles and content that haven't been accessed in a while have to be served from the core data center, which consumes more resources and, hence, costs more money for Wikimedia. Since AI crawlers tend to bulk read pages, they access obscure pages that have to be served from the core data center.

Wikimedia said that upon a closer look, 65 percent of the resource-consuming traffic it gets is from bots. It's already causing constant disruption for its Site Reliability team, which has to block the crawlers all the time before they they significantly slow down page access to actual readers. Now, the real problem, as Wikimedia states, is that the "expansion happened largely without sufficient attribution, which is key to drive new users to participate in the movement." A foundation that relies on people's donations to continue running needs to attract new users and get them to care for its cause. "Our content is free, our infrastructure is not," the foundation said. Wikimedia is now looking to establish sustainable ways for developers and reusers to access its content in the upcoming fiscal year. It has to, because it sees no sign of AI-related traffic slowing down anytime soon.

Read the original article

bretpiatt

Karma: 3068

@Hacker__News
@hacker._news

Comments

By diggan 2025-04-0212:4215 reply

This has to be one of strangest targets to crawl, since they themselves make database dumps available for download (https://en.wikipedia.org/wiki/Wikipedia:Database_download) and if that wasn't enough, there are 3rd party dumps as well (https://library.kiwix.org/#lang=eng&category=wikipedia) that you could use if the official ones aren't good enough for some reason.

Why would you crawl the web interface when the data is so readily available in a even better format?

By roenxi 2025-04-0212:563 reply

I've written some unfathomably bad web crawlers in the past. Indeed, web crawlers might be the most natural magnet for bad coding and eye-twitchingly questionable architectural practices I know of. While it likely isn't the major factor here I can attest that there are coders who see pages-articles-multistream.xml.bz2 and then reach for a wget + HTML parser combo.

If you don't live and breath Wikipedia it is going to soak up a lot of time figuring out Wikipedia's XML format and markup language, not to mention re-learning how to parse XML. HTTP requests and bashing through the HTML is all everyday web skills and familiar scripting that is more reflexive and well understood. The right way would probably be much easier but figuring it out will take too long.

Although that is all pre-ChatGPT logic. Now I'd start by asking it to solve my problem.

By a2128 2025-04-0213:07

You don't even need to deal with any XML formats or anything, they publish a complete dataset on Huggingface that's just a few lines to load in your Python training script

https://huggingface.co/datasets/wikimedia/wikipedia

By jerf 2025-04-0213:131 reply

To be a "good" web crawler, you have to go beyond "not bad coding". If you just write the natural "fetch page, fetch next page, retry if it fails" loop, notably, missing any sort of wait between fetches, so that you fetch as quickly as possible, you are already a pest. You don't even need multiple threads or machines to be a pest; a single machine on a home connection fetching pages as quickly as it can be already be a pest to a website with heavy backend computation or DB demands. Do an equally naive "run on a couple dozen threads" upgrade to your code and you expand the blast radius of your pestilence out to even more web sites.

Being a truly good web crawler takes a lot of work, and being a polite web crawler takes yet more different work.

And then, of course, you add the bad coding practices on top of it, ignoring robots.txt or using robots.txt as a list of URLs to scrape (which can be either deliberate or accidental), hammering the same pages over and over, preferentially "retrying" the very pages that are timing out because you found the page that locks the DB for 30 seconds in a hard query that even the website owners themselves didn't know was possible until you showed them by taking down the rest of their site in the process... it just goes downhill from there. Being "not bad" is already not good enough and there's plenty of "bad" out there.

By marginalia_nu 2025-04-0214:401 reply

I think most crawlers inevitably tend to turn into spaghetti code because of the number of weird corner cases you need to deal with.

Crawlers are also incredibly difficult to test in a comprehensive way. No matter what test scenarios you come up with, there's a hundred more weird cases in the wild. (e.g. there's a world's difference between a server taking a long time to respond to a request, and a server sending headers quickly but taking a long time to send the body)

By joquarky 2025-04-0220:28

I thrive for these kinds of moving-target challenges. But nobody will hire.

By soco 2025-04-0213:05

You'd probably ask ChatGPT to write you a crawler for Wikipedia, without thinking to ask whether there's a better way to get Wikipedia info. So that download would be missed, because how and what we ask AI stays very important. Actually this is not new, googling skills were known as being important before and even philosophers recognized that asking good questions was crucial.

By joquarky 2025-04-0220:26

> Why would you crawl the web interface when the data is so readily available in a even better format?

Have you seen the lack of experience that is getting through the hiring process lately? It feels like 80% of the people onboarding are only able to code to pre-existing patterns without an ability to think outside the box.

I'm just bitter because I have 25 years of experience and can't even get a damn interview no matter how low I go on salary expectations. I obviously have difficulty in the soft skills department, but companies who need real work to get done reliably used to value technical skills over social skills.

By Cthulhu_ 2025-04-0212:57

Because the scrapers they use aren't targeted, they just try to index the whole internet. It's easier that way.

By johannes1234321 2025-04-0213:401 reply

While the dump may be simpler to consume, building it isn't simpler.

The generic web crawler works (more or less) everywhere. The Wikipedia dump solution works on Wikipedia dumps.

Also mind: This is tied in with search engines and other places, where the AI bot follows links from search results etc. thus they'd need the extra logic to detect a Wikipedia link, then find the matching article in the dump, and then add the original link back as reference for the source.

Also in one article on that I read about spikes around death from people etc. in that scenario they want the latest version of the article, not a day old dump.

So yeah, I guess they used the simple straight forward way and didn't care much about consequences.

By diggan 2025-04-0213:54

I'm not sure this is what is currently affecting them the most, the article mentions this:

> Since AI crawlers tend to bulk read pages, they access obscure pages that have to be served from the core data center.

So it doesn't seem to be driven by "Search the web for keywords, follow links, slurp content" but trying to read a bulk of pages all together, then move on to another set of bulk pages, suggesting mass-ingestion, not just acting as a user-agent for an actual user.

But maybe I'm reading too much into the specifics of the article, I don't have any particular internal insights to the problem they're facing I'll confess.

By marginalia_nu 2025-04-0212:44

I think most of these crawlers just aren't very well implemented. Takes a lot of time and effort to get crawling to work well, very easy to accidentally DoS a website if you don't pay attention.

By skywhopper 2025-04-0213:23

Because AI and everything about it is about being as lazy as possible and wasting money and compute in service of becoming even lazier. No one who is willing to burn the compute necessary to train the big models we see would think twice about the wasted resources involved in doing the most wasteful, least efficient means of collecting the data possible.

By is_true 2025-04-0212:571 reply

This is what you get when an AI generates your code and your prompts are vague.

By cowsaymoo 2025-04-0213:34

Vibe coded crawlers

By wslh 2025-04-0212:441 reply

I think those crawlers are just very generic: they basically operate like wget scripts, without much logic for avoiding sites that already offer clean data dumps.

By ldng 2025-04-0212:461 reply

That is not an excuse. Wikipedia isn't just any site.

By wslh 2025-04-0212:471 reply

Not an excuse, a plausible explanation of what's actually happening.

By franktankbank 2025-04-0213:14

Also plausibly they are trying to kill the site via soft ddos. Then they can sell a service based on all the data they scraped + unauditable censoring.

By iamacyborg 2025-04-0212:59

With the way Transclusion works in MediaWiki, dumps and the wiki api’s are often not very useful, unfortunately

By mzajc 2025-04-0212:56

There are crawlers that will recursively crawl source repository web interfaces (cgit et al, usually expensive to render) despite having a readily available URL they could clone from. At this point I'm not far from assuming malice over sheer incompetence.

By MiscIdeaMaker99 2025-04-0212:59

> Why would you crawl the web interface when the data is so readily available in a even better format?

It's entirely possible they don't know about this. I certainly didn't until just now.

By shreyshnaccount 2025-04-0213:33

i was thinking the same thing. it might be the case that the scrapers are result of the web search feature in llms?

By latexr 2025-04-0212:521 reply

> Why would you crawl the web interface when the data is so readily available in a even better format?

Because grifters have no respect or care for other people, nor are they interested in learning how to be efficient. They only care about the least amount of effort for the largest amount of personal profit. Why special-case Wikipedia, when they can just scratch their balls and turn their code loose? It’s not their own money they’re burning anyway; there are more chumps throwing money at them than they know what to do with, so it’s imperative they look competitive and hard at work.

By coldpie 2025-04-0213:041 reply

The vast, vast majority of companies using AI are on the same level as the people distributing malware to mine crypto on other peoples' machines. They're exploiting resources that aren't theirs to get rich quick from stupid investors & market hype. We all suffer so they can get a couple bucks. Thanks, AI & braindead investors. This bubble can't pop soon enough, and I hope it takes a whole lot of terrible people down with it.

By immibis 2025-04-0213:42

The crawler companies just do not give a shit. They're running these crawl jobs because they can, the methodology is worthless, the data will be worthless, but they have so much computing resources relative to developer resources that it costs them more to figure out that the crawl is worthless and figure out what isn't worthless, than it does to just do the crawl and throw away the worthless data at the end and then crawl again. Meanwhile they perform the internet's most widespread DDoS (which is against the CFAA btw so if they caused actual damages to you, try suing them). I don't personally take an issue with web crawling as a concept (how else would search engines work? oh, they don't work any more anyway) but the implementation is obviously a failure.

---

I've noticed one crawling my copy of Gitea for the last few months - fetching every combination of https://server/commithash/filepath. My server isn't overloaded by this. It filled up the disk space by generating every possible snapshot, but I count that as a bug in Gitea, not an attack by the crawler. Still, the situation is very dumb, so I set my reverse proxy to feed it a variant of the Wikipedia home page on every AI crawler request for the last few days. The variation has several sections replaced with nonsense, both AI-generated and not. You can see it here: https://git.immibis.com/gptblock.html

I just checked, and they're still crawling, and they've gone 3 layers deep into the image tags of the page. Since every URL returns that page if you have the wrong user-agent, so do the images, but they happen to be in a relative path so I know how many layers deep they're looking.

Interestingly, if you ask ChatGPT to evaluate this page (GPT interactive page fetches are not blocked) it says it's a fake Wikipedia. You'd think they could use their own technology to evaluate pages.

---

nginx rules for your convenience - be prepared to adjust the filters according to the actual traffic you see in your logs

location = /gptblock.html {root /var/www/html;}

if ($http_user_agent ~* "https://openai.com/gptbot") {rewrite ^.*$ /gptblock.html last; break;}

if ($http_user_agent ~* "claudebot@anthropic.com") {rewrite ^.*$ /gptblock.html last; break;}

if ($http_user_agent ~* "https://developer.amazon.com/support/amazonbot") {rewrite ^.*$ /gptblock.html last; break;}

if ($http_user_agent ~* "GoogleOther") {rewrite ^.*$ /gptblock.html last; break;}*

By nonrandomstring 2025-04-0212:561 reply

> Why would you crawl the web interface when the data is so readily available in a even better format?

To cause deliberate harm as a DDOS attack. Perhaps a better question is, why would companies who hope to replace human-curated static online information with their own generative service not use the cloak of "scraping" to take down their competition?

By concerndc1tizen 2025-04-0213:351 reply

This is the most reasonable explanation. Wikipedia is openly opposed by the current US administration, and 'denial of service' is key to their strategy (i.e. tariffs, removal of rights/due process, breaking net neutrality, etc.).

In the worst case, Wikipedia will have to require user login, which achieves the partial goal of making information inaccessible to the general public.

By nonrandomstring 2025-04-0215:151 reply

In the worst case Wikipedia will have to relocate to Europe and block the entire ASN of US network estates. But if the United States is determined to commit digital and economic suicide, I don't see how reasonable people can stop that.

By concerndc1tizen 2025-04-0215:40

It would be trivial to use botnets inside of the EU, so I doubt that blocking ASNs would make any difference. And as I said, it achieves the goal of disrupting access to information, so that would be nevertheless be a win for them. Your proposition does not solve for Wikipedia's agenda of providing free access to information.

> digital and economic suicide

My view is that it's an economic coup which started decades ago (Bush-Halliburton, bank bailouts in 2008, etc.). It's only inflation and economic uncertainty is only for the poor. For the people that do algorithmic stock trading, it's an arbitrage opportunity that occurs in the scale of microseconds.

By the time that the people will be properly motivated to revolt against the government, it will be too late.

By delichon 2025-04-0213:231 reply

We're having the same trouble for a few hundred sites that we manage. It is no problem for crawlers that obey robots.txt since we ask for one visit per 10 seconds, which is manageable. The problem seems to be mostly the greedy bots that request as fast as we can reply. So my current plan is to set rate limiting for everyone, bots or not. But doing stats on the logs, it isn't easy to figure out a limit that won't bounce legit human visitors.

The bigger problem is that the LLMs are so good that their users no longer feel the need to visit these sites directly. It looks like the business model of most of our clients is becoming obsolete. My paycheck is downstream of that, and I don't see a fix for it.

By MadVikingGod 2025-04-0213:40

I wonder if there is a WAF that has an exponential backoff and constant decay for delay. Something like start a 10us and decay 1us/s.

By aucisson_masque 2025-04-0212:591 reply

People got to make bots pay. That's the only way to get rid of this world wide DDOSing backed up by multi billions companies.

There are captcha to block bots or at least make them pay money to solve them, some people in Linux community also made tools to combat that, i think something that use a little cpu energy.

And in the same time, you offer an api, less expensive than the cost to crawl it, and everyone win.

Multi billions companies get their sweet sweet data, Wikipedia gets money to enhance their infrastructure or whatever, users benefits from Wikipedia quality engagement.

By guerrilla 2025-04-0213:053 reply

This is an interesting model in general: free for humans, pay for automation. How do you enforce that though? Captchas sounds like a waste.

By jerf 2025-04-0213:151 reply

Any plan that starts with "Step one: Apply the tool that almost perfectly distinguishes human traffic from non-human traffic" is doomed to failure. That's whatever the engineering equivalent of "begging the question" is, where the solution to the problem is that we assume that we have the solution to the problem.

By zokier 2025-04-0213:562 reply

Identity verification is not that far fetched these days. For europeans you got eIDAS and related tech, some other places have similar stuff, for rest of world you can do video based id checks. There are plenty of providers that handle this, it's pretty commonplace stuff.

By guerrilla 2025-04-0214:03

This is a terrible idea. Consider how that will be abused as soon as a government who hates you comes in to power.

By jerf 2025-04-0215:141 reply

That does not generically 100% solve the problem of "is this person a human". That ties things to an identity, but the verification that an identity is actually that human is not solved. Stolen identities, forged identities, faked identities, all still problems, and as soon as the full force of the black market capitalist market in such things is turned on that world, it'll still be a big problem.

Also video-based ID checks have a shelf-life measured in single-digit years now, if indeed the plural is even appropriate. The tricks for verifying that you're not looking at real-time faked face replacement won't hold up for much longer.

Don't forget what we're talking about, either. We're talking about accessing Wikimedia over HTTP, not briefing some official on top-secret information. How does "video interviews" solve "a highly distributed network of hacked machines is crawling my website using everyone's local identity"?

By johnnyanmac 2025-04-0223:041 reply

>That does not generically 100% solve the problem of "is this person a human".

you don't need 100% generic problem solvling. a "good enough solution" will block out 90% of low effort bad actors, and that's a huge relief by itself. That 9% will take some steps and be combatted, and that last 1% will never truly be held at bay.

By jerf 2025-04-0313:42

Your model assumes the solutions aren't shared.

They are.

Hacker News readers tend to grotesquely underestimate the organization of the underworld, since they aren't in it and aren't generally affected by it. But the underworld is large and organized and very well funded. I'm sure you're operating on a mental model where everyone who sets out to scrape Wikimedia is some random hacker off on their own, sitting down to write a web scraper from scratch having never done it before and not being good at, and being just gobsmacked by the first anti-scraping tech they find, frantically searching online for how to bypass it and coming up with nothing.

That's not how the world works. Look around you, after all; you can already see the evidence of how untrue this is even in the garbage you find for yourself. You can see it in your spams, which never have problems finding hacked systems to put their forged login pages on. That's because the people sending the spam aren't hacking the systems themselves... they use a Hacked System As A Service provider. And I am not saying that sarcastically... that's exactly what they are. APIs and all. Bad actors do not sit down with a fresh college grad and a copy of Python for Dummies to write crawlers in the general case. (Some do, but honestly they're not the worrying ones.) They get Black Web Scraping As A Service, which is a company that can and does pay people full time to figure out how to get around blocks and limits, and when you see people asking questions about how to do that online, you're not seeing the pros. The pros don't ask those questions on Stack Exchange. They just consult their fellow employees, like any other business, because it's a business.

You could probably mentally model the collection of businesses I'm referring to as at least as large as Microsoft or Google, and generally staffed by people as intelligent.

It in fact does need to be a nearly 100% solution, because any crack will be found, exploited, and not merely "shared" but bought and sold freely, in a market that incentivizes people with big payouts to find the exploits.

I really wish people would understand this, the security defense team in the world is grotesquely understaffed and mismanaged collectively because people still think they're going up against some stereotypical sweaty guy in a basement who might get bored and wander away from hacking your site if he discovers women, rather than funded professionals attacking and spamming and infiltrating and getting paid large amounts of money to do it.

By karn97 2025-04-0213:152 reply

Why not just rate limit every user to realistic human rates. You just punish anyone behaving like a bot.

By mrweasel 2025-04-0217:06

Because, as pointed out in another post about the same problem: Many of these scrappers make one or two requests from one IP and then move on.

By guerrilla 2025-04-0213:33

Sold. Pay by page retrieval rate.

By scoofy 2025-04-0218:35

Honeypots in JS and CSS

I've been dealing with this over at golfcourse.wiki for the last couple years. It fucking sucks. The good news is that all the idiot scrapers who don't follow robots.txt seem to fall for the honeypots pretty easily.

Make the honeypot disappear with a big CSS file, make another one disappear with a JS file. Humans aren't aware they are there, bots won't avoid them. Programming a bot to look for visible links instead of invisible links is challenging. The problem is these programmers are ubiquitous, and since they are ubiquitous they're not going to be geniuses.

Honeypot -> autoban