Advertising, security, and privacy @ Read the Docs / EthicalAds
To contact me about Read the Docs or Ads, use my first name @ our domain names (either readthedocs.org or ethicalads.io).
They absolutely do. Every sponsorship you see on a podcast or a youtube video or a streamer is a contextual ad. Many open source sponsorships are actually a form of marketing. You could argue that search ads are pretty contextual although there's more at work there. Every ad in a physical magazine is a contextual ad. Physical billboards take into account a lot of geographical context: the ads you see driving in LA are very different than the ones you see in the Bay Area. Ads on platforms like Amazon, HomeDepot, etc. are highly contextual and based on search terms.
Founder of EthicalAds here. In my view, this is only partially true and publishers (sites that show ads) have choices here but their power is dispersed. Advertisers will run advertising as long as it works and they will pay an amount commensurate with how well it works. If a publisher chooses to run ads without tracking, whether that's a network like ours or just buyout-the-site-this-month sponsorships, they have options as long as their audience generates value for advertisers.
That said, we 100% don't land some advertisers when they learn they can't run 3rd party tracking or even 3rd party verification.
My employer, Read the Docs, has a blog on the subject (https://about.readthedocs.com/blog/2024/07/ai-crawlers-abuse...) of how we got pounded by these bots to the tune of thousands of dollars. To be fair though, the AI company that hit us the hardest did end up compensating us for our bandwidth bill.
We've done a few things since then:
- We already had very generous rate limiting rules by IP (~4 hits/second sustained) but some of the crawlers used thousands of IPs. Cloudflare has a list that they update of AI crawler bots (https://developers.cloudflare.com/bots/additional-configurat...). We're using this list to block these bots and any new bots that get added to the list.
- We have more aggressive rate limiting rules by ASN on common hosting providers (eg. AWS, GCP, Azure) which also hits a lot of these bots.
- We are considering using the AI crawler list to rate limit by user agent in addition to rate limiting by IP. This will allow well behaved AI crawlers while blocking the badly behaved ones. We aren't against the crawlers generally.
- We now have alert rules that alert us when we get a certain amount of traffic (~50k uncached reqs/min sustained). This is basically always some new bot cranked to the max and usually an AI crawler. We get this ~monthly or so and we just ban them.
Auto-scaling made our infra good enough where we don't even notice big traffic spikes. However, the downside of that is that the AI crawlers were hammering us without causing anything noticeable. Being smart with rate limiting helps a lot.
This project is an enhanced reader for Ycombinator Hacker News: https://news.ycombinator.com/.
The interface also allow to comment, post and interact with the original HN platform. Credentials are stored locally and are never sent to any server, you can check the source code here: https://github.com/GabrielePicco/hacker-news-rich.
For suggestions and features requests you can write me here: gabrielepicco.github.io