AI's Unpaid Debt: How LLM Scrapers Destroy the Social Contract of Open Source

2025-12-1919:376318www.quippd.com

TL;DR: The big tech AI company LLMs have gobbled up all of our data, but the damage they have done to open source and free culture communities are particularly insidious. By taking advantage of those…

TL;DR: The big tech AI company LLMs have gobbled up all of our data, but the damage they have done to open source and free culture communities are particularly insidious. By taking advantage of those who share freely, they destroy the bargain that made free software spread like wildfire.

I made a video if you want to watch instead.

Last week, I argued that Mozilla’s replacement of volunteer contributions on Support Mozilla with AI translations was a betrayal of Mozilla’s open source principles.

If you disagree, you might not realize that the AI companies have been pirating from the commons, directly hurting open source. This is an important topic, so I’m devoting the rest of this post to it.

Open Source History

This comment helped me realize that some of us are forgetting the role of copyright in open source.

It’s funny seeing the sudden surge of “copyright is awesome!” On the Internet now that it’s become a useful talking point to bludgeon the hated Abominable Intelligence with.

We can think about copyleft as a kind of hack of the copyright system.

What is copyleft?

The FSF defines it as a method to make a work free “and requiring all modified versions to be free as well”.

The default way that copyright works is when you create an original work, you own the copyright on that work, which gives you some exclusive rights over the intellectual property. Since those rights are exclusive, you need to assign them to others if you want to give them permission to exercise those rights.

Open source subverts the default notion of copyright. By releasing your work under open licenses, the copyright owner grants others the same rights that the copyright owner has over the work. Since that means that the copyright owner is effectively giving the work away (with addtional restrictions, depending on the license), it gives that work new life.

Like royalty free music on YouTube, people will spread the work, and the community of people who find it useful can fix issues and add new features – helping it spread even further.

The results are evident – open source lets hobbyists connected by the internet to compete with software built by professionals working at big software companies. According to a Harvard study, open source has an economic value of $8.8 trillion.

It is so powerful that it allowed KDE to develop a browser engine called KHTML that was eventually forked by Apple to WebKit, then forked again by Google to form Blink.

The two dominant browser engines on the market today are Blink and WebKit. These engines power big tech web browsers like Chrome, Safari and Edge.

Open source is what propelled competition between Linux and its entirely open source stack with huge big tech companies like Microsoft and Apple. Google’s adoption of open source in Linux and Android helped eventually force Microsoft out of the mobile operating system market.

The Linux desktop has even gotten to the point that Valve is promoting it as a gaming OS to run Windows games. You don’t even need to run a Microsoft OS to run Windows games anymore. Open source really is powerful stuff.

Open Source Licensing

I’m talking about copyleft licenses, as they differ from “permissive” licenses by preserving the openness of the original work by placing changes to the work under the same open license. These licenses are also sometimes called “viral” because by using the “virally licensed” work in another work, you transmit the license (virus) to the new work.

The virality ensures that the work - freely given - continues to be given freely.

Copyleft doesn’t just apply to software - user contributions on Wikipedia are licensed under a Creative Commons Attribution-ShareAlike license, for example. This means that any edits people make to the encyclopedia are also given freely.

The “share alike” in Wikipedia’s license is a covenant that Wikipedia has made with its contributors, and that copyleft developers have made with open source contributors.

Big Tech AI

Since arriving on the scene, AI companies have relied on massive data sets to train their LLMs. The data sets include virtually anything posted online by anyone - blog posts, books, music, illustrations, even things like reddit comments.

While the AI companies have stolen from us all, I contend that open source and free culture communities have been disproportionately damaged, and will continue to be.

Damage to Open Source and Free Culture

The key covenant that differentiates copyleft licenses from permissive ones is that they are shared alike. That is the bargain that the big tech content pirates blow a hole through.

By incorporating copyleft data into their models, the LLMs do share the work - but not alike. Instead, the AI strips the work of its provenance and transforms it to be copyright free.

That is because the US copyright office has rightly advised that AI generated works are not copyrighted.

This means that the LLMs are copyright removal devices - copy open source (or proprietary!) data into it, and you get copyright free data on the other side that you are free to plagiarize into newly copyrighted works.

While this process robs all copyright owners, it is particularly damaging to people involved in sharing communities, since many are particpating not for money but for the love of the game - what librarian Fobazi Ettarh calls “vocational awe”.

Vocational Awe and Sharing Communities

Vocational awe explains why so many are willing to share freely with others - even when being involved can be degrading and thankless. Ms. Ettarh invented the term to describe how people who believe that they are serving the greater good can be more easily taken advantage of.

It explains why people share even when they aren’t getting paid.

When was the last time you heard of a plumber who repaired leaks for free? It doesn’t happen, so when AI comes for the plumbers, they will need to be coerced for their training, or will need to be paid for it. Plumbers aren’t going to work for free, and if they do, they know exactly how much they are donating.

The Aftermath of LLM Piracy

LLM piracy of open source works damages the value exchange that contributors made to the community. When LLMs strip provenance from contributions, the contract is broken - their contributions are no longer shared alike. Instead, the contributions are stolen.

As Sean O’Brien, founder of the Yale Privacy Lab at Yale Law School points out:

“Now those same corporations are using that wealth and compute to train opaque models on the very codebases that made their existence possible, and threatening the legal structures, such as reciprocal or copyleft licenses like GNU GPL, by labeling all the outputs of genAI chatbots public domain.”

While many contributors will continue to give, others will rightly realize that the bargain has shifted, and will stop contributing.

You can see that shift happen on Stack Overflow, a popular question and answer site for programmers.

The data speaks for itself: 50% fewer posts and comments than before ChatGPT became generally available.

Perhaps the people who might have contributed are instead helping to train a chatbot like Copilot or Gemini. They aren’t helping the community anymore, though.

Contributors need to ask themselves: Does it makes sense to continue to contribute directly to the big tech LLMs for free?

It isn’t as if they are also getting a (second?) big tech salary in the mail.

If you liked this material, please consider supporting me. You can message me or follow this blog on Mastodon.


Read the original article

Comments

  • By p0w3n3d 2025-12-1923:292 reply

    Normally people get punished for downloading illegal books. Allegedly someone at meta downloaded hella ton of illegal books and taught the LLM on them and they said "oh it was for his/hers private usage". You won't get justice here

    • By muldvarp 2025-12-1923:515 reply

      This to me is the most ridiculous thing about the whole AI situation. Piracy is now apparently just okay as long as you do it on an industrial scale and with the expressed intention of hurting the economic prospects of the authors of the pirated work.

      Seems completely ridiculous when compared to the trouble I was in that one time I pirated a single book that I was unable to purchase.

      • By Llamamoe 2025-12-200:04

        We've essentially given up on pretending that corporations are also held accountable for their crimes in the recent years, and I think that's more worrying than anything.

      • By lifestyleguru 2025-12-200:11

        Hollywood and media publishers run entire franchises of legal bullies across developed world to harass individuals, and lobby for laws allowing easy prosecution of ISP contract owner. Even Google Books was castrated because of IP rights. Now I have hard time to imagine how this IP+AI cartel operates. Nowadays everyone and their cat throws millions on AI so I imagine IP owners get their share.

      • By p0w3n3d 2025-12-208:08

        Recently archive.org got into trouble for renting one book (or fixed amount of books) exclusively on the whole world, like in a library. Sad men from law office came and made an example of them, but it seems that if they used those books to teach AI and serve the content in "remembered" way, they would get away with it.

      • By pcthrowaway 2025-12-2116:45

        > Seems completely ridiculous when compared to the trouble I was in that one time I pirated a single book that I was unable to purchase.

        How would one manage to get in trouble for pirating a book? Unless you mean with your employer for doing it on their network or something?

      • By Mathnerd314 2025-12-202:48

        Well, so what the actual ruling was was that use of the books was okay, but only if they were legally obtained. And so the authors could proceed with a lawsuit for illegally downloading the books. But then presumably compensation for torrenting the books was included as part of the out of court settlement. So the lesson is something like AI is fine, but torrenting books is still not acceptable, m'kay wink wink.

  • By citizenpaul 2025-12-1920:211 reply

    I'm not sure how this is much different then Amazon which has basically monetized the entire Apache Software Foundation and donates a pittance back to them in the single digit millions when they are profiting in the trillions.

    • By y0eswddl 2025-12-1921:141 reply

      It's not different.

      There's also a huge problem with for-profit companies building on the work of FOSS without contributing resources or knowledge back.

  • By AndrewKemendo 2025-12-202:501 reply

    This article could just have been a link to the tragedy of the commons Wikipedia page

    Humans destroying common resources until depleted is a feature not a bug

    • By NoraCodes 2025-12-2012:56

      This is quite literally the opposite of the tragedy of the commons.

HackerNews