Business Reddit will block the Internet Archive - The company says that AI companies have scraped data from the Wayback Machine, so it’s going to limit what the Wayback Machine can access.

  • Want to keep track of this thread?
    Accounts can bookmark posts, watch threads for updates, and jump back to where you stopped reading.
    Create account
reddit.webp

Reddit says that it has caught AI companies scraping its data from the Internet Archive’s Wayback Machine, so it’s going to start blocking the Internet Archive from indexing the vast majority of Reddit. The Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles; instead, it will only be able to index the Reddit.com homepage, which effectively means Internet Archive will only be able to archive insights into which news headlines and posts were most popular on a given day.
”Internet Archive provides a service to the open web, but we’ve been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine,” spokesperson Tim Rathschmidt tells The Verge.

The Internet Archive’s mission is to keep a digital archive of websites on the internet and “other cultural artifacts,” and the Wayback Machine is a tool you can use to look at pages as they appeared on certain dates, but Reddit believes not all of its content should be archived that way.“Until they’re able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we’re limiting some of their access to Reddit data to protect redditors,” Rathschmidt says.
The limits will start “ramping up” today, and Reddit says it reached out to the Internet Archive “in advance” to “inform them of the limits before they go into effect,” according to Rathschmidt. He says Reddit has also “raised concerns” about the ability of people to scrape content from the Internet Archive in the past.
Reddit has a recent history of cutting off access to scraper tools as AI companies have begun to use (and abuse) them en masse, but it’s willing to provide that data if companies pay. Last year, Reddit struck a deal with Google for both Google Search and AI training data early last year, and a few months later, it started blocking major search engines from crawling its data unless they pay. It also said its infamous API changes from 2023, which forced some third-party apps to shut down, leading to protests, were because those APIs were abused to train AI models.

Reddit also struck an AI deal with OpenAI, but it sued Anthropic in June, claiming Anthropic was still scraping from Reddit even after Anthropic said it wasn’t scraping anymore.
“We have a longstanding relationship with Reddit and continue to have ongoing discussions about this matter,” Mark Graham, director of the Wayback Machine, says in a statement to The Verge.

(Link/Archive)
 
Its all about money for a handful of people associated with reddit. They don't give a shit about selling people's data as long as the right people within the cabal get their cut.
 
reddit could solve this by just making it so you need to be logged in to see posts and comments, but they won't.
 
reddit could solve this by just making it so you need to be logged in to see posts and comments, but they won't.
Generally that fucks engagement over the long term though due to not showing up in searches and all the lurkers. Hell I lurked here since 2015 lol, literally 10 years. It means nothing for the farms (currently) but reddit is able to monetize my views even if I'm not logged in. That's why they don't do it.
 
As much as I hate Reddit, they are probably right. Remember that they struck a 60 MILLION DOLLARS exclusivity deal with Google to sell their data to train Google AI bots just last year. As much as we hate them, the data for all the subreddits for fucking 20 years is very VERY valuable to AI corpos.

I 100% believe the pajeets and chinks would "steal" this data by alternate means (from archive sites), its in their blood.

reddit could solve this by just making it so you need to be logged in to see posts and comments, but they won't.
Yeah it would solve the issue and fuck everyone that wants to just look at random information online.

Reddit is sort of the stack overflow of random shit, you face a problem (lets say you want to know why we say a specific idiom), you google it, and hope that some retard 10 years ago asked the same question and got an answer. Most people would not look at the site if it required a login.

Pinterest does this and I will never fucking make an account for that dogshit website.

Correction: 60 Millions, not Billions.
 
Last edited:
As much as I hate Reddit, they are probably right. Remember that they struck a 60 BILLION DOLLARS deal with Google to sell their data to train Google AI bots just last year. As much as we hate them, the data for all the subreddits for fucking 20 years is very VERY valuable to AI corpos.

I 100% believe the pajeets and chinks would "steal" this data by alternate means (from archive sites), its in their blood.
How valuable is it really though? Reddit has been permeated with bots since it's inception. For the last 10 years I would feel confident betting $1000 that at least 50% of all conversations on reddit were between two bots with little to no human guidance.
 
How valuable is it really though? Reddit has been permeated with bots since it's inception. For the last 10 years I would feel confident betting $1000 that at least 50% of all conversations on reddit were between two bots with little to no human guidance.
According to a quick search (not sure how accurate) it's the 7th most used website in the whole internet. It has been around for 20 years, and it's filled with people talking to other people and (sometimes) answering questions. Sure, there are a lot of bots, but most traffic (especially older) is from real people.

This is exactly what they want from AI chatbots. Answering questions. Reddit is the perfect example. They wont get that from Twitter, where people are just saying random shit, and most of the other main sites are image/video based (instagram, youtube, tiktok, etc) so they dont have useful (text) data.

Sure, sometimes they are going to fuck up. There was that one time where Gemini, Googgle's ChatGPT, was recommending people to eat rocks and put glue on pizza to make it more consistent. It turns out that it came from "joke/sarcastic" reddit threads. This is the sort of shit they have to deal with until they fine tune the AI, but its still worth it.

(Useful) Data in 2025 is more valuable than Oil. Advertising Data, AI Training Data, etc. People are investing a LOT of money into this.
 
Generally that fucks engagement over the long term though
Yeah it would solve the issue and fuck everyone that wants to just look at random information online.
Absolutely, but that's sorta my point. They can fix this on their own without bitching about other companies. The IA might be gay & pozzed but blaming them because you refuse to lock down your golden goose is dumb. Next step will be reddit inc complaining that posts & comments are cached in search engine results.

With that said, I shouldn't complain because it's funny watching them squirm and lash out.
 
Back
Top Bottom