Reddit escalates its fight against AI bots

With AI eating the public web, Reddit is going on the offensive against data scraping.

by Alex Heath

Jun 25, 2024, 9:49 PM UTC

Illustration by William Joel / The Verge

Part Of

Perplexity AI: the answer engine with a lot of question marks

see all updates

Alex Heath is a contributing writer and author of the Sources newsletter.

In the coming weeks, Reddit will start blocking most automated bots from accessing its public data. You’ll need to make a licensing deal, like Google and OpenAI have done, to use Reddit content for model training and other commercial purposes.

While this has technically been Reddit’s policy already, the company is now enforcing it by updating its robots.txt file, a core part of the web that dictates how web crawlers are allowed to access a site. “It’s a signal to those who don’t have an agreement with us that they shouldn’t be accessing Reddit data,” the company’s chief legal officer, Ben Lee, tells me. “It’s also a signal to bad actors that the word ‘allow’ in robots.txt doesn’t mean, and has never meant, that they can use the data however they want.”

My colleague David Pierce recently called robots.txt “the text file that runs the internet.” Since it was conceptualized in the early days of the web, the file has primarily governed whether search engines like Google can crawl a website to index it for results. For the last 20 years or so, the give-and-take — Google sending traffic in exchange for the ability to crawl — mostly made sense for everyone involved. Then, AI companies started ingesting all the data they could find online to train their models.

Chatbots aren’t sending traffic back to content sources like traditional search engines. In fact, their output can often look like straight up plagiarism. For companies like Reddit, that means the value exchange that robots.txt facilitates has been broken. “The simplistic, ‘Hey, I can index a bunch of links but provide traffic back,’ syllogism doesn’t carry forward anymore,” says Lee.

Reddit won’t name offenders, but it’s easy to imagine the companies it’s targeting with this change. The AI search engine Perplexity has been caught surreptitiously guzzling content from other sites. TollBit, a startup that brokers AI licensing deals for publishers, recently told its clients that multiple unnamed AI firms are ignoring crawling rules.

Lee knows that simply updating Reddit’s robots.txt won’t end all of the scraping. The file itself is not legally enforceable. It’s more about sending a message and making Reddit’s rules “ridiculously” clear to intruders. “Just because you have a welcome mat on the front of your house doesn’t mean someone can literally break down the door and walk in because you said they were welcome,” he says.

Reddit is making exceptions for a handful of noncommercial entities like the Internet Archive. The companies it has entered into licensing agreements with can of course keep using its data. It’s also working with moderators to ensure that their tools for things like content moderation don’t break.

If Reddit really wanted to protect against being ingested by AI, it would throw up a login page. Given the nature of the platform, that option isn’t in the cards, Lee says. He thinks that the industry “definitely needs something other than robots.txt” to enforce scraping rules. “But I think anybody who has taken on the brain damage of dealing with either W3C [The World Wide Web Consortium] or the Internet Engineering Task Force knows this is hard.”

The uncomfortable truth underlying this is that most AI companies don’t really care about robots.txt, a website’s terms of service, or even copyright law. They see the public data on the internet as ripe for the taking simply because it’s accessible. Lee has seen this story play out before from the other side of the fence; long before he worked at Reddit, he was senior legal counsel at Google in the early 2000s.

Back then, it was Google that was speedrunning the legal system to build up Search and YouTube. Now, the internet is being reshaped again by the rise of generative AI. For companies like Reddit, the risk of AI subsuming everything is too value destructive to not fight against.

Closing the loop

Checking in on stories I’ve covered previously as they develop:

OpenAI is starting to act more like a regular company: It’s going to start treating current and former employees (including those who work at competitors) the same in stock tender offers. It’s continuing to make small, focused acquisitions. And it’s in the process of moving away from being controlled by a nonprofit.
Amazon’s AI plans are coming into focus: First, there was a recent report saying the company is considering a $10 subscription for an AI-enhanced Alexa. Then, we learned that it’s developing a ChatGPT rival to potentially debut this fall. The chatbot will be powered by Olympus, Amazon’s forthcoming foundational model that is in training. (Eagle-eyed subscribers will know I broke the existence of Olympus in late March.)
Feedback: Last week, I scooped Meta’s reorganization that is putting more of an emphasis on its smart glasses with Ray-Ban, which have turned into a surprise hit. One of you wrote in with the following: “The mic on the latest glasses is incredible compared with Airpods for calls and recording with your voice on the move — a good way to break into the Apple ecosystem must-have accessory list. In my experience over the past month, it’s like a podcast booth on your face. I can collect crystal clear recordings in the middle of a coffee shop or when flying my drone over the last few weekends — no disruptions to my audio, despite my best efforts! Unexpected win for Meta hardware that I feel we are all going to be talking a lot more about in the coming 12-18 months.”

People moves

Some notable career moves:

Stability AI has found a new CEO in Prem Akkaraju, the former head of the famed VFX studio Weta Digital. Hopefully, he and business partner Sean Parker (yes, that Sean Parker) can bring some much-needed... stability to the once-high-flying AI lab. I expect to see Stability focus more on its developer ecosystem roots and, given Akkaraju’s background, lean into entertainment partnerships going forward.
Cruise’s new CEO is Marc Whitten, a former Unity and Amazon executive. This is part of a much-needed overhaul of Cruise’s leadership ranks, including the appointment of Nick Mulholland as head of comms and marketing. I’m not sure if Cruise has what it takes to be a meaningful player in self-driving anymore, but it’s good to see fresh blood coming in to hopefully get things turned around.
Two interesting Google hires: Qualcomm’s ex metaverse chief, Hugo Swart, is now leading the efforts to license the company’s software in third-party headsets. Meanwhile, Nima Khajehnouri, one of Snap’s veteran engineering leaders, recently joined to set up a new team to manage the data that feeds the company’s AI models.

Interesting links

Netflix co-CEO Greg Peters interviewed on Decoder.
What Game of Thrones did to the media industry.
John Gruber breaks down Apple Intelligence.
Businessweek’s profile of Jeff Yass.
Who has (and has not) been invited to Sun Valley this year.
Two new AI apps I’m trying: ElevenLabs Reader and Dot.

If you aren’t already subscribed to Command Line, don’t forget to sign up and get future issues delivered directly to your inbox. You’ll also get full access to the archive featuring scoops about companies like Meta, Google, OpenAI, and more.

Thanks for subscribing.

Follow topics and authors from this story to see more like this in your personalized homepage feed and to receive email updates.