OpenAI on a phone

Photo by Levart_Photographer on Unsplash

OpenAI and Anthropic Disregard Web Scraping Rules for Bots

June 24, 2024

Leading AI startups OpenAI and Anthropic are disregarding protocols designed to prevent them from scraping web content without compensating publishers for training their models.

OpenAI, known for the widely used chatbot ChatGPT, has Microsoft as its primary investor, while Anthropic, creator of the popular chatbot Claude, is mainly backed by Amazon.

An analyst at TollBit, a startup aiming to facilitate paid licensing deals among publishers and AI firms, along with another individual familiar with the issue, revealed to Business Insider that both OpenAI and Anthropic have been finding a way around or bypassing established web protocols, specifically the robots.txt standard. This rule is designed to prevent automated scraping of websites.


On Friday, TollBit issued a letter to certain prominent publishers, alerting them to this issue after it came to light that numerous AI firms engaged in similar practices. The correspondence refrained from disclosing the names of any AI companies that participated in this.

However, last week, Perplexity, a firm that describes itself as “a free AI search engine,” faced public scrutiny after Forbes accused it of plagiarizing and distributing its content without authorization across multiple platforms. In a report, Wired disclosed that Perplexity has been scraping content from its website and other publications owned by Condé Nast, disregarding the robots.txt protocol.

Despite OpenAI and Anthropic publicly stating their commitment to honoring robots.txt and blocks for their respective web crawlers, ClaudeBot and GPTBot, TollBit’s findings suggest they have not been true to their word. AI companies, such as OpenAI and Anthropic, are reportedly opting to “bypass” robots.txt to scrape entire content from websites.


Although both OpenAI and Anthropic have not commented on the matter, in May, OpenAI wrote in a blog post on its website that it takes web crawler permissions “into account each time we train a new model.”

Since its introduction in the late 1990s, robots.txt has served as a fundamental piece of code allowing websites to instruct bot crawlers not to scrape and collect their data. It has been embraced widely and, as a result, it has become a foundation of the unwritten rules governing the web.

As generative AI rapidly grows, startups and tech firms are competing to construct cutting-edge AI models. High-quality data is a crucial element in this mission. In this process, the rising demand for such training data has weakened the efficacy of robots.txt.

Last year, several tech firms advocated before the U.S. Copyright Office that web content should be exempt from copyright protections for AI training data. OpenAI has responded by securing agreements with publishers to access their content. The U.S. Copyright Office is scheduled to update its guidelines on AI and copyright later this year.

Recent News

NESCAFÉ Celebrates National Ice Cream Day with Affogato Kit Giveaway

In celebration of National Ice Cream Day on July 21, NESCAFÉ, the world’s leading coffee producer, is offering espresso enthusiasts a unique chance to win a limited-edition NESCAFÉ® Affogato Kit. This exciting initiative draws inspiration from the recent TikTok affogato trend, which has garnered over 400 million views and nearly 24,000 videos. The classic Italian dessert, featuring a shot of hot espresso poured over vanilla gelato or ice cream, has captivated social media users with its simplicity and indulgence.

CVC Capital Partners to Acquire UK Infrastructure Contractor M Group

In a strategic move to bolster its portfolio, CVC Capital Partners Plc has announced its agreement to acquire UK-based infrastructure contractor M Group Services. This acquisition, expected to finalize in the third quarter of this year, represents another significant investment for the private equity firm. While the official statement did not reveal financial specifics, sources indicate that the deal values M Group at just over £1 billion ($1.3 billion).