Photo by Christian Wiediger on unsplash

Amazon Investigates Perplexity AI Web Scraping Allegations

June 29, 2024

Dennis Limmer

Amazon Web Services (AWS) has initiated an investigation into Perplexity AI following allegations that the company’s web crawler is bypassing the Robots Exclusion Protocol. This protocol allows web developers to control how search engines and other bots interact with their websites through a robots.txt file. While adherence to these guidelines is voluntary, reputable crawlers have generally respected them since the protocol’s inception in the 1990s.

To verify the scraping activity, Wired tested Perplexity AI’s chatbot by inputting short descriptions and headlines from their articles. The responses provided by the chatbot closely paraphrased Wired’s content with minimal attribution, suggesting that the information was scraped directly from their site.

Aravind Srinivas, CEO of Perplexity AI, also denied the claims of bypassing the Robots Exclusion Protocol. Srinivas admitted to using third-party web crawlers in addition to their own and confirmed that the crawler identified by Wired was one of these third-party tools.

Perplexity AI, through spokesperson Sara Platnick, denied the accusations, asserting that their PerplexityBot respects robots.txt files. Platnick acknowledged that PerplexityBot might ignore these directives only when users specifically include a URL in their chatbot inquiries. She emphasized that the company adheres to AWS Terms of Service and that AWS’s inquiry was a standard procedure for addressing reports of resource abuse. According to Platnick, Perplexity AI had not received any formal investigation notice from AWS prior to Wired’s report.

The controversy emerged after Wired reported discovering a virtual machine on an AWS server that ignored their website’s robots.txt instructions. This machine, with IP address 44.221.181.252, was allegedly operated by Perplexity AI. Wired found that the crawler had repeatedly accessed various Condé Nast properties over the past three months to scrape content. Additionally, similar patterns of access were observed in other major publications like The Guardian, Forbes, and The New York Times.

This situation is not unique to Perplexity AI. A recent Reuters report highlighted that multiple AI companies have been bypassing robots.txt files to collect content for training large language models. However, Wired focused its findings on Perplexity AI and provided detailed information to AWS. In response, AWS stated that it prohibits abusive and illegal activities and expects customers to comply with these terms. AWS regularly investigates reports of abuse and has engaged with Perplexity AI regarding the allegations.

Amazon Investigates Perplexity AI Web Scraping Allegations

NJ Transit Fare Hike Takes Effect: 15% Increase for Bus and Rail Service

Boeing Agrees To Buy Spirit AeroSystems

Rite Aid’s $2 Billion Debt Cut Approved

Kyle MacLachlan Leads the Return of Arby’s Potato Cakes

Login · Register

Explore

Home

Discussions

Retail News

Resources

Press Releases

Blog

About

About Retailwire

Meet the Braintrust

Contact us

Sponsors

Advertise with us

Submit a Press Release

Subscribe