black Samsung Galaxy smartphone displaying Amazon logo

Photo by Christian Wiediger on unsplash

Amazon Investigates Perplexity AI Web Scraping Allegations

June 29, 2024

Amazon Web Services (AWS) has initiated an investigation into Perplexity AI following allegations that the company’s web crawler is bypassing the Robots Exclusion Protocol. This protocol allows web developers to control how search engines and other bots interact with their websites through a robots.txt file. While adherence to these guidelines is voluntary, reputable crawlers have generally respected them since the protocol’s inception in the 1990s.

To verify the scraping activity, Wired tested Perplexity AI’s chatbot by inputting short descriptions and headlines from their articles. The responses provided by the chatbot closely paraphrased Wired’s content with minimal attribution, suggesting that the information was scraped directly from their site.

Aravind Srinivas, CEO of Perplexity AI, also denied the claims of bypassing the Robots Exclusion Protocol. Srinivas admitted to using third-party web crawlers in addition to their own and confirmed that the crawler identified by Wired was one of these third-party tools.


Perplexity AI, through spokesperson Sara Platnick, denied the accusations, asserting that their PerplexityBot respects robots.txt files. Platnick acknowledged that PerplexityBot might ignore these directives only when users specifically include a URL in their chatbot inquiries. She emphasized that the company adheres to AWS Terms of Service and that AWS’s inquiry was a standard procedure for addressing reports of resource abuse. According to Platnick, Perplexity AI had not received any formal investigation notice from AWS prior to Wired’s report.

The controversy emerged after Wired reported discovering a virtual machine on an AWS server that ignored their website’s robots.txt instructions. This machine, with IP address 44.221.181.252, was allegedly operated by Perplexity AI. Wired found that the crawler had repeatedly accessed various Condé Nast properties over the past three months to scrape content. Additionally, similar patterns of access were observed in other major publications like The Guardian, Forbes, and The New York Times.

This situation is not unique to Perplexity AI. A recent Reuters report highlighted that multiple AI companies have been bypassing robots.txt files to collect content for training large language models. However, Wired focused its findings on Perplexity AI and provided detailed information to AWS. In response, AWS stated that it prohibits abusive and illegal activities and expects customers to comply with these terms. AWS regularly investigates reports of abuse and has engaged with Perplexity AI regarding the allegations.


Recent News