Photo by Christian Wiediger on unsplash
Amazon Investigates Perplexity AI Web Scraping Allegations
June 29, 2024
Amazon Web Services (AWS) has initiated an investigation into Perplexity AI following allegations that the company’s web crawler is bypassing the Robots Exclusion Protocol. This protocol allows web developers to control how search engines and other bots interact with their websites through a robots.txt file. While adherence to these guidelines is voluntary, reputable crawlers have generally respected them since the protocol’s inception in the 1990s.
To verify the scraping activity, Wired tested Perplexity AI’s chatbot by inputting short descriptions and headlines from their articles. The responses provided by the chatbot closely paraphrased Wired’s content with minimal attribution, suggesting that the information was scraped directly from their site.
Aravind Srinivas, CEO of Perplexity AI, also denied the claims of bypassing the Robots Exclusion Protocol. Srinivas admitted to using third-party web crawlers in addition to their own and confirmed that the crawler identified by Wired was one of these third-party tools.
Perplexity AI, through spokesperson Sara Platnick, denied the accusations, asserting that their PerplexityBot respects robots.txt files. Platnick acknowledged that PerplexityBot might ignore these directives only when users specifically include a URL in their chatbot inquiries. She emphasized that the company adheres to AWS Terms of Service and that AWS’s inquiry was a standard procedure for addressing reports of resource abuse. According to Platnick, Perplexity AI had not received any formal investigation notice from AWS prior to Wired’s report.
The controversy emerged after Wired reported discovering a virtual machine on an AWS server that ignored their website’s robots.txt instructions. This machine, with IP address 44.221.181.252, was allegedly operated by Perplexity AI. Wired found that the crawler had repeatedly accessed various Condé Nast properties over the past three months to scrape content. Additionally, similar patterns of access were observed in other major publications like The Guardian, Forbes, and The New York Times.
This situation is not unique to Perplexity AI. A recent Reuters report highlighted that multiple AI companies have been bypassing robots.txt files to collect content for training large language models. However, Wired focused its findings on Perplexity AI and provided detailed information to AWS. In response, AWS stated that it prohibits abusive and illegal activities and expects customers to comply with these terms. AWS regularly investigates reports of abuse and has engaged with Perplexity AI regarding the allegations.
Recent News
NJ Transit Fare Hike Takes Effect: 15% Increase for Bus and Rail Service
New Jersey Transit riders are experiencing a significant fare hike as they commute across the Garden State. Riders will now pay an average of 15% more to ride the state’s trains and buses, the first increase in nine years.
Boeing Agrees To Buy Spirit AeroSystems
Dave Calhoun, Boeing’s president, believes this deal is in the best interest of the flying public.
Rite Aid’s $2 Billion Debt Cut Approved
Rite Aid, the well-known pharmacy chain, has secured approval for its restructuring plan in a landmark decision by a U.S. bankruptcy judge.
Kyle MacLachlan Leads the Return of Arby’s Potato Cakes
Kyle MacLachlan, renowned for his iconic roles in “Blue Velvet” and “Twin Peaks,” has embarked on a new culinary adventure as the face of Arby’s Potato Cakes.