
Photo by Solen Feyissa on Unsplash
TikTok’s Parent Company ByteDance Is Scraping Data at Breakneck Pace
October 4, 2024
An aggressive web scraper called Bytespider has been reportedly set loose by ByteDance, the Chinese company that owns TikTok. Used to extract data for training ByteDance’s generative AI models, this bot can supposedly work much faster than other scrapers created by OpenAI, Anthropic, and Google.
According to a Fortune report, research from bot management company Kasada discovered that Bytespider has been crawling the web since at least April. The scraper is extracting data at a reported rate 25 times faster than GPTbot, OpenAI’s web crawler. ByteDance’s scraper is purportedly working about 3,000 times faster than Anthropic’s ClaudeBot. Kasada also noted the bot’s scraping activity has increased in the past six weeks.
Many websites have a line of code embedded called robots.txt, which tells web scrapers not to extract data from the site. However, there is nothing that compels a bot to cooperate, and it is not legally binding. Reportedly, Bytespider is ignoring the code and grabbing data, nonetheless.
Many web publishers have argued that scraper bots infringe on copyrights. Web scraping, which is a common practice almost as old as the internet itself, gathers massive amounts of data for free, which is now being used extensively to train various AI models.
Bytespider Bot and TikTok
No one outside of ByteDance really knows what Bytespider’s extracted information is being used for. However, per Fortune, ByteDance is training a new AI model, which may integrate with TikTok’s search function.
The search function was recently updated to give advertisers a competitive edge. Marketers can find trending keywords in real time and quickly create relevant ads. It’s speculated that a new AI model could utilize data from recent internet trends and topics to improve TikTok’s search function.
“Given the audience and the amount of use, TikTok with a search environment that is a completely biddable space with keywords and topics, that would be very interesting to a lot of people spending a ton of money with Google right now,” said a person purportedly with inside knowledge, as reported by Fortune.
ByteDance was at one point far behind in the development of generative AI models. In 2023, the tech company tried to catch up by covertly using OpenAI to create a rival large language model (LLM). In doing so, ByteDance violated OpenAI’s terms of service, which state that developers cannot use the technology to create competing models.
While Bytespider is quickly and quietly gathering data, it’s no secret ByteDance is making money. TikTok’s parent company reported a 60% growth in revenue last year.
Recent News
