On August 8, OpenAI announced a new product called GPTBot on its official website. This web crawler is designed to gather vast amounts of online data for training AI models.

OpenAI aims to use GPTBot to collect massive data sets, which will help train and optimize future models. Many international tech media outlets speculate that this future model refers to GPT-5.

In fact, OpenAI had previously submitted a trademark for GPT-5 on July 18 this year. The introduction of this new web crawler suggests that GPT-5 might be closer to its launch than anticipated.

Introducing GPTBot GPTBot is OpenAI's web crawler, identifiable through the following user agent and string:

User agent token: GPTBot Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (compatible; GPTBot/1.0; +https://openai.com/gptbot)

OpenAI plans to filter the gathered data, removing paid content, personal identification information (PII), and any data that violates legal regulations. This ensures that the collected data aligns with safety standards.

Website owners who do not wish their site to be accessed by GPTBot can modify their robots.txt as follows:

User-agent: GPTBot Disallow: /

Additionally, they can customize GPTBot's access permissions within their website's robots.txt.

What is a Web Crawler? Web crawlers are tools primarily used to scrape data from the internet, involving methods like data mining, web page copying, and website mirroring.

Web crawlers are among the most crucial tools in the era of the internet and big data, often dubbed as "gold miners" due to their extensive applications.

For instance, search engines like Google and Baidu use web crawlers to gather and index web pages, enabling users to find relevant pages quickly using keywords.

Moreover, businesses use web crawlers to gather real-time information on competitors, such as product prices, new product releases, and marketing campaigns, to analyze the market and devise marketing strategies.

Limitations of Web Crawlers Despite the immense capabilities of web crawlers, they do have some drawbacks:

  • Data Quality: The data acquired can be unstable, potentially containing illegal, false, or low-quality information.
  • Copyright Risks: Crawlers might infringe on data privacy and copyright, posing legal challenges.
  • Difficulty Accessing Specific Content: Some content that requires user interaction, like search results or content behind logins, may be challenging for crawlers to access.
  • Crawling Frequency: The data fetched by crawlers is static, necessitating periodic updates. However, too frequent crawls can burden the target server, while infrequent crawls may lead to outdated data.

Nevertheless, with advancements in AI technologies, many of these traditional limitations have been addressed, with a greater emphasis on data copyright and security.

Web Crawlers: An Essential Data Source for Large Language Models Currently, the primary sources for training large language models include proprietary datasets, open-source datasets, and web crawlers. For instance, proprietary datasets, consisting of genuine legal rulings, books, and contracts, are used to train AI products dedicated to legal applications.

While open-source datasets are often provided by major companies (some for commercial use and others solely for research), these can sometimes be outdated. Hence, web crawlers have become a significant source of data for enterprises training generalized large models.

For example, OpenAI's GPT-3 model was trained on 45TB of internet text, including code, novels, encyclopedias, news, and blogs, primarily sourced via web crawlers.

As a result, users might occasionally notice that ChatGPT can generate inaccurate information. This is because the crawler might have gathered incorrect or false data, which might go unnoticed during the cleaning, pre-training, and fine-tuning processes. Sometimes, issues related to AI algorithms might also play a role.

However, OpenAI has established stringent standards for data acquisition and usage to prevent such occurrences.