OpenAI has introduced a web crawling tool named “GPTBot,” aimed at bolstering the capabilities of future GPT models.
The company says the data amassed through GPTBot could potentially enhance model accuracy and expand its capabilities, marking a significant step in the evolution of AI-powered language models.
Web crawlers – also referred to as web spiders – play a pivotal role in indexing content across the vast expanse of the internet. Renowned search engines such as Google and Bing rely on these bots to populate their search results with relevant web pages.
OpenAI’s GPTBot will have a distinct purpose: to gather publicly available data while carefully sidestepping sources that involve paywalls, personal data collection, or content that contravenes OpenAI’s policies.
Website owners have the ability to prevent GPTBot from crawling their sites simply by implementing a “disallow” command within a standard server file. This grants them control over which portions of their content are accessible to the web crawler.
OpenAI’s announcement follows closely on the heels of the company’s submission of a trademark application for “GPT-5,” which is anticipated to succeed the current GPT-4 model.
The filing, made with the United States Patent and Trademark Office on July 18, encompasses the usage of “GPT-5” in AI-based human speech and text, audio-to-text conversion, voice recognition, and speech synthesis.
However, while the GPT-5 trademark application has generated excitement among AI enthusiasts, OpenAI’s CEO Sam Altman cautioned against premature expectations. Altman revealed that the company is still far from initiating GPT-5 training, as extensive safety audits need to be conducted prior to embarking on the process.
OpenAI’s recent endeavours have not been without their share of controversy. Concerns have arisen over the company’s data collection practices, particularly surrounding copyright and consent issues.
In June, Japan’s privacy regulator issued a warning to OpenAI concerning unauthorised data collection. Earlier this year, Italy temporarily prohibited the use of ChatGPT due to alleged violations of European Union privacy laws.
OpenAI and Microsoft also currently face a class-action lawsuit filed by 16 plaintiffs who claim that private information from ChatGPT user interactions was accessed without proper consent. The companies have also been hit with a lawsuit over GitHub Copilot, with the claimants alleging the code-generation tool infringed on the rights of developers by scraping their code without providing due attribution.
Should these allegations prove true, both OpenAI and Microsoft could potentially be found in violation of the Computer Fraud and Abuse Act, a legal precedent with relevance to web-scraping cases.
As OpenAI continues to push the boundaries of AI technology, it must navigate these challenges to ensure responsible and ethical development in the AI landscape.
See also: Meta launches Llama 2 open-source LLM
Want to learn more about AI and big data from industry leaders? Check out AI & Big Data Expo taking place in Amsterdam, California, and London. The event is co-located with Digital Transformation Week.
Explore other upcoming enterprise technology events and webinars powered by TechForge here.