OpenAI has unveiled its own web crawler bot, GPTBot, and has provided web admins with the means to block it if they want to.
AI training methods have become a hot topic, with the industry still trying to figure out the legality and ethics of training AI models using data on the internet. OpenAI is addressing those concerns head-on, by giving web admins the ability to block GPTBot.
Usage
Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies. Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety. Below, we also share how to disallow GPTBot from accessing your site.
Disallowing GPTBot
To disallow GPTBot to access your site you can add the GPTBot to your site’s robots.txt:
User-agent: GPTBot Disallow: /