Now you can block OpenAI’s web crawler

Internet users can block GPTBot and keep their site out of ChatGPT.

By Emilia David, a reporter who covers AI. Prior to joining The Verge, she covered the intersection between technology, finance, and the economy.

Aug 7, 2023, 5:36 PM UTC

Image: OpenAI

OpenAI now lets you block its web crawler from scraping your site to help train GPT models.

OpenAI said website operators can specifically disallow its GPTBot crawler on their site’s Robots.txt file or block its IP address. “Web pages crawled with the GPTBot user agent may potentially be used to improve future models and are filtered to remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates our policies,” OpenAI said in the blog post. For sources that don’t fit the excluded criteria, “allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.”

Blocking the GPTBot may be the first step in OpenAI allowing internet users to opt out of having their data used for training its large language models. It follows some early attempts at creating a flag that would exclude content from training, like a “NoAI” tag conceived by DeviantArt last year. It does not retroactively remove content previously scraped from a site from ChatGPT’s training data.

The internet provided much of the training data for large language models such as OpenAI’s GPT models and Google’s Bard. However, OpenAI won’t confirm if it got its data through social media posts, copyrighted works, or what parts of the internet it scraped for information. And sourcing data for AI training has become increasingly contentious. Sites, including Reddit and Twitter, have pushed to crack down on the free use of their users’ posts by AI companies, while authors and other creatives have sued over alleged unauthorized use of their works. Lawmakers also latched onto data privacy and consent questions in several Senate hearings around AI regulation last month.

As reported by Axios, companies like Adobe have floated the idea of marking data as not for training through an anti-impersonation law. AI companies, including OpenAI, signed an agreement with the White House to develop a watermarking system to let people know if something was generated by AI but made no promises to stop using internet data for training.

Now you can block OpenAI’s web crawler

Now you can block OpenAI’s web crawler

Internet users can block GPTBot and keep their site out of ChatGPT.

Microsoft says it needs games like Hi-Fi Rush the day after killing its studio

Inside Microsoft’s Xbox turmoil

People sure are pressed about Apple’s crushing iPad commercial

Apple TV Plus is turning into the best place for streaming sci-fi

The new Sonos app is missing a lot of features, and people aren’t happy

More from Artificial Intelligence

Interview: Figma’s CEO on life after the company’s failed sale to Adobe

The shine comes off the Vision Pro

Spike Jonze’s Her holds up a decade later

You sound like a bot

Now you can block OpenAI’s web crawler

Now you can block OpenAI’s web crawler

Internet users can block GPTBot and keep their site out of ChatGPT.

Share this story

Microsoft says it needs games like Hi-Fi Rush the day after killing its studio

Inside Microsoft’s Xbox turmoil

People sure are pressed about Apple’s crushing iPad commercial

Apple TV Plus is turning into the best place for streaming sci-fi

The new Sonos app is missing a lot of features, and people aren’t happy

More from Artificial Intelligence

Interview: Figma’s CEO on life after the company’s failed sale to Adobe

The shine comes off the Vision Pro

Spike Jonze’s Her holds up a decade later

You sound like a bot