How to prevent ChatGPT from crawling your website

Came across an interesting article and thought I’d share it. ChatGPT is pretty neat and useful for a lot of folks, but there are also those who don’t take too kindly to it scraping their data to train their LLMs (large language models). For example those who publish copyrighted works online, or those that have to pay the server costs for the bandwidth that ChatGPT consumes when it’s crawling their site. There are even those that just don’t want to contribute to the potential future AI uprising. :robot:

If you fall into the “ChatGPT = Bad” camp, below are two methods you can use to prevent our future AI overlord from assimilating your website’s data.

Season 2 Borg GIF by Paramount+


Block the ChatGPT Bot via robots.txt

  1. At your web host, locate your website’s root directory (usually public_html if your web host is using cPanel) and create a new file in that directory called robots.txt. Open that robots.txt file by clicking Edit.

  1. Enter the below rules to block ChatGPT from accessing all areas of your website, then click the “Save Changes” button, then the “Close” button.
User-agent: GPTBot
Disallow: /

  1. If you would like to prevent ChatGPT from accessing only certain parts of your site, you can selectively list what directories/folders it can and cannot crawl by entering the below rules (replacing directory-*/ with the actual path to your directory), then click the “Save Changes” button, then the “Close” button.
User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/


Block the ChatGPT bot via .htaccess

  1. In RapidWeaver, go to your Publishing settings, then click on the “Edit .htaccess File” button.

  1. On a new line, enter the below .htaccess rules to block ChatGPT from accessing all areas of your website, then click the “Save and Upload” button.
# Apache 2.2
<IfModule !authz_core_module>
    Order Allow,Deny
    Allow from all
    Deny from 52.230.152.0/24
    Deny from 52.233.106.0/24
</IfModule>

# Apache 2.4+
<IfModule authz_core_module>
    <RequireAll>
        Require all granted
        Require not ip 52.230.152.0/24
        Require not ip 52.233.106.0/24
    </RequireAll>
</IfModule>

  1. Another possible way to block the ChatGPT Bot via .htaccess is provided below.
# Apache 2.2
<IfModule !authz_core_module>
    SetEnvIf User-Agent GPTBot NoChatGPT=1
    Order Allow,Deny
    Allow from all
    Deny from env=NoChatGPT
</IfModule>

# Apache 2.4+
<IfModule authz_core_module>
    <If "%{HTTP_USER_AGENT} == 'GPTBot'">
        Require all denied
    </If>
</IfModule>


That’s it!

You’ve now blocked the ChatGPT bot from crawling your website. More information can be found on OpenAI’s website here.

Hope that helps…humanity. :cold_sweat:

4 Likes

Can’t remember where I picked this up from, but this robots.txt prevents many of the other bots as well.

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: Omgili
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Bytespider
Disallow: /

Might be helpful to someone!

6 Likes

Yeah I saw that over on Adam’s forum yesterday, someone posted it.

Thanks for adding it here, it’s a great resource for excluding more AI bots from crawling a website. :slightly_smiling_face: :raised_hands:

One thing in that above robots.txt, GPTBot covers ChatGPT-User as well, so only the GPTBot needs to be added to the robots.txt file and it would block both.

3 Likes

Awesome post, dan. Thanks so much.

No problem.

If anybody knows of other AI bots out there and ways to block them, please add them in the comments here.

OpenAI has changed their IP egress ranges three times since I originally wrote this article, so it’s a constant battle of staying up to date to make sure these AI bots are properly blocked from crawling people’s websites. :face_exhaling:

1 Like

I was looking up this data yesterday. What I have found hundreds of bots. And this is growing everyday it seems.

(April 12th, 2024) - Updated list of AI bots to add to your robots.txt file to direct them not to crawl/scrape your website.


# Amazon Bot - enabling Alexa to answer even more questions for customers.
User-agent: Amazonbot
Disallow: /

# Anthropic AI Bot
User-agent: anthropic-ai
Disallow: /

# Apple Bot - collects website data for its Siri and Spotlight services.
User-agent: Applebot
Disallow: /

# Claude Bot run by Anthropic
User-agent: Claude-Web
Disallow: /

# Cohere AI Bot - unconfirmed bot believed to be associated with Cohere’s chatbot.
User-agent: cohere-ai
Disallow: /

# Common Crawl's bot - Common Crawl is one of the largest public datasets used by AI for training, with ChatGPT, Bard and other large language models.
User-agent: CCBot
Disallow: /

# Diffbot - somewhat dishonest scraping bot used to collect data to train LLMs.
User-agent: Diffbot
Disallow: /

# Google Bard and VertexAI. This will not have an impact on Google Search indexing. This will not affect GoogleBot crawling.
User-agent: Google-Extended
Disallow: /

# ImagesiftBot is billed as a reverse image search tool, but it's associated with The Hive, a company that produces models for image generation.
User-agent: ImagesiftBot 
Disallow: /

# KUKA's youBot
User-agent: YouBot
Disallow: /

# OMGilibot - They sell data for training LLMs (large language models)
User-agent: omgilibot
Disallow: /

# Omgili (Oh My God I Love It)
User-agent: omgili
Disallow: /

# OpenAI API - bot that OpenAI specifically uses to collect bulk training data from your website for ChatGPT.
User-agent: GPTBot
Disallow: /

# Perplexity AI
User-agent: PerplexityBot
Disallow: /

## Social Media Bots

# Bytespider is a web crawler operated by ByteDance, the Chinese owner of TikTok
User-agent: Bytespider
Disallow: /

# Meta’s bot that crawls public web pages to improve language models for their speech recognition technology
User-agent: FacebookBot
Disallow: /

#Twitter's bot used to index the content of any given URL
User-agent: Twitterbot
Disallow: /
2 Likes

There’s a good GitHub project that keeps an updated listed of AI crawlers, worth keeping an eye on if you want to keep away those pesky robots!

3 Likes

Ugh, I hope openAI is more sincere than Perplexity is! Perplexity (an AI search engine) outright ignores your robots.txt, Robb Knight (a researcher) and WIRED Magazine have found.

Cheers,
Erwin

1 Like