How to prevent ChatGPT from crawling your website

Came across an interesting article and thought I’d share it. ChatGPT is pretty neat and useful for a lot of folks, but there are also those who don’t take too kindly to it scraping their data to train their LLMs (large language models). For example those who publish copyrighted works online, or those that have to pay the server costs for the bandwidth that ChatGPT consumes when it’s crawling their site. There are even those that just don’t want to contribute to the potential future AI uprising. :robot:

If you fall into the “ChatGPT = Bad” camp, below are two methods you can use to prevent our future AI overlord from assimilating your website’s data.

Season 2 Borg GIF by Paramount+


Block the ChatGPT Bot via robots.txt

  1. At your web host, locate your website’s root directory (usually public_html if your web host is using cPanel) and create a new file in that directory called robots.txt. Open that robots.txt file by clicking Edit.

  1. Enter the below rules to block ChatGPT from accessing all areas of your website, then click the “Save Changes” button, then the “Close” button.
User-agent: GPTBot
Disallow: /

  1. If you would like to prevent ChatGPT from accessing only certain parts of your site, you can selectively list what directories/folders it can and cannot crawl by entering the below rules (replacing directory-*/ with the actual path to your directory), then click the “Save Changes” button, then the “Close” button.
User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/


Block the ChatGPT bot via .htaccess

  1. In RapidWeaver, go to your Publishing settings, then click on the “Edit .htaccess File” button.

  1. On a new line, enter the below .htaccess rules to block ChatGPT from accessing all areas of your website, then click the “Save and Upload” button.
# Apache 2.2
<IfModule !authz_core_module>
    Order Allow,Deny
    Allow from all
    Deny from 52.230.152.0/24
    Deny from 52.233.106.0/24
</IfModule>

# Apache 2.4+
<IfModule authz_core_module>
    <RequireAll>
        Require all granted
        Require not ip 52.230.152.0/24
        Require not ip 52.233.106.0/24
    </RequireAll>
</IfModule>

  1. Another possible way to block the ChatGPT Bot via .htaccess is provided below.
# Apache 2.2
<IfModule !authz_core_module>
    SetEnvIf User-Agent GPTBot NoChatGPT=1
    Order Allow,Deny
    Allow from all
    Deny from env=NoChatGPT
</IfModule>

# Apache 2.4+
<IfModule authz_core_module>
    <If "%{HTTP_USER_AGENT} == 'GPTBot'">
        Require all denied
    </If>
</IfModule>


That’s it!

You’ve now blocked the ChatGPT bot from crawling your website. More information can be found on OpenAI’s website here.

Hope that helps…humanity. :cold_sweat:

4 Likes

Can’t remember where I picked this up from, but this robots.txt prevents many of the other bots as well.

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: Omgili
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Bytespider
Disallow: /

Might be helpful to someone!

6 Likes

Yeah I saw that over on Adam’s forum yesterday, someone posted it.

Thanks for adding it here, it’s a great resource for excluding more AI bots from crawling a website. :slightly_smiling_face: :raised_hands:

One thing in that above robots.txt, GPTBot covers ChatGPT-User as well, so only the GPTBot needs to be added to the robots.txt file and it would block both.

3 Likes

Awesome post, dan. Thanks so much.

No problem.

If anybody knows of other AI bots out there and ways to block them, please add them in the comments here.

OpenAI has changed their IP egress ranges three times since I originally wrote this article, so it’s a constant battle of staying up to date to make sure these AI bots are properly blocked from crawling people’s websites. :face_exhaling:

1 Like

I was looking up this data yesterday. What I have found hundreds of bots. And this is growing everyday it seems.

(April 12th, 2024) - Updated list of AI bots to add to your robots.txt file to direct them not to crawl/scrape your website.


# Amazon Bot - enabling Alexa to answer even more questions for customers.
User-agent: Amazonbot
Disallow: /

# Anthropic AI Bot
User-agent: anthropic-ai
Disallow: /

# Apple Bot - collects website data for its Siri and Spotlight services.
User-agent: Applebot
Disallow: /

# Claude Bot run by Anthropic
User-agent: Claude-Web
Disallow: /

# Cohere AI Bot - unconfirmed bot believed to be associated with Cohere’s chatbot.
User-agent: cohere-ai
Disallow: /

# Common Crawl's bot - Common Crawl is one of the largest public datasets used by AI for training, with ChatGPT, Bard and other large language models.
User-agent: CCBot
Disallow: /

# Diffbot - somewhat dishonest scraping bot used to collect data to train LLMs.
User-agent: Diffbot
Disallow: /

# Google Bard and VertexAI. This will not have an impact on Google Search indexing. This will not affect GoogleBot crawling.
User-agent: Google-Extended
Disallow: /

# ImagesiftBot is billed as a reverse image search tool, but it's associated with The Hive, a company that produces models for image generation.
User-agent: ImagesiftBot 
Disallow: /

# KUKA's youBot
User-agent: YouBot
Disallow: /

# OMGilibot - They sell data for training LLMs (large language models)
User-agent: omgilibot
Disallow: /

# Omgili (Oh My God I Love It)
User-agent: omgili
Disallow: /

# OpenAI API - bot that OpenAI specifically uses to collect bulk training data from your website for ChatGPT.
User-agent: GPTBot
Disallow: /

# Perplexity AI
User-agent: PerplexityBot
Disallow: /

## Social Media Bots

# Bytespider is a web crawler operated by ByteDance, the Chinese owner of TikTok
User-agent: Bytespider
Disallow: /

# Meta’s bot that crawls public web pages to improve language models for their speech recognition technology
User-agent: FacebookBot
Disallow: /

#Twitter's bot used to index the content of any given URL
User-agent: Twitterbot
Disallow: /
2 Likes

There’s a good GitHub project that keeps an updated listed of AI crawlers, worth keeping an eye on if you want to keep away those pesky robots!

3 Likes

Ugh, I hope openAI is more sincere than Perplexity is! Perplexity (an AI search engine) outright ignores your robots.txt, Robb Knight (a researcher) and WIRED Magazine have found.

Cheers,
Erwin

1 Like

Well, apparently OpenAI also ignores robots.txt:

1 Like

I don’t doubt that there is some shady scraping of data going on among the different AI companies, however that BI article isn’t listing its sources yet.

OpenAI and Anthropic have been found to be either ignoring or circumventing an established web rule, called robots.txt, that prevents automated scraping of websites, according to a person with knowledge of the analytics of TollBit, as well as another person familiar with the matter.

The article does reference an earlier article from Reuters here, in which TollBit states:

According to the TollBit letter, Perplexity is not the only offender that appears to be ignoring robots.txt. TollBit said its analytics indicate “numerous” AI agents are bypassing the protocol, a standard tool used by publishers to indicate which parts of its site can be crawled.

It doesn’t specifically mention which AI companies they are referring to in the letter.

With that said, it wouldn’t surprise me if it turns out OpenAI is ignoring the robots.txt file, but I’d need a bit more to go on aside from that BI article. Let’s see if OpenAI publicly addresses it in the coming days.

In the meantime, perhaps blocking via the .htaccess method would be more ironclad.

1 Like

Another blocking alternative, for any website that runs behind CloudFlare (which is recommended), they offer AI bot management on all of their plans, including the free one according to their below article.

Hi, I continue to be surprised by this desire for confidentiality when the online posting is obviously public. Remember back in the 90s all those programs to suck up entire websites under the guise of saving the prohibitive connection fees at the time. From forum to forum I read posts asking how to protect your photos (which you put online voluntarily), how to protect your documents (which you put online voluntarily), how to protect your music (which you… . yes, you understand). It’s like movie stars, as long as they’re not yet, they absolutely want to be known and recognized, when they finally are, they put on dark glasses… I know my psychological side… I can’t do anything :yawning_face:. I believe that the real desire behind these posts is: “how can we only give to those to whom we want to give while showing everyone?” Put differently: “how can we ensure that it only benefits those for whom we accept that it will benefit.” It reminds me of a joke about perverts: who, the sadist and the masochist, will win if they play together? The masochist says “go ahead and hurt me.” The sadist replies: “if I want.” :grimacing:

I think it has more to do with people wanting to protect their intellectual property and copyrighted works, not concerns about privacy or confidentiality.

1 Like

Yeah that, exactly!

In first intention I agree, but how to design a thought lock, graphic representations or others? When one has indicated “all rights reserved” and the other does not care, all that remains is justice… and its cost.

Cloudflare have just announced a tool that helps you prevent AI-bots from scraping your website.

Cheers,
Erwin

2 Likes