How to prevent ChatGPT from crawling your website

dang · 15 August 2023 13:13

Came across an interesting article and thought I’d share it. ChatGPT is pretty neat and useful for a lot of folks, but there are also those who don’t take too kindly to it scraping their data to train their LLMs (large language models). For example those who publish copyrighted works online, or those that have to pay the server costs for the bandwidth that ChatGPT consumes when it’s crawling their site. There are even those that just don’t want to contribute to the potential future AI uprising.

If you fall into the “ChatGPT = Bad” camp, below are two methods you can use to prevent our future AI overlord from assimilating your website’s data.

Season 2 Borg GIF by Paramount+

Block the ChatGPT Bot via robots.txt

At your web host, locate your website’s root directory (usually public_html if your web host is using cPanel) and create a new file in that directory called robots.txt. Open that robots.txt file by clicking Edit.

Enter the below rules to block ChatGPT from accessing all areas of your website, then click the “Save Changes” button, then the “Close” button.

User-agent: GPTBot
Disallow: /

If you would like to prevent ChatGPT from accessing only certain parts of your site, you can selectively list what directories/folders it can and cannot crawl by entering the below rules (replacing directory-*/ with the actual path to your directory), then click the “Save Changes” button, then the “Close” button.

User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/

Block the ChatGPT bot via .htaccess

In RapidWeaver, go to your Publishing settings, then click on the “Edit .htaccess File” button.

On a new line, enter the below .htaccess rules to block ChatGPT from accessing all areas of your website, then click the “Save and Upload” button.

# Apache 2.2
<IfModule !authz_core_module>
    Order Allow,Deny
    Allow from all
    Deny from 52.230.152.0/24
    Deny from 52.233.106.0/24
</IfModule>

# Apache 2.4+
<IfModule authz_core_module>
    <RequireAll>
        Require all granted
        Require not ip 52.230.152.0/24
        Require not ip 52.233.106.0/24
    </RequireAll>
</IfModule>

Another possible way to block the ChatGPT Bot via .htaccess is provided below.

# Apache 2.2
<IfModule !authz_core_module>
    SetEnvIf User-Agent GPTBot NoChatGPT=1
    Order Allow,Deny
    Allow from all
    Deny from env=NoChatGPT
</IfModule>

# Apache 2.4+
<IfModule authz_core_module>
    <If "%{HTTP_USER_AGENT} == 'GPTBot'">
        Require all denied
    </If>
</IfModule>

That’s it!

You’ve now blocked the ChatGPT bot from crawling your website. More information can be found on OpenAI’s website here.

Hope that helps…humanity.

jacksona · 16 August 2023 10:35

Can’t remember where I picked this up from, but this robots.txt prevents many of the other bots as well.

User-agent: CCBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Omgilibot
Disallow: /

User-agent: Omgili
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Bytespider
Disallow: /

Might be helpful to someone!

dang · 16 August 2023 10:40

Yeah I saw that over on Adam’s forum yesterday, someone posted it.

Thanks for adding it here, it’s a great resource for excluding more AI bots from crawling a website.

One thing in that above robots.txt, GPTBot covers ChatGPT-User as well, so only the GPTBot needs to be added to the robots.txt file and it would block both.

zamboknee · 16 August 2023 18:00

Awesome post, dan. Thanks so much.

dang · 17 August 2023 11:48

No problem.

If anybody knows of other AI bots out there and ways to block them, please add them in the comments here.

OpenAI has changed their IP egress ranges three times since I originally wrote this article, so it’s a constant battle of staying up to date to make sure these AI bots are properly blocked from crawling people’s websites.

Flash · 17 August 2023 15:13

I was looking up this data yesterday. What I have found hundreds of bots. And this is growing everyday it seems.

dang · 12 April 2024 13:22

(April 12th, 2024) - Updated list of AI bots to add to your robots.txt file to direct them not to crawl/scrape your website.

# Amazon Bot - enabling Alexa to answer even more questions for customers.
User-agent: Amazonbot
Disallow: /

# Anthropic AI Bot
User-agent: anthropic-ai
Disallow: /

# Apple Bot - collects website data for its Siri and Spotlight services.
User-agent: Applebot
Disallow: /

# Claude Bot run by Anthropic
User-agent: Claude-Web
Disallow: /

# Cohere AI Bot - unconfirmed bot believed to be associated with Cohere’s chatbot.
User-agent: cohere-ai
Disallow: /

# Common Crawl's bot - Common Crawl is one of the largest public datasets used by AI for training, with ChatGPT, Bard and other large language models.
User-agent: CCBot
Disallow: /

# Diffbot - somewhat dishonest scraping bot used to collect data to train LLMs.
User-agent: Diffbot
Disallow: /

# Google Bard and VertexAI. This will not have an impact on Google Search indexing. This will not affect GoogleBot crawling.
User-agent: Google-Extended
Disallow: /

# ImagesiftBot is billed as a reverse image search tool, but it's associated with The Hive, a company that produces models for image generation.
User-agent: ImagesiftBot 
Disallow: /

# KUKA's youBot
User-agent: YouBot
Disallow: /

# OMGilibot - They sell data for training LLMs (large language models)
User-agent: omgilibot
Disallow: /

# Omgili (Oh My God I Love It)
User-agent: omgili
Disallow: /

# OpenAI API - bot that OpenAI specifically uses to collect bulk training data from your website for ChatGPT.
User-agent: GPTBot
Disallow: /

# Perplexity AI
User-agent: PerplexityBot
Disallow: /

## Social Media Bots

# Bytespider is a web crawler operated by ByteDance, the Chinese owner of TikTok
User-agent: Bytespider
Disallow: /

# Meta’s bot that crawls public web pages to improve language models for their speech recognition technology
User-agent: FacebookBot
Disallow: /

#Twitter's bot used to index the content of any given URL
User-agent: Twitterbot
Disallow: /

dan · 11 June 2024 08:25

There’s a good GitHub project that keeps an updated listed of AI crawlers, worth keeping an eye on if you want to keep away those pesky robots!

github.com

ai-robots-txt/ai.robots.txt/blob/main/table-of-bot-metrics.md

|Name            |Operator |Respects `robots.txt`  |Data use  |Visit regularity  |Description  |
|----------------|---------|-----------------------|----------|------------------|-------------|
| AdsBot-Google   | Google  | Yes (Exceptions for Dynamic Search Ads) | Analyzes website content for ad relevancy, improves ad serving for Google Ads. Data anonymized according to [Google's Privacy Policy](https://policies.google.com/privacy). Unclear on data retention or use by other products. | Varies depending on campaign activity and website updates. Crawls optimized to minimize impact, specific frequency not public. | Web crawler by Google Ads to analyze websites for ad effectiveness and ensure ad relevancy to webpage content. |
|Amazonbot      | Amazon | Yes | Service improvement and enabling answers for Alexa users. | No information provided. | Includes references to crawled website when surfacing answers via Alexa; does not clearly outline other uses. |
|anthropic-ai  | [Anthropic](https://www.anthropic.com) | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
|Applebot      | Apple         | Yes | Indexes sites to provide answers and search results for Siri users. | Irregular and may be prompted by user queries. | Used to answer queries from users; may included references to the indexed site. |
|AwarioRssBot   |         |                       |          |                  |             |
|AwarioSmartBot |         |                       |          |                  |             |
|Bytespider    | ByteDance | No | LLM training. | Unclear at this time. | Downloads data to train LLMS, including ChatGPT competitors. |
|CCBot         | [Common Crawl](https://commoncrawl.org) | [Yes](https://commoncrawl.org/ccbot) | Provides crawl data for an open source repository that has been used to train LLMs. | Unclear at this time. | Sources data that is made openly available and is used to train AI models. |
|ChatGPT-User   | [OpenAI](https://openai.com) | Yes | Takes action based on user prompts. | Only when prompted by a user. | Used by plugins in ChatGPT to answer queries based on user input. |
|ClaudeBot      | [Anthropic](https://www.anthropic.com) | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
|Claude-Web | [Anthropic](https://www.anthropic.com) | Unclear at this time. | Scrapes data to train Anthropic's AI products. | No information provided. | Scrapes data to train LLMs and AI products offered by Anthropic. |
|cohere-ai | [Cohere](https://cohere.com) | Unclear at this time. | Retrieves data to provide responses to user-initiated prompts. | Takes action based on user prompts. | Retrieves data based on user prompts. |
|DataForSeoBot | [DataForSEO](https://dataforseo.com/) | [Yes](https://dataforseo.com/dataforseo-bot) | Backlink checking and SEO data collection to be resolt to clients. | As often as every 5 seconds. | Operated by DataForSEO to check backlinks and scrape SEO data for resale. |
|Diffbot | [Diffbot](https://www.diffbot.com/) | At the discretion of Diffbot users. | Aggregates structured web data for monitoring and AI model training. | Unclear at this time. | Diffbot is an application used to parse web pages into structured data; this data is used for monitoring or AI model training. |
|FacebookBot    |         |                       |          |                  |             |
|Google-Extended|         |                       |          |                  |             |
|GoogleOther    |         |                       |          |                  |             |
|GPTBot        | [OpenAI](https://openai.com) | Yes | Scrapes data to train OpenAI's products. | No information provided. | Data is used to train current and future models, removed paywalled data, PII and data that violates the company's policies. |

This file has been truncated. show original

Heroic_Nonsense · 20 June 2024 14:37

Ugh, I hope openAI is more sincere than Perplexity is! Perplexity (an AI search engine) outright ignores your robots.txt, Robb Knight (a researcher) and WIRED Magazine have found.

Cheers,
Erwin

Heroic_Nonsense · 25 June 2024 14:09

Well, apparently OpenAI also ignores robots.txt:

dang · 25 June 2024 14:31

I don’t doubt that there is some shady scraping of data going on among the different AI companies, however that BI article isn’t listing its sources yet.

OpenAI and Anthropic have been found to be either ignoring or circumventing an established web rule, called robots.txt, that prevents automated scraping of websites, according to a person with knowledge of the analytics of TollBit, as well as another person familiar with the matter.

The article does reference an earlier article from Reuters here, in which TollBit states:

According to the TollBit letter, Perplexity is not the only offender that appears to be ignoring robots.txt. TollBit said its analytics indicate “numerous” AI agents are bypassing the protocol, a standard tool used by publishers to indicate which parts of its site can be crawled.

It doesn’t specifically mention which AI companies they are referring to in the letter.

With that said, it wouldn’t surprise me if it turns out OpenAI is ignoring the robots.txt file, but I’d need a bit more to go on aside from that BI article. Let’s see if OpenAI publicly addresses it in the coming days.

In the meantime, perhaps blocking via the .htaccess method would be more ironclad.

dang · 25 June 2024 14:42

Another blocking alternative, for any website that runs behind CloudFlare (which is recommended), they offer AI bot management on all of their plans, including the free one according to their below article.

Bruno · 25 June 2024 15:02

Hi, I continue to be surprised by this desire for confidentiality when the online posting is obviously public. Remember back in the 90s all those programs to suck up entire websites under the guise of saving the prohibitive connection fees at the time. From forum to forum I read posts asking how to protect your photos (which you put online voluntarily), how to protect your documents (which you put online voluntarily), how to protect your music (which you… . yes, you understand). It’s like movie stars, as long as they’re not yet, they absolutely want to be known and recognized, when they finally are, they put on dark glasses… I know my psychological side… I can’t do anything . I believe that the real desire behind these posts is: “how can we only give to those to whom we want to give while showing everyone?” Put differently: “how can we ensure that it only benefits those for whom we accept that it will benefit.” It reminds me of a joke about perverts: who, the sadist and the masochist, will win if they play together? The masochist says “go ahead and hurt me.” The sadist replies: “if I want.”

dang · 25 June 2024 15:07

I think it has more to do with people wanting to protect their intellectual property and copyrighted works, not concerns about privacy or confidentiality.

Flash · 25 June 2024 15:10

Yeah that, exactly!

Bruno · 25 June 2024 15:15

In first intention I agree, but how to design a thought lock, graphic representations or others? When one has indicated “all rights reserved” and the other does not care, all that remains is justice… and its cost.

Heroic_Nonsense · 4 July 2024 12:35

Cloudflare have just announced a tool that helps you prevent AI-bots from scraping your website.

Cheers,
Erwin

Topic		Replies	Views
Website Stats Mystification Classic	8	900	12 November 2015
Stop Google Indexing a Page Classic	14	3533	6 January 2016
How do you prevent a page form beeing seen by any search engines Classic	21	2022	25 February 2016
Chat-Funktion und Google-Analytics im Vorschaumodus deaktivieren German	14	1081	24 April 2017
Defeating AI scraping Classic	2	47	7 November 2024

How to prevent ChatGPT from crawling your website

Block the ChatGPT Bot via robots.txt

Block the ChatGPT bot via .htaccess

Related topics