Categories: SEO

How to Block ChatGPT From Using Your Website Content

There is concern about the lack of an easy way to opt-out of having ones content used to train large language models (LLMs) like ChatGPT. There is a way to do it, but it’s neither straightforward or guaranteed to work.

How AIs Learn From Your Content

Large Language Models (LLMs) are trained on data that originates from multiple sources. Many of these datasets are open source and are freely used for training AIs.

Some of the sources used are:

  • Wikipedia
  • Government court records
  • Books
  • Emails
  • Crawled websites

There are actually portals, websites offering datasets, that are giving away vast amounts of information.

One of the portals is hosted by Amazon, offering thousands of datasets at the Registry of Open Data on AWS.

The Amazon portal with thousands of datasets is just one portal out of many others that contain more datasets.

Wikipedia lists 28 portals for downloading datasets, including the Google Dataset and the Hugging Face portals for finding thousands of datasets.

Datasets of Web Content

OpenWebText

A popular dataset of web content is called OpenWebText. OpenWebText consists of URLs found on Reddit posts that had at least three upvotes.

The idea is that these URLs are trustworthy and will contain quality content. I couldn’t find information about a user agent for their crawler, maybe it’s just identified as Python, I’m not sure.

Nevertheless, we do know that if your site is linked from Reddit with at least three upvotes then there’s a good chance that your site is in the OpenWebText dataset.

More information about OpenWebText here.

Common Crawl

One of the most commonly used datasets for Internet content is offered by a non-profit organization called Common Crawl.

Common Crawl data comes from a bot that crawls the entire Internet.

The data is downloaded by organizations wishing to use the data and then cleaned of spammy sites, etc.

The name of the Common Crawl bot is, CCBot.

CCBot obeys the robots.txt protocol so it is possible to block Common Crawl with Robots.txt and prevent your website data from making it into another dataset.

However, if your site has already been crawled then it’s likely already included in multiple datasets.

Nevertheless, by blocking Common Crawl it’s possible to opt-out your website content from being included in new datasets sourced from newer Common Crawl data.

The CCBot User-Agent string is:

CCBot/2.0

Add the following to your robots.txt file to block the Common Crawl bot:

User-agent: CCBot
Disallow: /

An additional way to confirm if a CCBot user agent is legit is that it crawls from Amazon AWS IP addresses.

CCBot also obeys the the nofollow robots meta tag directives.

Use this in your robots meta tag:

<meta name="robots" content="nofollow">

Blocking AI From Using Your Content

Search engines allow websites to opt-out of being crawled. Common Crawl also allows opting out. But there is currently no way to remove ones website content from existing datasets.

Furthermore, research scientists don’t seem to offer website publishers a way to opt-out of being crawled.

The article, Is ChatGPT Use Of Web Content Fair? explores the topic of whether it’s even ethical to use website data without permission or a way to opt out.

Many publishers may appreciate if in the near future they are given more say on how their content is used, especially by AI products like ChatGPT.

Whether that will happen is unknown at this time.

Featured image by Shutterstock/ViDI Studio

FOLLOW US ON GOOGLE NEWS

 

Read original article here

Denial of responsibility! Search Engine Codex is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@searchenginecodex.com. The content will be deleted within 24 hours.

Share
Chris Barnhart

Leave a Comment
Published by
Chris Barnhart

Recent Posts

Offline For Last Days Of Passover 5784

This is a programming note that I will be completely offline for the last days…

April 29, 2024

Studio By WordPress & Other Free Tools

WordPress announced the rollout of Studio by WordPress, a new local development tool that makes…

April 28, 2024

Big Update To Google’s Ranking Drop Documentation

Google updated their guidance with five changes on how to debug ranking drops. The new…

April 27, 2024

Google March 2024 Core Update Officially Completed A Week Ago

Google has officially completed its March 2024 Core Update, ending over a month of ranking…

April 27, 2024

Daily Search Forum Recap: April 26, 2024

Here is a recap of what happened in the search forums today, through the eyes…

April 27, 2024

Google March 2024 Core Update Finished April 19, 2024

The Google March 2024 core update finished a week ago and Google did not tell…

April 27, 2024