Categories: Google

Search the 15 million websites in Google’s C4 dataset

Was your website or content used to help train AI systems as part of Google’s C4 dataset? A new search tool from the Washington Post lets you find out.

Why we care. The dataset includes the types of websites and content creators that generative AI could potentially negatively impact or even wipe out, such as news and media publishers, blogs and marketing.

Search. The new search tool can be found in the Post’s article Inside the secret list of websites that make AI like ChatGPT sound smart. It created the list “based on how many ‘tokens’ appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase,” the story explained.

For example, Search Engine Land was used.

As were Marketing Land (a brand that no longer exists, but did in 2019) and Marketing Land Events, which hosted our SMX and MarTech conference sites.

And Search Engine Land’s parent company site, Third Door Media.

Also, Barry Schwartz’s Search Engine Roundtable was used.

Only part of the data. As a reminder, the C4 (which stands for Colossal Clean Crawled Corpus) is only part of the data used by Google Bard and other large language models. It also uses Wikipedia, Reddit and other sources.

Speaking of Reddit. Reddit wants to get paid when any companies want to use its data to train AI models, the New York Times reported. Reddit has updated its API terms and will now charge some companies (e.g., Google, OpenAI) for access. Said Reddit CEO and co-founder Steve Huffman:

  • “The Reddit corpus of data is really valuable. But we don’t need to give all of that value to some of the largest companies in the world for free. Crawling Reddit, generating value and not returning any of that value to our users is something we have a problem with. It’s a good time for us to tighten things up.”

Ironically, Reddit, itself, didn’t even create any of that value. Its users did.

FOLLOW US ON GOOGLE NEWS

 

Read original article here

Denial of responsibility! Search Engine Codex is an automatic aggregator of the all world’s media. In each content, the hyperlink to the primary source is specified. All trademarks belong to their rightful owners, all materials to their authors. If you are the owner of the content and do not want us to publish your materials, please contact us by email – admin@searchenginecodex.com. The content will be deleted within 24 hours.

Share
Henry White

Leave a Comment
Published by
Henry White

Recent Posts

Google AI Overviews = Theft? Court Ruling Sets Precedent

Google’s bold new vision for the future of online search, powered by AI technology, is…

May 16, 2024

Chatbots And AI Search Engines Converge: Key Strategies For SEO

A lot is happening in the world of search right now, and for many, keeping…

May 16, 2024

Google Says Sites Hit By Helpful Content Update Could See Improvements With Next Core Update

Google's John Mueller was asked when can a site expect to recover from the September…

May 16, 2024

Google Local Service Ads With Message Multiple Businesses Button

Google is testing a new button for the Local Service Ads to "message multiple businesses"…

May 16, 2024

Continued Google Search Ranking Volatility Through May 16th

I have been continuing to update the May 9th Google ranking volatility story, but it…

May 16, 2024

Searchers Want To Turn Off Google AI Overviews

There are many searchers who have been using Google daily that are heading to the…

May 16, 2024