This seems the most relevant topic for this post if anyone cares to read. I can read it but I'm not sure if it will be paywalled for some:
https://www.washingtonpost.com/tech..._medium=email&utm_source=alert&location=alert
I'll post a snippet here:
'To look inside this black box, we analyzed
Google’s C4 data set, a massive snapshot of the contents of 15 million websites that have been used to instruct some high-profile English-language AIs, called large language models, including Google’s T5 and Facebook’s LLaMA. (OpenAI does not disclose what datasets it uses to train the models backing its popular chatbot, ChatGPT)
The Post worked with researchers at the Allen Institute for AI on this investigation and categorized the websites using data from Similarweb, a web analytics company. About a third of the websites could not be categorized, mostly because they no longer appear on the internet. Those are not shown.
Hover over the boxes above to view the top sites in each category
We then ranked the remaining 10 million websites based on how many “tokens” appeared from each in the data set. Tokens are small bits of text used to process disorganized information — typically a word or phrase.
Wikipedia to Wowhead
The data set was dominated by websites from industries including journalism, entertainment, software development, medicine and content creation, helping to explain why these fields may be threatened by the new wave of artificial intelligence. The three biggest sites were
patents.google.com No. 1, which contains text from patents issued around the world; wikipedia.org No. 2, the free online encyclopedia; and
scribd.com No. 3, a subscription-only digital library. Also high on the list: b-ok.org No. 190, a notorious market for pirated e-books that has since been seized by the U.S. Justice Department. At least 27 other sites identified
by the U.S. government as markets for piracy and counterfeits were present in the data set.
Some top sites seemed arbitrary, like
wowhead.com No. 181, a World of Warcraft player forum;
thriveglobal.com No. 175, a product for beating burnout founded by Arianna Huffington; and at least 10 sites that sell dumpsters, including
dumpsteroid.com No. 183, that no longer appear accessible.'