we all know that openai did it
Nobody even knew what OpenAI was up to when they were gathering training data - they got away with a lot. Now there is precedent and people are paying more attention. Data that was previously free/open now has a clause that it can't be used for AI training. OpenAI didn't have to deal with any of that.
Also OpenAI used cheap labor in Africa to tag training data which was also controversial. If someone did it now it would they'd be the ones to pay. OpenAI can always say "we stopped" like Nike said with sweat shops.
A lot has changed.
Companies are willing to pay a lot for clean training data, and my bet is there will be a growing pile of training sets for sale on a non-exclusive basis as well.
A lot of this data - what I've seen anyway, is far cleaner than anything you'll find on the open web, with significant data on human preferences, validation, cited sources, and in the case of e.g. coding with verification that the code runs and works correctly.
Very interesting, thanks for sharing that detail. As someone who has tinkered with tokenizing/training I quickly found out this must be the case. Some people on HN don't know this. I've argued here with otherwise smart people who think there is no data preprocessing for LLMs, that they don't need it because "vectors", failing to realize the semantic depth and quality of embeddings depends on the quality of training data.