Hacker News new | past | comments | ask | show | jobs | submit
There are class actions now like https://www.nytimes.com/2024/06/13/business/clearview-ai-fac...

Nobody even knew what OpenAI was up to when they were gathering training data - they got away with a lot. Now there is precedent and people are paying more attention. Data that was previously free/open now has a clause that it can't be used for AI training. OpenAI didn't have to deal with any of that.

Also OpenAI used cheap labor in Africa to tag training data which was also controversial. If someone did it now it would they'd be the ones to pay. OpenAI can always say "we stopped" like Nike said with sweat shops.

A lot has changed.

There are at least 3 companies with staff in developed countries well above minimum wage doing tagging and creation of training data, and at least one of them that I have an NDA with pays at least some of their staff tech contractor rates for data in some niches and even then some of data gets processed by 5+ people before it's returned to the client. Since I have ended up talking to 3, and I'm hardly well connected in that space, I can only presume there are many more.

Companies are willing to pay a lot for clean training data, and my bet is there will be a growing pile of training sets for sale on a non-exclusive basis as well.

A lot of this data - what I've seen anyway, is far cleaner than anything you'll find on the open web, with significant data on human preferences, validation, cited sources, and in the case of e.g. coding with verification that the code runs and works correctly.

loading story #41448591
loading story #41449881