Story Detail of id 41446683 | Liveview Hacker News

whimsicalism4 months ago | on: Ilya Sutskever's SSI Inc raises $1B

what laws have actually changed that make it no longer okay?

we all know that openai did it

There are class actions now like https://www.nytimes.com/2024/06/13/business/clearview-ai-fac...

Nobody even knew what OpenAI was up to when they were gathering training data - they got away with a lot. Now there is precedent and people are paying more attention. Data that was previously free/open now has a clause that it can't be used for AI training. OpenAI didn't have to deal with any of that.

Also OpenAI used cheap labor in Africa to tag training data which was also controversial. If someone did it now it would they'd be the ones to pay. OpenAI can always say "we stopped" like Nike said with sweat shops.

A lot has changed.

vidarh4 months ago | root | parent

There are at least 3 companies with staff in developed countries well above minimum wage doing tagging and creation of training data, and at least one of them that I have an NDA with pays at least some of their staff tech contractor rates for data in some niches and even then some of data gets processed by 5+ people before it's returned to the client. Since I have ended up talking to 3, and I'm hardly well connected in that space, I can only presume there are many more.

Companies are willing to pay a lot for clean training data, and my bet is there will be a growing pile of training sets for sale on a non-exclusive basis as well.

A lot of this data - what I've seen anyway, is far cleaner than anything you'll find on the open web, with significant data on human preferences, validation, cited sources, and in the case of e.g. coding with verification that the code runs and works correctly.

bschmidt14 months ago | root | parent | next

> A lot of this data - what I've seen anyway, is far cleaner than anything you'll find on the open web, with significant data on human preferences, validation, cited sources, and in the case of e.g. coding with verification that the code runs and works correctly.

Very interesting, thanks for sharing that detail. As someone who has tinkered with tokenizing/training I quickly found out this must be the case. Some people on HN don't know this. I've argued here with otherwise smart people who think there is no data preprocessing for LLMs, that they don't need it because "vectors", failing to realize the semantic depth and quality of embeddings depends on the quality of training data.

whimsicalism4 months ago | root | parent

i think we should distinguish between pretraining and polishing/alignment data. what you are describing is most likely the latter (and probably mixed into to pretraining). but if you can't get a mass of tokens from scraping, you're going to be screwed

alpha_squared4 months ago | parent

A lot of APIs changed in response to OpenAI hoovering up data. Reddit's a big one that comes to mind. I'd argue that the last two years have seen the biggest change in the openness of the internet.

linotype4 months ago | root | parent

It’s made Reddit unusable without an account, which makes me wonder why it’s even on the web anymore and not an app. I guess legacy users that only use a web browser.

whimsicalism4 months ago | root | parent

did that not predate chatgpt?

loading story #41453777

#visit	11479328
#session	45277
#live-session	0