Story Detail of id 41448591 | Liveview Hacker News

bschmidt14 months ago | on: Ilya Sutskever's SSI Inc raises $1B

> A lot of this data - what I've seen anyway, is far cleaner than anything you'll find on the open web, with significant data on human preferences, validation, cited sources, and in the case of e.g. coding with verification that the code runs and works correctly.

Very interesting, thanks for sharing that detail. As someone who has tinkered with tokenizing/training I quickly found out this must be the case. Some people on HN don't know this. I've argued here with otherwise smart people who think there is no data preprocessing for LLMs, that they don't need it because "vectors", failing to realize the semantic depth and quality of embeddings depends on the quality of training data.

#visit	11482995
#session	45282
#live-session	0