Hacker News new | past | comments | ask | show | jobs | submit
> “Pre-ImageNet, people did not believe in data,” Li said in a September interview at the Computer History Museum. “Everyone was working on completely different paradigms in AI with a tiny bit of data.”

That's baloney. The old ML adage "there's no data like more data" is as old as mankind itself.

Not baloney. The culture around data in 2005-2010 -- at least / especially in academia -- was night and day to where it is today. It's not that people didn't understand that more data enabled richer + more accurate models, but that they accepted data constraints as a part of the problem setup.

Most methods research went into ways of building beliefs about a domain into models as biases, so that they could be more accurate in practice with less data. (This describes a lot of PGM work). This was partly because there was still a tug of war between CS and traditional statistics communities on ML, and the latter were trained to be obsessive about model specification.

One result was that the models that were practical for production inference were often trained to the point of diminishing returns on their specific tasks. Engineers deploying ML weren't wishing for more training instances, but better data at inference time. Models that could perform more general tasks -- like differentiating 90k object classes rather than just a few -- were barely even on most people's radar.

Perhaps folks at Google or FB at the time have a different perspective. One of the reasons I went ABD in my program was that it felt industry had access to richer data streams than academia. Fei Fei Li's insistence on building an academic computer science career around giant data sets really was ingenius, and even subversive.

loading story #42063187
loading story #42062715
It's not quite so - we couldn't handle it, and we didn't have it, so it was a bit of a none question.

I started with ML in 1994, I was in a small poor lab - so we didn't have state of the art hardware. On the other hand I think my experience is fairly representative. We worked with data sets on spark workstations that were stored in flat files and had thousands or sometimes tens of thousands of instances. We had problems keeping our data sets on the machines and often archived them to tape.

Data came from very deliberate acquisition processes. For example I remember going to a field exercise with a particular device and directing it's use over a period of days in order to collect the data that would be needed for a machine learning project.

Sometime in the 2000's data started to be generated and collected as "exhaust" from various processes. People and organisations became instrumented in the sense that their daily activities were necessarily captured digitally. For a time this data was latent, people didn't really think about using it in the way that we think about it now, but by about 2010 it was obvious that not only was this data available but we had the processing and data systems to use it effectively.

Answering to people arguing against my comment: you guys do not seem to take into account that the technical circumstances were totally different thirty, twenty or even ten years ago! People would have liked to train with more data, and there was a big interest in combining heterogeneous datasets to achieve exactly that. But one major problem was the compute! There weren't any pretrained models that you specialized in one way or the other - you always retrained from scratch. I mean, even today, who's get the capability to train a multibillion GPT from scratch? And not just retraining once a tried and trusted architecture+dataset, no, I mean as a research project trying to optimize your setup towards a certain goal.
Pre-ImageNet was like pre-2010. Doing ML with massive data really wasn't in vogue back then.
loading story #42064389
In 2019, GPT-2 1.5B was trained on ~10B tokens.

Last week Hugging Face released SmolLM v2 1.7B trained on 11T tokens, 3 orders of magnitude more training data for the same number of tokens with almost the same architecture.

So even back in 2019 we can say we were working with a tiny amount of data compared to what is routine now.

loading story #42063083
Not really. This is referring back to the 80's. People weren't even doing 'ML'. And back then people were more focused on teasing out 'laws' in as few data points as possible. The focus was more on formulas and symbols, and finding relationships between individual data points. Not the broad patterns we take for granted today.
I would say using backpropagation to train multi-layer neural networks would qualify as ML and we were definitely doing that in 80's.
Just with tiny amounts of data.
Compared to today. We thought we used large amounts of data at the time.
"We thought we used large amounts of data at the time."

Really? Did it take at least an entire rack to store?

We didn't measure data size that way. At some point in the future someone would find this dialog, and think that we dont't have large amounts of data now, because we are not using entire solar systems for storage.
Why can't you use a rack as a unit of storage at the time? Were 19" server racks not in common use yet? The storage capacity of a rack will grow over time.

my storage hierarchy goes 1) 1 storage drive 2) 1 server maxed out with the biggest storage drives available 3) 1 rack filled with servers from 2 4) 1 data center filled with racks from 3

How big is a rack in VW beetles though?

It's a terrible measurement because it's an irrelevant detail about how their data is stored that no one actually knows if your data is being stored in a proprietary cloud except for people that work there on that team.

So while someone could say they used a 10 TiB data set, or 10T parameters, how many "racks" of AWS S3 that is, is not known outside of Amazon.

a 42U 19" inch rack is an industry standard. If you actually work on the physical infrastructure of data centers it is most CERTAINLY NOT an irrelevant detail.

And whether your data can fit on a single server, single rack, or many racks will drastically affect how you design the infrastructure.

mid-90s had neural nets, even a few popular science kinds of books on it. The common hardware was so much less capable then.
mid-60's had neural nets.

mid-90's had LeCun telling everyone that big neural nets were the future.

loading story #42065537