If I'm understanding correctly, the model code itself is only a tiny proportion of the challenge. The training compute and training data are far bigger parts.
Google has access to training compute on a scale perhaps nobody else has.
Is that really the case though? Available compute sounds unlikely to be the limiting factor here, compared to data which is way scarcer than what's being used to train LLMs, and I suspect Google used mostly publicly available data for training unless they signed deals beforehand with biotechnology companies which have access to more data. That's possible of course, but that doesn't feel very google-y.
Yes, all data Google used was public. We have enough compute from YC (thanks YC!) to do this. The main thing is the technical infrastructure - processing the data, efficient loading at training time, proper benchmarking, etc. We are building these now.
Thanks for the answer! It's much better to have the definitive answer rather than rely on gut feeling (even though it was right in this case).
Keep up the good work!
How much compute does YC give you access to btw? Is that just things like azure credit or do YC have actual hardware?