Story Detail of id 48442554 | Liveview Hacker News

dilyevsky11 hours ago | on: Playing with Vision Embeddings

> Because you somehow need a giant training set which describes images in natural language, no?

That's definitely one way - they train a text encoder together with an image encoder on a labelled set of images. WL & 3b1b made a nice video on it: https://www.youtube.com/watch?v=iv-5mZ_9CPY

jcattle11 hours ago | parent

Thanks I'll check out that video

#visit	13,660,875
#session	74,665
#live-session	0