Hacker News new | past | comments | ask | show | jobs | submit
The model need only recognize from the shape that it is a horse, and would know to extrapolate from there. It would presumably have some text encoding as residual from training, but it doesn't need to be fed text from the text encoder side to know that. Think of the CLIP encoder used in stable diffusion.