Story Detail of id 43117686 | Liveview Hacker News

ygouzerh1 day ago | on: Helix: A vision-language-action model for generalist humanoid control

So from what I understand it actually means that they were for example never trained on a video of an apple. Maybe only on a video of bread, pineapple, chocolate.

However, as it was trained using generic text data similarly to a normal LLM, it knows how an apple is supposed to look like.

Similar than a kid that never saw a banana, but his parent described it to him.

#visit	12094873
#session	46811
#live-session	0