Hacker News new | past | comments | ask | show | jobs | submit
Ah yeah, thinking further it's probably just using some positioning embedding based on sequence numbering added in the LLM layers. For vision it needs the patch location as well.