How do you explain a horse 2 legs become 4 legs when rotated assuming they only drew 2 legs on the side view
The second L in LLM stands for "language". Nothing of what you're describing has to do with language modeling.
They could be using transformers, sure. But plenty of transformers-based models are not LLMs.
They are probably looking for LGMs - Large Generative Models which encapsulate vision & multi-modal models.