There is no such thing as "thing" here.
These models are trained such that the given conditions (the visual input and the text prompt) will be continued with a desirable continuation (motor function over time).
The only dimension accuracy can apply to is desirability.
loading story #43122466