We know exactly why, it is because floating point operations aren't associative but the GPU scheduler assumes they are, and the scheduler isn't deterministic. Running the model strictly hurts performance so they don't do that.
Cool, thanks a lot for the explanation. Makes sense.