Your “and then” is doing a lot of work there. The steps between may or may not include some form of “learn to understand humans”, but you can’t just hide them behind “and then” if what we are doing is claiming some particular thing is not in the list.
Through training on human text, we are building implicitly in the weights a statistical model of what humans might write in response when presented with arbitrary pieces of text. It turns out that we can make these incredibly accurate.
If building an accurate internal model of something then using it to predict that thing’s behaviour is different to gaining understanding of that thing, we will need to pin down exactly what “understanding” means, or we are forever doomed to talk at cross purposes.