Geometrically I imagine the process of attention like picking up a bunch of vectots and spinning and squishing them in many-D until you can find a crack where you can see all the way through, then leveraging that crack to seperate what you want.
I doubt that's strictly accurate, but it might be close enough that it makes me think that if you were doing that with a bunch of bananas, it would be much easier to find the way through if you could also bend the bunch so they were all straight.
It's always the trade off of a smart complex operation against an absolute crapload of dumb ones.
What matters is not how good it is in isolation, but how well it scales to giant datasets and supercomputers. So far attention scales the best. It's the most "brute force"-able mechanism
You can't make attention more specialized without making it less general, which makes LLMs worse as a universal approximator.