Their 1.2B model was trained on only 10B tokens, which is less than half of the chinchilla compute optimal number. Modern overtrained 1B LLMs are trained on the order of 10T tokens (1000x more).
This is important because, from my own experience, simplifications and alternatives to standard attention can look fine in the under-trained regime but lag after over-training. This happens because attention has very little out-of-the-gate inductive bias, so it takes a lot of training for the expressiveness to really shine through.
I can't fault the authors since longer training runs cost money, but it warrants pointing out.
I'm also disappointed that they didn't report reasoning benchmark results for the Q=K-V case, since that is by far the most theoretically interesting case (in my eyes).
I agree that this isn’t proof that it scales to trillions of tokens, but this does show a scaled up experiment would be worth a shot.
I do agree that it is a datapoint, but GP's point is that this model was undertrained, so it's hard to draw the same conclusions from it that we would from other research.