Story Detail of id 48406477 | Liveview Hacker News

amemi20 hours ago | on: Do transformers need three projections? Systematic study of QKV variants

> Maybe it works because the sequences are short and the dimension is high and there's plenty of room for interesting results to fit in the merged key/value space.

In fact, on the second last page of the paper, they discuss this very problem. There is a clear correlation between performance and increasing sequence lengths for the Q-K=V model. While limited to a tight n=3 sample between 512, 1024, 2048 lengths, the degradation decreases from 5.4% to 2.2% as context is increased, suggesting that it is unlikely shorter sequences are the reason K=V performs acceptably.

#visit	13,590,447
#session	74,665
#live-session	0