Javascript is not enabled. This site can still works but it'll be more interactive when javascript is enabled.
loading...
Hacker News
new
|
past
|
comments
|
ask
|
show
|
jobs
|
submit
ketchup32613
19 hours ago
|
on: Do transformers need three projections? Systematic study of QKV variants
Do you want to see scaling curves wrt data and param size? I agree that 1.2B and 10B tokens is not representative, but what scale of parameters and dataset sizes would be convincing?
reply
zxexz
18 hours ago
|
parent
Not to sound facetious, but perhaps enough runs at different param/token sizings to define a curve?
reply