Story Detail of id 47477524 | Liveview Hacker News

xbar9 hours ago | on: Flash-MoE: Running a 397B Parameter Model on a Laptop

0.1 GB per full-attention layer and "The model has 60 transformer layers: 45 GatedDeltaNet (linear attention) + 15 standard full attention." So, 1.5 GB.

#visit	13,231,032
#session	74,665
#live-session	0