The model is natively quantized (i.e. it was trained that way in the first place, so this is not a post-training quantization which degrades performance).
Isn't it not completely quantized? I thought there were some dense parts but most is int4?
But the huggingface link mentions BF16, F16, and I32?
loading story #48506742
loading story #48507268