Full distributions are a fucking pain to save - at this point just save the hidden states. But there are lossy compression tricks there.
To the previous poster's point, soft distributions are useful, even saving the top 10 logits is significantly more training signal than just the final token.