I've been waiting for them to publish the 4B model for a while so I'm glad to have something similar to play with. I think I trust the Ranke-4B process a bit more, but that's partly because there aren't a lot of details in this report. And actually releasing a model counts for a whole lot.
One thing that I think will be a challenge for these models is achieving any sort of definite temporal setting. Unless the conversation establishes a clear timeframe, the model may end up picking a more or less arbitrary context, or worse, averaging over many different time periods. I think this problem is mostly handled by post-training in modern LLMs (plus the fact that most of their training data comes from a much narrower time range), but that is probably harder to accomplish while trying to avoid bias in the SFT and RL process.