Hacker News new | past | comments | ask | show | jobs | submit
Not affiliated with Sesame, but this is what the realtime models are trying to solve. If you look at NVIDIA’s PersonaPlex release [0], it uses a duplex architecture. It’s based on Moshi [1], which aims to address this problem by allowing the model to listen and generate audio at the same time.

[0] https://github.com/NVIDIA/personaplex

[1] https://arxiv.org/abs/2410.00037