In-Short
- Amazon researchers develop BASE TTS, a large text-to-speech model with emergent abilities.
- The model, with 980 million parameters, shows improved performance on complex test sentences.
- BASE TTS is designed to be lightweight and streamable, even over low-bandwidth connections.
Summary of Amazon’s New Text-to-Speech Model
Amazon’s research team has made a significant breakthrough in text-to-speech technology with their new model, BASE TTS. This model, which contains 980 million parameters, is the largest of its kind and has been trained on an extensive 100,000 hours of public domain speech data. The researchers observed that as the model’s size increased, it displayed a notable enhancement in handling complex sentences that typically challenge text-to-speech systems.
The medium-sized version of BASE TTS, with 400 million parameters, already demonstrated a leap in versatility and robustness when tested on sentences with intricate lexical, syntactic, and paralinguistic elements. Despite not being perfect, it outperformed existing models in areas like stress, intonation, and pronunciation. However, scaling up to the 980 million parameter model did not yield additional emergent abilities beyond what the 400 million parameter version could do.
BASE TTS is not only advanced in its capabilities but also in its design. It is engineered to be lightweight and capable of streaming, with emotional and prosodic data packaged separately. This feature is particularly beneficial for transmitting natural-sounding spoken audio over low-bandwidth connections, potentially broadening its applicability.
The research team sees this development as a positive indicator for the future of conversational AI, with plans to continue exploring the optimal model size for emergent abilities. The full BASE TTS paper is available for those interested in a more in-depth understanding of the model’s intricacies.
For more detailed insights, read the full BASE TTS paper on arXiv.