The Rise of AI Voice: Personalized Interaction Reaching a Critical Mass
This article is officially released by MiniMax, focusing on its developed high-quality Text-to-Speech (TTS) model Speech 02. Based on the AR Transformer architecture, the core innovation lies in its intrinsic Zero-Shot capability. Through a learnable Speaker Encoder, it can achieve highly realistic and stable voice cloning with just a reference audio clip. MiniMax Speech 02 supports 32 languages and can provide unlimited combinations of any language, accent, and voice characteristics. The article cites evaluation data from Artificial Analysis and Hugging Face, demonstrating that Speech 02 outperforms models such as OpenAI and ElevenLabs in terms of perceived audio quality and multilingual performance. The article also mentions that the model uses Flow-VAE and Flow Matching technologies to optimize sound quality and introduces the application potential in areas such as content creation and dissemination of under-represented languages, and finally includes a technical report link and product experience entry.