SF-Speech: Straightened Flow for Zero-Shot Voice Clone
Abstract
Recently, neural ordinary differential equations (ODE) models trained with flow matching have achieved impressive performance on the zero-shot voice clone task. Nevertheless, postulating standard Gaussian noise as the initial distribution of ODE gives rise to numerous intersections within the fitted targets of flow matching, which presents challenges to model training and enhances the curvature of the learned generated trajectories. These curved trajectories restrict the capacity of ODE models for generating desirable samples with a few steps. This paper proposes SF-Speech, a novel voice clone model based on ODE and in-context learning. Unlike the previous works, SF-Speech adopts a lightweight multi-stage module to generate a more deterministic initial distribution for ODE. Without introducing any additional loss function, we effectively straighten the curved reverse trajectories of the ODE model by jointly training it with the proposed module. Experiment results on datasets of various scales show that SF-Speech outperforms the state-of-the-art zero-shot TTS methods and requires only a quarter of the solver steps, resulting in a generation speed approximately 3.7 times that of Voicebox and E2 TTS. Audio samples are available at the demo page .
Zero-Shot Voice Clone
Test Case A (ODE models under Different NFEs.)
F5-TTS, E2 TTS, VoiceBox, and SF-Speech trained using Emilia.
Model
2
4
32
F5-TTS
E2 TTS
VoiceBox
SF-Speech
Test Case B (English Zero-Shot TTS Results with Large-scale Training Dataset.)
VoiceBox, E2 TTS, and F5-TTS inference with a NFEs of 32, while SF-Speech uses a NFEs of 8.
Reference
Text
Voicebox
E2 TTS
F5-TTS
SF-Speech
1.
But in such a case, Miss Milner's election of a husband shall not direct mine.
2.
Again he searched his own thoughts; nor ineffectually as before.
3.
Yea, his honourable worship is within. But he hath a godly minister or two with him, and likewise a leech.
4.
A story! cried the children, drawing a little fat man towards the Tree.
5.
You gave me double- five, I want double- nine.... Hallo, is that you, Horatio? Hamlet speaking.
Test Case C (Chinese Zero-Shot TTS Results with Large-scale Training Dataset.)
VoiceBox, E2 TTS, and F5-TTS inference with a NFEs of 32, while SF-Speech uses a NFEs of 8.
Reference
Text
Voicebox
E2 TTS
F5-TTS
SF-Speech
1.
谁能告诉我,蔡依林的新歌到底被谁盗了。
2.
但是我妈他们,住在我家北边的宾馆里。
3.
内蒙古呼伦贝尔市阿荣旗的天气。
4.
你把发票的名称发到我手机上。
5.
除了自命不凡和爱是一种幸福。
Test Case D (Chinese Zero-Shot TTS Results with Small-scale Training Dataset)
The evaluation for VoiceBox, YourTTS, VALL-E, SimpleSpeech2, and SF-Speech-S. SimpleSpeech2 is trained using 4K hours of Chinese data from WenetSpeech, while others are trained using MagicData. VoiceBox-S, VoiceBox and SF-Speech-S use the same duration model.
Reference
Text
YourTTS
VoiceBox-S
VoiceBox
SF-Speech-S
SimpleSpeech2
VALL-E
1.
谁能告诉我,蔡依林的新歌到底被谁盗了。
2.
但是我妈他们,住在我家北边的宾馆里。
3.
内蒙古呼伦贝尔市阿荣旗的天气。
4.
你把发票的名称发到我手机上。
5.
除了自命不凡和爱是一种幸福。
Test Case E (Speech Reconstruction Results)
The evaluation includes the use of VoiceBox-S, VoiceBox and SF-Speech-S. We masked the middle 70% for each audio and reconstruct the masked part.