F5-TTS/README.md at c01b988360e653c67bb7c9a073c51120014e646b

mirror of https://github.com/SWivid/F5-TTS.git synced 2025-12-12 15:50:07 -08:00

Files

Zhikang Niu c01b988360 Update README.md

2024-10-11 21:26:51 +08:00

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS.

Installation

Clone this repository.

git clone git@github.com:SWivid/F5-TTS.git
cd F5-TTS

Install packages.

pip install -r requirements.txt

Prepare Dataset

We provide data processing scripts for Wenetspeech4TTS and Emilia and you just need to update your data paths in the scripts.

# prepare custom dataset up to your need
# download corresponding dataset first, and fill in the path in scripts

# Prepare the Emilia dataset
python scripts/prepare_emilia.py

# Prepare the Wenetspeech4TTS dataset
python scripts/prepare_wenetspeech4tts.py

Training

Once your datasets are prepared, you can start the training process. Here’s how to set it up:

# setup accelerate config, e.g. use multi-gpu ddp, fp16
# will be to: ~/.cache/huggingface/accelerate/default_config.yaml     
accelerate config
accelerate launch test_train.py

Inference

To perform inference with the pretrained model, you can download the model checkpoints from F5-TTS Pretrained Model

Single Inference

You can test single inference using the following command. Before running the command, modify the config up to your need.

# modify the config up to your need,
# e.g. fix_duration (the total length of prompt + to_generate, currently support up to 30s)
#      nfe_step     (larger takes more time to do more precise inference ode)
#      ode_method   (switch to 'midpoint' for better compatibility with small nfe_step, )
#                   ( though 'midpoint' is 2nd-order ode solver, slower compared to 1st-order 'Euler')
python test_infer_single.py

Speech Edit

To test speech editing capabilities, use the following command.

python test_infer_single_edit.py

Evaluation

Prepare Test Datasets

Seed-TTS test set: Download from seed-tts-eval.
LibriSpeech test clean: Download from OpenSLR.
Unzip the downloaded datasets and place them in the data/ directory.
Update the path for the test clean data in test_infer_batch.py
our librispeech-pc 4-10s subset is already under data/ in this repo

Download Evaluation Model Checkpoints

Chinese ASR Model: Paraformer-zh
English ASR Model: Faster-Whisper
WavLM Model: Download from Google Drive.

Ensure you update the path for the checkpoints in test_infer_batch.py.

Batch inference

To run batch inference for evaluations, execute the following commands:

# batch inference for evaluations
accelerate config  # if not set before
bash test_infer_batch.sh

Installation Notes For Faster-Whisper with CUDA 11:

pip install --force-reinstall ctranslate2==3.24.0
pip install faster-whisper==0.10.1 # recommended

This will help avoid ASR failures, such as abnormal repetitions in output.

Evaluation

Run the following commands to evaluate the model's performance:

# Evaluation for Seed-TTS test set
python scripts/eval_seedtts_testset.py

# Evaluation for LibriSpeech-PC test-clean (cross-sentence)
python scripts/eval_librispeech_test_clean.py

Acknowledgements

E2-TTS brilliant work, simple and effective
Emilia, WenetSpeech4TTS valuable datasets
lucidrains initial CFM structure with also bfs18 for discussion
SD3 & Huggingface diffusers DiT and MMDiT code structure
FunASR, faster-whisper & UniSpeech for evaluation tools
torchdiffeq as ODE solver, Vocos as vocoder
ctc-forced-aligner for speech edit test

Citation

@misc{chen2024f5ttsfairytalerfakesfluent,
      title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, 
      author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
      year={2024},
      eprint={2410.06885},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2410.06885}, 
}

LICENSE

Our code is released under MIT License.

5.5 KiB

Raw Blame History

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Demo; Paper; Checkpoints.

Installation

Prepare Dataset

Training

Inference

Single Inference

Speech Edit

Evaluation

Prepare Test Datasets

Download Evaluation Model Checkpoints

Batch inference

Evaluation

Acknowledgements

Citation

LICENSE

5.5 KiB Raw Blame History Unescape Escape

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Demo; Paper; Checkpoints.

Installation

Prepare Dataset

Training

Inference

Single Inference

Speech Edit

Evaluation

Prepare Test Datasets

Download Evaluation Model Checkpoints

Batch inference

Evaluation

Acknowledgements

Citation

LICENSE

5.5 KiB

Raw Blame History