update README.md for infer & train

2025-12-12 15:50:07 -08:00 · 2025-03-20 10:03:54 +08:00
parent a1e88c2a9e
commit 79302b694a
2 changed files with 11 additions and 8 deletions
--- a/src/f5_tts/infer/README.md
+++ b/src/f5_tts/infer/README.md
@@ -4,16 +4,17 @@ The pretrained model checkpoints can be reached at [🤗 Hugging Face](https://h

 **More checkpoints with whole community efforts can be found in [SHARED.md](SHARED.md), supporting more languages.**

-Currently support **30s for a single** generation, which is the **total length** including both prompt and output audio. However, you can provide `infer_cli` and `infer_gradio` with longer text, will automatically do chunk generation. Long reference audio will be **clip short to ~15s**.
+Currently support **30s for a single** generation, which is the **total length** (same logic if `fix_duration`) including both prompt and output audio. However, `infer_cli` and `infer_gradio` will automatically do chunk generation for longer text input. Long reference audio will be **clip short to ~12s**.

 To avoid possible inference failures, make sure you have seen through the following instructions.

- Use reference audio <15s and leave some silence (e.g. 1s) at the end. Otherwise there is a risk of truncating in the middle of word, leading to suboptimal generation.
- Uppercased letters will be uttered letter by letter, so use lowercased letters for normal words. 
- Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some pauses.
- Preprocess numbers to Chinese letters if you want to have them read in Chinese, otherwise in English.
- If the generation output is blank (pure silence), check for ffmpeg installation (various tutorials online, blogs, videos, etc.).
- Try turn off use_ema if using an early-stage finetuned checkpoint (which goes just few updates).
+- Use reference audio <12s and leave proper silence space (e.g. 1s) at the end. Otherwise there is a risk of truncating in the middle of word, leading to suboptimal generation.
+- **Uppercased** letters (best with form like K.F.C.) will be uttered letter by letter, and lowercased letters used for common words. 
+- Add some spaces (blank: " ") or punctuations (e.g. "," ".") to explicitly introduce some **pauses**.
+- If English punctuation marks the end of a sentence, make sure there is a space " " after it. Otherwise not regarded as when chunk.
+- Preprocess **numbers** to Chinese letters if you want to have them read in Chinese, otherwise in English.
+- If the generation output is blank (pure silence), check for **ffmpeg** installation.
+- Try turn off **use_ema** if using an early-stage finetuned checkpoint (which goes just few updates).


 ## Gradio App