Coqui Cantonese xTTS v2 training

By admin , 20 五月, 2025

2025-5-20

接着这个页面的开发:https://cto.eguidedog.net/node/1391

测试checkpoint 150K生成音频失败,可能是应该从头训练,先把训练停了,把所有参数细节研究一下再重新开始。

检查参数发现sample_rate错了,应该从22050改为16000。

根据AI建议,把mel_fmax从7600改为8500,以适应粤语的声调变化。

根据AI建议,把norm_schedule改为了true,自适应学习率调度,初始阶段升温,之后衰减。

有一些问题待调研:

由于mdcc是多人的音频,需要着重研究coqui对这种情况应该怎样配置。

研究一下有没有办法禁用风格学习,以减少运算量。全局风格标记(Global Style Token, GST)

New init steps:

  1. clone code: git clone https://github.com/hgneng/TTS.git
  2. chnage repo: pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
  3. install deps: cd TTS && pip install -e .[all,dev]
  4. patch 1: cp /gemini/code/TTS/patch/tensorboard_logger.py /root/miniconda3/lib/python3.11/site-packages/trainer/logging/
  5. patch 2: cp /gemini/code/TTS/patch/summary.py /root/miniconda3/lib/python3.11/site-packages/torch/utils/tensorboard/
  6. patch 3: cp /gemini/code/TTS/patch/trainer.py /root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py
  7. run training: cd /gemini/code/TTS && time bash recipes/mdcc/tacotron2-DDC/run.sh 2>&1 | tee out

训练完后,把最新的checkpoint.pth链接到best_model.pth,下次训练时才能继续上次的进度。

把event文件复制到/gemini/output/,可以在tensorboard里看到训练趋势。不要创建符号链接,似乎有bug会把目录删掉并终止训练。

Model test command:

~/code/hgneng/TTS$ TTS/bin/synthesize.py --text "ngo5 wui2 syut3 jyut6 jyu5" --config_path recipes/mdcc/tacotron2-DDC/tacotron2-DDC.json --model_path recipes/mdcc/tacotron2-DDC/model_10000_411.pth --out_path ./demo.wav

2025-5-22

XTTS v2模型似乎是对beginner更容易的选择,Ghost已经成功train了粤语模型。计划从Tacotron 2转到XTTS v2。

kokoro在2024年12月25日发布的一个模型,对不到100小时的音频进行训练,使用A100 80G GPU花了约500小时,约400美元费用。而使用XTTS v2可能大概只需要4090 24G GPU10小时的成本。按照我租用的8G GPU估算,可能2天就能训练完。

tts默认使用hifigan_generator生成"this is a demo"大概需要13 CPU 秒时间。不确定除去初始化后需要多少时间。

time tts --text "this is a demo"

 > Processing time: 0.7871761322021484
> Real-time factor: 0.45792617441582345

而使用xtts_v2需要5分钟CPU时间,性能差距还是很明显的。如果从processing time来看,大概是4倍差距。

time tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 --text "this is a demo" --speaker_idx "Claribel Dervla" --language_idx "en"

 > Processing time: 2.966963529586792
> Real-time factor: 1.9062221977677378

普通话的合成效果很不错的。

time tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 --text "我挥一挥衣袖,不带走一片云彩" --speaker_idx "Claribel Dervla" --language_idx "zh-cn" 

 > Processing time: 7.238363027572632
> Real-time factor: 2.274884617416997

2025-5-23

尝试克隆刘德华的声音,有点像,不过杂音很大,可能和我输入的音频质量有关。

time tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 --text "大家好,我是AI生成的刘德华,我暂时不会唱歌。" --speaker_wav andy.wav --language_idx "zh-cn"

2025-5-27

开始训练,过程似乎是一批256个样本训练2万多个step * 1000 epoch,然后再到下一批。目前训练到40000多步时产生了一个best model。

(base) root@gjob-dev-582417481398870016-taskrole1-0:/gemini/code/TTS/recipes/mdcc/xtts_v2# time python train_cantonese.py 2>&1 | tee out
 > Training Environment:
 | > Backend: Torch
 | > Mixed precision: False
 | > Precision: float32
 | > Current device: 0
 | > Num. of GPUs: 1
 | > Num. of CPUs: 96
 | > Num. of Torch Threads: 1
 | > Torch seed: 1
 | > Torch CUDNN: True
 | > Torch CUDNN deterministic: False
 | > Torch CUDNN benchmark: False
 | > Torch TF32 MatMul: False
 > Start Tensorboard: tensorboard --logdir=/gemini/code/TTS/recipes/mdcc/xtts_v2/run/training/GPT_XTTS_v2.0_LJSpeech_FT-May-27-2025_12+37PM-4a6359e8

 > Model has 518442047 parameters

 > EPOCH: 0/1000
 --> /gemini/code/TTS/recipes/mdcc/xtts_v2/run/training/GPT_XTTS_v2.0_LJSpeech_FT-May-27-2025_12+37PM-4a6359e8

 > EVALUATION 

/gemini/code/TTS/TTS/tts/layers/xtts/trainer/gpt_trainer.py:277: UserWarning: "kaiser_window" resampling method name is being deprecated and replaced by "sinc_interp_kaiser" in the next release. The default behavior remains unchanged.
  dvae_wav = torchaudio.functional.resample(

  --> EVAL PERFORMANCE
     | > avg_loader_time: 0.04096300427506609 (+0)
     | > avg_loss_text_ce: 0.04030846954300636 (+0)
     | > avg_loss_mel_ce: 6.010201570464342 (+0)
     | > avg_loss: 6.050510057588903 (+0)


 > EPOCH: 1/1000
 --> /gemini/code/TTS/recipes/mdcc/xtts_v2/run/training/GPT_XTTS_v2.0_LJSpeech_FT-May-27-2025_12+37PM-4a6359e8

 > TRAINING (2025-05-27 12:38:21) 
>> DVAE weights restored from: /gemini/code/TTS/recipes/mdcc/xtts_v2/run/training/XTTS_v2.0_original_model_files/dvae.pth
mdcc-dataset/text_path does not exist.
 | > Found 65120 files in /gemini/data-2/dataset
 > Filtering invalid eval samples!!
 > Total eval samples after filtering: 247
 | > Synthesizing test sentences.
 > Sampling by language: dict_keys(['zh-yue'])

   --> TIME: 2025-05-27 12:38:28 -- STEP: 0/21622 -- GLOBAL_STEP: 0
     | > loss_text_ce: 0.03881431370973587  (0.03881431370973587)
     | > loss_mel_ce: 6.046611785888672  (6.046611785888672)
     | > loss: 0.07244555652141571  (0.07244555652141571)
     | > current_lr: 5e-06 
     | > step_time: 0.9248  (0.9248268604278564)
     | > loader_time: 5.8306  (5.8306238651275635)

2025-5-28

一切数据看起来很不错,在收敛中。15万step的位置产生了新的best model。一天时间大概完成了20万step的训练,总共似乎要训练2000万step。完整需要训练100天,但应该在收敛后,很久都不产生best model的时候就可以结束了。

发现训练时把输出采样率从24000改成了22050可能是个错误,这一轮训练完要用24000再训练一次对比一下。

kokoro用了8千万参数,而xTTS用了5亿个参数。Glow应该是最快的模型,只用了20万个参数。计划之后用Glow训练一次,看看速度怎么样。

2025-5-29

下载了一个model测试,有报错:

Inference...
Traceback (most recent call last):
  File "/home/hgneng/code/hgneng/TTS/recipes/mdcc/xtts_v2/test.py", line 18, in <module>
    out = model.inference(
          ^^^^^^^^^^^^^^^^
  File "/opt/miniconda3/envs/coqui/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/hgneng/code/hgneng/TTS/TTS/tts/models/xtts.py", line 534, in inference
    text_tokens = torch.IntTensor(self.tokenizer.encode(sent, lang=language)).unsqueeze(0).to(self.device)
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/hgneng/code/hgneng/TTS/TTS/tts/layers/xtts/tokenizer.py", line 653, in encode
    return self.tokenizer.encode(txt).ids
           ^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'encode'

2025-5-30

tokenizer的问题是需要在TTS/tts/layers/xtts/tokenizer.py里增加逻辑实现对zh-yue的支持,同时xtts.py代码里没有把config的vocab_file读进取,导致tokenizer一直没有值。目前测试效果是,可以听明白,但不完整,13个字的句子漏了最后3个字。音质有点沙哑(可以认为是噪音)。试试在原来基础上再训练4天看看是什么效果。后面有重要修改需要重新训练的时候再用24000Hz来训练。

2025-6-3

训练到大概145万步,在120万步的时候产生了一个best model,看起来已经收敛,难以产生更优的model了。训练效果还不如33万步时的model(音质变化不大,更早地终止了)。

New init steps:

  1. clone code: git clone https://github.com/hgneng/TTS.git
  2. chnage repo: pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
  3. install deps: cd TTS && pip install -e .[all,dev]
  4. run training: cd /gemini/code/TTS/recipes/mdcc/xtts_v2 && time python train_cantonese.py 2>&1 | tee out

训练完后,把最新的checkpoint.pth链接到best_model.pth,下次训练时才能继续上次的进度。

把event文件复制到/gemini/output/,可以在tensorboard里看到训练趋势。不要创建符号链接,似乎有bug会把目录删掉并终止训练。

Model test command:

~/code/hgneng/TTS/recipes/mdcc/xtts_v2$ python test.py

标签

评论

Restricted HTML

  • 允许的HTML标签:<a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id> <img src>
  • 自动断行和分段。
  • 网页和电子邮件地址自动转换为链接。
CAPTCHA
请输入"Drupal"
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.

最新内容

最新评论