2025-5-20
接着这个页面的开发:https://cto.eguidedog.net/node/1391
测试checkpoint 150K生成音频失败,可能是应该从头训练,先把训练停了,把所有参数细节研究一下再重新开始。
检查参数发现sample_rate错了,应该从22050改为16000。
根据AI建议,把mel_fmax从7600改为8500,以适应粤语的声调变化。
根据AI建议,把norm_schedule改为了true,自适应学习率调度,初始阶段升温,之后衰减。
有一些问题待调研:
由于mdcc是多人的音频,需要着重研究coqui对这种情况应该怎样配置。
研究一下有没有办法禁用风格学习,以减少运算量。全局风格标记(Global Style Token, GST)
New init steps:
- clone code: git clone https://github.com/hgneng/TTS.git
- chnage repo: pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
- install deps: cd TTS && pip install -e .[all,dev]
- patch 1: cp /gemini/code/TTS/patch/tensorboard_logger.py /root/miniconda3/lib/python3.11/site-packages/trainer/logging/
- patch 2: cp /gemini/code/TTS/patch/summary.py /root/miniconda3/lib/python3.11/site-packages/torch/utils/tensorboard/
- patch 3: cp /gemini/code/TTS/patch/trainer.py /root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py
- run training: cd /gemini/code/TTS && time bash recipes/mdcc/tacotron2-DDC/run.sh 2>&1 | tee out
训练完后,把最新的checkpoint.pth链接到best_model.pth,下次训练时才能继续上次的进度。
把event文件复制到/gemini/output/,可以在tensorboard里看到训练趋势。不要创建符号链接,似乎有bug会把目录删掉并终止训练。
Model test command:
~/code/hgneng/TTS$ TTS/bin/synthesize.py --text "ngo5 wui2 syut3 jyut6 jyu5" --config_path recipes/mdcc/tacotron2-DDC/tacotron2-DDC.json --model_path recipes/mdcc/tacotron2-DDC/model_10000_411.pth --out_path ./demo.wav
2025-5-22
XTTS v2模型似乎是对beginner更容易的选择,Ghost已经成功train了粤语模型。计划从Tacotron 2转到XTTS v2。
kokoro在2024年12月25日发布的一个模型,对不到100小时的音频进行训练,使用A100 80G GPU花了约500小时,约400美元费用。而使用XTTS v2可能大概只需要4090 24G GPU10小时的成本。按照我租用的8G GPU估算,可能2天就能训练完。
tts默认使用hifigan_generator生成"this is a demo"大概需要13 CPU 秒时间。不确定除去初始化后需要多少时间。
time tts --text "this is a demo"
> Processing time: 0.7871761322021484
> Real-time factor: 0.45792617441582345
而使用xtts_v2需要5分钟CPU时间,性能差距还是很明显的。如果从processing time来看,大概是4倍差距。
time tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 --text "this is a demo" --speaker_idx "Claribel Dervla" --language_idx "en"
> Processing time: 2.966963529586792
> Real-time factor: 1.9062221977677378
普通话的合成效果很不错的。
time tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 --text "我挥一挥衣袖,不带走一片云彩" --speaker_idx "Claribel Dervla" --language_idx "zh-cn"
> Processing time: 7.238363027572632
> Real-time factor: 2.274884617416997
2025-5-23
尝试克隆刘德华的声音,有点像,不过杂音很大,可能和我输入的音频质量有关。
time tts --model_name tts_models/multilingual/multi-dataset/xtts_v2 --text "大家好,我是AI生成的刘德华,我暂时不会唱歌。" --speaker_wav andy.wav --language_idx "zh-cn"
2025-5-27
开始训练,过程似乎是一批256个样本训练2万多个step * 1000 epoch,然后再到下一批。目前训练到40000多步时产生了一个best model。
(base) root@gjob-dev-582417481398870016-taskrole1-0:/gemini/code/TTS/recipes/mdcc/xtts_v2# time python train_cantonese.py 2>&1 | tee out
> Training Environment:
| > Backend: Torch
| > Mixed precision: False
| > Precision: float32
| > Current device: 0
| > Num. of GPUs: 1
| > Num. of CPUs: 96
| > Num. of Torch Threads: 1
| > Torch seed: 1
| > Torch CUDNN: True
| > Torch CUDNN deterministic: False
| > Torch CUDNN benchmark: False
| > Torch TF32 MatMul: False
> Start Tensorboard: tensorboard --logdir=/gemini/code/TTS/recipes/mdcc/xtts_v2/run/training/GPT_XTTS_v2.0_LJSpeech_FT-May-27-2025_12+37PM-4a6359e8
> Model has 518442047 parameters
> EPOCH: 0/1000
--> /gemini/code/TTS/recipes/mdcc/xtts_v2/run/training/GPT_XTTS_v2.0_LJSpeech_FT-May-27-2025_12+37PM-4a6359e8
> EVALUATION
/gemini/code/TTS/TTS/tts/layers/xtts/trainer/gpt_trainer.py:277: UserWarning: "kaiser_window" resampling method name is being deprecated and replaced by "sinc_interp_kaiser" in the next release. The default behavior remains unchanged.
dvae_wav = torchaudio.functional.resample(
--> EVAL PERFORMANCE
| > avg_loader_time: 0.04096300427506609 (+0)
| > avg_loss_text_ce: 0.04030846954300636 (+0)
| > avg_loss_mel_ce: 6.010201570464342 (+0)
| > avg_loss: 6.050510057588903 (+0)
> EPOCH: 1/1000
--> /gemini/code/TTS/recipes/mdcc/xtts_v2/run/training/GPT_XTTS_v2.0_LJSpeech_FT-May-27-2025_12+37PM-4a6359e8
> TRAINING (2025-05-27 12:38:21)
>> DVAE weights restored from: /gemini/code/TTS/recipes/mdcc/xtts_v2/run/training/XTTS_v2.0_original_model_files/dvae.pth
mdcc-dataset/text_path does not exist.
| > Found 65120 files in /gemini/data-2/dataset
> Filtering invalid eval samples!!
> Total eval samples after filtering: 247
| > Synthesizing test sentences.
> Sampling by language: dict_keys(['zh-yue'])
--> TIME: 2025-05-27 12:38:28 -- STEP: 0/21622 -- GLOBAL_STEP: 0
| > loss_text_ce: 0.03881431370973587 (0.03881431370973587)
| > loss_mel_ce: 6.046611785888672 (6.046611785888672)
| > loss: 0.07244555652141571 (0.07244555652141571)
| > current_lr: 5e-06
| > step_time: 0.9248 (0.9248268604278564)
| > loader_time: 5.8306 (5.8306238651275635)
2025-5-28
一切数据看起来很不错,在收敛中。15万step的位置产生了新的best model。一天时间大概完成了20万step的训练,总共似乎要训练2000万step。完整需要训练100天,但应该在收敛后,很久都不产生best model的时候就可以结束了。
发现训练时把输出采样率从24000改成了22050可能是个错误,这一轮训练完要用24000再训练一次对比一下。
kokoro用了8千万参数,而xTTS用了5亿个参数。Glow应该是最快的模型,只用了20万个参数。计划之后用Glow训练一次,看看速度怎么样。
2025-5-29
下载了一个model测试,有报错:
Inference...
Traceback (most recent call last):
File "/home/hgneng/code/hgneng/TTS/recipes/mdcc/xtts_v2/test.py", line 18, in <module>
out = model.inference(
^^^^^^^^^^^^^^^^
File "/opt/miniconda3/envs/coqui/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/hgneng/code/hgneng/TTS/TTS/tts/models/xtts.py", line 534, in inference
text_tokens = torch.IntTensor(self.tokenizer.encode(sent, lang=language)).unsqueeze(0).to(self.device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/hgneng/code/hgneng/TTS/TTS/tts/layers/xtts/tokenizer.py", line 653, in encode
return self.tokenizer.encode(txt).ids
^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'encode'
2025-5-30
tokenizer的问题是需要在TTS/tts/layers/xtts/tokenizer.py里增加逻辑实现对zh-yue的支持,同时xtts.py代码里没有把config的vocab_file读进取,导致tokenizer一直没有值。目前测试效果是,可以听明白,但不完整,13个字的句子漏了最后3个字。音质有点沙哑(可以认为是噪音)。试试在原来基础上再训练4天看看是什么效果。后面有重要修改需要重新训练的时候再用24000Hz来训练。
2025-6-3
训练到大概145万步,在120万步的时候产生了一个best model,看起来已经收敛,难以产生更优的model了。训练效果还不如33万步时的model(音质变化不大,更早地终止了)。
New init steps:
- clone code: git clone https://github.com/hgneng/TTS.git
- chnage repo: pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
- install deps: cd TTS && pip install -e .[all,dev]
- run training: cd /gemini/code/TTS/recipes/mdcc/xtts_v2 && time python train_cantonese.py 2>&1 | tee out
训练完后,把最新的checkpoint.pth链接到best_model.pth,下次训练时才能继续上次的进度。
把event文件复制到/gemini/output/,可以在tensorboard里看到训练趋势。不要创建符号链接,似乎有bug会把目录删掉并终止训练。
Model test command:
~/code/hgneng/TTS/recipes/mdcc/xtts_v2$ python test.py
评论