Coqui TTS

By admin, 14 十月, 2024

Coqui TTS

🐸(青蛙)TTS

https://github.com/coqui-ai/TTS

https://coqui.ai/

 

For the first time, tts need to download  a data model. If the download fails, it will fail for the second time. We need to remove empty data model folder from path below to make it do a retry download:

/home/hgneng/.local/share/tts/

modals are download from here: https://github.com/coqui-ai/TTS/releases/tag/v0.6.1_models

有个论坛,当没有思路的时候可以看看甚至提问:https://github.com/coqui-ai/TTS/discussions

Issues

We should use model tts_models/en/ljspeech/tacotron2-DDC_ph instead of default model. see https://github.com/coqui-ai/TTS/issues/2719

学习训练

  1. 在入门文档里有个train.py脚本,按照指引训练(should train on a GPU more than 6G memory):

文档:https://tts.readthedocs.io/en/latest/tutorial_for_nervous_beginners.html

If we get read timeout issue in trainer module of PyTorch, we can modify the file, which is in the backtrace, to change timeout from 5s to 15s.

/root/miniconda3/lib/python3.9/site-packages/trainer/analytics.py

2. Test the model with following command:

$ tts --text 'hello world' --model_path run-November-17-2023_03+14AM-1e152692/best_model.pth  --out_path output.wav --config_path run-November-17-2023_03+14AM-1e152692/config.json

Coqui普通话的问题

ai、an、ang等字前多数会额外增加一个g音。

gai、gan等字前少了一个g音。有可能学错了。

e读音不准,要么读不出,要么前面加了k音。

课程

公开课:https://edu.speechhome.com/

收费语音合成课程:https://edu.speechhome.com/p/t_pc/goods_pc_detail/goods_detail/course_29s1VsJQpubTWw2PPPYvbVGyOhE?app_id=appzxw56sw27444 (this course is not  easy to understand for beginner)

自注意力机制:https://www.youtube.com/watch?v=hYdO9CscNes,找“李宏毅“相关视频可以完成完整的机器学习课程。

训练中文语音

有人正在做这样的尝试,他应该已经成功合成,只是定制的时候出现问题:https://github.com/coqui-ai/TTS/discussions/2488

已经有中文模型,不过不知道为什么声音后面会多了一段奇怪的重复语音(似乎是必须补齐12.05秒):

tts = TTS(model_name="tts_models/zh-CN/baker/tacotron2-DDC-GST")
tts.tts_to_file("你好")

这个语音有一个TensorFlow的版本(不过我没有运行成功):https://huggingface.co/tensorspeech/tts-tacotron2-baker-ch

定制语音

Raise your voice - training a model on your very own voice clips with Common Voice and Coqui

YourTTS: Zero-Shot Multi-Speaker Text Synthesis and Voice Conversion

https://github.com/Edresson/YourTTS

Create a custom Speech-to-Text model for 💫 Your Voice 💫 with Common Voice

Best Procedure For Voice Cloning

https://tts.readthedocs.io/en/latest/faq.html#how-can-i-train-my-own-tts-model

 

以下命令可以轻易地克隆声音,耗时11秒。必须使用multi-lingual模型。目前主要问题应该在于性能。如果实在没有办法,就生产基本拼音让Ekho来合成。

from TTS.api import TTS
tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=False, gpu=False)
tts.tts_to_file("This is voice cloning.", speaker_wav="cameron.wav", language="en", file_path="output.wav")

Coqui STT

https://github.com/coqui-ai/STT

Tacotron2

2006年发布的Tacotron是第一批成功的使用深度学习应用于TTS的模型之一。Tacotron mainly is an encoder-decoder model with attention.

2008年发布了Tacotron2。此模型合成Hello World耗时74秒。

2020年Coqui Eren Gölge提出Tacotron2 Double Decoder Consistency模型。此模型合成Hello World耗时9秒。

参考:https://tts.readthedocs.io/en/latest/models/tacotron1-2.html

学习Pytorch关于语音合成的模块Tacotron2: https://pytorch.org/audio/stable/tutorials/tacotron2_pipeline_tutorial.html

We need to fix network issue:

/home/hgneng/miniconda3/envs/tacotron2/lib/python3.10/site-packages/torch/hub.py download_url_to_file from https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_forward.pt  to  /home/hgneng/.cache/torch/hub/checkpoints/en_us_cmudict_forward.pt

Vocoder translating specgrams to wav seems slow

此模型合成Hello World耗时139秒,远远高于Coqui的Tracotron2 DDC模型(9秒)。

https://pytorch.org/audio/stable/generated/torchaudio.pipelines.Tacotron2TTSBundle.Vocoder.html?highlight=vocoder#torchaudio.pipelines.Tacotron2TTSBundle.Vocoder

What's difference between Phoneme-based TTS and Character-based TTS?

Phoneme-based TTS is a text-to-speech system that uses the sounds of a language (phonemes) to generate speech. It is more accurate than character-based TTS because it is based on a more detailed analysis of the language. Character-based TTS, on the other hand, is a text-to-speech system that uses characters (or words) to generate speech. It is less accurate than phoneme-based TTS because it does not take into account the nuances of the language.

https://pytorch.org/audio/2.0.1/pipelines.html

Tacotron2 data Modal

理解其模型,看有没有中文可用的,如果没有想办法自己训练:https://pytorch.org/audio/stable/pipelines.html

DeepPhonemizer

DeepPhonemizer is a multilingual grapheme-to-phoneme modeling library that leverages recent deep learning technology and is optimized for usage in production systems such as TTS. In particular, the library should be accurate, fast, easy to use. Moreover, you can train a custom model on your own dataset in a few lines of code.

DeepPhonemizer is compatible with Python 3.6+ and is distributed under the MIT license.

Read the documentation at: https://as-ideas.github.io/DeepPhonemizer/

标签

评论

Restricted HTML

  • 允许的HTML标签:<a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id> <img src>
  • 自动断行和分段。
  • 网页和电子邮件地址自动转换为链接。
验证码
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
请输入"Drupal10"

最新评论