开源深度学习TTS(Coqui)

By admin, 23 九月, 2022

粤语TTS实现笔记

1. Implement Cantonese Frontend

参照文档Implementing a New Language Frontend普通话代码实现相应的粤语部分

在https://github.com/coqui-ai/TTS/tree/dev/TTS/tts/utils/text/phonemizers实现粤语的phonemizers。

关于phoneme,不知道coqui的普通话部分用的是什么标准,有点像IPA,但又不是。例如"ao": ["aʌ"],在IPA里没有ʌ。也许可以参考amazon的文档实现粤语的phoneme映射:Chinese (Cantonese) (yue-CN)(可能在wikiepedia里Jyutping 是一样的系统),再查一下eSpeak里面用的是不是一样的。尽量往IPA和eSpeak方向统一(如果发现eSpeak总体和IPA一致,但有些不同,可能需要修正eSpeak)。

2024-9-4

Implemented cantonese/jyutpingToPhonemes.py. Use pycantonse.jyutping_to_ipa (https://github.com/jacksonllee/pycantonese/issues/44) to do the translation. pycantonese is a useful Cantonese Linguistics and NLP library. The result of pycantonse.jyutping_to_ipa should be same to wikipedia. The IPA map in amazon seems not fully the same to IPA in wikipedia. It seems that situation in eSpeak is more complicated and I don't want to touch it. I plan to modify other phonemizer files tomorrow.

2024-9-5

Finish cantonese phonemizer. code at https://github.com/hgneng/TTS

reinstall conda coqui environment.

pycantonese's copus is not complete enough. "这是,样本中文。" is translated as ||si6||||||bun2||zung1|man2||| Two characters are missing. I need to write my cantonese to jyutping python module myself (or extend pycantonese).

2024-9-6

使用opencc模块把文本统一转换为繁体中文就可以使用pycantonese了,无需扩展pycantonese。

下面的步骤不太确定,没有找到config.phoneme_language,不知道实现是否正确。

https://docs.coqui.ai/en/latest/implementing_a_new_language_frontend.html

After you implement your phonemizer, you need to add it to the TTS/tts/utils/text/phonemizers/__init__.py to be able to map the language code in the model config - config.phoneme_language - to the phonemizer class and initiate the phonemizer automatically.

下面这个步骤暂时没有做(TODO1):

You should also add tests to tests/text_tests if you want to make a PR.

这个项目有粤语语音库(8G,仅用于学术研究),音质很好,属于播音员级别的录音。

https://github.com/HLTCHKUST/cantonese-asr

P.S. Orca在Firefox上有很多快捷键,输入中文时会触发这些快捷键,这导致输入中文很困难。

2. Training Model

参考文档:Training a Model ,先和普通话一样,选择Tacotron2模型,以后考虑使用xtts模型。

似乎参考这个例子,调通了就能训练:https://github.com/coqui-ai/TTS/blob/dev/recipes/kokoro/tacotron2-DDC/run.sh

2024-9-9

MDCC Dataset requires a signed license, I send it today.

I plan to read its paper later: http://arxiv.org/pdf/2201.02419

The paper said  "Common Voice zh-HK"(from wikipedia in 2019) is "the biggest existing dataset". I should take a look at it some time later.

2024-9-10

Read 3/8 of the paper.

copy kokoro foler to mdcc and modify based on it: recipes/mdcc/tacotron2-DDC

Don't understand setting in characters in tracotron2-DDC.json. It seems not correct but it may has no use.

Ready to train tomorrow.

2024-9-11

Finish paper reading. MDCC use "Fairseq S2T Transformer" to do the training. Use CER(character error rate) as evaluation metric and result of 10.15% comparing to 8.69% with Common Voice zh-HK. (What's the advantage of MDCC??) This decision is based on the fact that the MDCC data are cleaner, shorter and therefore easier to learn that those in Common Voice zh-HK.

~/code/hgneng/TTS$ bash recipes/mdcc/tacotron2-DDC/run.sh
> Avg mel spec mean: -2.2504470286478218
> Avg mel spec scale: 0.7808214242600025
> Avg linear spec mean: -1.4933440478801683
> Avg linear spec scale: 0.8465782427037042
> stats saved to /home/hgneng/code/hgneng/TTS/recipes/mdcc/tacotron2-DDC/scale_stats.npy
...
AttributeError: module 'TTS.tts.datasets' has no attribute 'mdcc'

I need to implement a method call mdcc in TTS/tts/datasets/formatters.py

2024-9-12

Implemented TTS/tts/datasets/formatters.py

got following error:

  File "/home/hgneng/code/hgneng/TTS/TTS/tts/utils/text/tokenizer.py", line 206, in init_from_config
    raise ValueError(
ValueError: No phonemizer found for language yue-cn.
                            You may need to install a third party library for this language.

And I also need to take a look at multi-speaker training.

2024-9-13

For multi-speaker, we need to implement SpeakerManager.

Added code to TTS/tts/utils/text/tokenizer.py to support yue-cn

Got following error:

  File "/home/hgneng/code/hgneng/TTS/TTS/tts/models/tacotron2.py", line 386, in _create_logs
    pred_spec = postnet_outputs[0].data.cpu().numpy()
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Got unsupported ScalarType BFloat16

Need to upload data and code to https://platform.virtaicloud.com/ to run.

P.S. 当ibus中文输入法无法使用的时候,可以通过下面方法重启:

$ ibus-daemon -drx
$ ibus restart // optional

2024-9-14

  1. Setup an environment with Pyhon 3.11, cuda >= 2.1
  2. setup mdcc-dataset link
  3. edit run.sh to restore some one time logic
  4. install dependencies: gemini/code/TTS# pip install .[all,dev]
  5. It's very slow to generate scale_stats.npy on virtualcloud, generate it locally and upload it.

I should fix following error:

 > EPOCH: 0/1000
 --> /gemini/code/TTS/recipes/mdcc/tacotron2-DDC/mdcc-ddc-September-14-2024_10+24AM-133a5102
[*] Pre-computing phonemes...
  0%|                                                                                                                                            | 0/65120 [00:00<?, ?it/s]jau5waak6bei2keoi5saau2siu2ge3jan4leoi6ci2zou2aa3dong1tung4maai4haa6waa1
 [!] Character '5' not found in the vocabulary. Discarding it.
jau5waak6bei2keoi5saau2siu2ge3jan4leoi6ci2zou2aa3dong1tung4maai4haa6waa1
 [!] Character '6' not found in the vocabulary. Discarding it.
jau5waak6bei2keoi5saau2siu2ge3jan4leoi6ci2zou2aa3dong1tung4maai4haa6waa1
 [!] Character '2' not found in the vocabulary. Discarding it.
jau5waak6bei2keoi5saau2siu2ge3jan4leoi6ci2zou2aa3dong1tung4maai4haa6waa1
 [!] Character '3' not found in the vocabulary. Discarding it.
jau5waak6bei2keoi5saau2siu2ge3jan4leoi6ci2zou2aa3dong1tung4maai4haa6waa1
 [!] Character '4' not found in the vocabulary. Discarding it.
jau5waak6bei2keoi5saau2siu2ge3jan4leoi6ci2zou2aa3dong1tung4maai4haa6waa1
 [!] Character '1' not found in the vocabulary. Discarding it.
  2%|██▉                                                                                                                              | 1491/65120 [00:33<17:08, 61.85it/s]

当我修改字符集合配置,重启环境运行后遇到以下错误。这个问题和网络有关,多试几次可能就可以了。

 > Start Tensorboard: tensorboard --logdir=/gemini/code/TTS/recipes/mdcc/tacotron2-DDC/mdcc-ddc-September-14-2024_11+14AM-850dac97
/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py:552: FutureWarning: `torch.cuda.amp.GradScaler(args...)` is deprecated. Please use `torch.amp.GradScaler('cuda', args...)` instead.
  self.scaler = torch.cuda.amp.GradScaler()

 > Model has 47872052 parameters
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/urllib3/response.py", line 712, in _error_catcher
    yield
  File "/root/miniconda3/lib/python3.11/site-packages/urllib3/response.py", line 812, in _raw_read
    data = self._fp_read(amt) if not fp_closed else b""
           ^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/urllib3/response.py", line 797, in _fp_read
    return self._fp.read(amt) if amt is not None else self._fp.read()
           ^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/http/client.py", line 473, in read
    s = self.fp.read(amt)
        ^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/socket.py", line 706, in readinto
    return self._sock.recv_into(b)
           ^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/ssl.py", line 1314, in recv_into
    return self.read(nbytes, buffer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/ssl.py", line 1166, in read
    return self._sslobj.read(len, buffer)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TimeoutError: The read operation timed out

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/requests/models.py", line 816, in generate
    yield from self.raw.stream(chunk_size, decode_content=True)
  File "/root/miniconda3/lib/python3.11/site-packages/urllib3/response.py", line 934, in stream
    data = self.read(amt=amt, decode_content=decode_content)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/urllib3/response.py", line 877, in read
    data = self._raw_read(amt)
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/urllib3/response.py", line 811, in _raw_read
    with self._error_catcher():
  File "/root/miniconda3/lib/python3.11/contextlib.py", line 158, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/root/miniconda3/lib/python3.11/site-packages/urllib3/response.py", line 717, in _error_catcher
    raise ReadTimeoutError(self._pool, None, "Read timed out.") from e  # type: ignore[arg-type]
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
urllib3.exceptions.ReadTimeoutError: HTTPSConnectionPool(host='user-images.githubusercontent.com', port=443): Read timed out.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/gemini/code/TTS/TTS/bin/train_tts.py", line 71, in <module>
    main()
  File "/gemini/code/TTS/TTS/bin/train_tts.py", line 58, in main
    trainer = Trainer(
              ^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 583, in __init__
    ping_training_run()
  File "/root/miniconda3/lib/python3.11/site-packages/trainer/analytics.py", line 12, in ping_training_run
    _ = requests.get(URL, timeout=5)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/requests/api.py", line 73, in get
    return request("get", url, params=params, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/requests/api.py", line 59, in request
    return session.request(method=method, url=url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/requests/sessions.py", line 589, in request
    resp = self.send(prep, **send_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/requests/sessions.py", line 725, in send
    history = [resp for resp in gen]
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/requests/sessions.py", line 725, in <listcomp>
    history = [resp for resp in gen]
              ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/requests/sessions.py", line 266, in resolve_redirects
    resp = self.send(
           ^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/requests/sessions.py", line 747, in send
    r.content
  File "/root/miniconda3/lib/python3.11/site-packages/requests/models.py", line 899, in content
    self._content = b"".join(self.iter_content(CONTENT_CHUNK_SIZE)) or b""
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/requests/models.py", line 822, in generate
    raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='user-images.githubusercontent.com', port=443): Read timed out.

一个epoch大概需要20分钟时间,完整训练预估300小时。可以尝试性能更好的机器。

Pre-computing phonemes...完成后,卡住在这个阶段:

> DataLoader initialization
| > Tokenizer:
       | > add_blank: False
       | > use_eos_bos: False
       | > use_phonemes: True
       | > phonemizer:
               | > phoneme language: yue-cn
               | > phoneme backend: yue_cn_phonemizer
| > Number of instances : 65120

CTRL+C终止后自动保存了一个checkpoint

> Keyboard interrupt detected.
 > Saving model before exiting...

 > CHECKPOINT : /gemini/code/TTS/recipes/mdcc/tacotron2-DDC/mdcc-ddc-September-14-2024_11+52AM-850dac97/checkpoint_0.pth
 ! Run is kept in /gemini/code/TTS/recipes/mdcc/tacotron2-DDC/mdcc-ddc-September-14-2024_11+52AM-850dac97

2024-9-18

start a machine and pip install . but it seems try to download all versions of every packages that make it never finish. Don't know why. try again tomorrow.

2024-9-19

After changing to another machine and change pip repo to tsinghua, it works again. However, the traing time of one epch increase from 20 minutes to 2 hours.

After realizing not using GPU, change run.sh and run again. It is stuck at

> DataLoader initialization
| > Tokenizer:
       | > add_blank: False
       | > use_eos_bos: False
       | > use_phonemes: True
       | > phonemizer:
               | > phoneme language: yue-cn
               | > phoneme backend: yue_cn_phonemizer
| > Number of instances : 65120

2024-9-20

The process stop after 2 hours because phoneme_cache is incomplete. It's related to scale_stats.npy. I need to upload it from local. The virtual machine's disk is very slow. Large quantity of files should be processed locally.

Network timeout issue resists. modify the timeout can solve it:

  File "/root/miniconda3/lib/python3.11/site-packages/trainer/analytics.py", line 12, in ping_training_run
    _ = requests.get(URL, timeout=5)

got error:

 > TRAINING (2024-09-20 10:41:04) 
 ! Run is removed from /gemini/code/TTS/recipes/mdcc/tacotron2-DDC/mdcc-ddc-September-20-2024_09+46AM-850dac97
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1833, in fit
    self._fit()
  File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1785, in _fit
    self.train_epoch()
  File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1503, in train_epoch
    for cur_step, batch in enumerate(self.train_loader):
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/root/miniconda3/lib/python3.11/site-packages/torch/_utils.py", line 694, in reraise
    raise exception
KeyError: Caught KeyError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/TTS/tts/datasets/dataset.py", line 212, in __getitem__
    return self.load_data(idx)
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/TTS/tts/datasets/dataset.py", line 268, in load_data
    token_ids = self.get_token_ids(idx, item["text"])
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/TTS/tts/datasets/dataset.py", line 251, in get_token_ids
    token_ids = self.get_phonemes(idx, text)["token_ids"]
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/TTS/tts/datasets/dataset.py", line 228, in get_phonemes
    out_dict = self.phoneme_dataset[idx]
               ~~~~~~~~~~~~~~~~~~~~^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/TTS/tts/datasets/dataset.py", line 614, in __getitem__
    ph_hat = self.tokenizer.ids_to_text(ids)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/TTS/tts/utils/text/tokenizer.py", line 120, in ids_to_text
    return self.decode(id_sequence)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/TTS/tts/utils/text/tokenizer.py", line 84, in decode
    text += self.characters.id_to_char(token_id)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/TTS/tts/utils/text/characters.py", line 305, in id_to_char
    return self._id_to_char[idx]
           ~~~~~~~~~~~~~~~~^^^^^
KeyError: 51

Caused by not rebuiding phoneme_cache after adding 1-6 tone number. (scale_stats.npy is identical after adding 1-6 tones) 

Got following error:

 > TRAINING (2024-09-20 12:29:55) 
/root/miniconda3/lib/python3.11/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3526.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]

   --> TIME: 2024-09-20 12:30:12 -- STEP: 0/1018 -- GLOBAL_STEP: 0
     | > decoder_loss: 1.3073298931121826  (1.3073298931121826)
     | > postnet_loss: 3.433417797088623  (3.433417797088623)
     | > stopnet_loss: 2.9247257709503174  (2.9247257709503174)
     | > decoder_coarse_loss: 1.3068547248840332  (1.3068547248840332)
     | > decoder_ddc_loss: 0.018958045169711113  (0.018958045169711113)
     | > ga_loss: 0.06158587336540222  (0.06158587336540222)
     | > decoder_diff_spec_loss: 0.1912860870361328  (0.1912860870361328)
     | > postnet_diff_spec_loss: 4.505105018615723  (4.505105018615723)
     | > decoder_ssim_loss: 0.8266807794570923  (0.8266807794570923)
     | > postnet_ssim_loss: 0.7951085567474365  (0.7951085567474365)
     | > loss: 6.987125873565674  (6.987125873565674)
     | > align_error: 0.8974450752139091  (0.8974450752139091)
     | > amp_scaler: 32768.0  (32768.0)
     | > grad_norm: 0  (0)
     | > current_lr: 2.5000000000000002e-08 
     | > step_time: 5.318  (5.318024396896362)
     | > loader_time: 11.2825  (11.282504320144653)

warning: audio amplitude out of range, auto clipped.
 ! Run is removed from /gemini/code/TTS/recipes/mdcc/tacotron2-DDC/mdcc-ddc-September-20-2024_11+30AM-850dac97
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1833, in fit
    self._fit()
  File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1785, in _fit
    self.train_epoch()
  File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1503, in train_epoch
    for cur_step, batch in enumerate(self.train_loader):
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
           ^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1325, in _next_data
    return self._process_data(data)
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
    data.reraise()
  File "/root/miniconda3/lib/python3.11/site-packages/torch/_utils.py", line 694, in reraise
    raise exception
AssertionError: Caught AssertionError in DataLoader worker process 1.
Original Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
    data = fetcher.fetch(index)
           ^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/TTS/tts/datasets/dataset.py", line 212, in __getitem__
    return self.load_data(idx)
           ^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/TTS/tts/datasets/dataset.py", line 268, in load_data
    token_ids = self.get_token_ids(idx, item["text"])
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/TTS/tts/datasets/dataset.py", line 251, in get_token_ids
    token_ids = self.get_phonemes(idx, text)["token_ids"]
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/TTS/tts/datasets/dataset.py", line 230, in get_phonemes
    assert len(out_dict["token_ids"]) > 0
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

Next time, upload dataset.tar and untar it to /tmp/ which should work around data disk slow issue.

2024-9-24

start a new machine with pytorch 2.2:

  1. Change pip repo: pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
  2. clone code: git clone https://github.com/hgneng/TTS.git
  3. install deps: pip install -e .[all,dev]
  4. copy dataset: cp -r /gemini/data-2/mdcc-dataset /tmp/
  5. link dataset: cd /gemini/code/TTS/recipes/mdcc/tacotron2-DDC/ && ln -sf /tmp/mdcc-dataset
  6. run training: cd /gemini/code/TTS && bash recipes/mdcc/tacotron2-DDC/run.sh

For error of  "assert len(out_dict["token_ids"]) > 0":

  1. don't install TTS. pip install -e .[all,dev]
  2. add debug like https://github.com/coqui-ai/TTS/issues/1624
  3. remove item from csv: 447_1804052037_72795_743.7_744.36 (keep fit)
  4. TODO: clean all items that has no Chinese characters in csv file.

2024-9-25

map all unknown character/phoneme to "v" to fix error of  "assert len(out_dict["token_ids"]) > 0".

It starts training. CPU = 400%/400%, MEM=12G/12G, GPU=20%-40%/100%, GPU_MEM=1.5G/6G

TODO: although we can train now. English text and audio should be do harm to the training. I should remove it in future when improving the quality. Following code can detect whether a file contains Chinese:

grep -P "[\x{4E00}-\x{9FFF}]" <file>

Maybe this is not enough. We should also detect whether a file contains characters other than Chinese and filter them out.

  --> EVAL PERFORMANCE
     | > avg_loader_time: 0.5965745482836499 (+0)
     | > avg_decoder_loss: 1.0214300359950192 (+0)
     | > avg_postnet_loss: 1.232000221611758 (+0)
     | > avg_stopnet_loss: 1.1567586131879342 (+0)
     | > avg_decoder_coarse_loss: 1.021938735128462 (+0)
     | > avg_decoder_ddc_loss: 0.0019014181905437951 (+0)
     | > avg_ga_loss: 0.00905237895490669 (+0)
     | > avg_decoder_diff_spec_loss: 0.15972242662656735 (+0)
     | > avg_postnet_diff_spec_loss: 0.5937308297616564 (+0)
     | > avg_decoder_ssim_loss: 0.9673015845058323 (+0)
     | > avg_postnet_ssim_loss: 0.9616055363636179 (+0)
     | > avg_loss: 3.203245741112037 (+0)
     | > avg_align_error: 0.9778342093512542 (+0)

 > BEST MODEL : /gemini/code/TTS/recipes/mdcc/tacotron2-DDC/mdcc-ddc-September-25-2024_11+50AM-9ab3ff82/best_model_1018.pth

 > Number of output frames: 5

 > EPOCH: 1/1000
 --> /gemini/code/TTS/recipes/mdcc/tacotron2-DDC/mdcc-ddc-September-25-2024_11+50AM-9ab3ff82

 > TRAINING (2024-09-25 12:31:38) 
 ! Run is kept in /gemini/code/TTS/recipes/mdcc/tacotron2-DDC/mdcc-ddc-September-25-2024_11+50AM-9ab3ff82
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1833, in fit
    self._fit()
  File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1785, in _fit
    self.train_epoch()
  File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1504, in train_epoch
    outputs, _ = self.train_step(batch, batch_num_steps, cur_step, loader_start_time)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1327, in train_step
    batch = self.format_batch(batch)
            ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/trainer/trainer.py", line 1058, in format_batch
    batch = self.model.format_batch(batch)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/gemini/code/TTS/TTS/tts/models/base_tts.py", line 215, in format_batch
    stop_targets = stop_targets.view(text_input.shape[0], stop_targets.size(1) // self.config.r, -1)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: shape '[64, 8, -1]' is invalid for input of size 2688

real    41m35.819s
user    48m57.862s
sys     37m51.650s

 

Coqui TTS

🐸(青蛙)TTS

https://github.com/coqui-ai/TTS

https://coqui.ai/

 

For the first time, tts need to download  a data model. If the download fails, it will fail for the second time. We need to remove empty data model folder from path below to make it do a retry download:

/home/hgneng/.local/share/tts/

modals are download from here: https://github.com/coqui-ai/TTS/releases/tag/v0.6.1_models

有个论坛,当没有思路的时候可以看看甚至提问:https://github.com/coqui-ai/TTS/discussions

Issues

We should use model tts_models/en/ljspeech/tacotron2-DDC_ph instead of default model. see https://github.com/coqui-ai/TTS/issues/2719

学习训练

  1. 在入门文档里有个train.py脚本,按照指引训练(should train on a GPU more than 6G memory):

文档:https://tts.readthedocs.io/en/latest/tutorial_for_nervous_beginners.html

If we get read timeout issue in trainer module of PyTorch, we can modify the file, which is in the backtrace, to change timeout from 5s to 15s.

/root/miniconda3/lib/python3.9/site-packages/trainer/analytics.py

2. Test the model with following command:

$ tts --text 'hello world' --model_path run-November-17-2023_03+14AM-1e152692/best_model.pth  --out_path output.wav --config_path run-November-17-2023_03+14AM-1e152692/config.json

Coqui普通话的问题

ai、an、ang等字前多数会额外增加一个g音。

gai、gan等字前少了一个g音。有可能学错了。

e读音不准,要么读不出,要么前面加了k音。

课程

公开课:https://edu.speechhome.com/

收费语音合成课程:https://edu.speechhome.com/p/t_pc/goods_pc_detail/goods_detail/course_29s1VsJQpubTWw2PPPYvbVGyOhE?app_id=appzxw56sw27444 (this course is not  easy to understand for beginner)

自注意力机制:https://www.youtube.com/watch?v=hYdO9CscNes,找“李宏毅“相关视频可以完成完整的机器学习课程。

训练中文语音

有人正在做这样的尝试,他应该已经成功合成,只是定制的时候出现问题:https://github.com/coqui-ai/TTS/discussions/2488

已经有中文模型,不过不知道为什么声音后面会多了一段奇怪的重复语音(似乎是必须补齐12.05秒):

tts = TTS(model_name="tts_models/zh-CN/baker/tacotron2-DDC-GST")
tts.tts_to_file("你好")

这个语音有一个TensorFlow的版本(不过我没有运行成功):https://huggingface.co/tensorspeech/tts-tacotron2-baker-ch

定制语音

Raise your voice - training a model on your very own voice clips with Common Voice and Coqui

YourTTS: Zero-Shot Multi-Speaker Text Synthesis and Voice Conversion

https://github.com/Edresson/YourTTS

Create a custom Speech-to-Text model for 💫 Your Voice 💫 with Common Voice

Best Procedure For Voice Cloning

https://tts.readthedocs.io/en/latest/faq.html#how-can-i-train-my-own-tts-model

 

以下命令可以轻易地克隆声音,耗时11秒。必须使用multi-lingual模型。目前主要问题应该在于性能。如果实在没有办法,就生产基本拼音让Ekho来合成。

from TTS.api import TTS
tts = TTS(model_name="tts_models/multilingual/multi-dataset/your_tts", progress_bar=False, gpu=False)
tts.tts_to_file("This is voice cloning.", speaker_wav="cameron.wav", language="en", file_path="output.wav")

Coqui STT

https://github.com/coqui-ai/STT

Tacotron2

2006年发布的Tacotron是第一批成功的使用深度学习应用于TTS的模型之一。Tacotron mainly is an encoder-decoder model with attention.

2008年发布了Tacotron2。此模型合成Hello World耗时74秒。

2020年Coqui Eren Gölge提出Tacotron2 Double Decoder Consistency模型。此模型合成Hello World耗时9秒。

参考:https://tts.readthedocs.io/en/latest/models/tacotron1-2.html

学习Pytorch关于语音合成的模块Tacotron2: https://pytorch.org/audio/stable/tutorials/tacotron2_pipeline_tutorial.html

We need to fix network issue:

/home/hgneng/miniconda3/envs/tacotron2/lib/python3.10/site-packages/torch/hub.py download_url_to_file from https://public-asai-dl-models.s3.eu-central-1.amazonaws.com/DeepPhonemizer/en_us_cmudict_forward.pt  to  /home/hgneng/.cache/torch/hub/checkpoints/en_us_cmudict_forward.pt

Vocoder translating specgrams to wav seems slow

此模型合成Hello World耗时139秒,远远高于Coqui的Tracotron2 DDC模型(9秒)。

https://pytorch.org/audio/stable/generated/torchaudio.pipelines.Tacotron2TTSBundle.Vocoder.html?highlight=vocoder#torchaudio.pipelines.Tacotron2TTSBundle.Vocoder

What's difference between Phoneme-based TTS and Character-based TTS?

Phoneme-based TTS is a text-to-speech system that uses the sounds of a language (phonemes) to generate speech. It is more accurate than character-based TTS because it is based on a more detailed analysis of the language. Character-based TTS, on the other hand, is a text-to-speech system that uses characters (or words) to generate speech. It is less accurate than phoneme-based TTS because it does not take into account the nuances of the language.

https://pytorch.org/audio/2.0.1/pipelines.html

Tacotron2 data Modal

理解其模型,看有没有中文可用的,如果没有想办法自己训练:https://pytorch.org/audio/stable/pipelines.html

DeepPhonemizer

DeepPhonemizer is a multilingual grapheme-to-phoneme modeling library that leverages recent deep learning technology and is optimized for usage in production systems such as TTS. In particular, the library should be accurate, fast, easy to use. Moreover, you can train a custom model on your own dataset in a few lines of code.

DeepPhonemizer is compatible with Python 3.6+ and is distributed under the MIT license.

Read the documentation at: https://as-ideas.github.io/DeepPhonemizer/

希尔贝壳AISHELL-3 高保真中文语音数据库

希尔贝壳中文普通话语音数据库AISHELL-3的语音时长为85小时88035句,可做为多说话人合成系统。录制过程在安静室内环境中, 使用高保真麦克风(44.1kHz,16bit)。218名来自中国不同口音区域的发言人参与录制。专业语音校对人员进行拼音和韵律标注,并通过严格质量检验,此数据库音字确率在98%以上。

https://www.aishelltech.com/aishell_3

http://www.openslr.org/93/

其他常见语音库:https://blog.csdn.net/weixin_44649780/article/details/129405327

Common Voice Dataset

We’re building an open source, multi-language dataset of voices that anyone can use to train speech-enabled applications.

Includes both Cantonese and Mandarin Chinese!!

抽样粤语(Chinese Hong Kong)语音数据的质量不好,录音人声音不够清晰(不是声优级别的声音),背景噪音较大,标记文件有错。另外还有个Cantonese的分类。

感觉可能用现有的TTS生成数据质量会好得多。

Librosa

audio and music processing in Python

Conda

We should to install packages in base. If there is conflict, remove packages in base.

cheatsheet

How to activate conda env in Visual Studio Code?

1. Open Visual Studio Code.  
2. Go to the Extensions tab (Ctrl+Shift+X) and install the Python extension.  
3. Go to File > Preferences > Settings.  
4. In the left pane, search for “conda”.  
5. In the right pane, search for “python.condaPath” and set the path to your Anaconda installation.  
6. In the left pane, search for “conda env”.  
7. In the right pane, search for “python.condaEnvFile” and set the path to your environment file.  
8. Finally, open the Command Palette (Ctrl+Shift+P) and select the Python: Select Interpreter command.  
9. Select the environment you would like to activate in Visual Studio Code.

标签

评论13

Restricted HTML

  • 允许的HTML标签:<a href hreflang> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id> <img src>
  • 自动断行和分段。
  • 网页和电子邮件地址自动转换为链接。
验证码
This question is for testing whether or not you are a human visitor and to prevent automated spam submissions.
请输入"Drupal10"

蓦然回首 (未验证)

2 years 之前

我已经在debian上安装了这个TTS 请问如何调用,谢谢

蓦然回首 (未验证)

2 years 之前

有没有可能基于这个TTS开发一个orca可以调用的版本呢

蓦然回首 (未验证)

2 years 之前

这个还不支持中文吗,有没有可能让他支持中文呢

孟繁永 (未验证)

1 year 5 months 之前

将句号作为显式的终止符,在短文本后面人为加上句号,就不会出现意外的颤音了。比如
tts.tts_to_file("你好。")