logo
0
0
WeChat Login
Kedreamix<1016617094@qq.com>
update README

TTS 赋予数字人真实的语音交互能力

Edge-TTS

Edge-TTS是一个Python库,它使用微软的Azure Cognitive Services来实现文本到语音转换(TTS)。

该库提供了一个简单的API,可以将文本转换为语音,并且支持多种语言和声音。要使用Edge-TTS库,首先需要安装上Edge-TTS库,安装直接使用pip 进行安装即可。

pip install -U edge-tts

如果想更细究使用方式,可参考https://github.com/rany2/edge-tts

根据源代码,我编写了一个 EdgeTTS 的类,能够更好的使用,并且增加了保存字幕文件的功能,能增加体验感

class EdgeTTS: def __init__(self, list_voices = False, proxy = None) -> None: voices = list_voices_fn(proxy=proxy) self.SUPPORTED_VOICE = [item['ShortName'] for item in voices] self.SUPPORTED_VOICE.sort(reverse=True) if list_voices: print(", ".join(self.SUPPORTED_VOICE)) def preprocess(self, rate, volume, pitch): if rate >= 0: rate = f'+{rate}%' else: rate = f'{rate}%' if pitch >= 0: pitch = f'+{pitch}Hz' else: pitch = f'{pitch}Hz' volume = 100 - volume volume = f'-{volume}%' return rate, volume, pitch def predict(self,TEXT, VOICE, RATE, VOLUME, PITCH, OUTPUT_FILE='result.wav', OUTPUT_SUBS='result.vtt', words_in_cue = 8): async def amain() -> None: """Main function""" rate, volume, pitch = self.preprocess(rate = RATE, volume = VOLUME, pitch = PITCH) communicate = Communicate(TEXT, VOICE, rate = rate, volume = volume, pitch = pitch) subs: SubMaker = SubMaker() sub_file: Union[TextIOWrapper, TextIO] = ( open(OUTPUT_SUBS, "w", encoding="utf-8") ) async for chunk in communicate.stream(): if chunk["type"] == "audio": # audio_file.write(chunk["data"]) pass elif chunk["type"] == "WordBoundary": # print((chunk["offset"], chunk["duration"]), chunk["text"]) subs.create_sub((chunk["offset"], chunk["duration"]), chunk["text"]) sub_file.write(subs.generate_subs(words_in_cue)) await communicate.save(OUTPUT_FILE) # loop = asyncio.get_event_loop_policy().get_event_loop() # try: # loop.run_until_complete(amain()) # finally: # loop.close() asyncio.run(amain()) with open(OUTPUT_SUBS, 'r', encoding='utf-8') as file: vtt_lines = file.readlines() # 去掉每一行文字中的空格 vtt_lines_without_spaces = [line.replace(" ", "") if "-->" not in line else line for line in vtt_lines] # print(vtt_lines_without_spaces) with open(OUTPUT_SUBS, 'w', encoding='utf-8') as output_file: output_file.writelines(vtt_lines_without_spaces) return OUTPUT_FILE, OUTPUT_SUBS

同时在src文件夹下,写了一个简易的WebUI

python app.py

TTS

PaddleTTS

在实际使用过程中,可能会遇到需要离线操作的情况。由于Edge TTS需要在线环境才能生成语音,因此我们选择了同样开源的PaddleSpeech作为文本到语音(TTS)的替代方案。虽然效果可能有所不同,但PaddleSpeech支持离线操作。更多信息可参考PaddleSpeech的GitHub页面:PaddleSpeech

声码器说明

PaddleSpeech预置了三种声码器:【PWGan】【WaveRnn】【HifiGan】。这三种声码器在音质和生成速度上有较大差异,用户可根据需求进行选择。我们建议仅使用前两种声码器,因为WaveRNN的生成速度非常慢。

声码器音频质量生成速度
PWGan中等中等
WaveRnn非常慢(耐心等待)
HifiGan

TTS数据集

PaddleSpeech中的样例主要按数据集分类,我们主要使用的TTS数据集有:

  • CSMCS(普通话单发音人)
  • AISHELL3(普通话多发音人)
  • LJSpeech(英文单发音人)
  • VCTK(英文多发音人)

PaddleSpeech的TTS模型映射

PaddleSpeech的TTS模型与以下模型相对应:

  • tts0 - Tacotron2
  • tts1 - TransformerTTS
  • tts2 - SpeedySpeech
  • tts3 - FastSpeech2
  • voc0 - WaveFlow
  • voc1 - Parallel WaveGAN
  • voc2 - MelGAN
  • voc3 - MultiBand MelGAN
  • voc4 - Style MelGAN
  • voc5 - HiFiGAN
  • vc0 - Tacotron2 Voice Clone with GE2E
  • vc1 - FastSpeech2 Voice Clone with GE2E

预训练模型列表

以下是PaddleSpeech提供的可通过命令行和Python API使用的预训练模型列表:

声学模型

模型语言
speedyspeech_csmsczh
fastspeech2_csmsczh
fastspeech2_ljspeechen
fastspeech2_aishell3zh
fastspeech2_vctken
fastspeech2_cnndecoder_csmsczh
fastspeech2_mixmix
tacotron2_csmsczh
tacotron2_ljspeechen
fastspeech2_malezh
fastspeech2_maleen
fastspeech2_malemix
fastspeech2_cantoncanton

声码器

模型语言
pwgan_csmsczh
pwgan_ljspeechen
pwgan_aishell3zh
pwgan_vctken
mb_melgan_csmsczh
style_melgan_csmsczh
hifigan_csmsczh
hifigan_ljspeechen
hifigan_aishell3zh
hifigan_vctken
wavernn_csmsczh
pwgan_malezh
hifigan_malezh

根据PaddleSpeech,我编写了一个 PaddleTTS 的类,能够更好的使用和运行结果

class PaddleTTS: def __init__(self) -> None: pass def predict(self, text, am, voc, spk_id = 174, lang = 'zh', male=False, save_path = 'output.wav'): self.tts = TTSExecutor() use_onnx = True voc = voc.lower() am = am.lower() if male: assert voc in ["pwgan", "hifigan"], "male voc must be 'pwgan' or 'hifigan'" wav_file = self.tts( text = text, output = save_path, am='fastspeech2_male', voc= voc + '_male', lang=lang, use_onnx=use_onnx ) return wav_file assert am in ['tacotron2', 'fastspeech2'], "am must be 'tacotron2' or 'fastspeech2'" # 混合中文英文语音合成 if lang == 'mix': # mix只有fastspeech2 am = 'fastspeech2_mix' voc += '_csmsc' # 英文语音合成 elif lang == 'en': am += '_ljspeech' voc += '_ljspeech' # 中文语音合成 elif lang == 'zh': assert voc in ['wavernn', 'pwgan', 'hifigan', 'style_melgan', 'mb_melgan'], "voc must be 'wavernn' or 'pwgan' or 'hifigan' or 'style_melgan' or 'mb_melgan'" am += '_csmsc' voc += '_csmsc' elif lang == 'canton': am = 'fastspeech2_canton' voc = 'pwgan_aishell3' spk_id = 10 print("am:", am, "voc:", voc, "lang:", lang, "male:", male, "spk_id:", spk_id) try: cmd = f'paddlespeech tts --am {am} --voc {voc} --input "{text}" --output {save_path} --lang {lang} --spk_id {spk_id} --use_onnx {use_onnx}' os.system(cmd) wav_file = save_path except: # 语音合成 wav_file = self.tts( text = text, output = save_path, am = am, voc = voc, lang = lang, spk_id = spk_id, use_onnx=use_onnx ) return wav_file