pytorch深度学习入门(8)之-Torchaudio使用Tacotron2 文本转语音

https://blog.csdn.net/ajunbin859/article/details/134380417?ops_request_misc=&request_id=&biz_id=102&utm_term=pytorch%E7%89%88%E6%9C%AC%E7%9A%84tacotron%E8%AF%A6%E7%BB%86%E5%AE%89%E8%A3%85%E6%95%99%E7%A8%8B&utm_medium=distribute.pc_search_result.none-task-blog-2~all~sobaiduweb~default-1-134380417.142^v99^pc_search_result_base6&spm=1018.2226.3001.4187

 

概述
本教程展示了如何使用 torchaudio 中预训练的 Tacotron2 构建文本到语音管道。
Tacotron2是一个端到端的语音合成神经网络结构,它由两部分组成,一部分由循环神经网络组成,应用Attention机制,自回归地产生mel谱序列,另一部分是修改后的Wavenet,将mel谱序列映射成音频。在Tacotron2中,首先使用50毫秒帧长,12.5毫秒帧移,汉宁窗截取,然后施加短时傅里叶变换(STFT)得出线性频谱。接着,使用频率范围在125赫兹到7.6K赫兹之间的80通道的梅尔滤波器组对STFT的线性频率进行过滤,后接对数函数进行范围压缩,从而把STFT幅度转换到梅尔刻度上。在进行对数压缩前,滤波器组的输出振幅被稳定到最小0.01以便限制其在对数域中的动态取值范围。最后,通过一个修改过的Wavenet声码器将mel谱映射成音频。Tacotron2实现了和groud Truth非常接近的自然度,MOS评测得分gd 4.58。

文本到语音的流程如下:
文本预处理
首先,输入文本被编码为符号列表。在本教程中,我们将使用英文字符和音素作为符号。

频谱图生成
根据编码文本生成频谱图。Tacotron2 我们为此使用模型。

时域转换
最后一步是将频谱图转换为波形。从声谱图生成语音的过程也称为声码器。在本教程中,使用了三种不同的声码器: WaveRNN、 GriffinLim和 Nvidia 的 WaveGlow。

下图说明了整个过程。

 

所有相关组件都捆绑在 中torchaudio.pipelines.Tacotron2TTSBundle,但本教程还将介绍幕后的过程。

准备
首先,我们安装必要的依赖项。除了 之外 torchaudio,DeepPhonemizer还需要执行基于音素的编码。

命令行下安装

pip3 install deep_phonemizer
1
import torch
import torchaudio

torch.random.manual_seed(0)
device = "cuda" if torch.cuda.is_available() else "cpu"

print(torch.__version__)
print(torchaudio.__version__)
print(device)
1
2
3
4
5
6
7
8
9
输出:

2.1.1
2.1.0
cuda
1
2
3
import IPython
import matplotlib.pyplot as plt
1
2
文本处理
基于字符的编码
在本节中,我们将介绍基于字符的编码的工作原理。

由于预训练的 Tacotron2 模型需要特定的符号表集,因此torchaudio. 本节更多的是对编码基础的解释。

首先,我们定义符号集。例如,我们可以使用’_-!‘(),.:;? abcdefghijklmnopqrstuvwxyz’ . 然后,我们将输入文本的每个字符映射到表中相应符号的索引中。

以下是此类处理的示例。在示例中,表中没有的符号将被忽略。

symbols = "_-!'(),.:;? abcdefghijklmnopqrstuvwxyz"
look_up = {s: i for i, s in enumerate(symbols)}
symbols = set(symbols)


def text_to_sequence(text):
text = text.lower()
return [look_up[s] for s in text if s in symbols]


text = "Hello world! Text to speech!"
print(text_to_sequence(text))
1
2
3
4
5
6
7
8
9
10
11
12
输出

[19, 16, 23, 23, 26, 11, 34, 26, 29, 23, 15, 2, 11, 31, 16, 35, 31, 11, 31, 26, 11, 30, 27, 16, 16, 14, 19, 2]
1
如上所述,符号表和索引必须与预训练的 Tacotron2 模型期望的相匹配。torchaudio提供转换以及预训练模型。例如,您可以实例化并使用此类转换,如下所示。

processor = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH.get_text_processor()

text = "Hello world! Text to speech!"
processed, lengths = processor(text)

print(processed)
print(lengths)
1
2
3
4
5
6
7
输出

tensor([[19, 16, 23, 23, 26, 11, 34, 26, 29, 23, 15, 2, 11, 31, 16, 35, 31, 11,
31, 26, 11, 30, 27, 16, 16, 14, 19, 2]])
tensor([28], dtype=torch.int32)
1
2
3
该processor对象采用文本或文本列表作为输入。当提供文本列表时,返回的lengths变量表示输出批次中每个已处理标记的有效长度。

可以如下检索中间表示。

print([processor.tokens[i] for i in processed[0, : lengths[0]]])
1
输出:

['h', 'e', 'l', 'l', 'o', ' ', 'w', 'o', 'r', 'l', 'd', '!', ' ', 't', 'e', 'x', 't', ' ', 't', 'o', ' ', 's', 'p', 'e', 'e', 'c', 'h', '!']
1
基于音素的编码
基于音素的编码与基于字符的编码类似,但它使用基于音素的符号表和G2P(Grapheme-to-Phoneme)模型。

G2P 模型的细节超出了本教程的范围,我们将仅了解转换的情况。

与基于字符的编码的情况类似,编码过程预计与预训练的 Tacotron2 模型的训练内容相匹配。 torchaudio有一个创建进程的接口。

下面的代码说明了如何制作和使用该过程。在幕后,使用 package 创建 G2P 模型,并获取DeepPhonemizer作者发布的预训练权重。DeepPhonemizer

bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH

processor = bundle.get_text_processor()

text = "Hello world! Text to speech!"
with torch.inference_mode():
processed, lengths = processor(text)

print(processed)
print(lengths)
1
2
3
4
5
6
7
8
9
10
输出:

0%| | 0.00/63.6M [00:00<?, ?B/s]
0%| | 56.0k/63.6M [00:00<03:30, 317kB/s]
0%| | 240k/63.6M [00:00<01:29, 741kB/s]
1%|1 | 800k/63.6M [00:00<00:35, 1.85MB/s]
4%|3 | 2.37M/63.6M [00:00<00:13, 4.73MB/s]
8%|7 | 4.87M/63.6M [00:00<00:07, 8.23MB/s]
12%|#2 | 7.93M/63.6M [00:01<00:05, 11.5MB/s]
17%|#7 | 11.1M/63.6M [00:01<00:04, 13.6MB/s]
22%|##2 | 14.3M/63.6M [00:01<00:03, 15.2MB/s]
28%|##7 | 17.6M/63.6M [00:01<00:02, 16.3MB/s]
33%|###2 | 20.9M/63.6M [00:01<00:02, 17.2MB/s]
38%|###7 | 23.9M/63.6M [00:01<00:02, 20.2MB/s]
41%|#### | 26.0M/63.6M [00:02<00:02, 17.6MB/s]
46%|####5 | 29.2M/63.6M [00:02<00:02, 18.0MB/s]
51%|##### | 32.4M/63.6M [00:02<00:01, 21.1MB/s]
54%|#####4 | 34.6M/63.6M [00:02<00:01, 18.5MB/s]
60%|#####9 | 38.2M/63.6M [00:02<00:01, 22.5MB/s]
64%|######3 | 40.6M/63.6M [00:02<00:01, 19.7MB/s]
69%|######8 | 43.8M/63.6M [00:03<00:01, 19.4MB/s]
74%|#######3 | 47.1M/63.6M [00:03<00:00, 22.5MB/s]
78%|#######7 | 49.5M/63.6M [00:03<00:00, 19.6MB/s]
81%|########1 | 51.8M/63.6M [00:03<00:00, 17.4MB/s]
87%|########6 | 55.2M/63.6M [00:03<00:00, 18.1MB/s]
92%|#########2| 58.6M/63.6M [00:03<00:00, 18.6MB/s]
98%|#########8| 62.6M/63.6M [00:04<00:00, 19.9MB/s]
100%|##########| 63.6M/63.6M [00:04<00:00, 16.6MB/s]
/pytorch/audio/ci_env/lib/python3.10/site-packages/torch/nn/modules/transformer.py:282: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance)
warnings.warn(f"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}")
tensor([[54, 20, 65, 69, 11, 92, 44, 65, 38, 2, 11, 81, 40, 64, 79, 81, 11, 81,
20, 11, 79, 77, 59, 37, 2]])
tensor([25], dtype=torch.int32)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
请注意,编码值与基于字符的编码示例不同。

中间表示如下所示。

print([processor.tokens[i] for i in processed[0, : lengths[0]]])
1
输出:

['HH', 'AH', 'L', 'OW', ' ', 'W', 'ER', 'L', 'D', '!', ' ', 'T', 'EH', 'K', 'S', 'T', ' ', 'T', 'AH', ' ', 'S', 'P', 'IY', 'CH', '!']
1
频谱图生成
Tacotron2是我们用来从编码文本生成频谱图的模型。有关模型的详细信息,请参阅论文。

使用预训练权重实例化 Tacotron2 模型很容易,但请注意,Tacotron2 模型的输入需要由匹配文本处理器进行处理。

torchaudio.pipelines.Tacotron2TTSBundle将匹配的模型和处理器捆绑在一起,以便轻松创建管道。

有关可用的捆绑包及其用法,请参阅 Tacotron2TTSBundle。

bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH
processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)

text = "Hello world! Text to speech!"

with torch.inference_mode():
processed, lengths = processor(text)
processed = processed.to(device)
lengths = lengths.to(device)
spec, _, _ = tacotron2.infer(processed, lengths)


_ = plt.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")
1
2
3
4
5
6
7
8
9
10
11
12
13
14

输出:

/pytorch/audio/ci_env/lib/python3.10/site-packages/torch/nn/modules/transformer.py:282: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance)
warnings.warn(f"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}")
Downloading: "https://download.pytorch.org/torchaudio/models/tacotron2_english_phonemes_1500_epochs_wavernn_ljspeech.pth" to /root/.cache/torch/hub/checkpoints/tacotron2_english_phonemes_1500_epochs_wavernn_ljspeech.pth

0%| | 0.00/107M [00:00<?, ?B/s]
33%|###3 | 35.9M/107M [00:00<00:00, 377MB/s]
72%|#######1 | 76.9M/107M [00:00<00:00, 408MB/s]
100%|##########| 107M/107M [00:00<00:00, 408MB/s]
1
2
3
4
5
6
7
8
请注意,该Tacotron2.infer方法执行多项式采样,因此生成频谱图的过程会产生随机性。

def plot():
fig, ax = plt.subplots(3, 1)
for i in range(3):
with torch.inference_mode():
spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
print(spec[0].shape)
ax[i].imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")


plot()
1
2
3
4
5
6
7
8
9
10

输出:

torch.Size([80, 190])
torch.Size([80, 184])
torch.Size([80, 185])
1
2
3
波形生成
生成频谱图后,最后一个过程是从频谱图中恢复波形。

torchaudio提供基于GriffinLim和 的 声码器WaveRNN。

WaveRNN
继续上一节,我们可以实例化同一包中的匹配 WaveRNN 模型。

bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_PHONE_LJSPEECH

processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
vocoder = bundle.get_vocoder().to(device)

text = "Hello world! Text to speech!"

with torch.inference_mode():
processed, lengths = processor(text)
processed = processed.to(device)
lengths = lengths.to(device)
spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
waveforms, lengths = vocoder(spec, spec_lengths)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
输出:

/pytorch/audio/ci_env/lib/python3.10/site-packages/torch/nn/modules/transformer.py:282: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance)
warnings.warn(f"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}")
Downloading: "https://download.pytorch.org/torchaudio/models/wavernn_10k_epochs_8bits_ljspeech.pth" to /root/.cache/torch/hub/checkpoints/wavernn_10k_epochs_8bits_ljspeech.pth

0%| | 0.00/16.7M [00:00<?, ?B/s]
100%|##########| 16.7M/16.7M [00:00<00:00, 342MB/s]
1
2
3
4
5
6
def plot(waveforms, spec, sample_rate):
waveforms = waveforms.cpu().detach()

fig, [ax1, ax2] = plt.subplots(2, 1)
ax1.plot(waveforms[0])
ax1.set_xlim(0, waveforms.size(-1))
ax1.grid(True)
ax2.imshow(spec[0].cpu().detach(), origin="lower", aspect="auto")
return IPython.display.Audio(waveforms[0:1], rate=sample_rate)


plot(waveforms, spec, vocoder.sample_rate)
1
2
3
4
5
6
7
8
9
10
11
12


Griffin-Lim
使用 Griffin-Lim 声码器与 WaveRNN 相同。您可以使用方法实例化声码对象 get_vocoder() 并传递频谱图。

bundle = torchaudio.pipelines.TACOTRON2_GRIFFINLIM_PHONE_LJSPEECH

processor = bundle.get_text_processor()
tacotron2 = bundle.get_tacotron2().to(device)
vocoder = bundle.get_vocoder().to(device)

with torch.inference_mode():
processed, lengths = processor(text)
processed = processed.to(device)
lengths = lengths.to(device)
spec, spec_lengths, _ = tacotron2.infer(processed, lengths)
waveforms, lengths = vocoder(spec, spec_lengths)
1
2
3
4
5
6
7
8
9
10
11
12
输出:

/pytorch/audio/ci_env/lib/python3.10/site-packages/torch/nn/modules/transformer.py:282: UserWarning: enable_nested_tensor is True, but self.use_nested_tensor is False because encoder_layer.self_attn.batch_first was not True(use batch_first for better inference performance)
warnings.warn(f"enable_nested_tensor is True, but self.use_nested_tensor is False because {why_not_sparsity_fast_path}")
Downloading: "https://download.pytorch.org/torchaudio/models/tacotron2_english_phonemes_1500_epochs_ljspeech.pth" to /root/.cache/torch/hub/checkpoints/tacotron2_english_phonemes_1500_epochs_ljspeech.pth

0%| | 0.00/107M [00:00<?, ?B/s]
35%|###5 | 37.8M/107M [00:00<00:00, 397MB/s]
72%|#######2 | 77.7M/107M [00:00<00:00, 409MB/s]
100%|##########| 107M/107M [00:00<00:00, 411MB/s]
1
2
3
4
5
6
7
8
plot(waveforms, spec, vocoder.sample_rate)
1


Waveglow
Waveglow 是 Nvidia 发布的声码器。预训练的权重发布在 Torch Hub 上。可以使用torch.hub 模块实例化模型。

# Workaround to load model mapped on GPU
# https://stackoverflow.com/a/61840832
waveglow = torch.hub.load(
"NVIDIA/DeepLearningExamples:torchhub",
"nvidia_waveglow",
model_math="fp32",
pretrained=False,
)
checkpoint = torch.hub.load_state_dict_from_url(
"https://api.ngc.nvidia.com/v2/models/nvidia/waveglowpyt_fp32/versions/1/files/nvidia_waveglowpyt_fp32_20190306.pth", # noqa: E501
progress=False,
map_location=device,
)
state_dict = {key.replace("module.", ""): value for key, value in checkpoint["state_dict"].items()}

waveglow.load_state_dict(state_dict)
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to(device)
waveglow.eval()

with torch.no_grad():
waveforms = waveglow.infer(spec)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
输出:

/pytorch/audio/ci_env/lib/python3.10/site-packages/torch/hub.py:294: UserWarning: You are about to download and run code from an untrusted repository. In a future release, this won't be allowed. To add the repository to your trusted list, change the command to {calling_fn}(..., trust_repo=False) and a command prompt will appear asking for an explicit confirmation of trust, or load(..., trust_repo=True), which will assume that the prompt is to be answered with 'yes'. You can also use load(..., trust_repo='check') which will only prompt for confirmation if the repo is not already trusted. This will eventually be the default behaviour
warnings.warn(
Downloading: "https://github.com/NVIDIA/DeepLearningExamples/zipball/torchhub" to /root/.cache/torch/hub/torchhub.zip
/root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub/PyTorch/Classification/ConvNets/image_classification/models/common.py:13: UserWarning: pytorch_quantization module not found, quantization will not be available
warnings.warn(
/root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub/PyTorch/Classification/ConvNets/image_classification/models/efficientnet.py:17: UserWarning: pytorch_quantization module not found, quantization will not be available
warnings.warn(
/pytorch/audio/ci_env/lib/python3.10/site-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
Downloading: "https://api.ngc.nvidia.com/v2/models/nvidia/waveglowpyt_fp32/versions/1/files/nvidia_waveglowpyt_fp32_20190306.pth" to /root/.cache/torch/hub/checkpoints/nvidia_waveglowpyt_fp32_20190306.pth
1
2
3
4
5
6
7
8
9
10
plot(waveforms, spec, 22050)
1

————————————————

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。

原文链接:https://blog.csdn.net/ajunbin859/article/details/134380417

posted on 2024-02-14 10:35  独上兰舟1  阅读(153)  评论(0编辑  收藏  举报