CMUSphinx Learn - Basic concepts of speech

Basic concepts of speech

Speech is a complex phenomenon. People rarely understand how is it produced and perceived. The naive perception is often that speech is built with words, and each word consists of phones. The reality is unfortunately very different. Speech is a dynamic process without clearly distinguished parts. It's always useful to get a sound editor and look into the recording of the speech and listen to it. Here is for example the speech recording in an audio editor.

语音基本概念

语音是一个复杂的现象。人们很少能理解它是怎样产生和被感知的。我们最简单的认识就是就是语音是由单词构成的，每个单词由音素构成。但事实却截然不同。语音是一个动态过程，并没有清晰地分辨部分。使用音频编辑器来观察、听取录制的语音是一个有效的方法。下面是一个用音频编辑器录制的语音。

All modern descriptions of speech are to some degree probabilistic. That means that there are no certain boundaries between units, or between words. Speech to text translation and other applications of speech are never 100% correct. That idea is rather unusual for software developers, who usually work with deterministic systems. And it creates a lot of issues specific only to speech technology.

目前语音的所有描述某些程度上都是基于概率性的。这就意味着，在两个单元或者单词之间，没有明确的边界。语音到文本的转换和其他的语音应用的正确率都不会达到100%。这种说法对软件开发人员来说是不可思议的，因为他们通常在确定性系统中工作。另外，对语音技术来说，会产生很多特定的问题。

Structure of speech

In current practice, speech structure is understood as follows:

Speech is a continuous audio stream where rather stable states mix with dynamically changed states. In this sequence of states, one can define more or less similar classes of sounds, orphones. Words are understood to be built of phones, but this is certainly not true. The acoustic properties of a waveform corresponding to a phone can vary greatly depending on many factors - phone context, speaker, style of speech and so on. The so called coarticulation makes phones sound very different from their “canonical” representation. Next, since transitions between words are more informative than stable regions, developers often talk aboutdiphones - parts of phones between two consecutive phones. Sometimes developers talk about subphonetic units - different substates of a phone. Often three or more regions of a different nature can easily be found.

语音的构成

本文中，应按以下方式理解语音构成：

语音是一个连续的音频流，流中大部分是稳定状态，也有部分动态改变的状态。在这个状态序列中，我们可以把声音或者音素定义为或多或少的相似的类别。词语由音素构成

这样的认识是不对的。一个音素的声学特性的波形的差异取决于很多因素，音素的上下文、说话人、语言风格等。协同发音使音素不同于它们的标准发音，因为词与词之间过渡的信息量要比稳定区域大很多，这个稳定区就是开发者经常说的两个连贯音素之间的音素部分。有时候开发者们讨论亚音素单元 - 一个音素的不同子状态，通常可以找到三个或者以上的不同属性区域。

The number three is easily explained. The first part of the phone depends on its preceding phone, the middle part is stable, and the next part depends on the subsequent phone. That's why there are often three states in a phone selected for HMM recognition.

以数字“3”来做解释，音素的第一部分取决于在它之前的音素，中间部分是稳定的，下一部分的音素取决于下一个序列的音素。这就是为什么选择用HMM来识别一个音素的三状态。

Sometimes phones are considered in context. There are triphones or even quinphones. But note that unlike phones and diphones, they are matched with the same range in waveform as just phones. They just differ by name. That's why we prefer to call this objectsenone. A senone's dependence on context could be more complex than just left and right context. It can be a rather complex function defined by a decision tree, or in some other way.

有时候音素会被放在上下文中考虑，有三元音素或者多元音素，但是和音素与亚音素不同，它们在波形匹配时和音素是一样的，它们只是名称不同。我们通常倾向于将这样的元素命名为senone。一个senone对上下文的依赖性比单一的左右上下文复杂的多，它是一个可以被决策树或者其他方式来定义的复杂函数。

Next, phones build subword units, like syllables. Sometimes, syllables are defined as “reduction-stable entities”. To illustrate, when speech becomes fast, phones often change, but syllables remain the same. Also, syllables are related to intonational contour. There are other ways to build subwords - morphologically-based in morphology-rich languages or phonetically-based. Subwords are often used in open vocabulary speech recognition.

音素构成亚单词单元，和音节差不多。有时候，音节被认为是一个稳定的实体，比如说，当语音很快时，音素会常变化，但是音节保持不变。音节与语调有关，有其他的方式构建子单词，基于形态学的丰富形态学预压或者是基于语音学，子单词经常在开放的词汇语音识别中使用。

Subwords form words. Words are important in speech recognition because they restrict combinations of phones significantly. If there are 40 phones and an average word has 7 phones, there must be 40^7 words. Luckily, even a very educated person rarely uses more then 20k words in his practice, which makes recognition way more feasible.

子单词构成单词，单词在语音识别中很重要，因为它们大大地的限制了音素的组合。假设有40个音素，一个单词有7个音素，就存在40^7个单词。幸运的是，即使一个受过高等教育的人也很少使用超过20K个单词，这就使得识别变得可能。

Words and other non-linguistic sounds, which we callfillers (breath, um, uh, cough), formutterances. They are separate chunks of audio between pauses. They don't necessary match sentences, which are more semantic concepts.

单词和其他非语言的声音构成了话语，我们称呼它们是fillers（呼吸，um，uh，咳嗽），它们从各个音频块之间的停顿中分离出来，它们不能匹配具有更多语义概念的句子。

On the top of this, there are dialog acts like turns, but they go beyond the purpose of the document.

Recognition process

The common way to recognize speech is the following: we take waveform, split it on utterances by silences then try to recognize what's being said in each utterance. To do that we want to take all possible combinations of words and try to match them with the audio. We choose the best matching combination. There are few important things in this match.

语音识别的一般方法如下：我们用silences将波形分割为utterances，然后去识别每个utterance说的内容。为了达到这个目的，我们需要用单词的所有可能组合去匹配每一段音频，选择最佳匹配的组合。在匹配过程中有一些重要的概念。

First of all it's a concept of features. Since number of parameters is large, we are trying to optimize it. Numbers that are calculated from speech usually by dividing speech on frames. Then for each frame of length typically 10 milliseconds we extract 39 numbers that represent the speech. That's calledfeature vector. The way to generates numbers is a subject of active investigation, but in simple case it's a derivative from spectrum.

第一，特征。当我们试图去优化它时，参数个数很庞大。按照帧划分的语音数量需要计算，一般每帧的长度是10ms，提取39个可以代表该段语音的数字，这些数字叫做特征向量。产生这些数字的方法是一门较活跃的研究课题，但是一般情况下是由频谱衍生出来的。

Second it's a concept of the model. Model describes some mathematical object that gathers common attributes of the spoken word. In practice, for audio model of senone is gaussian mixture of it's three states - to put it simple, it's a most probable feature vector. From concept of the model the following issues raised - how good does model fits practice, can model be made better of it's internal model problems, how adaptive model is to the changed conditions.

第二，模型。模型是描述了一些数学对象，这些数学对象代表一些口语的共同属性，在实际应用中，senone的音频模型是三态的高斯混合模型，简单来讲，它是一个可信的特征向量，对于模型，提出如下问题：模型能够多大程度上符合实际应用？在模型内部问题的限制下，模型能否训练的更优？自适应模型如何改变条件？

Third, it's a matching process itself. Since it would take a huge time more than universe existed to compare all feature vectors with all models, the search is often optimized by many tricks. At any points we maintain best matching variants and extend them as time goes producing best matching variants for the next frame.

第三，匹配过程。因为匹配过程需要耗费大量的时间来比较所有模型的特征向量，因此搜索过程通常需要一些技巧来做优化，在每一个匹配点，我们保留最好的匹配变量，并将它们扩展到下一帧产生最好的匹配变量。

Models

模型

According to the speech structure, three models are used in speech recognition to do the match:

根据语音的构成，语音识别使用3个模型来做匹配：

An acoustic model contains acoustic properties for each senone. There are context-independent models that contain properties (most probable feature vectors for each phone) and context-dependent ones (built from senones with context).

声学模型包含每一个senone的声学属性，它们是上下文无关的模型，包含属性（每个音素的可信特征向量）和上下文相关属性（根据上下文构建的senone）

A phonetic dictionary contains a mapping from words to phones. This mapping is not very effective. For example, only two to three pronunciation variants are noted in it, but it's practical enough most of the time. The dictionary is not the only variant of mapper from words to phones. It could be done with some complex function learned with a machine learning algorithm.

语音学字典包含从词语到语序的映射，这种映射不是非常有效，比如，仅仅有2到3个发音变量记录在字典里，但是实际上大部分时间是足够的。字典不是映射单词到音素的唯一方法，可以通过机器虚席算法来得到一些复杂的函数来完成映射。

A language model is used to restrict word search. It defines which word could follow previously recognized words (remember that matching is a sequential process) and helps to significantly restrict the matching process by stripping words that are not probable. Most common language models used are n-gram language models-these contain statistics of word sequences-and finite state language models-these define speech sequences by finite state automation, sometimes with weights. To reach a good accuracy rate, your language model must be very successful in search space restriction. This means it should be very good at predicting the next word. A language model usually restricts the vocabulary considered to the words it contains. That's an issue for name recognition. To deal with this, a language model can contain smaller chunks like subwords or even phones. Please note that search space restriction in this case is usually worse and corresponding recognition accuracies are lower than with a word-based language model.

语言模型是用来约束单词搜索的。它规定了哪些词可以跟在已经识别出的上一个词的后面（记住匹配是一个连续的过程），这样就可以为匹配过程排除一些不可能的词，大部分的语言模型都使用n-gram模型，这些模型包含单词序列的统计和有限状态语言模型，这些模型通过自动有限状态定义语音序列，有时候会加入权重。为了达到较好的识别准确率，语言模型必须能够成功的约束搜索空间，也就是说可以更好的预测下一个单词，语言模型约束词汇所包含的单词的，对于名字的识别就是一个问题，为了处理这种情况，语言模型要能够包含更小的块，比如亚单词或者音素，但是在这种情况下，搜索空间的限制通常很非常糟糕，识别准确率也低于基于单词的语言模型。

Those three entities are combined together in an engine to recognize speech. If you are going to apply your engine for some other language, you need to get such structures in place. For many languages there are acoustic models, phonetic dictionaries and even large vocabulary language models available for download.

一个语音识别引擎是由这三个实体组合在一起组成的，如果你想识别不同的语言，你就需要获取这三个部分，很多语言的声学模型、语音字典和大词汇量的语言模型都可以下载。

Other concepts used

其他用到的概念

A Lattice is a directed graph that represents variants of the recognition. Often, getting the best match is not practical; in that case, lattices are good intermediate formats to represent the recognition result.

网格是代表识别结果的有向图，一般来说，实际中很难获得一个最好的匹配结果，在那种情况下，lattices就是一个比较好的中间格式来存放识别的结果。

N-best lists of variants are like lattices, though their representations are not as dense as the lattice ones.

N-best变量列表有点象lattices，但是它没有lattices那么密集。（保留的中间结果没有lattices多）

Word confusion networks (sausages) are lattices where the strict order of nodes is taken from lattice edges.

单词混淆网络是从lattices的边缘得到的一个严格的节点顺序序列。

Speech database - a set of typical recordings from the task database. If we develop dialog system it might be dialogs recorded from users. For dictation system it might be reading recordings. Speech databases are used to train, tune and test the decoding systems.

语音数据库 - 一个从任务数据库得到的录音集。如果我们开发对话系统，那么数据库就包含了用户的对话录音，而对于听写系统，包含的就是朗读的录音。语音数据库是用来训练的、调整和测试解码系统的。

Text databases - sample texts collected for language model training and so on. Usually, databases of texts are collected in sample text form. The issue with collection is to put present documents (PDFs, web pages, scans) into spoken text form. That is, you need to remove tags and headings, to expand numbers to their spoken form, and to expand abbreviations.

文本数据库 - 为训练语言模型而采集的文本。一般文本数据库是以样本文本的形式来收集的，收集过程中将现成的文档（PDFs,web,scans）转换成口语文本会有问题，也就是，你需要移除标签和标题，将数字扩展为口语的形式，并且还需要将缩写扩展成完整形式。

What is optimized

优化

When speech recognition is being developed, the most complex issue is to make search precise (consider as many variants to match as possible) and to make it fast enough to not run for ages. There are also issues with making the model match the speech since models aren't perfect.

随着语音识别技术的发展，最复杂的问题是如何使搜索更加准确和快速，还有在模型不完美的情况下如何使模型匹配语音。

Usually the system is tested on a test database that is meant to represent the target task correctly.

The following characteristics are used:

一般来说，为了能够正确的完成目标任务，系统需要在测试数据库中测试。

以下参数表征系统性能：

Word error rate. Let we have original text and recognition text of length of N words. From them the I words were inserted D words were deleted and S words were substituted Word error rate is

WER = (I + D + S) / N

WER is usually measured in percent.

单词错误率. 假设我们有长度为N个单词的原始文本和识别文本，I 代表插入的单词个数，D 代表删除的单词个数，S 代表被替换的单词个数，错误率：

WER = (I + D + S) / N

WER 一般用百分数来表示

Accuracy. It is almost the same thing as word error rate, but it doesn't count insertions.

Accuracy = (N - D - S) / N

Accuracy is actually a worse measure for most tasks, since insertions are also important in final results. But for some tasks, accuracy is a reasonable measure of the decoder performance.

准确率. 它和单词错误率大部分是一样的，但是它不计算插入单词的个数

Accuracy = (N - D - S) / N

对大部分任务来说，准确率实际上是一个较差的度量方法，因为插入单词的情况对最后结果也是很重要的。但是对于有些任务来说，准确率也是衡量解码器性能的合理方法。

Speed. Suppose the audio file was 2 hours and the decoding took 6 hours. Then speed is counted as 3xRT.

速率. 假设音频文件长度为2小时，解码需要6个小时。速率就是3*RT。

ROC curves. When we talk about detection tasks, there are false alarms and hits/misses; ROC curves are used. A curve is a graphic that describes the number of false alarms vs number of hits, and tries to find optimal point where the number of false alarms is small and number of hits matches 100%.

ROC曲线. 当我们执行检测任务时，会出现误报和命中两种情况，ROC曲线就有用了，ROC曲线就是用来描述误报数和命中数之间的比例的图形，而且ROC曲线可以帮我们找到误报数较小、命中数达到100%的最优点。

There are other properties that aren't often taken into account, but still important for many practical applications. Your first task should be to build such a measure and systematically apply it during the system development. Your second task is to collect the test database and test how does your application perform.

还有其他的一些性能参数并没有提及，但在很多实际应用中还是很重要的。你的首要任务就是建立一个评价体系，然后系统地运用到开发过程中。次要任务就是收集测试数据库，并测试你的应用执行的怎么样。

posted on 2013-11-14 20:31 新一阅读(317) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部