Terms and Definitions¶

Next, we have listed some of the terms and definitions related to the CPQD ASR and to speech recognition in general.

ASR¶

Automatic Speech Recognition (ASR) is a technology that allows a machine to transform what a person is saying into text.

AM¶

Acoustic Model represents the sounds that form the words of the language. It is generated on a high volume of audio with speech and its transcription.

LM¶

Language model Set of data that defines how the words can be combined to form sentences in a given language. It can be a grammar or a free speech model.

Speech segment¶

‘Speech segments’ are the portions of the audio containing the speech signal, with a small margin of silence at each end, around 200 ms at the beginning and 400 ms at the end of the segment. A complete audio file can be 10s in length, but contain only 2s of speech (the speech segment).

RTF¶

The real time factor is the ratio between the recognition and the length of the speech segment.

An RTF equal to or less than 1 means the system can perform ‘real time’ recognition (without considering the time it takes to receive the audio through the network, or any other processing tasks other than the recognition itself). RTF values greater than 1 indicate that the recognition result will only be produced a certain time after the audio has been delivered.

For example, RTF=0.5 means that a 3s speech segment would take 1.5s to be recognized. In this case, if the audio is received through the ASR system, as soon as it is captured (from the mic, for example), the result can be delivered as soon as the process of capturing audio has ended. An RTF of 1.5 means that a 3s speech segment would take 4.5s to be recognized. In this case, we would have to wait 1.5s after the speech segment has been completely captured to get the results.

WER¶

WER (word error * rate) is the word error rate measured as WER = (I + D + S)/N, where:

I = insertion errors
D = deletion errors
S = substitution errors
N = total number of words in the reference sentence