Input Audio

Audio to be recognized arrives at the CPQD ASR server in many different ways, depending on the integration used.

Using the WebSocket and REST APIs, the audio is usually captured directly by the application (developers generally deploy this part) and sent to the ASR server in ASR API call itself.

When the application uses the MRCP interface, the phone channel audio is directed to the ASR server by the MRCP streaming protocol. Many times, developers have no access to this part of the code. Both sending audio and controlling the recognition are already deployed in the IVR platform used by the developer.

Quality

To obtain the best results, make sure you capture audio of the highest quality possible. Adopt the following practices:

  • Speak close to the microphone, especially when there is a lot of background noise.

  • Use directional microphones.

  • The microphones must have a flat response in the frequency range between 100 Hz and 8000 Hz.

  • Configure the recording level to make sure the captured signal is neither saturated nor too low. Try to maintain the RMS level of the signal between 1/3 and 2/3 of the scale.

  • Avoid recording with people talking around you.

Encoding

The CPQD ASR accepts audio channels encoded in PCM-linear with 16-bit sample rates (LINEAR16) Besides this encoding, the REST and WebSocket APIs can be used to receive audios encoded as MP3, OPUS, VORBIS, PCM aLaw/uLaw, GSM, FLAC and WAV; however, it is important to remember that when using lossy encoding, the results might not be so accurate. For encoded audio streams, 2-channel audio (stereo) can be used. The system will recognize the mono downmix of the two channels, in other words, channel 1 +2. Since the audio is processed as mono, it is important to point out that there will be errors if there are simultaneous utterances on the two channels.

Sample rate

The CPQD ASR accepts audio with sample rates of 8 kHz (the standard used for phone applications) and 16 kHz.

In general, the higher the sample rate of the audio, the higher the speech recognition accuracy will be. In practical terms, accuracy depends on the audio and models used by the application.

Warning: upsampling from 8 kHz to 16 kHz will not produce better results, much to the contrary. If the original recording is 8 kHz, do not upsample it; use the 8 kHz acoustic model.