Feature Summary

The CPQD ASR provides several resources, some of the most important of which are:

Support for Brazilian Portuguese and Latin American Spanish

The Portuguese acoustic models were trained with the speech of thousands of Brazilians, with accents from all over the country. The Spanish models were produced with the audio of voice actors from different countries of Latin America.

Audio input

Accepts PCM Linear 16 bit audio, with a 8 kHz sample rate, used for phones, and a16 kHz sample rate, for applications with better audio quality. Furthermore, it supports different audio formats, with and without compression: MP3, OPUS, VORBIS, PCM aLaw/uLaw, GSM, FLAC and WAV. It is important to remember that audios with high compression rates can affect the accuracy of the recognition.

Client/server architecture

The applications can use speech recognition through the WebSocket and REST APIs developed by CPQD, or use the MRCP standard for IVR applications.

Real time or online recognition

The audio is processed as it is received, minimizing the time needed to produce the final result.

Continuous recognition mode

The default operating mode for the ASR is detecting and recognizing only the user’s first sentence. In continuous recognition mode, the ASR continues processing the audio, generating recognition results as it receives the audio.

Recognition with grammars

The recognition can be based on grammars written using the SRGS standard.

Semantic interpretation

Recognition with semantic interpretation for grammars using the SISR standard.

Free speech recognition

Speech recognition without the need to write a grammar, allowing more flexibility when interacting with users.

Intermediate results

Intermediate or partial results are produced while the audio is being received and recognized.

Confidence score

Generated results receive a score, indicating the level of confidence in the recognition of that sentence; the higher the score, the higher the chance of the recognition being correct.

N-best list

A list of the N most probable phrases for each recognition is displayed, instead of only the recognized sentence.

Speech detection

Automatic identification of when the user starts and stops speaking.