Feature Summary¶

The CPQD ASR provides several resources, some of the most important of which are:

Support for Brazilian Portuguese and Latin American Spanish: The Portuguese acoustic models were trained with the speech of thousands of Brazilians, with accents from all over the country. The Spanish models were produced with the audio of voice actors from different countries of Latin America.
Audio input: Accepts PCM Linear 16 bit audio, with a 8 kHz sample rate, used for phones, and a16 kHz sample rate, for applications with better audio quality. Furthermore, it supports different audio formats, with and without compression: MP3, OPUS, VORBIS, PCM aLaw/uLaw, GSM, FLAC and WAV. It is important to remember that audios with high compression rates can affect the accuracy of the recognition.
Client/server architecture: The applications can use speech recognition through the WebSocket and REST APIs developed by CPQD, or use the MRCP standard for IVR applications.
Real time or online recognition: The audio is processed as it is received, minimizing the time needed to produce the final result.
Continuous recognition mode: The default operating mode for the ASR is detecting and recognizing only the user’s first sentence. In continuous recognition mode, the ASR continues processing the audio, generating recognition results as it receives the audio.
Recognition with grammars: The recognition can be based on grammars written using the SRGS standard.
Semantic interpretation: Recognition with semantic interpretation for grammars using the SISR standard.
Free speech recognition: Speech recognition without the need to write a grammar, allowing more flexibility when interacting with users.
Intermediate results: Intermediate or partial results are produced while the audio is being received and recognized.
Confidence score: Generated results receive a score, indicating the level of confidence in the recognition of that sentence; the higher the score, the higher the chance of the recognition being correct.
N-best list: A list of the N most probable phrases for each recognition is displayed, instead of only the recognized sentence.
Speech detection: Automatic identification of when the user starts and stops speaking.