Feature Summary¶
The CPQD ASR provides several resources, some of the most important of which are:
- Support for Brazilian Portuguese and Latin American Spanish
The Portuguese acoustic models were trained with the speech of thousands of Brazilians, with accents from all over the country. The Spanish models were produced with the audio of voice actors from different countries of Latin America.
- Audio input
Accepts PCM Linear 16 bit audio, with a 8 kHz sample rate, used for phones, and a16 kHz sample rate, for applications with better audio quality. Furthermore, it supports different audio formats, with and without compression: MP3, OPUS, VORBIS, PCM aLaw/uLaw, GSM, FLAC and WAV. It is important to remember that audios with high compression rates can affect the accuracy of the recognition.
- Client/server architecture
The applications can use speech recognition through the WebSocket and REST APIs developed by CPQD, or use the MRCP standard for IVR applications.
- Real time or online recognition
The audio is processed as it is received, minimizing the time needed to produce the final result.
- Continuous recognition mode
The default operating mode for the ASR is detecting and recognizing only the user’s first sentence. In continuous recognition mode, the ASR continues processing the audio, generating recognition results as it receives the audio.
- Recognition with grammars
The recognition can be based on grammars written using the SRGS standard.
- Semantic interpretation
Recognition with semantic interpretation for grammars using the SISR standard.
- Free speech recognition
Speech recognition without the need to write a grammar, allowing more flexibility when interacting with users.
- Intermediate results
Intermediate or partial results are produced while the audio is being received and recognized.
- Confidence score
Generated results receive a score, indicating the level of confidence in the recognition of that sentence; the higher the score, the higher the chance of the recognition being correct.
- N-best list
A list of the N most probable phrases for each recognition is displayed, instead of only the recognized sentence.
- Speech detection
Automatic identification of when the user starts and stops speaking.