Messages¶
API messages are transmitted by the WebSocket connection in binary format and cannot be larger than 2MB in size. Every message is formed by lines separated by the sequence CR (0x0D) and LF (0x10), where the type of message and its parameters (headers) are declared, and, optionally, the message body. The first lines are interpreted as encoded text in UTF-8 format, while the format and size of the body are defined by the headers Content-Type and Content-Length. There must be an empty line ending in CRLF between the header section and the message body.
Linha inicial: ASR <versão> <nome da mensagem> CRLF
Zero ou mais headers seguidor por CRLF
Linha vazia (indicando o fim dos headers) CRLF
Conteúdo opcional
Protocol messages will now be described in detail.
CREATE SESSION¶
Creates a speech recognition session. To be sent by the client after establishing a WebSocket connection with the server. The Server will generate a unique ID for the session (handle), which will identify the responses sent to the client. A timeout for the session is configured, setting the maximum time it can be open with no activity, without receiving any messages. When this time expires, the session is automatically ended by the server.
Initial line:
ASR 2.3 CREATE_SESSION
Headers:
User-Agent |
(Optional) This field contains information about the client device and the application that is being executed. This information can be used as an operation log record. In the case of smart phones, for example, the field can contain the following information:
|
Example:
ASR 2.3 CREATE_SESSION
User-Agent: model=iphone 6s;manufacturer=apple;os=ios;os_version=9.3; app_name=cpqd stt;app_version=1.0;phone_id=1A23BB36740
DEFINE GRAMMAR¶
Loads and compiles a grammar that can later be used for recognition.
Initial line:
ASR 2.3 DEFINE_GRAMMAR
Headers:
Content-ID |
Indicates the name that will be defined as the reference for using the grammar in the recognition session. |
Content-Length |
Indicates the number of bytes in the content. |
Content-Type |
Describes the type of grammar. Valid values: Grammar or free speech model URI:
SRGS XML grammar:
SRGS ABNF grammar:
|
Examples:
ASR 2.3 DEFINE_GRAMMAR
Content-Type: text/uri-list
Content-Length: 19
Content-ID: menu2
builtin:slm/general
ASR 2.3 DEFINE_GRAMMAR
Content-Type: application/srgs
Content-Length: 105
Content-ID: yes_no
#ABNF 1.0 UTF-8;
language pt-BR;
tag-format <semantics/1.0>;
mode voice;
root $root;
$root = sim | não;
SET PARAMETERS¶
Allows users to define the recognition parameters for the session. It can be sent anytime the session status is IDLE. The parameters must be sent as message headers. The complete list of parameters is shown in the section Configuration. When responding to this message, the server will return the current status of each parameter, or an error code, if a problem occurred when defining a parameter.
Initial line:
ASR 2.3 SET_PARAMETERS
Headers: (Configuration)
Example:
ASR 2.3 SET_PARAMETERS
decoder.maxSentences: 3
noInputTimeout.enabled: true
noInputTimeout.value: 5000
GET PARAMETERS¶
Retrieves the current value of the recognition session parameters. The client must specify, in the headers section, the parameters they wish to get, with empty values. If no parameters are specified, the server will return the list of all the existing parameters, with the current values. The complete list of parameters is shown in the section Configuration. The server will send the current values of the parameters in the header section of the RESPONSE MESSAGE.
Initial line:
ASR 2.3 GET_PARAMETERS
Headers: (Configuration)
Example:
ASR 2.3 GET_PARAMETERS
decoder.maxSentences:
noInputTimeout.enabled:
noInputTimeout.value:
START RECOGNITION¶
Starts the recognition. Must be sent whenever the session status is IDLE. The client must inform the language model to be used in the recognition session, whether it is a free speech or grammar model. In the case of a grammar model, it must have been previously installed on the server. The client can also define recognition parameters in the headers section of the message (see the complete list in Configuration). When the recognition session starts, the start and end timer counters are triggered (noInputTimeout) as well as the recognition counters (recognitionTimeout), when enabled.
Initial line:
ASR 2.3 START_RECOGNITION
Headers:
Accept |
(Optional) Content type of the recognition results. Valid values:
Default value: application/json. |
Content-ID |
Indicates the name that will be defined as the reference for using the grammar in the recognition session. |
Content-Length |
Indicates the number of bytes in the content. |
Content-Type |
Describes the language model type. Valid values: Grammar or free speech model URI:
SRGS XML grammar:
SRGS ABNF grammar:
|
In cases where the Content-Type
is text/uri-list
, the language model URI must contain one of the following prefixes:
builtin – internal model (e.g. builtin:slm/general).
file – models located on the ASR server (e.g. file:///opt/grammar/menu, file:///opt/grammar/menu.gram).
http – models located on the network (e.g. http://acme.com/grammar/menu.gram).
session – reference URI defined by DEFINE_GRAMMAR (eg. session:menu2).
Examples:
ASR 2.3 START_RECOGNITION
Accept: application/json
decoder.maxSentences: 3
noInputTimeout.enabled: true
noInputTimeout.value: 5000
Content-Type: text/uri-list
Content-Length: 19
builtin:slm/general
ASR 2.3 START_RECOGNITION
Accept: application/json
decoder.maxSentences: 3
noInputTimeout.enabled: true
noInputTimeout.value: 5000
Content-Type: text/uri-list
Content-Length: 13
session:menu2
ASR 2.3 START_RECOGNITION
Accept: application/json
decoder.maxSentences: 3
noInputTimeout.enabled: true
noInputTimeout.value: 5000
Content-Type: application/srgs
Content-ID: yes_no
Content-Length: 105
#ABNF 1.0 UTF-8;
language pt-BR;
tag-format <semantics/1.0>;
mode voice;
root $root;
$root = sim | não;
INTERPRET TEXT¶
Performs semantic interpretation of a text supplied by the client, using the indicated grammar, like the START RECOGNITION message. Must be sent whenever the session status is IDLE. The client must inform the grammar to be used in the recognition session.
Initial line:
ASR 2.3 INTERPRET_TEXT
Headers:
Accept |
(Optional) Content type of the recognition results. Valid values:
Default value: application/json. |
Content-ID |
Indicates the name that will be defined as the reference for using the grammar in the recognition session. |
Content-Length |
Indicates the number of bytes in the content. |
Content-Type |
Describes the type of grammar. Valid values: Grammar URI:
SRGS XML grammar:
SRGS ABNF grammar:
|
Example:
ASR 2.3 INTERPRET_TEXT
Accept: application/json
Content-Type: application/srgs
Content-ID: yes_no
Content-Length: 105
Text: sim eu quero
#ABNF 1.0 UTF-8;
language pt-BR;
tag-format <semantics/1.0>;
mode voice;
root $root;
$root = sim [eu quero] {"yes"} | não [quero] {"no"};
START INPUT TIMERS¶
Starts the speech start and recognition timers, when enabled. To start the timers, the recognition session status must be active (LISTENING or RECOGNIZING). The length of each timer is defined respectively by the noInputTimeout and recognitionTimeout parameters.
Initial line:
ASR 2.3 START_INPUT_TIMERS
SEND AUDIO¶
Sends a block of audio samples to the recognition process. Session status must be LISTENING. Audio content must be sent in the body of the message, in binary format, and cannot be larger than 2 MB in size. The format of the audio can be 16 bit PCM Linear, with a sample rate of 8kHz or 16kHz, according to the Acoustic Model (AM) installed on the server, with no encoding (RAW) and without a header, or encoded audio with or without compression, using the MP3, OPUS, VORBIS, PCM aLaw/uLaw, GSM, FLAC or WAVcodecs. The message can also be used to signal that audio capturing has ended on the client application. In this case, the LastPacket parameter must contain the value “true”, and body of the message can be empty. From that moment on, the server ends the recognition session and finishes processing the received speech segments.
Initial line:
ASR 2.3 SEND_AUDIO
Headers:
LastPacket |
Indicates if the sent sample is the last one, so the recognition can begin. Values: true or false. Required. |
Content-Length |
Indicates the number of bytes in the content. |
Content-Type |
Describes the type of content. Valid values:
|
CANCEL RECOGNITION¶
Interrupts an ongoing recognition session. Must be sent to a recognition session with a status of LISTENING or RECOGNIZING. All the recognition data is discarded and the session status returns to IDLE.
Initial line:
ASR 2.3 CANCEL_RECOGNITION
RELEASE SESSION¶
Ends the recognition session, releasing allocated resources on the server. The WebSocket connection is closed.
Initial line:
ASR 2.3 RELEASE_SESSION
RESPONSE¶
Response message generated by the server, indicating success or failure processing a previously received message. Contains the current recognition session status and additional information.
Initial line:
ASR 2.3 RESPONSE
Headers:
Handle |
Recognition session identifier. |
Method |
Name of the message related to the response. |
Expires |
Length of the recognition session. The timer is restarted whenever a message is received by the server. After a time of inactivity, the session is ended after reaching the configured timeout (in seconds). |
Result |
Result of the action performed on the server: success, failure or invalid. Valid values: SUCCESS, FAILURE, INVALID_ACTION. |
Session-Status |
Recognition session status. Valid values: IDLE, LISTENING, RECOGNIZING. |
Error-Code |
Error code, in case of failure (Error Codes). |
Message |
Complementary message explaining the reason for the failure. |
START OF SPEECH¶
Message generated by the server when the recognition session detects the beginning of a speech segment in the received audio flow.
Initial line:
ASR 2.3 START_OF_SPEECH
Headers:
Handle |
Recognition session identifier. |
Session-Status |
Recognition session status. |
END OF SPEECH¶
Message generated by the server when the recognition session detects the end of a speech segment (silence) in the received audio flow.
Initial line:
ASR 2.3 END_OF_SPEECH
Headers:
Handle |
Recognition session identifier. |
Session-Status |
Recognition session status. |
RECOGNITION RESULT¶
Message with the recognition results. Sent whenever there is an available partial or final result. The final result is generated after the end of speech is detected or the audio flow is finalized, when the client sends a SEND AUDIO message indicating LastPacket = true. Partial recognition results contain a single sentence that represents the recognized text, based on the audio received up to that moment. On the other hand, the final recognition contains more complete information, such as alternative recognition results, with confidence scores and, in certain cases, interpretation results generated by grammars.
Initial line:
ASR 2.3 RECOGNITION_RESULT
Headers:
Handle |
Recognition session identifier. |
Session-Status |
Recognition session status. |
Result-Status |
Indicates the recognition status. Valid values:
|
Content-Length |
Indicates the number of bytes in the content. |
Content-Type |
Describes the type of content. Valid values:
|
The recognition content is formed by the following fields:
recognition_result
Field Name |
Description |
Type |
---|---|---|
alternatives |
List of alternatives for the recognition results. |
See alternative element. |
result_status |
Recognition status |
Valid values:
|
alternative
Field Name |
Description |
Type |
---|---|---|
text |
recognized text |
text |
score |
confidence score |
numeric |
interpretations |
List of interpretation results grammatical. |
Structure that represents the interpretation generated by the grammar that was used. |
Note: the recognition results content can contain additional fields, due to future extensions and new features. The application that will read and analyze the results should not generate errors when encountering additional fields in JSON or XML format.
Example of content:
JSON result:
{ "alternatives": [{ "text": "oito sete quatro três um", "interpretations": ["87431"], "words": [{ "text": "oito", "score": 100, "start_time": 0.3901262, "end_time": 0.95921874 }, { "text": "sete", "score": 100, "start_time": 0.99, "end_time": 1.7068747 }, { "text": "quatro", "score": 100, "start_time": 1.74, "end_time": 2.28 }, { "text": "três", "score": 100, "start_time": 2.2800765, "end_time": 2.8498626 }, { "text": "um", "score": 100, "start_time": 2.9167604, "end_time": 3.2101758 }], "score": 100, "lm": "builtin:grammar/digits", "interpretation_scores": [100] }], "segment_index": 0, "last_segment": true, "final_result": true, "start_time": 0.24, "end_time": 3.52, "result_status": "RECOGNIZED" }