Messages¶

API messages are transmitted by the WebSocket connection in binary format and cannot be larger than 2MB in size. Every message is formed by lines separated by the sequence CR (0x0D) and LF (0x10), where the type of message and its parameters (headers) are declared, and, optionally, the message body. The first lines are interpreted as encoded text in UTF-8 format, while the format and size of the body are defined by the headers Content-Type and Content-Length. There must be an empty line ending in CRLF between the header section and the message body.

Linha inicial: ASR <versão> <nome da mensagem> CRLF
Zero ou mais headers seguidor por CRLF
Linha vazia (indicando o fim dos headers) CRLF
Conteúdo opcional

Protocol messages will now be described in detail.

CREATE SESSION¶

Creates a speech recognition session. To be sent by the client after establishing a WebSocket connection with the server. The Server will generate a unique ID for the session (handle), which will identify the responses sent to the client. A timeout for the session is configured, setting the maximum time it can be open with no activity, without receiving any messages. When this time expires, the session is automatically ended by the server.

Initial line:

ASR 2.3 CREATE_SESSION

Headers:

User-Agent

(Optional) This field contains information about the client device and the application that is being executed. This information can be used as an operation log record. In the case of smart phones, for example, the field can contain the following information:

model (xperia ZQ, iphone 6s)
manufacturer (sony, apple)
os (android, ios)
os_version (6.0, 9.2)
app_name (cpqd stt)
app_version (1.0)
phone_id (string)

Example:

ASR 2.3 CREATE_SESSION
User-Agent: model=iphone 6s;manufacturer=apple;os=ios;os_version=9.3; app_name=cpqd stt;app_version=1.0;phone_id=1A23BB36740

DEFINE GRAMMAR¶

Loads and compiles a grammar that can later be used for recognition.

Initial line:

ASR 2.3 DEFINE_GRAMMAR

Headers:

Content-ID

Indicates the name that will be defined as the reference for using the grammar in the recognition session.

Content-Length

Indicates the number of bytes in the content.

Content-Type

Describes the type of grammar. Valid values:

Grammar or free speech model URI:

text/uri-list

SRGS XML grammar:

application/grammar+xml
application/srgs+xml
application/xml

SRGS ABNF grammar:

text/xml
application/srgs
text/plain

Examples:

ASR 2.3 DEFINE_GRAMMAR
Content-Type: text/uri-list
Content-Length: 19
Content-ID: menu2

builtin:slm/general

ASR 2.3 DEFINE_GRAMMAR
Content-Type: application/srgs
Content-Length: 105
Content-ID: yes_no

#ABNF 1.0 UTF-8;
language pt-BR;
tag-format <semantics/1.0>;
mode voice;

root $root;
$root = sim | não;

SET PARAMETERS¶

Allows users to define the recognition parameters for the session. It can be sent anytime the session status is IDLE. The parameters must be sent as message headers. The complete list of parameters is shown in the section Configuration. When responding to this message, the server will return the current status of each parameter, or an error code, if a problem occurred when defining a parameter.

Initial line:

ASR 2.3 SET_PARAMETERS

Headers: (Configuration)

Example:

ASR 2.3 SET_PARAMETERS
decoder.maxSentences: 3
noInputTimeout.enabled: true
noInputTimeout.value: 5000

GET PARAMETERS¶

Retrieves the current value of the recognition session parameters. The client must specify, in the headers section, the parameters they wish to get, with empty values. If no parameters are specified, the server will return the list of all the existing parameters, with the current values. The complete list of parameters is shown in the section Configuration. The server will send the current values of the parameters in the header section of the RESPONSE MESSAGE.

Initial line:

ASR 2.3 GET_PARAMETERS

Headers: (Configuration)

Example:

ASR 2.3 GET_PARAMETERS
decoder.maxSentences:
noInputTimeout.enabled:
noInputTimeout.value:

START RECOGNITION¶

Starts the recognition. Must be sent whenever the session status is IDLE. The client must inform the language model to be used in the recognition session, whether it is a free speech or grammar model. In the case of a grammar model, it must have been previously installed on the server. The client can also define recognition parameters in the headers section of the message (see the complete list in Configuration). When the recognition session starts, the start and end timer counters are triggered (noInputTimeout) as well as the recognition counters (recognitionTimeout), when enabled.

Initial line:

ASR 2.3 START_RECOGNITION

Headers:

Accept	(Optional) Content type of the recognition results. Valid values: application/xml application/json Default value: application/json.
Content-ID	Indicates the name that will be defined as the reference for using the grammar in the recognition session.
Content-Length	Indicates the number of bytes in the content.
Content-Type	Describes the language model type. Valid values: Grammar or free speech model URI: text/uri-list SRGS XML grammar: application/grammar+xml application/srgs+xml application/xml SRGS ABNF grammar: text/xml application/srgs text/plain

In cases where the Content-Type is text/uri-list, the language model URI must contain one of the following prefixes:

builtin – internal model (e.g. builtin:slm/general).

file – models located on the ASR server (e.g. file:///opt/grammar/menu, file:///opt/grammar/menu.gram).

http – models located on the network (e.g. http://acme.com/grammar/menu.gram).

session – reference URI defined by DEFINE_GRAMMAR (eg. session:menu2).

Examples:

ASR 2.3 START_RECOGNITION
Accept: application/json
decoder.maxSentences: 3
noInputTimeout.enabled: true
noInputTimeout.value: 5000
Content-Type: text/uri-list
Content-Length: 19

builtin:slm/general

ASR 2.3 START_RECOGNITION
Accept: application/json
decoder.maxSentences: 3
noInputTimeout.enabled: true
noInputTimeout.value: 5000
Content-Type: text/uri-list
Content-Length: 13

session:menu2

ASR 2.3 START_RECOGNITION
Accept: application/json
decoder.maxSentences: 3
noInputTimeout.enabled: true
noInputTimeout.value: 5000
Content-Type: application/srgs
Content-ID: yes_no
Content-Length: 105

#ABNF 1.0 UTF-8;
language pt-BR;
tag-format <semantics/1.0>;
mode voice;

root $root;
$root = sim | não;

INTERPRET TEXT¶

Performs semantic interpretation of a text supplied by the client, using the indicated grammar, like the START RECOGNITION message. Must be sent whenever the session status is IDLE. The client must inform the grammar to be used in the recognition session.

Initial line:

ASR 2.3 INTERPRET_TEXT

Headers:

Accept	(Optional) Content type of the recognition results. Valid values: application/xml application/json Default value: application/json.
Content-ID	Indicates the name that will be defined as the reference for using the grammar in the recognition session.
Content-Length	Indicates the number of bytes in the content.
Content-Type	Describes the type of grammar. Valid values: Grammar URI: text/uri-list SRGS XML grammar: application/grammar+xml application/srgs+xml application/xml SRGS ABNF grammar: text/xml application/srgs text/plain

Example:

ASR 2.3 INTERPRET_TEXT
Accept: application/json
Content-Type: application/srgs
Content-ID: yes_no
Content-Length: 105
Text: sim eu quero

#ABNF 1.0 UTF-8;
language pt-BR;
tag-format <semantics/1.0>;
mode voice;

root $root;
$root = sim [eu quero] {"yes"} | não [quero] {"no"};

START INPUT TIMERS¶

Starts the speech start and recognition timers, when enabled. To start the timers, the recognition session status must be active (LISTENING or RECOGNIZING). The length of each timer is defined respectively by the noInputTimeout and recognitionTimeout parameters.

Initial line:

ASR 2.3 START_INPUT_TIMERS

SEND AUDIO¶

Sends a block of audio samples to the recognition process. Session status must be LISTENING. Audio content must be sent in the body of the message, in binary format, and cannot be larger than 2 MB in size. The format of the audio can be 16 bit PCM Linear, with a sample rate of 8kHz or 16kHz, according to the Acoustic Model (AM) installed on the server, with no encoding (RAW) and without a header, or encoded audio with or without compression, using the MP3, OPUS, VORBIS, PCM aLaw/uLaw, GSM, FLAC or WAVcodecs. The message can also be used to signal that audio capturing has ended on the client application. In this case, the LastPacket parameter must contain the value “true”, and body of the message can be empty. From that moment on, the server ends the recognition session and finishes processing the received speech segments.

Initial line:

ASR 2.3 SEND_AUDIO

Headers:

LastPacket

Indicates if the sent sample is the last one, so the recognition can begin. Values: true or false. Required.

Content-Length

Indicates the number of bytes in the content.

Content-Type

Describes the type of content. Valid values:

application/octet-stream – Encoded file in a supported format
audio/wav – Encoded file in a supported format
audio/raw – Linear PCM audio with no header

CANCEL RECOGNITION¶

Interrupts an ongoing recognition session. Must be sent to a recognition session with a status of LISTENING or RECOGNIZING. All the recognition data is discarded and the session status returns to IDLE.

Initial line:

ASR 2.3 CANCEL_RECOGNITION

RELEASE SESSION¶

Ends the recognition session, releasing allocated resources on the server. The WebSocket connection is closed.

Initial line:

ASR 2.3 RELEASE_SESSION

RESPONSE¶

Response message generated by the server, indicating success or failure processing a previously received message. Contains the current recognition session status and additional information.

Initial line:

ASR 2.3 RESPONSE

Headers:

Handle	Recognition session identifier.
Method	Name of the message related to the response.
Expires	Length of the recognition session. The timer is restarted whenever a message is received by the server. After a time of inactivity, the session is ended after reaching the configured timeout (in seconds).
Result	Result of the action performed on the server: success, failure or invalid. Valid values: SUCCESS, FAILURE, INVALID_ACTION.
Session-Status	Recognition session status. Valid values: IDLE, LISTENING, RECOGNIZING.
Error-Code	Error code, in case of failure (Error Codes).
Message	Complementary message explaining the reason for the failure.

START OF SPEECH¶

Message generated by the server when the recognition session detects the beginning of a speech segment in the received audio flow.

Initial line:

ASR 2.3 START_OF_SPEECH

Headers:

Handle	Recognition session identifier.
Session-Status	Recognition session status.

END OF SPEECH¶

Message generated by the server when the recognition session detects the end of a speech segment (silence) in the received audio flow.

Initial line:

ASR 2.3 END_OF_SPEECH

Headers:

Handle	Recognition session identifier.
Session-Status	Recognition session status.

RECOGNITION RESULT¶

Message with the recognition results. Sent whenever there is an available partial or final result. The final result is generated after the end of speech is detected or the audio flow is finalized, when the client sends a SEND AUDIO message indicating LastPacket = true. Partial recognition results contain a single sentence that represents the recognized text, based on the audio received up to that moment. On the other hand, the final recognition contains more complete information, such as alternative recognition results, with confidence scores and, in certain cases, interpretation results generated by grammars.

Initial line:

ASR 2.3 RECOGNITION_RESULT

Headers:

Handle	Recognition session identifier.
Session-Status	Recognition session status.
Result-Status	Indicates the recognition status. Valid values: PROCESSING RECOGNIZED NO_MATCH NO_INPUT_TIMEOUT MAX_SPEECH EARLY_SPEECH RECOGNITION_TIMEOUT NO_SPEECH CANCELED FAILURE
Content-Length	Indicates the number of bytes in the content.
Content-Type	Describes the type of content. Valid values: application/json application/xml

The recognition content is formed by the following fields:

recognition_result

Field Name

Description

Type

alternatives

List of alternatives for the recognition results.

See alternative element.

result_status

Recognition status

Valid values:

PROCESSING
RECOGNIZED
NO_MATCH
NO_INPUT_TIMEOUT
MAX_SPEECH
EARLY_SPEECH
RECOGNITION_TIMEOUT
NO_SPEECH
CANCELED
FAILURE

alternative

Field Name	Description	Type
text	recognized text	text
score	confidence score	numeric
interpretations	List of interpretation results grammatical.	Structure that represents the interpretation generated by the grammar that was used.

Note: the recognition results content can contain additional fields, due to future extensions and new features. The application that will read and analyze the results should not generate errors when encountering additional fields in JSON or XML format.

Example of content:

JSON result:

{
  "alternatives": [{
    "text": "oito sete quatro três um",
    "interpretations": ["87431"],
    "words": [{
      "text": "oito",
      "score": 100,
      "start_time": 0.3901262,
      "end_time": 0.95921874
    }, {
      "text": "sete",
      "score": 100,
      "start_time": 0.99,
      "end_time": 1.7068747
    }, {
      "text": "quatro",
      "score": 100,
      "start_time": 1.74,
      "end_time": 2.28
    }, {
      "text": "três",
      "score": 100,
      "start_time": 2.2800765,
      "end_time": 2.8498626
    }, {
      "text": "um",
      "score": 100,
      "start_time": 2.9167604,
      "end_time": 3.2101758
    }],
    "score": 100,
    "lm": "builtin:grammar/digits",
    "interpretation_scores": [100]
  }],
  "segment_index": 0,
  "last_segment": true,
  "final_result": true,
  "start_time": 0.24,
  "end_time": 3.52,
  "result_status": "RECOGNIZED"
}