Methods

This section provides details for the existing ASR REST API methods.

RECOGNIZE

Recognizes speech in audios sent to it. The audio content must be send in the body of the HTTP message, and can be RAW (16 bit Linear PCM with a sample rate of 8 kHz or 16 kHz, according to the installed AM) or encoded. Supported encoded formats are: MP3, OPUS, VORBIS, PCM aLaw/uLaw, GSM, FLAC and WAV. Recognition is performed synchronously and the result returned in the HTTP response.

Request

POST /asr-server/rest/recognize

HTTP Headers

Accept

(Optional) Content type of the recognition results. Valid values:

  • application/xml

  • application/json

Default value: application/json.

User-Agent

ID of the device and/or application generating the audio. Useful for application log purposes.

Content-Length

Indicates the number of bytes in the content.

Content-Type

Indicates the format of the streamed audio. Valid formats:

  • application/octet-stream – encoded file in a supported format

  • audio/wav– Encoded file in a supported format

  • audio/raw– Linear PCM audio with no header

Speech recognition can be configured to adjust to the specific characteristics of the application. These settings are configured using parameters defined as headers of the HTTP request. The complete list of parameters is shown in the section Configuration. The following example shows the configuration of the endpointer.levelThreshold and decoder.confidenceThreshold:

POST /asr-server/rest/recognize?lm=builtin:slm/general HTTP/1.1
Host: 127.0.0.1:8025
User-Agent: curl/7.47.0
Accept: */*
Content-Type: audio/wav
endpointer.levelThreshold: 10
decoder.confidenceThreshold: 30
Content-Length: ...

[binary content]

Request parameters

lm

Language model URI. If not entered, an error will be returned. The URI must present one of the following prefixes:

  • builtin - internal model (ex. builtin:slm/general).

  • file - model located on ASR server (e.g. file:///opt/grammar/menu, file:///opt/grammar/menu.gram).

  • http - model located on the network (e.g. http://acme.com/grammar/menu.gram).

Result (HTTP status = 200)

If the HTTP request returns a ‘200’ status code, the body of the response will have the following structure. The format of the response can be defined by the HTTP header ‘Accept’, selecting JSON (Accept: application/json) or XML (Accept: application/xml). The default format is JSON.

recognition_result

  • alternatives

  • result_status

alternatives: a list of the likely recognition results

  • index: results alternatives index

  • text: recognition text

  • score: confidence score or rate

  • interpretations: a list of the interpretation results, as defined in the grammar In the case of free speech models, the list is empty.

result_status: recognition status It can be one of the following values:

result_status

Description

RECOGNIZED

recognition completed successfully

NO_MATCH

recognition completed successfully but no matches found in the grammar

NO_INPUT_TIMEOUT

the recognition tool was unable to detect the start of the speech before the timer ran out

EARLY_SPEECH

the streamed audio did not have an initial stretch of silence (the speech started before recognition started)

MAX_SPEECH

the server receive more audio than it was able to process

RECOGNITION_TIMEOUT

no final result able to be generated before the timer ran out

NO_SPEECH

unable to detect any speech in the streamed audio

CANCELED

recognition canceled

FAILURE

unknown server error

Result (HTTP status <> 200)

If the HTTP request returns an error with a status code other than ‘200’, the body of the response will have the following structure.

ErrorResponse

  • code: Error code (Error codes).

  • message: Complementary message explaining the reason for the failure.

Examples

REST call with JSON result:

curl -X POST \
  --header "Content-Type: audio/wav" \
  --header "decoder.maxSentences: 1" \
  --data-binary '@/opt/cpqd/asr/samples/audio/ptbr/87431_8k.wav' \
  http://127.0.0.1:8025/asr-server/rest/recognize?lm=builtin:grammar/digits

Result:

[{
  "alternatives": [{
    "text": "oito sete quatro três um",
    "interpretations": ["87431"],
    "words": [{
      "text": "oito",
      "score": 100,
      "start_time": 0.3901262,
      "end_time": 0.95921874
    }, {
      "text": "sete",
      "score": 100,
      "start_time": 0.99,
      "end_time": 1.7068747
    }, {
      "text": "quatro",
      "score": 100,
      "start_time": 1.74,
      "end_time": 2.28
    }, {
      "text": "três",
      "score": 100,
      "start_time": 2.2800765,
      "end_time": 2.8498626
    }, {
      "text": "um",
      "score": 100,
      "start_time": 2.9167604,
      "end_time": 3.2101758
    }],
    "score": 100,
    "lm": "builtin:grammar/digits",
    "interpretation_scores": [100]
  }],
  "segment_index": 0,
  "last_segment": true,
  "final_result": true,
  "start_time": 0.24,
  "end_time": 3.52,
  "result_status": "RECOGNIZED"
}]

REST call with XML result:

curl -X POST \
  --header "Content-Type: audio/wav" \
  --header "Accept: application/xml" \
  --header "decoder.maxSentences: 1" \
  --data-binary '@/opt/cpqd/asr/samples/audio/ptbr/87431_8k.wav' \
  http://127.0.0.1:8025/asr-server/rest/recognize?lm=builtin:grammar/digits

Result:

<ArrayList>
  <item>
    <segment_index>0</segment_index>
    <last_segment>true</last_segment>
    <final_result>true</final_result>
    <start_time>0.24</start_time>
    <end_time>3.52</end_time>
    <result_status>RECOGNIZED</result_status>
    <alternatives>
      <alternative>
        <text>oito sete quatro três um</text>
        <score>100</score>
        <lm>builtin:grammar/digits</lm>
        <interpretations>
          <interpretation>87431</interpretation>
        </interpretations>
        <interpretation_scores>
          <interpretation_score>100</interpretation_score>
        </interpretation_scores>
        <words>
          <word>
            <text>oito</text>
            <score>100</score>
            <start_time>0.3901258</start_time>
            <end_time>0.95921737</end_time>
          </word>
          <word>
            <text>sete</text>
            <score>100</score>
            <start_time>0.99</start_time>
            <end_time>1.7068772</end_time>
          </word>
          <word>
            <text>quatro</text>
            <score>100</score>
            <start_time>1.74</start_time>
            <end_time>2.28</end_time>
          </word>
          <word>
            <text>três</text>
            <score>100</score>
            <start_time>2.2800772</start_time>
            <end_time>2.8498623</end_time>
          </word>
          <word>
            <text>um</text>
            <score>100</score>
            <start_time>2.9167345</start_time>
            <end_time>3.210177</end_time>
          </word>
        </words>
      </alternative>
    </alternatives>
  </item>
</ArrayList>

Result with error (JSON):

{
  "code":"ERR_LM_NOT_FOUND",
  "message":"Language Model not found: builtin:grammar/booh"
}

Result with error (XML):

<ErrorResponse>
  <code>ERR_LM_NOT_FOUND</code>
  <message>Language Model not found: builtin:grammar/booh</message>
</ErrorResponse>

INTERPRET

Performs semantic interpretation of a text supplied by the client, using the indicated grammar, like RECOGNITION. The text must be sent in the HTTP message body. Recognition is performed synchronously and the result returned in the HTTP response.

Request

POST /asr-server/rest/interpret

HTTP Headers

Accept

(Optional) Content type of the recognition results. Valid values:

  • application/xml

  • application/json

Default value: application/json.

User-Agent

ID of the device and/or application generating the audio. Useful for application log purposes.

Content-Length

Indicates the number of bytes in the content.

Content-Type

Indicates the format of the streamed audio. Valid formats:

  • text/plain – text content

Request parameters

lm

Language model URI. If not entered, an error will be returned. The URI must present one of the following prefixes:

  • builtin - internal model (ex. builtin:slm/general).

  • file - model located on ASR server (e.g. file:///opt/grammar/menu, file:///opt/grammar/menu.gram).

  • http - model located on the network (e.g. http://acme.com/grammar/menu.gram).

Result

The recognition result is an object with the same structure of the “recognize’ but only some of the fields make sense and should be used:

recognition_result

  • alternatives

  • result_status

alternatives: a list of the likely recognition results

  • text: recognition text

  • score: confidence score or rate

  • interpretations: a list of the interpretation results, as defined in the grammar In the case of free speech models, the list is empty.

result_status: recognition status It can be one of the following values:

result_status

Description

RECOGNIZED

recognition completed successfully

NO_MATCH

recognition completed successfully but no matches found in the grammar

CANCELED

recognition canceled

FAILURE

unknown server error

Examples

REST call with JSON result:

curl -X POST \
  --header "Content-Type: text/plain" \
  --data 'oito sete quatro três um' \
  http://127.0.0.1:8025/asr-server/rest/interpret?lm=builtin:grammar/digits

Result:

{
  "alternatives": [{
    "text": "oito sete quatro três um",
    "interpretations": ["87431"],
    "score": 100,
    "lm": "builtin:grammar/digits",
    "interpretation_scores": [100]
  }],
  "result_status": "RECOGNIZED"
}

REST call with XML result:

curl -X POST \
  --header "Content-Type: application/xml"
  --data 'oito sete quatro três um' \
  http://127.0.0.1:8025/asr-server/rest/interpret?lm=builtin:grammar/digits

Result:

<recognition_result>
  <result_status>RECOGNIZED</result_status>
  <alternatives>
    <alternative>
      <text>oito sete quatro três um</text>
      <score>100</score>
      <lm>builtin:grammar/digits</lm>
      <interpretations>
        <interpretation>87431</interpretation>
      </interpretations>
      <interpretation_scores>
        <interpretation_score>100</interpretation_score>
      </interpretation_scores>
    </alternative>
  </alternatives>
</recognition_result>