Métodos¶

Esta seção detalha os métodos da API REST do Speech Server para o recurso de reconhecimento de fala (ASR), que sofreram alterações com a adição de parâmetros para uso de novas funcionalidades. Para verificar todos os detalhes da API, consulte a documentação do CPQD Reconhecimento de Fala.

RECOGNIZE

POST /v2/recognize

HTTP Headers (adicionais)

Infer-age-enabled

(Opcional) Habilita a classificação da idade do usuário. Valores válidos:

true
false (padrão)

Infer-gender-enabled

(Opcional) Habilita a classificação do gênero do usuário. Valores válidos:

true
false (padrão)

Infer-emotion-enabled

(Opcional) Habilita a classificação do tom emocional do usuário. Valores válidos:

true
false (padrão)

Se os headers acima estiverem setados, os respectivos serviços serão acionados durante o reconhecimento de fala. A resposta dos serviços será combinada à resposta do ASR.

Resultado (HTTP status = 200)

Se a requisição HTTP retorna status code “200”, o corpo da resposta possui o resultado do reconhecimento com campos adicionais. Os campos adicionais são:

Campo	Descrição	Tipo do campo
age_scores:	Dicionário que contém o tipo de evento, a idade estipulada, a probabilidade por faixa etária e um índice de confiança.	{ event: <string>, age: <int>, p: <dict> { <0-10>: <float>, <10-20>: <float>, <20-30>: <float>, <30-40>: <float>, <40-50>: <float>, <50-60>: <float>, <60-70>: <float>, <70-80>: <float>, <80-90>: <float>, <90-100>: <float> }, confidence: <string> }
gender_scores:	Dicionário que contém o tipo de evento, a probabilidade e o gênero.	{ event: <string>, p: <array[float]>, gender: <string> }
emotion_scores:	Dicionário que contém a probabilidade e o tipo de emoção em um formato <K,V>.	{ event: <string>, emotion: <sting>, p: <dict> { <enojado>: <float>, <frustrado>: <float>, <triste>: <float>, <ansioso>: <float>, <entusiasmado>: <float>, <feliz>: <float>, <surpreso>: <float>, <amedrontado>: <float>, <neutro>: <float>, <irritado>: <float> }, p_groups: <dict> { <negativo_desativado>: <float>, <positivo>: <float>, <neutro>: <float>, <negativo_ativado>: <float> } }

Exemplos

Chamada REST passando o modelo da língua via parameters de query:

curl -X POST \
  --header "Content-Type: audio/wav" \
  --header 'accept: application/json' \
  --header 'infer-age-enabled: true' \
  --header 'infer-gender-enabled: true' \
  --header 'infer-emotion-enabled: true' \
  --header "decoder.maxSentences: 1" \
  --data-binary '@/nasceu-8k.wav' \
  http://localhost:8000/asr/rest/v2/recognize?lm=builtin:slm/general

Chamada REST passando o modelo da língua e audio via multipart:

curl -X POST \
  --header 'accept: application/json' \
  --header 'infer-age-enabled: true' \
  --header 'infer-gender-enabled: true' \
  --header 'infer-emotion-enabled: true' \
  --header "decoder.maxSentences: 2" \
  --header 'decoder.continuousMode: true' \
  --form 'audio=@"/pizza-veg-8k.wav"' \
  --form 'lm=@"/pizza.gram"'
  http://localhost:8000/asr/rest/v2/recognize

Resultado:

[
  {
    "alternatives": [
      {
        "text": "boa noite",
        "words": [
          {
            "text": "boa",
            "score": 100,
            "start_time": 1.31,
            "end_time": 1.64
          },
          {
            "text": "noite",
            "score": 100,
            "start_time": 1.64,
            "end_time": 2.1499999
          }
        ],
        "score": 100,
        "lm": "builtin:slm/general"
      }
    ],
    "segment_index": 0,
    "last_segment": true,
    "final_result": true,
    "start_time": 1.16,
    "end_time": 2.23,
    "result_status": "RECOGNIZED"
  },
  {
    "age_scores": {
      "event": "AGE RESULT",
      "age": 46,
      "p": {
        "0-10": 6.418202899121052e-07,
        "10-20": 0.015974009871477613,
        "20-30": 0.04174115835109025,
        "30-40": 0.07696848183742766,
        "40-50": 0.10564251137743308,
        "50-60": 0.7548041620187447,
        "60-70": 0.004869034450940512,
        "70-80": 2.113791160785148e-10,
        "80-90": 5.957488708518843e-11,
        "90-100": 1.6421073156355875e-12
      },
      "confidence": "mid"
    }
  },
  {
    "gender_scores": {
      "event": "GENDER RESULT",
      "p": [
        0.034807813443786806,
        0.9651921865562132
      ],
      "gender": "F"
    }
  },
  {
    "emotion_scores": {
      "event": "EMOTION RESULT",
      "emotion": "irritado",
      "p": {
        "enojado": 0.04954051971435547,
        "frustrado": 0.015936316922307014,
        "triste": 0.011620878241956234,
        "ansioso": 0.05386947840452194,
        "entusiasmado": 0.18894456326961517,
        "feliz": 0.019370414316654205,
        "surpreso": 0.02057667449116707,
        "amedrontado": 0.016010118648409843,
        "neutro": 0.10917621850967407,
        "irritado": 0.5149547457695007
      },
      "p_groups": {
        "negativo_desativado": 0.07709771487861872,
        "positivo": 0.2827611304819584,
        "neutro": 0.12518633715808392,
        "negativo_ativado": 0.5149547457695007
      }
    }
  }
]

Chamada REST com resultado XML:

curl -X POST \
  --header "Content-Type: audio/wav" \
  --header 'accept: application/xml' \
  --header 'infer-age-enabled: true' \
  --header 'infer-gender-enabled: true' \
  --header 'infer-emotion-enabled: true' \
  --header "decoder.maxSentences: 1" \
  --data-binary '@/nasceu-8k.wav' \
  http://localhost:8000/asr/rest/v2/recognize?lm=builtin:slm/general

Resultado:

<?xml version="1.0" encoding="UTF-8" ?>
<root>
  <ArrayList type="dict">
    <item type="dict">
      <segment_index type="str">0</segment_index>
      <last_segment type="str">true</last_segment>
      <final_result type="str">true</final_result>
      <start_time type="str">1.16</start_time>
      <end_time type="str">2.23</end_time>
      <result_status type="str">RECOGNIZED</result_status>
      <alternatives type="dict">
        <alternative type="dict">
          <text type="str">boa noite</text>
          <score type="str">100</score>
          <lm type="str">builtin:slm/general</lm>
          <words type="dict">
            <word type="list">
              <item type="dict">
                <text type="str">boa</text>
                <score type="str">100</score>
                <start_time type="str">1.31</start_time>
                <end_time type="str">1.64</end_time>
              </item>
              <item type="dict">
                <text type="str">noite</text>
                <score type="str">100</score>
                <start_time type="str">1.64</start_time>
                <end_time type="str">2.1499999</end_time>
              </item>
            </word>
          </words>
        </alternative>
      </alternatives>
    </item>
    <age_scores type="dict">
      <event type="str">AGE RESULT</event>
      <age type="int">46</age>
      <p type="dict">
        <key name="0-10" type="float">6.418202899121052e-07</key>
        <key name="10-20" type="float">0.015974009871477613</key>
        <key name="20-30" type="float">0.04174115835109025</key>
        <key name="30-40" type="float">0.07696848183742766</key>
        <key name="40-50" type="float">0.10564251137743308</key>
        <key name="50-60" type="float">0.7548041620187447</key>
        <key name="60-70" type="float">0.004869034450940512</key>
        <key name="70-80" type="float">2.113791160785148e-10</key>
        <key name="80-90" type="float">5.957488708518843e-11</key>
        <key name="90-100" type="float">1.6421073156355875e-12</key>
      </p>
      <confidence type="str">mid</confidence>
    </age_scores>
    <gender_scores type="dict">
      <event type="str">GENDER RESULT</event>
      <p type="list">
        <item type="float">0.034807813443786806</item>
        <item type="float">0.9651921865562132</item>
      </p>
      <gender type="str">F</gender>
    </gender_scores>
    <emotion_scores type="dict">
      <event type="str">EMOTION RESULT</event>
      <emotion type="str">irritado</emotion>
      <p type="dict">
        <enojado type="float">0.04954051971435547</enojado>
        <frustrado type="float">0.015936316922307014</frustrado>
        <triste type="float">0.011620878241956234</triste>
        <ansioso type="float">0.05386947840452194</ansioso>
        <entusiasmado type="float">0.18894456326961517</entusiasmado>
        <feliz type="float">0.019370414316654205</feliz>
        <surpreso type="float">0.02057667449116707</surpreso>
        <amedrontado type="float">0.016010118648409843</amedrontado>
        <neutro type="float">0.10917621850967407</neutro>
        <irritado type="float">0.5149547457695007</irritado>
      </p>
      <p_groups type="dict">
        <negativo_desativado type="float">0.07709771487861872</negativo_desativado>
        <positivo type="float">0.2827611304819584</positivo>
        <neutro type="float">0.12518633715808392</neutro>
        <negativo_ativado type="float">0.5149547457695007</negativo_ativado>
      </p_groups>
    </emotion_scores>
  </ArrayList>
</root>