Websocket interface

The websocket interface enables you to add speech recognition capabilities to your client applications. The transcription service relies on websocket connection to asynchronously exchange bi-directional data (audio chunks and JSON text messages) in real-time.

API calls require a api key and language model specified as URL parameter.

Base URL for API calls is wss://developer.speech-i.com/ws/client/speech?key=XXX&model=YYY

 Usage

The recognition request and response cycle has the following steps:

  1. The cient establishes a websocket connection to the server

  2. On connection opened, client must send the JSON formatted transcription parameters

  3. Server responds with a JSON formatted status message.
  4. {"status":"ready|error|busy"}
  5. If the server status is ready, the client can start sending the audio chunks in binary format (binary frame); to limit the latency, each chunck should contain at most 500ms of audio.
  6. Audio format must be mono, encoding speex or raw PCM, sampling rate 8kHz or 16kHz. Other formats may be available on request.
  7. Server periodically sends JSON formatted partial results, according to transcription parameters settings.
  8. {"final":false,"segment-start":0,"total-length":1.75,"transcript":"hello"} {"final":false,"segment-start":0,"total-length":2.75,"transcript":"hello world"}
  9. On audio transfer completed, the client can send a stop decoding message.
  10. {"decoding":"stop"}
  11. Server sends the JSON formatted final transcription.
  12. {
      "confidence":0.967202366309009,
      "final":true,
      "hypotheses":[
        {
          "likelihood":135.98733520507813,
          "transcript":"hello world",
          "word-alignment":[
            {
              "end":0.3899999912828207,
              "start":0,
              "word":"hello"
            },
            {
              "end":1.409999968484044,
              "start":0.41999999061226847,
              "word":"world"
            }
          ]
        },
        {
          "likelihood":133.66854858398438,
          "transcript":"yellow world"
        }
      ],
      "segment-length":1.409999968484044,
      "segment-start":0
    }
  13. According to transcription parameters, the iteration can restart from point 3.

 Transcription parameters

The following table shows the customizable parameters that client must send when websocket connection is opened.

PARAMETER TYPE DEFAULT DESCRIPTION
decoding String: 'start','stop' Mandatory parameter Mandatory parameter to start or end the transcription.
audio-type String: 's16le;16000', 'speex;nb;8' Mandatory parameter Mandatory parameter. It specifies the audio coding and the sampling rate. Audio format must be mono; the coding can be speex or raw; the sampling rate can be 8 or 16 Khz.
traceback-period Float: 0.2-inf 0.2 Time interval for receiving the partial transcription.
timeout int: 10-100 30 Websocket dialog timeout
n-best int: 1-inf 12 Maximum number of Nbest contained in the output
do-word-alignment boolean true It specifies whether or not output should contain the word time-coding
do-endpointing boolean true Activates/deactivates the endpointing detector. Deactivating the endpoinding detector the final transcription will be generated only after that the transcription process has been closed.
min-silence float: 0.1-inf 0.8 Minimal silence duration to detect the endpoint
connection-mode String: 'always-open','on-stop-close','on-final-close' always open 'always-open': the connection is always open, and must be closed by the client;
'on-stop-close': at the end of transcription the server will close the connection;
'on-final-close': the server will close the connection when detecting the first endpoint.

 Language models

The following table shows supported language models (other languages are available on request):

MODEL DESCRIPTION
en-GB_16k British English language model 16kHz
en-US_16k American English language model 16kHz
en-IE_16k Irish English language model 16kHz
fr-FR_16k French language model 16kHz
de-DE_16k German language model 16kHz
it-IT_16k Italian language model 16kHz
es-ES_16k Spanish language model 16kHz
pl-PL_16k Polish language model 16kHz
nl-NL_16k Dutch language model 16kHz
pt-PT_16k Portugal Portuguese language model 16kHz
pt-BR_16k Brazilian Portuguese language model 16kHz
el-EL_16k Greek language model 16kHz
ro-RO_16k Romanian language model 16kHz
sl-SL_16k Slovenian language model 16kHz
sk-SK_16k Slovak language model 16kHz
cs-CS_16k Czech language model 16kHz
lt-LT_16k Lithuanian language model 16kHz
bg-BG_16k Bulgarian language model 16kHz
hr-HR_16k Croatian language model 16kHz
hu-HU_16k Hungarian language model 16kHz
fi-FI_16k Finnish language model 16kHz
sv-SV_16k Swedish language model 16kHz
uk-UK_16k Ukrainian language model 16kHz
ru-RU_16k Russian language model 16kHz
zh-ZH_16k Chinese language model 16kHz
ar-AR_16k Arabic language model 16kHz
da-DA_16k Dansk language model 16kHz
mt-MT_16k Maltese language model 16kHz
ko-KR_16k Korean language model 16kHz
fa-IR_16k Persian language model 16kHz
et-EE_16k Estonian language model 16kHz
lv-LV_16k Latvian language model 16kHz
ga-IE_16k Irish language model 16kHz
sq-AL_16k Albanian language model 16kHz