Speech-i

The websocket interface enables you to add speech recognition capabilities to your client applications. The transcription service relies on websocket connection to asynchronously exchange bi-directional data (audio chunks and JSON text messages) in real-time.

API calls require a api key and language model specified as URL parameter.

Base URL for API calls is wss://developer.speech-i.com/ws/client/speech?key=XXX&model=YYY

The recognition request and response cycle has the following steps:

The cient establishes a websocket connection to the server

On connection opened, client must send the JSON formatted transcription parameters

Server responds with a JSON formatted status message.

{"status":"ready|error|busy"}

If the server status is ready, the client can start sending the audio chunks in binary format (binary frame); to limit the latency, each chunck should contain at most 500ms of audio.

Audio format must be mono, encoding speex or raw PCM, sampling rate 8kHz or 16kHz. Other formats may be available on request.

Server periodically sends JSON formatted partial results, according to transcription parameters settings.

{"final":false,"segment-start":0,"total-length":1.75,"transcript":"hello"}
{"final":false,"segment-start":0,"total-length":2.75,"transcript":"hello world"}

On audio transfer completed, the client can send a stop decoding message.

{"decoding":"stop"}

Server sends the JSON formatted final transcription.

{
  "confidence":0.967202366309009,
  "final":true,
  "hypotheses":[
    {
      "likelihood":135.98733520507813,
      "transcript":"hello world",
      "word-alignment":[
        {
          "end":0.3899999912828207,
          "start":0,
          "word":"hello"
        },
        {
          "end":1.409999968484044,
          "start":0.41999999061226847,
          "word":"world"
        }
      ]
    },
    {
      "likelihood":133.66854858398438,
      "transcript":"yellow world"
    }
  ],
  "segment-length":1.409999968484044,
  "segment-start":0
}

According to transcription parameters, the iteration can restart from point 3.

The following table shows the customizable parameters that client must send when websocket connection is opened.

PARAMETER	TYPE	DEFAULT	DESCRIPTION
decoding	String: 'start','stop'	Mandatory parameter	Mandatory parameter to start or end the transcription.
audio-type	String: 's16le;16000', 'speex;nb;8'	Mandatory parameter	Mandatory parameter. It specifies the audio coding and the sampling rate. Audio format must be mono; the coding can be speex or raw; the sampling rate can be 8 or 16 Khz.
traceback-period	Float: 0.2-inf	0.2	Time interval for receiving the partial transcription.
timeout	int: 10-100	30	Websocket dialog timeout
n-best	int: 1-inf	12	Maximum number of Nbest contained in the output
do-word-alignment	boolean	true	It specifies whether or not output should contain the word time-coding
do-endpointing	boolean	true	Activates/deactivates the endpointing detector. Deactivating the endpoinding detector the final transcription will be generated only after that the transcription process has been closed.
min-silence	float: 0.1-inf	0.8	Minimal silence duration to detect the endpoint
connection-mode	String: 'always-open','on-stop-close','on-final-close'	always open	'always-open': the connection is always open, and must be closed by the client; 'on-stop-close': at the end of transcription the server will close the connection; 'on-final-close': the server will close the connection when detecting the first endpoint.

The following table shows supported language models (other languages are available on request):

MODEL	DESCRIPTION
en-GB_16k	British English language model 16kHz
en-US_16k	American English language model 16kHz
en-IE_16k	Irish English language model 16kHz
fr-FR_16k	French language model 16kHz
de-DE_16k	German language model 16kHz
it-IT_16k	Italian language model 16kHz
es-ES_16k	Spanish language model 16kHz
pl-PL_16k	Polish language model 16kHz
nl-NL_16k	Dutch language model 16kHz
pt-PT_16k	Portugal Portuguese language model 16kHz
pt-BR_16k	Brazilian Portuguese language model 16kHz
el-EL_16k	Greek language model 16kHz
ro-RO_16k	Romanian language model 16kHz
sl-SL_16k	Slovenian language model 16kHz
sk-SK_16k	Slovak language model 16kHz
cs-CS_16k	Czech language model 16kHz
lt-LT_16k	Lithuanian language model 16kHz
bg-BG_16k	Bulgarian language model 16kHz
hr-HR_16k	Croatian language model 16kHz
hu-HU_16k	Hungarian language model 16kHz
fi-FI_16k	Finnish language model 16kHz
sv-SV_16k	Swedish language model 16kHz
uk-UK_16k	Ukrainian language model 16kHz
ru-RU_16k	Russian language model 16kHz
zh-ZH_16k	Chinese language model 16kHz
ar-AR_16k	Arabic language model 16kHz
da-DA_16k	Dansk language model 16kHz
mt-MT_16k	Maltese language model 16kHz
ko-KR_16k	Korean language model 16kHz
fa-IR_16k	Persian language model 16kHz
et-EE_16k	Estonian language model 16kHz
lv-LV_16k	Latvian language model 16kHz
ga-IE_16k	Irish language model 16kHz
sq-AL_16k	Albanian language model 16kHz

Websocket interface

Usage

Transcription parameters

Language models