Websocket interface
The websocket interface enables you to add speech recognition capabilities to your client applications. The transcription service relies on websocket connection to asynchronously exchange bi-directional data (audio chunks and JSON text messages) in real-time.
API calls require a api key and language model specified as URL parameter.
Base URL for API calls is wss://developer.speech-i.com/ws/client/speech?key=XXX&model=YYY
Usage
The recognition request and response cycle has the following steps:
- The cient establishes a websocket connection to the server
- On connection opened, client must send the JSON formatted transcription parameters
- Server responds with a JSON formatted status message.
- If the server status is ready, the client can start sending the audio chunks in binary format (binary frame); to limit the latency, each chunck should contain at most 500ms of audio.
- Server periodically sends JSON formatted partial results, according to transcription parameters settings.
- On audio transfer completed, the client can send a stop decoding message.
- Server sends the JSON formatted final transcription.
- According to transcription parameters, the iteration can restart from point 3.
{"status":"ready|error|busy"}
{"final":false,"segment-start":0,"total-length":1.75,"transcript":"hello"}
{"final":false,"segment-start":0,"total-length":2.75,"transcript":"hello world"}
{"decoding":"stop"}
{
"confidence":0.967202366309009,
"final":true,
"hypotheses":[
{
"likelihood":135.98733520507813,
"transcript":"hello world",
"word-alignment":[
{
"end":0.3899999912828207,
"start":0,
"word":"hello"
},
{
"end":1.409999968484044,
"start":0.41999999061226847,
"word":"world"
}
]
},
{
"likelihood":133.66854858398438,
"transcript":"yellow world"
}
],
"segment-length":1.409999968484044,
"segment-start":0
}
Transcription parameters
The following table shows the customizable parameters that client must send when websocket connection is opened.
PARAMETER | TYPE | DEFAULT | DESCRIPTION |
---|---|---|---|
decoding | String: 'start','stop' | Mandatory parameter | Mandatory parameter to start or end the transcription. |
audio-type | String: 's16le;16000', 'speex;nb;8' | Mandatory parameter | Mandatory parameter. It specifies the audio coding and the sampling rate. Audio format must be mono; the coding can be speex or raw; the sampling rate can be 8 or 16 Khz. |
traceback-period | Float: 0.2-inf | 0.2 | Time interval for receiving the partial transcription. |
timeout | int: 10-100 | 30 | Websocket dialog timeout |
n-best | int: 1-inf | 12 | Maximum number of Nbest contained in the output |
do-word-alignment | boolean | true | It specifies whether or not output should contain the word time-coding |
do-endpointing | boolean | true | Activates/deactivates the endpointing detector. Deactivating the endpoinding detector the final transcription will be generated only after that the transcription process has been closed. |
min-silence | float: 0.1-inf | 0.8 | Minimal silence duration to detect the endpoint |
connection-mode | String: 'always-open','on-stop-close','on-final-close' | always open | 'always-open': the connection is always open, and must be closed by the client; 'on-stop-close': at the end of transcription the server will close the connection; 'on-final-close': the server will close the connection when detecting the first endpoint. |
Language models
The following table shows supported language models (other languages are available on request):
MODEL | DESCRIPTION |
---|---|
en-GB_16k | British English language model 16kHz |
en-US_16k | American English language model 16kHz |
en-IE_16k | Irish English language model 16kHz |
fr-FR_16k | French language model 16kHz |
de-DE_16k | German language model 16kHz |
it-IT_16k | Italian language model 16kHz |
es-ES_16k | Spanish language model 16kHz |
pl-PL_16k | Polish language model 16kHz |
nl-NL_16k | Dutch language model 16kHz |
pt-PT_16k | Portugal Portuguese language model 16kHz |
pt-BR_16k | Brazilian Portuguese language model 16kHz |
el-EL_16k | Greek language model 16kHz |
ro-RO_16k | Romanian language model 16kHz |
sl-SL_16k | Slovenian language model 16kHz |
sk-SK_16k | Slovak language model 16kHz |
cs-CS_16k | Czech language model 16kHz |
lt-LT_16k | Lithuanian language model 16kHz |
bg-BG_16k | Bulgarian language model 16kHz |
hr-HR_16k | Croatian language model 16kHz |
hu-HU_16k | Hungarian language model 16kHz |
fi-FI_16k | Finnish language model 16kHz |
sv-SV_16k | Swedish language model 16kHz |
uk-UK_16k | Ukrainian language model 16kHz |
ru-RU_16k | Russian language model 16kHz |
zh-ZH_16k | Chinese language model 16kHz |
ar-AR_16k | Arabic language model 16kHz |
da-DA_16k | Dansk language model 16kHz |
mt-MT_16k | Maltese language model 16kHz |
ko-KR_16k | Korean language model 16kHz |
fa-IR_16k | Persian language model 16kHz |
et-EE_16k | Estonian language model 16kHz |
lv-LV_16k | Latvian language model 16kHz |
ga-IE_16k | Irish language model 16kHz |
sq-AL_16k | Albanian language model 16kHz |