WebSocket - Responses
The WebSocket interface is used for receiving speech recognition responses.
We remind that other protocols may be used to stream the media to our service, but the WebSocket interface is used in all cases to retrieve responses.
Response Types
Responses are JSON objects with a specific schema (see reference here).
To accommodate for different delivery use cases, we provide two types of responses: transcript
and captions
.
Transcript
This type of response contains the recognized words since the beginning of a time segment.
A segment of audio is recognized incrementally, processing more of the incoming audio at each step. Each segment starts at a specific start-time and extends its end-time with each step, yielding the most updated result.
Note that sequential updates for the same utterance will overlap, each response superseding the previous one - until a response signaling the end of the segment is received (marked by is_final = True
).
The alternatives
array might contain different hypotheses, ordered by level if confidence.
Here is an example of a transcript
response:
{
"response": {
"id": "e5ff9cc8-d5e6-4da5-aa51-bd1874f7bf49",
"type": "transcript",
"service_type": "transcription",
"language_code": "en-US",
"is_final": false,
"is_end_of_stream": false,
"start": 0.0,
"end": 1.0,
"start_pts": 4000.0,
"start_epoch": 1666011448.5125701,
"speakers": [
{
"id": "5a155a51-b181-4451-84f2-5f9e141aea52",
"label": null
}
],
"alternatives": [
{
"transcript": "Welcome",
"start": 0.0,
"end": 1.0,
"start_pts": 4000.0,
"start_epoch": 1666011448.5125701,
"items": [
{
"start": 0.2,
"end": 0.68,
"kind": "text",
"value": "Welcome",
"speaker_id": "5a155a51-b181-4451-84f2-5f9e141aea52"
}
]
}
]
}
}
This example shows the recognized word in a 1-second long segment, starting from time 0.0
to 1.0
.
The next expected result after this one, will also start from 0.0
, but will have an incremented end
time and potentially more recognized words.
For more details, please see the examples in our SDK: https://github.com/verbit-ai/verbit-streaming-python-sdk/blob/main/examples/responses/transcript.md
Captions
This type of response contains the recognized words within a specific time window. In contrast to the incremental nature of transcript
-type responses, the captions
-type responses are non-overlapping and consecutive.
Only one captions
-type response covering a specific time-span in the audio will ever be returned.
Given this behaviour, each response is considered as the final version of recognition for the specified time-span.
Therefore, the is_final
field is always set to True
. The alternatives
array will always have only one list item.
Here is an example of a captions
response:
{
"response": {
"id": "9ab9a97c-9a21-090c-6a98-1b68e512ad32",
"type": "captions",
"service_type": "transcription",
"language_code": "en-US",
"is_final": true,
"is_end_of_stream": false,
"start": 0.2,
"end": 1.25,
"start_pts": 4000.2,
"start_epoch": 1666011448.7125702,
"speakers": [
{
"id": "5a155a51-b181-4451-84f2-5f9e141aea52",
"label": null
}
],
"alternatives": [
{
"transcript": "Welcome friends,",
"start": 0.2,
"end": 1.25,
"start_pts": 4000.2,
"start_epoch": 1666011448.7125702,
"items": [
{
"start": 0.2,
"end": 0.71,
"kind": "text",
"value": "Welcome",
"speaker_id": "5a155a51-b181-4451-84f2-5f9e141aea52"
},
{
"start": 0.71,
"end": 1.25,
"kind": "text",
"value": "friends",
"speaker_id": "5a155a51-b181-4451-84f2-5f9e141aea52"
},
{
"start": 1.25,
"end": 1.25,
"kind": "punct",
"value": ",",
"speaker_id": "5a155a51-b181-4451-84f2-5f9e141aea52"
}
]
}
]
}
}
Silent audio segments
It should be noted that transcript
and captions
responses behave differently when no words are recognized:
transcript
responses are sent regardless of the audio content, in such a way that the entire audio duration is covered. In case of a silent audio segment, the responses will be sent with an empty word list. However, the timestamps will mark the portion of the audio that was transcribed.captions
responses are sent only when there are recognized words. In the case of a silent audio segment, no responses will be sent, since an empty caption does not make sense. Therefore,captions
responses will not necessarily cover the entire audio duration.
Service Types
Responses may originate in one of the following types of service:
Transcription
These responses are generated by Speech Recognition.
Responses marked of this service type will contain the same language of the input audio.
Translation
These responses are by Machine Translation. Responses marked with this type will contain translated variants of the original transcription
response. The language will be specified in the language_code
field. There may be more than one translation
response for each original transcription
response (as specified in the Order).
Note: Since the translated words were never really uttered in the original audio, they do not have "real" timings. Therefore, words in translation responses are assigned timings which are heuristically distributed within the time boundaries of the origin transcription
response. These heuristic timings may be used for synchronization purposes like displaying translated content in alignment with the audio. However, take heed as due to natural differences between languages, translated responses may diverge in word count and word order.
Here is an example of a translation
response which corresponds to the captions
example above, but translated to Spanish:
{
"response": {
"id": "9ab9a97c-9a21-090c-6a98-1b68e512ad32",
"type": "captions",
"service_type": "translation",
"language_code": "es-ES",
"is_final": true,
"is_end_of_stream": false,
"start": 0.2,
"end": 1.25,
"start_pts": 4000.2,
"start_epoch": 1666011448.7125702,
"speakers": [
{
"id": "5a155a51-b181-4451-84f2-5f9e141aea52",
"label": null
}
],
"alternatives": [
{
"transcript": "Bienvenidos amigos,",
"start": 0.2,
"end": 1.25,
"start_pts": 4000.2,
"start_epoch": 1666011448.7125702,
"items": [
{
"start": 0.2,
"end": 0.8,
"kind": "text",
"value": "Bienvenidos",
"speaker_id": "5a155a51-b181-4451-84f2-5f9e141aea52"
},
{
"start": 0.8,
"end": 1.25,
"kind": "text",
"value": "amigos",
"speaker_id": "5a155a51-b181-4451-84f2-5f9e141aea52"
},
{
"start": 1.25,
"end": 1.25,
"kind": "punct",
"value": ",",
"speaker_id": "5a155a51-b181-4451-84f2-5f9e141aea52"
}
]
}
]
}
}
General Response Schema
Below is the generic schema to which all responses adhere, followed by a description of each field.
{
"response": {
"id": string (UUID),
"type": "transcript" | "captions",
"service_type": "transcription" | "translation",
"language_code": str,
"start": float,
"end": float,
"start_pts": float,
"start_epoch": float,
"is_final": boolean,
"is_end_of_stream": boolean,
"speakers": [
{
"id": string (UUID),
"label": string | null
}
],
"alternatives": [
{
"transcript": string,
"start": float,
"end": float,
"start_pts": float,
"start_epoch": float,
"items": [
{
"start": float,
"end": float,
"kind": "text" | "punct",
"value": string,
"speaker_id": string (UUID)
}
]
}
]
}
}
Fields
response
- The root element in the response JSONid
- A unique identifier of the response (UUID).type
- The response type.
Can be either "transcript" or "captions".service_type
- The type of service which produced the response.
Can be either "transcription" (of the input language) or "translation" (of the input language transcription into a target language).language_code
- The language code representing the language of the words in the response.
The first two characters denote the language, and the last two characters denote the region.
The codes follow the ISO 639 + ISO 3166 standards.start
- The start time of the segment.
Measured in seconds from the beginning of the media stream.end
- The (current) end time of the segment.
Measured in seconds from the beginning of the media stream.start_pts
- Thepts
value corresponding to thestart
time of this response, as received from the input media stream. Measured in seconds.
Note: if the input media stream doesn't providepts
values, this field will have the same value asstart
.start_epoch
- The epoch timestamp at which the media corresponding to thestart
of the response was received.is_final
- A boolean denoting whether the response is the final one for this segment.is_end_of_stream
- A boolean denoting whether the response is the last one for the entire media stream.speakers
- A list of objects representing speakers identified in the media stream.id
- A unique identifier of the speaker (UUID)label
- A string representing the speaker.
Only available in sessions with human transcribers/annotators. This field is set tonull
by default.
alternatives
- A list of alternative transcription hypotheses. At least one alternative is always returned.transcript
- A concatenated textual representation of the alternative in the current response.start
- Same as["response"]["start"]
end
- Same as["response"]["end"]
start_pts
- Same as"response"]["start_pts"
start_epoch
- Same as["response"]["start_epoch"]
items
- A list containing textual items (words and punctuation marks) and their timings.start
- The start time of the item.
Measured in seconds from the beginning of the media stream.end
- The end time of the item.
Measured in seconds from the beginning of the media stream.kind
- The item's kind.
Can be either "text" or "punct" (a punctuation mark).value
- The item's textual valuespeaker_id
- The unique identifier of the speaker that this item is associated with. Corresponds with anid
of one of the speakers in the["response"]["speakers"]
list.
Updated 4 months ago