The WebSocket interface is used for receiving speech recognition responses.
We remind that other protocols may be used to stream the media to our service, but the WebSocket interface is used in all cases to retrieve responses.

Response Types

Responses are JSON objects with a specific schema (see reference here).
To accommodate for different delivery use cases, we provide two types of responses: transcript and captions.

Transcript

This type of response contains the recognized words since the beginning of a time segment.
A segment of audio is recognized incrementally, processing more of the incoming audio at each step. Each segment starts at a specific start-time and extends its end-time with each step, yielding the most updated result.
Note that sequential updates for the same utterance will overlap, each response superseding the previous one - until a response signaling the end of the segment is received (marked by is_final = True).
The alternatives array might contain different hypotheses, ordered by level if confidence.

Here is an example of a transcript response:

{
    "response": {
        "id": "e5ff9cc8-d5e6-4da5-aa51-bd1874f7bf49",
        "type": "transcript",
        "service_type": "transcription",
        "language_code": "en-US",
        "is_final": false,
        "is_end_of_stream": false,
        "start": 0.0,
        "end": 1.0,
        "start_pts": 4000.0,
        "start_epoch": 1666011448.5125701,
        "speakers": [
            {
                "id": "5a155a51-b181-4451-84f2-5f9e141aea52",
                "label": null
            }
        ],
        "alternatives": [
            {
                "transcript": "Welcome",
                "start": 0.0,
                "end": 1.0,
                "start_pts": 4000.0,
                "start_epoch": 1666011448.5125701,
                "items": [
                    {
                        "start": 0.2,
                        "end": 0.68,
                        "kind": "text",
                        "value": "Welcome",
                        "speaker_id": "5a155a51-b181-4451-84f2-5f9e141aea52"
                    }
                ]
            }
        ]
    }
}

This example shows the recognized word in a 1-second long segment, starting from time 0.0 to 1.0.
The next expected result after this one, will also start from 0.0, but will have an incremented end time and potentially more recognized words.
For more details, please see the examples in our SDK: https://github.com/verbit-ai/verbit-streaming-python-sdk/blob/main/examples/responses/transcript.md

Captions

This type of response contains the recognized words within a specific time window. In contrast to the incremental nature of transcript-type responses, the captions-type responses are non-overlapping and consecutive.
Only one captions-type response covering a specific time-span in the audio will ever be returned.
Given this behaviour, each response is considered as the final version of recognition for the specified time-span.
Therefore, the is_final field is always set to True. The alternatives array will always have only one list item.

Here is an example of a captions response:

{
    "response": {
        "id": "9ab9a97c-9a21-090c-6a98-1b68e512ad32",
        "type": "captions",
        "service_type": "transcription",
        "language_code": "en-US",
        "is_final": true,
        "is_end_of_stream": false,
        "start": 0.2,
        "end": 1.25,
        "start_pts": 4000.2,
        "start_epoch": 1666011448.7125702,
        "speakers": [
            {
                "id": "5a155a51-b181-4451-84f2-5f9e141aea52",
                "label": null
            }
        ],
        "alternatives": [
            {
                "transcript": "Welcome friends,",
                "start": 0.2,
                "end": 1.25,
                "start_pts": 4000.2,
                "start_epoch": 1666011448.7125702,
                "items": [
                    {
                        "start": 0.2,
                        "end": 0.71,
                        "kind": "text",
                        "value": "Welcome",
                        "speaker_id": "5a155a51-b181-4451-84f2-5f9e141aea52"
                    },
                    {
                        "start": 0.71,
                        "end": 1.25,
                        "kind": "text",
                        "value": "friends",
                        "speaker_id": "5a155a51-b181-4451-84f2-5f9e141aea52"
                    },
                    {
                        "start": 1.25,
                        "end": 1.25,
                        "kind": "punct",
                        "value": ",",
                        "speaker_id": "5a155a51-b181-4451-84f2-5f9e141aea52"
                    }
                ]
            }
        ]
    }
}

Silent audio segments

It should be noted that transcript and captions responses behave differently when no words are recognized:

transcript responses are sent regardless of the audio content, in such a way that the entire audio duration is covered. In case of a silent audio segment, the responses will be sent with an empty word list. However, the timestamps will mark the portion of the audio that was transcribed.
captions responses are sent only when there are recognized words. In the case of a silent audio segment, no responses will be sent, since an empty caption does not make sense. Therefore, captions responses will not necessarily cover the entire audio duration.

Service Types

Responses may originate in one of the following types of service:

Transcription

These responses are generated by Speech Recognition.
Responses marked of this service type will contain the same language of the input audio.

Translation

These responses are by Machine Translation. Responses marked with this type will contain translated variants of the original transcription response. The language will be specified in the language_code field. There may be more than one translation response for each original transcription response (as specified in the Order).

Note: Since the translated words were never really uttered in the original audio, they do not have "real" timings. Therefore, words in translation responses are assigned timings which are heuristically distributed within the time boundaries of the origin transcription response. These heuristic timings may be used for synchronization purposes like displaying translated content in alignment with the audio. However, take heed as due to natural differences between languages, translated responses may diverge in word count and word order.

Here is an example of a translation response which corresponds to the captions example above, but translated to Spanish:

{
    "response": {
        "id": "9ab9a97c-9a21-090c-6a98-1b68e512ad32",
        "type": "captions",
        "service_type": "translation",
        "language_code": "es-ES",
        "is_final": true,
        "is_end_of_stream": false,
        "start": 0.2,
        "end": 1.25,
        "start_pts": 4000.2,
        "start_epoch": 1666011448.7125702,
        "speakers": [
            {
                "id": "5a155a51-b181-4451-84f2-5f9e141aea52",
                "label": null
            }
        ],
        "alternatives": [
            {
                "transcript": "Bienvenidos amigos,",
                "start": 0.2,
                "end": 1.25,
                "start_pts": 4000.2,
                "start_epoch": 1666011448.7125702,
                "items": [
                    {
                        "start": 0.2,
                        "end": 0.8,
                        "kind": "text",
                        "value": "Bienvenidos",
                        "speaker_id": "5a155a51-b181-4451-84f2-5f9e141aea52"
                    },
                    {
                        "start": 0.8,
                        "end": 1.25,
                        "kind": "text",
                        "value": "amigos",
                        "speaker_id": "5a155a51-b181-4451-84f2-5f9e141aea52"
                    },
                    {
                        "start": 1.25,
                        "end": 1.25,
                        "kind": "punct",
                        "value": ",",
                        "speaker_id": "5a155a51-b181-4451-84f2-5f9e141aea52"
                    }
                ]
            }
        ]
    }
}

General Response Schema

Below is the generic schema to which all responses adhere, followed by a description of each field.

{  
    "response": {  
        "id": string (UUID),  
        "type": "transcript" | "captions",  
        "service_type": "transcription" | "translation",  
        "language_code": str,  
        "start": float,  
        "end": float,  
        "start_pts": float,  
        "start_epoch": float,  
        "is_final": boolean,  
        "is_end_of_stream": boolean,  
        "speakers": [  
            {  
                "id": string (UUID),  
                "label": string | null  
            }  
        ],  
        "alternatives": [  
            {  
                "transcript": string,  
                "start": float,  
                "end": float,  
                "start_pts": float,  
                "start_epoch": float,  
                "items": [  
                    {  
                        "start": float,  
                        "end": float,  
                        "kind": "text" | "punct",  
                        "value": string,  
                        "speaker_id": string (UUID)  
                    }  
                ]  
            }  
        ]  
    }  
}

Fields

response - The root element in the response JSON
- id - A unique identifier of the response (UUID).
- type - The response type.
  Can be either "transcript" or "captions".
- service_type - The type of service which produced the response.
  Can be either "transcription" (of the input language) or "translation" (of the input language transcription into a target language).
- language_code - The language code representing the language of the words in the response.
  The first two characters denote the language, and the last two characters denote the region.
  The codes follow the ISO 639 + ISO 3166 standards.
- start - The start time of the segment.
  Measured in seconds from the beginning of the media stream.
- end - The (current) end time of the segment.
  Measured in seconds from the beginning of the media stream.
- start_pts - The pts value corresponding to the start time of this response, as received from the input media stream. Measured in seconds.
  Note: if the input media stream doesn't provide pts values, this field will have the same value as start.
- start_epoch - The epoch timestamp at which the media corresponding to the start of the response was received.
- is_final - A boolean denoting whether the response is the final one for this segment.
- is_end_of_stream - A boolean denoting whether the response is the last one for the entire media stream.
- speakers - A list of objects representing speakers identified in the media stream.
  - id - A unique identifier of the speaker (UUID)
  - label - A string representing the speaker.
    Only available in sessions with human transcribers/annotators. This field is set to null by default.
- alternatives - A list of alternative transcription hypotheses. At least one alternative is always returned.
  - transcript - A concatenated textual representation of the alternative in the current response.
  - start - Same as ["response"]["start"]
  - end - Same as ["response"]["end"]
  - start_pts - Same as "response"]["start_pts"
  - start_epoch - Same as ["response"]["start_epoch"]
  - items - A list containing textual items (words and punctuation marks) and their timings.
    - start - The start time of the item.
      Measured in seconds from the beginning of the media stream.
    - end - The end time of the item.
      Measured in seconds from the beginning of the media stream.
    - kind - The item's kind.
      Can be either "text" or "punct" (a punctuation mark).
    - value - The item's textual value
    - speaker_id - The unique identifier of the speaker that this item is associated with. Corresponds with an id of one of the speakers in the ["response"]["speakers"] list.