Verbit Transcript JSON format
Post-production captioning and transcription results are available in multiple formats. A detailed JSON format with per-word timing is also available for integration into various workflows and applications. It can be retrieved via the get_caption
API endpoint.
Example Document
{
"version": 1,
"kind": "transcript",
"features": [
"speakers"
],
"segments": [
{
"start_time": 8.57,
"end_time": 11.55,
"speaker_id": "917f1dd7-f026-4aef-9544-dee627bbcdb0",
"tokens": [
{
"text": "What",
"start_time": 9.57,
"end_time": 9.81,
"kind": "word",
"confidence": 0.96
},
{
"text": " ",
"start_time": 9.81,
"end_time": 9.81,
"kind": "space"
},
{
"text": "a",
"start_time": 9.81,
"end_time": 9.84,
"kind": "word",
"confidence": 0.88
},
{
"text": " ",
"start_time": 9.84,
"end_time": 9.84,
"kind": "space"
},
{
"text": "good",
"start_time": 9.84,
"end_time": 10.05,
"kind": "word",
"confidence": 0.93
},
{
"text": " ",
"start_time": 10.05,
"end_time": 10.05,
"kind": "space"
},
{
"text": "looking",
"start_time": 10.05,
"end_time": 10.29,
"kind": "word",
"confidence": 0.96
},
{
"text": " ",
"start_time": 10.29,
"end_time": 10.29,
"kind": "space"
},
{
"text": "crowd",
"start_time": 10.29,
"end_time": 10.71,
"kind": "word",
"confidence": 0.81
},
{
"text": ".",
"start_time": 10.71,
"end_time": 10.71,
"kind": "punctuation"
},
{
"text": "\n",
"start_time": 10.71,
"end_time": 10.71,
"kind": "line_break"
}
]
}
],
"speakers": [
{
"id": "917f1dd7-f026-4aef-9544-dee627bbcdb0",
"text": "Speaker 1"
}
]
}
Format Structure
The core elements of a transcript or caption are tokens, which represent the text split into words, spaces, punctuation marks, and line breaks.
Tokens are grouped into segments, which define paragraph or caption cue segmentation. Segments also contain additional metadata, such as the associated speaker ID.
Root Attributes
Name | Type | Possible Values | Description |
---|---|---|---|
version | Integer | 1 | JSON format version. |
kind | String | "captions", "transcript" | Specifies whether the document is a transcript or captions. |
features | Array[String] | "speakers" | List of features applied to the job (currently, only "speakers" is supported). |
segments | Array[Object] | See Segment | List of transcript paragraphs or caption cues. |
speakers | Array[Object] | See Speaker | List of speakers. |
Segment
Name | Type | Value | Description |
---|---|---|---|
start_time | Float | ≥ 0 | Start time of the segment in seconds. |
end_time | Float | ≥ 0 | End time of the segment in seconds. |
speaker_id | String | <UUID> | Speaker ID (present only if the "speakers" feature is enabled). |
tokens | Array[Object] | See Token | List of words, spaces, punctuation, and line breaks in the segment. |
Token
Name | Type | Value | Description |
---|---|---|---|
text | String | <String> | The content of the token (word, punctuation, space, or line break). |
start_time | Float | ≥ 0 | Start time of the token in seconds. |
end_time | Float | ≥ 0 | End time of the token in seconds. |
kind | String | "word", "space", "punctuation", "line_break" | The type of token. |
confidence | Float | 0..1 | Confidence score for word tokens (not applicable to spaces, punctuation, or line breaks). |
Speaker
Name | Type | Value | Description |
---|---|---|---|
id | String | <UUID> | Unique speaker identifier, referenced in segments. |
text | String | <String> | Speaker name or label. |
Updated 28 days ago
Did this page help you?