Verbit Transcript JSON format

Post-production captioning and transcription results are available in multiple formats. A detailed JSON format with per-word timing is also available for integration into various workflows and applications. It can be retrieved via the get_caption API endpoint.

Example Document

{ "version": 1, "kind": "transcript", "features": [ "speakers" ], "segments": [ { "start_time": 8.57, "end_time": 11.55, "speaker_id": "917f1dd7-f026-4aef-9544-dee627bbcdb0", "tokens": [ { "text": "What", "start_time": 9.57, "end_time": 9.81, "kind": "word", "confidence": 0.96 }, { "text": " ", "start_time": 9.81, "end_time": 9.81, "kind": "space" }, { "text": "a", "start_time": 9.81, "end_time": 9.84, "kind": "word", "confidence": 0.88 }, { "text": " ", "start_time": 9.84, "end_time": 9.84, "kind": "space" }, { "text": "good", "start_time": 9.84, "end_time": 10.05, "kind": "word", "confidence": 0.93 }, { "text": " ", "start_time": 10.05, "end_time": 10.05, "kind": "space" }, { "text": "looking", "start_time": 10.05, "end_time": 10.29, "kind": "word", "confidence": 0.96 }, { "text": " ", "start_time": 10.29, "end_time": 10.29, "kind": "space" }, { "text": "crowd", "start_time": 10.29, "end_time": 10.71, "kind": "word", "confidence": 0.81 }, { "text": ".", "start_time": 10.71, "end_time": 10.71, "kind": "punctuation" }, { "text": "\n", "start_time": 10.71, "end_time": 10.71, "kind": "line_break" } ] } ], "speakers": [ { "id": "917f1dd7-f026-4aef-9544-dee627bbcdb0", "text": "Speaker 1" } ] }

Format Structure

The core elements of a transcript or caption are tokens, which represent the text split into words, spaces, punctuation marks, and line breaks.

Tokens are grouped into segments, which define paragraph or caption cue segmentation. Segments also contain additional metadata, such as the associated speaker ID.

Root Attributes

NameTypePossible ValuesDescription
versionInteger1JSON format version.
kindString"captions", "transcript"Specifies whether the document is a transcript or captions.
featuresArray[String]"speakers"List of features applied to the job (currently, only "speakers" is supported).
segmentsArray[Object]See SegmentList of transcript paragraphs or caption cues.
speakersArray[Object]See SpeakerList of speakers.

Segment

NameTypeValueDescription
start_timeFloat≥ 0Start time of the segment in seconds.
end_timeFloat≥ 0End time of the segment in seconds.
speaker_idString<UUID>Speaker ID (present only if the "speakers" feature is enabled).
tokensArray[Object]See TokenList of words, spaces, punctuation, and line breaks in the segment.

Token

NameTypeValueDescription
textString<String>The content of the token (word, punctuation, space, or line break).
start_timeFloat≥ 0Start time of the token in seconds.
end_timeFloat≥ 0End time of the token in seconds.
kindString"word", "space", "punctuation", "line_break"The type of token.
confidenceFloat0..1Confidence score for word tokens (not applicable to spaces, punctuation, or line breaks).

Speaker

NameTypeValueDescription
idString<UUID>Unique speaker identifier, referenced in segments.
textString<String>Speaker name or label.

Did this page help you?