Verbit Transcript JSON format

Post-production captioning and transcription results are available in multiple formats. A detailed JSON format with per-word timing is also available for integration into various workflows and applications. It can be retrieved via the get_caption API endpoint.

Example Document

{
   "version": 1,
   "kind": "transcript",
   "features": [
      "speakers"
   ],
   "segments": [
      {
         "start_time": 8.57,
         "end_time": 11.55,
         "speaker_id": "917f1dd7-f026-4aef-9544-dee627bbcdb0",
         "tokens": [
            {
               "text": "What",
               "start_time": 9.57,
               "end_time": 9.81,
               "kind": "word",
               "confidence": 0.96
            },
            {
               "text": " ",
               "start_time": 9.81,
               "end_time": 9.81,
               "kind": "space"
            },
            {
               "text": "a",
               "start_time": 9.81,
               "end_time": 9.84,
               "kind": "word",
               "confidence": 0.88
            },
            {
               "text": " ",
               "start_time": 9.84,
               "end_time": 9.84,
               "kind": "space"
            },
            {
               "text": "good",
               "start_time": 9.84,
               "end_time": 10.05,
               "kind": "word",
               "confidence": 0.93
            },
            {
               "text": " ",
               "start_time": 10.05,
               "end_time": 10.05,
               "kind": "space"
            },
            {
               "text": "looking",
               "start_time": 10.05,
               "end_time": 10.29,
               "kind": "word",
               "confidence": 0.96
            },
            {
               "text": " ",
               "start_time": 10.29,
               "end_time": 10.29,
               "kind": "space"
            },
            {
               "text": "crowd",
               "start_time": 10.29,
               "end_time": 10.71,
               "kind": "word",
               "confidence": 0.81
            },
            {
               "text": ".",
               "start_time": 10.71,
               "end_time": 10.71,
               "kind": "punctuation"
            },
            {
               "text": "\n",
               "start_time": 10.71,
               "end_time": 10.71,
               "kind": "line_break"
            }
         ]
      }
   ],
   "speakers": [
      {
         "id": "917f1dd7-f026-4aef-9544-dee627bbcdb0",
         "text": "Speaker 1"
      }
   ]
}

Format Structure

The core elements of a transcript or caption are tokens, which represent the text split into words, spaces, punctuation marks, and line breaks.

Tokens are grouped into segments, which define paragraph or caption cue segmentation. Segments also contain additional metadata, such as the associated speaker ID.

Root Attributes

Name	Type	Possible Values	Description
`version`	Integer	`1`	JSON format version.
`kind`	String	"captions", "transcript"	Specifies whether the document is a transcript or captions.
`features`	Array[String]	"speakers"	List of features applied to the job (currently, only "speakers" is supported).
`segments`	Array[Object]	See Segment	List of transcript paragraphs or caption cues.
`speakers`	Array[Object]	See Speaker	List of speakers.

Segment

Name	Type	Value	Description
`start_time`	Float	`≥ 0`	Start time of the segment in seconds.
`end_time`	Float	`≥ 0`	End time of the segment in seconds.
`speaker_id`	String	`<UUID>`	Speaker ID (present only if the "speakers" feature is enabled).
`tokens`	Array[Object]	See Token	List of words, spaces, punctuation, and line breaks in the segment.

Token

Name	Type	Value	Description
`text`	String	`<String>`	The content of the token (word, punctuation, space, or line break).
`start_time`	Float	`≥ 0`	Start time of the token in seconds.
`end_time`	Float	`≥ 0`	End time of the token in seconds.
`kind`	String	"word", "space", "punctuation", "line_break"	The type of token.
`confidence`	Float	`0..1`	Confidence score for word tokens (not applicable to spaces, punctuation, or line breaks).

Speaker

Name	Type	Value	Description
`id`	String	`<UUID>`	Unique speaker identifier, referenced in segments.
`text`	String	`<String>`	Speaker name or label.