User Tools

Site Tools


This is an old revision of the document!

Utterance segmentation can more straightforward in a corpus of spoken data than a corpus of written data, because much of the utterance segmentation may have been already been carried out by the transcriber.

The LCDC has recently been updated to a format where the transcription file is aligned with the audio, so that utterances are grouped by the transcriber into breath units. In the original LCDC transcription format that was not audio-aligned (and which makes up the training corpus for these guidelines), the transcriber relies on their own intuitions about which punctuation should go where to break up the running speech into utterances. The utterance segmentation in these guidelines is heavily reliant upon the intuitions of the original transcriber.

  1. Any sentence-final period is treated as an utterance delimiter. Even when two full NP-VP sentences are transcribed as being separated by a comma, for whatever reason, these should be treated as a single utterance.
  2. Speaker switches are also utterance delimiters. This is non-trivial, as full NP-VP sentences can sometimes be co-constructed by speakers in conversation. Nevertheless, co-construction by two or more speakers of a single utterance will not be assumed in this corpus for ease of analysis.
  3. False starts and speech errors are included in the full utterance which follows, if the full utterance is produced by the same speaker. If there is an incomplete sentence followed by a speaker switch, the incomplete sentence should form its own utterance.
segmentation_into_utterances.1431121178.txt.gz · Last modified: 2021/02/11 17:01 (external edit)