User Tools

Site Tools


Chat utterance segmentation

In general, an utterance boundary in a chat text can be defined by the use of the return (enter) key. However, this will not be uniformly true. In some cases a single utterance will need to be constructed from two; in other cases what appears to be one will need to be split.

Two utterances to one:

In some cases, the enter key will be hit mid-sentence, leading to a case where they should be recombined into a single utterance. This is likely mostly a way of holding the floor:

(not from the present corpus)

so i got a text last night

from my ex

saying that he wants to get back together :/

In other cases this enter may simply be a mistake, as mentioned in the tokenization section:

it [enter] s a good idea

In cases like the two above, efforts should be made to recombined across the enter boundaries into a single utterance.

However, the enter key may be used in lieu of punctuation, as in the below example:

propper english

its a family tradition

In this sentence the enter is likely meant to be a colon, a dash, or other significant pause. The guiding principle for recombination should be a matter of practicality. s a family tradition would have a more difficult syntactic structure than propper english; thus the former should be combined to make a single utterance, while the latter should be left as is.

One utterance to two:

On the other side, it may in some cases be preferred to split a single utterance (as defined by the striking of enter) into two.

In cases when two full (but non-identical) S (of any type) can be made, a single utterance should be split into two. This is likely to be relatively uncommon. Indeed, in the current corpus it occurred only twice, both times in almost identical format:

any girls wanna chat pm me

any girls wanna chat PM me

These should be rendered as:

any girls wanna chat (is question)

pm me (is imperative)

However, in the case of:

bob and weave bob and weave

the repetition here suggests that splitting would be dispreferred.

Problematic cases:

Given the somewhat subjective nature of deciding to split or combine utterances, it's important to not tinker excessively, and to preserve as much as possible the use of enter. One tempting scenario is with interjections, where the same user then follows (post-enter) with more text. For example:


i dont know

While it may be tempting to combine these into one utterance, interjections should be left as they appear in the corpus, whatever that means for their separation from the following utterances.

chat_utterance_segmentation.txt · Last modified: 2021/02/11 16:44 (external edit)