User Tools

Site Tools


Irish English


This is a guide to annotating Irish English

Annotation Guidelines

The guidelines will deal with several levels of analysis:

  • Irish English Tokenization - segmentation into words
  • Part of speech tagging
  • Utterance Segmentation
  • Constituent Parsing
  • Dependency Parsing


In most cases, tokenization of the Irish English corpus is quite standard.

Parts of speech tagging

Partial word

For partial words, use target hypothesis.


So uhm <,> then we all <.> dec </.> they all decided they wanted to go to the disco like but I had no money

Token Souhmthenwealldectheyalldecided

Sometimes, it may be difficult to use target hypothesis. In these cases, see the section UNCLEAR below.

Discourse markers

Null-to-low semantic value

Words that contain null-to-low semantic value are tagged as discourse markers (i.e. UH). These words are usually affirmative responses, where the words contain less semantic value than their alternative usage. For example, well in “oh well” no longer contains the sense of well as in “the child behaved well”.


Oh right

Token Ohright

Ah cool

Token Ahcool

He rang her alright

Token Herangheralright
Clause-final 'like'

Function: “retroactive focusing power, but more importantly, […] they can be interpreted as countering potential inferences, objections, or doubts.” (Miller & Weinert, 1995)

Since clause-final 'like' is extremely common, and does not (a) appear in the same distribution, and (b) have the same function as other forms of 'like', they should be tagged as UH.

All the people were out like.

Token Allthepeoplewereoutlike




Did she go out with ye.

Token Didshegooutwithye

Either use target hypothesis or the tag XX.

N.B. XX is also used in the Switchboard Corpus for partial words, and unclear parts of speech (Calhoun et al., 2010). Here, we tag partial words using target hypothesis. If the partial word is unclear, then proceed to tag as XX.


Did you go UNCLEAR

Token DidyougoUNCLEAR

Utterance Segmentation


The utterance should always end after a speaker's turn.


Speaker A: <#> Went in shopping for a while

Speaker A's turn ends. End of utterance.

Speaker B: <#> Buy anything

Speaker B's turn ends. End of utterance.

Speaker A: <#> Met Nicole in town <#> No I didn't buy anything <#> I 've hardly no money <#> <{> <[> Broke <,>

Speaker A's turn ends. End of utterance.

Notice that in this example, Speaker B had interrupted Speaker A. Speaker A was still listing out the activities from their previous turn. These two turns should be annotated distinct utterances even though they are closely related.

False Starts

False starts should be included in the utterance.


<#> But uhm she 's she 's from Galway

Tokens Butuhmshesshesfrom

Exceptions include false starts at the beginning of a sentence, in which the lexical item differs significantly. These should be segmented as distinct utterances. However, there may be cases where the distinction between false starts and topicalization is ambiguous. In these cases, you should use your own judgment.


<#> <.> Sat </.> who else <,>



Pauses at the end of an utterance should be included.


<#> Yeah <,> she was <{> <[> with her sister </[> <,> <#> She was going in shopping

Tokens Yeah,shewaswithhersister,Shewasgoinginshopping

Sentence Boundaries

In most cases, pre-annotated sentence boundaries should be used as utterance boundaries.


<#> So then uhm <,> what 'd I do Sunday then <#> Sunday I did nothing much

Tokens Sothenuhm,what'dIdoSundaythenSundayIdidnothingmuch

Constituent Parsing

Empty Categories

In speech, subject pronouns are frequently dropped. In these case, null subjects should be marked as an empty category (NONE *).


<#> Met Nicole in town <#>

<#> Went in shopping for a while


You may notice in the previous examples are annotated as fragments. The question is whether these kinds of sentences should be annotated as a fragment, or a regular sentence. For example, if a speaker is providing a narrative in the first person, they may drop subject pronouns but their sentences may be well-formed and complex. We would then expect that these sentences should be annotated as a sentence, and not a fragment. However, this is not always so clear as the boundary is oftentimes fuzzy. Therefore, this guideline will adopt the following definition for fragments - FRAG.

“FRAG marks those portions of text that appear to be clauses, but lack too many essential elements. Essential elements include phonologically overt nominal subjects and verbs.”


Multiple interjections may appear in clusters or “streams”. Phrases containing multiple interjections should be annotated flat.


<#> Oh right yeah

Clause-final LIKE

Clause-final LIKE is very frequent in the ICE Ireland Corpus, more so than either clause-initial or clause-medial LIKE (Schweinberger 2011). Many scholars consider the function of clause-final LIKE as a focus marker with backward scope (i.e. modifying the previous clause) (Harris 1993; Miller & Weinert 1999; Anderson 2000; Columbus 2009). Following their discussions, clause-final LIKE should then be attached to the root.


<#> What 's new like </[> <#>

However, there may be situations when the presence of clause-final LIKE may be unclear.

For example, in the phrase <[> So then <,> </[> she was asking like if we were going out Saturday night LIKE is syntactically ambiguous.


So then she was asking [like if we were going out Saturday night]


[So then she was asking like] if we were going out Saturday night

Since the corpus does not include recordings, this may be difficult to determine. Furthermore, the syntactic positions of LIKE are linked to their discourse-pragmatic function (Anderson 1998, 2000; Miller & Weinert 1995; Miller 2009).

The functions of LIKE within the linguistic literature include (Schweinberger 2011):

  1. Hedging
  2. Focusing
  3. Buying Processing Time
  4. Indicating the Passage is Hard to Follow
  5. Holding the Floor
  6. Signaling Minor Non-Equivalence Between What’s Said and What’s in Mind
  7. Signaling Loose Talk/Marking Non-Literalness
  8. Signaling Approximation
  9. Introducing Exemplifications
  10. Signaling Similarity

LIKE can therefore be functionally ambiguous, in addition to being syntactically ambiguous. In these cases, it should be up to the annotator's intuition on the true form and function of sentences containing LIKE.


Spoken corpora contain many disfluencies such as false starts, interruptions, stutters, etc.

Reparandum and Repair

For several types of these disfluencies, there are usually two parts: (a) the reparandum, and (b) the repair. The reparandum is defined as the phrase that is subjected to repair.


<#> So uhm <,> then we all <.> dec </.> they all decided they wanted to go to the disco like

In this example, the reparandum is we all dec, and the repair is they all decided.

In these cases, the guideline adopts the NXT-format Switchboard Corpus (Calhoun et al. 2009) where the reparandum is subsumed within the category EDITED. The token dec, in the example above, appears as an the unfinished token corresponding to decided. Unfinished categories should be annotated with the label UNF. The corresponding parse tree is represented below.


Stuttering or hesitation often results in repetition of a word, phrase, or sentence.

The repeated word or phrase (i.e. the second occurrence) should be included within the category REPEAT.


<#> But uhm she 's she 's from Galway as well though

<#> So she 's <&> laughter </&> she 's in great form like

Unknown, Uncertain or Un-bracketable

Unclear or unfamiliar words may sometimes appear in the transcript. The guideline again adopts the NXT-format Switchboard Corpus (Calhoun et al. 2009) where unknown, uncertain or un-bracketable are subsumed within the category X.


<#> <[> Did you go <unclear> 1 syll </unclear> </[> </{>

<#> Derv


Sentence-initial 'so' - flat? 'then'

Version 1

#2. Uhm Friday night I didn't do much. #11. Oh yeah unbelievable - 'Oh yeah' is INTJ together because possible MWE, but in general, each UH is an INTJ #13. Went in shopping for a while - added (NONE *) before 'Went'. #14. Buy anything - made it SQ - target hypothesis.

Version 2

#2. Broke - added (NONE *), short for 'I am broke.' therefore ADJP-PRD. Frag or S? Frag because incomplete, missing verb. #3. What's new like - 'like' is phrase-final so append INTJ to phrase before. #6. So uhm what else did I do then - sentence-initial 'so' must all be flat. #11. Did you - FRAG? SQ? where's the verb - 'did'? #14. Did you go XX - target hypothesis. #15. Derv - category X. #17. So yeah - RB is flat. #19. …she was asking like - append INTJ UH like at the end of phrase. #22. Oh right yeah - [Oh right] [yeah] #31. Cushty - not NP-SBJ, missing verb.

#36. So uhm then we all dec they all decided… - need a label to state false start/disfluency

#39. …I'd say - frag? PRN #44. No yeah #54. Did he - frag, no verb, not S…not NP-SBJ #58. Ah cool - separate intj?

#59 So she's she's… - false start what to do…

#73 That Cliona's mum #74 That has Cliana - fragment of an SBAR? WHNP? #75 Yeah that's right yeah - attach last 'yeah' to S or to VP? #80 Oh right right - where does thee constituency go? I made Oh right] [right #82 Yeah I do yeah

#88 She…you know…her her… - false starts and stuttering/disfluencies.

#90 She's she's… - diff btwn 88 and 90 is unclear, which gets marked as FRAG? first? second?

When two interjections in a row…which is the head? Phrase-final 'like' is at the end of phrase, inside. 'Sat' - single lexical item…fragment v2 #14 UNCLEAR…target hypothesis. 'so uhm' #6 v2. 'so yeah' #17 v2. #22 v2. 'oh right yeah' tagset not same as ours… #36 false start = frag, label? #39 I'd say #48 interjections at end of phrase… #54 Did he? ← frag? SQ? #59 So she's she's in great form like - FRAG as part of S? or on its own? SQ - Do you…

KISS…while the tags follow the PENN standard tagset, it is not an exhaustive use for the following reasons… Sentence initial 'so'…maybe some test… CHANGE: Keep uttereance boundary the same - helps with constituency, just make them fragments.

Dependency Parsing

#4 what was on ←- was is the root.

No not really #7 - root = really, not clear, right side.

#24 who else ←- root is on who

X is Y ←- Y is root X is with Y ←- 'is' is root

#30 did you go UNCLEAR –> go→UNCLEAR is dep

#40 so then –> then→So mwe

discourse always from root.

Oh right - root where? #46 –> used mwe, but use tests to determine!

So i went home, i'd say #55 –> I'd say is discourse function or parataxis?

Uhm what s #57 –> 's' is root, right most and is verb?

#66 did she not –> 'did' is the root…

#70 did he –> 'did' is root

#74 ah cool –> mwe

#85 with Fred –> with is root because if Fred is root, cannot link together

#86 With Fred and Ciaran… –> the subject is 'I', therefore with is prep…def interesting…

#89 That Cliona's mum –> null copula

And Noirin… #93 –> parataxis into Noirin

#107 …so… –> advcl and mark


#4 so yeah… –> what's root? hierarchy? RB, NN > INTJ as root? #6 Oh right yeah –> root to the right, [oh right] yeah #9 Oh wow –> wow is root, oh is mwe? #10 the all dec we all decided –> repair tag #11 no yeah –> you can say yeah no, so not mwe…discourse… #14 repeat tag used…what do you gain what do you lose? #20 She…you know… ←- used repair for false start, and repeat #21 contains repeat…

Combinations of Interjections

“combinations of interjections may have special pragmatic functions and distributions, representing regional and personal differences”

“the order of interjections in combinations is often fixed”

“Thus combinations of interjections hang together both functionally-semantically and formally-syntactically.”

“combinations of interjections demonstrate solidarity at the level of prosody as well”

“the need for the creation of sub-corpora and qualitative analysis of individual examples”

“interjections bring out both the strengths and weaknesses of corpus investigation…careful qualitative analysis of interjections continues to be necessary to determine particular functions.”

Neal R. Norrick. Corpus Pragmatics.

To Do

REPEAT - discursive features, take the first occurrence? LIKE - ambiguous - which root?


Anderson, Gisle 1998. “The pragmatic marker like from a relevance-theoretic perspective”. In Jucker, Andreas H. & Yael Ziv (eds.), Discourse markers: descriptions and theory. Amsterdam & New York: John Benjamins, 147-170.

Anderson, Gisle. 2000. The role of the pragmatic marker like in utterance interpretation. Amsterdam & New York: John Benjamins.

Bies, Ann, Mark Ferguson, Karen Katz, and Robert MacIntyre. 1995. Bracketing Guidelines for Treebank II Style. Ms., Department of Computer and Informational Science, University of Pennsylvania.

Columbus, Georgie. 2009. “Irish like as an invariant tag: evidence from ICE-Ireland”. Paper presented at AACL 2009 (American Association for Corpus Linguistics), October 9th 2009 in Edmonton, Alberta, Canada.

Calhoun, Sasha, Jean Carletta, Jason Brenier, Neil Mayo, Dan Jurafsky, Mark Steedman, and David Beaver. 2010. The NXT-format Switchboard corpus: A rich resource for investigating the syntax, semantics, pragmatics and prosody of dialogue. Language Resources and Evaluation.

Harris, John. 1993. “The grammar of Irish English”. In: Milroy, James & Leslie Milroy (eds.), 139-186.

Miller, J. and Weinert, R. 1995. The function of LIKE in spoken language. Journal of Pragmatics, 23, 365-393.

Miller, Jim. 2009. Like and other discourse markers. In Peters, Pam, Peter Collins & Adam Smith (eds.), Comparative Studies in Australian and New Zealand English: Grammar and beyond. 2009. John Benjamins: Amsterdam, 317-338.

Schweinberger, Martin. A variational approach towards discourse marker LIKE in Irish-English. In Bettina Migge und Maire Ni Chiosain (Eds.), New Perspectives on Irish English, 179-201. Amsterdam und New York: John Benjamins.

Schweinberger, Martin. 2011. The discourse marker like: A corpus-based analysis of selected varieties of English. Hamburg: unpublished PhD dissertation.

irish_english.txt · Last modified: 2015/04/23 00:01 by ek684