DisMo is a part-of-speech, disfluency and multi-word unit automatic annotator. It is designed to manage the complexities and phenomena specific to spoken language.
DisMo is a multi-level annotator for spoken language corpora that integrates part-of-speech tagging with basic disfluency detection and annotation, and multi-word unit recognition. It is a hybrid system that uses a combination of lexical resources, rules, and statistical models based on Conditional Random Fields (CRF). DisMo supports a multi-level annotation scheme, in which the tokenisation to minimal word units is complemented with multi-word unit groupings (each having associated POS tags), as well as separate levels for annotating disfluencies and discourse phenomena.