1The purpose of this concise glossary is to familiarise you with simplified definitions of a minimal set of terms required to comprehend this textbook. For a broader coverage of corpus linguistic terminology, see, for instance:
- Annotation is the process of adding interpretative linguistic or extralinguistic information to a corpus.
- Collocation is a relationship of co-occurrence between a search word or phrase (a node) and those words, which appear together with it more often than would be expected by chance.
- Concordance is a machine-generated display of all occurrences of a search unit in a corpus, shown with surrounding context for each instance.
- Concordance line is a text snippet extracted from a corpus that displays a search word or phrase (the node) together with a limited amount of its surrounding context.
- Concordancer is a software program that generates concordances from a machine-readable collection of texts (a corpus) and is commonly equipped with additional analytical functions.
- Corpus Query Language (CQL) is a system of instructions used for searches in a corpus, allowing queries based on words, lemmas, tags, attributes, text types, structures, and conditions. Regular expressions can be used in CQL to match specific values.
- Frequency list is a list of all items of a given type in a corpus (e.g., words, tags or values of a metadata category in focus) with counts of their occurrences.
- Key Word in Context (KWIC) is a concordance format where the search term is centred, with preceding and following context shown in side columns.
- Keyword is a word that occurs significantly more often in the focus corpus than in some reference corpus.
- Lemma is an abstract unit expressed by the base wordform (headword in a dictionary) that encompasses all the inflected and suppletive forms of the same base word. For instance, go, goes, going, went and gone are the wordforms of the lemma go.
- Lemmatisation is the process of assigning each token in the corpus to its lemma with the help of a tool called a lemmatiser.
- Named entity is a word or phrase that identifies a specific real-world object. Most often, it includes the categories of a person, organisation, location and other named objects (like document names). In corpus linguistics and natural language processing, named entities are determined via a procedure called NER – named entity recognition. It refers to the automatic identification and classification of such expressions in text, e.g., recognising “United Nations” as an organisation or “Paris” as a location.
- Normalisation, or frequency normalisation, is the process of adjusting raw counts to make them comparable across subcorpora of different sizes, typically by converting them into relative frequencies. Other types or normalisation also exist, like text normalisation, which refers to ascribing the canonical (standardised) form of a word to its various spoken or written variants in a corpus.
- Regular expression is a pattern that uses a search syntax consisting of special symbols to match strings sharing common structures, allowing users to find items, including tokens or tags, that start, end, or contain preset sequences.
- Relative density is a measure that compares how frequent an item is in a specific text type relative to its frequency in the whole corpus, adjusted for the size of that text type. Values below 100% indicate that the item is less typical of that text type; values around 100% indicate equal typicality; and values above 100% indicate that the item is more typical or characteristic of that text type.
- Relative frequency is the number of tokens of a given item divided by the total number of tokens in a corpus or a subcorpus, multiplied by a constant (e.g., 1,000,000) to express results on a common scale.
- Statistical measure is any calculated statistic in a corpus that has to do with word frequency and can range from relatively simple metrics to complex models. Collocation, or association, measures, such as t-score, mutual information (MI), log likelihood (LL), log Dice, etc., calculate the strength of association between words in a collocation. Keyword measures, such as log likelihood (LL) or simple maths parameter (SMP), are used to identify words that occur significantly more often in one corpus than in another. Dispersion measures, such as standard deviation or the coefficient of variation, are used to calculate the distribution of words or phrases throughout the corpus (see, e.g., Brezina, 2018 for more details on statistics in corpus linguistics).
- Token is a single occurrence of a running unit in a corpus, which may be a wordform, a punctuation mark, a digit, or another element between spaces, depending on the tokeniser used in a concordancer.