ICT4LT Module 3.4

Corpus linguistics

Aims
Authors of this module
1. Early corpus linguistics
2. The revival of corpus linguistics
3. Modern corpus linguistics
4. Conclusion
5. Discussion topics
6. Learning tasks
Bibliography and references
Websites
Feedback and blog

Aims

The aim of this module is to introduce the student to corpus linguistics. Corpora are often used by linguists as the raw material from which language description may be fashioned - the role is no less relevant for CALL package designers. Corpora can provide the basis of accurate, empirically justified, linguistic observations on which to base CALL materials. Additionally, the corpora themselves, typically via concordancing, may become the raw material of CALL based teaching itself. The corpus may be viewed, in certain contexts, as an item bank. The use of corpora in CALL are many. A knowledge of the corpus method to CALL package designers is increasingly indispensable.

This module extends and complements the section on corpus linguistics in Module 2.4, which has been written by Marie-Noëlle Lamy & Hans Jørgen Klarskov Mortensen. Corpus linguistics also forms part of Natural Language Processing (NLP), which is dealt with by Mathias Schulze & Piklu Gupta in Module 3.5, Human Language Technologies (HLT).

This Web page is designed to be read from the printed page. Use File / Print in your browser to produce a printed copy. After you have digested the contents of the printed copy, come back to the onscreen version to follow up the hyperlinks.

Authors of this module

Tony McEnery, University of Lancaster, UK.

Andrew Wilson, University of Wales Bangor, UK.

1. Early corpus linguistics

"Early corpus linguistics" is a term we use here to describe linguistics before the advent of Chomsky. Field linguists, for example Boas (1940) who studied American-Indian languages, and later linguists of the structuralist tradition all used a corpus-based methodology. However, that does not mean that the term "corpus linguistics" was used in texts and studies from this era. Below is a brief overview of some interesting corpus-based studies predating 1950.

Contents of Section 1

1.1 Language acquisition
1.2 Spelling conventions
1.3 Language pedagogy
1.4 Chomsky
1.5 The value of introspection
1.6 Other criticisms of corpus linguistics
1.7 Chomsky re-examined
1.8 Criticisms of introspective data
1.9 Benefits of corpus data

1.1 Language acquisition

The studies of child language in the diary studies period of language acquisition research (roughly 1876-1926) were based on carefully composed parental diaries recording the child's locutions. These primitive corpora are still used as sources of normative data in language acquisition research today, e.g. Ingram (1978). Corpus collection continued and diversified after the diary studies period: large sample studies covered the period roughly from 1927 to 1957 - analysis was gathered from a large number of children with the express aim of establishing norms of development. Longitudinal studies have been dominant from 1957 to the present - again based on collections of utterances, but this time with a smaller (approximately three) sample of children who are studied over long periods of time: e.g. Brown (1973) and Bloom (1970).

1.2 Spelling conventions

Kading (1897) used a large corpus of German - 11 million words - to collate frequency distributions of letters and sequences of letters in German. The corpus, by size alone, is impressive for its time, and compares favourably in terms of size with modern corpora.

1.3 Language pedagogy

Fries & Traver (1940) and Bongers (1947) are examples of linguists who used the corpus in research on foreign language pedagogy. Indeed, as noted by Kennedy (1992), the corpus and second language pedagogy had a strong link in the early half of the twentieth century, with vocabulary lists for foreign learners often being derived from corpora. The word counts derived from such studies as Thorndike (1921) and Palmer (1933) were important in defining the goals of the vocabulary control movement in second language pedagogy.

1.4 Chomsky

Chomsky changed the direction of linguistics away from empiricism and towards rationalism in a remarkably short space of time. In doing so he apparently invalidated the corpus as a source of evidence in linguistic enquiry. Chomsky suggested that the corpus could never be a useful tool for the linguist, as the linguist must seek to model language competence rather than performance.

Competence is best described as our tacit, internalised knowledge of a language.
Performance is external evidence of language competence, and is usage on particular occasions when, crucially, factors other than our linguistic competence may affect its form.

Competence both explains and characterises a speaker's knowledge of a language. Performance, however, is a poor mirror of competence. For examples, factors diverse as short-term memory limitations or whether or not we have been drinking can alter how we speak on any particular occasion. This brings us to the nub of Chomsky's initial criticism: a corpus is by its very nature a collection of externalised utterances - it is performance data and is therefore a poor guide to modelling linguistic competence.

Further to that, if we are unable to measure linguistic competence, how do we determine from any given utterance what are linguistically relevant performance phenomena? This is a crucial question, for without an answer to this, we are not sure that what we are discovering is directly relevant to linguistics. We may easily be commenting on the effects of drink on speech production without knowing it.

However, this was not the only criticism that Chomsky had of the early corpus linguistics approach.

The non-finite nature of language

All the work of early corpus linguistics was underpinned by two fundamental, yet flawed assumptions:

The sentences of a natural language are finite.
The sentences of a natural language can be collected and enumerated.

The corpus was seen as the sole source of evidence in the formation of linguistic theory - "This was when linguists [...] regarded the corpus as the sole explicandum of linguistics" (Leech, 1991).

To be fair, not all linguists at the time made such bullish statements - Harris (1951) is probably the most enthusiastic exponent of this point, while Hockett (1948) did make weaker claims for the corpus, suggesting that the purpose of the linguist working in the structuralist tradition "is not simply to account for utterances which comprise his corpus" but rather to "account for utterances which are not in his corpus at a given time."

The number of sentences in a natural language is not merely arbitrarily large - it is potentially infinite. This is because of the sheer number of choices, both lexical and syntactic, which are made in the production of a sentence. Also, sentences can be recursive. Consider the sentence "The man that the cat saw that the dog ate that the man knew that the..." This type of construct is referred to as centre embedding and can give rise to infinite sentences. (This topic is discussed in further detail in McEnery & Wilson 1996:7-8).

The only way to account for a grammar of a language is by description of its rules - not by enumeration of its sentences. It is the syntactic rules of a language that Chomsky considers finite. These rules in turn give rise to infinite numbers of sentences.

1.5 The value of introspection

Even if language was a finite construct, would corpus methodology still be the best method of studying language? Why bother waiting for the sentences of a language to enumerate themselves, when by the process of introspection we can delve into our own minds and examine our own linguistic competence? At times intuition can save us time in searching a corpus.

Without recourse to introspective judgements, how can ungrammatical utterances be distinguished from ones that simply haven't occurred yet? If our finite corpus does not contain the sentence:

*He shines Tony books

how do we conclude that it is ungrammatical? Indeed, there may be persuasive evidence in the corpus to suggest that it is grammatical if we see sentences such as:

He gives Tony books
He lends Tony books
He owes Tony books

Introspection seems a useful and good tool for cases such as this. But early corpus linguistics denied its use.

Also, ambiguous structures can only be identified and resolved with some degree of introspective judgement. An observation of physical form only seems inadequate. Consider the sentences:

Tony and Fido sat down - he read a book of recipes.
Tony and Fido sat down - he ate a can of dog food.

It is only with introspection that this pair of ambiguous sentences can be resolved e.g. we know that Fido is the name of a dog and it was therefore Fido who ate the dog food, and Tony who read the book.

1.6 Other criticisms of corpus linguistics

Apart from Chomsky's theoretical criticisms, there were problems of practicality with corpus linguistics. Abercrombie (1963) criticisms of "pseudo-procedures" can, in the context of the pre-mass computing era, easily be applied to corpus linguistics. Can you imagine searching through an 11-million-word corpus such as that of Kading (1897) using nothing more than your eyes? The whole undertaking becomes prohibitively time consuming, not to say error-prone and expensive.

Whatever Chomsky's criticisms were, Abercrombie's observations about the nature of the pseudo-procedure were undoubtedly correct. Early corpus linguistics required data processing abilities that were simply not available at that time. The impact of the criticisms levelled at early corpus linguistics in the 1950s was immediate and profound. Corpus linguistics was largely abandoned during this period, although it never totally died.

1.7 Chomsky re-examined

Although Chomsky's criticisms did discredit corpus linguistics, they did not stop all corpus-based work. For example, in the field of phonetics, naturally observed data remained the dominant source of evidence with introspective judgements never making the impact they did on other areas of linguistic enquiry. Also, in the field of language acquisition the observation of naturally occurring evidence remained dominant. Introspective judgements are not available to the linguist/psychologist who is studying child language acquisition - try asking an eighteen-month-old child whether the word "moo-cow" is a noun or a verb! Introspective judgements are only available to us when our meta-linguistic awareness has developed, and there is no evidence that a child at the one-word stage has meta-linguistic awareness. Even Chomsky (1964) cautioned the rejection of performance data as a source of evidence for language acquisition studies.

1.8 Criticisms of introspective data

Naturally occurring data is observable and verifiable by everyone. Introspective judgements are unobservable, therefore much more difficult to verify.
Introspective data is artificial. Sampson (1992) argues that the type of sentence analysed by the introspective linguist is far away from the type of evidence we tend to see typically accruing in a corpus. By artificially manipulating the informant, we artificially manipulate the data itself.
Human beings have only the vaguest notion of the frequency of a construct or a word. Corpora are sources of quantitative information beyond compare. However, frequency-based data is not available via introspective means.

1.9 Benefits of corpus data

Leech (1992) argues that the corpus is a more powerful methodology from the point of view of the scientific method, as it is open to objective verification of results
Is language production really a poor reflection of language competence as Chomsky really argued? Labov (1969) showed that "the great majority of utterances in all contexts are grammatical". We are not saying that all sentences in a corpus are grammatically acceptable, but it seems probable that the Chomsky's (1968: 88) claim that performance data is 'degenerate' is an exaggeration (see Ingram 1989:223 for further criticisms of this view).
Quantitative data is of use to linguistics. For example, Svartvik's (1966) study of passivisation used quantitative data extracted from a corpora. Elsewhere, all successful approaches to automated part-of-speech analysis reply on quantitative data from corpora. The proof of the pudding is in the eating.
Abercrombie's observations that corpus research is time-consuming, expensive and error-prone are no longer applicable thanks to the development of powerful computers and software which is able to perform complex calculations in seconds, without error.

2. The revival of corpus linguistics

It is a common belief that corpus linguistics was abandoned entirely in the 1950s, and then adopted once more almost as suddenly in the early 1980s. This is simply untrue, and does a disservice to those linguists who continued to pioneer corpus-based work during this interregnum.

For example, Quirk (1960) planned and executed the construction of his ambitious Survey of English Usage (SEU), which he began in 1961. In the same year, Francis & Kucera began work on the now famous Brown corpus, a work which was to take almost two decades to complete. These researchers were in a minority, but they were not universally regarded as peculiar and others followed their lead. In 1975 Jan Svartvik started to build on the work of the SEU and the Brown corpus to construct the London-Lund corpus.

During this period the computer slowly started to become the mainstay of corpus linguistics. Svartvik computerised the SEU, and as a consequence produced what some, including Leech (1991), still believe to be "to this day an unmatched resource for studying spoken English".

The availability of the computerised corpus and the wider availability of institutional and private computing facilities do seem to have provided a spur to the revival of corpus linguistics.

2.1 The machine-readable corpus

The term corpus is almost synonymous with the term machine-readable corpus. Interest in the computer for the corpus linguist comes from the ability of the computer to carry out various processes, which when required of humans, ensured that they could only be described as pseudo-techniques The type of analysis that Kading waited years for can now be achieved in a few moments on a desktop computer.

2.2 Processes

Considering the marriage of machine and corpus, it seems worthwhile to consider in slightly more detail what these processes that allow the machine to aid the linguist are. The computer has the ability to search for a particular word, sequence of words, or perhaps even a part of speech in a text. So if we are interested, say, in the usages of the word however in the text, we can simply ask the machine to search for this word in the text. The computer's ability to retrieve all examples of this word, usually in context, is a further aid to the linguist.

The machine can find the relevant text and display it to the user. It can also calculate the number of occurrences of the word so that information on the frequency of the word may be gathered. We may then be interested in sorting the data in some way - for example, alphabetically on words appearing to the right or left. We may even sort the list by searching for words accruing in the immediate context of the word. We may take our initial list of examples of however presented in context (usually referred to as a Key Word In Context (KWIC) concordance) and extract from this another list, say of all the examples of however followed closely by the word we, or followed by a punctuation mark.

The processes described above are often included in a concordance program. This is the tool most often implemented in corpus linguistics to examine corpora. Whatever philosophical advantages we may eventually see in a corpus, it is the computer which allows us to exploit corpora on a large scale with speed and accuracy. See Module 2.4, Using concordance programs in the Modern Foreign Languages classroom.

3. Modern Corpus Linguistics

Contents of Section 3

3.1 Definition of a corpus
3.2 Sampling and representativeness
3.3 Finite size
3.4 Machine-readable form
3.5 A standard reference
3.6 Multilingual corpora
3.7 Text encoding and annotation
3.8 Types of annotation
3.9 Phonetic transcription
3.10 Problem-oriented tagging
3.11 Corpora in language studies
3.12 Corpora in lexical studies
3.13 Corpora and grammar
3.14 Corpora and sociolinguistics

3.1 Definition of a corpus

The concept of carrying out research on written or spoken texts is not restricted to corpus linguistics. Indeed, individual texts are often used for many kinds of literary and linguistic analysis - the stylistic analysis of a poem, or a conversation analysis of a TV talk show. However, the notion of a corpus as the basis for a form of empirical linguistics is different from the examination of single texts in several fundamental ways.

In principle, any collection of more than one text can be called a corpus, (corpus being Latin for "body", hence a corpus is any body of text). But the term "corpus" when used in the context of modern linguistics tends most frequently to have more specific connotations than this simple definition. The following list describes the four main characteristics of the modern corpus.

3.2 Sampling and representativeness

Often in linguistics we are not merely interested in an individual text or author, but a whole variety of language. In such cases we have two options for data collection:

We could analyse every single utterance in that variety - however, this option is impracticable except in a few cases, for example with a dead language which only has a few texts. Usually, however, analysing every utterance would be an unending and impossible task.
We could construct a smaller sample of that variety. This is a more realistic option.

As discussed in Section 1.4, one of Chomsky's criticisms of the corpus approach was that language is infinite - therefore, any corpus would be skewed. In other words, some utterances would be excluded because they are rare, others which are much more common might be excluded by chance, and alternatively, extremely rare utterances might also be included several times. Although nowadays modern computer technology allows us to collect much larger corpora than those that Chomsky was thinking about, his criticisms still must be taken seriously. This does not mean that we should abandon corpus linguistics, but instead try to establish ways in which a much less biased and representative corpus may be constructed.

We are therefore interested in creating a corpus which is maximally representative of the variety under examination, that is, which provides us with an as accurate a picture as possible of the tendencies of that variety, as well as their proportions. What we are looking for is a broad range of authors and genres which, when taken together, may be considered to "average out" and provide a reasonably accurate picture of the entire language population in which we are interested.

3.3 Finite size

The term "corpus" also implies a body of text of finite size, for example, one million words. This is not universally so. For example John Sinclair's Cobuild team at the University of Birmingham initiated the construction and analysis of a monitor corpus in the 1980s. Such a "collection of texts", as Sinclair's team preferred to call the Cobuild corpus, is an open-ended entity - texts are constantly being added to it, so it gets bigger and bigger. Monitor corpora are of interest to lexicographers who can trawl a stream of new texts looking for the occurrence of new words, or for changing meanings of old words. Their main advantages are:

They are not static - new texts can always be added, unlike the synchronic "snapshot" provided by finite corpora.
Their scope - they provide for a large and broad sample of language.

Their main disadvantage is:

They are not such a reliable source of quantitative data (as opposed to qualitative data) because they are constantly changing in size and are less rigorously sampled than finite corpora.

With the exception of monitor corpora, it should be noted that it is more often the case that a corpus consists of a finite number of words. Usually this figure is determined at the beginning of a corpus-building project. For example, the Brown Corpus contains 1,000,000 running words of text. Unlike the monitor corpus, when a corpus reaches its grand total of words, collection stops and the corpus is not increased in size. (An exception is the London-Lund corpus, which was increased in the mid-1970s to cover a wider variety of genres.)

3.4 Machine-readable form

Nowadays the term "corpus" nearly always implies the additional feature "machine-readable". This was not always the case as in the past the word "corpus" was only used in reference to printed text.

Today few corpora are available in book form - one which does exist in this way is "A Corpus of English Conversation" (Svartvik and Quirk 1980) which represents the "original" London-Lund corpus. Corpus data (not excluding context-free frequency lists) is occasionally available in other forms of media. For example, a complete key-word-in-context concordance of the LOB corpus is available on microfiche, and with spoken corpora copies of the actual recordings are sometimes available - this is the case with the Lancaster/IBM Spoken English Corpus but not with the London-Lund corpus.

Machine-readable corpora possess the following advantages over written or spoken formats:

They can be searched and manipulated at speed. (This is something which we covered at the end of Part One).
They can easily be enriched with extra information. (We will examine this in detail later.)

If you haven't already done so you can now read about other characteristics of the modern corpus.

3.5 A standard reference

There is often a tacit understanding that a corpus constitutes a standard reference for the language variety that it represents. This presupposes that it will be widely available to other researchers, which is indeed the case with many corpora - e.g. the Brown Corpus, the LOB corpus and the London-Lund corpus.

One advantage of a widely available corpus is that it provides a yardstick by which successive studies can be measured. So long as the methodology is made clear, new results on related topics can be directly compared with already published results without the need for re-computation.
Also, a standard corpus also means that a continuous base of data is being used. This implies that any variation between studies is less likely to be attributed to differences in the data and more to the adequacy of the assumptions and methodology contained in the study.

3.6 Multilingual corpora

Not all corpora are monolingual, and an increasing amount of work in being carried out on the building of multilingual corpora, which contain texts of several different languages.

First we must make a distinction between two types of multilingual corpora: the first can really be described as small collections of individual monolingual corpora in the sense that the same procedures and categories are used for each language, but each contains completely different texts in those several languages. For example, the Aarhus corpus of Danish, French and English contract law consists of a set of three monolingual law corpora, which is not comprised of translations of the same texts.

The second type of multilingual corpora (and the one which receives the most attention) is parallel corpora. This refers to corpora which hold the same texts in more than one language. The parallel corpus dates back to mediaeval times when "polyglot bibles" were produced which contained the biblical texts side by side in Hebrew, Latin and Greek etc.

A parallel corpus is not immediately user-friendly. For the corpus to be useful it is necessary to identify which sentences in the sub-corpora are translations of each other, and which words are translations of each other. A corpus which shows these identifications is known as an aligned corpus as it makes an explicit link between the elements which are mutual translations of each other. For example, in a corpus the sentences "Das Buch ist auf dem Tisch" and "The book is on the table" might be aligned to one another. At a further level, specific words might be aligned, e.g. "Das" with "The". This is not always a simple process, however, as often one word in one language might be equal to two words in another language, e.g. the German word "raucht" could be equivalent to "is smoking" in English.

At present there are few cases of annotated parallel corpora, and those which exist tend to be bilingual rather than multilingual. However, two EU-funded projects (CRATER and MULTEXT) are aiming to produce genuinely multilingual parallel corpora. The Canadian Hansard corpus is annotated, and contains parallel texts in French and English, but it only covers a restricted range of text types (proceedings of the Canadian Parliament). However, this is an area of growth, and the situation is likely to change dramatically in the near future.

3.7 Text encoding and annotation

If corpora is said to be unannotated it appears in its existing raw state of plain text, whereas annotated corpora has been enhanced with various types of linguistic information. Unsurprisingly, the utility of the corpus is increased when it has been annotated, making it no longer a body of text where linguistic information is implicitly present, but one which may be considered a repository of linguistic information. The implicit information has been made explicit through the process of concrete annotation.

For example, the form "gives" contains the implicit part-of-speech information "third person singular present tense verb" but it is only retrieved in normal reading by recourse to our pre-existing knowledge of the grammar of English. However, in an annotated corpus the form "gives" might appear as "gives_VVZ", with the code VVZ indicating that it is a third person singular present tense (Z) form of a lexical verb (VV). Such annotation makes it quicker and easier to retrieve and analyse information about the language contained in the corpus.

3.8 Types of annotation

Certain kinds of linguistic annotation, which involve the attachment of special codes to words in order to indicate particular features, are often known as "tagging" rather than annotation, and the codes which are assigned to features are known as "tags". These terms will be used in the sections which follow.

This is the most basic type of linguistic corpus annotation - the aim being to assign to each lexical unit in the text a code indicating its part of speech. Part-of-speech annotation is useful because it increases the specificity of data retrieval from corpora, and also forms an essential foundation for further forms of analysis (such as syntactic parsing and semantic field annotation). Part-of-speech annotation also allows us to distinguish between homographs.

Part-of-speech annotation was one of the first types of annotation to be formed on corpora and is the most common today. One reason for this is because it is a task that can be carried out to a high degree of accuracy by a computer. Greene & Rubin (1971) achieved a 71% accuracy rate of correctly tagged words with their early part-of-speech tagging program (TAGGIT). In the early 1980s the UCREL team at Lancaster University reported a success rate of 95% using their program CLAWS.

3.9 Phonetic transcription

Spoken language corpora can also be transcribed using a form of phonetic transcription. Not many examples of publicly available phonetically transcribed corpora exist at the time of writing. This is possibly because phonetic transcription is a form of annotation which needs to be carried out by humans rather than computers. Such humans have to be well skilled in the perception and transcription of speech sounds. Phonetic transcription is therefore a very time consuming task.

Another problem is that phonetic transcription works on the assumption that the speech signal can be divided into single, clearly demarcated "sounds", while in fact, these "sounds" do not have such clear boundaries, therefore what phonetic transcription takes to be the same sound, might be different according to context.

Nevertheless, phonetically transcribed corpora is extremely useful to the linguist who lacks the technological tools and expertise for the laboratory analysis of recorded speech. One such example is the MARSEC corpus (which is derived from the Lancaster/IBM Spoken English Corpus) and has been manipulated by the Universities of Lancaster and Leeds. The MARSEC corpus will include a phonetic transcription.

3.10 Problem-oriented tagging

Problem-oriented tagging, as described by de Haan (1984), is the phenomenon whereby users will take a corpus, either already annotated, or unannotated, and add to it their own form of annotation, oriented particularly towards their own research goal. This differs in two ways from the other types of annotation we have examined in this section.

It is not exhaustive. Not every word (or sentence) is tagged - only those which are directly relevant to the research. This is something which problem-oriented tagging has in common with anaphoric annotation.
Annotation schemes are selected, not for broad coverage and theory-neutrality, but for the relevance of the distinctions which it makes to the specific questions that the researcher wishes to ask of his/her data.

Although it is difficult to generalise further about this form of corpus annotation, it is an important type to keep in mind in the context of practical research using corpora.

3.11 Corpora in language studies

In this section we will examine a few of the roles which corpora may play in the study of language. The importance of corpora to language study is aligned to the importance of empirical data. Empirical data enable the linguist to make objective statements, rather than those which are subjective, or based upon the individual's own internalised cognitive perception of language. Empirical data also allows us to study language varieties such as dialects or earlier periods in a language for which it is not possible to carry out a rationalist approach.

It is important to note that although many linguists may use the term "corpus" to refer to any collection of texts, when it is used here it refers to a body of text which is carefully sampled to be maximally representative of the language or language variety. Corpus linguistics, proper, should be seen as a subset of the activity within an empirical approach to linguistics. Although corpus linguistics entails an empirical approach, empirical linguistics does not always entail the use of a corpus.

In the following pages we'll consider the roles which corpora use may play in a number of different fields of study related to language. We will focus on the conceptual issues of why corpus data are important to these areas, and how they can contribute to the advancement of knowledge in each, providing real examples of corpus use.

3.12 Corpora in lexical studies

Empirical data has been used in lexicography long before the discipline of corpus linguistics was invented. Samuel Johnson, for example, illustrated his dictionary with examples from literature, and in the 19th Century the Oxford Dictionary used citation slips to study and illustrate word usage. Corpora, however, have changed the way in which linguists can look at language.

A linguist who has access to a corpus, or other (non-representative) collection of machine readable text can call up all the examples of a word or phrase from many millions of words of text in a few seconds. Dictionaries can be produced and revised much more quickly than before, thus providing up-to-date information about language. Also, definitions can be more complete and precise since a larger number of natural examples are examined.

Examples extracted from corpora can be easily organised into more meaningful groups for analysis. For example, by sorting the right-hand context of the word alphabetically so that it is possible to see all instances of a particular collocate together. Furthermore, because corpus data contains a rich amount of textual information - regional variety, author, date, genre, part-of-speech tags etc. it is easier to tie down usages of particular words or phrases as being typical of particular regional varieties, genres and so on.

The open-ended (constantly growing) monitor corpus has its greatest role in dictionary building as it enables lexicographers to keep on top of new words entering the language, or existing words changing their meanings, or the balance of their use according to genre etc. However, finite corpora also have an important role in lexical studies - in the area of quantification. It is possible to rapidly produce reliable frequency counts and to subdivide these areas across various dimensions according to the varieties of language in which a word is used.

Finally, the ability to call up word combinations rather than individual words, and the existence of mutual information tools which establish relationships between co-occurring words mean that we can treat phrases and collocations more systematically than was previously possible. A phraseological unit may constitute a piece of technical terminology or an idiom, and collocations are important clues to specific word senses.

3.13 Corpora and grammar

Grammatical (or syntactic) studies have, along with lexical studies, been the most frequent types of research which have used corpora. Corpora are a useful tool for syntactical research because of:

The potential for the representative quantification of a whole language variety.
Their role as empirical data for the testing of hypotheses derived from grammatical theory.

Many smaller-scale studies of grammar using corpora have included quantitative data analysis (for example, Schmied's 1993 study of relative clauses). There is now a greater interest in the more systematic study of grammatical frequency - for example, Oostdijk & de Haan (1994) are aiming to analyse the frequency of the various English clause types.

Since the 1950s the rational-theory based/empiricist-descriptive division in linguistics (see Section 1) has often meant that these two approaches have been viewed as separate and in competition with each other. However, there is a group of researchers who have used corpora in order to test essentially rationalist grammatical theory, rather than use it for pure description or the inductive generation of theory.

At Nijmegen University, for instance, primarily rationalist formal grammars are tested on real-life language found in computer corpora (Aarts 1991). The formal grammar is first devised by reference to introspective techniques and to existing accounts of the grammar of the language. The grammar is then loaded into a computer parser and is run over a corpus to test how far it accounts for the data in the corpus: see Section 5, Module 3.5, headed Parsing and tagging. The grammar is then modified to take account of those analyses which it missed or got wrong.

3.14 Corpora and sociolinguistics

Although sociolinguistics is an empirical field of research it has hitherto relied primarily upon the collection of research-specific data which is often not intended for quantitative study and is thus not often rigorously sampled. Sometimes the data are also elicited rather than naturalistic data. A corpus can provide what these kinds of data cannot provide - a representative sample of naturalistic data which can be quantified. Although corpora have not as yet been used to a great extent in sociolinguistics, there is evidence that this is a growing field.

The majority of studies in this area have concerned themselves with lexical studies in the area of language and gender. Kjellmer (1986), for example, used the Brown and LOB corpora to examine the masculine bias in American and British English. He looked at the occurrence of masculine and feminine pronouns, and at the occurrence of the items man/men and woman/women. As one would expect, the frequencies of the female items were much lower than the male items in both corpora. Interestingly, however, the female items were more common in British English than in American English. Another hypothesis of Kjellmer's was not supported in the corpora - that woman would be less "active", that is would be more frequently the objects rather than the subjects of verbs. In fact men and women had similar subject/object ratios.

Holmes (1994) makes two important points about the methodology of these kinds of study, which are worth bearing in mind. First, when classifying and counting occurrences the context of the lexical item should be considered. For instance, whilst there is a non-gender marked alternative for policeman/policewoman, namely police officer, there is no such alternative for the -ess form in Duchess of York. The latter form should therefore be excluded from counts of "sexist" suffixes when looking at gender bias in writing. Second, Holmes points out the difficulty of classifying a form when it is actively undergoing semantic change. She argues that the word man can refer both to a single male (such as in the phrase A 35 year old man was killed, or can have a generic meaning which refers to mankind (such as Man has engaged in warfare for centuries. In phrases such as we need the right man for the job it is difficult to decide whether man is gender specific or could be replaced by person. These simple points should incite a more critical approach to data classification in further sociolinguistic work using corpora, both within and without the area of gender studies.

4. Conclusion

In this section we have seen how language study has benefited from exploiting corpus data. To summarise, the main important advantages of corpora are:

Sampling and quantification: Because a corpus is sampled to maximally represent the population, any findings taken from the corpus can be generalised to the larger population. Hence quantification in corpus linguistics is more meaningful than other forms of linguistic quantification because it can tell us about a variety of language, not just that which is being analysed.
Ease of access: As all of the data collection has been dealt with by someone else, the researcher does not have to go through the issues of sampling, collection and encoding. The majority of corpora are readily available, either free or at low-cost price. Once the corpora have been obtained, it is usually easy to access the data within it, e.g. by using a concordance program.
Enriched data: Many corpora have already been enriched with additional linguistic information such as part-of-speech annotation, parsing and prosodic transcription. Hence data retrieval from annotated corpora can be easier and more specific than with unannotated data.
Naturalistic data: Corpus data is not always completely unmonitored in the sense that the people producing the spoken or written texts are unaware until after the fact that they are being asked to participate in the building of a corpus. But for the most part, the data are largely naturalistic, unmonitored and the product of real social contexts. Thus the corpus provides one of the most reliable sources of naturally occurring data that can be examined.

5. Discussion topics

How might intuition-based and corpus-based linguistics combine usefully to inform the creation of a CALL package?
To what extent can we ever view corpus annotation as objectively 'correct'? If we cannot view it in such a way, what does this mean for CALL packages which rely on corpus annotation?

6. Learning tasks

Imagine that you need to design an LSP corpus as part of the implementation of an LSP-oriented CALL program. Chose an area for the task and try to find appropriate texts from the web in order to populate such a corpus.
Having gathered texts, what further processing might you be able to carry out on those texts? Use a web browser to try to find sites on the web where you could carry out automated part-of-speech tagging.
Imagine that you would now like to contrast your LSP corpus with a corpus of general English. Using a Web browser find what corpora of general English are available. What is the relative balance of British English corpora available in comparison to corpora of other varities of English?

Bibliography and references

Aarts J. (1991) "Intuition-based and observation-based grammars". In Aijmer K. & Altenberg B. (eds.) English corpus linguistics. Studies in honour of Jan Svartvik, London: Longman: 44-62.

Aarts J. & Meijs W. (eds.) (1986) Corpus Linguistics II, Amsterdam: Rodopi.

Abercrombie D. (1963) Studies in phonetics and linguistics, London: Oxford University Press.

Baker P., Hardie A. & McEnery T. (2006) A glossary of corpus linguistics, Edinburgh: Edinburgh University Press.

Aijmer K. & Altenberg B. (eds.) (1991) English corpus linguistics. Studies in honour of Jan Svartvik, London: Longman.

Beale A. (1987) "Towards a distributional lexicon". In Garside R., Leech G. & Sampson G. (eds.) The computational analysis of English: a corpus based approach. London: Longman.

Bernardini S. (2000) "Systematising serendipity: proposals for concordancing large corpora with language learners". In Burnard L. & McEnery T. (eds.) Rethinking language pedagogy from a corpus perspective: papers from the Third International Conference on Teaching and Language Corpora, Frankfurt am Main: Peter Lang: 225-34.

Bernardini, S. (2002) "Exploring new directions for discovery learning". In Kettemann B. & Marko G. (eds.) Teaching and learning by doing corpus analysis, Amsterdam and New York: Rodopi: 165-82.

Biber D. (1993) "Representativeness in corpus design", Literary and Linguistic Computing 8, 4: 243-57.

Bloom L. (1970) Language development: form and function in emerging grammars, Cambridge, MA: MIT Press.

Boas F. (1940) Race, language and culture, New York: Macmillan.

Bongers H. (1947) The history and principles of vocabulary control, Worden: Wocopi.

Braun S. (2005) "From pedagogically relevant corpora to authentic language learning contents", ReCALL 17, 1: 47-64.

Braun S. (2007) "Integrating corpus work into secondary education: from data-driven learning to needs-driven corpora", ReCALL 19, 3: 307-328.

Brown R. (1973) A first language: the early stages, Cambridge, MA: Harvard University Press.

Chambers A. (2005) "Integrating corpus consultation in language studies", Language Learning and Technology 9, 2: 111-125.

Chambers A., Farr F. & O'Riordan S. (2011) "Language teachers with corpora in mind: from starting steps to walking tall", Language Learning Journal 39, 1: 85-103.

Chomsky N. (1964) "Formal Discussion". In Bellugi U. & Brown R. (eds.) The acquisition of language. Monographs of the Society for Research in Child Development 29: 37-39.

Chomsky N. (1965) Aspects of the theory of syntax, Cambridge, MA: MIT Press.

Chomsky N. (1968) Language and mind, New York: Harcourt Brace.

Collins Cobuild English Language Dictionary (1987) ed. John Sinclair, London: Collins.

Collins Cobuild English Grammar (1990) ed. John Sinclair, London: Collins.

de Haan P. (1984) "Problem-oriented tagging of English corpus data". In Aarts J. & Meijs W. (eds.) Corpus linguistics, Amsterdam: Rodopi.

Farr F. (2007) "Spoken language as an aid to reflective practice in language teacher education: using a specialised corpus to establish a generic fingerprint". In Campoy M.-C. & Luzon M.-J. (eds.) Spoken corpora in applied linguistics, Bern: Peter Lang: 235-258.

Farr F. (2008) "Evaluating the use of corpus-based instruction in a language teacher education context: perspectives from the users", Language Awareness 17, 1: 25-43.

Farr F., Chambers A. & O'Riordan S. (2010) "Corpora for materials development in language teacher education: principles for development and useful data". In Mishan F. & Chambers A. (eds.) Perspectives on language learning materials development, Oxford: Peter Lang: 33-61.

Farr F. & Murphy B. (2009) "Religious references in contemporary Irish-English: 'for the love of God almighty... I'm a holy terror for turf", Journal of Intercultural Pragmatics 6, 3: 535-560.

Farr F., B. Murphy & A. O'Keeffe. (2004) "The Limerick corpus of Irish English: design, description and application". In Farr F. & O'Keeffe A. (eds.) Corpora, Varieties and the Language Classroom, Special Edition of Teanga 21, Dublin: IRAAL: 5-29.

Fligelstone S. (1993) "Some reflections on the question of teaching, from a corpus linguistics perspective", ICAME Journal 17: 97-109.

Fries C. & Traver A. (1940) English word lists: a study of their adaptability and instruction, Washington, DC: American Council of Education.

Gaskell D. & T. Cobb. (2004) "Can learners use concordance feedback for writing errors?" System 32, 3: 301-319.

Gavioli L. & Aston. G. (2001) "Enriching reality: language corpora in language pedagogy", English Language Teaching Journal 55, 3: 238-246.

Granger S. (2002) "A bird's-eye view of learner corpus research". In Granger S., Hung J. & Petch-Tyson S. (eds.) Computer learner corpora, second language acquisition and foreign language teaching, Amsterdam: John Benjamins Publishing Company: 3-33.

Greene B. & Rubin G. (1971) Automatic grammatical tagging of English, Technical Report, Department of Linguistics, Brown University, RI.

Halliday M. & Hasan R. (1976) Cohesion in English, London: Longman.

Harris Z. (1951) Methods in structural linguistics, Chicago: University of Chicago Press.

Hockett C. (1948) "A note on structure", International Journal of American Linguistics 14: 269-71.

Hockey S. (1980 A guide to computer applications in the humanities, London: Duckworth.

Holmes J. (1994) "Inferring language change from computer corpora: some methodological problems", ICAME Journal 18: 27-40.

Hundt M., N. Nesselhauf & C. Biewer (2007) "Corpus linguistics and the Web". In Hundt M., Nesselhauf N. & Biewer C. (eds.) Corpus linguistics and the Web, Amsterdam: Rodopi: 1-6.

Hunston S. (2002) Corpora in applied linguistics, Cambridge: Cambridge University Press.

Ingram D. (1978) "Sensori-motor development and language acquisition". In Lock A (ed.) Action, gesture and symbol: the emergence of language, London: Academic Press.

Ingram D. (1989) First language acquisition, Cambride: Cambridge University Press.

Johansson S. (1991) "Times change and so do corpora". In Aijmer K. & Altenburg B. (eds.) English corpus linguistics: studies in honour of Jan Svartvik, London: Longman.

Kading J. (1879) Häufigkeitswörterbuch der deutschen Sprache, Steglitz: privately published.

Karlsson F., Voutilainen A., Heikkil� J. & Anttila A. (eds.) (1995) Constraint grammar: a language-independent system for parsing unrestricted text, Berlin: Mouton de Gruyter.

Kennedy C. & T. Miceli (2001) "An evaluation of intermediate students' approaches to corpus investigation", Language Learning and Technology 5, 3: 77-90.

Kennedy G. (1992) "Preferred ways of putting things". In Svartvik J. (ed) Directions in corpus linguistics, Berlin: Mouton de Gruyter.

Kjellmer G. (1986) "The lesser man: observations on the role of women in modern English writings". In Aarts J. & Meijs W. (eds.) Corpus Linguistics II: 163-76.

Labov V. (1969) "The logic of non-standard English", Georgetown Monographs on Language and Linguistics 22.

Leech G. (1991) "The state of the art in corpus linguistics". In Aijmer K. & Altenberg B. (eds.) English corpus linguistics: studies in honour of Jan Svartvik, London: Longman.

Leech G. (1992) "Corpora and theories of linguistic performance". In Svartvik J. (ed.) Directions in corpus linguistics, Berlin: Mouton de Gruyter.

Leech G. (1993) "Corpus annotation schemes", Literary and Linguistic Computing 8, 4: 275-81.

McEnery T. & Wilson A. (1996) Corpus linguistics, Edinburgh: Edinburgh University Press.

McEnery T. & Wilson A. (1997) "Teaching and language corpora", ReCALL 9, 1: 5-14.

McEnery T., Xiao R. & Tone Y. (2006) Corpus-based language studies. An advanced resource book, London and New York: Routledge.

Mauranen A. (2003) "The corpus of English as a lingua franca in academic setting", TESOL Quarterly 37, 3: 513-527.

O'Connor J. & Arnold G. (1961) Intonation of colloquial English, London: Longman.

O'Keeffe A. & Farr F. (2003) "Using language corpora in language teacher education: pedagogic, linguistic and cultural insights", TESOL Quarterly 37, 3: 389-418.

O'Keeffe A., McCarthy M. & Carter R. (2007) From corpus to classroom, Cambridge: Cambridge University Press.

Oostdijk N. & de Haan P. (1994) "Clause patterns in Modern British English. A corpus-based (quantitative) study", ICAME Journal 18. 41-80.

O'Sullivan I. & A. Chambers (2006) "Learners' writing skills in French: corpus consultation and learner evaluation", Journal of Second Language Writing 15: 49-68.

Palmer H. (1933) Second interim report on English collocations, Tokyo: Institute for Research in English Teaching.

Quirk R. (1960) "Towards a description of English usage", Transactions of the Philological Society: 40-61.

Renouf A., Kehoe A. & Banerjee J. (2007) "WebCorp: an integrated system for Web text search". In Hundt M., Nesselhauf N. & Biewer C. (eds.) Corpus linguistics and the Web, Amsterdam: Rodopi: 47-68.

R�mer U. (2006) "Pedagogical applications of corpora: some reflections on the current scope and a wish list for future developments", Zeitschrift f�r Anglistik und Amerikanistik 54, 2: 121-134.

Sampson G. (1992) "Probablistic parsing". In Svartvik, J. (ed.) Directions in corpus linguistics, Berlin: Mouton de Gruyter.

Schmidt K. M. (1993) Begriffsglossar und Index zu Ulrichs von Zatzikhoven Lanzelet, T�bingen: Niemeyer.

Schmied J. (1993) "Qualitative and quantitative research approaches to English relative constructions". In Souter C. & Atwell E. (eds.): 85-96.

Sedelow S & Sedelow W. (1969) "Categories and procedures for content analysis in the humanities". In Gerbner G., Holsti O. R., Krippendorff K., Paisley W.J. & Stone P.J. (eds.) The analysis of communication content, New York: John Wiley.

Seidlhofer B. (2002) "Pedagogy and local learner corpora. Working with learning-driven data". In Granger S., Hung J. & Petch-Tyson S. (eds.) Computer learner corpora, second language acquisition and foreign language teaching, Amsterdam: John Benjamins Publishing Company: 231-234.

Seidlhofer B. (2004) "In search of 'European English': or why the corpus can't tell us what to teach", Paper presented at EUROCALL 2004.

Sinclair J. (1991) Corpus, concordance, collocation, London: Longman.

Sinclair J. (1995) "Corpus typology - a framework for classification". In Melchers G. & Warren B. (eds.) Studies in Anglistics, Stockholm: Almqvist and Wiksell International: 17-46.

Sinclair J. (ed.) (2004) How to use corpora in language teaching, Amsterdam: John Benjamins Publishing Company.

Souter C. (1993) "Towards a standard format for parsed corpora". In Aarts J., de Haan P. & Oostdijk N. (eds.) English language corpora: design, analysis and exploitation, Amsterdam: Rodopi.

Souter C. & Atwell E. (eds.) (1993) Corpus-based computational linguistics, Amsterdam: Rodopi.

Sperberg-McQueen C.M. & Burnard L. (1994) Guidelines for electronic text encoding and interchange (P3), Chicago and Oxford: Text Encoding Initiative.

Stenstr�m A-B. (1984) "Discourse tags". In Aarts J. & Meijs W. (eds.) Corpus linguistics, Amsterdam: Rodopi.

Svartvik J. (1966) On voice in the English verb, The Hague: Mouton.

Svartvik J. & Quirk R. (1980) A corpus of English conversation, Lund: C.W.K. Gleerup.

Thorndike E. (1921) A teacher's wordbook, New York: Columbia Teachers College.

Tribble C. (1997) "Improvising corpora for ELT: quick-and-dirty ways of developing corpora for language teaching". In Melia J. & Lewandowska-Tomaszczyk B. (eds.) PALC 97 Proceedings, Lodz: Lodz University Press. Paper presented at the Practical Applications in Language Corpora (PALC 97) conference, University of Lodz, Poland: http://www.ctribble.co.uk/text/Palc.htm

Tribble C. & Barlow M. (eds.) (2001) Language Learning & Technology 5, 3, Special Issue on Using corpora in language teaching and learning: http://llt.msu.edu/vol5num3/default.html

Wichmann A., Fligelstone S., Knowles G. & McEnery T. (eds.) (1997) Teaching and language corpora, Harlow: Pearson Education.

Wynne M. (2005) Developing linguistic corpora: a guide to good practice. Oxford: Arts and Humanities Data Service. Available online at http://www.ahds.ac.uk/creating/guides/linguistic-corpora/

Yoon H. & Hirvela A. (2004) "ESC student attitudes towards corpus use in L2 writing", Journal of Second Language Writing 13: 257-283.

Websites

See also the Websites listed in Module 2.4, Using concordance programs in the Modern Foreign Languages classroom.

American English: See (COCA) Corpus of Contemporary American English, Brigham Young University. See also American English: Google Books, a new interface for Google Books allows you to search more than 155 billion words in more than 1.3 million books of American English from 1810-2009 (including 62 billion words from 1980-2009). Although this "corpus" is based on Google Books data, it is not an official product of Google or Google Books, but rather it was created by Mark Davies, Professor of Linguistics at Brigham Young University (BYU), and it is related to other large corpora at BYU.

Bookmarks for Corpus-based Linguistics: Compiled by David Lee: http://tiny.cc/corpora

British National Corpus (BNC): A large corpus (100 million words) of modern British English. See also the BYU-BNC website (Brigham Young University).

ICAME (International Computer Archive of Modern and Medieval English): English-language archive, produced and maintained by the Norwegian Computing Centre for the Humanities (NCCH) in Bergen: http://icame.uib.no/

COCA (Corpus of Contemporary American English): Maintained by Brigham Young University, USA. See also: Word and Phrase, the top 60,000 words based on COCA.

Collins Cobuild Bank of English: A project initiated by John Sinclair at the University of Birmingham in the 1980s that led to the publication of a series of dictionaries and reference books. Search for Cobuild at the HarperCollins website for the range of publications that are available both in book and in digital format. See also Wordbanks Online.

Corpora4Learning.net: Links and references for the use of corpora, corpus linguistics and corpus analysis in the context of language learning and teaching. Created by Sabine Braun, University of Surrey. Many useful bibliographical references and links to websites but some links are now dead as the site has not been updated since 2006: http://www.corpora4learning.net/

CorpusCALL: A EUROCALL Special Interest Group: http://www.eurocall-languages.org/sigs/corpuscall.html

Google Books: American English: This new interface for Google Books allows you to search more than 155 billion words in more than 1.3 million books of American English from 1810-2009 (including 62 billion words from 1980-2009). Although this "corpus" is based on Google Books data, it is not an official product of Google or Google Books, but rather it was created by Mark Davies, Professor of Linguistics at Brigham Young University (BYU), and it is related to other large corpora that BYU has staff have created.

University of Lancaster: University Centre for Computer Corpus Research on Language (UCREL). Many useful resources and links: http://ucrel.lancs.ac.uk/

Logos Library: A massive database of searchable texts in a wide range of languages, containing multilingual novels, technical literature and translated texts: http://www.logoslibrary.eu/

University of Louvain, Centre for English Corpus Linguistics: A comprehensive list of publications on learner corpora: http://www.uclouvain.be/en-cecl.html

Scholars' Lab, University of Virginia: A large collection of texts in a variety of languages: http://www2.lib.virginia.edu/scholarslab/

University of Southern Denmark VISL website: Links to corpora in different languages: http://visl.sdu.dk/corpus_linguistics.html

Text Corpora and Corpus Linguistics site: Many useful links and sources of information: http://www.athel.com/corpus.html

Virtual Linguistics Campus, University of Marburg: Includes a virtual lecture hall where the student can attend linguistics courses, a linguistics lab, chat rooms, message boards, etc. Some sections are only accessible if you have registered, but there are a lot of materials that are open to all: http://linguistics.online.uni-marburg.de

Wordbanks Online is an online corpus that evolved out of the Collins Cobuild Bank of English corpus that forms the basis of the Collins range of dictionaries and reference books. Wordbanks Online is available by subscription, but there is also a trial version online.

Feedback

If you wish to send us feedback on any aspect of the ICT4LT website, use our online Feedback Form or visit the ICT4LT blog at:
http://ictforlanguageteachers.blogspot.com

Document last updated 27 March 2012. This page is maintained by Graham Davies.

Please cite this Web page as:
McEnery T. & Wilson A. (2012) Corpus linguistics. Module 3.4 in Davies G. (ed.) Information and Communications Technology for Language Teachers (ICT4LT), Slough, Thames Valley University [Online]. Available at: http://www.ict4lt.org/en/en_mod3-4.htm [Accessed DD Month YYYY].

© Sarah Davies in association with MDM creative. This work is licensed under a
Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License.