You can now licence our extensive corpus holdings. Collins has one of the largest English language corpora at 4.5 billion words and our monitor corpus is updated daily from a diverse range of newspapers, leading magazines, websites, books, TV and radio. We also hold continuously updated corpora for French, German and Spanish.
We are actively seeking to increase the range of sub-corpora which we hold and are eager to hear from you if there are specific corpora in which you are interested.
All of our corpora are enriched with information such as:
- Region: Britain, US, Australia, India, South Africa, New Zealand, Canada etc.
- Domain: medicine, business, computing, science, religion, news, sport etc.
- Textform: book, magazine, newspaper, transcribed spoken etc.
- Date: the content ranges from the late 1980s to the present day
This rich metadata enables us to carry out complex linguistic analysis to establish, for example, which words are used in US but not UK English and which terms are used in very informal situations only. We also track language change over time to identify and record emerging neologisms and shifts in word patterns or usage.
The raw corpus content is held as XML. It can also be delivered with part of speech tagging and lemmatization.
With our vast corpus resources and corpus analysis expertise, we also offer a number of language services such as:
- New words research
- Bespoke corpus building projects
- Creation of word and n-gram frequency lists for particular corpora and subcorpora