You can now licence our extensive corpus holdings. Collins has one of the largest English language corpora at 4.5 billion words and our monitor corpus is updated daily from a diverse range of newspapers, leading magazines, websites, books, TV and radio. We also hold continuously updated corpora for French, German and Spanish.

We are actively seeking to increase the range of sub-corpora which we hold and are eager to hear from you if there are specific corpora in which you are interested.

All of our corpora are enriched with information such as:

  • Region: Britain, US, Australia, India, South Africa, New Zealand, Canada etc.
  • Domain: medicine, business, computing, science, religion, news, sport etc.
  • Textform: book, magazine, newspaper, transcribed spoken etc.
  • Date: the content ranges from the late 1980s to the present day

This rich metadata enables us to carry out complex linguistic analysis to establish, for example, which words are used in US but not UK English and which terms are used in very informal situations only. We also track language change over time to identify and record emerging neologisms and shifts in word patterns or usage.

Data format

The raw corpus content is held as XML. It can also be delivered with part of speech tagging and lemmatization.

Corpus-based services

With our vast corpus resources and corpus analysis expertise, we also offer a number of language services such as:

  • New words research
  • Bespoke corpus building projects
  • Creation of word and n-gram frequency lists for particular corpora and subcorpora

 

You can access a subset of our corpus via our WordBanks service. Contact us to find out more.

corpora data