Account Login

Shopping Basket

Research

Academic research

As a publishing house committed to innovation and improvement in language technology, we have special terms for academic use. If you would like to license Collins material for academic research within a University research group, please contact us, giving full details of your affiliation, what the data will be used for, the nature of the research and how many people will require access to it. An annual fee will apply and you will be required to sign an agreement regarding usage.

Commercial research

We work very closely with language professionals and recognise the value of corpora to the language technology community. This knowledge has led us to develop a unique resource: the Collins Corpus for Linguistic Research, dating from 1985 to the present day and containing over 1 billion words of written and spoken English. This ground-breaking initiative opens up exciting new opportunities in language analysis for commercial language research.

Best spread of sources

Other research corpora such as those available from the Linguistic Data Consortium and Reuters are often restricted to newswire data or to smaller dated collections of one text type. Collins Corpus for Linguistic Research offers an unparalleled range, quality and quantity of varied texts:

Top-quality British broadsheet journalism across all subject ranges

  • Best-selling British tabloid journalism
  • 20 million words of British unscripted speech
  • 4.7 million words of British ephemera (leaflets, reports, brochures etc) and 3.6million words of US ephemera

Unparalleled size and date range

The size and contemporary timespan of the corpus (most material originates after 1990) give statistical analyses greater validity in modern day research. The Collins Corpus for Linguistic Research provides objective evidence about the English which most people read, write, speak and hear every day of their lives.

Syntactic tagging

All our datasets have been linguistically annotated and part-of-speech-tagged by our in-house lexicographers. This allows powerful grammatical analyses to be undertaken over more than a billion words of modern English.

Dynamic and evolving

It's growing. Every year we add more components as we recognise the need to reflect the evolving nature of English, as opposed to static corpora, such as the BNC, which rapidly date and reflect a snapshot of English from several years ago. And it all comes in the same stable format which you can use to update your collection.

A Customizable Corpus

We appreciate that researchers have particular needs and we aim to serve each client by tailoring the corpus to their own needs, so if you require a particular mix of sources or dates we will try to accommodate you.

News categorization by subject field

Text-typing of news stories by subject field (such as sport, stock market, politics etc) allows much greater refinement in linguistic analysis.

Standardized Format

All accented characters, source tags and newspaper metadata tags have been standardized throughout the sources.

Collins Corpus for Linguistic Research: Contents
British Ephemera 4.7m 1991-1996
US Ephemera 3.6m 1995-1996
British Spoken 20m 1991-1996
Times 553m 1985-2003
Sunday Times 225m 1985-2003
The Sun 147m 1996-2003
The News of the World 47m 1996-2003
Today Newspaper 25m 1991-1995

British and US Ephemera

A unique and rarely-collected gathering of short and varied text types such as political handouts, medical pamphlets, educational leaflets, social services leaflets, business communications, local authority brochures, promotional material, junk mail etc.

British Spoken English

20 million words of transcribed spoken British English encompassing many aspects of daily life such as family conversation, conversation between friends and colleagues, work discussions and meetings, telephone conversations, interviews, court proceedings, radio phone-ins etc. All texts have been anonymised.

The Times and The Sunday Times

Award-winning journalism from two of the most highly-respected publications in British journalism. A vast gamut of topic areas such as Finance, Sport, Travel, Law, Health, Cookery, Lifestyle as well as all national and international news and views.

The Sun and The News of the World

The most popular and widely read tabloid newspapers in the UK offering unmatched coverage of Britain in all its eccentricities.

Today

No longer in print but a middle-of-the-road brand of journalism offering a contrast in style between broadsheets and tabloids.

Data Presentation and Supply

The data is normally presented in three columns:

  1. First column: assigned part of speech tag number.
  2. Second column: tag name abbreviation.
  3. Third column: word or punctuation.
Sample
15 EX There
31 VBD was
4 AT an
1 JJ awful
18 NN1 inevitability
28 IN about
4 IAT the
19 NN2 obituaries
13 DT this
18 NN1 week
27 INOF of
20 NP Lord
20 NP Dacre
27 INOF of
20 NP Glanton
57 . .

The text data is supplied as zipped files on CDs. There is accompanying documentation on the part-of-speech tags and a list of text-type fields used in categorising the news data.

Forthcoming Corpora

We are continually adding to our datasets to better serve the language technology community. To this end we hope to shortly release 12.5 million words of audio data of spoken British English corresponding to part of the transcribed spoken corpus mentioned above.

In addition we are currently pursuing datastreams from, for example, Australia, India, the United States and Europe. If you are particularly interested in these or any other areas then please enquire as to their availability.

If you are interested in finding out more, please contact us to discuss your requirements.

AddThis Social Bookmark Button

Advertisements and Additional Information