Subject Guides: Data: a Practical Guide: Text and data mining

Mining (with computers)

A pithead (a metal scaffolding around the wheel that operates the cage-lift that takes miners into the bowels of the earth). In the distance, a mist-filled valley.

Barnsley Main Colliery - CC BY-SA Steve Fareham

Mining is the act of digging something out of something else. Humans have been mining things out of the earth for years: coal, metals, precious gems... We've also got used to the idea of people being a metaphorical mine of information, or of enterprises being metaphorical goldmines.

Humanity has produced masses of written documentation. Previously you would have to physically read it to 'mine' any information from it, but in our modern digital world there are quicker and more sophisticated approaches we might take. Take a search engine, for instance: it indexes billions of digital texts (webpages), and searches that data to extract matches for the search term you enter (just as a miner might extract a handful of tiny gems from a huge cliffside). A tool like Google can do in fractions of a second something that would take humans centuries.

And as digital technology develops, we can do even more in terms of this 'mining' of information: we can use computational methods to identify patterns and relationships we might otherwise have been unable to see.

It's not a new thing. For years people have been getting computers to look for patterns in Shakespeare plays, in coded messages, and even in chess games in order to find the best moves to play at a given point. But computers are getting faster and more powerful, and more and more data is becoming available for us to work with.

Some definitions of text and data mining

In this world of computational research, there's a lot of terminology used to describe a lot of similar things: knowledge extraction, information harvesting, data wrangling, data munging, data archaeology, text mining, and data mining, to name just seven interestingly named variants. These terms may mean different things to different people; they're sometimes interchangeable and sometimes not, depending on who you're talking to. But they all essentially come down to using a computer to uncover otherwise hidden understanding from a mass of information. On this page we're referring to text and data mining, if only because that's the terminology that's getting used in things like legislation and contracts. We'll consider this as something separate to (and more advanced than) data cleaning or data wrangling (which might be seen as a sub-set of the text/data-mining process).

The UK government defines text and data mining (collectively) as "the use of automated analytical techniques to analyse text and data for patterns, trends and other useful information.” The University of Cambridge Library distinguishes between the two as follows:

Data mining is the computational process of discovering and extracting knowledge from structured [organised and codified] data;
Text mining is the computational process of discovering and extracting knowledge from unstructured [messy] data.

Text and data mining of Library databases

The University subscribes to a range of specialist database resources. That's a lot of tasty-looking data just sitting there, waiting to be mined. However, these subscribed resources normally have specific licences and restrictions regarding what text and/or data mining can be carried out upon them (there's a lot of money and copyright involved in these things). That's not to say that they can't be mined — indeed the databases may have built in specific mechanisms to facilitate data mining. But it's important that you observe the restrictions that are in place, so that neither you nor the University are in breach of any agreements.

Getting access to Library databases for text or data mining

If you're wanting to use our databases for text or data mining, contact your Faculty Librarians in the first instance, via the contact details on your department's Library Subject Guide.

We will typically need to request additional licence permissions from the publishers and vendors of those databases. To help us enable this for you, you will need to let us know:

the resources you are interested in. We will contact the publisher(s) concerned for you, and discuss licence options and any tools they may have available to facilitate text or data mining;
if you have a research budget to cover any costs for licence inclusion of data mining rights. If you do not have a budget we can support you in preparing a case to bid for Library funds;
as early as possible to ensure that we can arrange appropriate text and data mining rights for the resource(s) you need when you need it. Negotiations can sometimes be complex, so do let us know well in advance!

Subject Guides
Subject-specific help and resources

Copyright considerations

An exception to copyright exists for text and data mining where it is undertaken "for non-commercial research." Further guidance on understanding this exception and wider copyright considerations for research is provided in our Copyright Practical Guide:

Selected text and data mining resources

Here's a few text and data mining resources:

CORE Services
An aggregated collection of open access research papers from repositories and journals. The full dataset can be downloaded and used for text mining or other machine processing.
Hathi Trust Digital Library
Allows researchers to bulk download and analyse text data of public domain works for non-commercial research purposes.
JSTOR Analyser Tool
The JSTOR Text Analyzer carries out a search of the JSTOR database based on text provided in an uploaded document.
PubMed Article Datasets
Provides a range of datasets which can be downloaded for text mining.
Text Creation Partnership raw files
Includes EEBO and ECCO.
Provides searchable transcriptions of early print books. Raw datasets are available to download for text mining.

Gale Digital Scholar Lab This link opens in a new window
For access you will need to login at the Gale Digital Scholar Lab site using a Google or Microsoft account. This allows you to save your work in the Lab.
Digital Scholar Lab provides tools for exploring and analysing the University's collections of Gale Primary Sources. Using Digital Scholar Lab you can:
- Search across the University's Gale Primary Sources collections.
- Create custom content sets containing as many as 10,000 documents.
- Analyse content sets with built in text analysis, mining and visualization tools. Analysis methods include: Named Entity Recognition, Topic Modelling, Parts of Speech, Ngram, Sentiment Analysis and Clustering.
- Organise and manage your research.
- Export tabular data, and visualisations in standard formats.
Accessibility information

Forthcoming training sessions

Forthcoming sessions on … :

Show details & booking for these sessions

There's more training events at:

Skills Guide: Training
Take a look at our list of events