Skip to Main Content
University of York Library
Library Subject Guides

Data: a Practical Guide

Text and data mining

Final prototype for the Data practical guide.

Text and data mining

Text and data mining (sometimes abbreviated to TDM) is an area of computational research that looks for patterns of meaning in large datasets.

Mining (with computers)

A pithead (a metal scaffolding around the wheel that operates the cage-lift that takes miners into the bowels of the earth). In the distance, a mist-filled valley.
Barnsley Main Colliery - CC BY-SA Steve Fareham

Mining is the act of digging something out of something else. Humans have been mining things out of the earth for years: coal, metals, precious gems... We've also got used to the idea of people being a metaphorical mine of information, or of enterprises being metaphorical goldmines.

Humanity has produced masses of written documentation. Previously you would have to physically read it to 'mine' any information from it, but in our modern digital world there are quicker and more sophisticated approaches we might take. Take a search engine, for instance: it indexes billions of digital texts (webpages), and searches that data to extract matches for the search term you enter (just as a miner might extract a handful of tiny gems from a huge cliffside). A tool like Google can do in fractions of a second something that would take humans centuries.

And as digital technology develops, we can do even more in terms of this 'mining' of information: we can use computational methods to identify patterns and relationships we might otherwise have been unable to see.

It's not a new thing. For years people have been getting computers to look for patterns in Shakespeare plays, in coded messages, and even in chess games in order to find the best moves to play at a given point. But computers are getting faster and more powerful, and more and more data is becoming available for us to work with.

Some definitions of text and data mining

In this world of computational research, there's a lot of terminology used to describe a lot of similar things: knowledge extraction, information harvesting, data wrangling, data munging, data archaeology, text mining, and data mining, to name just seven interestingly named variants. These terms may mean different things to different people; they're sometimes interchangeable and sometimes not, depending on who you're talking to. But they all essentially come down to using a computer to uncover otherwise hidden understanding from a mass of information. On this page we're referring to text and data mining, if only because that's the terminology that's getting used in things like legislation and contracts. We'll consider this as something separate to (and more advanced than) data cleaning or data wrangling (which might be seen as a sub-set of the text/data-mining process).

The UK government defines text and data mining (collectively) as "the use of automated analytical techniques to analyse text and data for patterns, trends and other useful information.” The University of Cambridge Library distinguishes between the two as follows:

  • Data mining is the computational process of discovering and extracting knowledge from structured [organised and codified] data;
  • Text mining is the computational process of discovering and extracting knowledge from unstructured [messy] data.

Text and data mining of Library databases

The University subscribes to a range of specialist database resources. That's a lot of tasty-looking data just sitting there, waiting to be mined. However, these subscribed resources normally have specific licences and restrictions regarding what text and/or data mining can be carried out upon them (there's a lot of money and copyright involved in these things). That's not to say that they can't be mined — indeed the databases may have built in specific mechanisms to facilitate data mining. But it's important that you observe the restrictions that are in place, so that neither you nor the University are in breach of any agreements.

Getting access to Library databases for text or data mining

If you're wanting to use our databases for text or data mining, contact your Academic Liaison Librarian in the first instance, via the contact details on your department's Library Subject Guide.

We will typically need to request additional licence permissions from the publishers and vendors of those databases. To help us enable this for you, you will need to let us know:

  • the resources you are interested in. We will contact the publisher(s) concerned for you, and discuss licence options and any tools they may have available to facilitate text or data mining;
  • if you have a research budget to cover any costs for licence inclusion of data mining rights. If you do not have a budget we can support you in preparing a case to bid for Library funds;
  • as early as possible to ensure that we can arrange appropriate text and data mining rights for the resource(s) you need when you need it. Negotiations can sometimes be complex, so do let us know well in advance!

Copyright considerations

An exception to copyright exists for text and data mining where it is undertaken "for the purpose of computational analysis" provided the researcher already has "the right to read the work (that is, they have ‘lawful access’ to the work). This exception only permits the making of copies for the purpose of text and data and mining for non-commercial research."

However, many licences will have specific requirements around data storage and deletion policies, and may have specific restrictions regarding how content can be shared.

Permissions to undertake text and/or data mining on licenced resources will normally be restricted to University of York users only. This will likely have implications if you are working cross-institutionally.

Cambridge University Libraries have put together a useful resource on copyright considerations for text and data mining:

Selected open text and data mining resources

Here's a few open text and data mining resources:

Feedback
X