Text and data mining (sometimes abbreviated to TDM) is an area of computational research that looks for patterns of meaning in large datasets.
Mining is the act of digging something out of something else. Humans have been mining things out of the earth for years: coal, metals, precious gems... We've also got used to the idea of people being a metaphorical mine of information, or of enterprises being metaphorical goldmines.
Humanity has produced masses of written documentation. Previously you would have to physically read it to 'mine' any information from it, but in our modern digital world there are quicker and more sophisticated approaches we might take. Take a search engine, for instance: it indexes billions of digital texts (webpages), and searches that data to extract matches for the search term you enter (just as a miner might extract a handful of tiny gems from a huge cliffside). A tool like Google can do in fractions of a second something that would take humans centuries.
And as digital technology develops, we can do even more in terms of this 'mining' of information: we can use computational methods to identify patterns and relationships we might otherwise have been unable to see.
It's not a new thing. For years people have been getting computers to look for patterns in Shakespeare plays, in coded messages, and even in chess games in order to find the best moves to play at a given point. But computers are getting faster and more powerful, and more and more data is becoming available for us to work with.
In this world of computational research, there's a lot of terminology used to describe a lot of similar things: knowledge extraction, information harvesting, data wrangling, data munging, data archaeology, text mining, and data mining, to name just seven interestingly named variants. These terms may mean different things to different people; they're sometimes interchangeable and sometimes not, depending on who you're talking to. But they all essentially come down to using a computer to uncover otherwise hidden understanding from a mass of information. On this page we're referring to text and data mining, if only because that's the terminology that's getting used in things like legislation and contracts. We'll consider this as something separate to (and more advanced than) data cleaning or data wrangling (which might be seen as a sub-set of the text/data-mining process).
The UK government defines text and data mining (collectively) as "the use of automated analytical techniques to analyse text and data for patterns, trends and other useful information.” The University of Cambridge Library distinguishes between the two as follows:
The University subscribes to a range of specialist database resources. That's a lot of tasty-looking data just sitting there, waiting to be mined. However, these subscribed resources normally have specific licences and restrictions regarding what text and/or data mining can be carried out upon them (there's a lot of money and copyright involved in these things). That's not to say that they can't be mined — indeed the databases may have built in specific mechanisms to facilitate data mining. But it's important that you observe the restrictions that are in place, so that neither you nor the University are in breach of any agreements.
If you're wanting to use our databases for text or data mining, contact your Faculty Librarians in the first instance, via the contact details on your department's Library Subject Guide.
We will typically need to request additional licence permissions from the publishers and vendors of those databases. To help us enable this for you, you will need to let us know:
An exception to copyright exists for text and data mining where it is undertaken "for the purpose of computational analysis" provided the researcher already has "the right to read the work (that is, they have ‘lawful access’ to the work). This exception only permits the making of copies for the purpose of text and data and mining for non-commercial research."
However, many licences will have specific requirements around data storage and deletion policies, and may have specific restrictions regarding how content can be shared.
Permissions to undertake text and/or data mining on licenced resources will normally be restricted to University of York users only. This will likely have implications if you are working cross-institutionally.
Cambridge University Libraries have put together a useful resource on copyright considerations for text and data mining:
Here's a few text and data mining resources:
Forthcoming sessions on :
There's more training events at: