Subject Guides: Data: a Practical Guide: Cleaning data

Understanding the data you have

It's one thing finding some data, but you probably need to manipulate it in some way before you can interrogate it...

We might think of ‘data’ as values stored without context. Through processing that data we can seek to provide context and determine meaning. But even the simplest interrogation requires us to have some understanding of what's in that dataset, and what constitutes ‘good’ data in the first place.

As an example, let’s ‘deconstruct’ some information:

“The appointment with Dr Watt is on Tuesday at 2:30pm at the Heslington Lane surgery.”

This information contains the following fields of data:

Who the appointment is with
The day (date) of the appointment
The time of the appointment
The location of the appointment

If you wanted to record appointments in a computer-based system you would need to use separate ‘fields’ for these — for example, separate columns in a spreadsheet table.

When faced with an existing dataset, our first challenge might well be to reverse this process and rebuild our understanding of what information these fields convey. If you've got the data from a third-party source, look out for any explanatory notes that might help you with this.

Data types

Most datasets are essentially just glorified text files. But that text will be organised in a certain way that can be processed and interpreted as fields of like data. And while that data may look like text, certain fields might be encoded to be read in a very specific and specialised way.

Data processing systems struggle if you don’t stick to recognised data types, or if you add in values that don’t match others in the same context. Take spreadsheets for example — in addition to 'strings' of text they observe the following special data types:

Data type	Understood by a machine	Unrecognisable to a machine
Number	5 1.6 -350 0.105	About 10 >5 10-15 25cm
Date/time	01/01/2000 23-11-1963 15:30 17:16:20	01.01.2000 Mon or Fri Next Tuesday About 10:30
Boolean	True False	Maybe ?

Other systems may have even more data types — for instance some databases and languages make a distinction between whole numbers (integers) and decimal numbers (floating point), or between short numbers and long numbers.

The reason for this variety of number types historically comes down to software trying to be efficient with storage space. Wikipedia has a lengthy list of data type examples should you ever be bored.

For software to be able to analyse a number or a date, it needs a number or a date that it can parse — that it can understand and calculate with. If a value doesn't match the necessary rules to qualify as 'parsable', it will be treated as normal text rather than the special category of information you may have been intending. This may have an effect on how you're able to interrogate that data. If you represent a number or date in a way that does not allow the program to determine its type correctly, you will not be able to sort and filter correctly, you will not be able to add up, find averages, find the interval between two dates, etc... You might be able to understand that 20 + c.10 = c.30, but a computer can't make that leap. You're going to have to clean your data...

Cleaning data

The success of any data processing will depend in large part on the quality of the source data you're working with.

Data is often messy: columns or other fields might contain a mix of text and numerical data; some rows of information may have missing data; or perhaps you're trying to mash together two separate datasets and the field names don’t quite match, or people have used a label in slightly different ways.

This is when you need to clean your data (a process also known as data munging or data wrangling). You need your data to be in a useful shape for your needs: if you're analysing or visualising data, what information (and types of data) does that analysis or visualisation require?

It’s all about ensuring that your data is validated and quantifiable. For instance, if you have a field containing 'fuzzy' dates — dates that are contaminated with text annotations (e.g. c.1810 or 1990-1997), you might want to create a new field of 'parsed' dates — dates that are machine-readable (e.g. 1810, 1990). This might mean that you're losing some information and nuance from your data, and you'll need to keep that in mind in your analysis. But you'll at least have quantifiable data that you can analyse effectively.

For small, straightforward datasets, you can do data cleaning in a spreadsheet: ensure that numbers and dates are formatted as their appropriate data type, and use filters to help you standardise any recurring text. Excel even has a Query Editor tool that makes a lot of this work even easier.

The larger a dataset, the harder it is to work with it in a spreadsheet. Free tools like OpenRefine offer a relatively friendly way to clean up large amounts of data, while programming languages like R and Python have functions and libraries that can help with the tidying process.

Library Carpentry: OpenRefine
Resouces from the Carpentries on how to get started working with data in OpenRefine.

data.europa.eu:
Data visualisation guide: Intro to tidy data
What is tidy data?
data.europa.eu:
Data visualisation guide: Cleaning data
Detailed advice on how to clean data.
data.europa.eu:
Data visualisation guide: Pitfalls in data
Some typical problems you might encounter.

Data structures

The way your data is laid out has an impact on how you can analyse it...

'Flat file' and relational data

Data is conventionally displayed as a two-dimensional table (rows and columns). Generally this will be laid out as a relationship between a case (a 'tuple') in each row, and its corresponding attributes (each with their own data type) in columns. Take this example of list structured data from a student fundraiser:

	A	B	C	D	E	F	G	H	I
1	Student ID	Foreame	Surname	Year	College	Bean bath	10k run	Parachute jump	Tandem joust
2	1001	David	Jones	2	Derwith	60.00	75.50		55.00
3	1002	Farrokh	Bulsara	1	Alcricke	70.00		85.00	45.50
4	1003	Catherine	Bush	2	Langbrugh		65.50	95.50	35.00

Sometimes a single 'flat file' table of rows and columns is not enough. For instance:

You need to work with information about people and the research projects they are involved in. There will be several fields of data about the people, but also several about the projects.

It would be impossible to design one table that is suitable to hold all the data about people and projects, so in this case we create separate tables – one for people and one for projects – and find ways to express the connections between them.

In this example, one person can be involved in many projects, and one project can involve many people. This is a clear indication that the data is relational, and any attempt to work with it using a simple table will entail compromises.

This approach marks out the fundamental difference between a spreadsheet and a relational database.

Reshaping your data

Even the fundraising example in the table above may be better thought of as multiple tables: one table could index the students alongside their forenames, surnames, year, and college; a second table could list all the bean bathers (by Student ID) and the corresponding amount raised; a third could list the 10k runners, etc.

Depending on the analysis you need to do, it may be necessary to restructure your data. One common approach is to reorganise your data into what we might call a 'pivotable' format.

In our student fundraiser example, we have multiple columns all sharing the same attribute: amount raised. We might therefore look to move all these values into a single column:

	A	B	C	D	E	F	G
1	Student ID	Forename	Surname	Year	College	Activity	Amount
2	1001	David	Jones	2	Derwith	Bean bath	60.00
3	1001	David	Jones	2	Derwith	10k run	75.50
4	1001	David	Jones	2	Derwith	Tandem joust	55.00
5	1002	Farrokh	Bulsara	1	Alcricke	Bean bath	70.00
6	1002	Farrokh	Bulsara	1	Alcricke	Parachute jump	45.00
7	1002	Farrokh	Bulsara	1	Alcricke	Tandem joust	85.00
8	1003	Catherine	Bush	2	Langbrugh	10k run	65.50
9	1003	Catherine	Bush	2	Langbrugh	Parachute jump	95.50
10	1003	Catherine	Bush	2	Langbrugh	Tandem joust	35.00

This table looks unusual when we're used to seeing one row per student. Now it's effectively one row per fundraising performance (we might even imagine a unique ID ascribed to each activity a student performs). But it means that all the fundraising amounts are now in the same column (G): we can get a total for that column very easily, and can even filter based on the activity, the student, or any other field. If we're using a spreadsheet, we can use this data in a pivot table, and if we're looking to make a visualisation, this is also the ideal format for a lot of visualisation tools.

Restructuring data is not always straightforward. But some of the data wrangling tools below may help you. We've also got some guidance on using spreadsheets to unpivot 'pivoted' data.

Essential Spreadsheets
Our practical guide to spreadsheets.
Particularly useful will be the sections on connecting data, datasets, and processing.
data.europa.eu:
Data visualisation guide: Wide versus long data
Further comparison of the above approaches to organising data.
Seven free data wrangling tools
Blogpost on some free tools for cleaning up data. As ever, think critically before downloading strange software from the internet...
7 steps to mastering data preparation with Python
A look at data cleaning with the pandas library in Python.

Why data is always bad

View full slides on Google Slides

Data formats

Data can be organised in a number of standardised and interoperable text-based formats. Whether you're importing existing data, or exporting it for use in another tool, it's worth understanding the common formats in use.

Delimited (CSV and TSV)

An archival standard for spreadsheet-formatted data is to use delimited text: each cell is separated by a special character (usually a comma or a tab character), and each row by a different character (usually a line-break character).

The most common delimited formats are CSV (comma-separated values) and TSV (tab-separated values):

CSV

Forename,Surname,Year,College David,Jones,2,Derwith Farrokh,Bulsara,1,Alcricke Catherine,Bush,2,Langbrugh

TSV

Forename Surname Year College David Jones 2 Derwith Farrokh Bulsara 1 Alcricke Catherine Bush 2 Langbrugh

Spreadsheet files can be saved into these formats for use elsewhere, but only the superficial text values are saved, not any formatting or underlying formulae.

Cells containing commas (or tabs) are encoded in quotation marks. If your data contains complicated combinations of commas (or tabs) and quotation marks, you may have problems saving as csv (or tsv), though you could potentially save with a different delimiter!

XML

XML (eXtensible Markup Language) uses nested tags (and corresponding closing tags) to define data and relationships. This webpage is essentially built using the same principles.

XML can incorporate rules (Document Type Definitions) to determine the data type of any given field. But because you don’t have to, and because it's quite difficult to do, it doesn't always happen.

While delimited filetypes can organise data in two dimensions (rows and columns), XML can encode more elaborate relational data.

XML files can often be imported into Excel, but the relationships contained may be too complicated to display adequately in a spreadsheet.

XML

<PEOPLE> <PERSON> <FORENAME>David</FORENAME> <SURNAME>Jones</SURNAME> <YEAR>2</YEAR> <COLLEGE>Derwith</COLLEGE> </PERSON> <PERSON> <FORENAME>Farrokh</FORENAME> <SURNAME>Bulsara</SURNAME> <YEAR>1</YEAR> <COLLEGE>Alcricke</COLLEGE> </PERSON> <PERSON> <FORENAME>Catherine</FORENAME> <SURNAME>Bush</SURNAME> <YEAR>2</YEAR> <COLLEGE>Langbrugh</COLLEGE> </PERSON> </PEOPLE>

JSON

Like XML, JSON (JavaScript Object Notation) uses a hierarchical structure, in this case arranged in name/value pairs. The brackets and quote-marks may make JSON a little harder to read than XML, but the lack of closing tags means that files are a lot shorter (something that's quite important when you've got a lot of data). It's also easier to parse for use with a coding language.

JSON

{"people":[ { "forename":"David", "surname":"Jones", "year":2, "college":"Derwith" }, { "forename":"Farrokh", "surname":"Bulsara", "year":1, "college":"Alcricke" }, { "forename":"Catherine", "surname":"Bush", "year":2, "college":"Langbrugh" } ]}

The latest versions of Excel can open JSON files, but not yet in an especially useful way. You'd generally interrogate JSON with a coding language like JavaScript or Python.

Getting data from web tables and PDFs

Not all data you find online is in a friendly format. You may occasionally come across tables of useful statistics on webpages or in PDFs. Sometimes you can copy and paste them into something like a spreadsheet without any problems, but not always.

Web tables

If data on a webpage has been formatted as a table or a list, and copying and pasting isn't pulling information across as you'd like, you should be able to import the data into a spreadsheet using an import function. Even if the data has been formatted in a non-standard way, you may still be able to extract usable information using an import function like IMPORTXML in Google or WEBSERVICE in Excel, but you might have to dig a bit deeper into the HTML.

PDFs

So long as the data in the PDF is encoded as text (rather than as an image), it can be extracted into a spreadsheet format. On University computers you can use ABBYY FineReader to convert a PDF to Excel format. If you're on your own machine you could use Google Drive to convert the PDF to a Google Doc, and then copy and paste. There are also free tools like Tabula, though, as ever, you should think critically when using software from the internet.

If your data is just an image (a photograph or photocopy of some data, with no machine-readable element), you'll need to employ some optical character recognition (OCR). If you're on campus, the scanning options on the printers/photocopiers includes OCR. Alternatively, you could use Google Drive to convert a PDF to a Google Doc. Either way, the results may not be structured in a very useful way, and you may have to do a lot of repair. It may be easier to simply enter the data yourself.

data.europa.eu visualisation guide: Data file formats
More detail on file formats from the European Commission's data portal.

Data: Analysis
Tools and approaches for manipulating and making sense of data.
Data: Visualisation
A look at data visualisation and how to communicate data in different ways.

Forthcoming training sessions

Forthcoming sessions on … :

Show details & booking for these sessions

There's more training events at:

Skills Guide: Training
Take a look at our list of events