We take an introductory look at some methods of data analysis and communication. We explain the different analyses that spreadsheets, statistical software and qualitative data analysis software can help with, and we investigate some of the principles and practice of data visualisation.
Most data can be reduced to numbers or counts of items. Depending on what you're trying to achieve, you may only need to calculate descriptive statistics such as counts, averages, and percentages.
You're best off using an approach you've been taught, or that your supervisor or others in your department use.
SPSS is used for statistical analysis. The add-on module, SPSS Amos, enables structural equation modelling.
R is a programming language and software environment for statistical computing and graphics. We've got a load of resources for R on our Coding Practical Guide:
Python is a coding language which can be used for statistical analyses. We've got a load of resources for Python on our Coding Practical Guide:
Stata is a general-purpose statistical software package. It's capabilities include data management, statistical analysis, graphics, simulations, and custom programming.
IT Services have a number of software packages for statistical analysis:
Qualitative data analysis is the analysis of things other than numbers — usually text information. It's mostly a case of just reading stuff, but it can also be advantageous to find ways of quantifying content.
NVivo is a qualitative data, text management and organisational tool which enables an analysis of very rich text-based and/or multimedia information, where deep levels of analysis on small or large volumes of data are required. It is often used for qualitative research and literature review.
Data can be organised in a number of standardised and interoperable text-based formats. Whether you're importing existing data, or exporting it for use in another tool, it's worth understanding the common formats in use.
An archival standard for spreadsheet-formatted data is to use delimited text: each cell is separated by a special character (usually a comma or a tab character), and each row by a different character (usually a line-break character).
The most common delimited formats are CSV (comma-separated values) and TSV (tab-separated values):
Spreadsheet files can be saved into these formats for use elsewhere, but only the superficial text values are saved, not any formatting or underlying formulae.
Cells containing commas (or tabs) are encoded in quotation marks. If your data contains complicated combinations of commas (or tabs) and quotation marks, you may have problems saving as csv (or tsv), though you could potentially save with a different delimiter!
It's one thing finding some data, but you probably need to manipulate it in some way before you can interrogate it...
We might think of ‘data’ as values stored without context. Through processing that data we can seek to provide context and determine meaning. But even simple spreadsheet operations require us to have some understanding of what's in that dataset, and what constitutes ‘good’ data in the first place.
As an example, let’s ‘deconstruct’ some information:
“The appointment with Dr Watt is on Tuesday at 2:30pm at the Heslington Lane surgery.”
This information contains the following fields of data:
If you wanted to record appointments in a computer-based system you would need to use separate ‘fields’ for these — which in a spreadsheet might translate to separate columns.
When faced with an existing dataset, our first challenge might well be to reverse this process and rebuild our understanding of what information these fields convey. If you've got the data from a third-party source, look out for any explanatory notes that might help you with this.
Data processing systems struggle if you don’t stick to recognised data types, or if you add in values that don’t match others in the same context, For instance, in addition to text, spreadsheets observe the following special data types:
Mon or Fri
For software to be able to analyse a number or a date, it needs a number or a date that it can parse — that it can understand and calculate with. If a value doesn't match the necessary rules to qualify as 'parsable', it will be treated as text. This may have an affect on how you're able to interrogate that data. If you represent a number or date in a way that does not allow the program to determine its type correctly, you will not be able to sort and filter correctly, you will not be able to add up, find averages, find the interval between two dates, etc... You might be able to understand that 20 + c.10 = c.30, but a computer can't make that leap. You're going to have to clean your data.
The success of any data processing will depend in large part on the quality of the source data you're working with. Data is often messy: columns might contain a mix of text and numerical data; some rows may have missing data; or perhaps you're trying to mash together two separate spreadsheets and the column names don’t quite match, or people have used a label in slightly different ways.
This is when you need to clean your data (a process also known as data munging or data wrangling). You need your data to be in a useful shape for your needs: if you're analysing or visualising data, what information (and types of data) does that analysis or visualisation require?
It’s all about ensuring that your data is validated and quantifiable. For instance, if you have a column of 'fuzzy' dates (e.g. c.1810 or 1990-1997), you might want to create a new column of 'parsed' dates — dates that are machine-readable (e.g. 1810, 1990). This might mean that you're losing some information and nuance from your data, and you'll need to keep that in mind in your analysis. But you'll at least have quantifiable data that you can analyse effectively.
For small, straightforward datasets, you can do data cleaning in a spreadsheet: ensure that numbers and dates are formatted as their appropriate data type, and use filters to help you standardise any recurring text. Excel even has a Query Editor tool that makes a lot of this work even easier.
The larger a dataset, the harder it is to work with it in a spreadsheet. Free tools like OpenRefine offer a relatively friendly way to clean up large amounts of data, while programming languages like R and Python have functions and libraries that can help with the tidying process.
The way your data is laid out has an impact on how you can analyse it.
Data is conventionally displayed as a two-dimensional table (rows and columns). Generally this will be laid out as a relationship between a case (a 'tuple') in each row, and its corresponding attributes (each with their own data type) in columns. Take this example of list structured data from a student fundraiser:
|1||Student ID||Foreame||Surname||Year||College||Bean bath||10k run||Parachute jump||Tandem joust|
Sometimes a single 'flat file' table of rows and columns is not enough. For instance:
You need to work with information about people and the research projects they are involved in. There will be several fields of data about the people, but also several about the projects.
It would be impossible to design one table that is suitable to hold all the data about people and projects, so in this case we create separate tables – one for people and one for projects – and find ways to express the connections between them.
In this example, one person can be involved in many projects, and one project can involve many people. This is a clear indication that the data is relational, and any attempt to work with it using a simple table will entail compromises.
This approach marks out the fundamental difference between a spreadsheet and a relational database.
Even the fundraising example in the table above may be better thought of as multiple tables: one table could index the students alongside their forenames, surnames, year, and college; a second table could list all the bean bathers (by Student ID) and the corresponding amount raised; a third could list the 10k runners, etc.
Depending on the analysis you need to do, it may be necessary to restructure your data. One common approach is to reorganise your data into what we might call a 'pivotable' format.
In our student fundraiser example, we have multiple columns all sharing the same attribute: amount raised. We might therefore look to move all these values into a single column:
This table looks unusual when we're used to seeing one row per student. Now it's effectively one row per fundraising performance (we might even imagine a unique ID ascribed to each activity a student performs). But it means that all the fundraising amounts are now in the same column (G): we can get a total for that column very easily, and can even filter based on the activity, the student, or any other field. If we're using a spreadsheet, we can use this data in a pivot table, and if we're looking to make a visualisation, this is also the ideal format for a lot of visualisation tools.
Restructuring data is not always straightforward. But some of the data wrangling tools below may help you. We've also got some guidance on using spreadsheets to unpivot 'pivoted' data.
Forthcoming sessions on :
There's more training events at: