Skip to Main Content
University of York Library
Library Subject Guides

Data: a Practical Guide

What is data?

Final prototype for the Data practical guide.
Feedback
X

What is data?

No, really; what is it? And is it for me?

What is this data thing?

"In the pursuit of knowledge, data is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted."

- Wikipedia


Data can be pretty much anything, really. Or rather, pretty much anything can be data. Take a look around you. Everything you can perceive can be quantified, qualified, or otherwise interpreted in some way shape or form. The word data, for instance, is four characters long but one of those characters is repeated. We've already had the word data four times so far in this paragraph, and twice it followed the word word, while the other two times it was in a sentence with the phrase can be. What does all this tell us about the word data and its uses? Maybe that's something we could investigate further..?

Data isn't just numbers, though it often is reduced to that. It's discrete packets of evidence that we can potentially aggregate to find patterns and meaning. It's testimonies, field boundaries, the human genome, the avocado genome, a bottle of wine, a packet of crisps, the complete works of Shakespeare...

It's possible to get quite philosophical about the nature of data. We're not going to do that too much here. Often in this guide we'll be making certain assumptions about types of data: humans love looking for patterns and patterns imply something that is measurable or countable or mappable. We'll be looking for data that's already in this form (neatly packaged in a 'data set'), or looking to turn more abstract data into something more quantifiable. But that won't universally be the case, and we'll look, too, at examples of working with data in more qualitative ways.

Understanding the data you have

It's one thing finding some data, but you probably need to manipulate it in some way before you can interrogate it...

We might think of ‘data’ as values stored without context. Through processing that data we can seek to provide context and determine meaning. But even the simplest interrogation requires us to have some understanding of what's in that dataset, and what constitutes ‘good’ data in the first place.

As an example, let’s ‘deconstruct’ some information:

“The appointment with Dr Watt is on Tuesday at 2:30pm at the Heslington Lane surgery.”

This information contains the following fields of data:

  • Who the appointment is with
  • The day (date) of the appointment
  • The time of the appointment
  • The location of the appointment

If you wanted to record appointments in a computer-based system you would need to use separate ‘fields’ for these — for example, separate columns in a spreadsheet table.

When faced with an existing dataset, our first challenge might well be to reverse this process and rebuild our understanding of what information these fields convey. If you've got the data from a third-party source, look out for any explanatory notes that might help you with this.

Datasets and statistics

The words datasets and statistics are often used interchangeably to refer to facts and numbers collected together. In academic research the two are different and the difference is important in order to understand exactly what you are looking at and, crucially, what you can do with it.

Datasets

We've already seen that data is a tricky thing to pin down. We might consider it to be a kind of raw form of information. Ideally someone else will have gathered that material as part of some sort of study, and then stored it somewhere, or maybe we'll have to do that ourselves. If the data has already been collated, we might hope that it exists in the form of a digital dataset: a collection of related sets of information kept in a machine-readable format that can be filtered and searched according to your own criteria, and can be analyzed using software such as Excel or SPSS. But we may first have to do some work to get it into that state.

When are datasets useful?

Some organisations and researchers collect and store enormous quantities of data that can be used by people as a basis for reasoning, discussion or calculation. People can interrogate and analyse this data in order to explain things.

Data series are often kept over long periods of time and updated regularly with the most recent figures. This means they can be repeatedly interrogated, each time potentially showing new developments. You can start to compare a variable over time and be very up-to-date in what you find.

Data is particularly useful when you want to examine what is happening and come to your own conclusions, i.e. when you are asking questions like 'Why?' Or 'How?'


Statistics

Statistics are the result of some human analysis of the raw collected data. Data has been interrogated and further processed in some way and decisions have been made on how to present that data to show a particular view of what is going on.

You will usually see statistics in tables, charts, or graphs, and also as numbers and percentages reported in articles.

Once a statistic is published it is static and only ever refers to that point in time.

When are stats useful?

  • When you need a quick figure to evidence a point, and you think that figure may already have been pulled out for you by someone else publishing on the topic or by an organisation who pulls together summary statistics (e.g. GDP for a country in a particular year, % of people who enrolled in higher education in a particular country; number of crimes reported in a particular place and time) and you don't want to interrogate the data for yourself in your own way.
  • When you want to answer questions like 'how many?' or 'how much?'