Skip to Main Content
University of York Library
Subject Guides

Essential Spreadsheets: a Practical Guide

T4: Processing

Spreadsheets were developed as a tool to store, analyse and manipulate numerical data. They are now commonly used for working with sets of data containing both text and numbers, and for generating graphs and charts. They're also brilliant. Let's play with

Cross-referencing and interrogating information is a very database-y activity, but it's something spreadsheets can do too, to some extent. On this page we take a look at lookup functions, pivot to the topic of pivot tables, then query Google Sheets' QUERY function and Excel's Query Editor.

Lookup functions

Lookup functions provide a means to reference data stored as a table elsewhere in a spreadsheet, and so extract values from it for use elsewhere.

There are three main lookup functions:

LOOKUPLocates a supplied value in one column and returns a value from the same row in another column
VLOOKUPLocates a supplied value (or its inferred position in a range of values) in the first column of a range of data and returns the value from the same row in any other specified column
HLOOKUPWorks the same as VLOOKUP but with data that is arranged in rows rather than columns

In addition to these functions, there's also MATCH and INDEX, and we'll take a look at those too.

LOOKUP

=LOOKUP(lookup-value,lookup-range,result-range)

LOOKUP is a weird one. Superficially it looks like the perfect lookup function: a supplied lookup-value (e.g. ‘Muffins’) is looked up in one column, and the value is returned from the same row of another column. Here it is in action:

fx=LOOKUP(F2,B:B,D:D)
ABCDEFG
1 IDItemPortion (g)CaloriesLookupReturn
2 1001Apple pie200180Muffins120
3 1002Chocolate cake150200
4 1003Flapjack160210
5 1004Lemon drizzle170200
6 1005Muffins90120
7 1006Sausage rolls85130

But there's a catch. And it's a big one. LOOKUP only works if the lookup range is sorted in ascending order. And if it can't find what you're after it will make a best guess based on where in your list the lookup value should come. If "Muffins" were to appear anywhere else in our list other than in its proper alphabetical place between the lemon drizzle and the sausage rolls, we'd get the wrong result being returned. And that's not good at all.

For that reason, LOOKUP is only really provided in spreadsheets as a legacy function. You're (slightly) safer using one of the others instead...

VLOOKUP (and HLOOKUP)

=VLOOKUP(lookup‑value, lookup‑array, result‑index, sorted?)

lookup‑valueThe value you're looking for
lookup‑arrayA data table, the first column of which will be matched to the lookup-value
result‑indexAwkwardly, the number of the column in your lookup-array from which you will retrieve your result, where the first column in the array is 1
sorted?Do you want this lookup to work like LOOKUP, whereby the first column of your lookup-array is sorted in ascending order and 'guesses' can be made as to the value to return (in which case put TRUE here), or are you looking for an exact match in a not-necessarily-sorted range (in which case put FALSE)?

The function's fiddlier than it needs to be, so let's see a (familiar) example:

fx=VLOOKUP(F2,B:D,3,FALSE)
ABCDEFG
1 IDItemPortion (g)CaloriesLookupReturn
2 1001Apple pie200180Muffins120
3 1002Chocolate cake150200
4 1003Flapjack160210
5 1004Lemon drizzle170200
6 1005Muffins90120
7 1006Sausage rolls85130

Some things to notice / be aware of:

  • Our list of data is in columns A:D but we're using B:D as our lookup-array because VLOOKUP needs the matching to happen in the first column of that lookup-array.
  • Column D is the fourth column in the sheet, and the fourth column in our data table, but it's the third column in our lookup-array: B:D, hence the 3 in our formula.
  • Our data in B:B happens to be alphabetical, but we're still looking for "Muffins" so we still want an exact match from our lookup. We should therefore use FALSE in the last argument. If we use TRUE (or miss out the last argument entirely) and the "Muffins" entry wasn't in B:B, we'd get back a misleading 200 from the lemon drizzle instead (more on why in a moment).
  • Because we're using FALSE, the values in column B could be in any order: they don't have to be alphabetical. But if there's more than one entry for "Muffins", only the correspondence from the first entry in the list would be returned.

Range lookups

Let's get to the bottom of this TRUE/FALSE business...

In the example above, the fourth argument was set to FALSE. This means that a calorie value is only returned if "Muffins" is found in column B. Otherwise the formula will return a #N/A error.

So why would we ever want anything else?

Suppose we're marking an exam:

fx=VLOOKUP(C2,F:G,2,TRUE)
ABCDEFG
1 IDExamResultGradeMarkGrade
2 1001168C0Fail
3 1001272B40E
4 1002156D50D
5 1002263C60C
6 1003181A70B
7 1003276B80A
8 1004142E

Our lookup array in F:G gives a set of grade boundaries: "Fail" starts at 0, "E" starts at 40, etc. A table such as this allows us to do a range lookup: rather than looking for an exact match, we want the closest range value below the lookup value supplied. So we put TRUE for that fourth arugment in the VLOOKUP. Now, when we look up the mark of 68 from cell C2, our closest match below 68 is the 60 in row 5 which corresponds to the grade "C".

Range lookups are generally used with numerical data where an exact match is less likely to be a thing. Of course, as with the LOOKUP example we saw earlier, our lookup range needs to be sorted in ascending order for the range lookup to work properly.

Tip

When creating a range lookup array, you must include the lowest possible value – in the above example, for instance, a zero is needed, otherwise values below 40 would generate an error.


HLOOKUP

HLOOKUP works in exactly the same way as VLOOKUP but horizontally rather than vertically: it looks across columns for a match and then down rows for a corresponding value.

MATCH and INDEX

MATCH and INDEX can be used together for lookup purposes:

=MATCH(lookup‑value,lookup‑range,type)Identifies the position of a lookup-value in a lookup-range, based on the type of lookup specified:
 0 - exact match
 1 - range match
-1 - inverted range match (matches upwards rather than downwards)
=INDEX(range,row,[column])Returns a value from a range at the row (and/or column) specified

Let's go back to our list of desserts:

fx
ABCDEF
1 IDItemLookupMATCHINDEX
2 1001Apple pieMuffins61005
3 1002Chocolate cake=MATCH(D2,B:B,0)=INDEX(A:A,E2)
4 1003Flapjack
5 1004Lemon drizzle51005
6 1005Muffins=MATCH(D2,B2:B7,0)=INDEX(A2:A7,E2)
7 1006Sausage rolls

In E2 we're using MATCH to look for the value "Muffins" (from D2) in B:B, while requiring an exact match (0). The result is 6. We're then passing that result to the formula in F2 where we're looking in A:A for the sixth row.

In row 5 we've done the same thing but expressing the ranges from rows 2:7 rather than using whole columns. This gives a different value for the MATCH because the returned row number is the row in the range, not the absolute row number on the sheet. This is also true of INDEX which is why we still get the right answer in F5.

Of course, the MATCH and INDEX functions can be combined, with the MATCH nested inside the INDEX. For example:

=INDEX(A:A,MATCH(D2,B:B,0))

Although it uses two functions instead of one, this MATCH and INDEX approach is probably the best method for lookups, not least because it doesn't involve the cumbersome hard-coded column numbers of VLOOKUP.

Pivot tables

Pivot tables allow you to rearrange a data set so as to be able to view it from different perspectives.

In order to do this, the data must be organised in a pivotable way; you cannot create a pivot table from poorly-organised data.

Pivotable data

Let's consider an example:

Students have volunteered to take part in activities to raise funds for charity, and we want to analyse the proceeds to see which activities were most successful, which colleges raised the most, and so on.

We may be tempted to lay our data out like this:

fx
ABCDEFG
1 StudentYearCollegeBean bath10k runParachute jumpTandem joust
2 David Jones2Derwith60.0075.50 55.00
3 Farrokh Bulsara1Alcricke70.0085.0045.50
4 Catherine Bush2Langbrugh 65.5095.5035.00

It's a nice, useable layout: each student with their own row; columns for each activity; values entered for the amounts raised in each case...

The trouble is, that data's already been pivoted!

What does that actually mean? Well, let's say we wanted to tot up the amount raised by each college or yeargroup. It's going to take an awful lot of messing about because our amounts raised are in multiple columns. Really we want to have the amounts raised in a single "Amount" column. And that means we'll also need a corresponding "Activity" column. And that means we need to do something to our data that might seem counter-intuitive — we need to do this:

fx
ABCDE
1 StudentYearCollegeActivityAmount
2 David Jones2DerwithBean bath60.00
3 David Jones2Derwith10k run75.50
4 David Jones2DerwithTandem joust55.00
5 Farrokh Bulsara1AlcrickeBean bath70.00
6 Farrokh Bulsara1AlcrickeParachute jump45.00
7 Farrokh Bulsara1AlcrickeTandem joust85.00
8 Catherine Bush2Langbrugh10k run65.50
9 Catherine Bush2LangbrughParachute jump95.50
9 Catherine Bush2LangbrughTandem joust35.00

This may look odd, particularly the repetition, but it means that each row is a collection of separate data items including all relevant information for each instance of a student doing an activity. And the stuff we want to be able to measure (specifically the amount raised) is in the one easy-to-handle column.


Data structure rules

The rules we talked about regarding lists are doubly true here:

  • Data should be entered down the page, each item occupying a new row, so that each row contains one instance of the value to be used in calculation (e.g. the amount raised in the above example)
  • Each column should contain just one type of information (e.g. text, number, date), and each cell should contain just one value
  • Enter column headings in one row at the top of the list – never use more, and never merge cells for labels
  • Do not repeat attributes across several columns (e.g. month names) no matter how tempting it might be!
  • Do not leave whole rows or columns empty (some blank cells are fine)
  • Stick to one dataset per tab of a spreadsheet file

Planning the ideal data structure

Here’s one approach that may help you plan or check your data structure. We’ll continue to use the student fund-raising example.

First, identify the data that provides the individual values that are most important to your analysis — in this case, it’s the amount raised at a specific activity undertaken by a specific student.

Next, surround this with the collection of related attributes:

Also include ‘attributes of attributes’ if you are likely to make use of them. For example, college and year are technically attributes of the student, but since we are storing all the data in one table, these must also be regarded as attributes of the amount, and included with each value.

These attributes should then be the additional columns used in the data set, so each amount has a corresponding activity, student, college and year.


Lowest common denominator data

Pivot tables work by grouping like values together. So it will look at our data-set, spot three rows of David Joneses and start merging them. Which is all well and good if we've only got one David Jones in our student cohort. But perhaps we've got two of them (like the one from The Lower Third and the one from The Monkees). If we don't want these two entities mingling together into some frightening gestalt, we're going to need to provide some means of disambiguation. Perhaps the year and/or college would be sufficient to do that, or maybe we're going to need a unique student ID.

Unpivoting 'pivoted' data

If you've inherited a dataset that just isn't pivotable, there are ways to get that data into a pivotable structure.

Excel

The easiest way of unpivoting is via Excel's Query Editor (more of which later, but this will serve as a tantalising glimpse ahead):

  1. Assign a named range to your list of data;
  2. Select any cell within your list
  3. Choose Data > Get & Transform Data > From Table/Range —the Query Editor will open;
  4. Select the columns containing the values you need to unpivot (in our example, the columns with the amounts);
  5. Choose Transform > Any Column > Unpivot Columns;
  6. Rename your unpivoted columns (they'll be called Attribute and Value by default) — select a column and go to Transform > Any Column > Rename (or just double-click the header);
  7. Hit Home > Close & Load — your restructured data will be written to a new list on a new sheet;
  8. This list will be in the form of an Excel Table: a special kind of list with its own set of conventions. If you'd prefer it to be a normal range, you can go to Table Design > Tools > Convert to Range. This will sever the link between this table and your original data, so if your original data is likely to change you might not want to do that.
Google Sheets

Unpivoting in Google Sheets is a bit harder. There are a few methods you could use, all of which involve building up a coordinate map of your data and then restructuring it in a new location, including this example using OFFSET, ROUND, MOD, COLUMN and ROW.

However, perhaps the most elegant method is the player0/Ben Collins approach which unpivots a sheet using this beautiful formula:

=ARRAYFORMULA( QUERY( IFERROR( SPLIT( TRIM( TRANSPOSE( SPLIT( TRANSPOSE( QUERY( TRANSPOSE( QUERY( TRANSPOSE( IF( Sheet1!B2:Z<>"", Sheet1!A2:A&"🐒"&Sheet1!B1:1&"🐒"&Sheet1!B2:Z&"🎾", )),, 500000)),, 500000)), "🎾"))), "🐒"), ""), "SELECT * OFFSET 1"))

  1. Make a copy of your list on a new sheet and reduce it to an index column and the columns you need to pivot (in our case we'd have the students column as our index and the activities columns for unpivoting. This effectively gives us a header row, a header column, and our values;
  2. Use the "monkey tennis" formula on a new sheet, making sure that it points to the sheet with your data on (the formula assumes it's called Sheet1);
  3. The formula basically reduces your list to a coordinate set (using some emoji delimiters) and then breaks it apart in its new form. You'll need to add new headers for your resulting unpivoted data;
  4. In our case we need to add back the student attributes of year and college which we could do with a VLOOKUP (or similar formula) back to our original data.

Pivot table anatomy

A pivot table can consist of the following elements:

  • A field that contains the values to be used in calculation
  • A field to be used as labels down the left of your table (row labels)
  • A field to be used as labels across the top of your table (column labels)
  • A field to be used for filtering your data.

The only real essential there is the values, though values without any breakdown is just a sum total. So you'll probably want to attach some of your attributes to the values so that you can compare them. You might therefore want to consider which attributes you want to compare: one of these could become the row labels, another the column labels. A filter then allows you to limit the rows, columns and values by another field.

Taking the example we've been playing with so far, our values are the amount raised. We could compare individual students (as rows) against the activities they were involved in (as columns), thereby essentially recreating the layout of that original table (albeit minus the year and college):

fx
ABCDE
1 StudentBean bath10k runParachute jumpTandem joust
2 David Jones60.0075.50 55.00
3 Farrokh Bulsara70.0085.0045.50
4 Catherine Bush 65.5095.5035.00

The year or college could then, potentially be used as a filter for our data. Or maybe even two filters. We could even add more column labels or row labels. That way we really could rebuild that first table if we wanted to:

fx
ABCDEFG
1 StudentYearCollegeBean bath10k runParachute jumpTandem joust
2 David Jones2Derwith60.0075.50 55.00
3 Farrokh Bulsara1Alcricke70.0085.0045.50
4 Catherine Bush2Langbrugh 65.5095.5035.00

Here's the thing about pivot tables – the thing that makes them brilliant: it doesn't matter what you put where because it's like Lego — you can just keep trying different things in different places and see what works!

For instance:

Filter

Could be year,college, or activity;
Less likely to be student.
 
 
 

Column labels

Could be student, year, college, or activity.

Row labels

Could be student, year, college, or activity.
 

Values

Will pretty-much almost always be amount.

Creating pivot tables

Having first ensured that your data is pivotable, and that you have an idea as to what numerical values you're going to be working with, it's time to do some pivoting!

Excel
  1. Select any cell in your dataset and choose Insert > Tables > PivotTable — the data range is indicated and a dialogue opens with the range already entered.
  2. Tip

    There's also a Recommended Pivot Tables option in the same place that may give you an idea of what’s possible with your data.

  3. You will probably want your pivot table on a new sheet, but you can opt to put it on an existing sheet (in which case you'll need to define the position of the top-left cell of the area to be used).
  4. Tip

    If you're specifying a location, don't try to put your table any higher than row 3 of the sheet, because you need to make room for filter controls.

  5. Choose OK and the empty framework for your pivot will be created.
  6. Using the panel that appears on the right, drag fields from the list into the appropriate areas below. As you do this, the used fields will be ticked. Alternatively you can tick the required fields and have Excel decide where to put them (you can always drag to rearrange if you disagree with it).

The field you choose for the Values will usually (but not always) be numeric. For numeric data you can choose to display a Sum, Average etc., by choosing the drop-down and selecting Value Field Settings…


Google Sheets
  1. Select any cell in your dataset and choose Data > Pivot table — the data range is indicated and a dialogue opens with the range already entered.
  2. You will probably want your pivot table on a new sheet, but you can opt to put it on an existing sheet (in which case you'll need to define the position of the top-left cell of the area to be used).
  3. Choose Create and the empty framework for your pivot will be created.
  4. Using the Pivot table editor pane on the right, add fields in the locations you require using the Add buttons. The panel will also suggest options it thinks you might want to try.

Some example pivots

The following examples just use the simple nine-row list of three students and their fund-raising antics:


Q1: Which activities generated the most money in each year group?

  • Columns: year
  • Rows: activity
  • Values: amount
fx
ABCD
1 Sum of AmountYear
2 Activity12Grand Total
3 10k run141141
4 Bean bath7060130
5 Parachute jump4595.5140.5
6 Tandem joust8590175
7 Grand Total200386.5586.5

Q2: Which activities were most successful in each college?

  • Columns: college
  • Rows: activity
  • Values: amount
fx
ABCDE
1 Sum of AmountCollege
2 ActivityAlcrickeDerwithLangbrughGrand Total
3 10k run75.565.5141
4 Bean bath7060130
5 Parachute jump4595.5140.5
6 Tandem joust855535175
7 Grand Total200190.5196586.5

Q3: How well did students in year 2 do at raising funds? — which activities raised most?

  • Columns: activity
  • Rows: student
  • Values: amount
  • Filter: year 2
fx
ABCDEF
1 Year2
2
3 Sum of AmountActivity
4 Student10k runBean bathParachute jumpTandem joustGrand Total
5 Catherine Bush65.595.535196
6 David Jones75.560 55190.5
7 Grand Total1416095.590386.5

Tip

If we were doing a proper job at this, our student labels would probably take up more than one column as we'd have a unique studentID and we'd probably divide up the forename and surname fields.


Q4: Which activities by year 2 students raised the most?

  • Columns: none
  • Rows: activity
  • Values: amount
  • Filter: year 2
fx
AB
1 Year2
2
3 ActivitySum of Amount
4 10k run141
5 Parachute jump95.5
6 Tandem joust90
7 Bean bath60
8 Grand Total386.5

Tip

We've sorted the values column in descending order. We can do things like that — we're allowed.


Q5: How many activities did each year-group undertake?

  • Columns: year
  • Rows: activity
  • Values: activity summarised by COUNTA
fx
ABCD
1 Sum of AmountYear
2 Activity12Grand Total
3 10k run22
4 Bean bath112
5 Parachute jump112
6 Tandem joust123
7 Grand Total369

Tip

While we'd generally be working with numerical data for our values, we can also do basic counts of textual information.


QUERY function (Google Sheets)

Google Sheets

The QUERY function is the most powerful data function in Google Sheets. It provides a method to interrogate a data table and generate a sub-set that is actively linked to the source data but will not change it. This makes it particularly useful in a collaborative context, or where you want to produce several sub-sets from the same data source.

=QUERY(data-array,"query-expression")

The "query-expression" argument is a string of text which defines the columns to be used, the sort orders to be applied, filtering criteria, and even grouping options and calculated values. The expression is based on Structured Query Language (SQL), used by database systems, but it's not difficult to work with for things like straightforward sorting and filtering.

Here's some simple examples where dataList is a named range:

=QUERY(dataList, “select A,F,B”) Simple selection of three columns
=QUERY(dataList, “select A,F,B order by B desc”)Select three columns and sort by one of them
=QUERY(dataList, "select A,F,B where F=3")Select three columns and filter based on a condition

Query Editor (Excel)

Excel

Excel doesn't have the QUERY function but it does have a powerful Query Editor.

We've already seen how the Query Editor can be used to 'unpivot' pivoted data, but it has other uses too, not least in terms of cleaning up data (it's great at identifying dates for instance, even when they're misformatted) and reorganising it (in similar ways to QUERY).

  1. Assign a named range to your list of data;
  2. Select any cell within your list
  3. Choose Data > Get & Transform Data > From Table/Range —the Query Editor will open:
  4. Columns can be sorted and filtered from the dropdown toggles, reordered by dragging, and deleted from the menu ribbon, where there are loads of other tools you can apply to your data. Each change you make is recorded in the applied steps panel to the right of the editor, where you can delete steps or reorder them.
  5. Tip

    In the example above, an index column has been added to disambiguate between students, students' names have been split on space characters, the activities have been 'unpivoted', and the amounts have been reformatted as currency.

  6. When you're happy with your changes, hit Home > Close & Load — your restructured data will be written to a new list on a new sheet.

The outputted list will be in the form of an Excel Table: a special kind of list with its own set of conventions. A link is maintained between your original list and the new table, which means that changes to your original data can be passed across to your query output. This connection is not live by default: to refresh the table manually, select a cell within it and go to Query > Load > Refresh (or any of the other Refresh buttons about the place). You can automate the refresh in Query > Edit > Properties, where there are options to refresh on opening, and refresh every n minutes.

If you'd prefer your outputted table to be a normal range, you can go to Table Design > Tools > Convert to Range. This will sever the link between this table and your original data, so if your original data is likely to change you might not want to do that.

Exercises
Google Sheets

Google Sheets versions of the files are listed below. In each case you will be prompted to make your own copy of the spreadsheet files, for which you'll need to be signed into a Google account.




Excel

You can download the above Google Sheets for use in Excel by going to File > Download

Excel versions of the exercise files can also be found on university-managed machines at T:\IT Training\Essential Spreadsheets

You can map to that drive at home if you use the University's VPN.