Cross-referencing and interrogating information is a very database-y activity, but it's something spreadsheets can do too, to some extent. On this page we take a look at lookup functions, pivot to the topic of pivot tables, then query Google Sheets' QUERY function and Excel's Query Editor.
Lookup functions provide a means to reference data stored as a table elsewhere in a spreadsheet, and so extract values from it for use elsewhere.
Until recently there were three main lookup functions:
LOOKUP | Locates a supplied value in one column and returns a value from the same row in another column |
---|---|
VLOOKUP | Locates a supplied value (or its inferred position in a range of values) in the first column of a range of data and returns the value from the same row in any other specified column |
HLOOKUP | Works the same as VLOOKUP but with data that is arranged in rows rather than columns |
In addition to these functions, there's now also the more-powerful XLOOKUP, and the old power-user favourites of MATCH and INDEX. We'll take a look at those too.
=LOOKUP(lookup-value,lookup-range,result-range)
LOOKUP is a weird one. Superficially it looks like the perfect lookup function: a supplied lookup-value (e.g. ‘Muffins’) is looked up in one column, and the value is returned from the same row of another column. Here it is in action:
fx | =LOOKUP(F2,B:B,D:D) | ||||||
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
1 | ID | Item | Portion (g) | Calories | Lookup | Return | |
2 | 1001 | Apple pie | 200 | 180 | Muffins | 120 | |
3 | 1002 | Chocolate cake | 150 | 200 | |||
4 | 1003 | Flapjack | 160 | 210 | |||
5 | 1004 | Lemon drizzle | 170 | 200 | |||
6 | 1005 | Muffins | 90 | 120 | |||
7 | 1006 | Sausage rolls | 85 | 130 |
But there's a catch. And it's a big one. LOOKUP only works if the lookup range is sorted in ascending order. And if it can't find what you're after it will make a best guess based on where in your list the lookup value should come. If "Muffins" were to appear anywhere else in our list other than in its proper alphabetical place between the lemon drizzle and the sausage rolls, we'd get the wrong result being returned. And that's not good at all.
For that reason, LOOKUP is only really provided in spreadsheets as a legacy function. You're (slightly) safer using one of the others instead...
=VLOOKUP(lookup‑value, lookup‑array, result‑index, sorted?)
lookup‑value | The value you're looking for |
---|---|
lookup‑array | A data table, the first column of which will be matched to the lookup-value |
result‑index | Awkwardly, the number of the column in your lookup-array from which you will retrieve your result, where the first column in the array is 1 |
sorted? | Do you want this lookup to work like LOOKUP, whereby the first column of your lookup-array is sorted in ascending order and 'guesses' can be made as to the value to return (in which case put TRUE here), or are you looking for an exact match in a not-necessarily-sorted range (in which case put FALSE)? |
The function's fiddlier than it needs to be, so let's see a (familiar) example:
fx | =VLOOKUP(F2,B:D,3,FALSE) | ||||||
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
1 | ID | Item | Portion (g) | Calories | Lookup | Return | |
2 | 1001 | Apple pie | 200 | 180 | Muffins | 120 | |
3 | 1002 | Chocolate cake | 150 | 200 | |||
4 | 1003 | Flapjack | 160 | 210 | |||
5 | 1004 | Lemon drizzle | 170 | 200 | |||
6 | 1005 | Muffins | 90 | 120 | |||
7 | 1006 | Sausage rolls | 85 | 130 |
Some things to notice / be aware of:
Let's get to the bottom of this TRUE/FALSE business...
In the example above, the fourth argument was set to FALSE. This means that a calorie value is only returned if "Muffins" is found in column B. Otherwise the formula will return a #N/A error.
So why would we ever want anything else?
Suppose we're marking an exam:
fx | =VLOOKUP(C2,F:G,2,TRUE) | ||||||
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
1 | ID | Exam | Result | Grade | Mark | Grade | |
2 | 1001 | 1 | 68 | C | 0 | Fail | |
3 | 1001 | 2 | 72 | B | 40 | E | |
4 | 1002 | 1 | 56 | D | 50 | D | |
5 | 1002 | 2 | 63 | C | 60 | C | |
6 | 1003 | 1 | 81 | A | 70 | B | |
7 | 1003 | 2 | 76 | B | 80 | A | |
8 | 1004 | 1 | 42 | E |
Our lookup array in F:G gives a set of grade boundaries: "Fail" starts at 0, "E" starts at 40, etc. A table such as this allows us to do a range lookup: rather than looking for an exact match, we want the closest range value below the lookup value supplied. So we put TRUE for that fourth arugment in the VLOOKUP. Now, when we look up the mark of 68 from cell C2, our closest match below 68 is the 60 in row 5 which corresponds to the grade "C".
Range lookups are generally used with numerical data where an exact match is less likely to be a thing. Of course, as with the LOOKUP example we saw earlier, our lookup range needs to be sorted in ascending order for the range lookup to work properly.
When creating a range lookup array, you must include the lowest possible value – in the above example, for instance, a zero is needed, otherwise values below 40 would generate an error.
HLOOKUP works in exactly the same way as VLOOKUP but horizontally rather than vertically: it looks across columns for a match and then down rows for a corresponding value.
MATCH and INDEX can be used together for lookup purposes:
=MATCH(lookup‑value,lookup‑range,type) | Identifies the position of a lookup-value in a lookup-range, based on the type of lookup specified: 0 - exact match 1 - range match -1 - inverted range match (matches upwards rather than downwards) |
---|---|
=INDEX(range,row,[column]) | Returns a value from a range at the row (and/or column) specified |
Let's go back to our list of desserts:
fx | ||||||
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
1 | ID | Item | Lookup | MATCH | INDEX | |
2 | 1001 | Apple pie | Muffins | 6 | 1005 | |
3 | 1002 | Chocolate cake | =MATCH(D2,B:B,0) | =INDEX(A:A,E2) | ||
4 | 1003 | Flapjack | ||||
5 | 1004 | Lemon drizzle | 5 | 1005 | ||
6 | 1005 | Muffins | =MATCH(D2,B2:B7,0) | =INDEX(A2:A7,E2) | ||
7 | 1006 | Sausage rolls |
In E2 we're using MATCH to look for the value "Muffins" (from D2) in B:B, while requiring an exact match (0). The result is 6. We're then passing that result to the formula in F2 where we're looking in A:A for the sixth row.
In row 5 we've done the same thing but expressing the ranges from rows 2:7 rather than using whole columns. This gives a different value for the MATCH because the returned row number is the row in the range, not the absolute row number on the sheet. This is also true of INDEX which is why we still get the right answer in F5.
Of course, the MATCH and INDEX functions can be combined, with the MATCH nested inside the INDEX. For example:
=INDEX(A:A,MATCH(D2,B:B,0))
Although it uses two functions instead of one, this MATCH and INDEX approach is probably the best method for lookups, not least because it doesn't involve the cumbersome hard-coded column numbers of VLOOKUP.
In Google Sheets and Excel for Microsoft 365 there's now an additional function: XLOOKUP...
=XLOOKUP(lookup-value, lookup-range, result-range, [if-not-found], [match-mode], [search-mode])
Fundamentally, this works like LOOKUP, but defaults to the "exact match" method you'd see when using FALSE in a VLOOKUP. However, the three extra arguments to the function let you do some serious fine-tuning of how the lookup functions:
if‑not‑found | You can set the value that's returned if there's no matching value. By default this is #N/A but you could set it to be something else here without the need for an IFERROR function. |
---|---|
match‑mode | A more advanced version of the "sorted?" option in VLOOKUP:
|
search‑mode | How the search is performed:
|
XLOOKUP can work either vertically or horizontally, depending on the ranges selected.
Pivot tables allow you to rearrange a data set so as to be able to view it from different perspectives.
In order to do this, the data must be organised in a pivotable way; you cannot create a pivot table from poorly-organised data.
Let's consider an example:
Students have volunteered to take part in activities to raise funds for charity, and we want to analyse the proceeds to see which activities were most successful, which colleges raised the most, and so on.
We may be tempted to lay our data out like this:
fx | |||||||
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
1 | Student | Year | College | Bean bath | 10k run | Parachute jump | Tandem joust |
2 | David Jones | 2 | Derwith | 60.00 | 75.50 | 55.00 | |
3 | Farrokh Bulsara | 1 | Alcricke | 70.00 | 85.00 | 45.50 | |
4 | Catherine Bush | 2 | Langbrugh | 65.50 | 95.50 | 35.00 |
It's a nice, useable layout: each student with their own row; columns for each activity; values entered for the amounts raised in each case...
The trouble is, that data's already been pivoted!
What does that actually mean? Well, let's say we wanted to tot up the amount raised by each college or yeargroup. It's going to take an awful lot of messing about because our amounts raised are in multiple columns. Really we want to have the amounts raised in a single "Amount" column. And that means we'll also need a corresponding "Activity" column. And that means we need to do something to our data that might seem counter-intuitive — we need to do this:
fx | |||||
A | B | C | D | E | |
---|---|---|---|---|---|
1 | Student | Year | College | Activity | Amount |
2 | David Jones | 2 | Derwith | Bean bath | 60.00 |
3 | David Jones | 2 | Derwith | 10k run | 75.50 |
4 | David Jones | 2 | Derwith | Tandem joust | 55.00 |
5 | Farrokh Bulsara | 1 | Alcricke | Bean bath | 70.00 |
6 | Farrokh Bulsara | 1 | Alcricke | Parachute jump | 45.00 |
7 | Farrokh Bulsara | 1 | Alcricke | Tandem joust | 85.00 |
8 | Catherine Bush | 2 | Langbrugh | 10k run | 65.50 |
9 | Catherine Bush | 2 | Langbrugh | Parachute jump | 95.50 |
9 | Catherine Bush | 2 | Langbrugh | Tandem joust | 35.00 |
This may look odd, particularly the repetition, but it means that each row is a collection of separate data items including all relevant information for each instance of a student doing an activity. And the stuff we want to be able to measure (specifically the amount raised) is in the one easy-to-handle column.
The rules we talked about regarding lists are doubly true here:
Here’s one approach that may help you plan or check your data structure. We’ll continue to use the student fund-raising example.
First, identify the data that provides the individual values that are most important to your analysis — in this case, it’s the amount raised at a specific activity undertaken by a specific student.
Next, surround this with the collection of related attributes:
Also include ‘attributes of attributes’ if you are likely to make use of them. For example, college and year are technically attributes of the student, but since we are storing all the data in one table, these must also be regarded as attributes of the amount, and included with each value.
These attributes should then be the additional columns used in the data set, so each amount has a corresponding activity, student, college and year.
Pivot tables work by grouping like values together. So it will look at our data-set, spot three rows of David Joneses and start merging them. Which is all well and good if we've only got one David Jones in our student cohort. But perhaps we've got two of them (like the one from The Lower Third and the one from The Monkees). If we don't want these two entities mingling together into some frightening gestalt, we're going to need to provide some means of disambiguation. Perhaps the year and/or college would be sufficient to do that, or maybe we're going to need a unique student ID.
If you've inherited a dataset that just isn't pivotable, there are ways to get that data into a pivotable structure.
The easiest way of unpivoting is via Excel's Query Editor (more of which later, but this will serve as a tantalising glimpse ahead):
Unpivoting in Google Sheets is a bit harder. There are a few methods you could use, all of which essentially involve building up a coordinate map of your data and then restructuring it in a new location, including this example using OFFSET, ROUND, MOD, COLUMN and ROW.
There's also the "Monkey Tennis" approach by player0 and Ben Collins which unpivots a sheet using this beautiful formula:
=ARRAYFORMULA( IFERROR( SPLIT( TRIM( TRANSPOSE( SPLIT( TRANSPOSE( QUERY( TRANSPOSE( QUERY( TRANSPOSE( IF( Sheet1!B2:Z<>"", Sheet1!A2:A & "🐒" & Sheet1!B1:1 & "🐒" & Sheet1!B2:Z & "🎾", )),, 500000)),, 500000)), "🎾"))), "🐒"), ""))
And with the arrival of the FLATTEN function there's now an even more elegant way to unpivot in Google Sheets:
=ARRAYFORMULA( QUERY( SPLIT( FLATTEN( Sheet1!A2:A & "🥞" & Sheet1!B1:1 & "🥞" & Sheet1!B2:Z), "🥞"), "Select * where Col2 is not null"))
A pivot table can consist of the following elements:
The only real essential there is the values, though values without any breakdown is just a sum total. So you'll probably want to attach some of your attributes to the values so that you can compare them. You might therefore want to consider which attributes you want to compare: one of these could become the row labels, another the column labels. A filter then allows you to limit the rows, columns and values by another field.
Taking the example we've been playing with so far, our values are the amount raised. We could compare individual students (as rows) against the activities they were involved in (as columns), thereby essentially recreating the layout of that original table (albeit minus the year and college):
fx | |||||||
A | B | C | D | E | |||
---|---|---|---|---|---|---|---|
1 | Student | Bean bath | 10k run | Parachute jump | Tandem joust | ||
2 | David Jones | 60.00 | 75.50 | 55.00 | |||
3 | Farrokh Bulsara | 70.00 | 85.00 | 45.50 | |||
4 | Catherine Bush | 65.50 | 95.50 | 35.00 |
The year or college could then, potentially be used as a filter for our data. Or maybe even two filters. We could even add more column labels or row labels. That way we really could rebuild that first table if we wanted to:
fx | |||||||
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
1 | Student | Year | College | Bean bath | 10k run | Parachute jump | Tandem joust |
2 | David Jones | 2 | Derwith | 60.00 | 75.50 | 55.00 | |
3 | Farrokh Bulsara | 1 | Alcricke | 70.00 | 85.00 | 45.50 | |
4 | Catherine Bush | 2 | Langbrugh | 65.50 | 95.50 | 35.00 |
Here's the thing about pivot tables – the thing that makes them brilliant: it doesn't matter what you put where because it's like Lego — you can just keep trying different things in different places and see what works!
For instance:
Filter Could be year,college, or activity;Less likely to be student. | ||
Column labels Could be student, year, college, or activity. | ||
Row labels Could be student, year, college, or activity. |
Values Will pretty-much almost always be amount. |
Having first ensured that your data is pivotable, and that you have an idea as to what numerical values you're going to be working with, it's time to do some pivoting!
There's also a Recommended Pivot Tables option in the same place that may give you an idea of what’s possible with your data.
If you're specifying a location, don't try to put your table any higher than row 3 of the sheet, because you need to make room for filter controls.
The field you choose for the Values will usually (but not always) be numeric. For numeric data you can choose to display a Sum, Average etc., by choosing the drop-down and selecting Value Field Settings…
The following examples just use the simple nine-row list of three students and their fund-raising antics:
fx | ||||
A | B | C | D | |
---|---|---|---|---|
1 | Sum of Amount | Year | ||
2 | Activity | 1 | 2 | Grand Total |
3 | 10k run | 141 | 141 | |
4 | Bean bath | 70 | 60 | 130 |
5 | Parachute jump | 45 | 95.5 | 140.5 |
6 | Tandem joust | 85 | 90 | 175 |
7 | Grand Total | 200 | 386.5 | 586.5 |
fx | |||||
A | B | C | D | E | |
---|---|---|---|---|---|
1 | Sum of Amount | College | |||
2 | Activity | Alcricke | Derwith | Langbrugh | Grand Total |
3 | 10k run | 75.5 | 65.5 | 141 | |
4 | Bean bath | 70 | 60 | 130 | |
5 | Parachute jump | 45 | 95.5 | 140.5 | |
6 | Tandem joust | 85 | 55 | 35 | 175 |
7 | Grand Total | 200 | 190.5 | 196 | 586.5 |
fx | ||||||||
A | B | C | D | E | F | |||
---|---|---|---|---|---|---|---|---|
1 | Year | 2 | ||||||
2 | ||||||||
3 | Sum of Amount | Activity | ||||||
4 | Student | 10k run | Bean bath | Parachute jump | Tandem joust | Grand Total | ||
5 | Catherine Bush | 65.5 | 95.5 | 35 | 196 | |||
6 | David Jones | 75.5 | 60 | 55 | 190.5 | |||
7 | Grand Total | 141 | 60 | 95.5 | 90 | 386.5 |
If we were doing a proper job at this, our student labels would probably take up more than one column as we'd have a unique studentID and we'd probably divide up the forename and surname fields.
fx | ||
A | B | |
---|---|---|
1 | Year | 2 |
2 | ||
3 | Activity | Sum of Amount |
4 | 10k run | 141 |
5 | Parachute jump | 95.5 |
6 | Tandem joust | 90 |
7 | Bean bath | 60 |
8 | Grand Total | 386.5 |
We've sorted the values column in descending order. We can do things like that — we're allowed.
fx | ||||
A | B | C | D | |
---|---|---|---|---|
1 | Sum of Amount | Year | ||
2 | Activity | 1 | 2 | Grand Total |
3 | 10k run | 2 | 2 | |
4 | Bean bath | 1 | 1 | 2 |
5 | Parachute jump | 1 | 1 | 2 |
6 | Tandem joust | 1 | 2 | 3 |
7 | Grand Total | 3 | 6 | 9 |
While we'd generally be working with numerical data for our values, we can also do basic counts of textual information.
The QUERY function is the most powerful data function in Google Sheets. It provides a method to interrogate a data table and generate a sub-set that is actively linked to the source data but will not change it. This makes it particularly useful in a collaborative context, or where you want to produce several sub-sets from the same data source.
=QUERY(data-array,"query-expression")
The "query-expression" argument is a string of text which defines the columns to be used, the sort orders to be applied, filtering criteria, and even grouping options and calculated values. The expression is based on Structured Query Language (SQL), used by database systems, but it's not difficult to work with for things like straightforward sorting and filtering.
Here's some simple examples where dataList is a named range:
=QUERY(dataList, “select A,F,B”) | Simple selection of three columns |
---|---|
=QUERY(dataList, “select A,F,B order by B desc”) | Select three columns and sort by one of them |
=QUERY(dataList, "select A,F,B where F=3") | Select three columns and filter based on a condition |
Excel doesn't have the QUERY function but it does have a powerful Query Editor.
We've already seen how the Query Editor can be used to 'unpivot' pivoted data, but it has other uses too, not least in terms of cleaning up data (it's great at identifying dates for instance, even when they're misformatted) and reorganising it (in similar ways to QUERY).
In the example above, an index column has been added to disambiguate between students, students' names have been split on space characters, the activities have been 'unpivoted', and the amounts have been reformatted as currency.
The outputted list will be in the form of an Excel Table: a special kind of list with its own set of conventions. A link is maintained between your original list and the new table, which means that changes to your original data can be passed across to your query output. This connection is not live by default: to refresh the table manually, select a cell within it and go to Query > Load > Refresh (or any of the other Refresh buttons about the place). You can automate the refresh in Query > Edit > Properties, where there are options to refresh on opening, and refresh every n minutes.
If you'd prefer your outputted table to be a normal range, you can go to Table Design > Tools > Convert to Range. This will sever the link between this table and your original data, so if your original data is likely to change you might not want to do that.
Google Sheets versions of the files are listed below. In each case you will be prompted to make your own copy of the spreadsheet files, for which you'll need to be signed into a Google account.
You can download the above Google Sheets for use in Excel by going to File > Download
Excel versions of the exercise files can also be found on university-managed machines at T:\IT Training\Essential Spreadsheets
You can map to that drive at home if you use the University's VPN.