Cross-referencing and interrogating information is a very database-y activity, but it's something spreadsheets can do too, to some extent. On this page we take a look at lookup functions, pivot to the topic of pivot tables, then query Google Sheets' QUERY function and Excel's Query Editor. We also look at some other functions for reorganising data.
Lookup functions provide a means to reference data stored as a table elsewhere in a spreadsheet, and so extract values from it for use elsewhere.
There are four main lookup functions:
LOOKUP | Locates a supplied value in an ordered column and returns a value from the same row in another column |
---|---|
VLOOKUP | Locates a supplied value (or its inferred position in a range of values) in the first column of a range of data and returns the value from the same row in any other specified column — very widely used as the lookup function of choice since 1985 but now superseded by XLOOKUP |
HLOOKUP | Works the same as VLOOKUP but with data that is arranged in rows rather than columns |
XLOOKUP | Introduced in 2019 as an improved version of VLOOKUP and HLOOKUP — it has the immediate simplicity of the LOOKUP function, can work both vertically and horizontally, and can search in various ways without the need for the data to be in a particular order |
In addition to these functions, there's the old power-user favourites of MATCH and INDEX which were often used as an alternative to VLOOKUP before XLOOKUP's arrival on the scene! We'll take a look at those too.
Because of their ubiquity we'll look at all of these methods in turn — there's a very real chance that you'll come across them in other people's spreadsheets. But for future work the XLOOKUP is unquestionably the method to favour.
=LOOKUP(lookup-value,lookup-range,result-range)
LOOKUP is a weird one. Superficially it looks like the perfect lookup function: a supplied lookup-value (e.g. ‘Muffins’) is looked up in one column, and the value is returned from the same row of another column. Here it is in action:
fx | =LOOKUP(F2,B:B,D:D) | ||||||
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
1 | ID | Item | Portion (g) | Calories | Lookup | Return | |
2 | 1001 | Apple pie | 200 | 180 | Muffins | 120 | |
3 | 1002 | Chocolate cake | 150 | 200 | |||
4 | 1003 | Flapjack | 160 | 210 | |||
5 | 1004 | Lemon drizzle | 170 | 200 | |||
6 | 1005 | Muffins | 90 | 120 | |||
7 | 1006 | Sausage rolls | 85 | 130 |
But there's a catch. And it's a big one. LOOKUP only works if the lookup range is sorted in ascending order. And if it can't find what you're after it will make a best guess based on where in your list the lookup value should come. If "Muffins" were to appear anywhere else in our list other than in its proper alphabetical place between the lemon drizzle and the sausage rolls, we'd get the wrong result being returned. And that's not good at all.
For that reason, LOOKUP is only really provided in spreadsheets as a legacy function and pretty much everything about it is replicated in a more straightforward way in the XLOOKUP.
VLOOKUP is one of the most commonly used functions in spreadsheets over the last half a century, and so while it may have been superseded by XLOOKUP we're going to need to take a look at it. Let's see what arguments go into the function:
=VLOOKUP(lookup‑value, lookup‑array, result‑index, sorted?)
lookup‑value | The value you're looking for |
---|---|
lookup‑array | A data table, the first column of which will be matched to the lookup-value |
result‑index | Awkwardly, the number of the column in your lookup-array from which you will retrieve your result, where the first column in the array is 1 |
sorted? | Do you want this lookup to work like LOOKUP, whereby the first column of your lookup-array is sorted in ascending order and 'guesses' can be made as to the value to return (in which case put TRUE here), or are you looking for an exact match in a not-necessarily-sorted range (in which case put FALSE)? |
The function's fiddlier than it needs to be, so let's see a (familiar) example:
fx | =VLOOKUP(F2,B:D,3,FALSE) | ||||||
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
1 | ID | Item | Portion (g) | Calories | Lookup | Return | |
2 | 1001 | Apple pie | 200 | 180 | Muffins | 120 | |
3 | 1002 | Chocolate cake | 150 | 200 | |||
4 | 1003 | Flapjack | 160 | 210 | |||
5 | 1004 | Lemon drizzle | 170 | 200 | |||
6 | 1005 | Muffins | 90 | 120 | |||
7 | 1006 | Sausage rolls | 85 | 130 |
Some things to notice / be aware of:
Let's get to the bottom of this TRUE/FALSE business...
In the example above, the fourth argument was set to FALSE. This means that a calorie value is only returned if "Muffins" is found in column B. Otherwise the formula will return a #N/A error.
So why would we ever want anything else?
Suppose we're marking an exam:
fx | =VLOOKUP(C2,F:G,2,TRUE) | ||||||
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
1 | ID | Exam | Result | Grade | Mark | Grade | |
2 | 1001 | 1 | 68 | C | 0 | Fail | |
3 | 1001 | 2 | 72 | B | 40 | E | |
4 | 1002 | 1 | 56 | D | 50 | D | |
5 | 1002 | 2 | 63 | C | 60 | C | |
6 | 1003 | 1 | 81 | A | 70 | B | |
7 | 1003 | 2 | 76 | B | 80 | A | |
8 | 1004 | 1 | 42 | E |
Our lookup array in F:G gives a set of grade boundaries: "Fail" starts at 0, "E" starts at 40, etc. A table such as this allows us to do a range lookup: rather than looking for an exact match, we want the closest range value below the lookup value supplied. So we put TRUE for that fourth arugment in the VLOOKUP. Now, when we look up the mark of 68 from cell C2, our closest match below 68 is the 60 in row 5 which corresponds to the grade "C".
Range lookups are generally used with numerical data where an exact match is less likely to be a thing. Of course, as with the LOOKUP example we saw earlier, our lookup range needs to be sorted in ascending order for the range lookup to work properly.
When creating a range lookup array, you must include the lowest possible value – in the above example, for instance, a zero is needed, otherwise values below 40 would generate an error.
HLOOKUP works in exactly the same way as VLOOKUP but horizontally rather than vertically: it looks across columns for a match and then down rows for a corresponding value.
Probably the best lookup function in Google Sheets and Excel (2019 or after, and Microsoft 365) is XLOOKUP...
=XLOOKUP(lookup-value, lookup-range, result-range, [if-not-found], [match-mode], [search-mode])
Fundamentally, this works like LOOKUP, but defaults to the "exact match" method you'd see when using FALSE in a VLOOKUP. However, the three extra arguments to the function let you do some serious fine-tuning of how the lookup works:
if‑not‑found | You can set the value that's returned if there's no matching value. By default this is #N/A but you could set it to be something else here without the need for an IFERROR function. |
---|---|
match‑mode | A more advanced version of the "sorted?" option in VLOOKUP — in VLOOKUP wildcards are automatically enabled, and inexact matching relies on the sort order of the dataset being interrogated, whereas XLOOKUP doesn't need the data to be sorted and the matching can be configured more precisely:
|
search‑mode | How the search is performed. Because XLOOKUP doesn't need the data to be sorted it has to perform a 'mental' sort as part of its calculations. That's not usually a problem, but for larger datasets it could slow things down. If you've got a lot of data to search, you can potentially use one of these options to help reduce the calculation time:
|
XLOOKUP can work either vertically or horizontally, depending on the ranges selected. The return range can also be wider than a single column or row to return matches from multiple fields in your dataset.
MATCH and INDEX can be used together for lookup purposes, and were often favoured over VLOOKUP because they didn't rely on hard-coded numbers and column counting. This approach has been superseded by XLOOKUP but you might still come across it so let's take a look:
=MATCH(lookup‑value,lookup‑range,type) | Identifies the position of a lookup-value in a lookup-range, based on the type of lookup specified: 0 - exact match 1 - range match -1 - inverted range match (matches upwards rather than downwards) |
---|---|
=INDEX(range,row,[column]) | Returns a value from a range at the row (and/or column) specified |
Let's go back to our list of desserts:
fx | ||||||
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
1 | ID | Item | Lookup | MATCH | INDEX | |
2 | 1001 | Apple pie | Muffins | 6 | 1005 | |
3 | 1002 | Chocolate cake | =MATCH(D2,B:B,0) | =INDEX(A:A,E2) | ||
4 | 1003 | Flapjack | ||||
5 | 1004 | Lemon drizzle | 5 | 1005 | ||
6 | 1005 | Muffins | =MATCH(D2,B2:B7,0) | =INDEX(A2:A7,E2) | ||
7 | 1006 | Sausage rolls |
In E2 we're using MATCH to look for the value "Muffins" (from D2) in B:B, while requiring an exact match (0). The result is 6. We're then passing that result to the formula in F2 where we're looking in A:A for the sixth row.
In row 5 we've done the same thing but expressing the ranges from rows 2:7 rather than using whole columns. This gives a different value for the MATCH because the returned row number is the row in the range, not the absolute row number on the sheet. This is also true of INDEX which is why we still get the right answer in F5.
Of course, the MATCH and INDEX functions can be combined, with the MATCH nested inside the INDEX. For example:
=INDEX(A:A,MATCH(D2,B:B,0))
Pivot tables allow you to rearrange a dataset so as to be able to view it from different perspectives.
In order to do this, the data must be organised in a pivotable way; you cannot create a pivot table from poorly-organised data.
Let's consider an example:
Students have volunteered to take part in activities to raise funds for charity, and we want to analyse the proceeds to see which activities were most successful, which colleges raised the most, and so on.
We may be tempted to lay our data out like this:
fx | |||||||
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
1 | Student | Year | College | Bean bath | 10k run | Parachute jump | Tandem joust |
2 | David Jones | 2 | Derwith | 60.00 | 75.50 | 55.00 | |
3 | Farrokh Bulsara | 1 | Alcricke | 70.00 | 85.00 | 45.50 | |
4 | Catherine Bush | 2 | Langbrugh | 65.50 | 95.50 | 35.00 |
It's a nice, useable layout: each student with their own row; columns for each activity; values entered for the amounts raised in each case...
The trouble is, that data's already been pivoted!
What does that actually mean? Well, let's say we wanted to tot up the amount raised by each college or yeargroup. It's going to take an awful lot of messing about because our amounts raised are in multiple columns. Really we want to have the amounts raised in a single "Amount" column. And that means we'll also need a corresponding "Activity" column. And that means we need to do something to our data that might seem counter-intuitive — we need to do this:
fx | |||||
A | B | C | D | E | |
---|---|---|---|---|---|
1 | Student | Year | College | Activity | Amount |
2 | David Jones | 2 | Derwith | Bean bath | 60.00 |
3 | David Jones | 2 | Derwith | 10k run | 75.50 |
4 | David Jones | 2 | Derwith | Tandem joust | 55.00 |
5 | Farrokh Bulsara | 1 | Alcricke | Bean bath | 70.00 |
6 | Farrokh Bulsara | 1 | Alcricke | Parachute jump | 45.00 |
7 | Farrokh Bulsara | 1 | Alcricke | Tandem joust | 85.00 |
8 | Catherine Bush | 2 | Langbrugh | 10k run | 65.50 |
9 | Catherine Bush | 2 | Langbrugh | Parachute jump | 95.50 |
9 | Catherine Bush | 2 | Langbrugh | Tandem joust | 35.00 |
This may look odd, particularly the repetition, but it means that each row is a collection of separate data items including all relevant information for each instance of a student doing an activity. And the stuff we want to be able to measure (specifically the amount raised) is in the one easy-to-handle column.
The rules we talked about regarding lists are doubly true here:
Here’s one approach that may help you plan or check your data structure. We’ll continue to use the student fund-raising example.
First, identify the data that provides the individual values that are most important to your analysis — in this case, it’s the amount raised at a specific activity undertaken by a specific student.
Next, surround this with the collection of related attributes:
Also include ‘attributes of attributes’ if you are likely to make use of them. For example, college and year are technically attributes of the student, but since we are storing all the data in one table, these must also be regarded as attributes of the amount, and included with each value.
These attributes should then be the additional columns used in the data set, so each amount has a corresponding activity, student, college and year.
Pivot tables work by grouping like values together. So it will look at our data-set, spot three rows of David Joneses and start merging them. Which is all well and good if we've only got one David Jones in our student cohort. But perhaps we've got two of them (like the one from The Lower Third and the one from The Monkees). If we don't want these two entities mingling together into some frightening chimera, we're going to need to provide some means of disambiguation. Perhaps the year and/or college would be sufficient to do that, or maybe we're going to need a unique student ID.
If you've inherited a dataset that just isn't pivotable, there are ways to get that data into a pivotable structure.
The easiest way of unpivoting is via Excel's Query Editor (more of which later, but this will serve as a tantalising glimpse ahead):
Unpivoting in Google Sheets is a bit harder. There are a few methods you could use, all of which essentially involve building up a coordinate map of your data and then restructuring it in a new location, including this example using OFFSET, ROUND, MOD, COLUMN and ROW.
There's also the "Monkey Tennis" approach by player0 and Ben Collins which unpivots a sheet using this beautiful formula:
=ARRAYFORMULA( IFERROR( SPLIT( TRIM( TRANSPOSE( SPLIT( TRANSPOSE( QUERY( TRANSPOSE( QUERY( TRANSPOSE( IF( Sheet1!B2:Z<>"", Sheet1!A2:A & "🐒" & Sheet1!B1:1 & "🐒" & Sheet1!B2:Z & "🎾", )),, 500000)),, 500000)), "🎾"))), "🐒"), ""))
And with the arrival of the FLATTEN function there's now an even more elegant way to unpivot in Google Sheets:
=ARRAYFORMULA( QUERY( SPLIT( FLATTEN( Sheet1!A2:A & "🥞" & Sheet1!B1:1 & "🥞" & Sheet1!B2:Z), "🥞"), "Select * where Col2 is not null"))
A pivot table can consist of the following elements:
The only real essential there is the values, though values without any breakdown is just a sum total. So you'll probably want to attach some of your attributes to the values so that you can compare them. You might therefore want to consider which attributes you want to compare: one of these could become the row labels, another the column labels. A filter then allows you to limit the rows, columns and values by another field.
Taking the example we've been playing with so far, our values are the amount raised. We could compare individual students (as rows) against the activities they were involved in (as columns), thereby essentially recreating the layout of that original table (albeit minus the year and college):
fx | |||||||
A | B | C | D | E | |||
---|---|---|---|---|---|---|---|
1 | Student | Bean bath | 10k run | Parachute jump | Tandem joust | ||
2 | David Jones | 60.00 | 75.50 | 55.00 | |||
3 | Farrokh Bulsara | 70.00 | 85.00 | 45.50 | |||
4 | Catherine Bush | 65.50 | 95.50 | 35.00 |
The year or college could then, potentially be used as a filter for our data. Or maybe even two filters. We could even add more column labels or row labels. That way we really could rebuild that first table if we wanted to:
fx | |||||||
A | B | C | D | E | F | G | |
---|---|---|---|---|---|---|---|
1 | Student | Year | College | Bean bath | 10k run | Parachute jump | Tandem joust |
2 | David Jones | 2 | Derwith | 60.00 | 75.50 | 55.00 | |
3 | Farrokh Bulsara | 1 | Alcricke | 70.00 | 85.00 | 45.50 | |
4 | Catherine Bush | 2 | Langbrugh | 65.50 | 95.50 | 35.00 |
Here's the thing about pivot tables – the thing that makes them brilliant: it doesn't matter what you put where because it's like Lego — you can just keep trying different things in different places and see what works!
For instance:
Filter Could be year,college, or activity;Less likely to be student. | ||
Column labels Could be student, year, college, or activity. | ||
Row labels Could be student, year, college, or activity. |
Values Will pretty-much almost always be amount. |
Having first ensured that your data is pivotable, and that you have an idea as to what numerical values you're going to be working with, it's time to do some pivoting!
There's also a Recommended Pivot Tables option in the same place that may give you an idea of what’s possible with your data.
If you're specifying a location, don't try to put your table any higher than row 3 of the sheet, because you need to make room for filter controls.
The field you choose for the Values will usually (but not always) be numeric. For numeric data you can choose to display a Sum, Average etc., by choosing the drop-down and selecting Value Field Settings…
The following examples just use the simple nine-row list of three students and their fund-raising antics:
fx | ||||
A | B | C | D | |
---|---|---|---|---|
1 | Sum of Amount | Year | ||
2 | Activity | 1 | 2 | Grand Total |
3 | 10k run | 141 | 141 | |
4 | Bean bath | 70 | 60 | 130 |
5 | Parachute jump | 45 | 95.5 | 140.5 |
6 | Tandem joust | 85 | 90 | 175 |
7 | Grand Total | 200 | 386.5 | 586.5 |
fx | |||||
A | B | C | D | E | |
---|---|---|---|---|---|
1 | Sum of Amount | College | |||
2 | Activity | Alcricke | Derwith | Langbrugh | Grand Total |
3 | 10k run | 75.5 | 65.5 | 141 | |
4 | Bean bath | 70 | 60 | 130 | |
5 | Parachute jump | 45 | 95.5 | 140.5 | |
6 | Tandem joust | 85 | 55 | 35 | 175 |
7 | Grand Total | 200 | 190.5 | 196 | 586.5 |
fx | ||||||||
A | B | C | D | E | F | |||
---|---|---|---|---|---|---|---|---|
1 | Year | 2 | ||||||
2 | ||||||||
3 | Sum of Amount | Activity | ||||||
4 | Student | 10k run | Bean bath | Parachute jump | Tandem joust | Grand Total | ||
5 | Catherine Bush | 65.5 | 95.5 | 35 | 196 | |||
6 | David Jones | 75.5 | 60 | 55 | 190.5 | |||
7 | Grand Total | 141 | 60 | 95.5 | 90 | 386.5 |
If we were doing a proper job at this, our student labels would probably take up more than one column as we'd have a unique studentID and we'd probably divide up the forename and surname fields.
fx | ||
A | B | |
---|---|---|
1 | Year | 2 |
2 | ||
3 | Activity | Sum of Amount |
4 | 10k run | 141 |
5 | Parachute jump | 95.5 |
6 | Tandem joust | 90 |
7 | Bean bath | 60 |
8 | Grand Total | 386.5 |
We've sorted the values column in descending order. We can do things like that — we're allowed.
fx | ||||
A | B | C | D | |
---|---|---|---|---|
1 | Sum of Amount | Year | ||
2 | Activity | 1 | 2 | Grand Total |
3 | 10k run | 2 | 2 | |
4 | Bean bath | 1 | 1 | 2 |
5 | Parachute jump | 1 | 1 | 2 |
6 | Tandem joust | 1 | 2 | 3 |
7 | Grand Total | 3 | 6 | 9 |
While we'd generally be working with numerical data for our values, we can also do basic counts of textual information.
Excel for Microsoft 365 has introduced a PIVOTBY function which allows you to build basic pivot tables via a data function. It can take a number of arguments but only the first four are essential:
=PIVOTBY(row-fields, col-fields, values, function, [field-headers], [row-total-depth], [row-sort-order], [col-total-depth], [col-sort-order], [filter-array], [relative-to])
row‑fields | The range used to create the rows of the pivot. It can be multiple columns. If the columns aren't adjacent you can combine them with HSTACK (for example, HSTACK(A:A,D:D) will bring through column A and column D) as rows in the pivot. It can also be blank. |
---|---|
col‑fields | The range used to create the columns of the pivot. It can work with multiple columns in the same way as row-fields. It can also be blank. |
values | The range containing the values to be aggregated in the pivot. Again, it can be more than one column. |
function | The type of aggregation to be performed on the values (there's a wide range of options). You can apply multiple aggregations using HSTACK (for example, HSTACK(SUM, AVERAGE) |
[field‑headers] | Determines whether headers are brought through into the pivot. |
[row‑total‑depth] | Used to toggle total and subtotal rows for the columns |
[row‑sort‑order] | Used to determine the sort order of the rows. Numbers are assigned to the output columns in your pivot. If your pivot has three columns and you wanted to sort the values in the second column, you would use 2, or -2 to sort in reverse order. |
[col‑total‑depth] | Used to toggle total and subtotal columns for the rows. |
[col‑sort‑order] | Used to determine the sort order of the columns. Works in the same way as row-sort-order |
[filter‑array] | A filter range and logical test, for instance B:B>=2 |
[relative‑to] | Used for aggregate functions in the function argument that require a second argument (for instance, PERCENTOF). |
PIVOTBY could be used to generate all of the example pivot tables in the previous section:
=PIVOTBY(D1:D10,B1:B10,E1:E10,SUM)
You could add field-headers (3) but the repetition of "Activity" might be a bit ugly, so we didn't bother.
=PIVOTBY(D1:D10,C1:C10,E1:E10,SUM)
It's the same formula as before but using a different range for the columns.
=PIVOTBY(A1:A10,D1:D10,E1:E10,SUM,,,,,,B1:B10=2)
There's a lot of empty fields to get past before we can incorporate the filter here. This filter is specifically to 2nd years (2) — you could create a dynamic filter by referencing a cell rather than hard-entering the value.
=PIVOTBY(D1:D10,,E1:E10,SUM,3,,-2,,,B1:B10=2)
This example doesn't bother with any col-fields, but we've turned on field-headers (3to make things a bit clearer. We're also sorting by the second column (the Amount) in descending order (hence the -2.
=PIVOTBY(D1:D10,B1:B10,E1:E10,COUNTA)
A simpler example again, but we've changed the aggregate function to COUNTA which counts anything in the range.
The QUERY function is the most powerful data function in Google Sheets. It provides a method to interrogate a data table and generate a sub-set that is actively linked to the source data but will not change it. This makes it particularly useful in a collaborative context, or where you want to produce several sub-sets from the same data source.
=QUERY(data-array,"query-expression",[headers])
Although you can use it on the sheet that contains the data, you are more likely to use this function on another worksheet in the file. The function is entered in just one cell, which becomes the top left cell of the retrieved data.
data‑array | This is the range of cells containing the data. Since you will most likely be referring to another worksheet, it needs to include the sheet name if you use row-column notation, for example:
data_sheet!A1:G50
You can also refer to entire columns if you don’t know how many rows there will be: data_sheet!A:G But defining this as a named range makes the syntax easier, and allows easy extension of the number of columns used. If you define whole columns in a named range, the query may include blank rows — you can stop them appearing by using an appropriate ‘where’ clause. |
---|---|
"query‑expression" | Enclosed in quotes, this is the query expression which defines the columns to be used, criteria to be applied, sort orders and grouping. The expression can refer to values in other cells in the spreadsheet file. |
[headers] | An optional argument that tells the query how many header rows are at the top of the data-array. |
The "query-expression" argument is a string of text which defines the columns to be used, the sort orders to be applied, filtering criteria, and even grouping options and calculated values. The expression is based on Structured Query Language (SQL), used by database systems, but it's not difficult to work with for things like straightforward sorting and filtering.
Here's some simple examples where dataList is a named range:
=QUERY(dataList, "select A,F,B") | Simple selection (and reordering) of three columns |
---|---|
=QUERY(dataList, "select A,F,B order by B desc") | Select three columns and sort by one of them (in descending order) |
=QUERY(dataList, "select A,F,B order by B desc, C asc") | Select three columns and sort by two of them (in different orders) |
=QUERY(dataList, "select A,F,B where F=3") | Select three columns and filter based on a condition |
Columns referenced in the "where" and "order by" parts of the query do not need to be part of the "select"‑ed columns being pulled through, but do need to exist in the referenced data‑array. So in the third example above, the data can be sorted on column C even though only columns A, F, and B are being presented.
The query expression part of the QUERY function has a number of useful commands and we'll go through them in turn here:
select selects the columns you want to display. These columns can be expressed by their letter names or by their position in the data array (expressed as Col1, Col2, etc.), so if our dataList range starts at column A, then...
=QUERY(dataList, "select A,F,B")
…is the same as…
=QUERY(dataList, "select Col1,Col6,Col2")
The selected columns can be in any order. They just need to exist in the original data, so if the named range dataList is mapped to columns A:F then you couldn't refer to a column G or a Col7 in your query expression.
You don't need to pull through the column as is. You can make certain transformations. For instance, you can do basic maths on values in a queried range using the standard mathematical operators (+, -, *, /). Suppose column F of dataList is a mark out of 100 and we wanted to convert it to a percentage, we could do:
=QUERY(dataList, "select A,F/100,B")
We could then format the column as a percentage in the usual way.
Other transformations you can apply to the selected data include aggregate functions and scalar functions (more of which below).
order by controls the way your data is sorted — for example:
=QUERY(dataList, "select A,F,B order by B desc")
…or…
=QUERY(dataList, "select Col1,Col6,Col2 order by Col2 desc, Col3 asc")
asc orders the data in ascending order by the column specified; desc in descending order. The columns controlling the sort do not need to be the columns select‑ed for display but do need to be available in the data array.
where allows you to filter your data. As with standard functions, numbers and text require slightly different syntax:
Here's a couple of simple examples:
=QUERY(dataList, "select A,F,B where F=3") | Filter to show only the rows where the numerical value 3 is in column F of the dataList array |
---|---|
=QUERY(dataList, "select A,B where F='Sold'") | Filter to show only the rows where the value in column F of dataList is the text Sold |
The usual symbols for equalities and inequalities apply:
=QUERY(dataList, "select A,F,B where F>3") | Filter to show only the rows where a value greater than 3 is in column F |
---|---|
=QUERY(dataList, "select Col1,Col6,Col2 where Col6<=4") | Filter to show only the rows where a value less than or equal to 4 is in the 7th column of dataList |
=QUERY(dataList, "select A,F,B where F<>2") | Filter to show only the rows where the numerical value 2 is not in column F |
Because the query expression is a string of text in its own right (encased in "double-quotes" in the QUERY formula), it's necessary to 'break out' of the string to reference any cell values in your spreadsheet. The concatenation operator & is used to join the cell reference to the rest of the query expression:
=QUERY(dataList, "select A,F,B where D="&F2) | Filter to show only the rows where the value in column D is the numerical value in cell F2 |
---|---|
=QUERY(dataList, "select Col1,Col6,Col2 where Col4>="&F2) | Filter to show only the rows where the value in the 4th column in dataList is greater than or equal to the numerical value in cell F2 |
=QUERY(dataList, "select A,B where F='"&E2&"'") | Filter to show only the rows where the value in column F is the text found in cell E2 |
The latter example needs two &s because it needs to add the final 'single quote' to enclose what is in effect a text string within a text string.
Let's assume that cell E2 contains the word Donkey — then…
=QUERY(dataList, "select A,B where F='Donkey'")
…is fundamentally identical to…
=QUERY(dataList, "select A,B where F='"&E2&"'")
If you want to use only part of a text value in the where clause, use contains instead of = :
=QUERY(dataList, "select A,B where F contains 'br'")
This would filter for all instances of text containing the string br, returning matches for broken, library, and cobra.
Working with dates using the where command can be fiddly. It requires a precise form of syntax:
=QUERY(dataList, "select C,B,E,H where H < date '2005-03-26'")
Notice that:
This creates an interesting challenge if you want to reference a cell value in the expression: if you enter the desired date into your cell in the ‘yyyy-mm-dd’ format, Google Sheets is likely to recognise it as a date and convert it to a ‘proper’ date/time value - which won’t work in the query expression.
To be sure of getting a correct ‘string’ value for the date, enter the date in your usual format and then use functions to generate the string. Suppose the reference date is in cell A1 — so long as it's a recognised date in Google Sheets you could use the TEXT function to convert the date to the appropriate text string:
=QUERY(dataList, "select C,B,E,H where H < date '"&TEXT(A1,"yyyy-mm-dd")&"'")
The file, More Query function examples, includes this and some alternative approaches on the where (dates) tab.
Along with date you can also define formats using number, timeofday, datetime (for combined date and time fields), and boolean (for true/false values).
If your data array includes blank rows (which is likely if you've referred to whole columns), you may find that these blanks appear in the query result (which, depending on the sort order, could be at the top!). To stop these appearing you could add a clause to exclude them. If you had blanks in column A then the following would work:
=QUERY(dataList, "select A,F,B where A<>''" | Filter to exclude rows where column A has a blank text field |
---|---|
=QUERY(dataList, "select A,F,B where A<>0" | Filter to exclude rows where column A has a blank numeric field |
=QUERY(dataList, "select A,F,B where A is not null" | Filter to exclude rows where column A has any kind of blank field |
You can use where to filter on multiple clauses at once using the Boolean operators and and or:
and | When and is used, only rows in which both conditions are true are included |
---|---|
or | When or is used, rows are returned where either condition is true |
Here's an example without cell references:
=QUERY(dataList, "select C,B,E,H where E = 'Derwith' and D=2")
...and here's an equivalent one using cell references:
=QUERY(dataList, "select C,B,E,H where E = '"&G2&"' and D="&H2)
When grouping of records is appropriate, aggregate functions can be included in order to calculate averages, totals etc. In this case you must define the columns to be presented, the function to be used and on which column(s) to group the records:
=QUERY(dataList, "select E, avg(F) group by E")
The above example displays values from column E and the average of grouped values in column F, grouping records together that contain the same value in column E.
The available aggregate functions are:
avg | The mean average of the values in the column specified |
---|---|
count | The number of numerical items in the column specified |
max | The highest of the values in the column specified |
min | The lowest of the values in the column specified |
sum | The sum total of the values in the column specified |
Essentially the aggregate function component — avg(F) in this case — is treated as one of the columns in the select part of the query string, with the group by section telling it what to do. You could just have the aggregate function component on its own to return a grand total, for instance...
=QUERY(dataList, "select avg(F)")
...but as soon as you add any other columns into the mix you would need to add the group by component.
You can use pivot to essentially transpose a grouped aggregate function. It works much like the earlier group by example but we don't need to call the grouping column in the select part:
=QUERY(dataList, "select avg(F) pivot E")
This will create a table with column headers created from the values in column E, and corresponding averages of the values from column F in the next row.
By combining group by and pivot we can build an aggregated cross-reference table:
=QUERY(dataList, "select D, avg(F) group by D pivot E")
The same results can be achieved in Excel for Microsoft 365 using the PIVOTBY function:
=PIVOTBY(D:D,E:E,F:F,AVERAGE,,0,,0,,A:A<>"")
We could use this group and pivot approach to build pivot tables with QUERY. The only thing missing would be the total columns which we would have to add manually. Our five pivot examples from earlier could be rendered with the following formulae:
Q1: Which activities generated the most money in each year group? | =QUERY(A1:E10, "select D, sum(E) group by D pivot B") |
---|---|
Q2: Which activities were most successful in each college? | =QUERY(A1:E10, "select D, sum(E) group by D pivot C") |
Q3: How well did students in year 2 do at raising funds? — which activities raised most? | =QUERY(A1:E10, "select A, sum(E) where B=2 group by A pivot D") |
Q4: Which activities by year 2 students raised the most? | =QUERY(A1:E10, "select D, sum(E) where B=2 group by D order by sum(E) desc") |
Q5: How many activities did each year-group undertake? | =QUERY(A1:E10, "select D, count(E) group by D pivot B") |
Most of the time you'll want to return all of your data in a QUERY, but sometimes you might just be after a certain number of rows, and that's where limit comes in handy. It lets you set a fixed number of rows to return. For instance...
=QUERY(dataList, "select A,F,B limit 10")
...will show just 10 rows of data.
offset removes a given number of rows from the top of your data. This is acting on the returned data, not the original dataset, so if you've used order by to sort your data high to low, offset 1 will remove the highest-ranked item. Because of this, you shouldn't use offset to remove header rows — you can use the headers argument of the QUERY function for that.
offset can be useful for returning a specific record in your data; for instance...
=QUERY(dataList, "select A,F,B order by B desc limit 1 offset 9")
...would return only the 10th record in the sorted range (it's offset 9 because the first record would have an offset of 0).
As well as the aggregate functions mentioned in the group by section, there are scalar functions we can apply to ranges. upper() and lower() change the case of your data (UPPER CASE and lower case), and the rest work with dates and times:
year(), month(), day(), hour(), minute(), second(), millisecond(), quarter(), dayOfWeek(), and toDate()
There's also dateDiff() which you can use in calculations with two values, and now() which allows you to do calculations with the current date and time. These functions all essentially work like their traditional spreadsheet namesakes.
Here's a couple of examples which assume a date-of-birth in column H of dataList:
=QUERY(dataList, "select A,C,B, dateDiff(now(),H)")
…gets you the number of days elapsed since the birthdate, while…
=QUERY(dataList, "select A,C,B, year(now())-year(H)")
…gets you the number of years elapsed.
The optional headers argument in the QUERY function tells the QUERY how many rows of headers your data-array has. If this argument is omitted, the QUERY will guess the appropriate value based on the appearance of the data. Otherwise the options are:
0 | The data has no header rows |
---|---|
1 | The data has one header row which will be applied as the labels in the first row of the output |
2 (etc.) | The data has two header rows which will be merged to form the labels in the first row of the output |
If you've performed transformations to your data such as applying aggregate or scalar functions, these transformations will be used as the resulting header label, even if the header argument has been set as 0.
Within the query expression you can use label to set custom headers. Given the situation just outlined, this is especially useful when performing transformations. Here's an example that creates new labels for each of the outputted columns (though we could always omit labeling anything we were already happy with):
=QUERY(dataList, "select A, C, B, year(now())-year(H) label A 'Student ID', B 'Forename', C 'Surname', year(now())-year(H) 'Age'")
These two Google Sheets files include some examples of using the QUERY function:
In both these examples the dataList worksheet includes module results for a number of (fictitious) students. As most students have taken more than one module, they appear several times. There is also a sheet named otherData which is used to populate drop-down lists etc.
In each of the example sheets, cells shaded in green (usually on the third row) contain the query function. Some comments are included to help you understand the way the functions are used.
Feel free to make copies of these to experiment with (File > Make a copy).
There's more QUERY examples on Google's Query Language Reference:
Excel doesn't have the QUERY function but it does have a powerful Query Editor.
We've already seen how the Query Editor can be used to 'unpivot' pivoted data, but it has other uses too, not least in terms of cleaning up data (it's great at identifying dates for instance, even when they're misformatted) and reorganising it (in similar ways to QUERY).
In the example above, an index column has been added to disambiguate between students, students' names have been split on space characters, the activities have been 'unpivoted', and the amounts have been reformatted as currency.
The outputted list will be in the form of an Excel Table: a special kind of list with its own set of conventions. A link is maintained between your original list and the new table, which means that changes to your original data can be passed across to your query output. This connection is not live by default: to refresh the table manually, select a cell within it and go to Query > Load > Refresh (or any of the other Refresh buttons about the place). You can automate the refresh in Query > Edit > Properties, where there are options to refresh on opening, and refresh every n minutes.
If you'd prefer your outputted table to be a normal range, you can go to Table Design > Tools > Convert to Range. This will sever the link between this table and your original data, so if your original data is likely to change you might not want to do that.
In the last few years, spreadsheets have introduce a whole range of new functions for restructuring and reorganising data. We've already encountered a few of them. But here's some more:
There's now a group of functions that can convert between one-dimensional arrays (like A1:A10 or A1:E1) and two-dimensional arrays (like A1:E10).
A two-dimensional array can be reduced to a longer one-dimensional array. TOROW takes a two-dimensional range (like A1:E10) and turns it into one long row, while TOCOL does the same thing but as one long column. Both have options for setting the reading direction (across rows or down columns) and for determining whether to skip empty cells.
A one-dimensional array such as a column or row can be restructured into a two-dimensional table. WRAPROWS takes a one-dimensional range and splits it across multiple rows, starting a new row after a defined number of columns. WRAPCOLS unsurprisingly does the same thing but slitting across multiple columns after a defined number of rows.
These two sets of functions are effectively opposites. If we had a 5×10 range, A1:E10, we could do =TOCOL(A1:E10) to reduce the data to a single column of fifty rows. Let's assume we did that function in cell G1 — that would give us a collapsed list of values in cells G1:G50 — then we could do =WRAPROWS(G1:G50, 5) to wrap the collapsed data across five columns, thereby reconstituting our original table!
Being able to work with multiple ranges of data has its uses. Here's some more ways we can dynamically combine existing data…
In Google Sheets it's always been possible to merge data by building an array in those curly brace thingummies { }, with commas (,) to divide columns and semicolons (;) to divide rows. For instance…
={"College", "Date founded"; "Derwent", 1965; "Langwith", 1965; "Alcuin", 1967; "Vanbrugh", 1967; "Goodricke", 1968; "Wentworth", 1972}
…would give you a 2×7 table of college founding dates. And any of those values in that constructed array could be cell references, or even ranges so long as the ranges were of a consistent width or height.
But now there are two functions that let you do the same thing. And the good news is that these are also available in Excel for Microsoft 365...
HSTACK takes a list of values or ranges and puts them next to each other across multiple columns. For instance, =HSTACK(A1:A5, A6:A10) would create a new table that moved rows 6:10 of the original table and shoved them to the right of a version of rows 1:5.
VSTACK does the same thing but vertically down rows. So, =VSTACK(A1:E5, G1:K5) would shunt the table at G1:K5 beneath a table made from A1:E5.
You could use these functions to replicate the above college array that uses strings and numbers rather than cell references:
=VSTACK(HSTACK("College", "Date founded"), HSTACK("Derwent", 1965), HSTACK("Langwith", 1965), HSTACK("Alcuin", 1967), HSTACK("Vanbrugh", 1967), HSTACK("Goodricke", 1968), HSTACK("Wentworth", 1972))
VSTACK is also useful for merging tables from multiple sheets:
=VSTACK(Sheet1!A2:Z100, Sheet2!A2:Z100, Sheet3!A2:Z100)
…would bring together the data (but not the headers) from three separate 26×100 ranges. You could wrap each of those referenced ranges in a FILTER function if you needed to leave behind any empty rows.
Google Sheets versions of the files are listed below. In each case you will be prompted to make your own copy of the spreadsheet files, for which you'll need to be signed into a Google account.
You can download the above Google Sheets for use in Excel by going to File > Download
Excel versions of the exercise files can also be found on university-managed machines at T:\IT Training\Essential Spreadsheets
You can map to that drive at home if you use the University's VPN.