It's time to get stuck in with some data. We begin by looking at the idea of data validation, by which we can make sure our data is consistently entered. We then look at what makes a good list structure we can work with in a spreadsheet, before going about sorting and filtering our dataset. We also have a quick look at subtotaling in Excel.
Although originally designed for numeric data, spreadsheets are a powerful tool for working with broader sets of data that include text values. As well as performing calculations, we can manipulate and interrogate data in other ways, such as sorting and filtering.
A quick web search will render several definitions of ‘data’, the common theme being that they are values without context. Processing data in some way provides context and hence meaning - when it may be referred to as ‘information’. Even simple operations with a spreadsheet require some understanding of what constitutes ‘good’ data.
As an example, let’s ‘deconstruct’ some information:
“The appointment with Dr Watt is on Tuesday at 2:30pm at the Heslington Lane surgery.”
This information contains the following fields of data:
If you wanted to record appointments in a computer-based system you would need to use separate ‘fields’ for these — which would translate to separate columns in a spreadsheet.
Data processing systems struggle if you don’t stick to recognised data types, or if you add in values that don’t match others in the same context:
Data type | Good | Unrecognisable |
---|---|---|
Number | 5 1.6 -350 0.105 | About 10 >5 10-15 25cm |
Date/time | 01/01/2000 23-11-1963 15:30 17:16:20 | 01.01.2000 Mon or Fri Next Tuesday About 10:30 |
Boolean | True False | Maybe ? |
Spreadsheets attempt to determine the data type whenever you enter a value. It will be treated as text if it is not recognised as anything else. This may have an affect on how you're able to interrogate that data. If you represent a number or date in a way that does not allow the program to determine its type correctly, you will not be able to sort and filter correctly, you will not be able to add up, find averages, find the interval between two dates...
Sometimes a single table of rows and columns is not enough. For instance:
You need to work with information about people and the research projects they are involved in. There will be several fields of data about the people, but also several about the projects.
It would be impossible to design one table that is suitable to hold all the data about people and projects, so in this case we create separate tables – one for people and one for projects – and find ways to express the connections between them.
In this example, one person can be involved in many projects, and one project can involve many people. This is a clear indication that the data is relational, and any attempt to work with it using a simple table will entail compromises.
Spreadsheets are not very good with relational data; you often end up with something that 'sort of' works.
If you're using a spreadsheet to store and process complex data, you should be aware of the advantages of using a relational database tool such as Microsoft Access. If your data includes several related sets, or a lot of repeated data, you should probably investigate using a database rather than a spreadsheet. The data can be exported to a spreadsheet for further numeric analysis if needed.
Excel's Data Model tools are looking more and more like Access with every update. They allow you to connect multiple data sources and combine them, relationally, to build queries and advanced pivot tables.
Relational databases can seem daunting at first, and simple options are limited: Access is very much a desktop database, and other options generally require a larger degree of familiarity with database systems. If online operation, access control, and collaboration are key requirements – or if you need to integrate with Google Forms, Mail, or Calendar – Google Sheets would likely be the most sensible option to explore. Sometimes you can use Google Apps Script coding to make Sheets behave a bit more like a database.
The success of any data processing will depend in large part on the quality of the source data you're working with. While you can never ensure 100% accuracy of data entered into a spreadsheet, you can take steps to minimise the likelihood of any errors and improve consistency in the data entered.
Validation is about making sure the data entered is reasonable, at the point it is entered by the user. This can be achieved by:
For example, you might:
If you want people to pick a value from a dropdown list you can choose the List option for the allowed data. For this option you'll need to provide a Source which will be used to populate the values in the dropdown control. You can enter a comma-separated list of items but it's better to provide an existing list of values from elsewhere in the spreadsheet (either as a cell-reference range or named range). Here's some points to note:
There are two list options you can choose from in the "Criteria" list:
These options can be customised extensively:
From the "Criteria" menu it is possible to generate tick boxes in Google Sheets (you can also add them from the Insert menu).
Values can be assigned to the ticked and unticked cells. The default values are TRUE and FALSE but you can also assign numbers or text. You can then use these values in another part of your workbook by referencing the tick box cells accordingly.
If you use a column from an existing dataset as the range for dropdown list values, a list is automatically generated containing one of each of the unique values in that column (rather than a full list of the whole column).
It is also possible to generate a list of unique values from a range using the UNIQUE function (in Google Sheets and Excel for Microsoft 365) — but more on that in a moment!
It's a good idea to create the content for your dropdown lists at the top of an otherwise empty column on another sheet: that way, not only does it keep things tidy, but you can also select the whole column for the list range — any new items added to this column will automatically appear in the dropdown list.
You can validate according to a number of criteria including whether or not the validation matches the outcome of a particular formula.
When entering a custom formula, start with "=" as usual. Cell references work relatively from the first cell in your selected range, so if you're applying validation to cells B2:B10 and want to apply a condition based on the corresponding values in column A, refer to cell A2 in your formula.
Remember: validation tests what's allowed in the cell, not what's prohibited. Be sure to test your validation rigorously!
The way you organise data can also have a huge impact on your ability to process it. The most common approach is using 'tabular' form - a two-dimensional table - which is why spreadsheets are frequently used.
To make full use of the available features when working with structured sets of data (and to make your spreadsheet more accessible, too), some simple rules are best observed:
Here's an example of a good (albeit rather short) list:
fx | ||||
A | B | C | D | |
---|---|---|---|---|
1 | Month | Fruit | Number | |
2 | Jan | Apple | 22 | |
3 | Jan | Orange | 23 | |
4 | Feb | Apple | 31 | |
5 | Feb | Orange | 18 |
...and here's some common mistakes to avoid:
If you're working with a list that is longer/wider than your screen, and labels keep disappearing when you scroll, or you wish to compare data from cells that are a way apart, try one of these approaches:
Changing the zoom level can help if the list is only slightly bigger than the current screen:
The zoom controls are at the bottom right of the Excel window, or you can go to View > Zoom.
As Google Sheets is a browser application, use the browser's built-in zoom options: conventionally Ctrl (or Cmnd) and + / -.
This feature 'locks' rows and/or columns in position so they remain on the screen as you scroll:
Select the cell immediately below and to the right of the rows/columns you want to freeze, and choose View > Window > Freeze panes > Freeze Panes. Use the same control to unfreeze.
If you want to freeze more than two rows or columns, first select the cell that is in the last row/column you want frozen, then choose View > Freeze and select as appropriate (rows and columns are treated separately). Use the same menu to unfreeze.
Excel also includes a split view, which creates up to 4 semi-independently scrollable areas on the sheet. First select the cell where you want the split, then choose View > Window > Split. You can drag the position of the split once it’s there. Use the same button to turn off the split.
There's no equivalent view in Google Sheets, but it is possible to open two separate instances of the same file in two separate browser windows/tabs, allowing you to view and work with two different views of the same document.
Some keyboard short-cuts are useful for finding the extreme right/bottom of a long or wide list:
Ctrl + ↓ | Jump to the last occupied cell in a column |
---|---|
Ctrl + → | Jump to the last occupied cell in a row |
Ctrl + End | Jump to the last occupied cell in a range (Excel only) |
Using up, left and Home keys with Ctrl also works as you would expect.
Sorting a list will re-arrange the rows in a specific order — for example, you could sort by price to find the most expensive item in a list.
Broadly speaking, numbers can be sorted in ascending and descending order, and text can be sorted alphabetically (A-Z and Z-A). You can also sort (in a very basic way) by colour. Since dates & times are essentially numbers, they can be sorted in ascending and descending order too.
Excel has some additional sorting capabilities: it can also sort horizontally as well as vertically (reordering columns rather than rows). It can order items by days of the week and months of the year. You can even create your own custom lists by which to sort.
One workaround for sorting by weekday or month in Google Sheets is to enter the values as dates and then format accordingly.
Sorting is a fundamentally 'destructive' act: it restructures your dataset in a new order, and depending on how your list has been built, it may be difficult to get it back to the order you started with. There are, however, ways to sort in a non-destructive way. And we'll look at those too.
Excel is able to detect the limits of a 'table' of data based on the list structure conventions outlined above. Sorting controls will act on the block of contingent cells around the current active cell.
In this example, there are two detectable tables of data: A1:B4, and D3:E6.
fx | Polly | |||||
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
1 | Forename | Surname | ||||
2 | Susan | Foreman | ||||
3 | Barbara | Wright | Forename | Surname | ||
4 | Ian | Chesterton | Ben | Jackson | ||
5 | Polly | |||||
6 | Jamie | McCrimmon | ||||
7 |
With D5 as the currently selected cell, any sorting we do will only act upon the table D3:E6. A1:B4 would remain in the order it is now. Likewise, if our active cell were anywhere within A1:B4 it would only affect A1:B4 and would not act upon D3:E6.
This table detection is mostly down to the empty blocks of space about the place that cause 'islands' of contingent data. It's worth noting that while cell E5 is empty, it is still identified as being part of a table of data because the spreadsheet assumes a rectangular structure.
Because the headers are in bold, they are detected as headers by Excel and will not be sorted into the data — clever innit?
Controls for sorting can be found at Home > Editing Sort & Filter or at Data > Sort & Filter:
The quick sort icons let you sort your data based on the values within a single column.
The whole list will sort according to the order specified for that column. So for an alphabetical sort in our example spreadsheet above, D3:E6 will reorder according to the contents in column D to give:
Forename | Surname |
---|---|
Ben | Jackson |
Jamie | McCrimmon |
Polly |
If your data contains entirely blank rows/columns, this method won't work properly, and you will probably end up corrupting your data by 'shuffling' it.
If you wanted to sort a Forename / Surname list alphabetically by name, you'd probably want to first sort by forename, and then resort by surname. That way, the surnames would be alphabetical (from the latest sort), and, where there were multiple cases of the same surname, the forenames would preserve their alphabetisation from the previous sort. However, there's easier ways to go about complex sorts like this...
A Custom Sort allows you to:
Here's how to go about doing one:
When working collaboratively, the disadvantage of the ‘standard’ sort/filter tools is that the data is changed for all users. Google Sheets therefore has features designed specifically with collaboration in mind.
There are two approaches, both using the same sort and filter tools:
We'll look at filter views in a moment, so for now we'll consider a conventional, 'destructive' sort.
There are sort options in the Data menu, but they're rather cumbersome. If your list has a header row, either ‘freeze’ the header row (View > Freeze…) or, better still, use the following method for both sorting and filtering:
As with Excel, you can do a degree of multi-field sorting using the single field method, by sorting by your 'secondary' field and then by your 'primary' field. But Google Sheets also has some more advanced options.
To access these properly you must first select the whole data range you're wanting to sort — otherwise you risk corrupting your data by shuffling it. The quickest way to do this is to enable the filters, as per the single field approach we just looked at. This automatically selects the whole data range. Alternatively you could select the range using keyboard methods (Ctrl-A, for instance).
Once you've got your data selected:
Unlike Excel, Google Sheets can't sort horizontally. But you can use the TRANSPOSE function to turn your dataset 90°.
An effective, non-destructive way to work with a dataset is to generate a sub-dataset on another worksheet using data functions. As the output is generated by functions, it will automatically reflect any changes made to the underlying data.
One-such data function is SORT, variations of which exist both within Excel (2021 or later, and Microsoft 365) and Google Sheets. Annoyingly they're slightly different in each program:
=SORT(data-array,[index],[order],[sort-direction])
=SORTBY(data-array,sort-array,order,[sort-array2],[order]...)
=SORT(data-array,[index],[order],[index2],[order]...)
In the above formulae:
Excel's SORTBY and Google's SORT both allow for multiple sort fields — you just keep adding arguments to the function until you're all sorted out.
Let's see an example in action:
fx | |||||
A | B | C | D | E | |
---|---|---|---|---|---|
1 | Fruit | Numbers | Fruit | Numbers | |
2 | Orange | 24 | Apple | 17 | |
3 | Banana | 11 | Banana | 11 | |
4 | Plum | 9 | Orange | 24 | |
5 | Apple | 17 | Plum | 9 |
Filtering a list will make it display only those rows containing values which match a specified criteria, temporarily hiding all the other rows — for example, you might filter to show just those students in Year 1.
Data lists are filtered by enabling a series of drop-down controls in the header row.
The filter options available will depend on the type of data (text or number):
If you enter new records immediately below the current data, the filter range will be automatically expanded. You can then choose the Reapply option from the Sort & Filter menu to re-filter the changed data.
To clear filters:
Google Sheets are collaborative documents. While filtering a list isn't 'destructive' in the same way that a sort is, a filter will still change the appearance of a list, both for you and for anybody else who happens to be in the Sheet at the same time. That being the case, Google Sheets has features designed specifically with collaboration in mind.
There are therefore two approaches, both using the same sort and filter tools:
We'll look at filter views in a moment (no, really we will), but for now we'll consider a conventional filter:
When working collaboratively, the disadvantage of the ‘standard’ sort/filter tools is that the data is changed for all users. Google Sheets' filter views allow you to sort and filter a dataset in a way which doesn't disrupt the underlying data. The filter view effectively sits on top of the source data so does not affect the view of that data for other users, who can create their own filter views too!
Filter views can be named and saved for future use by you and other collaborators. These filter views even have their own unique web address.
Furthermore, collaborators can use a filter view to interrogate data without the need for "Editor"-level access: you can give somebody "Viewer" access to a spreadsheet and still have them able to sort and filter that data thanks to filter views.
To create a new filter view:
An effective, non-destructive way to work with a dataset is to generate a sub-dataset on another worksheet using data functions. As the output is generated by functions, it will automatically reflect any changes made to the underlying data.
One-such data function is FILTER, variations of which exist both within Excel (2021 or later, and Microsoft 365) and Google Sheets. The formulae are slightly different in each program (though can to some extent be used interchangeably):
=FILTER(data-array,condition,[if-empty])
=FILTER(data-array,condition,[condition2]...)
In the above formulae:
Despite appearances, both formulae can handle more than one condition. Excel uses an * to serve as a boolean 'AND' and a + to serve as an 'OR', and these can be placed between bracket-wrapped clauses. Google Sheets can also follow this syntax.
Maybe we should see an example:
fx | |||||
A | B | C | D | E | |
---|---|---|---|---|---|
1 | Fruit | Numbers | Fruit | Numbers | |
2 | Orange | 24 | Banana | 11 | |
3 | Banana | 11 | Apple | 17 | |
4 | Plum | 9 | |||
5 | Apple | 17 |
We've already seen how data functions like SORT and FILTER allow you to work non-destructively with datasets by generating a sub-dataset in a second location. Another such function is UNIQUE:
=UNIQUE(range)
When a data set includes repeated data, UNIQUE will display just one instance of each value (junking the repeats). If you select a two-dimensional array (columns and rows) it will effectively filter out duplicate rows.
This function works in both Google Sheets and Excel (2021 or later and Microsoft 365). Excel also has some extra optional arguments for filtering horizontally and for junking all duplicates (including the first instance).
You can find unique values in earlier versions of Excel by using Data > Data Tools > Remove Duplicates.
Slicers are graphical filter controls which you can add to your spreadsheet. Their appearance can even be customised for extra prettiness! They're of limited use for most working spreadsheets, but can be handy for guiding interaction if you're sharing a dataset with a third party.
In Excel, you can only add a slicer to a special formalised dataset such as a Pivot Table or an Excel Table. Excel Tables work differently to standard ranges, but you can always turn a Table back into a range if you don't like it!
Google Sheets slicers will work on any range of data you choose. Like Filter Views, they can be operated by any viewer of your spreadsheet, regardless of whether they have edit access — handy for letting people play with your data in a guided way without the chance of them breaking anything!
Take a look at this simple slicer example to see it in action.
Subtotals (not to be confused with the SUBTOTAL function) are a feature available in Excel, but not Google Sheets as they act directly on the source data and so are not really appropriate in a collaborative context.
Subtotals are used when a list includes values which fall into categories, and subtotalling of those categories is required. The name is slightly misleading since the feature can perform more calculations than just a sum total.
Extra rows are inserted containing the subtotals and a grand total.
Additional "grouping" controls appear on the left to collapse and expand sections of the worksheet, showing or hiding the data.
Group controls like these can also be added manually (from the same Outline menu in Excel, or from Data > Group in Google Sheets) — useful for when you want to tidy away information that you might occasionally want to see again.
One advantage of using the subtotals feature (as opposed to manually entering subtotals) is that they are very easy to change or remove:
Google Sheets versions of the files are listed below. In each case you will be prompted to make your own copy of the spreadsheet files, for which you'll need to be signed into a Google account.
You can download the above Google Sheets for use in Excel by going to File > Download
Excel versions of the exercise files can also be found on university-managed machines at T:\IT Training\Essential Spreadsheets
You can map to that drive at home if you use the University's VPN.