# 10. Grouping Data
---[SIDEBAR]---
"type":"objectives",
"title":"Learning Objectives",
"text":"After completing this lesson you can:
- Use the GROUP BY clause of the SELECT statement to create summary categories for an aggregate calculation.
- Use the HAVING clause of the SELECT statement to create filters on the summary categories you create.
It is easiest to see how GROUP BY works in a couple of examples. Suppose you want to calculate the average area for each style of artwork in the database. Name the calculated field 'average_area_by_style' (without the quotes). The syntax for a query that accomplishes this is:
---[SQL]---
connection:sql_book_connection
schema:canvas
diagram:canvas
owner:atlas-query
result:museum-name-sorted
---[SQL]---
SELECT style, AVG(width * height) AS average_area_by_style
FROM work
GROUP BY style;
---[SQL]---
In this example, the GROUP BY clause is used to create categories in the query result based on the values in the style column of the table. Each unique value for style will automatically be represented in a row of the results. All of the works of art will be first classified by style. The values height and width of each artwork will be used to calculate the area for each work of art. A separate average will be calculate for each style of art and returned in the query result.
---[SIDEBAR]---
"type":"note",
"title":"NULL values in grouping columns",
"text":"If a column in the GROUP BY clause has NULL values, NULL will be one of the categories defined in the query result. All rows with NULL values in the grouping column will be combined for the aggregate calculation. Notice that the style column contains NULL values in the prior example. The rows on the work table with a NULL value for style are grouped together. The average area for these rows will also be calculated and included in the query results."
---[SIDEBAR]---
Notice that the style column is also in the SELECT clause. This provides a label for each of the calculations. Each style is listed and the calculated average for each style is displayed on the same row.
---[SIDEBAR]---
"type":"tip",
"title":"The SELECT clause and aggregate functions",
"text":"When an aggregate function is used in a SELECT clause, any column that is not an argument of an aggregate function must be part of the GROUP BY clause. In the statement SELECT style, AVG(width*height), notice that width and height are part of the argument for the aggregate function AVG. Style is not an argument for the aggregate function. It must be included in the GROUP BY clause. Additionally, it is a good practice to make sure that any fields you use in the GROUP BY clause are also listed in the SELECT clause so that the values in the query have appropriate labels."
---[SIDEBAR]---
You can include more than one aggregate function in a SQL statement. Suppose you want to calculate the average *and* total area for each style of artwork in the database. Name the calculated fields 'average_area_by_style' and 'total_area_by_style' (without the quotes), respectively. The syntax for a query that accomplishes this is:
---[SQL]---
connection:sql_book_connection
schema:canvas
diagram:canvas
owner:atlas-query
result:museum-name-sorted
---[SQL]---
SELECT style, AVG(width * height) AS average_area_by_style, SUM(width * height) AS total_area_by_style
FROM work
GROUP BY style;
---[SQL]---
This example is very similar to the last one. The query calculates both the average and total area for each style of artwork in the database. The result is grouped by style to ensure that separate calculations are made for each unique style. The result is one row for each style each with a calculated average and total area.
Consider that you want to determine the most recent death of an artist for each art style and country. Name the calculated field 'most_recent_death' (without the quotes). The syntax for a query that accomplishes this is:
---[SQL]---
connection:sql_book_connection
schema:canvas
diagram:canvas
owner:atlas-query
result:museum-name-sorted
---[SQL]---
SELECT style, nationality, MAX(death) AS most_recent_death
FROM artist
GROUP BY style, nationality;
---[SQL]---
In order to determine the most recent death for each style and nationality, you need to include both fields in the GROUP BY clause. Notice that the query result lists each style once for each nationality. Each row of the query represents the combination of these groups. The most recent death will be determined for the combination of the groups using the MAX function. For example, the row that contains Art Nouveau as the style and Austrian as the nationality contains the largest death year (the most recent death) for an Austrian artist who works in the Art Nouveau style.
The arrangement of the groups will follow the pattern seen in the query result. The values on the source table will be categorized by the left-most column in the GROUP BY clause. Each subsequent grouping will be done within those groups. In this example, the artists are first grouped by style and then nationality. This is why you see all of the entries for a given style appear together in the result.
You can also sort by calculated fields. Assume you want to determine most recent death for each style of artist and you also want to arrange the results to show the styles with the most recent deaths first in your results. The syntax for a query that accomplishes this is:
---[SQL]---
connection:sql_book_connection
schema:canvas
diagram:canvas
owner:atlas-query
result:museum-name-sorted
---[SQL]---
SELECT style, MAX(death) AS most_recent_death
FROM artist
GROUP BY style
ORDER BY most_recent_death DESC;
---[SQL]---
In this example, the most recent death is calculated for each style of artist using the MAX function. The name of this calculated field is most_recent_death. Notice that the query is sorted by this calculation. The ORDER BY clause arranges the result by most_recent_death in decending order. Because the SELECT clause executes before the ORDER BY clause, you can use the alias for the most recent death in the ORDER BY clause.
---[SIDEBAR]---
"type":"caution",
"title":"Using Aliases in the GROUP BY Clause",
"text":"If you use an alias to rename a column in the SELECT statement, that alias cannot by used in the GROUP BY clause even though it can be used in the ORDER BY clause (as shown in the last example). This is due to the order of execution. GROUP BY executes before SELECT, so any aliases defined in the SELECT clause will not be in effect when the GROUP BY clause executes."
---[SIDEBAR]---
Here are some important things to keep in mind about using GROUP BY:
* You can create as many categories in your query as you need. The GROUP BY clause can have any number of columns.
* If you have more than one column in the GROUP BY clause, the values included in each group will be determined by the combination of the categories defined in the GROUP BY clause. This means you will not get summary data for each GROUP BY level.
* Every column listed in the GROUP BY clause should also be part of the SELECT clause if it is not an argument for an aggregate function.
* Every column listed in the SELECT clause that is not an argument for an aggregate function must be included in the GROUP BY clause.
* If the grouping column contains has NULL values, NULL will used as a category. If more than one row has a NULL value, those rows will be grouped together for the aggregate calculation.
## Filtering Groups
Not only can you create categories of data using the GROUP BY clause, but you can also create a filter to determine which categories to include in a query result. This is done using the HAVING clause.
The filters created in the HAVING clause use logical tests to determine which categories to include in a query result much like the WHERE clause. In fact all of the operators you learned about using in the WHERE clause will also work in the HAVING clause. Those categories that meet the logical test are included in the query result.
This is an important distinction between HAVING and WHERE. A logical test in the WHERE clause applies to the individual rows of the source table. It determines which rows to include in a query result. A logical test in the HAVING clause filter groups based on the aggregate calculations that relate to those groups.
This is also best seen in an example. Assume you want to list all of the nationalities with at least 25 artists. Name the calculated field 'number_of_artists' (without the quotes). The syntax for a query that accomplishes this is:
---[SQL]---
connection:sql_book_connection
schema:canvas
diagram:canvas
owner:atlas-query
result:museum-name-sorted
---[SQL]---
SELECT nationality, COUNT(*) AS number_of_artists
FROM artist
GROUP BY nationality
HAVING COUNT(*) > 25;
SELECT nationality, COUNT(*) AS number_of_artists
FROM artist
GROUP BY nationality
---[SQL]---
For comparison, this example also shows the query without the HAVING clause. Without the HAVING clause each of the different nationalities for artists on the artist table are displayed. The number of artists with that nationality are calculated for each. The HAVING clause creates a filter condition on the nationality groups. Only those nationalities with more than 25 artists will be displayed in the query result.
---[SIDEBAR]---
"type":"caution",
"title":"Using Aliases in the HAVING Clause",
"text":"If you use an alias to rename a column in the SELECT statement that alias cannot by used in the HAVING clause, even though it would be convenient if you could. In the prior example, the HAVING clause is HAVING COUNT(*) > 25. Intuitively, it would be nice if you could restate this as HAVING number_of_artist > 25. This doesn't work because of the order of execution. HAVING executes before SELECT, so any aliases defined in the SELECT clause will not be in effect when the HAVING clause executes."
---[SIDEBAR]---
It is important to note why a WHERE clause could not be used to create the filter for this example. Recall that the WHERE clause creates filters that apply the the individual rows of the source table. Each row on the artist table contains information about a single artist. Each artist only has one nationality, so no single artist could possibly meet the criteria.
More importantly, the WHERE clause would not "know" what a category is. Recall that the WHERE clause executes before the GROUP BY clause. Any groups created by the GROUP BY clause would not be available to the WHERE clause anyway. Both of these situations explain why WHERE cannot be used to create filter conditions on the groups or classifications in a query.
It is possible (and appropriate) to use both the HAVING and WHERE clauses in the same query. Consider that you want to list all of the nationalities with at least 25 artists who were born after the year 1800. Name the calculated field 'number_of_artists' (without the quotes). The syntax for a query that accomplishes this is:
---[SQL]---
connection:sql_book_connection
schema:canvas
diagram:canvas
owner:atlas-query
result:museum-name-sorted
---[SQL]---
SELECT nationality, COUNT(*) AS number_of_artists
FROM artist
WHERE birth > 1800
GROUP BY nationality
HAVING COUNT(*) > 25;
SELECT nationality, COUNT(*) AS number_of_artists
FROM artist
GROUP BY nationality
HAVING COUNT(*) > 25;
---[SQL]---
In this example, only those artists born after the year 1800 will be included in the query result. This is accomplished with the WHERE clause (WHERE birth > 1800). Once this row-level filter is applied, the GROUP BY clause creates catergories of artists based on their nationality. The HAVING clause creates the group filter that only those nationalities with more than 25 artists will be included in the query result. In this case, there are only two nationalities with more than 25 artists born after 1800.
The prior example is included here for comparison. In this example, the WHERE clause is omited. This means that all artists will be included in the query result. When all artists are included (not just those born after 1800), there are five nationalities with more than 25 artists. Notice how the effect of the WHERE clause impacts the query results.
## Select Statement Clause Listing Order and Execution Order
At this point, all of the clauses of the select statement have been discussed. The table below summarizes the clauses and what they do. It also details the order they appear in the select statement and the order they are executed.
#### Select Statement Clauses
| Clause | Description | Required | Where Specified | When Executed |
|:------:|-------------|----------|-----------------|---------------|
| SELECT | Set the columns or calculations displayed in the results | Yes | 1st | 5th |
| FROM | Set the table(s) used in the query | Yes | 2nd | 1st |
| WHERE | Apply filters to rows | No | 3rd | 2nd |
| GROUP BY | Create groups for aggregate calculations | No (unless there are columns in the SELECT clause that are not an argument to an aggregate function) | 4th | 3rd |
| HAVING | Apply filters to groups | No | 5th | 4th |
| ORDER BY | Arrange the results of a query | No | 6th | 6th |
The select statement provides a standard way to retrieve data from a database. The six clauses of the select statement are used in specific and standard ways as well. Recall that only the SELECT and FROM clauses are required when creating a select statement. When a clause is not included in a query, it is just omited and the other clauses appear in the prescribed order.
You may be wondering why there is a different order that the clauses are listed in a select statement than when they are executed by the DBMS. Perhaps the best reason for this is that they are specified in a way that makes the statement more conversational. If we were to use normal English to request data, we might use a similar structure. We would likely say that we want to see specific data from a particular source and then describe any specialized instructions for how we see the data after that.
The order of execution follows a more logical ordering that aids in the automated retrival of the data. While having a different order for executing the statements may seem odd, it is important to understand this ordering. This way you will understand how the different clauses work together and why something you define in one clause of the select statment is "known" in one clause, but not another.
## Summary
This lesson described how to use the GROUP BY clause to create categories in a query result. Aggregate functions are then applied to these categories to summaries the values in each. The HAVING clause was also discussed. The HAVING clause is used to create filters on the groups defined in the GROUP BY clause. At this point, all of the clauses of the SELECT statement have been defined. This lesson reviews each, describing their function, when they are included in a SQL statement, and when each is executed by the DBMS.