SQL-Book

# 10. Grouping Data ---[SIDEBAR]--- "type":"objectives", "title":"Learning Objectives", "text":"After completing this lesson you can:

Use the GROUP BY clause of the SELECT statement to create summary categories for an aggregate calculation.
Use the HAVING clause of the SELECT statement to create filters on the summary categories you create.

" ---[SIDEBAR]--- ## Understanding Data Grouping The last lesson discussed how to use aggregate functions to create summary calculations in a query result. These summary calculations were used to calculate a total (SUM) and an average (AVG) as well as counting rows (COUNT), finding the largest value (MAX), and finding the smallest value (MIN) in a column. When using these functions, the query would return a single row which represents the result of the calculation for all of the values in the column or that met the criteria set in the WHERE clause of the SQL statement. By way of review, the following SQL statement will calculate the average area (in squared inches) of all of the artwork with a style of Renaissance: ---[SQL]--- connection:sql_book_connection schema:canvas diagram:canvas|visible owner:atlas-query result:museum-name-sorted ---[SQL]--- SELECT AVG(width * height) AS average_renaissance_area FROM work WHERE style = 'Renaissance'; ---[SQL]--- Recall some important elements of this query. The WHERE clause ensures that only the Renaissance artwork will be included in the query result. The values for the width and height columns are multiplied together to calculate the area for each piece of art included in the query result. These calculated areas are summarized by the AVG function which will return the average area. This calculation is aliased with the name average_renaissance_area. The result of the query is a single column (average_renaissance_area) and a single row which has calculated average. This query is helpful if all of you want to know is the average area of the Renaissance artwork. What if you wanted to know the average area of the other styles? One way to obtain these values would be to rerun this query altering the WHERE clause to filter for each of the other styles. While this approach would be effective, it would also be time consuming. Additionally it could lead to error. You might forget to run the query for one or more styles, for example. You could also create groups in your query to calculate the average area for each of the artwork styles in the same query. Grouping involve dividing your query result into categories and performing an aggregate calculation on the values in each category rather than all of the values in the query at once. A query that uses grouping will result in a row for each of the categories. The values for each row will be the result of the aggregate function performed on that group. ---[SIDEBAR]--- "type":"term", "title":"Grouping", "text":"Dividing your query result into categories and performing an aggregate calculation on the values in each category." ---[SIDEBAR]--- ## Creating Groups The GROUP BY clause is used to create these categories. The GROUP BY clause appears in a SQL statement after the WHERE clause (or the FROM clause if there isn't a WHERE clause). The GROUP BY clause specifies one or more columns from the table to be used to create the categories. A separate category is created for each unique value in the column. ---[SIDEBAR]--- "type":"term", "title":"GROUP BY", "text":"The clause in a SELECT statement used to create categories of values within which one or more aggregate function can be applied." ---[SIDEBAR]---

It is easiest to see how GROUP BY works in a couple of examples. Suppose you want to calculate the average area for each style of artwork in the database. Name the calculated field 'average_area_by_style' (without the quotes). You can use a chat with ^ai_name^ to create the SELECT statement that will do this: ---[AI]--- [ { "prompt": "Write a SQL statement to calculate the average area of each style of artwork. Name the calculated field average_area_by_style.", "pretext": "Certainly! Here's the SQL statement to calculate the average area of artworks for each style and name the calculated field as average_area_by_style:", "response": "SELECT style, AVG(width * height) AS average_area_by_style \nFROM work \nGROUP BY style;", "posttext": "This query calculates the average area (width * height) for artworks in each style from the work table. It groups the results by the style column and calculates the average area for each style, naming the calculated field as average_area_by_style." } ] ---[AI]--- **Try the SQL statement.**

---[SQL]--- connection:sql_book_connection schema:canvas diagram:canvas owner:atlas-query result:museum-name-sorted ---[SQL]--- SELECT style, AVG(width * height) AS average_area_by_style FROM work GROUP BY style; ---[SQL]--- In this example, the GROUP BY clause is used to create categories in the query result based on the values in the style column of the table. Each unique value for style will automatically be represented in a row of the results. All of the works of art will be first classified by style. The values height and width of each artwork will be used to calculate the area for each work of art. A separate average will be calculate for each style of art and returned in the query result. ---[SIDEBAR]--- "type":"note", "title":"NULL values in grouping columns", "text":"If a column in the GROUP BY clause has NULL values, NULL will be one of the categories defined in the query result. All rows with NULL values in the grouping column will be combined for the aggregate calculation. Notice that the style column contains NULL values in the prior example. The rows on the work table with a NULL value for style are grouped together. The average area for these rows will also be calculated and included in the query results." ---[SIDEBAR]--- Notice that the style column is also in the SELECT clause. This provides a label for each of the calculations. Each style is listed and the calculated average for each style is displayed on the same row. ---[SIDEBAR]--- "type":"tip", "title":"The SELECT clause and aggregate functions", "text":"When an aggregate function is used in a SELECT clause, any column that is not an argument of an aggregate function must be part of the GROUP BY clause. In the statement SELECT style, AVG(width*height), notice that width and height are part of the argument for the aggregate function AVG. Style is not an argument for the aggregate function. It must be included in the GROUP BY clause. Additionally, it is a good practice to make sure that any fields you use in the GROUP BY clause are also listed in the SELECT clause so that the values in the query have appropriate labels." ---[SIDEBAR]---

You can include more than one aggregate function in a SQL statement. Suppose you want to calculate the average *and* total area for each style of artwork in the database. Name the calculated fields 'average_area_by_style' and 'total_area_by_style' (without the quotes), respectively. You can use a chat with ^ai_name^ to create the SELECT statement that will do this: ---[AI]--- [ { "prompt": "Write a SQL statement to calculate the average and total area of each style of artwork. Name the calculated fields average_area_by_style and total_style_by_area.", "pretext": "To calculate both the average and total area of each style of artwork, you can use the SQL GROUP BY clause along with the AVG and SUM functions. Here's the SQL statement to achieve that and name the calculated fields as specified:", "response": "SELECT style, AVG(width * height) AS average_area_by_style, SUM(width * height) AS total_area_by_style \nFROM work \nGROUP BY style;", "posttext": "This query calculates the average and total area (width * height) for artworks in each style from the work table. It groups the results by the style column and calculates the average area and total area for each style, naming the calculated fields as average_area_by_style and total_area_by_style, respectively." } ] ---[AI]--- **Try the SQL statement.**

---[SQL]--- connection:sql_book_connection schema:canvas diagram:canvas owner:atlas-query result:museum-name-sorted ---[SQL]--- SELECT style, AVG(width * height) AS average_area_by_style, SUM(width * height) AS total_area_by_style FROM work GROUP BY style; ---[SQL]--- This example is very similar to the last one. The query calculates both the average and total area for each style of artwork in the database. The result is grouped by style to ensure that separate calculations are made for each unique style. The result is one row for each style each with a calculated average and total area.

Consider that you want to determine the most recent death of an artist for each art style and country. Name the calculated field 'most_recent_death' (without the quotes). You can use a chat with ^ai_name^ to create the SELECT statement that will do this: ---[AI]--- [ { "prompt": "Write a SQL statement to categorize artists by their style and nationality. Determine the most recent death in each group. Name the calculated field most_recent_death.", "pretext": "To categorize artists by their style and nationality and determine the most recent death in each group, you can use the GROUP BY clause along with the MAX function. Here's the SQL statement to achieve that and name the calculated field as most_recent_death:", "response": "SELECT style, nationality, MAX(death) AS most_recent_death \nFROM artist \nGROUP BY style, nationality;", "posttext": "This query groups the artists by their style and nationality, then calculates the most recent death year for each group using the MAX function and names the calculated field as most_recent_death." } ] ---[AI]--- **Try the SQL statement.**

Consider that you want to determine the most recent death of an artist for each art style and country. Name the calculated field 'most_recent_death' (without the quotes). The syntax for a query that accomplishes this is:

---[SQL]--- connection:sql_book_connection schema:canvas diagram:canvas owner:atlas-query result:museum-name-sorted ---[SQL]--- SELECT style, nationality, MAX(death) AS most_recent_death FROM artist GROUP BY style, nationality; ---[SQL]--- In order to determine the most recent death for each style and nationality, you need to include both fields in the GROUP BY clause. Notice that the query result lists each style once for each nationality. Each row of the query represents the combination of these groups. The most recent death will be determined for the combination of the groups using the MAX function. For example, the row that contains Art Nouveau as the style and Austrian as the nationality contains the largest death year (the most recent death) for an Austrian artist who works in the Art Nouveau style. The arrangement of the groups will follow the pattern seen in the query result. The values on the source table will be categorized by the left-most column in the GROUP BY clause. Each subsequent grouping will be done within those groups. In this example, the artists are first grouped by style and then nationality. This is why you see all of the entries for a given style appear together in the result.

You can also sort by calculated fields. Assume you want to determine most recent death for each style of artist and you also want to arrange the results to show the styles with the most recent deaths first in your results. You can use a chat with ^ai_name^ to create the SELECT statement that will do this: ---[AI]--- [ { "prompt": "Write a SQL statement to determine the most recent death for each artist style. Name the calculated field most_recent_death. Sort the results by most_recent_death in reverse chronological order.", "pretext": "To determine the most recent death for each artist style, sort the results by the most recent death year in reverse chronological order, you can use the GROUP BY clause along with the MAX function and the ORDER BY clause. Here's the SQL statement to achieve that and name the calculated field as most_recent_death:", "response": "SELECT style, MAX(death) AS most_recent_death \nFROM artist \nGROUP BY style \nORDER BY most_recent_death DESC;", "posttext": "This query groups the artists by their style, calculates the most recent death year for each style using the MAX function, and then orders the results in reverse chronological order based on the most_recent_death field." } ] ---[AI]--- **Try the SQL statement.**

---[SQL]--- connection:sql_book_connection schema:canvas diagram:canvas owner:atlas-query result:museum-name-sorted ---[SQL]--- SELECT style, MAX(death) AS most_recent_death FROM artist GROUP BY style ORDER BY most_recent_death DESC; ---[SQL]--- In this example, the most recent death is calculated for each style of artist using the MAX function. The name of this calculated field is most_recent_death. Notice that the query is sorted by this calculation. The ORDER BY clause arranges the result by most_recent_death in decending order. Because the SELECT clause executes before the ORDER BY clause, you can use the alias for the most recent death in the ORDER BY clause. ---[SIDEBAR]--- "type":"caution", "title":"Using Aliases in the GROUP BY Clause", "text":"If you use an alias to rename a column in the SELECT statement, that alias cannot by used in the GROUP BY clause even though it can be used in the ORDER BY clause (as shown in the last example). This is due to the order of execution. GROUP BY executes before SELECT, so any aliases defined in the SELECT clause will not be in effect when the GROUP BY clause executes." ---[SIDEBAR]--- Here are some important things to keep in mind about using GROUP BY: * You can create as many categories in your query as you need. The GROUP BY clause can have any number of columns. * If you have more than one column in the GROUP BY clause, the values included in each group will be determined by the combination of the categories defined in the GROUP BY clause. This means you will not get summary data for each GROUP BY level. * Every column listed in the GROUP BY clause should also be part of the SELECT clause if it is not an argument for an aggregate function. * Every column listed in the SELECT clause that is not an argument for an aggregate function must be included in the GROUP BY clause. * If the grouping column contains has NULL values, NULL will used as a category. If more than one row has a NULL value, those rows will be grouped together for the aggregate calculation. ## Filtering Groups Not only can you create categories of data using the GROUP BY clause, but you can also create a filter to determine which categories to include in a query result. This is done using the HAVING clause. The filters created in the HAVING clause use logical tests to determine which categories to include in a query result much like the WHERE clause. In fact all of the operators you learned about using in the WHERE clause will also work in the HAVING clause. Those categories that meet the logical test are included in the query result. This is an important distinction between HAVING and WHERE. A logical test in the WHERE clause applies to the individual rows of the source table. It determines which rows to include in a query result. A logical test in the HAVING clause filter groups based on the aggregate calculations that relate to those groups.

This is also best seen in an example. Assume you want to list all of the nationalities with at least 25 artists. Name the calculated field 'number_of_artists' (without the quotes). You can use a chat with ^ai_name^ to create the SELECT statement that will do this: ---[AI]--- [ { "prompt": "Write an SQL statement to list that nationalities with more than 25 artists. Name the calculated field number_of_artists.", "pretext": "To list the nationalities with more than 25 artists and name the calculated field as number_of_artists, you can use the GROUP BY clause along with the COUNT function and a HAVING clause. Here's the SQL statement to achieve that:", "response": "SELECT nationality, COUNT( * ) AS number_of_artists \nFROM artist \nGROUP BY nationality \nHAVING COUNT( * ) > 25;", "posttext": "This query groups the artists by their nationality, calculates the number of artists for each nationality using the COUNT function, and then filters the results using the HAVING clause to only include nationalities with more than 25 artists. The calculated field is named number_of_artists." } ] ---[AI]--- **Try the SQL statement.**

---[SQL]--- connection:sql_book_connection schema:canvas diagram:canvas owner:atlas-query result:museum-name-sorted ---[SQL]--- SELECT nationality, COUNT(*) AS number_of_artists FROM artist GROUP BY nationality HAVING COUNT(*) > 25; SELECT nationality, COUNT(*) AS number_of_artists FROM artist GROUP BY nationality ---[SQL]--- For comparison, this example also shows the query without the HAVING clause. Without the HAVING clause each of the different nationalities for artists on the artist table are displayed. The number of artists with that nationality are calculated for each. The HAVING clause creates a filter condition on the nationality groups. Only those nationalities with more than 25 artists will be displayed in the query result. ---[SIDEBAR]--- "type":"caution", "title":"Using Aliases in the HAVING Clause", "text":"If you use an alias to rename a column in the SELECT statement that alias cannot by used in the HAVING clause, even though it would be convenient if you could. In the prior example, the HAVING clause is HAVING COUNT(*) > 25. Intuitively, it would be nice if you could restate this as HAVING number_of_artist > 25. This doesn't work because of the order of execution. HAVING executes before SELECT, so any aliases defined in the SELECT clause will not be in effect when the HAVING clause executes." ---[SIDEBAR]--- It is important to note why a WHERE clause could not be used to create the filter for this example. Recall that the WHERE clause creates filters that apply the the individual rows of the source table. Each row on the artist table contains information about a single artist. Each artist only has one nationality, so no single artist could possibly meet the criteria. More importantly, the WHERE clause would not "know" what a category is. Recall that the WHERE clause executes before the GROUP BY clause. Any groups created by the GROUP BY clause would not be available to the WHERE clause anyway. Both of these situations explain why WHERE cannot be used to create filter conditions on the groups or classifications in a query.

It is possible (and appropriate) to use both the HAVING and WHERE clauses in the same query. Consider that you want to list all of the nationalities with at least 25 artists who were born after the year 1800. Name the calculated field 'number_of_artists' (without the quotes). You can use a chat with ^ai_name^ to create the SELECT statement that will do this: ---[AI]--- [ { "prompt": "Write an SQL statement to list that nationalities with more than 25 artists who were born after the year 1800. Name the calculated field number_of_artists.", "pretext": "To list the nationalities with more than 25 artists who were born after the year 1800 and name the calculated field as number_of_artists, you can use the GROUP BY clause along with the COUNT function and a HAVING clause to apply both conditions. Here's the SQL statement to achieve that:", "response": "SELECT nationality, COUNT( * ) AS number_of_artists \nFROM artist \nWHERE birth > 1800 \nGROUP BY nationality \nHAVING COUNT( * ) > 25;", "posttext": "This query filters the artists who were born after the year 1800 using the WHERE clause, groups the remaining artists by their nationality, calculates the number of artists for each nationality using the COUNT function, and then filters the results using the HAVING clause to only include nationalities with more than 25 artists. The calculated field is named number_of_artists." } ] ---[AI]--- **Try the SQL statement.**

---[SQL]--- connection:sql_book_connection schema:canvas diagram:canvas owner:atlas-query result:museum-name-sorted ---[SQL]--- SELECT nationality, COUNT(*) AS number_of_artists FROM artist WHERE birth > 1800 GROUP BY nationality HAVING COUNT(*) > 25; SELECT nationality, COUNT(*) AS number_of_artists FROM artist GROUP BY nationality HAVING COUNT(*) > 25; ---[SQL]--- In this example, only those artists born after the year 1800 will be included in the query result. This is accomplished with the WHERE clause (WHERE birth > 1800). Once this row-level filter is applied, the GROUP BY clause creates catergories of artists based on their nationality. The HAVING clause creates the group filter that only those nationalities with more than 25 artists will be included in the query result. In this case, there are only two nationalities with more than 25 artists born after 1800. The prior example is included here for comparison. In this example, the WHERE clause is omited. This means that all artists will be included in the query result. When all artists are included (not just those born after 1800), there are five nationalities with more than 25 artists. Notice how the effect of the WHERE clause impacts the query results. ## Select Statement Clause Listing Order and Execution Order At this point, all of the clauses of the select statement have been discussed. The table below summarizes the clauses and what they do. It also details the order they appear in the select statement and the order they are executed. #### Select Statement Clauses

| Clause | Description | Required | Where Specified | When Executed | |:------:|-------------|----------|-----------------|---------------| | SELECT | Set the columns or calculations displayed in the results | Yes | 1st | 5th | | FROM | Set the table(s) used in the query | Yes | 2nd | 1st | | WHERE | Apply filters to rows | No | 3rd | 2nd | | GROUP BY | Create groups for aggregate calculations | No (unless there are columns in the SELECT clause that are not an argument to an aggregate function) | 4th | 3rd | | HAVING | Apply filters to groups | No | 5th | 4th | | ORDER BY | Arrange the results of a query | No | 6th | 6th |

The select statement provides a standard way to retrieve data from a database. The six clauses of the select statement are used in specific and standard ways as well. Recall that only the SELECT and FROM clauses are required when creating a select statement. When a clause is not included in a query, it is just omited and the other clauses appear in the prescribed order. You may be wondering why there is a different order that the clauses are listed in a select statement than when they are executed by the DBMS. Perhaps the best reason for this is that they are specified in a way that makes the statement more conversational. If we were to use normal English to request data, we might use a similar structure. We would likely say that we want to see specific data from a particular source and then describe any specialized instructions for how we see the data after that. The order of execution follows a more logical ordering that aids in the automated retrival of the data. While having a different order for executing the statements may seem odd, it is important to understand this ordering. This way you will understand how the different clauses work together and why something you define in one clause of the select statment is "known" in one clause, but not another. ## Summary This lesson described how to use the GROUP BY clause to create categories in a query result. Aggregate functions are then applied to these categories to summaries the values in each. The HAVING clause was also discussed. The HAVING clause is used to create filters on the groups defined in the GROUP BY clause. At this point, all of the clauses of the SELECT statement have been defined. This lesson reviews each, describing their function, when they are included in a SQL statement, and when each is executed by the DBMS.