3.5 Group Data

Learning Objectives

  • perform additional aggregations using GROUP BY and HAVING clause
  • discuss how NULLs are or aren’t affected by the GROUP BY and HAVING clauses
  • use the GROUP BY and ORDER BY clauses together to better sort your data

A lot of times, we’ll be looking at the average price for different types of products, or total perchasing for different customer segments. Then we need to aggregate data by groups. In this example, assume we want to know the number of customers we have by each region.

SELECT Region, COUNT(CustomerID) AS total_customers
FROM Customers
GROUP BY Region;

If we were to just have our SELECT statement with Region, COUNT(CustomerID) AS total_customers, but without specifying the variable to GROUP BY, we’re going to get an error return. Because the computer doesn’t know how to count the customer IDs. So we put the GROUP BY clause in the end. There are three things to pay attention:

  1. GROUP BY clauses can contain multiple columns.
  2. Every column in your SELECT statement must be present in a GROUP BY clause, except for aggregated calculations.
  3. NULLs will be grouped together if your GROUP BY column contains NULLs.

HAVING clause works for groups. But WHERE filters on rows not groups.

SELECT CustomerID, COUNT (*) AS orders
FROM Orders
GROUP BY CustomerID
HAVING COUNT (*) >=2;

To sum up, WHERE filters before the data is grouped and then HAVING filters after the data is grouped. Rows eliminated by the WHERE clause will not be included in the GROUP BY clause. It is important to know when you should use WHERE versus HAVING.

Another thing to note about GROUP BY is that it’s always a good practice to use the ORDER BY clause. The GROUP BY does not sort the data in any fashion. It only groups it together. In our previous examples, we have a list of states, a list of regions. It’s not going to sort those regions in alphabetical order. It’s just going to group them by different regions. It is recommended to use ORDER BY in this situation. It makes the results a little easier to read.

SELECT SupplierID, COUNT(*) AS Num_Prod
FROM Products
WHERE UnitPrice >= 4
GROUP BY SupplierID
HAVING COUNT (*) >= 2;