SQL Grouping
See basics here:
Table of contents
GROUPING SETS
Given two columns column1 and column2, the total combinations of grouping sets that can be formed are:
(column1, column2)Full set(column1)Singletons(column2)()Empty set
The keyword GROUPING SETS is a way to specify multiple grouping sets in a single GROUP BY clause.
It is actually a shorthand for UNION ALL of multiple GROUP BY clauses.
SELECT column1, column2, COUNT(*) as count
FROM tbl
GROUP BY GROUPING SETS (
(column1),
(column2)
);
Which would have been:
SELECT column1, NULL as column2, COUNT(*) as count
FROM tbl
GROUP BY column1
UNION ALL
SELECT NULL as column1, column2, COUNT(*) as count
FROM tbl
GROUP BY column2;
A union of queries with grouping set (column1) and (column2). This can become very annoying, very quickly.
With GROUPING SETS you don’t even have to worry about making the queries union-compatible.
GROUPING
GROUPING(column) is a function that outputs
0if the column is a member of the grouping set (i.e. it is being grouped by this column)1if the column is not a member of the grouping set
I personally thought the booleans were counter-intuitive. But think of it as, is this column a grouping, as in, were they smashed together under a certain group?
Try to differentiate being a group by criterion and being a grouping. This function is about the latter.
Basically creates boolean columns indicating whether this column was used as a grouping criterion to generate this row.
Let’s take a look at a crazy example:
SELECT f.film_id,
fa.actor_id,
GROUPING(f.film_id) AS film_id_grouping,
GROUPING(fa.actor_id) AS actor_id_grouping,
COUNT(*) AS cnt
FROM film f
JOIN film_actor fa ON f.film_id = fa.film_id
GROUP BY GROUPING SETS
( (f.film_id),
(fa.actor_id),
(f.film_id, fa.actor_id)
)
HAVING GROUPING(f.film_id) = 0
AND GROUPING(fa.actor_id) = 0;
What is HAVING doing there?
Without the HAVING above, the resulting top rows will have GROUPING values of 0 for film_id and 1 for actor_id, because the first grouping set is (f.film_id).
What the HAVING does is allows us to select subsets of the GROUPING SETS (in this case, the last one (f.film_id, fa.actor_id)). So technically, it’d be like not using the GROUPING SETS from the beginning and sticking to our usual GROUP BY f.film_id, fa.actor_id.
CUBE
Remember how we said given two columns column1 and column2, the total combinations of grouping sets that can be formed are:
(column1, column2)Full set(column1)Singletons(column2)()Empty set
CUBE generates this power set (2^n) for us without having to write them all out.
GROUP BY CUBE (column1, column2)
-- is equivalent to
GROUP BY GROUPING SETS (
(column1, column2)
(column1),
(column2),
(),
)
Again, you can use HAVING to select subsets of the power set:
HAVING GROUPING(column1) = 0
AND GROUPING(column2) = 0
-- would select the full set
HAVING GROUPING(column1) = 0
AND GROUPING(column2) = 1
-- would select column1 singleton set
-- and so on...
ROLLUP
Unlike CUBE, which generates the entire power set, ROLLUP generates the hierarchical subsets.
GROUP BY ROLLUP (c1, c2, c3)
-- is equivalent to
GROUP BY GROUPING SETS (
(c1, c2, c3),
(c1, c2),
(c1),
()
)
It is called hierarchical because it imposes hierarchy in the given order of the columns.
If you compare it to a markdown header, c1 is the # header, c2 is the ## header, and c3 is the ### header.
Because (c1, c2, c3) comes first and the empty set comes last, our results will return bottom-up aggregation (individualized to general).
ROLLUP is very useful for generating subtotals.
Take a look at this example:
SELECT rating,
CASE WHEN length >= 120 THEN 'long'
WHEN length < 120 THEN 'short'
ELSE 'unknown'
END AS length_category,
AVG(rental_rate)
FROM film
GROUP BY ROLLUP (rating, 2);
The generated result is:
| rating | length_category | avg |
|---|---|---|
| null | null | 2.98 |
| R | short | 3.0094174757281553 |
| PG-13 | long | 3.1375409836065574 |
| NC-17 | long | 3.0518556701030928 |
| PG-13 | short | 2.9107920792079208 |
| NC-17 | short | 2.9015044247787611 |
| G | short | 2.8185714285714286 |
| PG | long | 3.1363414634146341 |
| R | long | 2.8595652173913043 |
| G | long | 2.99 |
| PG | short | 2.99 |
| R | null | 2.9387179487179487 |
| PG | null | 3.0518556701030928 |
| G | null | 2.888876404494382 |
| PG-13 | null | 3.0348430493273543 |
| NC-17 | null | 2.970952380952381 |