SQL Grouping
See basics here:
Table of contents
GROUPING SETS
Given two columns column1
and column2
, the total combinations of grouping sets that can be formed are:
(column1, column2)
Full set(column1)
Singletons(column2)
()
Empty set
The keyword GROUPING SETS
is a way to specify multiple grouping sets in a single GROUP BY
clause.
It is actually a shorthand for UNION ALL
of multiple GROUP BY
clauses.
SELECT column1, column2, COUNT(*) as count
FROM tbl
GROUP BY GROUPING SETS (
(column1),
(column2)
);
Which would have been:
SELECT column1, NULL as column2, COUNT(*) as count
FROM tbl
GROUP BY column1
UNION ALL
SELECT NULL as column1, column2, COUNT(*) as count
FROM tbl
GROUP BY column2;
A union of queries with grouping set (column1)
and (column2)
. This can become very annoying, very quickly.
With GROUPING SETS
you don’t even have to worry about making the queries union-compatible.
GROUPING
GROUPING(column)
is a function that outputs
0
if the column is a member of the grouping set (i.e. it is being grouped by this column)1
if the column is not a member of the grouping set
I personally thought the booleans were counter-intuitive. But think of it as, is this column a grouping, as in, were they smashed together under a certain group?
Try to differentiate being a group by criterion and being a grouping. This function is about the latter.
Basically creates boolean columns indicating whether this column was used as a grouping criterion to generate this row.
Let’s take a look at a crazy example:
SELECT f.film_id,
fa.actor_id,
GROUPING(f.film_id) AS film_id_grouping,
GROUPING(fa.actor_id) AS actor_id_grouping,
COUNT(*) AS cnt
FROM film f
JOIN film_actor fa ON f.film_id = fa.film_id
GROUP BY GROUPING SETS
( (f.film_id),
(fa.actor_id),
(f.film_id, fa.actor_id)
)
HAVING GROUPING(f.film_id) = 0
AND GROUPING(fa.actor_id) = 0;
What is HAVING doing there?
Without the HAVING
above, the resulting top rows will have GROUPING
values of 0
for film_id
and 1
for actor_id
, because the first grouping set is (f.film_id)
.
What the HAVING
does is allows us to select subsets of the GROUPING SETS
(in this case, the last one (f.film_id, fa.actor_id)
). So technically, it’d be like not using the GROUPING SETS
from the beginning and sticking to our usual GROUP BY f.film_id, fa.actor_id
.
CUBE
Remember how we said given two columns column1
and column2
, the total combinations of grouping sets that can be formed are:
(column1, column2)
Full set(column1)
Singletons(column2)
()
Empty set
CUBE
generates this power set (2^n
) for us without having to write them all out.
GROUP BY CUBE (column1, column2)
-- is equivalent to
GROUP BY GROUPING SETS (
(column1, column2)
(column1),
(column2),
(),
)
Again, you can use HAVING
to select subsets of the power set:
HAVING GROUPING(column1) = 0
AND GROUPING(column2) = 0
-- would select the full set
HAVING GROUPING(column1) = 0
AND GROUPING(column2) = 1
-- would select column1 singleton set
-- and so on...
ROLLUP
Unlike CUBE
, which generates the entire power set, ROLLUP
generates the hierarchical subsets.
GROUP BY ROLLUP (c1, c2, c3)
-- is equivalent to
GROUP BY GROUPING SETS (
(c1, c2, c3),
(c1, c2),
(c1),
()
)
It is called hierarchical because it imposes hierarchy in the given order of the columns.
If you compare it to a markdown header, c1
is the #
header, c2
is the ##
header, and c3
is the ###
header.
Because (c1, c2, c3)
comes first and the empty set comes last, our results will return bottom-up aggregation (individualized to general).
ROLLUP
is very useful for generating subtotals.
Take a look at this example:
SELECT rating,
CASE WHEN length >= 120 THEN 'long'
WHEN length < 120 THEN 'short'
ELSE 'unknown'
END AS length_category,
AVG(rental_rate)
FROM film
GROUP BY ROLLUP (rating, 2);
The generated result is:
rating | length_category | avg |
---|---|---|
null | null | 2.98 |
R | short | 3.0094174757281553 |
PG-13 | long | 3.1375409836065574 |
NC-17 | long | 3.0518556701030928 |
PG-13 | short | 2.9107920792079208 |
NC-17 | short | 2.9015044247787611 |
G | short | 2.8185714285714286 |
PG | long | 3.1363414634146341 |
R | long | 2.8595652173913043 |
G | long | 2.99 |
PG | short | 2.99 |
R | null | 2.9387179487179487 |
PG | null | 3.0518556701030928 |
G | null | 2.888876404494382 |
PG-13 | null | 3.0348430493273543 |
NC-17 | null | 2.970952380952381 |