SQL Grouping

See basics here:

Table of contents
  1. GROUPING SETS
    1. GROUPING
  2. CUBE
  3. ROLLUP

GROUPING SETS

Given two columns column1 and column2, the total combinations of grouping sets that can be formed are:

  • (column1, column2) Full set
  • (column1) Singletons
  • (column2)
  • () Empty set

The keyword GROUPING SETS is a way to specify multiple grouping sets in a single GROUP BY clause.

It is actually a shorthand for UNION ALL of multiple GROUP BY clauses.

SELECT column1, column2, COUNT(*) as count
FROM tbl
GROUP BY GROUPING SETS (
  (column1),
  (column2)
);

Which would have been:

SELECT column1, NULL as column2, COUNT(*) as count
FROM tbl
GROUP BY column1
UNION ALL
SELECT NULL as column1, column2, COUNT(*) as count
FROM tbl
GROUP BY column2;

A union of queries with grouping set (column1) and (column2). This can become very annoying, very quickly.

With GROUPING SETS you don’t even have to worry about making the queries union-compatible.

GROUPING

GROUPING(column) is a function that outputs

  • 0 if the column is a member of the grouping set (i.e. it is being grouped by this column)
  • 1 if the column is not a member of the grouping set

I personally thought the booleans were counter-intuitive. But think of it as, is this column a grouping, as in, were they smashed together under a certain group?
Try to differentiate being a group by criterion and being a grouping. This function is about the latter.

Basically creates boolean columns indicating whether this column was used as a grouping criterion to generate this row.

Let’s take a look at a crazy example:

SELECT f.film_id,
       fa.actor_id,
       GROUPING(f.film_id) AS film_id_grouping,
       GROUPING(fa.actor_id) AS actor_id_grouping,
       COUNT(*) AS cnt
FROM film f
     JOIN film_actor fa ON f.film_id = fa.film_id
GROUP BY GROUPING SETS
    ( (f.film_id),
      (fa.actor_id),
      (f.film_id, fa.actor_id)
    )
HAVING GROUPING(f.film_id) = 0
   AND GROUPING(fa.actor_id) = 0;
What is HAVING doing there?

Without the HAVING above, the resulting top rows will have GROUPING values of 0 for film_id and 1 for actor_id, because the first grouping set is (f.film_id).

What the HAVING does is allows us to select subsets of the GROUPING SETS (in this case, the last one (f.film_id, fa.actor_id)). So technically, it’d be like not using the GROUPING SETS from the beginning and sticking to our usual GROUP BY f.film_id, fa.actor_id.


CUBE

Remember how we said given two columns column1 and column2, the total combinations of grouping sets that can be formed are:

  • (column1, column2) Full set
  • (column1) Singletons
  • (column2)
  • () Empty set

CUBE generates this power set (2^n) for us without having to write them all out.

GROUP BY CUBE (column1, column2)

-- is equivalent to

GROUP BY GROUPING SETS (
  (column1, column2)
  (column1),
  (column2),
  (),
)

Again, you can use HAVING to select subsets of the power set:

HAVING GROUPING(column1) = 0
   AND GROUPING(column2) = 0
-- would select the full set

HAVING GROUPING(column1) = 0
   AND GROUPING(column2) = 1
-- would select column1 singleton set
-- and so on...

ROLLUP

Unlike CUBE, which generates the entire power set, ROLLUP generates the hierarchical subsets.

GROUP BY ROLLUP (c1, c2, c3)
-- is equivalent to

GROUP BY GROUPING SETS (
  (c1, c2, c3),
  (c1, c2),
  (c1),
  ()
)

It is called hierarchical because it imposes hierarchy in the given order of the columns.

If you compare it to a markdown header, c1 is the # header, c2 is the ## header, and c3 is the ### header.

Because (c1, c2, c3) comes first and the empty set comes last, our results will return bottom-up aggregation (individualized to general).

ROLLUP is very useful for generating subtotals.

Take a look at this example:

SELECT rating,
    CASE WHEN length >= 120 THEN 'long'
         WHEN length < 120 THEN 'short'
         ELSE 'unknown'
    END AS length_category,
    AVG(rental_rate)
FROM film
GROUP BY ROLLUP (rating, 2);

The generated result is:

ratinglength_categoryavg
nullnull2.98
Rshort3.0094174757281553
PG-13long3.1375409836065574
NC-17long3.0518556701030928
PG-13short2.9107920792079208
NC-17short2.9015044247787611
Gshort2.8185714285714286
PGlong3.1363414634146341
Rlong2.8595652173913043
Glong2.99
PGshort2.99
Rnull2.9387179487179487
PGnull3.0518556701030928
Gnull2.888876404494382
PG-13null3.0348430493273543
NC-17null2.970952380952381