SQL Query Basics
Examples are written with PostgreSQL dialect.
Table of contents
Overall Order of Clauses
The overall clause order is:
SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT
SELECT Clause
SELECT target_list
FROM relations
WHERE qualification
SELECT DISTINCT
SELECT DISTINCT release_year, rating
FROM film;
When multiple columns are selected, SELECT DISTINCT
applies across all those columns.
SELECT DISTINCT
does not rule out NULL
value. It includes NULL
value in the result set.
Range Variables / Tuple Variables
Basically, table aliases.
It is optional in unambiguous situations, but considered good practice to use them.
SELECT f.title
FROM film AS f;
Also, AS
is optional, but I prefer to use it for clarity:
SELECT f.title
FROM film f;
Column Aliases
SELECT DISTINCT release_year AS year, rating r
FROM film;
Again, AS
is optional.
Aliasing with =
Not many RDBMS support using assignment operator =
for aliasing, but it is a thing…
SELECT year=release_year
...
To use keywords, special characters, or whitespace in aliases, you need to enclose the alias in double quotes:
SELECT something AS "this is a column"
Using Arithmetic Expressions as Column
SELECT length / 60 AS hours, length % 60 AS minutes
FROM film;
LIMIT Clause
SELECT title
FROM film
LIMIT 5;
ORDER BY
SELECT title, length
FROM film
ORDER BY length DESC, title;
Ordered by length
in descending order, then by title
in ascending order.
In PostgreSQL, default is ASC
.
You can actually use column numbers in ORDER BY
:
SELECT title, length
FROM film
ORDER BY 2 DESC, 1 ASC;
These numbers refer to the order of columns in SELECT
, starting from 1.
When to use column numbers?
It can make the query less readable, but it can be useful in situations where the column names are too verbose or subject to change.
Technically, you should alias columns of complex expressions, but if you left them anonymous, you could use numbers.
Example with CASE
:
SELECT title,
CASE
WHEN length >= 120 THEN 'LONG'
ELSE 'SHORT' END
FROM film
ORDER BY 2;
Without the numbers and alias, you’d have to repeat the CASE
expression.
WHERE Clause
Logic in SQL is actually three-valued:
TRUE
FALSE
UNKNOWN
The UNKNOWN
value is returned when comparing or performing arithmetic operations with NULL
.
WHERE
filters rows that do not evaluate to TRUE
. So NULL
values are also filtered out.
Comparison Operators
=
EqualSELECT title FROM film WHERE rating = 'PG-13';
Use single quotes for text.
<>
or!=
Not equalSELECT title FROM film WHERE rating <> 'R';
>
,>=
,<
,<=
SELECT title FROM film WHERE length > 120;
With non-numeric data, comparison is lexicographical.
Logical Operators
LIKE
String matching with wildcards.
_
matches any single character.%
matches 0 or more arbitrary characters.
SELECT title
FROM film
where title LIKE 'KI_________%';
You can enforce certain number of characters with _
.
LIKE
is case-sensitive in PostgreSQL.
ILIKE
Case-insensitive version of LIKE
.
SELECT title
FROM film
WHERE title ILIKE 'ki_________%';
REGEXP_LIKE
SELECT title
FROM film
WHERE REGEXP_LIKE(title, '[A-Z]')
LIMIT 10;
AND, OR, NOT
SELECT title
FROM film
WHERE rating = 'PG-13' AND length > 120;
NOT
IS NOT NULL
NOT IN
NOT BETWEEN
NOT LIKE
BETWEEN
SELECT title
FROM film
WHERE length BETWEEN 120 AND 150;
IS NULL
SELECT title
FROM film
WHERE original_language_id IS NULL;
Set Operators
By default, set operators UNION
, INTERSECT
, and EXCEPT
remove duplicates.
UNION ALL
, INTERSECT ALL
, and EXCEPT ALL
retain duplicates.
SELECT cols
FROM table1
UNION [UNION ALL / INTERSECT / INTERSECT ALL / EXCEPT / EXCEPT ALL]
SELECT cols
FROM table1
The SELECT
statements must be union-compatible.
Set Comparison Operators
IN / NOT IN
Unlike the other ones, you can use IN
with an expression list:
SELECT title
FROM film
WHERE rating IN ('PG', 'PG-13');
Or with a subquery like the others.
EXISTS / NOT EXISTS
SELECT f.title
FROM film AS f
WHERE EXISTS (
SELECT *
FROM language AS l
WHERE l.language_id = f.language_id AND l.name = 'English'
);
This is an awkward example, but you get the idea. EXISTS
returns TRUE
if the subquery returns any rows.
It is often used with correlated subqueries. By correlated, I mean the subquery refers to the relation in the outer query, which is f
in this case.
ANY
ANY
is used with leading comparison operators: op ANY
(= ANY
, <> ANY
, <= ANY
, etc.).
The following uses = ANY
:
SELECT length
FROM film
WHERE length = ANY (
SELECT length
FROM film
WHERE length = 60 OR length = 120
);
Semantically, = ANY
is equivalent to IN
. However, you cannot use expression lists with = ANY
.
ALL
ALL
is used with leading comparison operators: op ALL
(= ALL
, <> ALL
, <= ALL
, etc.).
The following uses <> ALL
:
SELECT length
FROM film
WHERE length <> ALL (
SELECT length
FROM film
WHERE length = 60 OR length = 120
);
Semantically, <> ALL
is equivalent to NOT IN
. However, you cannot use expression lists with <> ALL
.
Aggregate Functions
You can only use aggregate functions in SELECT
and HAVING
clauses.
Except for COUNT(*)
, aggregate functions ignore NULL
values.
Except for COUNT(*)
, aggregate functions take a single column.
COUNT(*) / COUNT(col) / COUNT(DISTINCT col)
SUM(col) / SUM(DISTINCT col)
AVG(col) / AVG(DISTINCT col)
MAX(col)
MIN(col)
DISTINCT
does not ignore NULL
values, but the aggregate function does.
GROUP BY Clause
To use aggregate functions across only certain groups of rows, we use GROUP BY
clause.
SELECT rating, COUNT(title), AVG(length)
from film
GROUP BY rating;
rating | count | avg |
---|---|---|
R | 195 | 118.6615384615384615 |
PG | 194 | 112.0051546391752577 |
G | 178 | 111.050561797752809 |
PG-13 | 223 | 120.4439461883408072 |
NC-17 | 210 | 113.2285714285714286 |
You can also group by multiple columns:
SELECT rating, category_id, COUNT(*) AS count
FROM film
JOIN film_category USING (film_id)
GROUP BY rating, category_id;
GROUP BY col1, col2
means group rows by unique joints of col1
and col2
.
Which also means ordering of columns in GROUP BY
does not matter, in terms of results (may or may not affect performance).
If you specified two columns in GROUP BY
, and each column has n
and m
unique values, the result will have n * m
rows.
Non-aggregated columns in SELECT
must appear in GROUP BY
. In this example, rating
and category_id
.
Just like ORDER BY
, you can use column numbers in GROUP BY
. See Using Column Numbers.
HAVING Clause
WHERE
filters rows before grouping, HAVING
filters groups after grouping.
SELECT rating, category_id, COUNT(*) AS count
FROM film
JOIN film_category USING (film_id)
GROUP BY category_id, rating
HAVING COUNT(*) >= 15;
Other than aggregate functions, you can also have subqueries in HAVING
clause.
Caveats with LIMIT
LIMIT
is applied after GROUP BY
and HAVING
.
So LIMIT
is applied to the result set, after grouping, filtering, and aggregation.
You should really think about whether you’d actually want to use LIMIT
with GROUP BY
, because it often doesn’t make sense.
CASE Expressions
Basic syntax:
CASE WHEN condition1 THEN true_case_1
WHEN condition2 THEN true_case_2
...
ELSE false_case
END
In SELECT
statement:
SELECT title,
CASE
WHEN length >= 120 THEN 'LONG'
ELSE 'SHORT' END AS is_long
FROM film;
With GROUP BY
You can use CASE
expressions in GROUP BY
to create custom groups; basically preprocess group labels to deal with NULL
values, etc. before aggregation.
SELECT CASE
WHEN last_name LIKE 'A%' THEN 'A'
WHEN last_name LIKE 'B%' THEN 'B'
WHEN last_name LIKE 'C%' THEN 'C'
ELSE 'Other' END AS last_name_starts_with,
COUNT(*) AS actor_count
FROM actor
GROUP BY 1
ORDER BY 2;
In Aggregate Functions
You can use CASE
expressions in aggregate functions to filter or transform values before aggregation:
SELECT COUNT(CASE WHEN length >= 120 THEN 1 ELSE NULL END) AS long_film_cnt
FROM film;
Here we used the fact that aggregate functions except COUNT(*)
ignore NULL
values.