SQL is a really handy language to know if you need to store and manipulate large amounts of data that fit nicely into tabular form. (Fortunately, a large fraction of the data we humans deal with fits this description.)
However, sometimes SQL can be a bit… opaque. In particular, the order that the SQL parser executes the clauses in SELECT statements has caused a bit of head-scratching on my part in the past. Fortunately, once you learn it, it’s pretty easy to remember.
Here is an example of a pseudo-SQL SELECT statement that uses all of the main clauses available:
SELECT DISTINCT col, SUM(another_col), ...
FROM data_table
JOIN yet_another_table
ON data_table.col = yet_another_table.another_column
WHERE constraint_expression
GROUP BY col
HAVING constraint_expression
ORDER BY col ASC/DESC
LIMIT integer_count OFFSET integer_count;
So, in the order they appear, there are the following clauses (some are optional according to the situation, others are not):
The actual order in which the SQL parser executes these is not the same as their order of appearance in the statement text. Watch carefully now…
FROM
and JOIN
These two, working in tandem, along with the results of any subqueries, are the root of the tree, so to speak, where all the data that might be included from the specified tables gets gathered up into the initial large bucket of joined rows. This can get truly massive if you’re not careful about your joins, so optimizing here can save you a lot of time and memory.
WHERE
Now the WHERE
clause kicks in, and filters out all the rows that don’t
match its expression. This makes it the first chance to knock off the
data that’s irrelevant to your search, while it’s still in fairly raw
form.
GROUP BY
The rows left after WHERE
gets done are now ordered into groups according
to shared values in the column(s) specified in the GROUP BY
clause.
This operation, by definition, reduces the data set to the number of rows
that there are distinct values for in the specified columns, so this
only applies where you’re using those columns in an aggregate function
expression in the later SELECT
clause, such as with COUNT
, SUM
,
or AVG
. If you don’t have an aggregate function for this clause to work
with, you’ll get an error.
HAVING
This clause is pretty much the same as the WHERE clause, except with the
critical difference that it executes after the GROUP BY
clause, not
before, so its filtering only applies to the remaining single rows that corresponded to each distinct value in the GROUP BY
columns, as mentioned before.
SELECT
Now, finally, all of those expressions you so painstakingly wrote into the
beginning of your query get executed. One of the implications of this is
that aliases that are defined in the SELECT
expressions are not usable
until after this point. Once again, SELECT aliases are not usable in
any of the preceding clauses. Just wanted to make that very clear,
because it’s caused me maddening errors more times than I want to admit.
DISTINCT
Although it looks like it’s part of the SELECT
expressions, this clause
is actually separate, and executes after them. At this point, the rows
left are grouped by the values in the columns specified in the DISTINCT
clause, and duplicates are simply dropped. I’m not sure if the SQL standard
specifies which one will survive, somehow, but it hasn’t been an issue
that has caused me any problems so far.
ORDER BY
The rows that made it through data row bootcamp (all the preceding clauses)
are now ordered by the columns specified, and in the direction specified
(whether ASCENDING
or DESCENDING
). As noted above, you can use aliases
in this clause because it’s executing after the SELECT expressions have
been evaluated.
LIMIT
and OFFSET
Now the list of rows go through the final selection, and only the rows that
fall within the offset and limit specified get to stay in play. This can
get a bit tricky, as it may appear that the first LIMIT
rows are grabbed,
and then the final set will be everything after the OFFSET
value. In
truth, it’s the other way around. First the OFFSET
clause moves down
the list the specified number of rows, and then the LIMIT
clause grabs
the next n rows, n being whatever the LIMIT
count specified is.
This is especially handy for doing things like breaking a data set
into “pages”, where for example if you want 10 items on a page, and you
have several thousand rows, you can calculate page_number * items_per_page
and use that as your OFFSET
value, and use items_per_page
as your
LIMIT
value.
Hopefully all of this made sense. It makes things a lot easier to work with when you understand the order in which things happen. Happy querying!