SQL is a really handy language to know if you need to store and manipulate large amounts of data that fit nicely into tabular form. (Fortunately, a large fraction of the data we humans deal with fits this description.)
However, sometimes SQL can be a bit… opaque. In particular, the order that the SQL parser executes the clauses in SELECT statements has caused a bit of head-scratching on my part in the past. Fortunately, once you learn it, it’s pretty easy to remember.
Here is an example of a pseudo-SQL SELECT statement that uses all of the main clauses available:
SELECT DISTINCT col, SUM(another_col), ... FROM data_table JOIN yet_another_table ON data_table.col = yet_another_table.another_column WHERE constraint_expression GROUP BY col HAVING constraint_expression ORDER BY col ASC/DESC LIMIT integer_count OFFSET integer_count;
So, in the order they appear, there are the following clauses (some are optional according to the situation, others are not):
The actual order in which the SQL parser executes these is not the same as their order of appearance in the statement text. Watch carefully now…
These two, working in tandem, along with the results of any subqueries, are the root of the tree, so to speak, where all the data that might be included from the specified tables gets gathered up into the initial large bucket of joined rows. This can get truly massive if you’re not careful about your joins, so optimizing here can save you a lot of time and memory.
WHERE clause kicks in, and filters out all the rows that don’t
match its expression. This makes it the first chance to knock off the
data that’s irrelevant to your search, while it’s still in fairly raw
The rows left after
WHERE gets done are now ordered into groups according
to shared values in the column(s) specified in the
GROUP BY clause.
This operation, by definition, reduces the data set to the number of rows
that there are distinct values for in the specified columns, so this
only applies where you’re using those columns in an aggregate function
expression in the later
SELECT clause, such as with
AVG. If you don’t have an aggregate function for this clause to work
with, you’ll get an error.
This clause is pretty much the same as the WHERE clause, except with the
critical difference that it executes after the
GROUP BY clause, not
before, so its filtering only applies to the remaining single rows that corresponded to each distinct value in the
GROUP BY columns, as mentioned before.
Now, finally, all of those expressions you so painstakingly wrote into the
beginning of your query get executed. One of the implications of this is
that aliases that are defined in the
SELECT expressions are not usable
until after this point. Once again, SELECT aliases are not usable in
any of the preceding clauses. Just wanted to make that very clear,
because it’s caused me maddening errors more times than I want to admit.
Although it looks like it’s part of the
SELECT expressions, this clause
is actually separate, and executes after them. At this point, the rows
left are grouped by the values in the columns specified in the
clause, and duplicates are simply dropped. I’m not sure if the SQL standard
specifies which one will survive, somehow, but it hasn’t been an issue
that has caused me any problems so far.
The rows that made it through data row bootcamp (all the preceding clauses)
are now ordered by the columns specified, and in the direction specified
DESCENDING). As noted above, you can use aliases
in this clause because it’s executing after the SELECT expressions have
Now the list of rows go through the final selection, and only the rows that
fall within the offset and limit specified get to stay in play. This can
get a bit tricky, as it may appear that the first
LIMIT rows are grabbed,
and then the final set will be everything after the
OFFSET value. In
truth, it’s the other way around. First the
OFFSET clause moves down
the list the specified number of rows, and then the
LIMIT clause grabs
the next n rows, n being whatever the
LIMIT count specified is.
This is especially handy for doing things like breaking a data set
into “pages”, where for example if you want 10 items on a page, and you
have several thousand rows, you can calculate
page_number * items_per_page
and use that as your
OFFSET value, and use
items_per_page as your
Hopefully all of this made sense. It makes things a lot easier to work with when you understand the order in which things happen. Happy querying!