SQL SELECT statement: order of clause execution

30 May 2018

SQL is a really handy language to know if you need to store and manipulate large amounts of data that fit nicely into tabular form. (Fortunately, a large fraction of the data we humans deal with fits this description.)

However, sometimes SQL can be a bit… opaque. In particular, the order that the SQL parser executes the clauses in SELECT statements has caused a bit of head-scratching on my part in the past. Fortunately, once you learn it, it’s pretty easy to remember.

SELECT clauses in a complete statement

Here is an example of a pseudo-SQL SELECT statement that uses all of the main clauses available:

SELECT DISTINCT col, SUM(another_col), ...
    FROM data_table
        JOIN yet_another_table
            ON data_table.col = yet_another_table.another_column
    WHERE constraint_expression
    GROUP BY col
    HAVING constraint_expression
    ORDER BY col ASC/DESC
    LIMIT integer_count OFFSET integer_count;

So, in the order they appear, there are the following clauses (some are optional according to the situation, others are not):

  1. SELECT column or expression, …
  2. DISTINCT
  3. FROM table
  4. JOIN table
  5. ON matching expression
  6. WHERE constraint expression
  7. GROUP BY column, …
  8. HAVING constraint expression
  9. ORDER BY column, … ascending or descending
  10. LIMIT integer row count
  11. OFFSET integer row count

What actually happens

The actual order in which the SQL parser executes these is not the same as their order of appearance in the statement text. Watch carefully now…

FROM and JOIN

These two, working in tandem, along with the results of any subqueries, are the root of the tree, so to speak, where all the data that might be included from the specified tables gets gathered up into the initial large bucket of joined rows. This can get truly massive if you’re not careful about your joins, so optimizing here can save you a lot of time and memory.

WHERE

Now the WHERE clause kicks in, and filters out all the rows that don’t match its expression. This makes it the first chance to knock off the data that’s irrelevant to your search, while it’s still in fairly raw form.

GROUP BY

The rows left after WHERE gets done are now ordered into groups according to shared values in the column(s) specified in the GROUP BY clause. This operation, by definition, reduces the data set to the number of rows that there are distinct values for in the specified columns, so this only applies where you’re using those columns in an aggregate function expression in the later SELECT clause, such as with COUNT, SUM, or AVG. If you don’t have an aggregate function for this clause to work with, you’ll get an error.

HAVING

This clause is pretty much the same as the WHERE clause, except with the critical difference that it executes after the GROUP BY clause, not before, so its filtering only applies to the remaining single rows that corresponded to each distinct value in the GROUP BY columns, as mentioned before.

SELECT

Now, finally, all of those expressions you so painstakingly wrote into the beginning of your query get executed. One of the implications of this is that aliases that are defined in the SELECT expressions are not usable until after this point. Once again, SELECT aliases are not usable in any of the preceding clauses. Just wanted to make that very clear, because it’s caused me maddening errors more times than I want to admit.

DISTINCT

Although it looks like it’s part of the SELECT expressions, this clause is actually separate, and executes after them. At this point, the rows left are grouped by the values in the columns specified in the DISTINCT clause, and duplicates are simply dropped. I’m not sure if the SQL standard specifies which one will survive, somehow, but it hasn’t been an issue that has caused me any problems so far.

ORDER BY

The rows that made it through data row bootcamp (all the preceding clauses) are now ordered by the columns specified, and in the direction specified (whether ASCENDING or DESCENDING). As noted above, you can use aliases in this clause because it’s executing after the SELECT expressions have been evaluated.

LIMIT and OFFSET

Now the list of rows go through the final selection, and only the rows that fall within the offset and limit specified get to stay in play. This can get a bit tricky, as it may appear that the first LIMIT rows are grabbed, and then the final set will be everything after the OFFSET value. In truth, it’s the other way around. First the OFFSET clause moves down the list the specified number of rows, and then the LIMIT clause grabs the next n rows, n being whatever the LIMIT count specified is. This is especially handy for doing things like breaking a data set into “pages”, where for example if you want 10 items on a page, and you have several thousand rows, you can calculate page_number * items_per_page and use that as your OFFSET value, and use items_per_page as your LIMIT value.

Hopefully all of this made sense. It makes things a lot easier to work with when you understand the order in which things happen. Happy querying!