What is Advanced SQL, anyway?
I think about this sometimes. Is it window functions? CTEs? I don’t know. Some of it is trickier than others. I think it’s relative to how much practice you’ve had and depends a great deal on the data in front of you.
I wanted to come up with a solution before watching the author’s video. It’s only 8 minutes, so probably worth watching.
I was happy to find that my solution was close but not identical.In case some of the techniques are interesting to others, I figured I would write up how I went about solving this problem.
The goalThe goal here was to turn this:
This is the create script:
CREATE TABLE event_status (event_time varchar(10),[status] varchar(10)); INSERT INTO event_status (event_time, [status]) VALUES ('10:01','on'),('10:02','on'),('10:03','on'), ('10:04','off'),('10:07','on'),('10:08','on'), ('10:09','off'),('10:11','on'),('10:12','off');
With a problem like this, it’s useful to formulate a sentence or two that describes what we’re trying to do. So:
Identify groups of rows that begin with 1:n rows with status=’on’ and end with 1 row with status=’off’. Then, identify the minimum event_time, the maximum event_time and the count of ‘on’ rows within each group.
There are many ways to solve this and some of the detail can depend on which variant of SQL you’re using. This time, I used SQL Server. This is what I came up with:
WITH dat AS ( SELECT e.event_time, e.[status], LAG(CASE WHEN e.[status] = 'off' THEN 1 ELSE 0 END,1,0) OVER (ORDER BY e.event_time) AS new_group FROM event_status e ), grp AS ( SELECT event_time, [status], SUM(new_group) OVER (ORDER BY event_time) AS grp_id FROM dat ) SELECT MIN(event_time) AS login, MAX(event_time) AS logout, SUM(CASE WHEN [status] = 'on' THEN 1 ELSE 0 END) AS on_count FROM grp GROUP BY grp_id;
We’ll break it down in a moment, but first let’s get some definitions out there.
- CTE – is a common table expression. It let’s you define a named temporary result that can be used by other queries. Read more here.
- Window function – is a way to apply an aggregation or calculation over a partition of the query result. In this post, I’m using LAG to return a value from a row before the current row (here, one row before). I’m also using SUM to calculate a running total. If you’re not familiar with window functions, you can read more here. You can tell something is a window function by the presence of the OVER clause.
Here I’m using two CTEs to:
- Add an indicator to show where a new group has started, and
- Assign a group id to each group of rows defined in the problem statement above
The first CTE is:
SELECT e.event_time, e.[status], LAG(CASE WHEN e.[status] = 'off' THEN 1 ELSE 0 END,1,0) OVER (ORDER BY e.event_time) AS new_group FROM event_status e
It produces this result:
You can see that whenever the previous row’s status is ‘off’, the new column shows 1. All other rows show 0. The assumption here is that there are never two ‘off’ rows adjacent to each other. This 1 indicates the beginning of a new group.
LAG is a very useful function. The first parameter is the column, or calculated column, from which we want to return a value. The second parameter is the number of rows prior to the current row we want to return the value from and the third parameter is a default value to use in case LAG cannot find a row before the current row. We determine what constitutes “before” by using the required ORDER BY statement in the OVER clause.To convert this into a group ID, we only need to calculate a running total on the new_group column. The running total will start with the value of new_group on row 1, then add each successive value of new_group to the previous result of the sum.
SELECT event_time, [status], SUM(new_group) OVER (ORDER BY event_time) AS grp_id FROM dat
When we use SUM over an ordered field, it has the effect of creating a running total. So, the result is then:
Now that we have a group ID, calculating simple aggregates is easy:
SELECT MIN(event_time) AS login, MAX(event_time) AS logout, SUM(CASE WHEN [status] = 'on' THEN 1 ELSE 0 END) AS on_count FROM grp GROUP BY grp_id
Min and Max should be simple enough. Counting the number of ‘on’ rows in each group is just a matter of nesting a CASE statement inside the SUM function. Because each ‘on’ row has a value of 1 according to this CASE, then, SUM of that CASE is equivalent to COUNT WHERE status=’on’.
Note that we achieve the right segmentation of the rows by using GROUP BY grp_id. It’s not necessary to have the grouping field in the SELECT clause.
This post showed how to use LAG and SUM window functions to group rows.
The LAG window function allows us to return values from rows before the current row.
SUM OVER ORDER BY allows us to create a running total on a numeric field. By using this on a field where a 1 indicates the change of a group from one group to the next, we can increment the ID by one each time we encounter a new group.