SQL: Use LAG and SUM window functions to group rows

What is Advanced SQL, anyway?

I think about this sometimes. Is it window functions? CTEs? I don’t know. Some of it is trickier than others. I think it’s relative to how much practice you’ve had and depends a great deal on the data in front of you.

Anyway, I saw this post on LinkedIn by Ankit Bansal. It struck me as an interesting problem.

I wanted to come up with a solution before watching the author’s video. It’s only 8 minutes, so probably worth watching.

I was happy to find that my solution was close but not identical.

In case some of the techniques are interesting to others, I figured I would write up how I went about solving this problem.

The goal

The goal here was to turn this:

Into this:

This is the create script:

CREATE TABLE event_status (event_time varchar(10),[status] varchar(10));

INSERT INTO event_status (event_time, [status])
VALUES 
	('10:01','on'),('10:02','on'),('10:03','on'),
	('10:04','off'),('10:07','on'),('10:08','on'),
	('10:09','off'),('10:11','on'),('10:12','off');

As you can see, the data are simple. Finding a solution is perhaps not as simple as it might first appear.

A solution

With a problem like this, it’s useful to formulate a sentence or two that describes what we’re trying to do. So:

Identify groups of rows that begin with 1:n rows with status=’on’ and end with 1 row with status=’off’. Then, identify the minimum event_time, the maximum event_time and the count of ‘on’ rows within each group.

There are many ways to solve this and some of the detail can depend on which variant of SQL you’re using. This time, I used SQL Server. This is what I came up with:

WITH dat 
AS 
(
SELECT
	e.event_time,
	e.[status],
	LAG(CASE WHEN e.[status] = 'off' THEN 1 ELSE 0 END,1,0) OVER (ORDER BY e.event_time) AS new_group
FROM event_status e
),
grp 
AS 
(
SELECT event_time, [status], SUM(new_group) OVER (ORDER BY event_time) AS grp_id
FROM dat
)
SELECT MIN(event_time) AS login, MAX(event_time) AS logout, SUM(CASE WHEN [status] = 'on' THEN 1 ELSE 0 END) AS on_count
FROM grp
GROUP BY grp_id;

We’ll break it down in a moment, but first let’s get some definitions out there.

Important concepts

CTE – is a common table expression. It let’s you define a named temporary result that can be used by other queries. Read more here.
Window function – is a way to apply an aggregation or calculation over a partition of the query result. In this post, I’m using LAG to return a value from a row before the current row (here, one row before). I’m also using SUM to calculate a running total. If you’re not familiar with window functions, you can read more here. You can tell something is a window function by the presence of the OVER clause.

The detail

Here I’m using two CTEs to:

Add an indicator to show where a new group has started, and
Assign a group id to each group of rows defined in the problem statement above

The first CTE is:

SELECT
	e.event_time,
	e.[status],
	LAG(CASE WHEN e.[status] = 'off' THEN 1 ELSE 0 END,1,0) OVER (ORDER BY e.event_time) AS new_group
FROM event_status e

It produces this result:

You can see that whenever the previous row’s status is ‘off’, the new column shows 1. All other rows show 0. The assumption here is that there are never two ‘off’ rows adjacent to each other. This 1 indicates the beginning of a new group.

LAG is a very useful function. The first parameter is the column, or calculated column, from which we want to return a value. The second parameter is the number of rows prior to the current row we want to return the value from and the third parameter is a default value to use in case LAG cannot find a row before the current row. We determine what constitutes “before” by using the required ORDER BY statement in the OVER clause.

To convert this into a group ID, we only need to calculate a running total on the new_group column. The running total will start with the value of new_group on row 1, then add each successive value of new_group to the previous result of the sum.

SELECT event_time, [status], SUM(new_group) OVER (ORDER BY event_time) AS grp_id
FROM dat

When we use SUM over an ordered field, it has the effect of creating a running total. So, the result is then:

Now that we have a group ID, calculating simple aggregates is easy:

SELECT MIN(event_time) AS login, MAX(event_time) AS logout, SUM(CASE WHEN [status] = 'on' THEN 1 ELSE 0 END) AS on_count
FROM grp
GROUP BY grp_id

Min and Max should be simple enough. Counting the number of ‘on’ rows in each group is just a matter of nesting a CASE statement inside the SUM function. Because each ‘on’ row has a value of 1 according to this CASE, then, SUM of that CASE is equivalent to COUNT WHERE status=’on’.

Note that we achieve the right segmentation of the rows by using GROUP BY grp_id. It’s not necessary to have the grouping field in the SELECT clause.

In summary

This post showed how to use LAG and SUM window functions to group rows.

The LAG window function allows us to return values from rows before the current row.

SUM OVER ORDER BY allows us to create a running total on a numeric field. By using this on a field where a 1 indicates the change of a group from one group to the next, we can increment the ID by one each time we encounter a new group.

Owen Price - Microsoft MVP