Professional SQL That Will Make Your Queries Shine: INTERSECT vs. INNER JOIN

Posted on 2024-05-022024-05-02 by OP

When working with databases, it’s crucial to know the best methods to query data effectively. In SQL, two powerful operations that allow you to compare sets of data are the INTERSECT operator and the INNER JOIN clause. Both commands serve the purpose of identifying commonalities between datasets, but they do so in subtly different ways that can affect the output and performance of your queries.

What is the INTERSECT Operator?

The INTERSECT operator is used to return a distinct intersection of two sets, which means it will only return the rows that are present in both query result sets. Here are the key features of INTERSECT:

Distinct Results: Automatically removes duplicates.
NULL Value Comparisons Are Respected: Includes NULL comparisons, treating them distinctly.
Structural Requirements: Requires matching column order and data types, though column names can differ.

What about INNER JOIN?

INNER JOIN, on the other hand, is commonly used for merging rows from two or more tables based on a related column between them. Here’s what you should know about INNER JOIN:

Comprehensive Output: Returns rows with columns from both tables.
Handling NULL Values: Does not respect NULL values; such comparisons need explicit handling.
Duplicates: Can return duplicates, depending on the number of matches—requiring adjustments in the SELECT clause to manage them.

Handling NULLs and Duplicates

When using INNER JOIN, you may encounter issues with NULL values and duplicates. Here’s how you can manage them:

NULL Values: Use the COALESCE function to replace NULL with a non-null default value, allowing the row to be included in the results.
Duplicates: To avoid duplicates, you can explicitly specify the DISTINCT keyword in your SELECT clause.

Which Should You Use?

Choosing between INTERSECT and INNER JOIN depends on your specific needs:

Use INTERSECT for straightforward comparisons that require distinct results. Learn more here.
Opt for INNER JOIN when you need more flexibility in selecting and displaying columns or when handling complex relationships. Learn more here.

For a deeper dive into the nuances of these SQL operations, I’ve included a detailed PDF in this post that elaborates further on these differences and how to effectively use each operation.

Owen-Price-SQL-INTERSECT-vs.-INNER-JOIN

What’s the difference between ROLLUP and GROUPING SETS in SQL?

Posted on 2022-06-16 by OP

The full script for this post can be found here.

This post uses the AdventureWorksDW database in SQL Server 2019 Express.

Depending on your preference, you can either watch this video or read below.

The goal

Suppose we have a request to produce a report:

Create a report showing internet sales amount with these columns:

Order year

Country

Product category

Sales amount

Include a sub-total row showing total sales amount for each year. The value in the Country column on the sub-total row should be “All countries in YYYY”. The value in the Product category column on the sub-total row should be “All product categories in YYYY”. Sort the result by ascending year, ascending country and within each country, sort the product categories by descending sum of sales amount. The sub-total row for each year should appear at the bottom of the rows for that year.

The data

We have the following query:

SELECT YEAR(fis.OrderDate) AS "Order year",
       dg.CountryRegionCode AS "Country",
       dpc.EnglishProductCategoryName AS "Product category",
       SUM(fis.SalesAmount) AS "Sales amount"
FROM FactInternetSales fis
  INNER JOIN DimCustomer dc ON fis.CustomerKey = dc.CustomerKey
  INNER JOIN DimGeography dg ON dc.GeographyKey = dg.GeographyKey
  INNER JOIN DimProduct dp ON fis.ProductKey = dp.ProductKey
  INNER JOIN DimProductSubcategory dps ON dp.ProductSubcategoryKey = dps.ProductSubcategoryKey
  INNER JOIN DimProductCategory dpc ON dps.ProductCategoryKey = dpc.ProductCategoryKey
GROUP BY YEAR(fis.OrderDate),
         dg.CountryRegionCode,
         dpc.EnglishProductCategoryName
ORDER BY 1, 2, 4 DESC;

We’ve created a group by query and summed the sales amount by the requested columns.

Since the requirements have requested specific column names, we have aliased each column exactly as requested and enclosed the aliases in double-quotes in order to be able to use spaces in the column headers.

The ORDER BY clause is using positional references to sort by columns 1 and 2 (Order year and Country) in ascending order, and column 4 (Sales amount) in descending order.

The query above returns results that look like this:

We can see that we have the sum of sales amount by each unique combination of year, country and product category.

We need to add the sub-totals.

There are three common ways to do this.

Method 1 – UNION ALL

The first way to add sub-totals to a query is to use UNION ALL to append a second query to the first.

WITH dat
AS
(
SELECT YEAR(fis.OrderDate) AS "Order year",
       dg.CountryRegionCode AS "Country",
       dpc.EnglishProductCategoryName AS "Product category",
       0 AS country_type,
       SUM(fis.SalesAmount) AS "Sales amount"
FROM FactInternetSales fis
  INNER JOIN DimCustomer dc ON fis.CustomerKey = dc.CustomerKey
  INNER JOIN DimGeography dg ON dc.GeographyKey = dg.GeographyKey
  INNER JOIN DimProduct dp ON fis.ProductKey = dp.ProductKey
  INNER JOIN DimProductSubcategory dps ON dp.ProductSubcategoryKey = dps.ProductSubcategoryKey
  INNER JOIN DimProductCategory dpc ON dps.ProductCategoryKey = dpc.ProductCategoryKey
GROUP BY YEAR(fis.OrderDate),
         dg.CountryRegionCode,
         dpc.EnglishProductCategoryName
UNION ALL
SELECT YEAR(fis.OrderDate) AS "Order year",
       'All countries in ' + CAST(YEAR(fis.OrderDate) AS nvarchar(4)) AS "Country",
       'All product categories in ' + CAST(YEAR(fis.OrderDate) AS nvarchar(4)) AS "Product category",
       1 AS country_type,
       SUM(fis.SalesAmount) AS "Sales amount"
FROM FactInternetSales fis
GROUP BY YEAR(fis.OrderDate)
)
SELECT "Order year", "Country", "Product category", "Sales amount"
FROM dat
ORDER BY "Order year", country_type, "Country", "Sales amount" DESC;

There are a few things to note about this query.

We add a second query joined to the first by using UNION ALL.
The second query is only grouping by year, as we would expect.
A UNION ALL query must have the same number of columns in each part of the query (above and below the UNION ALL operator), so we need to provide default values for the Country and Product category columns. These are defined per the requirements.
If we tried to put the ORDER BY clause directly under the UNION ALL and didn’t wrap the UNION ALL inside a CTE (Common-table expression), we would find that the sub-total rows are sorted to the top of each year (because the sub-totals have the word “All…” in their definition). It would look like this:

SELECT YEAR(fis.OrderDate) AS "Order year",
       dg.CountryRegionCode AS "Country",
       dpc.EnglishProductCategoryName AS "Product category",
       SUM(fis.SalesAmount) AS "Sales amount"
FROM FactInternetSales fis
  INNER JOIN DimCustomer dc ON fis.CustomerKey = dc.CustomerKey
  INNER JOIN DimGeography dg ON dc.GeographyKey = dg.GeographyKey
  INNER JOIN DimProduct dp ON fis.ProductKey = dp.ProductKey
  INNER JOIN DimProductSubcategory dps ON dp.ProductSubcategoryKey = dps.ProductSubcategoryKey
  INNER JOIN DimProductCategory dpc ON dps.ProductCategoryKey = dpc.ProductCategoryKey
GROUP BY YEAR(fis.OrderDate),
         dg.CountryRegionCode,
         dpc.EnglishProductCategoryName
UNION ALL
SELECT YEAR(fis.OrderDate) AS "Order year",
       'All countries in ' + CAST(YEAR(fis.OrderDate) AS nvarchar(4)) AS "Country",
       'All product categories in ' + CAST(YEAR(fis.OrderDate) AS nvarchar(4)) AS "Product category",
       SUM(fis.SalesAmount) AS "Sales amount"
FROM FactInternetSales fis
GROUP BY YEAR(fis.OrderDate)
ORDER BY 1,2,4 DESC;

In order to sort the sub-totals at the bottom of each year, we would preferably put something like CASE WHEN LEFT(“Country”,3) = ‘All’ THEN 1 ELSE 0 END in second position of the ORDER BY clause, to ensure that the sub-total appears below each of the product categories and countries. However, when using UNION ALL, each column mentioned in ORDER BY must also be in SELECT. So, we must add an additional column – country_type – to put 0 (zero) next to each row in the top of the UNION ALL and 1 next to each row in the bottom of the UNION ALL, then use this new column in the ORDER BY clause.
This creates an additional problem in that we now have a column in the output that we don’t want – country_type. In order to get rid of that column, we must then wrap the entire UNION ALL query in a CTE, then select from and order by that CTE.

The result of the UNION ALL method looks like this. You can see that now we have fulfilled all the requirements (specific column names, specific ordering, sub-totals at the bottom of each year):

This gets us what we want – but it’s unnecessarily complicated. We have to use UNION ALL, CTE and add columns to get what we want.

Method 2 – ROLLUP

The second way to add sub-totals to a query is to use ROLLUP in the GROUP BY clause to specify that we want to roll-up the values (i.e. aggregate them) in those columns.

SELECT YEAR(fis.OrderDate) AS "Order year",
       CASE 
        WHEN GROUPING(dg.CountryRegionCode) = 1 THEN 'All countries in ' + CAST(YEAR(fis.OrderDate) AS nvarchar(4))
        ELSE dg.CountryRegionCode 
       END AS "Country",
       CASE 
        WHEN GROUPING(dpc.EnglishProductCategoryName) = 1 THEN 'All product categories in ' + CAST(YEAR(fis.OrderDate) AS nvarchar(4))
        ELSE dpc.EnglishProductCategoryName 
       END AS "Product category",
       SUM(fis.SalesAmount) AS "Sales amount"
FROM FactInternetSales fis
  INNER JOIN DimCustomer dc ON fis.CustomerKey = dc.CustomerKey
  INNER JOIN DimGeography dg ON dc.GeographyKey = dg.GeographyKey
  INNER JOIN DimProduct dp ON fis.ProductKey = dp.ProductKey
  INNER JOIN DimProductSubcategory dps ON dp.ProductSubcategoryKey = dps.ProductSubcategoryKey
  INNER JOIN DimProductCategory dpc ON dps.ProductCategoryKey = dpc.ProductCategoryKey
GROUP BY YEAR(fis.OrderDate),
         ROLLUP(dg.CountryRegionCode,
         dpc.EnglishProductCategoryName)
HAVING GROUPING(dg.CountryRegionCode) + GROUPING(dpc.EnglishProductCategoryName) <> 1
ORDER BY 1, GROUPING(dg.CountryRegionCode), 2, 4 DESC;

Again, let’s note a few things about this method:

We don’t need to use the UNION ALL operator and a separate query.
We have wrapped the CountryRegionCode and the EnglishProductCategoryName in the GROUP BY clause inside the ROLLUP function. This has the effect of rolling up those columns to create sub-totals. However, as you can see here, if we only did that, we would get additional rows we don’t want – where country is not null and category is null (i.e. rolling up that category within that country):

SELECT YEAR(fis.OrderDate) AS "Order year",
	   dg.CountryRegionCode AS "Country",
       dpc.EnglishProductCategoryName AS "Product category",
       SUM(fis.SalesAmount) AS "Sales amount"
FROM FactInternetSales fis
  INNER JOIN DimCustomer dc ON fis.CustomerKey = dc.CustomerKey
  INNER JOIN DimGeography dg ON dc.GeographyKey = dg.GeographyKey
  INNER JOIN DimProduct dp ON fis.ProductKey = dp.ProductKey
  INNER JOIN DimProductSubcategory dps ON dp.ProductSubcategoryKey = dps.ProductSubcategoryKey
  INNER JOIN DimProductCategory dpc ON dps.ProductCategoryKey = dpc.ProductCategoryKey
GROUP BY YEAR(fis.OrderDate),
         ROLLUP(dg.CountryRegionCode,
         dpc.EnglishProductCategoryName)
ORDER BY 1, 2, 4 DESC;

ROLLUP creates NULLs in the columns being rolled up. We have to supply default values for those NULLs.
This ROLLUP has created two types of sub-total rows: (1) where both country and product category are null – which is the sub-total we need and (2) where only product category is null, which is a sub-total row we don’t need.

       CASE 
        WHEN GROUPING(dg.CountryRegionCode) = 1 THEN 'All countries in ' + CAST(YEAR(fis.OrderDate) AS nvarchar(4))
        ELSE dg.CountryRegionCode 
       END AS "Country",
       CASE 
        WHEN GROUPING(dpc.EnglishProductCategoryName) = 1 THEN 'All product categories in ' + CAST(YEAR(fis.OrderDate) AS nvarchar(4))
        ELSE dpc.EnglishProductCategoryName 
       END AS "Product category",

The GROUPING function differentiates a NULL created by ROLLUP and a NULL present in the source data. If a NULL is created by ROLLUP, the GROUPING function called on that column returns 1, otherwise it returns 0. So, we can use the GROUPING function to help specify what text should be in the sub-total rows.

GROUP BY YEAR(fis.OrderDate),
         ROLLUP(dg.CountryRegionCode,
         dpc.EnglishProductCategoryName)
HAVING GROUPING(dg.CountryRegionCode) + GROUPING(dpc.EnglishProductCategoryName) <> 1
ORDER BY 1, GROUPING(dg.CountryRegionCode), 2, 4 DESC;

We can also use the GROUPING function to filter-out those rows where the ROLLUP has created a NULL in the category column and not in the country columns. We do this by adding the HAVING clause and specifying that the sum of the GROUPING function on both of those columns should not be equal to 1. This is because if country is not null, then GROUPING(country) = 0 and if category is null, then GROUPING(category) = 1. We don’t want these rows and we use HAVING to remove them.
Finally, since we are not using the UNION ALL operator, we can put columns or expressions in the ORDER BY clause which aren’t in the SELECT clause. In this case, instead of creating the column country_type like before, we simply put GROUPING(dg.CountryRegionCode) in the second position in the ORDER BY clause, which has the same effect – putting the sub-totals at the bottom of each country group.

So, ROLLUP is easier than UNION ALL. But there’s another way we can use to avoid the additional rows created by ROLLUP which then need to be removed by the HAVING clause.

Method 3 – GROUPING SETS

The third way to add sub-totals to a query is to use GROUPING SETS in the GROUP BY clause to specify exactly which columns we want to GROUP BY in each type of row.

SELECT YEAR(fis.OrderDate) AS "Order year",
       CASE 
        WHEN GROUPING(dg.CountryRegionCode) = 1 THEN 'All countries in ' + CAST(YEAR(fis.OrderDate) AS nvarchar(4))
        ELSE dg.CountryRegionCode 
       END AS "Country",
       CASE 
        WHEN GROUPING(dpc.EnglishProductCategoryName) = 1 THEN 'All product categories in ' + CAST(YEAR(fis.OrderDate) AS nvarchar(4))
        ELSE dpc.EnglishProductCategoryName 
       END AS "Product category",
       SUM(fis.SalesAmount) AS "Sales amount"
FROM FactInternetSales fis
  INNER JOIN DimCustomer dc ON fis.CustomerKey = dc.CustomerKey
  INNER JOIN DimGeography dg ON dc.GeographyKey = dg.GeographyKey
  INNER JOIN DimProduct dp ON fis.ProductKey = dp.ProductKey
  INNER JOIN DimProductSubcategory dps ON dp.ProductSubcategoryKey = dps.ProductSubcategoryKey
  INNER JOIN DimProductCategory dpc ON dps.ProductCategoryKey = dpc.ProductCategoryKey
GROUP BY GROUPING SETS (
  (YEAR(fis.OrderDate),dg.CountryRegionCode,dpc.EnglishProductCategoryName ),
  (YEAR(fis.OrderDate))
) 
ORDER BY 1, GROUPING(dg.CountryRegionCode), 2, 4 DESC;

We add the GROUPING SETS keywords to the GROUP BY clause, follow by parentheses.
Within those parentheses, we have a comma-separated list of groups of columns we want to group by.
The first group is – year, country, category – this is just the original group by clause and creates the non-sub-total rows.
The second group is just year – this creates the sub-total we want and nothing else.
Note that GROUPING SETS is powerful precisely because we can be very specific about what groups we want. In my opinion it’s much easier to understand than ROLLUP for this reason.
Because we haven’t created any additional sub-totals, we don’t need to use the HAVING clause at all.
The ORDER BY clause is the same as with the ROLLUP example, because GROUPING works in the same way for GROUPING SETS as it does for ROLLUP.

The results are identical:

In summary

We saw the difference between ROLLUP and GROUPING SETS in SQL.

We used each of UNION ALL, ROLLUP and GROUPING SETS to add sub-totals to a query.

We saw how GROUPING SETS is more precise than ROLLUP and so can help us avoid creating sub-totals we don’t want by only specifying the groups we need.

We saw how to use the GROUPING function to differentiate between NULL values in source data and NULL values created by a rollup operation.

SQL: Numerical row constructor comparisons in PostgreSQL and MySQL

Posted on 2022-05-31 by OP

The full script for this post can be downloaded here.

The following post was written primarily in Postgres 14 and tested in both Postgres and MySQL.

While the data generation and use of EXCEPT are not supported natively in MySQL, the principles of row comparisons are equivalent.

Setup

To demonstrate how a row constructor comparison can be confusing, let’s create some dummy data with two columns of integers:

--create a table to hold some random integers
drop table if exists random_numbers;
create temp table random_numbers (a int, b int);

--create 100 rows of random integers
--between 0 and 50
insert into random_numbers (a, b)
select floor(random()*50), ceiling(random()*50)
from generate_series(1,10000);

Comparisons

PostgreSQL allows us to compare two sets of values using what is known as a row constructor comparison.

A row can be constructed with parentheses, like this:

--with columns in a query
(a_column, b_column)

--or with literals
(10, 40)

We can create complex rows in this way and then compare them with each other.

For example:

(a_column, b_column) = (10, 40)

This can be incredibly powerful and is a useful shortcut for complex logic.

However, the usual Spiderman caveat applies.

Comparing two row constructors is not always intuitive.

= operator

Consider this:

select 'This: "where (a,b) = (10,40)"' as message, count(*) 
from (
    select a, b
    from random_numbers 
    where (a,b) = (10,40)
) x
union all 
select 'Is the same as: "where a = 10 and b = 40"' as message, count(*) 
from (
    select a, b
    from random_numbers 
    where a = 10 and b = 40
) x
union all 
select 'Subtracting one from the other using "except" gives us' as message, count(*)
from (
    select a, b
    from random_numbers 
    where a = 10 and b = 40
    except
    select a, b
    from random_numbers 
    where (a,b) = (10,40)
) x;

Which for the random data created when I wrote this post, returns this:

This is straightforward enough. Put simply: is the left side equal to the right side?

Intuitively, we expect a comparison between each “column” from the left side with the corresponding “column” of the right side. For the equals operator, it works how we expect.

<> operator

Consider this:

select 'This: "where (a,b) <> (10,40)"' as message, count(*) 
from (
    select a, b
    from random_numbers 
    where (a,b) <> (10,40)
) x
union all 
select 'Is not the same as: "where a <> 10 and b <> 40"' as message, count(*) 
from (
    select a, b
    from random_numbers 
    where a <> 10 and b <> 40
) x
union all 
select 'It''s the same as: "where a <> 10 or (a = 10 and b <> 40)"' as message, count(*) 
from (
    select a, b
    from random_numbers 
    where a <> 10 or (a = 10 and b <> 40)
) x
union all 
select 'Subtracting row 1 from row 3 using "except" gives us' as message, count(*)
from (
    select a, b
    from random_numbers 
    where a <> 10 or (a = 10 and b <> 40)
    except
    select a, b
    from random_numbers 
    where (a,b) <> (10,40)
) x;

Which gives us:

This may not appear as intuitive. But think about it this way: as long as one of the two conditions is met, then we consider the entire condition met.

When comparing two rows with the <> operator, if even one column is different, then the rows are different

We can either have a <> 10 and b = anything, or we can have a = 10 and b <> 40. Since both of those conditions together are equivalent to b <> 40 and a = anything, it doesn’t need to be specified.

Pretty straightforward when you think about if for a second.

Things are less intuitive when we move to other operators.

< operator

Consider:

select 'This: "where (a,b) < (10,40)"' as where_clause, count(*) 
from (
    select a, b
    from random_numbers 
    where (a,b) < (10,40)
) x
union all 
select 'Is not the same as: "where a < 10 and b < 40"' as where_clause, count(*) 
from (
    select a, b
    from random_numbers 
    where a < 10 and b < 40
) x
union all 
select 'It''s the same as: "where a < 10 or (a = 10 and b < 40)"' as where_clause, count(*) 
from (
    select a, b
    from random_numbers 
    where a < 10 or (a = 10 and b < 40)
) x
union all 
select 'Subtracting row 1 from row 3 using "except" gives us' as message, count(*)
from (
    select a, b
    from random_numbers 
    where a < 10 or (a = 10 and b < 40)
    except
    select a, b
    from random_numbers 
    where (a,b) < (10,40)
) x;

We get this:

According to the documentation:

For the <, <=, > and >= cases, the row elements are compared left-to-right, stopping as soon as an unequal or null pair of elements is found. If either of this pair of elements is null, the result of the row comparison is unknown (null); otherwise comparison of this pair of elements determines the result

So, we are comparing one row with another, column-wise from left to right.

Is a less than 10? If yes, include the row in the output. If not:
Is a equal to 10 and b less than 40? If yes, include the row in the output.

The notable omission here is where b is less than 40 but a is greater than 10. Why?

Well, read the quote above again.

Specifically:

stopping as soon as an unequal or null pair of elements is found

The implication here is that each subsequent comparison assumes equality in the prior comparison(s). So, the “2nd” comparison of b with 40 assumes equality between a and 10.

And if that wasn’t confusing enough.

<= operator

Consider:

select 'This: "where (a,b) <= (10,40)"' as where_clause, count(*) 
from (
    select a, b
    from random_numbers 
    where (a,b) <= (10,40)
) x
union all 
select 'Is not the same as: "where a <= 10 and b <= 40"' as where_clause, count(*) 
from (
    select a, b
    from random_numbers 
    where a <= 10 and b <= 40
) x
union all 
select 'It''s the same as: "where a < 10 or (a = 10 and b <= 40)"' as where_clause, count(*) 
from (
    select a, b
    from random_numbers 
    where a < 10 or (a = 10 and b <= 40)
) x
union all 
select 'Subtracting row 1 from row 3 using "except" gives us' as message, count(*)
from (
    select a, b
    from random_numbers 
    where a < 10 or (a = 10 and b <= 40)
    except
    select a, b
    from random_numbers 
    where (a,b) <= (10,40)
) x;

That code gives us:

If a is less than 10, then include the row in the output, otherwise:
If a is equal to 10 and b is less than or equal to 40, then include the row in the output

We might have expected the first condition to be:

If a is less than or equal to 10

But again:

stopping as soon as an unequal or null pair of elements is found

So, if the first comparison was actually a<=10, those rows where a=10 would not cause the evaluation to stop, the implication being that it would be including rows where a < 10 and b = anything as well as a = 10 and b = anything, after which rows where a = 10 and b <= 40 would be evaluated. This latter evaluation is a subset of (a = 10 and b = anything), and so becomes entirely redundant.

This behavior is similarly expressed with >= and >, but I won’t elaborate here, though they are included in the example script.

In summary

PostgreSQL allows us to use row constructors to compare sets of values.

The operators supported for such comparisons are =, <>, <, <=, > or >=

= and <> work as we might expect.

The other operators are evaluated from left-to-right and stop evaluating when the first unequal non-null comparison is encountered. Subsequent comparisons assume equality of the prior comparisons.

SQL: Use LAG and SUM window functions to group rows

Posted on 2022-05-272022-05-27 by OP

What is Advanced SQL, anyway?

I think about this sometimes. Is it window functions? CTEs? I don’t know. Some of it is trickier than others. I think it’s relative to how much practice you’ve had and depends a great deal on the data in front of you.

Anyway, I saw this post on LinkedIn by Ankit Bansal. It struck me as an interesting problem.

I wanted to come up with a solution before watching the author’s video. It’s only 8 minutes, so probably worth watching.

I was happy to find that my solution was close but not identical.

In case some of the techniques are interesting to others, I figured I would write up how I went about solving this problem.

The goal

The goal here was to turn this:

Into this:

This is the create script:

CREATE TABLE event_status (event_time varchar(10),[status] varchar(10));

INSERT INTO event_status (event_time, [status])
VALUES 
	('10:01','on'),('10:02','on'),('10:03','on'),
	('10:04','off'),('10:07','on'),('10:08','on'),
	('10:09','off'),('10:11','on'),('10:12','off');

As you can see, the data are simple. Finding a solution is perhaps not as simple as it might first appear.

A solution

With a problem like this, it’s useful to formulate a sentence or two that describes what we’re trying to do. So:

Identify groups of rows that begin with 1:n rows with status=’on’ and end with 1 row with status=’off’. Then, identify the minimum event_time, the maximum event_time and the count of ‘on’ rows within each group.

There are many ways to solve this and some of the detail can depend on which variant of SQL you’re using. This time, I used SQL Server. This is what I came up with:

WITH dat 
AS 
(
SELECT
	e.event_time,
	e.[status],
	LAG(CASE WHEN e.[status] = 'off' THEN 1 ELSE 0 END,1,0) OVER (ORDER BY e.event_time) AS new_group
FROM event_status e
),
grp 
AS 
(
SELECT event_time, [status], SUM(new_group) OVER (ORDER BY event_time) AS grp_id
FROM dat
)
SELECT MIN(event_time) AS login, MAX(event_time) AS logout, SUM(CASE WHEN [status] = 'on' THEN 1 ELSE 0 END) AS on_count
FROM grp
GROUP BY grp_id;

We’ll break it down in a moment, but first let’s get some definitions out there.

Important concepts

CTE – is a common table expression. It let’s you define a named temporary result that can be used by other queries. Read more here.
Window function – is a way to apply an aggregation or calculation over a partition of the query result. In this post, I’m using LAG to return a value from a row before the current row (here, one row before). I’m also using SUM to calculate a running total. If you’re not familiar with window functions, you can read more here. You can tell something is a window function by the presence of the OVER clause.

The detail

Here I’m using two CTEs to:

Add an indicator to show where a new group has started, and
Assign a group id to each group of rows defined in the problem statement above

The first CTE is:

SELECT
	e.event_time,
	e.[status],
	LAG(CASE WHEN e.[status] = 'off' THEN 1 ELSE 0 END,1,0) OVER (ORDER BY e.event_time) AS new_group
FROM event_status e

It produces this result:

You can see that whenever the previous row’s status is ‘off’, the new column shows 1. All other rows show 0. The assumption here is that there are never two ‘off’ rows adjacent to each other. This 1 indicates the beginning of a new group.

LAG is a very useful function. The first parameter is the column, or calculated column, from which we want to return a value. The second parameter is the number of rows prior to the current row we want to return the value from and the third parameter is a default value to use in case LAG cannot find a row before the current row. We determine what constitutes “before” by using the required ORDER BY statement in the OVER clause.

To convert this into a group ID, we only need to calculate a running total on the new_group column. The running total will start with the value of new_group on row 1, then add each successive value of new_group to the previous result of the sum.

SELECT event_time, [status], SUM(new_group) OVER (ORDER BY event_time) AS grp_id
FROM dat

When we use SUM over an ordered field, it has the effect of creating a running total. So, the result is then:

Now that we have a group ID, calculating simple aggregates is easy:

SELECT MIN(event_time) AS login, MAX(event_time) AS logout, SUM(CASE WHEN [status] = 'on' THEN 1 ELSE 0 END) AS on_count
FROM grp
GROUP BY grp_id

Min and Max should be simple enough. Counting the number of ‘on’ rows in each group is just a matter of nesting a CASE statement inside the SUM function. Because each ‘on’ row has a value of 1 according to this CASE, then, SUM of that CASE is equivalent to COUNT WHERE status=’on’.

Note that we achieve the right segmentation of the rows by using GROUP BY grp_id. It’s not necessary to have the grouping field in the SELECT clause.

In summary

This post showed how to use LAG and SUM window functions to group rows.

The LAG window function allows us to return values from rows before the current row.

SUM OVER ORDER BY allows us to create a running total on a numeric field. By using this on a field where a 1 indicates the change of a group from one group to the next, we can increment the ID by one each time we encounter a new group.