SQL is changing into something new – are you ready?

On 30 April 2025, DataBricks announced their SQL pipe operator. Without giving you a full rundown of what it is and what it does, here’s a quick example which should give you a good idea.

This query:

SELECT c_count, COUNT(*) AS custdist
FROM
  ( SELECT c_custkey, COUNT(o_orderkey) c_count
    FROM customer
    LEFT OUTER JOIN orders ON c_custkey = o_custkey
         AND o_comment NOT LIKE '%unusual%packages%'
    GROUP BY c_custkey
  ) AS c_orders
GROUP BY c_count
ORDER BY custdist DESC, c_count DESC;

Is now equivalent to this:

FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
      AND o_comment NOT LIKE '%unusual%packages%'
|> AGGREGATE COUNT(o_orderkey) c_count
   GROUP BY c_custkey
|> AGGREGATE COUNT(*) AS custdist
   GROUP BY c_count
|> ORDER BY custdist DESC, c_count DESC;

There’s more to it of course, but that’s the gist of it. It follows precedent set by other languages. Here’s how it might be written in R using tidyverse syntax.

library(dplyr)

customer %>%
  left_join(orders, by = c("c_custkey" = "o_custkey")) %>%
  filter(!grepl("unusual.*packages", o_comment)) %>%
  group_by(c_custkey) %>%
  summarise(c_count = n_distinct(o_orderkey), .groups = "drop") %>%
  group_by(c_count) %>%
  summarise(custdist = n(), .groups = "drop") %>%
  arrange(desc(custdist), desc(c_count))

Here’s how you might express it in Python using a method chain.

import pandas as pd

result = (
    customer
    .merge(orders[~orders["o_comment"].str.contains("unusual.*packages", regex=True)],
           how="left", left_on="c_custkey", right_on="o_custkey")
    .groupby("c_custkey", as_index=False)
    .agg(c_count=("o_orderkey", "nunique"))
    .groupby("c_count", as_index=False)
    .agg(custdist=("c_count", "count"))
    .sort_values(by=["custdist", "c_count"], ascending=[False, False])
)

Ok so you get the picture. It’s not an unusual step for DataBricks to have taken. At time of writing, I’m not sure of my opinion on piped SQL. From a data programming perspective it does make sense, but part of the beauty of SQL is its constancy. For decades the core way you write a SQL query has remained unchanged. And this is a good thing! That’s not to say that various vendors haven’t added small enhancements over the years – new functions, operators, etc. But by and large, the order in which you write a query has always been SELECT, then FROM, then WHERE, and so on.

DataBricks’s new syntax flips that on its head to make it more approachable for people coming from other languages – primarily Python as you can see in their article.

My opinion is that this degree of change makes this querying syntax not SQL. It may have the same keywords. It may look like SQL. It may be used for a similar purpose to SQL. But it isn’t SQL.

Now don’t get me wrong. This is not to say it doesn’t have value. In fact I would love to get my hands on it and take it for a spin to see what damage can be done! But it isn’t SQL.

DataBricks isn’t alone in introducing new ways to write SQL. Let’s take a quick journey through some of the more recent and notable innovations in the SQL space.

GROUP BY ALL

SELECT
    date_trunc(month, event_timestamp) as event_month,
    event_type, 
    COUNT(DISTINCT user_id) AS distinct_user_count
FROM
    events
GROUP BY ALL;

Instead of writing this:

GROUP BY date_trunct(month, event_timestamp), event_type

Or (god forbid) this:

GROUP BY 1, 2

You can just use GROUP BY ALL.

I like this syntax. Writing SQL is often an exploratory process. Queries change quickly – new columns are added and removed as our understanding of a dataset improves.

And given that in most GROUP BY queries, any non-aggregate in the SELECT clause is going to appear in the GROUP BY clause, having GROUP BY ALL is incredibly convenient.

All of this said, I don’t recommend using GROUP BY ALL in a production pipeline. For a query that’s unlikely to change, explicitly writing out the grouping columns is much safer. Read more about GROUP BY ALL here.

Using column aliases outside of SELECT

Snowflake allows you to use the ‘event_month’ alias created in the query above in the WHERE clause.

SELECT
    date_trunc(month, event_timestamp) as event_month,
    event_type, 
    COUNT(DISTINCT user_id) AS distinct_user_count
FROM
    events
WHERE
    event_month >= '2025-01-01'
GROUP BY ALL;

This is such a simple change, but it’s really hard to understate how important it is. Not only can you use the column alias in WHERE, but you can use it in GROUP BY, HAVING and ORDER BY as well!

If you’ve ever worked in any situation where you’re writing lots of SQL for analytics, you know that expression sprawl is real. And if you need to keep writing those same expressions everywhere you use them, you typically fall back to pre-processing in a CTE or temp table.

This and GROUP BY ALL are what I would call scrappy innovations. They are really useful when you’re exploring, but I wouldn’t recommend keeping them in your code when you push something into a production pipeline. It’s better to be explicit and save yourself some migration hassle if (when) the time comes to jump ship to the Next Big Platform.

SELECT * ILIKE ‘pattern’

Another Snowflake innovation. When I look at this, it’s hard to believe we lived without it for so long.

This lets us dynamically select which columns to return by comparing the column name to a pattern. And if you think this is in some way bad or wrong, then you’ve never written dynamic SQL. And if you tell me you’ve never written dynamic SQL, it’s only because it hasn’t happened yet.

SELECT * ILIKE '%English%'
FROM DimProducts

This query will select those columns in the DimProducts dimension whose column names contain English. To be clear: this ‘pattern’ is not a regular expression pattern. It uses either an underscore to represent a single character or a percentage symbol to represent a sequence of characters. You can read more about it here.

SELECT * EXCLUDE (columns)

EXCLUDE is Snowflake syntax. BigQuery uses EXCEPT. Because to hell with standardisation, right?

Without getting into the weeds, you can quickly exclude a shortlist of columns from a SELECT * query. You know the ones – where you’ve got a dimension table with 30 columns and you don’t want to have to type out every column just so it won’t return those pesky ID fields.

Deliberately simple example:

SELECT * EXCLUDE (ProductID)
FROM DimProducts

SELECT * REPLACE

Another modification to SELECT * available from Snowflake, BigQuery and possibly others, this allows you to assign the value of an expression to a column name that already exists in the result set.

So, if for example you wanted to prefix the department ID with “DEPT-“, you could do this:

SELECT * REPLACE ('DEPT-' || department_id::text AS department_id)
FROM DimEmployees

The result being that the existing integer column department_id is returned as DEPT-1, DEPT-2 and so on.

SELECT * RENAME

RENAME allows you to – you guessed it – rename a column in a SELECT *.

It almost doesn’t need an example at this point but:

SELECT * RENAME (department_id AS dept_id)
FROM DimEmployees

Strictly speaking the parentheses for the above two examples are optional if you’re only replacing or renaming one column, but hopefully you get the picture.

And the interesting thing, to make things nice and confusing, you can combine ILIKE, EXCLUDE, REPLACE and RENAME. Which, when you start doing that, leads you into the territory of not really saving yourself much time at all and you may as well just do it the old-fashioned way!

Read all about these SELECT * modifiers from Snowflake here.

QUALIFY

More convenience! Have you ever wanted to filter on the result of a window function? Of course you have. Until QUALIFY came along, you would have to wrap the offending window function in a CTE or sub-query or some other diversion:

SELECT *
FROM (
    SELECT *,
           RANK() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rnk
    FROM orders
) sub
WHERE rnk = 1;

QUALIFY let’s you filter your query on the result of the window function without doing that:

SELECT *,
       RANK() OVER (PARTITION BY customer_id ORDER BY order_date DESC) AS rnk
FROM orders
QUALIFY rnk = 1;

Not only that, but you can just forgo putting the window function in the select clause at all:

SELECT * 
FROM orders
QUALIFY RANK() OVER (PARTITION BY customer_id ORDER BY order_date DESC) = 1;

This is great. As for production use, I think this doesn’t qualify (ahem) as one of those features you would necessarily need to not use since I think it does offer significant advantage in terms of query readability. You would just need to be aware that if you migrate to a platform that doesn’t support QUALIFY in the future, you’ll have some refactoring to do.

AFAIK, QUALIFY is available in Snowflake, BigQuery, DataBricks and Teradata (where it originated).

Conclusion

I’ve covered some of the more notable enhancements to writing SQL in this post. We looked at the brand new piped SQL variant from DataBricks (which I think is not SQL – your opinion may vary). We looked at the hugely convenient GROUP BY ALL and QUALIFY clauses and touched on Snowflake’s ability to use column aliases outside of the SELECT clause. Additionally, both Snowflake and BigQuery have interesting ways to control how SELECT * is projected with EXCLUDE/EXCEPT, REPLACE and RENAME.

All in all, it’s an interesting time for SQL. As far as I know, none of the features I’ve covered in this article have made their way into any kind of SQL standard as of now, but who knows? Maybe we’ll see a shift towards more convenience by design in the near future?

What do you think? Have you used these features?

In my next post, I’ll cover some ideas I have about modifications I think would make writing SQL easier and more intuitive.

Comments

One response to “SQL is changing into something new – are you ready?”

  1. […] week I wrote about some of the newer innovations in the SQL space. This got me thinking. What other language features would be useful in a SQL […]

Leave a Reply

Your email address will not be published. Required fields are marked *