In this month’s T-SQL Tuesday Andy Leonard (b|t) asks “How Do You Respond When Technology Changes Under You?” .
My response to this would be that wherever possible embrace it. Despite inevitable teething troubles most software changes are for the better and while you probably can’t prevent change you can make sure it doesn’t leave you behind.
One aspect of T-SQL that everyone should embrace is window functions. They were first added in SQL Server 2005 with simple ranking functions and have built since then with features such as greater control over window size and shape and neat functions like lag, lead .
Despite being around for over a decade many haven’t really embraced window functions as far as they could and probably should.
Here’s an example of how a window function can make your code neater and faster. The example I’ll share is somewhat contrived to fit in the Adventure Works sample databases but the general pattern is one I’ve been fighting against for at least a decade.
Imagine that for some reasonthe Adventure Works Company wants to identify the most fertile customer that has purchased each product sold. Our analyst sits down and thinks “the most fertile customer has the most children” they therefore write a query to find the maximum number of children in the customer dimension of the data warehouse for each product. This doesn’t give us the customer just the number of children they have so we join the result back to the same table to get their name. Unfortunately we get multiple results because more than one person has that number of children.
Out analyst thinks again and decides that as a tie breaker we should pick the customer with the most recent date of birth (they had those children fastest) so picks the maximum date of birth from that list and again joins it to the customer table… only to find that there’s more than one with the same date of birth, finally we pick the lowest customer id and as there are no duplicates in this we get our answer.
The resulting query looks like this, actually I’m being kind here and using common table expressions to make the steps clear, more often than not when I see code like this it would be sub queries nested in sub queries and so on.
USE AdventureWorksDW2017; WITH MaxChildren AS ( SELECT S.ProductKey, MAX(C.TotalChildren) AS MaxChildren FROM dbo.FactInternetSales AS S INNER JOIN dbo.DimCustomer AS C ON S.CustomerKey = C.CustomerKey GROUP BY S.ProductKey ), MaxBirth AS ( SELECT S.ProductKey, MAX(C.TotalChildren) AS MaxChildren, MAX(C.BirthDate) as MaxBirth FROM dbo.FactInternetSales AS S INNER JOIN dbo.DimCustomer AS C ON S.CustomerKey = C.CustomerKey INNER JOIN MaxChildren AS M ON S.ProductKey = M.ProductKey AND C.TotalChildren = M.MaxChildren GROUP BY S.ProductKey ), MinCustomer AS ( SELECT S.ProductKey, MAX(C.TotalChildren) AS MaxChildren, MAX(C.BirthDate) as MaxBirth,MIN(C.CustomerKey) AS MinCustomer FROM dbo.FactInternetSales AS S INNER JOIN dbo.DimCustomer AS C ON S.CustomerKey = C.CustomerKey INNER JOIN MaxBirth AS M ON S.ProductKey = M.ProductKey AND C.TotalChildren = M.MaxChildren AND M.MaxBirth = C.BirthDate GROUP BY S.ProductKey ) SELECT S.ProductKey, C.CustomerKey FROM dbo.FactInternetSales AS S INNER JOIN dbo.DimCustomer AS C ON S.CustomerKey = C.CustomerKey INNER JOIN MinCustomer AS M ON S.ProductKey = M.ProductKey AND C.TotalChildren = M.MaxChildren AND C.BirthDate = M.MaxBirth AND C.CustomerKey = M.MinCustomer ORDER BY ProductKey, CustomerKey;
Apart from the length of the code the big problem is the number of times we go though the involved tables. SQL Server isn’t going to do anything clever here, we code in the two tables four times and so those tables will get read four times.
Enter window functions and the most common of them all ROW_NUMBER(), we use PARTITION BY to give us a number sequence that resets per product and then we order our customers by all our criteria in one go. The result is ranked customers with the one coming first in our sequence being the choice we need. Finally we just filter those results to get those top picks.
USE AdventureWorksDW2017; WITH CustomersOrdered AS ( SELECT S.ProductKey, C.CustomerKey, C.TotalChildren, C.BirthDate , ROW_NUMBER() OVER ( PARTITION BY S.ProductKey ORDER BY C.TotalChildren DESC, C.BirthDate DESC, C.CustomerKey ASC ) AS CustomerSequence FROM dbo.FactInternetSales AS S INNER JOIN dbo.DimCustomer AS C ON S.CustomerKey = C.CustomerKey ) SELECT ProductKey, CustomerKey FROM CustomersOrdered WHERE CustomerSequence = 1 ORDER BY ProductKey, CustomerKey;
All this happens in 17 lines of code rather than the 43 above. The intention of the query is clearer as well.
As an added bonus the query runs faster, we go through the table only once, admittedly with a big sort of the data but the gains outweigh the losses by a long distance.
There are many more examples of where window functions make your queries better and faster, it’s something I’m hoping to cover once I’ve finished my blog series on querying XML until then here are some links to some good reading material.
- Official Documentation of OVER
- Itzik Ben-Gan presenting on the topic from the SQLBits Archive
- Looking to the future Koen Verbeeck writes about Snowflake and some of the features that haven’t reached SQL Server yet but that we can hopefully embrace in the future.