T-SQL Tuesday #168 – FIRST_VALUE, LAST_VALUE and NULL Handling

FIRST_VALUE and LAST_VALUE had limited uses when they were first released to the world back in 2012, but with the introduced of the ability to ignore nulls in SQL Server 2022 they become a lot more useful. Let’s take a look at some examples of why.

TSQL Tuesday Logo

For this month’s T-SQL Tuesday Steve Jones (blog|twitter) asked us to share something neat using window functions. Who could resist?

Not me for certain. As a self confessed window function addict and regular presenter on the topic I couldn’t stop myself from sharing something.

Serendipitously I’ve been writing a query this week, not just with a window function but with window function features newly introduced with SQL Server 2022 and now available in Azure SQL DB and Managed Instances. What follows is a simplified and sanitised version of the code. Dial your server up to compatibility level 160 if you’d like to follow along!

The Problem With FIRST_VALUE and LAST_VALUE

FIRST_VALUE and LAST_VALUE functions were first introduced back in SQL Server 2012 but they didn’t see much use (or certainly not in my work) for a couple of reasons. Firstly because they didn’t do much with the default window window ranges, without dipping in to the ROWS or RANGES feature LAST_VALUE simply returns the values in the current row. More critically, as far as both functions were concerned the definition of “value” included null.

Take the table below, we record a value over time but often no record is available and we hit a NULL in the list. If I’m looking at row 12 I probably want my last value to be green from back on row 9 but depending on my window clause I’m likely to get either the value from row 12 (null), row 11(null) or possibly row 14 (you guessed it, null). Finding the last non null value in a table was possible but it wasn’t exactly straight forwards and would likely involve a couple of hours buried in Itzik Ben-Gan’s blogs. If you felt particularly daring you might end up taking some some dubious liberties with the query engine (hands up who remembers what a quirky update is).

Finding The Last Non Null Value – Enter SQL Server 2022

With SQL Server 2022 came a much requested additional feature added from the SQL standards – IGNORE_NULLS. You can probably guess what it does. Drop in IGNORE_NULLS after your function and you can blur the non null values over those gaps giving us results like this:

Definitely useful in itself where data is sporadic and you want to see whatever the most recent recorded value, in practice you’re likely to do this over multiple columns. The code for the above looks like this. I’m using the new WINDOW clause to define the window to act over. This has little impact on the size of the code for this example but if you need multiple columns with the same window it will save a lot of unneeded, messy repetition.

SELECT No, Colour,
LAST_VALUE(Colour) IGNORE_NULLS
	OVER MyWindow AS LAST_VALUE,
FROM #MyData
WINDOW 
MyWindow AS (
ORDER BY No
ROWS BETWEEN
UNBOUNDED PRECEDING
AND CURRENT ROW
)

Pivoting Data Over Time

But wait there’s more as they say in the adverts. LAST_VALUE is an aggregate function which means you can combine it with another of my favourite SQL techniques the crosstab/pivot query. Let’s see that in action.

I’m not a big fan of T-SQL’s PIVOT function, it’s too limiting and I can never remember the syntax. My preference has always been to embed an IIF statement inside an aggregate in order to control when the aggregate sees values and when it sees a NULL. Let’s look at an example building up in steps.

The below query takes a set of colours recorded over time. Applying an IIF for each colour we filter the results so we have a column that either contains the requested colour or a null value.

SELECT RecordDate, Colour,
IIF(Colour='Red’
	,RecordDate,NULL) 
	AS RedOnly,
IIF(Colour='Green’
	,RecordDate,NULL) 
	AS GreenOnly,
IIF(Colour='Blue’
	,RecordDate,NULL) 
	AS BlueOnly
FROM #MyData

Giving us a result set looking like the one below. You may notice that each column looks a lot like our single column example from above.

So let’s combine the two and wrap those IIFs inside a LAST_VALUE.

SELECT RecordDate, Colour,
LEAD(RecordDate) 
	OVER MyOrder AS NextDate,
LAST_VALUE(
	IIF(Colour='Red',RecordDate,NULL)
	) IGNORE NULLS
	OVER MyWindow AS LastSeenRed,
LAST_VALUE(
	IIF(Colour='Green',RecordDate,NULL)
	) IGNORE NULLS
	OVER MyWindow AS LastSeenGreen,
LAST_VALUE(
	IIF(Colour='Blue',RecordDate,NULL)
	) IGNORE NULLS
	OVER MyWindow AS LastSeenBlue
FROM #MyData
WINDOW 
MyOrder AS (
	ORDER BY RecordDate
),
MyWindow AS (
MyOrder
	ROWS BETWEEN
	UNBOUNDED PRECEDING
	AND CURRENT ROW
)

Giving us a final set of results looking like this.

Now every gap is filled and by sneaking in a lead function on the date field we have a set of data with continuous date ranges showing the last seen date for every colour over time.
There’s plenty of room for variations around this concept, anywhere that there’s gaps in records or intermittent data flows this approach can come to your rescue. I’d love to hear about other use cases people have encountered. Feel free to chip in down in the comments.

SQL Window Functions

Links to resources related to Window Functions and my sessions on them.

This page is here to act as an index to material I’ve produced relating to window functions.

GitHub: https://github.com/BarneyLawrence/Sessions-SQL-Window_Functions
My main repository, this holds PowerPoint presentations, sample queries and workbooks either used in sessions, to back them up or to develop the examples used.

Future Sessions

I’ll be presenting on window functions at:
PASS Data Community Summit – November 15th to 18th. Pre recorded session and live QA on Wednesday 15th 04:00 PM–04:30 PM PST

Further Resources

Itzik Ben-Gan’s Blogs at SQLPerformance.com
T-SQL Beyond the Basics – My 2021 PASS Summit session, includes sections on window functions amongst other intermediate techniques.

Querying XML in SQL Server – A Cheat Sheet

Need to learn as much as you can about XML in SQL Server as quickly as possible or just looking for a helpful reminder of the methods? Here’s the place to start.

I’ve been giving a few lightening talks on the topic of querying XML in SQL Server recently and this has really forced me to squeeze my blog content down into as small a space as possible. Part of that process was gathering examples that cover the maximum number of concepts in the minimum number of examples.

To make sure all my examples actually worked I gathered them together in another SQL Notebook this was for my own benefit to start with but I quickly realised it also to gives others the chance to run or change the examples themselves, it’s missing the full detail of my other blog posts and the notebooks I copied them into but if all you need is a few quick examples of values, nodes, exist or query methods it’s a great thing to have to hand.

Here’s the link:

A Cheat Sheet For Querying XML In SQL Server

It’s viewable directly in GitHub but best used in Azure Data Studio, preferably connected to an instance of SQL Server so you can run the queries.

Have fun with it and do let me know if there are any key examples you feel are missing.

Querying XML In SQL Server Series

Cheat Sheet

Posts in SQL Notebook form

T-SQL Tuesday 138: Managing Technology Changes

How Do You Respond When Technology Changes Under You? Grab it with both hands and cling on tight!

T-SQL Tuesday Logo

In this month’s T-SQL Tuesday Andy Leonard (b|t) asks How Do You Respond When Technology Changes Under You? .
My response to this would be that wherever possible embrace it. Despite inevitable teething troubles most software changes are for the better and while you probably can’t prevent change you can make sure it doesn’t leave you behind.

One aspect of T-SQL that everyone should embrace is window functions. They were first added in SQL Server 2005 with simple ranking functions and have built since then with features such as greater control over window size and shape and neat functions like lag, lead .

Despite being around for over a decade many haven’t really embraced window functions as far as they could and probably should.

Here’s an example of how a window function can make your code neater and faster. The example I’ll share is somewhat contrived to fit in the Adventure Works sample databases but the general pattern is one I’ve been fighting against for at least a decade.

Imagine that for some reasonthe Adventure Works Company wants to identify the most fertile customer that has purchased each product sold. Our analyst sits down and thinks “the most fertile customer has the most children” they therefore write a query to find the maximum number of children in the customer dimension of the data warehouse for each product. This doesn’t give us the customer just the number of children they have so we join the result back to the same table to get their name. Unfortunately we get multiple results because more than one person has that number of children.
Out analyst thinks again and decides that as a tie breaker we should pick the customer with the most recent date of birth (they had those children fastest) so picks the maximum date of birth from that list and again joins it to the customer table… only to find that there’s more than one with the same date of birth, finally we pick the lowest customer id and as there are no duplicates in this we get our answer.

The resulting query looks like this, actually I’m being kind here and using common table expressions to make the steps clear, more often than not when I see code like this it would be sub queries nested in sub queries and so on.

USE AdventureWorksDW2017;

WITH MaxChildren AS
(
SELECT S.ProductKey, MAX(C.TotalChildren) AS MaxChildren
FROM dbo.FactInternetSales AS S
INNER JOIN dbo.DimCustomer AS C
	ON S.CustomerKey = C.CustomerKey
GROUP BY S.ProductKey
	),
MaxBirth AS
(
SELECT S.ProductKey, MAX(C.TotalChildren) AS MaxChildren, MAX(C.BirthDate) as MaxBirth
FROM dbo.FactInternetSales AS S
INNER JOIN dbo.DimCustomer AS C
	ON S.CustomerKey = C.CustomerKey
INNER JOIN MaxChildren AS M
	ON S.ProductKey = M.ProductKey
	AND C.TotalChildren = M.MaxChildren
GROUP BY S.ProductKey
), 
MinCustomer AS
(
SELECT S.ProductKey, MAX(C.TotalChildren) AS MaxChildren, MAX(C.BirthDate) as MaxBirth,MIN(C.CustomerKey) AS MinCustomer
FROM dbo.FactInternetSales AS S
INNER JOIN dbo.DimCustomer AS C
	ON S.CustomerKey = C.CustomerKey
INNER JOIN MaxBirth AS M
	ON S.ProductKey = M.ProductKey
	AND C.TotalChildren = M.MaxChildren
	AND M.MaxBirth = C.BirthDate
GROUP BY S.ProductKey
)
SELECT S.ProductKey, C.CustomerKey
FROM dbo.FactInternetSales AS S
INNER JOIN dbo.DimCustomer AS C
	ON S.CustomerKey = C.CustomerKey
INNER JOIN MinCustomer AS M
	ON S.ProductKey = M.ProductKey
	AND C.TotalChildren = M.MaxChildren
	AND C.BirthDate = M.MaxBirth
	AND C.CustomerKey = M.MinCustomer
ORDER BY ProductKey, CustomerKey;

Apart from the length of the code the big problem is the number of times we go though the involved tables. SQL Server isn’t going to do anything clever here, we code in the two tables four times and so those tables will get read four times.

Enter window functions and the most common of them all ROW_NUMBER(), we use PARTITION BY to give us a number sequence that resets per product and then we order our customers by all our criteria in one go. The result is ranked customers with the one coming first in our sequence being the choice we need. Finally we just filter those results to get those top picks.

USE AdventureWorksDW2017;

WITH CustomersOrdered AS
(
SELECT S.ProductKey, C.CustomerKey, C.TotalChildren, C.BirthDate
, ROW_NUMBER() OVER (
						PARTITION BY S.ProductKey 
						ORDER BY C.TotalChildren DESC, C.BirthDate DESC, C.CustomerKey ASC
					) AS CustomerSequence
FROM dbo.FactInternetSales AS S
INNER JOIN dbo.DimCustomer AS C
	ON S.CustomerKey = C.CustomerKey
	)
SELECT ProductKey, CustomerKey
FROM CustomersOrdered
WHERE CustomerSequence = 1
ORDER BY ProductKey, CustomerKey;

All this happens in 17 lines of code rather than the 43 above. The intention of the query is clearer as well.
As an added bonus the query runs faster, we go through the table only once, admittedly with a big sort of the data but the gains outweigh the losses by a long distance.

There are many more examples of where window functions make your queries better and faster, it’s something I’m hoping to cover once I’ve finished my blog series on querying XML until then here are some links to some good reading material.

Querying XML in SQL Server – Part 7 – Creating XML

A brief break from querying XML this week while I prepare some test data. This naturally turns our eyes to building XML.

Last time I promised to start looking at performance and XML indexes. I’ll get there but I realised I needed a decent test data set and couldn’t easily lay my hands on one that matched my needs so had to start building one myself. Part way through the process I figured I may as well squeeze an extra post out of the process and blog the creation process.

In my experience creating XML is a less common task than reading it with one semi-exception (that being the old trick of using FOR XML PATH(”) to concatenate strings) but it can be an important one. The biggest piece of XML I’ve built was for a government mandated return, it was the only XML I built on that job but a sizeable proportion of the entire department’s work culminated in it. The switch from a set of Excel macros which took over eight hours to run to something that could rebuilt in minutes had a huge impact on our efficiency, particularly when there would be multiple rounds of validation failures against initial attempts at submission.

There are a few variants when it comes to converting queries into XML but in my opinion the only one you really need to know is the XML PATH approach, the others being either fiddly or not providing enough control.

Because all the cool kids are using it and I had a copy available from his recent live steamed courses I’m going to use a copy of the Stack Overflow database that Brent Ozar maintains. You can get your own copy here. I’m using the 2013 version but if you want to follow along any variant should work here. I’ve used TOP to limit results returned so that they’re short enough to place in the Blog.

How Do I Build XML In SQL Server?

At the simplest level we just drop FOR XML PATH after a query, we give a value to the path which becomes the element each row is nested in and give a root value that wraps the whole output.

SELECT TOP 10
	U.Id AS 'UserId',
	U.DisplayName,
	U.UpVotes,
	U.DownVotes
FROM dbo.Users AS U
FOR XML PATH('User'), ROOT('Users');

Which gets something like this (this is going to be a post where the SQL notebook version and its embedded query results is going to be really helpful, you can find these here):

<Users>
  <User>
    <UserId>17</UserId>
    <DisplayName>Nick Berardi</DisplayName>
    <UpVotes>884</UpVotes>
    <DownVotes>215</DownVotes>
  </User>
  <User>
    <UserId>19</UserId>
    <DisplayName>Mads Kristiansen</DisplayName>
    <UpVotes>36</UpVotes>
    <DownVotes>12</DownVotes>
  </User>
</Users>

How do I Create Attributes In XML With SQL Server?

Adding Attributes is easy, we just give the relevant column a name commencing with an @ symbol.

SELECT TOP 2
	U.Id AS '@UserId',
	U.DisplayName,
	U.UpVotes,
	U.DownVotes
FROM dbo.Users AS U
FOR XML PATH('User'), ROOT('Users');

And now the XML looks like this:

<Users>
  <User UserId="17">
    <DisplayName>Nick Berardi</DisplayName>
    <UpVotes>884</UpVotes>
    <DownVotes>215</DownVotes>
  </User>
  <User UserId="19">
    <DisplayName>Mads Kristiansen</DisplayName>
    <UpVotes>36</UpVotes>
    <DownVotes>12</DownVotes>
  </User>
</Users>

How Do I Nest Elements In XML With SQL Server?

If we want to start grouping elements under parent tags then we just name the columns with the needed path. If two consecutive fields have a common element to the path they get grouped together. Here we nest up and down votes in a votes element.

SELECT TOP 2
	U.Id AS '@UserId',
	U.DisplayName,
	U.UpVotes AS 'Votes/UpVotes',
	U.DownVotes AS 'Votes/DownVotes'
FROM dbo.Users AS U
FOR XML PATH('User'), ROOT('Users');

And now we have this:

<Users>
  <User UserId="17">
    <DisplayName>Nick Berardi</DisplayName>
    <Votes>
      <UpVotes>884</UpVotes>
      <DownVotes>215</DownVotes>
    </Votes>
  </User>
  <User UserId="19">
    <DisplayName>Mads Kristiansen</DisplayName>
    <Votes>
      <UpVotes>36</UpVotes>
      <DownVotes>12</DownVotes>
    </Votes>
  </User>
</Users>

How Do I Add Repeating Sub Elements to XML With SQL Server

Things get a little messier when it comes to adding repeating elements but the approach is relatively simple. An XML element can contain element’s itself so all we need to do is push in a sub query that generates that XML. We’ve already seen how to do this above.
I’m going to get some details on badges for each user, limited and grouped in this case to keep the results presentable.

Note in the sub query we specify the TYPE directive. This tells the server to keep the results as XML, if we don’t do this then the server assumes you want the results formatted as text, escapes all the special characters and makes a big mess.

SELECT TOP 2
U.Id AS '@UserId',
U.DisplayName,
U.UpVotes AS 'Votes/UpVotes',
U.DownVotes AS 'Votes/DownVotes',
(	SELECT TOP 2
		B.Name, MIN(B.Date) AS FirstAwarded FROM dbo.Badges AS B
	WHERE B.UserId = U.Id
	GROUP BY B.Name
	FOR XML PATH ('Badge'), TYPE
) AS 'Badges'
FROM dbo.Users AS U
FOR XML PATH('User'), ROOT('Users');

And now we get this:

<Users>
  <User UserId="17">
    <DisplayName>Nick Berardi</DisplayName>
    <Votes>
      <UpVotes>884</UpVotes>
      <DownVotes>215</DownVotes>
    </Votes>
    <Badges>
      <Badge>
        <Name>.net</Name>
        <FirstAwarded>2009-02-10T05:11:54.607</FirstAwarded>
      </Badge>
      <Badge>
        <Name>asp.net</Name>
        <FirstAwarded>2009-04-15T03:45:30.720</FirstAwarded>
      </Badge>
    </Badges>
  </User>
  <User UserId="19">
    <DisplayName>Mads Kristiansen</DisplayName>
    <Votes>
      <UpVotes>36</UpVotes>
      <DownVotes>12</DownVotes>
    </Votes>
    <Badges>
      <Badge>
        <Name>Beta</Name>
        <FirstAwarded>2008-09-16T00:00:00</FirstAwarded>
      </Badge>
      <Badge>
        <Name>Citizen Patrol</Name>
        <FirstAwarded>2008-09-15T08:55:05.157</FirstAwarded>
      </Badge>
    </Badges>
  </User>
</Users>

I’m not a fan of sub queries directly in the SELECT statement of a query and possibly a little too keen on using APPLY so I’d probably re-write the above like the following.
The ‘*’ on the field name is a place holder for nothing so we don’t force in another element level. Equally we could exclude the Root of ‘users’ and use that as the field name and get the same result.

SELECT TOP 2
U.Id AS '@UserId',
U.DisplayName,
U.UpVotes AS 'Votes/UpVotes',
U.DownVotes AS 'Votes/DownVotes',
B.x AS '*'
FROM dbo.Users AS U
OUTER APPLY
(
	SELECT B.Name, MIN(B.Date) AS FirstAwarded FROM dbo.Badges AS B
	WHERE B.UserId = U.Id
	GROUP BY B.Name
	FOR XML PATH ('Badge'), ROOT('Badges'), TYPE
) B(x)
FOR XML PATH('User'), ROOT('Users')
;

And that should cover 99% of needs when it comes to building XML. You might need to take the sub query concept a little further, potentially needing sub queries in your sub queries but with the above you can put together pretty much anything.

Querying XML In SQL Server Series

Cheat Sheet

Posts in SQL Notebook form

Querying XML In SQL Server – Part 6 – Sequencing Elements and Finding Positions

XML Elements aren’t always neatly labelled with a unique identifier which makes joining different levels back together in the database tricky. In this post we look at a number of solutions to this issue.

This post looks at a problem that can come up in XML messages that have perhaps not been thought out as well as we’d like and where we can’t uniquely identify one instance of a repeating element. The example we’ve used so far avoids this problem so we’ll switch to a sample that illustrates this problem.

CREATE TABLE #MyXML (MyXML XML NOT NULL);
 
INSERT INTO #MyXML(MyXML)
VALUES
(
'
<Order name="XML User Group">
	<Pizza size="medium">
		<Topping>Mushrooms</Topping>
		<Topping>Ham</Topping>
		<Topping>Jalapenos</Topping>
	</Pizza>
	<Pizza size="medium">
		<Topping>Pineapple</Topping>
		<Topping>Onion</Topping>
		<Topping>Jalapenos</Topping>
	</Pizza>
	<Pizza size="large">
		<Topping>Pepperoni</Topping>
		<Topping>Extra Cheese</Topping>
	</Pizza>
</Order>
'
);
 
SELECT 
X.MyXML.value('Order[1]/@name', 'varchar(50)') AS Name,
Pizza.PizzaXML.value('./@size', 'varchar(50)') AS Size,
Topping.ToppingXML.value('.', 'varchar(50)') AS Topping
FROM #MyXML AS X
CROSS APPLY X.MyXML.nodes('Order/Pizza') AS Pizza(PizzaXML)
CROSS APPLY Pizza.PizzaXML.nodes('Topping') AS Topping(ToppingXML);

The XML looks understandable but when we query it we get something that looks like this:

SizeTopping
mediumMushrooms
mediumHam
mediumJalapenos
mediumPineapple
mediumOnion
mediumJalapenos
largePepperoni
largeExtra Cheese

How many Pizzas do we have? Which toppings go on which pizza? Will the vegetarians of the XML User Group go hungry?
The problem here is we have two elements with the same value (medium) and when we convert to rows in the table we lose the structure that would allow us to identify which toppings belong to which.
I’d love to say the solution is to just ask the developers of the source system to change the XML, but the answer to that request in most cases will be a two letter word beginning with N so what can we do instead?

How do I Sequence XML Elements in SQL Server With a Numbers Table?

One common solution has already appeared back in part 2, if we have a list of numbers we can push them in to the query to pick out the pizzas one at a time.
The query looks something like this:

With MyNumbers AS
(
SELECT N
FROM (VALUES(1),(2),(3),(4),(5),(6),(7)) AS N(N)
)
SELECT
MyNumbers.N,
X.MyXML.value('(Order/Pizza[sql:column("MyNumbers.N")]/@size)[1]' ,'varchar(50)')  AS Size
,Topping.ToppingXML.value('.', 'varchar(50)') AS Topping
FROM  #MyXML AS X
CROSS JOIN MyNumbers
OUTER APPLY X.MyXML.nodes('Order/Pizza[sql:column("MyNumbers.N")]/Topping') AS Topping(ToppingXML)
ORDER BY N;

And we have our numbering. This is the solution that you’ll probably find if you hit Google or StackOverflow. It works, it looks neat and it’s nicely self contained. There are possible issues with it unfortunately. The first is that we’ve specified our numbers. What if we have more than seven pizzas? We loose the rest. The observant might also notice I’ve swapped to OUTER APPLY to illustrate another issue, SQL Server doesn’t know how many nodes are in the XML and so just keeps on trying with all our numbers giving us empty results when there’s no match.
My hands on experience suggests that while this approach works for small sized XML messages, if you have thousands of messages each with potentially hundreds of elements then performance nosedives fast particularly if you have a high variance in the numbers (100 every time is less of a problem than a random number between 1 and 1000).
There are ways to mitigate this (use the query method to count the nodes and then limit the numbers pushed in) but they get messy and one begins to question if the work put into the fix outweighs the benefits of it.

How do I Sequence XML Elements in SQL Server With a Window Function?

We could look to using window functions to sequence the XML elements like this:

WITH Pizzas AS
(
SELECT 
X.MyXML.value('Order[1]/@name', 'varchar(50)') AS Name,
ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS PizzaSequence,
Pizza.PizzaXML.value('./@size', 'varchar(50)') AS Size,
Pizza.PizzaXML.query('.') AS PizzaXML
FROM #MyXML AS X
CROSS APPLY X.MyXML.nodes('Order/Pizza') AS Pizza(PizzaXML)
)
SELECT 
Pizzas.PizzaSequence,
Pizzas.Size,
Topping.ToppingXML.value('.', 'varchar(50)') AS Topping,
PizzaXML
FROM Pizzas
OUTER APPLY Pizzas.PizzaXML.nodes('Pizza/Topping') AS Topping(ToppingXML);

We have to run in two parts because we aren’t allowed to partition on the XML data type and so we can’t drop the results of the first nodes expression in directly, even then what happens if we have two 100% identical pizzas, toppings and all? We need to sequence the pizzas and then use the nodes method to break out the toppings.
Unfortunately that’s enough to break the logical pointer we’ve used in the examples above, we have to effectively materialise each pizza as a separate piece of XML with the query method and then apply the nodes method to that. Here’s a copy of the query plan over at Paste the Plan I need to work out a larger data set to prove it (watch this space) but those estimated vs actual numbers make me nervous.
It also makes our code about twice as long without improving readability.

If you’ve been following along with this series then you’ll know that often there’s a solution inside XPath or XQuery that is either neater, faster or both because we can go into the XML, get a single answer and then come back out again without having to break the data into rows.

How do I Sequence XML Elements in SQL Server With XQuery?

Here’s my preferred solution to allow nodes in XML to be assigned a sequence number. It isn’t quite as elegant as I’d like as there are limits to XQuery’s position function and a few features missing from SQL Server’s implementation of XQuery (handling of siblings).
This approach makes use of the << comparison operator in XQuery which in essence means ‘before’, we assign our current element to a variable, move up to the parent level and then count the elements before the current one.
This leads us to a query that looks like this:

SELECT 
X.MyXML.value('Order[1]/@name', 'varchar(50)') AS Name,
Pizza.PizzaXML.value('let $i := . return count(../Pizza[. << $i])', 'int') + 1  AS PizzaSequence,
Pizza.PizzaXML.value('./@size', 'varchar(50)') AS Size,
Topping.ToppingXML.value('.', 'varchar(50)') AS Topping,
Topping.ToppingXML.value('let $i := . return count(../Topping[. << $i])', 'int') + 1  AS ToppingSequence,
Topping.ToppingXML.value('let $i := . return count(../../Pizza/Topping[. << $i])', 'int') + 1  AS ToppingSequenceAbsolute,
Topping.ToppingXML.value('let $i := .. return count(../../Pizza[. << $i])', 'int') + 1  AS ToppingPizzaSequence
FROM #MyXML AS X
CROSS APPLY X.MyXML.nodes('Order/Pizza') AS Pizza(PizzaXML)
CROSS APPLY Pizza.PizzaXML.nodes('Topping') AS Topping(ToppingXML);

This gives us what we need in a single line of code and while not perfectly performant I’ve used this approach over tens of thousands of XML messages containing millions of elements in total with better results than the above two approaches.
The last three lines of the above query give variants on this technique to give us a sequence to the toppings within each pizza, an absolute sequence to the toppings across all pizzas and finally our pizza sequence but calculated starting from the topping level nodes rather than the pizza (if we wanted this would allow us to drop to one cross apply going directly down to toppings).

Next I’m going to turn my eye to performance by putting together a larger example data set and taking a look at the impact of XML Indexes.

See this post in SQL Notebook form.

Querying XML In SQL Server Series

Cheat Sheet

Posts in SQL Notebook form

Querying XML In SQL Server – Part 5 – An Introduction to the Query Method and XQuery

Did you know there’s a whole extra query language hidden inside SQL Server. Time to see XQuery in action with our XML.

Last time round we looked at exists, this time round it’s the query method. Similar to exists there’s rarely anything that you can do that you can’t also handle in SQL but there’s a lot of cases where it makes things either easier, neater or faster.
There is probably as much if not more to say about the query method as there the preceding methods put together so this post will be more of a guide on how to use it and some inspiration rather than comprehensive coverage. I’ll add more posts with examples as and when I’m inspired.

The query method does what you’d expect, it allows you to run a query against a piece of XML. Documentation for the method is here, it takes a string as the input and behind that string is XQuery an entire* new language hidden away in your SQL Server instance. Documentation for the XQuery language as implemented in SQL Server is here. It’s worth a browse, particularly the coded examples which often go a long way to providing models you can adapt to your needs.
You can think of XQuery as a superset of the XPath we’ve been using so far to navigate the levels of our XML, it does that and then a whole lot more on top to manipulate and apply functions to those paths. I’m not intending this to be a full tutorial on XQuery, that’s too big a job for a single post. Consider this more a recipe book and inspiration. I’m also going to focus on getting data out of XML. You can use XQuery to build XML as well but I’m not sure there are a lot of use cases for that inside SQL Server (do share in the comments if you have some).
It returns the XML data type, sometimes that’s actual XML other times it’s just text or a number that we’ll want to convert appropriately.

Let’s start as ever with our sample XML.

CREATE TABLE #MyXML (MyXML XML NOT NULL);
 
INSERT INTO #MyXML(MyXML)
VALUES
(
'<?xml version="1.0" encoding="UTF-8"?>
<items termsofuse="https://boardgamegeek.com/xmlapi/termsofuse">
   <item type="boardgame" id="91514">
      <thumbnail>https://cf.geekdo-images.com/R_VAhiLCzl5RXKwSluvEbg__thumb/img/zpzD1TJLfuNEjuTjpPbN1y1mpss=/fit-in/200x150/filters:strip_icc()/pic3271388.jpg</thumbnail>
      <image>https://cf.geekdo-images.com/R_VAhiLCzl5RXKwSluvEbg__original/img/hVn3HtD2_5vKNjhEBkL1qFB4FaU=/0x0/filters:format(jpeg)/pic3271388.jpg</image>
      <name type="primary" sortindex="1" value="Rhino Hero" />
      <name type="alternate" sortindex="1" value="Rino Ercolino" />
      <name type="alternate" sortindex="1" value="Super Rhino!" />
      <name type="alternate" sortindex="1" value="Super Rino!" />
      <description>
        Super Rhino! presents players with an incredibly heroic &amp;ndash; and regrettably heavy &amp;ndash; rhinoceros who is eager to climb a tall building and leap other tall buildings in a single bound. 
      </description>
      <yearpublished value="2011" />
      <minplayers value="2" />
      <maxplayers value="5" />
      <poll name="suggested_numplayers" title="User Suggested Number of Players" totalvotes="61">
         <results numplayers="1">
            <result value="Best" numvotes="0" />
            <result value="Recommended" numvotes="7" />
            <result value="Not Recommended" numvotes="29" />
         </results>
         <results numplayers="2">
            <result value="Best" numvotes="10" />
            <result value="Recommended" numvotes="40" />
            <result value="Not Recommended" numvotes="3" />
         </results>
         <results numplayers="3">
            <result value="Best" numvotes="44" />
            <result value="Recommended" numvotes="11" />
            <result value="Not Recommended" numvotes="0" />
         </results>
         <results numplayers="4">
            <result value="Best" numvotes="18" />
            <result value="Recommended" numvotes="30" />
            <result value="Not Recommended" numvotes="1" />
         </results>
         <results numplayers="5">
            <result value="Best" numvotes="5" />
            <result value="Recommended" numvotes="25" />
            <result value="Not Recommended" numvotes="13" />
         </results>
         <results numplayers="5+">
            <result value="Best" numvotes="0" />
            <result value="Recommended" numvotes="3" />
            <result value="Not Recommended" numvotes="24" />
         </results>
      </poll>
      <playingtime value="15" />
      <minplaytime value="5" />
      <maxplaytime value="15" />
      <minage value="5" />
   </item>
</items>'
);
 
SELECT * FROM #MyXML;

We encountered the simplest form of XQuery earlier when looking at nodes.
The below gives a neat way to view the XML in the current context and up at the parent level. I use this often as a help during development. If the results are going in to a view I’ll probably leave them there as the test team can find them helpful too but it’s not something I’d want filling up disk space.

SELECT
R.ResultXML.query('.') AS ResultLevelXML,
R.ResultXML.value('.','varchar(max)') AS ResultLevelXMLValue,
cast(R.ResultXML.query('.') as varchar(max)) AS ResultLevelXMLCast,
R.ResultXML.query('..') AS PollLevelXML
FROM #MyXML AS X
CROSS APPLY X.MyXML.nodes('/items/item/poll/results') AS R(ResultXML);

Note how in this case the value method gives us nothing but we can cast the query result into text.

Most other uses for the query method fall into similar categories as the exists method these are things you can do in SQL but that would require you to drop down to a level in the structure below the one you want in your final output. Let’s look at a few examples.

How Do I Aggregate XML Values In SQL Server?

We can use Xquery to sum values without the need to break them down at row level and group them up again. This can give us a result in the select statement or could be used in a WHERE clause. The format is simple and looks just like a SQL aggregate with a path inside.
Note the data types, even though maximum and minimum are only handling integers I get a type conversion error if I ask for an integer result.

SELECT
R.ResultXML.value('./@numplayers','varchar(50)') AS NumPlayers,
R.ResultXML.query('sum(./result/@numvotes)') AS SumVotesQuery,
R.ResultXML.value('sum(./result/@numvotes)','int') AS SumVotesValue,
R.ResultXML.value('max(./result/@numvotes)','float') AS MaxVotesValue,
R.ResultXML.value('min(./result/@numvotes)','float') AS MinVotesValue,
R.ResultXML.value('count(./result/@numvotes)','int') AS CountVotesValue,
R.ResultXML.value('avg(./result/@numvotes)','numeric(10,2)') AS SumVotesValue,
R.ResultXML.query('.') AS ResultsXML
FROM #MyXML AS X
CROSS APPLY X.MyXML.nodes('/items/item/poll/results') AS R(ResultXML)
WHERE
R.ResultXML.value('sum(./result/@numvotes)','int') 	> 50;

If we really want that integer then we need to explicitly convert the @numvotes attribute to an integer.

SELECT
--R.ResultXML.value('min(xs:integer(./result/@numvotes))','int') AS MinVotesValueBroken,
R.ResultXML.value('min(for $v in ./result/@numvotes return xs:integer($v))','int') AS MinVotesValue,
R.ResultXML.query('.') AS ResultsXML
FROM #MyXML AS X
CROSS APPLY X.MyXML.nodes('/items/item/poll/results') AS R(ResultXML);

That’s not simple, the commented out version looks like a viable approach but that’s not asking to convert the values into an integer it’s trying to convert all the elements as one.
Instead we need to iterate through each item converting it. It’s unlikely to be worth the effort in most cases but there’s a small risk of floating point errors if you leave things to the defaults which use the double data type.
The above is a nice introduction to looping through data in XQuery which follows a fairly common pattern in programming languages:

for $MyOneThing in AllTheThings return DoSomethingOneAtATimeTo($MyOneThing)

This sets us up for the next example.

How Do I Concatenate XML Values Into a Single String?

The ability to loop through sub nodes gives us a route to a concatenated string because sometimes (e.g. for report outputs) that’s all we need. Here’s an example concatenating all the alternative names for our example game into one.

SELECT
X.MyXML.query(
	'for $name in 
		/items/item/name 
	return 
		concat($name/@type,'': '',$name/@value, '' aka '')'
		).value('.','varchar(max)') AS AllTheNames
FROM #MyXML AS X;

The value method doesn’t work directly here but we can stack methods, query gives us the result as XML type and then we apply the value method to get us to text. We could also do this in SQL and might want to as we probably want to remove that stray ‘aka’ at the end. This next example shows STUFF () being used to solve the extra aka issue and also the text conversion running via CAST().

SELECT
STUFF(
	CAST(
		X.MyXML.query(
		'for $name in 
			/items/item/name 
		return 
			concat( '' aka '', $name/@type,'': '',$name/@value)'
			) 
	AS varchar(max))
,1,5,'') AS 	
		AllTheNames
FROM #MyXML AS X;

How to Do Almost Anything Else In SQL Server With XQuery?

The catch all solution applies here… Google it (other search engines are available). You’ll probably find someone has asked how to do the same thing on Stack Overflow (or tech forum of choice) and dropping that solution into the query method has a good chance of solving your problem or getting you most of the way there.
Other good resources are the aforementioned examples in Microsoft’s XQuery documentation and the W3 School’s tutorials both of which I’ve bounced in and out of while writing these blogs and the worktime activities that inspired them.

Observant readers may have noticed there was a somewhat ominous footnote promised earlier on, here it is.

*Why Won’t My Entirely Valid XQuery Work in SQL Server?

Long term users of SQL Server may be wholly unsurprised to discover that XQuery falls into the rather large category of “cool sounding features in SQL Server that haven’t been developed further since they were sufficiently functional to tick all the right boxes on a list of features but no more” (see elsewhere for temporal tables, graph queries and many others). To quote Microsoft on the subject:

This XQuery implementation is aligned with the July 2004 Working Draft of XQuery

Some Person At Microsoft

That’s a draft of version 1.0 and we’re now up to version 3.1, don’t be surprised if features are missing.

One set of features I’ve missed personally are the more advanced abilities to move around nodes in the XML specifying preceding or following siblings in a group to work out relative or absolute positions. Expect to see some alternative solutions to that in a future post.

Querying XML In SQL Server Series

Cheat Sheet

Posts in SQL Notebook form

Using SQL Notebooks For Knowledge Sharing

Jupyter notebooks aren’t just there for the script commands you struggle to remember (although they’re pretty good for that too).

T-SQL Tuesday Logo

For this month’s T-SQL Tuesday #137 Steve Jones (b|t) asked to share how we have used, or would like to use, Jupyter notebooks.

I don’t really think about Jupyter Notebooks often, I’ve seen some nice demos of use cases for us SQL Server people (See this session by Rob Sewell for a few examples) but never given them much thought beyond that.

Which possibly shows how unglamorous they can be given that I use one virtually every day at work.
Just the one though, it’s a PowerShell notebook and as PowerShell can be used to run git commands I use it to hold the scripts that either take me too long to type out (like refreshing all the branches for our shared UAT and live environments in one go) or that I forget the syntax for because I don’t use it often enough and there’s no friendly button in Visual Studio to save me.

I thought I’d focus on something a little more interesting for this blog instead and so went back to one of the use cases that interested me when I first encountered SQL Notebooks. Using them to combine both text and code into a nice neat demo of a T-SQL feature. When I first investigated this most of my ideas fell flat because each code block was considered a different session and so temporary tables couldn’t be carried between them, this prevented you building up a decent example data set and then using it in subsequent steps. The good news is that that issue has now been fixed and so I thought I’d try converting my series of blog posts on querying XML in SQL Server into a set of notebooks.

The process was pleasantly simple, it’s just a matter of copying and pasting over and most of the formatting for headings, bullet points and links carried over without issue. A few times I had to delve into the markdown code to strip out a rogue change of font but each post took me 5-10 minutes to adapt.

You can access the results here they’re even viewable online at GitHub although the formatting of results tables isn’t quite as neat as in Azure Data Studio and there’s no way to run the sample code.

There’s a lot to be said for the approach, the long bits of code that set up the sample data can be hidden and results can be optionally saved with the notebook. As long as you have access to SQL Server you can click to run the sample code or start to experiment by changing it.

There are a few down sides. The most obvious is that while temp tables work between blocks of code, intellisense isn’t smart enough to know this and so throws up a lot of red wiggly lines. I’d also love to be able to collapse results sets without wiping them completely in the same way that you can do for code. Sometimes I wanted the results to be there but they were too long winded to want them to be filling up the page by default.
Limits come from Azure Data Studio as well, it’s not an issue for these posts but the lack of query plans will likely send me running back to good old SSMS at some point in the future.

Overall I like the result. I have something that I can easily point people to that allows them to run my example code but it also allows for more control over the surrounding text than is possible in a classic SQL script. Given that it’s relatively easy to convert a blog post into a notebook I’m going to keep at it for now and where appropriate attach a notebook version to each blog post.

Querying XML In SQL Server Series

Cheat Sheet

Posts in SQL Notebook form

Querying XML In SQL Server – Part 4 – Filtering With The Exist Method

In part four of our series on querying XML in SQL server we look at the exist method and how it can help to filter results or flag for missing elements.

Back in Part 3 of this series we covered what I consider to be the essentials of querying XML. Between the values and nodes methods you can break up XML into rows and extract the values into columns and from there you can do anything you could with data lifted from a standard table in the database.

However there’s a lot that you can do that might perform better or look neater if a bit more of the query work was handled inside the XML and a bit less on the SQL side of the equation.

You can see what Microsoft has to say about the exist method here but you can probably guess a lot from the name and what we’ve seen of the values and nodes methods so far. Exist runs a query and returns a bit field, giving a 1 if the query produces a result and 0 if the result is empty.

Let’s start out as we always do with our sample XML.

CREATE TABLE #MyXML (MyXML XML NOT NULL);

INSERT INTO #MyXML(MyXML)
VALUES
(
'<?xml version="1.0" encoding="UTF-8"?>
<items termsofuse="https://boardgamegeek.com/xmlapi/termsofuse">
   <item type="boardgame" id="91514">
      <thumbnail>https://cf.geekdo-images.com/R_VAhiLCzl5RXKwSluvEbg__thumb/img/zpzD1TJLfuNEjuTjpPbN1y1mpss=/fit-in/200x150/filters:strip_icc()/pic3271388.jpg</thumbnail>
      <image>https://cf.geekdo-images.com/R_VAhiLCzl5RXKwSluvEbg__original/img/hVn3HtD2_5vKNjhEBkL1qFB4FaU=/0x0/filters:format(jpeg)/pic3271388.jpg</image>
      <name type="primary" sortindex="1" value="Rhino Hero" />
      <name type="alternate" sortindex="1" value="Rino Ercolino" />
      <name type="alternate" sortindex="1" value="Super Rhino!" />
      <name type="alternate" sortindex="1" value="Super Rino!" />
      <description>
        Super Rhino! presents players with an incredibly heroic &amp;ndash; and regrettably heavy &amp;ndash; rhinoceros who is eager to climb a tall building and leap other tall buildings in a single bound. 
      </description>
      <yearpublished value="2011" />
      <minplayers value="2" />
      <maxplayers value="5" />
      <poll name="suggested_numplayers" title="User Suggested Number of Players" totalvotes="61">
         <results numplayers="1">
            <result value="Best" numvotes="0" />
            <result value="Recommended" numvotes="7" />
            <result value="Not Recommended" numvotes="29" />
         </results>
         <results numplayers="2">
            <result value="Best" numvotes="10" />
            <result value="Recommended" numvotes="40" />
            <result value="Not Recommended" numvotes="3" />
         </results>
         <results numplayers="3">
            <result value="Best" numvotes="44" />
            <result value="Recommended" numvotes="11" />
            <result value="Not Recommended" numvotes="0" />
         </results>
         <results numplayers="4">
            <result value="Best" numvotes="18" />
            <result value="Recommended" numvotes="30" />
            <result value="Not Recommended" numvotes="1" />
         </results>
         <results numplayers="5">
            <result value="Best" numvotes="5" />
            <result value="Recommended" numvotes="25" />
            <result value="Not Recommended" numvotes="13" />
         </results>
         <results numplayers="5+">
            <result value="Best" numvotes="0" />
            <result value="Recommended" numvotes="3" />
            <result value="Not Recommended" numvotes="24" />
         </results>
      </poll>
      <playingtime value="15" />
      <minplaytime value="5" />
      <maxplaytime value="15" />
      <minage value="5" />
   </item>
</items>'
);

SELECT * FROM #MyXML;

Imagine we want to find all the results elements that had a result with more than 15 votes. We might write it like this.

SELECT
R2.ResultsXML.value('../@numplayers','varchar(50)') AS NumPlayers
FROM #MyXML AS X
CROSS APPLY X.MyXML.nodes('/items/item/poll/results/result') AS R2(ResultsXML)
WHERE R2.ResultsXML.value('./@numvotes','int') > 15;

But if you run that you’ll notice we get an extra row because Numplayers =4 has two different results that meet this condition. We could of course sweep this issue under the carpet with DISTINCT but that’s not going to perform well with larger data sets. We might also be able to produce something more complicated using EXISTS like this.

SELECT
R.ResultXML.query('.') AS ResultLevelXML,
R.ResultXML.value('./@numplayers','varchar(50)') AS NumPlayers
FROM #MyXML AS X
CROSS APPLY X.MyXML.nodes('/items/item/poll/results') AS R(ResultXML)
WHERE EXISTS (
SELECT 1
FROM R.ResultXML.nodes('./result') AS R2(ResultsXML)
WHERE R2.ResultsXML.value('./@numvotes','int') > 15
);

That’s a lot more code to type out, it’s harder to read and a quick peek at the query plans for the above two makes me think that second query is going to get ugly if you threw more data at it.

How can I use the exist method to filter results from XML in SQL Server?

We can use exist to test a piece of xml to see if a matching element exists inside it, any time a result is returned we get a 1 and if nothing is found we get a 0. This can then be used to test within the WHERE clause of a TSQL statement like the one below.

SELECT
R2.ResultsXML.value('./@numplayers','varchar(50)') AS NumPlayers
FROM #MyXML AS X
CROSS APPLY X.MyXML.nodes('/items/item/poll/results') AS R2(ResultsXML)
WHERE ResultsXML.exist('./result[@numvotes > 15]') = 1;

This looks a lot neater to me both in code view and as a query plan. Hopefully I’ll get the chance to do some performance tests in a later post in this series once I’ve built a suitable data set to test against. Anecdotally I can confirm I’ve had no issues using this approach in a work environment.

The exist method doesn’t have to be used in the where clause we can use it to give us useful columns in our results too.

How can I use the exist method to test for existence of XML elements in SQL Server?

The exist method can be added to a column definition to show if a particular element or value exists within the XML without filtering it out. Consider this example that lists all results and gives us a flag to show if any of the result level elements had 30 or more votes.

SELECT
R2.ResultsXML.value('./@numplayers','varchar(50)') AS NumPlayers,
ResultsXML.exist('./result[@numvotes > 30]') AS HasThirtyVotes
FROM #MyXML AS X
CROSS APPLY X.MyXML.nodes('/items/item/poll/results') AS R2(ResultsXML);

This approach can be really useful if there are optional elements to a piece of XML. I’ve used this where the XML represents a process performed in distinct steps and there’s a need to see which of those steps of the process have been started without needing any more detail. Exist allows you to quickly dip in and see what’s there and what isn’t without going any further than that.

Why Am I Getting Errors When Using The Node and Exist Methods Together?

A word of warning, while testing the above code I received syntax errors in SSMS like the one in the image below.
The good news is this appears to be an issue with SSMS rather than the code, it runs just fine but it is a little off putting to be told there’s a problem.

That’s all for this post, next time round we’re going to delve deeper into what more we can do inside XML with the query method.

Querying XML In SQL Server Series

Cheat Sheet

Posts in SQL Notebook form