-

 

ROW_NUMBER(): An Efficient Alternative to Subqueries

By Francis Rodrigues, 2009/05/12

 

Introduction

    SQL Server 2005 offers an array of ranking and windowing functions that can be used to evaluate the data and only return appropriate rows.  For instance, a development cycle of a product may include hundreds of releases, and those releases may have versions associated with them: a "0" version with its many minor versions, a "1" version with its minor versions, etc.  If the history of a particular product's releases is kept, then analysis can be done to see how far a version is developed before it is released.  An ORDER BY clause alone cannot fulfill this need because the query would most likely still return the entire history of the product, not necessarily the last release of every version.  The code is also available for download.  The name of the file is RowCountScenario1-CodeDownload.sql.

Scenario 1 – Versioning

In order to demonstrate the usage of the ROW_NUMBER() windowing function, I started with Microsoft's AdventureWorks database.  In particular, I used the data in the Production.ProductCostHistory table.  The products in this table are identified by the ProductID column; this is a foreign key to the Production.Product table.  Using the Production.ProductCostHistory table, I mocked up some data to create versions for each Product in the table.  I used a random number generation process to create attributes called Version, MinorVersion and ReleaseVersion for each product.  These attributes are meant to show detailed information about the product.  Together 7.0.59 represents that the 7th version of the product is currently being used, a minor version represents the iteration of the version, and the release version of this particular installation is 59.  The next iteration of the product's life cycle could result with 7.2.19.  I also used the existing StandardCost to create different costs for each of the Versions, to create some sense of value for the particular Version.

I created a table called Production.ProductVersion with the ProductID, Version, MinorVersion and ReleaseVersion defined as the primary key and the StandardCost as an attribute.  I inserted the mocked up data generated by the code into this table to model a simple product version/cost history.

Code

I used the following code to populate the table with randomized data.  I created the data using a common table expression (CTE) and inserted the data into the table after it was generated. The data is based on the Production.ProductCostHistory table.

Code
The ABS(CHECKSUM(NEWID())) is utilized as the random number generator; the modulus operator provides the upper bound for the random number that was generated.  The NEWID() function is guaranteed to generate a globally unique identifier for each row.  The GROUP BY clause is used to avoid any Primary Key constraint violations that might be encountered.

The purpose of this exercise is to avoid the complexity of certain code by using the ROW_NUMBER() windowing function.  Suppose you are required to return only the latest version of a Product with its associated MinorVersion, ReleaseVersion and StandardCost.  The following query will not return the correct result set.

Code
In fact, this query violates integrity of the rows of data in the table.  It simply returns the maximum Version, the maximum MinorVersion, the maximum ReleaseVersion and the maximum StandardCost of a particular Product.  This is an easy and tempting trap to fall into.  Compare the results displayed in Figure 2 and Figure 3.  The actual data that is in Production.ProductVersion is in Figure 1.

 

The following sample query captures the actual requirements, and returns the correct result, but it is long and convoluted.

Code
This query utilizes nested subqueries in order to ensure the integrity of each row.  The first subquery (lines 117-25 in the code download) provides the maximum Version for each Product.  The second subquery provides the maximum MinorVersion for the maximum Version of each Product.  The subsubquery in the WHERE clause ensures that the MinorVersions and the Versions match.  The third subquery,
Code
(lines 127-46 in the code download) provides the maximum ReleaseVersion for the maximum MinorVersion of the maximum Version of each Product.  Once again the subqueries ensure that only complete rows of data are retrieved.  If this logic sounds too complicated, that is simply because it is.  The estimated subtree cost for this query turns out to be 0.583005.  This includes several Clustered Index Scan and Seek operations.  The query plan is displayed in Figure 6.  The complexity of the subquery approach can increase if the requirements change.

A simplified approach uses the ROW_NUMBER() function as shown below.

Code
The result set for this query looks like the Figure 3.

 

The PARTITION BY clause allows a set of row numbers to be assigned for all distinct Products.  When a new ProductID is encountered, the row numbering will start over at 1 and continue incrementing for each row with the same ProductID.  The row number will be assigned according to the sort order of the columns that you specify in the OVER clause's ORDER BY clause.  The estimated subtree cost for this improved query is 0.039954.  This query has only one Clustered Index Scan operation.  The query plan is displayed in Figure 7.

With the OVER clause, the ROW_NUMBER() function can efficiently assign a row number to each row in a query.  The fact that I've ordered the partitions by ProductID and then in descending order by Version, MinorVersion, and ReleaseVersion, guarantees the maximum version will be in the first row of each ProductID partition.  This allows me to use a simple WHERE MaxVersion = 1 predicate in place of the convoluted sub-query logic in the previous sample query.

To test the effects of indexing on difference between the two methods, I used the following table.

Code
I used the following query to generate a large set of randomized data to compare the estimated query costs for different record size sets.
Code
The following charts show the estimated query costs for different row sizes.  I chose to start at 1,000 because there are 286 distinct ProductIDs, starting at 100 would have eliminated too many rows.

The following figure shows the estimated query costs for the Production.ProductVersion.  The subquery implementation actually took less than 1 second to complete where as the ROW_NUMBER() implementation took about 2 seconds to complete for 1,000,000 rows.

Row Size

Subquery Implementation Cost

ROW_NUMBER() Implementation Cost

1000

0.0652462

0.0355736

10000

0.238573

0.673282

100000

2.2258

5.97198

1000000

14.3881

83.7228

Figure 4: Indexed estimated query costs

The following figure shows the estimated query costs for the Production.ProductVersion2.  The subquery implementation took 43 seconds to complete where as the ROW_NUMBER() implementation took 5 seconds to complete for 1,000,000 rows.

Row Size

Subquery Implementation Cost

ROW_NUMBER() Implementation Cost

1000

0.0355736

0.225896

10000

1.6397

0.673282

100000

44.1332

5.97202

1000000

448.47

83.7229

Figure 5: Non-indexed estimated query costs

These results may differ according to the hardware used to run the queries.  A quick look at the ROW_NUMBER() implementation column shows that indexing does not significantly impact this implementation's query cost where as it is very important to the subquery implementation's query cost.

Scenario 1 – Change in Requirements

Suppose the requirement changes and you need to grab the maximum MinorVersions for every (ProductID, Version) combination.  Changing the subquery implementation has a large overhead, namely breaking down the logic.  The subquery approach looks like this with the new set of requirements:

Code
The new requirements actually eliminate some of the nested subqueries.  The estimated query cost, however, does not change significantly for Production.ProductVersion.

This new change requires only a small modification to the ROW_NUMBER() implementation. This is what the ROW_NUMBER() implementation looks like for the new set of requirements:

Code
In this example, the row numbers are partitioned according to both ProductID and Version.  The WHERE clause is still valid here because the maximum MinorVersion for each (ProductID, Version) combination is guaranteed to be the first row.  The estimated query cost did not change greatly for the modified code.  The readability, manageability and the efficiency of the function make it a better choice than the subquery approach.

 

Conclusion

These examples show the critical role of indexing in the subquery approach.  The ROW_NUMBER() implementation is far more readable and therefore easier to maintain.  It also remains relatively independent of indexing even for large amounts of data.  Since the function takes advantage of the SQL Server's ability to sort records, most queries that need to uphold a level of sequencing should at the very least explore its implementation.  The sorting itself can greatly reduce or replace all together the extra logic necessary to enforce the integrity of data at the row level.  The readability of the function's implementation also plays a key role in its manageability.  Modifying the code with the ROW_NUMBER() implementation is easy because the logic is performed in easy to spot areas and is performed once, whereas in the subquery the logic appears in several places and could be repeated.

posted on 2009-05-15 11:13  yhb199  阅读(456)  评论(0编辑  收藏  举报