XSLT存档  

不及格的程序员-八神

 查看分类:  ASP.NET XML/XSLT JavaScripT   我的MSN空间Blog

SQL Server 2005 和自增长主键identity说再见——NEWSEQUENTIALID()(转载)

在SQL Server 2005环境下,表的主键应该怎样设计.
目前主要用到的主键方案共三种:

  • 自动增长主键
  • 手动增长主键
  • UNIQUEIDENTIFIER主键

1、先说自动增长主键,它的优点是简单,类型支持bigint.但是它有致命的弱点:

当我们需要在多个数据库间进行数据的复制时(SQL Server的数据分发、订阅机制允许我们进行库间的数据复制操作),自动增长型字段可能造成数据合并时的主键冲突。设想一个数据库中的Order表向另一个库中的Order表复制数据库时,OrderID到底该不该自动增长呢?

 

2、再说手动增长主键,它的优点是自行定制主键列,主键列的数据类型乃至数据样本都可以控制,能够稳定的获得目标键值,不会重复.但是它维护成本比较搞,首先生成键值需要自行编写存储过程来产生,网络开销大,运行时还要考虑到并发冲突等等.

 

3、最后就是UNIQUEIDENTIFIER主键,它利用GUID作为键值,可以直接调用newid()来获得全局唯一标识,即便合并数据表也不会有重复现象.但是GUID有两个弱点:其一,和int类型比较,GUID长度是前者4倍.其二,用newid()获得的GUID毫无规律,因为该列作为主键,必然有聚集索引,那么在插入新数据时,将是一个非常耗时的操作.这样的话UNIQUEIDENTIFIER作为主键将大大有损效率.

所以SQL Server 2000环境下DBA们往往写一个存储过程来生成与时间有关的GUID,即在GUID前面加上生成时间.这样确保生成出来的主键全局唯一并且按时间递增.不过这又回到了第二种主键方案,不便维护.

 

4、SQL Server 2005已经解决了这个问题,使用的是NEWSEQUENTIALID()

这个函数产生的GUID是递增的,下面看下它的用法

复制代码
--创建实验表
--1创建id列的类型为UNIQUEIDENTIFIER
--2ROWGUIDCOL只是这个列的别名,一个表中只能有一个
--3PRIMARY KEY确定id为主键
--4使用DEFAULT约束来自动为该列添加GUID
create table jobs
(
id UNIQUEIDENTIFIER ROWGUIDCOL PRIMARY KEY  NOT NULL
CONSTRAINT [DF_jobs_id] DEFAULT (NEWSEQUENTIALID()),
account varchar(64) not null,
password varchar(64) not null
)
go
 
select * from jobs
--添加实验数据
insert jobs (account,password) values ('tudou','123')
insert jobs (account,password) values ('ntudou','123')
insert jobs (account,password) values ('atudou','123')
insert jobs (account,password) values ('btudou','123')
insert jobs (account,password) values ('ctudou','123')
 
select * from jobs
复制代码

结果:

 

复制代码
--使用identity的是我们可以通过Select @@IDENTITY取到新添加的id
--使用UNIQUEIDENTIFIER怎么办呢?
--采取手动增长的方法select NEWSEQUENTIALID()先取出id再添加
--不行,语法不支持
--可以通过下面的方法取到新添加数据的id
--在ADO.NET中的用法和Select @@IDENTITY一样
DECLARE @outputTable TABLE(ID uniqueidentifier)
INSERT INTO jobs(account, password)
OUTPUT INSERTED.ID INTO @outputTable
VALUES('dtudou', '123')
 
SELECT ID FROM @outputTable
 
--对比下数据
select * from jobs
复制代码

结果:

 

--ROWGUIDCOL是主键列的别名,可以直接当做列名来使用
--这样可以忽略主键列的名称
insert jobs (account,password) values ('etudou','123')
select ROWGUIDCOL from jobs

结果:

 

原文链接


sqlserver如何自增字符串类型ID

 前几天有一个需求,在用户不需要传递ID的情况下,数据库层自增32位字符ID,找了好多文章都是自增int类型ID,最后请教了一位大神,用设置字段默认值得方式成功解决这个问题。分享出来,希望能帮助到大家!

sqlserver自增字符串类型ID,只需要使用navicat在字段默认值框中输入(replace(newid(),’-’,’’))搞定!!!!


Asked 12 years, 8 months ago
Viewed 77k times
 
48

I know that If I run this query

select top 100 * from mytable order by newid()

it will get 100 random records from my table.

However, I'm a bit confused as to how it works, since I don't see newid() in the select list. Can someone explain? Is there something special about newid() here?

 

5 Answers

4

as MSDN says:

NewID() Creates a unique value of type uniqueidentifier.

and your table will be sorted by this random values.

 
  • 1
     
     
    Thanks - I know what NewID() does, I'm just trying to understand how it would help in the random selection. Is it that [1] the select statement will select EVERYTHING from mytable, [2] for each row selected, tack on a uniqueidentifier generated by NewID(), [3] sort the rows by this uniqueidentifier and [4] pick off the top 100 from the sorted list?   Feb 12, 2011 at 19:50
 
12

In general it works like this:

  • All rows from mytable is "looped"
  • NEWID() is executed for each row
  • The rows are sorted according to random number from NEWID()
  • 100 first row are selected
 
39
 

I know what NewID() does, I'm just trying to understand how it would help in the random selection. Is it that (1) the select statement will select EVERYTHING from mytable, (2) for each row selected, tack on a uniqueidentifier generated by NewID(), (3) sort the rows by this uniqueidentifier and (4) pick off the top 100 from the sorted list?

Yes. this is pretty much exactly correct (except it doesn't necessarily need to sort all the rows). You can verify this by looking at the actual execution plan.

SELECT TOP 100 * 
FROM master..spt_values 
ORDER BY NEWID()

The compute scalar operator adds the NEWID() column on for each row (2506 in the table in my example query) then the rows in the table are sorted by this column with the top 100 selected.

SQL Server doesn't actually need to sort the entire set from positions 100 down so it uses a TOP N sort operator which attempts to perform the entire sort operation in memory (for small values of N)

Plan

 
  •  
    Got it! And yes, you're right - once I've determined the top 100 rows from the entire set, there's no need to sort the rest.   Feb 12, 2011 at 22:08
  •  
    So, is it safe to ensure that no data is written? Since this is a SELECT query, the NEWID() will calculate a randomized identifier just for the query, it won't be updating anything in the database with this new id, right? 
    – K09P
     Mar 25, 2019 at 11:40
  • 1
     
     
    Yes it won't affect the tables you are selecting from. At least some of the data will be temporarily written to a worktable in tempdb to hold at least the TOP N results but nothing written to the user database   Mar 25, 2019 at 12:36
  •  
    I don't see it mentioned anywhere else, but because of how this works, this is probably a terrible way of selecting random rows from a table. If you have a table with many rows (i.e. half a billion), do not run a query like that. If I ran this on a table with 500GB of data in it, I'd be in trouble. Better off using the built-in feature "TABLESAMPLE" that's meant for selecting random rows from data pages: SELECT * FROM Person.Person TABLESAMPLE (10 PERCENT); 
    – Triynko
     Apr 28 at 23:35 
 
-1

use select top 100 randid = newid(), * from mytable order by randid you will be clarified then..

 
1

I have an unimportant query which uses newId() and joins many tables. It returns about 10k rows in about 3 seconds. So, newId() might be ok in such cases where performance is not too bad & does not have a huge impact. But, newId() is bad for large tables.

Here is the explanation from Brent Ozar's blog - https://www.brentozar.com/archive/2018/03/get-random-row-large-table/.

From the above link, I have summarized the methods which you can use to generate a random id. You can read the blog for more details.

4 ways to get a random row from a large table:

  1. Method 1, Bad: ORDER BY NEWID() > Bad performance!
  2. Method 2, Better but Strange: TABLESAMPLE > Many gotchas & is not really random!
  3. Method 3, Best but Requires Code: Random Primary Key > Fastest, but won't work for negative numbers.
  4. Method 4, OFFSET-FETCH (2012+) > Only performs properly with a clustered index.

More on method 3: Get the top ID field in the table, generate a random number, and look for that ID. For top N rows, call the code below N times or generate N random numbers and use in an IN clause.

/* Get a random number smaller than the table's top ID */
DECLARE @rand BIGINT;
DECLARE @maxid INT = (SELECT MAX(Id) FROM dbo.Users);
SELECT @rand = ABS((CHECKSUM(NEWID()))) % @maxid;

/* Get the first row around that ID */
SELECT TOP 1 *
FROM dbo.Users AS u
WHERE u.Id >= @rand;
 
  •  
    I find method 3 to be the fastest. But I have a problem when I get the top N. I ran it N times, but there will be several times where the data will overlap so it's not really random N values. Do you have any ideas?-   Apr 27 at 9:37

Asked 12 years, 8 months ago
Modified 7 months ago
Viewed 64k times
 
39

I have a query where I want the resulting records to be ordered randomly. It uses a clustered index, so if I do not include an order by it will likely return records in the order of that index. How can I ensure a random row order?

I understand that it will likely not be "truly" random, pseudo-random is good enough for my needs.

6 Answers

30
 

ORDER BY NEWID() will sort the records randomly. An example from SQLTeam.com

SELECT *
FROM Northwind..Orders 
ORDER BY NEWID()
 
23

This is an old question, but one aspect of the discussion is missing, in my opinion -- PERFORMANCE. ORDER BY NewId() is the general answer. When someone get's fancy they add that you should really wrap NewID() in CheckSum(), you know, for performance!

The problem with this method, is that you're still guaranteed a full index scan and then a complete sort of the data. If you've worked with any serious data volume this can rapidly become expensive. Look at this typical execution plan, and note how the sort takes 96% of your time ...

enter image description here

To give you a sense of how this scales, I'll give you two examples from a database I work with.

  • TableA - has 50,000 rows across 2500 data pages. The random query generates 145 reads in 42ms.
  • Table B - has 1.2 million rows across 114,000 data pages. Running Order By newid() on this table generates 53,700 reads and takes 16 seconds.

The moral of the story is that if you have large tables (think billions of rows) or need to run this query frequently the newid() method breaks down. So what's a boy to do?

Meet TABLESAMPLE()

In SQL 2005 a new capability called TABLESAMPLE was created. I've only seen one article discussing it's use...there should be more. MSDN Docs here. First an example:

SELECT Top (20) *
FROM Northwind..Orders TABLESAMPLE(20 PERCENT)
ORDER BY NEWID()

The idea behind table sample is to give you approximately the subset size you ask for. SQL numbers each data page and selects X percent of those pages. The actual number of rows you get back can vary based on what exists in the selected pages.

So how do I use it? Select a subset size that more than covers the number of rows you need, then add a Top(). The idea is you can make your ginormous table appear smaller prior to the expensive sort.

Personally I've been using it to in effect limit the size of my table. So on that million row table doing top(20)...TABLESAMPLE(20 PERCENT) the query drops to 5600 reads in 1600ms. There is also a REPEATABLE() option where you can pass a "Seed" for page selection. This should result in a stable sample selection.

Anyway, just thought this should be added to the discussion. Hope it helps someone.

  •  
    It would nice to be able to write a scalable random-ordering query which not only scales up but works with small data sets. It sounds like you have to manually switch between having and not having TABLESAMPLE() based on how much data you have. I don’t think that TABLESAMPLE(x ROWS) would even ensure that at least x rows are returned because the documentation says “The actual number of rows that are returned can vary significantly. If you specify a small number, such as 5, you might not receive results in the sample.”—so the ROWS syntax really still is just a masked PERCENT inside? 
    – binki
     Jun 18, 2014 at 18:49
  •  
    Sure, auto-magic is nice. In practice, I've rarely seen a 5 row table scale to millions of rows without notice. TABLESAMPLE() seems to base selection of the number of pages in a table, so the given row size influences what comes back. The point of table sample, at least as I see it, is to give you a good sub-set from which you can select -- kind of like a derived table. 
    – EBarr
     Jun 19, 2014 at 1:26
  •  
    TABLESAMPLE() should be used with care. For example, only rows from the 'first page' of the table may return. Thus it can appear not to be truly random. If true-randomness is important, e.g. for an audit, then it shouldn't be used at all. If this is only for some quick usage, it is fine as a performance-improvement, however.   Feb 2, 2021 at 22:26
22

Pradeep Adiga's first suggestion, ORDER BY NEWID(), is fine and something I've used in the past for this reason.

Be careful with using RAND() - in many contexts it is only executed once per statement so ORDER BY RAND() will have no effect (as you are getting the same result out of RAND() for each row).

For instance:

SELECT display_name, RAND() FROM tr_person

returns each name from our person table and a "random" number, which is the same for each row. The number does vary each time you run the query, but is the same for each row each time.

To show that the same is the case with RAND() used in an ORDER BY clause, I try:

SELECT display_name FROM tr_person ORDER BY RAND(), display_name

The results are still ordered by the name indicating that the earlier sort field (the one expected to be random) has no effect so presumably always has the same value.

Ordering by NEWID() does work though, because if NEWID() was not always reassessed the purpose of UUIDs would be broken when inserting many new rows in one statemnt with unique identifiers as they key, so:

SELECT display_name FROM tr_person ORDER BY NEWID()

does order the names "randomly".

Other DBMS

The above is true for MSSQL (2005 and 2008 at least, and if I remember rightly 2000 as well). A function returning a new UUID should be evaluated every time in all DBMSs NEWID() is under MSSQL but it is worth verifying this in the documentation and/or by your own tests. The behaviour of other arbitrary-result functions, like RAND(), is more likely to vary between DBMSs, so again check the documentation.

Also I've seen ordering by UUID values being ignored in some contexts as the DB assumes that the type has no meaningful ordering. If you find this to be that case explicitly cast the UUID to a string type in the ordering clause, or wrap some other function around it like CHECKSUM() in SQL Server (there may be a small performance difference from this too as the ordering will be done on a 32-bit values not a 128-bit one, though whether the benefit of that outweighs the cost of running CHECKSUM() per value first I'll leave you to test).

Side Note

If you want an arbitrary but somewhat repeatable ordering, order by some relatively uncontrolled subset of the data in the rows themselves. For instance either or these will return the names in an arbitrary but repeatable order:

SELECT display_name FROM tr_person ORDER BY CHECKSUM(display_name), display_name -- order by the checksum of some of the row's data
SELECT display_name FROM tr_person ORDER BY SUBSTRING(display_name, LEN(display_name)/2, 128) -- order by part of the name field, but not in any an obviously recognisable order)

Arbitrary but repeatable orderings are not often useful in applications, though can be useful in testing if you want to test some code on results in a variety of orders but want to be able to repeat each run the same way several times (for getting average timing results over several runs, or testing that a fix you have made to the code does remove a problem or inefficiency previously highlighted by a particular input resultset, or just for testing that your code is "stable" in that is returns the same result each time if sent the same data in a given order).

This trick can also be used to get more arbitrary results from functions, which do not allow non-deterministic calls like NEWID() within their body. Again, this is not something that is likely to be often useful in the real world but could come in handy if you want a function to return something random and "random-ish" is good enough (but be careful to remember the rules that determine when user defined functions evaluted, i.e. usually only once per row, or your results may not be what you expect/require).

Performance

As EBarr points out, there can be performance issues with any of the above. For more than a few rows you are almost garanteed to see the output spooled out to tempdb before the requested number of rows being read back in the right order, which means that even if you are looking for the top 10 you might find a full index scan (or worse, table scan) happens along with a huge block of writing to tempdb. Therefor it can be vitally important, as with most things, to benchmark with realistic data before using this in production.

 
5

Many tables have a relatively dense (few missing values) indexed numeric ID column.

This allows us to determine the range of existing values, and choose rows using randomly-generated ID values in that range. This works best when the number of rows to be returned is relatively small, and the range of ID values is densely populated (so the chance of generating a missing value is small enough).

To illustrate, the following code chooses 100 distinct random users from the Stack Overflow table of users, which has 8,123,937 rows.

The first step is to determine the range of ID values, an efficient operation due to the index:

DECLARE 
    @MinID integer,
    @Range integer,
    @Rows bigint = 100;

--- Find the range of values
SELECT
    @MinID = MIN(U.Id),
    @Range = 1 + MAX(U.Id) - MIN(U.Id)
FROM dbo.Users AS U;

Range query

The plan reads one row from each end of the index.

Now we generate 100 distinct random IDs in the range (with matching rows in the users table) and return those rows:

WITH Random (ID) AS
(
    -- Find @Rows distinct random user IDs that exist
    SELECT DISTINCT TOP (@Rows)
        Random.ID
    FROM dbo.Users AS U
    CROSS APPLY
    (
        -- Random ID
        VALUES (@MinID + (CONVERT(integer, CRYPT_GEN_RANDOM(4)) % @Range))
    ) AS Random (ID)
    WHERE EXISTS
    (
        SELECT 1
        FROM dbo.Users AS U2
            -- Ensure the row continues to exist
            WITH (REPEATABLEREAD)
        WHERE U2.Id = Random.ID
    )
)
SELECT
    U3.Id,
    U3.DisplayName,
    U3.CreationDate
FROM Random AS R
JOIN dbo.Users AS U3
    ON U3.Id = R.ID
-- QO model hint required to get a non-blocking flow distinct
OPTION (MAXDOP 1, USE HINT ('FORCE_LEGACY_CARDINALITY_ESTIMATION'));

random rows query

The plan shows that in this case 601 random numbers were needed to find 100 matching rows. It is pretty quick:

Table 'Users'. Scan count 1, logical reads 1937, physical reads 2, read-ahead reads 408
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 9 ms.

Try it on the Stack Exchange Data Explorer.

  •  
    I love this solution, I was previously facing problems with both RAND() and NEWID() on a dataset of 1 million records but this solution seem to perform pretty well so far, i'm pressed. Looking to test it on larger datasets.   Aug 10, 2020 at 20:44
0

As I explained in this article, in order to shuffle the SQL result set, you need to use a database-specific function call.

Note that sorting a large result set using a RANDOM function might turn out to be very slow, so make sure you do that on small result sets.

If you have to shuffle a large result set and limit it afterward, then it's better to use the SQL Server TABLESAMPLE in SQL Server instead of a random function in the ORDER BY clause.

So, assuming we have the following database table:

enter image description here

And the following rows in the song table:

| id | artist                          | title                              |
|----|---------------------------------|------------------------------------|
| 1  | Miyagi & Эндшпиль ft. Рем Дигга | I Got Love                         |
| 2  | HAIM                            | Don't Save Me (Cyril Hahn Remix)   |
| 3  | 2Pac ft. DMX                    | Rise Of A Champion (GalilHD Remix) |
| 4  | Ed Sheeran & Passenger          | No Diggity (Kygo Remix)            |
| 5  | JP Cooper ft. Mali-Koa          | All This Love                      |

On SQL Server, you need to use the NEWID function, as illustrated by the following example:

SELECT
    CONCAT(CONCAT(artist, ' - '), title) AS song
FROM song
ORDER BY NEWID()

When running the aforementioned SQL query on SQL Server, we are going to get the following result set:

| song                                              |
|---------------------------------------------------|
| Miyagi & Эндшпиль ft. Рем Дигга - I Got Love      |
| JP Cooper ft. Mali-Koa - All This Love            |
| HAIM - Don't Save Me (Cyril Hahn Remix)           |
| Ed Sheeran & Passenger - No Diggity (Kygo Remix)  |
| 2Pac ft. DMX - Rise Of A Champion (GalilHD Remix) |

Notice that the songs are being listed in random order, thanks to the NEWID function call used by the ORDER BY clause.

-2

This is an old thread but came across this recently; so updating a method that has worked for me and gives good performance. This assumes your table has an IDENTITY or similar column:

DECLARE @r decimal(8,6) = rand()
SELECT @r

SELECT  TOP 100 *
FROM    TableA
ORDER BY ID % @r

How to Get a Random Row from a Large Table

Last Updated 6 years ago
T-SQL
23 Comments

Method 1, Bad: ORDER BY NEWID()

Easy to write, but it performs like hot, hot garbage because it scans the entire clustered index, calculating NEWID() on every row:

The plan with the scan

That took 6 seconds on my machine, going parallel across multiple threads, using tens of seconds of CPU for all that computing and sorting. (And the Users table isn’t even 1GB.)

Method 2, Better but Strange: TABLESAMPLE

This came out in 2005, and has a ton of gotchas. It’s kinda picking a random page, and then returning a bunch of rows from that page. The first row is kinda random, but the rest aren’t.

The plan looks like it’s doing a table scan, but it’s only doing 7 logical reads:

The plan with the fake scan

But here’s the results – you can see that it jumps to a random 8K page and then starts reading out rows in order. They’re not really random rows.

Random like mafia lottery numbers

You can use the ROWS sample size instead, but it has some rather odd results. For example, in the Stack Overflow Users table, when I said TABLESAMPLE (50 ROWS), I actually got 75 rows back. That’s because SQL Server converts your row size over to a percentage instead.

Method 3, Best but Requires Code: Random Primary Key

Get the top ID field in the table, generate a random number, and look for that ID. Here, we’re sorting by the ID because we wanna find the top record that actually exists (whereas a random number might have been deleted.) Pretty fast, but is only good for a single random row. If you wanted 10 rows, you’d have to call code like this 10 times (or generate 10 random numbers and use an IN clause.)

The execution plan shows a clustered index scan, but it’s only grabbing one row – we’re only talking 6 logical reads for everything you see here, and it finishes near instantaneously:

The plan that can

There’s one gotcha: if the Id has negative numbers, it won’t work as expected. (For example, say you start your identity field at -1 and step -1, heading ever downwards, like my morals.)

Method 4, OFFSET-FETCH (2012+)

Daniel Hutmacher added this one in the comments:

And said, “But it only performs properly with a clustered index. I’m guessing that’s because it’ll scan for (@rows) rows in a heap instead of doing an index seek.”

Bonus Track #1: Watch Us Discussing This

Ever wonder what it’s like to be in our company’s chat room? This 10-minute Slack discussion will give you a pretty good idea:

Spoiler alert: there was not. I just took screenshots.

Bonus Track #2: Mitch Wheat Digs Deeper

Want an in-depth analysis of the randomness of several different techniques? Mitch Wheat dives really deep, complete with graphs!

 
Previous Post
[Video] Office Hours 2018/02/28 (With Transcriptions)
 
Next Post
Troubleshooting Parameter Sniffing Issues the Right Way: Part 1
 

23 Comments. Leave new

  • Daniel Hutmacher
    March 5, 2018 8:41 am

    Method 4: using OFFSET-FETCH:

    DECLARE @row bigint=(
    SELECT RAND(CHECKSUM(NEWID()))*SUM([rows]) FROM sys.partitions
    WHERE index_id IN (0, 1) AND [object_id]=OBJECT_ID(‘dbo.thetable’));

    SELECT *
    FROM dbo.thetable
    ORDER BY (SELECT NULL)
    OFFSET @row ROWS FETCH NEXT 1 ROWS ONLY;

    But it only performs properly with a clustered index. I’m guessing that’s because it’ll scan for (@rows) rows in a heap instead of doing an index seek.

    Reply
  • It seems that chat transcript was heavily censored 🙂

    Reply
  • Looks like someone came across Method 3 a little bit ago. I like it! Thanks for the post.

    https://stackoverflow.com/questions/9463938/checksumnewid-executes-multiple-times-per-row

    Reply
  • This should allow for gaps and always return a row regardless of min and max id values: –

    DECLARE @maxid bigint, @minid bigint, @randid bigint, @rand float = rand()
    SELECT @minid = MIN(Id), @maxid = MAX(Id) FROM dbo.Users
    SELECT @randid = MIN(Id) FROM dbo.Users WHERE Id >= (@minid + ( (@maxid – @minid) * @rand) )
    SELECT * FROM dbo.Users WHERE Id = @randid

    Reply
  • I always wonder why MS doesn’t look at these things and make them built in. I’m certain it wouldn’t take much effort for a built in rand_row(#) function.

    Reply
  • It all depends on how random you want your results to be. With Option 2, assuming you randomly select a row from the returned page(s), and you can get the percent correct, you’ll be oversampling larger rows (and rows in pages with more free space). With Option 3, if you have any gaps in your key, you’ll oversample the rows after a gap (think identity columns where SQL Server decides to bump the identity value by 1000).

    Reply
  • I blogged a similar topic a while back and did some basic stats testing: https://mitchwheat.com/2011/08/07/t-sql-generating-random-numbers-random-sampling-and-random-goodness/

    It includes your Method 3 which I saw in a MSDN article (not sure where though)

    Reply
    • Mitch – wow, you put a ton of work into that post! Nice job. I’ve linked to that in the body of the post because it’s so interesting.

      Reply
  • Seems to do the trick :
    DECLARE @MxRws BIGINT, @row BIGINT;
    AndOneMoreTime: /*Only if the row does not exist*/
    SELECT @MxRws = IDENT_CURRENT ( ‘[TableHere]’ ); –Assume using identity on PK
    SELECT @row = RAND () * @MxRws;
    /*Check that the row exists and has not been deleted*/
    IF (SELECT 1 FROM [TableHere] WHERE Id = @row) !=1 GOTO AndOneMoreTime: ELSE SELECT * FROM [TableHere] WHERE Id = @row;

    Reply
  • What about taking advantage of the pk (or index) histogram statistics?

    Something like:

    drop table if exists #stats

    create table #stats(
    id int identity(1,1) primary key,
    RANGE_HI_KEY int,
    RANGE_ROWS int,
    EQ_ROWS numeric(10,2),
    DISTINST_RANGE_ROWS int,
    AVG_RANGE_ROWS int
    )
    insert into #stats
    exec(‘dbcc show_statistics(”Users”,”IX_ID”) WITH HISTOGRAM’)

    declare @rows int
    select @rows=count(*) from #stats

    declare @id int =cast(round((@rows-1)* Rand()+1,0) as integer)
    declare @low int
    declare @high int

    select top 1 @low=a.range_hi_key , @high=lead(range_hi_key,1) over (order by id) from #stats a where a.id between @id and @id+1
    order by id

    select top 1 * from Users where ID >= @low+cast(round(((@high-@low)-1)* Rand()+1,0) as integer)

    Reply
  • I have a quick’n’dirty query that I use to get semi-random values from a table. I use time. It depends, of course, on whether one needs a result every execution, how many gaps in the table there are, how representative it has to be and how large the table is.
    declare @gosh int = datepart( hour, getdate() ) * datepart( minute, getdate() ) * datepart( s, getdate() ) * datepart( ms, getdate() );
    select *
    from schemaname.BigTable
    where BigTablePK = @gosh
    ;
    This allows me to address a row with 41 million rows. It is not especially repesentative but it does give me new rows on each execution and it is quite fast.

    Reply
  • Maybe this is obvious, but I think method 3 is not completely random. What if there are only 2 rows with the id of 1 and 100? “100” will be selected with 99% probability. Can it be improved?

    Reply
  • Dismiss it, but ORDER BY NEW_ID() appears to be the only way to get a truly random sample set (more than one record) in your list of methods.

    Reply
  • Am I the only person who thinks that 3rd or 4th case is bad if PK key has a large number of big gaps?

    because in case of big gaps in PK, it means that the probability of receiving is very large compared to another rows..

    e.g.

    You have a table for 1’000’000 rows, then you delete 700’000 of first rows

    so the probability to receive that 1st rows after from table (after deletion)
    is ~70%, where 30% “probability” is spread across another row

    Am I correct?

    Reply
    • Yeah, method 3 is not random at all, just imagine a table with UserId = {1, 2, 99999 ,100000}. There 99999 will b picked close to always.

      To pick a decently random record you need at least a COUNT() of the table, and that requires a table scan. Its the scan that takes time, not the calculation of NewId().

      Reply
  • Giving each row a random value is the only right way to do random selection unless you already have them numbered without gaps.

    If you know you have millions of rows you can filter out maybe 999/1000 of the users and then sort the rest:

    SELECT TOP (1)
    U.*
    FROM dbo.Users U
    CROSS APPLY(SELECT CHECKSUM(NEWID())) R(Rnd)
    WHERE
    R.Rnd%1000=0
    ORDER BY
    NEWID()

    This code should be a lot faster than if you drop the CROSS APPLY and WHERE.

    Also, don’t use ABS(CHECKSUM(NEWID())) in production.

    Reply
  • This post and comments have helped me thus far. My business case is a tad different. I need a random sample of claims that represent 2.5% of the population. So, if my population contains 3,517 distinct records, I need to return, consistently, 88 distinct records. Here’s what I’ve done thus far:

    DECLARE @FirstofLastMonth date = DATEADD(DD,1,EOMONTH(Getdate(),-2)),
    @EndofLastMonth date = EOMONTH(Getdate(), -1)

    DECLARE @Sample table (ID uniqueidentifier, ClaimNO varchar(50), Company_ID varchar(10), FinancialResp varchar(30), ProviderType varchar(50),
    DateOfService date, ClaimType varchar(20), TotalBilled numeric(10,2), TotalPaid numeric(10,2) )
    Insert into @Sample (ID, ClaimNo,Company_ID, FinancialResp, ProviderType, DateOfService, ClaimType, TotalBilled, TotalPaid)
    Select distinct NEWID() as ID, cd.ClaimNo, cd.Company_ID, cd.FinancialResp,
    CASE
    when cd.FinancialResp in (‘KEYMA’, ‘SHP’, ‘SFKD’) and lob.LOB_Network = ‘1’ then ‘Contracted’
    when cd.FinancialResp in (‘KEYMA’, ‘SHP’, ‘SFKD’) and lob.LOB_Network = ‘0’ then ‘Non-Contracted’
    when cd.FinancialResp = ‘KDMA’ and pu.UDF_Value = ‘Y’ then ‘Contracted’
    when cd.FinancialResp = ‘KDMA’ and pu.UDF_Value ‘Y’ then ‘Non-Contracted’
    else ‘Missing’
    end as ProviderType,
    Format( Min(cd.FromDateSvc), ‘MM/dd/yyyy’) as DateOfService,
    CASE
    when substring(cd.ClaimNo, 9,1) = ‘8’ then ‘Institutional’
    when substring(cd.ClaimNO, 9,1) = ‘9’ then ‘Professional’
    end as Claim_Type,
    sum(cd.Billed) as TotalBilled,
    SUM(cd.Net) as TotalPaid
    from Claim_Details cd
    inner join Claim_Masters_V cm on cd.ClaimNo = cm.ClaimNo and cd.Company_ID = cm.Company_ID
    inner join Prov_VendInfo pv on cm.Prov_KeyID = pv.Prov_KeyID and cm.Company_ID = pv.Company_ID
    inner join Vend_Masters vm on pv.Vendor = vm.VendorID
    and pv.Company_ID = vm.Environment_ID
    inner join Prov_Master_V pm on cm.Prov_KeyID = pm.Prov_KeyID
    and cm.Company_ID = pm.Company_ID
    and pv.Default_YN = 1
    inner join BM_Prov_HP_LineOfBuss lob on cm.Prov_KeyID = lob.Prov_KeyID
    and cm.Company_ID = lob.Company_ID
    and cm.HPCode = lob.HPCode
    and pv.Vendor = lob.Vendor
    and lob.LOB_Start = CAST(CHECKSUM(NEWID(),ID) & 0x7fffffff AS float) / CAST (0x7fffffff AS int)

    When I run the above query repeatedly, the results vary in number from around 100 to the upper 80’s and lower 90’s. How do I refine the above to ensure that just 2.5% of the qualifying records are returned in a random sample?

    Reply
 

Auto generated SQL Server keys with the uniqueidentifier or IDENTITY

By: Armando Prato   |   Comments (38)   |   Related: 1 | 2 | 3 | 4 | More > Identities

 

SQL Server Scripts



Need a better way to manage and monitor your SQL Server backups? Download this free setup of SQL Server backups scripts now.

Download SQL Server Backup Scripts
Problem

I'm designing a table and I've decided to create an auto-generated primary key value as opposed to creating my own scheme or using natural keys. I see that SQL Server offers globally unique identifiers (GUIDs) as well as identities to create these values. What are the pros and cons of these approaches?

Solution

Yes, there are a number of ways you can auto-generate key values for your tables. The most common ways are via the use of the IDENTITY column property or by specifying a uniqueidentifier (GUID) data type along with defaulting with either the NEWID() or NEWSEQUENTIALID() function. Furthermore, GUIDs are heavily used in SQL Server Replication to uniquely identify rows in Merge Replication or Transactional Replication with updating subscriptions.

The most common, well known way to auto-generate a key value is via the use of the IDENTITY column property on a column that's typically declared as an integer. Once defined, the engine will automatically generate a sequential number based on the way the property has been declared on the column. The IDENTITY property takes an initial seed value as its first parameter and an increment value as its second parameter.

Consider the following example which creates and inserts into identity based tables that define the primary key as a clustered index:

SET NOCOUNT ON
GO
USE MASTER
GO
CREATE DATABASE MSSQLTIPS
GO

USE MSSQLTIPS
GO
-- Start at 1 and increment by 1
CREATE TABLE IDENTITY_TEST1 
(
ID INT IDENTITY(1,1) PRIMARY KEY,
TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000)
)
GO

-- Start at 10 and increment by 10
CREATE TABLE IDENTITY_TEST2
(
ID INT IDENTITY(10,10) PRIMARY KEY,
TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000)
)
GO

-- Start at 1000 and increment by 5
CREATE TABLE IDENTITY_TEST3
(
ID INT IDENTITY(1000,5) PRIMARY KEY,
TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000)
)
GO

-- INSERT 1000 ROWS INTO EACH TEST TABLE 
DECLARE @COUNTER INT
SET @COUNTER = 1

WHILE (@COUNTER <= 1000)
BEGIN
   INSERT INTO IDENTITY_TEST1 DEFAULT VALUES 
   INSERT INTO IDENTITY_TEST2 DEFAULT VALUES 
   INSERT INTO IDENTITY_TEST3 DEFAULT VALUES 
   SET @COUNTER = @COUNTER + 1
END
GO

SELECT TOP 3 ID FROM IDENTITY_TEST1
SELECT TOP 3 ID FROM IDENTITY_TEST2
SELECT TOP 3 ID FROM IDENTITY_TEST3
GO

 

 

Another way to auto-generate key values is to specify your column as a type of uniqueidentifier and DEFAULT using NEWID() or NEWSEQUENTIALID(). Unlike IDENTITY, a DEFAULT constraint must be used to assign a GUID value to the column.

How do NEWID() and NEWSEQUENTIALID() differ? NEWID() randomly generates a guaranteed unique value based on the identification number of the server's network card plus a unique number from the CPU clock. In contrast, NEWSEQUENTIALID() generates these values in sequential order as opposed to randomly.

Let's create new tables that use a uniqueidentifier along with both NEWID() and NEWSEQUENTIALID()

USE MSSQLTIPS
GO

CREATE TABLE NEWID_TEST
(
ID UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY,
TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000)
)
GO
CREATE TABLE NEWSEQUENTIALID_TEST
(
ID UNIQUEIDENTIFIER DEFAULT NEWSEQUENTIALID() PRIMARY KEY,
TESTCOLUMN CHAR(2000) DEFAULT REPLICATE('X',2000)
)
GO

-- INSERT 1000 ROWS INTO EACH TEST TABLE 
DECLARE @COUNTER INT
SET @COUNTER = 1

WHILE (@COUNTER <= 1000)
BEGIN
   INSERT INTO NEWID_TEST DEFAULT VALUES 
   INSERT INTO NEWSEQUENTIALID_TEST DEFAULT VALUES 
   SET @COUNTER = @COUNTER + 1
END
GO

SELECT TOP 3 ID FROM NEWID_TEST
SELECT TOP 3 ID FROM NEWSEQUENTIALID_TEST
GO

 

 

As you can see, the first table which uses NEWID() generates random values while the second table that uses NEWSEQUENTIALID() generates sequential values. As opposed to the integers generated by the IDENTITY approach, the GUID values generated are not as friendly to look at or work with. There is one other item to note. SQL Server keeps the last generated identity value in memory which can be retrieved right after an INSERT using SCOPE_IDENTITY(), @@IDENTITY, or CHECK_IDENT (depending on the scope you require). There is nothing similar to capture the last generated GUID value. If you use a GUID, you'll have to create your own mechanism to capture the last inserted value (i.e. retrieve the GUID prior to insertion or use the SQL Server 2005 OUTPUT clause).

Now that we understand how to auto generate key values and what they look like, let's examine the storage impacts of each approach. As part of the previously created table definitions, I added a column of CHAR(2000) to mimic the storage of additional column data. Let's examine the physical storage of the data:

USE MSSQLTIPS
GO
SELECT OBJECT_NAME([OBJECT_ID]) as tablename, avg_fragmentation_in_percent, fragment_count, page_count
FROM sys.dm_db_index_physical_stats (DB_ID(), null, null, null, null)
ORDER BY tablename
GO

 

 

Looking at this output, you can see that the NEWID() test table is very fragmented as evidenced by its fragmentation percentage of 98%. Furthermore, you can see that the rows were dispersed among 490 pages. This is due to the page splitting that occurred due to the random nature of the key generation. In contrast, the IDENTITY and NEWSEQUENTIALID() test tables show minimal fragmentation since their auto generated keys occur in sequential order. As a result, they don't suffer from the page splitting condition that plagues the NEWID() approach. Though you can defragment the NEWID() table, the random nature of the key generation will still cause page splitting and fragmentation with all future table INSERTs. However, page splitting can be minimized by specifying an appropriate FILL FACTOR.

Looking at the NEWSEQUENTIALID() test table, we see it generated fewer pages than the NEWID() approach but it still generated more pages than the IDENTITY approach. Why is this? It's because the uniqueidentifier data type consumes 16 bytes of disk space as opposed to the 4 bytes used by the integer data type that was used for the IDENTITY. Considering that SQL Server pages are generally capped at 8K or roughly 8060 bytes (as of SQL Server 2005, there is a row-overflow mechanism that can kick in but that's for another discussion), this leads to more pages generated for the NEWSEQUENTIALID() approach as opposed to the IDENTITY approach.

Examining the database table space used, we see that the tables using the IDENTITY approach used the least amount disk space.

exec sp_spaceused IDENTITY_TEST1
GO
exec sp_spaceused IDENTITY_TEST2
GO
exec sp_spaceused IDENTITY_TEST3
GO
exec sp_spaceused NEWID_TEST
GO
exec sp_spaceused NEWSEQUENTIALID_TEST
GO

 

 

Now also consider this, since a uniqueidentifier data type consumes 16 bytes of data, the size of any defined non-clustered indexes on a table using a GUID as a clustered index are also affected because the leaf level of these non-clustered indexes contains the clustered index key as a pointer. As a result, the size of any non-clustered indexes would end up being larger than if an IDENTITY were defined as integer or bigint.

It's evident that using IDENTITY to auto-generate key values offers a few advantages over the GUID approaches:

  • IDENTITY generated values are configurable, easy to read, and easier to work with
  • Fewer database pages are required to satisfy query requests
  • In comparison to NEWID(), page splitting (and its associated overhead) is eliminated
  • Database size is minimized
  • System functions exist to capture the last generated IDENTITY value (i.e. SCOPE_IDENTITY(), etc)
  • Some system functions - such as MIN() and MAX(), for instance - cannot be used on uniqueidentifier columns
Next Steps
  • Read more about NEWSEQUENTIALID() in the SQL Server 2005 Books Online
  • Read Using uniqueidentifier Data in the SQL Server 2005 Books Online
  • If you're not in a situation where you require a globally unique value, consider if an IDENTITY makes sense for auto-generating your key values.
  • Regardless if you decide on a GUID or IDENTITY, consider adding a meaningful UNIQUE key based on the real data in your table.



 
 
 
 
 
 



About the author
Armando Prato has close to 30 years of industry experience and has been working with SQL Server since version 6.5.

View all my tips



 

Comments For This Article




Tuesday, October 1, 2019 - 10:49:56 PM - Jim Evans Back To Top (82644)

Great article. Nice details comparing identity to guids. Often I have dealt with very large tables heavily fragmented because of guid pkeys, when an int column with identity would have worked just fine.  Also, for those worried about running out of numbers they can use bigint with identity and hold 9,223,372,036,854,775,807 rows.  Another trick with int or bigint is to start the seed at the largest negitive number. Thanks for the info.


Thursday, February 23, 2017 - 10:51:35 PM - Asdirs Back To Top (46632)

Custom Generated ID with SQL Server

how to create a function id number in sql server? id primary example I want my customers table reading area code like this: Area Codes 12345 and Consumer Code 00001 , if combined into 12345-00001, 12345-00002, 12345-00003 and so on, then the area code 12344-00001, 12344-00002, 12344-00003 and so on, so the consumer code will increase following the area code. I have a table like this:

table customer (id, name, kode_area, idcustomer) and table area (id, name)
table customer (00001, A, 12345, 12345-00001) and table area (12345, BandarLampung-Indonesia)
table customer (00002, B, 12345, 12345-00002) and table area (12345, BandarLampung-Indonesia)
table customer (00002, B, 12345, 12345-00003) and table area (12345, BandarLampung-Indonesia)
table customer (00001, AA, 12344, 12344-00001) and table area (12344, Tanggamus-Indonesia)
table customer (00002, BB, 12344, 12344-00002) and table area (12344, Tanggamus-Indonesia)
table customer (00002, BC, 12344, 12344-00003) and table area (12344, Tanggamus-Indonesia)

Monday, February 6, 2017 - 12:36:12 PM - Michael Ryan Back To Top (46055)

Wow, what a fantastic article and analysis of UUID generation, thanks!

We have been pondering going to a UUID based primary key scheme for one of systems (we are currently on an INT based PK) and this really helps illustrate the difference.

I ran this test on a local MSSQLExpress 2014 box and the results are very similar but with the Sequential UUID showing less fragmentation than the traditional Identity model.

Interesting stuff...thanks!

 

IDENTITY_TEST1 2.8 38 250

IDENTITY_TEST2 2.8 38 250

IDENTITY_TEST3 2.8 38 250

NEWID_TEST 97.9959919839679 498 499

NEWSEQUENTIALID_TEST 1.49700598802395 47 334

 

 

 

 


Thursday, August 21, 2014 - 9:38:44 AM - Mike Back To Top (34220)

I find this a very simplified look at generating keys.

 

You don't know about replication?

 

I do like SQL Server 2012's ability to create your own unique pattern also.

 


Friday, April 12, 2013 - 3:12:32 PM - george hardy Back To Top (23336)

so it looks like i can still maintain my readable index numbers that are auto-incrementing integers, just remove it as the primary key, and begin to use these guids as the real row key.  this way, my code doing something like "DELETE FROM EQUIPMENT WHERE EQUIPMENT_ID = x" will still work.

i also want a much cleaner way to utilize the sync framework, and had i known this before, that task would have been MUCH easier!

 

thanks for the write-up.


Monday, January 14, 2013 - 2:26:26 AM - URVISH SUTHAR Back To Top (21434)

Great job, many thanks for sharing :)

 


Wednesday, May 23, 2012 - 11:26:06 PM - Ravi Chauhan Back To Top (17628)

how to delete record from table where primary key(ID) is auto generated, and we need to delete that record according to primary example

DELETE TOP(1) FROM dbo.Test WHERE ID = 1

but we dont know the ID no in above scenario....


Wednesday, May 23, 2012 - 2:26:09 AM - Ravi Chauhan Back To Top (17600)

We create example of two tables,

but suppose we have three table IDENTITY_TEST1, IDENTITY_TEST2 and IDENTITY_TEST3

Means A >> B >> C (link sign ">>")

so how to create the relationship in these three tables?

 


Wednesday, May 23, 2012 - 2:17:38 AM - Ravi Chauhan Back To Top (17599)

create database db

use db

CREATE TABLE IDENTITY_TEST1
(
ID INT IDENTITY(1,1) PRIMARY KEY,
TESTCOLUMN varchar(20)
)

CREATE TABLE IDENTITY_TEST2
(
EMP_ID INT IDENTITY(1,1) REFERENCES IDENTITY_TEST1(ID),
ADDR VARCHAR(20)
)

insert into IDENTITY_TEST1 values('cde')

select * from IDENTITY_TEST1

insert into IDENTITY_TEST2 values('acd')

select * from IDENTITY_TEST2


Thursday, April 19, 2012 - 6:00:38 PM - Enigmatic Back To Top (17010)

@Javed,

Yes that is correct. IDENTITY is a counter and every time you insert a record the next number in that counter is used. This is done to ensure you ALWAYS get a unique number that could not possibly be in the table. If you want to go back and "fill in" missing numbers then you are going to have to implement your own locking mechanism to first lock the table, then select the first missing number and then unlock it once you have found it. If you do not do this, there is a chance multiple calls made at the same time could potentially return the same number and attempt to both insert the same key which would give you a primary key violation.

If you want to delete all records and restart your numbering at 1 you need to use:

DBCC CHECKIDENT ("<Schema>.<TableName>")

This will reset the IDENTITY back to its original value


Thursday, April 19, 2012 - 9:11:55 AM - javed Back To Top (17000)

Hi. sir,

My questions is I have a column id and set as primary key identity in a table.whenever I delete any record from table and insert new record then column id start count as where last id number was exists.And  whenvere delete all record from table. then insrt new record then does not start with 1.whereas start with last id was there.

plz help me..do reply me.. 


Monday, April 2, 2012 - 1:37:32 AM - Enigmatic Back To Top (16725)

One thing that wasn't listed were the disadvantages of using IDENTITY columns.

The biggest issue I have is the use of index values in code. Where a function accepts an integer as the index, a person can place any number they like in there, regardless of what that number respresents. It opens you up to the possibility of more errors as every table with an identity will have a row with ID = 1, 2, 3, etc.

So for instance, there would be nothing stopping me from accidentally transposing an ID from one table into an ID of another table, and in fact I could create foreign key indexes between unrelated data simply because the ID values of one table exist in the other.

To me this is a loss of referencial integrity, and the signature of C# methods that require IDs all end up something like (int, int, int). To me this isn't IDentifying records at all, its imply numbers.

For this reason I prefer to use GUIDs as you are guaranteed that the GUID value in one table will not appear in any other table other than the one it belongs in, thus ensuring no mistakes in foreign keys, and no accidental transposing when filling out method signatures. When you consider how many functions, stored procedures, methods, and queries either pass identifiers or compare identifiers, the odds of transposing increase with every single new addition.

This level of security and reliability easily "trumps" a bit of disk space in a world that is now being measured in Terabytes, or the use of MIN/MAX functions on an IDENTITY field (that makes no sense!), or a few less pages being created.

Anyone who has spent days trying to find out why something isn't working correctly only to realise that they "accidentally" inserted an IndustryTypeID into an IndustryID, but it never complained because there just happens to be an IndustryID = 3 the same as there is an IndustryTypeID = 3, or why records occasionally said it violated a primary key but only at what appears to be random intervals, etc.


Thursday, March 22, 2012 - 12:51:58 AM - sandeep kumar Back To Top (16573)

How to create cust_id autogenrated conditions are-

1- not match to previous id

2- look like cust_id AB001

Please reply me.

                                 Thank You.

 


Tuesday, March 20, 2012 - 9:53:03 AM - Zeni Back To Top (16534)

 

 

Thanks. This article was helpful for me in deciding primery key datatype for my database.

Thanks


Tuesday, February 22, 2011 - 5:41:51 AM - Fayssal El Moufatich Back To Top (12997)

Thanks for the nicely written article! So, unless one is in a situation where he needs a glabally unique identifier, one should avoid using GUIDs. And even, in this situation, it is recommended to generate the keys using NEWSEQUENTIALID(), as long as it is possible. I agree! :)


Monday, January 19, 2009 - 6:38:33 PM - aprato Back To Top (2585)

You answered your own question.  You work in a distributed environment and they make sense in this scenario.  Replication makes use of GUIDs for this reason.   I personally wouldn't use a GUID for an application that didn't have a valid business reason for it.  They have their place but they do have some downside.   


Monday, January 19, 2009 - 5:50:12 PM - laughingskeptic Back To Top (2584)

But itsn't this a bit of a red herring?  The natural fill factor of an index on random data (such as a GUID) that experiences primarily inserts is 75%.  If you compact it, then you just cause a bunch of page splits as soon as the inserts begin again, potentially swinging quickly from a fill of 100% to 50% before settling back in to 75%.  A SQL Server page can hold around 254 GUIDs, so if your table has significantly more rows than this you should expect to see close to 100% fragmentation.  For random indexes, average fill factor is a better metric than percent fragmentation.  If the number is not about 75% then an investigation is warranted.

GUIDs are 4 times bigger than an integer, but if your server has reached memory constraints (most real ones do) a GUID index will be 5 times slower than an integer at lookups.  This is a design consideration.

I work with some highly distributed databases, involving multitiered replication and thousands of clients (some being third parties).  We would never get this system to work without using a lot of GUIDs.  GUIDs are especially handy when multiple clients can work standalone then expect to re-sync when they connect.


Monday, October 27, 2008 - 7:40:26 AM - aprato Back To Top (2092)

Hi

I was just stating that you'll still get rapid index fragmentation (heavy, actually) if you decide to create a non-clustered index on a uniqueidentifier that uses NEWID().  It doesn't take many rows.  I had done a test with just 10,000 insertions and I still had an index with 95% fragmentation.  Due to the random generation, you'll get rapid fragmentation; it won't take many rows.   It's just something to be aware of.


Monday, October 27, 2008 - 6:29:39 AM - mharr Back To Top (2091)

[quote user="aprato"]

You should note that even using NEWID() for a non-clustered index will lead to fragmentation for that index. Recall that index structures are sorted logically.

[/quote]

Yes, but:

  • a fragmented index is less likely (since an index "row" will be smaller than a data "row", and will have more entries and more room for new entries in an index), and will be less
  • rebuilding an index to remove fragmentation is easier and less disruptive than reordering a table.
  • while doing "one of " lookups of a primary key where the primary key is a guid (most used query using a primary key), using a fragmented index and unfragmented data pages will likely still be more efficient than using a fragmented data pages where the guid is clustered primary key.

Thursday, October 23, 2008 - 12:38:00 PM - aprato Back To Top (2070)

You should note that even using NEWID() for a non-clustered index will lead to fragmentation for that index. Recall that index structures are sorted logically.


Thursday, October 23, 2008 - 12:13:46 PM - Wiseman82 Back To Top (2069)

[quote user="DiverKas"]

[quote user="mharr"]

Nice article on the advantages/disadvantages of identity and uniqueidentifier keys.  However, there is a relatively simple solution to the problem of fragmentation when using uniqueidentifier keys: don't use the primary key as a clustered index.

By default, when you create a primary key, it is created as a clustered index.  But it does not have to be so.  You can choose any other column as the clustered index, then the data rows will be physically ordered by that column.

How are the queries on the table going to be used?  If it is by and large going to be random access, then perhaps a DateInserted date column would be appropriate.  If there will frequent access of a customer table by territory number, it may be best to set up a clustered index by that column.  Or, choose not to have any clustered index, and rows will always be added to the end of the last data page, and time-consuming page splits will never occur.

There is no rule, in practice or theory, that the primary key must be a clustered index.  A little more thought on how the table will be used and accessed can resolve the fragmentation issue, and provide better performance.

Hope this helps

[/quote]

That is a great point.  Its silly for Microsoft to have left the clustered index on an identity or guid column for that matter.  It makes no sense and certainly lends nothing to the performance.  It is usually the first thing I have to change on a newly created table.

 

[/quote]

Choosing another column for the clustered index might get around the fragmentation issue, but the primary key is often the best candidate for the clustered index.  Sequential GUIDs might be a better option to get around the fragmentation issue.  Also remember that the fragmentation issue would apply (to a lesser extent) to a non-clustered index created on the GUID column.

It's normally a good idea for the clustered index to be unique - if it's not SQL Server will add a uniqueifier to make the clustered index unique (additional overhead).  Due to the fragmentation issues already mentioned, it's also a good idea to to have an auto-incermenting clustered index - Identities and Sequential GUIDs both fit the bill here.  Another thing you need to consider is if the chosen column(s) for the clustered index are likely to be updated - Ideally you want to pick a column that is never/rarely updated - the primary key also fits the bill here.

I'd advise people to keep the clustered index as the primary key, unless they have a good reason for choosing a different column (MS made this a default for good reason).  I'm not saying that you always should use the primary key as a clustered index, but it's not a bad choice in the absence of another candidate for the clustered index.

In terms of query performance (rather than write performance), you might want to use a column that is frequently used in ORDER BY, GROUP BY, JOINS (foreign keys) or where the column is often used to select a range of values etc. 

As mentioned, a date column might be a good candidate if it's populated with a GetDate()/GetUTCDate() - you would expect a very low number of duplicate values and the column will be auto-incrementing.  It might also unlikely that the column will ever be updated.  If queries frequently sort on this column and filter it by a range of values then it would be a good candidate for the clustered index.

Also note that best practice will vary depending on the type of database.  If you are creating a data warehouse/reporting database rather than an OLTP database, you will have different priorities. 

 

 


Thursday, October 23, 2008 - 9:31:52 AM - Perre Back To Top (2068)

Nice article.  Also good comments telling there are also advantages with guids ... not that you said there aren't.

I personally prefer identity columns ... indead for passing back the key from Stored procedures to the client software, for easiness in SQL statements, joins, etc ... and as you prove nicely to limit page splits, index size, ...

Although I once had a problem when a branch office with it's own SQL server, with the same application and database started using VPN.  Business managers decided the databases should be merged together ... so I hit the problem we had the same clients id's in both databases but regarding different clients ... and of course those id's were spread accross a lot of tables as foreign keys.  By consequence I had a serious job merging the data.  Finally I decided to add an extra field to the autoincremental primary key, telling me it's a client from the branch office or one from the main office.  So nowadays when I have to use a key like that I consider adding an extra column.  Alternativly I could have added 10.000.000 to the id's from the branch office before importing the data ... but I didn't thought about that at that moment and it wouldn't have clearified the difference between clients from both offices inserted after the merge.

An old collegue of me always used GUIDs for his websitedatabases since they seem more complex in the URL ... avoiding the fact if you hit .com?id=127 people would try to enter .com?id=128 which should not be visible to them ... but afterwards I think this isn't a strong argument.  If you should not be allowed to see the page with .com?id=128 ... the website should block it :-)

 Cheers

 

 

 

 

 

 

 


Thursday, October 23, 2008 - 9:22:09 AM - DiverKas Back To Top (2067)

[quote user="mharr"]

Nice article on the advantages/disadvantages of identity and uniqueidentifier keys.  However, there is a relatively simple solution to the problem of fragmentation when using uniqueidentifier keys: don't use the primary key as a clustered index.

By default, when you create a primary key, it is created as a clustered index.  But it does not have to be so.  You can choose any other column as the clustered index, then the data rows will be physically ordered by that column.

How are the queries on the table going to be used?  If it is by and large going to be random access, then perhaps a DateInserted date column would be appropriate.  If there will frequent access of a customer table by territory number, it may be best to set up a clustered index by that column.  Or, choose not to have any clustered index, and rows will always be added to the end of the last data page, and time-consuming page splits will never occur.

There is no rule, in practice or theory, that the primary key must be a clustered index.  A little more thought on how the table will be used and accessed can resolve the fragmentation issue, and provide better performance.

Hope this helps

[/quote]

That is a great point.  Its silly for Microsoft to have left the clustered index on an identity or guid column for that matter.  It makes no sense and certainly lends nothing to the performance.  It is usually the first thing I have to change on a newly created table.

 


Thursday, October 23, 2008 - 9:19:29 AM - mharr Back To Top (2066)

Nice article on the advantages/disadvantages of identity and uniqueidentifier keys.  However, there is a relatively simple solution to the problem of fragmentation when using uniqueidentifier keys: don't use the primary key as a clustered index.

By default, when you create a primary key, it is created as a clustered index.  But it does not have to be so.  You can choose any other column as the clustered index, then the data rows will be physically ordered by that column.

How are the queries on the table going to be used?  If it is by and large going to be random access, then perhaps a DateInserted date column would be appropriate.  If there will frequent access of a customer table by territory number, it may be best to set up a clustered index by that column.  Or, choose not to have any clustered index, and rows will always be added to the end of the last data page, and time-consuming page splits will never occur.

There is no rule, in practice or theory, that the primary key must be a clustered index.  A little more thought on how the table will be used and accessed can resolve the fragmentation issue, and provide better performance.

Hope this helps


Thursday, October 23, 2008 - 9:15:52 AM - DiverKas Back To Top (2065)

[quote user="cbasoz"]

I found the conclusions at the end of the article a little bit biased to the identity usage. OK so far so good, IDENTITY has advantages. Doesn't it have any clear disadvantages? Does it really outperform GUIDs in performance?

I am not an SQL DBA but more a developer and I think GUIDs have undeniable advantages over IDENTITY. ie:

-I can create the key independantly from the data store. I know it is unique so I wouldn't bother to create data on my laptop totally disconnected and later insert to main db without any change in keys.

-Not needing db store to learn the key also means I wouldn't need roundtrips to server. Those roundtrips might sum to a big amount. Performance wise maybe it is better to use GUIDs then?

I wonder, if GUIDs are disadvantegous, why would ever SQL server itself use GUIDs in replication? SQL server team developers might do that with integer key fix ups, but they chose GUID. Probably they thought using GUIDs is cheaper than doing key fix ups?

(just thinking loud)

[/quote]

 There is a rather large performance difference between comparing 4bytes and 16 bytes of data, especially in join conditions.  This isnt a trival difference.

Also, your assumption that GUIDs are unique, by creating them on the client is false.  GUIDs have a reasonable chance of being unique, it is not guaranteed, and I see cases monthly where it is not true.  Since the GUID on a client is generated based on several pieces of "hardware" and time, it is quite possible through oem vendor mistakes and luck of the draw to have the same GUID assigned on different hardware.

GUIDs as a general rule should only be used for replication and distributed data.  In general it is also best to generate the key from a single source, thereby reducing the chance of a collision to about nil.

For performance sake, and if replication and distributed data is not a concern, Identity columns are a HUGE winner.

 


Thursday, October 23, 2008 - 8:59:47 AM - aprato Back To Top (2064)

I didn't take it negatively at all.  I understand what you're driving at.  


Thursday, October 23, 2008 - 8:36:40 AM - cbasoz Back To Top (2063)

I was referring to section starting with:

"It's evident that using IDENTITY to auto-generate key values offers a few advantages over the GUID approaches"

and you say:

"The reason they probably chose GUIDs is due to their uniqueness across space and time."

It sounds like a very good reason to choose a surrogate primary key to me so in the same manner I could say:

It's evident that using GUID to auto-generate key values offers a few advantages over the IDENTITY approaches

this statement is a fact too but I didn't see any mention and naturally thought it as biased. I didn't have a negative intention just meant I felt the article was not completed.


Thursday, October 23, 2008 - 8:21:42 AM - aprato Back To Top (2062)

 <<Yes - more or less.  You might also consider composite keys - using identity in conjunction with another column. You could also generate your own "unique" identifier based on some custom logic. You might even consider splitting identity ranges between tables. In this situation GUIDs would probably be the most attractive option IMO.>>

I would concur with Wiseman.  GUIDs may be the better option if you're working in a distributed environment. 


Thursday, October 23, 2008 - 7:59:01 AM - Wiseman82 Back To Top (2061)

M[quote user="pmh4514"]

Thanks for the followups to my questions!

 [quote user="aprato"]The reason they probably chose GUIDs is due to their uniqueness across space and time.[/quote]

If uniqueness across space in time is a fundamental requirement of a given distributed database archicture, am I correct in saying that this requirement alone makes the decision for us, and requires the use of GUID over IDENTITY, despite any inherent performance issues that come with that decision?

[/quote]

Yes - more or less.  You might also consider composite keys - using identity in conjunction with another column. You could also generate your own "unique" identifier based on some custom logic. You might even consider splitting identity ranges between tables. In this situation GUIDs would probably be the most attractive option IMO.


Thursday, October 23, 2008 - 7:49:52 AM - pmh4514 Back To Top (2060)

Thanks for the followups to my questions!

 [quote user="aprato"]The reason they probably chose GUIDs is due to their uniqueness across space and time.[/quote]

If uniqueness across space in time is a fundamental requirement of a given distributed database archicture, am I correct in saying that this requirement alone makes the decision for us, and requires the use of GUID over IDENTITY, despite any inherent performance issues that come with that decision?

 

 

 


Thursday, October 23, 2008 - 7:31:55 AM - aprato Back To Top (2059)

 <<I found the conclusions at the end of the article a little bit biased to the identity usage>>

In reality, it wasn't bias - they were statements in fact.  Using IDENTITY results in less database bloat.  Fewer data and index pages are generated (less churn for DML statements) as well as less bloat of non-clustered indexes if the GUID is the clustered index.    

The reason they probably chose GUIDs is due to their uniqueness across space and time.


Thursday, October 23, 2008 - 7:24:53 AM - Wiseman82 Back To Top (2058)

GUIDs do have benefits over identity columns but they come with quite a large overhead - 16 bytes compared to 4bytes for an integer.  Some of the problems with uniqueidentifiers can be overcome by using sequential guids, but identity is far better from a performance point of view.

Bottom line: Identities should be your first choice. If an identity column doesn’t fit the bill, consider a GUID. 

Also, don’t take peoples word for it.  Construct your own performance test to see how they perform in comparison to each other.  I’ve done so myself and found the difference to be significant.  The difference between sequential guids and non-sequential guids was also VERY significant.

 


Thursday, October 23, 2008 - 7:13:44 AM - aprato Back To Top (2057)

 

<<Question 1 would be "how are tables defraged?">>

 Depends on your version of SQL Server

2000: DBCC DBREINDEX or DBCC INDEXDEFRAG
2005: ALTER INDEX specifying either REBUILD or REORGANIZE

You could also drop and re-create the clustered index

<<2. Describe "fill factor" - how would I implement that? What is it specifying?>>

Creating indexes with a FILL FACTOR % tells the engine how full you want the index leaf pages (i.e. a FILL FACTOR of 80 means keep 20% of the leaf free for future inserts - or 80% full).  It minimizes splits (an intensive operation where 1/2 of the data on a page is moved to a new page) because the free space can accommodate any random row that can fit on the page.  It's used in high volume enviorments where page splitting becomes a problem.

 

 


Thursday, October 23, 2008 - 6:47:48 AM - cbasoz Back To Top (2056)

I found the conclusions at the end of the article a little bit biased to the identity usage. OK so far so good, IDENTITY has advantages. Doesn't it have any clear disadvantages? Does it really outperform GUIDs in performance?

I am not an SQL DBA but more a developer and I think GUIDs have undeniable advantages over IDENTITY. ie:

-I can create the key independantly from the data store. I know it is unique so I wouldn't bother to create data on my laptop totally disconnected and later insert to main db without any change in keys.

-Not needing db store to learn the key also means I wouldn't need roundtrips to server. Those roundtrips might sum to a big amount. Performance wise maybe it is better to use GUIDs then?

I wonder, if GUIDs are disadvantegous, why would ever SQL server itself use GUIDs in replication? SQL server team developers might do that with integer key fix ups, but they chose GUID. Probably they thought using GUIDs is cheaper than doing key fix ups?

(just thinking loud)


Thursday, October 23, 2008 - 6:45:33 AM - Wiseman82 Back To Top (2055)

You defrag tables by doing an clustered index rebuild/reorganize.  BOL will give you more info on this.

A lower fill factor will leave room for more items to be inserted, reducing page splits and improving performance of inserts.  Lower fill factors will have a negative impact on select performance though.  With an auto-increasing key (identity/seq guid) you can use a high fill factor - new items will be inserted at the end of the index. Fill factors are specified when the index is created - BOL will provide you with more info.


Thursday, October 23, 2008 - 6:45:00 AM - aprato Back To Top (2054)

<< There is a problem with this image - SQL Server actually sorts GUIDs by byte group from right to left >>

Yes, you're correct.  The image is meant to shows the differences in how the engine generates the value (i.e. the randomness of NEWID() vs NEWSEQUENTIALID())


Thursday, October 23, 2008 - 6:24:53 AM - pmh4514 Back To Top (2053)

Very interesting..I'm putting together a new distributed DB design using GUIDs as unique identifiers to ensure uniquness across many computers. We are still in early development, so design and implementation changes happen daily as I work through everything. I found this article interesting because it may represent some things I can do early to prevent some of the inherent problems with using GUIDs as identities. But there were a few points I was unsure about.

The author described page splitting that occurs, when discussing the defragmentation differences between using an incremental integer for an identity vs. a GUID.  He wrote:

Though you can defragment the NEWID() table, the random nature of the key generation will still cause page splitting and fragmentation with all future table INSERTs. However, page splitting can be minimized by specifying an appropriate FILL FACTOR.

Question 1 would be "how are tables defraged?" (I am more a programmer than a DBA for what it's worth) and 2. Describe "fill factor" - how would I implement that? What is it specifying?

 

 

 


Thursday, October 23, 2008 - 2:59:58 AM - jnollet Back To Top (2050)

Great article ... I've found this out too and its important to think about up front in the development process.  Trying to change later can be difficult.

 
 
posted on 2023-10-12 11:34  不及格的程序员-八神  阅读(52)  评论(0编辑  收藏  举报