"order by newid()" - how does it work?

Question

I know that If I run this query

select top 100 * from mytable order by newid()

it will get 100 random records from my table.

However, I'm a bit confused as to how it works, since I don't see newid() in the select list. Can someone explain? Is there something special about newid() here?

Note that this is a slow way to get 100 random entries unless the db server recognizes this as a known pattern to optimize. — CodesInChaos, Feb 12, 2011 at 18:37
It's also only pseudo-random. If you need true randomness for security, never use this method. — Justin Morgan - On strike, Feb 12, 2011 at 18:40
The columns in your ORDER BY clause do not need to appear in your SELECT clause in SQL Server. — Gabe, Feb 12, 2011 at 18:51
You should also be wary when using this technique on large tables as it will use tempdb as mentioned in this article, which does offer an alternative method. — domager, Apr 25, 2013 at 13:40

Saeed Amiri · Answer 1 · 2011-02-12 18:30:34Z

4

as MSDN says:

NewID() Creates a unique value of type uniqueidentifier.

and your table will be sorted by this random values.

answered Feb 12, 2011 at 18:30

Saeed Amiri

22.3k55 gold badges4545 silver badges8484 bronze badges

1

Thanks - I know what NewID() does, I'm just trying to understand how it would help in the random selection. Is it that [1] the select statement will select EVERYTHING from mytable, [2] for each row selected, tack on a uniqueidentifier generated by NewID(), [3] sort the rows by this uniqueidentifier and [4] pick off the top 100 from the sorted list?
– Tola Odejayi
Feb 12, 2011 at 19:50

Add a comment

Uhlen · Answer 2 · 2011-02-12 21:20:56Z

In general it works like this:

All rows from mytable is "looped"
NEWID() is executed for each row
The rows are sorted according to random number from NEWID()
100 first row are selected

Martin Smith · Accepted Answer · 2011-02-12 21:25:51Z

I know what NewID() does, I'm just trying to understand how it would help in the random selection. Is it that (1) the select statement will select EVERYTHING from mytable, (2) for each row selected, tack on a uniqueidentifier generated by NewID(), (3) sort the rows by this uniqueidentifier and (4) pick off the top 100 from the sorted list?

Yes. this is pretty much exactly correct (except it doesn't necessarily need to sort all the rows). You can verify this by looking at the actual execution plan.

SELECT TOP 100 * 
FROM master..spt_values 
ORDER BY NEWID()

The compute scalar operator adds the NEWID() column on for each row (2506 in the table in my example query) then the rows in the table are sorted by this column with the top 100 selected.

SQL Server doesn't actually need to sort the entire set from positions 100 down so it uses a TOP N sort operator which attempts to perform the entire sort operation in memory (for small values of N)

Got it! And yes, you're right - once I've determined the top 100 rows from the entire set, there's no need to sort the rest. — Tola Odejayi, Feb 12, 2011 at 22:08
So, is it safe to ensure that no data is written? Since this is a SELECT query, the NEWID() will calculate a randomized identifier just for the query, it won't be updating anything in the database with this new id, right? — K09P, Mar 25, 2019 at 11:40
Yes it won't affect the tables you are selecting from. At least some of the data will be temporarily written to a worktable in tempdb to hold at least the TOP N results but nothing written to the user database — Martin Smith, Mar 25, 2019 at 12:36
I don't see it mentioned anywhere else, but because of how this works, this is probably a terrible way of selecting random rows from a table. If you have a table with many rows (i.e. half a billion), do not run a query like that. If I ran this on a table with 500GB of data in it, I'd be in trouble. Better off using the built-in feature "TABLESAMPLE" that's meant for selecting random rows from data pages: SELECT * FROM Person.Person TABLESAMPLE (10 PERCENT); — Triynko, Apr 28 at 23:35

Sukhi · Answer 4 · 2015-09-01 07:25:21Z

use select top 100 randid = newid(), * from mytable order by randid you will be clarified then..

DigitalNomad · Answer 5 · 2020-10-26 22:04:00Z

I have an unimportant query which uses newId() and joins many tables. It returns about 10k rows in about 3 seconds. So, newId() might be ok in such cases where performance is not too bad & does not have a huge impact. But, newId() is bad for large tables.

Here is the explanation from Brent Ozar's blog - https://www.brentozar.com/archive/2018/03/get-random-row-large-table/.

From the above link, I have summarized the methods which you can use to generate a random id. You can read the blog for more details.

4 ways to get a random row from a large table:

Method 1, Bad: ORDER BY NEWID() > Bad performance!
Method 2, Better but Strange: TABLESAMPLE > Many gotchas & is not really random!
Method 3, Best but Requires Code: Random Primary Key > Fastest, but won't work for negative numbers.
Method 4, OFFSET-FETCH (2012+) > Only performs properly with a clustered index.

More on method 3: Get the top ID field in the table, generate a random number, and look for that ID. For top N rows, call the code below N times or generate N random numbers and use in an IN clause.

/* Get a random number smaller than the table's top ID */
DECLARE @rand BIGINT;
DECLARE @maxid INT = (SELECT MAX(Id) FROM dbo.Users);
SELECT @rand = ABS((CHECKSUM(NEWID()))) % @maxid;

/* Get the first row around that ID */
SELECT TOP 1 *
FROM dbo.Users AS u
WHERE u.Id >= @rand;

I find method 3 to be the fastest. But I have a problem when I get the top N. I ran it N times, but there will be several times where the data will overlap so it's not really random N values. Do you have any ideas?- — Lê Duy Thứ, Apr 27 at 9:37

tinlyx · Accepted Answer · 2023-03-14 23:12:41Z

ORDER BY NEWID() will sort the records randomly. An example from SQLTeam.com

SELECT *
FROM Northwind..Orders 
ORDER BY NEWID()

Community · Answer 7 · 2020-06-15 09:05:18Z

This is an old question, but one aspect of the discussion is missing, in my opinion -- PERFORMANCE. ORDER BY NewId() is the general answer. When someone get's fancy they add that you should really wrap NewID() in CheckSum(), you know, for performance!

The problem with this method, is that you're still guaranteed a full index scan and then a complete sort of the data. If you've worked with any serious data volume this can rapidly become expensive. Look at this typical execution plan, and note how the sort takes 96% of your time ...

To give you a sense of how this scales, I'll give you two examples from a database I work with.

TableA - has 50,000 rows across 2500 data pages. The random query generates 145 reads in 42ms.
Table B - has 1.2 million rows across 114,000 data pages. Running Order By newid() on this table generates 53,700 reads and takes 16 seconds.

The moral of the story is that if you have large tables (think billions of rows) or need to run this query frequently the newid() method breaks down. So what's a boy to do?

Meet TABLESAMPLE()

In SQL 2005 a new capability called TABLESAMPLE was created. I've only seen one article discussing it's use...there should be more. MSDN Docs here. First an example:

SELECT Top (20) *
FROM Northwind..Orders TABLESAMPLE(20 PERCENT)
ORDER BY NEWID()

The idea behind table sample is to give you approximately the subset size you ask for. SQL numbers each data page and selects X percent of those pages. The actual number of rows you get back can vary based on what exists in the selected pages.

So how do I use it? Select a subset size that more than covers the number of rows you need, then add a Top(). The idea is you can make your ginormous table appear smaller prior to the expensive sort.

Personally I've been using it to in effect limit the size of my table. So on that million row table doing top(20)...TABLESAMPLE(20 PERCENT) the query drops to 5600 reads in 1600ms. There is also a REPEATABLE() option where you can pass a "Seed" for page selection. This should result in a stable sample selection.

Anyway, just thought this should be added to the discussion. Hope it helps someone.

It would nice to be able to write a scalable random-ordering query which not only scales up but works with small data sets. It sounds like you have to manually switch between having and not having TABLESAMPLE() based on how much data you have. I don’t think that TABLESAMPLE(x ROWS) would even ensure that at least x rows are returned because the documentation says “The actual number of rows that are returned can vary significantly. If you specify a small number, such as 5, you might not receive results in the sample.”—so the ROWS syntax really still is just a masked PERCENT inside? — binki, Jun 18, 2014 at 18:49
Sure, auto-magic is nice. In practice, I've rarely seen a 5 row table scale to millions of rows without notice. TABLESAMPLE() seems to base selection of the number of pages in a table, so the given row size influences what comes back. The point of table sample, at least as I see it, is to give you a good sub-set from which you can select -- kind of like a derived table. — EBarr, Jun 19, 2014 at 1:26
TABLESAMPLE() should be used with care. For example, only rows from the 'first page' of the table may return. Thus it can appear not to be truly random. If true-randomness is important, e.g. for an audit, then it shouldn't be used at all. If this is only for some quick usage, it is fine as a performance-improvement, however. — JosephDoggie, Feb 2, 2021 at 22:26

David Spillett · Answer 8 · 2018-06-15 16:16:29Z

Pradeep Adiga's first suggestion, ORDER BY NEWID(), is fine and something I've used in the past for this reason.

Be careful with using RAND() - in many contexts it is only executed once per statement so ORDER BY RAND() will have no effect (as you are getting the same result out of RAND() for each row).

For instance:

SELECT display_name, RAND() FROM tr_person

returns each name from our person table and a "random" number, which is the same for each row. The number does vary each time you run the query, but is the same for each row each time.

To show that the same is the case with RAND() used in an ORDER BY clause, I try:

SELECT display_name FROM tr_person ORDER BY RAND(), display_name

The results are still ordered by the name indicating that the earlier sort field (the one expected to be random) has no effect so presumably always has the same value.

Ordering by NEWID() does work though, because if NEWID() was not always reassessed the purpose of UUIDs would be broken when inserting many new rows in one statemnt with unique identifiers as they key, so:

SELECT display_name FROM tr_person ORDER BY NEWID()

does order the names "randomly".

Other DBMS

The above is true for MSSQL (2005 and 2008 at least, and if I remember rightly 2000 as well). A function returning a new UUID should be evaluated every time in all DBMSs NEWID() is under MSSQL but it is worth verifying this in the documentation and/or by your own tests. The behaviour of other arbitrary-result functions, like RAND(), is more likely to vary between DBMSs, so again check the documentation.

Also I've seen ordering by UUID values being ignored in some contexts as the DB assumes that the type has no meaningful ordering. If you find this to be that case explicitly cast the UUID to a string type in the ordering clause, or wrap some other function around it like CHECKSUM() in SQL Server (there may be a small performance difference from this too as the ordering will be done on a 32-bit values not a 128-bit one, though whether the benefit of that outweighs the cost of running CHECKSUM() per value first I'll leave you to test).

Side Note

If you want an arbitrary but somewhat repeatable ordering, order by some relatively uncontrolled subset of the data in the rows themselves. For instance either or these will return the names in an arbitrary but repeatable order:

SELECT display_name FROM tr_person ORDER BY CHECKSUM(display_name), display_name -- order by the checksum of some of the row's data
SELECT display_name FROM tr_person ORDER BY SUBSTRING(display_name, LEN(display_name)/2, 128) -- order by part of the name field, but not in any an obviously recognisable order)

Arbitrary but repeatable orderings are not often useful in applications, though can be useful in testing if you want to test some code on results in a variety of orders but want to be able to repeat each run the same way several times (for getting average timing results over several runs, or testing that a fix you have made to the code does remove a problem or inefficiency previously highlighted by a particular input resultset, or just for testing that your code is "stable" in that is returns the same result each time if sent the same data in a given order).

This trick can also be used to get more arbitrary results from functions, which do not allow non-deterministic calls like NEWID() within their body. Again, this is not something that is likely to be often useful in the real world but could come in handy if you want a function to return something random and "random-ish" is good enough (but be careful to remember the rules that determine when user defined functions evaluted, i.e. usually only once per row, or your results may not be what you expect/require).

Performance

As EBarr points out, there can be performance issues with any of the above. For more than a few rows you are almost garanteed to see the output spooled out to tempdb before the requested number of rows being read back in the right order, which means that even if you are looking for the top 10 you might find a full index scan (or worse, table scan) happens along with a huge block of writing to tempdb. Therefor it can be vitally important, as with most things, to benchmark with realistic data before using this in production.

Paul White · Answer 9 · 2018-06-17 04:20:26Z

Many tables have a relatively dense (few missing values) indexed numeric ID column.

This allows us to determine the range of existing values, and choose rows using randomly-generated ID values in that range. This works best when the number of rows to be returned is relatively small, and the range of ID values is densely populated (so the chance of generating a missing value is small enough).

To illustrate, the following code chooses 100 distinct random users from the Stack Overflow table of users, which has 8,123,937 rows.

The first step is to determine the range of ID values, an efficient operation due to the index:

DECLARE 
    @MinID integer,
    @Range integer,
    @Rows bigint = 100;

--- Find the range of values
SELECT
    @MinID = MIN(U.Id),
    @Range = 1 + MAX(U.Id) - MIN(U.Id)
FROM dbo.Users AS U;

The plan reads one row from each end of the index.

Now we generate 100 distinct random IDs in the range (with matching rows in the users table) and return those rows:

WITH Random (ID) AS
(
    -- Find @Rows distinct random user IDs that exist
    SELECT DISTINCT TOP (@Rows)
        Random.ID
    FROM dbo.Users AS U
    CROSS APPLY
    (
        -- Random ID
        VALUES (@MinID + (CONVERT(integer, CRYPT_GEN_RANDOM(4)) % @Range))
    ) AS Random (ID)
    WHERE EXISTS
    (
        SELECT 1
        FROM dbo.Users AS U2
            -- Ensure the row continues to exist
            WITH (REPEATABLEREAD)
        WHERE U2.Id = Random.ID
    )
)
SELECT
    U3.Id,
    U3.DisplayName,
    U3.CreationDate
FROM Random AS R
JOIN dbo.Users AS U3
    ON U3.Id = R.ID
-- QO model hint required to get a non-blocking flow distinct
OPTION (MAXDOP 1, USE HINT ('FORCE_LEGACY_CARDINALITY_ESTIMATION'));

The plan shows that in this case 601 random numbers were needed to find 100 matching rows. It is pretty quick:

Table 'Users'. Scan count 1, logical reads 1937, physical reads 2, read-ahead reads 408
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0
Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 9 ms.

Try it on the Stack Exchange Data Explorer.

I love this solution, I was previously facing problems with both RAND() and NEWID() on a dataset of 1 million records but this solution seem to perform pretty well so far, i'm pressed. Looking to test it on larger datasets. — Timothy Macharia, Aug 10, 2020 at 20:44

Vlad Mihalcea · Answer 10 · 2019-07-23 12:27:51Z

As I explained in this article, in order to shuffle the SQL result set, you need to use a database-specific function call.

Note that sorting a large result set using a RANDOM function might turn out to be very slow, so make sure you do that on small result sets.

If you have to shuffle a large result set and limit it afterward, then it's better to use the SQL Server TABLESAMPLE in SQL Server instead of a random function in the ORDER BY clause.

So, assuming we have the following database table:

And the following rows in the song table:

| id | artist                          | title                              |
|----|---------------------------------|------------------------------------|
| 1  | Miyagi & Эндшпиль ft. Рем Дигга | I Got Love                         |
| 2  | HAIM                            | Don't Save Me (Cyril Hahn Remix)   |
| 3  | 2Pac ft. DMX                    | Rise Of A Champion (GalilHD Remix) |
| 4  | Ed Sheeran & Passenger          | No Diggity (Kygo Remix)            |
| 5  | JP Cooper ft. Mali-Koa          | All This Love                      |

On SQL Server, you need to use the NEWID function, as illustrated by the following example:

SELECT
    CONCAT(CONCAT(artist, ' - '), title) AS song
FROM song
ORDER BY NEWID()

When running the aforementioned SQL query on SQL Server, we are going to get the following result set:

| song                                              |
|---------------------------------------------------|
| Miyagi & Эндшпиль ft. Рем Дигга - I Got Love      |
| JP Cooper ft. Mali-Koa - All This Love            |
| HAIM - Don't Save Me (Cyril Hahn Remix)           |
| Ed Sheeran & Passenger - No Diggity (Kygo Remix)  |
| 2Pac ft. DMX - Rise Of A Champion (GalilHD Remix) |

Notice that the songs are being listed in random order, thanks to the NEWID function call used by the ORDER BY clause.

Dharmendar Kumar 'DK' · Answer 11 · 2020-02-05 20:05:56Z

This is an old thread but came across this recently; so updating a method that has worked for me and gives good performance. This assumes your table has an IDENTITY or similar column:

DECLARE @r decimal(8,6) = rand()
SELECT @r

SELECT  TOP 100 *
FROM    TableA
ORDER BY ID % @r

不及格的程序员-八神

SQL Server 2005 和自增长主键identity说再见——NEWSEQUENTIALID()（转载）

sqlserver如何自增字符串类型ID

"order by newid()" - how does it work?

5 Answers

What is the best way to get a random ordering?

6 Answers

Meet TABLESAMPLE()

How to Get a Random Row from a Large Table

Method 1, Bad: ORDER BY NEWID()

Method 2, Better but Strange: TABLESAMPLE

Method 3, Best but Requires Code: Random Primary Key

Method 4, OFFSET-FETCH (2012+)

Bonus Track #1: Watch Us Discussing This

Bonus Track #2: Mitch Wheat Digs Deeper

23 Comments. Leave new

Auto generated SQL Server keys with the uniqueidentifier or IDENTITY

Problem

Solution

Next Steps

About the author

Comments For This Article

南来地,北往的,上班的,下岗的,走过路过不要错过!

Tuesday, February 22, 2011 - 5:41:51 AM - Fayssal El Moufatich	Back To Top (12997)
Thanks for the nicely written article! So, unless one is in a situation where he needs a glabally unique identifier, one should avoid using GUIDs. And even, in this situation, it is recommended to generate the keys using NEWSEQUENTIALID(), as long as it is possible. I agree! :)

Monday, January 19, 2009 - 6:38:33 PM - aprato	Back To Top (2585)
You answered your own question. You work in a distributed environment and they make sense in this scenario. Replication makes use of GUIDs for this reason. I personally wouldn't use a GUID for an application that didn't have a valid business reason for it. They have their place but they do have some downside.

Thursday, October 23, 2008 - 12:38:00 PM - aprato	Back To Top (2070)
You should note that even using NEWID() for a non-clustered index will lead to fragmentation for that index. Recall that index structures are sorted logically.

Thursday, October 23, 2008 - 8:59:47 AM - aprato	Back To Top (2064)
I didn't take it negatively at all. I understand what you're driving at.

Thursday, October 23, 2008 - 2:59:58 AM - jnollet	Back To Top (2050)
Great article ... I've found this out too and its important to think about up front in the development process. Trying to change later can be difficult.