Select n random rows from SQL Server table - sql

I've got a SQL Server table with about 50,000 rows in it. I want to select about 5,000 of those rows at random. I've thought of a complicated way, creating a temp table with a "random number" column, copying my table into that, looping through the temp table and updating each row with RAND(), and then selecting from that table where the random number column < 0.1. I'm looking for a simpler way to do it, in a single statement if possible.
This article suggest using the NEWID() function. That looks promising, but I can't see how I could reliably select a certain percentage of rows.
Anybody ever do this before? Any ideas?

select top 10 percent * from [yourtable] order by newid()
In response to the "pure trash" comment concerning large tables: you could do it like this to improve performance.
select * from [yourtable] where [yourPk] in
(select top 10 percent [yourPk] from [yourtable] order by newid())
The cost of this will be the key scan of values plus the join cost, which on a large table with a small percentage selection should be reasonable.

Depending on your needs, TABLESAMPLE will get you nearly as random and better performance.
this is available on MS SQL server 2005 and later.
TABLESAMPLE will return data from random pages instead of random rows and therefore deos not even retrieve data that it will not return.
On a very large table I tested
select top 1 percent * from [tablename] order by newid()
took more than 20 minutes.
select * from [tablename] tablesample(1 percent)
took 2 minutes.
Performance will also improve on smaller samples in TABLESAMPLE whereas it will not with newid().
Please keep in mind that this is not as random as the newid() method but will give you a decent sampling.
See the MSDN page.

newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.
TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).
For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:
If you really want a random sample of
individual rows, modify your query to
filter out rows randomly, instead of
using TABLESAMPLE. For example, the
following query uses the NEWID
function to return approximately one
percent of the rows of the
Sales.SalesOrderDetail table:
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
The SalesOrderID column is included in
the CHECKSUM expression so that
NEWID() evaluates once per row to
achieve sampling on a per-row basis.
The expression CAST(CHECKSUM(NEWID(),
SalesOrderID) & 0x7fffffff AS float /
CAST (0x7fffffff AS int) evaluates to
a random float value between 0 and 1.
When run against a table with 1,000,000 rows, here are my results:
SET STATISTICS TIME ON
SET STATISTICS IO ON
/* newid()
rows returned: 10000
logical reads: 3359
CPU time: 3312 ms
elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()
/* TABLESAMPLE
rows returned: 9269 (varies)
logical reads: 32
CPU time: 0 ms
elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)
/* Filter
rows returned: 9994 (varies)
logical reads: 3359
CPU time: 641 ms
elapsed time: 627 ms
*/
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.

Selecting Rows Randomly from a Large Table on MSDN has a simple, well-articulated solution that addresses the large-scale performance concerns.
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10

This link have a interesting comparison between Orderby(NEWID()) and other methods for tables with 1, 7, and 13 millions of rows.
Often, when questions about how to select random rows are asked in discussion groups, the NEWID query is proposed; it is simple and works very well for small tables.
SELECT TOP 10 PERCENT *
FROM Table1
ORDER BY NEWID()
However, the NEWID query has a big drawback when you use it for large tables. The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database, where they are sorted. This causes two problems:
The sorting operation usually has a high cost associated with it.
Sorting can use a lot of disk I/O and can run for a long time.
In the worst-case scenario, tempdb can run out of space. In the
best-case scenario, tempdb can take up a large amount of disk space
that never will be reclaimed without a manual shrink command.
What you need is a way to select rows randomly that will not use tempdb and will not get much slower as the table gets larger. Here is a new idea on how to do that:
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10
The basic idea behind this query is that we want to generate a random number between 0 and 99 for each row in the table, and then choose all of those rows whose random number is less than the value of the specified percent. In this example, we want approximately 10 percent of the rows selected randomly; therefore, we choose all of the rows whose random number is less than 10.
Please read the full article in MSDN.

Just order the table by a random number and obtain the first 5,000 rows using TOP.
SELECT TOP 5000 * FROM [Table] ORDER BY newid();
UPDATE
Just tried it and a newid() call is sufficent - no need for all the casts and all the math.

If you (unlike the OP) need a specific number of records (which makes the CHECKSUM approach difficult) and desire a more random sample than TABLESAMPLE provides by itself, and also want better speed than CHECKSUM, you may make do with a merger of the TABLESAMPLE and NEWID() methods, like this:
DECLARE #sampleCount int = 50
SET STATISTICS TIME ON
SELECT TOP (#sampleCount) *
FROM [yourtable] TABLESAMPLE(10 PERCENT)
ORDER BY NEWID()
SET STATISTICS TIME OFF
In my case this is the most straightforward compromise between randomness (it's not really, I know) and speed. Vary the TABLESAMPLE percentage (or rows) as appropriate - the higher the percentage, the more random the sample, but expect a linear drop off in speed. (Note that TABLESAMPLE will not accept a variable)

This is a combination of the initial seed idea and a checksum, which looks to me to give properly random results without the cost of NEWID():
SELECT TOP [number]
FROM table_name
ORDER BY RAND(CHECKSUM(*) * RAND())

In MySQL you can do this:
SELECT `PRIMARY_KEY`, rand() FROM table ORDER BY rand() LIMIT 5000;

Didn't quite see this variation in the answers yet. I had an additional constraint where I needed, given an initial seed, to select the same set of rows each time.
For MS SQL:
Minimum example:
select top 10 percent *
from table_name
order by rand(checksum(*))
Normalized execution time: 1.00
NewId() example:
select top 10 percent *
from table_name
order by newid()
Normalized execution time: 1.02
NewId() is insignificantly slower than rand(checksum(*)), so you may not want to use it against large record sets.
Selection with Initial Seed:
declare #seed int
set #seed = Year(getdate()) * month(getdate()) /* any other initial seed here */
select top 10 percent *
from table_name
order by rand(checksum(*) % #seed) /* any other math function here */
If you need to select the same set given a seed, this seems to work.

Try this:
SELECT TOP 10 Field1, ..., FieldN
FROM Table1
ORDER BY NEWID()

Here is an updated and improved form of sampling. It is based on the same concept of some other answers that use CHECKSUM / BINARY_CHECKSUM and modulus.
Reasons to use an implementation similar to this one, as opposed to other answers:
It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal.
Does not suffer from CHECKSUM(*) / BINARY_CHECKSUM(*) issues with runs of data. When using the CHECKSUM(*) approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution.
Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID(), such as CHECKSUM(NEWID()) % 100, can never be stable/repeatable.
Allows for increased sample precision and reduces introduced statistical errors. The sampling precision can also be tweaked. CHECKSUM only returns an int value.
Does not use ORDER BY NEWID(), as ordering can become a significant bottleneck with large input sets. Avoiding the sorting also reduces memory and tempdb usage.
Does not use TABLESAMPLE and thus works with a WHERE pre-filter.
Cons / limitations:
Slightly slower execution times and using CHECKSUM(*). Using hashbytes, as shown below, adds about 3/4 of a second of overhead per million lines. This is with my data, on my database instance: YMMV. This overhead can be eliminated if using a persisted computed column of the resulting 'well distributed' bigint value from HASHBYTES.
Unlike the basic SELECT TOP n .. ORDER BY NEWID(), this is not guaranteed to return "exactly N" rows. Instead, it returns a percentage row rows where such a value is pre-determined. For very small sample sizes this could result in 0 rows selected. This limitation is shared with the CHECKSUM(*) approaches.
Here is the gist:
-- Allow a sampling precision [0, 100.0000].
declare #sample_percent decimal(7, 4) = 12.3456
select
t.*
from t
where 1=1
and t.Name = 'Mr. No Questionable Checksum Usages'
and ( -- sample
#sample_percent = 100
or abs(
-- Choose appropriate identity column(s) for hashbytes input.
-- For demonstration it is assumed to be a UNIQUEIDENTIFIER rowguid column.
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * #sample_percent)
)
Notes:
While SHA1 is technically deprecated since SQL Server 2016, it is both sufficient for the task and is slightly faster than either MD5 or SHA2_256. Use a different hashing function as relevant. If the table already contains a hashed column (with a good distribution), that could potentially be used as well.
Conversion of bigint is critical as it allows 2^63 bits of 'random space' to which to apply the modulus operator; this is much more than the 2^31 range from the CHECKSUM result. This reduces the modulus error at the limit, especially as the precision is increased.
The sampling precision can be changed as long as the modulus operand and sample percent are multiplied appropriately. In this case, that is 1000 * to account for the 4 digits of precision allowed in #sample_percent.
Can multiply the bigint value by RAND() to return a different row sample each run. This effectively changes the permutation of the fixed hash values.
If #sample_percent is 100 the query planner can eliminate the slower calculation code entirely. Remember 'parameter sniffing' rules. This allows the code to be left in the query regardless of enabling sampling.
Computing #sample_percent, with lower/upper limits, and adding a TOP "hint" in the query as might be useful when the sample is used in a derived table context.
-- Approximate max-sample and min-sample ranges.
-- The minimum sample percent should be non-zero within the precision.
declare #max_sample_size int = 3333333
declare #min_sample_percent decimal(7,4) = 0.3333
declare #sample_percent decimal(7,4) -- [0, 100.0000]
declare #sample_size int
-- Get initial count for determining sample percentages.
-- Remember to match the filter conditions with the usage site!
declare #rows int
select #rows = count(1)
from t
where 1=1
and t.Name = 'Mr. No Questionable Checksum Usages'
-- Calculate sample percent and back-calculate actual sample size.
if #rows <= #max_sample_size begin
set #sample_percent = 100
end else begin
set #sample_percent = convert(float, 100) * #max_sample_size / #rows
if #sample_percent < #min_sample_percent
set #sample_percent = #min_sample_percent
end
set #sample_size = ceiling(#rows * #sample_percent / 100)
select *
from ..
join (
-- Not a precise value: if limiting exactly at, can introduce more bias.
-- Using 'option optimize for' avoids this while requiring dynamic SQL.
select top (#sample_size + convert(int, #sample_percent + 5))
from t
where 1=1
and t.Name = 'Mr. No Questionable Checksum Usages'
and ( -- sample
#sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * #sample_percent)
)
) sampled
on ..

It appears newid() can't be used in where clause, so this solution requires an inner query:
SELECT *
FROM (
SELECT *, ABS(CHECKSUM(NEWID())) AS Rnd
FROM MyTable
) vw
WHERE Rnd % 100 < 10 --10%

I was using it in subquery and it returned me same rows in subquery
SELECT ID ,
( SELECT TOP 1
ImageURL
FROM SubTable
ORDER BY NEWID()
) AS ImageURL,
GETUTCDATE() ,
1
FROM Mytable
then i solved with including parent table variable in where
SELECT ID ,
( SELECT TOP 1
ImageURL
FROM SubTable
Where Mytable.ID>0
ORDER BY NEWID()
) AS ImageURL,
GETUTCDATE() ,
1
FROM Mytable
Note the where condtition

The server-side processing language in use (eg PHP, .net, etc) isn't specified, but if it's PHP, grab the required number (or all the records) and instead of randomising in the query use PHP's shuffle function. I don't know if .net has an equivalent function but if it does then use that if you're using .net
ORDER BY RAND() can have quite a performance penalty, depending on how many records are involved.

select * from table
where id in (
select id from table
order by random()
limit ((select count(*) from table)*55/100))
// to select 55 percent of rows randomly

If you know you have approximately N rows and you want approximately K random rows, you just need to pull any given row with a chance K/N. Using the RAND() function which gives you a fair distribution between 0 and 1, you could just do the following where PROB = K/N. Worked very quickly for me.
SELECT * FROM some_table WHERE RAND() < PROB

This works for me:
SELECT * FROM table_name
ORDER BY RANDOM()
LIMIT [number]

Related

SQL limit with spread

Is there a way to limit a selection like with LIMIT, but instead of returning the limitation with an offset, limit with a "spread".
So for instance if a select returns 1000 rows and I limit it to 100, then I get every 10th row from start to end.
I know this would require the full SELECT to be performed since the RDBMS would need to go through all rows to do so. But instead of, for instance, returning 100000 rows when I need every 100th row, there would be a lot less transfer and the work could be done at the RDBMS.
I am requiring this on a PostgreSQL DB.
There is no built-in syntax to do that in connection with LIMIT / OFFSET (nor with standard-SQL FETCH { FIRST | NEXT } [ count ] { ROW | ROWS } { ONLY | WITH TIES }]).
You can achieve your goal with the modulo operator %:
SELECT *
FROM (
SELECT row_number() OVER () AS rn, ... original SELECT list
FROM ... -- original query
) sub
WHERE rn%10 = 0 -- every 10th row
Since there is no ORDER BY in the window definition, row numbers are assigned according to the ORDER BY of the query.
If there is no ORDER BY at all, you get an arbitrary order of rows. That's still some kind of order, the result is just out of your hands.
You can apply that kind of filter on an individual table with the TABLESAMPLE syntax.
SELECT * FROM tbl TABLESAMPLE SYSTEM (10); -- roughly 10 %
Or:
SELECT * FROM tbl TABLESAMPLE BERNOULLI (10); -- roughly 10 %
SYSTEM is faster, BERNOULLI is more random.
You could even apply the TABLESAMPLE filter on multiple tables in the same query, like:
SELECT *
FROM tbl1 TABLESAMPLE SYSTEM (10)
JOIN tbl2 TABLESAMPLE BERNOULLI (10) USING (big_id);
But the number of resulting rows can vary wildly. To get a given number of rows, consider the additional module tsm_system_rows instead. See:
Fast way to discover the row count of a table in PostgreSQL
Best way to select random rows PostgreSQL

Performant stable random sort in SQL Server?

I want to query a rather large table (millions of rows), providing a seed value, in a manner that will guarantee a random order - but one that remains stable across multiple queries as long as the same seed is used.
The best I've come up with so far is
SELECT TOP n *
FROM tbl t
ORDER BY t.int_column % seed, t.int_column
Is this a usable approach, both from a performance standpoint and a somewhat-uniform distribution of result rows over different seeds?
Edit:
For context, the stable sort is needed because of multiple - possibly nested - WHERE NOT IN queries that operate on the same data set; e.g.
SELECT *
FROM tbl t
WHERE t.some_criteria = 'some_value'
AND t.id NOT IN
(
SELECT TOP n t.id
FROM tbl t
WHERE t.some_other_criteria = 'some_other_value'
ORDER BY t.int_column % seed, t.int_column
)
AND t.id NOT IN
(
# etc.
)
When the order of the subselects is random, but not stable (i.e. NEWID(), TABLESAMPLE()), the result rows fluctuate wildly between executions.
If you want random-looking ordering, you can do this with HASHBYTES and some piece of data from the row you're selecting.
SELECT TOP 100 *
FROM tbl t
ORDER BY HASHBYTES('SHA1', CONCAT(STR(t.int_column), 'seed string'))
Now, the performance of this is a big question. Modern CPUs do SHA1 very quickly, so this might be good enough for your needs.
If you can more about performance and less about "good randomness," you could drop in a simple linear congruential generator as the transformation function:
SET ARITHABORT OFF;
SET ARITHIGNORE ON;
SET ANSI_WARNINGS OFF;
SELECT TOP 100 *
FROM tbl t
ORDER BY ((t.int_column + seed_number) * 1103515245 + 12345)
This will be faster, but less random.
Just a thought... You could add a "RamdomSort" column to you table. That way the the sort order will be truly random but will remain repeatable repeatable until you update the table with new values. Something along these lines...
ALTER TABLE dbo.MyTable ADD RandomSort INT NOT NULL
CONSTRAINT df_MyTable_RandomSort DEFAULT(0);
UPDATE mt SET
mt.RandomSort = ABS(CHECKSUM(NEWID())) % 100000 + 1
FROM
dbo.MyTable mt;
SELECT
*
FROM
dbo.MyTable mt
ORDER BY
mt.SomeValue;
If the situation warrants it, you could even add a covering, nonclustered index to eliminate the sort operation.

Select random row(s) in SQLite

In MySQL, you can select X random rows with the following statement:
SELECT * FROM table ORDER BY RAND() LIMIT X
This does not, however, work in SQLite. Is there an equivalent?
For a much better performance use:
SELECT * FROM table WHERE id IN (SELECT id FROM table ORDER BY RANDOM() LIMIT x)
SQL engines first load projected fields of rows to memory then sort them, here we just do a random sort on id field of each row which is in memory because it's indexed, then separate X of them, and find the whole row using these X ids.
So this consume less RAM and CPU as table grows!
SELECT * FROM table ORDER BY RANDOM() LIMIT X
SELECT * FROM table ORDER BY RANDOM() LIMIT 1
All answers here are based on ORDER BY. This is very inefficient (i.e. unusable) for large sets because you will evaluate RANDOM() for each record, and then ORDER BY which is a resource expensive operation.
An other approach is to place abs(CAST(random() AS REAL))/9223372036854775808 < 0.5 in the WHERE clause to get in this case for example 0.5 hit chance.
SELECT *
FROM table
WHERE abs(CAST(random() AS REAL))/9223372036854775808 < 0.5
The large number is the maximum absolute number that random() can produce. The abs() is because it is signed. Result is a uniformly distributed random variable between 0 and 1.
This has its drawbacks. You can not guarantee a result and if the threshold is large compared to the table, the selected data will be skewed towards the start of the table. But in some carefully designed situations, it can be a feasible option.
This one solves the negative RANDOM integers, and keeps good performance on large datasets:
SELECT * FROM table LIMIT 1 OFFSET abs(random() % (select count(*) from table));
where:
abs(random() % n ) Gives you a positive integer in range(0,n)
The accepted answer works, but requires a full table scan per query. This will get slower and slower as your table grows large, making it risky for queries that are triggered by end-users.
The following solution takes advantage of indexes to run in O(log(N)) time.
SELECT * FROM table
WHERE rowid > (
ABS(RANDOM()) % (SELECT max(rowid) FROM table)
)
LIMIT 1;
To break it down
SELECT max(rowid) FROM table - Returns the largest valid rowid for the table. SQLite is able to use the index on rowid to run this efficiently.
ABS(RANDOM()) % ... - Return a random number between 0 and max(rowid) - 1). SQLite's random function generates a number between -9223372036854775808 and +9223372036854775807. The ABS makes sure its positive, and the modulus operator gates it between max(rowid) - 1.
rowid > ... - Rather than using =, use > in case the random number generated corresponds to a deleted row. Using strictly greater than ensures that we return a row with a row id between 1 (greater than 0) and max(rowid) (great than max(rowid) - 1). SQLite uses the primary key index to efficiently return this result as well.
This also works for queries with WHERE clauses. Apply the WHERE clause to both the output and the SELECT max(rowid) subquery. I'm not sure which conditions this will run efficiently, however.
Note: This was derived from an answer in a similar question.

How do I return random numbers as a column in SQL Server 2005?

I'm running a SQL query on SQL Server 2005, and in addition to 2 columns being queried from the database, I'd also like to return 1 column of random numbers along with them. I tried this:
select column1, column2, floor(rand() * 10000) as column3
from table1
Which kinda works, but the problem is that this query returns the same random number on every row. It's a different number each time you run the query, but it doesn't vary from row to row. How can I do this and get a new random number for each row?
I realize this is an older post... but you don't need a view.
select column1, column2,
ABS(CAST(CAST(NEWID() AS VARBINARY) AS int)) % 10000 as column3
from table1
WARNING
Adam's answer involving the view is very inefficient and for very large sets can take out your database for quite a while, I would strongly recommend against using it on a regular basis or in situations where you need to populate large tables in production.
Instead you could use this answer.
Proof:
CREATE VIEW vRandNumber
AS
SELECT RAND() as RandNumber
go
CREATE FUNCTION RandNumber()
RETURNS float
AS
BEGIN
RETURN (SELECT RandNumber FROM vRandNumber)
END
go
create table bigtable(i int)
go
insert into bigtable
select top 100000 1 from sysobjects a
join sysobjects b on 1=1
go
select cast(dbo.RandNumber() * 10000 as integer) as r into #t from bigtable
-- CPU (1607) READS (204639) DURATION (1551)
go
select ABS(CAST(CAST(NEWID() AS VARBINARY) AS int)) % 10000 as r into #t1
from bigtable
-- Runs 15 times faster - CPU (78) READS (809) DURATION (99)
Profiler trace:
alt text http://img519.imageshack.us/img519/8425/destroydbxu9.png
This is proof that stuff is random enough for numbers between 0 to 9999
-- proof that stuff is random enough
select avg(r) from #t
-- 5004
select STDEV(r) from #t
-- 2895.1999
select avg(r) from #t1
-- 4992
select STDEV(r) from #t1
-- 2881.44
select r,count(r) from #t
group by r
-- 10000 rows returned
select r,count(r) from #t1
group by r
-- 10000 row returned
Adam's answer works really well, so I marked it as accepted. While I was waiting for an answer though, I also found this blog entry with a few other (slightly less random) methods. Kaboing's method was among them.
http://blog.sqlauthority.com/2007/04/29/sql-server-random-number-generator-script-sql-query/
select RAND(CHECKSUM(NEWID()))
You need to use a UDF
first:
CREATE VIEW vRandNumber
AS
SELECT RAND() as RandNumber
second:
CREATE FUNCTION RandNumber()
RETURNS float
AS
BEGIN
RETURN (SELECT RandNumber FROM vRandNumber)
END
test:
SELECT dbo.RandNumber(), *
FROM <table>
Above borrowed from Jeff's SQL Server Blog
For SQLServer, there are a couple of options.
1. A while loop to update an empty column with one random number at a time
2. A .net Assembly that contains a function that returns a random number
You might like to consider generating a UUID instead of a random number using the newid function. These are guaranteed to be unique each time generated whereas there is a significant chance that some duplication will occur with a straightforward random number (and depending on what you're using it for could give you a phenominally hard to debug error at a later point)
newid() i believe is very resource intensive. i recall trying that method on a table of a few million records and the performance wasn't nearly as good as rand().
According to my testing, the answer above doesn't generate a value of 10000 ever. This probably isn't much of a problem when you are generating a random between 1 and 10000, but the same algorithm between 1 and 5 would be noticable. Add 1 to your mod.
This snippet seems to provide a reasonable substitute for rand() in that it returns a float between 0.0 and 1.0. It uses only the last 3 bytes provided by newid() so total randomness may be slightly different than the conversion to VARBINARY then INT then modding from the recommended answer. Have not had a chance to test relative performance but seems fast enough (and random enough) for my purposes.
SELECT CAST(SubString(CONVERT(binary(16), newid()), 14, 3) AS INT) / 16777216.0 AS R
I use c# for dealing with random numbers. It's much cleaner. I have a function I use to return a list of random number and a unique key, then I just join the uniqueKey on the row number. Because I use c#, I can easily specify a range within which the random numbers must fall.
Here are the steps to making the function:
http://www.sqlwithcindy.com/2013/04/elegant-random-number-list-in-sql-server.html
Here is what my query ends up looking like:
SELECT
rowNumber,
name,
randomNumber
FROM dbo.tvfRandomNumberList(1,10,100)
INNER JOIN (select ROW_NUMBER() over (order by int_id) as 'rowNumber', name from client
)as clients
ON clients.rowNumber = uniqueKey
Query
select column1, column2, cast(new_id() as varchar(10)) as column3
from table1

How to request a random row in SQL?

How can I request a random row (or as close to truly random as is possible) in pure SQL?
See this post: SQL to Select a random row from a database table. It goes through methods for doing this in MySQL, PostgreSQL, Microsoft SQL Server, IBM DB2 and Oracle (the following is copied from that link):
Select a random row with MySQL:
SELECT column FROM table
ORDER BY RAND()
LIMIT 1
Select a random row with PostgreSQL:
SELECT column FROM table
ORDER BY RANDOM()
LIMIT 1
Select a random row with Microsoft SQL Server:
SELECT TOP 1 column FROM table
ORDER BY NEWID()
Select a random row with IBM DB2
SELECT column, RAND() as IDX
FROM table
ORDER BY IDX FETCH FIRST 1 ROWS ONLY
Select a random record with Oracle:
SELECT column FROM
( SELECT column FROM table
ORDER BY dbms_random.value )
WHERE rownum = 1
Solutions like Jeremies:
SELECT * FROM table ORDER BY RAND() LIMIT 1
work, but they need a sequential scan of all the table (because the random value associated with each row needs to be calculated - so that the smallest one can be determined), which can be quite slow for even medium sized tables. My recommendation would be to use some kind of indexed numeric column (many tables have these as their primary keys), and then write something like:
SELECT * FROM table WHERE num_value >= RAND() *
( SELECT MAX (num_value ) FROM table )
ORDER BY num_value LIMIT 1
This works in logarithmic time, regardless of the table size, if num_value is indexed. One caveat: this assumes that num_value is equally distributed in the range 0..MAX(num_value). If your dataset strongly deviates from this assumption, you will get skewed results (some rows will appear more often than others).
I don't know how efficient this is, but I've used it before:
SELECT TOP 1 * FROM MyTable ORDER BY newid()
Because GUIDs are pretty random, the ordering means you get a random row.
ORDER BY NEWID()
takes 7.4 milliseconds
WHERE num_value >= RAND() * (SELECT MAX(num_value) FROM table)
takes 0.0065 milliseconds!
I will definitely go with latter method.
You didn't say which server you're using. In older versions of SQL Server, you can use this:
select top 1 * from mytable order by newid()
In SQL Server 2005 and up, you can use TABLESAMPLE to get a random sample that's repeatable:
SELECT FirstName, LastName
FROM Contact
TABLESAMPLE (1 ROWS) ;
For SQL Server
newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.
TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).
For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:
If you really want a random sample of
individual rows, modify your query to
filter out rows randomly, instead of
using TABLESAMPLE. For example, the
following query uses the NEWID
function to return approximately one
percent of the rows of the
Sales.SalesOrderDetail table:
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
The SalesOrderID column is included in
the CHECKSUM expression so that
NEWID() evaluates once per row to
achieve sampling on a per-row basis.
The expression CAST(CHECKSUM(NEWID(),
SalesOrderID) & 0x7fffffff AS float /
CAST (0x7fffffff AS int) evaluates to
a random float value between 0 and 1.
When run against a table with 1,000,000 rows, here are my results:
SET STATISTICS TIME ON
SET STATISTICS IO ON
/* newid()
rows returned: 10000
logical reads: 3359
CPU time: 3312 ms
elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()
/* TABLESAMPLE
rows returned: 9269 (varies)
logical reads: 32
CPU time: 0 ms
elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)
/* Filter
rows returned: 9994 (varies)
logical reads: 3359
CPU time: 641 ms
elapsed time: 627 ms
*/
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.
If possible, use stored statements to avoid the inefficiency of both indexes on RND() and creating a record number field.
PREPARE RandomRecord FROM "SELECT * FROM table LIMIT ?,1";
SET #n=FLOOR(RAND()*(SELECT COUNT(*) FROM table));
EXECUTE RandomRecord USING #n;
Best way is putting a random value in a new column just for that purpose, and using something like this (pseude code + SQL):
randomNo = random()
execSql("SELECT TOP 1 * FROM MyTable WHERE MyTable.Randomness > $randomNo")
This is the solution employed by the MediaWiki code. Of course, there is some bias against smaller values, but they found that it was sufficient to wrap the random value around to zero when no rows are fetched.
newid() solution may require a full table scan so that each row can be assigned a new guid, which will be much less performant.
rand() solution may not work at all (i.e. with MSSQL) because the function will be evaluated just once, and every row will be assigned the same "random" number.
For SQL Server 2005 and 2008, if we want a random sample of individual rows (from Books Online):
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
In late, but got here via Google, so for the sake of posterity, I'll add an alternative solution.
Another approach is to use TOP twice, with alternating orders. I don't know if it is "pure SQL", because it uses a variable in the TOP, but it works in SQL Server 2008. Here's an example I use against a table of dictionary words, if I want a random word.
SELECT TOP 1
word
FROM (
SELECT TOP(#idx)
word
FROM
dbo.DictionaryAbridged WITH(NOLOCK)
ORDER BY
word DESC
) AS D
ORDER BY
word ASC
Of course, #idx is some randomly-generated integer that ranges from 1 to COUNT(*) on the target table, inclusively. If your column is indexed, you'll benefit from it too. Another advantage is that you can use it in a function, since NEWID() is disallowed.
Lastly, the above query runs in about 1/10 of the exec time of a NEWID()-type of query on the same table. YYMV.
Insted of using RAND(), as it is not encouraged, you may simply get max ID (=Max):
SELECT MAX(ID) FROM TABLE;
get a random between 1..Max (=My_Generated_Random)
My_Generated_Random = rand_in_your_programming_lang_function(1..Max);
and then run this SQL:
SELECT ID FROM TABLE WHERE ID >= My_Generated_Random ORDER BY ID LIMIT 1
Note that it will check for any rows which Ids are EQUAL or HIGHER than chosen value.
It's also possible to hunt for the row down in the table, and get an equal or lower ID than the My_Generated_Random, then modify the query like this:
SELECT ID FROM TABLE WHERE ID <= My_Generated_Random ORDER BY ID DESC LIMIT 1
As pointed out in #BillKarwin's comment on #cnu's answer...
When combining with a LIMIT, I've found that it performs much better (at least with PostgreSQL 9.1) to JOIN with a random ordering rather than to directly order the actual rows: e.g.
SELECT * FROM tbl_post AS t
JOIN ...
JOIN ( SELECT id, CAST(-2147483648 * RANDOM() AS integer) AS rand
FROM tbl_post
WHERE create_time >= 1349928000
) r ON r.id = t.id
WHERE create_time >= 1349928000 AND ...
ORDER BY r.rand
LIMIT 100
Just make sure that the 'r' generates a 'rand' value for every possible key value in the complex query which is joined with it but still limit the number of rows of 'r' where possible.
The CAST as Integer is especially helpful for PostgreSQL 9.2 which has specific sort optimisation for integer and single precision floating types.
For MySQL to get random record
SELECT name
FROM random AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(id)
FROM random)) AS id)
AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1
More detail http://jan.kneschke.de/projects/mysql/order-by-rand/
With SQL Server 2012+ you can use the OFFSET FETCH query to do this for a single random row
select * from MyTable ORDER BY id OFFSET n ROW FETCH NEXT 1 ROWS ONLY
where id is an identity column, and n is the row you want - calculated as a random number between 0 and count()-1 of the table (offset 0 is the first row after all)
This works with holes in the table data, as long as you have an index to work with for the ORDER BY clause. Its also very good for the randomness - as you work that out yourself to pass in but the niggles in other methods are not present. In addition the performance is pretty good, on a smaller dataset it holds up well, though I've not tried serious performance tests against several million rows.
Random function from the sql could help. Also if you would like to limit to just one row, just add that in the end.
SELECT column FROM table
ORDER BY RAND()
LIMIT 1
For SQL Server and needing "a single random row"..
If not needing a true sampling, generate a random value [0, max_rows) and use the ORDER BY..OFFSET..FETCH from SQL Server 2012+.
This is very fast if the COUNT and ORDER BY are over appropriate indexes - such that the data is 'already sorted' along the query lines. If these operations are covered it's a quick request and does not suffer from the horrid scalability of using ORDER BY NEWID() or similar. Obviously, this approach won't scale well on a non-indexed HEAP table.
declare #rows int
select #rows = count(1) from t
-- Other issues if row counts in the bigint range..
-- This is also not 'true random', although such is likely not required.
declare #skip int = convert(int, #rows * rand())
select t.*
from t
order by t.id -- Make sure this is clustered PK or IX/UCL axis!
offset (#skip) rows
fetch first 1 row only
Make sure that the appropriate transaction isolation levels are used and/or account for 0 results.
For SQL Server and needing a "general row sample" approach..
Note: This is an adaptation of the answer as found on a SQL Server specific question about fetching a sample of rows. It has been tailored for context.
While a general sampling approach should be used with caution here, it's still potentially useful information in context of other answers (and the repetitious suggestions of non-scaling and/or questionable implementations). Such a sampling approach is less efficient than the first code shown and is error-prone if the goal is to find a "single random row".
Here is an updated and improved form of sampling a percentage of rows. It is based on the same concept of some other answers that use CHECKSUM / BINARY_CHECKSUM and modulus.
It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal.
Does not suffer from CHECKSUM(*) / BINARY_CHECKSUM(*) issues with runs of data. When using the CHECKSUM(*) approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution.
Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID() can never be stable/repeatable.
Does not use ORDER BY NEWID() of the entire input set, as ordering can become a significant bottleneck with large input sets. Avoiding unnecessary sorting also reduces memory and tempdb usage.
Does not use TABLESAMPLE and thus works with a WHERE pre-filter.
Here is the gist. See this answer for additional details and notes.
Naïve try:
declare #sample_percent decimal(7, 4)
-- Looking at this value should be an indicator of why a
-- general sampling approach can be error-prone to select 1 row.
select #sample_percent = 100.0 / count(1) from t
-- BAD!
-- When choosing appropriate sample percent of "approximately 1 row"
-- it is very reasonable to expect 0 rows, which definitely fails the ask!
-- If choosing a larger sample size the distribution is heavily skewed forward,
-- and is very much NOT 'true random'.
select top 1
t.*
from t
where 1=1
and ( -- sample
#sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * #sample_percent)
)
This can be largely remedied by a hybrid query, by mixing sampling and ORDER BY selection from the much smaller sample set. This limits the sorting operation to the sample size, not the size of the original table.
-- Sample "approximately 1000 rows" from the table,
-- dealing with some edge-cases.
declare #rows int
select #rows = count(1) from t
declare #sample_size int = 1000
declare #sample_percent decimal(7, 4) = case
when #rows <= 1000 then 100 -- not enough rows
when (100.0 * #sample_size / #rows) < 0.0001 then 0.0001 -- min sample percent
else 100.0 * #sample_size / #rows -- everything else
end
-- There is a statistical "guarantee" of having sampled a limited-yet-non-zero number of rows.
-- The limited rows are then sorted randomly before the first is selected.
select top 1
t.*
from t
where 1=1
and ( -- sample
#sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * #sample_percent)
)
-- ONLY the sampled rows are ordered, which improves scalability.
order by newid()
SELECT * FROM table ORDER BY RAND() LIMIT 1
Most of the solutions here aim to avoid sorting, but they still need to make a sequential scan over a table.
There is also a way to avoid the sequential scan by switching to index scan. If you know the index value of your random row you can get the result almost instantially. The problem is - how to guess an index value.
The following solution works on PostgreSQL 8.4:
explain analyze select * from cms_refs where rec_id in
(select (random()*(select last_value from cms_refs_rec_id_seq))::bigint
from generate_series(1,10))
limit 1;
I above solution you guess 10 various random index values from range 0 .. [last value of id].
The number 10 is arbitrary - you may use 100 or 1000 as it (amazingly) doesn't have a big impact on the response time.
There is also one problem - if you have sparse ids you might miss. The solution is to have a backup plan :) In this case an pure old order by random() query. When combined id looks like this:
explain analyze select * from cms_refs where rec_id in
(select (random()*(select last_value from cms_refs_rec_id_seq))::bigint
from generate_series(1,10))
union all (select * from cms_refs order by random() limit 1)
limit 1;
Not the union ALL clause. In this case if the first part returns any data the second one is NEVER executed!
You may also try using new id() function.
Just write a your query and use order by new id() function. It quite random.
Didn't quite see this variation in the answers yet. I had an additional constraint where I needed, given an initial seed, to select the same set of rows each time.
For MS SQL:
Minimum example:
select top 10 percent *
from table_name
order by rand(checksum(*))
Normalized execution time: 1.00
NewId() example:
select top 10 percent *
from table_name
order by newid()
Normalized execution time: 1.02
NewId() is insignificantly slower than rand(checksum(*)), so you may not want to use it against large record sets.
Selection with Initial Seed:
declare #seed int
set #seed = Year(getdate()) * month(getdate()) /* any other initial seed here */
select top 10 percent *
from table_name
order by rand(checksum(*) % seed) /* any other math function here */
If you need to select the same set given a seed, this seems to work.
In MSSQL (tested on 11.0.5569) using
SELECT TOP 100 * FROM employee ORDER BY CRYPT_GEN_RANDOM(10)
is significantly faster than
SELECT TOP 100 * FROM employee ORDER BY NEWID()
For Firebird:
Select FIRST 1 column from table ORDER BY RAND()
In SQL Server you can combine TABLESAMPLE with NEWID() to get pretty good randomness and still have speed. This is especially useful if you really only want 1, or a small number, of rows.
SELECT TOP 1 * FROM [table]
TABLESAMPLE (500 ROWS)
ORDER BY NEWID()
I have to agree with CD-MaN: Using "ORDER BY RAND()" will work nicely for small tables or when you do your SELECT only a few times.
I also use the "num_value >= RAND() * ..." technique, and if I really want to have random results I have a special "random" column in the table that I update once a day or so. That single UPDATE run will take some time (especially because you'll have to have an index on that column), but it's much faster than creating random numbers for every row each time the select is run.
Be careful because TableSample doesn't actually return a random sample of rows. It directs your query to look at a random sample of the 8KB pages that make up your row. Then, your query is executed against the data contained in these pages. Because of how data may be grouped on these pages (insertion order, etc), this could lead to data that isn't actually a random sample.
See: http://www.mssqltips.com/tip.asp?tip=1308
This MSDN page for TableSample includes an example of how to generate an actualy random sample of data.
http://msdn.microsoft.com/en-us/library/ms189108.aspx
It seems that many of the ideas listed still use ordering
However, if you use a temporary table, you are able to assign a random index (like many of the solutions have suggested), and then grab the first one that is greater than an arbitrary number between 0 and 1.
For example (for DB2):
WITH TEMP AS (
SELECT COMLUMN, RAND() AS IDX FROM TABLE)
SELECT COLUMN FROM TABLE WHERE IDX > .5
FETCH FIRST 1 ROW ONLY
A simple and efficient way from http://akinas.com/pages/en/blog/mysql_random_row/
SET #i = (SELECT FLOOR(RAND() * COUNT(*)) FROM table); PREPARE get_stmt FROM 'SELECT * FROM table LIMIT ?, 1'; EXECUTE get_stmt USING #i;
There is better solution for Oracle instead of using dbms_random.value, while it requires full scan to order rows by dbms_random.value and it is quite slow for large tables.
Use this instead:
SELECT *
FROM employee sample(1)
WHERE rownum=1
For SQL Server 2005 and above, extending #GreyPanther's answer for the cases when num_value has not continuous values. This works too for cases when we have not evenly distributed datasets and when num_value is not a number but a unique identifier.
WITH CTE_Table (SelRow, num_value)
AS
(
SELECT ROW_NUMBER() OVER(ORDER BY ID) AS SelRow, num_value FROM table
)
SELECT * FROM table Where num_value = (
SELECT TOP 1 num_value FROM CTE_Table WHERE SelRow >= RAND() * (SELECT MAX(SelRow) FROM CTE_Table)
)
select r.id, r.name from table AS r
INNER JOIN(select CEIL(RAND() * (select MAX(id) from table)) as id) as r1
ON r.id >= r1.id ORDER BY r.id ASC LIMIT 1
This will require a lesser computation time