Select 15 elements from 1 million row data in database [duplicate]

Select 15 elements from 1 million row data in database [duplicate] - sql

I've got a SQL Server table with about 50,000 rows in it. I want to select about 5,000 of those rows at random. I've thought of a complicated way, creating a temp table with a "random number" column, copying my table into that, looping through the temp table and updating each row with RAND(), and then selecting from that table where the random number column < 0.1. I'm looking for a simpler way to do it, in a single statement if possible.
This article suggest using the NEWID() function. That looks promising, but I can't see how I could reliably select a certain percentage of rows.
Anybody ever do this before? Any ideas?

select top 10 percent * from [yourtable] order by newid()
In response to the "pure trash" comment concerning large tables: you could do it like this to improve performance.
select * from [yourtable] where [yourPk] in
(select top 10 percent [yourPk] from [yourtable] order by newid())
The cost of this will be the key scan of values plus the join cost, which on a large table with a small percentage selection should be reasonable.

Depending on your needs, TABLESAMPLE will get you nearly as random and better performance.
this is available on MS SQL server 2005 and later.
TABLESAMPLE will return data from random pages instead of random rows and therefore deos not even retrieve data that it will not return.
On a very large table I tested
select top 1 percent * from [tablename] order by newid()
took more than 20 minutes.
select * from [tablename] tablesample(1 percent)
took 2 minutes.
Performance will also improve on smaller samples in TABLESAMPLE whereas it will not with newid().
Please keep in mind that this is not as random as the newid() method but will give you a decent sampling.
See the MSDN page.

newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.
TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).
For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:
If you really want a random sample of
individual rows, modify your query to
filter out rows randomly, instead of
using TABLESAMPLE. For example, the
following query uses the NEWID
function to return approximately one
percent of the rows of the
Sales.SalesOrderDetail table:
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
The SalesOrderID column is included in
the CHECKSUM expression so that
NEWID() evaluates once per row to
achieve sampling on a per-row basis.
The expression CAST(CHECKSUM(NEWID(),
SalesOrderID) & 0x7fffffff AS float /
CAST (0x7fffffff AS int) evaluates to
a random float value between 0 and 1.
When run against a table with 1,000,000 rows, here are my results:
SET STATISTICS TIME ON
SET STATISTICS IO ON
/* newid()
rows returned: 10000
logical reads: 3359
CPU time: 3312 ms
elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()
/* TABLESAMPLE
rows returned: 9269 (varies)
logical reads: 32
CPU time: 0 ms
elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)
/* Filter
rows returned: 9994 (varies)
logical reads: 3359
CPU time: 641 ms
elapsed time: 627 ms
*/
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.

Selecting Rows Randomly from a Large Table on MSDN has a simple, well-articulated solution that addresses the large-scale performance concerns.
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10

This link have a interesting comparison between Orderby(NEWID()) and other methods for tables with 1, 7, and 13 millions of rows.
Often, when questions about how to select random rows are asked in discussion groups, the NEWID query is proposed; it is simple and works very well for small tables.
SELECT TOP 10 PERCENT *
FROM Table1
ORDER BY NEWID()
However, the NEWID query has a big drawback when you use it for large tables. The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database, where they are sorted. This causes two problems:
The sorting operation usually has a high cost associated with it.
Sorting can use a lot of disk I/O and can run for a long time.
In the worst-case scenario, tempdb can run out of space. In the
best-case scenario, tempdb can take up a large amount of disk space
that never will be reclaimed without a manual shrink command.
What you need is a way to select rows randomly that will not use tempdb and will not get much slower as the table gets larger. Here is a new idea on how to do that:
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10
The basic idea behind this query is that we want to generate a random number between 0 and 99 for each row in the table, and then choose all of those rows whose random number is less than the value of the specified percent. In this example, we want approximately 10 percent of the rows selected randomly; therefore, we choose all of the rows whose random number is less than 10.
Please read the full article in MSDN.

Just order the table by a random number and obtain the first 5,000 rows using TOP.
SELECT TOP 5000 * FROM [Table] ORDER BY newid();
UPDATE
Just tried it and a newid() call is sufficent - no need for all the casts and all the math.

If you (unlike the OP) need a specific number of records (which makes the CHECKSUM approach difficult) and desire a more random sample than TABLESAMPLE provides by itself, and also want better speed than CHECKSUM, you may make do with a merger of the TABLESAMPLE and NEWID() methods, like this:
DECLARE #sampleCount int = 50
SET STATISTICS TIME ON
SELECT TOP (#sampleCount) *
FROM [yourtable] TABLESAMPLE(10 PERCENT)
ORDER BY NEWID()
SET STATISTICS TIME OFF
In my case this is the most straightforward compromise between randomness (it's not really, I know) and speed. Vary the TABLESAMPLE percentage (or rows) as appropriate - the higher the percentage, the more random the sample, but expect a linear drop off in speed. (Note that TABLESAMPLE will not accept a variable)

This is a combination of the initial seed idea and a checksum, which looks to me to give properly random results without the cost of NEWID():
SELECT TOP [number]
FROM table_name
ORDER BY RAND(CHECKSUM(*) * RAND())

In MySQL you can do this:
SELECT `PRIMARY_KEY`, rand() FROM table ORDER BY rand() LIMIT 5000;

Didn't quite see this variation in the answers yet. I had an additional constraint where I needed, given an initial seed, to select the same set of rows each time.
For MS SQL:
Minimum example:
select top 10 percent *
from table_name
order by rand(checksum(*))
Normalized execution time: 1.00
NewId() example:
select top 10 percent *
from table_name
order by newid()
Normalized execution time: 1.02
NewId() is insignificantly slower than rand(checksum(*)), so you may not want to use it against large record sets.
Selection with Initial Seed:
declare #seed int
set #seed = Year(getdate()) * month(getdate()) /* any other initial seed here */
select top 10 percent *
from table_name
order by rand(checksum(*) % #seed) /* any other math function here */
If you need to select the same set given a seed, this seems to work.

Try this:
SELECT TOP 10 Field1, ..., FieldN
FROM Table1
ORDER BY NEWID()

Here is an updated and improved form of sampling. It is based on the same concept of some other answers that use CHECKSUM / BINARY_CHECKSUM and modulus.
Reasons to use an implementation similar to this one, as opposed to other answers:
It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal.
Does not suffer from CHECKSUM(*) / BINARY_CHECKSUM(*) issues with runs of data. When using the CHECKSUM(*) approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution.
Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID(), such as CHECKSUM(NEWID()) % 100, can never be stable/repeatable.
Allows for increased sample precision and reduces introduced statistical errors. The sampling precision can also be tweaked. CHECKSUM only returns an int value.
Does not use ORDER BY NEWID(), as ordering can become a significant bottleneck with large input sets. Avoiding the sorting also reduces memory and tempdb usage.
Does not use TABLESAMPLE and thus works with a WHERE pre-filter.
Cons / limitations:
Slightly slower execution times and using CHECKSUM(*). Using hashbytes, as shown below, adds about 3/4 of a second of overhead per million lines. This is with my data, on my database instance: YMMV. This overhead can be eliminated if using a persisted computed column of the resulting 'well distributed' bigint value from HASHBYTES.
Unlike the basic SELECT TOP n .. ORDER BY NEWID(), this is not guaranteed to return "exactly N" rows. Instead, it returns a percentage row rows where such a value is pre-determined. For very small sample sizes this could result in 0 rows selected. This limitation is shared with the CHECKSUM(*) approaches.
Here is the gist:
-- Allow a sampling precision [0, 100.0000].
declare #sample_percent decimal(7, 4) = 12.3456
select
t.*
from t
where 1=1
and t.Name = 'Mr. No Questionable Checksum Usages'
and ( -- sample
#sample_percent = 100
or abs(
-- Choose appropriate identity column(s) for hashbytes input.
-- For demonstration it is assumed to be a UNIQUEIDENTIFIER rowguid column.
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * #sample_percent)
)
Notes:
While SHA1 is technically deprecated since SQL Server 2016, it is both sufficient for the task and is slightly faster than either MD5 or SHA2_256. Use a different hashing function as relevant. If the table already contains a hashed column (with a good distribution), that could potentially be used as well.
Conversion of bigint is critical as it allows 2^63 bits of 'random space' to which to apply the modulus operator; this is much more than the 2^31 range from the CHECKSUM result. This reduces the modulus error at the limit, especially as the precision is increased.
The sampling precision can be changed as long as the modulus operand and sample percent are multiplied appropriately. In this case, that is 1000 * to account for the 4 digits of precision allowed in #sample_percent.
Can multiply the bigint value by RAND() to return a different row sample each run. This effectively changes the permutation of the fixed hash values.
If #sample_percent is 100 the query planner can eliminate the slower calculation code entirely. Remember 'parameter sniffing' rules. This allows the code to be left in the query regardless of enabling sampling.
Computing #sample_percent, with lower/upper limits, and adding a TOP "hint" in the query as might be useful when the sample is used in a derived table context.
-- Approximate max-sample and min-sample ranges.
-- The minimum sample percent should be non-zero within the precision.
declare #max_sample_size int = 3333333
declare #min_sample_percent decimal(7,4) = 0.3333
declare #sample_percent decimal(7,4) -- [0, 100.0000]
declare #sample_size int
-- Get initial count for determining sample percentages.
-- Remember to match the filter conditions with the usage site!
declare #rows int
select #rows = count(1)
from t
where 1=1
and t.Name = 'Mr. No Questionable Checksum Usages'
-- Calculate sample percent and back-calculate actual sample size.
if #rows <= #max_sample_size begin
set #sample_percent = 100
end else begin
set #sample_percent = convert(float, 100) * #max_sample_size / #rows
if #sample_percent < #min_sample_percent
set #sample_percent = #min_sample_percent
end
set #sample_size = ceiling(#rows * #sample_percent / 100)
select *
from ..
join (
-- Not a precise value: if limiting exactly at, can introduce more bias.
-- Using 'option optimize for' avoids this while requiring dynamic SQL.
select top (#sample_size + convert(int, #sample_percent + 5))
from t
where 1=1
and t.Name = 'Mr. No Questionable Checksum Usages'
and ( -- sample
#sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * #sample_percent)
)
) sampled
on ..

It appears newid() can't be used in where clause, so this solution requires an inner query:
SELECT *
FROM (
SELECT *, ABS(CHECKSUM(NEWID())) AS Rnd
FROM MyTable
) vw
WHERE Rnd % 100 < 10 --10%

I was using it in subquery and it returned me same rows in subquery
SELECT ID ,
( SELECT TOP 1
ImageURL
FROM SubTable
ORDER BY NEWID()
) AS ImageURL,
GETUTCDATE() ,
1
FROM Mytable
then i solved with including parent table variable in where
SELECT ID ,
( SELECT TOP 1
ImageURL
FROM SubTable
Where Mytable.ID>0
ORDER BY NEWID()
) AS ImageURL,
GETUTCDATE() ,
1
FROM Mytable
Note the where condtition

The server-side processing language in use (eg PHP, .net, etc) isn't specified, but if it's PHP, grab the required number (or all the records) and instead of randomising in the query use PHP's shuffle function. I don't know if .net has an equivalent function but if it does then use that if you're using .net
ORDER BY RAND() can have quite a performance penalty, depending on how many records are involved.

select * from table
where id in (
select id from table
order by random()
limit ((select count(*) from table)*55/100))
// to select 55 percent of rows randomly

If you know you have approximately N rows and you want approximately K random rows, you just need to pull any given row with a chance K/N. Using the RAND() function which gives you a fair distribution between 0 and 1, you could just do the following where PROB = K/N. Worked very quickly for me.
SELECT * FROM some_table WHERE RAND() < PROB

This works for me:
SELECT * FROM table_name
ORDER BY RANDOM()
LIMIT [number]

Related

int64 overflow in sampling n number of rows (not %)

The below script is to randomly sample an approximate number of rows (50k).
SELECT *
FROM table
qualify rand() <= 50000 / count(*) over()
This has worked a handful of times before, hence, I was shocked to find this error this morning:
int64 overflow: 8475548256593033885 + 6301395400903259047
I have read this post. But as I am not summing, I don't think it is applicable.
The table in question has 267,606,559 rows.
Looking forward to any ideas. Thank you.

I believe counting is actually a sum the way BQ (and other databases) compute counts. You can see this by viewing the Execution Details/Graph (in the BQ UI). This is true even on a simple select count(*) from table query.
For your problem, consider something simpler like:
select *, rand() as my_rand
from table
order by my_rand
limit 50000
Also, if you know the rough size of your data or don't need exactly 50K, consider using the tablesample method:
select * from table
tablesample system (10 percent)

How to permutate an SQL table using a seed?

Background
I have a front-end with a list of items with infinite scrolling, and I fetch pages of items by specifying the page limit and offset.
Problem
Apart from simply ordering the result by some of the columns, I would like to add a "random" option. The thing is, I don't want repetitions, so I need to have the entire dataset permutated before doing the limit and offset, and I need to be able to get the same permutation as long as I supply the same seed.
What I tried
A naive approach was to write a table-valued function that takes an int seed and uses it in the ORDER BY clause like so:
SELECT *
FROM dbo.Entities e
ORDER BY HASHBYTES('MD2', e.Title) ^ #seed
OFFSET 0 ROWS
FETCH NEXT (SELECT COUNT(*) FROM dbo.Entities) ROWS ONLY
This seemed to work well at a first glance, but it turned out it's not very "volatile" for the lack of better word - it becomes more visible with sparse result sets, where most seeds (chosen randomly from between 0 and 2147483647) yield the same order.
I thought I would get better results by hashing the seed as well, but SQL Server doesn't allow me to XOR two varbinary variables. Am I even looking in the right direction? Are there any performance considerations that I should be making and I might not be aware of?

The best way is to create a tally table with two columns: first a sequential integer, (between 1 and 1,000,000), second a random integer number. Then generate a random number to get the first value and then make a join with a computed ROW_NUMBER().
CREATE TABLE T_NUM (SEQUENTIAL INT, RANDOM INT);
GO
WITH
N AS(SELECT 0 AS I
UNION ALL
SELECT I + 1
FROM N
WHERE I < 9)
INSERT INTO T_NUM (SEQUENTIAL)
SELECT N1.I + N2.I * 10 + N3.I * 100 + N4.I * 1000 + N5.I * 10000 + N6.I * 100000
FROM N AS N1
CROSS JOIN N AS N2
CROSS JOIN N AS N3
CROSS JOIN N AS N4
CROSS JOIN N AS N5
CROSS JOIN N AS N6;
GO
WITH T AS
(
SELECT SEQUENTIAL, ROW_NUMBER() OVER (ORDER BY CHECKSUM(NEWID())) AS ALEA
FROM T_NUM
)
UPDATE N
SET RANDOM = ALEA
FROM T_NUM AS N
JOIN T ON T.SEQUENTIAL = N.SEQUENTIAL;
GO
DECLARE #SEED INT = FLOOR(1 + RAND() * 1000000);
Now you have your seed to enter in the alea sequence then join your table on sequential order

ORDER BY HASHBYTES('MD2', e.Title + convert(nvarchar(max), #seed))
should work, but performance-wise it would be a disaster. You would calculate MD2 for all records every time. I would not do this on server side at all. You can generate random sequence on client and then just pick from server rows with row number 158, 7, 1027 and 9. But it has still two problems
if item is deleted, row number of all consecutive records shifts. It would just break the whole sequence and you would get duplicities and missing records
row number over millions of records is not that fast either
I see two options. You can query all ids from the table and use them for generating of random order. But that would be a lot of numbers. Or you have to ensure the id space is dense enough. Then you can query 20 random ids and hope at least 10 of them exist. If you are unlucky, you would have to query again.

SQL Server Looking to return random 10% of records [duplicate]

I've got a SQL Server table with about 50,000 rows in it. I want to select about 5,000 of those rows at random. I've thought of a complicated way, creating a temp table with a "random number" column, copying my table into that, looping through the temp table and updating each row with RAND(), and then selecting from that table where the random number column < 0.1. I'm looking for a simpler way to do it, in a single statement if possible.
This article suggest using the NEWID() function. That looks promising, but I can't see how I could reliably select a certain percentage of rows.
Anybody ever do this before? Any ideas?

select top 10 percent * from [yourtable] order by newid()
In response to the "pure trash" comment concerning large tables: you could do it like this to improve performance.
select * from [yourtable] where [yourPk] in
(select top 10 percent [yourPk] from [yourtable] order by newid())
The cost of this will be the key scan of values plus the join cost, which on a large table with a small percentage selection should be reasonable.

Depending on your needs, TABLESAMPLE will get you nearly as random and better performance.
this is available on MS SQL server 2005 and later.
TABLESAMPLE will return data from random pages instead of random rows and therefore deos not even retrieve data that it will not return.
On a very large table I tested
select top 1 percent * from [tablename] order by newid()
took more than 20 minutes.
select * from [tablename] tablesample(1 percent)
took 2 minutes.
Performance will also improve on smaller samples in TABLESAMPLE whereas it will not with newid().
Please keep in mind that this is not as random as the newid() method but will give you a decent sampling.
See the MSDN page.

newid()/order by will work, but will be very expensive for large result sets because it has to generate an id for every row, and then sort them.
TABLESAMPLE() is good from a performance standpoint, but you will get clumping of results (all rows on a page will be returned).
For a better performing true random sample, the best way is to filter out rows randomly. I found the following code sample in the SQL Server Books Online article Limiting Results Sets by Using TABLESAMPLE:
If you really want a random sample of
individual rows, modify your query to
filter out rows randomly, instead of
using TABLESAMPLE. For example, the
following query uses the NEWID
function to return approximately one
percent of the rows of the
Sales.SalesOrderDetail table:
SELECT * FROM Sales.SalesOrderDetail
WHERE 0.01 >= CAST(CHECKSUM(NEWID(),SalesOrderID) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
The SalesOrderID column is included in
the CHECKSUM expression so that
NEWID() evaluates once per row to
achieve sampling on a per-row basis.
The expression CAST(CHECKSUM(NEWID(),
SalesOrderID) & 0x7fffffff AS float /
CAST (0x7fffffff AS int) evaluates to
a random float value between 0 and 1.
When run against a table with 1,000,000 rows, here are my results:
SET STATISTICS TIME ON
SET STATISTICS IO ON
/* newid()
rows returned: 10000
logical reads: 3359
CPU time: 3312 ms
elapsed time = 3359 ms
*/
SELECT TOP 1 PERCENT Number
FROM Numbers
ORDER BY newid()
/* TABLESAMPLE
rows returned: 9269 (varies)
logical reads: 32
CPU time: 0 ms
elapsed time: 5 ms
*/
SELECT Number
FROM Numbers
TABLESAMPLE (1 PERCENT)
/* Filter
rows returned: 9994 (varies)
logical reads: 3359
CPU time: 641 ms
elapsed time: 627 ms
*/
SELECT Number
FROM Numbers
WHERE 0.01 >= CAST(CHECKSUM(NEWID(), Number) & 0x7fffffff AS float)
/ CAST (0x7fffffff AS int)
SET STATISTICS IO OFF
SET STATISTICS TIME OFF
If you can get away with using TABLESAMPLE, it will give you the best performance. Otherwise use the newid()/filter method. newid()/order by should be last resort if you have a large result set.

Selecting Rows Randomly from a Large Table on MSDN has a simple, well-articulated solution that addresses the large-scale performance concerns.
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10

This link have a interesting comparison between Orderby(NEWID()) and other methods for tables with 1, 7, and 13 millions of rows.
Often, when questions about how to select random rows are asked in discussion groups, the NEWID query is proposed; it is simple and works very well for small tables.
SELECT TOP 10 PERCENT *
FROM Table1
ORDER BY NEWID()
However, the NEWID query has a big drawback when you use it for large tables. The ORDER BY clause causes all of the rows in the table to be copied into the tempdb database, where they are sorted. This causes two problems:
The sorting operation usually has a high cost associated with it.
Sorting can use a lot of disk I/O and can run for a long time.
In the worst-case scenario, tempdb can run out of space. In the
best-case scenario, tempdb can take up a large amount of disk space
that never will be reclaimed without a manual shrink command.
What you need is a way to select rows randomly that will not use tempdb and will not get much slower as the table gets larger. Here is a new idea on how to do that:
SELECT * FROM Table1
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10
The basic idea behind this query is that we want to generate a random number between 0 and 99 for each row in the table, and then choose all of those rows whose random number is less than the value of the specified percent. In this example, we want approximately 10 percent of the rows selected randomly; therefore, we choose all of the rows whose random number is less than 10.
Please read the full article in MSDN.

Just order the table by a random number and obtain the first 5,000 rows using TOP.
SELECT TOP 5000 * FROM [Table] ORDER BY newid();
UPDATE
Just tried it and a newid() call is sufficent - no need for all the casts and all the math.

If you (unlike the OP) need a specific number of records (which makes the CHECKSUM approach difficult) and desire a more random sample than TABLESAMPLE provides by itself, and also want better speed than CHECKSUM, you may make do with a merger of the TABLESAMPLE and NEWID() methods, like this:
DECLARE #sampleCount int = 50
SET STATISTICS TIME ON
SELECT TOP (#sampleCount) *
FROM [yourtable] TABLESAMPLE(10 PERCENT)
ORDER BY NEWID()
SET STATISTICS TIME OFF
In my case this is the most straightforward compromise between randomness (it's not really, I know) and speed. Vary the TABLESAMPLE percentage (or rows) as appropriate - the higher the percentage, the more random the sample, but expect a linear drop off in speed. (Note that TABLESAMPLE will not accept a variable)

This is a combination of the initial seed idea and a checksum, which looks to me to give properly random results without the cost of NEWID():
SELECT TOP [number]
FROM table_name
ORDER BY RAND(CHECKSUM(*) * RAND())

In MySQL you can do this:
SELECT `PRIMARY_KEY`, rand() FROM table ORDER BY rand() LIMIT 5000;

Didn't quite see this variation in the answers yet. I had an additional constraint where I needed, given an initial seed, to select the same set of rows each time.
For MS SQL:
Minimum example:
select top 10 percent *
from table_name
order by rand(checksum(*))
Normalized execution time: 1.00
NewId() example:
select top 10 percent *
from table_name
order by newid()
Normalized execution time: 1.02
NewId() is insignificantly slower than rand(checksum(*)), so you may not want to use it against large record sets.
Selection with Initial Seed:
declare #seed int
set #seed = Year(getdate()) * month(getdate()) /* any other initial seed here */
select top 10 percent *
from table_name
order by rand(checksum(*) % #seed) /* any other math function here */
If you need to select the same set given a seed, this seems to work.

Try this:
SELECT TOP 10 Field1, ..., FieldN
FROM Table1
ORDER BY NEWID()

Here is an updated and improved form of sampling. It is based on the same concept of some other answers that use CHECKSUM / BINARY_CHECKSUM and modulus.
Reasons to use an implementation similar to this one, as opposed to other answers:
It is relatively fast over huge data sets and can be efficiently used in/with derived queries. Millions of pre-filtered rows can be sampled in seconds with no tempdb usage and, if aligned with the rest of the query, the overhead is often minimal.
Does not suffer from CHECKSUM(*) / BINARY_CHECKSUM(*) issues with runs of data. When using the CHECKSUM(*) approach, the rows can be selected in "chunks" and not "random" at all! This is because CHECKSUM prefers speed over distribution.
Results in a stable/repeatable row selection and can be trivially changed to produce different rows on subsequent query executions. Approaches that use NEWID(), such as CHECKSUM(NEWID()) % 100, can never be stable/repeatable.
Allows for increased sample precision and reduces introduced statistical errors. The sampling precision can also be tweaked. CHECKSUM only returns an int value.
Does not use ORDER BY NEWID(), as ordering can become a significant bottleneck with large input sets. Avoiding the sorting also reduces memory and tempdb usage.
Does not use TABLESAMPLE and thus works with a WHERE pre-filter.
Cons / limitations:
Slightly slower execution times and using CHECKSUM(*). Using hashbytes, as shown below, adds about 3/4 of a second of overhead per million lines. This is with my data, on my database instance: YMMV. This overhead can be eliminated if using a persisted computed column of the resulting 'well distributed' bigint value from HASHBYTES.
Unlike the basic SELECT TOP n .. ORDER BY NEWID(), this is not guaranteed to return "exactly N" rows. Instead, it returns a percentage row rows where such a value is pre-determined. For very small sample sizes this could result in 0 rows selected. This limitation is shared with the CHECKSUM(*) approaches.
Here is the gist:
-- Allow a sampling precision [0, 100.0000].
declare #sample_percent decimal(7, 4) = 12.3456
select
t.*
from t
where 1=1
and t.Name = 'Mr. No Questionable Checksum Usages'
and ( -- sample
#sample_percent = 100
or abs(
-- Choose appropriate identity column(s) for hashbytes input.
-- For demonstration it is assumed to be a UNIQUEIDENTIFIER rowguid column.
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * #sample_percent)
)
Notes:
While SHA1 is technically deprecated since SQL Server 2016, it is both sufficient for the task and is slightly faster than either MD5 or SHA2_256. Use a different hashing function as relevant. If the table already contains a hashed column (with a good distribution), that could potentially be used as well.
Conversion of bigint is critical as it allows 2^63 bits of 'random space' to which to apply the modulus operator; this is much more than the 2^31 range from the CHECKSUM result. This reduces the modulus error at the limit, especially as the precision is increased.
The sampling precision can be changed as long as the modulus operand and sample percent are multiplied appropriately. In this case, that is 1000 * to account for the 4 digits of precision allowed in #sample_percent.
Can multiply the bigint value by RAND() to return a different row sample each run. This effectively changes the permutation of the fixed hash values.
If #sample_percent is 100 the query planner can eliminate the slower calculation code entirely. Remember 'parameter sniffing' rules. This allows the code to be left in the query regardless of enabling sampling.
Computing #sample_percent, with lower/upper limits, and adding a TOP "hint" in the query as might be useful when the sample is used in a derived table context.
-- Approximate max-sample and min-sample ranges.
-- The minimum sample percent should be non-zero within the precision.
declare #max_sample_size int = 3333333
declare #min_sample_percent decimal(7,4) = 0.3333
declare #sample_percent decimal(7,4) -- [0, 100.0000]
declare #sample_size int
-- Get initial count for determining sample percentages.
-- Remember to match the filter conditions with the usage site!
declare #rows int
select #rows = count(1)
from t
where 1=1
and t.Name = 'Mr. No Questionable Checksum Usages'
-- Calculate sample percent and back-calculate actual sample size.
if #rows <= #max_sample_size begin
set #sample_percent = 100
end else begin
set #sample_percent = convert(float, 100) * #max_sample_size / #rows
if #sample_percent < #min_sample_percent
set #sample_percent = #min_sample_percent
end
set #sample_size = ceiling(#rows * #sample_percent / 100)
select *
from ..
join (
-- Not a precise value: if limiting exactly at, can introduce more bias.
-- Using 'option optimize for' avoids this while requiring dynamic SQL.
select top (#sample_size + convert(int, #sample_percent + 5))
from t
where 1=1
and t.Name = 'Mr. No Questionable Checksum Usages'
and ( -- sample
#sample_percent = 100
or abs(
convert(bigint, hashbytes('SHA1', convert(varbinary(32), t.rowguid)))
) % (1000 * 100) < (1000 * #sample_percent)
)
) sampled
on ..

It appears newid() can't be used in where clause, so this solution requires an inner query:
SELECT *
FROM (
SELECT *, ABS(CHECKSUM(NEWID())) AS Rnd
FROM MyTable
) vw
WHERE Rnd % 100 < 10 --10%

I was using it in subquery and it returned me same rows in subquery
SELECT ID ,
( SELECT TOP 1
ImageURL
FROM SubTable
ORDER BY NEWID()
) AS ImageURL,
GETUTCDATE() ,
1
FROM Mytable
then i solved with including parent table variable in where
SELECT ID ,
( SELECT TOP 1
ImageURL
FROM SubTable
Where Mytable.ID>0
ORDER BY NEWID()
) AS ImageURL,
GETUTCDATE() ,
1
FROM Mytable
Note the where condtition

The server-side processing language in use (eg PHP, .net, etc) isn't specified, but if it's PHP, grab the required number (or all the records) and instead of randomising in the query use PHP's shuffle function. I don't know if .net has an equivalent function but if it does then use that if you're using .net
ORDER BY RAND() can have quite a performance penalty, depending on how many records are involved.

select * from table
where id in (
select id from table
order by random()
limit ((select count(*) from table)*55/100))
// to select 55 percent of rows randomly

If you know you have approximately N rows and you want approximately K random rows, you just need to pull any given row with a chance K/N. Using the RAND() function which gives you a fair distribution between 0 and 1, you could just do the following where PROB = K/N. Worked very quickly for me.
SELECT * FROM some_table WHERE RAND() < PROB

This works for me:
SELECT * FROM table_name
ORDER BY RANDOM()
LIMIT [number]

SQL WHERE filter on two integers, is it faster by conversion to char or a combined integer?

I have a table with a primary key made of two 32bit integers. I want to filter by an explicit list of these, and want to know the fastest approach. There are 3 ways I can think of.My question simply is: Which method is the fastest out of the second method or the third method?
1st method I do not want to use because if I have many to list (only filtering for 2 rows in this example), it gets messy, or need a temp table, so not as concise:
select *
from [table]
where
(
([int1] = 123 and [int2] = 456)
OR ([int1] = 654 and [int2] = 321)
--etc
)
2nd method convert to varchar
select *
from [table]
where convert(varchar(10), [int1]) + ',' + convert(varchar(10), [int2]) IN ('123,456','654,321')
3rd method combine two 32bit integers to single 64bit integer
select *
from [table]
where convert(bigint, [int1]) * 4294967296 + [int2] IN (528280977864,2808908611905)
Edit
Thanks to suggestion from Aron, I have tried using statistics - these are the results on a table with > 1 million rows, average from 10 trials each:
Time Statistics method 1 method 2 method 3
Client processing time 22.1 2.7 2.9
Total execution time 300.5 1099.8 1317.3
Wait time on server replies 278.4 1097.1 1314.4
So really querying on them as is is the fastest by far, but if I did pick between the second or third method, then varchar is faster (which surprises me).

Your first method:
select *
from [table]
where ([int1] = 123 and [int2] = 456) OR
[int1] = 654 and [int2] = 321) OR
--etc
)
Should be the fastest because it can take advantage of an index on (int1, int2). Perhaps the fastest method for a large list is to store the pairs in a temporary table with an index (clustered or unclustered) on int1 and int2.
I would shy away from playing around with the values. The bulk of the effort of the query is reading the data pages. Slight variations in comparison logic will have little impact on the query.

Maybe you need to give a better example?
I tried your example and performance looks all good. a bigger number of the result set can predict better? try using estimated plan.
create table #table (int1 int,int2 int)
insert into #table values(123,456);
insert into #table values(654,321);
select *
from #table
where
(
([int1] = 123 and [int2] = 456)
OR ([int1] = 654 and [int2] = 321)
)
select *
from #table
where convert(varchar(10), [int1]) +'-'+ convert(varchar(10), [int2]) IN ('123-456','654-321')
select *
from #table
where convert(bigint, [int1]) * 4294967296 + [int2] IN (528280977864,2808908611905)
--drop table #table
will give almost same estimated cost. 33% each query...

The 1st method that you dont want to use because of its messy, seems to be the fastest way, just put them on two columns and index them.
Speed of queries in SQL doesn't depend on the number of fields queried or complexity of queries, it only depends on how you use its index.

Biased random in SQL?

I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a boost factor.
So I essentially have the fields ID and BOOST.The boost is calculated in a way that it turns out as an integer that represents the percentage of how often this entry should be hit in in comparison.
ID Boost
1 1
2 2
3 7
So if I run my random function indefinitely I should end up with X hits on ID 1, twice as much on ID 2 and 7 times as much on ID 3.
So every hit should be random but with a probability of (boost / sum of boosts). So the probability for ID 3 in this example should be 0.7 (because the sum is 10. I choose those values for simplicity).
I thought about something like the following query:
SELECT id FROM table WHERE CEIL(RAND() * MAX(boost)) >= boost ORDER BY rand();
Unfortunately that doesn't work, after considering the following entries in the table:
ID Boost
1 1
2 2
It will, with a 50/50 chance, have only the 2nd or both elements to choose from randomly.
So 0.5 hit goes to the second element
And 0.5 hit goes to the (second and first) element which is chosen from randomly so so 0.25 each.
So we end up with a 0.25/0.75 ratio, but it should be 0.33/0.66
I need some modification or new a method to do this with good performance.
I also thought about storing the boost field cumulatively so I just do a range query from (0-sum()), but then I would have to re-index everything coming after one item if I change it or develop some swapping algorithm or something... but that's really not elegant and stuff.
Both inserting/updating and selecting should be fast!
Do you have any solutions to this problem?
The best use case to think of is probably advertisement delivery. "Please choose a random ad with given probability"... however i need it for another purpose but just to give you a last picture what it should do.
edit:
Thanks to kens answer i thought about the following approach:
calculate a random value from 0-sum(distinct boost)
SET #randval = (select ceil(rand() * sum(DISTINCT boost)) from test);
select the boost factor from all distinct boost factors which added up surpasses the random value
then we have in our 1st example 1 with a 0.1, 2 with a 0.2 and 7 with a 0.7 probability.
now select one random entry from all entries having this boost factor
PROBLEM: because the count of entries having one boost is always different. For example if there is only 1-boosted entry i get it in 1 of 10 calls, but if there are 1 million with 7, each of them is hardly ever returned...
so this doesnt work out :( trying to refine it.
I have to somehow include the count of entries with this boost factor ... but i am somehow stuck on that...

You need to generate a random number per row and weight it.
In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost
DECLARE #sample TABLE (id int, boost int)
INSERT #sample VALUES (1, 1), (2, 2), (3, 7)
SELECT
RAND(CHECKSUM(NEWID())) * boost AS weighted,
SUM(boost) OVER () AS boostcount,
id
FROM
#sample
GROUP BY
id, boost
ORDER BY
weighted DESC
If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.
Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.
This sample was put together on SQL Server 2008, BTW

I dare to suggest straightforward solution with two queries, using cumulative boost calculation.
First, select sum of boosts, and generate some number between 0 and boost sum:
select ceil(rand() * sum(boost)) from table;
This value should be stored as a variable, let's call it {random_number}
Then, select table rows, calculating cumulative sum of boosts, and find the first row, which has cumulative boost greater than {random number}:
SET #cumulative_boost=0;
SELECT
id,
#cumulative_boost:=(#cumulative_boost + boost) AS cumulative_boost,
FROM
table
WHERE
cumulative_boost >= {random_number}
ORDER BY id
LIMIT 1;

My problem was similar: Every person had a calculated number of tickets in the final draw. If you had more tickets then you would have an higher chance to win "the lottery".
Since I didn't trust any of the found results rand() * multiplier or the one with -log(rand()) on the web I wanted to implement my own straightforward solution.
What I did and in your case would look a little bit like this:
(SELECT id, boost FROM foo) AS values
INNER JOIN (
SELECT id % 100 + 1 AS counter
FROM user
GROUP BY counter) AS numbers ON numbers.counter <= values.boost
ORDER BY RAND()
Since I don't have to run it often I don't really care about future performance and at the moment it was fast for me.
Before I used this query I checked two things:
The maximum number of boost is less than the maximum returned in the number query
That the inner query returns ALL numbers between 1..100. It might not depending on your table!
Since I have all distinct numbers between 1..100 then joining on numbers.counter <= values.boost would mean that if a row has a boost of 2 it would end up duplicated in the final result. If a row has a boost of 100 it would end up in the final set 100 times. Or in another words. If sum of boosts is 4212 which it was in my case you would have 4212 rows in the final set.
Finally I let MySql sort it randomly.
Edit: For the inner query to work properly make sure to use a large table, or make sure that the id's don't skip any numbers. Better yet and probably a bit faster you might even create a temporary table which would simply have all numbers between 1..n. Then you could simply use INNER JOIN numbers ON numbers.id <= values.boost

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Select 15 elements from 1 million row data in database [duplicate] - sql

Selecting Rows Randomly from a Large Table on MSDN has a simple, well-articulated solution that addresses the large-scale performance concerns. SELECT * FROM Table1 WHERE (ABS(CAST( (BINARY_CHECKSUM() RAND()) as int)) % 100) < 10

Just order the table by a random number and obtain the first 5,000 rows using TOP. SELECT TOP 5000 * FROM [Table] ORDER BY newid(); UPDATE Just tried it and a newid() call is sufficent - no need for all the casts and all the math.

This is a combination of the initial seed idea and a checksum, which looks to me to give properly random results without the cost of NEWID(): SELECT TOP [number] FROM table_name ORDER BY RAND(CHECKSUM() RAND())

In MySQL you can do this: SELECT `PRIMARY_KEY`, rand() FROM table ORDER BY rand() LIMIT 5000;

Try this: SELECT TOP 10 Field1, ..., FieldN FROM Table1 ORDER BY NEWID()

It appears newid() can't be used in where clause, so this solution requires an inner query: SELECT * FROM ( SELECT *, ABS(CHECKSUM(NEWID())) AS Rnd FROM MyTable ) vw WHERE Rnd % 100 < 10 --10%

select * from table where id in ( select id from table order by random() limit ((select count() from table)55/100)) // to select 55 percent of rows randomly

This works for me: SELECT * FROM table_name ORDER BY RANDOM() LIMIT [number]

Related

int64 overflow in sampling n number of rows (not %)

How to permutate an SQL table using a seed?

SQL Server Looking to return random 10% of records [duplicate]

SQL WHERE filter on two integers, is it faster by conversion to char or a combined integer?

Biased random in SQL?

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Select 15 elements from 1 million row data in database [duplicate] - sql

Selecting Rows Randomly from a Large Table on MSDN has a simple, well-articulated solution that addresses the large-scale performance concerns. SELECT * FROM Table1 WHERE (ABS(CAST( (BINARY_CHECKSUM(*) * RAND()) as int)) % 100) < 10

Just order the table by a random number and obtain the first 5,000 rows using TOP. SELECT TOP 5000 * FROM [Table] ORDER BY newid(); UPDATE Just tried it and a newid() call is sufficent - no need for all the casts and all the math.

This is a combination of the initial seed idea and a checksum, which looks to me to give properly random results without the cost of NEWID(): SELECT TOP [number] FROM table_name ORDER BY RAND(CHECKSUM(*) * RAND())

In MySQL you can do this: SELECT `PRIMARY_KEY`, rand() FROM table ORDER BY rand() LIMIT 5000;

Try this: SELECT TOP 10 Field1, ..., FieldN FROM Table1 ORDER BY NEWID()

It appears newid() can't be used in where clause, so this solution requires an inner query: SELECT * FROM ( SELECT *, ABS(CHECKSUM(NEWID())) AS Rnd FROM MyTable ) vw WHERE Rnd % 100 < 10 --10%

select * from table where id in ( select id from table order by random() limit ((select count(*) from table)*55/100)) // to select 55 percent of rows randomly

This works for me: SELECT * FROM table_name ORDER BY RANDOM() LIMIT [number]

Related

int64 overflow in sampling n number of rows (not %)

How to permutate an SQL table using a seed?

SQL Server Looking to return random 10% of records [duplicate]

SQL WHERE filter on two integers, is it faster by conversion to char or a combined integer?

Biased random in SQL?

Categories

Resources

Selecting Rows Randomly from a Large Table on MSDN has a simple, well-articulated solution that addresses the large-scale performance concerns. SELECT * FROM Table1 WHERE (ABS(CAST( (BINARY_CHECKSUM() RAND()) as int)) % 100) < 10

This is a combination of the initial seed idea and a checksum, which looks to me to give properly random results without the cost of NEWID(): SELECT TOP [number] FROM table_name ORDER BY RAND(CHECKSUM() RAND())

select * from table where id in ( select id from table order by random() limit ((select count() from table)55/100)) // to select 55 percent of rows randomly