What I am trying to achieve is selecting a control group for a process. To do this I am using random(), and for debugging / consistency I would like to be able to set the random number in a repeatable fashion. Meaning, I run the query once it assigns user 123 random number .001. At a different time I delete the previous data, I call the same query, and once again user 123 is assigned random number .001.
I have tried:
SELECT setseed(0);
SELECT 1, random() from generate_series(1,10);
I receive a different random number with every run.
SELECT 1, setseed(0), random() from generate_series(1,10);
Each row receives the same random number, which is useless.
I'm sure there is something I don't understand here. Any help is appreciated.
Do a union all of the setseed() query with the desired query. It is necessary to match the column types from both queries. setseed() returns void.
select setseed(0), null
union all
select null, random()
from generate_series(1, 10)
offset 1
;
The offset 1 clause eliminates the setseed() row from the result set
Related
Background
I have a front-end with a list of items with infinite scrolling, and I fetch pages of items by specifying the page limit and offset.
Problem
Apart from simply ordering the result by some of the columns, I would like to add a "random" option. The thing is, I don't want repetitions, so I need to have the entire dataset permutated before doing the limit and offset, and I need to be able to get the same permutation as long as I supply the same seed.
What I tried
A naive approach was to write a table-valued function that takes an int seed and uses it in the ORDER BY clause like so:
SELECT *
FROM dbo.Entities e
ORDER BY HASHBYTES('MD2', e.Title) ^ #seed
OFFSET 0 ROWS
FETCH NEXT (SELECT COUNT(*) FROM dbo.Entities) ROWS ONLY
This seemed to work well at a first glance, but it turned out it's not very "volatile" for the lack of better word - it becomes more visible with sparse result sets, where most seeds (chosen randomly from between 0 and 2147483647) yield the same order.
I thought I would get better results by hashing the seed as well, but SQL Server doesn't allow me to XOR two varbinary variables. Am I even looking in the right direction? Are there any performance considerations that I should be making and I might not be aware of?
The best way is to create a tally table with two columns: first a sequential integer, (between 1 and 1,000,000), second a random integer number. Then generate a random number to get the first value and then make a join with a computed ROW_NUMBER().
CREATE TABLE T_NUM (SEQUENTIAL INT, RANDOM INT);
GO
WITH
N AS(SELECT 0 AS I
UNION ALL
SELECT I + 1
FROM N
WHERE I < 9)
INSERT INTO T_NUM (SEQUENTIAL)
SELECT N1.I + N2.I * 10 + N3.I * 100 + N4.I * 1000 + N5.I * 10000 + N6.I * 100000
FROM N AS N1
CROSS JOIN N AS N2
CROSS JOIN N AS N3
CROSS JOIN N AS N4
CROSS JOIN N AS N5
CROSS JOIN N AS N6;
GO
WITH T AS
(
SELECT SEQUENTIAL, ROW_NUMBER() OVER (ORDER BY CHECKSUM(NEWID())) AS ALEA
FROM T_NUM
)
UPDATE N
SET RANDOM = ALEA
FROM T_NUM AS N
JOIN T ON T.SEQUENTIAL = N.SEQUENTIAL;
GO
DECLARE #SEED INT = FLOOR(1 + RAND() * 1000000);
Now you have your seed to enter in the alea sequence then join your table on sequential order
ORDER BY HASHBYTES('MD2', e.Title + convert(nvarchar(max), #seed))
should work, but performance-wise it would be a disaster. You would calculate MD2 for all records every time. I would not do this on server side at all. You can generate random sequence on client and then just pick from server rows with row number 158, 7, 1027 and 9. But it has still two problems
if item is deleted, row number of all consecutive records shifts. It would just break the whole sequence and you would get duplicities and missing records
row number over millions of records is not that fast either
I see two options. You can query all ids from the table and use them for generating of random order. But that would be a lot of numbers. Or you have to ensure the id space is dense enough. Then you can query 20 random ids and hope at least 10 of them exist. If you are unlucky, you would have to query again.
I have an SQL table which has two integers. Let these integers be a and b.
I want to SELECT out a random record, such that the record is selected with probability proportional to C + a/b for some constant C which I will choose.
So for example, if C = 0, and there are two records with a=1,b=2 and a=2,b=3, then we have that for the first record C+a/b = 1/2 and for the second record C+a/b = 2/3, and therefore with probability 0.3 I will choose the first record, and probability 0.7 I will choose the second record from that SELECT query.
I know SQL well (I thought), but I am not even sure where to begin here. I thought of doing a select for the "SUM(a/b)" first, and then doing a select for the first record the sum of C+a/b up to it exceeds a random number between C*number_of_records + SUM(a/b) for the first time. But, I don't really know how to do that.
You could do something like sorting by a random number multiplied by your other stuff, and just select top 1 from that query - something like:
SELECT TOP 1 (your column names)
FROM (your table)
ORDER BY Rand() * (your calculation)
Is it possible to retrieve a specific range of results? I know how to do TOP x but the result I will retrieve is WAY too big and will time out. I was hoping to be able to pick say the first 10,000 results then the next 10,000 and so on. Is this possible?
WITH Q AS (
SELECT ROW_NUMBER() OVER (ORDER BY ...some column) AS N, ...other columns
FROM ...some table
) SELECT * FROM Q WHERE N BETWEEN 1 AND 10000;
Read more about ROW_NUMBER() here: http://msdn.microsoft.com/en-us/library/ms186734.aspx
Practically all SQL DB implementations have a way of specifying the starting row to return, as well as the number of rows.
For example, in both mysql and postgres it looks like:
SELECT ...
ORDER BY something -- not required, but highly recommended
LIMIT 100 -- only get 100 rows
OFFSET 500; -- start at row 500
Note that normally you would include an ORDER BY to make sure your chunks are consistent
MS SQL Server (being a "pretend" DB) don't support OFFSET directly, but it can be coded using ROW_NUMBER() - see this SO post for more detail.
I have some entries in my database, in my case Videos with a rating and popularity and other factors. Of all these factors I calculate a likelihood factor or more to say a boost factor.
So I essentially have the fields ID and BOOST.The boost is calculated in a way that it turns out as an integer that represents the percentage of how often this entry should be hit in in comparison.
ID Boost
1 1
2 2
3 7
So if I run my random function indefinitely I should end up with X hits on ID 1, twice as much on ID 2 and 7 times as much on ID 3.
So every hit should be random but with a probability of (boost / sum of boosts). So the probability for ID 3 in this example should be 0.7 (because the sum is 10. I choose those values for simplicity).
I thought about something like the following query:
SELECT id FROM table WHERE CEIL(RAND() * MAX(boost)) >= boost ORDER BY rand();
Unfortunately that doesn't work, after considering the following entries in the table:
ID Boost
1 1
2 2
It will, with a 50/50 chance, have only the 2nd or both elements to choose from randomly.
So 0.5 hit goes to the second element
And 0.5 hit goes to the (second and first) element which is chosen from randomly so so 0.25 each.
So we end up with a 0.25/0.75 ratio, but it should be 0.33/0.66
I need some modification or new a method to do this with good performance.
I also thought about storing the boost field cumulatively so I just do a range query from (0-sum()), but then I would have to re-index everything coming after one item if I change it or develop some swapping algorithm or something... but that's really not elegant and stuff.
Both inserting/updating and selecting should be fast!
Do you have any solutions to this problem?
The best use case to think of is probably advertisement delivery. "Please choose a random ad with given probability"... however i need it for another purpose but just to give you a last picture what it should do.
edit:
Thanks to kens answer i thought about the following approach:
calculate a random value from 0-sum(distinct boost)
SET #randval = (select ceil(rand() * sum(DISTINCT boost)) from test);
select the boost factor from all distinct boost factors which added up surpasses the random value
then we have in our 1st example 1 with a 0.1, 2 with a 0.2 and 7 with a 0.7 probability.
now select one random entry from all entries having this boost factor
PROBLEM: because the count of entries having one boost is always different. For example if there is only 1-boosted entry i get it in 1 of 10 calls, but if there are 1 million with 7, each of them is hardly ever returned...
so this doesnt work out :( trying to refine it.
I have to somehow include the count of entries with this boost factor ... but i am somehow stuck on that...
You need to generate a random number per row and weight it.
In this case, RAND(CHECKSUM(NEWID())) gets around the "per query" evaluation of RAND. Then simply multiply it by boost and ORDER BY the result DESC. The SUM..OVER gives you the total boost
DECLARE #sample TABLE (id int, boost int)
INSERT #sample VALUES (1, 1), (2, 2), (3, 7)
SELECT
RAND(CHECKSUM(NEWID())) * boost AS weighted,
SUM(boost) OVER () AS boostcount,
id
FROM
#sample
GROUP BY
id, boost
ORDER BY
weighted DESC
If you have wildly different boost values (which I think you mentioned), I'd also consider using LOG (which is base e) to smooth the distribution.
Finally, ORDER BY NEWID() is a randomness that would take no account of boost. It's useful to seed RAND but not by itself.
This sample was put together on SQL Server 2008, BTW
I dare to suggest straightforward solution with two queries, using cumulative boost calculation.
First, select sum of boosts, and generate some number between 0 and boost sum:
select ceil(rand() * sum(boost)) from table;
This value should be stored as a variable, let's call it {random_number}
Then, select table rows, calculating cumulative sum of boosts, and find the first row, which has cumulative boost greater than {random number}:
SET #cumulative_boost=0;
SELECT
id,
#cumulative_boost:=(#cumulative_boost + boost) AS cumulative_boost,
FROM
table
WHERE
cumulative_boost >= {random_number}
ORDER BY id
LIMIT 1;
My problem was similar: Every person had a calculated number of tickets in the final draw. If you had more tickets then you would have an higher chance to win "the lottery".
Since I didn't trust any of the found results rand() * multiplier or the one with -log(rand()) on the web I wanted to implement my own straightforward solution.
What I did and in your case would look a little bit like this:
(SELECT id, boost FROM foo) AS values
INNER JOIN (
SELECT id % 100 + 1 AS counter
FROM user
GROUP BY counter) AS numbers ON numbers.counter <= values.boost
ORDER BY RAND()
Since I don't have to run it often I don't really care about future performance and at the moment it was fast for me.
Before I used this query I checked two things:
The maximum number of boost is less than the maximum returned in the number query
That the inner query returns ALL numbers between 1..100. It might not depending on your table!
Since I have all distinct numbers between 1..100 then joining on numbers.counter <= values.boost would mean that if a row has a boost of 2 it would end up duplicated in the final result. If a row has a boost of 100 it would end up in the final set 100 times. Or in another words. If sum of boosts is 4212 which it was in my case you would have 4212 rows in the final set.
Finally I let MySql sort it randomly.
Edit: For the inner query to work properly make sure to use a large table, or make sure that the id's don't skip any numbers. Better yet and probably a bit faster you might even create a temporary table which would simply have all numbers between 1..n. Then you could simply use INNER JOIN numbers ON numbers.id <= values.boost
This question already has answers here:
Closed 14 years ago.
EDIT: Duplicate of How do I return random numbers as a column in SQL Server 2005?
Hello
How can I generate random numbers in mssql server 2005.
for example: when I select 500rows from table, each row must have one generated random number, numbers must be generated at runtime
UPDATE: I need the fastest method, generate numbers from large tables, datepart, same magic computations is realatively slow with big amount of rows
UPDATE thanks for answers, but o finally used this solution
SELECT ABS(CHECKSUM(NewId())) % 10 - 5 AS Random
for random numbers from -5 to 5 and bonus is approximately the same number of occurances for each number
SELECT TOP 500
CONVERT(INT, CONVERT(VARBINARY(16), NEWID()))
FROM
dbo.MyTable
I love this trick, its awesome ...
-- 500 random numbers from 0 to 499
select top 500 ABS(CAST(CAST(NEWID() AS VARBINARY) AS int)) % 500 from sysobjects
easy google: 1 2 ...
Returning Random Numbers in a SELECT statement
As it is implemented, the RAND() function in SQL Server doesn't let you return a different random number per row in your SELECT statement. For example, if you execute this:
SELECT Rand() as RandomNumber, *
FROM Northwind..Customers
You will see the same number over and over. However, sometimes, you may want to return a randomly generated number per line in your SELECT. Here's one way to do that, in SQL Server 2000, using a UDF.
First, we need to create a View that returns a single random number:
CREATE VIEW vRandNumber
AS
SELECT RAND() as RandNumber
The view is necessary because normally in a UDF we cannot use the rand() function, because that would make the function non-determistic. We can trick the UDF to accepting a random number by using a View.
Once that is set up, we can then create our function:
CREATE FUNCTION RandNumber()
RETURNS float
AS
BEGIN
RETURN (SELECT RandNumber FROM vRandNumber)
END
Finally, you can use this function in any SELECT to now return a random number between 0 and 1 per row:
SELECT dbo.RandNumber(), *
FROM Northwind..Customers
You can get even fancier in your RandNumber function by accepting a seed if you like, or allowing for range parameters like this:
CREATE FUNCTION RandNumber2(#Min int, #Max int)
RETURNS float
AS
BEGIN
RETURN #Min + (select RandNumber from RetRandNumber) * (#Max-#Min)
END
It might be helpful to know why you want the random number. For example if you just want to sort the results randomly you can do ORDER BY NEWID() and not worry about generating the random number at all. If you really need the random number later, use #Tim quoted solution.