Background
I have a front-end with a list of items with infinite scrolling, and I fetch pages of items by specifying the page limit and offset.
Problem
Apart from simply ordering the result by some of the columns, I would like to add a "random" option. The thing is, I don't want repetitions, so I need to have the entire dataset permutated before doing the limit and offset, and I need to be able to get the same permutation as long as I supply the same seed.
What I tried
A naive approach was to write a table-valued function that takes an int seed and uses it in the ORDER BY clause like so:
SELECT *
FROM dbo.Entities e
ORDER BY HASHBYTES('MD2', e.Title) ^ #seed
OFFSET 0 ROWS
FETCH NEXT (SELECT COUNT(*) FROM dbo.Entities) ROWS ONLY
This seemed to work well at a first glance, but it turned out it's not very "volatile" for the lack of better word - it becomes more visible with sparse result sets, where most seeds (chosen randomly from between 0 and 2147483647) yield the same order.
I thought I would get better results by hashing the seed as well, but SQL Server doesn't allow me to XOR two varbinary variables. Am I even looking in the right direction? Are there any performance considerations that I should be making and I might not be aware of?
The best way is to create a tally table with two columns: first a sequential integer, (between 1 and 1,000,000), second a random integer number. Then generate a random number to get the first value and then make a join with a computed ROW_NUMBER().
CREATE TABLE T_NUM (SEQUENTIAL INT, RANDOM INT);
GO
WITH
N AS(SELECT 0 AS I
UNION ALL
SELECT I + 1
FROM N
WHERE I < 9)
INSERT INTO T_NUM (SEQUENTIAL)
SELECT N1.I + N2.I * 10 + N3.I * 100 + N4.I * 1000 + N5.I * 10000 + N6.I * 100000
FROM N AS N1
CROSS JOIN N AS N2
CROSS JOIN N AS N3
CROSS JOIN N AS N4
CROSS JOIN N AS N5
CROSS JOIN N AS N6;
GO
WITH T AS
(
SELECT SEQUENTIAL, ROW_NUMBER() OVER (ORDER BY CHECKSUM(NEWID())) AS ALEA
FROM T_NUM
)
UPDATE N
SET RANDOM = ALEA
FROM T_NUM AS N
JOIN T ON T.SEQUENTIAL = N.SEQUENTIAL;
GO
DECLARE #SEED INT = FLOOR(1 + RAND() * 1000000);
Now you have your seed to enter in the alea sequence then join your table on sequential order
ORDER BY HASHBYTES('MD2', e.Title + convert(nvarchar(max), #seed))
should work, but performance-wise it would be a disaster. You would calculate MD2 for all records every time. I would not do this on server side at all. You can generate random sequence on client and then just pick from server rows with row number 158, 7, 1027 and 9. But it has still two problems
if item is deleted, row number of all consecutive records shifts. It would just break the whole sequence and you would get duplicities and missing records
row number over millions of records is not that fast either
I see two options. You can query all ids from the table and use them for generating of random order. But that would be a lot of numbers. Or you have to ensure the id space is dense enough. Then you can query 20 random ids and hope at least 10 of them exist. If you are unlucky, you would have to query again.
Related
I'm writing a SQL Server procedure to optimize cut of bars. I haven't found yet the best method. Seems to be CTE request, but I'm stuck.
I try to write a stored procedure to optimize cut of bars. For my test, I have to cut 18 pieces (3 of 1000 mm, 3 of 1500 mm, 3 of 2500mm, 3 of 3500 mm, 3 of 4500 mm and 3 of 6000 mm), and I have 3 sizes of bars (5500mm, 7000mm and 8500mm).
After that, I generate every combination of bars with any cuts as possible.
I tried with a while loop and a temporary table, It takes me one hour and a half. But I think I can do better with a CTE request...
Now, I must generate every combination of many bars to have my 18 cuts. I made another CTE request, but I haven't find the way to stop recursivity when at least one combination has all the cuts. So, my request find over 150 millions combinations, with 8,9,10,11... bars. And it tries every loop with 18 bars. I want it to stop with 8 bars (I know it is the smallest bar count I need for my cuts). And it takes more than two days !
I have 2 temporary tables, on with my combination of bars (#COMBI_BARRE) with this structure : ID_ART : identity for article, COLOR, CUT_COMBI : a varchar concat the cut ID of the bar combination : 1-2-3-4..., NB_CUTSan integer to get the count of cuts in the bar, FIRST_CUT the smaller cut ID of the bar.
I have another temporary table #DET_BAR with the detail of my cuts, with 2 columns : ID_COMBI_BAR the bar combination ID and ID_CUT_STR, the cut ID in varchar (to avoid cast or convert in CTE for better performance).
I store the result in a table call Combi, with the ID_ART, the COLOR, a varchar column Combi who concat the the bar combination ID (1-2-3-4...), a varchar column COMBI_CUT who concat the ID_CUT (1-2-3-4-5...), NB_BAR the count of bar in the combination, NB_CUTS : the count of cuts in the combination, MAX_CUTS the total number of cut I must to for my article and color.
As it makes one loop per bar,I tried to add a exists clause to stop recursivity when the number of loop has at least one combination with all my cuts. I know I must not cut 10 bars if I can do it with 8. But I get an error "the recursive table has multiple reference'.
How can I make my request and avoid every loop ?
;WITH Combi (ID_ART, COLOR, COMBI, COMBI_CUT, NB_BAR, NB_CUTS, MAX_CUTS)
AS
( SELECT C.ID_ART,
C.COLOR,
'-' + ID_COMBI_BAR_STR + '-',
'-' + C.CUT_COMBI + '-',
1,
C.NB_CUTS,
ISNULL(MAXI.CUT_NUM,0)
FROM #COMBI_BARRE C with(nolock)
outer apply (select top 1 D.CUT_NUM
from #DEBITS D
where D.ID_ART = C.ID_ART
and D.COLOR= C.COLOR
order by D.NUM_OCC_DEB desc) MAXI
WHERE C.FIRST_CUT = 1
UNION ALL
SELECT C.ID_ART,
C.COLOR,
Combi.COMBI + ID_COMBI_BAR_STR + '-',
Combi.COMBI_CUT+ C.CUT_COMBI + '-',
Combi.NB_BAR+ 1,
Combi.NB_CUTS+ C.NB_CUTS,
Combi.MAX_CUTS
FROM #COMBI_BARRE C with(nolock)
INNER JOIN Combi on C.ID_ART = Combi.ID_ART
and C.COLOR= Combi.COLOR
where C.FIRST_CUT > Combi.NB_BAR
and Combi.NB_CUTS+ C.NB_CUTS<= Combi.MAX_CUTS
and NOT EXISTS(select * from #DET_BAR D with(nolock)
where D.ID_COMBI_BAR = C.ID_COMBI_BAR
and PATINDEX(D.ID_CUT_STR, Combi.COMBI_CUT) > 0)
and NOT EXISTS(select top 1 * from Combi Combi2 where Combi2.ID_ART = C.ID_ART and Combi2.COLOR = C.COLOR and Combi2.NB_CUTS = Combi2.MAX_CUTS)
)
select * from Combi
This is a variation of the bin packing problem. That search term might help you in the right direction.
Also, you can to go my Bin Packing page, which gives several approaches to the more simplified version of your problem.
A small warning: the linked article(s) don't use any (recursive) CTE, so they won't answer your specific CTE question.
The use case is, that I have a table products and user_match_product. For a specific user, I want to select X random products, for which that user has no match.
The naive way to do that, would be to make something like
SELECT * FROM products WHERE id NOT IN (SELECT p_id FROM user_match_product WHERE u_id = 123) ORDER BY random() LIMIT X
but that will become a performance bottleneck when having millions of rows.
I thought of some possible solutions which I will present here now. I would love to hear about your solutions for that problem or suggestions regarding my solutions.
Solution 1: Trust the randomness
Based on the fact that the product ids are monotonically increasing, one could optimistically generate X*C random numbers R_i where i between 1 and X*C, which are in the range [min_id, max_id], and hope that a select like the following will return X elements.
SELECT * FROM products p1 WHERE p1.id IN (R_1, R_2, ..., R_XC) AND NOT EXISTS (SELECT * FROM user_match_product WHERE u_id = 123 AND p_id = p1.id) LIMIT X
Advantages
If the random number generator is good, this will probably work very well within O(1)
Old and newly added products have the same probability of being choosen
Disadvantages
If the number of matches is near to the number of products, the collision probability might be very high.
Solution 2: Block-wise PRNG
One could create a permutation function permutate(seed, start, end, value) for the domain [START, END] that uses a seed for randomness. At time t0 a user A has 0 matched products and observes that E0 products exist. The first block for the user A at t0 is for the domain [1, E0]. The user remembers a counter C which initially is 0.
To select X products the user A first has to create the permutations P_i like
P_i = permutate(seed, START, END, C + i)
The following has to hold for the function.
permutate(seed, start, end, value) is element of [start, end]
value is element of [start, end]
The following query will return X non-repeating elements.
SELECT * FROM products WHERE id IN (P_1, ..., P_X)
When C reaches END, the next block is allocated by using END + 1 as the new START, the current count of products E1 as new END. The seed and C stay the same.
Advantages
No collisions possible
Guaranteed O(1)
Disadvantages
The current block has to be finished before new products can be selected
I'd go with approach #1.
You can get a first estimate of C by counting the user's rows in user_match_product (supposed unique). If he already possesses half the possible products, selecting twice the number of random products seems a good heuristic.
You can also have a last-ditch correction that verifies that the number of extracted products is actually X. If it was, say, X/3, you'd need to run the same extraction two more times (avoiding already-generated random product IDs), and increase that user's C constant by a factor of three.
Also, knowing what the range of product IDs is, you could select random numbers in that range that do not appear in user_match_product (i.e. your first stage query is only against user_match_product) which is bound to have a (much?) lower cardinality than products. Then, those IDs that pass the test can be safely selected from products.
If you want to choose X products that the user doesn't have, the first thing that comes to mind is to enumerate the products and use order by rand() (or the equivalent, depending on the database). This is your first solution:
SELECT p.*
FROM products p
WHERE NOT EXISTS (SELECT 1 FROM user_match_product WHERE ump.p_id = p.id and u_id = 123)
ORDER BY random()
LIMIT X;
A simple way to make this more efficient is to choose an arbitrary subset. You can actually do this using random() as well, but in the where clause:
SELECT p.*
FROM products p
WHERE random() < Y AND
NOT EXISTS (SELECT 1 FROM user_match_product WHERE ump.p_id = p.id and u_id = 123)
ORDER BY random()
LIMIT X;
The question is: what is "Y"? Well, let's say the number of products is P and the user has U. Then, if we choose a random set of (X + U) products, we can definitely get X products the user does not have. This suggests that the expression random() < (X + U) / P would be sufficient. Alas, the vagaries of random numbers say that sometimes we would get enough and sometimes not enough. Let's add a factor such as 3 to be safe. This is actually really, really, really, really safe for most values of X, U, and P.
The idea is a query such as this:
SELECT p.*
FROM Products p CROSS JOIN
(SELECT COUNT(*) as p FROM Products) v1 CROSS JOIN
(SELECT COUNT(*) as u FROM User_Match_Product WHERE u_id = 123) v2
WHERE random() < 3 * (u + x) / p AND
NOT EXISTS (SELECT 1 FROM User_Match_Product WHERE ump.p_id = p.id and ump.u_id = 123)
ORDER BY random()
LIMIT X;
Note that these calculations require a small amount of time with appropriate indexes on Products and User_Match_Product.
So, if you have 1,000,000 products and a typical user has 20. You want to recommend 10 more. Then the expression (20 + 10)*3/1000000 --> 90/1000000. This query will scan the products table, pull out 90 rows randomly and then sort them and choose an appropriate 10 rows. Sorting 90 rows is, essentially, constant time relative to the original operation.
For many purposes, the cost of the table scan is acceptable. It sure beats the cost of sorting all the data, for instance.
The alternative approach is to load all products for a user into the application. Then pull a random product out and compare to the list:
select p.id
from Products p cross join
(select min(id) as minid, max(id) as maxid as p from Products) v1
where p.id >= minid + random() * (maxid - minid)
order by p.id
limit 1;
(Note the calculation can be done outside the query so you can just plug in a constant.)
Many query optimizers will resolve this query constant time by doing an index scan. You can then check in the application whether the user has the product already. This would then run about X times for the user, providing O(1) performance. However, this has rather bad worst case performance: if there are not X available products, it will run indefinitely. Of course, additional logic can fix this problem.
Basically I have a database of words,
This database contains a rowID(primary key), the word and word length as table columns.
I want to select a random row where length = x and get the word at that row.
This is for an iPhone game project and it is high priority that the queries are as fast as possible (the searches are made in a game).
For instance:
SELECT * FROM WordsDB WHERE >= (abs(random()) %% (SELECT max(rowid) FROM WordsDB)) LIMIT 1;
This query is really fast at selecting a random row a lot faster than ORDER BY RANDOM() LIMIT 1, however, if I add the word length to the query I get issues:
SELECT * FROM WordsDB WHERE length = 9 AND rowid >= (abs(random()) %% (SELECT max(rowid) FROM WordsDB)) LIMIT 1
Presumably because the random row will not always have a length of 9.
I was just wondering what would be the fastest / most efficient way of doing this.
Thanks for your time
Note: the 2 % symbols are because it is in objective c and the query is set as a string.
This one seems to work ok for me:
select * from WordsDB
where length = 9
limit (abs(random()) % (select count(rowid) from WordsDB
where length = 9)), 1;
note that length = 9 appears in both where clauses.
Add index on length if it appears to be slow.
Add an index to the WordsDB.length
create index if not exists WordsDBLengthIndex on WordsDB (length);
should make selection on this field much faster
I have an SQL table which has two integers. Let these integers be a and b.
I want to SELECT out a random record, such that the record is selected with probability proportional to C + a/b for some constant C which I will choose.
So for example, if C = 0, and there are two records with a=1,b=2 and a=2,b=3, then we have that for the first record C+a/b = 1/2 and for the second record C+a/b = 2/3, and therefore with probability 0.3 I will choose the first record, and probability 0.7 I will choose the second record from that SELECT query.
I know SQL well (I thought), but I am not even sure where to begin here. I thought of doing a select for the "SUM(a/b)" first, and then doing a select for the first record the sum of C+a/b up to it exceeds a random number between C*number_of_records + SUM(a/b) for the first time. But, I don't really know how to do that.
You could do something like sorting by a random number multiplied by your other stuff, and just select top 1 from that query - something like:
SELECT TOP 1 (your column names)
FROM (your table)
ORDER BY Rand() * (your calculation)
I have a table named A. it has only one record with one field. It is an integer named number.
I want to create a view that have A.number records, each are one of the numbers less than A.number.
For example:
select A.number -----> 5
the view should show 5 records 0 1 2 3 4
P.S: This is a real problem that I simplified it a lot. The real problem is like dividing a budget in a fixed period to each day.
This sounds a bit like it might be homework, so I'm wary of providing the code outright.
I can give a pointer for how to solve the question, though. You use a recursive CTE where each iteration adds one to the previous iteration. Just be sure to set the MAXRECURSION option if you'll be checking numbers > 101. You can use a scalar sub query to key the view to the original table:
WITH numbers ( n ) AS (
SELECT 0 UNION ALL
SELECT 1 + n FROM numbers WHERE n < (select number from a) -1)
SELECT n FROM numbers
OPTION ( MAXRECURSION 500) --example
If the number of your table will be < 2048 and you are on SQL Server, this will work for you:
CREATE VIEW MyView AS
SELECT number
FROM master..spt_values
WHERE type = 'p'
AND number < (SELECT value FROM yourTable)
Alternatively you could consider creating your own Numbers table with an appropriate size to suit your application if you require a higher limit, or are not on SQL Server that has this provided to you. Here is a link to a blog post on the idea of having a "Numbers table" handy.