Retrieve a random row with like statement (over 5 millions rows)

Retrieve a random row with like statement (over 5 millions rows) - sql

I have a DB with two tables
tblVideos is about 8 million rows, contains Id(auto increment 1,1), videoId, Name, Tags, (FK)VideoProviderId
tblVideoProviders is about 6 providers at the moment, and has 3 columns:
Id(auto increment 1,1 tiny int), Name, Url(to build the link using the provider + video Id)
Unlike YouTube smaller providers don't have an API to return an array then pick up something random.
retrieving a totally random row takes under a second in both ways I got now:
select top 1 tblVideoProvider.Url + tblVideos.videoId as url, tblVideos.Name,
tblVideos.tags from tblVideos
inner join tblVideoProvider
on tblVideos.VideoProviderId = tblVideoProvider.id
WHERE ((ABS(CAST(
(BINARY_CHECKSUM
(tblVideos.id, NEWID())) as int))
% 6800000) < 10 )
OR
slightly longer
select top 1 tblVideoProvider.Url + tblVideos.videoId as url,
tblVideos.Name, tblVideos.tags from tblVideos
inner join tblVideoProvider
on tblVideos.VideoProviderId = tblVideoProvider.id
ORDER BY NEWID()
but once I start looking for something more specific:
select top 1 tblVideoProvider.Url + tblVideos.videoId as url, tblVideos.Name,
tblVideos.tags from tblVideos
inner join tblVideoProvider
on tblVideos.VideoProviderId = tblVideoProvider.id
where (tblVideos.tags like '%' + #tag + '%')
or (tblVideos.Name like '%' + #tag + '%')
ORDER BY NEWID()
The query hits 8 seconds, removing the last or tblVideos like takes it down to 4~5 seconds, but that's way too high.
retrieving the whole query without the "order by newid()" will make the query take a lot less time but the application will consume about 0.2~2 MB of data per user, and assuming over 200~400 simultanios requests ends up in lots of data

In general the "like" operator is very expensive, and when the pattern starts with a "%" even an index on the respective column (assuming you have one) cannot be used. I think there is no easy way to increase the performance of your query.

Related

How to permutate an SQL table using a seed?

Background
I have a front-end with a list of items with infinite scrolling, and I fetch pages of items by specifying the page limit and offset.
Problem
Apart from simply ordering the result by some of the columns, I would like to add a "random" option. The thing is, I don't want repetitions, so I need to have the entire dataset permutated before doing the limit and offset, and I need to be able to get the same permutation as long as I supply the same seed.
What I tried
A naive approach was to write a table-valued function that takes an int seed and uses it in the ORDER BY clause like so:
SELECT *
FROM dbo.Entities e
ORDER BY HASHBYTES('MD2', e.Title) ^ #seed
OFFSET 0 ROWS
FETCH NEXT (SELECT COUNT(*) FROM dbo.Entities) ROWS ONLY
This seemed to work well at a first glance, but it turned out it's not very "volatile" for the lack of better word - it becomes more visible with sparse result sets, where most seeds (chosen randomly from between 0 and 2147483647) yield the same order.
I thought I would get better results by hashing the seed as well, but SQL Server doesn't allow me to XOR two varbinary variables. Am I even looking in the right direction? Are there any performance considerations that I should be making and I might not be aware of?

The best way is to create a tally table with two columns: first a sequential integer, (between 1 and 1,000,000), second a random integer number. Then generate a random number to get the first value and then make a join with a computed ROW_NUMBER().
CREATE TABLE T_NUM (SEQUENTIAL INT, RANDOM INT);
GO
WITH
N AS(SELECT 0 AS I
UNION ALL
SELECT I + 1
FROM N
WHERE I < 9)
INSERT INTO T_NUM (SEQUENTIAL)
SELECT N1.I + N2.I * 10 + N3.I * 100 + N4.I * 1000 + N5.I * 10000 + N6.I * 100000
FROM N AS N1
CROSS JOIN N AS N2
CROSS JOIN N AS N3
CROSS JOIN N AS N4
CROSS JOIN N AS N5
CROSS JOIN N AS N6;
GO
WITH T AS
(
SELECT SEQUENTIAL, ROW_NUMBER() OVER (ORDER BY CHECKSUM(NEWID())) AS ALEA
FROM T_NUM
)
UPDATE N
SET RANDOM = ALEA
FROM T_NUM AS N
JOIN T ON T.SEQUENTIAL = N.SEQUENTIAL;
GO
DECLARE #SEED INT = FLOOR(1 + RAND() * 1000000);
Now you have your seed to enter in the alea sequence then join your table on sequential order

ORDER BY HASHBYTES('MD2', e.Title + convert(nvarchar(max), #seed))
should work, but performance-wise it would be a disaster. You would calculate MD2 for all records every time. I would not do this on server side at all. You can generate random sequence on client and then just pick from server rows with row number 158, 7, 1027 and 9. But it has still two problems
if item is deleted, row number of all consecutive records shifts. It would just break the whole sequence and you would get duplicities and missing records
row number over millions of records is not that fast either
I see two options. You can query all ids from the table and use them for generating of random order. But that would be a lot of numbers. Or you have to ensure the id space is dense enough. Then you can query 20 random ids and hope at least 10 of them exist. If you are unlucky, you would have to query again.

CTE query in SQL Server : exit when one row exists in result

I'm writing a SQL Server procedure to optimize cut of bars. I haven't found yet the best method. Seems to be CTE request, but I'm stuck.
I try to write a stored procedure to optimize cut of bars. For my test, I have to cut 18 pieces (3 of 1000 mm, 3 of 1500 mm, 3 of 2500mm, 3 of 3500 mm, 3 of 4500 mm and 3 of 6000 mm), and I have 3 sizes of bars (5500mm, 7000mm and 8500mm).
After that, I generate every combination of bars with any cuts as possible.
I tried with a while loop and a temporary table, It takes me one hour and a half. But I think I can do better with a CTE request...
Now, I must generate every combination of many bars to have my 18 cuts. I made another CTE request, but I haven't find the way to stop recursivity when at least one combination has all the cuts. So, my request find over 150 millions combinations, with 8,9,10,11... bars. And it tries every loop with 18 bars. I want it to stop with 8 bars (I know it is the smallest bar count I need for my cuts). And it takes more than two days !
I have 2 temporary tables, on with my combination of bars (#COMBI_BARRE) with this structure : ID_ART : identity for article, COLOR, CUT_COMBI : a varchar concat the cut ID of the bar combination : 1-2-3-4..., NB_CUTSan integer to get the count of cuts in the bar, FIRST_CUT the smaller cut ID of the bar.
I have another temporary table #DET_BAR with the detail of my cuts, with 2 columns : ID_COMBI_BAR the bar combination ID and ID_CUT_STR, the cut ID in varchar (to avoid cast or convert in CTE for better performance).
I store the result in a table call Combi, with the ID_ART, the COLOR, a varchar column Combi who concat the the bar combination ID (1-2-3-4...), a varchar column COMBI_CUT who concat the ID_CUT (1-2-3-4-5...), NB_BAR the count of bar in the combination, NB_CUTS : the count of cuts in the combination, MAX_CUTS the total number of cut I must to for my article and color.
As it makes one loop per bar,I tried to add a exists clause to stop recursivity when the number of loop has at least one combination with all my cuts. I know I must not cut 10 bars if I can do it with 8. But I get an error "the recursive table has multiple reference'.
How can I make my request and avoid every loop ?
;WITH Combi (ID_ART, COLOR, COMBI, COMBI_CUT, NB_BAR, NB_CUTS, MAX_CUTS)
AS
( SELECT C.ID_ART,
C.COLOR,
'-' + ID_COMBI_BAR_STR + '-',
'-' + C.CUT_COMBI + '-',
1,
C.NB_CUTS,
ISNULL(MAXI.CUT_NUM,0)
FROM #COMBI_BARRE C with(nolock)
outer apply (select top 1 D.CUT_NUM
from #DEBITS D
where D.ID_ART = C.ID_ART
and D.COLOR= C.COLOR
order by D.NUM_OCC_DEB desc) MAXI
WHERE C.FIRST_CUT = 1
UNION ALL
SELECT C.ID_ART,
C.COLOR,
Combi.COMBI + ID_COMBI_BAR_STR + '-',
Combi.COMBI_CUT+ C.CUT_COMBI + '-',
Combi.NB_BAR+ 1,
Combi.NB_CUTS+ C.NB_CUTS,
Combi.MAX_CUTS
FROM #COMBI_BARRE C with(nolock)
INNER JOIN Combi on C.ID_ART = Combi.ID_ART
and C.COLOR= Combi.COLOR
where C.FIRST_CUT > Combi.NB_BAR
and Combi.NB_CUTS+ C.NB_CUTS<= Combi.MAX_CUTS
and NOT EXISTS(select * from #DET_BAR D with(nolock)
where D.ID_COMBI_BAR = C.ID_COMBI_BAR
and PATINDEX(D.ID_CUT_STR, Combi.COMBI_CUT) > 0)
and NOT EXISTS(select top 1 * from Combi Combi2 where Combi2.ID_ART = C.ID_ART and Combi2.COLOR = C.COLOR and Combi2.NB_CUTS = Combi2.MAX_CUTS)
)
select * from Combi

This is a variation of the bin packing problem. That search term might help you in the right direction.
Also, you can to go my Bin Packing page, which gives several approaches to the more simplified version of your problem.
A small warning: the linked article(s) don't use any (recursive) CTE, so they won't answer your specific CTE question.

SQL WHERE filter on two integers, is it faster by conversion to char or a combined integer?

I have a table with a primary key made of two 32bit integers. I want to filter by an explicit list of these, and want to know the fastest approach. There are 3 ways I can think of.My question simply is: Which method is the fastest out of the second method or the third method?
1st method I do not want to use because if I have many to list (only filtering for 2 rows in this example), it gets messy, or need a temp table, so not as concise:
select *
from [table]
where
(
([int1] = 123 and [int2] = 456)
OR ([int1] = 654 and [int2] = 321)
--etc
)
2nd method convert to varchar
select *
from [table]
where convert(varchar(10), [int1]) + ',' + convert(varchar(10), [int2]) IN ('123,456','654,321')
3rd method combine two 32bit integers to single 64bit integer
select *
from [table]
where convert(bigint, [int1]) * 4294967296 + [int2] IN (528280977864,2808908611905)
Edit
Thanks to suggestion from Aron, I have tried using statistics - these are the results on a table with > 1 million rows, average from 10 trials each:
Time Statistics method 1 method 2 method 3
Client processing time 22.1 2.7 2.9
Total execution time 300.5 1099.8 1317.3
Wait time on server replies 278.4 1097.1 1314.4
So really querying on them as is is the fastest by far, but if I did pick between the second or third method, then varchar is faster (which surprises me).

Your first method:
select *
from [table]
where ([int1] = 123 and [int2] = 456) OR
[int1] = 654 and [int2] = 321) OR
--etc
)
Should be the fastest because it can take advantage of an index on (int1, int2). Perhaps the fastest method for a large list is to store the pairs in a temporary table with an index (clustered or unclustered) on int1 and int2.
I would shy away from playing around with the values. The bulk of the effort of the query is reading the data pages. Slight variations in comparison logic will have little impact on the query.

Maybe you need to give a better example?
I tried your example and performance looks all good. a bigger number of the result set can predict better? try using estimated plan.
create table #table (int1 int,int2 int)
insert into #table values(123,456);
insert into #table values(654,321);
select *
from #table
where
(
([int1] = 123 and [int2] = 456)
OR ([int1] = 654 and [int2] = 321)
)
select *
from #table
where convert(varchar(10), [int1]) +'-'+ convert(varchar(10), [int2]) IN ('123-456','654-321')
select *
from #table
where convert(bigint, [int1]) * 4294967296 + [int2] IN (528280977864,2808908611905)
--drop table #table
will give almost same estimated cost. 33% each query...

The 1st method that you dont want to use because of its messy, seems to be the fastest way, just put them on two columns and index them.
Speed of queries in SQL doesn't depend on the number of fields queried or complexity of queries, it only depends on how you use its index.

Faster Join on Sub-query

I'm attempting to do a single query for a report that I need and I'm not sure how to get past a speed issue.
Expected Outcome
I need to have a single row per patient that lists all of their diagnosis codes in a the same column. My code does work and gets the job done but it increases my runs which must be done 30 times under different criteria and will make a 5 minute process about 30.
Attempted Resolution
I am using the following code to left outer join to.
left outer join (Select distinct add2.VisitID,
substring((Select ', '+add1.Diagnosis AS [text()]
From AbsDrgDiagnoses add1 Where add1.VisitID = add2.VisitID
ORDER BY add1.VisitID,DiagnosisSeqID For XML PATH ('')), 2, 1000) DiagText
From [Livendb].[dbo].[AbsDrgDiagnoses] add2) add3 on diag.VisitID = add3.VisitID
Outcome
This works but my 9 second query over a month of data with only a filter one 1 of 30 codes raises to 1m 12s. If I run the query by itself it takes 3m 49s seconds to compile so its an improvement in my main table but I would like to slim this down if possible.
Other Attempted Resolutions
I attempted to create a view from the query and use that but received the same run time.
I also attached SourceID which is always the same value but my 8 tables use this in their index but it actually slightly increased my time.
Conclusion
The table that I need to merge contains around 30 million rows which is most likely the issue and there is no way around the increased time, but I'm hoping someone may have a trick that could help me decrease this time.

This is your subquery:
(Select distinct add2.VisitID,
substring((Select ', '+add1.Diagnosis AS [text()]
From AbsDrgDiagnoses add1
Where add1.VisitID = add2.VisitID
order by add1.VisitID,DiagnosisSeqID
For XML PATH ('')
), 2, 1000) DiagText
From [Livendb].[dbo].[AbsDrgDiagnoses] add2
) add3
on diag.VisitID = add3.VisitID
Let me assume that when you remove it, the query is fast.
I think you would be better off with outer apply:
outer apply
(select stuff((Select ', ' + add1.Diagnosis as [text()]
From AbsDrgDiagnoses add
Where diag.VisitID = add.VisitID
order by DiagnosisSeqID
For XML PATH ('')
), 1, 2, '') DiagText
) add3
I can't imagine that the second level of subqueries actually helps performance.
And, speaking of performance, you can use an index on AbsDrgDiagnoses(VisitID, DiagnosisSeqID, Diagnosis).

multiple results in one result, application limitations

I am working with a legal software called Case Aware. You can do limited sql searches and I have had luck in getting Case Aware to pull a specific value from the database. My problem is that I need to create a sql search that returns multiple values but the Case Aware software will only accept one result as an answer. If my query produces a list, it will only recognize the top value. This is a limitation of the software I cannot get around.
My very basic search is:
select rate
From case_fin_info
where fin_info_id = 7 and rate!=0
This should produce a list of 3-15 rates, which does when the search is run straight from the database. However, when run through Case Aware, only the first rate in the table will pull. I need to pull the values through Case Aware because with Case Aware I can automatically insert the results into a template. (Where I work generates hundreds if not thousands a day so doing it manually is a B$##%!)
I need to find a way to pull all the values from the search into one value. I cannot use XML (Case Aware will give an error) and I cannot create a temporary table. (Again, a Case Aware limitation) If possible, I also need to insert a manual return between each value so they are separated in the document I am pulling this information into.
Case Aware does not have any user manual and you pay for support (We do have it) but I have my doubts on their abilities. I have been able to easily create queries that they have told me in the past are impossible. I am hoping this is one of those times.
IntegrationGirly
Addtl FYI:
I currently have this kludge: Pulling each value individually from the database even if it is null and putting each value into a table in the document. (30 separate searches) It "works" but takes much longer for the document to generate and it also leaves a great deal of empty space. Some case have 3 values, most have 5-10 but we have up to 30 areas for rate because once in a blue moon we need them. This makes the template look horribly junky but that doesn't affect the lawyers who generate the docs since they don't see it, but every time they generate the table they have to take out all the empty columns. With the number of docs we do each day, 1) this becomes time consuming and 2) This assumes attorneys and paralegals know how to take rows out of tables in word.

First, my condolences for having to work with such terrible software.
Second, here's a possible solution (this is assuming SQL Server):
1) Execute a SELECT COUNT(*) FROM case_fin_info WHERE fin_info_id = 7 AND rate <> 0. Store the result (number of rows) in your client application.
2) In your client app, do a for (i = 0; i < count; i++) loop. During each iteration, perform the query
WITH OrderedRates AS
(
SELECT Rate, ROW_NUMBER() OVER (ORDER BY <table primary key> ASC) AS 'RowNum'
FROM case_fin_info WHERE fin_info_id = 7 AND rate <> 0
)
SELECT Rate FROM OrderedRates WHERE RowNum = <count>
Replacing the stuff in <> as appropriate. Essentially you get the row count in your client app, then get one row at a time. It's inefficient as hell, but if you only have 15 rows it shouldn't be too bad.

I had a similar query to implement in my application. This should work.
DECLARE #Rate VARCHAR(8000)
SELECT #Rate = COALESCE(#Rate + ', ', '') + rate
From case_fin_info where fin_info_id = 7 and rate!=0;

Here's a single query that will return the one result in a single column. It assumes your manual return is CR + LF. And, you would need to expand it to handle all 15 rates.
SELECT max(Rate1) + CHAR(13) + CHAR(10)
+ max(Rate2) + CHAR(13) + CHAR(10)
+ max(Rate3) + CHAR(13) + CHAR(10)
+ max(Rate4) + CHAR(13) + CHAR(10)
+ max(Rate5) + CHAR(13) + CHAR(10)
FROM (
SELECT CASE RateID WHEN 1 THEN CAST(rate as varchar) END AS Rate1,
CASE RateID WHEN 2 THEN CAST(rate as varchar) END AS Rate2,
CASE RateID WHEN 3 THEN CAST(rate as varchar) END AS Rate3,
CASE RateID WHEN 4 THEN CAST(rate as varchar) END AS Rate4,
CASE RateID WHEN 5 THEN CAST(rate as varchar) END AS Rate5
FROM
(
select RateID, rate From case_fin_info where fin_info_id = 7 and rate!=0
) as r
) as Rates

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas