Optimize query with many OR statements in WHERE clause - sql

What is the best way to write query which will give equivalent result to this:
SELECT X,Y,* FROM TABLE
WHERE (X = 1 AND Y = 2) OR (X = 2235 AND Y = 324) OR...
Table has clustered index (X, Y).
Table is huge (milions) and there can be hundreds of OR statements.

you can create another table with columns X and Y
and insert the values in that table and and then join with the original table
create table XY_Values(X int, Y int)
Insert into XY_Values values
(1,2),
(2235,324),
...
Then
SELECT X,Y,* FROM TABLE T
join XY_Values V
on T.X=V.X
and T.Y=V.Y
You could create an index on (X,Y) on XY_Values , which will boost the performance
You could create XY_Values as a table variable also..

I think you can fill up a temp tables with the hundreds of X and Y values, and join them.
Like:
DECLARE #Temp TABLE
(
X int,
Y int
)
Prefill with this with your search requirements and join then.
(Or an other physical table which saves the search settings.)

this will do better
select t.*
from table t
join (select 1 as x,2 as y
union
...) t1 on t.x=t1.x and t.y=t1.y

if you are using too many or statements the execution plan wont use indexes.
It is better to create multiple statements and merge the result using union all.
SELECT X,Y,*
FROM TABLE
WHERE (X = 1 AND Y = 2)
union all
SELECT X,Y,*
FROM TABLE
WHERE (X = 2235 AND Y = 324)
union all...

Related

Why does Snowflake not report ambiguous column references for USING joins

Given this query snowflake returns a result set of 2, arbitrarily resolving y to the table T,
select y
from (select 1 x, 2 y) T
join (select 1 x, 3 y) T1 using (x)
while at the same time returning an ambiguous column error when using a qualified join instead:
select y
from (select 1 x, 2 y) T
join (select 1 x, 3 y) T1 on T.x = T1.x
What's the set of rules that determine whether a column reference is ambiguous in Snowflake SQL? Postgres considers both of these queries ambiguous.
This answer is just an observation. It seems the column is chosen depending on order of join(left-to-right):
CREATE OR REPLACE TABLE T(x INT, y INT) AS select 1, 2 UNION SELECT 10, 20;
CREATE OR REPLACE TABLE T1(x INT, y INT) AS select 1, 3 UNION SELECT 10, 30;
-- disabling cache
ALTER SESSION SET USE_CACHED_RESULT=FALSE;
Query profile:
explain using tabular
select y
from T
join T1 using (x);
Output:
Swapped join order:
explain using tabular
select y
from T1
join T using (x);
Output:

How to select specific rows or the whole table efficiently using an user defined type?

I am having some problems using an User defined type to pass some identifiers and select several rows at the same time.
For example
User defined type:
CREATE TYPE IntList AS Table (n int UNIQUE)
Usage with stored procedure:
CREATE PROCEDURE spBarsGet
#lBars IntList READONLY
AS
SELECT *
FROM Bars
WHERE id IN (SELECT n FROM #lBars)
When it is necessary to return all the rows from Bars in the table, it is possible to use the same procedure changing the query and sending an empty Bars list:
SELECT *
FROM Bars
WHERE
(NOT EXISTS (SELECT Null FROM #lBars) OR
Id IN (SELECT n FROM #lBars))
or this:
DECLARE #Aux int
SELECT #Aux = COUNT(n) FROM #lBars
SELECT *
FROM Bars
WHERE (#Aux = 0 OR Id IN (SELECT n FROM #lBars))
Both options work, however if there are several rows in the table (more than 1 million approx) the query becomes very slow. Much slower than two separated queries for each case:
Select specific Bars:
SELECT *
FROM Bars
WHERE id IN (SELECT n FROM #lBars)
Select all the rows in the table:
SELECT * FROM Bars
I am looking for a better (faster, efficient) way to achieve the described behavior using only one query, in this case the same stored procedure.
Any suggestion will be appreciated
You can try this.
SELECT b.*
FROM Bars b
JOIN #lBars lb ON b.Id = lb.n
UNION
SELECT b.*
FROM Bars b
WHERE NOT EXISTS ( SELECT 1
FROM #lBars );
You can use union all :
select *
from( (select . . . -- Qualify column name
from bars b
where exists (select 1 from #lBars b1 where b1.n = b.id)
) union all
(select . . . -- Qualify column name
from bars b
where not exists (select 1 from #lBars b1 where b1.n = b.id)
)
) t;

SQL Join on sequence number

I have 2 tables (A, B). They each have a different column that is basically an order or a sequence number. Table A has 'Sequence' and the values range from 0 to 5. Table B has 'Index' and the values are 16740, 16744, 16759, 16828, 16838, and 16990. Unfortunately I do not know the significance of these values. But I do believe they will always match in sequential order. I want to join these tables on these numbers where 0 = 16740, 1 = 16744, etc. Any ideas?
Thanks
You could use a case expression to convert table a's values to table b's values (or vise-versa) and join on that:
SELECT *
FROM a
JOIN b ON a.[sequence] = CASE b.[index] WHEN 16740 THEN 0
WHEN 16744 THEN 1
WHEN 16759 THEN 2
WHEN 16828 THEN 3
WHEN 16838 THEN 4
WHEN 16990 THEN 5
ELSE NULL
END;
#Mureinik has a great example. If down the road you do end up adding more numbers maybe putting this information into a new table would be a good idea.
CREATE TABLE C(
AInfo INT,
BInfo INT
)
INSERT INTO TABLE C(AInfo,BInfo) VALUES(0,16740)
INSERT INTO TABLE C(AInfo,BInfo) VALUES(1,16744)
etc
Then you can Join all the tables.
If the values are in ascending order as per your example, you can use the ROW_NUMBER() function to achieve this:
;with cte AS (SELECT *, ROW_NUMBER() OVER(ORDER BY [Index])-1 RN
FROM B)
SELECT *
FROM cte

Subtract Values from Two Different Tables

Consider table X:
A
-
1
2
3
3
6
Consider table Y:
A
-
0
4
2
1
9
How do you write a query that takes the difference between these two tables, to compute the following table (say table Z):
A
-
1
-2
1
2
-3
It's not clear what you want. Could it be this?
SELECT (SELECT SUM(A) FROM X) -
(SELECT SUM(A) FROM Y)
AS MyValue
Marcelo is 100% right - in a true relational database the order of a result set is never guaranteed. that said, there are some databases that do always return sets in an order.
So if you are willing to risk it, here is one solution. Make two tables with autoincrement keys like this:
CREATE TABLE Sets (
id integer identity(1,1)
, val decimal
)
CREATE TABLE SetY (
id integer identity(1,1)
, val decimal
)
Then fill them with the X and Y values:
INSERT INTO Sets (val) (SELECT * FROM X)
INSERT INTO SetY (val) (SELECT * FROM Y)
Then you can do this to get your answer:
SELECT X.ID, X.Val, Y.Val, X.val-Y.val as Difference
FROM Sets X
LEFT OUTER JOIN SetY Y
ON Y.id = X.ID
I would cross my fingers first though! If there is any way you can get a proper key in your table, please do so.
Cheers,
Daniel

How to do this data transformation

This is my input data
GroupId Serial Action
1 1 Start
1 2 Run
1 3 Jump
1 8 End
2 9 Shop
2 10 Start
2 11 Run
For each activitysequence in a group I want to Find pairs of Actions where Action1.SerialNo = Action2.SerialNo + k and how may times it happens
Suppose k = 1, then output will be
FirstAction NextAction Frequency
Start Run 2
Run Jump 1
Shop Start 1
How can I do this in SQL, fast enough given the input table contains millions of entries.
tful, This should produce the result you want, but I don't know if it will be as fast as you 'd like. It's worth a try.
create table Actions(
GroupId int,
Serial int,
"Action" varchar(20) not null,
primary key (GroupId, Serial)
);
insert into Actions values
(1,1,'Start'), (1,2,'Run'), (1,3,'Jump'),
(1,8,'End'), (2,9,'Shop'), (2,10,'Start'),
(2,11,'Run');
go
declare #k int = 1;
with ActionsDoubled(Serial,Tag,"Action") as (
select
Serial, 'a', "Action"
from Actions as A
union all
select
Serial-#k, 'b', "Action"
from Actions
as B
), Pivoted(Serial,a,b) as (
select Serial,a,b
from ActionsDoubled
pivot (
max("Action") for Tag in ([a],[b])
) as P
)
select
a, b, count(*) as ct
from Pivoted
where a is not NULL and b is not NULL
group by a,b
order by a,b;
go
drop table Actions;
If you will be doing the same computation for various #k values on stable data, this may work better in the long run:
declare #k int = 1;
select
Serial, 'a' as Tag, "Action"
into ActionsDoubled
from Actions as A
union all
select
Serial-#k, 'b', "Action"
from Actions
as B;
go
create unique clustered index AD_S on ActionsDoubled(Serial,Tag);
create index AD_a on ActionsDoubled(Tag,Serial);
go
with Pivoted(Serial,a,b) as (
select Serial,a,b
from ActionsDoubled
pivot (
max("Action") for Tag in ([a],[b])
) as P
)
select
a, b, count(*) as ct
from Pivoted
where a is not NULL and b is not NULL
group by a,b
order by a,b;
go
drop table ActionsDoubled;
SELECT a1.Action AS FirstActio, a2.Action AS NextAction, COUNT(*) AS Frequency
FROM Activities a1 JOIN Activities a2
ON (a1.GroupId = a2.GroupId AND a1.Serial = a2.Serial + #k)
GROUP BY a1.Action, a2.Action;
The problem is this: Your query has to go through EVERY row regardless.
You can make it more manageable for your database by tackling each group separately as separate queries. Especially if the size of each group is SMALL.
There's a lot going on under the hood and when the query has to do a scan of the entire table, this actually ends up being many times slower than if you did small chunks which effectively cover all million rows.
So for instance:
--Stickler for clean formatting...
SELECT
a1.Action AS FirstAction,
a2.Action AS NextAction,
COUNT(*) AS Frequency
FROM
Activities a1 JOIN Activities a2
ON (a1.groupid = a2.groupid
AND a1.Serial = a2.Serial + #k)
WHERE
a1.groupid = 1
GROUP BY
a1.Action,
a2.Action;
By the way, you have an index (GroupId, Serial) on the table, right?