Query Optimization : using (Union instead of OR) and (exists instead of null) - sql

i have a Query optimisation issue.
for the context, this query has always been running instantly
but today it took way more time. (3h+)
so i tried to fix it.
The query is Like -->
Select someCols from A
inner join B left join C
Where A.date = Today
And (A.col In ( Select Z.colseekedinA from tab Z) --A.col is the same column for
-- than below
OR
A.col In ( Select X.colseekedinA from tab X)
)
-- PART 1 ---
Select someCols from A
inner join B left join C -- takes 1 second 150 lines
Where A.date = Today
-- Part 2 ---
Select Z.colseekedinA from tab Z
OR -- Union -- takes 1 seconds 180 lines
Select X.colseekedinA from tab X
When i join now the two parts with the In, the query becomes incredibly long.
so i optimized it using union instead or OR and exists instead of in
but it still takes 3 minutes
i want to get it done again down to 5 seconds.
do you see some query issue ?
thank you

Using Union and Exists
Select someCols
from A
inner join B on a.col = b.col
left join C on b.col = c.col
Where A.date = Today
and exists(
Select Z.colseekedinA from tab Z where Z.colseekedinA = A.col
Union
Select X.colseekedinA from tab X where x.colseekedinA = A.col )
Also, if possible change below join to Left join.
inner join B on a.col = b.col

The exists approach may give spurious results as you will get rows that do not match either condition just if 1 row does match. This might be avoided by using exists within a correlated subquery but it isn't something I have experimented with enough to recommend.
For speed I'd go for a cross apply and specify the parent table within the cross apply expression (correlated subquery to create a derived table). That way the join condition is specified before the data is returned, if the columns in question have indexes on them (i.e. they are primary keys) then the optimiser can work out an efficient plan for this.
Union all is used within the cross apply expression as this prevents a distinct sort within the derived table which is generally heavier in terms of cost than bringing the data itself back (union has to identify all rows anyway including duplications).
Finally if this is still slow then potentially you might want to add an index to the date column in table a. This overcomes the lack of sargability inherent in a date column and means the optimiser can leverage the index rather than scanning all of the rows in the result set and testing whether or not the date equals today.
Select someCols from A
inner join B left join C
cross apply (Select Z.colseekedinA from tab Z where a.col=z.colseekedinA
union all
Select X.colseekedinA from tab X where a.col=x.colseekedina) d
Where A.date = Today

You code is confused but for the first part
You could try using a select UNION for the inner subquery ( these with OR )
and avoid the IN clause using a inner JOIN
Select someCols from A
inner join B
left join C
INNER JOIN (
Select Z.colseekedinA from tab Z
UNION
Select X.colseekedinA from tab X
) t on A.col = t.colseekedinA
Where A.date = Today

Related

Can I select several tables in the same WITH query?

I have a long query with a with structure. At the end of it, I'd like to output two tables. Is this possible?
(The tables and queries are in snowflake SQL by the way.)
The code looks like this:
with table_a as (
select id,
product_a
from x.x ),
table_b as (
select id,
product_b
from x.y ),
table_c as (
..... many more alias tables and subqueries here .....
)
select * from table_g where z = 3 ;
But for the very last row, I'd like to query table_g twice, once with z = 3 and once with another condition, so I get two tables as the result. Is there a way of doing that (ending with two queries rather than just one) or do I have to re-run the whole code for each table I want as output?
One query = One result set. That's just the way that RDBMS's work.
A CTE (WITH statement) is just syntactic sugar for a subquery.
For instance, a query similar to yours:
with table_a as (
select id,
product_a
from x.x ),
table_b as (
select id,
product_b
from x.y ),
table_c as (
select id,
product_c
from x.z ),
select *
from table_a
inner join table_b on table_a.id = table_b.id
inner join table_c on table_b.id = table_c.id;
Is 100% identical to:
select *
from
(select id, product_a from x.x) table_a
inner join (select id, product_b from x.y) table_b
on table_a.id = table_b.id
inner join (select id, product_c from x.z) table_c
on table_b.id = table_c.id
The CTE version doesn't give you any extra features that aren't available in the non-cte version (with the exception of a recursive cte) and the execution path will be 100% the same (EDIT: Please see Simon's answer and comment below where he notes that Snowflake may materialize the derived table defined by the CTE so that it only has to perform that step once should the CTE be referenced multiple times in the main query). As such there is still no way to get a second result set from the single query.
While they are the same syntactically, they don't have the same performance plan.
The first case can be when one of the stages in the CTE is expensive, and is reused via other CTE's or join to many times, under Snowflake, use them as a CTE I have witness it running the "expensive" part only a single time, which can be good so for example like this.
WITH expensive_select AS (
SELECT a.a, b.b, c.c
FROM table_a AS a
JOIN table_b AS b
JOIN table_c AS c
WHERE complex_filters
), do_some_thing_with_results AS (
SELECT stuff
FROM expensive_select
WHERE filters_1
), do_some_agregation AS (
SELECT a, SUM(b) as sum_b
FROM expensive_select
WHERE filters_2
)
SELECT a.a
,a.b
,b.stuff
,c.sum_b
FROM expensive_select AS a
LEFT JOIN do_some_thing_with_results AS b ON a.a = b.a
LEFT JOIN do_some_agregation AS c ON a.a = b.a;
This was originally unrolled, and the expensive part was some VIEWS that the date range filter that was applied at the top level were not getting pushed down (due to window functions) so resulted in full table scans, multiple times. Where pushing them into the CTE the cost was paid once. (In our case putting date range filters in the CTE made Snowflake notice the filters and push them down into the view, and things can change, a few weeks later the original code ran as good as the modified, so they "fixed" something)
In other cases, like this the different paths that used the CTE use smaller sub-sets of the results, so using the CTE reduced the remote IO so improved performance, there then was more stalls in the execution plan.
I also use CTEs like this to make the code easier to read, but giving the CTE a meaningful name, but the aliasing it to something short, for use. Really love that.

SQL Server double left join counts are different

Code:
Select a.x,
a.y,
b.p,
c.i
from table1 a left join table2 b on a.z=b.z
left join table3 on a.z=c.z;
When I am using the above code I am not getting the correct counts:
Table1 has 30 records.
After first left join I get 30 records but after 2nd left join I am getting 33 records.
I am having hard time figuring out why I am getting different counts. According to my understanding I should be getting 30 counts even after the 2nd left join.
Can anyone help me understand this difference?
I am using sql server 2012
There are multiple rows in table3 with the same z value.
You can find them by doing:
select z, count(*)
from table3
group by z
having count(*) >= 2
order by count(*) desc;
If you want at most one match, then outer apply can be useful:
Select a.x, a.y, b.p, c.i
from table1 a outer apply
(select top 1 b.*
from table2 b
where a.z = b.z
) b outer apply
(select top 1 c.*
from table3 c
where a.z = c.z
) c;
Of course, top 1 should be used with order by, but I don't know which row you want. And, this is probably a stop-gap; you should figure out why there are duplicates.
In your table table3 contain more then 1 row per 1 row in table1. Check one value which is occured more times in both tables.
You can use group by with max function to make one to one row.

Microsoft SQL Server performance- or in on clause [duplicate]

This question already has answers here:
UNION ALL vs OR condition in sql server query
(3 answers)
Closed 7 years ago.
I need to join 2 tables using a field 'cdi' from the 2nd table with cdi or cd_cliente from the 1st table. I mean that it might match the same field or cd_cliente from the 1st table.
My original query was
select
a.cd_cliente, a.cdi as cdi_cli,b.*
from
clientes a
left join
rightTable b on a.cdi = b.cdi or a.cd_cliente = b.cdi
But since it took too much time, I changed it to:
Select a.cd_cliente, a.cdi, b.*
from clientes a
left join
(select
a.cd_cliente, a.cdi as cdi_cli, b.*
from
clientes a
inner join
rightTable b on a.cdi = b.cdi
union
select
a.cd_cliente, a.cdi as cdi_cli, b.*
from
clientes a
inner join
rightTable b on a.cd_cliente = b.cdi) b
on a.cd_cliente=b.cd_cliente
And it took less time. I'm not sure if the results would be the same. And if so, why the time taken by the 2nd query is considerably less?
I'm not sure if the results would be the same. Most likely not.
Consider a row in clientes that matched a row in rightTable on cdi but did not match any row on cd_cliente. The first query will return one row for the match. The second query will return two rows. Once for the match, and once for the not match, but with nulls filled in the rightTable columns because of the left outer join.
Also, if the first query returns any legitimate duplicates those will be removed by the union operator in the second query.
SQL Server isn't good with OR and indexes. Not sure why. Your second query is getting around that by (most likely) seeking via indexes twice and then merging them somehow.
There are simpler queries you could try, such as this one:
SELECT
a.cd_cliente,
cdi_cli = a.cdi,
b.*
FROM
dbo.clientes a
OUTER APPLY (
SELECT *
FROM dbo.rightTable b
WHERE a.cdi = b.cdi
UNION
SELECT *
FROM dbo.rightTable b
WHERE a.cd_cliente = b.cdi
) b
;
And here's a weird one that could actually work, though I'm not sure:
SELECT
a.cd_cliente,
cdi_cli = a.cdi,
b.*
FROM
dbo.clientes a
OUTER APPLY (
SELECT *
FROM dbo.rightTable b
WHERE EXISTS (
SELECT 1 WHERE a.cdi = c.cdi
UNION
SELECT 1 WHERE a.cd_cliente = b. cd_cliente
)
) b
;
Told you it was weird! And here's an even weirder (and probably inadvisable) one.
SELECT
a.cd_cliente,
cdi_cli = a.cdi,
BColumn1 = Max(BColumn1),
BColumn2 = Max(BColumn2),
BColumn3 = Max(BColumn3),
BColumn4 = Max(BColumn4)
-- all columns of B
FROM
dbo.clientes a
CROSS APPLY (VALUES
(a.cdi),
(a.cd_cliente)
) c (cdi)
LEFT JOIN dbo.rightTable b
ON c.cdi = b.cdi
GROUP BY
a.cd_cliente,
a.cdi,
-- all columns of A
;
Given some time to play with your data and indexes and work with execution plans, I'm sure we could come up with something that would really sizzle.
This is your original query:
select a.cd_cliente, a.cdi as cdi_cli,b.*
from clientes a left join
rightTable b
on a.cdi = b.cdi or a.cd_cliente = b.cdi;
The performance problem is due to the or in the on condition. This generally interferes with using indexes.
If you only cared about one column from b, you could do:
select a.cd_cliente, a.cdi as cdi_cli, coalesce(b1.col, b2.col)
from clientes a left join
rightTable b1
on a.cdi = b1 left join
rightTable b2
on a.cd_cliente = b2.cdi;
These easily generalizes to a small handful of columns, but is cumbersome if b is wide.
Another way of writing the query would be much more cumbersome. It would start with the b table, double left join to a and then union in the remaining values from a:
select coalesce(a1.cd_cliente, a2.cd_cliente) as cd_cliente,
coalesce(a1.cdi, a2.cd) as cdi_cli,
b.*
from rightTable b left join
clientes a1
on a1.cdi = b.cdi left join
clientes a2
on a2.cd_cliente = b.cdi
where a1.cdi is not null or c2.cdi is not null
union all
select a.cd_cliente, a.cdi, b.*
from clientes a left join
righttable b
on 1 = 0
where not exists (select 1 from righttable b where a.cdi = b.cdi) and
not exists (select 1 from righttable b where a.cd_cliente = b.cdi)
The first part of the query gets all the matching rows, to one or the other tables. The second adds in the unmatching rows. Note the strange use of left join with a condition that always evaluates to FALSE. That makes it easier to bring in the tables from b.
Although this looks complicated, the joins and not exists subqueries can all take advantage of appropriate indexes on the tables. That means that it should have more reasonable performance.

Join a table only if result set > 0

I have a table A joined with a table B which give me a result set.
I want to join a table C to the previous ones in order to restrict the result set. But in case there is no result with this join, I would like to have the same result set than before (without taking care of C).
Can you think of way to do that in SQL ?
SELECT *
FROM TableA
INNER JOIN TableB
ON TableA.ID = TableB.TableAID
LEFT JOIN TableC
ON TableC.ID = TableB.TableCID
This will return all rows from Tables A & B but only the rows from TableC where the ON criteria match.
Otherwise conditional joins don't really apply in standard SQL. If you are using SQL Server you can perform some stored procedure logic to check the results from TableC and if there are none then only get data from Table A & B. But this approach with be provider specific
Not possible with regular SQL since it involves logic.
Your best bet is to make a small script, e.g. (in pseudo code)
select * into #tmp from x inner join y inner join z where blabla;
if (exists (select * from #tmp))
BEGIN
select * from #tmp
END
else
BEGIN
select * from x inner join y where blabla;
END
Edit:
But if I were you, I would just always join with C using a LEFT JOIN, so you can see if the result was in one or the other result set...
e.g.
select x.*, y.*, case when z.id is null then 0 else 1 end from x inner join y left join z on blabla where blabla;
But that of course assumes you are able to alter the code path that reads the result.
I see a problem in the LEFT/OUTER JOIN methods. If you do it you could get some results that are in A and B but not in C. If I understand well the porpouse is join AB with C, I mean the result when crossing with C must include the three restrictions. So the #Cine solution is the apropiate to this case.

When or why would you use a right outer join instead of left?

Wikipedia states:
"In practice, explicit right outer joins are rarely used, since they can always be replaced with left outer joins and provide no additional functionality."
Can anyone provide a situation where they have preferred to use the RIGHT notation, and why?
I can't think of a reason to ever use it. To me, it wouldn't ever make things more clear.
Edit:
I'm an Oracle veteran making the New Year's Resolution to wean myself from the (+) syntax. I want to do it right
The only reason I can think of to use RIGHT OUTER JOIN is to try to make your SQL more self-documenting.
You might possibly want to use left joins for queries that have null rows in the dependent (many) side of one-to-many relationships and right joins on those queries that generate null rows in the independent side.
This can also occur in generated code or if a shop's coding requirements specify the order of declaration of tables in the FROM clause.
B RIGHT JOIN A is the same as A LEFT JOIN B
B RIGHT JOIN A reads: B ON RIGHT, THEN JOINS A. means the A is in left side of data set. just the same as A LEFT JOIN B
There are no performance that can be gained if you'll rearrange LEFT JOINs to RIGHT.
The only reasons I can think of why one would use RIGHT JOIN is if you are type of person that like to think from inside side out (select * from detail right join header). It's like others like little-endian, others like big-endian, others like top down design, others like bottom up design.
The other one is if you already have a humongous query where you want to add another table, when it's a pain in the neck to rearrange the query, so just plug the table to existing query using RIGHT JOIN.
I've never used right join before and never thought I could actually need it, and it seems a bit unnatural. But after I thought about it, it could be really useful in the situation, when you need to outer join one table with intersection of many tables, so you have tables like this:
And want to get result like this:
Or, in SQL (MS SQL Server):
declare #temp_a table (id int)
declare #temp_b table (id int)
declare #temp_c table (id int)
declare #temp_d table (id int)
insert into #temp_a
select 1 union all
select 2 union all
select 3 union all
select 4
insert into #temp_b
select 2 union all
select 3 union all
select 5
insert into #temp_c
select 1 union all
select 2 union all
select 4
insert into #temp_d
select id from #temp_a
union
select id from #temp_b
union
select id from #temp_c
select *
from #temp_a as a
inner join #temp_b as b on b.id = a.id
inner join #temp_c as c on c.id = a.id
right outer join #temp_d as d on d.id = a.id
id id id id
----------- ----------- ----------- -----------
NULL NULL NULL 1
2 2 2 2
NULL NULL NULL 3
NULL NULL NULL 4
NULL NULL NULL 5
So if you switch to the left join, results will not be the same.
select *
from #temp_d as d
left outer join #temp_a as a on a.id = d.id
left outer join #temp_b as b on b.id = d.id
left outer join #temp_c as c on c.id = d.id
id id id id
----------- ----------- ----------- -----------
1 1 NULL 1
2 2 2 2
3 3 3 NULL
4 4 NULL 4
5 NULL 5 NULL
The only way to do this without the right join is to use common table expression or subquery
select *
from #temp_d as d
left outer join (
select *
from #temp_a as a
inner join #temp_b as b on b.id = a.id
inner join #temp_c as c on c.id = a.id
) as q on ...
The only time I would think of a right outer join is if I were fixing a full join, and it just so happened that I needed the result to contain all records from the table on the right. Even as lazy as I am, though, I would probably get so annoyed that I would rearrange it to use a left join.
This example from Wikipedia shows what I mean:
SELECT *
FROM employee
FULL OUTER JOIN department
ON employee.DepartmentID = department.DepartmentID
If you just replace the word FULL with RIGHT you have a new query, without having to swap the order of the ON clause.
SELECT * FROM table1 [BLANK] OUTER JOIN table2 ON table1.col = table2.col
Replace [BLANK] with:
LEFT - if you want all records from table1 even if they don't have a col that matches table2's (also included are table2 records with matches)
RIGHT - if you want all records from table2 even if they don't have a col that matches table1's (also included are table1 records with matches)
FULL - if you want all records from table1 and from table2
What is everyone talking about? They're the same? I don't think so.
SELECT * FROM table_a
INNER JOIN table_b ON ....
RIGHT JOIN table_c ON ....
How else could you quickly/easily inner join the first 2 tables and join with table_c while ensuring all rows in table_c are always selected?
I've not really had to think much on the right join but I suppose that I have not in nearly 20 years of writing SQL queries, come across a sound justification for using one. I've certainly seen plenty of them I'd guess arising from where developers have used built-in query builders.
Whenever I've encountered one, I've rewritten the query to eliminate it - I've found they just require too much additional mental energy to learn or re-learn if you haven't visited the query for some time and it hasn't been uncommon for the intent of the query to become lost or return incorrect results - and it's usually this incorrectness that has led to requests for me to review why the queries weren't working.
In thinking about it, once you introduce a right-join, you now have what I'd consider competing branches of logic which need to meet in the middle. If additional requirements/conditions are introduced, both of these branches may be further extended and you now have more complexity you're having to juggle to ensure that one branch isn't giving rise to incorrect results.
Further, once you introduce a right join, other less-experienced developers that work on the query later may simply bolt on additional tables to the right-join portion of the query and in doing so, expanding competing logic flows that still need to meet in the middle; or in some cases I've seen, start nesting views because they don't want to touch the original logic, perhaps in part, this is because they may not understand the query or the business rules that were in place that drove the logic.
SQL statements, in addition to being correct, should be as easy to read and expressively concise as possible (because they represent single atomic actions, and your mind needs to grok them completely to avoid unintended consequences.) Sometimes an expression is more clearly stated with a right outer join.
But one can always be transformed into the other, and the optimizer will do as well with one as the other.
For quite a while, at least one of the major rdbms products only supported LEFT OUTER JOIN. (I believe it was MySQL.)
The only times I've used a right join have been when I want to look at two sets of data and I already have the joins in a specific order for the left or inner join from a previously written query. In this case, say you want to see as one set of data the records not included in table a but in table b and in a another set the records not in table b but in table a. Even then I tend only to do this to save time doing research but would change it if it was code that would be run more than once.
In some SQL databases, there are optimizer hints that tell the optimizer to join the tables in the order in which they appear in the FROM clause - e.g. /*+ORDERED */ in Oracle. In some simple implementations, this might even be the only execution plan available.
In such cases order of tables in the FROM clause matters so RIGHT JOIN could be useful.
I think it's difficult if you don't have right join in this case. ex with oracle.
with a as(
select 1 id, 'a' name from dual union all
select 2 id, 'b' name from dual union all
select 3 id, 'c' name from dual union all
select 4 id, 'd' name from dual union all
select 5 id, 'e' name from dual union all
select 6 id, 'f' name from dual
), bx as(
select 1 id, 'fa' f from dual union all
select 3 id, 'fb' f from dual union all
select 6 id, 'f' f from dual union all
select 6 id, 'fc' f from dual
)
select a.*, b.f, x.f
from a left join bx b on a.id = b.id
right join bx x on a.id = x.id
order by a.id