Snowflake inner join bad performance over left join - sql

I have 2 equivalent queries in Snowflake - one with left join and the other with inner join:
SELECT *
FROM A
INNER JOIN B ON a.id=b.id;
SELECT *
FROM A
LEFT JOIN B ON a.id=b.id
WHERE b.id IS NOT NULL;
The inner join does not finish after an hour while the left join takes only few seconds. Why would it happen?
EDIT:

There are two possibilities.
One is that there is really a difference in performance between the queries. That would occur because Snowflake chooses very different execution plans for the two queries. Of course the queries are not the same, but I might expect the execution plans to be similar. You can check explain to investigate this.
The second is that the queries actually have quite similar performance, but you start seeing results from the first query quickly. Why? Because every row in the first table is going to be in the result set -- even if there is no match. By contrast, if there is no match at all between the two tables, the second query has to process all the data before it knows there is no match.

Related

Optimizing OUTER JOIN queries using filters from WHERE clause.(Query Planner)

I am writing a distributed SQL query planner(Query Engine). Data will be fetched from RDBMS(PostgreSQL) nodes involving network I/O.
I want to optimize JOIN queries.
Logical Order of Execution is:
Do JOIN(make use of ON clause)
Apply WHERE clause on the joined result.
I was thinking about applying Filter(WHERE clause specific to a table) first itself, and then do join.
In what cases would that result in wrong results?
Example:
SELECT *
FROM tableA
LEFT JOIN tableB ON(tableA.col1 = tableB.col1)
LEFT JOIN tableC ON(tableB.col2 = tableC.col1)
WHERE tableA.colY < 100 AND tableB.colX > 50
Logical Execution:
joinResult = (tableA left join tableB ON() ) left join tableC ON()
Filter joinResult using given WHERE clause.
Proposed Execution:
filteredA = tableA WHERE tableA.colY < 100
filteredB = tableB WHERE tableB.colX > 50
Result = (filteredA left join filteredB ON(..))left join tableC ON(..)
Can I optimize any query like this? That is filtering the table first and then applying join above that.
Edit:
Some people are confusing and talking about this specific example. I am not talking about this specific example query, I am writing a query planner and I want to handle all type of queries
Please note that, each of the tables is sharded and stored in different machines, and the current execution model is to fetch each of the tables and then do join locally. So if I apply the WHERE filter before fetching, it would be better.
This is actually a complex topic.
We can filter the table in some cases. We can also reorder outer joins and then push the filter quals inside.
I was going through a research paper regarding this, but I haven't completed it yet(may not complete it also).
So for now, for those who are looking for answers, you could probably go through this research paper particularly section 2.2. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.43.2531&rep=rep1&type=pdf
For now I'm relying on PostgreSQL's planner and taking its output and reconstructing the query for my requirements.

Left join Ignore

I have recently noticed that SQL Server 2016 appears to be ignoring left joins if any column is not used in the select or where clause. Also not found in Actual execution plan.
This is good for if anyone added extra join but still not affecting performance.
I have query that took 9 sec, if I add column in select clause for Left join tables but without that only 1 sec.
Can anyone please check and suggest, Is that true or not?
Query with Actual execution plan. You can see there is no any table from left join in execution plan.
I'm not 100% sure what the question is asking, but a SQL optimizer can ignore left join. Consider this type of query:
select a.*
from a left join
b
on a.b_id = b.id;
If b.id is declared as unique (or equivalently a primary key) then the above query returns exactly the same result set as:
select a.*
from a;
I am not per se aware that SQL Server started putting this optimization in 2016. But the optimization is perfectly valid and (I believe) other optimizers do implement it.
Remember, SQL is a declarative language, not a procedural language. The SQL query describes the result set, not how it is produced.
If you have a left join and your matching condition don't return any data from the joined table it will return data as inner join return, when select statement does not contains columns from right tables. Not only in ms server 2016 but most of the DB's.
Left join reduces the performance of the query if there are large amount of data available in join tables.

SAS Enterprise: left join and right join difference?

I joined a new company that uses SAS Enterprise Guide.
I have 2 tables, table A has 100 row, and table B has over 30M rows (50-60 columns).
I tried to do a right join from A (100) to B (30M), it took over 2 hours and no result come back. I want to ask, will it help if I do a left join? I used the GUI and created the following query.
30M Record <- 100 Record ?
or
100 Record -> 30M Record ?
PROC SQL;
CREATE TABLE WORK.QUERY_FOR_CASE_NUMBER AS
SELECT t2.EMPGRPCOM,
t2.SEQINVNUM,
t2.SBSID,
t2.SBSLASTNAME,
t2.SBSFIRSTNAME,
t2.PMTDUEDATE,
t2.PREMAMT,
t2.ITEMDESC,
t2.EFFDATE,
t2.PAYAMT,
t2.MCAIDRATECD,
t2.REBILLIND,
t2.BILLTYPE
FROM WORK.'CASE NUMBER'n t1
LEFT JOIN DW.BILLING t2 ON (t1.CaseNumber = t2.SBSID)
WHERE t2.LOB = 'MD' AND t2.PMTDUEDATE BETWEEN '1Jan2015:0:0:0'dt AND '31Dec2017:0:0:0'dt AND t2.SITEID = '0001';
QUIT;
Left join and Right join, all other things aside, are equivalent - if you implement them the same way, anyway. I.E.,
select a.*
from a
left join
b
on a.id=b.id
;
vs
select a.*
from b
right join
a
on b.id=a.id
;
Same exact query, no difference, same time used. SQL is an interpreted language, meaning the SQL interpreter looks at what you send it and figures out what the best way to do it is - so it sees both queries and knows in both cases to do the same thing.
You can read about this in all sorts of articles, this one is a good starting point, or if that link ages just search for "right join vs left join".
Now, what you might want to consider is writing this in a different way, namely not using SQL; this kind of query SQL should be good at but sometimes isn't for some reason. I would write it as a hash table search, where the smaller case_number dataset is loaded to memory, then data step iterate over the larger table and check if it's found in the smaller dataset - if so, then great, return it.
I'd also think about whether left/right join is what you want, vs. inner join. Seems to me that if you're returning solely t2 values, right/left join isn't correct (when t1 is the "primary"): you'll just get empty rows for the non-matches. Either return a t1 variable, or use inner join.

difference in with/without "left join" and matching in "where" or "on"?

Is there any performance difference between two different SQL-codes as below? The first one is without left jon and matching with where, the other is with left join and matching with on.
Because I get exactly the same result/output from those sql's, but I will be working with bigger tables soon (like couple of billions rows), so I don't want to have any performance issues. Thanks in advance ...
select a.customer_id
from table a, table b
where a.customer_id = b.customer_id
select a.customer_id
from table a
left join table b
on a.customer_id = b.customer_id
The two do different things and yes, there is a performance impact.
Your first example is a cross join with a filter which reduces it to an inner join (virtually all planners are smart enough to reduce this to an inner join but it is semantically a cross join and filter).
Your second is a left join which means that where the filter is not met, you will still get all records from table a.
This means that the planner has to assume all records from table a are relevant, and that correlating records from table b are relevant in your second example, but in your first example it knows that only correlated records are relevant (and therefore has more freedom in planning).
In a very small data set you will see no difference but you may get different results. In a large data set, your left join can never perform better than your inner join and may perform worse.

Excluding rows from result set, LEFT JOIN and EXCEPT

When you have two tables, and want to exclude rows from the second one, there are a multitude of options including EXISTS, NOT IN, LEFT JOIN and EXCEPT.
I've always used left join:
select N.ProductID from NewProducts N
left join Products P on P.ProductID = N.ProductID
where P.ProductID is null
Now I'm thinking it's cleaner to to use EXCEPT:
select ProductID from NewProducts
except
select ProductID from Products
Are there performance issues of using EXCEPT?
You can check execution plan and SQL profiler to choose the suitable query.
But, for me, NOT EXISTS is good. Reference here
The answer to your question is all up to you, depending on how large the data.
You can use any of that (EXISTS, NOT IN, LEFT JOIN and EXCEPT.) depending on your requirement.
you said that you always use LEFT JOIN , and that is good.. because joining the two tables will minimize the execution time of the query, especially when you are holding large amount of data.
JOIN is advisable but it is always depends on you.
You can see the difference of execution time using the execution plan of sql.