Impala SQL LEFT ANTI JOIN - sql

Goal is to find the empid's for a given timerange that are present in LEFT table but not in RIGHT table.
I have the following two Impala queries which I ran and got different results?
QUERY 1: select count(dbonetable.empid), COUNT(DISTINCT dbtwotable.empid) from
(select distinct dbonetable.empid
from dbonedbtable dbonetable
WHERE (dbonetable.expiration_dt >= '2009-01-01' OR dbonetable.expiration_dt IS NULL) AND dbonetable.effective_dt <= '2019-01-01' AND dbonetable.empid IS NOT NULL) dbonetable
LEFT join dbtwodbtable dbtwotable ON dbonetable.empid = dbtwotable.empid
--43324489 43270569
QUERY 2: select count(*) from (
select distinct dbonetable.empid from dbonedbtable dbonetable
LEFT ANTI join dbtwodbtable dbtwotable ON dbonetable.empid = dbtwotable.empid
AND (dbonetable.expiration_dt >= '2009-01-01' OR dbonetable.expiration_dt IS NULL) AND dbonetable.effective_dt <= '2019-01-01' AND dbonetable.empid IS NOT NULL) tab
--19088973
--For LEFT ANTI JOIN, this clause returns those values from the left-hand table that have no matching value in the right-hand table.
To explain the Context,
Query 2: Trying to find all the empid's that are in dbonetable and are not in dbtwotable using LEFT ANTI JOIN which I learned from here:
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_joins.html
--For LEFT ANTI JOIN, this clause returns those values from the left-hand table that have no matching value in the right-hand table.
And in Query 1:
The dbOnetable calculated based on where clause and results from it are LEFT OUTER joined with dbtwotable, And on top of that result, I am doing a count(dbonetable.empid) and COUNT(DISTINCT dbtwotable.empid) which gave me a result as --43324489 43270569, which means 53,920.
My question either my Query 1 result should be 43324489 -43270569 = 53,920 or my Query 2 Result should be 19088973.
what could be missing here, is my Query 1 is incorrect? Or is my LEFT ANTI JOIN is misleading?
Thank you all in Advance.

It's different because you forgot specifying "where dbtwotable.empid is null" in the query 1
Additionally, your query 2 is logically different from query 1 because in query 1, you join only on equivalence of empid1 and empid2, while in query 2 your join has much more conditions, so the tables have much fewer common entries compared to query 1, and as a result, the final count is much larger.
If you make a join condition in query 2 the same as in query 1 and put everything else into where clause, you will get the same count that you got in query 1 (updated) which is 53920. That's the count you need

Related

Request Optimization (CTE, multiple LEFT JOIN, WHERE with OR)

Can anyone give me advice on how to optimize this request?
More left joins than in this example (20+), principally to get values with foreign key, what optimization is possible?
CTE used to create aggregates but CTE tables are used in principal request, so is it useful?
Where condition with a simple condition on the principal table and a second condition OR with fields of several tables, could it be better to add a column with a max date of the 3 fields and have a simple second condition (without OR)?
SQL Server 2015+
WITH
cte AS
(
SELECT
e_ofcte.id,
SUM(CASE WHEN f_ofcte.lib='G' THEN 1 ELSE 0 END) AS n1,
SUM(CASE WHEN f_ofcte.lib='H' THEN 1 ELSE 0 END) AS n2
FROM e_ofcte
INNER JOIN f_ofcte ON f_ofcte.id=e_ofcte.id
WHERE f_ofcte.lib IN ('G','H')
AND e_ofcte.date>=DATEFROMPARTS(YEAR(CURRENT_TIMESTAMP)-2,1,1)
GROUP BY
e_ofcte.id
)
SELECT
a.id,
b.sid,
c.sid,
cte.n1,
cte.n2
FROM a
LEFT JOIN cte ON a.id=cte.id
LEFT JOIN b ON a.id=b.id
LEFT JOIN c ON a.id=c.id
LEFT JOIN e_ofcte ON a.id=e_ofcte.id
LEFT JOIN i ON a.id=i.id
LEFT JOIN j ON a.id=j.id
LEFT JOIN f_ofcte ON a.id=f_ofcte.id
WHERE a.code='A'
AND
(
a.date>=>=DATEFROMPARTS(YEAR(CURRENT_TIMESTAMP)-2,1,1)
OR
b.date>=>=DATEFROMPARTS(YEAR(CURRENT_TIMESTAMP)-2,1,1)
OR
c.date>=>=DATEFROMPARTS(YEAR(CURRENT_TIMESTAMP)-2,1,1)
)
If you move "OR" conditions to JOIN it will return different results. My answer would be "NO", unless you exactly know what you are doing.
There are multiple possible approaches you can try to fight performance:
Move CTE to temporary table if you can. It makes query smaller?
which will help optimizer to come up with the best plan. Also, you
can tune two parts separately.
Possibly build filtered index on table A(id,date) with "WHERE
code='A'" - that would work if number of filtered records is
relatively small
Possibly build filtered index on table f_ofcte(id) with "WHERE
lib IN ('G','H')"
Build indexes on other tables on (id,date)
Not sure if you provided full query, but it looks like following
part is completely unused:
LEFT JOIN e_ofcte ON a.id=e_ofcte.id
LEFT JOIN i ON a.id=i.id
LEFT JOIN j ON a.id=j.id
LEFT JOIN f_ofcte ON a.id=f_ofcte.id

How join two query by removing inner query name in MS Access

I have two tables. One table has floor number(tb_FloorNumber.FloorNumber. records :For example 1 to 15) and another table which has Floor number and User_Id column(tb_Emp_Master.FloorNumber, tb_Emp_Master.User_Id). I want to bring all the records from tb_FloorNumber and only the records from tb_Emp_Master with the condition (User_Id = "fat35108").
I know I can do this with two queries like this :
Query 1:
SELECT DISTINCT tb_Emp_Master.FloorNumber
FROM tb_Emp_Master
WHERE (((tb_Emp_Master.User_Id)="fat35108"));
Query2:
SELECT DISTINCT tb_FloorNumber.FloorNumber, Query1.FloorNumber
FROM tb_FloorNumber LEFT JOIN Query1 ON tb_FloorNumber.FloorNumber = Query1.FloorNumber;
But I want to write this query with sing query instead of using Query1 inside the Query 2
I have tried like this:
SELECT DISTINCT tb_FloorNumber.FloorNumber, tb_Emp_Master.FloorNumber
FROM tb_FloorNumber LEFT JOIN tb_Emp_Master ON tb_FloorNumber.FloorNumber = tb_Emp_Master.FloorNumber
WHERE (((tb_Emp_Master.User_Id)="fat35108"));
But it brings only one record (For instance 8)
Please help me how to write this
If you set the condition:
tb_Emp_Master.User_Id = "fat35108"
in the WHERE clause, then you actually get an INNER JOIN instead of a LEFT JOIN because you filter only the matched rows from tb_Emp_Master.
Use tb_Emp_Master in the LEFT JOIN instead of Query1 and set the condition in the ON clause:
SELECT DISTINCT
tb_FloorNumber.FloorNumber,
tb_Emp_Master.FloorNumber
FROM tb_FloorNumber LEFT JOIN tb_Emp_Master
ON tb_FloorNumber.FloorNumber = tb_Emp_Master.FloorNumber AND tb_Emp_Master.User_Id = "fat35108";
I don't know why you need DISTINCT so I use it too.

Problems getting desired output from SQL JOIN Query

Trying to extract data from multiple SQL tables. I have a main table and a couple of sub-tables. I want to get all the rows from the main table given a condition and add some fields from the sub-tables. I figured an OUTER JOIN should have worked but I am not getting the entire data.
When I run a COUNT on the main table with the condition I get ~10k rows which is what I am expecting to get once I join the other tables. I understand that I will get NULL values on some row entries.
This is the query I came up with but I am only getting partial results
SELECT main_table.group_id, main_table.floor, sub_table1.Name, sub_table2.base
FROM main_table
LEFT JOIN ON main_table.group_id =sub_table1.group_id
LEFT JOIN ON main_table.group_id =sub_table2.group_id
WHERE main_table.year = 2000 AND sub_table1.year = 2000
AND sub_table2.year = 2000 AND main_table.group = 'C'
I am expecting to see a collection of about 10k rows since that is the number I get when only querying the main table with where clause.
SELECT COUNT(*) FROM main_table WHERE year = 2000 AND group = 'C';
Your where clause is filtering out the extra rows from the outer joins -- effectively turning them into inner joins.
Conditions on all but the first table should be in the on clauses. But I would phrase this as:
SELECT main_table.group_id, main_table.floor, sub_table1.Name, sub_table2.base
FROM main_table LEFT JOIN
sub_table1
ON main_table.group_id = sub_table1.group_id AND
main_table.year = sub_table1.year LEFT JOIN
sub_table2
ON main_table.group_id = sub_table2.group_id AND
main_table.year = sub_table2.year
WHERE main_table.year = 2000 AND main_table.group = 'C';
You want the years to be equal, so that should really be a JOIN condition. Then you only need to specify the year once in the WHERE clause.
Whatever condition in ON clause is used for join and condition in WHERE clause are used to filter out final result.
Apart from gordon's answer, If your requirement is to include different/same years in joins then you can use following query:
SELECT main_table.group_id, main_table.floor, sub_table1.Name, sub_table2.base
FROM main_table LEFT JOIN
sub_table1
ON (main_table.group_id = sub_table1.group_id AND
sub_table1.year = 2000) LEFT JOIN
sub_table2
ON (main_table.group_id = sub_table2.group_id AND
sub_table2.year = 2000)
WHERE main_table.year = 2000 AND main_table.group = 'C';
Cheers!!

Oracle - MINUS-operator with different results than OUTER JOIN

In Oracle the following MINUS SQL statement returns results, while the allegedly equivalent OUTER JOIN statement doesn't return any.
Results:
SELECT
/*+parallel (8)*/
pd.item_id
FROM MY_TABLE#DB_LINK_PROD_ENV
WHERE pd.valid_to='09.09.9999'
MINUS
SELECT
/*+parallel (8)*/
it.item_id
FROM MY_TABLE#DB_LINK_TEST_ENV
WHERE it.valid_to='09.09.9999' ;
No results:
SELECT
/*+parallel (8)*/
pd.item_id,
it.item_id
FROM MY_TABLE#DB_LINK_PROD_ENV
LEFT OUTER JOIN MY_TABLE#DB_LINK_TEST_ENV
ON pd.item_id = it.item_id
WHERE it.valid_to ='09.09.9999'
AND pd.valid_to ='09.09.9999'
AND it.item_id IS NULL;
Without knowing the data, what could be the reason?
In first query it is MINUS. Means it will show all item_id present in DB_LINK_PROD_ENV having valid_to='09.09.9999' but not present in DB_LINK_TEST_ENV having valid_to='09.09.9999'.
In second one it is LEFT JOIN with AND condition.
it.valid_to ='09.09.9999'
AND pd.valid_to ='09.09.9999'
So it is possible that there are records in DB_LINK_PROD_ENV having valid_to='09.09.9999' but NO any record in DB_LINK_TEST_ENV with valid_to='09.09.9999'.
So when you will perform MINUS in first query, it will show record present in DB_LINK_PROD_ENV. But in second query AND condition will fail to give you any record.

What's the difference between filtering in the WHERE clause compared to the ON clause?

I would like to know if there is any difference in using the WHERE clause or using the matching in the ON of the inner join.
The result in this case is the same.
First query:
with Catmin as
(
select categoryid, MIN(unitprice) as mn
from production.Products
group by categoryid
)
select p.productname, mn
from Catmin
inner join Production.Products p
on p.categoryid = Catmin.categoryid
and p.unitprice = Catmin.mn;
Second query:
with Catmin as
(
select categoryid, MIN(unitprice) as mn
from production.Products
group by categoryid
)
select p.productname, mn
from Catmin
inner join Production.Products p
on p.categoryid = Catmin.categoryid
where p.unitprice = Catmin.mn; // this is changed
Result both queries:
My answer may be a bit off-topic, but I would like to highlight a problem that may occur when you turn your INNER JOIN into an OUTER JOIN.
In this case, the most important difference between putting predicates (test conditions) on the ON or WHERE clauses is that you can turn LEFT or RIGHT OUTER JOINS into INNER JOINS without noticing it, if you put fields of the table to be left out in the WHERE clause.
For example, in a LEFT JOIN between tables A and B, if you include a condition that involves fields of B on the WHERE clause, there's a good chance that there will be no null rows returned from B in the result set. Effectively, and implicitly, you turned your LEFT JOIN into an INNER JOIN.
On the other hand, if you include the same test in the ON clause, null rows will continue to be returned.
For example, take the query below:
SELECT * FROM A
LEFT JOIN B
ON A.ID=B.ID
The query will also return rows from A that do not match any of B.
Take this second query:
SELECT * FROM A
LEFT JOIN B
WHERE A.ID=B.ID
This second query won't return any rows from A that don't match B, even though you think it will because you specified a LEFT JOIN. That's because the test A.ID=B.ID will leave out of the result set any rows with B.ID that are null.
That's why I favor putting predicates in the ON clause rather than in the WHERE clause.
The results are exactly same.
Using "ON" clause is more suggested due to increasing performance of the query.
Instead of requesting the data from tables then filtering, by using on clause, you first filter first data-set and then join the data to other tables. So, lesser data to match and faster result is given.
There is no difference between the above two queries outputs both of them result same.
When you are using On Clause the join operation joins only those rows that matches the codidtion specified on ON Clause
Where as in case of Where Clause, the join opeartion joins all the rows and then filters out based on where condidtion Specified
So, obviously On Clause is more effective and should be preferred over where condidtion