SQL - why is this 'where' needed to remove row duplicates, when I'm already grouping? - sql

Why, in this query, is the final 'WHERE' clause needed to limit duplicates?
The first LEFT JOIN is linking programs to entities on a UID
The first INNER JOIN is linking programs to a subquery that gets statistics for those programs, by linking on a UID
The subquery (that gets the StatsForDistributorClubs subset) is doing a grouping on UID columns
So, I would've thought that this would all be joining unique records anyway so we shouldn't get row duplicates
So why the need to limit based on the final WHERE by ensuring the 'program' is linked to the 'entity'?
(irrelevant parts of query omitted for clarity)
SELECT LmiEntity.[DisplayName]
,StatsForDistributorClubs.*
FROM [Program]
LEFT JOIN
LMIEntityProgram
ON LMIEntityProgram.ProgramUid = Program.ProgramUid
INNER JOIN
(
SELECT e.LmiEntityUid,
sp.ProgramUid,
SUM(attendeecount) [Total attendance],
FROM LMIEntity e,
Timetable t,
TimetableOccurrence [to],
ScheduledProgramOccurrence spo,
ScheduledProgram sp
WHERE
t.LicenseeUid = e.lmientityUid
AND [to].TimetableOccurrenceUid = spo.TimetableOccurrenceUid
AND sp.ScheduledProgramUid = spo.ScheduledProgramUid
GROUP BY e.lmientityUid, sp.ProgramUid
) AS StatsForDistributorClubs
ON Program.ProgramUid = StatsForDistributorClubs.ProgramUid
INNER JOIN LmiEntity
ON LmiEntity.LmiEntityUid = StatsForDistributorClubs.LmiEntityUid
LEFT OUTER JOIN Region
ON Region.RegionId = LMIEntity.RegionId
WHERE (
[Program].LicenseeUid = LmiEntity.LmiEntityUid
OR
[LMIEntityProgram].LMIEntityUid = LmiEntity.LmiEntityUid
)

If you were grouping in your outer query, the extra criteria probably wouldn't be needed, but only your inner query is grouped. Your LEFT JOIN to a grouped inner query can still result in multiple records being returned, for that matter any of your JOINs could be the culprit.
Without seeing sample of duplication it's hard to know where the duplicates originate from, but GROUPING on the outer query would definitely remove full duplicates, or revised JOIN criteria could take care of it.

You have in result set:
SELECT LmiEntity.[DisplayName]
,StatsForDistributorClubs.*
I suppose that you dublicates comes from LMIEntityProgram.
My conjecture: LMIEntityProgram - is a bridge table with both LmiEntityId an ProgramId, but you join only by ProgramId.
If you have several LmiEntityId for single ProgramId - you must have dublicates.
And this dublicates you're filtering in WHERE:
[LMIEntityProgram].LMIEntityUid = LmiEntity.LmiEntityUid
You can do it in JOIN:
LEFT JOIN LMIEntityProgram
ON LMIEntityProgram.ProgramUid = Program.ProgramUid
AND [LMIEntityProgram].LMIEntityUid = LmiEntity.LmiEntityUid

Related

SELECT * FROM T1 LEFT JOIN T2 ... LEFT JOIN T3 ... WHERE T3.KEY NOT IN (1,2,3)

My application generates the following SQL-request to get the records matching teamkey:
select cr.callid, t.teamname, u.userfirstname
from callrecord cr
left join agentrecord ar on cr.callid = ar.callid
left join users u on ar.agentkey = u.userkey
left join teams t on u.teamkey = t.teamkey
where t.teamkey in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
This works fine.
When I tried to get the records NOT matching teamkey, the first idea was:
select cr.callid, t.teamname, u.userfirstname
from callrecord cr
left join agentrecord ar on cr.callid = ar.callid
left join users u on ar.agentkey = u.userkey
left join teams t on u.teamkey = t.teamkey
where t.teamkey not in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
This returns no data. Seems this requires completely different SQL request.
Please help to switch my mind in proper direction.
Record from callrecord table may have no matching record in agentrecord table, also record from users table may have no matching record in teams table, but I want them in the output.
Your query should work, for example a team key of 17 should be returned.
The condition is not exactly the negation of the original because in SQL null values never compare as true (look up SQL three-valued logic, they evaluate as unknown).
Only is null and is distinct from (standard but not supported by most RDBMS) can be used to compare nulls.
So the only rows you might be missing are those that don't have a team. If teamkey is null (in the table or because one of the join did not match), it would not be returned.
You can get those results back by changing your condition to t.teamkey not in (...) or t.teamkey is null

Multiple joins on the same table, Results Not Returned if Join Field is NULL

SELECT organizations_organization.code as organization,
core_user.email as Created_By,
assinees.email as Assigned_To,
from tickets_ticket
JOIN organizations_organization on tickets_ticket.organization_id = organizations_organization.id
JOIN core_user on tickets_ticket.created_by_id = core_user.id
Left JOIN core_user as assinees on assinees.id = tickets_ticket.currently_assigned_to_id
In the above query, if tickets_ticket.currently_assigned_to_id is null then that that row from tickets_ticket is not returned
> Records In tickets_ticket = 109
> Returned Records = 4 (out of 109 4 row has value for currently_assigned_to_id rest 105 are null )
> Expected Records = 109 (with nulll set for Assigned_To)
Note I am trying to achieve multiple joins on the same table
LEFT JOIN can not kill output records,
your problem is here:
JOIN core_user on tickets_ticket.created_by_id = core_user.id
this join kills non-matching records
try
LEFT JOIN core_user on tickets_ticket.created_by_id = core_user.id
First, this is not the actual code you are running. There is a comma before the from clause that would cause a syntax error. If you have left out a where clause, then that would explain why you are seeing no rows.
When using left joins, conditions on the first table go in the where clause. Conditions on subsequent tables go in the on clause.
That said, a where clause may not be the problem. I would suggest using left joins from the first table onward -- along with table aliases:
select oo.code as organization, cu.email as Created_By, a.email as Assigned_To,
from tickets_ticket tt left join
organizations_organization oo
on tt.organization_id = oo.id left join
core_user cu
on tt.created_by_id = cu.id left join
core_user a
on a.id = tt.currently_assigned_to_id ;
I suspect that you have data in your data model that is unexpected -- perhaps bad organizations, perhaps bad created_by_id. Keep all the tickets to see what is missing.
That said, you should probably be including something like tt.id in the result set to identify the specific ticket.

Problems getting desired output from SQL JOIN Query

Trying to extract data from multiple SQL tables. I have a main table and a couple of sub-tables. I want to get all the rows from the main table given a condition and add some fields from the sub-tables. I figured an OUTER JOIN should have worked but I am not getting the entire data.
When I run a COUNT on the main table with the condition I get ~10k rows which is what I am expecting to get once I join the other tables. I understand that I will get NULL values on some row entries.
This is the query I came up with but I am only getting partial results
SELECT main_table.group_id, main_table.floor, sub_table1.Name, sub_table2.base
FROM main_table
LEFT JOIN ON main_table.group_id =sub_table1.group_id
LEFT JOIN ON main_table.group_id =sub_table2.group_id
WHERE main_table.year = 2000 AND sub_table1.year = 2000
AND sub_table2.year = 2000 AND main_table.group = 'C'
I am expecting to see a collection of about 10k rows since that is the number I get when only querying the main table with where clause.
SELECT COUNT(*) FROM main_table WHERE year = 2000 AND group = 'C';
Your where clause is filtering out the extra rows from the outer joins -- effectively turning them into inner joins.
Conditions on all but the first table should be in the on clauses. But I would phrase this as:
SELECT main_table.group_id, main_table.floor, sub_table1.Name, sub_table2.base
FROM main_table LEFT JOIN
sub_table1
ON main_table.group_id = sub_table1.group_id AND
main_table.year = sub_table1.year LEFT JOIN
sub_table2
ON main_table.group_id = sub_table2.group_id AND
main_table.year = sub_table2.year
WHERE main_table.year = 2000 AND main_table.group = 'C';
You want the years to be equal, so that should really be a JOIN condition. Then you only need to specify the year once in the WHERE clause.
Whatever condition in ON clause is used for join and condition in WHERE clause are used to filter out final result.
Apart from gordon's answer, If your requirement is to include different/same years in joins then you can use following query:
SELECT main_table.group_id, main_table.floor, sub_table1.Name, sub_table2.base
FROM main_table LEFT JOIN
sub_table1
ON (main_table.group_id = sub_table1.group_id AND
sub_table1.year = 2000) LEFT JOIN
sub_table2
ON (main_table.group_id = sub_table2.group_id AND
sub_table2.year = 2000)
WHERE main_table.year = 2000 AND main_table.group = 'C';
Cheers!!

Why is LEFT JOIN deleting rows?

I have been using sql for a long time, but I am now working in Databricks and I am getting a very strange result. I have a table called block_durations with a set of ids (called block_ts), and I have another table called mergetable, which I want to left join to that table. Mergetable is indexed by acct_id and block_ts, so it has many different records for each block_ts. I want to keep the rows in block_durations that don't match, and if there are multiple matches in mergetable I want there to be multiple corresponding entries in the resulting join, as you would expect from a left join.
But this is not happening. In order to demonstrate this, I am showing the result of joining mergetable, after filtering for a single acct_id so that there is at most one match per block_ts.
select count(*) from mergetable where acct_id = '0xfbb1b73c4f0bda4f67dca266ce6ef42f520fbb98'
16579
select count(*) from block_durations
82817
select count(*) from
(
SELECT
mt.*,
bd.block_duration
FROM
block_durations bd
left outer JOIN mergetable mt
ON mt.block_ts = bd.block_ts
where acct_id='0xfbb1b73c4f0bda4f67dca266ce6ef42f520fbb98'
) countTable
16579
As you can see, even though there are >80000 records in block_durations, most of them are getting lost in the left join. Why is this happening? I thought the whole point of a left join is that the non-matching rows of the left table are kept. This is exactly the behavior I would expect from an inner join -- and indeed when I switch to an inner join nothing changes.
Could someone please help me figure out what's going on?
-Paul
All rows from left side of the join are preserved, but later on you run WHERE ... condition on that which removed rows not matching the condition.
Merge your WHERE condition into JOIN condition:
SELECT
mt.*,
bd.block_duration
FROM
block_durations bd
left outer JOIN mergetable mt
ON mt.block_ts = bd.block_ts AND acct_id='0xfbb1b73c4f0bda4f67dca266ce6ef42f520fbb98'
You can also filter mergetable before you run JOIN on the results:
SELECT
mt.*,
bd.block_duration
FROM
block_durations bd
left outer JOIN (SELECT * FROM mergetable WHERE acct_id='0xfbb1b73c4f0bda4f67dca266ce6ef42f520fbb98') mt
ON mt.block_ts = bd.block_ts

What's the difference between filtering in the WHERE clause compared to the ON clause?

I would like to know if there is any difference in using the WHERE clause or using the matching in the ON of the inner join.
The result in this case is the same.
First query:
with Catmin as
(
select categoryid, MIN(unitprice) as mn
from production.Products
group by categoryid
)
select p.productname, mn
from Catmin
inner join Production.Products p
on p.categoryid = Catmin.categoryid
and p.unitprice = Catmin.mn;
Second query:
with Catmin as
(
select categoryid, MIN(unitprice) as mn
from production.Products
group by categoryid
)
select p.productname, mn
from Catmin
inner join Production.Products p
on p.categoryid = Catmin.categoryid
where p.unitprice = Catmin.mn; // this is changed
Result both queries:
My answer may be a bit off-topic, but I would like to highlight a problem that may occur when you turn your INNER JOIN into an OUTER JOIN.
In this case, the most important difference between putting predicates (test conditions) on the ON or WHERE clauses is that you can turn LEFT or RIGHT OUTER JOINS into INNER JOINS without noticing it, if you put fields of the table to be left out in the WHERE clause.
For example, in a LEFT JOIN between tables A and B, if you include a condition that involves fields of B on the WHERE clause, there's a good chance that there will be no null rows returned from B in the result set. Effectively, and implicitly, you turned your LEFT JOIN into an INNER JOIN.
On the other hand, if you include the same test in the ON clause, null rows will continue to be returned.
For example, take the query below:
SELECT * FROM A
LEFT JOIN B
ON A.ID=B.ID
The query will also return rows from A that do not match any of B.
Take this second query:
SELECT * FROM A
LEFT JOIN B
WHERE A.ID=B.ID
This second query won't return any rows from A that don't match B, even though you think it will because you specified a LEFT JOIN. That's because the test A.ID=B.ID will leave out of the result set any rows with B.ID that are null.
That's why I favor putting predicates in the ON clause rather than in the WHERE clause.
The results are exactly same.
Using "ON" clause is more suggested due to increasing performance of the query.
Instead of requesting the data from tables then filtering, by using on clause, you first filter first data-set and then join the data to other tables. So, lesser data to match and faster result is given.
There is no difference between the above two queries outputs both of them result same.
When you are using On Clause the join operation joins only those rows that matches the codidtion specified on ON Clause
Where as in case of Where Clause, the join opeartion joins all the rows and then filters out based on where condidtion Specified
So, obviously On Clause is more effective and should be preferred over where condidtion