Left join effectiveness when using IS NULL

Left join effectiveness when using IS NULL - sql

I'm using a left join to check if certain types of information have been stored in the database.
I'm wondering if a lot of resources will be wasted if the joined table contains a lot of rows which matches the JOIN clause.
i.e.:
SELECT Applications.*
FROM Applications
LEFT JOIN SomeFeatureRows ON (SomeFeatureRows.ApplicationId = Applications.Id)
WHERE SomeFeatureRows.Id IS NULL;
Do the DB scan through all rows in SomeFeatureRows to see if there is a row where Id is NULL?
I just want to check if there is a row or not in that table (with the specified application id).
Edit, might as well include the real SQL statement:
SELECT organizations.id AS OrganizationId,
organizations.Name,
Application.Id as ApplicationId,
Application.Name as ApplicationName,
Account.id AS AccountId,
Account.Email,
Account.Username ,
SentEmails. SentAtUtc
FROM organizations
INNER JOIN applications ON ( organizations.id = applications.organizationid )
LEFT JOIN Incidents ON ( organizations.id = Incidents.organizationid )
LEFT JOIN SentEmails ON ( organizations.id = SentEmails.organizationid AND EmailTypeName = 'IncidentsReminder')
CROSS apply (SELECT accounts.id,
accounts.email,
accounts.username
FROM accounts,
organizationmembers
WHERE accounts.id = organizationmembers.accountid
AND organizationmembers.organizationid =
organizations.id)
Account
WHERE Incidents.id IS NULL

Here is a very good article explaining the different techniques and performance benefits of using: Not Exists vs. Not In vs. Left join / Is null
To summarize:
LEFT JOIN / IS NULL is less efficient, since it makes no attempt to skip the already matched values in the right table, returning all results and filtering them out instead. Use Not Exists for best performance as it will create a LEFT ANTI SEMI JOIN in the execution plan.

Related

Join optimization PostgresSQL

I have 2 tables : Calls (10,000 rows) , CRM (25 million rows)
I want to do Calls left join CRM.
select *
from calls a
left join crm b
on (
(a.customerID = b.customerID)
OR
(a.Number1 in (b.Number_A,b.Number_B))
OR
(a.Number2 in (b.Number_A,b.Number_B))
);
When I do just the customerID join, it runs fine. But the above code causes timeout and it crashes.

I would suggest multiple left joins:
select c.*,
coalesce(cc.col1, c1a.col1, c1b.col1, c2a.col1, c2b.col1)
from calls c left join
crm cc
on c.customerID = cc.customerID left join
crm c1a
on c.Number1 = c1a.Number_A left join
crm c1b
on c.Number1 = c1b.Number_B left join
crm c2a
on c.Number2 = c2a.Number_A left join
crm c2b
on c.Number2 = c2b.Number_B;
This can then take advantage of indexes on crm(CustomerId), crm(Number1), and crm(Number2).

Sometimes, when replacing one query that contains two conditons with OR with two queries that get glued together with UNION, this results in a better execution plan. I have never understood why DBMS optimizers don't take this in consideration themselves. And I don't know whether this is true for PostgreSQL or not. But it may be worth a try.
In your case there is an outer join in the query. That complicates the matter. With the separate queries we may get both outer joined and matching crm rows for a call and must get rid of the former in that case.
select *
from
(
select * from calls left join crm on crm.customerID = calls.customerID
union
select * from calls left join crm on crm.number_a = calls.number1
union
select * from calls left join crm on crm.number_a = calls.number2
union
select * from calls left join crm on crm.number_b = calls.number1
union
select * from calls left join crm on crm.number_b = calls.number2
) data
order by rank() over (partition by calls.id order by case when crm.id is null then 2 else 1 end)
fetch first row with ties;
For this to work fast you should have one index per column in the query, i.e. six single-column indexes.
Whether this is faster than your original query depends on a lot of things. Mainly: the fewer matches the better.

SELECT * FROM T1 LEFT JOIN T2 ... LEFT JOIN T3 ... WHERE T3.KEY NOT IN (1,2,3)

My application generates the following SQL-request to get the records matching teamkey:
select cr.callid, t.teamname, u.userfirstname
from callrecord cr
left join agentrecord ar on cr.callid = ar.callid
left join users u on ar.agentkey = u.userkey
left join teams t on u.teamkey = t.teamkey
where t.teamkey in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
This works fine.
When I tried to get the records NOT matching teamkey, the first idea was:
select cr.callid, t.teamname, u.userfirstname
from callrecord cr
left join agentrecord ar on cr.callid = ar.callid
left join users u on ar.agentkey = u.userkey
left join teams t on u.teamkey = t.teamkey
where t.teamkey not in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
This returns no data. Seems this requires completely different SQL request.
Please help to switch my mind in proper direction.
Record from callrecord table may have no matching record in agentrecord table, also record from users table may have no matching record in teams table, but I want them in the output.

Your query should work, for example a team key of 17 should be returned.
The condition is not exactly the negation of the original because in SQL null values never compare as true (look up SQL three-valued logic, they evaluate as unknown).
Only is null and is distinct from (standard but not supported by most RDBMS) can be used to compare nulls.
So the only rows you might be missing are those that don't have a team. If teamkey is null (in the table or because one of the join did not match), it would not be returned.
You can get those results back by changing your condition to t.teamkey not in (...) or t.teamkey is null

Multiple joins on the same table, Results Not Returned if Join Field is NULL

SELECT organizations_organization.code as organization,
core_user.email as Created_By,
assinees.email as Assigned_To,
from tickets_ticket
JOIN organizations_organization on tickets_ticket.organization_id = organizations_organization.id
JOIN core_user on tickets_ticket.created_by_id = core_user.id
Left JOIN core_user as assinees on assinees.id = tickets_ticket.currently_assigned_to_id
In the above query, if tickets_ticket.currently_assigned_to_id is null then that that row from tickets_ticket is not returned
> Records In tickets_ticket = 109
> Returned Records = 4 (out of 109 4 row has value for currently_assigned_to_id rest 105 are null )
> Expected Records = 109 (with nulll set for Assigned_To)
Note I am trying to achieve multiple joins on the same table

LEFT JOIN can not kill output records,
your problem is here:
JOIN core_user on tickets_ticket.created_by_id = core_user.id
this join kills non-matching records
try
LEFT JOIN core_user on tickets_ticket.created_by_id = core_user.id

First, this is not the actual code you are running. There is a comma before the from clause that would cause a syntax error. If you have left out a where clause, then that would explain why you are seeing no rows.
When using left joins, conditions on the first table go in the where clause. Conditions on subsequent tables go in the on clause.
That said, a where clause may not be the problem. I would suggest using left joins from the first table onward -- along with table aliases:
select oo.code as organization, cu.email as Created_By, a.email as Assigned_To,
from tickets_ticket tt left join
organizations_organization oo
on tt.organization_id = oo.id left join
core_user cu
on tt.created_by_id = cu.id left join
core_user a
on a.id = tt.currently_assigned_to_id ;
I suspect that you have data in your data model that is unexpected -- perhaps bad organizations, perhaps bad created_by_id. Keep all the tickets to see what is missing.
That said, you should probably be including something like tt.id in the result set to identify the specific ticket.

Either of two SQL tables have a value

I have two tables, task_list_sharee and task_list_assignee. They both have a reference to a task_list table.
There's also a task table that has a reference to the task_list table since a task always exists within a task_list.
Given a task, I want to find out if either task_list_sharee OR task_list_assignee have values. Right now I'm doing it as two SQL statements, like so:
SELECT count(*)
FROM task_list_assignee a
INNER JOIN task_list l ON l.uid = a.task_list_uid
INNER JOIN task t ON t.task_list_uid = l.uid
WHERE t.uid = ?
SELECT count(*)
FROM task_list_sharee s
INNER JOIN task_list l ON l.uid = s.task_list_uid
INNER JOIN task t ON t.task_list_uid = l.uid
WHERE t.uid = ?
and if either is non-zero I punt. I'm thinking this has to be doable as just a single SQL statement but I'm a bit stumped.

Performance of a full count on multiple joins (even more so for LEFT JOIN) can deteriorate quickly. While all you need is proof for the existence of a single row, there is no need for this. Use EXISTS - true to its name - to allow an optimal query plan:
SELECT EXISTS (
SELECT 1
FROM task t
WHERE t.uid = ? -- provide uid here
AND (
EXISTS (
SELECT 1
FROM task_list_assignee
WHERE task_list_uid = t.task_list_uid
)
OR EXISTS (
SELECT 1
FROM task_list_sharee
WHERE task_list_uid = t.task_list_uid
)
)
);
Should be substantially faster than a full count.
I also cut out the middleman. Joining to task_list only establishes that the related row in task_list exists - which is a waste of time given that:
a task always exists within a task_list.
Ideally implemented with FK constraints to enforce referential integrity.
In the absence of actual table definitions my educated guess will have to do.
To make this fast for any table size, you need 3 btree (default) indexes on
task(uid, task_list_uid)
task_list_assignee(task_list_uid)
task_list_sharee(task_list_uid)

If you start with Task as your base table and do a series of left joins, you should be able to determine which tasks have values for both assignee and sharee with a coalesce:
select count(*)
from task as t
left join
task_list as l
on t.task_list_uid = l.uid
left join
task_list_assignee as a
on l.uid = a.task_list_uid
left join
task_list_sharee as s
on l.uid = s.task_list_uid
where coalesce( a.task_list_uid, s.task_list_uid ) is not null
and t.uid = ?
SQL Fiddle here

Counting Related Records : Query Taking Over 2 Minutes To Run

Considering the diagram above I am trying to select bulletins along with related info.
A bulletin can have only one associated user (the creator)
A bulletin can have only one state (the creator's home state)
A bulletin can have only one bulletin type (E.G. Announcement, for sale, etc)
A bulletin can have 0 or 1 event tied to it
A bulletin can have many likes
A bulletin can have many comments
As far as the states go a region can have many states
Using the query below causes it to run for 2 minutes before I hit the cancel button. I have not tried to run it for more than that.
SELECT TOP 10 Bulletins.Id, LEFT(Bulletins.Body, 350) AS BodySnippet, Bulletins.CreationDateTime
, Bulletins.UserId AS PosterId, Bulletins.StateId, Bulletins.EventId,
Bulletins.BulletinTypeId, Bulletins.[Views], Users.UserName,
Users.Zipcode as ZipCode, Users.StateId as StateId, Users.City,
States.Name, States.UnitedStatesRegionId, RegionsOfTheUnitedStates.Name,
COUNT(BulletinLikes.Id) AS Likes, COUNT(BulletinComments.Id) AS Comments
FROM Bulletins
INNER JOIN Users ON Bulletins.UserId = Users.Id
INNER JOIN States ON Bulletins.StateId = States.Id
INNER JOIN RegionsOfTheUnitedStates ON States.UnitedStatesRegionId = RegionsOfTheUnitedStates.Id
INNER JOIN BulletinTypes ON Bulletins.BulletinTypeId = BulletinTypes.Id
LEFT JOIN [Events] ON Bulletins.EventId = [Events].Id
LEFT JOIN BulletinLikes ON Bulletins.Id = BulletinLikes.BulletinId
LEFT JOIN BulletinComments ON Bulletins.Id = BulletinComments.BulletinId
GROUP BY Bulletins.Id, Bulletins.Body, Bulletins.CreationDateTime
, Bulletins.UserId, Bulletins.StateId, Bulletins.EventId,
Bulletins.BulletinTypeId, Bulletins.[Views], Users.UserName,
Users.Zipcode, Users.StateId, Users.City,
States.Name, States.UnitedStatesRegionId, RegionsOfTheUnitedStates.Name
Deleting the line that does the counting of Likes and Comments makes the query return back instantaneously. In my tables I have lots of dummy data. Some of these bulletins have hundreds or a couple thousand likes or comments. That still does not seem like enough to make the query run for 2 minutes plus+ I am no expert when it comes to TSQL so I know it is boiling down to how I'm counting or how I am grouping.
What would be the proper way to return the counted related records in my specific scenario?
**EDIT 1*
My ER is c*ompletely off on one part. I closed out of the website I was using to create it and lost it. Here are some corrections
Bulletins is tied to BulletinTypes with a BulletinTypeFK inside of the Bulletins table (reason being is we use Bulletintypes for a drop down)
EDIT 2
I just found out you can do some profiling on SQL Azure and came up with these two sreenshots of information; however, I'm not 100% sure what to gain from these.
It looks as if the first sort operation is taking up 54.2% of resources. The first index seek looks pretty high too # 32.2%

The first thing I'd try to check performance of much simpler query that touches tables that have the most effect (you mentioned BulletinLikes and BulletinComments are the biggest offenders of performance) :
SELECT TOP 10 b.id, COUNT(bl.Id) AS likes, COUNT(bc.Id) AS Comments
FROM Bulletins b
LEFT JOIN BulletinLikes bl ON b.Id = bl.BulletinId
LEFT JOIN BulletinComments ON b.Id = bc.BulletinId
GROUP BY b.id
If that gives decent performance, I'd make it subquery or CTE, whatever syntax you prefer, and join the rest to the result of subquery.
The general idea is to get rid of huge GROUP BY ...
Side note : TOP without ORDER BY is not guaranteed to give consistent results.

Without the counts then those left joins don't even need to be performed and the query optimizer probably figures it out.
And you don't even user Events with the count - drop it
Make sure you have indexes on all those join conditions (BullitinID) and they are not fragmented.
When these two queries run fast your query will run fast
select count(distinct(BulletinId)) from BulletinLikes
select count(distinct(BulletinId)) from BulletinComments
(and you may need an index on regionId)
SELECT TOP 10 Bulletins.Id, LEFT(Bulletins.Body, 350) AS BodySnippet
, Bulletins.CreationDateTime
, Bulletins.UserId AS PosterId, Bulletins.StateId, Bulletins.EventId
, Bulletins.BulletinTypeId, Bulletins.[Views]
, Users.UserName, Users.Zipcode as ZipCode, Users.StateId as StateId, Users.City
, States.Name, States.UnitedStatesRegionId
, RegionsOfTheUnitedStates.Name
, COUNT(BulletinLikes.Id) AS Likes
, COUNT(BulletinComments.Id) AS Comments
FROM Bulletins
INNER JOIN Users
ON Bulletins.UserId = Users.Id
INNER JOIN States
ON Bulletins.StateId = States.Id
INNER JOIN RegionsOfTheUnitedStates
ON States.UnitedStatesRegionId = RegionsOfTheUnitedStates.Id
INNER JOIN BulletinTypes
ON Bulletins.BulletinTypeId = BulletinTypes.Id
LEFT JOIN [Events]
ON Bulletins.EventId = [Events].Id
LEFT JOIN BulletinLikes
ON Bulletins.Id = BulletinLikes.BulletinId
LEFT JOIN BulletinComments
ON Bulletins.Id = BulletinComments.BulletinId
GROUP BY Bulletins.Id, Bulletins.Body, Bulletins.CreationDateTime
, Bulletins.UserId, Bulletins.StateId, Bulletins.EventId
, Bulletins.BulletinTypeId, Bulletins.[Views]
, Users.UserName, Users.Zipcode, Users.StateId, Users.City
, States.Name, States.UnitedStatesRegionId
, RegionsOfTheUnitedStates.Name

There is nothing wrong with the form of your query (although you may want to consider if you need to select so many columns, but that is beside the point).
You may want to focus on the indexes that exist on all of the columns in your join conditions. Most of the time, we join on columns that are in a foreign key relationship to a primary key, and thus there is likely a (default) clustered index on that column, but you'll want to check to be sure: each of these columns should be the first column in some index on each of the tables in question (at least for the tables with more than a trivial number of rows).

I would try pulling the COUNT fields out into sub-queries, and avoid the whole GROUP BY statement:
SELECT TOP 10 Bulletins.Id, LEFT(Bulletins.Body, 350) AS BodySnippet, Bulletins.CreationDateTime, Bulletins.UserId AS PosterId, Bulletins.StateId, Bulletins.EventId, Bulletins.BulletinTypeId, Bulletins.[Views], Users.UserName, Users.Zipcode as ZipCode, Users.StateId as StateId, Users.City, States.Name, States.UnitedStatesRegionId, RegionsOfTheUnitedStates.Name,
(SELECT COUNT(*) FROM BulletinLikes bl WHERE bl.BulletinId = b.Id) AS Likes,
(SELECT COUNT(*) FROM BulletinComments bc WHERE bc.BulletinId = b.Id) AS Comments
FROM Bulletins
INNER JOIN Users ON Bulletins.UserId = Users.Id
INNER JOIN States ON Bulletins.StateId = States.Id
INNER JOIN RegionsOfTheUnitedStates ON States.UnitedStatesRegionId = RegionsOfTheUnitedStates.Id
INNER JOIN BulletinTypes ON Bulletins.BulletinTypeId = BulletinTypes.Id
LEFT JOIN [Events] ON Bulletins.EventId = [Events].Id

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Left join effectiveness when using IS NULL - sql

Related

Join optimization PostgresSQL

SELECT * FROM T1 LEFT JOIN T2 ... LEFT JOIN T3 ... WHERE T3.KEY NOT IN (1,2,3)

Multiple joins on the same table, Results Not Returned if Join Field is NULL

Either of two SQL tables have a value

Counting Related Records : Query Taking Over 2 Minutes To Run

Categories

Resources