Left join not being optimized - sql

On a SQL Server database, consider a classical parent-child relation like the following:
create table Parent(
p_id uniqueidentifier primary key,
p_col1 int,
p_col2 int
);
create table Child(
c_id uniqueidentifier primary key,
c_p uniqueidentifier foreign key references Parent(p_id)
);
declare #Id int
set #Id = 1
while #Id <= 10000
begin
insert into Parent(p_id, p_col1, p_col2) values (NEWID(), #Id, #Id);
set #Id=#Id+1;
end
insert into Child(c_id, c_p) select NEWID(), p_id from Parent;
insert into Child(c_id, c_p) select NEWID(), p_id from Parent;
insert into Child(c_id, c_p) select NEWID(), p_id from Parent;
;
Now I have these two equivalent queries, one using inner and the other using left join:
Inner query:
select *
from Child c
inner join Parent p
on p.p_id=c.c_p
where p.p_col1=1 or p.p_col2=2;
Left Join query:
select *
from Child c
left join Parent p
on p.p_id=c.c_p
where p.p_col1=1 or p.p_col2=2;
I thought that the sql optimizer would be smart enough to figure out the same execution plan for these two query, but it's not the case.
The plan for the inner query is this:
The plan for the left join query is this:
The optimizer works nice, chosing the same plan, if I have only one condition like:
where p.p_col1=1
But if I add an "or" on a second different column then it doesn't chose the best plan anymore:
where p.p_col1=1 or p.p_col2=2;
Am I missing something or it is just the optimizer that is missing this improvement?

Clearly, it is the optimizer.
When you have one condition in the WHERE clause (and "condition" could be a condition connected with ANDs, but not ORs), then the optimizer can easily peak and say "yes, the condition has rows from the second table, there is no NULL value comparison, so this is really an inner join".
That logic gets harder when the conditions are connected by OR. I think you have observed that the optimizer does not do this for more complex conditions.

Sometimes if you change the order of the conditions the generated plans are different. The optimizer won't check all possible implementation scenarios (unfortunately). That is why sometimes you have to use hints for optimization.

Related

Performance - Select query with left join and null check

I have two different tables which are called as Processing (30M records for now) and EtlRecord (4.3M records for now).
As the name of tables suggest, these tables will be used for normalization of data with ETL.
We are trying to process records with batches where we have 1000 records in each batch.
SELECT TOP 1000 P.StreamGuid
FROM [staging].[Processing] P (NOLOCK)
LEFT JOIN [core].[EtlRecord] E (NOLOCK) ON E.StreamGuid = P.StreamGuid
WHERE E.StreamGuid IS NULL
AND P.CompleteDate IS NOT NULL
AND P.StreamGuid IS NOT NULL
Execution of this query takes around 20 seconds now. And we are expecting to have more and more data especially in EtlRecord table. To be able to improve the performance of this query I check the actual execution plan which I shared below.
As you can see, the most time consuming part is index seek to determine null records in EtlRecord table. I have tried several changes but couldn't able to improve it.
Additional notes
All suggested indices by execution plan already applied to tables. So there is no further index suggestion.
There are 8 columns in Processing table which are mostly boolean flags and 4 columns in EtlRecord table.
EtlRecord table is only used by single procedure. So there is no issue with transaction lock.
Any suggestions to improve this query will be really helpful.
Well, in your query you need to get records from [staging].[Processing] which has not got corresponding record in the [core].[EtlRecord].
You can remove the proceeded records, first.
DELETE [staging].[Processing]
FROM [staging].[Processing] P
INNER JOIN [core].[EtlRecord] E
ON E.StreamGuid = P.StreamGuid;
You can use deletion on batches if you need. Removing this records will simplify our initial query and the nasty join by uniqueidentifier. You simply need to do then something like this for each batch:
SELECT TOP 1000 StreamGuid
INTO #buffer
FROM [staging].[Processing]
WHERE CompleteDate IS NOT NULL
AND StreamGuid IS NOT NULL;
-- do whatevery you need with this records
DELETE FROM [staging].[Processing]
WHERE StreamGuid IN (SELECT StreamGuid FROM #buffer);
Also, you have said that you have all indexes created but indexes suggested by the execution plan are not always best. This part here:
WHERE CompleteDate IS NOT NULL
AND StreamGuid IS NOT NULL;
seems like very good candidate for filtered index especially if large amount of the rows has a NULL value for one of this columns.
First, DDL and easily consumable sample data, like below, will help a great deal. You can copy/paste my solutions and run them locally to see what I'm talking about.
IF OBJECT_ID('tempdb..#processing','U') IS NOT NULL DROP TABLE #processing;
IF OBJECT_ID('tempdb..#EtlRecord','U') IS NOT NULL DROP TABLE #EtlRecord;
SELECT TOP (100)
StreamGuid = NEWID(),
CompleteDate = CASE WHEN CHECKSUM(NEWID())%3 < 2 THEN GETDATE() END
INTO #processing
FROM sys.all_columns AS a
SELECT TOP (80) p.StreamGuid
INTO #EtlRecord
FROM #Processing AS p;
ALTER TABLE #processing ALTER COLUMN StreamGuid UNIQUEIDENTIFIER NOT NULL;
ALTER TABLE #EtlRecord ALTER COLUMN StreamGuid UNIQUEIDENTIFIER NOT NULL;
GO
ALTER TABLE #processing ADD CONSTRAINT pk_processing PRIMARY KEY CLUSTERED(StreamGuid);
ALTER TABLE #etlRecord ADD CONSTRAINT pk_etlRecord PRIMARY KEY CLUSTERED(StreamGuid);
GO
Next understand that, without an ORDER BY clause, your query is not guaranteed to return the same records each time. For example, if SQL Server picks a parallel execution plan you will definitely get a different rows. I have also seen cases where including the ORDER BY will actually improve performance.
With that in mind, not that this...
SELECT --TOP 1000
P.StreamGuid
FROM #processing AS p
LEFT JOIN #etlRecord AS e ON e.StreamGuid = p.StreamGuid
WHERE e.StreamGuid IS NOT NULL
AND P.CompleteDate IS NOT NULL
... will return the exact same thing as this:
SELECT TOP 1000
P.StreamGuid
FROM #processing AS p
JOIN #etlRecord AS e ON e.StreamGuid = p.StreamGuid
WHERE p.CompleteDate IS NOT NULL;
note that WHERE e.StreamGuid = p.StreamGuid already implies that both values are NOT NULL. Note that this query...
DECLARE #X INT;
SELECT AreTheyEqual = IIF(#X=#X,'Yep','Nope');
... returns:
AreTheyEqual
------------
Nope
I agree with the solution #gotqn posted about the filtered index. Using my sample data, you can add something like this:
CREATE NONCLUSTERED INDEX nc_processing ON #processing(CompleteDate,StreamGuid)
WHERE CompleteDate IS NOT NULL;
Then you can add an ORDER BY CompleteDate to the query to coerce the optimizer into choosing it that index (on my system it doesn't pick the index unless I add an ORDER BY). The ORDER BY will make you query deterministic and more predictable.
I would suggest writing this as:
SELECT TOP 1000 P.StreamGuid
FROM [staging].[Processing] P
WHERE P.CompleteDate IS NOT NULL AND
P.StreamGuid IS NOT NULL AND
NOT EXISTS (SELECT 1
FROM [core].[EtlRecord] E
WHERE E.StreamGuid = P.StreamGuid
);
I removed the NOLOCK directive. Only use it if you really know what you are doing -- and are prepared to read invalid data.
Then you definitely want an index on EtlRecord(StreamGuid).
You probably also want an index on Processing(CompleteDate, StreamGuid). This is at least a covering index for the query.

Does EXCEPT execute faster than a JOIN when the table columns are the same

To find all the changes between two databases, I am left joining the tables on the pk and using a date_modified field to choose the latest record. Will using EXCEPT increase performance since the tables have the same schema. I would like to rewrite it with an EXCEPT, but I'm not sure if the implementation for EXCEPT would out perform a JOIN in every case. Hopefully someone has a more technical explanation for when to use EXCEPT.
There is no way anyone can tell you that EXCEPT will always or never out-perform an equivalent OUTER JOIN. The optimizer will choose an appropriate execution plan regardless of how you write your intent.
That said, here is my guideline:
Use EXCEPT when at least one of the following is true:
The query is more readable (this will almost always be true).
Performance is improved.
And BOTH of the following are true:
The query produces semantically identical results, and you can demonstrate this through sufficient regression testing, including all edge cases.
Performance is not degraded (again, in all edge cases, as well as environmental changes such as clearing buffer pool, updating statistics, clearing plan cache, and restarting the service).
It is important to note that it can be a challenge to write an equivalent EXCEPT query as the JOIN becomes more complex and/or you are relying on duplicates in part of the columns but not others. Writing a NOT EXISTS equivalent, while slightly less readable than EXCEPT should be far more trivial to accomplish - and will often lead to a better plan (but note that I would never say ALWAYS or NEVER, except in the way I just did).
In this blog post I demonstrate at least one case where EXCEPT is outperformed by both a properly constructed LEFT OUTER JOIN and of course by an equivalent NOT EXISTS variation.
In the following example, the LEFT JOIN is faster than EXCEPT by 70%
(PostgreSQL 9.4.3)
Example:
There are three tables. suppliers, parts, shipments.
We need to get all parts not supplied by any supplier in London.
Database(has indexes on all involved columns):
CREATE TABLE suppliers (
id bigint primary key,
city character varying NOT NULL
);
CREATE TABLE parts (
id bigint primary key,
name character varying NOT NULL,
);
CREATE TABLE shipments (
id bigint primary key,
supplier_id bigint NOT NULL,
part_id bigint NOT NULL
);
Records count:
db=# SELECT COUNT(*) FROM suppliers;
count
---------
1281280
(1 row)
db=# SELECT COUNT(*) FROM parts;
count
---------
1280000
(1 row)
db=# SELECT COUNT(*) FROM shipments;
count
---------
1760161
(1 row)
Query using EXCEPT.
SELECT parts.*
FROM parts
EXCEPT
SELECT parts.*
FROM parts
LEFT JOIN shipments
ON (parts.id = shipments.part_id)
LEFT JOIN suppliers
ON (shipments.supplier_id = suppliers.id)
WHERE suppliers.city = 'London'
;
-- Execution time: 3327.728 ms
Query using LEFT JOIN with table, returned by subquery.
SELECT parts.*
FROM parts
LEFT JOIN (
SELECT parts.id
FROM parts
LEFT JOIN shipments
ON (parts.id = shipments.part_id)
LEFT JOIN suppliers
ON (shipments.supplier_id = suppliers.id)
WHERE suppliers.city = 'London'
) AS subquery_tbl
ON (parts.id = subquery_tbl.id)
WHERE subquery_tbl.id IS NULL
;
-- Execution time: 1136.393 ms

Most efficient way to identify values not present in a table using SQL

I currently need to search a table of items to validate a list of items entered by the user.
The table of items contains a unique primary key for each item, called ItemId (which corresponds to the values entered by the user).
Given a table with 10000 (ten-thousand) rows, what would be the most efficient way to search the ItemId column and determine if ANY of the items the user entered did not exist in the table?
For example, given the table:
ItemId Color Price
1000 Blue 3.00
1001 Red 4.00
1003 Green 1.25
And the user enters the following:
1000
1001
1002
I'd like to throw up an error to alert the user that one of the items (1002) is invalid. It is not a requirement for me to specifically identify the item that is invalid, only that one or more items does not exist in the table. I've tried using IF NOT EXISTS, as well as EXCEPT but I'm not getting a feel for a 'best practice' in terms of efficiency. Normally, I would examine the execution plan, but I don't really know where to start. I would greatly appreciate any and all suggestions!
I agree with Chris' answer in that using a TVP is a good way of passing in the user supplied values but would make the following changes.
Declare the TVP with a Primary Key
CREATE TYPE ItemTableType AS TABLE
( ItemId INT PRIMARY KEY);
Use NOT EXISTS instead of OUTER JOIN ... NULL. This is implemented as an efficient anti semi join. The OPTION (RECOMPILE) hint and PRIMARY KEY should give sufficient information to the optimiser to choose the best join strategy for the size of both tables.
SELECT CASE
WHEN EXISTS (SELECT *
FROM #Items i
WHERE NOT EXISTS (SELECT *
FROM dbo.MyItemTable it
WHERE i1.ItemId = it.ItemId)) THEN 0
ELSE 1
END
OPTION (RECOMPILE)
OUTER JOIN ... NULL can end up outer joining all rows then eliminating the NOT NULL ones with a filter afterwards which is less efficient (Example of that at bottom of this article), also see Left outer join vs NOT EXISTS
Personally, I would use a Table-Valued parameter in my stored procedure, then simply LEFT JOIN to the table and look for NULL.
CREATE TYPE ItemTableType AS TABLE
( ItemId INT);
GO
CREATE PROCEDURE dbo.usp_GetItems
#Items ItemTableType READONLY
AS
SET XACT_ABORT ON;
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
SET NOCOUNT ON;
IF EXISTS (SELECT *
FROM dbo.MyItemTable i1
LEFT JOIN #Items i2 ON i1.ItemId = #i2.ItemId
WHERE i2.ItemId IS NULL)
RAISERROR('Item does not exist', 16, 1)
ELSE
SELECT i1.*
FROM dbo.MyItemTable i1
JOIN #Items i2 on i1.ItemId = #i2.ItemId
GO
The other possibility I can think of is to perform a count on the end result.
To perform the count:
DECLARE #CommaCount INT
SET #CommaCount = (select len(#ItemIds)-len(replace(#ItemIds,',','')))
Then when you perform the SELECT, just compare the result to the count:
SELECT *
FROM dbo.MyItemTable
WHERE ItemId IN (#ItemIds)
IF(##ROWCOUNT != #CommaCount)
RAISERROR('Item does not exist', 16, 1)

Order of Filter to make the execution of SQL query better

I have two queries serving the same purpose.
SELECT * FROM #user
INNER JOIN #department ON #user.departmentId = #department.departmentId
INNER JOIN #user manager ON #department.departmentHead = manager.userid
WHERE #department.DepartmentName= 'QA'
SELECT * FROM #user
INNER JOIN #department ON #department.DepartmentName= 'QA' AND #user.departmentId = #department.departmentId
INNER JOIN #user manager ON #department.departmentHead = manager.userid
The table structure is as follows.
create table #user
(
userid int identity(1,1),
userName varchar(20),
departmentId int
)
create table #department
(
departmentId int identity(1,1),
departmentName varchar(50),
departmentHead int
)
I am expecting some difference between the two queries, in the performance perspective. My assumption/understanding is that the first query gets executed in this order.
Join all the records of user and department tables.
Join the result of step 1 with all the records of Users table again.
Apply WHERE condition (Department = 'QA').
Whereas the second one
Join all the QA department records (filtered set) with the users table.
Join the results of step 1 with all the records of user table.
I am assuming that the second query to be more efficient than the first one since the second query applies the filter at an early stage of execution saving the followers to deal with less number of records.
But SQL Execution plan doesn't show any difference between the two queries. Please explain this and validate if my understanding on the order of filter application is correct.
Actually SQL server has statistical data about your tables and can rearange the joins to create a more optimal execution plan.
Query hint FORCE ORDER specifies that the join order indicated by the query syntax is preserved during query optimization. But dont use that unless you have a performance problem.
Order of statements in SQL queries largely doesn't matter. The query gets compiled completely and then the execution plan is created based on what is calculated to be the likely best possible way to execute the query.

Deleting hierarchical data in SQL table

I have a table with hierarchical data.
A column "ParentId" that holds the Id ("ID" - key column) of it's parent.
When deleting a row, I want to delete all children (all levels of nesting).
How to do it?
Thanks
On SQL Server: Use a recursive query. Given CREATE TABLE tmp(Id int, Parent int), use
WITH x(Id) AS (
SELECT #Id
UNION ALL
SELECT tmp.Id
FROM tmp
JOIN x ON tmp.Parent = x.Id
)
DELETE tmp
FROM x
JOIN tmp ON tmp.Id = x.Id
Add a foreign key constraint. The following example works for MySQL (syntax reference):
ALTER TABLE yourTable
ADD CONSTRAINT makeUpAConstraintName
FOREIGN KEY (ParentID) REFERENCES yourTable (ID)
ON DELETE CASCADE;
This will operate on the database level, the dbms will ensure that once a row is deleted, all referencing rows will be deleted, too.
When the number of rows is not too large, erikkallen's recursive approach works.
Here's an alternative that uses a temporary table to collect all children:
create table #nodes (id int primary key)
insert into #nodes (id) values (#delete_id)
while ##rowcount > 0
insert into #nodes
select distinct child.id
from table child
inner join #nodes parent on child.parentid = parent.id
where child.id not in (select id from #nodes)
delete
from table
where id in (select id from #nodes)
It starts with the row with #delete_id and descends from there. The where statement is to protect from recursion; if you are sure there is none, you can leave it out.
Depends how you store your hierarchy. If you only have ParentID, then it may not be the most effective approach you took. For ease of subtree manipulation you should have an additional column Parents that wouls store all parent IDs like:
/1/20/25/40
This way you'll be able to get all sub-nodes simply by:
where Parents like #NodeParents + '%'
Second approach
Instead of just ParentID you could also have left and right values. Inserts doing it this way are slower, but select operations are extremely fast. Especially when dealing with sub-tree nodes... http://en.wikipedia.org/wiki/Tree_traversal
Third approach
check recursive CTEs if you use SQL 2005+
Fourth approach
If you use SQL 2008, check HierarchyID type. It gives enough possibilities for your case.
http://msdn.microsoft.com/en-us/magazine/cc794278.aspx
Add a trigger to the table like this
create trigger TD_MyTable on myTable for delete as
-- Delete one level of children
delete M from deleted D inner join myTable M
on D.ID = M.ID
Each delete will call a delete on the same table, repeatedly calling the trigger. Check books online for additional rules. There may be a restriction to the number of times a trigger can nest.
ST
Depends on your database. If you are using Oracle, you could do something like this:
DELETE FROM Table WHERE ID IN (
SELECT ID FROM Table
START WITH ID = id_to_delete
CONNECT BY PRIOR.ID = ParentID
)
ETA:
Without CONNECT BY, it gets a bit trickier. As others have suggested, a trigger or cascading delete constraint would probably be easiest.
Triggers can only be used for hierarchies 32 levels deep or less:
http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/05/11/defensive-database-programming-fun-with-triggers.aspx
What you want is referential integrity between these tables.