Performance - Select query with left join and null check

Performance - Select query with left join and null check - sql

I have two different tables which are called as Processing (30M records for now) and EtlRecord (4.3M records for now).
As the name of tables suggest, these tables will be used for normalization of data with ETL.
We are trying to process records with batches where we have 1000 records in each batch.
SELECT TOP 1000 P.StreamGuid
FROM [staging].[Processing] P (NOLOCK)
LEFT JOIN [core].[EtlRecord] E (NOLOCK) ON E.StreamGuid = P.StreamGuid
WHERE E.StreamGuid IS NULL
AND P.CompleteDate IS NOT NULL
AND P.StreamGuid IS NOT NULL
Execution of this query takes around 20 seconds now. And we are expecting to have more and more data especially in EtlRecord table. To be able to improve the performance of this query I check the actual execution plan which I shared below.
As you can see, the most time consuming part is index seek to determine null records in EtlRecord table. I have tried several changes but couldn't able to improve it.
Additional notes
All suggested indices by execution plan already applied to tables. So there is no further index suggestion.
There are 8 columns in Processing table which are mostly boolean flags and 4 columns in EtlRecord table.
EtlRecord table is only used by single procedure. So there is no issue with transaction lock.
Any suggestions to improve this query will be really helpful.

Well, in your query you need to get records from [staging].[Processing] which has not got corresponding record in the [core].[EtlRecord].
You can remove the proceeded records, first.
DELETE [staging].[Processing]
FROM [staging].[Processing] P
INNER JOIN [core].[EtlRecord] E
ON E.StreamGuid = P.StreamGuid;
You can use deletion on batches if you need. Removing this records will simplify our initial query and the nasty join by uniqueidentifier. You simply need to do then something like this for each batch:
SELECT TOP 1000 StreamGuid
INTO #buffer
FROM [staging].[Processing]
WHERE CompleteDate IS NOT NULL
AND StreamGuid IS NOT NULL;
-- do whatevery you need with this records
DELETE FROM [staging].[Processing]
WHERE StreamGuid IN (SELECT StreamGuid FROM #buffer);
Also, you have said that you have all indexes created but indexes suggested by the execution plan are not always best. This part here:
WHERE CompleteDate IS NOT NULL
AND StreamGuid IS NOT NULL;
seems like very good candidate for filtered index especially if large amount of the rows has a NULL value for one of this columns.

First, DDL and easily consumable sample data, like below, will help a great deal. You can copy/paste my solutions and run them locally to see what I'm talking about.
IF OBJECT_ID('tempdb..#processing','U') IS NOT NULL DROP TABLE #processing;
IF OBJECT_ID('tempdb..#EtlRecord','U') IS NOT NULL DROP TABLE #EtlRecord;
SELECT TOP (100)
StreamGuid = NEWID(),
CompleteDate = CASE WHEN CHECKSUM(NEWID())%3 < 2 THEN GETDATE() END
INTO #processing
FROM sys.all_columns AS a
SELECT TOP (80) p.StreamGuid
INTO #EtlRecord
FROM #Processing AS p;
ALTER TABLE #processing ALTER COLUMN StreamGuid UNIQUEIDENTIFIER NOT NULL;
ALTER TABLE #EtlRecord ALTER COLUMN StreamGuid UNIQUEIDENTIFIER NOT NULL;
GO
ALTER TABLE #processing ADD CONSTRAINT pk_processing PRIMARY KEY CLUSTERED(StreamGuid);
ALTER TABLE #etlRecord ADD CONSTRAINT pk_etlRecord PRIMARY KEY CLUSTERED(StreamGuid);
GO
Next understand that, without an ORDER BY clause, your query is not guaranteed to return the same records each time. For example, if SQL Server picks a parallel execution plan you will definitely get a different rows. I have also seen cases where including the ORDER BY will actually improve performance.
With that in mind, not that this...
SELECT --TOP 1000
P.StreamGuid
FROM #processing AS p
LEFT JOIN #etlRecord AS e ON e.StreamGuid = p.StreamGuid
WHERE e.StreamGuid IS NOT NULL
AND P.CompleteDate IS NOT NULL
... will return the exact same thing as this:
SELECT TOP 1000
P.StreamGuid
FROM #processing AS p
JOIN #etlRecord AS e ON e.StreamGuid = p.StreamGuid
WHERE p.CompleteDate IS NOT NULL;
note that WHERE e.StreamGuid = p.StreamGuid already implies that both values are NOT NULL. Note that this query...
DECLARE #X INT;
SELECT AreTheyEqual = IIF(#X=#X,'Yep','Nope');
... returns:
AreTheyEqual
------------
Nope
I agree with the solution #gotqn posted about the filtered index. Using my sample data, you can add something like this:
CREATE NONCLUSTERED INDEX nc_processing ON #processing(CompleteDate,StreamGuid)
WHERE CompleteDate IS NOT NULL;
Then you can add an ORDER BY CompleteDate to the query to coerce the optimizer into choosing it that index (on my system it doesn't pick the index unless I add an ORDER BY). The ORDER BY will make you query deterministic and more predictable.

I would suggest writing this as:
SELECT TOP 1000 P.StreamGuid
FROM [staging].[Processing] P
WHERE P.CompleteDate IS NOT NULL AND
P.StreamGuid IS NOT NULL AND
NOT EXISTS (SELECT 1
FROM [core].[EtlRecord] E
WHERE E.StreamGuid = P.StreamGuid
);
I removed the NOLOCK directive. Only use it if you really know what you are doing -- and are prepared to read invalid data.
Then you definitely want an index on EtlRecord(StreamGuid).
You probably also want an index on Processing(CompleteDate, StreamGuid). This is at least a covering index for the query.

Related

sql server does not take most restrictive condition for execution plan

We have a query with multiple joins where sql server 2016 does not take the optimal path and we cannot convince it without hints (which we prefer not to use)
Simplified the problem is as follows :
Table A (12 million rows)
Table B (type table, 5 rows)
Table C (12 million rows)
query (simplified to clarify)
SELECT
[A].[ID]
,[A].[DATE_CREATED]
,[A].[DATE_LAST_MODIFIED]
,[A].[CODE]
,[B].[CODE]
,[B].[DESCRIPTION]
,[C].[EVENT_ID]
,[C].[SOURCE_REFERENCE]
,[C].[EVTY_ID]
,[C].[BUSINESS_KEY]
,[C].[DATA]
,[C].[EVENT_DATE]
FROM A
JOIN B ON [B].[ID] = [A].[PSTY_ID] AND [B].[ACTIVE] = 1
JOIN C ON [C].[ID] = [B].[EVEN_ID] AND [C].[ACTIVE] = 1
WHERE [B].[CODE] = 'nopr' OR [B].[CODE] = 'inpr'
the selected codes from B correspond to values 1 and 2
Table A contain max 10 PSTY_ID values 1 or 2 the rest is 3,4 or 5
There is a foreign key from A.PSTY_ID to B.ID
There is a filtered index on table A PSTY_ID 1,2 and all selected columns as included columns
The optimizer does not seem to recognize that we try to select values 1 and 2, and does not use the index or start with table B (trying to force with subqueries or changing table order do not help, only the hint OPTION (FORCE ORDER) can convince the optimizer, but this we do not want)
Only when we hard code the B.ID or A.PSTY_ID values 1 and 2 in the where clause the optimizer takes the correct path, starting with table B.
If we do not do this, it starts to join table A with table C, and only then with table B, leading to vastly more processing time (approx 50X)
We also tried to declare the values and using them as variables, but still no luck.
Would anyone know if this is a known issue, or if this can be worked around ?

Your filtered index will not be used in this case unless you include values 1 and 2 in the where clause, you cannot change this even if you try to join with the table that ONLY has 1,2 in its rows.
Filtered index will never be used based on some "assumptions" of what values some table (physical or derived like CTE or subquery), and in fact your subquery did not help.
So if you want to use it, you should add the where condition equivalent to those of filtered index to your query.
Since you don't want to add this condition, but still want to change join order of your tables starting with B table you can use temporary table/table variable like this:
select [ID]
,[CODE]
,[DESCRIPTION]
,[EVEN_ID]
into #tmp
from B
where ([CODE] = 'nopr' OR [CODE] = 'inpr') and [ACTIVE] = 1
And now use this #tmp instead of B in your query.

INNER LOOP JOIN Failing

I need to update a field called FamName in a table called Episode with randomly generated Germanic names from a different table called Surnames which has a single column called Surname.
To do this I have first added an ID field and NONCLUSTERED INDEX to my Surnames table
ALTER TABLE Surnames
ADD ID INT NOT NULL IDENTITY(1, 1);
GO
CREATE UNIQUE NONCLUSTERED INDEX idx ON Surnames(ID);
GO
I then attempt to update my Episode table via
UPDATE E
SET E.FamName = S.Surnames
FROM Episode AS E
INNER LOOP JOIN Surnames AS S
ON S.ID = (1 + ABS(CRYPT_GEN_RANDOM(8) % (SELECT COUNT(*) FROM Surnames)));
GO
where I am attempting to force the query to 'loop' using the LOOP join hint. Of course if I don't force the optimizer to loop (using LOOP) I will get the same German name for all rows. However, this query is strangely returning zero rows affected.
Why is this returning zero affected rows and how can this be amended to work?
Note, I could use a WHILE loop to perform this update, but I want a succinct way of doing this and to find out what I am doing wrong in this particular case.

You cannot (reliably) affect query results with join hints. They are performance hints, not semantic hints. You try to rely on undefined behavior.
Moving the random number computation out of the join condition into one of the join sources prevents the expression to be treated as a constant:
UPDATE E
SET E.FamName = S.Surnames
FROM (
SELECT *, (1 + ABS(CRYPT_GEN_RANDOM(8) % (SELECT COUNT(*) FROM Surnames))) AS SurnameID
FROM Episode AS E
) E
INNER LOOP JOIN Surnames AS S ON S.ID = E.SurnameID
The derived table E adds the computed SurnameID as a new column.
You don't need join hints any longer. I just tested that this works in my specific test case although I'm not whether this is guaranteed to work.

Horrible Oracle update performance

I am performing an update with a query like this:
UPDATE (SELECT h.m_id,
m.id
FROM h
INNER JOIN m
ON h.foo = m.foo)
SET m_id = id
WHERE m_id IS NULL
Some info:
Table h is roughly ~5 million rows
All rows in table h have NULL values for m_id
Table m is roughly ~500 thousand rows
m_id on table h is an indexed foreign key pointing to id on table m
id on table m is the primary key
There are indexes on m.foo and h.foo
The EXPLAIN PLAN for this query indicated a hash join and full table scans, but I'm no DBA, so I can't really interpret it very well.
The query itself ran for several hours and did not complete. I would have expected it to complete in no more than a few minutes. I've also attempted the following query rewrite:
UPDATE h
SET m_id = (SELECT id
FROM m
WHERE m.foo = h.foo)
WHERE m_id IS NULL
The EXPLAIN PLAN for this mentioned ROWID lookups and index usage, but it also went on for several hours without completing. I've also always been under the impression that queries like this would cause the subquery to be executed for every result from the outer query's predicate, so I would expect very poor performance from this rewrite anyway.
Is there anything wrong with my approach, or is my problem related to indexes, tablespace, or some other non-query-related factor?
Edit:
I'm also having abysmal performance from simple count queries like this:
SELECT COUNT(*)
FROM h
WHERE m_id IS NULL
These queries are taking anywhere from ~30 seconds to sometimes ~30 minutes(!).
I am noticing no locks, but the tablespace for these tables is sitting at 99.5% usage (only ~6MB free) right now. I've been told that this shouldn't matter as long as indexes are being used, but I don't know...

Some points:
Oracle does not index NULL values (it will index a NULL that is part of a globally non-null tuple, but that's about it).
Oracle is going for a HASH JOIN because of the size of both h and m. This is likely the best option performance-wise.
The second UPDATE might get Oracle to use indexes, but then Oracle is usually smart about merging subqueries. And it would be a worse plan anyway.
Do you have recent, reasonable statistics for your schema? Oracle really needs decent statistics.
In your execution plan, which is the first table in the HASH JOIN? For best performance it should be the smaller table (m in your case). If you don't have good cardinality statistics, Oracle will get messed up. You can force Oracle to assume fixed cardinalities with the cardinality hint, it may help Oracle get a better plan.
For example, in your first query:
UPDATE (SELECT /*+ cardinality(h 5000000) cardinality(m 500000) */
h.m_id, m.id
FROM h
INNER JOIN m
ON h.foo = m.foo)
SET m_id = id
WHERE m_id IS NULL
In Oracle, FULL SCAN reads not only every record in the table, it basically reads all storage allocated up to the maximum used (the high water mark in Oracle documentation). So if you have had a lot of deleted rows your tables might need some cleaning up. I have seen a SELECT COUNT(*) on an empty table consume 30+ seconds because the table in question had like 250 million deleted rows. If that is the case, I suggest analyzing your specific case with a DBA, so he/she can reclaim space from deleted rows and lower the high water mark.

As far as I remember, a WHERE m_id IS NULL performs a full-table scan, since NULL values cannot be indexed.
Full-table scan means, that the engine needs to read every record in the table to evaluate the WHERE condition, and cannot use an index.
You could try to add a virtual column set to a not-null value if m_id IS NULL, and index this column, and use this column in the WHERE condition.
Then you could also move the WHERE condition from the UPDATE statement to the sub-select, which will probably make the statement faster.
Since JOINs are expensive, rewriting INNER JOIN m ON h.foo = m.foo as
WHERE h.foo IN (SELECT m.foo FROM m WHERE m.foo IS NOT NULL)
may also help.

For large tables, MERGE is often much faster than UPDATE. Try this (untested):
MERGE INTO h USING
(SELECT h.h_id,
m.id as new_m_id
FROM h
INNER JOIN m
ON h.foo = m.foo
WHERE h.m_id IS NULL
) new_data
ON (h.h_id = new_data.h_id)
WHEN MATCHED THEN
UPDATE SET h.m_id = new_data.new_m_id;

Try undocumented hint /*+ BYPASS_UJVC */. If it works, add an UNIQUE/PK constraint on m.foo.

I would update the table in iterations, for example, add a condition according to where h.date_created > sysdate-30 and after it finishes I would run the same query and change the condition to: where h.date_created between sysdate-30 and sysdate-60 etc. If you don't have a column like date_created maybe there's another column you can filter by ? for example: WHERE m.foo = h.foo AND m.foo between 1 and 10
Only the result of plan can explain why the cost of this update is high, but an educated guess will be that both tables are very big and that there are many NULL values as well as a lot of matching (m.foo = h.foo)...

Performance issue with select query in Firebird

I have two tables, one small (~ 400 rows), one large (~ 15 million rows), and I am trying to find the records from the small table that don't have an associated entry in the large table.
I am encountering massive performance issues with the query.
The query is:
SELECT * FROM small_table WHERE NOT EXISTS
(SELECT NULL FROM large_table WHERE large_table.small_id = small_table.id)
The column large_table.small_id references small_table's id field, which is its primary key.
The query plan shows that the foreign key index is used for the large_table:
PLAN (large_table (RDB$FOREIGN70))
PLAN (small_table NATURAL)
Statistics have been recalculated for indexes on both tables.
The query takes several hours to run. Is this expected?
If so, can I rewrite the query so that it will be faster?
If not, what could be wrong?

I'm not sure about Firebird, but in other DBs often a join is faster.
SELECT *
FROM small_table st
LEFT JOIN large_table lt
ON st.id = lt.small_id
WHERE lt.small_id IS NULL
Maybe give that a try?
Another option, if you're really stuck, and depending on the situation this needs to be run in, is to take the small_id column out of the large_table, possibly into a temp table, and then do a left join / EXISTS query.

If the large table only has relatively few distinct values for small_id, the following might perform better:
select *
from small_table st left outer join
(select distinct small_id
from large_table
) lt
on lt.small_id = st.id
where lt.small_id is null
In this case, the performance would be better by doing a full scan of the large table and then index lookups in the small table -- the opposite of what it is doing. Doing a distinct could do just an index scan on the large table which then uses the primary key index on the small table.

Optimising CTE for recursive queries

I have a table with self join. You can think of the structure as standard table to represent organisational hierarchy. Eg table:-
MemberId
MemberName
RelatedMemberId
This table consists of 50000 sample records. I wrote CTE recursive query and it works absolutely fine. However the time it takes to process just 50000 records is round about 3 minutes on my machine (4GB Ram, 2.4 Ghz Core2Duo, 7200 RPM HDD).
How can I possibly improve the performance because 50000 is not so huge number. Over time it will keep on increasing. This is the query which is exactly what I have in my Stored Procedure. The query's purpose is to select all the members that come under a specific member. Eg. Under Owner of the company each and every person comes. For Manager, except Owner all of the records gets returned. I hope you understand the query's purpose.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
Alter PROCEDURE spGetNonVirtualizedData
(
#MemberId int
)
AS
BEGIN
With MembersCTE As
(
Select parent.MemberId As MemberId, 0 as Level
From Members as parent Where IsNull(MemberId,0) = IsNull(#MemberId,0)
Union ALL
Select child.MemberId As MemberId , Level + 1 as Level
From Members as child
Inner Join MembersCTE on MembersCTE.MemberId = child.RelatedMemberId
)
Select Members.*
From MembersCTE
Inner Join Members On MembersCTE.MemberId = Members.MemberId
option(maxrecursion 0)
END
GO
As you can see to improve the performance, I have even made the Joins at the last step while selecting records so that all unnecessary records do not get inserted into temp table. If I made joins in my base step and recursive step of CTE (instead of Select at the last step) the query takes 20 minutes to execute!
MemberId is primary key in the table.
Thanks in advance :)

In your anchor condition you have Where IsNull(MemberId,0) = IsNull(#MemberId,0) I assume this is just because when you pass NULL as a parameter = doesn't work in terms of bringing back IS NULL values. This will cause a scan rather than a seek.
Use WHERE MemberId = #MemberId OR (#MemberId IS NULL AND MemberId IS NULL) instead which is sargable.
Also I'm assuming that you can't have an index on RelatedMemberId. If not you should add one
CREATE NONCLUSTERED INDEX ix_name ON Members(RelatedMemberId) INCLUDE (MemberId)
(though you can skip the included column bit if MemberId is the clustered index key as it will be included automatically)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas