SQL - improve NOT EXISTS query performance - sql

Is there a way I can improve this kind of SQL query performance:
INSERT
INTO ...
WHERE NOT EXISTS(Validation...)
The problem is when I have many data in my table (like million of rows), the execution of the WHERE NOT EXISTS clause if very slow. I have to do this verification because I can't insert duplicated data.
I use SQLServer 2005
thx

Make sure you are searching on indexed columns, with no manipulation of the data within those columns (like substring etc.)

Off the top of my head, you could try something like:
TRUNCATE temptable
INSERT INTO temptable ...
INSERT INTO temptable ...
...
INSERT INTO realtable
SELECT temptable.* FROM temptable
LEFT JOIN realtable on realtable.key = temptable.key
WHERE realtable.key is null

Try to replace the NOT EXISTS with a left outer join, it sometimes performs better in large data sets.

Outer Apply tends to work for me...
instead of:
from t1
where not exists (select 1 from t2 where t1.something=t2.something)
I'll use:
from t1
outer apply (
select top 1 1 as found from t2 where t1.something=t2.something
) t2f
where t2f.found is null

Pay attention to the other answer regarding indexing. NOT EXISTS is typically quite fast if you have good indexes.
But I have had performance issues with statements like you describe. One method I've used to get around that is to use a temp table for the candidate values, perform a DELETE FROM ... WHERE EXISTS (...), and then blindly INSERT the remainder. Inside a transaction, of course, to avoid race conditions. Splitting up the queries sometimes allows the optimizer to do its job without getting confused.

If you can at all reduce your problem space, then you'll gain heaps of performance. Are you absolutely sure that every one of those rows in that table needs to be checked?
The other thing you might want to try is a DELETE InsertTable FROM InsertTable INNER JOIN ExistingTable ON <Validation criteria> before your insert. However, your mileage may vary

insert into customers
select *
from newcustomers
where customerid not in (select customerid
from customers)
..may be more efficient. As others have said, make sure you've got indexes on any lookup fields.

Related

SQL best practice/performance when inserting into a table. To use a temp table or not

I have a select query that's come about from me trying to remove while loops from an existing query that was far too slow. As it stands I first select into a temp table.
Then from that temp table I insert into the final table using the values from the temp table.
Below is a simplified example of the flow of my query
select
b.BookId,
b.BookDescription,
a.Name,
a.BirthDate,
a.CountryOfOrigin,
into #tempTable
from library.Book b
left join authors.Authors a
on a.AuthorId = b.AuthorId
insert into bookStore.BookStore
([BookStoreEntryId]
[BookId],
[BookDescription],
[Author],
[AuthorBirthdate],
[AuthorCountryOfOrigin])
select
NEWID(),
t.BookId,
t.BookDescription,
t.Name,
t.Birthdate,
t.CountryOfOrigin
from #tempTable t
drop table #tempTable
Would it be better to move the select statement at the start, to below so that its incorporated into the insert statement, removing the need for the temp table?
There is no advantage at all to having a temporary table in this case. Just use the select query directly.
Sometimes, temporary tables can improve performance. One method is that a real table has real statistics (notably the number of rows). The optimizer can use that information for better execution plans.
Temporary tables can also improve performance if they explicit have an index on them.
However, they incur overhead of writing the table.
In this case, you just get all the overhead and there should be no benefit.
Actually, I could imagine one benefit under one circumstance. If the query took a long time to run -- say because the join required a nested loops join with no indexes -- then the destination table would be saved from locking and contention until all the rows are available for insert. That would be an unusual case, though.
Do in 1 step
insert into bookStore.BookStore
( /* [BookStoreEntryId] <-- assuming this is auto id*/
[BookId],
[BookDescription],
[Author],
[AuthorBirthdate],
[AuthorCountryOfOrigin])
SELECT distinct
b.BookId,
b.BookDescription,
a.Name,
a.BirthDate,
a.CountryOfOrigin,
from library.Book b
left join authors.Authors a
on a.AuthorId = b.AuthorId
your performance will depend on number of indexes in the target table. More indexes - slower insert. May be worth to disable them during insert and then rebuild them after the insert is completed

Use of temp table in joining for performance issue

Is there any basic difference in terms of performance and output between this below two query?
select * from table1
left outer join table2 on table1.col=table2.col
and table2.col1='shhjs'
and
select * into #temp from table2 where table2.col1='shhjs'
select * from table1 left outer join #temp on table1.col=#temp.col
Here table2 have huge number of records while #temp have less amount.
Yes, there is. The second method is going to materialize a table in the temp data bases, which requires additional overhead.
The first method does not require such overhead. And, it can be better optimized. For instance, if an index existed on table2(col, col1), the first version might take advantage of it. The second would not.
However, you can always try the two queries on your system with your data and determine if one noticeably outperforms the other.

Tuning And Performance

INSERT INTO <TABLED>
SELECT A.* FROM
<TABLEA> A WHERE A.MED_DTL_STATUS='0'
AND A.TRANS_ID
NOT IN
(
SELECT DISTINCT TRANS_ID_X_REF FROM <TABLEB>
UNION
SELECT DISTINCT TRANS_ID FROM <TABLEA> WHERE ADJUSTMENT_TYPE='3'
);
The table has more than 250 columns.
The Select statement will return more than 300000 records .The above query is running for a long time.I have never worked on performance tuning.Could someone please help me on tuning this or give me some good links on how to tune oracle queries?.
Thanks in advance.
I find that NOT IN clauses are really slow. I would rewrite the query with NOT EXISTS instead.
INSERT INTO <TABLED>
SELECT A.* FROM <TABLEA> A
WHERE A.MED_DTL_STATUS='0'
AND NOT EXISTS (
SELECT B.TRANS_ID_X_REF
FROM <TABLEB> B
WHERE B.TRANS_ID_X_REF = A.TRANS_ID
)
AND NOT EXISTS (
SELECT A2.TRANS_ID
FROM <TABLEA> A2
WHERE A2.TRANS_ID = A.TRANS_ID
AND A2.ADJUSTMENT_TYPE='3'
);
The query above assumes there are indexes on TRANS_ID on TableA and TableB. This may not really solve your problem, but without knowing the data model and indexes it may be worth a shot.
Apart from the good suggestions already given, whenever you are inserting a large number of records into a table it is best practice to drop the indexes on that table. When the INSERT process has finished, then recreate the indexes.
How selective is this predicate?
A.MED_DTL_STATUS='0'
If it filters out a large proportion of the rows in the table then creating an index on MED_DTL_STATUS might help.
Note that Oracle has (or at least used to have) a limit of ~1000 items for IN: in case your subquery starts returning more rows than that you will get an error (this IN can be rewritten using a left outer join if/when that happens).

Which is faster - NOT IN or NOT EXISTS?

I have an insert-select statement that needs to only insert rows where a particular identifier of the row does not exist in either of two other tables. Which of the following would be faster?
INSERT INTO Table1 (...)
SELECT (...) FROM Table2 t2
WHERE ...
AND NOT EXISTS (SELECT 'Y' from Table3 t3 where t2.SomeFK = t3.RefToSameFK)
AND NOT EXISTS (SELECT 'Y' from Table4 t4 where t2.SomeFK = t4.RefToSameFK AND ...)
... or...
INSERT INTO Table1 (...)
SELECT (...) FROM Table2 t2
WHERE ...
AND t2.SomeFK NOT IN (SELECT RefToSameFK from Table3)
AND t2.SomeFK NOT IN (SELECT RefToSameFK from Table4 WHERE ...)
... or do they perform about the same? Additionally, is there any other way to structure this query that would be preferable? I generally dislike subqueries as they add another "dimension" to the query that increases runtime by polynomial factors.
Usually it does not matter if NOT IN is slower / faster than NOT EXISTS, because they are NOT equivalent in presence of NULL. Read:
NOT IN vs NOT EXISTS
In these cases you almost always want NOT EXISTS, because it has the usually expected behaviour.
If they are equivalent, it is likely that your database already has figured that out and will generate the same execution plan for both.
In the few cases where both options are aquivalent and your database is not able to figure that out, it is better to analyze both execution plans and choose the best options for your specific case.
You could use a LEFT OUTER JOIN and check if the value in the RIGHT table is NULL. If the value is NULL, the row doesn't exist. That is one way to avoid subqueries.
SELECT (...) FROM Table2 t2
LEFT OUTER JOIN t3 ON (t2.someFk = t3.ref)
WHERE t3.someField IS NULL
It's dependent on the size of the tables, the available indices, and the cardinality of those indices.
If you don't get the same execution plan for both queries, and if neither query plans out to perform a JOIN instead of a sub query, then I would guess that version two is faster. Version one is correlated and therefore would produce many more sub queries, version two can be satisfied with three queries total.
(Also, note that different engines may be biased in one direction or another. Some engines may correctly determine that the queries are the same (if they really are the same) and resolve to the same execution plan.)
For bigger tables, it's recomended to use NOT EXISTS/EXISTS, because the IN clause runs the subquery a lot of times depending of the architecture of the tables.
Based on cost optimizer:
There is no difference.

SQL: Optimization problem, has rows?

I got a query with five joins on some rather large tables (largest table is 10 mil. records), and I want to know if rows exists. So far I've done this to check if rows exists:
SELECT TOP 1 tbl.Id
FROM table tbl
INNER JOIN ... ON ... = ... (x5)
WHERE tbl.xxx = ...
Using this query, in a stored procedure takes 22 seconds and I would like it to be close to "instant". Is this even possible? What can I do to speed it up?
I got indexes on the fields that I'm joining on and the fields in the WHERE clause.
Any ideas?
switch to EXISTS predicate. In general I have found it to be faster than selecting top 1 etc.
So you could write like this IF EXISTS (SELECT * FROM table tbl INNER JOIN table tbl2 .. do your stuff
Depending on your RDBMS you can check what parts of the query are taking a long time and which indexes are being used (so you can know they're being used properly).
In MSSQL, you can use see a diagram of the execution path of any query you submit.
In Oracle and MySQL you can use the EXPLAIN keyword to get details about how the query is working.
But it might just be that 22 seconds is the best you can do with your query. We can't answer that, only the execution details provided by your RDBMS can. If you tell us which RDBMS you're using we can tell you how to find the information you need to see what the bottleneck is.
4 options
Try COUNT(*) in place of TOP 1 tbl.id
An index per column may not be good enough: you may need to use composite indexes
Are you on SQL Server 2005? If som, you can find missing indexes. Or try the database tuning advisor
Also, it's possible that you don't need 5 joins.
Assuming parent-child-grandchild etc, then grandchild rows can't exist without the parent rows (assuming you have foreign keys)
So your query could become
SELECT TOP 1
tbl.Id --or count(*)
FROM
grandchildtable tbl
INNER JOIN
anothertable ON ... = ...
WHERE
tbl.xxx = ...
Try EXISTS.
For either for 5 tables or for assumed heirarchy
SELECT TOP 1 --or count(*)
tbl.Id
FROM
grandchildtable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
-- or
SELECT TOP 1 --or count(*)
tbl.Id
FROM
mytable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
AND
EXISTS (SELECT *
FROM
yetanothertable T3
WHERE
tbl.key = T3.key /* AND T3 condition*/)
Doing a filter early on your first select will help if you can do it; as you filter the data in the first instance all the joins will join on reduced data.
Select top 1 tbl.id
From
(
Select top 1 * from
table tbl1
Where Key = Key
) tbl1
inner join ...
After that you will likely need to provide more of the query to understand how it works.
Maybe you could offload/cache this fact-finding mission. Like if it doesn't need to be done dynamically or at runtime, just cache the result into a much smaller table and then query that. Also, make sure all the tables you're querying to have the appropriate clustered index. Granted you may be using these tables for other types of queries, but for the absolute fastest way to go, you can tune all your clustered indexes for this one query.
Edit: Yes, what other people said. Measure, measure, measure! Your query plan estimate can show you what your bottleneck is.
Use the maximun row table first in every join and if more than one condition use
in where then sequence of the where is condition is important use the condition
which give you maximum rows.
use filters very carefully for optimizing Query.