INSERT INTO <TABLED>
SELECT A.* FROM
<TABLEA> A WHERE A.MED_DTL_STATUS='0'
AND A.TRANS_ID
NOT IN
(
SELECT DISTINCT TRANS_ID_X_REF FROM <TABLEB>
UNION
SELECT DISTINCT TRANS_ID FROM <TABLEA> WHERE ADJUSTMENT_TYPE='3'
);
The table has more than 250 columns.
The Select statement will return more than 300000 records .The above query is running for a long time.I have never worked on performance tuning.Could someone please help me on tuning this or give me some good links on how to tune oracle queries?.
Thanks in advance.
I find that NOT IN clauses are really slow. I would rewrite the query with NOT EXISTS instead.
INSERT INTO <TABLED>
SELECT A.* FROM <TABLEA> A
WHERE A.MED_DTL_STATUS='0'
AND NOT EXISTS (
SELECT B.TRANS_ID_X_REF
FROM <TABLEB> B
WHERE B.TRANS_ID_X_REF = A.TRANS_ID
)
AND NOT EXISTS (
SELECT A2.TRANS_ID
FROM <TABLEA> A2
WHERE A2.TRANS_ID = A.TRANS_ID
AND A2.ADJUSTMENT_TYPE='3'
);
The query above assumes there are indexes on TRANS_ID on TableA and TableB. This may not really solve your problem, but without knowing the data model and indexes it may be worth a shot.
Apart from the good suggestions already given, whenever you are inserting a large number of records into a table it is best practice to drop the indexes on that table. When the INSERT process has finished, then recreate the indexes.
How selective is this predicate?
A.MED_DTL_STATUS='0'
If it filters out a large proportion of the rows in the table then creating an index on MED_DTL_STATUS might help.
Note that Oracle has (or at least used to have) a limit of ~1000 items for IN: in case your subquery starts returning more rows than that you will get an error (this IN can be rewritten using a left outer join if/when that happens).
Related
I have a select query that's come about from me trying to remove while loops from an existing query that was far too slow. As it stands I first select into a temp table.
Then from that temp table I insert into the final table using the values from the temp table.
Below is a simplified example of the flow of my query
select
b.BookId,
b.BookDescription,
a.Name,
a.BirthDate,
a.CountryOfOrigin,
into #tempTable
from library.Book b
left join authors.Authors a
on a.AuthorId = b.AuthorId
insert into bookStore.BookStore
([BookStoreEntryId]
[BookId],
[BookDescription],
[Author],
[AuthorBirthdate],
[AuthorCountryOfOrigin])
select
NEWID(),
t.BookId,
t.BookDescription,
t.Name,
t.Birthdate,
t.CountryOfOrigin
from #tempTable t
drop table #tempTable
Would it be better to move the select statement at the start, to below so that its incorporated into the insert statement, removing the need for the temp table?
There is no advantage at all to having a temporary table in this case. Just use the select query directly.
Sometimes, temporary tables can improve performance. One method is that a real table has real statistics (notably the number of rows). The optimizer can use that information for better execution plans.
Temporary tables can also improve performance if they explicit have an index on them.
However, they incur overhead of writing the table.
In this case, you just get all the overhead and there should be no benefit.
Actually, I could imagine one benefit under one circumstance. If the query took a long time to run -- say because the join required a nested loops join with no indexes -- then the destination table would be saved from locking and contention until all the rows are available for insert. That would be an unusual case, though.
Do in 1 step
insert into bookStore.BookStore
( /* [BookStoreEntryId] <-- assuming this is auto id*/
[BookId],
[BookDescription],
[Author],
[AuthorBirthdate],
[AuthorCountryOfOrigin])
SELECT distinct
b.BookId,
b.BookDescription,
a.Name,
a.BirthDate,
a.CountryOfOrigin,
from library.Book b
left join authors.Authors a
on a.AuthorId = b.AuthorId
your performance will depend on number of indexes in the target table. More indexes - slower insert. May be worth to disable them during insert and then rebuild them after the insert is completed
I heard that the IN operator is costlier than the JOIN operator.
Is that true?
Example case for IN operator:
SELECT *
FROM table_one
WHERE column_one IN (SELECT column_one FROM table_two)
Example case for JOIN operator:
SELECT *
FROM table_one TOne
JOIN (select column_one from table_two) AS TTwo
ON TOne.column_one = TTwo.column_one
In the above query, which is recommended to use and why?
tl;dr; - once the queries are fixed so that they will yield the same results, the performance is the same.
Both queries are not the same, and will yield different results.
The IN query will return all the columns from table_one,
while the JOIN query will return all the columns from both tables.
That can be solved easily by replacing the * in the second query to table_one.*, or better yet, specify only the columns you want to get back from the query (which is best practice).
However, even if that issue is changed, the queries might still yield different results if the values on table_two.column_one are not unique.
The IN query will yield a single record from table_one even if it fits multiple records in table_two, while the JOIN query will simply duplicate the records as many times as the criteria in the ON clause is met.
Having said all that - if the values in table_two.column_one are guaranteed to be unique, and the join query is changed to select table_one.*... - then, and only then, will both queries yield the same results - and that would be a valid question to compare their performance.
So, in the performance front:
The IN operator has a history of poor performance with a large values list - in earlier versions of SQL Server, if you would have used the IN operator with, say, 10,000 or more values, it would have suffer from a performance issue.
With a small values list (say, up to 5,000, probably even more) there's absolutely no difference in performance.
However, in currently supported versions of SQL Server (that is, 2012 or higher), the query optimizer is smart enough to understand that in the conditions specified above these queries are equivalent and might generate exactly the same execution plan for both queries - so performance will be the same for both queries.
UPDATE: I've done some performance research, on the only available version I have for SQL Server which is 2016 .
First, I've made sure that Column_One in Table_Two is unique by setting it as the primary key of the table.
CREATE TABLE Table_One
(
id int,
CONSTRAINT PK_Table_One PRIMARY KEY(Id)
);
CREATE TABLE Table_Two
(
column_one int,
CONSTRAINT PK_Table_Two PRIMARY KEY(column_one)
);
Then, I've populated both tables with 1,000,000 (one million) rows.
SELECT TOP 1000000 ROW_NUMBER() OVER(ORDER BY ##SPID) As N INTO Tally
FROM sys.objects A
CROSS JOIN sys.objects B
CROSS JOIN sys.objects C;
INSERT INTO Table_One (id)
SELECT N
FROM Tally;
INSERT INTO Table_Two (column_one)
SELECT N
FROM Tally;
Next, I've ran four different ways of getting all the values of table_one that matches values of table_two. - The first two are from the original question (with minor changes), the third is a simplified version of the join query, and the fourth is a query that uses the exists operator with a correlated subquery instead of the in operaor`,
SELECT *
FROM table_one
WHERE Id IN (SELECT column_one FROM table_two);
SELECT TOne.*
FROM table_one TOne
JOIN (select column_one from table_two) AS TTwo
ON TOne.id = TTwo.column_one;
SELECT TOne.*
FROM table_one TOne
JOIN table_two AS TTwo
ON TOne.id = TTwo.column_one;
SELECT *
FROM table_one
WHERE EXISTS
(
SELECT 1
FROM table_two
WHERE column_one = id
);
All four queries yielded the exact same result with the exact same execution plan - so from it's safe to say performance, under these circumstances, are exactly the same.
You can copy the full script (with comments) from Rextester (result is the same with any number of rows in the tally table).
From the point of performance view, mostly, using EXISTS might be a better option rather than using IN operator and JOIN among the tables :
SELECT TOne.*
FROM table_one TOne
WHERE EXISTS ( SELECT 1 FROM table_two TTwo WHERE TOne.column_one = TTwo.column_one )
If you need the columns from both tables, and provided those have indexes on the column column_one used in the join condition, using a JOIN would be better than using an IN operator, since you will be able to benefit from the indexes :
SELECT TOne.*, TTwo.*
FROM table_one TOne
JOIN table_two TTwo
ON TOne.column_one = TTwo.column_one
In the above query, which is recommended to use and why?
The second (JOIN) query cannot be optimal compare to first query unless you put where clause within sub-query as follows:
Select * from table_one TOne
JOIN (select column_one from table_two where column_tow = 'Some Value') AS TTwo
ON TOne.column_one = TTwo.column_one
However, the better decision can be based on execution plan with following points into consideration:
How many tasks the query has to perform to get the result
What is task type and execution time of each task
Variance between Estimated number of row and Actual number of rows in each task - this can be fixed by UPDATED STATISTICS on TABLE if the variance too high.
In general, the Logical Processing Order of the SELECT statement goes as follows, considering that if you manage your query to read the less amount of rows/pages at higher level (as per following order) would make that query less logical I/O cost and eventually query is more optimized. i.e. It's optimal to get rows filtered within From or Where clause rather than filtering it in GROUP BY or HAVING clause.
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
I'm have an oracle query as below which is working well :
INSERT /*+APPEND*/ INTO historical
SELECT a.* FROM TEMP_DATA a WHERE NOT EXISTS(SELECT 1 FROM historical WHERE KEY=a.KEY)
With the query, when i run explain plan, i notice that the optimizer chooses a HASH JOIN plan and the cost is fairly low
However there's a new request to state how many rows that can exists in the historical table to check against the TEMP_DATA table, and hence the query is changed to:
INSERT /*+APPEND*/ INTO historical
SELECT a.* FROM TEMP_DATA a WHERE (SELECT COUNT(1) FROM historical WHERE KEY=a.KEY) < 2
Which means if 1 row of record exists in the historical data given the key (not primary key), the data still could be inserted.
However with this approach the query slow down a lot, with the cost more than 10 times of the original cost. I'd also noticed that the optimizer chooses a NESTED LOOP plan now.
Note that the historical table is a partitioned table with indexes.
Is there anyway i can optimized this?
Thanks.
The following query should do the same thing and should be more performant:
select a.*
from temp_data a
left
join(select key, count(*) cnt
from historical
group
by key
) b
on a.key = b.key
where nvl(b.cnt, 0) < 2;
Hope it helps
An alternative to #DirkNM's answer would be:
select a.*
from temp_data a
where not exists (select null
from historical h
where h.key = a.key
and rownum <= 2
group by h.key
having count(*) > 1);
You would have to test with your data sets to work out which is the best solution for you.
NB: I wouldn't expect the new query (whichever one you choose) to be as performant as your original query.
I got a query with five joins on some rather large tables (largest table is 10 mil. records), and I want to know if rows exists. So far I've done this to check if rows exists:
SELECT TOP 1 tbl.Id
FROM table tbl
INNER JOIN ... ON ... = ... (x5)
WHERE tbl.xxx = ...
Using this query, in a stored procedure takes 22 seconds and I would like it to be close to "instant". Is this even possible? What can I do to speed it up?
I got indexes on the fields that I'm joining on and the fields in the WHERE clause.
Any ideas?
switch to EXISTS predicate. In general I have found it to be faster than selecting top 1 etc.
So you could write like this IF EXISTS (SELECT * FROM table tbl INNER JOIN table tbl2 .. do your stuff
Depending on your RDBMS you can check what parts of the query are taking a long time and which indexes are being used (so you can know they're being used properly).
In MSSQL, you can use see a diagram of the execution path of any query you submit.
In Oracle and MySQL you can use the EXPLAIN keyword to get details about how the query is working.
But it might just be that 22 seconds is the best you can do with your query. We can't answer that, only the execution details provided by your RDBMS can. If you tell us which RDBMS you're using we can tell you how to find the information you need to see what the bottleneck is.
4 options
Try COUNT(*) in place of TOP 1 tbl.id
An index per column may not be good enough: you may need to use composite indexes
Are you on SQL Server 2005? If som, you can find missing indexes. Or try the database tuning advisor
Also, it's possible that you don't need 5 joins.
Assuming parent-child-grandchild etc, then grandchild rows can't exist without the parent rows (assuming you have foreign keys)
So your query could become
SELECT TOP 1
tbl.Id --or count(*)
FROM
grandchildtable tbl
INNER JOIN
anothertable ON ... = ...
WHERE
tbl.xxx = ...
Try EXISTS.
For either for 5 tables or for assumed heirarchy
SELECT TOP 1 --or count(*)
tbl.Id
FROM
grandchildtable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
-- or
SELECT TOP 1 --or count(*)
tbl.Id
FROM
mytable tbl
WHERE
tbl.xxx = ...
AND
EXISTS (SELECT *
FROM
anothertable T2
WHERE
tbl.key = T2.key /* AND T2 condition*/)
AND
EXISTS (SELECT *
FROM
yetanothertable T3
WHERE
tbl.key = T3.key /* AND T3 condition*/)
Doing a filter early on your first select will help if you can do it; as you filter the data in the first instance all the joins will join on reduced data.
Select top 1 tbl.id
From
(
Select top 1 * from
table tbl1
Where Key = Key
) tbl1
inner join ...
After that you will likely need to provide more of the query to understand how it works.
Maybe you could offload/cache this fact-finding mission. Like if it doesn't need to be done dynamically or at runtime, just cache the result into a much smaller table and then query that. Also, make sure all the tables you're querying to have the appropriate clustered index. Granted you may be using these tables for other types of queries, but for the absolute fastest way to go, you can tune all your clustered indexes for this one query.
Edit: Yes, what other people said. Measure, measure, measure! Your query plan estimate can show you what your bottleneck is.
Use the maximun row table first in every join and if more than one condition use
in where then sequence of the where is condition is important use the condition
which give you maximum rows.
use filters very carefully for optimizing Query.
Is there a way I can improve this kind of SQL query performance:
INSERT
INTO ...
WHERE NOT EXISTS(Validation...)
The problem is when I have many data in my table (like million of rows), the execution of the WHERE NOT EXISTS clause if very slow. I have to do this verification because I can't insert duplicated data.
I use SQLServer 2005
thx
Make sure you are searching on indexed columns, with no manipulation of the data within those columns (like substring etc.)
Off the top of my head, you could try something like:
TRUNCATE temptable
INSERT INTO temptable ...
INSERT INTO temptable ...
...
INSERT INTO realtable
SELECT temptable.* FROM temptable
LEFT JOIN realtable on realtable.key = temptable.key
WHERE realtable.key is null
Try to replace the NOT EXISTS with a left outer join, it sometimes performs better in large data sets.
Outer Apply tends to work for me...
instead of:
from t1
where not exists (select 1 from t2 where t1.something=t2.something)
I'll use:
from t1
outer apply (
select top 1 1 as found from t2 where t1.something=t2.something
) t2f
where t2f.found is null
Pay attention to the other answer regarding indexing. NOT EXISTS is typically quite fast if you have good indexes.
But I have had performance issues with statements like you describe. One method I've used to get around that is to use a temp table for the candidate values, perform a DELETE FROM ... WHERE EXISTS (...), and then blindly INSERT the remainder. Inside a transaction, of course, to avoid race conditions. Splitting up the queries sometimes allows the optimizer to do its job without getting confused.
If you can at all reduce your problem space, then you'll gain heaps of performance. Are you absolutely sure that every one of those rows in that table needs to be checked?
The other thing you might want to try is a DELETE InsertTable FROM InsertTable INNER JOIN ExistingTable ON <Validation criteria> before your insert. However, your mileage may vary
insert into customers
select *
from newcustomers
where customerid not in (select customerid
from customers)
..may be more efficient. As others have said, make sure you've got indexes on any lookup fields.