Performance Tuning for an insert query - sql

Can someone help me in tuning this query as I am new to performance tuning in oracle.
INSERT INTO mdm_id_relation
SELECT
pat_key, hub_pat_id, msa_pat_id, pat_id
FROM
ods_raw_patient_mdm_process p1
WHERE NVL (pat_id, 'NULL') IN (SELECT pat_id
FROM mdm_id_relation)
AND NOT EXISTS (SELECT pat_key
FROM mdm_id_relation p2
WHERE p1.pat_key = p2.pat_key);

To tune an INSERT query, you'll need the follwowing ingredients:
A place to test your query. Ideally a separate database, but a separate schema might do as well. Minimally a copy of the tables and indexes that are involved. Reason: the INSERT will change data, you'll need to run different versions of the query until you are happy with the performance.
The test tables need to have exactly the same structure as the real table and roughly the same amount of data as the real thing. Reason: The performance of the INSERT depends heavily on both structure and amount.
Up to date statistics: Look up DBMS_STATS.GATHER_TABLE_STATS and how to use it. Reason: Give the query optimizer a chance to find a good query plan.
A way to measure performance (wall clock seconds or Oracle costs etc.), and, even better, access to the query plan (SQL Developer: Explain Plan button, or have a look at William's script).
When I need to tune INSERT statements, I normally start with the SELECT part until I am happy with it. Firstly run SELECT ..., when that is fine, I run a CREATE TABLE foo NOLOGGING AS SELECT ... to measure the SELECT of all the rows. When thats fine, I test the whole INSERT ... SELECT ... statement.

Any issue with performance is going to be the select, not the insert. I think this is an equivalent query:
INSERT INTO mdm_id_relation (pat_key, hub_pat_id, msa_pat_id, pat_id) -- always list the columns!
SELECT pat_key, hub_pat_id, msa_pat_id, pat_id
FROM ods_raw_patient_mdm_process p1
WHERE EXISTS (SELECT 1
FROM mdm_id_relation mir
WHERE mir.pat_id = p1.pat_id
) AND
NOT EXISTS (SELECT 1
FROM mdm_id_relation mir
WHERE p1.pat_key = mir.pat_key
);
For this query, you want two indexes: mdm_id_relation(pat_id) and mdm_id_relation(pat_key). These should be a big help on performance.
Note: Test the select first before doing the insert.

Related

Hive Query Efficiency

Could you help me with a Hive Query Efficiency problem? I have two queries working for the same problem. I just cannot figure out why one is much faster than the other. If you know please feel free to provide insight. Any info is welcomed!
Problem: I am trying to check the minimum value of a bunch of variables in a Hive parquet table.
Queries: I tried two two queries as follows:
query 1
drop table if exists tb_1 purge;
create table if not exists tb_1 as
select 'v1' as name, min(v1) as min_value from src_tb union all
select 'v2' as name, min(v2) as min_value from src_tb union all
select 'v3' as name, min(v3) as min_value from src_tb union all
...
select 'v200' as name, min(v200) as min_value from src_tb
;
query 2
drop table if exists tb_2 purge;
create table if not exists tb_2 as
select min(v1) as min_v1
, min(v2) as min_v2
, min(v3) as min_v3
...
, min(v200) as min_v200
from src_tb
;
Result: Query 2 is much faster than query 1. It took probably 5 mins to finish the second query. I don't know how long will query 1 take. But after I submit the first query, it took a long time to even react to the query, by which I mean that usually after I submit a query, the system will start to analyze and provides some compiling information in the terminal. However, for my first query, after my submission, the system won't even react to this. So I just killed it.
What do you think? Thank you in advance.
Query execution time depends on environment that you execute it.
In MSSQL.
Some people like you think query execution is similar to algorithm that they see in some theoretical resources, but in practical situation, it depends on other things.
For example both of your queries have SELECT statement that perform on a table and at first glance, they need to read all rows, but database server must analyze the statement to determine the most efficient way to extract the requested data. This is referred to as optimizing the SELECT statement. The component that does this is called the Query Optimizer. The input to the Query Optimizer consists of the query, the database schema (table and index definitions), and the database statistics. The output of the Query Optimizer is a query execution plan, sometimes referred to as a query plan or just a plan. (Please see this for more information about query-processing architecture)
You can see execution plan in MSSQL by reading this article and I think you will understand better by seeing execution plan for both of your queries.
Edit (Hive)
Hive provides an EXPLAIN command that shows the execution plan for a query. The syntax for this statement is as follows:
EXPLAIN [EXTENDED|DEPENDENCY|AUTHORIZATION] query
A Hive query gets converted into a sequence of stages. The description of the stages itself shows a sequence of operators with the metadata associated with the operators.
Please see LanguageManual Explain for more information.
What is surprising? The first query has to read src_tb a total of 200 times. The second reads the data once and performs 200 aggregations. It is a no brainer that it is faster.

is IN(SELECT ...) bad for performance?

Suppose I have the following code:
SELECT *
FROM [myTable]
WHERE [myColumn] IN (SELECT [otherColumn] FROM [myOtherTable])
Will the subquery be executed again and again for every row?
If so, can I execute it and store its results and use them for every row instead? For example:
SELECT [otherColumn]
INTO #Results
FROM [myOtherTable]
SELECT *
FROM [myTable]
WHERE [myColumn] IN (#Results)
SQL server query optimizer is smart enough to not run the same subquery over and over again. If anything, the temp table is less optimal because of additional steps after getting the results.
You can see this by looking at the SQL query execution plan.
Edit: After looking into this further, it can also be more than once. Apparently query optimizer can also do a lot of interesting things like convert your IN to a JOIN to increase performance. There's lots of information on it here: Number of times a nested query is executed
None the less, view your execution plan to see what your RDMS's query optimizer decided to do.
Have you considered using a join instead? I think that could be best in terms of performance.
SELECT * FROM [myTable] INNER JOIN [myOtherTable]
ON ([myTable][myColumn] = [myOtherTable][otherColumn]);
This however will only work if you don't expect duplicates to be in myOtherTable.

Is there a performance or memory benefit : Select A, B, C into temp from query VS Select into temp from (Select A,B,C from table) as TTable

In Sql server (2008 version),
I have two queries like this :
Select * from sampleTable into #tempTable
OR
Select * into #tempTable from (Select Query) as someTableName
Is there performance or memory space benefit in any of these queries ? or both are equally good.
This is known that they individually are better than
Insert into Temp
<Query>
But how about when compared to each other.
Updated Text :
Two queries are like this:
Select A,B,C into #tempTable from TestTable
OR
Select * into #tempTable from (Select A,B,C from TestTable) as someTableName
All of those result in the same query plan.
SQL Server has a query optimizer, and optimizing away redundant columns is about the easiest and most basic optimization there is.
The best way to answer such questions for yourself is to look at the query plans and compare them. It is generally quite pointless to memorize the performance behavior of specific queries. It is a better approach to understand how queries are optimized and executed in general.
Same performance
if you are going to use very large datasets like 10000, then #temptable slows down the query, if you are using this for paging or something there is groupby and in newer versions fetch newxt rows methods

SQL "WITH" Performance and Temp Table (possible "Query Hint" to simplify)

Given the example queries below (Simplified examples only)
DECLARE #DT int; SET #DT=20110717; -- yes this is an INT
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
and ...
DECLARE #DT int; SET #DT=20110717;
BEGIN TRY DROP TABLE #LargeData END TRY BEGIN CATCH END CATCH; -- dump any possible table.
SELECT * -- This is a MASSIVE table indexed on dt field
INTO #LargeData -- put smaller results into temp
FROM mydata
WHERE dt=#DT;
WITH Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM #LargeData
)
SELECT * FROM Ordered
Both produce the same results, which is a limited and ranked list of values from a list based on a fields data.
When these queries get considerably more complicated (many more tables, lots of criteria, multiple levels of "with" table alaises, etc...) the bottom query executes MUCH faster then the top one. Sometimes in the order of 20x-100x faster.
The Question is...
Is there some kind of query HINT or other SQL option that would tell the SQL Server to perform the same kind of optimization automatically, or other formats of this that would involve a cleaner aproach (trying to keep the format as much like query 1 as possible) ?
Note that the "Ranking" or secondary queries is just fluff for this example, the actual operations performed really don't matter too much.
This is sort of what I was hoping for (or similar but the idea is clear I hope). Remember this query below does not actually work.
DECLARE #DT int; SET #DT=20110717;
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
**OPTION (USE_TEMP_OR_HARDENED_OR_SOMETHING) -- EXAMPLE ONLY**
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
EDIT: Important follow up information!
If in your sub query you add
TOP 999999999 -- improves speed dramatically
Your query will behave in a similar fashion to using a temp table in a previous query. I found the execution times improved in almost the exact same fashion. WHICH IS FAR SIMPLIER then using a temp table and is basically what I was looking for.
However
TOP 100 PERCENT -- does NOT improve speed
Does NOT perform in the same fashion (you must use the static Number style TOP 999999999 )
Explanation:
From what I can tell from the actual execution plan of the query in both formats (original one with normal CTE's and one with each sub query having TOP 99999999)
The normal query joins everything together as if all the tables are in one massive query, which is what is expected. The filtering criteria is applied almost at the join points in the plan, which means many more rows are being evaluated and joined together all at once.
In the version with TOP 999999999, the actual execution plan clearly separates the sub querys from the main query in order to apply the TOP statements action, thus forcing creation of an in memory "Bitmap" of the sub query that is then joined to the main query. This appears to actually do exactly what I wanted, and in fact it may even be more efficient since servers with large ammounts of RAM will be able to do the query execution entirely in MEMORY without any disk IO. In my case we have 280 GB of RAM so well more then could ever really be used.
Not only can you use indexes on temp tables but they allow the use of statistics and the use of hints. I can find no refernce to being able to use the statistics in the documentation on CTEs and it says specifically you cann't use hints.
Temp tables are often the most performant way to go when you have a large data set when the choice is between temp tables and table variables even when you don't use indexes (possobly because it will use statistics to develop the plan) and I might suspect the implementation of the CTE is more like the table varaible than the temp table.
I think the best thing to do though is see how the excutionplans are different to determine if it is something that can be fixed.
What exactly is your objection to using the temp table when you know it performs better?
The problem is that in the first query SQL Server query optimizer is able to generate a query plan. In the second query a good query plan isn't able to be generated because you're inserting the values into a new temporary table. My guess is there is a full table scan going on somewhere that you're not seeing.
What you may want to do in the second query is insert the values into the #LargeData temporary table like you already do and then create a non-clustered index on the "valuefield" column. This might help to improve your performance.
It is quite possible that SQL is optimizing for the wrong value of the parameters.
There are a couple of options
Try using option(RECOMPILE). There is a cost to this as it recompiles the query every time but if different plans are needed it might be worth it.
You could also try using OPTION(OPTIMIZE FOR #DT=SomeRepresentatvieValue) The problem with this is you pick the wrong value.
See I Smell a Parameter! from The SQL Server Query Optimization Team blog

SQL WHERE ID IN (id1, id2, ..., idn)

I need to write a query to retrieve a big list of ids.
We do support many backends (MySQL, Firebird, SQLServer, Oracle, PostgreSQL ...) so I need to write a standard SQL.
The size of the id set could be big, the query would be generated programmatically. So, what is the best approach?
1) Writing a query using IN
SELECT * FROM TABLE WHERE ID IN (id1, id2, ..., idn)
My question here is. What happens if n is very big? Also, what about performance?
2) Writing a query using OR
SELECT * FROM TABLE WHERE ID = id1 OR ID = id2 OR ... OR ID = idn
I think that this approach does not have n limit, but what about performance if n is very big?
3) Writing a programmatic solution:
foreach (var id in myIdList)
{
var item = GetItemByQuery("SELECT * FROM TABLE WHERE ID = " + id);
myObjectList.Add(item);
}
We experienced some problems with this approach when the database server is queried over the network. Normally is better to do one query that retrieve all results versus making a lot of small queries. Maybe I'm wrong.
What would be a correct solution for this problem?
Option 1 is the only good solution.
Why?
Option 2 does the same but you repeat the column name lots of times; additionally the SQL engine doesn't immediately know that you want to check if the value is one of the values in a fixed list. However, a good SQL engine could optimize it to have equal performance like with IN. There's still the readability issue though...
Option 3 is simply horrible performance-wise. It sends a query every loop and hammers the database with small queries. It also prevents it from using any optimizations for "value is one of those in a given list"
An alternative approach might be to use another table to contain id values. This other table can then be inner joined on your TABLE to constrain returned rows. This will have the major advantage that you won't need dynamic SQL (problematic at the best of times), and you won't have an infinitely long IN clause.
You would truncate this other table, insert your large number of rows, then perhaps create an index to aid the join performance. It would also let you detach the accumulation of these rows from the retrieval of data, perhaps giving you more options to tune performance.
Update: Although you could use a temporary table, I did not mean to imply that you must or even should. A permanent table used for temporary data is a common solution with merits beyond that described here.
What Ed Guiness suggested is really a performance booster , I had a query like this
select * from table where id in (id1,id2.........long list)
what i did :
DECLARE #temp table(
ID int
)
insert into #temp
select * from dbo.fnSplitter('#idlist#')
Then inner joined the temp with main table :
select * from table inner join temp on temp.id = table.id
And performance improved drastically.
First option is definitely the best option.
SELECT * FROM TABLE WHERE ID IN (id1, id2, ..., idn)
However considering that the list of ids is very huge, say millions, you should consider chunk sizes like below:
Divide you list of Ids into chunks of fixed number, say 100
Chunk size should be decided based upon the memory size of your server
Suppose you have 10000 Ids, you will have 10000/100 = 100 chunks
Process one chunk at a time resulting in 100 database calls for select
Why should you divide into chunks?
You will never get memory overflow exception which is very common in scenarios like yours.
You will have optimized number of database calls resulting in better performance.
It has always worked like charm for me. Hope it would work for my fellow developers as well :)
Doing the SELECT * FROM MyTable where id in () command on an Azure SQL table with 500 million records resulted in a wait time of > 7min!
Doing this instead returned results immediately:
select b.id, a.* from MyTable a
join (values (250000), (2500001), (2600000)) as b(id)
ON a.id = b.id
Use a join.
In most database systems, IN (val1, val2, …) and a series of OR are optimized to the same plan.
The third way would be importing the list of values into a temporary table and join it which is more efficient in most systems, if there are lots of values.
You may want to read this articles:
Passing parameters in MySQL: IN list vs. temporary table
I think you mean SqlServer but on Oracle you have a hard limit how many IN elements you can specify: 1000.
Sample 3 would be the worst performer out of them all because you are hitting up the database countless times for no apparent reason.
Loading the data into a temp table and then joining on that would be by far the fastest. After that the IN should work slightly faster than the group of ORs.
For 1st option
Add IDs into temp table and add inner join with main table.
CREATE TABLE #temp (column int)
INSERT INTO #temp (column)
SELECT t.column1 FROM (VALUES (1),(2),(3),...(10000)) AS t(column1)
Try this
SELECT Position_ID , Position_Name
FROM
position
WHERE Position_ID IN (6 ,7 ,8)
ORDER BY Position_Name