I am inserting data from one table "Tags" from "Recovery" database into another table "Tags" in "R3" database
they all live in my laptop similar SQL Server instance
I have built the insert query and because Recovery..Tags table is around 180M records I decided to break it into smaller sebsets. ( 1 million recs at the time)
Here is my query (Let's call Query A)
insert into R3..Tags (iID,DT,RepID,Tag,xmiID,iBegin,iEnd,Confidence,Polarity,Uncertainty,Conditional,Generic,HistoryOf,CodingScheme,Code,CUI,TUI,PreferredText,ValueBegin,ValueEnd,Value,Deleted,sKey,RepType)
SELECT T.iID,T.DT,T.RepID,T.Tag,T.xmiID,T.iBegin,T.iEnd,T.Confidence,T.Polarity,T.Uncertainty,T.Conditional,T.Generic,T.HistoryOf,T.CodingScheme,T.Code,T.CUI,T.TUI,T.PreferredText,T.ValueBegin,T.ValueEnd,T.Value,T.Deleted,T.sKey,R.RepType
FROM Recovery..tags T inner join Recovery..Reps R on T.RepID = R.RepID
where T.iID between 13000001 and 14000000
it takes around 2 minutes.
That is ok
To make things a bit easier for me
I put the iiD in the were statement in a variable
so my query looks like this (Let's call Query B)
declare #i int = 12
insert into R3..Tags (iID,DT,RepID,Tag,xmiID,iBegin,iEnd,Confidence,Polarity,Uncertainty,Conditional,Generic,HistoryOf,CodingScheme,Code,CUI,TUI,PreferredText,ValueBegin,ValueEnd,Value,Deleted,sKey,RepType)
SELECT T.iID,T.DT,T.RepID,T.Tag,T.xmiID,T.iBegin,T.iEnd,T.Confidence,T.Polarity,T.Uncertainty,T.Conditional,T.Generic,T.HistoryOf,T.CodingScheme,T.Code,T.CUI,T.TUI,T.PreferredText,T.ValueBegin,T.ValueEnd,T.Value,T.Deleted,T.sKey,R.RepType
FROM Recovery..tags T inner join Recovery..Reps R on T.RepID = R.RepID
where T.iID between (1000000 * #i) + 1 and (#i+1)*1000000
but that cause the insert to become so slow (around 10 min)
So what I tried query A again and gave me around 2 min
I tried query B again and gave around 8 min!!
I am attaching exec plan for each one (at a site that shows an analysis of the query plan) - Query A Plan and Query B Plan
Any idea why this is happening?
and how to fix it?
The big difference in time is due to the very different plans that are being created to join Tags and Reps.
Fundamentally, in version A, it knows how much data is being extracted (a million rows) and it can design an efficient query for that. However, because you are using variables in B to define how much data is being imported, it has to define a more generic query - one that would work for 10 rows, a million rows, or a hundred million rows.
In the plans, here are the relevant sections of the query joining Tags and Reps...
... in A
... and B
Note that in A it takes just over a minute to do the join; in B it takes 6 and a half minutes.
The key thing that appears to take the time is that it does a table scan of the Tags table which takes 5:44 to complete. The plan has this as a table scan, as the next time you run the query you may want many more than 1 million rows.
A secondary issue is that the amount of data it reads (or expects to read) from Reps is also way out of whack. In A it expected to read 2 million rows and read 1421; in B it basically read them all (even though technically it probably only needed the same 1421).
I think you have two main approaches to fix
Look at indexing, to remove the table scan on Tags - ensure the indexes match what is needed and allows the query to do a scan on that index (it appears that the index at the top of #MikePetri's answer is what you need, or similar). This way instead of doing a table scan, it can do an index scan which can start 'in the middle' of the data set (a table scan must start at either the start or end of the data set).
Separate this into two processes. The first process gets the relevant million rows from Tags, and saves it in a temporary table. The second process uses the data in the temporary table to join to Reps (also try using option (recompile) in the second query, so that it checks the temporary table's size before creating the plan).
You can even put an index or two (and/or Primary Key) on that temporary table to make it better for the next step.
The reason the first query is so much faster is it went parallel. This means the cardinality estimator knew enough about the data it had to handle, and the query was large enough to tip the threshold for parallel execution. Then, the engine passed chunks of data for different processors to handle individually, then report back and repartition the streams.
With the value as a variable, it effectively becomes a scalar function evaluation, and a query cannot go parallel with a scalar function, because the value has to determined before the cardinality estimator can figure out what to do with it. Therefore, it runs in a single thread, and is slower.
Some sort of looping mechanism might help. Create the included indexes to assist the engine in handling this request. You can probably find a better looping mechanism, since you are familiar with the identity ranges you care about, but this should get you in the right direction. Adjust for your needs.
With a loop like this, it commits the changes with each loop, so you aren't locking the table indefinitely.
USE Recovery;
GO
CREATE INDEX NCI_iID
ON Tags (iID)
INCLUDE (
DT
,RepID
,tag
,xmiID
,iBegin
,iEnd
,Confidence
,Polarity
,Uncertainty
,Conditional
,Generic
,HistoryOf
,CodingScheme
,Code
,CUI
,TUI
,PreferredText
,ValueBegin
,ValueEnd
,value
,Deleted
,sKey
);
GO
CREATE INDEX NCI_RepID ON Reps (RepID) INCLUDE (RepType);
USE R3;
GO
CREATE INDEX NCI_iID ON Tags (iID);
GO
DECLARE #RowsToProcess BIGINT
,#StepIncrement INT = 1000000;
SELECT #RowsToProcess = (
SELECT COUNT(1)
FROM Recovery..tags AS T
WHERE NOT EXISTS (
SELECT 1
FROM R3..Tags AS rt
WHERE T.iID = rt.iID
)
);
WHILE #RowsToProcess > 0
BEGIN
INSERT INTO R3..Tags
(
iID
,DT
,RepID
,Tag
,xmiID
,iBegin
,iEnd
,Confidence
,Polarity
,Uncertainty
,Conditional
,Generic
,HistoryOf
,CodingScheme
,Code
,CUI
,TUI
,PreferredText
,ValueBegin
,ValueEnd
,Value
,Deleted
,sKey
,RepType
)
SELECT TOP (#StepIncrement)
T.iID
,T.DT
,T.RepID
,T.Tag
,T.xmiID
,T.iBegin
,T.iEnd
,T.Confidence
,T.Polarity
,T.Uncertainty
,T.Conditional
,T.Generic
,T.HistoryOf
,T.CodingScheme
,T.Code
,T.CUI
,T.TUI
,T.PreferredText
,T.ValueBegin
,T.ValueEnd
,T.Value
,T.Deleted
,T.sKey
,R.RepType
FROM Recovery..tags AS T
INNER JOIN Recovery..Reps AS R ON T.RepID = R.RepID
WHERE NOT EXISTS (
SELECT 1
FROM R3..Tags AS rt
WHERE T.iID = rt.iID
)
ORDER BY
T.iID;
SET #RowsToProcess = #RowsToProcess - #StepIncrement;
END;
We have SAP HANA 1.0 SP11. We have one requirement where we need to calculate current stock at store, material level on daily basis. No of rows expected are around 250 million.
Currently we use procedure for same. Flow of procedure is as follows -
begin
t_rst = select * from <LOGIC of deriving current stock on tables MARD,MARC,MBEW>;
select count(*) into v_cnt from :t_rst;
v_loop = v_cnt/2500000;
FOR X in 0 .. v_loop DO
INSERT INTO CRRENT_STOCK_TABLE
SELECT * FROM :t_rst LIMIT 2500000 OFFSET :count;
COMMIT;
count := count + 2500000;
END FOR;
end;
Row count of result set t_rst is around 250 million.
Total execution time of procedure time is around 2.5 hours. Few times procedure goes into long running state resulting into error. We run this procedure in non peak hours of business so load on system is almost nothing.
Is there a way, we can load data in target table in parallel threads and reduce loading time. Also, is there way to bulk insert efficiently in HANA.
Query for t_rst fetches first 1000 rows in 5 minutes.
As Lars, mentioned the total resource usage will not change effectively
But if you have limited time (non-peak hours) and if the system configuration will overcome to the requirements of parallel execution, maybe you can try using
BEGIN PARALLEL EXECUTION
<stmt>
END;
Please refer to reference documentation
After you calculate v_loop value, you know how many times you have to run following INSERT command
INSERT INTO CRRENT_STOCK_TABLE
SELECT * FROM :t_rst LIMIT 2500000 OFFSET :count;
I'm not sure how to convert above code into a dynamic calculation for PARALLEL EXECUTION
But you can assume let's say 10 parallel processes, and run that many INSERT command by modifying the OFFSET clause according to calculated values
The ones that you exceed will run for zero rows which will not harm the overall process
As a response to #LarsBr. , as he mentioned there are limitations that will prevent parallel execution
Restrictions and Limitations
The following restrictions apply:
Modification of tables with a foreign key or triggers are not allowed
Updating the same table in different statements is not allowed
Only concurrent reads on one table are allowed. Implicit SELECT and SELCT INTO scalar variable statements are supported.
Calling procedures containing dynamic SQL (for example, EXEC, EXECUTE IMMEDIATE) is not supported in parallel blocks
Mixing read-only procedure calls and read-write procedure calls in a parallel block is not allowed.
These limitations saying, insert to same table will not be possible from different executions and dynamic SQL cannot be used too
I have more than 10 million records in a table.
SELECT * FROM tbl ORDER BY datecol DESC
LIMIT 10
OFFSET 999990
Output of EXPLAIN ANALYZE on explain.depesz.com.
Executing the above query takes about 10 seconds. How can I make this faster?
Update
The execution time is reduced half by using a subquery:
SELECT * FROM tbl where id in
(SELECT id FROM tbl ORDER BY datecol DESC LIMIT 10 OFFSET 999990)
Output of EXPLAIN ANALYZE on explain.depesz.com.
You need to create an index on the column used in ORDER BY. Ideally in the same sort order, but PostgreSQL can scan indexes backwards at almost the same speed.
CREATE INDEX tbl_datecol_idx ON tbl (datecol DESC);
More about indexes and CREATE INDEX in the current manual.
Test with EXPLAIN ANALYZE to get actual times in addition to the query plan.
Of course all the usual advice for performance optimization applies, too.
I was trying to do something similar my self with a very large table ( >100m records ) and found that using Offset / Limit was killing performance.
Offset for the first 10m records was (with limit 1) about 1.5 minutes to retrieve with it growing exponentially.
By record 50m I was up to 3 minutes per select - even using sub-queries.
I came across a post here which details useful alternatives.
I modified this slightly to suit my needs and came up with a method that gave me pretty quick results.
CREATE TEMPORARY TABLE
just_index AS SELECT ROW_NUMBER()
OVER (ORDER BY [VALUE-You-need]), [VALUE-You-need]
FROM [your-table-name];
This was a once-off - took about 4 minutes but I then had all values I wanted
Next was to create a function that would loop at the "Offset" I needed:
create or replace
function GetOffsets ()
returns void as $$
declare
-- For this part of the function I only wanted values after 90 million up to 120 million
counter bigint := 90000000;
maxRows bigInt := 120000000;
begin
drop table if exists OffsetValues;
create temp table OffsetValues
(
offset_myValue bigint
);
while counter <= maxRows loop
insert into OffsetValues(offset_myValue)
select [VALUE-You-need] from just_index where row_number > counter
limit 1;
-- here I'm looping every 500,000 records - this is my 'Offset'
counter := counter + 500000 ;
end loop ;
end ;$$ LANGUAGE plpgsql;
Then run the function:
select GetOffsets();
Again, a once-off amount of time (I went from ~3 minutes getting one of my offset values to 3 milliseconds to get one of my offset values).
Then select from the temp-table:
select * from OffsetValues;
This worked really well for me in terms of performance - I don't think i'll be using offset going forward if I can help it.
Hope this improves performance for any of your larger tables.