Oracle process getting slower in time - sql

I am using oracle database and what i do is
Taking 1 record of table A. (table A has column P and lets say
values of it are x,y,z)
Putting that record to table B or C or D according to values x,y,z
(if P=x then put record to table B , if P=y then put record into
table C ...)
Delete that record of A which we inserted to table B or C or D.
Note: size of A is like 200 million, B is 170 C is 20 D is 10 so and size of A is decreasing others are same (if a parameter of A record is negative then it is not inserted into to B,C,D it is exist in these tables so just deleted it from table) so there is no size change for B,C,D just size of A decreasing in time.
The problem is at the beginning everything is working nice, but in time, its becoming extremely slow. Approximately it is making 40 insert+delete in 1 second but in time its processing 1 insert+delete in 3 second.
All tables have index in corresponding columns.
Paralel run exists but there is no lock.
Table sizes are approximately 60 million record.
What other effects can make it - in time - if there is no lock or size increase for table??
note: it is not different processes, in same process i click "execute query" it is starting very fast but then extremely slow.

Inserting 200 million records from a staging table and inserting them into permanent tables in a single transaction is ambitious. It would be a useful if you had a scheme for dividing the records from table A into chunks which could be processed in discrete chunks.
Without seeing your code it's hard to tell but I have a suspicion you are attempting this RBAR rather than a more efficient set-based approach. I think the key here is to de-couple the insertions from clearing down table A. Insert all the records, than zap A at your leisure. Something like this
insert all
when p = 'X' then into b
when p = 'Y' then into c
when p = 'Z' then into d
select * from a;
truncate table a;

Related

How to improve SQL query in Spark when updating table? ('NOT IN' in subquery)

I have a Dataframe in Spark which is registered as a table called A and has 1 billion records and 10 columns. First column (ID) is Primary Key.
Also have another Dataframe which is registered as a table called B and has 10,000 records and 10 columns (same columns as table A, first column (ID) is Primary Key).
Records in Table B are 'Update records'. So I need to update all 10,000 records in table A with records in table B.
I tried first with this SQL query:
select * from A where ID not in (select ID from B) and then to Union that with table B. Approach is ok but first query (select * from A where ID not in (select ID from B)) is extremly slow (hours on moderate cluster).
Then I tried to speed up first query with LEFT JOIN:
select A.* from A left join B on (A.ID = B.ID ) where B.ID is null
That approach seems fine logically but it takes WAY to much memory for Spark containers (YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memory)..
What would be a better/faster/less memory consumption approach?
I would go with left join too rather than not in.
A couple of advices to reduce memory requirement and performance -
Please see the large table is uniformly distributed by join key (ID). If not then some tasks will be heavily burdened and some lightly busy. This will cause serious slowness. Please do a groupBy ID and count to measure this.
If the join key is naturally skewed then add more columns to the join condition keeping the result same. More columns may increase the chance to shuffle data uniformly. This is little hard to achieve.
Memory demand depends on - number of parallel tasks running, volume of data per task being executed in an executor. Reducing either or both will reduce memory pressure and obviously run slower but that is better than crashing. I would reduce the volume of data per task by creating more partitions on the data. Say you have 10 partitions for 1B rows then make it 200 to reduce the volume per task. Use repartition on table A. Don't create too many partitions because that will cause inefficiency, 10K partitions may be a bad idea.
There are some parameters to be tweaked which is explained here.
The small table having 10K rows should be automatically broadcasted because its small. If not you can increase the broadcast limit and apply broadcast hint.

Slow performance with small table after extreme reduction of size

I have table with approximately 10 million rows, with the id column being primary key.
Then I delete all rows where id > 10. Only 10 rows remain in the table.
Now, when I run the query SELECT id FROM tablename, execution time is approximately 1.2 - 1.5 seconds.
But SELECT id FROM tablename where id = x only takes 10 - 11 milliseconds.
Why is the first SELECT so slow for just 10 rows?
The main reason is the MVCC model of Postgres, where deleted rows are kept until the system can be sure that the transaction is not rolled back and the dead rows are not visible to any concurrent transaction any more. Only then, dead rows can be physically removed by VACUUM - or more radically VACUUM FULL.
Related:
VACUUM returning disk space to operating system
Your simple query SELECT id FROM tablename - if run immediately after the DELETE and before autovacuum can kick in - still finds 10 million rows and has to check visibility, only to rule out most of them.
Your second query SELECT id FROM tablename where id = x can use the primary key index and only needs to read a single data page from the (formerly) big table. That kind of query is largely immune to the total size of the table either way.
There may be a (much) more efficient way to delete almost all 10 million rows:
Best way to delete millions of rows by ID
Copying timestamp columns within a Postgres table
What causes large INSERT to slow down and disk usage to explode?

cardinality estimation when the foreign key is limited to a small subset

Recently I have been optimizing the performance of a large scale ERP packet.
One of the performance issues I haven't been able to solve involves bad cardinality estimation for a foreign key which is limited to a very small subset of records from a large table.
Table A holds 3 mil records and has a type field
Table B holds 7 mil records and holds a foreign key FK to Table A
The foreign key will only be filled with primary keys from table A with a certain type, only 36 from the 3 mil records in Table A have this certain type.
B JOIN A ON B.FK = A.PK AND A.TYPE = X AND A.Name = Y
Now using the correct statistics SQL knows table A will only return 1 value
But SQL estimates only 2 records will be returned from table B (my guess is 7 mil / 3 mil) while actually 930 000 records are returned
This results in a slow query plan being used.
The real query is more complex but the cause of the bad query plan is because of this simplified statement.
Our DB does have accurate statistics for the FK (histogram shows EQ_Rows for every distinct value of this FK) and filtering on a fixed FK value does result in accurate estimations.
Is there any way to show SQL that this FK can only hold a small amount of distinct values or in any other way help him with the estimations.
If we had a chance we would split up the table and put these 36 records in a separate table but unfortunately this is how the ERP system works.
Extra info:
We are using SQL 2014.
The ERP system is Dynamics AX 2012 R3
Using trace flag 9481 does help (not perfect but a lot better) but unfortunately we cannot set trace flags for separate queries with Dynamics AX
I've encountered these kinds of problems before, and have often found that I can dramatically reduce total run time for a stored proc or script by pulling those 'very few relevant rows' from a large table into a small temp table and then joining that temp table into the main query later. Or using CTE queries to isolate the few needed rows. A little experimentation should quickly tell you if this has potential in your case.
Look at the query plan
Clearly you want it to filter on TYPE early
It is probably doing loop join
FROM B
JOIN A
ON B.FK = A.PK
AND A.TYPE = X
AND A.Name = Y
Try the various join hints
Next would be to create a #temp and join to it
Declare a PK on you temp

Limit number of rows - TSQL - Merge - SQL Server 2008

Hi all i have the following merge sql script which works fine for a relatively small number of rows (up to about 20,000 i've found). However sometimes the data i have in Table B can be up to 100,000 rows and trying to merge this with Table A (which is currently at 60 million rows). This takes quite a while to process, which is understandable as it has to merge 100,000 with 60 million existing records!
I was just wondering if there was a better way to do this. Or is it possible to have some sort of count, so merge 20,000 rows from Table B to Table A. Then delete those merged rows from table B. Then do the next 20,000 rows and so on, until Table B has no rows left?
Script:
MERGE
Table A AS [target]
USING
Table B AS [source]
ON
([target].recordID = [source].recordID)
WHEN NOT MATCHED BY TARGET
THEN
INSERT([recordID],[Field 1]),[Field 2],[Field 3],[Field 4],[Field 5])
VALUES([source].[recordID],[source].[Field 1],[source].[Field 2],[source].[Field 3],[source].[Field 4],[source].[Field 5]
);
MERGE is overkill for this since all you want is to INSERT missing values.
Try:
INSERT INTO Table_A
([recordID],[Field 1]),[Field 2],[Field 3],[Field 4],[Field 5])
SELECT B.[recordID],
B.[Field 1],B.[Field 2],B.[Field 3],B.[Field 4],B.[Field 5]
FROM Table_B as B
WHERE NOT EXISTS (SELECT 1 FROM Table_A A
WHERE A.RecordID = B.RecordID)
In my experience MERGE can perform worse for simple operations like this. I try to reserve it for when you need varying operations depending on conditions, like an UPSERT.
You can definitely do (SELECT TOP 20000 * FROM B ORDER BY [some_column]) as [source] in USING and then delete these records after MERGE. So you pseudo-code will look like :
1. Merge top 20000
2. Delete 20000 records from source table
3. Check ##ROWCOUNT. If it's 0, exit; otherwise goto step 1
I'm not sure if it runs any faster than merging all the records at the same time.
Also, are you sure you need MERGE? From what I see in your code INSERT INTO ... SELECT should also work for you.

Better Import Performance on Large Data Sets

I have a few jobs that insert large data sets from a text file. The data is loaded via .NET's SqlBulkCopy.
Currently, I load all the data into a temp table and then insert it into the production table. This was an improvement over straight importing into production. The T-SQL insert results query was a lot faster. Data is only loaded via this method, there is no other inserts or deletes.
However, I'm back to timeouts because of locks while the job is executing. The job consists of the following steps:
load data into temp table
start transaction
delete current and future dated rows
insert from temp table
commit
This happens once every hour. This portion takes 70 seconds. I need to get that to the smallest number possible.
The production table has about 20 million records and each import is about 70K rows. The table is not accessed at night, so I use this time to do all required maintenance (rebuild stats, index, etc.). Of the 70K, added, ~4K is kept from day-to-day - that is, the table grows by 4k a day.
I'm thinking a 2 part solution:
The job will turn into a copy/rename job. I insert all current data into the temp table, create stats & index, rename tables, drop old table.
Create a history table to break out older data. The "current" table would have a rolling 6 months data, about 990K records. This would make the delete/insert table smaller and [hopefully] more performant. I would prefer not to do this; the table is well designed with the perfect indexes; queries are plenty fast. But eventually it might be required.
Edit: Using Windows 2003, SQL Server 2008
Thoughts? Other suggestions?
Well one really sneaky way is to rename the current table as TableA and set up a second table with the same structure as TableB and the same data. Then set up a view with the same name and the exact fields in the TableA. Now all your existing code will use the view instead of the current table. The view starts out looking at TableA.
In your load process, load to TableB. Refresh the view defintion changing it to look at TableB. Your users are down for less than a second. Then load the same data to TableA and store which table you should start with somewhere in a database table. Next time load first to TableA and then change the view to point to TableA then reload TableB.
The answer should be that your queries that read from this table should READ UNCOMMITTED, since your data load is the only place that changes data. With READ UNCOMMITTED, the SELECT queries won't get locks.
http://msdn.microsoft.com/en-us/library/ms173763.aspx
You should look into using partitioned tables. The basic idea is that you can load all of your new data into a new table, then join that table to the original as a new partition. This is orders of magnitude faster than inserting into the current existing table.
Later on, you can merge multiple partitions into a single larger partition.
More information here: http://msdn.microsoft.com/en-us/library/ms191160.aspx
Get better hardware. Using 3 threads, 35.000 item batches I import around 90.000 items per second using this approach.
Sorry, but at a point hardware decides insert speed. Important: SSD for the logs, mirrored ;)
Another trick you could use is have a delta table for the updates. You'd have 2 tables with a view over them to merge them. One table, TableBig, will hold the old data, the second table, TableDelta, will hold deltas that you add to rows in tableBig.
You create a view over them that adds them up. A simple example:
For instance your old data in the TableBig (20M rows, lot's of indexes etc)
ID Col1 Col2
123 5 10
124 6 20
And you want to update 1 row and add a new one so that afterwards the table looks like this:
ID Col1 Col2
123 5 10 -- unchanged
124 8 30 -- this row is updated
125 9 60 -- this one added
Then in the TableDelta you insert these two rows:
ID Col1 Col2
124 2 10 -- these will be added to give the right number
125 9 60 -- this row is new
and the view is
select ID,
sum(col1) col1, -- the old value and delta added to give the correct value
sum(col2) col2
from (
select id, col1, col2 from TableBig
union all
select id, col1, col2 from TableDelta
)
group by ID
At night you can merge TableDelta into TableBig and index etc.
This way you can leave the big table alone completely during the day, and TableDelta will not have many rows so overall query perf is very good. Getting the data from BigTable benefits from inexing, getting rows from DeltaTable is no issue becuase it's small, summing them up is cheap compared to looking for data on disk. Pumping data into TableDelta is very cheap because you can just insert at the end.
Regards Gert-Jan
Update for text columns:
You could try somehting similar, with two tables, but instead of adding, you would substitute.Like this:
Select isnull(b.ID, o.ID) Id,
isnull(b.Col1, o.Col1) Col1
isnull(b.Col2, o.Col2) col2
From TableBig b
full join TableOverWrite o on b.ID = o.ID
The basic idea is the same: a big table with indexes and a small table for updates that doesn't need them.
Your solution seems sound to me. Another alternative to try would be to create the final data in a temp table, all outside of a transaction and then inside the transaction truncate the target table then load it from your temp table...something along those lines might be worth trying too.