Select million+ records while huge insert is running

Select million+ records while huge insert is running - sql

I am trying to extract application log file from a single table. The select query statement is pretty straightforward.
select top 200000 *
from dbo.transactionlog
where rowid>7
and rowid <700000 and
Project='AmWINS'
The query time for above select is above 5 mins. Is it considered long? While the select is running, the bulk insertion is also running.
[EDIT]
Actually, I am having serious problem on my current Production logging database,
Basically, we only have one table (transactionlog). all the application log will be insert into this table. For Project like AmWINS, base on select count result, we have about 800K++ records inserted per day. The insertion of record are running 24 hours daily in Production environment. User would like to extract data from the table if user want to check the transaction logs. Therefore, we need to select the records out from the table if necessary.
I tried to simulate on UAT enviroment to pump in the volumn as per Production which already grow up to 10millions records until today. and while i try to extract records, at the same time, I simulate with a bulk insertion to make it look like as per production environment. It took like 5 mins just to extract 200k records.
During the extraction running, I monitor on the SQL phyiscal server CPU is spike up to 95%.
the tables have 13 fields and a identity turn on(rowid) with bigint. rowid is the PK.
Indexes are create on Date, Project, module and RefNumber.
the tables are created on rowlock and pagelock enabled.
I am using SQL server 2005.
Hope you guys can give me some professional advices to enlighten me. Thanks.

It may be possible for you to use the "Nolock" table hint, as described here:
Table Hints MSDN
Your SQL would become something like this:
select top 200000 * from dbo.transactionlog with (no lock) ...
This would achieve better performance if you aren't concerned about the complete accuracy of the data returned.

What are you doing with the 200,000 rows? Are you running this over a network? Depending on the width of your table, just getting that amount of data across the network could be the bulk of the time spent.

It depends on your hardware. Pulling 200000 rows out while there is data being inserted requires some serious IO, so unless you have a 30+disk system, it will be slow.
Also, is your rowID column indexed? This will help with the select, but could slow down the bulk insert.

I am not sure, but doesn't bulk insert in MS SQL lock the whole table?

As ck already said. Indexing is important. So make sure you have an appropriate index ready. I would not only set an index on rowId but also on Project. Also I would rewrite the where-clause to:
WHERE Project = 'AmWINS' AND rowid BETWEEN 8 AND 699999
Reason: I guess Project is more restrictive than rowid and - correct me, if I'm wrong - BETWEEN is faster than a < and > comparison.

You could also export this as a local dat or sql file.

No amount of indexing will help here because it's a SELECT * query so it's most likely a PK scan or an horrendous bookup lookup
And the TOP is meaningless because there is no ORDER BY.
The simultaneous insert is probably misleading as far as I can tell, unless the table only has 2 columns and the bulk insert is locking the whole table. With a simple int IDENTITY column the insert and select may not interfere with each other too.
Especially if the bulk insert is only a few 1000s of rows (or even 10,000s)
Edit. The TOP and rowid values do not imply a million plus

Related

limit to insert rows into a table (Oracle)

In Oracle pl/sql, I have join few tables and insert into another table, which would result in Thousands/Lakhs or it could be in millions. Can insert as
insert into tableA
select * from tableB;
Will there be any chance of failure because of number of rows ?
Or is there a better way to insert values in case of more no of records.
Thanks in Advance

Well, everything is finite inside the machine, so if that select returns too many rows, it for sure won't work (although there must be maaany rows, the number is dependent on your storage and memory size, your OS, and maybe other things).
If you think your query can surpass the limit, then do the insertion in batches, and commit after each batch. Of course you need to be aware you must do something if at 50% of the inserts you decide you need to cancel the process (as a rollback will not be useful here).

My recommended steps are different because performance typically increases when you load more data in one SQL statement using SQL or PL/SQL:
I would recommend checking the size of your rollback segment (RBS segment) and possibly bring online a larger dedicated one for such transaction.
For inserts, you can say something like 'rollback consumed' = 'amount of data inserted'. You know the typical row width from the database statistics (see user_tables after analyze table tableB compute statistics for table for all columns for all indexes).
Determine how many rows you can insert per iteration.
Insert this amount of data in big insert and commit.
Repeat.
Locking normally is not an issue with insert, since what does not yet exist can't be locked :-)
When running on partitioned tables, you might want to consider different scenarios allowing the (sub)partitions to distribute the work together. When using SQL*Loader by loading from text files, you might use different approach too, such as direct path which adds preformatted data blocks to the database without the SQL engine instead of letting the RDBMS handle the SQL.

To create limited number of rows you can use ROW_NUM which is a pseudo column .
for example to create table with 10,000 rows from another table having 50,000 rows you can use.
insert into new_table_name select * from old_table_name where row_num<10000;

How can I efficiently manipulate 500k records in SQL Server 2005?

I am getting a large text file of updated information from a customer that contains updates for 500,000 users. However, as I am processing this file, I often am running into SQL Server timeout errors.
Here's the process I follow in my VB application that processes the data (in general):
Delete all records from temporary table (to remove last month's data) (eg. DELETE * FROM tempTable)
Rip text file into the temp table
Fill in extra information into the temp table, such as their organization_id, their user_id, group_code, etc.
Update the data in the real tables based on the data computed in the temp table
The problem is that I often run commands like UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id) and these commands frequently time out. I have tried bumping the timeouts up to as far as 10 minutes, but they still fail. Now, I realize that 500k rows is no small number of rows to manipulate, but I would think that a database purported to be able to handle millions and millions of rows should be able to cope with 500k pretty easily. Am I doing something wrong with how I am going about processing this data?
Please help. Any and all suggestions welcome.

subqueries like the one you give us in the question:
UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id)
are only good on one row at a time, so you must be looping. Think set based:
UPDATE t
SET user_id = u.user_id
FROM tempTable t
inner join myUsers u ON t.external_id=u.external_id
and remove your loops, this will update all rows in one statement and be significantly faster!

Needs more information. I am manipulating 3-4 million rows in a 150 million row table regularly and I am NOT thinking this is a lot of data. I have a "products" table that contains about 8 million entries - includign full text search. No problems either.
Can you just elaborte on your hardware? I assume "normal desktop PC" or "low end server", both with absolutely non-optimal disc layout, and thus tons of IO problems - on updates.

Make sure you have indexes on your tables that you are doing the selects from. In your example UPDATE command, you select the user_id from the myUsers table. Do you have an index with the user_id column on the myUsers table? The downside of indexes is that they increase time for inserts/updates. Make sure you don't have indexes on the tables you are trying to update. If the tables you are trying to update do have indexes, consider dropping them and then rebuilding them after your import.
Finally, run your queries in SQL Server Management Studio and have a look at the execution plan to see how the query is being executed. Look for things like table scans to see where you might be able to optimize.

Look at the KM's answer and don't forget about indexes and primary keys.

Are you indexing your temp table after importing the data?
temp_table.external_id should definitely have an index since it is in the where clause.

There are more efficient ways of importing large blocks of data. Look in SQL Books Online under BCP (Bulk Copy Protocol.)

SQL Server 2005 efficient delete

My issue at hand is that I need to remove about 60M records from a table without causing deadlocks with other processes that use this table. At this point I'm almost done removing the records using a while loop that only processes about 1M records at a time however it's taken all day.
Q1: What is the optimal way to remove large quantities of data from a table, keeping the table online and minimal impact to other resources that need to use this table in MS SQL Server 2005?
Q2: Is there a way to implement the individual row locking (rather than table locking) in SQL Server like they have in Oracle? (Note answering this may answer Q1).
A2: So as #Remus Rusanu informed me there is a way to do row level locking with a delete.

See this thread, the original poster actually did some tests and posted the most efficient method. An MVP did originally chime in with an option to actually insert the data you want to retain into a temp table and then truncate the original table and reinsert.

I just did something like that recently. I simply created a SQL Server job that ran every 10 mins deleting a million rows. Code follows
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
DELETE TOP 1000000 FROM BIG_TABLE WHERE CreatedDate <= '20080630'
Said table in question had about 900 mil rows to start with. Didn't notice any significant performance issues.

The most efficient way is to use partition switching, see Transferring Data Efficiently by Using Partition Switching. The drawback is that it requires planning ahead in how the partitions are deployed.
If partition switching is not available then the answer depends on the actual table schema. You better post the actual schema (including all indexes and most importantly the clustered key definiton) and the criteria that qualifies the delete candidates.
As for Q2, SQL Server had row level locking since mid 90s, I don't know what you're actually asking.

How to quickly duplicate rows in SQL

Edit: Im running SQL Server 2008
I have about 400,000 rows in my table. I would like to duplicate these rows until my table has 160 million rows or so. I have been using an statement like this:
INSERT INTO [DB].[dbo].[Sales]
([TotalCost]
,[SalesAmount]
,[ETLLoadID]
,[LoadDate]
,[UpdateDate])
SELECT [TotalCost]
,[SalesAmount]
,[ETLLoadID]
,[LoadDate]
,[UpdateDate]
FROM [DB].[dbo].[Sales]
This process is very slow. and i have to re-issue the query some large number of times Is there a better way to do this?

To do this many inserts you will want to disable all indexes and constraints (including foreign keys) and then run a series of:
INSERT INTO mytable
SELECT fields FROM mytable
If you need to specify ID, pick some number like 80,000,000 and include in the SELECT list ID+80000000. Run as many times as necessary (no more than 10 since it should double each time).
Also, don't run within a transaction. The overhead of doing so over such a huge dataset will be enormous. You'll probably run out of resources (rollback segments or whatever your database uses) anyway.
Then re-enable all the constraints and indexes. This will take a long time but overall it will be quicker than adding to indexes and checking constraints on a per-row basis.

Since each time you run that command it will double the size of your table, you would only need to run it about 9 times (400,000 * 29 = 204,800,000). Yes, it might take a while because copying that much data takes some time.

The speed of the insert will depend on a number of things...the physical disk speed, indexes, etc. I would recommend removing all indexes from the table and adding them back when you're done. If the table is heavily indexed then that should help quite a bit.
You should be able to repeatedly run that query in a loop until the desired number of rows is achieved. Every time you run it you'll double the data, so you'll end up with:
400,000
800,000
1,600,000
3,200,000
6,400,000
12,800,000
25,600,000
51,200,000
102,400,000
204,800,000
After nine executions.

You don't state your SQL database, but most have a bulk loading tool to handle this scenario. Check the docs. If you have to do it with INSERTs, remove all indexes from the table first and reapply them after the data is INSERTed; this will generally be much faster than indexing during insertion.

this may still take a while to run... you might want to turn off logging while you create your data.
INSERT INTO [DB].[dbo].[Sales] (
[TotalCost] ,[SalesAmount] ,[ETLLoadID]
,[LoadDate] ,[UpdateDate]
)
SELECT s.[TotalCost] ,s.[SalesAmount] ,s.[ETLLoadID]
,s.[LoadDate] ,s.[UpdateDate]
FROM [DB].[dbo].[Sales] s (NOLOCK)
CROSS JOIN (SELECT TOP 400 totalcost FROM [DB].[dbo].[Sales] (NOLOCK)) o

Oracle SQL technique to avoid filling trans log

Newish to Oracle programming (from Sybase and MS SQL Server). What is the "Oracle way" to avoid filling the trans log with large updates?
In my specific case, I'm doing an update of potentially a very large number of rows. Here's my approach:
UPDATE my_table
SET a_col = null
WHERE my_table_id IN
(SELECT my_table_id FROM my_table WHERE some_col < some_val and rownum < 1000)
...where I execute this inside a loop until the updated row count is zero,
Is this the best approach?
Thanks,

The amount of updates to the redo and undo logs will not at all be reduced if you break up the UPDATE in multiple runs of, say 1000 records. On top of it, the total query time will be most likely be higher compared to running a single large SQL.
There's no real way to address the UNDO/REDO log issue in UPDATEs. With INSERTs and CREATE TABLEs you can use a DIRECT aka APPEND option, but I guess this doesn't easily work for you.

Depends on the percent of rows almost as much as the number. And it also depends on if the update makes the row longer than before. i.e. going from null to 200bytes in every row. This could have an effect on your performance - chained rows.
Either way, you might want to try this.
Build a new table with the column corrected as part of the select instead of an update. You can build that new table via CTAS (Create Table as Select) which can avoid logging.
Drop the original table.
Rename the new table.
Reindex, repoint contrainst, rebuild triggers, recompile packages, etc.
you can avoid a lot of logging this way.

Any UPDATE is going to generate redo. Realistically, a single UPDATE that updates all the rows is going to generate the smallest total amount of redo and run for the shortest period of time.
Assuming you are updating the vast majority of the rows in the table, if there are any indexes that use A_COL, you may be better off disabling those indexes before the update and then doing a rebuild of those indexes with NOLOGGING specified after the massive UPDATE statement. In addition, if there are any triggers or foreign keys that would need to be fired/ validated as a result of the update, getting rid of those temporarily might be helpful.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas