How can I efficiently manipulate 500k records in SQL Server 2005? - sql-server-2005

I am getting a large text file of updated information from a customer that contains updates for 500,000 users. However, as I am processing this file, I often am running into SQL Server timeout errors.
Here's the process I follow in my VB application that processes the data (in general):
Delete all records from temporary table (to remove last month's data) (eg. DELETE * FROM tempTable)
Rip text file into the temp table
Fill in extra information into the temp table, such as their organization_id, their user_id, group_code, etc.
Update the data in the real tables based on the data computed in the temp table
The problem is that I often run commands like UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id) and these commands frequently time out. I have tried bumping the timeouts up to as far as 10 minutes, but they still fail. Now, I realize that 500k rows is no small number of rows to manipulate, but I would think that a database purported to be able to handle millions and millions of rows should be able to cope with 500k pretty easily. Am I doing something wrong with how I am going about processing this data?
Please help. Any and all suggestions welcome.

subqueries like the one you give us in the question:
UPDATE tempTable SET user_id = (SELECT user_id FROM myUsers WHERE external_id = tempTable.external_id)
are only good on one row at a time, so you must be looping. Think set based:
UPDATE t
SET user_id = u.user_id
FROM tempTable t
inner join myUsers u ON t.external_id=u.external_id
and remove your loops, this will update all rows in one statement and be significantly faster!

Needs more information. I am manipulating 3-4 million rows in a 150 million row table regularly and I am NOT thinking this is a lot of data. I have a "products" table that contains about 8 million entries - includign full text search. No problems either.
Can you just elaborte on your hardware? I assume "normal desktop PC" or "low end server", both with absolutely non-optimal disc layout, and thus tons of IO problems - on updates.

Make sure you have indexes on your tables that you are doing the selects from. In your example UPDATE command, you select the user_id from the myUsers table. Do you have an index with the user_id column on the myUsers table? The downside of indexes is that they increase time for inserts/updates. Make sure you don't have indexes on the tables you are trying to update. If the tables you are trying to update do have indexes, consider dropping them and then rebuilding them after your import.
Finally, run your queries in SQL Server Management Studio and have a look at the execution plan to see how the query is being executed. Look for things like table scans to see where you might be able to optimize.

Look at the KM's answer and don't forget about indexes and primary keys.

Are you indexing your temp table after importing the data?
temp_table.external_id should definitely have an index since it is in the where clause.

There are more efficient ways of importing large blocks of data. Look in SQL Books Online under BCP (Bulk Copy Protocol.)

Related

Update time of a big table in PostgreSQL

i have a question for performance for update on a big table that is around 8 to 10GBs of size.
I have a task where i'm supposed to detect distinct values from a table of mentioned size with about 4.3 million rows and insert them into some table. This part is not really a problem but it's the update that follows afterwards. So i need to update some column based on the id of the created rows in the table i did an import. A example of the query i'm executing is:
UPDATE billinglinesstagingaws as s
SET product_id = p.id
FROM product AS p
WHERE p.key=(s.data->'product'->>'sku')::varchar(75)||'-'||(s.data->'lineitem'->>'productcode')::varchar(75) and cloudplatform_id = 1
So as mentioned, the staging table size is around 4.3 million rows and 8-10Gb and as it can be seen from the query, it has a JSONB field and the product table has around 1500 rows.
This takes about 12 minutes, which i'm not really sure if it is ok, and i'm really wondering, what i can do to speed it up somehow. There aren't foreign key constraints, there is a unique constraint on two columns together. There are no indexes on the staging table.
I attached the query plan of the query, so any advice would be helpful. Thanks in advance.
This is a bit too long for a comment.
Updating 4.3 million rows in a table is going to take some time. Updates take time because the the ACID properties of databases require that something be committed to disk for each update -- typically log records. And that doesn't count the time for reading the records, updating indexes, and other overhead.
So, about 17,000 updates per second isn't so bad.
There might be ways to speed up your query. However, you describe these as new rows. That makes me wonder if you can just insert the correct values when creating the table. Can you lookup the appropriate value during the insert, rather than doing so afterwards in an update?

How to merge 500 million table with another 500 million table

I have to merge two 500M+ row tables.
What is the best method to merge them?
I just need to display the records from these two SQL-Server tables if somebody searches on my webpage.
These are fixed tables, no one will ever change data in these tables once they are live.
create a view myview as select * from table1 union select * from table2
Is there any harm using the above method?
If I start merging 500M rows it will run for days and if machine reboots it will make the database go into recovery mode, and then I have to start from the beginning again.
Why Am I merging these table?
I have a website which provides a search on the person table.
This table have columns like Name, Address, Age etc
We got 500 million similar .txt files which we loaded into some other
table.
Now we want the website search page to query both tables to see if
a person exists in the table.
We get similar .txt files of 100 million or 20 million, which we load
to this huge table.
How we are currently doing it?
We import the .txt files into separate tables ( some columns are different
in .txt)
Then we arrange the columns and do the data type conversions
Then insert this staging table into the liveCopy huge table ( in
test environment)
We have SQL server 2008 R2
Can we use table partitioning for performance benefits?
Is it ok to create monthly small tables and create a view on top of
them?
How can indexing be done in this case?
We only load new data once in a month and do the select
Does replication help?
Biggest issue I am facing is managing huge tables.
I hope I explained the situation .
Thanks & Regards
1) Usually developers, to achieve more performance, are splitting large tables into smaller ones and call this as partitioning (horizontal to be more precise, because there is also vertical one). Your view is a sample of such partitions joined. Of course, it is mostly used to split a large amount of data into range of values (for example, table1 contains records with column [col1] < 0, while table2 with [col1] >= 0). But even for unsorted data it is ok too, because you get more room for speed improvements. For example - parallel reads if put tables to different storages. So this is a good choice.
2) Another way is to use MERGE statement supported in SQL Server 2008 and higher - http://msdn.microsoft.com/en-us/library/bb510625(v=sql.100).aspx.
3) Of course you can copy using INSERT+DELETE, but in this case or in case of MERGE command used do this in a small batches. Smth like:
SET ROWCOUNT 10000
DECLARE #Count [int] = 1
WHILE #Count > 0 BEGIN
... INSERT+DELETE/MERGE transcation...
SET #Count = ##ROWCOUNT
END
If your purpose is truly just to move the data from the two tables into one table, you will want to do it in batches - 100K records at a time, or something like that. I'd guess you crashed before because your T-Log got full, although that's just speculation. Make sure to throw in a checkpoint after each batch if you are in Full recovery mode.
That said, I agree with all the comments that you should provide why you are doing this - it may not be necessary at all.
You may want to have a look at an Indexed View.
In this way, you can set up indexes on your view and get the best performance out of it. The expensive part of using Indexed Views is in the CRUD operations - but for read performance it would be your best solution.
http://www.brentozar.com/archive/2013/11/what-you-can-and-cant-do-with-indexed-views/
https://www.simple-talk.com/sql/learn-sql-server/sql-server-indexed-views-the-basics/
If the two tables are linked one to one, then you are wasting the cpu time a lot for each read. Especially that you mentioned that the tables don't change at all. You should have only one table in this case.
Try creating a new table including (at least) the two columns from the two tables.
You can do this by:
Select into newTable
from A left join B on A.x=B.y
or (if some people don't have the information of the text file)
Select into newTable
from A inner join B on A.x=B.y
And note that you have to have made index on the join fields at least (to speed up the process).
More details about the fields may help giving more precise answer as well.

Select million+ records while huge insert is running

I am trying to extract application log file from a single table. The select query statement is pretty straightforward.
select top 200000 *
from dbo.transactionlog
where rowid>7
and rowid <700000 and
Project='AmWINS'
The query time for above select is above 5 mins. Is it considered long? While the select is running, the bulk insertion is also running.
[EDIT]
Actually, I am having serious problem on my current Production logging database,
Basically, we only have one table (transactionlog). all the application log will be insert into this table. For Project like AmWINS, base on select count result, we have about 800K++ records inserted per day. The insertion of record are running 24 hours daily in Production environment. User would like to extract data from the table if user want to check the transaction logs. Therefore, we need to select the records out from the table if necessary.
I tried to simulate on UAT enviroment to pump in the volumn as per Production which already grow up to 10millions records until today. and while i try to extract records, at the same time, I simulate with a bulk insertion to make it look like as per production environment. It took like 5 mins just to extract 200k records.
During the extraction running, I monitor on the SQL phyiscal server CPU is spike up to 95%.
the tables have 13 fields and a identity turn on(rowid) with bigint. rowid is the PK.
Indexes are create on Date, Project, module and RefNumber.
the tables are created on rowlock and pagelock enabled.
I am using SQL server 2005.
Hope you guys can give me some professional advices to enlighten me. Thanks.
It may be possible for you to use the "Nolock" table hint, as described here:
Table Hints MSDN
Your SQL would become something like this:
select top 200000 * from dbo.transactionlog with (no lock) ...
This would achieve better performance if you aren't concerned about the complete accuracy of the data returned.
What are you doing with the 200,000 rows? Are you running this over a network? Depending on the width of your table, just getting that amount of data across the network could be the bulk of the time spent.
It depends on your hardware. Pulling 200000 rows out while there is data being inserted requires some serious IO, so unless you have a 30+disk system, it will be slow.
Also, is your rowID column indexed? This will help with the select, but could slow down the bulk insert.
I am not sure, but doesn't bulk insert in MS SQL lock the whole table?
As ck already said. Indexing is important. So make sure you have an appropriate index ready. I would not only set an index on rowId but also on Project. Also I would rewrite the where-clause to:
WHERE Project = 'AmWINS' AND rowid BETWEEN 8 AND 699999
Reason: I guess Project is more restrictive than rowid and - correct me, if I'm wrong - BETWEEN is faster than a < and > comparison.
You could also export this as a local dat or sql file.
No amount of indexing will help here because it's a SELECT * query so it's most likely a PK scan or an horrendous bookup lookup
And the TOP is meaningless because there is no ORDER BY.
The simultaneous insert is probably misleading as far as I can tell, unless the table only has 2 columns and the bulk insert is locking the whole table. With a simple int IDENTITY column the insert and select may not interfere with each other too.
Especially if the bulk insert is only a few 1000s of rows (or even 10,000s)
Edit. The TOP and rowid values do not imply a million plus

How to quickly duplicate rows in SQL

Edit: Im running SQL Server 2008
I have about 400,000 rows in my table. I would like to duplicate these rows until my table has 160 million rows or so. I have been using an statement like this:
INSERT INTO [DB].[dbo].[Sales]
([TotalCost]
,[SalesAmount]
,[ETLLoadID]
,[LoadDate]
,[UpdateDate])
SELECT [TotalCost]
,[SalesAmount]
,[ETLLoadID]
,[LoadDate]
,[UpdateDate]
FROM [DB].[dbo].[Sales]
This process is very slow. and i have to re-issue the query some large number of times Is there a better way to do this?
To do this many inserts you will want to disable all indexes and constraints (including foreign keys) and then run a series of:
INSERT INTO mytable
SELECT fields FROM mytable
If you need to specify ID, pick some number like 80,000,000 and include in the SELECT list ID+80000000. Run as many times as necessary (no more than 10 since it should double each time).
Also, don't run within a transaction. The overhead of doing so over such a huge dataset will be enormous. You'll probably run out of resources (rollback segments or whatever your database uses) anyway.
Then re-enable all the constraints and indexes. This will take a long time but overall it will be quicker than adding to indexes and checking constraints on a per-row basis.
Since each time you run that command it will double the size of your table, you would only need to run it about 9 times (400,000 * 29 = 204,800,000). Yes, it might take a while because copying that much data takes some time.
The speed of the insert will depend on a number of things...the physical disk speed, indexes, etc. I would recommend removing all indexes from the table and adding them back when you're done. If the table is heavily indexed then that should help quite a bit.
You should be able to repeatedly run that query in a loop until the desired number of rows is achieved. Every time you run it you'll double the data, so you'll end up with:
400,000
800,000
1,600,000
3,200,000
6,400,000
12,800,000
25,600,000
51,200,000
102,400,000
204,800,000
After nine executions.
You don't state your SQL database, but most have a bulk loading tool to handle this scenario. Check the docs. If you have to do it with INSERTs, remove all indexes from the table first and reapply them after the data is INSERTed; this will generally be much faster than indexing during insertion.
this may still take a while to run... you might want to turn off logging while you create your data.
INSERT INTO [DB].[dbo].[Sales] (
[TotalCost] ,[SalesAmount] ,[ETLLoadID]
,[LoadDate] ,[UpdateDate]
)
SELECT s.[TotalCost] ,s.[SalesAmount] ,s.[ETLLoadID]
,s.[LoadDate] ,s.[UpdateDate]
FROM [DB].[dbo].[Sales] s (NOLOCK)
CROSS JOIN (SELECT TOP 400 totalcost FROM [DB].[dbo].[Sales] (NOLOCK)) o

Oracle SQL technique to avoid filling trans log

Newish to Oracle programming (from Sybase and MS SQL Server). What is the "Oracle way" to avoid filling the trans log with large updates?
In my specific case, I'm doing an update of potentially a very large number of rows. Here's my approach:
UPDATE my_table
SET a_col = null
WHERE my_table_id IN
(SELECT my_table_id FROM my_table WHERE some_col < some_val and rownum < 1000)
...where I execute this inside a loop until the updated row count is zero,
Is this the best approach?
Thanks,
The amount of updates to the redo and undo logs will not at all be reduced if you break up the UPDATE in multiple runs of, say 1000 records. On top of it, the total query time will be most likely be higher compared to running a single large SQL.
There's no real way to address the UNDO/REDO log issue in UPDATEs. With INSERTs and CREATE TABLEs you can use a DIRECT aka APPEND option, but I guess this doesn't easily work for you.
Depends on the percent of rows almost as much as the number. And it also depends on if the update makes the row longer than before. i.e. going from null to 200bytes in every row. This could have an effect on your performance - chained rows.
Either way, you might want to try this.
Build a new table with the column corrected as part of the select instead of an update. You can build that new table via CTAS (Create Table as Select) which can avoid logging.
Drop the original table.
Rename the new table.
Reindex, repoint contrainst, rebuild triggers, recompile packages, etc.
you can avoid a lot of logging this way.
Any UPDATE is going to generate redo. Realistically, a single UPDATE that updates all the rows is going to generate the smallest total amount of redo and run for the shortest period of time.
Assuming you are updating the vast majority of the rows in the table, if there are any indexes that use A_COL, you may be better off disabling those indexes before the update and then doing a rebuild of those indexes with NOLOGGING specified after the massive UPDATE statement. In addition, if there are any triggers or foreign keys that would need to be fired/ validated as a result of the update, getting rid of those temporarily might be helpful.