How to merge 2 huge SQL Server tables without a lot of disk space? - sql

I have 2 tables. Here is my merge statement:
MERGE INTO Transactions_Raw AS Target
USING Four_Fields AS Source
ON
Target.PI = Source.PI AND
Target.TIME_STAMP = Source.TIME_STAMP AND
Target.STS = Source.STS
WHEN MATCHED THEN
UPDATE SET
Target.FROM_APP_A = Source.FROM_APP_A,
Target.TO_APP_A = Source.TO_APP_A,
Target.FROM_APP_B = Source.FROM_APP_B,
Target.TO_APP_B = Source.TO_APP_B;
Both tables have around 77 million rows to them.
When I run this command, it fails due to tempdb growing and running out of disk space. I cannot get more space.
I tried running it as an update statement and it still failed.
I even tried using SSIS, sort and merge-join transformations from these 2 source tables into a 3rd empty table, left join on transactions_raw... It failed and crashed the server requiring a reboot.
I am thinking I need to run it in batches of 100,000 rows or something? How would I do that? Any other suggestions?

Disclaimer: if you want to use any framework to do this job.
spring-batch is a good choice to perform this task.
Even though it involves some coding, it will be helpful in transferring the data seemless and consistent way.
You can choose the row size(100,000) you want to transfer at a time.
Any failures while inserting the data would be limited to set of
rows (100,000). So you need to deal with less data and insert only
this delta data after fixing it. Correct data can be inserted
successfully. This whole set of control will be provided by using
spring-batch.

I usually prefer BCP for such operations on huge tables.
If that query (make sure it repeats the structure of Target table)
SELECT
Target.ID,
Target.TIME_STAMP,
Target.STS,
ISNULL(Source.FROM_APP_A, Target.FROM_APP_A) AS FROM_APP_A,
ISNULL(Source.TO_APP_A , Target.TO_APP_A) AS TO_APP_A,
ISNULL(Source.FROM_APP_B, Target.FROM_APP_B) AS FROM_APP_B,
ISNULL(Source.TO_APP_B , Target.TO_APP_B) AS TO_APP_B
FROM Database.dbo.Transactions_Raw AS Target
LEFT JOIN Database.dbo.Four_Fields AS Source
ON Target.ID = Source.ID AND
Target.TIME_STAMP = Source.TIME_STAMP AND
Target.STS = Source.STS*
won't overflow your TempDB, you can use it to export data into the file using something like
bcp "query_here" queryout D:\Data.dat -S server -Uuser -k -n
Later import data in a new table using something like
bcp Database.dbo.NewTable in D:\Data.dat -S server -Uuser -n -k -b 100000 -h TABLOCK

-- this is code example doing a loop a bit more explained what I said in comments for an example you just have to plug in your code
SET NOCOUNT ON;
DECLARE #PerBatchCount as int
DECLARE #MAXID as bigint
DECLARE #WorkingOnID as bigint
Set #PerBatchCount = 1000
--Find range of IDs to process
SELECT #WorkingOnID = MIN(ID), #MAXID = MAX(ID)
FROM MasterTableYouAreWorkingOn
WHILE #WorkingOnID <= #MAXID
BEGIN
-- do an update on all the ones that exist in the offer table NOW
--MergeStatementHere with adding the where below for ID
WHERE ID BETWEEN #WorkingOnID AND (#WorkingOnID + #PerBatchCount -1)
-- if the loop runs too much data you may need to add this to delay for your log to have time to ship/empty out
WAITFOR DELAY '00:00:10' -- wait 10 seconds
set #WorkingOnID = #WorkingOnID + #PerBatchCount
END

Related

SQL Server: Merge in iterations

I have to merge millions of rows into a table. The target table has an AFTER UPDATE trigger. The whole process is consuming a lot more memory than I'd like to allocate, and tempdb is eating up disk space.
I'd like to have the MERGE command run in batches of 100,000 records at a time. With SET ROWCOUNT being deprecated and cursors being inefficient, I'm not sure what the best approach for this would be.
A set oriented approach would be best way to run the query efficiently. So what you are doing seems just fine to me. If it consumes temp db etc, would need to know if you are doing any row by row operations which is slowing down. Generally a MERGE is a single statement and therefore is efficent.
The other things to consider is, if you got indexes on the destination table, then you could drop those indexes and then run the MERGE followed by recreating the indexes.
Coming to your question, you can split up into batches by using the mod operator and running the MERGE in a loop
eg:
declare #i int
select #i=count(*)/10 from source
while #i>0
begin
merge
into dest d
using (select *
from source
where id%10000=i --here id is the primary key of the source table
) s
on d.id=s.id
when matched
set ...
when not matched
insert...
...rest of the insert/update logic here
set #i=#i-1
end
Try WHILE loop
DECLARE #I INT = 1
WHILE (#I > 0)
BEGIN
;MERGE INTO Dst USING (
SELECT TOP 1000
FROM Src
WHERE NotUpdated
)
...
SET #I = ##ROWCOUNT
END

SSIS is hanging during Update with 3 millions of rows

I'm implementing a new method for a warehouse. The new method consist on perform incremental loading between source and destination tables (Insert,Update or Delete).
All the table are working really well, except for 1 table which the Source has more than 3 millions of rows, as you will see in the image below it just start running but never finish.
Probable I'm not doing the update in the correct way or there is another way to do it.
Here are some pictures of my SSIS package:
Highlighted object is where it hangs.
This is the stored procedure I call to update the table:
ALTER PROCEDURE [dbo].[UpdateDim_A]
#ID INT,
#FileDataID INT
,#CategoryID SMALLINT
,#FirstName VARCHAR(50)
,#LastName VARCHAR(50)
,#Company VARCHAR(100)
,#Email VARCHAR(250) AS BEGIN
SET NOCOUNT ON;
BEGIN TRAN
UPDATE DIM_A
SET
[FileDataID] = #FileDataID,
[CategoryID] = #CategoryID,
[FirstName] = #FirstName,
[LastName] = #LastName,
[Company] = #Company,
[Email] = #Email
WHERE PartyID=#ID
COMMIT TRAN; END
Note:
I already tried Dropping the constraint and indexes and changing the recovery mode of the database to simple.
Any help will be appreciate.
After Apply the solution provided by #Prabhat G, this is how my package looks like, running in 39 seconds (avg)!!!
Inside Dim_A DataFlow
Follow these 2 performance enhancers and you'll avoid your bottleneck.
Remove sort transformation. In your source, while fetching the data use order by sql. Reason being, sort takes up all the records in memory before sorting. You don't want that, be it incremental or full load.
In the last step of update, introduce another Staging Table instead of update records oledb command, which will be replica of Dim table. Once all the matching records are inserted in this new staging table, exit the Data flow task and create EXECUTE SQL TASK which will simply UPDATE Dim table based on joining ID/conditions.
Reason for this is, oledb command hits row by row. Always prefer update using Execute SQL Task as its a batch process.
Edit:
As per comments, to update only changed rows in Execute SQL Task, add the conditions in where clause:
eg:
UPDATE x
SET
x.attribute_A = y.attribute_A
,x.attribute_B = y.attribute_B
FROM
DimA x
inner join stg_DimA y
ON x.Id = y.Id
WHERE
(x.Attribute_A <> y.Attribute_A
OR x.Attribute_B <> y.Attribute_B)
So your problem is actually very simple the method you are using is executing that stored procedure for every row returned. If you have 9961(as in your picture) rows to update it will run that statement 9961 sepreate time. Chances are if you are to look at active queries running on SQL server you'll see that procedure executing over and over.
What you should do to speed this up is dump that data into a staging table then use the execute SQL task further in your package to run a standard SQL update. This will run much faster.
The problem is that you are trying to execute a stored procedure within the data flow. The correct SqlCommand will be an explicit UPDATE query and then map the columns from SSIS to the columns on the table that you are updating.
UPDATE DIM_A
SET FileDataID = ?
,CategoryID = ?
,FirstName = ?
,LastName = ?
,Company = ?
,Email = ?
WHERE PartyID = ?
Note: The #Id needs to be included as a column in your data flow.
One final thing you should consider, as Zane correctly pointed out: you should only update rows that have changed. So, in your data flow you should add a Conditional Split transformation that checks to see if any of the columns in the new source row are different from the existing table rows. Only rows that are different should be send to the OLE DB Command - the rest can be disregarded.

How to efficiently delete small set of data from a large sql table

I want to delete 10GB (1%) data from 1TB table. I have come across several articles to delete large amounts of data from a huge table but didn't find much on deleting smaller percentage of data from a huge table.
Additional details:
Trying to delete bot data from the visits table. The filter condition is a combination of fields... ip in (list of ips about 20 of them) and useragent like '%SOMETHING%'
useragent size 1024 varchar
The data can be old or new. I can't use date filter
Here is a batch delete in chunks that I use regularly. Perhaps it would give you some ideas on how to approach your need. I create a stored proc and call the proc from a SQL Agent Job. I generally schedule it to allow a transaction log backup between executions so the log does not grow too large. You could always just run it interactively if you wish.
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROC [DBA_Delete_YourTableName] AS
SET NOCOUNT ON;
---------------------------------------------------------
DECLARE #DaysHistoryToKeep INT
SET #DaysHistoryToKeep = 90
IF #DaysHistoryToKeep < 30
SET #DaysHistoryToKeep = 30
---------------------------------------------------------
DECLARE #continue INT
DECLARE #rowcount INT
DECLARE #loopCount INT
DECLARE #MaxLoops INT
DECLARE #TotalRows BIGINT
DECLARE #PurgeThruDate DATETIME
SET #PurgeThruDate = DATEADD(dd,(-1)*(#DaysHistoryToKeep+1), GETDATE())
SET #MaxLoops = 100
SET #continue = 1
SET #loopCount = 0
SELECT #TotalRows = (SELECT COUNT(*) FROM YourTableName (NOLOCK) WHERE CREATEDDATETIME < #PurgeThruDate)
PRINT 'Total Rows = ' + CAST(#TotalRows AS VARCHAR(20))
PRINT ''
WHILE #continue = 1
BEGIN
SET #loopCount = #loopCount + 1
PRINT 'Loop # ' + CAST(#loopCount AS VARCHAR(10))
PRINT CONVERT(VARCHAR(20), GETDATE(), 120)
BEGIN TRANSACTION
DELETE TOP (4500) YourTableName WHERE CREATEDDATETIME < #PurgeThruDate
SET #rowcount = ##rowcount
COMMIT
PRINT 'Rows Deleted: ' + CAST(#rowcount AS VARCHAR(10))
PRINT CONVERT(VARCHAR(20), GETDATE(), 120)
PRINT ''
IF #rowcount = 0 OR #loopCount >= #MaxLoops
BEGIN
SET #continue = 0
END
END
SELECT #TotalRows = (SELECT COUNT(*) FROM YourTableName (NOLOCK) WHERE CREATEDDATETIME < #PurgeThruDate)
PRINT 'Total Rows Remaining = ' + CAST(#TotalRows AS VARCHAR(20))
PRINT ''
GO
The filter condition is ... ip in (list of ips about 20 of them) and useragent like '%SOMETHING%'
Regarding table size, it's important to touch as few rows as possible while executing the delete.
I imagine on a table that size you already have an index on the ip column. It might help (or not) to put your list 20 or so ips in a table instead of in an in clause, especially if they're parameters. I'd look at my query plan to see.
I hope useragent like '%SOMETHING%' is usually true; otherwise it's an expensive test because SQL Server has to examine every row for an eligible ip. If not, a redesign to allow the query to avoid like would probably be beneficial.
[D]eleting smaller percentage isn't really relevant. Using selective search criteria is (per above), as is the size of the delete transaction in absolute terms. By definition, the size of the deletion in terms of rows and row size determines the size of the transaction. Very large transactions can push against machine resources. Breaking them up into smaller ones can yield better performance in such cases.
The last server I used had 0.25 TB RAM and was comfortable deleting 1 million rows at a time, but not 10 million. Your milage will vary; you have to try, and observe, to see.
How much you're willing to tax the machine will depend on what else is (or needs to be able to) run at the same time. The way you break up one logical action -- delete all rows where [condition] -- into "chunks" also depends on what you want the database to look like while the delete procedure is in process, when some chunks are deleted and others remain present.
If you do decide to break it into chunks, I recommend not using a fixed number of rows and a TOP(n) syntax, because that's the least logical solution. Unless you use order by, you're leaving to the server to choose arbitrarily which N rows to delete. If you do use order by, you're requiring the server to sort the result before starting the delete, possibly several times over the whole run. Bleh!
Instead, find some logical subset of rows, ideally distinguishable along the clustered index, that fall beneath your machine's threshold of an acceptable number of rows to delete at one time. Loop over that set. In your case, I would be tempted to iterate over the set of ip values in the in clause. Instead of delete ... where ip in(...), you get (roughly) for each ip delete ... where ip = #ip
The advantage of the latter approach is that you always know where the database stands. If you kill the procedure or it gets rolled back partway through its iteration, you can examine the database to see which ips still remain (or whatever criteria you end up using). You avoid any kind of pathological behavior, whereby some query gets a partial result because some part of your selection criteria (determined by the server alone) are present and others deleted. In thinking about the problem you can say, I'm unable to delete ip 192.168.0.1 because, without wondering which portion have already been removed.
In sum, I recommend:
Give the server every chance to touch only the rows you want to affect, and verify that's what it will do.
Construct your delete routine, if you need one, to delete logical chunks, so you can reason about the state of the database at any time.

How to force a running t-sql query (half done) to commit?

I have database on Sql Server 2008 R2.
On that database a delete query on 400 Million records, has been running for 4 days , but I need to reboot the machine. How can I force it to commit whatever is deleted so far? I want to reject that data which is deleted by running query so far.
But problem is that query is still running and will not complete before the server reboot.
Note : I have not set any isolation / begin/end transaction for the query. The query is running in SSMS studio.
If machine reboot or I cancelled the query, then database will go in recovery mode and it will recovering for next 2 days, then I need to re-run the delete and it will cost me another 4 days.
I really appreciate any suggestion / help or guidance in this.
I am novice user of sql server.
Thanks in Advance
Regards
There is no way to stop SQL Server from trying to bring the database into a transactionally consistent state. Every single statement is implicitly a transaction itself (if not part of an outer transaction) and is executing either all or nothing. So if you either cancel the query or disconnect or reboot the server, SQL Server will from transaction log write the original values back to the updated data pages.
Next time when you delete so many rows at once, don't do it at once. Divide the job in smaller chunks (I always use 5.000 as a magic number, meaning I delete 5000 rows at the time in the loop) to minimize transaction log use and locking.
set rowcount 5000
delete table
while ##rowcount = 5000
delete table
set rowcount 0
If you are deleting that many rows you may have a better time with truncate. Truncate deletes all rows from the table very efficiently. However, I'm assuming that you would like to keep some of the records in the table. The stored procedure below backs up the data you would like to keep into a temp table then truncates then re-inserts the records that were saved. This can clean a huge table very quickly.
Note that truncate doesn't play well with Foreign Key constraints so you may need to drop those then recreate them after cleaned.
CREATE PROCEDURE [dbo].[deleteTableFast] (
#TableName VARCHAR(100),
#WhereClause varchar(1000))
AS
BEGIN
-- input:
-- table name: is the table to use
-- where clause: is the where clause of the records to KEEP
declare #tempTableName varchar(100);
set #tempTableName = #tableName+'_temp_to_truncate';
-- error checking
if exists (SELECT [Table_Name] FROM Information_Schema.COLUMNS WHERE [TABLE_NAME] =(#tempTableName)) begin
print 'ERROR: already temp table ... exiting'
return
end
if not exists (SELECT [Table_Name] FROM Information_Schema.COLUMNS WHERE [TABLE_NAME] =(#TableName)) begin
print 'ERROR: table does not exist ... exiting'
return
end
-- save wanted records via a temp table to be able to truncate
exec ('select * into '+#tempTableName+' from '+#TableName+' WHERE '+#WhereClause);
exec ('truncate table '+#TableName);
exec ('insert into '+#TableName+' select * from '+#tempTableName);
exec ('drop table '+#tempTableName);
end
GO
You must know D(Durability) in ACID first before you understand why database goes to Recovery mode.
Generally speaking, you should avoid long running SQL if possible. Long running SQL means more lock time on resource, larger transaction log and huge rollback time when it fails.
Consider divided your task some id or time. For example, you want to insert large volume data from TableSrc to TableTarget, you can write query like
DECLARE #BATCHCOUNT INT = 1000;
DECLARE #Id INT = 0;
DECLARE #Max = ...;
WHILE Id < #Max
BEGIN
INSERT INTO TableTarget
FROM TableSrc
WHERE PrimaryKey >= #Id AND #PrimaryKey < #Id + #BatchCount;
SET #Id = #Id + #BatchCount;
END
It's ugly more code and more error prone. But it's the only way I know to deal with huge data volume.

Copy one column to another for over a billion rows in SQL Server database

Database : SQL Server 2005
Problem : Copy values from one column to another column in the same table with a billion+
rows.
test_table (int id, bigint bigid)
Things tried 1: update query
update test_table set bigid = id
fills up the transaction log and rolls back due to lack of transaction log space.
Tried 2 - a procedure on following lines
set nocount on
set rowcount = 500000
while #rowcount > 0
begin
update test_table set bigid = id where bigid is null
set #rowcount = ##rowcount
set #rowupdated = #rowsupdated + #rowcount
end
print #rowsupdated
The above procedure starts slowing down as it proceeds.
Tried 3 - Creating a cursor for update.
generally discouraged in SQL Server documentation and this approach updates one row at a time which is too time consuming.
Is there an approach that can speed up the copying of values from one column to another. Basically I am looking for some 'magic' keyword or logic that will allow the update query to rip through the billion rows half a million at a time sequentially.
Any hints, pointers will be much appreciated.
I'm going to guess that you are closing in on the 2.1billion limit of an INT datatype on an artificial key for a column. Yes, that's a pain. Much easier to fix before the fact than after you've actually hit that limit and production is shut down while you are trying to fix it :)
Anyway, several of the ideas here will work. Let's talk about speed, efficiency, indexes, and log size, though.
Log Growth
The log blew up originally because it was trying to commit all 2b rows at once. The suggestions in other posts for "chunking it up" will work, but that may not totally resolve the log issue.
If the database is in SIMPLE mode, you'll be fine (the log will re-use itself after each batch). If the database is in FULL or BULK_LOGGED recovery mode, you'll have to run log backups frequently during the running of your operation so that SQL can re-use the log space. This might mean increasing the frequency of the backups during this time, or just monitoring the log usage while running.
Indexes and Speed
ALL of the where bigid is null answers will slow down as the table is populated, because there is (presumably) no index on the new BIGID field. You could, (of course) just add an index on BIGID, but I'm not convinced that is the right answer.
The key (pun intended) is my assumption that the original ID field is probably the primary key, or the clustered index, or both. In that case, lets take advantage of that fact, and do a variation of Jess' idea:
set #counter = 1
while #counter < 2000000000 --or whatever
begin
update test_table set bigid = id
where id between #counter and (#counter + 499999) --BETWEEN is inclusive
set #counter = #counter + 500000
end
This should be extremely fast, because of the existing indexes on ID.
The ISNULL check really wasn't necessary anyway, neither is my (-1) on the interval. If we duplicate some rows between calls, that's not a big deal.
Use TOP in the UPDATE statement:
UPDATE TOP (#row_limit) dbo.test_table
SET bigid = id
WHERE bigid IS NULL
You could try to use something like SET ROWCOUNT and do batch updates:
SET ROWCOUNT 5000;
UPDATE dbo.test_table
SET bigid = id
WHERE bigid IS NULL
GO
and then repeat this as many times as you need to.
This way, you're avoiding the RBAR (row-by-agonizing-row) symptoms of cursors and while loops, and yet, you don't unnecessarily fill up your transaction log.
Of course, in between runs, you'd have to do backups (especially of your log) to keep its size within reasonable limits.
Is this a one time thing? If so, just do it by ranges:
set counter = 500000
while #counter < 2000000000 --or whatever your max id
begin
update test_table set bigid = id where id between (#counter - 500000) and #counter and bigid is null
set counter = #counter + 500000
end
I didn't run this to try it, but if you can get it to update 500k at a time I think you're moving in the right direction.
set rowcount 500000
update test_table tt1
set bigid = (SELECT tt2.id FROM test_table tt2 WHERE tt1.id = tt2.id)
where bigid IS NULL
You can also try changing the recover model so you don't log the transactions
ALTER DATABASE db1
SET RECOVERY SIMPLE
GO
update test_table
set bigid = id
GO
ALTER DATABASE db1
SET RECOVERY FULL
GO
First step, if there are any, would be to drop indexes before the operation. This is probably what is causing the speed degrade with time.
The other option, a little outside the box thinking...can you express the update in such a way that you could materialize the column values in a select? If you can do this then you could create what amounts to a NEW table using SELECT INTO which is a minimally logged operation (assuming in 2005 that you are set to a recovery model of SIMPLE or BULK LOGGED). This would be pretty fast and then you can drop the old table, rename this table to to old table name and recreate any indexes.
select id, CAST(id as bigint) bigid into test_table_temp from test_table
drop table test_table
exec sp_rename 'test_table_temp', 'test_table'
I second the
UPDATE TOP(X) statement
Also to suggest, if you're in a loop, add in some WAITFOR delay or COMMIT between, to allow other processes some time to use the table if needed vs. blocking forever until all the updates are completed