Issue with huge table archive - sql

I am assigned to move data from huge tables (around 20 hundred thousand record each) to identical history table. But when my query is running the log file grows too large and messes up everything. I tried the following:
For each table being archived, handle as separate transaction
Anyway, for history table I didn’t specify the primary key (could this be a problem?)
All the transactions were written in a single stored procedure
Can anyone tell me if there is any issue with my work or this is not the right way?

You can minimise logging if you use table locks with a bulk import
Lots of great info is found here:
http://msdn.microsoft.com/en-us/library/ms190422.aspx
Some pointers from the article:
change db mode to bulk logged
apply indexes after import
import in batches
do a log backup after each batch.

Related

SSIS : Huge Data Transfer from Source (SQL Server) to Destination (SQL Server)

Requirement :
Transfer millions of records from source (SQL Server) to destination (SQL Server).
Structure of source tables is different from destination tables.
Refresh data once per week in destination server.
Minimum amount of time for the processing.
I am looking for optimized approach using SSIS.
Was thinking these options :
Create Sql dump from source server and import that dump in destination server.
Directly copy the tables from source server to destination server.
Lots of issues to consider here. Such as are the servers in the same domain, on same network, etc.
Most of the time you will not want to move the data as a single large chunk of millions of records but in smaller amounts. An SSIS package handles that logic for you, but you can always recreate it as well but iterating the changes easier. Sometimes this is a reason to push changes more often rather than wait an entire week as smaller syncs are easier to manage with less downtime.
Another consideration is to be sure you understand your delta's and to ensure that you have ALL of the changes. For this reason I would generally suggest using a staging table at the destination server. By moving changes to staging and then loading to the final table you can more easily ensure that changes are applied correctly. Think of the scenario of a an increment being out of order (identity insert), datetime ordered incorrectly or 1 chunk failing. When using a staging table you don't have to rely solely on the id/date and can actually do joins on primary keys to look for changes.
Linked Servers proposed by Alex K. can be a great fit, but you will need to pay close attention to a couple of things. Always do it from Destination server so that it is a PULL not a push. Linked servers are fast at querying the data but horrible at updating/inserting in bulk. 1 XML column cannot be in the table at all. You may need to set some specific properties for distributed transactions.
I have done this task both ways and I would say that SSIS does give a bit of advantage over Linked Server just because of its robust error handling, threading logic, and ability to use different adapters (OLEDB, ODBC, etc. they have different performance do a search and you will find some results). But the key to your #4 is to do it in smaller chunks and from a staging table and if you can do it more often it is less likely to have an impact. E.g. daily means it would already be ~1/7th of the size as weekly assuming even daily distribution of changes.
Take 10,000,000 records changed a week.
Once weekly = 10mill
once daily = 1.4 mill
Once hourly = 59K records
Once Every 5 minutes = less than 5K records
And if it has to be once a week. just think about still doing it in small chunks so that each insert will have more minimal affect on your transaction logs, actual lock time on production table etc. Be sure that you never allow loading of a partially staged/transferred data otherwise identifying delta's could get messed up and you could end up missing changes/etc.
One other thought if this is a scenario like a reporting instance and you have enough server resources. You could bring over your entire table from production into a staging or update a copy of the table at destination and then simply do a drop of current table and rename the staging table. This is an extreme scenario and not one I generally like but it is possible and actual impact to the user would be very nominal.
I think SSIS is good at transfer data, my approach here:
1. Create a package with one Data Flow Task to transfer data. If the structure of two tables is different then it's okay, just map them.
2. Create a SQL Server Agent job to run your package every weekend
Also, feature Track Data Changes (SQL Server) is also good to take a look. You can config when you want to sync data and it's good at performance too
With SQL Server versions >2005, it has been my experience that a dump to a file with an export is equal to or slower than transferring data directly from table to table with SSIS.
That said, and in addition to the excellent points #Matt makes, this the usual pattern I follow for this sort of transfer.
Create a set of tables in your destination database that have the same table schemas as the tables in your source system.
I typically put these into their own database schema so their purpose is clear.
I also typically use the SSIS OLE DB Destination package's "New" button to create the tables.
Mind the square brackets on [Schema].[TableName] when editing the CREATE TABLE statement it provides.
Use SSIS Data Flow tasks to pull the data from the source to the replica tables in the destination.
This can be one package or many, depending on how many tables you're pulling over.
Create stored procedures in your destination database to transform the data into the shape it needs to be in the final tables.
Using SSIS data transformations is, almost without exception, less efficient than using server side SQL processing.
Use SSIS Execute SQL tasks to call the stored procedures.
Use parallel processing via Sequence Containers where possible to save time.
This can be one package or many, depending on how many tables you're transforming.
(Optional) If the transformations are complex, requiring intermediate data sets, you may want to create a separate Staging database schema for this step.
You will have to decide whether you want to use the stored procedures to land the data in your ultimate destination tables, or if you want to have the procedures write to intermediate tables, and then move the transformed data directly into the final tables. Using intermediate tables minimizes down time on the final tables, but if your transformations are simple or very fast, this may not be an issue for you.
If you use intermediate tables, you will need a package or packages to manage the final data load into the destination tables.
Depending on the number of packages all of this takes, you may want to create a Master SSIS package that will call the extraction package(s), then the transformation package(s), and then, if you use intermediate processing tables, the final load package(s).

Backing up portion of data in SQL

I have a huge schema containing billions of records, I want to purge data older than 13 months from it and maintain it as a backup in such a way that it can be recovered again whenever required.
Which is the best way to do it in SQL - can we create a separate copy of this schema and add a delete trigger on all tables so that when trigger fires, purged data gets inserted to this new schema?
Will there be only one record per delete statement if we use triggers? Or all records will be inserted?
Can we somehow use bulk copy?
I would suggest this is a perfect use case for the Stretch Database feature in SQL Server 2016.
More info: https://msdn.microsoft.com/en-gb/library/dn935011.aspx
The cold data can be moved to the cloud with your given date criteria without any applications or users being aware of it when querying the database. No backups required and very easy to setup.
There is no need for triggers, you can use job running every day, that will put outdated data into archive tables.
The best way I guess is to create a copy of current schema. In main part - delete all that is older then 13 months, in archive part - delete all for last 13 month.
Than create SP (or any SPs) that will collect data - put it into archive and delete it from main table. Put this is into daily running job.
The cleanest and fastest way to do this (with billions of rows) is to create a partitioned table probably based on a date column by month. Moving data in a given partition is a meta operation and is extremely fast (if the partition setup and its function is set up properly.) I have managed 300GB tables using partitioning and it has been very effective. Be careful with the partition function so dates at each edge are handled correctly.
Some of the other proposed solutions involve deleting millions of rows which could take a long, long time to execute. Model the different solutions using profiler and/or extended events to see which is the most efficient.
I agree with the above to not create a trigger. Triggers fire with every insert/update/delete making them very slow.
You may be best served with a data archive stored procedure.
Consider using multiple databases. The current database that has your current data. Then an archive or multiple archive databases where you move your records out from your current database to with some sort of say nightly or monthly stored procedure process that moves the data over.
You can use the exact same schema as your production system.
If the data is already in the database no need for a Bulk Copy. From there you can backup your archive database so it is off the sql server. Restore the database if needed to make the data available again. This is much faster and more manageable than bulk copy.
According to Microsoft's documentation on Stretch DB (found here - https://learn.microsoft.com/en-us/azure/sql-server-stretch-database/), you can't update or delete rows that have been migrated to cold storage or rows that are eligible for migration.
So while Stretch DB does look like a capable technology for archive, the implementation in SQL 2016 does not appear to support archive and purge.

Bulk Insert with Limited Disk Space

I have a bit of a strange situation, and I'm wondering if anyone would have any ideas how to proceed.
I'm trying to bulk load a 48 gig pipe-delimited file into a table in SQL Server 2008, using a pretty simple bulk insert statement.
BULK INSERT ItemMovement
FROM 'E:\SQLexp\itemmove.csv'
WITH (DATAFILETYPE = 'char', FIELDTERMINATOR = '|', ROWTERMINATOR = '\n' )
Originally, I was trying to load directly into the ItemMovement table. But unfortunately, there's a primary key violation somewhere in this giant file. I created a temporary table to load this file to instead, and I'm planning on selecting distinct rows from the temporary table and merging them into the permanent table.
However, I keep running into space issues. The drive I'm working with is a total of 200 gigs, and 89 gigs are already devoted to both my CSV file and other database information. Every time I try to do my insertion, even with my recovery model set to "Simple", I get the following error (after 9.5 hours of course):
Msg 9002, Level 17, State 4, Line 1
The transaction log for database 'MyData' is full due to 'ACTIVE_TRANSACTION'.
Basically, my question boils down to two things.
Is there any way to load this file into a table that won't fill up the drive with logging? Simple Recovery doesn't seem to be enough by itself.
If we do manage to load up the table, is there a way to do a distinct merge that removes the items from the source table while it's doing the query (for space reasons)?
Appreciate your help.
Even with simple recovery the insert is still a single operation.
You are getting the error on the PK column
I assume the PK is only a fraction of the total size
I would break it up to only insert the PK
Pretty sure you can limit the columns with FORMATFILE
If you have to edit a bunch of duplicate PKs you may need use a program to parse and then load row by row
Sounds like a lot of work that is solved with a $100 drive.
For real would install a drive and use it for the transaction log.
#tommy_o was right about using TABLOCK in order to get my information loaded. Not only did it run in about an hour and a half instead of nine hours, but it barely increased my log size.
For the second part, I realized I could free up quite a bit of space by deleting my CSV after the load, which gave me enough space to get the tables merged.
Thanks everyone!

LDF file continues to grow very large during transaction phase - SQL Server 2005

We have a 6 step where we copy tables from one database to another. Each step is executing a stored procedure.
Remove tables from destination database
Create tables in destination database
Shrink database log before copy
Copy tables from source to destination
Shrink the database log
Back up desstination database
during the step 4, our transaction log (ldf file) grows very large to where we now have to consistently increase the max size on the sql server and soon enough (in the far furture) we believe it may eat up all the resources on our server. It was suggested that in our script, we commit each transaction instead of waiting til the end to commit the transactions.
Any suggestions?
I'll make the assumption that you are moving large amounts of data. The typical solution to this problem is to break the copy up in to smaller number of rows. This keeps the hit on transaction log smaller. I think this will be the preferred answer.
The other answer that I have seen is using Bulk Copy, which writes the data out to a text file and imports it into your target db using Bulk Copy. I've seen a lot of posts that recommend this. I haven't tried it.
If the schema of the target tables isn't changing could you not just truncate the data in the target tables instead of dropping and recreating?
Can you change the database recovery model to Bulk Logged for this process?
Then, instead of creating empty tables at the destination, do a SELECT INTO to create them. Once they are built, alter the tables to add indices and constraints. Doing bulk copies like this will greatly reduce your logging requirements.

SQL Server DB size - why is it so large?

I am building a database which contains about 30 tables:
The largest amount of columns in a table is about 15.
For datatypes I am mostly using VarChar(50) for text
and Int og SmallInt for numbers.
Identity columns is Uniqueidentifiers
I have been testing a bit filling in data and deleting
again. I have no deleted all data, so everey table is empty.
But, if I look inside the properties of the database in
Management Studio, the size says 221,38 MB!
How comes that? Please help, I am getting notifications
from my hosting company that I am exceeding my limits .
Best regards,
:-)
I would suggest that you look first at the recovery mode for the database. By default, the recovery mode is FULL. This fills the log file with all transactions that you perform, never deleting them until you do a backup.
To change the recovery mode, right click on the database and choose Properties. In the properties list, choose the Options (on the right hand pane). Then change the "Recovery model" to Simple.
You probably also want to shrink your files. To do this, right click on the database and choose Tasks --> Shrink --> Files. You can shrink both the data file and the log file, by changing the "File Type" option in the middle.
Martin's comment is quite interesting. Even if the log file is in auto-truncate mode, you still have the issue of deletes being logged. If you created large-ish tables, the log file will still expand and the space not recovered until you truncate the file. You can get around this by using TRUNCATE rathe than DELETE:
truncate table <table>
does not log every record being deleted (http://msdn.microsoft.com/en-us/library/ms177570.aspx).
delete * from table
logs every record.
As you do inserts, updates, deletes, and design changes a log file with every transaction, and a whole bunch of other data is created. This transaction log is a required component of a SQL Server database, and thus cannot be disabled in any available settings.
Below is an article from Microsoft on doing backups to shrink the transaction logs generated by SQL Server.
http://msdn.microsoft.com/en-us/library/ms178037(v=sql.105).aspx
Also, are you indexing your columns? Indexes that consist of several columns on tables with a high row count can become unnecessarily large, especially if you are just doing tests. Try just having a single clustered index on only one column per table.
You may also want to learn about table statistics. They help your indexes out and also help you perform queries like SELECT DISTINCT, or SELECT COUNT(*), etc.
http://msdn.microsoft.com/en-us/library/ms190397.aspx
Finally, you will need to upgrade your storage allocation for the SQL Server database. The more you use it, the faster it will want to grow.