I need to push a large SQL table from my local instance to SQL Azure. The transfer is a simple, 'clean' upload - simply push the data into a new, empty table.
The table is extremely large (~100 million rows) and consist only of GUIDs and other simple types (no timestamp or anything).
I create an SSIS package using the Data Import / Export Wizard in SSMS. The package works great.
The problem is when the package is run over a slow or intermittent connection. If the internet connection goes down halfway through, then there is no way to 'resume' the transfer.
What is the best approach to engineering an SSIS package to upload this data, in a resumable fashion? i.e. in case of connection failure, or to allow the job to be run only between specific time windows.
Normally, in a situation like that, I'd design the package to enumerate through batches of size N (1k row, 10M rows, whatever) and log to a processing table what the last successful batch transmitted would be. However, with GUIDs you can't quite partition them out into buckets.
In this particular case, I would modify your data flow to look like Source -> Lookup -> Destination. In your lookup transformation, query the Azure side and only retrieve the keys (SELECT myGuid FROM myTable). Here, we're only going to be interested in rows that don't have a match in the lookup recordset as those are the ones pending transmission.
A full cache is going to cost about 1.5GB (100M * 16bytes) of memory assuming the Azure side was fully populated plus the associated data transfer costs. That cost will be less than truncating and re-transferring all the data but just want to make sure I called it out.
Just order by your GUID when uploading. And make sure you use the max(guid) from Azure as your starting point when recovering from a failure or restart.
Related
I am using ADF to keep an Azure SQL DB in sync with an on-prem DB. The on-prem DB is read only and the direction is one-way, from the Azure SQL DB to the on-prem DB.
My source table in the Azure SQL Cloud DB is quite large (10's of millions of rows) so I have the pipeline set to use an UPSERT (merge, trying to create a differential merge). I am using a filter on the Source table and the and the Filter Query has a WHERE condition that looks like this:
[HistoryDate] >= '#{formatDateTime(pipeline().parameters.windowStart, 'yyyy-MM-dd HH:mm' )}'
AND [HistoryDate] < '#{formatDateTime(pipeline().parameters.windowEnd, 'yyyy-MM-dd HH:mm' )}'
The HistoryDate column is auto-maintained in the source table with a getUTCDate() type approach. New records will always get a higher value and be included in the WHERE condition.
This works well, but here is my question: I am testing on my local machine before deploying to the client. When I am not working, my laptop hibernates and the pipeline rightfully fails because my local SQL Instance is "offline" during that run. When I move this to production this should not be an issue (computer hibernating), but what happens if the clients connection is temporarily lost (i.e, the client loses internet for a time)? Because my pipeline has a WHERE condition on the source to reduce the table size upsert to a practical number, any failure would result in a loss of any data created during that 5 minute window.
A failed pipeline can be rerun, but the run time would be different at that moment in time and I would effectively miss the block of records that would have been picked up if the pipeline had been run on time. pipeline().parameters.windowStart and pipeline().parameters.windowEnd will now be different.
As an FYI, I have this running every 5 minutes to keep the local copy in sync as close to real-time as possible.
Am I approaching this correctly? I'm sure others have this scenario and it's likely I am missing something obvious. :-)
Thanks...
Sorry to answer my own question, but to potentially help others in the future, it seems there was a better way to deal with this.
ADF offers a "Metadata-driven Copy Task" utility/wizard on the home screen that creates a pipeline. When I used it, it offers a "Delta Load" option for tables which takes a "Watermark". The watermark is a column for an incrementing IDENTITY column, increasing date or timestamp, etc. At the end of the wizard, it allows you to download a script that builds a table and corresponding stored procedure that maintains the values of each parameters after each run. For example, if I wanted my delta load to be based on an IDENTITY column, it stores the value of the max value of a particular pipeline run. The next time a run happens (trigger), it uses this as the MIN value (minus 1) and the current MAX value of the IDENTITY column to get the added records since the last run.
I was going to approach things this way, but it seems like ADF already does this heavy lifting for us. :-)
Requirement :
Transfer millions of records from source (SQL Server) to destination (SQL Server).
Structure of source tables is different from destination tables.
Refresh data once per week in destination server.
Minimum amount of time for the processing.
I am looking for optimized approach using SSIS.
Was thinking these options :
Create Sql dump from source server and import that dump in destination server.
Directly copy the tables from source server to destination server.
Lots of issues to consider here. Such as are the servers in the same domain, on same network, etc.
Most of the time you will not want to move the data as a single large chunk of millions of records but in smaller amounts. An SSIS package handles that logic for you, but you can always recreate it as well but iterating the changes easier. Sometimes this is a reason to push changes more often rather than wait an entire week as smaller syncs are easier to manage with less downtime.
Another consideration is to be sure you understand your delta's and to ensure that you have ALL of the changes. For this reason I would generally suggest using a staging table at the destination server. By moving changes to staging and then loading to the final table you can more easily ensure that changes are applied correctly. Think of the scenario of a an increment being out of order (identity insert), datetime ordered incorrectly or 1 chunk failing. When using a staging table you don't have to rely solely on the id/date and can actually do joins on primary keys to look for changes.
Linked Servers proposed by Alex K. can be a great fit, but you will need to pay close attention to a couple of things. Always do it from Destination server so that it is a PULL not a push. Linked servers are fast at querying the data but horrible at updating/inserting in bulk. 1 XML column cannot be in the table at all. You may need to set some specific properties for distributed transactions.
I have done this task both ways and I would say that SSIS does give a bit of advantage over Linked Server just because of its robust error handling, threading logic, and ability to use different adapters (OLEDB, ODBC, etc. they have different performance do a search and you will find some results). But the key to your #4 is to do it in smaller chunks and from a staging table and if you can do it more often it is less likely to have an impact. E.g. daily means it would already be ~1/7th of the size as weekly assuming even daily distribution of changes.
Take 10,000,000 records changed a week.
Once weekly = 10mill
once daily = 1.4 mill
Once hourly = 59K records
Once Every 5 minutes = less than 5K records
And if it has to be once a week. just think about still doing it in small chunks so that each insert will have more minimal affect on your transaction logs, actual lock time on production table etc. Be sure that you never allow loading of a partially staged/transferred data otherwise identifying delta's could get messed up and you could end up missing changes/etc.
One other thought if this is a scenario like a reporting instance and you have enough server resources. You could bring over your entire table from production into a staging or update a copy of the table at destination and then simply do a drop of current table and rename the staging table. This is an extreme scenario and not one I generally like but it is possible and actual impact to the user would be very nominal.
I think SSIS is good at transfer data, my approach here:
1. Create a package with one Data Flow Task to transfer data. If the structure of two tables is different then it's okay, just map them.
2. Create a SQL Server Agent job to run your package every weekend
Also, feature Track Data Changes (SQL Server) is also good to take a look. You can config when you want to sync data and it's good at performance too
With SQL Server versions >2005, it has been my experience that a dump to a file with an export is equal to or slower than transferring data directly from table to table with SSIS.
That said, and in addition to the excellent points #Matt makes, this the usual pattern I follow for this sort of transfer.
Create a set of tables in your destination database that have the same table schemas as the tables in your source system.
I typically put these into their own database schema so their purpose is clear.
I also typically use the SSIS OLE DB Destination package's "New" button to create the tables.
Mind the square brackets on [Schema].[TableName] when editing the CREATE TABLE statement it provides.
Use SSIS Data Flow tasks to pull the data from the source to the replica tables in the destination.
This can be one package or many, depending on how many tables you're pulling over.
Create stored procedures in your destination database to transform the data into the shape it needs to be in the final tables.
Using SSIS data transformations is, almost without exception, less efficient than using server side SQL processing.
Use SSIS Execute SQL tasks to call the stored procedures.
Use parallel processing via Sequence Containers where possible to save time.
This can be one package or many, depending on how many tables you're transforming.
(Optional) If the transformations are complex, requiring intermediate data sets, you may want to create a separate Staging database schema for this step.
You will have to decide whether you want to use the stored procedures to land the data in your ultimate destination tables, or if you want to have the procedures write to intermediate tables, and then move the transformed data directly into the final tables. Using intermediate tables minimizes down time on the final tables, but if your transformations are simple or very fast, this may not be an issue for you.
If you use intermediate tables, you will need a package or packages to manage the final data load into the destination tables.
Depending on the number of packages all of this takes, you may want to create a Master SSIS package that will call the extraction package(s), then the transformation package(s), and then, if you use intermediate processing tables, the final load package(s).
I have an ETL process that will run periodically. I was using kettle (PDI) to extract the data from the source database and copy it to a stage database. For this I use several transformations with table input and table output steps. However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data. Furthermore, I don't know if the source database would be blocked. This would be a problem if the extraction takes some minutes (and it will take them). The advantage of PDI is that I can select only the necessary columns and use timestamps to get only the new data.
By the other hand, I think mysqldump with --single-transaction allows me to get the data in a consistent way and don't block the source database (all tables are innodb). The disadventage is that I would get innecessary data.
Can I use PDI, or I need mysqldump?
PD: I need to read specific tables from specific databases, so I think xtrabackup it's not a good option.
However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data
I think "Table Input" step doesn't take into account any modifications that are happening when you are reading. Try a simple experiment:
Take a .ktr file with a single table input and table output. Try loading the data into the target table. While in the middle of data load, insert few records in the source database. You will find that those records are not read into the target table. (note i tried with postgresql db and the number of rows read is : 1000000)
Now for your question, i suggest you using PDI since it gives you more control on the data in terms of versioning, sequences, SCDs and all the DWBI related activities. PDI makes it easier to load to the stage env. rather than simply dumping the entire tables.
Hope it helps :)
Interesting point. If you do all the table inputs in one transformation then at least they all start at same time but whilst likely to be consistent it's not guaranteed.
There is no reason you can't use pdi to orchestrate the process AND use mysql dump. In fact for bulk insert or extract it's nearly always better to use the vendor provided tools.
I have an application that produce approximately 15000 rows int a table named ExampleLog for each Task. The task has a taskID, that is saved in a table named TaskTable, thus it's possible to retrieve data from the ExampleLog table to run some queries.
The problem is that the ExampleLog table is getting very big, since I run everyday at least 1 task. At the time being my ExampleLog table is over 60 GB.
I would like to compress the 15000 rows which belong to a TaskID, and compress them or just Zip them and then save the compressed data somewhere inside the database as Blob or as Filestream. But it is important for me to be able to query easily the compressed or zipped file and proccess some query in a efficient manner inside the compressed or zipped data. (I don't know, if it's possible or I may lost in term of performance)
PS: The compressed data should not be considered as backup data.
Did someone can recommend an good approach or technique to resolve this problem. My focus is on the speed and of the query running on the ExampleLog and the place taken on the disk.
I'm using SQL Server 2008 on Windows 7
Consider Read-Only Filegroups and Compression.
Using NTFS Compression with Read-Only User-defined Filegroups and Read-Only Databases
SQL Server supports NTFS compression of read-only
user-defined filegroups and read-only databases. You should consider
compressing read-only data in the following situations: You have a
large volume of static or historical data that must be available for
limited read-only access. You have limited disk space.
Also, you can try and estimate the gains from page compression applied to the log table using Data Compression Wizard.
The answer of Denis could not solve my Problem completely, however I will use it for some optimization inside the DB.
Regarding the problem of storing data in package/group, there are 2 solutions of my problem:
The first solution is the use of the Partitioned Table and Index Concepts.
For example, if a current month of data is primarily used for INSERT, UPDATE, DELETE, and MERGE operations while previous months are used primarily for SELECT queries, managing this table may be easier if it is partitioned by month. This benefit can be especially true if regular maintenance operations on the table only have to target a subset of the data. If the table is not partitioned, these operations can consume lots of resources on an entire data set. With partitioning, maintenance operations, such as index rebuilds and defragmentations, can be performed on a single month of write-only data, for example, while the read-only data is still available for online access.
The second solution it to insert from the code (C# in my case) a List or Dictionary of row from a Task, then save them inside a FILESTREAM (SQL Server) on the DB server. Data will later by retrived by Id; the zip will be decompressed and data will be ready to use.
We have decided to use the second solution.
I am a C# developer, I am not really good with SQL. I have a simple questions here. I need to move more than 50 millions records from a database to other database. I tried to use the import function in ms SQL, however it got stuck because the log was full (I got an error message The transaction log for database 'mydatabase' is full due to 'LOG_BACKUP'). The database recovery model was set to simple. My friend said that importing millions records using task->import data will cause the log to be massive and told me to use loop instead to transfer the data, does anyone know how and why? thanks in advance
If you are moving the entire database, use backup and restore, it will be the quickest and easiest.
http://technet.microsoft.com/en-us/library/ms187048.aspx
If you are just moving a single table read about and use the BCP command line tools for this many records:
The bcp utility bulk copies data between an instance of Microsoft SQL Server and a data file in a user-specified format. The bcp utility can be used to import large numbers of new rows into SQL Server tables or to export data out of tables into data files. Except when used with the queryout option, the utility requires no knowledge of Transact-SQL. To import data into a table, you must either use a format file created for that table or understand the structure of the table and the types of data that are valid for its columns.
http://technet.microsoft.com/en-us/library/ms162802.aspx
The fastest and probably most reliable way is to bulk copy the data out via SQL Server's bcp.exe utility. If the schema on the destination database is exactly identical to that on the source database, including nullability of columns, export it in "native format":
http://technet.microsoft.com/en-us/library/ms191232.aspx
http://technet.microsoft.com/en-us/library/ms189941.aspx
If the schema differs between source and target, you will encounter...interesting (yes, interesting is a good word for it) problems.
If the schemas differ or you need to perform any transforms on the data, consider using text format. Or another format (BCP lets you create and use a format file to specify the format of the data for export/import).
You might consider exporting data in chunks: if you encounter problems it gives you an easier time of restarting without losing all the work done so far.
You might also consider zipping the exported data files up to minimize time on the wire.
Then FTP the files over to the destination server.
bcp them in. You can use the bcp utility on the destination server for the BULK IMPORT statement in SQL Server to do the work. Makes no real difference.
The nice thing about using BCP to load the data is that the load is what is described as a 'non-logged' transaction, though it's really more like a 'minimally logged' transaction.
If the tables on the destination server have IDENTITY columns, you'll need to use SET IDENTITY statement to disable the identity column on the the table(s) involved for the nonce (don't forget to reenable it). After your data is imported, you'll need to run DBCC CHECKIDENT to get things back in synch.
And depending on what your doing, it can sometimes be helpful to put the database in single-user mode or dbo-only mode for the duration of the surgery: http://msdn.microsoft.com/en-us/library/bb522682.aspx
Another approach I've used to great effect is to use Perl's DBI/DBD modules (which provide access to the bulk copy interface) and write a perl script to suck out the data from the source server, transform it and bulk load it directly into the destination server, without having to save it to disk and move it. Also means you can trap errors and design things for recovery and restart right at the point of failure.
Use BCP to migrate data.
Another approach i have used in the past is to take a backup of the transaction log and shrink the log Prior to the migration. Split the migration script in parts and run the log backup- shrink - migrate iteration a few times.