What is the most efficient way to refresh a table in SQL Server having 15M records from Oracle on daily basis? - sql

I am using a LinkedServer in SQL 2012 and refreshing a table from Oracle 9G using below procedure on daily basis. The current records in the table is 15M and it is increasing every day by 2-3K new records and the old records are also deleting and updating randomly. It takes 7-8 hours to complete this job overnight.Considering the table is already optimized on index level at Oracle side, What can be the most efficient way to attempt this?
My current process is below :
Truncate table SQLTable
Select * into SQLTable from openquery (LinkedServerName,'Select * from OracleTable')

It doesn't make sense to truncate 15M rows just for 3000-8000 rows changes.
I would consider using an ETL tool like https://sourceforge.net/projects/pentaho/. You can start with a free community edition.
This tool provides a Spoon tool that basically provides graphical interface to create a workflow. With the Pan tool you can execute the file you create using spoon tool. Basically create a batch file with Pan command and provide .ktr file as an argument. Now, this batch file you can schedule using windows task manager or Unix CRON Job.
With this, you can create a workflow, which can look for changes and only insert or update changes.

Related

SSIS Incremental Load-15 mins

I have 2 tables. The source table being from a linked server and destination table being from the other server.
I want my data load to happen in the following manner:
Everyday at night I have scheduled a job to do a full dump i.e. truncate the table and load all the data from the source to the destination.
Every 15 minutes to do incremental load as data gets ingested into the source on second basis. I need to replicate the same on the destination too.
For incremental load as of now I have created scripts which are stored in a stored procedure but for future purposes we would like to implement SSIS for this case.
The scripts run in the below manner:
I have an Inserted_Date column, on the basis of this column I take the max of that column and delete all the rows that are greater than or equal to the Max(Inserted_Date) and insert all the similar values from the source to the destination. This job runs evert 15 minutes.
How to implement similar scenario in SSIS?
I have worked on SSIS using the lookup and conditional split using ID columns, but these tables I am working with have a lot of rows so lookup takes up a lot of the time and this is not the right solution to be implemented for my scenario.
Is there any way I can get Max(Inserted_Date) logic into SSIS solution too. My end goal is to remove the approach using scripts and replicate the same approach using SSIS.
Here is the general Control Flow:
There's plenty to go on here, but you may need to learn how to set variables from an Execute SQL and so on.

SSIS Data QC methods

I get a set of monthly data every month, mostly with the same columns. I'm loading these files manually using Import/Export wizard. Usually, I load this data with a date stamp, so that I can compare old data that was provided last month to the new data. I keep the new data if the variance is less than 5%, otherwise, I have to go back to the vendor and ask for an explanation for the difference.
I'm trying to automate this in SSIS but not sure how to do the QC part. Any suggestions?
My recommended workflow in one single SSIS package.
Truncate the staging table SQL Task.
Load the incoming monthly file into the staging table. If there is a layout issue the package fails DFT.
Compare the staging table against the records of the last-monthly load SQL Task, Expression Task. If the variance is above the threshold, email to the vendor Send Email Task. The other option of notification, which I prefer, is to insert a record into an error logging table and then use SSRS to send out the error notification. Generally, I prefer not doing non-sql tasks in SSIS.
Insert the Staging table records into the final table DFT and insert a record into the import log table SQL Task.

Backing up portion of data in SQL

I have a huge schema containing billions of records, I want to purge data older than 13 months from it and maintain it as a backup in such a way that it can be recovered again whenever required.
Which is the best way to do it in SQL - can we create a separate copy of this schema and add a delete trigger on all tables so that when trigger fires, purged data gets inserted to this new schema?
Will there be only one record per delete statement if we use triggers? Or all records will be inserted?
Can we somehow use bulk copy?
I would suggest this is a perfect use case for the Stretch Database feature in SQL Server 2016.
More info: https://msdn.microsoft.com/en-gb/library/dn935011.aspx
The cold data can be moved to the cloud with your given date criteria without any applications or users being aware of it when querying the database. No backups required and very easy to setup.
There is no need for triggers, you can use job running every day, that will put outdated data into archive tables.
The best way I guess is to create a copy of current schema. In main part - delete all that is older then 13 months, in archive part - delete all for last 13 month.
Than create SP (or any SPs) that will collect data - put it into archive and delete it from main table. Put this is into daily running job.
The cleanest and fastest way to do this (with billions of rows) is to create a partitioned table probably based on a date column by month. Moving data in a given partition is a meta operation and is extremely fast (if the partition setup and its function is set up properly.) I have managed 300GB tables using partitioning and it has been very effective. Be careful with the partition function so dates at each edge are handled correctly.
Some of the other proposed solutions involve deleting millions of rows which could take a long, long time to execute. Model the different solutions using profiler and/or extended events to see which is the most efficient.
I agree with the above to not create a trigger. Triggers fire with every insert/update/delete making them very slow.
You may be best served with a data archive stored procedure.
Consider using multiple databases. The current database that has your current data. Then an archive or multiple archive databases where you move your records out from your current database to with some sort of say nightly or monthly stored procedure process that moves the data over.
You can use the exact same schema as your production system.
If the data is already in the database no need for a Bulk Copy. From there you can backup your archive database so it is off the sql server. Restore the database if needed to make the data available again. This is much faster and more manageable than bulk copy.
According to Microsoft's documentation on Stretch DB (found here - https://learn.microsoft.com/en-us/azure/sql-server-stretch-database/), you can't update or delete rows that have been migrated to cold storage or rows that are eligible for migration.
So while Stretch DB does look like a capable technology for archive, the implementation in SQL 2016 does not appear to support archive and purge.

Purging an SQL table

I have an SQL table which is used for logging purpose(There are lakhs of records in the table). I need to purge the table (Take a back up of the data and need to clear the table data).
Is there a standard way of doing it where I can automate it.?
You can do this within SQL Server Management Studio, by:
right clicking Database > Tasks > Generate Script
You can then select the table you wish to script out and also choose to include any associated objects, such as constraints and indexes.
Attaching an image which will give you the step by step procedure,
image_bkp_procedure
PFB the stackoverflow link which will give you more insight on this,
Table-level backup
And your automation requirement,
You can download bcp utility which copies data between an instance of Microsoft SQL Server and a data file in a user-specified format.
Sample syntax to export,
bcp "select * from [MyDatabase].dbo.Customer " queryout "Customer.bcp" -N -S localhost -T -E
You can automate this query by using any scheduling mechanism (UNIX etc)
Simply we can create a job that runs once in a month
--> That backups data in another table like archive table
--> Then deletes data in the main table
Its primitive partitioning I guess, this way it will be more flexible when you need to select data from the past deleted one i.e. now on archive table where you have backed up

Move data from one table to another every night SQL server

I have this scenario i have a staging table that contains all the record imported from a XML file .Now i want to move this data based on verification like if the record is already in the other table update the record other wise insert the new record. i want to create a job or scheduler in SQL Server that do this for me every night without using any SSIS packages.
Have you tried using the MERGE statement?
SSIS really is an easy way to go with something like this, but if necessary, you can set up a a SQL server agent job. Take a look at this MSDN Article. Basically, write your validation code in a stored procedure, then create a job with a TSQL job step which calls that stored procedure.