SSIS Data QC methods - sql

I get a set of monthly data every month, mostly with the same columns. I'm loading these files manually using Import/Export wizard. Usually, I load this data with a date stamp, so that I can compare old data that was provided last month to the new data. I keep the new data if the variance is less than 5%, otherwise, I have to go back to the vendor and ask for an explanation for the difference.
I'm trying to automate this in SSIS but not sure how to do the QC part. Any suggestions?

My recommended workflow in one single SSIS package.
Truncate the staging table SQL Task.
Load the incoming monthly file into the staging table. If there is a layout issue the package fails DFT.
Compare the staging table against the records of the last-monthly load SQL Task, Expression Task. If the variance is above the threshold, email to the vendor Send Email Task. The other option of notification, which I prefer, is to insert a record into an error logging table and then use SSRS to send out the error notification. Generally, I prefer not doing non-sql tasks in SSIS.
Insert the Staging table records into the final table DFT and insert a record into the import log table SQL Task.

Related

How to set up a staging table in SQL with SSIS dataflow?

I am trying to create a dataflow in SSIS where the source data originates from an excel file and reaches to a temporary staging table in a SQL server where I can add various stored procedures to the data.
The dataflow that I have created stores the data permanently on what is supposed to be the staging area.
I would like to get some ideas on creating the staging table in SQL with the SSIS dataflow.
your question is a bit confusing. I suppose that you are maybe trying to make the data loaded in the table of the staging area temporary without keeping the past loaded data.
If I'm right what you're trying to accomplish is a "full resfresh" data flow.
From your description I assume you alerady have the staging table (so no nedd to CREATE it) but you need to truncate it at every run. You can achive this by using a Execut SQL Task element to the control flow with a TRUNCATE TABLE <YOUR TABLE NAME> in it. The data flow loading the data must be in dependency of this task with the result of truncating your table at every run.
If you need to CREATE a table you can do it in the control flow with the Execute SQL Task (you can execute any kind of query with this task), rember to set correctly the connection manager of the task.

SSIS Incremental Load-15 mins

I have 2 tables. The source table being from a linked server and destination table being from the other server.
I want my data load to happen in the following manner:
Everyday at night I have scheduled a job to do a full dump i.e. truncate the table and load all the data from the source to the destination.
Every 15 minutes to do incremental load as data gets ingested into the source on second basis. I need to replicate the same on the destination too.
For incremental load as of now I have created scripts which are stored in a stored procedure but for future purposes we would like to implement SSIS for this case.
The scripts run in the below manner:
I have an Inserted_Date column, on the basis of this column I take the max of that column and delete all the rows that are greater than or equal to the Max(Inserted_Date) and insert all the similar values from the source to the destination. This job runs evert 15 minutes.
How to implement similar scenario in SSIS?
I have worked on SSIS using the lookup and conditional split using ID columns, but these tables I am working with have a lot of rows so lookup takes up a lot of the time and this is not the right solution to be implemented for my scenario.
Is there any way I can get Max(Inserted_Date) logic into SSIS solution too. My end goal is to remove the approach using scripts and replicate the same approach using SSIS.
Here is the general Control Flow:
There's plenty to go on here, but you may need to learn how to set variables from an Execute SQL and so on.

SSIS Alternatives to one-by-one update from RecordSet

I'm looking for a way to speed up the following process: I have a SSIS package that loads data from Excel files on a weekly basis to SQL Server. There are 3 fields: Brand, Date, Value.
In the dataflow, I check for existing combinations of Brand+Date, and new combinations go to the table directly, the existing ones go to a RecordSet destination for updates:
The next step is to update the Value of the existing combinations:
As you can see, there are thousands of records to update, and it takes too long. The number of records tend to grow week by week. Please suggest.
The fastest way will be do this inside a Stored procedure using ELT (Extract Load Transform) approach.
Push all data from excel as is into a table(called load to a staging table in theory). Since you do not seem to be concerned with data validation steps, this table can be a replica of final destination table columns.
Next step is to call a stored procedure using Execute SQL task. Inside this procedure you can put all your business logic. Since this steps with native data manipulation on SQL server entities, it is the fastest alternative.
As a last part, please delete all entries from the staging table.
You can use indexes on staging table to make the SP part even faster.

Move data from one table to another every night SQL server

I have this scenario i have a staging table that contains all the record imported from a XML file .Now i want to move this data based on verification like if the record is already in the other table update the record other wise insert the new record. i want to create a job or scheduler in SQL Server that do this for me every night without using any SSIS packages.
Have you tried using the MERGE statement?
SSIS really is an easy way to go with something like this, but if necessary, you can set up a a SQL server agent job. Take a look at this MSDN Article. Basically, write your validation code in a stored procedure, then create a job with a TSQL job step which calls that stored procedure.

Table Variables in SSIS

In one SQL Task can I create a table variable
DELCARE #TableVar TABLE (...)
Then in another SQL Task or DataSource destination and select or insert into the table variable?
The other option I have considered is using a Temp Table.
CREATE TABLE #TempTable (...)
I would prefer to use Table Variable so that it remains in memory. But can use temp table if it is not possible to use table variable. Also I cannot use the record set destination as I need to preform straight SQL tasks on it later on.
The use case that this is trying to solve is essentially performing a transformation in the stead of BizTalk. There is a very large flat file to flat file transformation that BizTalk has to transform unfortunately the data volume would produce unacceptable load on the BizTalk server so the idea is to off load it to SSIS. However, it is not a simple row to row transformation, there are different types of rows which have relations to each other. The first task in SSIS is to load the row into appropriate (temp) tables, then in the second data task a select is preformed with the correct format for output.
You could use some of the techniques in this post: http://consultingblogs.emc.com/jamiethomson/archive/2006/11/19/SSIS_3A00_-Using-temporary-tables.aspx
especially the ones about using RetainSameConnection=TRUE on the connection manager.
I would be interested to see more information about what use case you have that requires you to write out data to a temp table or table variable before further SSIS processing. Couldn't you take care of all of the SQL required steps in your source query before you start processing the dataflow with SSIS?
Table variables are not kept solely in memory and can be written to disk under memory pressure. I tend to use table variables for very small lookups. If you Need to push a table into SQL Server due to necessary and complex transformations, then use a 'permanent' temp table that is truncated within the SSIS package prior to insert. Simple and will get what you need done.
The SSIS package would be run in a job. I assume it runs inside a SQL job. In that case, using a temp table won't harm. SQL Jobs are generally run after office hours so it does not matter.