Can I automate an SSIS package that requires user input? - sql

I've been developing a data pipeline in SSIS on an on-premise VM during my internship, and was tasked with gathering data from Marketo (re: https://www.marketo.com/ ). This package runs without error, starting with a Truncate table execute SQL task, followed by 5 data flow tasks that gather data from different sources within Marketo and moves them to staging tables within SQL Server, and concludes with an execute SQL task to load processing tables with only new data.
The problem I'm having: my project lead wants this process to be automated to run daily, and I have noticed tons of resources online that show automation of an SSIS package, but within my package, I have to have user input for the Marketo source. The Marketo source requires a user input of a time frame from which to gather data.
Is it possible to automate this package to run daily even with user input required? I was thinking there may be a way to increment the date value by one for the start and end dates (So start date could be 2018-07-01, and end date could be 2018-07-02, incrementing each day by one), to make the package run by itself. Thank you in advance for any help!

As you are automating your extract, this suggests that you have a predefined schedule on which to pull that data. From this schedule, you should be able to work out your start and end dates based on the date that the package was run.
In SSIS there are numerous ways to achieve this depending on the data source and your connection methods. If you are using a script task, you can simply calculate the dates required using your .Net code. Another alternative would be to use calculated variables that return the result of an expression, such as:
DATEADD("Month", -1, GETDATE())
Assuming you schedule your extract to run on the first day of the month, the expression above would return the first day of the previous month.

Related

Azure Data Factory - Rerun Failed Pipeline Against Azure SQL Table With Differential Date Filter

I am using ADF to keep an Azure SQL DB in sync with an on-prem DB. The on-prem DB is read only and the direction is one-way, from the Azure SQL DB to the on-prem DB.
My source table in the Azure SQL Cloud DB is quite large (10's of millions of rows) so I have the pipeline set to use an UPSERT (merge, trying to create a differential merge). I am using a filter on the Source table and the and the Filter Query has a WHERE condition that looks like this:
[HistoryDate] >= '#{formatDateTime(pipeline().parameters.windowStart, 'yyyy-MM-dd HH:mm' )}'
AND [HistoryDate] < '#{formatDateTime(pipeline().parameters.windowEnd, 'yyyy-MM-dd HH:mm' )}'
The HistoryDate column is auto-maintained in the source table with a getUTCDate() type approach. New records will always get a higher value and be included in the WHERE condition.
This works well, but here is my question: I am testing on my local machine before deploying to the client. When I am not working, my laptop hibernates and the pipeline rightfully fails because my local SQL Instance is "offline" during that run. When I move this to production this should not be an issue (computer hibernating), but what happens if the clients connection is temporarily lost (i.e, the client loses internet for a time)? Because my pipeline has a WHERE condition on the source to reduce the table size upsert to a practical number, any failure would result in a loss of any data created during that 5 minute window.
A failed pipeline can be rerun, but the run time would be different at that moment in time and I would effectively miss the block of records that would have been picked up if the pipeline had been run on time. pipeline().parameters.windowStart and pipeline().parameters.windowEnd will now be different.
As an FYI, I have this running every 5 minutes to keep the local copy in sync as close to real-time as possible.
Am I approaching this correctly? I'm sure others have this scenario and it's likely I am missing something obvious. :-)
Thanks...
Sorry to answer my own question, but to potentially help others in the future, it seems there was a better way to deal with this.
ADF offers a "Metadata-driven Copy Task" utility/wizard on the home screen that creates a pipeline. When I used it, it offers a "Delta Load" option for tables which takes a "Watermark". The watermark is a column for an incrementing IDENTITY column, increasing date or timestamp, etc. At the end of the wizard, it allows you to download a script that builds a table and corresponding stored procedure that maintains the values of each parameters after each run. For example, if I wanted my delta load to be based on an IDENTITY column, it stores the value of the max value of a particular pipeline run. The next time a run happens (trigger), it uses this as the MIN value (minus 1) and the current MAX value of the IDENTITY column to get the added records since the last run.
I was going to approach things this way, but it seems like ADF already does this heavy lifting for us. :-)

SSIS Incremental Load-15 mins

I have 2 tables. The source table being from a linked server and destination table being from the other server.
I want my data load to happen in the following manner:
Everyday at night I have scheduled a job to do a full dump i.e. truncate the table and load all the data from the source to the destination.
Every 15 minutes to do incremental load as data gets ingested into the source on second basis. I need to replicate the same on the destination too.
For incremental load as of now I have created scripts which are stored in a stored procedure but for future purposes we would like to implement SSIS for this case.
The scripts run in the below manner:
I have an Inserted_Date column, on the basis of this column I take the max of that column and delete all the rows that are greater than or equal to the Max(Inserted_Date) and insert all the similar values from the source to the destination. This job runs evert 15 minutes.
How to implement similar scenario in SSIS?
I have worked on SSIS using the lookup and conditional split using ID columns, but these tables I am working with have a lot of rows so lookup takes up a lot of the time and this is not the right solution to be implemented for my scenario.
Is there any way I can get Max(Inserted_Date) logic into SSIS solution too. My end goal is to remove the approach using scripts and replicate the same approach using SSIS.
Here is the general Control Flow:
There's plenty to go on here, but you may need to learn how to set variables from an Execute SQL and so on.

SSIS Data QC methods

I get a set of monthly data every month, mostly with the same columns. I'm loading these files manually using Import/Export wizard. Usually, I load this data with a date stamp, so that I can compare old data that was provided last month to the new data. I keep the new data if the variance is less than 5%, otherwise, I have to go back to the vendor and ask for an explanation for the difference.
I'm trying to automate this in SSIS but not sure how to do the QC part. Any suggestions?
My recommended workflow in one single SSIS package.
Truncate the staging table SQL Task.
Load the incoming monthly file into the staging table. If there is a layout issue the package fails DFT.
Compare the staging table against the records of the last-monthly load SQL Task, Expression Task. If the variance is above the threshold, email to the vendor Send Email Task. The other option of notification, which I prefer, is to insert a record into an error logging table and then use SSRS to send out the error notification. Generally, I prefer not doing non-sql tasks in SSIS.
Insert the Staging table records into the final table DFT and insert a record into the import log table SQL Task.

Run Snowflake Task when a data share view is refreshed

Snowflake's documentation illustrates to have a TASK run on a scheduled basis when there are inserts/updates/deletions or other DML operations run on a table by creating a STREAM on that specific table.
Is there any way to have a TASK run if a view from a external Snowflake data share is refreshed, i.e. dropped and recreated?
As part of this proposed pipeline, we receive a one-time refresh of a view within a specific time period in a day and the goal would be to start a downstream pipeline that runs at most once during that time period, when the view is refreshed.
For example for the following TASK schedule
'USING CRON 0,10,20,30,40,50 8-12 * * MON,WED,FRI America/New York', the downstream pipeline should only run once every Monday, Wednesday, and Friday between 8-12.
Yes, I can point you to the documentation if you would like to see if this works for the tables you might already have set up:
Is there any way to have a TASK run if a view from a external
Snowflake data share is refreshed, i.e. dropped and recreated?
If you create a stored procedure to monitor the existence of the table, I have not tried that before though, I will see if I can ask an expert.
Separately, is there any way to guarantee that the task runs at most
once on a specific day or other time period?
Yes, you can use CRON to schedule optional parameters with specific days of the week or time: an example:
CREATE TASK delete_old_data
WAREHOUSE = deletion_wh
SCHEDULE = 'USING CRON 0 0 * * * UTC';
Reference: https://docs.snowflake.net/manuals/user-guide/tasks.html more specifically: https://docs.snowflake.net/manuals/sql-reference/sql/create-task.html#optional-parameters
A TASK can only be triggered by a calendar schedule, either directly or indirectly via a predecessor TASK being run by a schedule.
Since the tasks are only run on a schedule, they will not run more often than the schedule says.
A TASK can't be triggered by a data share change, so you have to monitor it on a calendar schedule.
This limitation is bound to be lifted sometime, but is valid as of Dec, 2019.

PENTAHO 7.1 - Generating large number of different reports by script

On my Pentaho CE 7.1 I often need to generate large number of reports (*.prpt) with different attributes.
For example, I have a report that shows data for a day, and I need to generate those reports for each day since September 2017.
Is there any way how create a script, that would execute those *.prpt files one by one for each day since September 2017 until now?
I have been checking API on official Pentaho documentation, but it does not seem to be such option there. Perhaps some kind of hack, like sending parameters in URL or so?
Create your *.prpt with the Report Designer and use a parameter to select one day in your data.
Then open PDI, with first step to generate a date starting from 2017-09-10, and give this date to a Pentaho Reporting Example step. Then do what you need with the report output (mail, save them in the Pentaho-solutions,...).
You have a use case very similar in the sample directory which is shipped with the Pentaho Data Integrator. It is named Pentaho Reporting Output Example.ktr.