Azure Data Factory, how to supply pipeline schedule date as value to activities - azure-data-factory-2

I previously had a Pipeline parameter of LookBack with a static default value of -1
Then in the child Pipelines/Activities had that translated into
#adddays(utcnow(), pipeline().parameters.lookback)
Where that would take today, subtrack the -1 to give you a date which is then supplied to the U-SQL script.
This creates problems when we're reprocessing/back processing and that processing time crosses over the UTC time, jumping ahead 1 day.
To make the scheduling more robust, I've adjusted the Top level pipeline to take in a DateTime field of type string, which is passed to the child pipelines/activities and onto the scripts for an explicit schedule.
#pipeline().parameters.processDate
The top level pipeline is on a standard schedule which runs each day at around 5am. How do I pass that schedule DateTime to the top level pipeline when the pipeline parameters doesn't offer the dynamic values, but only a static default value.
Ideally I'd like the default value to be the Schedule DateTime

Yes, as Kyle Bunting says, you can create a parameter and it's default value is #trigger().scheduledTime like this:
Then ADF will pass scheduled time of trigger to your parameter in top level pipeline.
My test result:

Related

BigQuery events calculation

I have table with events with start datetime, and event_name. event_name can belong to a start or an end events.
I need to calculate sum of differences between datetime of start and end events pairs with following rule: stop event datetime minus earliest start event datime before that stop event datime, e.g.:
Start 1, Start 2, Stop 1, Start 3, Stop 2
sum((Stop1 - Start1),(Stop2-Start3)).
Do you have some ideas how to do it using sql with analitical functions? From my perspective loop would be necessery so either stored procedure (which is not possible) or apache beam with dataflow for calculating data.
You can try following Bigquery SQL to find difference between 2 DATETIME fields.
Referral Doc.
SELECT DATETIME_DIFF(DATETIME '2018-02-20', DATETIME '2018-01-15', DAY) as days_diff;
Then You can Write an Apache Beam code to Read data from BQ using BQ-SQL you have created above then after apply all transformation logic you can write to Target (BQ table or GCS or BT). Here you will get sample Beam code.

How to aggregate the time time between pairs of logs in CloudWatch

Suppose you have logs with some transaction ID and timestamp
12:00: transactionID1 handled by funcX
12:01: transactionID2 handled by funcX
12:03: transactionID2 handled by funcY
12:04: transactionID1 handled by funcY
I want to get the time between 2 logs of the same event and aggregate (e.g. sum, avg) the time difference.
For example, for transactionID1, the time diff would be (12:04 - 12:01) 3min and for transactionID2, the time diff would be (12:03 - 12:02) 1min. Then I'd like to take the average of all these time differences, so (3+1)/2 or 2min.
Is there a way to that?
This doesn't seem possible with CloudWatch alone. I don't know where your logs come from, e.g. EC2, Lambda function. What you could do is to use the AWS SDK to create custom metrics.
Approach 1
If the logs are written by the same process, you can keep a map of transactionID and startTimein memory and create a custom metric with transactionID as dimension and calculate the metric value with the startTime. In case the logs are from different processes e.g. Lambda function invocations, then you can use DynamoDB to store the startTime.
Approach 2
If the transactions are independent you could also create custom metrics per transaction and use CloudWatch DIFF_TIME which will create a calculated metric with values for each transaction.
With CloudWatch AVG it should then be possible to calculate the average duration.
Personally, I have used the first approach to calculate a duration across Lambda functions and other services.

Pentaho Data Integration Import large dataset from DB

I'm trying to import a large set of data from one DB to another (MSSQL to MySQL).
The transformation does this: gets a subset of data, check if it's an update or an insert by checking hash, map the data and insert it into MySQL DB with an API call.
The subset part for the moment is strictly manual, is there a way to set Pentaho to do it for me, kind of iteration.
The query I'm using to get the subset is
select t1.*
from (
select *, ROW_NUMBER() as RowNum over (order by id)
from mytable
) t1
where RowNum between #offset and #offset + #limit;
Is there a way that PDI can set the offset and reiterate the whole?
Thanks
You can (despite the warnings) create a loop in a parent job, incrementing the offset variable each iteration in a Javascript step. I've used such a setup to consume webservices with an unknown number of results, shifting the offset each time I after get a full page and stopping when I get less.
Setting up the variables
In the job properties, define parameters Offset and Limit, so you can (re)start at any offset even invoke the job from the commandline with specific offset and limit. It can be done with a variables step too, but parameters do all the same things plus you can set defaults for testing.
Processing in the transformation
The main transformation(s) should have "pass parameter values to subtransformation" enabled, as it is by default.
Inside the transformation (see lower half of the image) you start with a Table Input that uses variable substitution, putting ${Offset} and ${Limit} where you have #offset and #limit.
The stream from Table Input then goes to processing, but also is copied to a Group By step for counting rows. Leave the group field empty and create a field that counts all rows. Check the box to always give back a result row.
Send the stream from Group By to a Set Variables step and set the NumRows variable in the scope of the parent job.
Looping back
In the main job, go from the transformations to a Simple Evaluation step to compare the NumRows variable to the Limit. If NumRows is smaller than ${Limit}, you've reached the last batch, success!
If not, proceed to a Javascript step to increment the Offset like this:
var offset = parseInt(parent_job.getVariable("Offset"),0);
var limit = parseInt(parent_job.getVariable("Limit"),0);
offset = offset + limit;
parent_job.setVariable("Offset",offset);
true;
The job flow then proceeds to the dummy step and then the transformation again, with the new offset value.
Notes
Unlike a transformation, you can set and use a variable within the same job.
The JS step needs "true;" as the last statement so it reports success to the job.

TWSz update application via Java API

I try to update application in TWSz via Java API but when application has defined run cycles with Out of Effect date set to 71/12/31, TWSz returns error:
EQQX375E THE RUN CYCLE VALIDITY END 720101 IS INVALID OR BEFORE/AT THE START
In every application, before update, it I have to check that there is run cycles and if are, check the Out Of Effect dates. If OOE == 71/12/31 update it to 31-12-71 using setValidTo but this is very inconvenient. Is there any other way to update application without updating Run Cycles?
It looks like that going forward and back, the date gets an additional day, wrapping to the TWSz minimal date 720101 (Jan 1st, 1972).
Do you make any conversion of the java Date that is returned by the API before send it back to the update?
I suggest to verify the date and time of the Java Date that is returned by the API on the get, and compare it with the Java Date that you are passing to the update.
For TWSz APIs, the Java Date object that contains a date without a time (like the validTo) should be set to the midnight GMT of the date they represent.

how to call function every time the current time is equal to time in my row?

I have column "date" in my table.I need to call my function for this table every time when the current time is equal to time in my "date" column. I don't know if it's possible to do this in ms sql server?
It seems like you are trying to implement some kind of scheduling.
You could try implementing one using one of SQL Server services called SQL Server Agent. It may not be fit for all kinds of response to time events, though, but it should be able to manage certain tasks.
You would need to set up a SQL Server Agent job for it.
A job would need to consist of at least one job step and have at least one schedule to be runnable. Perhaps, it would be easiest for you at this point to use the Transact-SQL type of job step.
A Transact-SQL job step is just a Transact-SQL script, a multi-statement query. In your case it would probably first check if there are rows matching the current time. Then, either for every matching row separately or for the entire set of them, it would perform whatever kind of operation Transact-SQL allows you to perform.