Get the list of scheduled triggers - azure-data-factory-2

We are trying to get the Maximum scheduled trigger time from the list of scheduled trigger in ADF
We have one ADF pipeline, which has multiple scheduled trigger. The pipeline will run at 6:10, 6:20, 6:30, 6:40...... till 10 AM UTC at a gap of every 10 minutes. Is there any possible way to get the Max of scheduled trigger i.e. 10 AM UTC in my case.
We have tried with several system variables, but none worked. We might take an API approach to get the job done, but I want to stay native to the ADF world.

You could refer to ADF REST API:Trigger Runs - Query By Factory.
In the request body,define the lastUpdatedAfter and lastUpdatedBefore properties,like the below example:
{
"lastUpdatedAfter": "2018-06-16T00:36:44.3345758Z",
"lastUpdatedBefore": "2018-06-16T00:49:48.3686473Z",
"filters": [
{
"operand": "TriggerName",
"operator": "Equals",
"values": [
"exampleTrigger"
]
}
]
}
Then loop the trigger runs data from response to get the max row.
We might take an API approach to get the job done
You could use Azure Http Trigger Function,or use Web Activity in the ADF to call your specific api.

Related

How to monitor Databricks jobs using CLI or Databricks API to get the information about all jobs

I want to monitor the status of the jobs to see whether the jobs are running overtime or it failed. if you have the script or any reference then please help me with this. thanks
You can use the databricks runs list command to list all the jobs ran. This will list all jobs and their current status RUNNING/FAILED/SUCCESS/TERMINATED.
If you wanted to see if a job is running over you would then have to use databricks runs get --run-id command to list the metadata from the run. This will return a json which you can parse out the start_time and end_time.
# Lists job runs.
databricks runs list
# Gets the metadata about a run in json form
databricks runs get --run-id 1234
Hope this helps get you on track!

Run Snowflake Task when a data share view is refreshed

Snowflake's documentation illustrates to have a TASK run on a scheduled basis when there are inserts/updates/deletions or other DML operations run on a table by creating a STREAM on that specific table.
Is there any way to have a TASK run if a view from a external Snowflake data share is refreshed, i.e. dropped and recreated?
As part of this proposed pipeline, we receive a one-time refresh of a view within a specific time period in a day and the goal would be to start a downstream pipeline that runs at most once during that time period, when the view is refreshed.
For example for the following TASK schedule
'USING CRON 0,10,20,30,40,50 8-12 * * MON,WED,FRI America/New York', the downstream pipeline should only run once every Monday, Wednesday, and Friday between 8-12.
Yes, I can point you to the documentation if you would like to see if this works for the tables you might already have set up:
Is there any way to have a TASK run if a view from a external
Snowflake data share is refreshed, i.e. dropped and recreated?
If you create a stored procedure to monitor the existence of the table, I have not tried that before though, I will see if I can ask an expert.
Separately, is there any way to guarantee that the task runs at most
once on a specific day or other time period?
Yes, you can use CRON to schedule optional parameters with specific days of the week or time: an example:
CREATE TASK delete_old_data
WAREHOUSE = deletion_wh
SCHEDULE = 'USING CRON 0 0 * * * UTC';
Reference: https://docs.snowflake.net/manuals/user-guide/tasks.html more specifically: https://docs.snowflake.net/manuals/sql-reference/sql/create-task.html#optional-parameters
A TASK can only be triggered by a calendar schedule, either directly or indirectly via a predecessor TASK being run by a schedule.
Since the tasks are only run on a schedule, they will not run more often than the schedule says.
A TASK can't be triggered by a data share change, so you have to monitor it on a calendar schedule.
This limitation is bound to be lifted sometime, but is valid as of Dec, 2019.

Azure Data Factory v2 intermittent error while calling stored procedure

Running Azure Data Factory v2 with a for each loop with a batch count of 4-8. Calling several stored procedures and 1 copy activity. Targets are all the same Azure SQL Database. Running this setup for ~8 months in production.
Suddenly this week acceptance started to fail intermittently on calling the stored procedures. Production since last night (2019-09-05). All with the same error:
{
"errorCode": "2011",
"message": "An error occurred while sending the request.",
"failureType": "UserError",
"target": "USP_End_Batch_Successful"
}
There is no pattern. Rerunning the pipeline results in failing other parts of the for-each loop. Setting the batch count lower, no improvement. Load on the database is not high. Log analytics on the databases show no blocks, deadlocks, dropped connections etc. Even the most stripped and basic stored procedures fail. No data on the database is changed at all.
The retry option will not work: the option is set to 1, and the stored procedure is not re-run.
Any clue how to dig further into this problem or any solution?
Example activity run id: 033ca5ab-c396-407f-8362-794459e4d0c4
Found the cause a few days later: we had a job running that was scaling the database during our ETL window. Hence running queries got killed at some point, resulting in the error above.

Executing a Dataprep template with Dataflow API holds the timestamp included in the flow recipe

I have a cloud function which uses the dataflow API to create a new job from a template I created using DataPrep. The recipe basically cleans up some JSON objects, turn them into CSV format, and add a timestamp column to fetch everything in a BigQuery database. The main idea is to take a snapshot of certain information of our platform.
I managed to run the job from the dataflow API, and the data is correctly inserted in the bigquery table, however in the timestamp field, the value of the timestamp is always the same, and it corresponds to the execution time from the job where I take the template from(DataPrep template). When I run the job from the dataprep interface, this timestamp is correctly inserted, but it is not changed when I execute the job with the same template from the cloud function.
The snippet of code which calls the dataflow API:
dataflow.projects.templates.launch({
projectId: projectId,
location: location,
gcsPath: jobTemplateUrl,
resource: {
parameters: {
inputLocations : `{"location1" :"gs://${file.bucket}/${file.name}"}`,
outputLocations: `{"location1" : "${location2}"}`,
customGcsTempLocation: `gs://${destination.bucket}/${destination.tempFolder}`
},
environment: {
tempLocation: `gs://${destination.bucket}/${destination.tempFolder}`,
zone: "us-central1-f"
},
jobName: 'user-global-stats-flow',
}
}
This is the Dataflow execution console snapshot, as it can be seen the latest jobs are the ones executed from the cloud function, the one in the bottom was executed from the Dataprep interface:
Dataflow console snapshot
This is the part of the recipe in charge of creating the timestamp:
Dataprep recipe sample
Finally this is what is inserted in the BigQuery table, where the first insertion with the same timestamp (row 4) corresponds to the job executed from Dataprep, and the rest are the executions from the cloud function with the Dataflow API:
Big Query Insertions
So the question is whether there is a way I can make the timestamp to get resolved during the job execution time for the insertion, because now it looks like it is fixed in the template's recipe.
Thanks for your help in advance.
If I understand correctly, this is documented behaviour. From the list of known limitations when running a Dataprep template through Dataflow:
All relative functions are computed based on the moment of execution. Functions such as NOW() and TODAY are not recomputed when the Cloud Dataflow template is executed.

Copy failed records to dynamo db

I am copying 50 million records to amazon dynamodb using hive script. The script failed after running for 2 days with an item size exceeded exception.
Now if I restart the script again, it will start the insertions again from first record. Is there a way where I can say like "Insert only those records which are not in dynamo db" ?
You can use conditional writes to only write the item if it the specified attributes are not equal to the values you provide. This is done by using the ConditionExpression for a PutItem request. However, it still uses write capacity even if a write fails (emphasis mine) so this may not even be the best option for you:
If a ConditionExpression fails during a conditional write, DynamoDB
will still consume one write capacity unit from the table. A failed
conditional write will return a ConditionalCheckFailedException
instead of the expected response from the write operation. For this
reason, you will not receive any information about the write capacity
unit that was consumed. However, you can view the
ConsumedWriteCapacityUnits metric for the table in Amazon CloudWatch
to determine the provisioned write capacity that was consumed from the
table.