Azure ADF - how to proceed with pipeline activities only after new files arrival? - azure-data-factory-2

I have written generic datafiles arrival checking routine using databricks notebooks which accepts filenames and time which specifies acceptable freshness of files. many pipeline uses this notebook and passes filenames tuples and at end notebook returns True or False, to indicate if next workflow activity could start or not. so far so good.
now my question is how to use this in Azure ADF pipeline such that if it fails it should wait for 30 minutes or so and check again by running above notebook again?
this notebook shall run first so that if new files are already there then it should not wait

Since you are talking about the notebook activity , you can add a wait activity "on failue " and set the time for the wait . after wait add a executepipelien actvity . This execute pipeline should point to a pipeline with a execute pipeline ( again ) pointing to the main pipeline which has the notebook activity . Basically this is just a cycle , but will only execute when you have a failure .

Related

Avoid running same job by two EC2 instances

I am using APScheduler in decorator way to run jobs at certain intervals. The problem is that when below code is deployed in two EC2 instances then same job runs twice at same with difference in milliseconds.
My question is : How to avoid running same job by two EC2 instances at same time or Do I need to follow different code design pattern in this case. I want to run this job only once either by one of the severs.
from datetime import datetime
from apscheduler.schedulers.blocking import BlockingScheduler
sched = BlockingScheduler()
sched.start()
#sched.scheduled_job('interval', id='my_job_id', hours=2)
def job_function():
print("Hello World")
If you can share any locking mechanism examples it would be appreciable
You can use AWS-SDK/AWS-CLI by using AWS-SDK/AWS-CLI you can set
If instance_id = "your instance id"
Write your code here
Now your cron will get execute on each instances you have and your code will be executed from that specific instance.

How to speed up the time it takes for UI Server to update and allow dynamic DAG to be triggered?

I have a DAG Generator that takes a JSON input and creates a new dynamic DAG in the dags directory. The time it takes for that newly created DAG to be available to use (through the API) can range from 2 seconds to 5 minutes.
I ran the test 100 times:
Create a new DAG (with the same input JSON, so the dynamic dags are
identical)
Once the DAG is saved in the dags directory, start
sending API requests to see if the DAG can be triggered.
Track the seconds passed before I was able to successfully trigger the
DAG.
Results are as follows:
[14.81, 6.44, 6.38, 6.36, 2.21, 6.42, 18.96, 23.14, 23.11, 14.82, 6.39, 23.10, 18.93, 14.80, 23.20, 31.49, 48.29, 35.83, 27.20, 18.96, 14.80, 44.14, 35.66, 35.77, 39.92, 31.50, 69.15, 48.22, 69.29, 39.87, 10.53, 69.15, 27.37, 48.22, 77.51, 39.90, 27.35, 65.03, 69.16, 31.47, 65.06, 90.00, 2.19, 111.33, 69.19, 98.46, 90.16, 27.28, 60.89, 56.57, 110.96, 18.92, 140.55, 39.95, 94.22, 85.89, 44.29, 94.54, 69.21, 136.20, 35.72, 102.57, 102.63, 81.72, 98.58, 77.55, 148.83, 102.79, 136.38, 115.22, 94.38, 148.68, 119.43, 48.24, 178.09, 81.80, 127.64, 119.59, 44.22, 194.88, 23.17, 170.00, 211.47, 153.18, 249.55, 182.40, 152.98, 86.00, 157.02, 98.54, 270.02, 81.75, 153.04, 69.23, 265.92, 27.30, 278.64, 23.19, 269.98, 81.91]
Average Time: 79.35 seconds
You can see that as the number of files in the dags folder increased, the time it took for the DAG to be triggered also increased, but it's still somewhat random. Is there any way to keep this consistent (without restarting the Airflow server after each creation). Or speed it up?
Thank you!

LeaseAlreadyPresent Error in Azure Data Factory V2

I am getting the following error in a pipeline that has Copy activity with Rest API as source and Azure Data Lake Storage Gen 2 as Sink.
"message": "Failure happened on 'Sink' side. ErrorCode=AdlsGen2OperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=ADLS Gen2 operation failed for: Operation returned an invalid status code 'Conflict'. Account: '{Storage Account Name}'. FileSystem: '{Container Name}'. Path: 'foodics_v2/Burgerizzr/transactional/_567a2g7a/2018-02-09/raw/inventory-transactions.json'. ErrorCode: 'LeaseAlreadyPresent'. Message: 'There is already a lease present.'. RequestId: 'd27f1a3d-d01f-0003-28fb-400303000000'..,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.Azure.Storage.Data.Models.ErrorSchemaException,Message=Operation returned an invalid status code 'Conflict',Source=Microsoft.DataTransfer.ClientLibrary,'",
The pipeline runs in a for loop with Batch size = 5. When I make it sequential, the error goes away, but I need to run it in parallel.
This is known issue with adf limitation variable thread parallel running.
You probably trying to rename filename using variable.
Your option is to run another child looping after each variable execution.
i.e. variable -> Execute Pipeline
enter image description here
or
remove those variable, hard coded those variable expression in azure activity.
enter image description here
Hope this helps

Change Google Cloud Dataflow BigQuery Priority

I have a Beam job running on Google Cloud DataFlow that reads data from BigQuery. When I run the job it takes minutes for the job to start reading data from the (tiny) table. It turns out the dataflow job sends of a BigQuery job which runs in BATCH mode and not in INTERACTIVE mode. How can I switch this to run immediately in Apache Beam? I couldn't find a method in the API to change the priority.
Maybe a Googler will correct me, but no, you cannot change this from BATCH to INTERACTIVE because it's not exposed by Beam's API.
From org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.java (here):
private void executeQuery(
String executingProject,
String jobId,
TableReference destinationTable,
JobService jobService) throws IOException, InterruptedException {
JobReference jobRef = new JobReference()
.setProjectId(executingProject)
.setJobId(jobId);
JobConfigurationQuery queryConfig = createBasicQueryConfig()
.setAllowLargeResults(true)
.setCreateDisposition("CREATE_IF_NEEDED")
.setDestinationTable(destinationTable)
.setPriority("BATCH") <-- NOT EXPOSED
.setWriteDisposition("WRITE_EMPTY");
jobService.startQueryJob(jobRef, queryConfig);
Job job = jobService.pollJob(jobRef, JOB_POLL_MAX_RETRIES);
if (parseStatus(job) != Status.SUCCEEDED) {
throw new IOException(String.format(
"Query job %s failed, status: %s.", jobId, statusToPrettyString(job.getStatus())));
}
}
If it's really a problem for you that the query is running in BATCH mode, then one workaround could be:
Using the BigQuery API directly, roll your own initial request, and set the priority to INTERACTIVE.
Write the results of step 1 to a temp table
In your Beam pipeline, read the temp table using BigQueryIO.Read.from()
You can configure to run the queries with "Interactive" priority by passing a priority parameter. Check this Github example for details.
Please note that you might be reaching one of the BigQuery limits and quotas as when you use batch, if you ever hit a rate limit, the query will be queued and retried later. As opposed to the interactive ones, when if these limits are hit, the query will fail immediately. This is because BigQuery assumes that an interactive query is something you need run immediately.

JitterBit Run Only One Instance of an Operation at a Time

I ran into an issue where I had long running JitterBit operations that were scheduled. I had them scheduled close together, since I needed to keep data flowing. But, when they would take longer than expected I would wind up with multiple instances of the operation set running at the same time. This was killing my performance.
I'll put the fix in the answer below.
To resolve this issue I added an additional Script Operation at the beginning of my operation set (with the schedule running on this operation). This script simply checks to see if one of the operations in this set is already running. If not, it starts the next operation. If there is anything running, it exists and waits till the next scheduled instance.
This is a sample of my script. This one assumes that there were originally two operations in this operation set.
<trans>
$isInQueue=GetOperationQueue("<TAG>Operations/OperationToCheck01</TAG>");
$isInQueue2=GetOperationQueue("<TAG>Operations/OperationToCheck02</TAG>");
$isRunning=$isInQueue[0][1];
$isRunning2=$isInQueue2[0][1];
if(($isRunning==1 && $isRunning!=Null()) || ($isRunning2==1 && $isRunning2!=Null()),
WriteToOperationLog("Skip for now: "+$isRunning+" / "+$isRunning2);,
WriteToOperationLog("Nothign is Running - Starting Operation Chain.");
RunOperation("<TAG>Operations/OperationToCheck01</TAG>");
);
</trans>