I am using APScheduler in decorator way to run jobs at certain intervals. The problem is that when below code is deployed in two EC2 instances then same job runs twice at same with difference in milliseconds.
My question is : How to avoid running same job by two EC2 instances at same time or Do I need to follow different code design pattern in this case. I want to run this job only once either by one of the severs.
from datetime import datetime
from apscheduler.schedulers.blocking import BlockingScheduler
sched = BlockingScheduler()
sched.start()
#sched.scheduled_job('interval', id='my_job_id', hours=2)
def job_function():
print("Hello World")
If you can share any locking mechanism examples it would be appreciable
You can use AWS-SDK/AWS-CLI by using AWS-SDK/AWS-CLI you can set
If instance_id = "your instance id"
Write your code here
Now your cron will get execute on each instances you have and your code will be executed from that specific instance.
I have a DAG Generator that takes a JSON input and creates a new dynamic DAG in the dags directory. The time it takes for that newly created DAG to be available to use (through the API) can range from 2 seconds to 5 minutes.
I ran the test 100 times:
Create a new DAG (with the same input JSON, so the dynamic dags are
identical)
Once the DAG is saved in the dags directory, start
sending API requests to see if the DAG can be triggered.
Track the seconds passed before I was able to successfully trigger the
DAG.
Results are as follows:
[14.81, 6.44, 6.38, 6.36, 2.21, 6.42, 18.96, 23.14, 23.11, 14.82, 6.39, 23.10, 18.93, 14.80, 23.20, 31.49, 48.29, 35.83, 27.20, 18.96, 14.80, 44.14, 35.66, 35.77, 39.92, 31.50, 69.15, 48.22, 69.29, 39.87, 10.53, 69.15, 27.37, 48.22, 77.51, 39.90, 27.35, 65.03, 69.16, 31.47, 65.06, 90.00, 2.19, 111.33, 69.19, 98.46, 90.16, 27.28, 60.89, 56.57, 110.96, 18.92, 140.55, 39.95, 94.22, 85.89, 44.29, 94.54, 69.21, 136.20, 35.72, 102.57, 102.63, 81.72, 98.58, 77.55, 148.83, 102.79, 136.38, 115.22, 94.38, 148.68, 119.43, 48.24, 178.09, 81.80, 127.64, 119.59, 44.22, 194.88, 23.17, 170.00, 211.47, 153.18, 249.55, 182.40, 152.98, 86.00, 157.02, 98.54, 270.02, 81.75, 153.04, 69.23, 265.92, 27.30, 278.64, 23.19, 269.98, 81.91]
Average Time: 79.35 seconds
You can see that as the number of files in the dags folder increased, the time it took for the DAG to be triggered also increased, but it's still somewhat random. Is there any way to keep this consistent (without restarting the Airflow server after each creation). Or speed it up?
Thank you!
I am getting the following error in a pipeline that has Copy activity with Rest API as source and Azure Data Lake Storage Gen 2 as Sink.
"message": "Failure happened on 'Sink' side. ErrorCode=AdlsGen2OperationFailed,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=ADLS Gen2 operation failed for: Operation returned an invalid status code 'Conflict'. Account: '{Storage Account Name}'. FileSystem: '{Container Name}'. Path: 'foodics_v2/Burgerizzr/transactional/_567a2g7a/2018-02-09/raw/inventory-transactions.json'. ErrorCode: 'LeaseAlreadyPresent'. Message: 'There is already a lease present.'. RequestId: 'd27f1a3d-d01f-0003-28fb-400303000000'..,Source=Microsoft.DataTransfer.ClientLibrary,''Type=Microsoft.Azure.Storage.Data.Models.ErrorSchemaException,Message=Operation returned an invalid status code 'Conflict',Source=Microsoft.DataTransfer.ClientLibrary,'",
The pipeline runs in a for loop with Batch size = 5. When I make it sequential, the error goes away, but I need to run it in parallel.
This is known issue with adf limitation variable thread parallel running.
You probably trying to rename filename using variable.
Your option is to run another child looping after each variable execution.
i.e. variable -> Execute Pipeline
enter image description here
or
remove those variable, hard coded those variable expression in azure activity.
enter image description here
Hope this helps
I have a Beam job running on Google Cloud DataFlow that reads data from BigQuery. When I run the job it takes minutes for the job to start reading data from the (tiny) table. It turns out the dataflow job sends of a BigQuery job which runs in BATCH mode and not in INTERACTIVE mode. How can I switch this to run immediately in Apache Beam? I couldn't find a method in the API to change the priority.
Maybe a Googler will correct me, but no, you cannot change this from BATCH to INTERACTIVE because it's not exposed by Beam's API.
From org.apache.beam.sdk.io.gcp.bigquery.BigQueryIO.java (here):
private void executeQuery(
String executingProject,
String jobId,
TableReference destinationTable,
JobService jobService) throws IOException, InterruptedException {
JobReference jobRef = new JobReference()
.setProjectId(executingProject)
.setJobId(jobId);
JobConfigurationQuery queryConfig = createBasicQueryConfig()
.setAllowLargeResults(true)
.setCreateDisposition("CREATE_IF_NEEDED")
.setDestinationTable(destinationTable)
.setPriority("BATCH") <-- NOT EXPOSED
.setWriteDisposition("WRITE_EMPTY");
jobService.startQueryJob(jobRef, queryConfig);
Job job = jobService.pollJob(jobRef, JOB_POLL_MAX_RETRIES);
if (parseStatus(job) != Status.SUCCEEDED) {
throw new IOException(String.format(
"Query job %s failed, status: %s.", jobId, statusToPrettyString(job.getStatus())));
}
}
If it's really a problem for you that the query is running in BATCH mode, then one workaround could be:
Using the BigQuery API directly, roll your own initial request, and set the priority to INTERACTIVE.
Write the results of step 1 to a temp table
In your Beam pipeline, read the temp table using BigQueryIO.Read.from()
You can configure to run the queries with "Interactive" priority by passing a priority parameter. Check this Github example for details.
Please note that you might be reaching one of the BigQuery limits and quotas as when you use batch, if you ever hit a rate limit, the query will be queued and retried later. As opposed to the interactive ones, when if these limits are hit, the query will fail immediately. This is because BigQuery assumes that an interactive query is something you need run immediately.
I ran into an issue where I had long running JitterBit operations that were scheduled. I had them scheduled close together, since I needed to keep data flowing. But, when they would take longer than expected I would wind up with multiple instances of the operation set running at the same time. This was killing my performance.
I'll put the fix in the answer below.
To resolve this issue I added an additional Script Operation at the beginning of my operation set (with the schedule running on this operation). This script simply checks to see if one of the operations in this set is already running. If not, it starts the next operation. If there is anything running, it exists and waits till the next scheduled instance.
This is a sample of my script. This one assumes that there were originally two operations in this operation set.
<trans>
$isInQueue=GetOperationQueue("<TAG>Operations/OperationToCheck01</TAG>");
$isInQueue2=GetOperationQueue("<TAG>Operations/OperationToCheck02</TAG>");
$isRunning=$isInQueue[0][1];
$isRunning2=$isInQueue2[0][1];
if(($isRunning==1 && $isRunning!=Null()) || ($isRunning2==1 && $isRunning2!=Null()),
WriteToOperationLog("Skip for now: "+$isRunning+" / "+$isRunning2);,
WriteToOperationLog("Nothign is Running - Starting Operation Chain.");
RunOperation("<TAG>Operations/OperationToCheck01</TAG>");
);
</trans>