How to speed up the time it takes for UI Server to update and allow dynamic DAG to be triggered? - dynamic

I have a DAG Generator that takes a JSON input and creates a new dynamic DAG in the dags directory. The time it takes for that newly created DAG to be available to use (through the API) can range from 2 seconds to 5 minutes.
I ran the test 100 times:
Create a new DAG (with the same input JSON, so the dynamic dags are
identical)
Once the DAG is saved in the dags directory, start
sending API requests to see if the DAG can be triggered.
Track the seconds passed before I was able to successfully trigger the
DAG.
Results are as follows:
[14.81, 6.44, 6.38, 6.36, 2.21, 6.42, 18.96, 23.14, 23.11, 14.82, 6.39, 23.10, 18.93, 14.80, 23.20, 31.49, 48.29, 35.83, 27.20, 18.96, 14.80, 44.14, 35.66, 35.77, 39.92, 31.50, 69.15, 48.22, 69.29, 39.87, 10.53, 69.15, 27.37, 48.22, 77.51, 39.90, 27.35, 65.03, 69.16, 31.47, 65.06, 90.00, 2.19, 111.33, 69.19, 98.46, 90.16, 27.28, 60.89, 56.57, 110.96, 18.92, 140.55, 39.95, 94.22, 85.89, 44.29, 94.54, 69.21, 136.20, 35.72, 102.57, 102.63, 81.72, 98.58, 77.55, 148.83, 102.79, 136.38, 115.22, 94.38, 148.68, 119.43, 48.24, 178.09, 81.80, 127.64, 119.59, 44.22, 194.88, 23.17, 170.00, 211.47, 153.18, 249.55, 182.40, 152.98, 86.00, 157.02, 98.54, 270.02, 81.75, 153.04, 69.23, 265.92, 27.30, 278.64, 23.19, 269.98, 81.91]
Average Time: 79.35 seconds
You can see that as the number of files in the dags folder increased, the time it took for the DAG to be triggered also increased, but it's still somewhat random. Is there any way to keep this consistent (without restarting the Airflow server after each creation). Or speed it up?
Thank you!

Related

How to run Airflow S3 sensor exactly once?

I want to continue the DAG only if a csv file exists in S3, otherwise it should just end.
The DAG itself is being scheduled hourly.
with DAG(dag_id="my_dag",
start_date=datetime(2023, 1, 1),
schedule_interval='#hourly',
catchup=False
) as dag:
check_for_new_csv = S3KeySensor(
task_id='check_for_new_csv',
bucket_name='bucket-data',
bucket_key='*.csv',
wildcard_match=True,
soft_fail=True,
retries=1
)
start_instance = EC2StartInstanceOperator(
task_id="start_ec2_instance_task",
instance_id=INSTANCE_ID,
region_name=REGION
)
check_for_new_csv >> start_instance
But the sensor seems to run forever - in the log I can see it keeps on running:
[2023-01-10, 15:02:06 UTC] {s3.py:98} INFO - Poking for key : s3://bucket-data/*.csv
[2023-01-10, 15:03:08 UTC] {s3.py:98} INFO - Poking for key : s3://bucket-data/*.csv
Maybe the sensor in not the best choice for such logic?
A sensor is a perfect choice for this use case. I'd try setting the poke_interval and timeout to different smaller values than their default to make sure the sensor that Airflow is checking on the right intervals (by default, they are very long).
One thing to watch out for is if your sensors run on longer intervals than your schedule interval. For example, if your DAG is scheduled to run hourly, but your sensors's timeout is set for 2 hours, your next DAG run may not run as expected (depending on your concurrency and max_active_dag settings), or it may run unexpectedly because the sensor detects an older file. Ideally, you can append a timestamp in the name of your file to avoid this.

Azure ADF - how to proceed with pipeline activities only after new files arrival?

I have written generic datafiles arrival checking routine using databricks notebooks which accepts filenames and time which specifies acceptable freshness of files. many pipeline uses this notebook and passes filenames tuples and at end notebook returns True or False, to indicate if next workflow activity could start or not. so far so good.
now my question is how to use this in Azure ADF pipeline such that if it fails it should wait for 30 minutes or so and check again by running above notebook again?
this notebook shall run first so that if new files are already there then it should not wait
Since you are talking about the notebook activity , you can add a wait activity "on failue " and set the time for the wait . after wait add a executepipelien actvity . This execute pipeline should point to a pipeline with a execute pipeline ( again ) pointing to the main pipeline which has the notebook activity . Basically this is just a cycle , but will only execute when you have a failure .

AWS Glue ETL"Failed to delete key: target_folder/_temporary" caused by S3 exception "Please reduce your request rate"

Glue job configured to max 10 nodes capacity, 1 job in parallel and no retries on failure is giving an error "Failed to delete key: target_folder/_temporary", and according to stacktrace the issue is that S3 service starts blocking the Glue requests due to the amount of requests: "AmazonS3Exception: Please reduce your request rate."
Note: The issue is not with IAM as the IAM role that glue job is using has permissions to delete objects in S3.
I found a suggestion for this issue on GitHub with a proposition of reducing the worker count: https://github.com/aws-samples/aws-glue-samples/issues/20
"I've had success reducing the number of workers."
However, I don't think that 10 is too many workers and would even like to actually increase the worker count to 20 to speed up the ETL.
Did anyone have any success who faced this issue? How would I go about solving it?
Shortened stacktrace:
py4j.protocol.Py4JJavaError: An error occurred while calling o151.pyWriteDynamicFrame.
: java.io.IOException: Failed to delete key: target_folder/_temporary
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:665)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.delete(EmrFileSystem.java:332)
...
Caused by: java.io.IOException: 1 exceptions thrown from 12 batch deletes
at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.deleteAll(Jets3tNativeFileSystemStore.java:384)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.doSingleThreadedBatchDelete(S3NativeFileSystem.java:1372)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.delete(S3NativeFileSystem.java:663)
...
Caused by: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: Please reduce your request rate. (Service: Amazon S3; Status Code: 503; Error Code: SlowDown; Request ID: ...
Part of Glue ETL python script (just in case):
datasource0 = glueContext.create_dynamic_frame.from_catalog(database="database", table_name="table_name", transformation_ctx="datasource0")
... relationalizing, renaming and etc. Transforming from DynamicDataframe to PySpark dataframe and back.
partition_ready = Map.apply(frame=processed_dataframe, f=map_date_partition, transformation_ctx="map_date_partition")
datasink = glueContext.write_dynamic_frame.from_options(frame=partition_ready, connection_type="s3", connection_options={"path": "s3://bucket/target_folder", "partitionKeys": ["year", "month", "day", "hour"]}, format="parquet", transformation_ctx="datasink")
job.commit()
Solved(Kind of), thank you to user ayazabbas
Accepted the answer that helped me into the correct direction of a solution. One of the things I was searching for is how to reduce many small files into big chunks and repartition does exactly that. Instead of repartition(x) I used coalesce(x) where x is 4*worker count of a glue job so that Glue service could allocate each data chunk to each available vCPU resource. It might make sense to have x at least 2*4*worker_count to account for slower and faster transformation parts if they do exist.
Another thing I did was reduce the number of columns by which I was partitioning the data before writing it to S3 from 5 to 4.
Current drawback is that I haven't figured out how to find the worker count within the glue script that the glue service allocates for the job, thus the number is hardcoded according to the job configuration (Glue service allocates sometimes more nodes than what is configured).
I had this same issue. I worked around it by running repartition(x) on the dynamic frame before writing to S3. This forces x files per partition and the max parallelism during the write process will be x, reducing S3 the request rate.
I set x to 1 as I wanted 1 parquet file per partition so I'm not sure what the safe upper limit of parallelism you can have is before the request rate gets too high.
I couldn't figure out a nicer way to solve this issue, it's annoying because you have so much idle capacity during the write process.
Hope that helps.

JitterBit Run Only One Instance of an Operation at a Time

I ran into an issue where I had long running JitterBit operations that were scheduled. I had them scheduled close together, since I needed to keep data flowing. But, when they would take longer than expected I would wind up with multiple instances of the operation set running at the same time. This was killing my performance.
I'll put the fix in the answer below.
To resolve this issue I added an additional Script Operation at the beginning of my operation set (with the schedule running on this operation). This script simply checks to see if one of the operations in this set is already running. If not, it starts the next operation. If there is anything running, it exists and waits till the next scheduled instance.
This is a sample of my script. This one assumes that there were originally two operations in this operation set.
<trans>
$isInQueue=GetOperationQueue("<TAG>Operations/OperationToCheck01</TAG>");
$isInQueue2=GetOperationQueue("<TAG>Operations/OperationToCheck02</TAG>");
$isRunning=$isInQueue[0][1];
$isRunning2=$isInQueue2[0][1];
if(($isRunning==1 && $isRunning!=Null()) || ($isRunning2==1 && $isRunning2!=Null()),
WriteToOperationLog("Skip for now: "+$isRunning+" / "+$isRunning2);,
WriteToOperationLog("Nothign is Running - Starting Operation Chain.");
RunOperation("<TAG>Operations/OperationToCheck01</TAG>");
);
</trans>

Need help in Apache Camel multicast/parallel/concurrent processing

I am trying to achieve concurrent/parallel processing in my requirement, but I did not get appropriate help in my multiple attempts in this regard.
I have 5 remote directories ( which may be added or removed) which contains log files, I want to Dow load them for every 15 minutes to my local directory and want to perform Lucene indexing after completion of ftp transfer job, I want to add routers dynamically.
Since all those remote machines are different end points , and different routes. I don't have any particular end point to kickoff all these.
Start
<parallel>
<download remote dir from: sftp1>
<download remote dir from: sftp2>
....
</parallel>
<After above task complete>
<start Lucene indexing>
<end>
Repeat above for every 15 minutes,
I want to download all folders paralally, Kindly suggest the solution if anybody worked on similar requirement.
I would like to know how to start/initiate these multiple routes (like this multiple remote directories) should be kick started when I don't have a starter end point. I would like to start all ftp operations parallel and on completing those then indexing. Thanks for taking time to reading this post , I really appreciate your help.
I tried like this,
from (bean:foo? Method=start).multicast ().to (direct:a).to (direct:b)...
From (direct:a) .from (sftp:xxx).to (localdir)
from (direct:b).from (sftp:xxx).to (localdir)
camel-ftp support periodic polling via the consumer.delay property
add camel-ftp consumer routes dynamically for each server as shown in this unit test
you can then aggregate your results based on a size or timeout value to initiate the Lucene indexing, etc
[todo - put together an example]