How to run Airflow S3 sensor exactly once? - amazon-s3

I want to continue the DAG only if a csv file exists in S3, otherwise it should just end.
The DAG itself is being scheduled hourly.
with DAG(dag_id="my_dag",
start_date=datetime(2023, 1, 1),
schedule_interval='#hourly',
catchup=False
) as dag:
check_for_new_csv = S3KeySensor(
task_id='check_for_new_csv',
bucket_name='bucket-data',
bucket_key='*.csv',
wildcard_match=True,
soft_fail=True,
retries=1
)
start_instance = EC2StartInstanceOperator(
task_id="start_ec2_instance_task",
instance_id=INSTANCE_ID,
region_name=REGION
)
check_for_new_csv >> start_instance
But the sensor seems to run forever - in the log I can see it keeps on running:
[2023-01-10, 15:02:06 UTC] {s3.py:98} INFO - Poking for key : s3://bucket-data/*.csv
[2023-01-10, 15:03:08 UTC] {s3.py:98} INFO - Poking for key : s3://bucket-data/*.csv
Maybe the sensor in not the best choice for such logic?

A sensor is a perfect choice for this use case. I'd try setting the poke_interval and timeout to different smaller values than their default to make sure the sensor that Airflow is checking on the right intervals (by default, they are very long).
One thing to watch out for is if your sensors run on longer intervals than your schedule interval. For example, if your DAG is scheduled to run hourly, but your sensors's timeout is set for 2 hours, your next DAG run may not run as expected (depending on your concurrency and max_active_dag settings), or it may run unexpectedly because the sensor detects an older file. Ideally, you can append a timestamp in the name of your file to avoid this.

Related

Spark - Failed to load collect frame - "RetryingBlockFetcher - Exception while beginning fetch"

We have a Scala Spark application, that reads something like 70K records from the DB to a data frame, each record has 2 fields.
After reading the data from the DB, we make minor mapping and load this as a broadcast for later usage.
Now, in local environment, there is an exception, timeout from the RetryingBlockFetcher while running the following code:
dataframe.select("id", "mapping_id")
.rdd.map(row => row.getString(0) -> row.getLong(1))
.collectAsMap().toMap
The exception is:
2022-06-06 10:08:13.077 task-result-getter-2 ERROR
org.apache.spark.network.shuffle.RetryingBlockFetcher Exception while
beginning fetch of 1 outstanding blocks
java.io.IOException: Failed to connect to /1.1.1.1:62788
at
org.apache.spark.network.client.
TransportClientFactory.createClient(Transpor .tClientFactory.java:253)
at
org.apache.spark.network.client.
TransportClientFactory.createClient(TransportClientFactory.java:195)
at
org.apache.spark.network.netty.
NettyBlockTransferService$$anon$2.
createAndStart(NettyBlockTransferService.scala:122)
In the local environment, I simply create the spark session with local "spark.master"
When I limit the max of records to 20K, it works well.
Can you please help? maybe I need to configure something in my local environment in order that the original code will work properly?
Update:
I tried to change a lot of Spark-related configurations in my local environment, both memory, a number of executors, timeout-related settings, and more, but nothing helped! I just got the timeout after more time...
I realized that the data frame that I'm reading from the DB has 1 partition of 62K records, while trying to repartition with 2 or more partitions the process worked correctly and I managed to map and collect as needed.
Any idea why this solves the issue? Is there a configuration in the spark that can solve this instead of repartition?
Thanks!

How to speed up the time it takes for UI Server to update and allow dynamic DAG to be triggered?

I have a DAG Generator that takes a JSON input and creates a new dynamic DAG in the dags directory. The time it takes for that newly created DAG to be available to use (through the API) can range from 2 seconds to 5 minutes.
I ran the test 100 times:
Create a new DAG (with the same input JSON, so the dynamic dags are
identical)
Once the DAG is saved in the dags directory, start
sending API requests to see if the DAG can be triggered.
Track the seconds passed before I was able to successfully trigger the
DAG.
Results are as follows:
[14.81, 6.44, 6.38, 6.36, 2.21, 6.42, 18.96, 23.14, 23.11, 14.82, 6.39, 23.10, 18.93, 14.80, 23.20, 31.49, 48.29, 35.83, 27.20, 18.96, 14.80, 44.14, 35.66, 35.77, 39.92, 31.50, 69.15, 48.22, 69.29, 39.87, 10.53, 69.15, 27.37, 48.22, 77.51, 39.90, 27.35, 65.03, 69.16, 31.47, 65.06, 90.00, 2.19, 111.33, 69.19, 98.46, 90.16, 27.28, 60.89, 56.57, 110.96, 18.92, 140.55, 39.95, 94.22, 85.89, 44.29, 94.54, 69.21, 136.20, 35.72, 102.57, 102.63, 81.72, 98.58, 77.55, 148.83, 102.79, 136.38, 115.22, 94.38, 148.68, 119.43, 48.24, 178.09, 81.80, 127.64, 119.59, 44.22, 194.88, 23.17, 170.00, 211.47, 153.18, 249.55, 182.40, 152.98, 86.00, 157.02, 98.54, 270.02, 81.75, 153.04, 69.23, 265.92, 27.30, 278.64, 23.19, 269.98, 81.91]
Average Time: 79.35 seconds
You can see that as the number of files in the dags folder increased, the time it took for the DAG to be triggered also increased, but it's still somewhat random. Is there any way to keep this consistent (without restarting the Airflow server after each creation). Or speed it up?
Thank you!

Spark structured streaming groupBy not working in append mode (works in update)

I'm trying to get a streaming aggregation/groupBy working in append output mode, to be able to use the resulting stream in a stream-to-stream join. I'm working on (Py)Spark 2.3.2, and I'm consuming from Kafka topics.
My pseudo-code is something like below, running in a Zeppelin notebook
orderStream = spark.readStream().format("kafka").option("startingOffsets", "earliest").....
orderGroupDF = (orderStream
.withWatermark("LAST_MOD", "20 seconds")
.groupBy("ID", window("LAST_MOD", "10 seconds", "5 seconds"))
.agg(
collect_list(struct("attra", "attrb2",...)).alias("orders"),
count("ID").alias("number_of_orders"),
sum("PLACED").alias("number_of_placed_orders"),
min("LAST_MOD").alias("first_order_tsd")
)
)
debug = (orderGroupDF.writeStream
.outputMode("append")
.format("memory").queryName("debug").start()
)
After that, I would expected that data appears on the debug query and I can select from it (after the late arrival window of 20 seconds has expired. But no data every appears on the debug query (I waited several minutes)
When I changed output mode to update the query works immediately.
Any hint what I'm doing wrong?
EDIT: after some more experimentation, I can add the following (but I still don't understand it).
When starting the Spark application, there is quite a lot of old data (with event timestamps << current time) on the topic from which I consume. After starting, it seems to read all these messages (MicroBatchExecution in the log reports "numRowsTotal = 6224" for example), but nothing is produced on the output, and the eventTime watermark in the log from MicroBatchExecution stays at epoch (1970-01-01).
After producing a fresh message onto the input topic with eventTimestamp very close to current time, the query immediately outputs all the "queued" records at once, and bumps the eventTime watermark in the query.
What I can also see that there seems to be an issue with the timezone. My Spark programs runs in CET (UTC+2 currently). The timestamps in the incoming Kafka messages are in UTC, e.g "LAST__MOD": "2019-05-14 12:39:39.955595000". I have set spark_sess.conf.set("spark.sql.session.timeZone", "UTC"). Still, the microbatch report after that "new" message has been produced onto the input topic says
"eventTime" : {
"avg" : "2019-05-14T10:39:39.955Z",
"max" : "2019-05-14T10:39:39.955Z",
"min" : "2019-05-14T10:39:39.955Z",
"watermark" : "2019-05-14T10:35:25.255Z"
},
So the eventTime somehow links of with the time in the input message, but it is 2 hours off. The UTC difference has been subtraced twice. Additionally, I fail to see how the watermark calculation works. Given that I set it to 20 seconds, I would have expected it to be 20 seconds older than the max eventtime. But apparently it is 4 mins 14 secs older. I fail to see the logic behind this.
I'm very confused...
It seems that this was related to the Spark version 2.3.2 that I used, and maybe more concretely to SPARK-24156. I have upgraded to Spark 2.4.3 and here I get the results of the groupBy immediately (well, of course after the watermark lateThreshold has expired, but "in the expected timeframe".

Spark execution occasionally gets stuck at mapPartitions at Exchange.scala:44

I am running a Spark job on a two node standalone cluster (v 1.0.1).
Spark execution often gets stuck at the task mapPartitions at Exchange.scala:44.
This happens at the final stage of my job in a call to saveAsTextFile (as I expect from Spark's lazy execution).
It is hard to diagnose the problem because I never experience it in local mode with local IO paths, and occasionally the job on the cluster does complete as expected with the correct output (same output as with local mode).
This seems possibly related to reading from s3 (of a ~170MB file) immediately prior, as I see the following logging in the console:
DEBUG NativeS3FileSystem - getFileStatus returning 'file' for key '[PATH_REMOVED].avro'
INFO FileInputFormat - Total input paths to process : 1
DEBUG FileInputFormat - Total # of splits: 3
...
INFO DAGScheduler - Submitting 3 missing tasks from Stage 32 (MapPartitionsRDD[96] at mapPartitions at Exchange.scala:44)
DEBUG DAGScheduler - New pending tasks: Set(ShuffleMapTask(32, 0), ShuffleMapTask(32, 1), ShuffleMapTask(32, 2))
The last logging I see before the task apparently hangs/gets stuck is:
INFO NativeS3FileSystem: INFO NativeS3FileSystem: Opening key '[PATH_REMOVED].avro' for reading at position '67108864'
Has anyone else experience non-deterministic problems related to reading from s3 in Spark?

Celery task schedule (Celery, Django and RabbitMQ)

I want to have a task that will execute every 5 minutes, but it will wait for last execution to finish and then start to count this 5 minutes. (This way I can also be sure that there is only one task running) The easiest way I found is to run django application manage.py shell and run this:
while True:
result = task.delay()
result.wait()
sleep(5)
but for each task that I want to execute this way I have to run it's own shell, is there an easy way to do it? May be some king custom ot django celery scheduler?
Wow it's amazing how no one understands this person's question. They are asking not about running tasks periodically, but how to ensure that Celery does not run two instances of the same task simultaneously. I don't think there's a way to do this with Celery directly, but what you can do is have one of the tasks acquire a lock right when it begins, and if it fails, to try again in a few seconds (using retry). The task would release the lock right before it returns; you can make the lock auto-expire after a few minutes if it ever crashes or times out.
For the lock you can probably just use your database or something like Redis.
You may be interested in this simpler method that requires no changes to a celery conf.
#celery.decorators.periodic_task(run_every=datetime.timedelta(minutes=5))
def my_task():
# Insert fun-stuff here
All you need is specify in celery conf witch task you want to run periodically and with which interval.
Example: Run the tasks.add task every 30 seconds
from datetime import timedelta
CELERYBEAT_SCHEDULE = {
"runs-every-30-seconds": {
"task": "tasks.add",
"schedule": timedelta(seconds=30),
"args": (16, 16)
},
}
Remember that you have to run celery in beat mode with the -B option
manage celeryd -B
You can also use the crontab style instead of time interval, checkout this:
http://ask.github.com/celery/userguide/periodic-tasks.html
If you are using django-celery remember that you can also use tha django db as scheduler for periodic tasks, in this way you can easily add trough the django-celery admin panel new periodic tasks.
For do that you need to set the celerybeat scheduler in settings.py in this way
CELERYBEAT_SCHEDULER = "djcelery.schedulers.DatabaseScheduler"
To expand on #MauroRocco's post, from http://docs.celeryproject.org/en/v2.2.4/userguide/periodic-tasks.html
Using a timedelta for the schedule means the task will be executed 30 seconds after celerybeat starts, and then every 30 seconds after the last run. A crontab like schedule also exists, see the section on Crontab schedules.
So this will indeed achieve the goal you want.
Because of celery.decorators deprecated, you can use periodic_task decorator like that:
from celery.task.base import periodic_task
from django.utils.timezone import timedelta
#periodic_task(run_every=timedelta(seconds=5))
def my_background_process():
# insert code
Add that task to a separate queue, and then use a separate worker for that queue with the concurrency option set to 1.