Event based Triggering and running an airflow task on dropping a file into S3 bucket - amazon-s3

Is it possible to run an airflow task only when a specific event occurs like an event of dropping a file into a specific S3 bucket. Something similar to AWS Lambda events
There is S3KeySensor but I don't know if it does what I want (to run Task only when an event occurs)
Here is the example to make the question more clear:
I have a sensor object as follows
sensor = S3KeySensor(
task_id='run_on_every_file_drop',
bucket_key='file-to-watch-*',
wildcard_match=True,
bucket_name='my-sensor-bucket',
timeout=18*60*60,
poke_interval=120,
dag=dag
)
Using the above sensor object, airflow behavior for the sensor task is as follows:
Runs the task if there is already an object name matching the
wildcard in the S3 bucket my-sensor-bucket even before the DAG is
switched ON in airflow admin UI (I don't want to run the task due
to the presence of past s3 objects)
After running once, the sensor task will not run again whenever there
is a new S3 file object drop(I want to run the sensor task and subsequent tasks in the DAG every single time there is a new S3 file object dropped in the bucket my-sensor-bucket)
If you configure the scheduler, the tasks are run based on schedule
but not based on event. So scheduler seems like not an option in this
case
I'm trying to understand if tasks in airflow can be run only based on scheduling(like cron jobs) or sensors(only once based on sensing criteria) or cant it be setup like an event based pipeline(something similar to AWS Lambda)

Airflow is fundamentally organized around time based scheduling.
You can hack around to get what you want though in a few ways:
Say you have an SQS event on S3 it triggers an AWS Lambda that calls the airflow API to trigger a dag run.
You can make a DAG start with the SQS sensor, when it gets the s3 change event, it just proceeds with the rest of the DAG (see 3_1 & 3_2 for rescheduling).
You can make a DAG start with the sensor (like the one you show) it doesn't choose the task to run, it just passes to the next dependent tasks OR times out. You'd have to delete the key that made the sensor match.
You rerun by making the final task re-trigger the DAG.
Or set the schedule interval to every minute, with no catchup, with max active DAG runs set to 1. This way one run will be active, the sensor will hold it until its time out. If it completes or times out, the next run will start within a minute.
If you go with route 3, you'll be deleting the keys that passed the sensor before the next run of the DAG and its sensor. Note that due to S3 eventual consistency, the routes 1 & 2 are more reliable.

Related

IICS How Do You Orchestrate Scheduled Task Flows?

I would like to run multiple scheduled Task Flows against the same data source but only run one at a time.
Example:
Schedule "Nightly" runs once a day (expected runtime 30 minutes),
Schedule "Hourly" runs once an hour (expected runtime 10 minutes),
Schedule "Minute" runs once a minute (expected runtime 5 seconds).
I would like:
#1 "Nightly" test status of "Hourly" and "Minute":
If they are not running, start "Nightly",
If either are running, loop around until both have stopped.
#2 "Hourly" test status of "Nightly" and "Minute":
If they are not running, start "Hourly",
If "Nightly" is running, exit,
If "Minute" is running, loop around until both have stopped.
#3 "Minute" test status of "Nightly" and "Hourly":
If they are not running, start "Minute",
If either are running, exit.
So far, I am using handshakes with several JSON files in the cloud.
Meaning, if "Minute" is running, the file minute.json contains information telling a caller "Minute" is running.
When "Minute" ends, it updates its file, minute.json, to reflect the operation has stopped.
As you can imagine, this is very slow.
Also, Informatica will always create a JSON file when JSON is the target. The issue here is, if there is any issue, Informatica will create a 0 file size JSON file that will fail any operation calling it.
There has got to be a better way.
You could use the Informatica Platform REST API v2 to monitor and control the execution of your jobs programmatically from an external site.
It's a bit involved to set everything up and write the logic or configure driving tools but this setup should give you full control, including error handling, logging, alerting, etc.
login (there are a number of options like SAML, and Salesforce credentials)
Then you could check the status and outcome of your jobs in the activity log or the activity monitor via API
use job and/or schedule via API to run your jobs.

Resource Freeing in Redis Jobs

I've an application that sends redid jobs output to front-end via socket. There are some special type nodes those are special & needs to wait to finish their jobs.So I'm using infinite waiting to finish those jobs to complete. Something like below.
While True:
If job.status() == "finished" :
Break
I want to if there's any way to free the resource & re run the jobs from their previous state.
Tried a solution to stay awake all the time. I want a solution where the resources are not bound to one jobs , the system can utilize their resources to other jobs
Well what you can do is,
To return if the job is special. And save the states of jobs in Redis Environment.
If u have a back end in your application, you can always check if the special jobs are finished running.

How to push AWS ECS Fargate cloudwatch logs to UI for users to see their long running task real time logs

I am creating an app where the long running tasks are getting executed in ECS Fargate and logs are getting pushed to CloudWatch. Now, am looking for a way to give the users an ability in the UI to see those realtime live logs while their tasks are running.
I am thinking of the below approach..
Saving logs temporarily in DynamoDB
DynamoDB stream with batch will trigger a lambda.
Lambda will trigger an AWS Appsync mutation with None data source.
In UI client will subscribed to that mutation to get real time updates. (depends on the batch size, example 5 batch means 5 logs lines )
https://aws.amazon.com/premiumsupport/knowledge-center/appsync-notify-subscribers-real-time/
Is there any other techniques or methods that i can think of?
Why not use Cloudwatch default save in S3 bucket ability and add SNS to let clients choose which topic they want to trail the log. Removing extra DynamoDB.

Perform action after all s3 files processed

I have files uploaded onto my s3 bucket - always 6, twice per day. I am running a fairly standard setup:
S3 -> SNS -> SQS -> lambda
Each lambda processes the newly uploaded file and then deletes it. I know the exact number of files in each batch, but I cannot require the client to perform any other action (e.g. message on SNS).
Question: what's the easiest/best way to perform a certain action after processing all files?
My ideas:
Step Functions - not sure how?
Simply check in each lambda if s3 items count is zero (or check sqs message queue size?) - not sure if there won't be a race condition against a delete immediately before (is it always consistent) or similar issues?
CloudWatch alarm when SQS queue depth is zero -> SNS -> lambda - I guess it should work, not sure about the correct metric?
I would appreciate info on the best/simplest way to achieve it.
If you are sure that by x o'clock, all your 6 files will proceed then simply you can create a cloud watch and schedule it at 11:50 PM, and based on your validation just delete the files.
You could use the number of files in the S3 bucket location to capture "file count" state. The processing lambda runs on each file-add, but conditionally initiates the delete and post-processing steps only when objectCount === 6.
How can we use S3 to keep track of file count? Lots of possibilities, here are two:
Option 1: defer processing until all 6 files have arrived
When triggered on OBJECT_CREATED, the lambda counts the S3 objects. If objectCount < 6 the lambda exits without further action. If 6 files exist, process all 6, delete the files and perform the post-processing action.
Option 2: use S3 tags to indicate PROCESSED status
When triggered on OBJECT_CREATED, the lambda processes the new file and adds a PROCESSED tag to the S3 Object. The lambda then counts the S3 objects with PROCESSED tags. If 6, delete the files and perform the post-processing action.
In any case, think through race conditions and other error states. This is often where "best" and "simplest" sometimes conflict.
N.B. Step Functions could be used to chain together the processing steps, but they don't offer a different way to keep track of file count state.

Deleting events published by AWS S3 buckets which are still in queue to be processed by lambda

My architecture is:
1.Drop multiple files in aws S# bucket
2. Lambda picks the file one by one and starts processing it
Problem is :
I am not able to stop the lambda to process the files in between. Even if i stop the lambda instance and restart it, it picks from where it left.
Is there a way to achieve this?
You have no control over the events pushed by S3. You'll be better off if you just cancel the Lambda subscription if you want to stop it for good, but I am afraid that already emitted events will be processed as long as your Lambda is active.
What exactly are you trying to achieve?
If you want to limit the number of files your Lambda functions can process, you can just limit of concurrent executions on your function to 1, so it won't auto-scale based on demand.
Simply go to Concurrency as the image below shows, set it to 1 and save it.
Detach the lambda S3 trigger and add it newly.
This way all new events will be picked up and not the old events