Trigger cloudwatch event when EMR step take longer than a threshold - amazon-cloudwatch

I know that we can set up cloudwatch event for EMR steps' status such as RUNNING, FAILED and etc. But I am interested in to know is there any option if a step run longer than a threshold, the event get triggered?

Related

How to push AWS ECS Fargate cloudwatch logs to UI for users to see their long running task real time logs

I am creating an app where the long running tasks are getting executed in ECS Fargate and logs are getting pushed to CloudWatch. Now, am looking for a way to give the users an ability in the UI to see those realtime live logs while their tasks are running.
I am thinking of the below approach..
Saving logs temporarily in DynamoDB
DynamoDB stream with batch will trigger a lambda.
Lambda will trigger an AWS Appsync mutation with None data source.
In UI client will subscribed to that mutation to get real time updates. (depends on the batch size, example 5 batch means 5 logs lines )
https://aws.amazon.com/premiumsupport/knowledge-center/appsync-notify-subscribers-real-time/
Is there any other techniques or methods that i can think of?
Why not use Cloudwatch default save in S3 bucket ability and add SNS to let clients choose which topic they want to trail the log. Removing extra DynamoDB.

How to make Dataproc detect Python-Hive connection as a Yarn Job?

I launch a Dataproc cluster and serve Hive on it. Remotely from any machine I use Pyhive or PyODBC to connect to Hive and do things. It's not just one query. It can be a long session with intermittent queries. (The query itself has issues; will ask separately.)
Even during one single, active query, the operation does not show as a "Job" (I guess it's Yarn) on the dashboard. In contrast, when I "submit" tasks via Pyspark, they show up as "Jobs".
Besides the lack of task visibility, I also suspect that, w/o a Job, the cluster may not reliably detect a Python client is "connected" to it, hence the cluster's auto-delete might kick in prematurely.
Is there a way to "register" a Job to companion my Python session, and cancel/delete the job at times of my choosing? For my case, it is a "dummy", "nominal" job that does nothing.
Or maybe there's a more proper way to let Yarn detect my Python client's connection and create a job for it?
Thanks.
This is not supported right now, you need to submit jobs via Dataproc Jobs API to make them visible on jobs UI page and to be taken into account by cluster TTL feature.
If you can not use Dataproc Jobs API to execute your actual jobs, then you can submit a dummy Pig job that sleeps for desired time (5 hours in the example below) to prevent cluster deletion by max idle time feature:
gcloud dataproc jobs submit pig --cluster="${CLUSTER_NAME}" \
--execute="sh sleep $((5 * 60 * 60))"

Salesforce Monitor Bulk Data Load Jobs: for processing jobs progress is not showing up

We are not able to see the progress while Salesforce job(bulk-api) is being processing. Now We're exporting 300.000 tasks and the job is there for 4 days, however we can not see any progress on it. Is there a way we can see the progress? We need to know when it's going to be finished.
A job by itself doesn't do any work. It is batches that are queued to jobs that actually carry on data modifications. An open job will stay open until closed or until timed out (after 1 week if I remember correctly). The open status therefore does not signify progress, it only means you can queue more batches to this job.
As you can see on your second screenshot, no batches were queued to this job. Check the code that actually queues the batches, the API probably returns some kind of error there.

Hangfire queues recurring jobs

I use hangfire to run recurring jobs.
My job get current data from the database, perform an action and leave a trace on records processed. If a job didn't run this minute - I have no need to run it twice the next minute.
Somehow I got my recurring jobs (1 minute cycle) queued by their thousands and never executed. When I restarted my IIS it tried to execute them all at once and clog the DB.
Besides than fixing the problem of no execution, is there a way to stop them from queuing up?
If you want to disable retry of failed job simply decorate your method with an AutomaticRetryAttribute and set Attempts to 0
See https://github.com/HangfireIO/Hangfire/blob/master/src/Hangfire.Core/AutomaticRetryAttribute.cs for more details

Event based Triggering and running an airflow task on dropping a file into S3 bucket

Is it possible to run an airflow task only when a specific event occurs like an event of dropping a file into a specific S3 bucket. Something similar to AWS Lambda events
There is S3KeySensor but I don't know if it does what I want (to run Task only when an event occurs)
Here is the example to make the question more clear:
I have a sensor object as follows
sensor = S3KeySensor(
task_id='run_on_every_file_drop',
bucket_key='file-to-watch-*',
wildcard_match=True,
bucket_name='my-sensor-bucket',
timeout=18*60*60,
poke_interval=120,
dag=dag
)
Using the above sensor object, airflow behavior for the sensor task is as follows:
Runs the task if there is already an object name matching the
wildcard in the S3 bucket my-sensor-bucket even before the DAG is
switched ON in airflow admin UI (I don't want to run the task due
to the presence of past s3 objects)
After running once, the sensor task will not run again whenever there
is a new S3 file object drop(I want to run the sensor task and subsequent tasks in the DAG every single time there is a new S3 file object dropped in the bucket my-sensor-bucket)
If you configure the scheduler, the tasks are run based on schedule
but not based on event. So scheduler seems like not an option in this
case
I'm trying to understand if tasks in airflow can be run only based on scheduling(like cron jobs) or sensors(only once based on sensing criteria) or cant it be setup like an event based pipeline(something similar to AWS Lambda)
Airflow is fundamentally organized around time based scheduling.
You can hack around to get what you want though in a few ways:
Say you have an SQS event on S3 it triggers an AWS Lambda that calls the airflow API to trigger a dag run.
You can make a DAG start with the SQS sensor, when it gets the s3 change event, it just proceeds with the rest of the DAG (see 3_1 & 3_2 for rescheduling).
You can make a DAG start with the sensor (like the one you show) it doesn't choose the task to run, it just passes to the next dependent tasks OR times out. You'd have to delete the key that made the sensor match.
You rerun by making the final task re-trigger the DAG.
Or set the schedule interval to every minute, with no catchup, with max active DAG runs set to 1. This way one run will be active, the sensor will hold it until its time out. If it completes or times out, the next run will start within a minute.
If you go with route 3, you'll be deleting the keys that passed the sensor before the next run of the DAG and its sensor. Note that due to S3 eventual consistency, the routes 1 & 2 are more reliable.