Can I reuse CloudWatch Log Stream for Batch Array Jobs - amazon-cloudwatch

Hi I created a Batch with array jobs. I have created a CloudWatch Log Group and Log Stream under it and I want the same CloudWatch LogStream to be shared by all the Batch Array jobs.
So that I should be able to see the same logStream link for the parent job and all its child jobs in the batch Job's Job Information Section.

Related

The right metrics to monitor AWS Glue Jobs

What're the right Glue metrics that allow to monitor the status of the glue jobs
Know that status of the last run (OK or KO) & the number of Success/Failed executions
PS: In order to explore the metrics, I'm using Datadog

How to push AWS ECS Fargate cloudwatch logs to UI for users to see their long running task real time logs

I am creating an app where the long running tasks are getting executed in ECS Fargate and logs are getting pushed to CloudWatch. Now, am looking for a way to give the users an ability in the UI to see those realtime live logs while their tasks are running.
I am thinking of the below approach..
Saving logs temporarily in DynamoDB
DynamoDB stream with batch will trigger a lambda.
Lambda will trigger an AWS Appsync mutation with None data source.
In UI client will subscribed to that mutation to get real time updates. (depends on the batch size, example 5 batch means 5 logs lines )
https://aws.amazon.com/premiumsupport/knowledge-center/appsync-notify-subscribers-real-time/
Is there any other techniques or methods that i can think of?
Why not use Cloudwatch default save in S3 bucket ability and add SNS to let clients choose which topic they want to trail the log. Removing extra DynamoDB.

Perform action after all s3 files processed

I have files uploaded onto my s3 bucket - always 6, twice per day. I am running a fairly standard setup:
S3 -> SNS -> SQS -> lambda
Each lambda processes the newly uploaded file and then deletes it. I know the exact number of files in each batch, but I cannot require the client to perform any other action (e.g. message on SNS).
Question: what's the easiest/best way to perform a certain action after processing all files?
My ideas:
Step Functions - not sure how?
Simply check in each lambda if s3 items count is zero (or check sqs message queue size?) - not sure if there won't be a race condition against a delete immediately before (is it always consistent) or similar issues?
CloudWatch alarm when SQS queue depth is zero -> SNS -> lambda - I guess it should work, not sure about the correct metric?
I would appreciate info on the best/simplest way to achieve it.
If you are sure that by x o'clock, all your 6 files will proceed then simply you can create a cloud watch and schedule it at 11:50 PM, and based on your validation just delete the files.
You could use the number of files in the S3 bucket location to capture "file count" state. The processing lambda runs on each file-add, but conditionally initiates the delete and post-processing steps only when objectCount === 6.
How can we use S3 to keep track of file count? Lots of possibilities, here are two:
Option 1: defer processing until all 6 files have arrived
When triggered on OBJECT_CREATED, the lambda counts the S3 objects. If objectCount < 6 the lambda exits without further action. If 6 files exist, process all 6, delete the files and perform the post-processing action.
Option 2: use S3 tags to indicate PROCESSED status
When triggered on OBJECT_CREATED, the lambda processes the new file and adds a PROCESSED tag to the S3 Object. The lambda then counts the S3 objects with PROCESSED tags. If 6, delete the files and perform the post-processing action.
In any case, think through race conditions and other error states. This is often where "best" and "simplest" sometimes conflict.
N.B. Step Functions could be used to chain together the processing steps, but they don't offer a different way to keep track of file count state.

resource management on spark jobs on Yarn and spark shell jobs

Our company has a 9 nodes clusters on cloudera.
We have 41 long running spark streaming jobs [YARN + cluster mode] & some regular spark shell jobs scheduled to run on 1pm daily.
All jobs are currently submitted at user A role [ with root permission]
The issue I encountered are that while all 41 spark streaming jobs are running, my scheduled jobs will not be able to obtain resource to run.
I have tried the YARN fair scheduler, but the scheduled jobs remain not running.
We expect the spark streaming jobs are always running, but it will reduce the resources occupied whenever other scheduled jobs start.
please feel free to share your suggestions or possible solutions.
Your spark streaming jobs are consuming too many resources for your scheduled jobs to get started. This is either because they're always scaled to a point that there aren't enough resources left for scheduled jobs or they aren't scaling back.
For the case where the streaming jobs aren't scaling back you could check whether you have dynamic resource allocation enabled for your streaming jobs. One way of checking is via the spark shell using spark.sparkContext.getConf.get("spark.streaming.dynamicAllocation.enabled"). If dynamic allocation is enabled then you could look at reducing the minimum resources for those jobs.

Event based Triggering and running an airflow task on dropping a file into S3 bucket

Is it possible to run an airflow task only when a specific event occurs like an event of dropping a file into a specific S3 bucket. Something similar to AWS Lambda events
There is S3KeySensor but I don't know if it does what I want (to run Task only when an event occurs)
Here is the example to make the question more clear:
I have a sensor object as follows
sensor = S3KeySensor(
task_id='run_on_every_file_drop',
bucket_key='file-to-watch-*',
wildcard_match=True,
bucket_name='my-sensor-bucket',
timeout=18*60*60,
poke_interval=120,
dag=dag
)
Using the above sensor object, airflow behavior for the sensor task is as follows:
Runs the task if there is already an object name matching the
wildcard in the S3 bucket my-sensor-bucket even before the DAG is
switched ON in airflow admin UI (I don't want to run the task due
to the presence of past s3 objects)
After running once, the sensor task will not run again whenever there
is a new S3 file object drop(I want to run the sensor task and subsequent tasks in the DAG every single time there is a new S3 file object dropped in the bucket my-sensor-bucket)
If you configure the scheduler, the tasks are run based on schedule
but not based on event. So scheduler seems like not an option in this
case
I'm trying to understand if tasks in airflow can be run only based on scheduling(like cron jobs) or sensors(only once based on sensing criteria) or cant it be setup like an event based pipeline(something similar to AWS Lambda)
Airflow is fundamentally organized around time based scheduling.
You can hack around to get what you want though in a few ways:
Say you have an SQS event on S3 it triggers an AWS Lambda that calls the airflow API to trigger a dag run.
You can make a DAG start with the SQS sensor, when it gets the s3 change event, it just proceeds with the rest of the DAG (see 3_1 & 3_2 for rescheduling).
You can make a DAG start with the sensor (like the one you show) it doesn't choose the task to run, it just passes to the next dependent tasks OR times out. You'd have to delete the key that made the sensor match.
You rerun by making the final task re-trigger the DAG.
Or set the schedule interval to every minute, with no catchup, with max active DAG runs set to 1. This way one run will be active, the sensor will hold it until its time out. If it completes or times out, the next run will start within a minute.
If you go with route 3, you'll be deleting the keys that passed the sensor before the next run of the DAG and its sensor. Note that due to S3 eventual consistency, the routes 1 & 2 are more reliable.