Amazon EMR managing my spark cluster - amazon-emr

I have a spark setup on Amazon EC2 machines with 2 worker machines running. It reads data from cassandra, do some processing and write to sql server. I have heard about amazon EMR and read about it. I want a managed system where my worker machines are automatically added to my cluster if my job is taking more time and shutdown when my job gets completed.
Can I achieve this through Amazon EMR?

The requirements are:
My worker machines are automatically added to my cluster if my job is taking more time.
Shutdown when my job gets completed.
No. 2 is definitely possible if your job is launched from the steps. There is an option that auto-terminates cluster after the last step is completed. Alternatively, this could also be done programatically with the SDK.
No. 1 is a little more difficult but EMR has three classes of nodes; master, core, and task. Task nodes can be added after cluster creation. The trigger for that would probably have to be done programatically or utilizing another Amazon service, like Lambda.

Related

Scheduling a task in a Distributed Environment like Kubernetes

I have a FAST API based Rest Application where I need to have some scheduled tasks that needs to run every 30 mins. This application will run on Kubernetes as such the number of instances are not fixed. I want the Scheduled Jobs to only trigger from one of the available instance and not from all the running instances creating a Race condition, as such I need some kind of locking mechanism that will prevent the schedulers to fire if one is already running. My App does connect to a MySql compatible Aurora DB running on AWS. Can I achieve this with ApScheduler, if not are there any alternatives available?

Manage In-memory cache in multiple servers in aws

Once or twice a day some files are being uploaded to S3 Bucket. I want the uploaded data to be refreshed with the In-memory data of each server on every s3 upload.
Note there are multiple servers running and I want to store the same data in all the servers. Also, the servers are scaling based on the traffic(also on start-up of the new server goes up and older ones go down means server instances will not be the same always).
Like I want to keep updated data in the cache.
I want to build an architecture where auto-scaling of the server can be supported. I came across the FAN-OUT architecture of AWS by using the SNS and multiple SQS from which different servers can poll.
How can we handle the auto-scaling of the queue with respect to servers?
Or is there any other way to handle the scenario?
PS: I m totally new to the AWS environment.
It Will be a great help for any reference.
To me there are a few things that you need to have to make this work. These are opinions and, as with most architectural designs, there is certainly more than one way to handle this.
I start with the assumption that you've got an application running on an EC2 of some sort (Elastic Beanstalk, Fargate, Raw EC2s with auto scaling, etc.) and that you've solved for having the application installed and configured when a scale-up event occurs.
Conceptually I'd have this diagram:
The setup involves having the S3 bucket publish likely s3:ObjectCreated events to the SNS topic. These events will be published when an object in the bucket is updated or created.
Next:
During startup your application will pull the current data from S3.
As part of application startup create a queue named after the instance id of the EC2 (see here for some examples) The queue would need to subscribe to the SNS topic. If the queue already exists then that's not an error.
Your application would have a background thread or process that polls the SQS queue for messages.
If you get a message on the queue then that needs to tell the application to refresh the cache from S3.
When an instance is shut down there is an event from at least Elastic Beanstalk and the load balancers that your instance will be shut down. Remove the SQS queue tied to the instance at that time.
The only issue might be that a hard crash of an environment would leave orphan queues. It may be advisable to either manually clean these up or have a periodic task clean them up.

Airflow best practice : s3_to_sftp_operator instead of running aws cli?

What would be the best solution to transfer files between s3 and an EC2 instance using airflow?
After research i found there was a s3_to_sftp_operator but i know it's good practice to execute tasks on the external systems instead of the airflow instance...
I'm thinking about running a bashoperator that executes an aws cli on the remote ec2 instance since it respects the principle above.
Do you have any production best practice to share about this case ?
The s3_to_sftp_operator is going to be the better choice unless the files are large. Only if the files are large would I consider a bash operator with an ssh onto a remote machine. As for what large means, I would just test with the s3_to_sftp_operator and if the performance of everything else on airflow isn't meaningfully impacted then stay with it. I'm regularly downloading and opening ~1 GiB files with PythonOperators in airflow on 2 vCPU airflow nodes with 8 GiB RAM. It doesn't make sense to do anything more complex on files that small.
The best solution would be not to transfer the files, and most likely to get rid of the EC2 while you are at it.
If you have a task that needs to run on some data in S3, then just run that task directly in airflow.
If you can't run that task in airflow because it needs vast power or some weird code that airflow won't run, then have the EC2 instance read S3 directly.
If you're using airflow to orchestrate the task because the task is watching the local filesystem on the EC2, then just trigger the task and have the task read S3.

Distributed job management system

I'm using beeQueue for video transcoding job scheduling and processing
For now everything is fine and but I'm now facing challenge of working with distributed environment like auto scaling the amazon the instances for adding more workers to process more jobs which are pending in the queue, We scale well but need to implement a system which is fail safe, I mean in case a instance on which workers were processing the job has gone shutdown and we don't get job status or events, In that case the job which were running on that instance is gone into blackhole and can't be recovered and processed again.
What I did :
I'm looking up for ready made solution who works fail safe in distributed env.
Thanks

Amazon EMR how to find out when job is finish?

I'm using Amazon Elastic MapReduce Ruby (http://aws.amazon.com/developertools/2264) to run my hive job. Is there a way to know when the job is done? Right now all I could think of is the keep running emrclient with "--list --active" but I'm hoping there is a better way to do this.
Thank you
You may also get to know this from the aws console's EMR section.
If your concern is to terminate the cluster once your job is done then while launching the cluster don not use the option --stay-alive. Or alternatively, you can have a script which would poll for the current status of the running cluster and terminate it once it gets to 'waiting' state.
I do not think there is another way.