Can we add more Amazon Elastic Mapreduce instances into an existing Amazon Elastic Mapreduce instances? - amazon-emr

I am new to Amazon Services and facing some issues.
Suppose I am running some Job Flow on Amazon Elastic Mapreduce with total 3 instances. While running my job flow on it I found that my job is taking more time to execute. And in such case I need to add more instances into it so that my instances will increase and hence job will execute fast.
My question is that How to add such instance into an existing instances? Because If we terminate existed instance and again create the new instances with more number is time consuming.
Is there anyway to do it? If yes then please suggest me.
I am doing all this task through CLI. So please share the anwers with commands too along with GUI steps in AWS Management Console.
Thanks.

Yes, you can do this with the command line tool
to add more instances to the core group:
elastic-mapreduce --modify-instance-group CORE --instance-count 40
To create a task group (no datanodes), with 40 instances:
elastic-mapreduce --add-instance-group TASK --instance-count 40 --instance-type c1.medium

It's important to note that CORE instance-group instances can not be reduced since they participate as data nodes. They can be increased only.
TASK instances only do processing and can be increased and reduced.

Related

Hazelcast cluster member crash results in loosing all scheduled tasks

We are running 4 instances of our java application in hazelcast cluster. We scheduled around 2000 task using schedule executor service schedule method. Hazelcast partition all these 2000 tasks across the 4 instances. Due to some reason one of the cluster member crashes then all the task that are assign to the partition that are owned by the crashed node are lost, rest all 3 cluster member completed their assign task.
So how can we overcome this problem to avoid the lost tasks.
Try using the Durable Executor
Probably a good idea also to find why the process crashed in the first place.

Get notification if ECS service launches a new task, if autoscaling is triggered

We have used ECS for our production setups. As per my understanding of ECS, while creating a cluster of type EC2, we specify the number of instances to be launched. When we create a service, and if autoscaling is enabled we specify the minimum and the maximum number of tasks that can be created.
While creating these tasks, if there is no space left on the existing instances, ECS launches a new instance to place these tasks.
I would like to know if we can trigger a notification whenever a new EC2 instance gets added in the ECS cluster if autoscaling is triggered?
If yes, please help me with links or steps for the same.
Thanks.
Should be doable, see here: https://docs.aws.amazon.com/autoscaling/ec2/userguide/ASGettingNotifications.html
There are simple ways to test it via manually increasing the capacity too and are other notifications you can also subscribe too.

Is there a way to reuse a single running databricks cluster in multiple mapping data flows

Is there a way to reuse a databricks cluster that is started by a web activity before
we run the mapping data flows and use the same running cluster in all of the data flows instead of letting all the data flow instances spin up their
own clusters which takes around 6 minutes for setting up each cluster?
Yes. Set the TTL in the Azure Integration Runtime under "Data Flow Properties" to an amount of time that there is a gap in between data flow job executions. This way, we can set-up a VM pool for you and reuse those resource to minimize the cluster start-up time: https://techcommunity.microsoft.com/t5/azure-data-factory/adf-adds-ttl-to-azure-ir-to-reduce-data-flow-activity-times/ba-p/878380.
To start the cluster, don't use a web activity. Use a "dummy" data flow as I demonstrate here: https://youtu.be/FFCbU4ujCiY?t=533.
In ADF, you cannot access the underlying compute engines (Databricks in this case), so you have to kick-off a dummy data flow to warm it up.
That cluster start-up will take 5-6 mins. But now, if you use that same Azure IR in your subsequent activities, as long as they are scheduled to execute within that TTL window, ADF can grab existing VM resources to spin-up the Spark clusters and marshall your data flow definition to the Spark job execution.
End-to-end that process should now take just 2 mins.

Is it possible to limit number of oozie workflows running at the same time?

This is not clear to me from the docs. Here's our scenario and why we need this as succinctly as I can:
We have 60 coordinators running, launching workflows usually hourly, some of which have sub-workflows (some multiple in parallel). This works out to around 40 workflows running at any given time. However when cluster is under load or some underlying service is slow (e.g. impala or hbase), workflows will run longer than usual and back up so we can end up with 80+ workflows (including sub-workflows) running.
This sometimes results in ALL workflows hanging indefinitely, because we have only enough memory and cores allocated to this pool that oozie can start the launcher jobs (i.e. oozie:launcher:T=sqoop:W=JobABC:A=sqoop-d596:ID=XYZ), but not their corresponding actions (i.e. oozie:action:T=sqoop:W=JobABC:A=sqoop-d596:ID=XYZ).
We could simply allocate enough resources to the pool to accommodate for these spikes, but that would be a massive waste (hundreds of cores and GBs that other pools/tenants could never use).
So I'm trying to enforce some limit on number of workflows running, even if that means some will be running behind sometimes. BTW all our coordinators are configured with execution=LAST_ONLY, and any delayed workflow will simply catch up fully on the next run. We are on CDH 5.13 with Oozie 4.1; pools are setup with DRF scheduler.
Thanks in advance for your ideas.
AFAIK there is not a configuration parameter that let you control the number of workflows running at a given time.
If your coordinators are scheduled to run approximately in the same time-window, you could think to collapse them in just one coordinator/workflow and use the fork/join control nodes to control the degree of parallelism. Thus you can distribute your actions in a K number of queues in your workflow and this will ensure that you will not have more than K actions running at the same time, limiting the load on the cluster.
We use a script to generate automatically the fork queues inside the workflow and distribute the actions (of course this is only for actions that can run in parallel, i.e. there no data dependencies etc).
Hope this helps

Related to speed of execution of Job in Amazon Elastic Mapreduce

My Task is
1) Initially I want to import the data from MS SQL Server into HDFS using SQOOP.
2) Through Hive I am processing the data and generating the result in one table
3) That result containing table from Hive is again exported to MS SQL SERVER back.
I want to perform all this using Amazon Elastic Map Reduce.
The data which I am importing from MS SQL Server is very large (near about 5,00,000 entries in one table. Like wise I have 30 tables). For this I have written a task in Hive which contains only queries (And each query has used a lot of joins in it). So due to this the performance is very poor on my single local machine ( It takes near about 3 hrs to execute completely).
I want to reduce that time as much less as possible. For that we have decided to use Amazon Elastic Mapreduce. Currently I am using 3 m1.large instance and still I have same performance as on my local machine.
In order to improve the performance what number of instances should I need to use?
As number of instances we use are they configured automatically or do I need to specify while submitting JAR to it for execution? Because as I use two machine time is same.
And also Is there any other way to improve the performance or just to increase the number of instance. Or am I doing something wrong while executing JAR?
Please guide me through this as I don't much about the Amazon Servers.
Thanks.
You could try Ganglia, which can be installed on your EMR cluster using a bootstrap action. This will give you some metrics on the performance of each node in the cluster and may help you optimise to get the right sized cluster:
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Ganglia.html
If you use the EMR Ruby client on your local machine, you can set up an SSH tunnel to allow you to view the ganglia web interface in Firefox (you'll also need to setup FoxyProxy as per the following http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-connect-master-node-foxy-proxy.html)