Can we create dynamic resource pools on hive using CLI?
I want the Hive Scripts to run with Xmemory and X Cores each time it runs. Each Script will have a different requirement for memory and Cores which is why i dont want to create pre-defined pools.
Related
We want to run hive with tez for querying data in hdfs as multiple users will query hive so we need to configure hive in such a way so that the queries get executed in parallel
As tez uses yarn for assigning resources to multiple nodes, we are trying to limit the containers getting assigned per query of hive in yarn but not able to find the proper config for that
User Limit Factor is a way to control the max amount of resources that a single user can consume. I hope below Cloudera blog will help you.
https://blog.cloudera.com/yarn-capacity-scheduler/#:~:text=User%20Limit%20Factor%20is%20a%20way%20to%20control%20the%20max%20amount%20of%20resources%20that%20a%20single%20user%20can%20consume
Then we have Queue level AMshare as well.
https://community.cloudera.com/t5/Support-Questions/how-to-tune-yarn-scheduler-capacity-maximum-am-resource/td-p/289415
when I use flink execute one job that read from hive to deal ,hive include about 1000 files,the flink show the parallelism is 1000,flink request resources used all resources of my cluster that caused others job request slot faild,others job executed faild.each file of 1000 files is small. the job maybe not need occupy the all resources.how can I tune the flink param that use less resource to execute the job
Yarn perspective
I don't recommend usage of Yarn's memory management. Yarn kills containers instantly when they exceed the limits. Usually you need to disable memory checks to overcome this kind of problems.
"yarn.nodemanager.vmem-check-enabled":"false",
"yarn.nodemanager.pmem-check-enabled":"false"
Flink perspective
You can't limit slot resource usage. You have to tune your task managers on your needs. By reducing slots or running multiple task managers on each node . You can set task manager resource usage limit by taskmanager.memory.process.size.
Alternatively you can use flink on kubernetes. You can create Flink clusters for each job which will give you more flexibility. It will create task managers for each job and destroy them when jobs are completed.
There are also stateful functions which you can deploy job pipeline operators into separate containers. This will allow you to manage each function resources separately beside task managers. This allows you to reduce pressure on task managers.
Flink also supports Reactive Mode. This also can reduce pressure on workers by scaling up/down operators automatically based on cpu kind of metrics.
You need to discover this kind of features and find best solution for your needs.
I am new to both PySpark and AWS EMR. I have been given a small project where I need to scrub large amounts of data files every hour and build aggregated data sets based on them. These data files are stored on S3 and I can utilize some of the basic functions in Spark (like filter and map) to derive the aggregated data. To save on egress costs and after performing some CBA analysis, I decided to create an EMR cluster and make pypark calls. The concept is working fine using Lambda functions triggered by file created in the S3 bucket. I am writing the output files back to S3.
But I am not able to comprehend the need for the 3 node EMR cluster I created and its use for me. How can I use the Hadoop file system to my advantage here and all the storage that is made available on the nodes?
How do I view (if possible) the utilization of the slave/core nodes in the cluster? How do I know they are used, how often, etc etc? I am executing the pyspark code on the master node.
Are there alternatives to EMR that I can use with pyspark?
Is there any good documentation available to get a better understanding.
Thanks
Spark is a framework for distributed computing. It can process larger than memory datasets and split the workload in chunks onto multiple workers in parallel. By default EMR creates 1 master node and 2 worker nodes. The disk space on the spark nodes is typically not used directly. Spark can use the space to cache temp results.
To use a Hadoop filesystem, you need to start a hdfs service in aws .
However s3 is also distributed storage. It is supported by Hadoop libraries. Spark EMR ships with Hadoop drivers and support S3 out of the box. Using spark with S3 is perfectly valid storage solution and will be good enough for a lot of basic data processing tasks.
The is a spark manager UI in AWS EMR. You can see each running spark application session and current job. By clicking on the job you can see how many executors are used. Whether those executors run on all nodes depends on your spark memory and cpu configuration. Tuning those is a really big topic. There are good hints here on SO.
There is also a hardware monitoring tab, showing cpu and memory usage for each node.
The spark code is always executed on the master node. But it just creates a DAG plan on that node and shifts the actual work to the worker nodes according to the plan. Hence the guides speak of submitting the spark application rather than executing.
Yes. You can start your own spark cluster on normal ec2 instances. There is even a standalone mode , allowing to start spark on only one machine. It is quite some footprint, that is installed then. And you still need to tune the memory, cpu and executor settings. So it is quite a complexity compared to just implement some multiprocessing in python or use dask. However there are valid reasons to do so. It allows to use all cores on one machine. And it allows you to use a well known , good documented api. The same one, which can be used to process petabytes of data. The linked article above, explains the motivation.
Another possibility is to use AWS Glue. It is serverless spark. The
service will submit your jobs to some on demand spark nodes on AWS,
where you have no control over. Similar to how lambda functions run
on random AWS EC2 instances. However glue has some limitations. With
pyspark on glue, you cannot install python libs with c-extensions
e.g numpy, pandas, most of ml libs. Also Glue forces you to create
schema mapping of your data in Athena catalog. But standalone spark
can just process those on the fly.
Databricks also offers a separate serverless spark solution outside of AWS. It is more sophisticated in my opinion. It also allows custom c-extensions.
Big part of official documentation is focusing on the different data processing apis and not on the internals of apache spark. There are some good notes on spark internals on github. I assume every good book will cover some inner workings on spark. AWS EMR is just an automated spark cluster with yarn orchestrator. (Unfortunately, never read some good book on spark, got some info here and there, so cannot recommend one)
My Task is
1) Initially I want to import the data from MS SQL Server into HDFS using SQOOP.
2) Through Hive I am processing the data and generating the result in one table
3) That result containing table from Hive is again exported to MS SQL SERVER back.
I want to perform all this using Amazon Elastic Map Reduce.
The data which I am importing from MS SQL Server is very large (near about 5,00,000 entries in one table. Like wise I have 30 tables). For this I have written a task in Hive which contains only queries (And each query has used a lot of joins in it). So due to this the performance is very poor on my single local machine ( It takes near about 3 hrs to execute completely).
I want to reduce that time as much less as possible. For that we have decided to use Amazon Elastic Mapreduce. Currently I am using 3 m1.large instance and still I have same performance as on my local machine.
In order to improve the performance what number of instances should I need to use?
As number of instances we use are they configured automatically or do I need to specify while submitting JAR to it for execution? Because as I use two machine time is same.
And also Is there any other way to improve the performance or just to increase the number of instance. Or am I doing something wrong while executing JAR?
Please guide me through this as I don't much about the Amazon Servers.
Thanks.
You could try Ganglia, which can be installed on your EMR cluster using a bootstrap action. This will give you some metrics on the performance of each node in the cluster and may help you optimise to get the right sized cluster:
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Ganglia.html
If you use the EMR Ruby client on your local machine, you can set up an SSH tunnel to allow you to view the ganglia web interface in Firefox (you'll also need to setup FoxyProxy as per the following http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-connect-master-node-foxy-proxy.html)
I am new to Amazon Services and facing some issues.
Suppose I am running some Job Flow on Amazon Elastic Mapreduce with total 3 instances. While running my job flow on it I found that my job is taking more time to execute. And in such case I need to add more instances into it so that my instances will increase and hence job will execute fast.
My question is that How to add such instance into an existing instances? Because If we terminate existed instance and again create the new instances with more number is time consuming.
Is there anyway to do it? If yes then please suggest me.
I am doing all this task through CLI. So please share the anwers with commands too along with GUI steps in AWS Management Console.
Thanks.
Yes, you can do this with the command line tool
to add more instances to the core group:
elastic-mapreduce --modify-instance-group CORE --instance-count 40
To create a task group (no datanodes), with 40 instances:
elastic-mapreduce --add-instance-group TASK --instance-count 40 --instance-type c1.medium
It's important to note that CORE instance-group instances can not be reduced since they participate as data nodes. They can be increased only.
TASK instances only do processing and can be increased and reduced.