How can i run multiple queries in parallel on hive with tez execution engine? - hive

We want to run hive with tez for querying data in hdfs as multiple users will query hive so we need to configure hive in such a way so that the queries get executed in parallel
As tez uses yarn for assigning resources to multiple nodes, we are trying to limit the containers getting assigned per query of hive in yarn but not able to find the proper config for that

User Limit Factor is a way to control the max amount of resources that a single user can consume. I hope below Cloudera blog will help you.
https://blog.cloudera.com/yarn-capacity-scheduler/#:~:text=User%20Limit%20Factor%20is%20a%20way%20to%20control%20the%20max%20amount%20of%20resources%20that%20a%20single%20user%20can%20consume
Then we have Queue level AMshare as well.
https://community.cloudera.com/t5/Support-Questions/how-to-tune-yarn-scheduler-capacity-maximum-am-resource/td-p/289415

Related

Which is a more efficient orchestrating mechanism, chaining Databricks notebooks together or using Apache Airflow?

The data size for the data is in the Terabytes.
I have multiple Databricks notebooks for incremental data load into Google BigQuery for each dimension table.
Now, I have to perform this data load every two hours i.e. run these notebooks.
What is a better approach among the following:
Create a master Databricks notebook and use dbutils to chain/parallelize the execution of the aforementioned Databricks notebooks.
Use Google Composer (Apache Airflow's Databricks Operator) to create a master DAG to orchestrate these notebooks remotely.
I want to know which is better approach when I have use cases for both parallel execution and sequential execution of said notebooks.
I'd be extremely grateful if I could get a suggestion or opinion on this topic, thank you.
why can't you try with databricks jobs . So that you can use job for way of running a notebook either immediately or on a scheduled basis.

flink on yarn use table api read from hive ,many hive file caused flink used all resource(cpu,memory)

when I use flink execute one job that read from hive to deal ,hive include about 1000 files,the flink show the parallelism is 1000,flink request resources used all resources of my cluster that caused others job request slot faild,others job executed faild.each file of 1000 files is small. the job maybe not need occupy the all resources.how can I tune the flink param that use less resource to execute the job
Yarn perspective
I don't recommend usage of Yarn's memory management. Yarn kills containers instantly when they exceed the limits. Usually you need to disable memory checks to overcome this kind of problems.
"yarn.nodemanager.vmem-check-enabled":"false",
"yarn.nodemanager.pmem-check-enabled":"false"
Flink perspective
You can't limit slot resource usage. You have to tune your task managers on your needs. By reducing slots or running multiple task managers on each node . You can set task manager resource usage limit by taskmanager.memory.process.size.
Alternatively you can use flink on kubernetes. You can create Flink clusters for each job which will give you more flexibility. It will create task managers for each job and destroy them when jobs are completed.
There are also stateful functions which you can deploy job pipeline operators into separate containers. This will allow you to manage each function resources separately beside task managers. This allows you to reduce pressure on task managers.
Flink also supports Reactive Mode. This also can reduce pressure on workers by scaling up/down operators automatically based on cpu kind of metrics.
You need to discover this kind of features and find best solution for your needs.

Apache Impala - YARN like CPU utilization report for queries (on Cloudera)

We have YARN and Impala co-located on the same cloudera cluster, YARN utilization report and YARN history server provides more valuable information like YARN CPU (Vcores) and Memory usage.
Does something like that exist for IMPALA where I can fetch CPU and memory usage per query and as a whole on the Cloudera cluster.
Precisely I want to know how many Vcores are utilized out of its CPU allocation.
For example, an Impala Query takes 10s to execute a query, and lets say it used 4 vcores and 50MB of RAM, how do I find out that 4 vcores utilized.
Is there any direct way to query this from the cluster or any other method on how to compute the CPU utilization?
You can get a lot of information through the Cloudera Manager Charts. You can find an overview of all available metrics on their website or by clicking on the help symbol on the right side when creating a new chart.
There are quite a few categories for Impala that might be worth a read for you. For example the general Impala metrics and the Impala query metrics. The query metrics for example contain "memory_usage" measured in byte and the general metrics contain "impala_query_cm_cpu_milliseconds_rate" and "impala_query_memory_accrual_rate". These seem to be relevant for your usecase, but check them out and the linked sites to see which ones fit your usecase.
More information is available from the service page of the Impala service in your Cloudera Manager. You can find out more about this page here, but for example the linked page mentions:
The Impala Queries page displays information about Impala queries that are running and have run in your cluster. You can filter the queries by time period and by specifying simple filtering expressions.
It also allows you to display "Threads: CPU Time" and "Work CPU Time" for each query, which again could be relevant for you.
That is all the information available from Impala.

Automatic Hive or Cascading for ETL in AWS-EMR

I have a large dataset residing in AWS S3. This data is typically a transactional data (like calling records). I run a sequence of Hive queries to continuously run aggregate and filtering condtions to produce a couple of final compact files (csvs with millions of rows at max).
So far with Hive, I had to manually run one query after another (as sometimes some queries do fail due to some problems in AWS or etc).
I have so far processed 2 months of data so far using manual means.
But for subsequent months, I want to be able to write some workflow which will execute the queries one by one, and if should a query fail , it will rerun it again. This CANT be done by running hive queries in bash.sh file (my current approach at least).
hive -f s3://mybucket/createAndPopulateTableA.sql
hive -f s3://mybucket/createAndPopulateTableB.sql ( this might need Table A to be populated before executing).
Alternatively, I have been looking at Cascading wondering whether it might be the solution to my problem and it does have Lingual, which might fit the case. Not sure though, how it fits into the AWS ecosystem.
The best solution, is if there is some hive query workflow process, it would be optimal. Else what other options do I have in the hadoop ecosystem ?
Edited:
I am looking at Oozie now, though facing a sh!tload of issues setting up in emr. :(
You can use AWS Data Pipeline:
AWS Data Pipeline helps you easily create complex data processing workloads that are fault tolerant, repeatable, and highly available
You can configure it to do or retry some actions when a script fails, and it support Hive scripts : http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-hiveactivity.html

Related to speed of execution of Job in Amazon Elastic Mapreduce

My Task is
1) Initially I want to import the data from MS SQL Server into HDFS using SQOOP.
2) Through Hive I am processing the data and generating the result in one table
3) That result containing table from Hive is again exported to MS SQL SERVER back.
I want to perform all this using Amazon Elastic Map Reduce.
The data which I am importing from MS SQL Server is very large (near about 5,00,000 entries in one table. Like wise I have 30 tables). For this I have written a task in Hive which contains only queries (And each query has used a lot of joins in it). So due to this the performance is very poor on my single local machine ( It takes near about 3 hrs to execute completely).
I want to reduce that time as much less as possible. For that we have decided to use Amazon Elastic Mapreduce. Currently I am using 3 m1.large instance and still I have same performance as on my local machine.
In order to improve the performance what number of instances should I need to use?
As number of instances we use are they configured automatically or do I need to specify while submitting JAR to it for execution? Because as I use two machine time is same.
And also Is there any other way to improve the performance or just to increase the number of instance. Or am I doing something wrong while executing JAR?
Please guide me through this as I don't much about the Amazon Servers.
Thanks.
You could try Ganglia, which can be installed on your EMR cluster using a bootstrap action. This will give you some metrics on the performance of each node in the cluster and may help you optimise to get the right sized cluster:
http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/UsingEMR_Ganglia.html
If you use the EMR Ruby client on your local machine, you can set up an SSH tunnel to allow you to view the ganglia web interface in Firefox (you'll also need to setup FoxyProxy as per the following http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/emr-connect-master-node-foxy-proxy.html)