In what mode is Hive installed? - hive

Does hive installation have any specific mode?
Like for example, Hadoop installation has 3 modes: standalone, pseudo-distributed and fully distributed.
Similarly does Hive has any specific type of distribution?
Can Hive be installed in distributed mode?

Hive actually provides you the option to run queries in 2 modes :
1- Map-Reduce mode
2- Local mode
Normally Hive compiler generates map-reduce jobs for most queries under the hood. These jobs are then submitted to the Map-Reduce cluster indicated by the variable:
mapred.job.tracker
While this usually points to a map-reduce cluster with multiple nodes, Hadoop also provided you the ability to run map-reduce jobs locally on the your standalone workstation. In order to run Hive queries in local mode you need to do this :
hive> SET mapred.job.tracker=local;
Details can be found here.

Related

Naming hive session cannot work sometimes

I ran sqls on hive tez by hive -f xxx.sql --hiveconf hive.session.id=sessionName
but on the yarn resourcemanager displays like this
HIVE-f4ea6c3f-f4cf-4db3-8801-da6f94e20237
HIVE-d920c434-d2e6-4c1c-a506-d69b580960f7
sometimes it displays correctly..
How can solve this problem
The thing is Tez can reuse containers. AM container reuse = session reuse. controlled by this parameter: tez.am.container.reuse.enabled=true
One yarn AM container can be reused for different Tez sessions. This is the reason why yarn name is different.
BTW there is one more parameter added in this JIRA HIVE-12357, you can set name for each DAG:
hive.query.name

Flink job on EMR runs only on one TaskManager

I am running EMR cluster with 3 m5.xlarge nodes (1 master, 2 core) and Flink 1.8 installed (emr-5.24.1).
On master node I start a Flink session within YARN cluster using the following command:
flink-yarn-session -s 4 -jm 12288m -tm 12288m
That is the maximum memory and slots per TaskManager that YARN let me set up based on selected instance types.
During startup there is a log:
org.apache.flink.yarn.AbstractYarnClusterDescriptor - Cluster specification: ClusterSpecification{masterMemoryMB=12288, taskManagerMemoryMB=12288, numberTaskManagers=1, slotsPerTaskManager=4}
This shows that there is only one task manager. Also when looking at YARN Node manager I see that there is only one container running on one of the core nodes. YARN Resource manager shows that the application is using only 50% of cluster.
With the current setup I would assume that I can run Flink job with parallelism set to 8 (2 TaskManagers * 4 slots), but in case that submitted job has set parallelism to more than 4, it fails after a while as it could not get desired resources.
In case the job parallelism is set to 4 (or less), the job runs as it should. Looking at CPU and memory utilisation with Ganglia it shows that only one node is utilised, while the other flat.
Why is application run only on one node and how to utilise the other node as well? Did I need to set up something on YARN that it would set up Flink on the other node as well?
In previous version of Flik there was startup option -n which was used to specify number of task managers. The option is now obsolete.
When you're starting a 'Session Cluster', you should see only one container which is used for the Flink Job Manager. This is probably what you see in the YARN Resource Manager. Additional containers will automatically be allocated for Task Managers, once you submit a job.
How many cores do you see available in the Resource Manager UI?
Don't forget that the Job Manager also uses cores out of the available 8.
You need to do a little "Math" here.
For example, if you would have set the number of slots to 2 per TM and less memory per TM, then submitted a job with parallelism of 6 it should have worked with 3 TMs.

Cannot connect Impala-Kudu to Apache Kudu (without Cloudera Manager): Get TTransportException Error

I have successfully installed kudu on Ubuntu (Trusty) as per the official kudu documentations (see http://kudu.apache.org/docs/installation.html ). The setup has one node running master and tablet server and another node running the tablet server only. I am having issues installing impala-kudu without Cloudera Manager on the node running kudu master. I have followed CDH installation instructions on this (see http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_cdh5_install.html ) page until Step 3. I have avoided installing CDH with YARN and MRv1 as I don’t need to run any mapreduce jobs and will not be using hadoop. Impala-kudu and impala-kudu-shell installed without errors. When I launch the impala-shell it returns:
Starting Impala Shell without Kerberos authentication
Error connecting: TTransportException, Could not connect to kudu_test:21000
***********************************************************************************
Welcome to the Impala shell. Copyright (c) 2015 Cloudera, Inc. All rights reserved.
(Impala Shell v2.7.0-cdh5-IMPALA_KUDU-cdh5 (48f1ad3) built on Thu Aug 18 12:15:44 PDT 2016)Want to know what version of Impala you're connected to? Run the VERSION command to
find out!
***********************************************************************************
[Not connected] >
I have tried to use the CONNECT option to connect to the kudu-master node without success. Both imapala-kudu and kudu are running on the same machine. Are there additional configuration settings which need to be changed or is hadoop and YARN a strict requirement to make impala-kudu work?
After running ps -ef | grep -i impalad I can confirm the impala daemon is not running. After navigating to the impala logs at ~/var/log/impala I find a few errors and warning files. Here is the output of impalad.ERROR:
Log file created at: 2016/09/13 13:26:24
Running on machine: kudu_test
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0913 13:26:24.084389 3021 logging.cc:118] stderr will be logged to this file.
E0913 13:26:25.406966 3021 impala-server.cc:249] Currently configured default filesystem: LocalFileSystem. fs.defaultFS (file:///) is not supported.ERROR: block location tracking is not properly enabled because
- dfs.datanode.hdfs-blocks-metadata.enabled is not enabled.
- dfs.client.file-block-storage-locations.timeout.millis is too low. It should be at least 10 seconds.
E0913 13:26:25.406990 3021 impala-server.cc:252] Aborting Impala Server startup due to improper configuration. Impalad exiting.
Maybe I need to revisit HDFS and the Hive Metastore to ensure I have these services configured properly?
According to the log, impalad quits because the default filesystem is configured to be LocalFileSystem, which is not supported. You have to set a distributed filesystem, such as HDFS as the default.
Although Kudu is a separate storage system and does not rely on HDFS, Impala still seems to require a non-local default FS even when using with Kudu. The Impala_Kudu documentation explicitly lists the following requirement:
Before installing Impala_Kudu, you must have already installed and configured services for HDFS (though it is not used by Kudu), the Hive Metastore (where Impala stores its metadata), and Kudu.
I can even imagine that HDFS may not really be needed for any other reason than to make Impala happy, but this is just speculation from my side. Update: Found IMPALA-1850 which confirms my suspicion that HDFS should not be needed for Impala any more, but it's not just a single check that has to be removed.

Where to install Hive ( On DataNode or Namenode ) and Why?

On Hadoop Cluster, Where we need to install Hive ,on DataNode or Namenode?
On which factor we need to decide the installation node ( Datanode ot Namenode)
Thanks !!!
Installation of hive is independent of the fact that it resides on the namenode or the datanode. Hive configuration file needs to know where is hadoop installed so that it can access the job tracker.
Once it has the knowledge of where job tracker is running, whenever you execute a query in Hive, it will convert your query into one or more mapreduce program and then it will submit this program to hadoop's jobtracker. Jobtracker then executes this map reduce program and show/store the output.

External script (R) not working

When I try to run the external script (R Script) from kognitio console. I'm getting the below error message.
Error:external script vfork child: No such file or directory
Can someone please help me to understand what it is!
This will be because you have not replicated the script environment to all the DB nodes which are eligible to run the script.
Chapter 10 of the Kognitio Guide (downloadable from http://www.kognitio.com/forums/viewtopic.php?t=3), explains in section 10.2 how the script environment myst be identically installed in the same location on all nodes which will be used in processing, and section 10.6 explains how you can contrain this to a subset of nodes if for some reason you do not want the script environment to be on all nodes (e.g. if it has an expensive per-node licence).
You can use the wxsync tool to synchronise files across all nodes, or a remote deployment tool, such as HP's RDP, to ensure that the script environment is installed identically on all nodes.