Where to install Hive ( On DataNode or Namenode ) and Why?

Where to install Hive ( On DataNode or Namenode ) and Why? - hive

On Hadoop Cluster, Where we need to install Hive ,on DataNode or Namenode?
On which factor we need to decide the installation node ( Datanode ot Namenode)
Thanks !!!

Installation of hive is independent of the fact that it resides on the namenode or the datanode. Hive configuration file needs to know where is hadoop installed so that it can access the job tracker.
Once it has the knowledge of where job tracker is running, whenever you execute a query in Hive, it will convert your query into one or more mapreduce program and then it will submit this program to hadoop's jobtracker. Jobtracker then executes this map reduce program and show/store the output.

Related

s3distcp fail with "mapreduce_shuffle does not exist"

When I running command below,
s3-dist-cp --src s3://test/9.19 --dest hdfs:///user/hadoop/test
I got a error about auxService.
20/02/03 07:52:13 INFO mapreduce.Job: Task Id : attempt_1580716305878_0001_m_000000_2, Status : FAILED
Container launch failed for container_1580716305878_0001_01_000004 : org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist
In many QnA, I found a solution like this
link.
But there is no process for nodemanager.
[hadoop#ip-172-31-37-115 ~]$ initctl list | grep yarn
hadoop-yarn-timelineserver start/running, process 8149
hadoop-yarn-resourcemanager start/running, process 17331
hadoop-yarn-proxyserver start/running, process 8147
My EMR was created by quick menu with emr-5.28.0.
Is there anyone knows about this problem?
Thanks!

I'm sure there's some way to update the configs, but what I did was create a cluster using the 'advanced' setup and chose these software packages:
Ganglia
Hive
Hue
Mahout
Pig
Tez
Spark
Hadoop
(8 in total)
Most of those, except spark, are installed with the default settings (the first radio button for software packages in quick setup). One of these software packages or something related to it is what causes s3-dist-cp to be installed, and I was able to use it with no problems with that setup.

Alluxio + Hive on EMR

I have Alluxio 1.8 installed on an EMR 5.19.0 cluster, and can see my S3 tables using /usr/local/alluxio/bin/alluxio fs ls /.
However, when I start up hive and issue
hive> [[DDL w/ LOCATION = alluxio://master_host:19998/my_table ]]], I get the following:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found
Is there a way of getting past this? I've tried starting hive with --auxpath pointing to both /usr/local/alluxio/client/alluxio-1.8.1-client.jar and a copy of the jar on hdfs without any success.
Any help?

I posted a blog talking about the reasons for the error message java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found. Here are some tips, hope they can help:
For Hive, set environment variable HIVE_AUX_JARS_PATH in conf/hive-env.sh:
export HIVE_AUX_JARS_PATH=/<PATH_TO_ALLUXIO>/client/alluxio-1.8.1-client.jar:${HIVE_AUX_JARS_PATH}
which I guess is equivalent to what you have done to set --auxpath.
Depending on your setting of Hive (e.g., Hive on MR or Spark or Tez), you may also need to make sure the runtime is also able to access the client jar. Take Hive on MR as an example, you perhaps also need to append the path to Alluxio client jar to mapreduce.application.classpath or yarn.application.classpath to ensure each task of the MR jobs can access this jar.

Cannot connect Impala-Kudu to Apache Kudu (without Cloudera Manager): Get TTransportException Error

I have successfully installed kudu on Ubuntu (Trusty) as per the official kudu documentations (see http://kudu.apache.org/docs/installation.html ). The setup has one node running master and tablet server and another node running the tablet server only. I am having issues installing impala-kudu without Cloudera Manager on the node running kudu master. I have followed CDH installation instructions on this (see http://www.cloudera.com/documentation/enterprise/latest/topics/cdh_ig_cdh5_install.html ) page until Step 3. I have avoided installing CDH with YARN and MRv1 as I don’t need to run any mapreduce jobs and will not be using hadoop. Impala-kudu and impala-kudu-shell installed without errors. When I launch the impala-shell it returns:
Starting Impala Shell without Kerberos authentication
Error connecting: TTransportException, Could not connect to kudu_test:21000
***********************************************************************************
Welcome to the Impala shell. Copyright (c) 2015 Cloudera, Inc. All rights reserved.
(Impala Shell v2.7.0-cdh5-IMPALA_KUDU-cdh5 (48f1ad3) built on Thu Aug 18 12:15:44 PDT 2016)Want to know what version of Impala you're connected to? Run the VERSION command to
find out!
***********************************************************************************
[Not connected] >
I have tried to use the CONNECT option to connect to the kudu-master node without success. Both imapala-kudu and kudu are running on the same machine. Are there additional configuration settings which need to be changed or is hadoop and YARN a strict requirement to make impala-kudu work?
After running ps -ef | grep -i impalad I can confirm the impala daemon is not running. After navigating to the impala logs at ~/var/log/impala I find a few errors and warning files. Here is the output of impalad.ERROR:
Log file created at: 2016/09/13 13:26:24
Running on machine: kudu_test
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0913 13:26:24.084389 3021 logging.cc:118] stderr will be logged to this file.
E0913 13:26:25.406966 3021 impala-server.cc:249] Currently configured default filesystem: LocalFileSystem. fs.defaultFS (file:///) is not supported.ERROR: block location tracking is not properly enabled because
- dfs.datanode.hdfs-blocks-metadata.enabled is not enabled.
- dfs.client.file-block-storage-locations.timeout.millis is too low. It should be at least 10 seconds.
E0913 13:26:25.406990 3021 impala-server.cc:252] Aborting Impala Server startup due to improper configuration. Impalad exiting.
Maybe I need to revisit HDFS and the Hive Metastore to ensure I have these services configured properly?

According to the log, impalad quits because the default filesystem is configured to be LocalFileSystem, which is not supported. You have to set a distributed filesystem, such as HDFS as the default.
Although Kudu is a separate storage system and does not rely on HDFS, Impala still seems to require a non-local default FS even when using with Kudu. The Impala_Kudu documentation explicitly lists the following requirement:
Before installing Impala_Kudu, you must have already installed and configured services for HDFS (though it is not used by Kudu), the Hive Metastore (where Impala stores its metadata), and Kudu.
I can even imagine that HDFS may not really be needed for any other reason than to make Impala happy, but this is just speculation from my side. Update: Found IMPALA-1850 which confirms my suspicion that HDFS should not be needed for Impala any more, but it's not just a single check that has to be removed.

Use of Enable blocking in PDI - Pig Script Executor

I am exploring Big data plugin in Pentaho 5.2. I was trying to run Pig Script executor. I am unable to understand the usage of
Enabling Blocking. The PDI documentation says that
If checked, the Pig Script Executor job entry will prevent downstream
entries from executing until the script has finished processing.
I am aware that running a pig script will convert the execution to Map reduce jobs. I am running the job with Start job -> Pig Script. If I disable the Enable blocking step I am unable to execute the script. I am getting permission denied errors. As per the documentation " ".
What does downstream mean here. I do not pass any hops from the pig script out. I am unable to understand the Enable blocking step. Any hints can be helpful and will be appreciated.

Enable blocking: the task is deployed to the Hadoop cluster; PDI will follow up on progress and only proceed with the rest of the job tasks AFTER the execution of the Hadoop job finishes;
Enable blocking is disabled: PDI deploys the task to the Hadoop cluster and forgets about it. The rest of the job tasks proceed immediately after the cluster accepts the task, but doesn't wait for it to complete.

In what mode is Hive installed?

Does hive installation have any specific mode?
Like for example, Hadoop installation has 3 modes: standalone, pseudo-distributed and fully distributed.
Similarly does Hive has any specific type of distribution?
Can Hive be installed in distributed mode?

Hive actually provides you the option to run queries in 2 modes :
1- Map-Reduce mode
2- Local mode
Normally Hive compiler generates map-reduce jobs for most queries under the hood. These jobs are then submitted to the Map-Reduce cluster indicated by the variable:
mapred.job.tracker
While this usually points to a map-reduce cluster with multiple nodes, Hadoop also provided you the ability to run map-reduce jobs locally on the your standalone workstation. In order to run Hive queries in local mode you need to do this :
hive> SET mapred.job.tracker=local;
Details can be found here.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas