Hadoop hive table not getting repaired with msck - hive

I have done a distcp of data from a partition to a place in hive ware house which is a diff partition table with same structure but when m doing msck it's exiting with arror code 1.. kindly guide

Try enabling logging on hive cli. Rerun msck command by enabling logging. Will know why it failed.
hive --hiveconf hive.root.logger=INFO,console

Related

How to run hive commands from shell?

I have to repair tables in hive from my shell script after successful completion of my spark application.
msck repair table <DATABASE_NAME>.<TABLE_NAME>;
Please suggest me a suitable approach for this which also works for large tables with partitions.
I have found a workaround for this using :
hive -S -e "msck repair table <DATABASE_NAME>.<TABLE_NAME>;"
-S : This silents the output generated from Hive.
-e : This is used for running hive command.
-f : This is used for providing a hql script.

Why the error getting when importing table data from Oracle DB to HIVE using Sqoop?

I am getting below error while importing data from oracle DB to HIVE using Sqoop
ERROR tool.ImportTool: Encountered IOException running import job: java.io.IOException: Cannot run program "hive": error=2, No such file or directory
below is my command which I am executing.
sqoop import --connect jdbc:oracle:thin:#host:port/xe--username sa -- password sa --table SA.SHIVAMSAMPLE --hive-import -m 1
The data is getting created inside hdfs but hive tables are not getting created i.e a folder gets created inside (bin/hdfs dfs -ls) direct default directory.
when I will give explicitly path for warehouse then only it will store in warehouse directory like "user/hive/warehouse" after that also table not created and not loaded data.
I installed hadoop in "Amit/hadoop-2.6.5" and HIVe is "Amit/apache-hive-1.2.1-bin" and sqoop "Amit/sqoop-1.4.5-cdh5.3.2" in .bashrc I set the hadoop path only.
Is require to hive and sqoop as well.
When I set hive home in sqoop-env.sh file then above command run fine but table is not created and file is created inside hdfs /user/hive/warehouse/shivamsample
can you please tell me any extra conf require to resolve this issue.

Google DataProc Hive and Presto query doesn't work

I have a Google DataProc cluster with presto installed as an optional component. I create a external table in Hive and its size is ~1GB. While the table is queryable(for example, groupby statement, distinct, etc succeed), I have problems with perform a simple select * from tableA with Hive and Presto:
For Hive, if I logged in to master node of cluster, and run the query from Hive command line, it success. However, when I run the following command from my local machine:
gcloud dataproc jobs submit hive --cluster $CLUSTER_NAME --region $REGION --execute "SELECT * FROM tableA;"
I get the following error:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
ERROR: (gcloud.dataproc.jobs.submit.hive) Job [3e165c0edcda4e35ad0d5f62b77725bc] entered state [ERROR] while waiting for [DONE].
Though I've updated the configurations in mapred-site.xml as:
mapreduce.map.memory.mb=9000;
mapreduce.map.java.opts=-Xmx7000m;
mapreduce.reduce.memory.mb=9000;
mapreduce.reduce.java.opts=-Xmx7000m;
For Presto, similarly the statements such as groupBy and distinct work. However, for the select * from tableA, everytime it just hangs forever at about RUNNING 60% until timeout. And regardless if I run from local machine or from master node of cluster, I get the same issue.
I don't understand why such a small external table can have such issue. Any help is appreciated, thank you!
The Presto CLI binary /usr/bin/presto specifies a jvm -Xmx argument inline (it uses some tricks to bootstrap itself as a java binary); unfortunately, that -Xmx is not normally fetched from /opt/presto-server/etc/jvm.config like the settings for the actual presto-server.
In your case, if you're selecting everything from a 1G parquet table, you're probably actually dealing with something like 6G uncompressed text, and you're trying to stream all of that to the console output. This is likely also not going to work with the Dataproc job-submission because the streamed output is designed to print out human-readable amounts of data, and will slow down considerably if dealing with non-human amounts of data.
If you want to still try doing that with the CLI, you can run:
sudo sed -i "s/Xmx1G/Xmx5G/" /usr/bin/presto
To modify the jvm settings for the CLI on the master, before starting it back up. You'd probably then want to pipe the output to a local file instead of streaming it to your console, because you won't be able to read 6G of text streaming through your screen.
I think the problem is that the output of gcloud dataproc jobs submit hive --cluster $CLUSTER_NAME --region $REGION --execute "SELECT * FROM tableA;" went through the Dataproc server which OOMed. To avoid that, you can query data from the cluster directly without going through the server.
Try following the Dataproc Presto tutorial - Presto CLI queries, run these commands from your local machine:
gcloud compute ssh <master-node> \
--project=${PROJECT} \
--zone=${ZONE} \
-- -D 1080 -N
./presto-cli \
--server <master-node>:8080 \
--socks-proxy localhost:1080 \
--catalog hive \
--schema default

How to run hive server 2 as a background, don't get terminated on closing terminal

My hive server 2 command runs properly on (aws)ubuntu terminal:
hive --service hiveserver2 --hiveconf hive.server2.thrift.port=10002 --hiveconf hive.root.logger=LOG,console
but when i closed the terminal my hive server stop,
i want a command to solve this problem thanks.
At last after searching so much i found this command :-
hive --service hiveserver2 --hiveconf hive.server2.thrift.port=10002 2 & > /dev/null
After running this process you will get an process id save it some where, it will be needed when you need to kill the same process , i am doing like this :)

How to use Apache hive with fully distributed cluster

I am using hadoop 1.2.1 having 3 data nodes and one namenode. My hbase version is 0.94.14. I have configured apache hive 1.0 on name node machine.
I have to import hbase table data to hive. When I run a query, it gives following error in log file
ERROR org.apache.hadoop.hbase.mapreduce.TableInputFormatBase - Cannot resolve the host name for /192.168.3.9 because of javax.naming.NameNotFoundException: DNS name not found [response code 3]; remaining name '9.3.168.192.in-addr.arpa'
What is the problem in my setup. I have followed this tutorial for hadoop installation.
In hadoop namenode log file following warning appears when I run query in hive
WARN org.apache.hadoop.hdfs.server.namenode.FSEditLog: Cannot roll edit log, edits.new files already exists in all healthy directories:
Is there any information needed for hive about how many datanode hadoop have?
Also my Hmaster is running on some other machine and I have configured hive at namnode machine/
Your hadoop, zookeeper, hbase and hive should be in running condition.
1) COPY THESE FILES TO THE HADOOP LIBRARY.
sudo cp /usr/lib/hive/lib/hive-common-0.7.0-cdh3u0.jar /usr/lib/hadoop/lib/
sudo cp /usr/lib/hive/lib/hbase-0.90.1-cdh3u0.jar /usr/lib/hadoop/lib/
2)STOP HBASE AND HADOOP USING FOLLOWING COMMONDS
/usr/lib/hadoop/bin/stop-all.sh
/usr/lib/hbase/bin/stop-hbase.sh
3) RESTART HBASE AND HADOOP USING COMMONDS
/usr/lib/hadoop/bin/start-all.sh
/usr/lib/hadoop/bin/start-hbase.sh