hive-HBase ClassNotFound happend when do mapreduce job - hive

I have a hive+hbase integration cluster.
I created a table by:
CREATE TABLE hbase_table_1(key int, value string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val")
TBLPROPERTIES ("hbase.table.name" = "xyz");
it is ok when execute:
select * from hbase_table_1;
but when I execute count operation, the classnotfound error will happen.
select count(*) from hbase_table_1;
error info is:
java.io.IOException:cannot find class
at org.apache.............HiveInputformat.getRecordReader(HiveInputFormat.java:220)
...........
Caused by:java.lang.ClassNoteFoundException:
at java.lang.Class.forName0(Native Method)
those error message does not notice me which class.
Sorry for my poor English.
Any one encounter this issue?

1) COPY THESE FILES TO THE HADOOP LIBRARY.
sudo cp /usr/lib/hive/lib/hive-common-0.7.0-cdh3u0.jar /usr/lib/hadoop/lib/
sudo cp /usr/lib/hive/lib/hbase-0.90.1-cdh3u0.jar /usr/lib/hadoop/lib/
sudo cp /usr/lib/hive/lib/hbase-0.90.1-cdh3u0.jar /usr/lib/hadoop/lib/
2)CLOSE HBASE AND HADOOP USING FOLLOWING COMMOND
/usr/lib/hadoop/bin/stop-all.sh
/usr/lib/hbase/bin/stop-hbase.sh
3) RESTART HBASE AND HADOOP USING COMMOND
/usr/lib/hadoop/bin/start-all.sh
/usr/lib/hadoop/bin/start-hbase.sh
Now create table in hive using Hbase storage handler.

Related

Cloud Composer - DAG error: java.lang.ClassNotFoundException: Failed to find data source: bigquery

I'm trying to execute a DAG which create a Dataproc Cluster at Cloud Composer. But It fails when trying to save on Big Query. I suppose that is missing a jar file ( --jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar) but I don't know how to add to my code.
code:
submit_job = DataprocSubmitJobOperator(
task_id="pyspark_task",
job=PYSPARK_JOB,
location=REGION,
project_id=PROJECT_ID)
If a call this job at the Cluster, it works.
gcloud dataproc jobs submit pyspark --cluster cluster-bc4b --jars gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar --region us-central1 ~/examen/ETL/loadBQ.py
But I don't know how can I replicate on Airflow
Code on PySpark:
df.write .format("bigquery") .mode("append") .option("temporaryGcsBucket","ds1-dataproc/temp") .save("test-opi-330322.test.Base3")
In your example
submit_job = DataprocSubmitJobOperator(
task_id="pyspark_task",
job=PYSPARK_JOB,
location=REGION,
project_id=PROJECT_ID)
The jars should be part of PYSPARK_JOB like
PYSPARK_JOB = {
"reference": {"project_id": PROJECT_ID},
"placement": {"cluster_name": CLUSTER_NAME},
"pyspark_job": {
"main_python_file_uri": PYSPARK_URI,
"jar_file_uris": ["gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar"],
},
}
See this doc.

How to locate/export Hive query?

I am new at Hive and am attempting to export a hive query to a local file on my computer that way I can import results to excel.
When I do from inside hive;
hive -e select * from TABLE limit 10'>output.txt;
I get "FAILED: ParseException line 1:0 cannot recognize input near 'hive' '-' 'e'"
when I do
hive -S -e "USE DATABASE; select * from TABLE limit 10" > /tmp/test/test.csv;
from shell OR
insert overwrite local directory '/tmp/hello'
select * from TABLE limit 10;
It goes to the hdfs system in Hive -- how do I get this to my local machine?
You can export query to CSV file like:
hive -e 'select * from your_Table' > /home/yourfile.csv
to get this file to your local machine, you should use HDFS:
HDFS DFS -get /tmp/hello /PATHinLocalMachine
Check out this Question
You are seeing the error as you are running the hive -e commands in the hive repl as show below
hive (venkat)> hive -e 'select * from a';
NoViableAltException(26#[])
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1084)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:202)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:166)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:437)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:320)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1219)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1260)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1156)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1146)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:216)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:168)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:379)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:739)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:684)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:624)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at Sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:233)
at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
FAILED: ParseException line 1:0 cannot recognize input near 'hive' '-' 'e'
you have to do it in the OS shell as shown below
[venkata_udamala#gw02 ~]$ hive -e 'use database_name;select * from table_name;' > temp.txt

FileNotFound Error While Fetching data from Hive External Table using HiveContext

I am trying to fetch data from a hive external table using HiveContext and storing it in a text file. The path of data for hive external table is hdfs:/data/abc/job_log. My code is failing intermittently with below error.
WARN TaskSetManager: Lost task 1524.0 in stage 0.0 (TID 1524, ): java.io.FileNotFoundException: File does not exist: /data/abc/job_log/abc_job_20171027001515.COPYING
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:71)
at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:61)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocationsInt(FSNamesystem.java:1828)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1799)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:1712)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:672)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:373)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
I am using Spark 1.6.1 , Scala 2.10.5 and HDP 2.4.2 cluster.Any help will be appreciated.

HBase/Hive table queried from Squirrel SQL - Error in loading storage handler.org.apache.hadoop.hive.hbase.HBaseStorageHandler

I am trying to query a HBase table through Squirrel SQL. Created a Hive external table like the following
create external table tweets_hbase(key string, value string)
stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
with serdeproperties ("hbase.columns.mapping" = ":key,data:tweet_text")
tblproperties ("hbase.table.name" = "tweets_hbase")
I am able to query through command line HIVE
hive> select * from tweets_hbase;
OK
20160725001730109 {"createdat":"25-Jul-2016 12:17:03","tweet_date":"2016-07-25","text":"私のランドールスゴビ:) \n#abyssrium\nhts:t.co/NcKtQi9lzm ht/t.co/WNgQIxLU05","user":"uw_kyaaaan","uniqueid":1469420239464,"searchtag":"Apple"}
20160725001730266 {"createdat":"25-Jul-2016 12:17:03","tweet_date":"2016-07-25","text":"2016年7月24日\n8422 Steps\n移動距離 6.485 km\n消費カロリー 467.6 kcal\n\n#M7POPOPO ht/t.co/eFathZXTHD","user":"matsuwichi","uniqueid":1469420239465,"searchtag":"Apple"}
20160725001730308 {"createdat":"25-Jul-2016 12:17:03","tweet_date":"2016-07-25","text":"RT #JBCrewdotcom: Don't forget to leave a nice review for #Coldwater after purchasing! \niTunes: t.co/p5YKRwPKNw\nGoogle Play: ht\u2026","user":"2016OLLGAndUGRL","uniqueid":1469420239466,"searchtag":"Apple"}
However when i try to query through Squirrel SQL, i get an Error in loading. The necessary JARs have been added to Extra Class Path.
hive-hbase-handler-1.1.0.jar
hbase-client-1.1.5.jar
hbase-common-1.1.5.jar
hbase-protocal-1.1.5.jar
hbase-server-1.1.5.jar
hive-jdbc-1.1.1-standalone.jar
Please help
java.sql.SQLException: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Error in loading storage handler.org.apache.hadoop.hive.hbase.HBaseStorageHandler
at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:296)
at net.sourceforge.squirrel_sql.client.session.StatementWrapper.execute(StatementWrapper.java:165)
at net.sourceforge.squirrel_sql.client.session.SQLExecuterTask.processQuery(SQLExecuterTask.java:369)
at net.sourceforge.squirrel_sql.client.session.SQLExecuterTask.run(SQLExecuterTask.java:212)
at net.sourceforge.squirrel_sql.fw.util.TaskExecuter.run(TaskExecuter.java:82)
at java.lang.Thread.run(Unknown Source)
I solved this myself. The following is what I had to do:
Upgrade HBase to 1.2.2
While starting thriftServer start with the following jars with --jars option
./start-thriftserver.sh --hiveconf hive.server2.thrift.port=10001
--hiveconf hive.server2.thrift.bind.host=xxx.xxx.xxx.xxx --hiveconf spark.cores.max=2 --master spark://xxx.xxx.xxx.xxx:7077 --name
ThriftServer --jars
file:///home/hadoop/software/apache-hive-1.2.1-bin/lib/hive-hbase-handler-1.2.1.jar,file:///home/hadoop/software/hbase-1.2.2/lib/hbase-common-1.2.2.jar,file:///home/hadoop/software/hbase-1.2.2/lib/hbase-protocol-1.2.2.jar,file:///home/hadoop/software/hbase-1.2.2/lib/hbase-client-1.2.2.jar,file:///home/hadoop/software/hbase-1.2.2/lib/guava-12.0.1.jar,file:///home/hadoop/software/hbase-1.2.2/lib/hbase-server-1.2.2.jar,file:///home/hadoop/software/hbase-1.2.2/lib/htrace-core-3.1.0-incubating.jar,file:///home/hadoop/software/hbase-1.2.2/lib/metrics-core-2.2.0.jar

How to load XML data file into Hive table?

While loading XML data file into HIVE table i got following error message:
FAILED: SemanticException 7:9 Input format must implement InputFormat. Error encountered near token 'StoresXml'.
The way i am loading the XML file is as follows :
**Create a table StoresXml
'CREATE EXTERNAL TABLE StoresXml (storexml string)
STORED AS INPUTFORMAT 'org.apache.mahout.classifier.bayes.XmlInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/user/hive/warehouse/stores';'
** Location /user/hive/warehouse/stores is in HDFS.
load data inpath <local path where the xml file is stored> into table StoresXml;
Now,problem is when i select any column from table StoresXml ,the above mentioned error comes up.
Please help me with it.Where i am going wrong?
1) first you need to create single column table like
CREATE TABLE xmlsample(xml string);
2) after that you need to load data in local/hdfs to hive table like
LOAD DATA INPATH '---------' INTO TABLE XMLSAMPLE;
3) NEXT BY USING XPATH, XPATH_ARRAY,XPATH_STRING LIKE SAMPLE XML QUERIES..
I have just loaded this transactions.xml file into hive table using xpath
for XML file:
**Bring records of xml file into one line:
terminal> cat /home/cloudera/Desktop/Test/Transactions_xml.xml | tr -d '&' | tr '\n' ' ' | tr '\r' ' ' | sed 's|</record>|</record>\n|g' | grep -v '^\s*$' > /home/cloudera/Desktop/trx_xml;
terminal> hadoop fs -put /home/cloudera/Desktop/trx_xml.xml /user/cloudera/DataTest/Transactions_xml
hive>create table Transactions_xml1(xmldata string);
hive>load data inpath '/user/cloudera/DataTest/Transactions_xml' overwrite into table Transactions_xml1;
hive>create table Transactions_xml(trx_id int,account int,amount int);
hive>insert overwrite table Transactions_xml select xpath_int(xmldata,'record/Tid'),
xpath_int(xmldata,'record/AccounID'),
xpath_int(xmldata,'record/Amount') from Transactions_xml1;
I hope this will help you. Let me know the result.
I have developed a tool to generate hive scripts from a csv file. Following are few examples on how files are generated.
Tool -- https://sourceforge.net/projects/csvtohive/?source=directory
Select a CSV file using Browse and set hadoop root directory ex: /user/bigdataproject/
Tool Generates Hadoop script with all csv files and following is a sample of
generated Hadoop script to insert csv into Hadoop
#!/bin/bash -v
hadoop fs -put ./AllstarFull.csv /user/bigdataproject/AllstarFull.csv
hive -f ./AllstarFull.hive
hadoop fs -put ./Appearances.csv /user/bigdataproject/Appearances.csv
hive -f ./Appearances.hive
hadoop fs -put ./AwardsManagers.csv /user/bigdataproject/AwardsManagers.csv
hive -f ./AwardsManagers.hive
Sample of generated Hive scripts
CREATE DATABASE IF NOT EXISTS lahman;
USE lahman;
CREATE TABLE AllstarFull (playerID string,yearID string,gameNum string,gameID string,teamID string,lgID string,GP string,startingPos string) row format delimited fields terminated by ',' stored as textfile;
LOAD DATA INPATH '/user/bigdataproject/AllstarFull.csv' OVERWRITE INTO TABLE AllstarFull;
SELECT * FROM AllstarFull;
Thanks
Vijay