Problem of compatibility of an external orc and Claudera’s hive - hive

I can not solve the problem of compatibility of an external orc and Claudera’s hive.
I have cloudera express version 6.3.2 with hive version 2.1.1
In general, it’s strange, I downloaded the latest version of claudera, and there is old hive 2.1.1 there
Case:
Externally I create some orc (I tried to create it in the local spark and in the same cloudera through map reducer job - the same result)
I'm trying to read this orc in my claudera even through orcfiledump
I get
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 6
at org.apache.orc.OrcFile $ WriterVersion.from (OrcFile.java:145)
I downloaded the orc-tools-1.5.5-uber.jar utility locally to my computer
Also downloaded there the problematic orc
Performed by java -jar orc-tools-1.5.5-uber.jar meta msout2o12.orc
Uber jar with its own hadoop inside have read this orc ok
Structure for msout2o12.orc
File Version: 0.12 with ORC_135
Rows: 242
Compression: ZLIB
Compression size: 262144
Without any creation of tables, just a hive in the cloudera can stupidly not be able to read the orc using its own utility.
The problem begun from the fact that I created an external table and hiveql on the orc generated such error.
But here it just stupidly reduced the problem to a minimum, just hive --orafiledump can not read the orc.
How to make cloudera read normally orcs? ..
What to tighten up in my cloudera?

It was a big surprize for me.
I returning to parquet.
https://community.cloudera.com/t5/Cloudera-Labs/Problem-of-compatibility-of-an-external-orc-and-Claudera-s/m-p/299395/highlight/false#M582

Related

hive standalone metastore reading avro data with schema not working

we have usecase of presto hive accessing s3 file present in avro format.
When we try to use standalone hive-metastore and read this avro data using external table ,we are getting issue SerDeStorageSchemaReader class not found issue
MetaException(message:org.apache.hadoop.hive.metastore.SerDeStorageSchemaReader class not found)
at org.apache.hadoop.hive.metastore.utils.JavaUtils.getClass(JavaUtils.java:54)
We understand this error is coming because SerDeStorageSchemaReader class is not available in standalone-metastore.
i want to understand can be run hive-metastore without using hive/hadoop or there is any other option too?
standalone hive doesnt support avro. we need to install full hadoop plus hive version and start only hive metastore to fix it

Hive with HBase (both Kerberos) java.net.SocketTimeoutException .. on table 'hbase:meta'

Error
Receiving Timeout errors when trying to query HBase from Hive using HBaseStorageHandler.
Caused by: java.net.SocketTimeoutException: callTimeout=60000, callDuration=68199: row 'phoenix_test310,,'
on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=hbase-master.example.com,16020,1583728693297, seqNum=0
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:159)
at org.apache.hadoop.hbase.client.ResultBoundedCompletionService$QueueingFuture.run(ResultBoundedCompletionService.java:64)
... 3 more
I tried to follow what documentation I could and have some hbase configuraiton options added to hive-site.xml based on this Cloudera link
Environment:
Hadoop 2.9.2
HBase 1.5
Hive 2.3.6
Zookeeper 3.5.6
First, the Cloudera link should be ignored, Hive detects the presence of HBase through environment variables and then automatically reads the hbase-site.xml configuration settings.
There is no need to duplicate HBase settings within hive-site.xml
Configuring Hive for HBase
Modify your hive-env.sh as folllows:
# replace <hbase-install> with your installation path /etc/hbase for example
export HBASE_BIN="<hbase-install>/bin/hbase"
export HBASE_CONF_DIR="<hbase-install>/conf"
Separately you should ensure HADOOP_* environment variables are set as well in hive-env.sh,
and that the hbase lib directory is added to HADOOP_CLASSPATH.
We solved this error,by adding this property hbase.client.scanner.timeout.period=600000
hbase 1.2
https://docs.cloudera.com/documentation/enterprise/5-5-x/topics/admin_hbase_scanner_heartbeat.html#concept_xsl_dz1_jt

Alluxio + Hive on EMR

I have Alluxio 1.8 installed on an EMR 5.19.0 cluster, and can see my S3 tables using /usr/local/alluxio/bin/alluxio fs ls /.
However, when I start up hive and issue
hive> [[DDL w/ LOCATION = alluxio://master_host:19998/my_table ]]], I get the following:
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. MetaException(message:java.lang.RuntimeException: java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found
Is there a way of getting past this? I've tried starting hive with --auxpath pointing to both /usr/local/alluxio/client/alluxio-1.8.1-client.jar and a copy of the jar on hdfs without any success.
Any help?
I posted a blog talking about the reasons for the error message java.lang.ClassNotFoundException: Class alluxio.hadoop.FileSystem not found. Here are some tips, hope they can help:
For Hive, set environment variable HIVE_AUX_JARS_PATH in conf/hive-env.sh:
export HIVE_AUX_JARS_PATH=/<PATH_TO_ALLUXIO>/client/alluxio-1.8.1-client.jar:${HIVE_AUX_JARS_PATH}
which I guess is equivalent to what you have done to set --auxpath.
Depending on your setting of Hive (e.g., Hive on MR or Spark or Tez), you may also need to make sure the runtime is also able to access the client jar. Take Hive on MR as an example, you perhaps also need to append the path to Alluxio client jar to mapreduce.application.classpath or yarn.application.classpath to ensure each task of the MR jobs can access this jar.

Port data from HDFS/S3 to local FS and load in Java

I have a Spark job running on an EMr cluster that writes out a DataFrame to HDFS (which is then s3-dist-cp-ed to S3). The data size isn't big (2 GB when saved as parquet). These data in S3 are then copied to a local filesystem (EC2 instance running Linux) and then loaded into a Java application.
It turns out I cannot have the data in parquet format because parquet has been designed for HDFS and cannot be used in local FS (if I am wrong, please point me to a resource on how to read parquet files on local FS).
What other format can I use to address this? Would Avro be compact enough and not blow up the size of data by packing the schema with each row of the dataframe?
You can use Parquet on a local filesystem. To see an example in action, download the parquet-mr library from here, build it with the local profile (mvn -P local install should do it, provided that you have thrift and protoc installed), then issue the following to see the contents of your parquet file:
java -jar parquet-tools/target/parquet-tools-1.10.0.jar cat /path/to/your-file.parquet

load local data files into hive table failed when using hive

when i tried to load local data files into hive table,it report error while moving files.And i found the link,which give comments to fix this issue.I follow this step ,but it still can't work.
http://answers.mapr.com/questions/3565/getting-started-with-hive-load-the-data-from-sample-table txt-into-the-table-fails
After mkdir /user/hive/tmp,and set hive.exec.scratchdir= /user/hive/tmp,it still report RuntimeException Cannot make directory:file/user/hive/tmp/hive_2013* How can I fix this issue?Who are familiar with hive can help me?Thanks!
hive version is 0.10.0
hadoop version is 1.1.2
I suspect a permission issue here, because you are using MapR distribution.
Make sure that the user trying to create the directory has permissions to create the directory on CLDB.
Easy way to debug here is to do
$hadoop fs -chmod -R 777 /user/hive
and then try to load the data, to confirm if it's permission issue.