I am currently using Cloudera 5.6 trying to create a parquet format table in hive table based off another table, but I am running into an error.
create table sfdc_opportunities_sandbox_parquet like
sfdc_opportunities_sandbox STORED AS PARQUET
Error Message
Parquet does not support date. See HIVE-6384
I read that hive 1.2 has a fix for this issue, but Cloudera 5.6 and 5.7 do not come with hive 1.2. Has anyone found way around this issue?
Except from using an other data type like TIMESTAMP or an other storage format like ORC, there might be no way around if there is a dependency to the used Hive version and Parquet file storage format.
According Clouderas CDH 5 Packaging and Tarball Information, the whole branch 5 comes packed with Apache Parquet in v1.5.0 and Apache Hive in v1.1.0.
Date was implemented in ParquetSerde with HIVE-8119 and as of Hive 1.2.
Related
I want to insert Json to hive database.
I try to transform Json to SQL using ConvertJsonToSQL Ni-Fi processor. How can I use PARTITION (....) part into my query??
Can I do this or I should use ReplaceText processor for making query?
What version of Hive are you using? There are Hive 1.2 and Hive 3 versions of PutHiveStreaming and PutHive3Streaming (respectively) that let you put the data directly into Hive without having to issue HiveQL statements. For external Hive tables in ORC format, there are also ConvertAvroToORC (for Hive 1.2) and PutORC (for Hive 3) processors.
Assuming those don't work for your use case, you may also consider ConvertRecord with a FreeFormTextRecordSetWriter that generates the HiveQL with the PARTITION statement and such. It gives a lot more flexibility than trying to patch a SQL statement to turn it into HiveQL for a partitioned table.
EDIT: I forgot to mention that the Hive 3 NAR/components are not included with the NiFi release due to space reasons. You can find the Hive 3 NAR for NiFi 1.11.4 here.
We have an emr server which is running Hive 0.13.1 ( I know very archaic it is , but this cluster has a lot of dependencies due to which we are not able to do away with it )
Anyway cutting the chase , we processed something like 10 TB of TSV data in parquet using a different emr cluster which has a latest version of hive.
This was a temporary thing to faciliate the hugh data processing.
Now we are back to the old emr to do incremental processing of TSV to parquet.
We are using aws redshift spectrum coupled with glue to do query on this data. Glue crawls the s3 path where data resides thereby giving us a schema to work with.
Now the data which has been processed by the old emr gives us issues regarding has an incompatible Parquet schema.
The error we get when we try to read parquet data comprising of data processed by newer hive and old hive is ,
[2018-08-13 09:40:36] error: S3 Query Exception (Fetch)
[2018-08-13 09:40:36] code: 15001
[2018-08-13 09:40:36] context: Task failed due to an internal error. File '<Some s3 path >/parquet/<Some table name>/caldate=2018080900/8e71ebbe-b398-483c-bda0-81db6f848d42-000000 has an incompatible Parquet schema for column
[2018-08-13 09:40:36] query: 11500732
[2018-08-13 09:40:36] location: dory_util.cpp:724
[2018-08-13 09:40:36] process: query1_703_11500732 [pid=5384]
My hunch is it is because of the different hive version or it could be a redshift spectrum bug.
Does anyone has faced the same issue ?
I think this particular post will help you in solving this issue. It talks about the issues where schema is written by a different version and read by another version.
https://slack.engineering/data-wrangling-at-slack-f2e0ff633b69
We are facing the following issue: we use hive 1.2.x to write orc files, it is a known problem, that hive before version 2.x does not write the orc column names into the orc file (it writes only col_0,col_1,etc.).
We would like to use an other application which reads the schema from the orc file and can not connect to hcat metastore for the correct column names. Unfortunately we do not have a chance to upgrade to 2.x version of hive.
Is there any solution to "append" or replace the correct column names into these exitsing orc files? Thanks in advnace for you help.
I am using hive and hbase to do some analysis on data.
I accidently removed the following file in hdfs
(hdfs) /user/hive/mydata.db
Although all table are existing in hive but when I retrieve data from hive through thrift server to plot it , it shows nothing. How can I rebuild my data. Any guide etc.
I'm currently importing from Mysql into HDFS using Sqoop in avro format, this works great. However what's the best way to load these files into HIVE?
Since avro files contain the schema I can pull the files down to the local file system, use avro tools and create the table with the extracted schema but this seems excessive?
Also if a column is dropped from a table in mysql can I still load the old files into a new HIVE table created with the new avro schema (dropped column missing)?
After version 9.1, Hive has come packaged with an Avro Hive SerDe. This allows Hive to read from Avro files directly while Avro still "owns" the schema.
For you second question, you can define the Avro schema with column defaults. When you add a new column just make sure to specify a default, and all your old Avro files will work just find in a new Hive table.
To get started, you can find the documentation here and the book Programming Hive (available on Safari Books Online) has a section on the Avro HiveSerde which you might find more readable.