Using cloudera 8.1. In Hive, loaded a table in ORC format with a CSV file. Getting this error on attempting to query the loaded table:
Failed with exception java.io.IOException:org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.hive.serde2.io.HiveVarcharWritable cannot be cast to org.apache.hadoop.io.IntWritable
This is common issue I see lots of people make,
You can create hive external table with CSV format and then say
"INSERT INTO TABLE FINAL SELECT * FROM TEMP_TABLE" which will copy the CSV data into ORC table.
By using this method Hive will convert the CSV data into ORC using inbuilt libraries.
Related
I'm creating a Databricks table in Azure backed by Parquet files in ADLS2.
I don't understand the difference between USING PARQUET and STORED AS PARQUET in the CREATE TABLE statement.
In particular, if my table has a decimal column the CREATE TABLE STORED AS PARQUET location 'abfss://...' will fail with error:
Parquet does not support decimal. See HIVE-6384
... unless I set properties to use a particular non-default version of Hive JARs.
On the other hand, CREATE TABLE USING PARQUET just works.
What's the difference?
I am trying to migrate the data from Hive to BigQuery. Data in Hive table is stored in PARQUET file format.Data type of one column is STRING, I am uploading the file behind the Hive table on Google cloud storage and from that creating BigQuery internal table with GUI. The datatype of column in imported table is getting converted to BYTES.
But when I imported CHAR of VARCHAR datatype, resultant datatype was STRING only.
Could someone please help me to explain why this is happening.
That does not answer the original question, as I do not know exactly what happened, but had experience with similar odd behavior.
I was facing similar issue when trying to move the table between Cloudera and BigQuery.
First creating the table as external on Impala like:
CREATE EXTERNAL TABLE test1
STORED AS PARQUET
LOCATION 's3a://table_migration/test1'
AS select * from original_table
original_table has columns with STRING datatype
Then transfer that to GS and importing that in BigQuery from console GUI, not many options, just select the Parquet format and point to GS.
And to my surprise I can see that the columns are now Type BYTES, the names of the columns was preserved fine, but the content was scrambled.
Trying different codecs, pre-creating the table and inserting still in Impala lead to no change.
Finally I tried to do the same in Hive, and that helped.
So I ended up creating external table in Hive like:
CREATE EXTERNAL TABLE test2 (col1 STRING, col2 STRING)
STORED AS PARQUET
LOCATION 's3a://table_migration/test2';
insert into table test2 select * from original_table;
And repeated the same dance with copying from S3 to GS and importing in BQ - this time without any issue. Columns are now recognized in BQ as STRING and data is as it should be.
I'm trying to load data from Oracle to Hive as parquet. Every time i load a table with date/timestamp column to hive, it automatically converts these columns to BIGINT. Is is possible to load timestamp/date formats to hive using sqoop and as a parquet file?
Already tried creating the table first in hive then using impala to LOAD DATA INPATH the parquet file.
Still failed with errors
"file XX has an incompatible Parquet schema for column XX column:
TIMESTAMP"
BTW, I'm using cloudera quickstart vm. Thanks
From the Cloudera documentation:
If you use Sqoop to convert RDBMS data to Parquet, be careful with interpreting any resulting values from DATE, DATETIME, or TIMESTAMP columns. The underlying values are represented as the Parquet INT64 type, which is represented as BIGINT in the Impala table. The Parquet values represent the time in milliseconds, while Impala interprets BIGINT as the time in seconds. Therefore, if you have a BIGINT column in a Parquet table that was imported this way from Sqoop, divide the values by 1000 when interpreting as the TIMESTAMP type.
Or you can also use your Hive query like this to get the result in your desired TIMESTAMP format.
FROM_UNIXTIME(CAST(SUBSTR(timestamp_column, 1,10) AS INT)) AS timestamp_column;
Try using configuration of sqoop
--map-column-hive
<cols_name>=TIMESTAMP
hive> create table orc_table (name string,img_loc string) stored as orc tblproperties("orc.compress"="none");
FAILED: Error in semantic analysis: Unrecognized file format in STORED AS clause: orc
hive> create table orc_table (name string,img_loc string) stored as orcfile tblproperties("orc.compress"="none");
FAILED: Error in semantic analysis: Unrecognized file format in STORED AS clause: orcfile
hive> create table orc_table(name string,img_loc string) stored as orcfile;
FAILED: Error in semantic analysis: Unrecognized file format in STORED AS clause: orcfile
hive> create table orc_table(name string,img_loc string) stored as orc;
FAILED: Error in semantic analysis: Unrecognized file format in STORED AS clause: orc
You need to make sure the your HIVE version is greater then 0.11. ORC is introduced in 0.11 version
ORC -- (Note: Available in Hive 0.11.0 and later)
How to check hive version
$ hive --version
Hive 0.14.0.2.2.4.8-40
HIVE-3874: Create a new Optimized Row Columnar file format for Hive. - this is the implementation ticket.
Hive Create Table syntax - check file_format to know minimum requirement for each storage type.
ORC Files - Information about ORC files.
here you Load non ORC file thats why this error occur. So best Solution is first make a table load a data and insert this tables into orc table
CREATE TABLE data(value1 string, value2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|';
here terminated by b "|" Because I am using PSV file you can set as per your file formate.
LOAD DATA INPATH '/user/hive/data.psv' INTO TABLE data;
create data2 stored as ORC tblproperties ("orc.compress" = "SNAPPY");
insert into data2 select * from data;
Most datasets on our production Hadoop cluster currently are stored as AVRO + SNAPPY format. I heard lots of good things about Parquet, and want to give it a try.
I followed this web page, to change one of our ETL to generate Parquet files, instead of Avro, as the output of our reducer. I used the Parquet + Avro schema, to produce the final output data, plus snappy codec. Everything works fine. So the final output parquet files should have the same schema as our original Avro file.
Now, I try to create a Hive table for these Parquet files. Currently, IBM BigInsight 3.0, which we use, contains Hive 12 and Parquet 1.3.2.
Based on the our Avro schema file, I come out the following Hive DDL:
create table xxx {col1 bigint, col2 string,.................field1 array<struct<sub1:string, sub2:string, date_value:bigint>>,field2 array<struct<..............>>ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat' location 'xxxx'
The table created successfully in Hive 12, and I can "desc table" without any problem. But when I tried to query the table, like "select * from table limit 2", I got the following error:
Caused by: java.lang.RuntimeException: Invalid parquet hive schema: repeated group array { required binary sub1 (UTF8); optional binary sub2 (UTF8); optional int64 date_value;} at parquet.hive.convert.ArrayWritableGroupConverter.<init>(ArrayWritableGroupConverter.java:56) at parquet.hive.convert.HiveGroupConverter.getConverterFromDescription(HiveGroupConverter.java:36) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:61) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:46) at parquet.hive.convert.HiveGroupConverter.getConverterFromDescription(HiveGroupConverter.java:38) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:61) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:40) at parquet.hive.convert.DataWritableRecordConverter.<init>(DataWritableRecordConverter.java:32) at parquet.hive.read.DataWritableReadSupport.prepareForRead(DataWritableReadSupport.java:109) at parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118) at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107) at parquet.hive.MapredParquetInputFormat$RecordReaderWrapper.<init>(MapredParquetInputFormat.java:230) at parquet.hive.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:119) at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:439) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:522) ... 14 more
I noticed that the error comes from the first nested array of struct columns. My question is following:
Does Parquet support the nested array of struct?
Is this only related to Parquet 1.3.2? Do I have any solution on Parquet 1.3.2?
If I have to use later version of Parquet to fix above problem, and if Parquet 1.3.2 available in runtime, will that cause any issue?
Can I use all kinds of Hive feature, like "explode" of nest structure, from the parquet data?
What we are looking for is to know if parquet can be used same way as we currently use AVRO, but gives us the columnar storage benefits which missing from AVRO.
It looks like Hive 12 cannot support the nest structure of parquet file, as shown in this Jira ticket.
https://issues.apache.org/jira/browse/HIVE-8909