I'm creating a Databricks table in Azure backed by Parquet files in ADLS2.
I don't understand the difference between USING PARQUET and STORED AS PARQUET in the CREATE TABLE statement.
In particular, if my table has a decimal column the CREATE TABLE STORED AS PARQUET location 'abfss://...' will fail with error:
Parquet does not support decimal. See HIVE-6384
... unless I set properties to use a particular non-default version of Hive JARs.
On the other hand, CREATE TABLE USING PARQUET just works.
What's the difference?
Related
I have orc files and their schema i have tried loading this orc files in local hive and its working fine, now I will generate multiple orc files and need to load this orc files to hive table using nifi put hive streamming processor ?
PutHiveStreaming expects incoming flow files to be in Avro format. If you are using PutHive3Streaming you have more flexibility but it doesn't accept flow files in ORC format; instead both of those processors convert the input into ORC and write it into a managed table in Hive.
If your files are already in ORC format, you can use PutHDFS to place them directly into HDFS. If you don't have permissions to write directly into a managed table location, you could write to a temporary location, create an external table on top of it, and then load from there into the managed table using INSERT INTO myTable FROM SELECT * FROM externalTable or whatever.
Facing issue on creating hive table on top of parquet file. Can someone help me on the same.? I have read many articles and followed the guidelines but not able to load a parquet file in Hive Table.
According "Using Parquet Tables in Hive" it is often useful to create the table as an external table pointing to the location where the files will be created, if a table will be populated with data files generated outside of Hive.
hive> create external table parquet_table_name (<yourParquetDataStructure>)
STORED AS PARQUET
LOCATION '/<yourPath>/<yourParquetFile>';
I am new to parquet, can you share what are pros and cons in parquet using Avro schema over parquet using its own schema format in the hive.
Currently, I store files in parquet in HDFS using spark streaming and then create a table in HIVE using "create table IF NOT EXISTS". Does this update schema in the hive? If not, what is the ideal way to update the latest schema in both formats?
I am now preparing to store data in .csv files into hive. Of course, because of the good performance of parquet file format, the hive table should is parquet format. So, the normal way, is to create a temp table whose format is textfile, then I load local CSV file data into this temp table, and finally, create a same-structure parquet table and use sql insert into parquet_table values (select * from textfile_table);.
But I don't think this temp textfile table is necessary. So, my question is, is there a way for me to load these local .csv files into hive parquet-format table directly, namely, not to resort the a temp table? Or a easier way to accomplish this task?
As stated in Hive documentation:
NO verification of data against the schema is performed by the load command.
If the file is in hdfs, it is moved into the Hive-controlled file system namespace.
You could skip a step by using CREATE TABLE AS SELECT for the parquet table.
So you'll have 3 steps:
Create text table defining the schema
Load data into text table (move the file into the new table)
CREATE TABLE parquet_table AS SELECT * FROM textfile_table STORED AS PARQUET; supported from hive 0.13
Most datasets on our production Hadoop cluster currently are stored as AVRO + SNAPPY format. I heard lots of good things about Parquet, and want to give it a try.
I followed this web page, to change one of our ETL to generate Parquet files, instead of Avro, as the output of our reducer. I used the Parquet + Avro schema, to produce the final output data, plus snappy codec. Everything works fine. So the final output parquet files should have the same schema as our original Avro file.
Now, I try to create a Hive table for these Parquet files. Currently, IBM BigInsight 3.0, which we use, contains Hive 12 and Parquet 1.3.2.
Based on the our Avro schema file, I come out the following Hive DDL:
create table xxx {col1 bigint, col2 string,.................field1 array<struct<sub1:string, sub2:string, date_value:bigint>>,field2 array<struct<..............>>ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat' location 'xxxx'
The table created successfully in Hive 12, and I can "desc table" without any problem. But when I tried to query the table, like "select * from table limit 2", I got the following error:
Caused by: java.lang.RuntimeException: Invalid parquet hive schema: repeated group array { required binary sub1 (UTF8); optional binary sub2 (UTF8); optional int64 date_value;} at parquet.hive.convert.ArrayWritableGroupConverter.<init>(ArrayWritableGroupConverter.java:56) at parquet.hive.convert.HiveGroupConverter.getConverterFromDescription(HiveGroupConverter.java:36) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:61) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:46) at parquet.hive.convert.HiveGroupConverter.getConverterFromDescription(HiveGroupConverter.java:38) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:61) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:40) at parquet.hive.convert.DataWritableRecordConverter.<init>(DataWritableRecordConverter.java:32) at parquet.hive.read.DataWritableReadSupport.prepareForRead(DataWritableReadSupport.java:109) at parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118) at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107) at parquet.hive.MapredParquetInputFormat$RecordReaderWrapper.<init>(MapredParquetInputFormat.java:230) at parquet.hive.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:119) at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:439) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:522) ... 14 more
I noticed that the error comes from the first nested array of struct columns. My question is following:
Does Parquet support the nested array of struct?
Is this only related to Parquet 1.3.2? Do I have any solution on Parquet 1.3.2?
If I have to use later version of Parquet to fix above problem, and if Parquet 1.3.2 available in runtime, will that cause any issue?
Can I use all kinds of Hive feature, like "explode" of nest structure, from the parquet data?
What we are looking for is to know if parquet can be used same way as we currently use AVRO, but gives us the columnar storage benefits which missing from AVRO.
It looks like Hive 12 cannot support the nest structure of parquet file, as shown in this Jira ticket.
https://issues.apache.org/jira/browse/HIVE-8909