difference between parquet schema and parquet using avro schema - hive

I am new to parquet, can you share what are pros and cons in parquet using Avro schema over parquet using its own schema format in the hive.
Currently, I store files in parquet in HDFS using spark streaming and then create a table in HIVE using "create table IF NOT EXISTS". Does this update schema in the hive? If not, what is the ideal way to update the latest schema in both formats?

Related

Databricks CREATE TABLE USING PARQUET vs. STORED AS PARQUET

I'm creating a Databricks table in Azure backed by Parquet files in ADLS2.
I don't understand the difference between USING PARQUET and STORED AS PARQUET in the CREATE TABLE statement.
In particular, if my table has a decimal column the CREATE TABLE STORED AS PARQUET location 'abfss://...' will fail with error:
Parquet does not support decimal. See HIVE-6384
... unless I set properties to use a particular non-default version of Hive JARs.
On the other hand, CREATE TABLE USING PARQUET just works.
What's the difference?

how can we load orcdata into hive using nifi hive streaming processor

I have orc files and their schema i have tried loading this orc files in local hive and its working fine, now I will generate multiple orc files and need to load this orc files to hive table using nifi put hive streamming processor ?
PutHiveStreaming expects incoming flow files to be in Avro format. If you are using PutHive3Streaming you have more flexibility but it doesn't accept flow files in ORC format; instead both of those processors convert the input into ORC and write it into a managed table in Hive.
If your files are already in ORC format, you can use PutHDFS to place them directly into HDFS. If you don't have permissions to write directly into a managed table location, you could write to a temporary location, create an external table on top of it, and then load from there into the managed table using INSERT INTO myTable FROM SELECT * FROM externalTable or whatever.

Create Hive table to read parquet files from parquet/avro schema

We are looking for a solution in order to create an external hive table to read data from parquet files according to a parquet/avro schema.
in other way, how to generate a hive table from a parquet/avro schema ?
thanks :)
Try below using avro schema:
CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc');
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';
Same query is asked in Dynamically create Hive external table with Avro schema on Parquet Data

Invalid parquet hive schema: repeated group array

Most datasets on our production Hadoop cluster currently are stored as AVRO + SNAPPY format. I heard lots of good things about Parquet, and want to give it a try.
I followed this web page, to change one of our ETL to generate Parquet files, instead of Avro, as the output of our reducer. I used the Parquet + Avro schema, to produce the final output data, plus snappy codec. Everything works fine. So the final output parquet files should have the same schema as our original Avro file.
Now, I try to create a Hive table for these Parquet files. Currently, IBM BigInsight 3.0, which we use, contains Hive 12 and Parquet 1.3.2.
Based on the our Avro schema file, I come out the following Hive DDL:
create table xxx {col1 bigint, col2 string,.................field1 array<struct<sub1:string, sub2:string, date_value:bigint>>,field2 array<struct<..............>>ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat' location 'xxxx'
The table created successfully in Hive 12, and I can "desc table" without any problem. But when I tried to query the table, like "select * from table limit 2", I got the following error:
Caused by: java.lang.RuntimeException: Invalid parquet hive schema: repeated group array { required binary sub1 (UTF8); optional binary sub2 (UTF8); optional int64 date_value;} at parquet.hive.convert.ArrayWritableGroupConverter.<init>(ArrayWritableGroupConverter.java:56) at parquet.hive.convert.HiveGroupConverter.getConverterFromDescription(HiveGroupConverter.java:36) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:61) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:46) at parquet.hive.convert.HiveGroupConverter.getConverterFromDescription(HiveGroupConverter.java:38) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:61) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:40) at parquet.hive.convert.DataWritableRecordConverter.<init>(DataWritableRecordConverter.java:32) at parquet.hive.read.DataWritableReadSupport.prepareForRead(DataWritableReadSupport.java:109) at parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118) at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107) at parquet.hive.MapredParquetInputFormat$RecordReaderWrapper.<init>(MapredParquetInputFormat.java:230) at parquet.hive.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:119) at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:439) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:522) ... 14 more
I noticed that the error comes from the first nested array of struct columns. My question is following:
Does Parquet support the nested array of struct?
Is this only related to Parquet 1.3.2? Do I have any solution on Parquet 1.3.2?
If I have to use later version of Parquet to fix above problem, and if Parquet 1.3.2 available in runtime, will that cause any issue?
Can I use all kinds of Hive feature, like "explode" of nest structure, from the parquet data?
What we are looking for is to know if parquet can be used same way as we currently use AVRO, but gives us the columnar storage benefits which missing from AVRO.
It looks like Hive 12 cannot support the nest structure of parquet file, as shown in this Jira ticket.
https://issues.apache.org/jira/browse/HIVE-8909

Sqoop, Avro and Hive

I'm currently importing from Mysql into HDFS using Sqoop in avro format, this works great. However what's the best way to load these files into HIVE?
Since avro files contain the schema I can pull the files down to the local file system, use avro tools and create the table with the extracted schema but this seems excessive?
Also if a column is dropped from a table in mysql can I still load the old files into a new HIVE table created with the new avro schema (dropped column missing)?
After version 9.1, Hive has come packaged with an Avro Hive SerDe. This allows Hive to read from Avro files directly while Avro still "owns" the schema.
For you second question, you can define the Avro schema with column defaults. When you add a new column just make sure to specify a default, and all your old Avro files will work just find in a new Hive table.
To get started, you can find the documentation here and the book Programming Hive (available on Safari Books Online) has a section on the Avro HiveSerde which you might find more readable.