Create Hive table to read parquet files from parquet/avro schema - hive

We are looking for a solution in order to create an external hive table to read data from parquet files according to a parquet/avro schema.
in other way, how to generate a hive table from a parquet/avro schema ?
thanks :)

Try below using avro schema:
CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc');
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';
Same query is asked in Dynamically create Hive external table with Avro schema on Parquet Data

Related

Databricks CREATE TABLE USING PARQUET vs. STORED AS PARQUET

I'm creating a Databricks table in Azure backed by Parquet files in ADLS2.
I don't understand the difference between USING PARQUET and STORED AS PARQUET in the CREATE TABLE statement.
In particular, if my table has a decimal column the CREATE TABLE STORED AS PARQUET location 'abfss://...' will fail with error:
Parquet does not support decimal. See HIVE-6384
... unless I set properties to use a particular non-default version of Hive JARs.
On the other hand, CREATE TABLE USING PARQUET just works.
What's the difference?

How Can I create a Hive Table on top of a Parquet File

Facing issue on creating hive table on top of parquet file. Can someone help me on the same.? I have read many articles and followed the guidelines but not able to load a parquet file in Hive Table.
According "Using Parquet Tables in Hive" it is often useful to create the table as an external table pointing to the location where the files will be created, if a table will be populated with data files generated outside of Hive.
hive> create external table parquet_table_name (<yourParquetDataStructure>)
STORED AS PARQUET
LOCATION '/<yourPath>/<yourParquetFile>';

difference between parquet schema and parquet using avro schema

I am new to parquet, can you share what are pros and cons in parquet using Avro schema over parquet using its own schema format in the hive.
Currently, I store files in parquet in HDFS using spark streaming and then create a table in HIVE using "create table IF NOT EXISTS". Does this update schema in the hive? If not, what is the ideal way to update the latest schema in both formats?

Load local csv file to hive parquet table directly,not resort to a temp textfile table

I am now preparing to store data in .csv files into hive. Of course, because of the good performance of parquet file format, the hive table should is parquet format. So, the normal way, is to create a temp table whose format is textfile, then I load local CSV file data into this temp table, and finally, create a same-structure parquet table and use sql insert into parquet_table values (select * from textfile_table);.
But I don't think this temp textfile table is necessary. So, my question is, is there a way for me to load these local .csv files into hive parquet-format table directly, namely, not to resort the a temp table? Or a easier way to accomplish this task?
As stated in Hive documentation:
NO verification of data against the schema is performed by the load command.
If the file is in hdfs, it is moved into the Hive-controlled file system namespace.
You could skip a step by using CREATE TABLE AS SELECT for the parquet table.
So you'll have 3 steps:
Create text table defining the schema
Load data into text table (move the file into the new table)
CREATE TABLE parquet_table AS SELECT * FROM textfile_table STORED AS PARQUET; supported from hive 0.13

Dynamically create Hive external table with Avro schema on Parquet Data

I'm trying to dynamically (without listing column names and types in Hive DDL) create a Hive external table on parquet data files. I have the Avro schema of underlying parquet file.
My try is to use below DDL:
CREATE EXTERNAL TABLE parquet_test
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS PARQUET
LOCATION 'hdfs://myParquetFilesPath'
TBLPROPERTIES ('avro.schema.url'='http://myHost/myAvroSchema.avsc');
My Hive table is successfully created with the right schema, but when I try to read the data :
SELECT * FROM parquet_test;
I get the following error :
java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Expecting a AvroGenericRecordWritable
Is there a way to successfully create and read Parquet files, without mentioning columns name and types list in DDL?
Below query works:
CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc');
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';