Dynamically create Hive external table with Avro schema on Parquet Data - hive

I'm trying to dynamically (without listing column names and types in Hive DDL) create a Hive external table on parquet data files. I have the Avro schema of underlying parquet file.
My try is to use below DDL:
CREATE EXTERNAL TABLE parquet_test
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS PARQUET
LOCATION 'hdfs://myParquetFilesPath'
TBLPROPERTIES ('avro.schema.url'='http://myHost/myAvroSchema.avsc');
My Hive table is successfully created with the right schema, but when I try to read the data :
SELECT * FROM parquet_test;
I get the following error :
java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Expecting a AvroGenericRecordWritable
Is there a way to successfully create and read Parquet files, without mentioning columns name and types list in DDL?

Below query works:
CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc');
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';

Related

Databricks CREATE TABLE USING PARQUET vs. STORED AS PARQUET

I'm creating a Databricks table in Azure backed by Parquet files in ADLS2.
I don't understand the difference between USING PARQUET and STORED AS PARQUET in the CREATE TABLE statement.
In particular, if my table has a decimal column the CREATE TABLE STORED AS PARQUET location 'abfss://...' will fail with error:
Parquet does not support decimal. See HIVE-6384
... unless I set properties to use a particular non-default version of Hive JARs.
On the other hand, CREATE TABLE USING PARQUET just works.
What's the difference?

ORC file format

I am new to Hive. Could you please let me know answer for below question?
Why do we need base table while loading the data in ORC?
Can't we directly create table as ORC and load data in it?
1. Why do we need base table while loading the data in ORC?
We need of the base table, because most of the time we get the data file in text file format, i.e. CSV, TXT, DAT or any other delimiter that we can open the file and see the content. But the file Format ORC maintain in a different way by using their algorithm to optimized the Row and Column.
Hence we need of a base table, so, Actually what happened in that case. We create a table with the textFile format and select the data over their and write it into ORC table.
2. Can't we directly create table as ORC and load data in it?
Yes, you can load the data into ORC file directly.
To understand more about ORC, you can refer to https://orc.apache.org/docs/
Usually if you don't define file format , for hive it is textfile by default.
Need of base table arises because when you create a hive table with orc format and then trying to load data using command:
load data in path '' ..
it simply moves data from one location to another.
hive orc table won't understand textfile. that's when serde comes into picture. you define serde while creating table.
so when a operation like :
1. select * (read)
2. insert into (write)
serde will serialize and desiarlize various format to orc and map data to hive columns.

Load local csv file to hive parquet table directly,not resort to a temp textfile table

I am now preparing to store data in .csv files into hive. Of course, because of the good performance of parquet file format, the hive table should is parquet format. So, the normal way, is to create a temp table whose format is textfile, then I load local CSV file data into this temp table, and finally, create a same-structure parquet table and use sql insert into parquet_table values (select * from textfile_table);.
But I don't think this temp textfile table is necessary. So, my question is, is there a way for me to load these local .csv files into hive parquet-format table directly, namely, not to resort the a temp table? Or a easier way to accomplish this task?
As stated in Hive documentation:
NO verification of data against the schema is performed by the load command.
If the file is in hdfs, it is moved into the Hive-controlled file system namespace.
You could skip a step by using CREATE TABLE AS SELECT for the parquet table.
So you'll have 3 steps:
Create text table defining the schema
Load data into text table (move the file into the new table)
CREATE TABLE parquet_table AS SELECT * FROM textfile_table STORED AS PARQUET; supported from hive 0.13

Create Hive table to read parquet files from parquet/avro schema

We are looking for a solution in order to create an external hive table to read data from parquet files according to a parquet/avro schema.
in other way, how to generate a hive table from a parquet/avro schema ?
thanks :)
Try below using avro schema:
CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc');
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';
Same query is asked in Dynamically create Hive external table with Avro schema on Parquet Data

Create hive table for schema less avro files

I have multiple avro files and each file have a STRING in it. Each avro file is a single row. How can I write hive table to consume all the avro files located in a single directory .
Each file has a big number in it and hence I do not have any json kind of schema that I can relate too. I might be wrong when I say schema less . But I cannot find a way for hive to understand this data. This might be very simple but I am lost since I tried numerous different ways without success. I created tables pointing to json schema as avro uri, but this is not the case here.
For more context files were written using crunch api
final Path outcomesVersionPath = ...
pipeline.write(fruit.keys(), To.avroFile(outcomesVersionPath));
I tried following query which creates table but does not read data properly
CREATE EXTERNAL TABLE test_table
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///somePath/directory_with_Ids'
If your data set only has one STRING field then you should be able to read it from Hive with a single column called data (or whatever you would like) by changing your DDL to:
CREATE EXTERNAL TABLE test_table
(data STRING)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///somePath/directory_with_Ids'
And then read the data with:
SELECT data FROM test_table;
Use avro utilities jar to see avro schema for any given binary file here!
Then just link the schema file while creating a table.