Create hive table for schema less avro files - hive

I have multiple avro files and each file have a STRING in it. Each avro file is a single row. How can I write hive table to consume all the avro files located in a single directory .
Each file has a big number in it and hence I do not have any json kind of schema that I can relate too. I might be wrong when I say schema less . But I cannot find a way for hive to understand this data. This might be very simple but I am lost since I tried numerous different ways without success. I created tables pointing to json schema as avro uri, but this is not the case here.
For more context files were written using crunch api
final Path outcomesVersionPath = ...
pipeline.write(fruit.keys(), To.avroFile(outcomesVersionPath));
I tried following query which creates table but does not read data properly
CREATE EXTERNAL TABLE test_table
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///somePath/directory_with_Ids'

If your data set only has one STRING field then you should be able to read it from Hive with a single column called data (or whatever you would like) by changing your DDL to:
CREATE EXTERNAL TABLE test_table
(data STRING)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///somePath/directory_with_Ids'
And then read the data with:
SELECT data FROM test_table;

Use avro utilities jar to see avro schema for any given binary file here!
Then just link the schema file while creating a table.

Related

How to create hive external table with avro file on qubole?

Can someone point in the doc to create external table on qubole base on avro files?
CREATE TABLE my_table_name
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 's3://my_avro_files/'
The following directory have a bunch of avro files
s3://my_avro_files/
s3://my_avro_files/file1.avro
s3://my_avro_files/file2.avro
s3://my_avro_files/file....avro
I believe you need to provide the schema as well. Please see here for details on how to extract it and specify in the create table statement.

ORC file format

I am new to Hive. Could you please let me know answer for below question?
Why do we need base table while loading the data in ORC?
Can't we directly create table as ORC and load data in it?
1. Why do we need base table while loading the data in ORC?
We need of the base table, because most of the time we get the data file in text file format, i.e. CSV, TXT, DAT or any other delimiter that we can open the file and see the content. But the file Format ORC maintain in a different way by using their algorithm to optimized the Row and Column.
Hence we need of a base table, so, Actually what happened in that case. We create a table with the textFile format and select the data over their and write it into ORC table.
2. Can't we directly create table as ORC and load data in it?
Yes, you can load the data into ORC file directly.
To understand more about ORC, you can refer to https://orc.apache.org/docs/
Usually if you don't define file format , for hive it is textfile by default.
Need of base table arises because when you create a hive table with orc format and then trying to load data using command:
load data in path '' ..
it simply moves data from one location to another.
hive orc table won't understand textfile. that's when serde comes into picture. you define serde while creating table.
so when a operation like :
1. select * (read)
2. insert into (write)
serde will serialize and desiarlize various format to orc and map data to hive columns.

Single file output from Hive

I have a hive table that used a SerDe to store files on Azure Blob.
field 1 int,
field 2 string,
field 3 struct
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
When I insert 5000 records into the table, the output consists of 5000 blobs on Azure Storage, is there a way to store the output as a single blob or even reduce the number of blobs with more records in each?
It seems to be caused by HiveIgnoreKeyTextOutputFormat with the ignore key feature for writing HDFS files. Please try to specify other output formats, such as HiveBinaryOutputFormat.

How to define an HDInsight hive external table based on XMLs in a container

I tried creating a hive external table:
CREATE EXTERNAL TABLE TestXML (storexml string)
STORED AS TEXTFILE
LOCATION 'wasb:///test/';
However when i try executing query like below, its not able to extract the fields:
SELECT
xpath_string (storexml, '/trades/trade/USI')
FROM TestXML;
I saw a post, that talked about specifying the input format.
add JARS <>
set xmlinput.element=Store;
CREATE EXTERNAL TABLE EventStoreXML (storexml string)
STORED AS INPUTFORMAT 'msdn.hadoop.mapreduce.input.XmlElementStreamingInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 'wasb:///eventstore#tradedata.blob.core.windows.net/';
I could not determine, which jars to include in the add JARs statement. I am using HDInsight on Linux.
Any pointers will be appreciated.
-Madhu
Realised the issue was with the XML having carriage return, as a result it was not able to read the XML.

Dynamically create Hive external table with Avro schema on Parquet Data

I'm trying to dynamically (without listing column names and types in Hive DDL) create a Hive external table on parquet data files. I have the Avro schema of underlying parquet file.
My try is to use below DDL:
CREATE EXTERNAL TABLE parquet_test
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS PARQUET
LOCATION 'hdfs://myParquetFilesPath'
TBLPROPERTIES ('avro.schema.url'='http://myHost/myAvroSchema.avsc');
My Hive table is successfully created with the right schema, but when I try to read the data :
SELECT * FROM parquet_test;
I get the following error :
java.io.IOException: org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Expecting a AvroGenericRecordWritable
Is there a way to successfully create and read Parquet files, without mentioning columns name and types list in DDL?
Below query works:
CREATE TABLE avro_test ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS AVRO TBLPROPERTIES ('avro.schema.url'='myHost/myAvroSchema.avsc');
CREATE EXTERNAL TABLE parquet_test LIKE avro_test STORED AS PARQUET LOCATION 'hdfs://myParquetFilesPath';