Single file output from Hive - hive

I have a hive table that used a SerDe to store files on Azure Blob.
field 1 int,
field 2 string,
field 3 struct
ROW FORMAT SERDE
'org.apache.hive.hcatalog.data.JsonSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
When I insert 5000 records into the table, the output consists of 5000 blobs on Azure Storage, is there a way to store the output as a single blob or even reduce the number of blobs with more records in each?

It seems to be caused by HiveIgnoreKeyTextOutputFormat with the ignore key feature for writing HDFS files. Please try to specify other output formats, such as HiveBinaryOutputFormat.

Related

Query Athena from s3 database - remove metadata/corrupted data

I was following along with the tutorials for connecting Tableau to Amazon Athena and got hung up when running the query and returning the expected result. I downloaded the student-db.csv from https://github.com/aws-samples/amazon-athena-tableau-integration and uploaded the csv to a S3 bucket that I created. I can create the database within Athena however when I create a table either with the bulk add or directly from the query editor and preview with a query the data gets corrupted. and includes unexpected characters and unexpected/unnecessary punctuations and sometimes all the data is aggregated into a single column and also contains metadata such as "1 ?20220830_185102_00048_tnqre"0 2 ?hive" 3 Query Plan* 4 Query Plan2?varchar8 #H?P?". Also with my Athena - Tableau connected receiving the same issues when I preview the table that was created with Athena and stored in my bucket.
CREATE EXTERNAL TABLE IF NOT EXISTS student(
`school` string,
`country` string,
`gender` string,
`age` string,
`studytime` int,
`failures` int,
`preschool` string,
`higher` string,
`remotestudy` string,
`health` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
's3://jj2-test-bucket/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1',
'transient_lastDdlTime'='1595149168')
SELECT * FROM "studentdb"."student" limit 10;
Query preview
The solution is to create a separate S3 bucket to house the query results. Additionally, when connecting to Tableau you must set the S3 Staging Directory to the location of the Query Result bucket rather than connecting to the S3 bucket that contains your raw data/csv

How to create hive external table with avro file on qubole?

Can someone point in the doc to create external table on qubole base on avro files?
CREATE TABLE my_table_name
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 's3://my_avro_files/'
The following directory have a bunch of avro files
s3://my_avro_files/
s3://my_avro_files/file1.avro
s3://my_avro_files/file2.avro
s3://my_avro_files/file....avro
I believe you need to provide the schema as well. Please see here for details on how to extract it and specify in the create table statement.

Hive AVRO table creation syntax

What are the differences between these two syntaxes in Hive to create an Avro table?
CREATE TABLE db.mytable (fields...)
STORED AS AVRO
...
CREATE TABLE db.mytable (fields...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
...
There is no difference, one is verbose, that's all. How to check? You can run the command
describe formatted db.yourtable;
you will see that the Serde used by Hive (for the non verbose table created) is the same that the one in the verbose version.

Hive External Table with Azure Blob Storage

Is there a way to create a Hive external table using SerDe with location pointing to Azure Storage, organized in such a way that the data uses the fewest number of blobs. For example if insert 10000 records, I would like it to create just 100 page blobs with 100 line records each instead of maybe 10000 with 1 record each. I am de serializing from the blob, so fewer blobs will require lesser time.What would be the most optimal format in hive?
First, there is a way to create a Hive external table using Serde with localtion pointing to Azure Blob Storage, but not directly, please see the section Create Hive database and tables like the HiveQL below.
create database if not exists <database name>;
CREATE EXTERNAL TABLE if not exists <database name>.<table name>
(
field1 string,
field2 int,
field3 float,
field4 double,
...,
fieldN string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '<field separator>' lines terminated by '<line separator>'
STORED AS TEXTFILE LOCATION '<storage location>' TBLPROPERTIES("skip.header.line.count"="1");
And focus the below content for explaination <storage location>.
<storage location>: the Azure storage location to save the data of Hive tables. If you do not specify LOCATION , the database and the tables are stored in hive/warehouse/ directory in the default container of the Hive cluster by default. If you want to specify the storage location, the storage location has to be within the default container for the database and tables. This location has to be referred as location relative to the default container of the cluster in the format of 'wasb:///<directory 1>/' or 'wasb:///<directory 1>/<directory 2>/', etc. After the query is executed, the relative directories are created within the default container.
So it means you can access Azure Blob Storage location on Hive via wasb protocol, which required hadoop-azure library that support Hadoop access HDFS on Azure Storage. If your Hive on Hadoop not deployed on Azure, you need to refer to the Hadoop offical document Hadoop Azure Support: Azure Blob Storage to configure it.
For using serde, it is depended on the file format you used, like for orc file format, the hql code using OrcSerde like below.
CREATE EXTERNAL TABLE IF NOT EXSISTS <table name> (<column_name column_type>, ...)
ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS ORC
LOCATION '<orcfile path>'
For your second, the most optimal format is ORC File Format in Hive.

Create hive table for schema less avro files

I have multiple avro files and each file have a STRING in it. Each avro file is a single row. How can I write hive table to consume all the avro files located in a single directory .
Each file has a big number in it and hence I do not have any json kind of schema that I can relate too. I might be wrong when I say schema less . But I cannot find a way for hive to understand this data. This might be very simple but I am lost since I tried numerous different ways without success. I created tables pointing to json schema as avro uri, but this is not the case here.
For more context files were written using crunch api
final Path outcomesVersionPath = ...
pipeline.write(fruit.keys(), To.avroFile(outcomesVersionPath));
I tried following query which creates table but does not read data properly
CREATE EXTERNAL TABLE test_table
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///somePath/directory_with_Ids'
If your data set only has one STRING field then you should be able to read it from Hive with a single column called data (or whatever you would like) by changing your DDL to:
CREATE EXTERNAL TABLE test_table
(data STRING)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS
INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 'hdfs:///somePath/directory_with_Ids'
And then read the data with:
SELECT data FROM test_table;
Use avro utilities jar to see avro schema for any given binary file here!
Then just link the schema file while creating a table.