Parquet binary UTF8 as string on hive - hive

There is a parquet file with a binary (UTF8) column named abc in it.
Is it possible to create an external table on hive that will contain the same column abc, but casted to string?
The structure of the parquet file:
$ parquet-tools schema ~/Downloads/dataset.gz.parquet
message spark_schema {
optional binary abc (UTF8);
}

There are three different types involved:
There is a SQL schema for the table in Hive. Each column has a type, like STRING or DECIMAL. Each table (or partition in case it's a partitioned table) consist of multiple files that must be of the same file format, for example PLAINTEXT, AVRO or PARQUET.
Each of the individual files will have type information as well (except PLAINTEXT). In the case of Parquet, this means two further levels:
The physical type describes the storage size, for example INT32, BYTE_ARRAY or FIXED_LEN_BYTE_ARRAY.
The logical type tells applications how to interpret the data, for example UTF8 or DECIMAL.
The STRING column type of Hive is stored as a BYTE_ARRAY physical type (called binary in Parquet schema definitions) with the UTF8 logical type annotation.

Apparently, you can simply specify string as the type of the column and it will be taken care of.
CREATE EXTERNAL TABLE `dataset`(
`abc` string)
STORED AS parquet
LOCATION
'...';

Related

Why loading parquet files into Bigquery gives me back gibberish values into the table?

When I load parquet files into Bigquery table, values stored are wierd. It seems to be the encoding of BYTES fields or whatever else.
Here's the format of the create fields
So when I read the table with casted fields, I get the readable values.
I found the solution here
Ma question is WHY TF bigquery is bahaving like this?
According to this GCP documentation, there are some parquet data types that can be converted into multiple BigQuery data types. A workaround is to add the data type that you want to parse to BigQuery.
For example, to convert the Parquet INT32 data type to the BigQuery DATE data type, specify the following:
optional int32 date_col (DATE);
And another way is to add the schema to the bq load command:
bq load --source_format=PARQUET --noreplace --noautodetect --parquet_enum_as_string=true --decimal_target_types=STRING [project]:[dataset].[tables] gs://[bucket]/[file].parquet Column_name:Data_type

Migrating data from Hive PARQUET table to BigQuery, Hive String data type is getting converted in BQ - BYTES datatype

I am trying to migrate the data from Hive to BigQuery. Data in Hive table is stored in PARQUET file format.Data type of one column is STRING, I am uploading the file behind the Hive table on Google cloud storage and from that creating BigQuery internal table with GUI. The datatype of column in imported table is getting converted to BYTES.
But when I imported CHAR of VARCHAR datatype, resultant datatype was STRING only.
Could someone please help me to explain why this is happening.
That does not answer the original question, as I do not know exactly what happened, but had experience with similar odd behavior.
I was facing similar issue when trying to move the table between Cloudera and BigQuery.
First creating the table as external on Impala like:
CREATE EXTERNAL TABLE test1
STORED AS PARQUET
LOCATION 's3a://table_migration/test1'
AS select * from original_table
original_table has columns with STRING datatype
Then transfer that to GS and importing that in BigQuery from console GUI, not many options, just select the Parquet format and point to GS.
And to my surprise I can see that the columns are now Type BYTES, the names of the columns was preserved fine, but the content was scrambled.
Trying different codecs, pre-creating the table and inserting still in Impala lead to no change.
Finally I tried to do the same in Hive, and that helped.
So I ended up creating external table in Hive like:
CREATE EXTERNAL TABLE test2 (col1 STRING, col2 STRING)
STORED AS PARQUET
LOCATION 's3a://table_migration/test2';
insert into table test2 select * from original_table;
And repeated the same dance with copying from S3 to GS and importing in BQ - this time without any issue. Columns are now recognized in BQ as STRING and data is as it should be.

Hive ORC File Format

When we create an ORC table in hive we can see that the data is compressed and not exactly readable in HDFS. So how is Hive able to convert that compressed data into readable format which is shown to us when we fire a simple select * query to that table?
Thanks for suggestions!!
By using ORCserde while creating table. u have to provide package name for serde class.
ROW FORMAT ''.
What serde does is to serialize a particular format data into object which hive can process and then deserialize to store it back in hdfs.
Hive uses “Serde” (Serialization DeSerialization) to do that. When you create a table you mention the file format ex: in your case It’s ORC “STORED AS ORC” , right. Hive uses the ORC library(Jar file) internally to convert into a readable format. To know more about hive internals search for “Hive Serde” and you will know how the data is converted to object and vice-versa.

Apache hive create table for given structure

My csv file contains data structure like:
99999,{k1:v1,k2:v2,k3:v3},9,1.5,http://www.asd.com
what is the create table query for this structure?
I don't have to do any processing on csv file before it is loaded into table.
You need to use Opencsv serde to read/write csv data to/from hive table. Download it here:https://drone.io/github.com/ogrodnek/csv-serde/files/target/csv-serde-1.1.2-0.11.0-all.jar
Add the serde to the library path of Hive. - Could be skipped, but do upload it to the hdfs cluster your hive server is running. We will use it later to query.
Create Table
CREATE TABLE my_table(a int, b string, c int, d double, url string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ",",
"quoteChar" = "'",
"escapeChar" = "\\"
)
STORED AS TEXTFILE;
Notice that if you use openCSV serde, no matter what type you give, it will be taken as String by hive. But no worries as Hive is loosely type language. It will typecast string into int, json etc. at runtime.
Query
To query at the hive prompt first add the library if not added to the library path of hive
add jar hdfs:///user/hive/aux_jars/opencsv.jar;
Now you could query as:
select a, get_json_object(b, '$.k1') from my_table where get_json_object(b, '$.k2') > val;
Above is the example to access the JSON field from an Hive table.
References:
http://documentation.altiscale.com/using-csv-serde-with-hive
http://thornydev.blogspot.in/2013/07/querying-json-records-via-hive.html
PS: Json Tuple is the faster way to access the json elements, but I found the syntax of get_json_object more appealing.

Invalid parquet hive schema: repeated group array

Most datasets on our production Hadoop cluster currently are stored as AVRO + SNAPPY format. I heard lots of good things about Parquet, and want to give it a try.
I followed this web page, to change one of our ETL to generate Parquet files, instead of Avro, as the output of our reducer. I used the Parquet + Avro schema, to produce the final output data, plus snappy codec. Everything works fine. So the final output parquet files should have the same schema as our original Avro file.
Now, I try to create a Hive table for these Parquet files. Currently, IBM BigInsight 3.0, which we use, contains Hive 12 and Parquet 1.3.2.
Based on the our Avro schema file, I come out the following Hive DDL:
create table xxx {col1 bigint, col2 string,.................field1 array<struct<sub1:string, sub2:string, date_value:bigint>>,field2 array<struct<..............>>ROW FORMAT SERDE 'parquet.hive.serde.ParquetHiveSerDe' STORED AS INPUTFORMAT 'parquet.hive.DeprecatedParquetInputFormat' OUTPUTFORMAT 'parquet.hive.DeprecatedParquetOutputFormat' location 'xxxx'
The table created successfully in Hive 12, and I can "desc table" without any problem. But when I tried to query the table, like "select * from table limit 2", I got the following error:
Caused by: java.lang.RuntimeException: Invalid parquet hive schema: repeated group array { required binary sub1 (UTF8); optional binary sub2 (UTF8); optional int64 date_value;} at parquet.hive.convert.ArrayWritableGroupConverter.<init>(ArrayWritableGroupConverter.java:56) at parquet.hive.convert.HiveGroupConverter.getConverterFromDescription(HiveGroupConverter.java:36) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:61) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:46) at parquet.hive.convert.HiveGroupConverter.getConverterFromDescription(HiveGroupConverter.java:38) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:61) at parquet.hive.convert.DataWritableGroupConverter.<init>(DataWritableGroupConverter.java:40) at parquet.hive.convert.DataWritableRecordConverter.<init>(DataWritableRecordConverter.java:32) at parquet.hive.read.DataWritableReadSupport.prepareForRead(DataWritableReadSupport.java:109) at parquet.hadoop.InternalParquetRecordReader.initialize(InternalParquetRecordReader.java:142) at parquet.hadoop.ParquetRecordReader.initializeInternalReader(ParquetRecordReader.java:118) at parquet.hadoop.ParquetRecordReader.initialize(ParquetRecordReader.java:107) at parquet.hive.MapredParquetInputFormat$RecordReaderWrapper.<init>(MapredParquetInputFormat.java:230) at parquet.hive.MapredParquetInputFormat.getRecordReader(MapredParquetInputFormat.java:119) at org.apache.hadoop.hive.ql.exec.FetchOperator.getRecordReader(FetchOperator.java:439) at org.apache.hadoop.hive.ql.exec.FetchOperator.getNextRow(FetchOperator.java:522) ... 14 more
I noticed that the error comes from the first nested array of struct columns. My question is following:
Does Parquet support the nested array of struct?
Is this only related to Parquet 1.3.2? Do I have any solution on Parquet 1.3.2?
If I have to use later version of Parquet to fix above problem, and if Parquet 1.3.2 available in runtime, will that cause any issue?
Can I use all kinds of Hive feature, like "explode" of nest structure, from the parquet data?
What we are looking for is to know if parquet can be used same way as we currently use AVRO, but gives us the columnar storage benefits which missing from AVRO.
It looks like Hive 12 cannot support the nest structure of parquet file, as shown in this Jira ticket.
https://issues.apache.org/jira/browse/HIVE-8909