Changing column datatype in parquet file

Changing column datatype in parquet file - sql

I have an external table pointing to an s3 location (parquet file) which has all the datatypes as string. I want to correct the datatypes of all the columns instead of just reading everything as a string. when i drop the external table and recreate with new datatypes, the select query always throws error which looks something like below:
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)

Specify type as BigInt which is Equivalent to long type,hive don't have long datatype.
hive> alter table table change col col bigint;
Duplicate content, from Hortonworks forum

Related

AWS Athena: HIVE_BAD_DATA ERROR: Field type DOUBLE in parquet is incompatible with type defined in table schema

I use AWS Athena to query some data stored in S3, namely partitioned parquet files with pyarrow compression.
I have three columns with string values, one column called "key" with int values and one column called "result" which have both double and int values.
With those columns, I created Schema like:
create external table (
key int,
result double,
location string,
vehicle_name string.
filename string
)
When I queried the table, I would get
HIVE_BAD_DATA: Field results type INT64 in parquet is incompatible with type DOUBLE defined in table schema
So, I modified a schema with result datatype as INT.
Then I queried the table and got,
HIVE_BAD_DATA: Field results type DOUBLE in parquet is incompatible with type INT defined in table schema
I've looked around to try to understand why this might happen but found no solution.
Any suggestion is much appreciated.

It sounds to me like you have some files where the column is typed as double and some where it is typed as int. When you type the column of the table as double Athena will eventually read a file where the corresponding column is int and throw this error, and vice versa if you type the table column as int.
Athena doesn't do type coercion as far as I can tell, but even if it did, the types are not compatible: a DOUBLE column in Athena can't represent all possible values of a Parquet INT64 column, and an INT column in Athena can't represent a floating point number (and a BIGINT column is required in Athena for a Parquet INT64).
The solution is to make sure your files all have the same schema. You probably need to be explicit in the code that produces the files about what schema to produce (e.g. make it always use DOUBLE).

Is there a way to specify Date/Timestamp format for the incoming data within the Hive CREATE TABLE statement itself?

I've have a CSV files which contain date and timestamp values in the below formats. Eg:
Col1|col2
01JAN2019|01JAN2019:17:34:41
But when I define Col1 as Date and Col2 as Timestamp in my create statement, the Hive tables simply returns NULL when I query.
CREATE EXTERNAL TABLE IF NOT EXISTS my_schema.my_table
(Col1 date,
Col2 timestamp)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘|’
STORED AS TEXTFILE
LOCATION 'my_path';
Instead, if I define the data types as simply string then it works. But that's not how I want my tables to be.
I want the table to be able to read the incoming data in correct type. How can I achieve this? Is it possible to define the expected data format of the incoming data with the CREATE statement itself?
Can someone please help?

As of Hive 1.2.0 it is possible to provide additional SerDe property "timestamp.formats". See this Jira for more details: HIVE-9298
ALTER TABLE timestamp_formats SET SERDEPROPERTIES ("timestamp.formats"="ddMMMyyyy:HH:mm:ss");

How to change a column name in hive

I have a hive table where the name of the columns are orderbook.time, orderbook.price, etc. I want to remove the prefix orderbook from the column names without changing anything else in the table. I'm using the following command
alter table orderbook change orderbook.time time;
but it gives me the following error message
NoViableAltException(17#[])
at org.apache.hadoop.hive.ql.parse.HiveParser_IdentifiersParser.identifier(HiveParser_IdentifiersParser.java:11568)
at org.apache.hadoop.hive.ql.parse.HiveParser.identifier(HiveParser.java:45214)
at org.apache.hadoop.hive.ql.parse.HiveParser.alterStatementSuffixRenameCol(HiveParser.java:10258)
at org.apache.hadoop.hive.ql.parse.HiveParser.alterTblPartitionStatementSuffix(HiveParser.java:8533)
at org.apache.hadoop.hive.ql.parse.HiveParser.alterTableStatementSuffix(HiveParser.java:8148)
at org.apache.hadoop.hive.ql.parse.HiveParser.alterStatement(HiveParser.java:7192)
at org.apache.hadoop.hive.ql.parse.HiveParser.ddlStatement(HiveParser.java:2604)
at org.apache.hadoop.hive.ql.parse.HiveParser.execStatement(HiveParser.java:1591)
at org.apache.hadoop.hive.ql.parse.HiveParser.statement(HiveParser.java:1067)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:205)
at org.apache.hadoop.hive.ql.parse.ParseDriver.parse(ParseDriver.java:170)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:524)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:1358)
at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1475)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1287)
at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1277)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:226)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:175)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:389)
at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:781)
at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:699)
at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:634)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.util.RunJar.run(RunJar.java:226)
at org.apache.hadoop.util.RunJar.main(RunJar.java:141)
FAILED: ParseException line 1:38 cannot recognize input near '.' 'time' 'time' in rename column name
I tried to put the old column names (e.g. orderbook.time) into quatation, but I'm getting the same error message. How can I change the column names?

You are missing datatype of time column and Then try escaping the column name with period with `(backtick)
Try with below query:
alter table orderbook change `orderbook.time` time <data_type>;
In general Syntax to change column name in Hive:
alter table <db_name>.<table_name> change `<col_name>` `<new_col_name>` <data_type>;

hive can't change column type Invalid column reference [duplicate]

I have a table which has a partition of type int but which I want to convert to string. However, I can't figure out how to do this.
The table description is:
Col1 timestamp
Col2 string
Col3 string
Col4 string
Part_col int
# Partition information
# col_name data_type comment
Part_col int
The partitions I have created are Part_col=0, Part_col=1, ..., Part_col=23
I want to change them to Part_col='0' etc
I run this command in hive:
set hive.exec.dynamic.partitions = true;
Alter table tbl_name partition (Part_col=0) Part_col Part_col string;
I have also tried using "partition (Part_col)" to change all partitions at once.
I get the error "Invalid column reference Part_col"
I am using the example from https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types for conversion of decimal columns but can't figure out what dec_column_name represents.
Thanks

A bit of digging revealed that there was a hive JIRA to have a command exactly for updating partition column data type (https://issues.apache.org/jira/browse/HIVE-3672)
alter table {table_name} partition column ({column_name} {column_type});
According to JIRA the command was implemented, but it's apparent that it was never documented on Hive Wiki.
I used it on my Hive 0.14 system and it worked as expected.

I think yo should redefine the table's schema and redefine that your partition value is not gonna be a integer anymore and this is now gonna be a string type.
What I recommend you to do is:
Make your table external (in case you define this as a non-external table). In this case you can drop the table without removing the data in the directories.
Drop the table.
Create again the table with the new schema (Partition value as a string).
The steps above, physically (structure folders) is not gonna make any difference with the structure that you already had. The difference is gonna be in the Hive metastore, specifically in the "virtual column" created when you make partitions.
Also, now instead making queries like: part_col = 1, now you are gonna be able to make queries like: part_col = '1'.
Try this and tell me how this goes.

Hive: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException

I have a Parquet file (created by Drill) that I'm trying to read in Hive as an external table. The data types are copied one-to-one (i.e. INTEGER -> INT, BIGINT -> BIGINT, DOUBLE -> DOUBLE, TIMESTAMP -> TIMESTAMP, CHARACTER VARYING -> STRING). There are no complex types.
Drill has no problem querying the file it created, but Hive does not like it:
CREATE EXTERNAL TABLE my_table
(
<col> <data_type>
)
STORED AS PARQUET
LOCATION '<hdfs_location>';
I can execute SELECT COUNT(*) FROM my_table and get the correct number of rows back, but when I ask for the first row it says:
Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.LongWritable (state=,code=0)
I'm not sure why it complains because I use integers and big integers, none of which I assume are stored as longs. Moreover, I would assume that an integer can be cast to a long. Is there a known workaround?

its just because of your data.
I was also facing same issue.
My data in the format of int and I have created external table as String.
Give appropriate datatypes in hive create statement.

Hive does not support certain data types e.g long - use bigint
Here is the 2-steps solution:
First, drop the Table
Drop TABLE if exists <TableName>
Second, recreate the Table, this time with 'bigint' instead of 'long'
Create external TABLE <TableName>
(
<col> bigint
)
Stored as Parquet
Location '<hdfs_location>';

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Changing column datatype in parquet file - sql

Specify type as BigInt which is Equivalent to long type,hive don't have long datatype. hive> alter table table change col col bigint; Duplicate content, from Hortonworks forum

Related

AWS Athena: HIVE_BAD_DATA ERROR: Field type DOUBLE in parquet is incompatible with type defined in table schema

Is there a way to specify Date/Timestamp format for the incoming data within the Hive CREATE TABLE statement itself?

How to change a column name in hive

hive can't change column type Invalid column reference [duplicate]

Hive: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException

Categories

Resources