I have a Parquet file (created by Drill) that I'm trying to read in Hive as an external table. The data types are copied one-to-one (i.e. INTEGER -> INT, BIGINT -> BIGINT, DOUBLE -> DOUBLE, TIMESTAMP -> TIMESTAMP, CHARACTER VARYING -> STRING). There are no complex types.
Drill has no problem querying the file it created, but Hive does not like it:
CREATE EXTERNAL TABLE my_table
(
<col> <data_type>
)
STORED AS PARQUET
LOCATION '<hdfs_location>';
I can execute SELECT COUNT(*) FROM my_table and get the correct number of rows back, but when I ask for the first row it says:
Error: java.io.IOException: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.ClassCastException: org.apache.hadoop.io.IntWritable cannot be cast to org.apache.hadoop.io.LongWritable (state=,code=0)
I'm not sure why it complains because I use integers and big integers, none of which I assume are stored as longs. Moreover, I would assume that an integer can be cast to a long. Is there a known workaround?
its just because of your data.
I was also facing same issue.
My data in the format of int and I have created external table as String.
Give appropriate datatypes in hive create statement.
Hive does not support certain data types e.g long - use bigint
Here is the 2-steps solution:
First, drop the Table
Drop TABLE if exists <TableName>
Second, recreate the Table, this time with 'bigint' instead of 'long'
Create external TABLE <TableName>
(
<col> bigint
)
Stored as Parquet
Location '<hdfs_location>';
Related
I tried using the following SQL to create a table from json files
create table tmp.pg_tbl_report_group_data using json
location '/rawdata/json/tbl_report_group_data';
It works but when the data is large, it becomes very slow. I thought (maybe it's wrong) spark scan all the data to infer the schema, so I tried then
create table tmp.pg_tbl_report_group_data (
report_key string,
unique_report_key string,
field string,
value string,
type bigint,
ids array<bigint>,
sql string,
create_time timestamp,
update_time timestamp,
unique_key string
) using json
location '/rawdata/json/tbl_report_group_data'
This SQL runs very faster, however, when I query using
select * from tmp.pg_tbl_report_group_data limit 10
all columns in result rows are null;
My root question is to create a table from a large json dataset fast. The way to specify columns manually or to speed up the phrase of inferring schema or any other is welcome.
edit1:
Tried samplingRatio 0.0001, still slow.
I'm who post the question. In fact, timestamp in our json are formatted as float number. then change create_time, update_time type to double can resolve this problem.
I use AWS Athena to query some data stored in S3, namely partitioned parquet files with pyarrow compression.
I have three columns with string values, one column called "key" with int values and one column called "result" which have both double and int values.
With those columns, I created Schema like:
create external table (
key int,
result double,
location string,
vehicle_name string.
filename string
)
When I queried the table, I would get
HIVE_BAD_DATA: Field results type INT64 in parquet is incompatible with type DOUBLE defined in table schema
So, I modified a schema with result datatype as INT.
Then I queried the table and got,
HIVE_BAD_DATA: Field results type DOUBLE in parquet is incompatible with type INT defined in table schema
I've looked around to try to understand why this might happen but found no solution.
Any suggestion is much appreciated.
It sounds to me like you have some files where the column is typed as double and some where it is typed as int. When you type the column of the table as double Athena will eventually read a file where the corresponding column is int and throw this error, and vice versa if you type the table column as int.
Athena doesn't do type coercion as far as I can tell, but even if it did, the types are not compatible: a DOUBLE column in Athena can't represent all possible values of a Parquet INT64 column, and an INT column in Athena can't represent a floating point number (and a BIGINT column is required in Athena for a Parquet INT64).
The solution is to make sure your files all have the same schema. You probably need to be explicit in the code that produces the files about what schema to produce (e.g. make it always use DOUBLE).
I have an external table pointing to an s3 location (parquet file) which has all the datatypes as string. I want to correct the datatypes of all the columns instead of just reading everything as a string. when i drop the external table and recreate with new datatypes, the select query always throws error which looks something like below:
java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainBinaryDictionary
at org.apache.parquet.column.Dictionary.decodeToInt(Dictionary.java:48)
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getInt(OnHeapColumnVector.java:233)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:234)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:228)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
Specify type as BigInt which is Equivalent to long type,hive don't have long datatype.
hive> alter table table change col col bigint;
Duplicate content, from Hortonworks forum
Everytime I am trying to select in IMPALA a DATE type field from a table created in HIVE I get the AnalysisException: Unsupported type 'DATE'.
Are there any workarounds?
UPDATE this is an example of a create table schema from hive and an impala query
Schema:
CREATE TABLE myschema.mytable(day_dt date,
event string)
PARTITIONED BY (day_id int)
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
Impala query
select b.day_dt
from myschema.mytable b;
Impala doesn't have a DATE datatype, whereas Hive has. You will get AnalysisException: Unsupported type 'DATE' when you access it from Impala. A quick fix would be to create a string column of that date value in Hive and access it in whichever way you want from Impala.
If you're storing as strings, it may work to create a new external hive table that points to the same HDFS location as the existing table, but with the schema having day_dt with datatype STRING instead of DATE.
This is a true workaround, it may only suit some use cases, and you'd at least need to do "MSCK REPAIR" on the external hive table whenever a new partition is added.
I want to map my HBase table to Hive, this is what i got:
CREATE EXTERNAL TABLE kutschke.bda01.twitter (
rowkey BIGINT,
userId BIGINT,
text STRING,
creationTime STRING,
isRetweet BOOLEAN,
retweetId BIGINT
)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key, user:id, text:, time:createdAt, retweet:isRetweet, retweet:retweetId'
TBLPROPERTIES('hbase.table.name' = 'kutschke.bda01.twitter'
However, the 'text:' column doesn't get properly mapped because it has no qualifier. Instead i get the exception:
Logging initialized using configuration in file:/etc/hive/conf.dist/hive-log4j.properties
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. java.lang.RuntimeException:
MetaException(message:org.apache.hadoop.hive.serde2.SerDeException org.apache.hadoop.hive.hbase.HBaseSerDe:
hbase column family 'text' should be mapped to Map<? extends LazyPrimitive<?, ?>,?>,
that is the Key for the map should be of primitive type, but is mapped to string)
I think i understand the logic behind mapping the whole column family to Map, but is there a way to properly map the column with the empty qualifier? If not, how do i need to go about mapping the column family to a MAP, and how will i retrieve the column i actually want?
This can be done by typing the Hive column as the Hive native map type, like this:
CREATE TABLE hbase_table_1(value map<string,int>, row_key int)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES (
"hbase.columns.mapping" = "cf:,:key"
);
The output from the field mapped to a whole CF will be presented as a json string.
More info here : https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration#HBaseIntegration-HiveMAPtoHBaseColumnFamily