Presto fails to import PARQUET files from S3 - amazon-s3

I have a presto table that imports PARQUET files based on partitions from s3 as follows:
create table hive.data.datadump
(
tUnixEpoch varchar,
tDateTime varchar,
temperature varchar,
series varchar,
sno varchar,
date date
)
WITH (
format = 'PARQUET',
partitioned_by = ARRAY['series','sno','date'],
external_location = 's3a://dev/files');
The S3 folder structure where the parquet files are stored looks like:
s3a://dev/files/series=S5/sno=242=/date=2020-1-23
and the partition starts from series.
The original code in pyspark that produces the parquet files has all the schema as String type and I am trying to import that as a string but when I run my create script in Presto, it successfully created the table but fails to import the data.
On Running,
select * from hive.data.datadump;
I get the following error:
[Code: 16777224, SQL State: ] Query failed (#20200123_191741_00077_tpmd5): The column tunixepoch is declared as type string, but the Parquet file declares the column as type DOUBLE[Code: 16777224, SQL State: ] Query failed (#20200123_191741_00077_tpmd5): The column tunixepoch is declared as type string, but the Parquet file declares the column as type DOUBLE
Can you guys help to resolve this issue?
Thank You in advance!

I ran into same issues and I found out that this was caused by one of the records in my source doesnt have a matching datatype for the column it was complaining about. I am sure this is just data. You need to trap the exact record which doesnt have the right type.

This might have been solved, just for info, this could be due to column declaration mismatch between hive and parquet file. To use the column names instead of the order, use the property -
hive.parquet.use-column-names=true

Related

Retrieving JSON raw file data from Hive table

I have a JSON File. I want to move only selected fields to Hive table. So below is the statement I used to create a new table to import the data from JSON file to HIVE Table. While creating it doesn't give any error but when i use select * from JsonFile1 or count(*) from JsonFile1 I get error as Failed with exception java.io.IOException:java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
I have browsed over the internet stuck with this since few days. I can't find a solution. I checked in the HDFS. I see there is a table created and complete file imported as-is(not just the fields I selected but all of it). I just provided the sample data, the actual data contains like 50+ field names. creating all the column names is cumbersome. Is that what we need to do? Thank you in advance.
CREATE EXTERNAL TABLE JsonFile1(user STRUCT<id:BIGINT,description:STRING, followers_count:INT>)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION 'link/data';
I have data as below
{filter_level":"low",geo":null,"user":{"id":859264394,"description":"I don’t want it. Building #techteam, #LetsTalk!!! def#abc.com",
"contributors_enabled":false,"profile_sidebar_border_color":"C0DEED","name"krogmi",
"screen_name":"jkrogmi","id_str":"859264394",}}06:20:16 +0000 2012","default_profile_image":false,"followers_count":88,
"profile_sidebar_fill_color":"DDFFCC","screen_name":"abc_abc"}}
Answering my own question.
I have deleted the data in hdfs which I was pointing in the LOCATION '...', copied data again from local to hdfs and recreated the table again and it worked.
I am assuming that data was the problem.

How to load data to Hive table and make it also accessible in Impala

I have a table in Hive:
CREATE EXTERNAL TABLE sr2015(
creation_date STRING,
status STRING,
first_3_chars_of_postal_code STRING,
intersection_street_1 STRING,
intersection_street_2 STRING,
ward STRING,
service_request_type STRING,
division STRING,
section STRING )
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES (
'colelction.delim'='\u0002',
'field.delim'=',',
'mapkey.delim'='\u0003',
'serialization.format'=',', 'skip.header.line.count'='1',
'quoteChar'= "\"")
The table is loaded data this way:
LOAD DATA INPATH "hdfs:///user/rxie/SR2015.csv" INTO TABLE sr2015;
Why the table is only accessible in Hive? when I attempt to access it in HUE/Impala Editor I got the following error:
AnalysisException: Could not resolve table reference: 'sr2015'
which seems saying there is no such a table, but the table does show up in the left panel.
In Impala-shell, error is different as below:
ERROR: AnalysisException: Failed to load metadata for table: 'sr2015'
CAUSED BY: TableLoadingException: Failed to load metadata for table:
sr2015 CAUSED BY: InvalidStorageDescriptorException: Impala does not
support tables of this type. REASON: SerDe library
'org.apache.hadoop.hive.serde2.OpenCSVSerde' is not supported.
I have always been thinking Hive table and Impala table are essentially the same and difference is Impala is a more efficient query engine.
Can anyone help sort it out? Thank you very much.
Assuming that sr2015 is located in DB called db, in order to make the table visible in Impala, you need to either issue
invalidate metadata db;
or
invalidate metadata db.sr2015;
in Impala shell
However in your case, the reason is probably the version of Impala you're using, since it doesn't support the table format altogether

Migrating data from Hive PARQUET table to BigQuery, Hive String data type is getting converted in BQ - BYTES datatype

I am trying to migrate the data from Hive to BigQuery. Data in Hive table is stored in PARQUET file format.Data type of one column is STRING, I am uploading the file behind the Hive table on Google cloud storage and from that creating BigQuery internal table with GUI. The datatype of column in imported table is getting converted to BYTES.
But when I imported CHAR of VARCHAR datatype, resultant datatype was STRING only.
Could someone please help me to explain why this is happening.
That does not answer the original question, as I do not know exactly what happened, but had experience with similar odd behavior.
I was facing similar issue when trying to move the table between Cloudera and BigQuery.
First creating the table as external on Impala like:
CREATE EXTERNAL TABLE test1
STORED AS PARQUET
LOCATION 's3a://table_migration/test1'
AS select * from original_table
original_table has columns with STRING datatype
Then transfer that to GS and importing that in BigQuery from console GUI, not many options, just select the Parquet format and point to GS.
And to my surprise I can see that the columns are now Type BYTES, the names of the columns was preserved fine, but the content was scrambled.
Trying different codecs, pre-creating the table and inserting still in Impala lead to no change.
Finally I tried to do the same in Hive, and that helped.
So I ended up creating external table in Hive like:
CREATE EXTERNAL TABLE test2 (col1 STRING, col2 STRING)
STORED AS PARQUET
LOCATION 's3a://table_migration/test2';
insert into table test2 select * from original_table;
And repeated the same dance with copying from S3 to GS and importing in BQ - this time without any issue. Columns are now recognized in BQ as STRING and data is as it should be.

Presto failed: com.facebook.presto.spi.type.VarcharType

I created a table with three columns - id, name, position,
then I stored the data into s3 using orc format using spark.
When I query select * from person it returns everything.
But when I query from presto, I get this error:
Query 20180919_151814_00019_33f5d failed: com.facebook.presto.spi.type.VarcharType
I have found the answer for the problem, when I stored the data in s3, the data inside the file was with one more column that was not defined in the hive table metastore.
So when Presto tried to query the data, it found that there are varchar instead of integer.
This also might happen if one record has a a type different than what is defined in the metastore.
I had to delete my data and import it again without that extra unneeded column

Hive create table for json data

I am trying to create the hive table which can read the json data, but when I am executing the create statement it is throwing an error.
Create statement:
CREATE TABLE employee_exp_json
( id INT,
fname STRING,
lname STRING,
profession STRING,
experience INT,
exp_service STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serede2.Jsonserede'
STORED AS TEXTFILE;
Error:
FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask. Cannot validate serde:
org.apache.hadoop.hive.contrib.serede2.Jsonserede
I have also added the jar hive-json-serde.jar, but I'm still facing the same issue. I am creating this table on cloudera and hive version is 1.1.0.
The correct class name is
org.apache.hive.hcatalog.data.JsonSerDe
Refer: Hive SerDes
As for the other JAR you added, check its documentation. Still a different class
org.openx.data.jsonserde.JsonSerDe
Try adding the json-serde-with-dependencies.jar.
You can Download it from Download Hive Serde
Also try the class
'org.openx.data.jsonserde.JsonSerDe'