Presto failed: com.facebook.presto.spi.type.VarcharType - hive

I created a table with three columns - id, name, position,
then I stored the data into s3 using orc format using spark.
When I query select * from person it returns everything.
But when I query from presto, I get this error:
Query 20180919_151814_00019_33f5d failed: com.facebook.presto.spi.type.VarcharType

I have found the answer for the problem, when I stored the data in s3, the data inside the file was with one more column that was not defined in the hive table metastore.
So when Presto tried to query the data, it found that there are varchar instead of integer.
This also might happen if one record has a a type different than what is defined in the metastore.
I had to delete my data and import it again without that extra unneeded column

Related

How to Set default value of empty data of column in copy activity from csv file using azure data factory v2

I've multiple csv files and multiple tables.
The table name is file name and column name is first row of csv file.
Now I want to add default value of empty string to the sink table.
Consider my scenario,
employee:
id int, name varchar, is_active bit NULL
employee.csv:
id|name|is_active
1|raja|
Now I'm trying to copy the csv data to PostgreSQL table its throwing error.
Expected result is default value if its empty value.
You can use NULLIF in PostgreSQL:
NULLIF(argument_1,argument_2);
The NULLIF function returns a null value if argument_1 equals to argument_2, otherwise it returns argument_1.
This way you can replace NULL value with some other value
If your error is related to Type mismatch then consider typecasting the column first
Thanks!
As per the issue, tried to repro the scenario and here is the following outcome which was successfully copied. You have to use
Source Dataset: employee.csv from Azure Blob Storage
Sink Dataset : Here, I have used the sink as Azure SQL DB for some limitations but as you have used PostgreSQL is almost similar.
Copy Activity Settings:
Under the mapping settings there will be type conversion, where you have to import schema else you can dynamically add
Output:
Alternative to use DataFlow - if you have multiple data fields, you need to use the derived column transformation to generate new columns in your data flow or to modify existing fields.
For more details, refer Derived column transformation in mapping data flow.
You can even refer to this Microsoft Q&A post for more insights: Copy Task failure because of conversion failure

Retrieving JSON raw file data from Hive table

I have a JSON File. I want to move only selected fields to Hive table. So below is the statement I used to create a new table to import the data from JSON file to HIVE Table. While creating it doesn't give any error but when i use select * from JsonFile1 or count(*) from JsonFile1 I get error as Failed with exception java.io.IOException:java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
I have browsed over the internet stuck with this since few days. I can't find a solution. I checked in the HDFS. I see there is a table created and complete file imported as-is(not just the fields I selected but all of it). I just provided the sample data, the actual data contains like 50+ field names. creating all the column names is cumbersome. Is that what we need to do? Thank you in advance.
CREATE EXTERNAL TABLE JsonFile1(user STRUCT<id:BIGINT,description:STRING, followers_count:INT>)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION 'link/data';
I have data as below
{filter_level":"low",geo":null,"user":{"id":859264394,"description":"I don’t want it. Building #techteam, #LetsTalk!!! def#abc.com",
"contributors_enabled":false,"profile_sidebar_border_color":"C0DEED","name"krogmi",
"screen_name":"jkrogmi","id_str":"859264394",}}06:20:16 +0000 2012","default_profile_image":false,"followers_count":88,
"profile_sidebar_fill_color":"DDFFCC","screen_name":"abc_abc"}}
Answering my own question.
I have deleted the data in hdfs which I was pointing in the LOCATION '...', copied data again from local to hdfs and recreated the table again and it worked.
I am assuming that data was the problem.

Presto fails to import PARQUET files from S3

I have a presto table that imports PARQUET files based on partitions from s3 as follows:
create table hive.data.datadump
(
tUnixEpoch varchar,
tDateTime varchar,
temperature varchar,
series varchar,
sno varchar,
date date
)
WITH (
format = 'PARQUET',
partitioned_by = ARRAY['series','sno','date'],
external_location = 's3a://dev/files');
The S3 folder structure where the parquet files are stored looks like:
s3a://dev/files/series=S5/sno=242=/date=2020-1-23
and the partition starts from series.
The original code in pyspark that produces the parquet files has all the schema as String type and I am trying to import that as a string but when I run my create script in Presto, it successfully created the table but fails to import the data.
On Running,
select * from hive.data.datadump;
I get the following error:
[Code: 16777224, SQL State: ] Query failed (#20200123_191741_00077_tpmd5): The column tunixepoch is declared as type string, but the Parquet file declares the column as type DOUBLE[Code: 16777224, SQL State: ] Query failed (#20200123_191741_00077_tpmd5): The column tunixepoch is declared as type string, but the Parquet file declares the column as type DOUBLE
Can you guys help to resolve this issue?
Thank You in advance!
I ran into same issues and I found out that this was caused by one of the records in my source doesnt have a matching datatype for the column it was complaining about. I am sure this is just data. You need to trap the exact record which doesnt have the right type.
This might have been solved, just for info, this could be due to column declaration mismatch between hive and parquet file. To use the column names instead of the order, use the property -
hive.parquet.use-column-names=true

Migrating data from Hive PARQUET table to BigQuery, Hive String data type is getting converted in BQ - BYTES datatype

I am trying to migrate the data from Hive to BigQuery. Data in Hive table is stored in PARQUET file format.Data type of one column is STRING, I am uploading the file behind the Hive table on Google cloud storage and from that creating BigQuery internal table with GUI. The datatype of column in imported table is getting converted to BYTES.
But when I imported CHAR of VARCHAR datatype, resultant datatype was STRING only.
Could someone please help me to explain why this is happening.
That does not answer the original question, as I do not know exactly what happened, but had experience with similar odd behavior.
I was facing similar issue when trying to move the table between Cloudera and BigQuery.
First creating the table as external on Impala like:
CREATE EXTERNAL TABLE test1
STORED AS PARQUET
LOCATION 's3a://table_migration/test1'
AS select * from original_table
original_table has columns with STRING datatype
Then transfer that to GS and importing that in BigQuery from console GUI, not many options, just select the Parquet format and point to GS.
And to my surprise I can see that the columns are now Type BYTES, the names of the columns was preserved fine, but the content was scrambled.
Trying different codecs, pre-creating the table and inserting still in Impala lead to no change.
Finally I tried to do the same in Hive, and that helped.
So I ended up creating external table in Hive like:
CREATE EXTERNAL TABLE test2 (col1 STRING, col2 STRING)
STORED AS PARQUET
LOCATION 's3a://table_migration/test2';
insert into table test2 select * from original_table;
And repeated the same dance with copying from S3 to GS and importing in BQ - this time without any issue. Columns are now recognized in BQ as STRING and data is as it should be.

List Impala tables that need invalidate/refresh

How can I programatically find all Impala tables that need INVALIDATE METADATA statement (because they were created in Hive, but not yet known to Impala) or REFRESH (because column added, datafile added, etc.)?
Invalidate Metadata:
As a workaround, create a shell script to do the below steps.
Using beeline, connect to a particular database and run show tables statement and save output data to a file.
Using impala-shell, connect to the same particular database and run show tables statement and save output data to another file.
Now compare both the file to remove the duplicates and get the unique tables list from the first file which is a list of tables which are only in hive but not in impala.
Note:
a. Instead of a particular database each at a time in 1 and 2 steps, you can loop over all databases and save the output to a file. Inside the loop itself, you can redirect and append the output files to another final output file with data in some format like database.table or database_table to get all tables from all databases into a single file. Finally, follow step 3.
b. The unique tables from the second output file after removing duplicates will be tables that are deleted in hive and invalidate metadata needs to be run in impala to remove them from the impala list.
c. Rename of a table in impala can be recognized by hive but vice-versa is not possible and invalidate metadata should be run for both old and new table names to remove and add respectively in impala. This applies to most operations not just rename of table.
Refresh:
Consider a text format table with 2 columns and 1 row data.
Now suppose, a third column is added to that table in the beeline.
select * from table; ---gives 3 columns in beeline and 2 columns in impala since refresh is not run on impala for this table.
If we run compute stats in impala before running refresh in this case, then that newly added column from the beeline will be removed from the table schema in hive as well.
select * from table; ---gives 2 columns in beeline and 2 columns in impala since compute stats from impala deleted the extra column metadata of table although data resides in hdfs for that column. This might cause parsing issues in impala if the column is added somewhere in the middle or front instead of ending.
So it is advised to run REFRESH table name in impala right after adding a new column or doing any modifications in beeline for an existing table to not lose table schema as explained in the above scenario.
refresh table; ---Right after modification in hive run refresh in impala.
select * from table; ---gives 3 columns in beeline and 3 columns in impala since refresh is run before compute stats in impala.