incompatible Parquet schema for column "ex: x is of type String" Column type: STRING, Parquet schema - hive

I have an existing External Table called for example YYYYYY that contains n number of columns and this table is loaded daily with partitioned column as extract_date.
We got a request from business to add few more columns in the existing table. To implement this we have done following things.
DROP existing partitions from Hive
alter table xxxx.yyyyyy add columns (
`c10` string COMMENT '',
`b` string COMMENT '',
`c11` string COMMENT '',
`c12` string COMMENT '',
`c13` string COMMENT '',
`c14` string COMMENT '',
`c15` string COMMENT '') CASCADE;
alter table xxxx.yyyyyyy change `c8` `c8` string COMMENT '' after `c7` CASCADE;
After I did the above 2 steps, I went to Hive and did MSCK REPAIR TABLE xxxx.yyyyyy;
Partitions added (there are partitions from 2018) along with my new fields.
Before the changes I was able to query the data both from Impala and Hive, but after executing ALTER commands, I am getting the error as below.
> select * from xxxx.yyyyyyy where extract_date like '2019%';
Query: select * from XXXXX.YYYYYYY where extract_date like '2019%'
Query submitted at: 2020-05-09 11:57:10 (Coordinator: ' xxxx.yyyyyyy .c9'. Column type: STRING, Parquet schema:
optional fixed_len_byte_array a_auth [i:12 d:1 r:0]
Whereas in Hive I am able to browse the data with no issues. So I have an issue only in Impala.
Troubleshooting steps:
Created new table without additional columns and pointed the external path as new and copied the previously created partitions to new path.
MSCK REPAIR TABLE TABLE NAME;
Both in Impala and Hive a select query is working.
Added additional fields to the newly created table with ALTER commands then did the following things
MSCK REPAIR TABLE TABLE NAME;
In Impala :
REFRESH TABLE TABLE NAME;
INVALIDATE METADATA TABLE NAME;
This time in Hive select query worked but in Impala got the above mentioned error.
Can some one guide me why this is happening and how to fix this issue.
Impala Shell v2.12.0-cdh5.16.2

Related

Is conversion of int to double of a column valid in presto?

I am trying to change the data type of a column from int to double by using the alter command:
ALTER TABLE schema_name.table_name CHANGE COLUMN col1 col1 double CASCADE;
Now, if I run a select query over the table on presto:
select * from schema_name.table_name where partition_column = '2022-12-01
I get the error:
schema_name.table_name is declared as type double, but the Parquet
file
(hdfs://ns-platinum-prod-phx/secure/user/hive/warehouse/db_name.db/table_name/partition_column=2022-12-01/000002_0)
declares the column as type INT32"
However, if I run the query on Hive, it provides me the output.
I tried digging into the this, by creating a copy table of the source and deleting the partiton from hdfs. However, I run into the same problem again. Is there any other way to resolve this as this table contains huge data.
You cannot change the data type of the Hive table as the parquet files created in HDFS for older partitions won’t get updated.
The only fix is to create a new table and load the data into the new table from the older table.

Can we add column to an existing table in AWS Athena using SQL query?

I have a table in AWS Athena which contains 2 records. Is there a SQL query using which a new column can be inserted in to the table?
You can find more information about adding columns to table in Athena documentation
Or you can use CTAS
For example, you have a table with
CREATE EXTERNAL TABLE sample_test(
id string)
LOCATION
's3://bucket/path'
and you can create another table from sample_test with the query
CREATE TABLE new_test
AS
SELECT *, 'new' AS new_col FROM sample_test
You can use any available query after AS
This is mainly for future readers like me, who was struggling to get this working for Hive table with AVRO data and if you don't want to create new table i.e updating schema of the existing table. It works for csv using 'add columns', but not for Hive + AVRO. For Hive + AVRO, to append columns at the end, before partition columns, the solution is available at this link. However, there are couple of things to note that, we need to pass full schema to the literal attribute and not just the changes; and (not sure why but) we had to alter hive table for all 3 things in the same order - 1. add columns using add columns 2. set tblproperties and 3. set serdeproperties. Hopefully it helps someone.

Creation of a partitioned external table with hive: no data available

I have the following file on HDFS:
I create the structure of the external table in Hive:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06') LOCATION '/flumania/google_analytics';
After that, the table structure is created in Hive but I cannot see any data:
Since it's an external table, data insertion should be done automatically, right?
your file should be in this sequence.
int,string
here you file contents are in below sequence
string, int
change your file to below.
86,"2016-08-20"
78,"2016-08-21"
It should work.
Also it is not recommended to use keywords as column names (date);
I think the problem was with the alter table command. The code below solved my problem:
CREATE EXTERNAL TABLE google_analytics(
`session` INT)
PARTITIONED BY (date_string string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/flumania/google_analytics/';
ALTER TABLE google_analytics ADD PARTITION (date_string = '2016-09-06');
After these two steps, if you have a date_string=2016-09-06 subfolder with a csv file corresponding to the structure of the table, data will be automatically loaded and you can already use select queries to see the data.
Solved!

Add partitions on existing hive table

I'm processing a big hive's table (more than 500 billion records).
The processing is too slow and I would like to make it faster.
I think that by adding partitions, the process could be more efficient.
Can anybody tell me how I can do that?
Note that my table already exists.
My table :
create table T(
nom string,
prenom string,
...
date string)
Partitioning on date field.
Thx
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
INSERT OVERWRITE TABLE table_name PARTITION(Date) select date from table_name;
Note :
In the insert statement for a partitioned table make sure that you are specifying the partition columns at the last in select clause.
You have to restructure the table. Here are the steps:
Make sure no other process is writing to the table.
Create new external table using partitioning
Insert into new table by selecting from the old table
Drop the new table (external), only table will be dropped but data will be there
Drop the old table
Create the table with original name by pointing to the location under step 2
You can run repair command to fix all the metadata.
Alternative 4, 5, 6 and 7
Create the table with original name by running show create table on new table and replace with original table name
Run LOAD DATA INPATH command to move files under partitions to new partitions of new table
Drop the external table created
Both the approaches will achieve restructuring with one insert/map reduce job.

automatically partition Hive tables based on S3 directory names

I have data stored in S3 like:
/bucket/date=20140701/file1
/bucket/date=20140701/file2
...
/bucket/date=20140701/fileN
/bucket/date=20140702/file1
/bucket/date=20140702/file2
...
/bucket/date=20140702/fileN
...
My understanding is that if I pull in that data via Hive, it will automatically interpret date as a partition. My table creation looks like:
CREATE EXTERNAL TABLE search_input(
col 1 STRING,
col 2 STRING,
...
)
PARTITIONED BY(date STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/';
However Hive doesn't recognize any data. Any queries I run return with 0 results. If I instead just grab one of the dates via:
CREATE EXTERNAL TABLE search_input_20140701(
col 1 STRING,
col 2 STRING,
...
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
LOCATION 's3n://bucket/date=20140701';
I can query data just fine.
Why doesn't Hive recognize the nested directories with the "date=date_str" partition?
Is there a better way to have Hive run a query over multiple sub-directories and slice it based on a datetime string?
In order to get this to work I had to do 2 things:
Enable recursive directory support:
SET mapred.input.dir.recursive=true;
SET hive.mapred.supports.subdirectories=true;
For some reason it would still not recognize my partitions so I had to recover them via:
ALTER TABLE search_input RECOVER PARTITIONS;
You can use:
SHOW PARTITIONS table;
to check and see that they've been recovered.
I had faced the same issue and realized that hive does not have partitions metadata with it. So we need to add that metadata using ALTER TABLE ADD PARTITION query. It becomes tedious, if you have few hundred partitions to create same queries with different values.
ALTER TABLE <table name> ADD PARTITION(<partitioned column name>=<partition value>);
Once you run above query for all available partitions. You should see the results in hive queries.