Alter column datatype in Hive table with cascasde not flowing in Parquet partitions - hive

I'm trying to alter the column data type from int to bigint in Hive as below
ALTER TABLE <TABLE_NAME> CHANGE COLUMN <COLUMN_NAME> <COLUMN_NAME> BIGINT CASCADE
Hive meta store is getting updated successfully. But Parquet file schema is not getting updated due to which while querying the data in pyspark is throwing error.
parquet column cannot be converted in file expected int32 found int64
Some of the forums suggested to recreate the data, but the data size is in TB and takes huge amount of time. I need to do this for around 10 tables.
Is there anyway to change the column type in parquet as well

Related

Is conversion of int to double of a column valid in presto?

I am trying to change the data type of a column from int to double by using the alter command:
ALTER TABLE schema_name.table_name CHANGE COLUMN col1 col1 double CASCADE;
Now, if I run a select query over the table on presto:
select * from schema_name.table_name where partition_column = '2022-12-01
I get the error:
schema_name.table_name is declared as type double, but the Parquet
file
(hdfs://ns-platinum-prod-phx/secure/user/hive/warehouse/db_name.db/table_name/partition_column=2022-12-01/000002_0)
declares the column as type INT32"
However, if I run the query on Hive, it provides me the output.
I tried digging into the this, by creating a copy table of the source and deleting the partiton from hdfs. However, I run into the same problem again. Is there any other way to resolve this as this table contains huge data.
You cannot change the data type of the Hive table as the parquet files created in HDFS for older partitions won’t get updated.
The only fix is to create a new table and load the data into the new table from the older table.

Is defining a delimiter in a hive ORC Table useless?

When you create a ORC table in hive, you are changing the file type to be orc. This means you can't look at a specific file outside of the orc table.
Here's an example orc create table statement
CREATE TABLE IF NOT EXISTS table_orc_v1
(
col1 int,
col2 int
)
PARTITIONED BY (odate date)
CLUSTERED BY (col1) INTO 10 BUCKETS
STORED AS ORC TBLPROPERTIES('transactional'='true');
If I try to make this a csv table (like you do on a non-orc table) will it
1) not affect table performance
2) slow down performance as it converts things to a csv file that you can never read
3) give me some benefit that I'm not aware of
4) do something else
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
if you are using any binary format (ORC, AVRO, Parquet) to store you data then ROW FORMAT DELIMITED FIELDS TERMINATED BY is just ignored, you can use it in your table syntax, it might not give you any error. However they are not being used

data appears as null on redshift external table while working right on athena

So I'm trying to run the following simple query on redshift spectrum:
select * from company.vehicles where vehicle_id is not null
and it return 0 rows(all of the rows in the table are null). However when I run the same query on athena it works fine and return results. Tried msck repair but both athena and redshift are using the same metastore so it shouldn't matter.
I also don't see any errors.
The format of the files is orc.
The create table query is:
CREATE EXTERNAL TABLE 'vehicles'(
'vehicle_id' bigint,
'parent_id' bigint,
'client_id' bigint,
'assets_group' int,
'drivers_group' int)
PARTITIONED BY (
'dt' string,
'datacenter' string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.orc.OrcSerde'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'
LOCATION
's3://company-rt-data/metadata/out/vehicles/'
TBLPROPERTIES (
'CrawlerSchemaDeserializerVersion'='1.0',
'CrawlerSchemaSerializerVersion'='1.0',
'classification'='orc',
'compressionType'='none')
Any idea?
How did you create your external table ??
For Spectrum,you have to explicitly set the parameters to treat what should be treated as null
add the parameter 'serialization.null.format'='' in TABLE PROPERTIES so that all columns with '' will be treated as NULL to your external table in spectrum
**
CREATE EXTERNAL TABLE external_schema.your_table_name(
)
row format delimited
fields terminated by ','
stored as textfile
LOCATION [filelocation]
TABLE PROPERTIES('numRows'='100', 'skip.header.line.count'='1','serialization.null.format'='');
**
Alternatively,you can setup the SERDE-PROPERTIES while creating the external table which will automatically recognize NULL values
Eventually it turned out to be a bug in redshift. In order to fix it, we needed to run the following command:
ALTER TABLE table_name SET TABLE properties(‘orc.schema.resolution’=‘position’);
I had a similar problem and found this solution.
In my case I had external tables that were created with Athena pointing to an S3 bucket that contained heavily nested JSON data. To access them with Redshift I used json_serialization_enable to true; before my queries to make the nested JSON columns queryable. This lead to some columns being NULL when the JSON exceeded a size limit, see here:
If the serialization overflows the maximum VARCHAR size of 65535, the cell is set to NULL.
To solve this issue I used Amazon Redshift Spectrum instead of serialization: https://docs.aws.amazon.com/redshift/latest/dg/tutorial-query-nested-data.html.

Unable to insert data into partitioned table due to precision loss

I have created an external table partitioning on two columns. The two columns are 'country' and 'state' stored as SEQUENCEFILE.
I am now trying to load the data into the table using the following command in Impala run via Hue editor -
load data inpath '/usr/temp/input.txt'
into table partitioned_user
partition (country = 'US', state = 'CA');
I am getting the following error -
AnalysisException: Partition key value may result in loss of precision. Would need to cast ''US'' to 'VARCHAR(64)' for partition column: country
What am I doing wrong? The table that I am inserting has columns such as and all are of type VARCHAR(64) - first_name,last_name,country,state.
The file input.txt contains the data only for the first two columns. Where am I going wrong?
Impala does not automatically convert from a larger type to a smaller one.. You must CAST() to a VARCHAR(64) before inserting to avoid such exception in Impala.
partition (country = cast('US' as VARCHAR(64)), state = cast('CA' as VARCHAR(64)))
Or use STRING datatype in table DDL instead.

Is there a way to alter column in a hive table that is stored as ORC?

There is already a question on Hive in general (
Is there a way to alter column type in hive table?). The answer to this question states that it is possible to change the schema with the alter table change command
However, is this also possible if the file is stored as ORC?
You can load the orc file into pyspark:
Load data into a dataframe:
df = spark.read.format("orc").load("<path-of-file-in-hdfs")
Create a view over the dataframe:
df2 = df.createOrReplaceTempView('Table')
Create a new data frame with manipulated columns:
df3 = spark.sql("select *, cast(third_column as float) as third_column, from Table")
Save the dataframe to hdfs:
df3.write.format("orc").save("<hdfs-path-where-file-needs-to-be-saved")
I ran tests on a ORC-table. It is possible to convert a string to a float column.
ALTER TABLE test_orc CHANGE third_column third_column float;
would convert a column called third_column that is marked as a string column to a float column. It is also possible to change the name of a column.
Sidenote: I was curious if other alterations on ORC might create problems. I ran into an exception when I tried to reorder columns.
ALTER TABLE test_orc CHANGE third_column third_column float AFTER first_column;
The exception is: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Reordering columns is not supported for table default.test_orc. SerDe may be incompatible.