Is there a way to alter column in a hive table that is stored as ORC? - hive

There is already a question on Hive in general (
Is there a way to alter column type in hive table?). The answer to this question states that it is possible to change the schema with the alter table change command
However, is this also possible if the file is stored as ORC?

You can load the orc file into pyspark:
Load data into a dataframe:
df = spark.read.format("orc").load("<path-of-file-in-hdfs")
Create a view over the dataframe:
df2 = df.createOrReplaceTempView('Table')
Create a new data frame with manipulated columns:
df3 = spark.sql("select *, cast(third_column as float) as third_column, from Table")
Save the dataframe to hdfs:
df3.write.format("orc").save("<hdfs-path-where-file-needs-to-be-saved")

I ran tests on a ORC-table. It is possible to convert a string to a float column.
ALTER TABLE test_orc CHANGE third_column third_column float;
would convert a column called third_column that is marked as a string column to a float column. It is also possible to change the name of a column.
Sidenote: I was curious if other alterations on ORC might create problems. I ran into an exception when I tried to reorder columns.
ALTER TABLE test_orc CHANGE third_column third_column float AFTER first_column;
The exception is: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. Reordering columns is not supported for table default.test_orc. SerDe may be incompatible.

Related

Is conversion of int to double of a column valid in presto?

I am trying to change the data type of a column from int to double by using the alter command:
ALTER TABLE schema_name.table_name CHANGE COLUMN col1 col1 double CASCADE;
Now, if I run a select query over the table on presto:
select * from schema_name.table_name where partition_column = '2022-12-01
I get the error:
schema_name.table_name is declared as type double, but the Parquet
file
(hdfs://ns-platinum-prod-phx/secure/user/hive/warehouse/db_name.db/table_name/partition_column=2022-12-01/000002_0)
declares the column as type INT32"
However, if I run the query on Hive, it provides me the output.
I tried digging into the this, by creating a copy table of the source and deleting the partiton from hdfs. However, I run into the same problem again. Is there any other way to resolve this as this table contains huge data.
You cannot change the data type of the Hive table as the parquet files created in HDFS for older partitions won’t get updated.
The only fix is to create a new table and load the data into the new table from the older table.

Alter column datatype in Hive table with cascasde not flowing in Parquet partitions

I'm trying to alter the column data type from int to bigint in Hive as below
ALTER TABLE <TABLE_NAME> CHANGE COLUMN <COLUMN_NAME> <COLUMN_NAME> BIGINT CASCADE
Hive meta store is getting updated successfully. But Parquet file schema is not getting updated due to which while querying the data in pyspark is throwing error.
parquet column cannot be converted in file expected int32 found int64
Some of the forums suggested to recreate the data, but the data size is in TB and takes huge amount of time. I need to do this for around 10 tables.
Is there anyway to change the column type in parquet as well

How to change the datatype of a field in a struct column

I have a struct column in BigQuery called meta. Inside that field, I have a field called join_at which is currently in a FLOAT datatype and I'd like to change it to a TIMESTAMP datatype.
I tried running this query:
ALTER TABLE `my-table`
ALTER COLUMN meta.join_at SET DATA TYPE TIMESTAMP
That doesn't work. It throws an error at the "." character. So, apparently I can't just change the struct field like that.
What would be the correct approach in this case?
If you were wanting to alter a field in a struct you would do something like this:
CREATE OR REPLACE TABLE so_test.alter_struct(s1 STRUCT<a FLOAT64, b STRING>);
ALTER TABLE so_test.alter_struct ALTER COLUMN s1
SET DATA TYPE STRUCT<a TIMESTAMP, b STRING>;
However a FLOAT is not coercible to a TIMESTAMP as listed in this table here:
https://cloud.google.com/bigquery/docs/reference/standard-sql/conversion_rules#comparison_chart
Instead you can take an approach similar to this:
How to delete a column in BigQuery that is part of a nested column
And just define the new structure while overwriting the table.

How to drop a column from a Databricks Delta table?

I have recently started discovering Databricks and faced a situation where I need to drop a certain column of a delta table. When I worked with PostgreSQL it was as easy as
ALTER TABLE main.metrics_table
DROP COLUMN metric_1;
I was looking through Databricks documentation on DELETE but it covers only DELETE the rows that match a predicate.
I've also found docs on DROP database, DROP function and DROP table but absolutely nothing on how to delete a column from a delta table. What am I missing here? Is there a standard way to drop a column from a delta table?
There is no drop column option on Databricks tables: https://docs.databricks.com/spark/latest/spark-sql/language-manual/alter-table-or-view.html#delta-schema-constructs
Remember that unlike a relational database there are physical parquet files in your storage, your "table" is just a schema that has been applied to them.
In the relational world you can update the table metadata to remove a column easily, in a big data world you have to re-write the underlying files.
Technically parquet can handle schema evolution (see Schema evolution in parquet format). But the Databricks implementation of Delta does not. It probably just too complicated to be worth it.
Therefore the solution in this case is to create a new table and insert the columns you want to keep from the old table.
use below code :
df = spark.sql("Select * from <DB Name>.<Table Name>")
df1 = df.drop("<Column Name>")
spark.sql("DROP TABLE if exists <DB Name>.<TableName>_OLD")
spark.sql("ALTER TABLE <DB Name>.<TableName> RENAME TO <DB Name>.<Table Name>_OLD ")
df1.write.format("delta").mode("OVERWRITE").option("overwriteSchema", "true").saveAsTable("<DB Name>.<Table Name>")
One way that I figured out to make that work is to first drop the table and then recreate the table from the dataframe using the overwriteSchema option to true. You also need to use the option of mode = overwrite so that it recreate the physical files using new schema that the dataframe contains.
Break down of the steps :
Read the table in the dataframe.
Drop the columns that you don't want in your final table
Drop the actual table from which you have read the data.
now save the newly created dataframe after dropping the columns as the same table name.
but make sure you use two options at the time of saving the dataframe as table.. (.mode("overwrite").option("overwriteSchema", "true") )
Above steps would help you recreate the same table with the extra column/s removed.
Hope it helps someone facing the similar issue.
Databricks Runtime 10.2+ supports dropping columns if you enable Column Mapping mode
ALTER TABLE <table_name> SET TBLPROPERTIES (
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5',
'delta.columnMapping.mode' = 'name'
)
And then drops will work --
ALTER TABLE table_name DROP COLUMN col_name
ALTER TABLE table_name DROP COLUMNS (col_name_1, col_name_2, ...)
You can overwrite the table without the column if the table isn't too large.
df = spark.read.table('table')
df = df.drop('col')
df.write.format('delta')\
.option("overwriteSchema", "true")\
.mode('overwrite')\
.saveAsTable('table')
As of Delta Lake 1.2, you can drop columns, see the latest ALTER TABLE docs.
Here's a fully working example if you're interested in a snippet you can run locally:
# create a Delta Lake
columns = ["language","speakers"]
data = [("English", "1.5"), ("Mandarin", "1.1"), ("Hindi", "0.6")]
rdd = spark.sparkContext.parallelize(data)
df = rdd.toDF(columns)
df.write.format("delta").saveAsTable("default.my_cool_table")
spark.sql("select * from `my_cool_table`").show()
+--------+--------+
|language|speakers|
+--------+--------+
|Mandarin| 1.1|
| English| 1.5|
| Hindi| 0.6|
+--------+--------+
Here's how to drop the language column:
spark.sql("""ALTER TABLE `my_cool_table` SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5')""")
spark.sql("alter table `my_cool_table` drop column language")
Verify that the language column isn't included in the table anymore:
spark.sql("select * from `my_cool_table`").show()
+--------+
|speakers|
+--------+
| 1.1|
| 1.5|
| 0.6|
+--------+
It works only if you added your column after creating the table.
If it is so, and if it is possible for you to recover the data inserted after altering your table, you may consider using the table history to restore the table to a previous version.
With
DESCRIBE HISTORY <TABLE_NAME>
you can check all the available versions of your table (operation 'ADD COLUMN' will create a new table version).
Afterwards, with RESTORE it is possible to transform the table to any available state.
RESTORE <TALBE_NAME> VERSION AS OF <VERSION_NUMBER>
Here you have more information about TIME TRAVEL

Unable to insert data into partitioned table due to precision loss

I have created an external table partitioning on two columns. The two columns are 'country' and 'state' stored as SEQUENCEFILE.
I am now trying to load the data into the table using the following command in Impala run via Hue editor -
load data inpath '/usr/temp/input.txt'
into table partitioned_user
partition (country = 'US', state = 'CA');
I am getting the following error -
AnalysisException: Partition key value may result in loss of precision. Would need to cast ''US'' to 'VARCHAR(64)' for partition column: country
What am I doing wrong? The table that I am inserting has columns such as and all are of type VARCHAR(64) - first_name,last_name,country,state.
The file input.txt contains the data only for the first two columns. Where am I going wrong?
Impala does not automatically convert from a larger type to a smaller one.. You must CAST() to a VARCHAR(64) before inserting to avoid such exception in Impala.
partition (country = cast('US' as VARCHAR(64)), state = cast('CA' as VARCHAR(64)))
Or use STRING datatype in table DDL instead.