How to add a column in the middle of a ORC partitioned hive table and still be able to query old partitioned files with new structure - hive

Currently I have a Partitioned ORC "Managed" (Wrongly created as Internal first) Hive table in Prod with atleast 100 days worth of data partitioned by year,month,day(~16GB of data).
This table has roughly 160 columns.Now my requirement is to Add a column in the middle of this table and still be able to query the older data(partitioned files).Its is fine if the newly added column shows null for the old data.
What I did so far ?
1)First convert the table to External using below to preserve data files before dropping
alter table <table_name> SET TBLPROPERTIES('EXTERNAL'='TRUE');
2)Drop and Recreate the table with new column in the middle and then Altered the table to add the partition file
However I am unable to read the table after Recreation .I get this Error message
[Simba][HiveJDBCDriver](500312) Error in fetching data rows: *org.apache.hive.service.cli.HiveSQLException:java.io.IOException: java.io.IOException: ORC does not support type conversion from file type array<string> (87) to reader type int (87):33:32;
Any other way to accomplish this ?

No need to drop and recreate the table. Simply use the following statement.
ALTER TABLE default.test_table ADD columns (column1 string,column2 string) CASCADE;
ALTER TABLE CHANGE COLUMN with CASCADE command changes the columns of a table's metadata, and cascades the same change to all the partition metadata.
PS - This will add new columns to the end of the existing columns but before the partition columns. Unfortunately, ORC does not support adding columns in the middle as of now.
Hope that helps!

Related

How to drop columns from a partitioned table in BigQuery

We can not use create or replace table statement for partitioned tables in BigQuery. I can export the table to GCS but BigQuery generates then multiple JSON files that can not be imported into a table in once. Is there a safe way to drop a column from a partitioned table? I use BigQuery's web interface.
Renaming a column is not supported by the Cloud Console, the classic BigQuery web UI, the bq command-line tool, or the API. If you attempt to update a table schema using a renamed column, the following error is returned: BigQuery error in update operation: Provided Schema does not match Table project_id:dataset.table.
There are two ways to manually rename a column:
Using a SQL query: choose this option if you are more concerned about simplicity and ease of use, and you are less concerned about costs.
Recreating the table: choose this option if you are more concerned about costs, and you are less concerned about simplicity and ease of use.
If you want to drop a column you can either:
Use a SELECT * EXCEPT query that excludes the column (or columns) you want to remove and use the query result to overwrite the table or to create a new destination table
You can also remove a column by exporting your table data to Cloud Storage, deleting the data corresponding to the column (or columns) you want to remove, and then loading the data into a new table with a schema definition that does not include the removed column(s). You can also use the load job to overwrite the existing table
There is a guide published for Manually Changing Table Schemas.
edit
In order to change a Partitioned table to a Non-partitioned table, you can use the Console to query your data and overwrite your current table or copy to a new one. As an example, I have a table in BigQuery partitioned by _PARTITIONTIME. I used the following query to create a non-partitioned table,
SELECT *, _PARTITIONTIME as pt FROM `project.dataset.table`
With the above code, you will query the data among all table's partitions and create an extra column to show which partition it came from. Then, before executing it, there are two options, save the view in a new non-partitioned table or overwrite the current table:
Creating a new table go to: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose your project, dataset and write your new table's name > Under Destination table write preference check Write if empty.
Overwriting the current table: More(under the query editor) > Query Settings > Check the box "Set a destination table for query results" > Choose the same project and dataset for your current table > Write the same table's name as the one you want to overwrite > Under Destination table write preference check Overwrite table.
credit

How to update a hive table's data after copied orc files with hdfs into the folder of that table

After insertion of orc files into the folder of a table with hdfs copy, how to update that hive table's data to see those data when querying with hive.
Best Regards.
If the table is not partitioned then once the files are in HDFS in the folder that is specified in the LOCATION clause, then the data should be available for querying.
If the table is partitioned then u first need to run an ADD PARTITION statement.
As mentioned in upper answer by belostoky. if the table is not partitioned then you can directly query your table with the updated data
But in case if you table is partitioned you need to add partitions first in hive table that you can do using
You can use alter table statement to add partitions like shown below
ALTER TABLE table1
ADD PARTITION (dt='<date>')
location '<hdfs file path>'
once partitions are added hive metastore should be aware of changes so you need to run
msck repair table table1
to add partitions in metastore.
Once done you can query your data

loading data to hive dynamic partitioned tables

I have created a hive table with dynamic partitioning on a column. Is there a way to directly load the data from files using "LOAD DATA" statement? Or do we have to only depend on creating a non-partitioned intermediate table and load file data to it and then inserting data from this intermediate table to partitioned table as mentioned in Hive loading in partitioned table?
No, the LOAD DATA command ONLY copies the files to the destination directory. It doesn't read the records of the input file, so it CANNOT do partitioning based on record values.
If your input data is already split into multiple files based on partitions, you could directly copy the files to table location in HDFS under their partition directory manually created by you (OR just point to their current location in case of EXTERNAL table) and use the following ALTER command to ADD the partition. This way you could skip the LOAD DATA statement altogether.
ALTER TABLE <table-name>
ADD PARTITION (<...>)
No other go, if we need to insert directly, we'll need to specify partitions manually.
For dynamic partitioning, we need staging table and then insert from there.

hive doesn't change parquet schema

I've a problem with alter table that changes the table schema but not the parquet schema.
For example I've a PARQUET table with these columns:
column1(string) column2(string)
column3(string) column4(string)
column5(bigint)
Now, I try to change the table's schema with
ALTER TABLE name_table DROP COLUMN column3;
With DESCRIBE TABLE I can see that the column2 there is not anymore;
Now I try to execute select * from table but i receive an error like this :
"data.0.parq' has an incompatible type with the table schema for column column4. Expected type: INT64. Actual type: BYTE_ARRAY"
The values of deleted column are yet present in parquet file that has 5 columns and not 4 (as the table schema)
This is a bug? How I can change the Parquet file's schema using Hive?
This is not a bug. When you drop the columns, that just updates the definition in Hive Metastore, which is just the information about the table. The underlying files on HDFS remain unchanged. Since the parquet metadata is embedded in the files , they have no idea what the metadata change has been.
Hence you see this issue.
The solution is described here. If you want to add a column(s) to a parquet table and be compatible with both impala and hive, you need to add the column(s) at the end.
If you alter the table and change column names or drop a column, that table will no longer be compatible with impala.
I had the same error after adding a column to hive table.
Solution is to set the below query option at each session
set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name;
If you're using Cloudera distribution, set it permanently in Cloudera Manager => Impala configuration => Impala Daemon Query Options Advanced Configuration Snippet (Safety Valve)
set config value as PARQUET_FALLBACK_SCHEMA_RESOLUTION=name

SQL copy column AND change data type

I have a table which is already populated.
I need to change the data type of one column from LONG to CLOB
However this database is hosted by a third party and the tablespace is limited.
I know the command:
ALTER TABLE myTable MODIFY my_data CLOB
However I then receive the error after a long wait:
ORA-01652: unable to extend temp segment by 128 in tablespace
Increasing the table space is not an option.
Is there any work arounds?
Could I create a new column with the data type CLOB then copy and convert the data from my_data ( LONG) without draing table space? Could i turn undo off to help?
Many thanks
I would say the best option is to create a new column with the new data type, update it based on the old column and then drop the old column but since you are having space issues that may not be an option.
Or you could try to do this in a series of bathces. For example, move 10000 rows of data to the new column and then set the old value to null on these 10000 to free some space.