hive doesn't change parquet schema - hive

I've a problem with alter table that changes the table schema but not the parquet schema.
For example I've a PARQUET table with these columns:
column1(string) column2(string)
column3(string) column4(string)
column5(bigint)
Now, I try to change the table's schema with
ALTER TABLE name_table DROP COLUMN column3;
With DESCRIBE TABLE I can see that the column2 there is not anymore;
Now I try to execute select * from table but i receive an error like this :
"data.0.parq' has an incompatible type with the table schema for column column4. Expected type: INT64. Actual type: BYTE_ARRAY"
The values of deleted column are yet present in parquet file that has 5 columns and not 4 (as the table schema)
This is a bug? How I can change the Parquet file's schema using Hive?

This is not a bug. When you drop the columns, that just updates the definition in Hive Metastore, which is just the information about the table. The underlying files on HDFS remain unchanged. Since the parquet metadata is embedded in the files , they have no idea what the metadata change has been.
Hence you see this issue.

The solution is described here. If you want to add a column(s) to a parquet table and be compatible with both impala and hive, you need to add the column(s) at the end.
If you alter the table and change column names or drop a column, that table will no longer be compatible with impala.

I had the same error after adding a column to hive table.
Solution is to set the below query option at each session
set PARQUET_FALLBACK_SCHEMA_RESOLUTION=name;
If you're using Cloudera distribution, set it permanently in Cloudera Manager => Impala configuration => Impala Daemon Query Options Advanced Configuration Snippet (Safety Valve)
set config value as PARQUET_FALLBACK_SCHEMA_RESOLUTION=name

Related

Getting a Databricks drop schema error for delta table

I have a delta table schema that needs new columns/changed data types (Usually I do this on non delta tables and those work fine)
I have already dropped the existing delta table and tried dropping the schema and getting a 'v1 session catalog' error.
I am currently using SQL, 10.4 LTS cluster, spark3.2.1, scala 2.12 (I cant change these computes), driver and workers are standard E_v4
What I already did, and worked as usual
drop table if exists dbname.tablename;
What I wanted to do next:
drop schema if exists dbname.tablename;
The error I got instead:
Error in SQL statement: AnalysisException: Nested databases are not supported by v1 session catalog: dbname.tablename
When I try recreating the schema in the same location I get the error:
AnalysisException: The specified schema does not match the existing schema at dbfs:locationOfMy/table
... Differences
-Specified schema has additional fields newColNameIAdded, anotherNewColIAdded
-Specified type for myOldCol is different from existing schema ...
If your intention is to keep the existing schema, you can omit the
schema from the create table command. Otherwise please ensure that
the schema matches.
How can I do the schema drop and re-register it in same location and same name with new definitions?
Answering a month later since I didnt get replies and found the right solution;
Delta files have left over partitions and logs that cannot be updated using the drop commands. I had to manually delete the logs depending on where my location was.
Try this:
dbutils.fs.rm(path, True)
Use the path of your schema.
Then create your table again.

How to add a column in the middle of a ORC partitioned hive table and still be able to query old partitioned files with new structure

Currently I have a Partitioned ORC "Managed" (Wrongly created as Internal first) Hive table in Prod with atleast 100 days worth of data partitioned by year,month,day(~16GB of data).
This table has roughly 160 columns.Now my requirement is to Add a column in the middle of this table and still be able to query the older data(partitioned files).Its is fine if the newly added column shows null for the old data.
What I did so far ?
1)First convert the table to External using below to preserve data files before dropping
alter table <table_name> SET TBLPROPERTIES('EXTERNAL'='TRUE');
2)Drop and Recreate the table with new column in the middle and then Altered the table to add the partition file
However I am unable to read the table after Recreation .I get this Error message
[Simba][HiveJDBCDriver](500312) Error in fetching data rows: *org.apache.hive.service.cli.HiveSQLException:java.io.IOException: java.io.IOException: ORC does not support type conversion from file type array<string> (87) to reader type int (87):33:32;
Any other way to accomplish this ?
No need to drop and recreate the table. Simply use the following statement.
ALTER TABLE default.test_table ADD columns (column1 string,column2 string) CASCADE;
ALTER TABLE CHANGE COLUMN with CASCADE command changes the columns of a table's metadata, and cascades the same change to all the partition metadata.
PS - This will add new columns to the end of the existing columns but before the partition columns. Unfortunately, ORC does not support adding columns in the middle as of now.
Hope that helps!

How hive create a table from a file present in HDFS?

I am new to HDFS and HIVE. I got some introduction of both after reading some books and documentation. I have a question regarding creation of a table in HIVE for which file is present in HDFS.
I have this file with 300 fields in HDFS. I want to create a table accessing this file in HDFS. But I want to make use of say 30 fields from this file.
My questions are
1. Does hive create a separate file directory?
2. Do I have to create hive table first and import data from HDFS?
3. Since I want to create a table with 30 columns out of 300 columns, Does hive create a file with only those 30 columns?
4. Do I have to create a separate file with 30 columns and import into HDFS and then create hive table pointing to HDFS directory?
My questions are
Does hive create a separate file directory?
YES if you create a hive table (managed/external) and load the data using load command.
NO if you create an external table and point to the existing file.
Do I have to create hive table first and import data from HDFS?
Not Necessarily you can create a hive external table and point to this existing file.
Since I want to create a table with 30 columns out of 300 columns, Does hive create a file with only those 30 columns?
You can do it easily using hiveQL. follow the below steps (note: this is not the only approach):
create a external table with 300 column and point to the existing
file.
create another hive table with desired 30 columns and insert data to this new table from 300 column table using "insert into
table30col select ... from table300col". Note: hive will create the
file with 30 columns during this insert operation.
Do I have to create a separate file with 30 columns and import into HDFS and then create hive table pointing to HDFS directory?
Yes this can be an alternative.
I personally like solution mentioned in question 3 as I don't have to recreate the file and I can do all of that in hadoop without depending on some other system.
You have several options. One is to have Hive simply point to the existing file, i.e. create an external HIVE table:
CREATE EXTERNAL TABLE ... LOCATION '<your existing hdfs file>';
This table in Hive will, obviously, match exactly your existing table. You must declare all 300 columns. There will be no data duplication, there is only one one file, Hive simply references the already existing file.
A second option would be to either IMPORT or LOAD the data into a Hive table. This would copy the data into a Hive table and let Hive control the location. But is important to understand that neither IMPORT nor LOAD do not transform the data, so the result table will have exactly the same structure layout and storage as your original table.
Another option, which I would recommend, is to create a specific Hive table and then import the data into it, using a tool like Sqoop or going through an intermediate staging table created by one of the methods above (preferably external reference to avoid an extra copy). Create the desired table, create the external reference staging table, insert the data into the target using INSERT ... SELECT, then drop the staging table. I recommend this because it lets you control not only the table structure/schema (ie. have only the needed 30 columns) but also, importantly, the storage. Hive has a highly columnar performant storage format, namely ORC, and you should thrive to use this storage format because will give you tremendous query performance boost.

How to update a hive table's data after copied orc files with hdfs into the folder of that table

After insertion of orc files into the folder of a table with hdfs copy, how to update that hive table's data to see those data when querying with hive.
Best Regards.
If the table is not partitioned then once the files are in HDFS in the folder that is specified in the LOCATION clause, then the data should be available for querying.
If the table is partitioned then u first need to run an ADD PARTITION statement.
As mentioned in upper answer by belostoky. if the table is not partitioned then you can directly query your table with the updated data
But in case if you table is partitioned you need to add partitions first in hive table that you can do using
You can use alter table statement to add partitions like shown below
ALTER TABLE table1
ADD PARTITION (dt='<date>')
location '<hdfs file path>'
once partitions are added hive metastore should be aware of changes so you need to run
msck repair table table1
to add partitions in metastore.
Once done you can query your data

hive 0.13 msck repair table only lists partitions not in metastore

I'm trying to use Hive(0.13) msck repair table command to recover partitions and it only lists the partitions not added to metastore instead of adding them to metastore as well.
here's the ouput of the command
partitions not in metastore externalexample:CreatedAt=26 04%3A50%3A56 UTC 2014/profileLocation="Chicago"
here's how I'm creating the external table
CREATE EXTERNAL TABLE IF NOT EXISTS ExternalExample(
tweetId BIGINT, username STRING,
txt STRING, CreatedAt STRING,
profileLocation STRING,
favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT)
COMMENT 'This is the Twitter streaming data'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
STORED AS TEXTFILE
location '/user/hue/exttable/';
Am I missing something?
I had a similar issue with the MSCK REPAIR TABLE listing the partitions that were not in the metastore but not actually adding them (and no error message).
I tried manually adding the partition with the ALTER TABLE ADD PARTITION command, and this gave me an error message, leading me to the root cause which was that the HDFS folder containing the 'missing' partition had been set up with incorrect permissions.
Once the permissions issue was resolved, then the MSCK REPAIR TABLE command worked correctly.
If you encounter this issue, it may be worthwhile to try adding it manually with the ALTER TABLE ADD PARTITION command. It may produce a useful error message that would help you determine the root cause of the problem.
Please make sure that the name of the partitions defined in your table definition match the name of the partition on hdfs.
For example, in your table creation example, I see that you haven't defined any paritions at all.
I think you want to do something like this (note the use of PARTITIONED BY):
create external table ExternalExample(tweetId BIGINT, username STRING, txt STRING,favc BIGINT,retweet STRING,retcount BIGINT,followerscount BIGINT) PARTITIONED BY (CreatedAt STRING, profileLocation STRING) COMMENT 'This is the Twitter streaming data' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' STORED AS TEXTFILE location '/user/hue/exttable/';
Then on hdfs you should have the following folder structure:
/user/hue/exttable/CreatedAt=<someString>/profileLocation=<someString>/your-data-file
The partition names for MSCK REPAIR TABLE ExternalTable should be in lowercase then only it will add it to hive metastore, I faced the similar issue in hive 1.2.1 where there was no support for ALTER TABLE ExternalTable RECOVER PARTITION, but after spending some time debugging found the issue that the partition names should be in lowercase i.e /some_external_path/mypartion=01 is valid and /some_external_path/myParition=01 is invalid;
Make your profileLocation to profilelocation or profile_location and test it should work.
My question is here Not able to recover partitions through alter table in Hive 1.2
Hive stores a list of partitions for each table in its metastore. If, however, new partitions are directly added to HDFS (manually by hadoop fs -put command), the metastore will not be aware of these partitions.
you need to add partition
ALTER TABLE ExternalExample ADD PARTITION
for every partition
or in short you can run
MSCK REPAIR TABLE ExternalExample;
It will add any partitions that exist on HDFS but not in metastore to the metastore.
Ref https://issues.apache.org/jira/browse/HIVE-874
1) You need to specify partitions
2) Partition names must have all lower case letters . See this - https://singhanuvrat.com/hive-partition-column-name-camelcase-bad-idea-b89796d4e741#.16d7uqfot
you might not be running as the hive user:
sudo -u hive** hive -e "set hive.msck.path.validation=ignore;msck repair table T1"
set hive.msck.path.validation=ignore; ( this is for tables with large number of partitions.)
You are just missing the PARTITIONED BY (CreatedAt STRING, profileLocation STRING).