Hdfs data corruption issue - hive

we have a data corruption issue at our hadoop cluster. We have a managed table on hive which contains three years of data partitioned by year.
Below two queries run fine without any issue
select count(*) from tkt_hist table where yr=2015
select count(*) from tkt_hist table where yr=2016
select count(*) from tkt_hist table where yr=2017
However, when we try to do group by per year, below error is shown.
Error while compiling statement: FAILED: SemanticException java.io.FileNotFoundException: File hdfs://ASIACELLHDP/apps/hive/warehouse/gprod1t_base.db/toll_tkt_hist_old/yr=2015/mn=01/dy=01 does not exist. [ERROR_STATUS]
Even select will not work when we specify a year other than 2015.
//this works fine
Select * from tkt_hist where yr=2015 limit 10;
// below throws same error mentioned above.
Select * from tkt_hist where yr=2016;

Try increasing java heap space (increase reducer memory if it doesn't work).
For example:
set mapreduce.map.java.opts = -Xmx15360m

You will have to drop the partitions manually because msck repair table only adds partitions but doesn't remove existing ones.
You will have to iterate through the corrupt partitions list. For internal tables, you'll have to be specific, as dropping a partition deletes the underlying physical files.
ALTER TABLE tkt_hist DROP IF EXISTS PARTITION(yr=2015, mn=01, dy=01);
You will need to do this for each partition. You could put it in a bash script and execute it with hive -e or beeline -e commands to work with a quoted query string.
If you are using an external table, then it's much easier to remove all partitions and then repair table.
ALTER TABLE tkt_hist DROP IF EXISTS PARTITION(yr<>'', mn<>'', dy<>'');
Make sure to repair the table as the user owning the Hive DB as well as the HDFS path.
MSCK REPAIR TABLE tkt_hist;
This should add partitions folders currently available in the table path without adding the invalid partitions.
Note: If your user isn't the owner of the directory, ensure you have write permissions and do your work in the hive access client as beeline requires absolute ownership rights to work.

Related

Databricks - is not empty but it's not a Delta table

I run a query on Databricks:
DROP TABLE IF EXISTS dublicates_hotels;
CREATE TABLE IF NOT EXISTS dublicates_hotels
...
I'm trying to understand why I receive the following error:
Error in SQL statement: AnalysisException: Cannot create table ('default.dublicates_hotels'). The associated location ('dbfs:/user/hive/warehouse/dublicates_hotels') is not empty but it's not a Delta table
I already found a way how to solve it (by removing it manually):
dbutils.fs.rm('.../dublicates_hotels',recurse=True)
But I can't understand why it's still keeping the table?
Even though that I created a new cluster (terminated the previous one) and I'm running this query with a new cluster attached.
Anyone can help me to understand that?
I also faced a similar problem, then tried the command line CREATE OR REPLACE TABLE and it solved my problem.
DROP TABLE & CREATE TABLE work with entries in the Metastore that is some kind of database that keeps the metadata about databases and tables. There could be the situation when entries in metastore don't exist so DROP TABLE IF EXISTS doesn't do anything. But when CREATE TABLE is executed, then it additionally check for location on DBFS, and fails if directory exists (maybe with data). This directory could be left from some previous experiments, when data were written without using the metastore.
if the table created with LOCATION specified - this means the table is EXTERNAL, so when you drop it - you drop only hive metadata for that table, directory contents remains as it is. You can restore the table by CREATE TABLE if you specify the same LOCATION (Delta keeps table structure along with it's data in the directory).
if LOCATION wasn't specified while table creation - it's a MANAGED table, DROP will destroy metadata and directory contents

Impala query failed for -compute incremental stats databsename.table name

We are scooping data from netezza to hadoop non-partitioned tables and then from non-partition to partitioned with insert overwrite method. After this we are running compute incremental stats for databasename.tablename on partitioned tables but this query failed for some of the partitions with error
Could not execute command: compute incremental stats and No such file or directory for some file in partitioned directory.
You can run a refresh statement before computing stats to refresh the metadata right away. It may be necessary to wait a few seconds before computing stats even if the refresh statement return code is 0 as past experience has shown that metadata is still refreshing even after a return code is given. You won't typically won't see this issue unless a script is executing these commands sequentially.
refresh yourTableName
compute stats yourTableName
As of Impala 2.3 your can also use the alter table recover partitions instead of refresh metadata or repair table.
alter yourTableName recover partitions
compute stats yourTableName

Hive: DROP TABLE IF EXISTS <Table Name> does not free memory

When I am using DROP TABLE IF EXISTS <Table Name> in hive, it is not freeing the memory. The files are created as 0000_n.bz2 and they are still on disk.
I have two questions here:
1) Will these files keep on growing for each and every insert?
2) Is there any DROP equivalent to remove the files as well on the disk?
Couple of things you can do:
Check if the table is an external table and in that case you need to drop the files manually on HDFS as dropping tables won't drop the files:
hadoop fs -rm /HDFS_location/filename
Secondly check if you are in the right database. You need to issue use database command before dropping the tables. The database should be same as the one in which tables were created.
There are two types of tables in hive.
Hive managed table: If you drop a hive managed table the data in HDFS are automatically deleted.
External Table: If you drop an external table, hive doesnt delete the underlying data.
I believe yours is an external table.
Drop table if exists table_name purge;
This command will also remove data files from trash folder and cannot be recovered after table drop

How to update partition metadata in Hive , when partition data is manualy deleted from HDFS

What is the way to automatically update the metadata of Hive partitioned tables?
If new partition data's were added to HDFS (without alter table add partition command execution) . then we can sync up the metadata by executing the command 'msck repair'.
What to be done if a lot of partitioned data were deleted from HDFS (without the execution of alter table drop partition commad execution).
What is the way to syncup the Hive metatdata?
EDIT : Starting with Hive 3.0.0 MSCK can now discover new partitions or remove missing partitions (or both) using the following syntax :
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS]
This was implemented in HIVE-17824
As correctly stated by HakkiBuyukcengiz, MSCK REPAIR doesn't remove partitions if the corresponding folder on HDFS was manually deleted, it only adds partitions if new folders are created.
Extract from offical documentation :
In other words, it will add any partitions that exist on HDFS but not in metastore to the metastore.
This is what I usually do in the presence of external tables if multiple partitions folders are manually deleted on HDFS and I want to quickly refresh the partitions :
Drop the table (DROP TABLE table_name)
(dropping an external table does not delete the underlying partition files)
Recreate the table (CREATE EXTERNAL TABLE table_name ...)
Repair it (MSCK REPAIR TABLE table_name)
Depending on the number of partitions this can take a long time. The other solution is to use ALTER TABLE DROP PARTITION (...) for each deleted partition folder but this can be tedious if multiple partitions were deleted.
Try using
MSCK REPAIR TABLE <tablename>;
Ensure the table is set to external, drop all partitions then run the table repair:
alter table mytable_name set TBLPROPERTIES('EXTERNAL'='TRUE')
alter table mytable_name drop if exists partition (`mypart_name` <> 'null');
msck repair table mytable_name;
If msck repair throws an error, then run hive from the terminal as:
hive --hiveconf hive.msck.path.validation=ignore
or set hive.msck.path.validation=ignore;

Apache hive create table

I have a problem understand the real meaning behind this Apache Hive code, Can someone please explain to me whether this code is really doing anything?
ALTER TABLE a RENAME TO a_tmp;
DROP TABLE a;
CREATE TABLE a AS SELECT * FROM a_tmp;
ALTER TABLE a RENAME TO a_tmp;
This simply allows you to rename your table a to a_tmp.
Let's say your table a initially points to /user/hive/warehouse/a, then after executing this command your data will be moved to /user/hive/warehouse/a_tmp and the content of /user/hive/warehouse/a will no longer exist. Note that this behavior of moving HDFS directories only exist in more recent versions of Hive. Before that the RENAME command was only updating the metastore and not moving directories in HDFS.
Similarly, if you do a show tables after, you will see that a doesn't exist anymore, but a_tmp exists. You can no longer query a at that point because it is no longer registered in the metastore.
DROP TABLE a;
This does basically nothing, because you already renamed a to a_tmp. So a doesn't exist anymore in the metastore. This will still print "OK" because there's nothing to do.
CREATE TABLE a AS SELECT * FROM a_tmp;
You are asking to create a brand new table called a and register it in the metastore. You are also asking to populate it with the same data that is in a_tmp (which you already copied from a before)
So in short you're moving your initial table to a new one, and then copying the new one back to the original, so the only thing these query do is duplicating your initial data into both a and a_tmp.