Impala external table partitions still show up in stats with row count 0 after deleting the data in HDFS and altering (like ALTER TABLE table RECOVER PARTITIONS) refreshing (REFRESH table) and invalidation of metadata.
Trying to drop partitions one by one works, but there are tens of partitions which should be removed and it would be quite tedious.
Dropping and recreating the table would also be an option but that way all the statistics would be dropped together with the table.
Is there any kind of other options in impala to get this done?
Found a workaround through HIVE.
By issuing MSCK REPAIR TABLE tablename SYNC PARTITIONS then refreshing the table in impala, the empty partitions disappear.
Related
I created 2 external tables Hive. In first table specified data location with create statement. In second table loaded data after creating it.
I can see data file created for second table in /hive/warehouse/ directory. Then I set "external.table.purge"="true" for both tables. And DROP both tables. But data files of both tables remains as is.
What is the behaviour of 'external.table.purge'='true'. Shouldn't it delete data files as well on issuing Drop command?
If Hive does not take any ownership over data files of external table, why is there even an option as 'external.table.purge'='true'.
I read in one of the threads, where someone mentioned it is possible to delete data as well for external tables by ALTER TABLE ... SET TBLPROPERTIES('external.table.purge'='true'), but unable to find that post again.
You can not drop the data in external table but you can do it for internal(managed) tables. So convert the table to internal and then drop it.
First change eternal property to false.
hive> ALTER TABLE nyse_external SET TBLPROPERTIES('EXTERNAL'='False');
and then you can easily drop it.
hive> drop table nyse_external;
TBLPROPERTIES ("external.table.purge"="true") should work for hive version 4.x+.
Answer to point 1:
Table property "external.table.purge", which if true (and if the table is an external table), will let Hive know to delete the table data when the table is dropped. This feature is introduced in this apache jira.
https://issues.apache.org/jira/browse/HIVE-19981 .
For reference on how to set the property take a look at this example,
https://docs.cloudera.com/runtime/7.2.7/using-hiveql/topics/hive_drop_external_table_data.html
We have recently partitioned most of our tables in BigQuery using the following method:
Run a Dataflow pipeline which reads a table and writes the data to a new partitioned table.
Copy the partitioned table back to the original table using a copy job with write truncate set.
Once complete the original table is replaced with the data from the newly created partitioned table, however, the original table is still not partitioned. So I tried the copy again, this time deleting the original table first and it all worked.
The problem is it takes 20 minutes to copy our partitioned table which would cause downtime for our production application. So is there any way of doing write truncate with a partitioned table replacing a non-partitioned table without causing any downtime? Or will we need to delete the table first in order to replace it?
Sorry but you cannot change a non-partitioned table to partitioned, or vice versa. You will have to delete and re-create the table.
Couple of workarounds I can think of:
Keep both tables while you're migrating your queries to the partitioned table. After all queries are migrated you delete the original, non-partitioned table.
If you are using Standard Sql, you can replace the original table with a view on top of the partitioned table. Deleting and replacing the original table with a view should be very quick. And partition pruning should still work on top of the view so you're only charged for the queried partitions. Partition pruning might not work for legacy sql.
We are scooping data from netezza to hadoop non-partitioned tables and then from non-partition to partitioned with insert overwrite method. After this we are running compute incremental stats for databasename.tablename on partitioned tables but this query failed for some of the partitions with error
Could not execute command: compute incremental stats and No such file or directory for some file in partitioned directory.
You can run a refresh statement before computing stats to refresh the metadata right away. It may be necessary to wait a few seconds before computing stats even if the refresh statement return code is 0 as past experience has shown that metadata is still refreshing even after a return code is given. You won't typically won't see this issue unless a script is executing these commands sequentially.
refresh yourTableName
compute stats yourTableName
As of Impala 2.3 your can also use the alter table recover partitions instead of refresh metadata or repair table.
alter yourTableName recover partitions
compute stats yourTableName
When I am using DROP TABLE IF EXISTS <Table Name> in hive, it is not freeing the memory. The files are created as 0000_n.bz2 and they are still on disk.
I have two questions here:
1) Will these files keep on growing for each and every insert?
2) Is there any DROP equivalent to remove the files as well on the disk?
Couple of things you can do:
Check if the table is an external table and in that case you need to drop the files manually on HDFS as dropping tables won't drop the files:
hadoop fs -rm /HDFS_location/filename
Secondly check if you are in the right database. You need to issue use database command before dropping the tables. The database should be same as the one in which tables were created.
There are two types of tables in hive.
Hive managed table: If you drop a hive managed table the data in HDFS are automatically deleted.
External Table: If you drop an external table, hive doesnt delete the underlying data.
I believe yours is an external table.
Drop table if exists table_name purge;
This command will also remove data files from trash folder and cannot be recovered after table drop
What is the way to automatically update the metadata of Hive partitioned tables?
If new partition data's were added to HDFS (without alter table add partition command execution) . then we can sync up the metadata by executing the command 'msck repair'.
What to be done if a lot of partitioned data were deleted from HDFS (without the execution of alter table drop partition commad execution).
What is the way to syncup the Hive metatdata?
EDIT : Starting with Hive 3.0.0 MSCK can now discover new partitions or remove missing partitions (or both) using the following syntax :
MSCK [REPAIR] TABLE table_name [ADD/DROP/SYNC PARTITIONS]
This was implemented in HIVE-17824
As correctly stated by HakkiBuyukcengiz, MSCK REPAIR doesn't remove partitions if the corresponding folder on HDFS was manually deleted, it only adds partitions if new folders are created.
Extract from offical documentation :
In other words, it will add any partitions that exist on HDFS but not in metastore to the metastore.
This is what I usually do in the presence of external tables if multiple partitions folders are manually deleted on HDFS and I want to quickly refresh the partitions :
Drop the table (DROP TABLE table_name)
(dropping an external table does not delete the underlying partition files)
Recreate the table (CREATE EXTERNAL TABLE table_name ...)
Repair it (MSCK REPAIR TABLE table_name)
Depending on the number of partitions this can take a long time. The other solution is to use ALTER TABLE DROP PARTITION (...) for each deleted partition folder but this can be tedious if multiple partitions were deleted.
Try using
MSCK REPAIR TABLE <tablename>;
Ensure the table is set to external, drop all partitions then run the table repair:
alter table mytable_name set TBLPROPERTIES('EXTERNAL'='TRUE')
alter table mytable_name drop if exists partition (`mypart_name` <> 'null');
msck repair table mytable_name;
If msck repair throws an error, then run hive from the terminal as:
hive --hiveconf hive.msck.path.validation=ignore
or set hive.msck.path.validation=ignore;