Impala query failed for -compute incremental stats databsename.table name - impala

We are scooping data from netezza to hadoop non-partitioned tables and then from non-partition to partitioned with insert overwrite method. After this we are running compute incremental stats for databasename.tablename on partitioned tables but this query failed for some of the partitions with error
Could not execute command: compute incremental stats and No such file or directory for some file in partitioned directory.

You can run a refresh statement before computing stats to refresh the metadata right away. It may be necessary to wait a few seconds before computing stats even if the refresh statement return code is 0 as past experience has shown that metadata is still refreshing even after a return code is given. You won't typically won't see this issue unless a script is executing these commands sequentially.
refresh yourTableName
compute stats yourTableName
As of Impala 2.3 your can also use the alter table recover partitions instead of refresh metadata or repair table.
alter yourTableName recover partitions
compute stats yourTableName

Related

Drop empty Impala partitions

Impala external table partitions still show up in stats with row count 0 after deleting the data in HDFS and altering (like ALTER TABLE table RECOVER PARTITIONS) refreshing (REFRESH table) and invalidation of metadata.
Trying to drop partitions one by one works, but there are tens of partitions which should be removed and it would be quite tedious.
Dropping and recreating the table would also be an option but that way all the statistics would be dropped together with the table.
Is there any kind of other options in impala to get this done?
Found a workaround through HIVE.
By issuing MSCK REPAIR TABLE tablename SYNC PARTITIONS then refreshing the table in impala, the empty partitions disappear.

How to make a Hive query take advantage of statistics stored in Metastore

I am using Hive version 1.2.1. If I run select count(*) from mytable, I see that it launches a Tez job. So obviously it is not using any table statistics information because ideally number of rows should be fetched from table statistics stored in the Hive Metastore. Also I explicitly checked all the tables in the Hive Metastore, and I didn't find any table name there, which would sort of suggest that it stores table statistics. The only next best relevant table that I could see was TAB_COL_STATS, but this table only stores column level statistics and that too it only had very handful number of rows, the table had 10 rows. This poses two questions.
Does this version of Hive (1.2.1) not support table statistics?
If this Metastore table i.e. TAB_COL_STATS stores everything, why any column like num_rows is not part of this table schema? I see max,min,avg,num_distinct etc type of columns.
When I query a table for some statistics (e.g. number of rows), do I have to switch on some option so it will take advantage of internally stored statistics instead of running a Tez job?
It supports table level statistics.
Hive metastore tables are highly normalized. You can find info about num of rows in TABLE_PARAMS or PARTITION_PARAMS.
You should set hive.compute.query.using.stats to true to make use of metadata for queries like select count(*)....
But beforehand, make sure those statistics actually exists.
If not, run analyze table mytable compute statistics to gather it first.
Or you can set hive.stats.autogather to true to enforce the gathering when data is inserted into tables.

Run DB2 Runstats without activity but still get SQLSTATE=01650

After reading many of articles from the internet, I am still not sure what is the actual purpose of DB2 Runstats.
As my understanding, DB2 Runstats will "register" the table index to the DB2 catalog, so that next time when the related query run, it will use the index to increase the performance. (Please correct me if I am wrong)
Meaning, if for a long period of time the DB2 Runstats is not run, the index will be removed from the DB2 catalog?
I am creating a new index for a table. Originally that table already contained another index.
After creating the new index, I ran DB2 Runstats on the table for the old index, but I hit the following error:
SQL2314W Some statistics are in an inconsistent state. The newly collected
"INDEX" statistics are inconsistent with the existing "TABLE" statistics. SQLSTATE=01650
At first I was thinking it's cause by the activity to create the new index, and the table was still in the "processing" stage. I ran the DB2 Runstats command again the next day but still got the same error.
Your understanding about db2 runstats is not correct. This command collects statistics on the given table and its indexes and placed it to views in the SYSSTAT schema like SYSSTAT.TABLES, SYSSTAT.INDEXES, etc. This information is used by the DB2 optimizer to produce better access plans of your queries. It doesn't "register" indexes itself. Indexes are not removed automatically if you don't collect statistics on them.
As for the warning message SQL2314W.
It's just a warning that table and index statistics is not logically compatible (for example, number of index keys is more than number of rows in the table). Sometimes it happens when you collect statistics on actively updated table at the same time even you run such a collection on a table and its indexes using a single RUNSTATS command. You can either ignore this message or make the RUNSTATS utility lock the table during the statistics collection on table and its indexes using a single command (ALLOW READ ACCESS clause).

Hdfs data corruption issue

we have a data corruption issue at our hadoop cluster. We have a managed table on hive which contains three years of data partitioned by year.
Below two queries run fine without any issue
select count(*) from tkt_hist table where yr=2015
select count(*) from tkt_hist table where yr=2016
select count(*) from tkt_hist table where yr=2017
However, when we try to do group by per year, below error is shown.
Error while compiling statement: FAILED: SemanticException java.io.FileNotFoundException: File hdfs://ASIACELLHDP/apps/hive/warehouse/gprod1t_base.db/toll_tkt_hist_old/yr=2015/mn=01/dy=01 does not exist. [ERROR_STATUS]
Even select will not work when we specify a year other than 2015.
//this works fine
Select * from tkt_hist where yr=2015 limit 10;
// below throws same error mentioned above.
Select * from tkt_hist where yr=2016;
Try increasing java heap space (increase reducer memory if it doesn't work).
For example:
set mapreduce.map.java.opts = -Xmx15360m
You will have to drop the partitions manually because msck repair table only adds partitions but doesn't remove existing ones.
You will have to iterate through the corrupt partitions list. For internal tables, you'll have to be specific, as dropping a partition deletes the underlying physical files.
ALTER TABLE tkt_hist DROP IF EXISTS PARTITION(yr=2015, mn=01, dy=01);
You will need to do this for each partition. You could put it in a bash script and execute it with hive -e or beeline -e commands to work with a quoted query string.
If you are using an external table, then it's much easier to remove all partitions and then repair table.
ALTER TABLE tkt_hist DROP IF EXISTS PARTITION(yr<>'', mn<>'', dy<>'');
Make sure to repair the table as the user owning the Hive DB as well as the HDFS path.
MSCK REPAIR TABLE tkt_hist;
This should add partitions folders currently available in the table path without adding the invalid partitions.
Note: If your user isn't the owner of the directory, ensure you have write permissions and do your work in the hive access client as beeline requires absolute ownership rights to work.

Partition swapping in Hive

What is the impact on running queries in Hive if i swap the partition using
ALTER TABLE user_data
PARTITION (name = 'ABC')
SET LOCATION = 'db/partitions/new';
Does this command wait until queries finished executing?
Hive translate your query into temporary Map/Reduce job and that job executed on behalf of your hive query.When you submit hive query it creates a Map/Reduce job based on your query and that job gets executed and you get a result from that job. But if you ALTER your hive query and change partition or anything during the execution of query, command will not wait to finish your running job, it will alter your table and you will get result from your previous query unless or until you kill your previous job.
Best way to understand this is try and run. Just submit your hive query and redirect it to store the result into file and then change the partition and again submit the query and redirect it to store the result into file. Verify the both output.