Trying to see statistics on a particular column. I executed the ANALYZE command first and then tried to see the stats by DESCRIBE FORMATTED <table_name> <col_name>.
I cant see any values in this. Any idea why its not showing any values?
I tried MSCK and analyzed the table again and checked for stats. No luck so far.
hive> desc extended testdb.table order_dispatch_diff;
OK
order_dispatch_diff int from deserializer
Time taken: 0.041 seconds, Fetched: 1 row(s)
Try to do it with FOR COLUMNS parameter:
ANALYZE TABLE testdb.table COMPUTE STATISTICS FOR COLUMNS;
Then use DESCRIBE FORMATTED testdb.table order_dispatch_diff; to display statistics.
See Column Statistics docs for more details.
Below Statement worked for me finally .
hive> desc formatted testdb.table col_name partition (data_dt='20180715');
Related
I have a partitioned table, with partitions based on timestamp column, like this:
...
partition by date_trunc(update_date, month)
...
Now, how do I query partitions in this table?
Afaik, I can't use _PARTITIONDATE pseudo column, since this is not an ingestion-time partitioned table. Do I just filter my query using date_trunc(update_date, month)?
select * from my_project.my_dataset.information_schema.partitions
where table_name = 'partitioned_table'
Here I can get partition_ids, but I can't address them in my query either.
I'm not sure, but whenever you want to to test if a partition is working: try typing out a SELECT * statement without a filter, and don't run it but check the estimated query size. Then try filtering on what you believe is the partition, also without running, and see if the estimated query size is reduced.
In this case try filtering on date_trunc(update_date, month) and see if it reduces the query size
I've recently moved to using AvroSerDe for my External tables in Hive.
Select col_name,count(*)
from table
group by col_name;
The above query gives me a count. Where as the below query does not:
Select count(*)
from table;
The reason is hive just looks at the table metadata and fetches the values. For some reason, statistics for the table is not updated in hive due to which count(*) returns 0.
The statistics is written with no data rows at the time of table creation and for any data appends/changes, hive requires to update this statistics in the metadata.
Running ANALYZE command gather statistics and write them into Hive MetaStore.
ANALYZE TABLE table_name COMPUTE STATISTICS;
Visit Apache Hive wiki for more details about ANALYZE command.
Other methods to solve this issue
Use of 'limit' and 'group by' clause triggers map reduce job to get
the count of number of rows and gives correct value
Setting fetch task conversion to none forces hive to run a map reduce
job to count the number of rows
hive> set hive.fetch.task.conversion=none;
Is there any way to limit the number of Hive partitions while listing the partitions in show command?
I have a Hive table which has around 500 partitions and I wanted the latest partition alone. The show command list all the partitions. I am using this partition to find out the location details. I do not have access to metastore to query the details and the partition location is where the actual data resides.
I tried set hive.limit.query.max.table.partition=1 but this does not affect the metastore query. So, is there any other way to limit the partitions listed?
Thank you,
Revathy.
Are you running from the command line?
If so you can get your desired with something like this:
hive -e "set hive.cli.print.header=false;show partitions table_name;" | tail -1
There is a "BAD" way to obtain what you want. You can treat the partitions columns like other columns and extract them into a select with limit query:
SELECT DISTINCT partition_column
FROM partitioned_table
ORDER BY partition_column
LIMIT 1;
The only way to filter a SHOW PARTION is using PARTITION:
SHOW PARTITIONS partitioned_table PARTION ( partitioned_column = "somevalue" );
We had an issue with our ingestion process that would result in partitions being added to a table in Hive, but the path in HDFS didn't actually exist. We've fixed that issue, but we still have these bad partitions. When querying these tables using Tez, we get FileNotFound exception, pointing to the location in HDFS that doesn't exist. If we use MR instead of Tez, the query works (which is very confusing to me), but it's too slow.
Is there a way to list all the partitions that have this probem? MSCK REPAIR seems to handle the opposite problem, where the data exists in HDFS but there is no partition in Hive.
EDIT: More info.
Here's the output of the file not found exception:
java.io.FileNotFoundException: File hdfs://<server>/db/tables/2016/03/14/mytable does not exist.
If I run show partitions <db.mytable>, I'll get all the partitions, including one for dt=2016-03-14.
show table extended like '<db.mytable>' partition(dt='2016-03-14' returns the same location:
location:hdfs://server/db/tables/2016/03/14/mytable.
MSCK REPAIR TABLE <tablename> does not provide this facility and I also face this same issue and I found solution for this,
As we know 'msck repair' command add partitions based on directory, So first drop all partitions
hive>ALTER TABLE mytable drop if exists partitions(p<>'');
above command remove all partitions ,
then use msck repair command then it will create partition from directory present at table location.
hive>msck repair table mytable
It seem MSCK REPAIR TABLE does not drop partitions that point to missing directories, but it does list these partitions (see Partitions not in metastore:), so with a little scripting / manual work, you can drop them based on the given list.
hive> create table mytable (i int) partitioned by (p int);
OK
Time taken: 0.539 seconds
hive> !mkdir mytable/p=1;
hive> !mkdir mytable/p=2;
hive> !mkdir mytable/p=3;
hive> msck repair table mytable;
OK
Partitions not in metastore: mytable:p=1 mytable:p=2 mytable:p=3
Repair: Added partition to metastore mytable:p=1
Repair: Added partition to metastore mytable:p=2
Repair: Added partition to metastore mytable:p=3
Time taken: 0.918 seconds, Fetched: 4 row(s)
hive> show partitions mytable;
OK
p=1
p=2
p=3
Time taken: 0.331 seconds, Fetched: 3 row(s)
hive> !rmdir mytable/p=1;
hive> !rmdir mytable/p=2;
hive> !rmdir mytable/p=3;
hive> msck repair table mytable;
OK
Partitions missing from filesystem: mytable:p=1 mytable:p=2 mytable:p=3
Time taken: 0.425 seconds, Fetched: 1 row(s)
hive> show partitions mytable;
OK
p=1
p=2
p=3
Time taken: 0.56 seconds, Fetched: 3 row(s)
I wanted to understand the UDF WeekOfYear and how it starts the first week. I had to artifically hit a table and run
the query . I wanted to not hit the table and compute the values. Secondly can I look at the UDF source code?
SELECT weekofyear
('12-31-2013')
from a;
You do not need table to test UDF since Hive 0.13.0.
See this Jira: HIVE-178 SELECT without FROM should assume a one-row table with no columns
Test:
hive> SELECT weekofyear('2013-12-31');
Result:
1
The source code (master branch) is here: UDFWeekOfYear.java
If you are Java developer, you can write Junit Test cases and test the UDFs..
you can search the source code of all hive built in functions in grepcode.
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-exec/1.0.0/org/apache/hadoop/hive/ql/udf/UDFWeekOfYear.java
I don't think executing UDF without hitting the tables is possible in Hive.
Even Hive developers hit the table in UDF testing.
To make query run faster you can:
Create table with only one row and run UDF queries on this table
Run Hive in local mode.
Hive source code is located here.
UDFWeekOfYear source is here.
You should be able to use any table with at least one row to test functions.
Here is an example using a few custom functions that perform work and output a string result.
Replace anytable with an actual table.
SELECT ST_AsText(ST_Intersection(ST_Polygon(2,0, 2,3, 3,0), ST_Polygon(1,1, 4,1, 4,4, 1,4))) FROM anytable LIMIT 1;
HIVE Resuluts:
OK
POLYGON ((2 1, 2.6666666666666665 1, 2 3, 2 1))
Time taken: 0.191 seconds, Fetched: 1 row(s)
hive>