We had an issue with our ingestion process that would result in partitions being added to a table in Hive, but the path in HDFS didn't actually exist. We've fixed that issue, but we still have these bad partitions. When querying these tables using Tez, we get FileNotFound exception, pointing to the location in HDFS that doesn't exist. If we use MR instead of Tez, the query works (which is very confusing to me), but it's too slow.
Is there a way to list all the partitions that have this probem? MSCK REPAIR seems to handle the opposite problem, where the data exists in HDFS but there is no partition in Hive.
EDIT: More info.
Here's the output of the file not found exception:
java.io.FileNotFoundException: File hdfs://<server>/db/tables/2016/03/14/mytable does not exist.
If I run show partitions <db.mytable>, I'll get all the partitions, including one for dt=2016-03-14.
show table extended like '<db.mytable>' partition(dt='2016-03-14' returns the same location:
location:hdfs://server/db/tables/2016/03/14/mytable.
MSCK REPAIR TABLE <tablename> does not provide this facility and I also face this same issue and I found solution for this,
As we know 'msck repair' command add partitions based on directory, So first drop all partitions
hive>ALTER TABLE mytable drop if exists partitions(p<>'');
above command remove all partitions ,
then use msck repair command then it will create partition from directory present at table location.
hive>msck repair table mytable
It seem MSCK REPAIR TABLE does not drop partitions that point to missing directories, but it does list these partitions (see Partitions not in metastore:), so with a little scripting / manual work, you can drop them based on the given list.
hive> create table mytable (i int) partitioned by (p int);
OK
Time taken: 0.539 seconds
hive> !mkdir mytable/p=1;
hive> !mkdir mytable/p=2;
hive> !mkdir mytable/p=3;
hive> msck repair table mytable;
OK
Partitions not in metastore: mytable:p=1 mytable:p=2 mytable:p=3
Repair: Added partition to metastore mytable:p=1
Repair: Added partition to metastore mytable:p=2
Repair: Added partition to metastore mytable:p=3
Time taken: 0.918 seconds, Fetched: 4 row(s)
hive> show partitions mytable;
OK
p=1
p=2
p=3
Time taken: 0.331 seconds, Fetched: 3 row(s)
hive> !rmdir mytable/p=1;
hive> !rmdir mytable/p=2;
hive> !rmdir mytable/p=3;
hive> msck repair table mytable;
OK
Partitions missing from filesystem: mytable:p=1 mytable:p=2 mytable:p=3
Time taken: 0.425 seconds, Fetched: 1 row(s)
hive> show partitions mytable;
OK
p=1
p=2
p=3
Time taken: 0.56 seconds, Fetched: 3 row(s)
Related
I have an external table table1 created in HDFS containing single partition column column1 of type string and I am using Hive to get data from it.
Following query finishes in 1 second as expected as the data is present in Hive metastore itself.
SHOW PARTITIONS table1;
The result of above command also makes sure that all partitions are present in metastore.
I have also run MSCK REPAIR TABLE table1 to make sure all partition info is present in metastore.
But below query takes 10 min to complete.
SELECT min(column1) from table1;
Why is this query doing full mapreduce tasks just to determine the minimum value of partition column1 when all the values are already present in metastore ?
There is 1 more use-case where Hive is checking full Table data and not making use of partition information.
SELECT * FROM (SELECT * FROM table1 WHERE column1='abc') q1 INNER JOIN (SELECT * FROM table1 WHERE column1='xyz') q2 ON q1.column2==q2.column2
In such queries also, Hive does not make use of partition info and is scanning all partitions like column1='jkl'
Any pointer about this behaviour ? I am not sure if above 2 scenarios are due to same reason.
Its because the way data is stored and accessed.
why SHOW PARTITIONS table1; is taking 1 sec because this data coming straight from metadata table.
why SELECT min(column1) from table1; is taking minutes because this data is coming from HDFS and calculated after hive goes through all the actual data.
To test it out, if you run this explain SELECT min(column1) from table1;, you will see that query is going through all the partitions( and all the data) and then finding min value. This is as good as checking all data to find min value. Pls note partition isnt an index but its different physical folders to store data files for quicker access.
If you run explain sql, you will see SQL is accessing all partition in case of min() sql (i created partitions on random college_marks column)-
29
Path -> Alias:
30
hdfs://namenode:8020/user/hive/warehouse/tmp/college_marks=10.0 [tmp]
31
hdfs://namenode:8020/user/hive/warehouse/tmp/college_marks=50.0 [tmp]
32
Path -> Partition:
33
hdfs://namenode:8020/user/hive/warehouse/tmp/college_marks=10.0
34
Partition
35
base file name: college_marks=10.0
36
input format: org.apache.hadoop.mapred.TextInputFormat
37
hdfs://namenode:8020/user/hive/warehouse/tmp/college_marks=50.0
85
Partition
86
base file name: college_marks=50.0
87
input format: org.apache.hadoop.mapred.TextInputFormat
88
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
89
partition values:
90
college_marks 50.0
91
We setup a streaming insert into a BQ table that is partitioned like this (ingestion-time partitioned):
Table Type Partitioned
Partitioned by DAY
Partitioned on field _PARTITIONTIME
Partition expiration
Partition filter Required
We know the table had fresh data today because a preview in the BG console showed rows being added to the table.
We tried following query with 0 B result for several hours after the "start of UTC day":
select * from `MQTT_trackers_partitioned`
WHERE _PARTITIONTIME BETWEEN TIMESTAMP('2022-06-01') AND TIMESTAMP('2022-06-02');
At about 13:30 UTC time today, this same query now shows "This query will process 285.99 KB when run." and worked fine.
Why did BQ take so long to make the partitioned table data available for the query to work? (13 hours!). We are inserting data every minute 24x7 on this dataset, I would expect closer to "real-time" performance for queries considering that these are frequent streaming inserts, are we missing some other detail to make this work?
Trying to see statistics on a particular column. I executed the ANALYZE command first and then tried to see the stats by DESCRIBE FORMATTED <table_name> <col_name>.
I cant see any values in this. Any idea why its not showing any values?
I tried MSCK and analyzed the table again and checked for stats. No luck so far.
hive> desc extended testdb.table order_dispatch_diff;
OK
order_dispatch_diff int from deserializer
Time taken: 0.041 seconds, Fetched: 1 row(s)
Try to do it with FOR COLUMNS parameter:
ANALYZE TABLE testdb.table COMPUTE STATISTICS FOR COLUMNS;
Then use DESCRIBE FORMATTED testdb.table order_dispatch_diff; to display statistics.
See Column Statistics docs for more details.
Below Statement worked for me finally .
hive> desc formatted testdb.table col_name partition (data_dt='20180715');
I wanted to understand the UDF WeekOfYear and how it starts the first week. I had to artifically hit a table and run
the query . I wanted to not hit the table and compute the values. Secondly can I look at the UDF source code?
SELECT weekofyear
('12-31-2013')
from a;
You do not need table to test UDF since Hive 0.13.0.
See this Jira: HIVE-178 SELECT without FROM should assume a one-row table with no columns
Test:
hive> SELECT weekofyear('2013-12-31');
Result:
1
The source code (master branch) is here: UDFWeekOfYear.java
If you are Java developer, you can write Junit Test cases and test the UDFs..
you can search the source code of all hive built in functions in grepcode.
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.hive/hive-exec/1.0.0/org/apache/hadoop/hive/ql/udf/UDFWeekOfYear.java
I don't think executing UDF without hitting the tables is possible in Hive.
Even Hive developers hit the table in UDF testing.
To make query run faster you can:
Create table with only one row and run UDF queries on this table
Run Hive in local mode.
Hive source code is located here.
UDFWeekOfYear source is here.
You should be able to use any table with at least one row to test functions.
Here is an example using a few custom functions that perform work and output a string result.
Replace anytable with an actual table.
SELECT ST_AsText(ST_Intersection(ST_Polygon(2,0, 2,3, 3,0), ST_Polygon(1,1, 4,1, 4,4, 1,4))) FROM anytable LIMIT 1;
HIVE Resuluts:
OK
POLYGON ((2 1, 2.6666666666666665 1, 2 3, 2 1))
Time taken: 0.191 seconds, Fetched: 1 row(s)
hive>
I have a simple table in my Hive. It has only one partition:
show partitions hive_test;
OK
pt=20130805000000
Time taken: 0.124 seconds
But when I execute a simple query sql, it turns out to find the data file under folder 20130805000000. Why doesn't it just use the file 20130805000000?
sql:
SELECT buyer_id AS USER_ID from hive_test limit 1;
and this is the exception:
java.io.IOException: /group/myhive/test/hive/hive_test/pt=20130101000000/data
doesn't exist!
at org.apache.hadoop.hdfs.DFSClient.listPathWithLocations(DFSClient.java:1045)
at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:352)
at org.apache.hadoop.fs.viewfs.ChRootedFileSystem.listLocatedStatus(ChRootedFileSystem.java:270)
at org.apache.hadoop.fs.viewfs.ViewFileSystem.listLocatedStatus(ViewFileSystem.java:851)
at org.apache.hadoop.hdfs.Yunti3FileSystem.listLocatedStatus(Yunti3FileSystem.java:349)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listLocatedStatus(SequenceFileInputFormat.java:49)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:242)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:261)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1238)
And my question is why does hive try to find file "/group/myhive/test/hive/hive_test/pt=20130101000000/data", but not "/group/myhive/test/hive/hive_test/pt=20130101000000/"?
You are not getting error because you have created partition over your hive table but not assigning partition name during select statement.
In Hive’s implementation of partitioning, data within a table is split across multiple partitions. Each partition corresponds to a particular value(s) of partition column(s) and is stored as a sub-directory within the table’s directory on HDFS. When the table is queried, where applicable, only the required partitions of the table are queried.
Please provide partition name in your select query or use your query like this:
select buyer_id AS USER_ID from hive_test where pt='20130805000000' limit 1;
Please see Link to know more about hive partition.