I have a simple table in my Hive. It has only one partition:
show partitions hive_test;
OK
pt=20130805000000
Time taken: 0.124 seconds
But when I execute a simple query sql, it turns out to find the data file under folder 20130805000000. Why doesn't it just use the file 20130805000000?
sql:
SELECT buyer_id AS USER_ID from hive_test limit 1;
and this is the exception:
java.io.IOException: /group/myhive/test/hive/hive_test/pt=20130101000000/data
doesn't exist!
at org.apache.hadoop.hdfs.DFSClient.listPathWithLocations(DFSClient.java:1045)
at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:352)
at org.apache.hadoop.fs.viewfs.ChRootedFileSystem.listLocatedStatus(ChRootedFileSystem.java:270)
at org.apache.hadoop.fs.viewfs.ViewFileSystem.listLocatedStatus(ViewFileSystem.java:851)
at org.apache.hadoop.hdfs.Yunti3FileSystem.listLocatedStatus(Yunti3FileSystem.java:349)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listLocatedStatus(SequenceFileInputFormat.java:49)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:242)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:261)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1238)
And my question is why does hive try to find file "/group/myhive/test/hive/hive_test/pt=20130101000000/data", but not "/group/myhive/test/hive/hive_test/pt=20130101000000/"?
You are not getting error because you have created partition over your hive table but not assigning partition name during select statement.
In Hive’s implementation of partitioning, data within a table is split across multiple partitions. Each partition corresponds to a particular value(s) of partition column(s) and is stored as a sub-directory within the table’s directory on HDFS. When the table is queried, where applicable, only the required partitions of the table are queried.
Please provide partition name in your select query or use your query like this:
select buyer_id AS USER_ID from hive_test where pt='20130805000000' limit 1;
Please see Link to know more about hive partition.
Related
I've recently moved to using AvroSerDe for my External tables in Hive.
Select col_name,count(*)
from table
group by col_name;
The above query gives me a count. Where as the below query does not:
Select count(*)
from table;
The reason is hive just looks at the table metadata and fetches the values. For some reason, statistics for the table is not updated in hive due to which count(*) returns 0.
The statistics is written with no data rows at the time of table creation and for any data appends/changes, hive requires to update this statistics in the metadata.
Running ANALYZE command gather statistics and write them into Hive MetaStore.
ANALYZE TABLE table_name COMPUTE STATISTICS;
Visit Apache Hive wiki for more details about ANALYZE command.
Other methods to solve this issue
Use of 'limit' and 'group by' clause triggers map reduce job to get
the count of number of rows and gives correct value
Setting fetch task conversion to none forces hive to run a map reduce
job to count the number of rows
hive> set hive.fetch.task.conversion=none;
Is there any way to limit the number of Hive partitions while listing the partitions in show command?
I have a Hive table which has around 500 partitions and I wanted the latest partition alone. The show command list all the partitions. I am using this partition to find out the location details. I do not have access to metastore to query the details and the partition location is where the actual data resides.
I tried set hive.limit.query.max.table.partition=1 but this does not affect the metastore query. So, is there any other way to limit the partitions listed?
Thank you,
Revathy.
Are you running from the command line?
If so you can get your desired with something like this:
hive -e "set hive.cli.print.header=false;show partitions table_name;" | tail -1
There is a "BAD" way to obtain what you want. You can treat the partitions columns like other columns and extract them into a select with limit query:
SELECT DISTINCT partition_column
FROM partitioned_table
ORDER BY partition_column
LIMIT 1;
The only way to filter a SHOW PARTION is using PARTITION:
SHOW PARTITIONS partitioned_table PARTION ( partitioned_column = "somevalue" );
Is there a way of getting a list of the partitions in a BigQuery date-partitioned table? Right now the best way I have found of do this is using the _PARTITIONTIME meta-column, but this needs to scan all the rows in all the partitions. Is there an equivalent to a show partitions call or maybe something in the bq command-line tool?
To list partitions in a table, query the table's summary partition by using the partition decorator separator ($) followed by PARTITIONS_SUMMARY. For example, the following command retrieves the partition IDs for table1:
SELECT partition_id from [mydataset.table1$__PARTITIONS_SUMMARY__];
Need help on below issue.
I need to delete rows from a table having huge amount of data getting inserted on daily basis, I have written a procedure which deletes the rows based on a column having index on it which to me should be enough but my collegue suggested me to use a date column as well to delete the data as this will use date parition (Parition is based on date).
My doubt is which delete statement would be faster to delete the data.
E.g
1. Column name :- FILE_NAME (Having index)
delete from table_name where column_name1=file_name
2. Column name1 :- FILE_NAME (HHaving index) and column name2:- TXN_DATE (no index, Partition is on this column)
delete from table_name where column_name1=file_name and txn_date=date_value
Please advise.
Thanks
Yes, your colleague is right. The second query will be quicker.
The process is called partition pruning. Using the column, based on which partitions are created will automatically hit only the necessary partitions where the data is available.
You can also directly reference the partition if you can determine the name of the partition for the date_value, as
DELETE FROM table_name
PARTITION (partition_date_value)
WHERE column_name1=file_name;
References:
Examples for DELETE on Oracle Database SQL Language Reference
Partition Pruning
Another Partition Pruning website
If file name is a index that actually improves the navigation on your table, i think it would be faster to use the first one.
A typical question is can a Hive partition be made up of multiple files. My question is the inverse. Can multiple Hive partitions point to the same file? I'll start with what I mean, then the use case.
What I mean:
Hive Partition File Name
20120101 /file/location/201201/file1.tsv
20120102 /file/location/201201/file1.tsv
20120103 /file/location/201201/file1.tsv
The Use Case: Over the past many years, we've been loading data into Hive in monthly format. So it looked like this:
Hive Partition File Name
201201 /file/location/201201/file1.tsv
201202 /file/location/201202/file1.tsv
201203 /file/location/201203/file1.tsv
But now the months are too large, so we need to partition by day. So we want the new files starting with 201204 to be daily:
Hive Partition File Name
20120401 /file/location/20120401/file1.tsv
20120402 /file/location/20120402/file1.tsv
20120403 /file/location/20120403/file1.tsv
But we want all the existing partitions to be redone into daily as well, so we would partition it as I propose above. I suspect this would actually work no problem, except that I suspect Hive would re-read the same datafile N times for each additional partition defined against the file. For example, in the very first "What I Mean" code block above, partitions 20120101..20120103 all point to file 201201/file1.tsv. So if the query has:
and partitionName >= '20120101' and partitionName <= '20120103"
Would it read "201201/file1.tsv" three times to answer the query? Or will Hive be smart enough to know it's only necessary to scan "201201/file1.tsv" once?
It looks like Hive will only scan the file(s) once. I finally decided to just give it a shot and run a query and find out.
First, I set up my data set like this in the filesystem:
tableName/201301/splitFile-201301-xaaaa.tsv.gz
tableName/201301/splitFile-201301-xaaab.tsv.gz
...
tableName/201301/splitFile-201301-xaaaq.tsv.gz
Note that even though I have many files, this is equivalent for Hive to having one giant file for the purposes of this question. If it makes it easier, pretend I just pasted a single file above.
Then I set up my Hive table with partitions like this:
alter table tableName add partition ( dt = '20130101' ) location '/tableName/201301/' ;
alter table tableName add partition ( dt = '20130102' ) location '/tableName/201301/' ;
...
alter table tableName add partition ( dt = '20130112' ) location '/tableName/201301/' ;
The total size of my files in tableName/201301 was about 791,400,000 bytes (I just eyeballed the numbers and did basic math). I ran the job:
hive> select dt,count(*) from tableName where dt >= '20130101' and dt <= '20130112' group by dt ;
The JobTracker reported:
Counter Map Reduce Total
Bytes Read 795,308,244 0 795,308,244
So it only read the data once. HOWEVER... the query output was all jacked:
20130112 392606124
So it thinks there was only one "dt", and that was the final "partition", and it had all rows. So you have to be very careful including "dt" in your queries when you do this, it would appear.
Hive would scan the file multiple times. Earlier answer was incorrect. Hive reads the file once, but generates "duplicate" records. The issue is that the partition columns are included in the total record, so for each record in the file, you would get multiple records in Hive, each with different partition values.
Do you have any way to recover the actual day from the earlier data? If so, the ideal way to do things would be to totally repartition all the old data. That's painful, but it's a one-time cost and would save you having a really weird Hive table.
You could also move to having two Hive tables: the "old" one partitioned by month, and the "new" one partitioned by day. Users could then do a union on the two when querying, or you could create a view that does the union automatically.