Why Hive query over partition info (supposed to be stored in metastore) takes so long time - hive

I have an external table table1 created in HDFS containing single partition column column1 of type string and I am using Hive to get data from it.
Following query finishes in 1 second as expected as the data is present in Hive metastore itself.
SHOW PARTITIONS table1;
The result of above command also makes sure that all partitions are present in metastore.
I have also run MSCK REPAIR TABLE table1 to make sure all partition info is present in metastore.
But below query takes 10 min to complete.
SELECT min(column1) from table1;
Why is this query doing full mapreduce tasks just to determine the minimum value of partition column1 when all the values are already present in metastore ?
There is 1 more use-case where Hive is checking full Table data and not making use of partition information.
SELECT * FROM (SELECT * FROM table1 WHERE column1='abc') q1 INNER JOIN (SELECT * FROM table1 WHERE column1='xyz') q2 ON q1.column2==q2.column2
In such queries also, Hive does not make use of partition info and is scanning all partitions like column1='jkl'
Any pointer about this behaviour ? I am not sure if above 2 scenarios are due to same reason.

Its because the way data is stored and accessed.
why SHOW PARTITIONS table1; is taking 1 sec because this data coming straight from metadata table.
why SELECT min(column1) from table1; is taking minutes because this data is coming from HDFS and calculated after hive goes through all the actual data.
To test it out, if you run this explain SELECT min(column1) from table1;, you will see that query is going through all the partitions( and all the data) and then finding min value. This is as good as checking all data to find min value. Pls note partition isnt an index but its different physical folders to store data files for quicker access.
If you run explain sql, you will see SQL is accessing all partition in case of min() sql (i created partitions on random college_marks column)-
29
Path -> Alias:
30
hdfs://namenode:8020/user/hive/warehouse/tmp/college_marks=10.0 [tmp]
31
hdfs://namenode:8020/user/hive/warehouse/tmp/college_marks=50.0 [tmp]
32
Path -> Partition:
33
hdfs://namenode:8020/user/hive/warehouse/tmp/college_marks=10.0
34
Partition
35
base file name: college_marks=10.0
36
input format: org.apache.hadoop.mapred.TextInputFormat
37
hdfs://namenode:8020/user/hive/warehouse/tmp/college_marks=50.0
85
Partition
86
base file name: college_marks=50.0
87
input format: org.apache.hadoop.mapred.TextInputFormat
88
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
89
partition values:
90
college_marks 50.0
91

Related

Bigquery Schedule query to load data to a particular partition

I am using the bigquery schedule query functionality to run a query every 30 mins.
My destination table will be a partitioned table and the partionining column is 'event_date'
The schedule query that i am using will be to copy today's data from source_table -> Dest_table
(like select * from source_table where event_date = CURRENT_DATE())
every 30 mins ,
but i would like it to write_truncate existing partition without write truncating the whole table.(since i don't want to duplicate today's data every 30 mins)
Currently when i schedule this query with partition_field set to event_date and write_truncate , it is truncating the whole table and this causes the previous data to be lost . Is there something else that i am missing
Instead of specifying destination table, you may use MERGE to truncate only one partition.
It is unfortunately more expensive, for you also pay for deleting the data from dest_table. (Insert is still free)
MERGE dest_table t
USING source_table
ON FALSE
WHEN NOT MATCHED BY SOURCE AND event_date=CURRENT_DATE() THEN DELETE
WHEN NOT MATCHED BY TARGET THEN INSERT ROW

Redshift Spectrum Query - Request ran out of memory in the S3 query layer

I am trying to execute a query with grouping on 26 columns. Data is stored in S3 in parquet format partitioned by day. Redshift Spectrum query is returning below error. I am not able to find any relevant documentation in aws regarding this.
Request ran out of memory in the S3 query layer
Total Number of rows in table : 770 Million
Total size of table in Parquet format : 45 GB
Number of records in each partition : 4.2 Million
Million Redshift configuration : Single node dc2.xlarge
Attached is the table ddl
Try declaring the text columns in this table as VARCHAR rather than STRING. Also make sure to use the minimum possible VARCHAR size for the column to reduce the memory required by the GROUP BY.
Also, two further suggestions:
Recommend always using at least 2 nodes of Redshift. This gives
you a free leader node and allows your compute nodes to use all
their RAM for query processing.
Grouping by so many columns is an unusual query pattern. If you are looking for duplicates in the table consider hashing the columns into a single value and grouping on that. Here's an example:
SELECT MD5(ws_sold_date_sk
||ws_sold_time_sk
||ws_ship_date_sk
||ws_item_sk
||ws_bill_customer_sk
||ws_bill_cdemo_sk
||ws_bill_hdemo_sk
||ws_bill_addr_sk
||ws_ship_customer_sk
||ws_ship_cdemo_sk
||ws_ship_hdemo_sk
||ws_ship_addr_sk
||ws_web_page_sk
||ws_web_site_sk
||ws_ship_mode_sk)
, COUNT(*)
FROM spectrum.web_sales
GROUP BY 1
ORDER BY 2 DESC
LIMIT 10
;

In time-partitioned bigquery tables, when is data written to __UNPARTITIONED__? what are the effects?

I ran into some freak undocumented behavior of time-partitioned bigquery tables:
I created a time-partitioned table in BigQuery and inserted data.
I was able to insert normally - data was written to today's partition (I was also able to explicitly specify a partition and write into it)
After some tests with new data, I deleted today's partition, in order to have clean data:(CLI)
bq --project_id=my-project rm v1.mytable$20160613
I then checked whether it's empty:
select count(*) from [v1.mytable]
Result 270 instead of 0
I tried deleting again and rerunning the query - same result.
So I queried
select count(*) from [v1.mytable$20160613]
Result 0
so a couple of previous dates in which I may have inserted data, but all were 0.
Finally I ran
SELECT partition_id from [v1.mytable$__PARTITIONS_SUMMARY__];
and the result was
{ UNPARTITIONED 20160609 20160613 }
and all the data was in fact in UNPARTITIONED
My questions:
When is the data written to this special partition instead of the daily partition, and how can I avoid this?
Are there other effects, except from losing the ability to address specific dates (in query, or when deleting data, etc.)? should I take care for this case?
While data is in the streaming buffer, it remains in the UNPARTITIONED partition. To address this partition in a query, you can use the value NULL for the _PARTITIONTIME pseudo column.
SELECT ... FROM mydataset.mypartitioned_table WHERE _PARTITIONTIME IS NULL
To delete data for a given partition, we suggest doing a write truncate to it with a query that returns an empty result. For example:
bq query --destination_table=mydataset.mypartitionedtable\$20160121 --replace 'SELECT 1 as field1, "one" as field2 FROM (SELECT 1 as field1, "one" as field2) WHERE FALSE'
Note that the partition will still be around (if you do a SELECT * from table$__PARTITIONS__SUMMARY), but it will have 0 rows.
$ bq query 'SELECT COUNT(*) from [mydataset.mypartitionedtable$20160121]'
+-----+
| f0_ |
+-----+
| 0 |
+-----+
This is a temporary state -- querying an hour later the records all belonged to today's partition.
The effect is thus similar to a delay in data write: querying immediately after the insert may not have the most recent data in the correct partition, but eventually this will be ok

Can two Hive Partitions Share One Set of Files?

A typical question is can a Hive partition be made up of multiple files. My question is the inverse. Can multiple Hive partitions point to the same file? I'll start with what I mean, then the use case.
What I mean:
Hive Partition File Name
20120101 /file/location/201201/file1.tsv
20120102 /file/location/201201/file1.tsv
20120103 /file/location/201201/file1.tsv
The Use Case: Over the past many years, we've been loading data into Hive in monthly format. So it looked like this:
Hive Partition File Name
201201 /file/location/201201/file1.tsv
201202 /file/location/201202/file1.tsv
201203 /file/location/201203/file1.tsv
But now the months are too large, so we need to partition by day. So we want the new files starting with 201204 to be daily:
Hive Partition File Name
20120401 /file/location/20120401/file1.tsv
20120402 /file/location/20120402/file1.tsv
20120403 /file/location/20120403/file1.tsv
But we want all the existing partitions to be redone into daily as well, so we would partition it as I propose above. I suspect this would actually work no problem, except that I suspect Hive would re-read the same datafile N times for each additional partition defined against the file. For example, in the very first "What I Mean" code block above, partitions 20120101..20120103 all point to file 201201/file1.tsv. So if the query has:
and partitionName >= '20120101' and partitionName <= '20120103"
Would it read "201201/file1.tsv" three times to answer the query? Or will Hive be smart enough to know it's only necessary to scan "201201/file1.tsv" once?
It looks like Hive will only scan the file(s) once. I finally decided to just give it a shot and run a query and find out.
First, I set up my data set like this in the filesystem:
tableName/201301/splitFile-201301-xaaaa.tsv.gz
tableName/201301/splitFile-201301-xaaab.tsv.gz
...
tableName/201301/splitFile-201301-xaaaq.tsv.gz
Note that even though I have many files, this is equivalent for Hive to having one giant file for the purposes of this question. If it makes it easier, pretend I just pasted a single file above.
Then I set up my Hive table with partitions like this:
alter table tableName add partition ( dt = '20130101' ) location '/tableName/201301/' ;
alter table tableName add partition ( dt = '20130102' ) location '/tableName/201301/' ;
...
alter table tableName add partition ( dt = '20130112' ) location '/tableName/201301/' ;
The total size of my files in tableName/201301 was about 791,400,000 bytes (I just eyeballed the numbers and did basic math). I ran the job:
hive> select dt,count(*) from tableName where dt >= '20130101' and dt <= '20130112' group by dt ;
The JobTracker reported:
Counter Map Reduce Total
Bytes Read 795,308,244 0 795,308,244
So it only read the data once. HOWEVER... the query output was all jacked:
20130112 392606124
So it thinks there was only one "dt", and that was the final "partition", and it had all rows. So you have to be very careful including "dt" in your queries when you do this, it would appear.
Hive would scan the file multiple times. Earlier answer was incorrect. Hive reads the file once, but generates "duplicate" records. The issue is that the partition columns are included in the total record, so for each record in the file, you would get multiple records in Hive, each with different partition values.
Do you have any way to recover the actual day from the earlier data? If so, the ideal way to do things would be to totally repartition all the old data. That's painful, but it's a one-time cost and would save you having a really weird Hive table.
You could also move to having two Hive tables: the "old" one partitioned by month, and the "new" one partitioned by day. Users could then do a union on the two when querying, or you could create a view that does the union automatically.

Why Hive use file from other files under the partition table?

I have a simple table in my Hive. It has only one partition:
show partitions hive_test;
OK
pt=20130805000000
Time taken: 0.124 seconds
But when I execute a simple query sql, it turns out to find the data file under folder 20130805000000. Why doesn't it just use the file 20130805000000?
sql:
SELECT buyer_id AS USER_ID from hive_test limit 1;
and this is the exception:
java.io.IOException: /group/myhive/test/hive/hive_test/pt=20130101000000/data
doesn't exist!
at org.apache.hadoop.hdfs.DFSClient.listPathWithLocations(DFSClient.java:1045)
at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:352)
at org.apache.hadoop.fs.viewfs.ChRootedFileSystem.listLocatedStatus(ChRootedFileSystem.java:270)
at org.apache.hadoop.fs.viewfs.ViewFileSystem.listLocatedStatus(ViewFileSystem.java:851)
at org.apache.hadoop.hdfs.Yunti3FileSystem.listLocatedStatus(Yunti3FileSystem.java:349)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listLocatedStatus(SequenceFileInputFormat.java:49)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:242)
at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:261)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1238)
And my question is why does hive try to find file "/group/myhive/test/hive/hive_test/pt=20130101000000/data", but not "/group/myhive/test/hive/hive_test/pt=20130101000000/"?
You are not getting error because you have created partition over your hive table but not assigning partition name during select statement.
In Hive’s implementation of partitioning, data within a table is split across multiple partitions. Each partition corresponds to a particular value(s) of partition column(s) and is stored as a sub-directory within the table’s directory on HDFS. When the table is queried, where applicable, only the required partitions of the table are queried.
Please provide partition name in your select query or use your query like this:
select buyer_id AS USER_ID from hive_test where pt='20130805000000' limit 1;
Please see Link to know more about hive partition.