As i know, hive partition can reduce the number of input file if you use the partition column in where clause. For example, in my table t i define a partition named date_entry(type is string, which stores the timestamp).
select count(*) from t where
date_entry >= (unix_timestamp() - 2 * 24 * 3600) * 1000
I try to execute this query, i expect it will filter some files by the where clause, but it not.
If i don't use function unix_timestamp() , it will work.
Can anybody knows why or give the workaround.
Related
I have the following UDF available on Hive to convert a time bigint to date,
to_date(from_utc_timestamp(from_unixtime(cast(listed_time/1000 AS bigint)),'PST'))
I want to use this UDF to query a table on a specific date. Something like,
SELECT * FROM <table_name>
WHERE date = '2020-03-01'
ORDER BY <something>
LIMIT 10
I would suggest to change the logic: avoid applying the function to the column being filtered, because it is an inefficient approach. The function needs to be invoked for every row, which prevents the query from benefiting an index.
On the other hand, you can simply convert the input date to a unix timestamp (possibly with an UDF). This should look like;
SELECT * FROM <table_name>
WHERE date = to_utc_timestamp('2020-03-01', 'PST') * 1000
ORDER BY <something>
LIMIT 10
I have a partitioned table and would love to use a MERGE statement, but for some reason doesn't work out.
MERGE `wr_live.p_email_event` t
using `wr_live.email_event` s
on t.user_id=s.user_id and t.event=s.event and t.timestamp=s.timestamp
WHEN NOT MATCHED THEN
INSERT (user_id,event,engagement_score,dest_email_domain,timestamp,tags,meta)
values (user_id,event,engagement_score,dest_email_domain,timestamp,tags,meta)
I get
Cannot query over table 'wr_live.p_email_event' without a filter that
can be used for partition elimination.
What's the proper syntax? Also is there a way I can express shorter the insert stuff? without naming all columns?
What's the proper syntax?
As you can see from error message - your partitioned wr_live.p_email_event table was created with require partition filter set to true. This mean that any query over this table must have some filter on respective partitioned field
Assuming that timestamp IS that partitioned field - you can do something like below
MERGE `wr_live.p_email_event` t
USING `wr_live.email_event` s
ON t.user_id=s.user_id AND t.event=s.event AND t.timestamp=s.timestamp
AND DATE(t.timestamp) > CURRENT_DATE() -- this is the filter you should tune
WHEN NOT MATCHED THEN
INSERT (user_id,event,engagement_score,dest_email_domain,timestamp,tags,meta)
VALUES (user_id,event,engagement_score,dest_email_domain,timestamp,tags,meta)
So you need to make below line such that it in reality does not filter out whatever you need to be involved
AND DATE(t.timestamp) <> CURRENT_DATE() -- this is the filter you should tune
For example, I found, setting it to timestamp in future - in many cases addresses the issue, like
AND DATE(t.timestamp) > DATE_ADD(CURRENT_DATE(), INTERVAL 1 DAY)
Of course, if your wr_live.email_event table also partitioned with require partition filter set to true - you need to add same filter for s.timestamp
Also is there a way I can express shorter the insert stuff? without naming all columns?
BigQuery DML's INSERT requires column names to be specified - there is no way (at least that I am aware of) to avoid it using INSERT statement
Meantime, you can avoid this by using DDL's CREATE TABLE from the result of the query. This will not require listing the columns
For example, something like below
CREATE OR REPLACE TABLE `wr_live.p_email_event`
PARTITION BY DATE(timestamp) AS
SELECT * FROM `wr_live.p_email_event`
WHERE DATE(timestamp) <> DATE_ADD(CURRENT_DATE(), INTERVAL 1 DAY)
UNION ALL
SELECT * FROM `wr_live.email_event` s
WHERE NOT EXISTS (
SELECT 1 FROM `wr_live.p_email_event` t
WHERE t.user_id=s.user_id AND t.event=s.event AND t.timestamp=s.timestamp
AND DATE(t.timestamp) > DATE_ADD(CURRENT_DATE(), INTERVAL 1 DAY)
)
You might also want to include table options list via OPTIONS() - but looks like filter attribute is not supported yet - so if you do have/need it - above will "erase" this attribute :o(
My external table auto1_tracking_events_ext is partitioned on a column dt.
First i execute:
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition=true;
When I run this query:
select count(*)
from auto1_tracking_events_ext
where dt = '2016-12-05';
It picks up the partition, creates maybe like 3 mappers and finishes in a couple of seconds.
However if i run this:
select count(*)
from auto1_tracking_events_ext
where dt = from_unixtime(unix_timestamp()-1*60*60*24, 'yyyy-MM-dd');
It does not pick up the partition and starts 413 mappers and takes quite some time to calculate.
For the time of posting this question:
hive> select from_unixtime(unix_timestamp()-1*60*60*24, 'yyyy-MM-dd');
OK
2016-12-05
Why does Hive not pick up the partition?
UPDATE:
Passing date string as hiveconf parameter (as shown below) does not help either.
hive -hiveconf date_yesterday=$(date --date yesterday "+%Y-%m-%d")
hive> select count(*) from auto1_tracking_events_ext where dt = ${hiveconf:date_yesterday};
Your last query with passing hiveconf variable should work as well if first query works, because variables are being substituted first and only after that query is being executed. It is one possible bug, you did not quote the variable. Try this:
hive -hiveconf date_yesterday=$(date --date yesterday "+%Y-%m-%d")
hive> select count(*) from auto1_tracking_events_ext where dt = '${hiveconf:date_yesterday}'; --single quotes here
Without quotes it is resolved like this where dt=2020-12-12 - this is wrong, it should be single quotes.
As for using unix_timestamp() - the function is not deterministic and prevents proper query optimization.
Use current_date or current_timestamp instead:
select count(*)
from auto1_tracking_events_ext
where dt = date_sub(current_date,1);
I run a daily report that has to query another table which is updated separately. Due to the high volume of records in the source table (8M+ per day) each day is stored in it's own partition. The partition has a standard format as P ... 4 digit year ... 2 digit month ... 2 digit date, so yesterday's partition is P20140907.
At the moment I use this expression, but have to manually change the name of the partition each day:
select * from <source_table> partition (P20140907) where ....
By using sysdate, toChar and Concat I have created another table called P_NAME2 that will automatically generate and update a string value as the name of the partition that I need to read. Now I need to update my main query so it does this:
select * from <source_table> partition (<string from P_NAME2>) where ....
You are working too hard. Oracle already does all these things for you. If you query the table using the correct date range oracle will perform the operation only on the relevant partitions - this is called pruning .
I suggest reading the docs on that.
If you'r still skeptic, Query all_tab_partitions.HIGH_VALUE to get each partitions high value (the table you created ... ).
I thought I'd pop back to share how I solved this in the end. The source database has a habit of leaking dates across partitions which is why queries for one day were going outside a single partition. I can't affect this, just work around it ...
begin
execute immediate
'create table LL_TEST as
select *
from SCHEMA.TABLE Partition(P'||TO_CHAR(sysdate,'YYYYMMDD')||')
where COLUMN_A=''Something''
and COLUMN_B=''Something Else''
';
end
;
Using the PL/SQL script I create the partition name with TO_CHAR(sysdate,'YYYYMMDD') and concatenate the rest of the query around it.
Note that the values you are searching for in the where clause require double apostrophes so to send 'Something' to the query you need ''Something'' in the script.
It may not be pretty, but it works on the database that I have to use.
Let's say I have a large table partitioned by dt field. I want to query this table for data after specific date. E.g.
select * from mytab where dt >= 20140701;
The tricky part is that date is not a constant, but comes from a subquery. So basically I want something like this:
select * from mytab where dt >= (select min(dt) from activedates);
Hive can't do it, however, giving me ParseException on subquery (from docs I'm guessing it's just not supported yet).
So how do I restrict my query based on dynamic subquery?
Note, that performance is key point here. So the faster, the better, even if it looks uglier.
Also note, that we haven't switched to Hive 0.13 yet, so solutions without in query are preferred.
Hive decides on the partition pruning when building the execution plan and thus has to have the value of the max(dt) prior to execution.
Currently the only way to accomplish something like this is breaking the query into two parts, when the first will be select min(dt) from activedates, its results will be put into a variable.
2nd query will be : select * from mytab where dt >=${hiveconf:var}.
Now this is a bit tricky.
You could either execute the 1st query into OS variable like so :
a=`hive -S -e "select min(dt) from activedates"`
And then run the 2nnd query like so :
hive -hiveconf var=$a -e "select * from mytab where dt >=${hiveconf:var}"
or event just :
hive -e "select * from mytab where dt >=$a"
Or, if you are using some other scripting language you can replace the variable in the code.