Hive doesn't pick up partition with the calculated partition key

Hive doesn't pick up partition with the calculated partition key - hive

My external table auto1_tracking_events_ext is partitioned on a column dt.
First i execute:
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition=true;
When I run this query:
select count(*)
from auto1_tracking_events_ext
where dt = '2016-12-05';
It picks up the partition, creates maybe like 3 mappers and finishes in a couple of seconds.
However if i run this:
select count(*)
from auto1_tracking_events_ext
where dt = from_unixtime(unix_timestamp()-1*60*60*24, 'yyyy-MM-dd');
It does not pick up the partition and starts 413 mappers and takes quite some time to calculate.
For the time of posting this question:
hive> select from_unixtime(unix_timestamp()-1*60*60*24, 'yyyy-MM-dd');
OK
2016-12-05
Why does Hive not pick up the partition?
UPDATE:
Passing date string as hiveconf parameter (as shown below) does not help either.
hive -hiveconf date_yesterday=$(date --date yesterday "+%Y-%m-%d")
hive> select count(*) from auto1_tracking_events_ext where dt = ${hiveconf:date_yesterday};

Your last query with passing hiveconf variable should work as well if first query works, because variables are being substituted first and only after that query is being executed. It is one possible bug, you did not quote the variable. Try this:
hive -hiveconf date_yesterday=$(date --date yesterday "+%Y-%m-%d")
hive> select count(*) from auto1_tracking_events_ext where dt = '${hiveconf:date_yesterday}'; --single quotes here
Without quotes it is resolved like this where dt=2020-12-12 - this is wrong, it should be single quotes.
As for using unix_timestamp() - the function is not deterministic and prevents proper query optimization.
Use current_date or current_timestamp instead:
select count(*)
from auto1_tracking_events_ext
where dt = date_sub(current_date,1);

Related

How to reuse a computed value multiple times?

Basically I just want a simple way of finding the most recent date in a table, saving it as a variable, and reusing that variable in the same query.
Right now this is how I'm doing it:
with recent_date as (
select max(date)
from mytable
)
select *
from mytable
where date = (select * from recent_date)
(For this simple example, a variable is overkill, but in my real-world use-case I reuse the recent date multiple times in the same query.)
But that feels cumbersome. It would be a lot cleaner to save the recent date to a variable rather than a table and having to select from it.
In pseudo-code, something like this would be nice:
$recent_date = (select max(date) from mytable)
select *
from mytable
where date = $recent_date
Is there something like that in Postgres?

Better for the simple case
For the scope of a single query, CTEs are a good tool. In my hands the query would look like this:
WITH recent(date) AS (SELECT max(date) FROM mytable)
SELECT m.*
FROM recent r
JOIN mytable m USING (date)
Except that the actual example query would burn down to this in my hands:
SELECT *
FROM mytable
ORDER BY date DESC NULLS LAST
FETCH FIRST 1 ROWS WITH TIES;
NULLS LAST only if there can be NULL values. See:
Sort by column ASC, but NULL values first?
WITH TIES only if date isn't UNIQUE NOT NULL. See:
Get top row(s) with highest value, with ties
In combination with an index on mytable (date) (or more specific), this produces the best possible query plan. Look no further.
No, I need variables!
If you positively need variables scoped for the same command, transaction, session or more, there are various options.
The closest thing to "variables" in SQL in Postgres are "customized options". See:
User defined variables in PostgreSQL
You can only store text, any other type has to be cast (and cast back on retrieval).
To set and retrieve a value from within a query, use the Configuration Settings Functions set_config() and current_setting():
SELECT set_config('foo.recent', max(date)::text, false) FROM mytable;
SELECT *
FROM mytable
WHERE date = current_setting('foo.recent')::date;
Typically, there are more efficient ways.
If you need that "recent date" a lot, consider a simple function as "global variable", usable by all transactions in all sessions (but each new command sees its own current state):
CREATE FUNCTION f_recent_date()
RETURNS date
LANGUAGE sql STABLE PARALLEL SAFE AS
'SELECT max(date) FROM mytable';
STABLE is a valid volatility setting as the function returns the same result within the same query. Be sure to actually make it STABLE, so Postgres does not evaluate repeatedly. In Postgres 9.6 or later, also make it PARALLEL SAFE. Then your query becomes:
SELECT * FROM mytable WHERE date = f_recent_date();
More options:
Is there a way to define a named constant in a PostgreSQL query?
Passing user id to PostgreSQL triggers
Typically, if I need variables in Postgres, I use a PL/pgSQL code block in a function, a procedure, or a DO statement for ad-hoc use without the need to return rows:
DO
$do$
DECLARE
_recent_date date := (SELECT max(date) FROM mytable);
BEGIN
PERFORM * FROM mytable WHERE date = _recent_date;
-- more queries using _recent_date ...
END
$do$;
PL/pgSQL may be what you should be using to begin with. Further reading:
When to use stored procedure / user-defined function?

Keep in mind that in SQL you cannot directly declare a variable. Basically a CTE is creating variable (or a set of) and in SQL to use a variable you select it. However, if you want to avoid that structure you can just get the variable directl from a subset directly.
select *
from mytable
where date = (select max(date) from mytable);

Using an UDF to query a table in Hive

I have the following UDF available on Hive to convert a time bigint to date,
to_date(from_utc_timestamp(from_unixtime(cast(listed_time/1000 AS bigint)),'PST'))
I want to use this UDF to query a table on a specific date. Something like,
SELECT * FROM <table_name>
WHERE date = '2020-03-01'
ORDER BY <something>
LIMIT 10

I would suggest to change the logic: avoid applying the function to the column being filtered, because it is an inefficient approach. The function needs to be invoked for every row, which prevents the query from benefiting an index.
On the other hand, you can simply convert the input date to a unix timestamp (possibly with an UDF). This should look like;
SELECT * FROM <table_name>
WHERE date = to_utc_timestamp('2020-03-01', 'PST') * 1000
ORDER BY <something>
LIMIT 10

Oracle SQL use variable partition name

I run a daily report that has to query another table which is updated separately. Due to the high volume of records in the source table (8M+ per day) each day is stored in it's own partition. The partition has a standard format as P ... 4 digit year ... 2 digit month ... 2 digit date, so yesterday's partition is P20140907.
At the moment I use this expression, but have to manually change the name of the partition each day:
select * from <source_table> partition (P20140907) where ....
By using sysdate, toChar and Concat I have created another table called P_NAME2 that will automatically generate and update a string value as the name of the partition that I need to read. Now I need to update my main query so it does this:
select * from <source_table> partition (<string from P_NAME2>) where ....

You are working too hard. Oracle already does all these things for you. If you query the table using the correct date range oracle will perform the operation only on the relevant partitions - this is called pruning .
I suggest reading the docs on that.
If you'r still skeptic, Query all_tab_partitions.HIGH_VALUE to get each partitions high value (the table you created ... ).

I thought I'd pop back to share how I solved this in the end. The source database has a habit of leaking dates across partitions which is why queries for one day were going outside a single partition. I can't affect this, just work around it ...
begin
execute immediate
'create table LL_TEST as
select *
from SCHEMA.TABLE Partition(P'||TO_CHAR(sysdate,'YYYYMMDD')||')
where COLUMN_A=''Something''
and COLUMN_B=''Something Else''
';
end
;
Using the PL/SQL script I create the partition name with TO_CHAR(sysdate,'YYYYMMDD') and concatenate the rest of the query around it.
Note that the values you are searching for in the where clause require double apostrophes so to send 'Something' to the query you need ''Something'' in the script.
It may not be pretty, but it works on the database that I have to use.

Subquery in `where` with comparison operator

Let's say I have a large table partitioned by dt field. I want to query this table for data after specific date. E.g.
select * from mytab where dt >= 20140701;
The tricky part is that date is not a constant, but comes from a subquery. So basically I want something like this:
select * from mytab where dt >= (select min(dt) from activedates);
Hive can't do it, however, giving me ParseException on subquery (from docs I'm guessing it's just not supported yet).
So how do I restrict my query based on dynamic subquery?
Note, that performance is key point here. So the faster, the better, even if it looks uglier.
Also note, that we haven't switched to Hive 0.13 yet, so solutions without in query are preferred.

Hive decides on the partition pruning when building the execution plan and thus has to have the value of the max(dt) prior to execution.
Currently the only way to accomplish something like this is breaking the query into two parts, when the first will be select min(dt) from activedates, its results will be put into a variable.
2nd query will be : select * from mytab where dt >=${hiveconf:var}.
Now this is a bit tricky.
You could either execute the 1st query into OS variable like so :
a=`hive -S -e "select min(dt) from activedates"`
And then run the 2nnd query like so :
hive -hiveconf var=$a -e "select * from mytab where dt >=${hiveconf:var}"
or event just :
hive -e "select * from mytab where dt >=$a"
Or, if you are using some other scripting language you can replace the variable in the code.

Why hive partition not work when i using funciton unix_timestamp()

As i know, hive partition can reduce the number of input file if you use the partition column in where clause. For example, in my table t i define a partition named date_entry(type is string, which stores the timestamp).
select count(*) from t where
date_entry >= (unix_timestamp() - 2 * 24 * 3600) * 1000
I try to execute this query, i expect it will filter some files by the where clause, but it not.
If i don't use function unix_timestamp() , it will work.
Can anybody knows why or give the workaround.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive doesn't pick up partition with the calculated partition key - hive

Related

How to reuse a computed value multiple times?

Using an UDF to query a table in Hive

Oracle SQL use variable partition name

Subquery in `where` with comparison operator

Why hive partition not work when i using funciton unix_timestamp()

Categories

Resources