Subquery in `where` with comparison operator - hive

Let's say I have a large table partitioned by dt field. I want to query this table for data after specific date. E.g.
select * from mytab where dt >= 20140701;
The tricky part is that date is not a constant, but comes from a subquery. So basically I want something like this:
select * from mytab where dt >= (select min(dt) from activedates);
Hive can't do it, however, giving me ParseException on subquery (from docs I'm guessing it's just not supported yet).
So how do I restrict my query based on dynamic subquery?
Note, that performance is key point here. So the faster, the better, even if it looks uglier.
Also note, that we haven't switched to Hive 0.13 yet, so solutions without in query are preferred.

Hive decides on the partition pruning when building the execution plan and thus has to have the value of the max(dt) prior to execution.
Currently the only way to accomplish something like this is breaking the query into two parts, when the first will be select min(dt) from activedates, its results will be put into a variable.
2nd query will be : select * from mytab where dt >=${hiveconf:var}.
Now this is a bit tricky.
You could either execute the 1st query into OS variable like so :
a=`hive -S -e "select min(dt) from activedates"`
And then run the 2nnd query like so :
hive -hiveconf var=$a -e "select * from mytab where dt >=${hiveconf:var}"
or event just :
hive -e "select * from mytab where dt >=$a"
Or, if you are using some other scripting language you can replace the variable in the code.

Related

How to reuse a computed value multiple times?

Basically I just want a simple way of finding the most recent date in a table, saving it as a variable, and reusing that variable in the same query.
Right now this is how I'm doing it:
with recent_date as (
select max(date)
from mytable
)
select *
from mytable
where date = (select * from recent_date)
(For this simple example, a variable is overkill, but in my real-world use-case I reuse the recent date multiple times in the same query.)
But that feels cumbersome. It would be a lot cleaner to save the recent date to a variable rather than a table and having to select from it.
In pseudo-code, something like this would be nice:
$recent_date = (select max(date) from mytable)
select *
from mytable
where date = $recent_date
Is there something like that in Postgres?
Better for the simple case
For the scope of a single query, CTEs are a good tool. In my hands the query would look like this:
WITH recent(date) AS (SELECT max(date) FROM mytable)
SELECT m.*
FROM recent r
JOIN mytable m USING (date)
Except that the actual example query would burn down to this in my hands:
SELECT *
FROM mytable
ORDER BY date DESC NULLS LAST
FETCH FIRST 1 ROWS WITH TIES;
NULLS LAST only if there can be NULL values. See:
Sort by column ASC, but NULL values first?
WITH TIES only if date isn't UNIQUE NOT NULL. See:
Get top row(s) with highest value, with ties
In combination with an index on mytable (date) (or more specific), this produces the best possible query plan. Look no further.
No, I need variables!
If you positively need variables scoped for the same command, transaction, session or more, there are various options.
The closest thing to "variables" in SQL in Postgres are "customized options". See:
User defined variables in PostgreSQL
You can only store text, any other type has to be cast (and cast back on retrieval).
To set and retrieve a value from within a query, use the Configuration Settings Functions set_config() and current_setting():
SELECT set_config('foo.recent', max(date)::text, false) FROM mytable;
SELECT *
FROM mytable
WHERE date = current_setting('foo.recent')::date;
Typically, there are more efficient ways.
If you need that "recent date" a lot, consider a simple function as "global variable", usable by all transactions in all sessions (but each new command sees its own current state):
CREATE FUNCTION f_recent_date()
RETURNS date
LANGUAGE sql STABLE PARALLEL SAFE AS
'SELECT max(date) FROM mytable';
STABLE is a valid volatility setting as the function returns the same result within the same query. Be sure to actually make it STABLE, so Postgres does not evaluate repeatedly. In Postgres 9.6 or later, also make it PARALLEL SAFE. Then your query becomes:
SELECT * FROM mytable WHERE date = f_recent_date();
More options:
Is there a way to define a named constant in a PostgreSQL query?
Passing user id to PostgreSQL triggers
Typically, if I need variables in Postgres, I use a PL/pgSQL code block in a function, a procedure, or a DO statement for ad-hoc use without the need to return rows:
DO
$do$
DECLARE
_recent_date date := (SELECT max(date) FROM mytable);
BEGIN
PERFORM * FROM mytable WHERE date = _recent_date;
-- more queries using _recent_date ...
END
$do$;
PL/pgSQL may be what you should be using to begin with. Further reading:
When to use stored procedure / user-defined function?
Keep in mind that in SQL you cannot directly declare a variable. Basically a CTE is creating variable (or a set of) and in SQL to use a variable you select it. However, if you want to avoid that structure you can just get the variable directl from a subset directly.
select *
from mytable
where date = (select max(date) from mytable);

Using Update statement with the _PARTITIONDATE Pseudo-column

I'm trying to update a table in BigQuery that is partitioned on _PARTITIONTIME and really struggling.
Source is an extract from destination that I need to backfill destination with. Destination is a large partitioned table.
To move data from source to destination, I tried this:
update t1 AS destination
set destination._PARTITIONTIME = '2022-02-09'
from t2 as source
WHERE source.id <> "1";
Because it said that the WHERE clause was required for UPDATE, but when I run it, I get a message that "update/merge must match at most one source row for each target row". I've tried... so many other methods that I can't even remember them all. INSERT INTO seemed like a no-brainer early on but it wants me to specify column names and these tables have about 800 columns each so that's less than ideal.
I would have expected this most recent attempt to work because if I do
select * from source where source.id <> "1";
I do, in fact, get results exactly the way I would expect, so that query clearly functions, but for some reason it can't load the data. This is interesting, because I created the source table by running something along the lines of:
select * from destination where DATE(createddate) = '2022-02-09' and DATE(_PARTITIONTIME) = '2022-02-10'
Is there a way to make Insert Into work for me in this instance? If there is not, does someone have an alternate approach they recommend?
You can use the bq command line tool (usually comes with the gcloud command line utility) to run a query that will overwrite a partition in a target table with your query results:
bq query --allow_large_results --replace --noflatten_results --destination_table 'target_db.target_table$20220209' "select field1, field2, field3 from source_db.source_table where _PARTITIONTIME = '2022-02-09'";
Note the $YYYYMMMDD postfix with the target_table. This indicates
that the partition corresponding to YYYYMMDD is to be overwritten
by the query results.
Make sure to distinctively select fields in your query (as a good practice) to avoid unexpected surprises. For instance, select field1, field2, field3 from table is way more explicit and readable than select * from table.

Cannot query over table without a filter that can be used for partition elimination

I have a partitioned table and would love to use a MERGE statement, but for some reason doesn't work out.
MERGE `wr_live.p_email_event` t
using `wr_live.email_event` s
on t.user_id=s.user_id and t.event=s.event and t.timestamp=s.timestamp
WHEN NOT MATCHED THEN
INSERT (user_id,event,engagement_score,dest_email_domain,timestamp,tags,meta)
values (user_id,event,engagement_score,dest_email_domain,timestamp,tags,meta)
I get
Cannot query over table 'wr_live.p_email_event' without a filter that
can be used for partition elimination.
What's the proper syntax? Also is there a way I can express shorter the insert stuff? without naming all columns?
What's the proper syntax?
As you can see from error message - your partitioned wr_live.p_email_event table was created with require partition filter set to true. This mean that any query over this table must have some filter on respective partitioned field
Assuming that timestamp IS that partitioned field - you can do something like below
MERGE `wr_live.p_email_event` t
USING `wr_live.email_event` s
ON t.user_id=s.user_id AND t.event=s.event AND t.timestamp=s.timestamp
AND DATE(t.timestamp) > CURRENT_DATE() -- this is the filter you should tune
WHEN NOT MATCHED THEN
INSERT (user_id,event,engagement_score,dest_email_domain,timestamp,tags,meta)
VALUES (user_id,event,engagement_score,dest_email_domain,timestamp,tags,meta)
So you need to make below line such that it in reality does not filter out whatever you need to be involved
AND DATE(t.timestamp) <> CURRENT_DATE() -- this is the filter you should tune
For example, I found, setting it to timestamp in future - in many cases addresses the issue, like
AND DATE(t.timestamp) > DATE_ADD(CURRENT_DATE(), INTERVAL 1 DAY)
Of course, if your wr_live.email_event table also partitioned with require partition filter set to true - you need to add same filter for s.timestamp
Also is there a way I can express shorter the insert stuff? without naming all columns?
BigQuery DML's INSERT requires column names to be specified - there is no way (at least that I am aware of) to avoid it using INSERT statement
Meantime, you can avoid this by using DDL's CREATE TABLE from the result of the query. This will not require listing the columns
For example, something like below
CREATE OR REPLACE TABLE `wr_live.p_email_event`
PARTITION BY DATE(timestamp) AS
SELECT * FROM `wr_live.p_email_event`
WHERE DATE(timestamp) <> DATE_ADD(CURRENT_DATE(), INTERVAL 1 DAY)
UNION ALL
SELECT * FROM `wr_live.email_event` s
WHERE NOT EXISTS (
SELECT 1 FROM `wr_live.p_email_event` t
WHERE t.user_id=s.user_id AND t.event=s.event AND t.timestamp=s.timestamp
AND DATE(t.timestamp) > DATE_ADD(CURRENT_DATE(), INTERVAL 1 DAY)
)
You might also want to include table options list via OPTIONS() - but looks like filter attribute is not supported yet - so if you do have/need it - above will "erase" this attribute :o(

BigQuery: how to convert this legacy SQL to standardSQL?

I have data import pipeline into BigQuery tables (the hourly tables named transactions_20170616_00 transactions_20170616_01 ... and there are more daily/weekly/... rollups), want to use a single view to always point to the latest one, found hard to do one static standardSQL view to point to latest, my current solution is to update the view's content to SELECT * FROM project.dataset.transactions_201706.... after every import successful,
Till I read this httparchive's latest view: it's all what I want but in legacy SQL; my project uses all standardSQL only, and prefer standardSQL because it's the future; wonder anyone knows how to convert this legacy SQL to standardSQL? then I won't need to constantly update my view
https://bigquery.cloud.google.com/table/httparchive:runs.latest_requests?tab=details
SELECT *
FROM TABLE_QUERY(httparchive:runs,
"table_id IN (
SELECT table_id FROM [httparchive:runs.__TABLES__]
WHERE REGEXP_MATCH(table_id, '2.*requests$')
ORDER BY table_id DESC LIMIT 1)")
following this guide, I'm trying to use
https://cloud.google.com/bigquery/docs/querying-wildcard-tables#the_table_query_function
#standardSQL
SELECT * FROM `httparchive.runs.*`
WHERE _TABLE_SUFFIX IN
( SELECT table_id
FROM httparchive.runs.__TABLES__
WHERE REGEXP_CONTAINS(table_id, r'2.*requests$')
ORDER BY table_id DESC
LIMIT 1)
but the query failed of
Query Failed
Error: Views cannot be queried through prefix. Matched views are: httparchive:runs.latest_pages, httparchive:runs.latest_pages_mobile, httparchive:runs.latest_requests, httparchive:runs.latest_requests_mobile
Job ID: bidder-1183:bquijob_1400109e_15cb1dc3c0c
I found the wildcard can only be used at last? in this case why not SELECT * FROM httparchive.runs.*_requests WHERE ... work?
in this case, is it saying the Wildcard Tables feature in standardSQL isn't same flexible as TABLE_QUERY in legacySQL>?

Hive doesn't pick up partition with the calculated partition key

My external table auto1_tracking_events_ext is partitioned on a column dt.
First i execute:
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.dynamic.partition=true;
When I run this query:
select count(*)
from auto1_tracking_events_ext
where dt = '2016-12-05';
It picks up the partition, creates maybe like 3 mappers and finishes in a couple of seconds.
However if i run this:
select count(*)
from auto1_tracking_events_ext
where dt = from_unixtime(unix_timestamp()-1*60*60*24, 'yyyy-MM-dd');
It does not pick up the partition and starts 413 mappers and takes quite some time to calculate.
For the time of posting this question:
hive> select from_unixtime(unix_timestamp()-1*60*60*24, 'yyyy-MM-dd');
OK
2016-12-05
Why does Hive not pick up the partition?
UPDATE:
Passing date string as hiveconf parameter (as shown below) does not help either.
hive -hiveconf date_yesterday=$(date --date yesterday "+%Y-%m-%d")
hive> select count(*) from auto1_tracking_events_ext where dt = ${hiveconf:date_yesterday};
Your last query with passing hiveconf variable should work as well if first query works, because variables are being substituted first and only after that query is being executed. It is one possible bug, you did not quote the variable. Try this:
hive -hiveconf date_yesterday=$(date --date yesterday "+%Y-%m-%d")
hive> select count(*) from auto1_tracking_events_ext where dt = '${hiveconf:date_yesterday}'; --single quotes here
Without quotes it is resolved like this where dt=2020-12-12 - this is wrong, it should be single quotes.
As for using unix_timestamp() - the function is not deterministic and prevents proper query optimization.
Use current_date or current_timestamp instead:
select count(*)
from auto1_tracking_events_ext
where dt = date_sub(current_date,1);