Partition pruning for bigquery partitioned table - google-bigquery

I have a query using analytic function for a day partitioned table. I would expect it to read only data in partitions filtered in where clause but it reads all partitions in the table.
WITH query AS (
SELECT
* EXCEPT(rank)
FROM (
SELECT
*,
RANK() OVER (PARTITION BY day order by num_mean_temp_samples) AS rank
FROM (
SELECT
FORMAT_DATE("%Y%m%d", _PARTITIONDATE) AS day,
*
FROM
`mydataset.gsod_partitioned` ) q_nested
) q
WHERE
rank < 1000
)
SELECT
num_mean_temp_samples ,
count(1) as samples
FROM query
WHERE
day in ( '20100101', '20100103')
GROUP BY 1 ORDER BY 1
I verified partition pruning works without analytic function:
WITH query AS (
SELECT
FORMAT_DATE("%Y%m%d", _PARTITIONDATE) AS day,
*
FROM
`mydataset.gsod_partitioned`
)
or after adding UNION ALL nested select:
WITH query AS (
SELECT
* EXCEPT(rank)
FROM (
SELECT
*,
RANK() OVER (PARTITION BY day order by num_mean_temp_samples) AS rank
FROM (
SELECT
FORMAT_DATE("%Y%m%d", _PARTITIONDATE) AS day,
*
FROM
`mydataset.gsod_partitioned` WHERE _PARTITIONDATE < "1970-01-01" ) q_nested1
UNION ALL SELECT
*,
RANK() OVER (PARTITION BY day order by num_mean_temp_samples) AS rank
FROM (
SELECT
FORMAT_DATE("%Y%m%d", _PARTITIONDATE) AS day,
*
FROM
`mydataset.gsod_partitioned` WHERE _PARTITIONDATE >= "1970-01-01" ) q_nested2
) q
WHERE
rank < 1000
)
Table mydataset.gsod_partitioned is public dataset gsod based where day=20100101 partition is created as follows:
bq query --destination_table 'private.gsod_partitioned$20100101' --time_partitioning_type=DAY --use_legacy_sql=false
'SELECT station_number, mean_temp, num_mean_temp_samples FROM `bigquery-public-data.samples.gsod` where year=2010 and month=01 and day=01'
Could you find a way to enable partition pruning for the analytic function without adding extra union to the query?

Regarding _PARTITIONDATE - it isn't documented feature and it is recommended to use _PARTITIONETIME instead, you could look for some other question to see one of Googlers saying about that: Use of the _PARTITIONDATE vs. the _PARTITIONTIME pseudo-columns in BigQuery
Regarding partition pruning with analitycal functions in last year Google added support for filter pushdown but is works only for _PARTITIONTIME which should be included in fields covered by PARTITON BY clause
It should look like this:
WITH query AS (
SELECT
* EXCEPT(rank)
FROM (
SELECT
*,
RANK() OVER (PARTITION BY _pt order by num_mean_temp_samples) AS rank
FROM (
SELECT
FORMAT_TIMESTAMP("%Y%m%d", _PARTITIONTIME) AS day,
_PARTITIONTIME as _pt,
*
FROM
`mydataset.gsod_partitioned` ) q_nested
) q
WHERE
rank < 1000
)
SELECT
num_mean_temp_samples ,
count(1) as samples
FROM query
WHERE
day in ( '20100101', '20100103')
GROUP BY 1 ORDER BY 1

Related

choose latest partition of a Bigquery table where filter over partition column is required

I have been using the following query
SELECT DISTINCT
*
FROM
`project.dataset.table` t
WHERE DATE(_PARTITIONTIME) >= DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY)
It is not ideal as the partition could be unavailable due to delay.. Thus I try the following queries
SELECT DISTINCT
*
FROM
`project.dataset.table` t
WHERE DATE(_PARTITIONTIME) IN
(
SELECT
MAX(DATE(_PARTITIONTIME)) AS max_partition
FROM `project.dataset.table`
WHERE DATE(_PARTITIONTIME) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
)
as well as
SELECT DISTINCT
*
FROM
`project.dataset.table` t
WHERE TIMESTAMP(DATE(_PARTITIONTIME)) IN
(
SELECT parse_timestamp("%Y%m%d", MAX(partition_id))
FROM `project.dataset.INFORMATION_SCHEMA.PARTITIONS`
WHERE table_name = 'table'
)
Neither of them work due to
Cannot query over table 'project.dataset.table' without a filter over
column(s) '_PARTITION_LOAD_TIME', '_PARTITIONDATE', '_PARTITIONTIME'
that can be used for partition elimination.
In both of your solutions the limiting filter for the partition column is calculated during the query. This lead to full table scan.
Therfore, you need to add a filter for the partition column which is always know at the beginning of the run of your query.
SELECT DISTINCT
*
FROM
`project.dataset.table` t
WHERE DATE(_PARTITIONTIME) IN
(
SELECT
MAX(DATE(_PARTITIONTIME)) AS max_partition
FROM `project.dataset.table`
WHERE DATE(_PARTITIONTIME) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
)
AND DATE(_PARTITIONTIME) >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
If the last partition date could be months back, this is a better solution:
Declare max_date date;
execute immediate
"""
SELECT max(date(_PARTITIONTIME)) FROM `project.dataset.table`
WHERE DATE(_PARTITIONTIME) > "2000-12-15"
""" into max_date;
execute immediate
"""
Select * from `project.dataset.table` where date(_PARTITIONTIME)= date('""" || max_date || "')"

BigQuery partition pruning while using analytical function as part of a view, on a DATETIME field with DAY granularity

I am trying to use analytical functions (e.g. FIRST_VALUE), while still benefiting from from partition pruning. This while the table is partitioned on a DATETIME field with DAY granularity.
Example Data
Let's assume a table with the following columns:
name
type
dt
DATETIME
value
STRING
The table is partitioned on dt with the DAY granularity.
An example table can be created using the following SQL:
CREATE TABLE `project.dataset.example_partioned_table`
PARTITION BY DATE(dt)
AS
SELECT dt, CONCAT('some value: ', STRING(dt)) AS value
FROM (
SELECT
DATETIME_ADD(
DATETIME(_date),
INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
) AS dt
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)
The generated table will be over 1 GB (around 3.4 MB per day).
The Problem
Now I want to get the first value in each partition. (Later I actually want to have a further breakdown)
As I want to use a view, the view itself wouldn't know the final date range. In the example query I using a temporary table t_view in place of the view.
WITH t_view AS (
SELECT
dt,
value,
FIRST_VALUE(value) OVER(
PARTITION BY DATE(dt)
ORDER BY dt
) AS first_val
FROM `project.dataset.example_partioned_table`
)
SELECT *,
FROM t_view
WHERE DATE(dt) = DATE('2020-01-01')
The query will contain something like some value: 2020-01-01 00:00:00 for first_val (i.e. first value for the date).
However, as it stands, it is scanning the whole table (over 1 GB), when it should just scan the partition.
Other observations
If I don't include first_val (the analytical function) in the result, then the partition pruning works as intended.
Including first_val causes it to scan everything.
If I don't wrap dt with DATE, then the partition pruning also works, but would of course not provide the correct value.
I also tried DATETIME_TRUNC(request.timestamp, DAY), with the same lacking partition pruning result as DATE(request.timestamp).
Also adding the date where clause inside the temporary table works, but I wouldn't know the date range inside the view.
How can I restrict the analytical function to the partition of the row?
Failed workaround using GROUP BY
Related, I also tried a workaround using GROUP BY (the date), with the same result...
WITH t_view_1 AS (
SELECT
dt,
DATE(dt) AS _date,
value,
FROM `project.dataset.example_partioned_table`
),
t_view_2 AS (
SELECT
_date,
MIN(value) AS first_val
FROM t_view_1
GROUP BY _date
),
t_view AS (
SELECT
t_view_1._date,
t_view_1.dt,
t_view_2.first_val
FROM t_view_1
JOIN t_view_2
ON t_view_2._date = t_view_1._date
)
SELECT *,
FROM t_view
WHERE _date = '2020-01-01'
As before, it is scanning the whole table rather than only processing the partition with the selected date.
Potentially working workaround with partition on DATE field
If the table is instead partitioned on a DATE field (_date), e.g.:
CREATE TABLE `project.dataset.example_date_field_partioned_table`
PARTITION BY _date
AS
SELECT dt, DATE(dt) AS _date, CONCAT('some value: ', STRING(dt)) AS value
FROM (
SELECT
DATETIME_ADD(
DATETIME(_date),
INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
) AS dt
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)
Then the partition pruning works with the following adjusted example query:
WITH t_view AS (
SELECT
dt,
_date,
value,
FIRST_VALUE(value) OVER(
PARTITION BY _date
ORDER BY dt
) AS first_val
FROM `elife-data-pipeline.de_proto.example_date_field_partioned_table`
)
SELECT *,
FROM t_view
WHERE _date = DATE('2020-01-01')
i.e. the query scans around 4 MB rather than 1 GB
However, now I would need to add and populate that additional _date field. (Inconvenient with an external data source)
Having two fields with redundant information can also be confusing.
Additionally there is now no partition pruning at all on dt (queries need to make sure to use _date instead).
BQ functions can sometimes lead the query optimizer to make some inefficient choices, however we’re constantly trying to improve the query optimizer.
So, the best possible workaround in your scenario would be adding an extra column date column and using it to partition the table.
Ie.
CREATE TABLE `project.dataset.example_date_field_partioned_table`
PARTITION BY _date
AS
SELECT dt, DATE(dt) AS _date, CONCAT('some value: ', STRING(dt)) AS value
FROM (
SELECT
DATETIME_ADD(
DATETIME(_date),
INTERVAL ((_hour * 60 + _minute) * 60 + _second) SECOND
) AS dt
FROM UNNEST(GENERATE_DATE_ARRAY(DATE('2020-01-01'), DATE('2020-12-31'))) AS _date
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 23)) AS _hour
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _minute
CROSS JOIN UNNEST(GENERATE_ARRAY(0, 59)) AS _second
)
WITH t_view AS (
SELECT
dt,
_date,
value,
FIRST_VALUE(value) OVER(
PARTITION BY _date
ORDER BY dt
) AS first_val
FROM `mock.example_date_field_partioned_table`
)
SELECT *,
FROM t_view
WHERE _date = DATE('2020-01-01')

Update bigquery value based on partition by row number

I have a table in which I have records on the wrong date. I want to update them to be the day before for "snapshot_date". I have written the query to select the values I want to update the date for, but I don't know how to write the update query to change it to the previous day.
See screenshot
Query to select problematic records
Select * FROM(
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY Period, User_Struct) rn
FROM `XXX.YYY.TABLE`
where Snapshot_Date = '2021-10-04'
order by Period, User_Struct, Num_Active_Users asc
) where rn = 1
Using DATE_SUB you may get the previous day i.e.
SELECT DATE_SUB(cast('2021-10-04' as DATE), interval '1' day)
will give 2021-10-03.
You may try the following using Big Query Update Statement Syntax
UPDATE
`XXX.YYY.TABLE` t0
SET
t0.Snapshot_Date = DATE_SUB(t2.Snapshot_Date, interval '1' day)
FROM (
SELECT * FROM(
SELECT
*,
ROW_NUMBER() OVER(PARTITION BY Period, User_Struct) rn
FROM
`XXX.YYY.TABLE`
WHERE
Snapshot_Date = '2021-10-04'
ORDER BY -- recommend removing order by here and use recommendation below for row_number
Period, User_Struct, Num_Active_Users asc
) t1
WHERE rn = 1
) t2
WHERE
t0.Snapshot_Date = t2.Snapshot_Date AND -- include other columns to match/join subquery with main table on
You should also specify how your rows should be ordered when using ROW_NUMBER eg
ROW_NUMBER() OVER (PARTITION BY Period, User_Struct ORDER BY Num_Active_Users asc)
if this generates the same/desired results.
Let me know if this works for you.

greenplum string_agg conversion into hivesql supported

We are migrating greenplum sql query to hivesql and please find below statement available, string_agg. how do we migrate, kindly help us. below sample greenplum code needed for migration hive.
select string_agg(Display_String, ';' order by data_day )
select string_agg(Display_String, ';' order by data_day )
from
(
select data_day,
sum(revenue)/1000000.00 as revenue,
data_day||' '||trim(to_char(sum(revenue),'9,999,999,999')) as Display_String
from(
select case when data_date = current_date then 'D:'
when data_date = current_date - 1 then ' D-01:'
when data_date = current_date - 2 then ' D-02:'
when data_date = current_date - 7 then ' D-07:'
when data_date = current_date - 28 then ' D-28:'
end data_day, revenue/1000000.00 revenue
from test.testable
where data_date between current_date - 28 and current_date and hour <=(Select hour from ( select row_number() over(order by hour desc) iRowsID, hour from test.testable where data_date = current_date and type = 'UVC')tbl1
where irowsid = 2) and type in( 'UVC')
order by 1 desc) a
group by 1)aa;
There is nothing like this in hive. However you can use collect list and partition by/Order by to calculate it.
select concat_ws(';', max(concat_str))
FROM (
SELECT collect_list(Display_String) over (order by data_day ) concat_str
FROM
(your above SQL) s ) concat_qry)r
Explanation -
collect list concats the string and while doing it it, order by orders data on day column.
Outermost MAX() will pickup max data for the concatenated string.
Pls note this is a very slow operation. Test performance as well before implementing it.
Here is a sample SQL and result to help you.
select
id, concat_ws(';', max(concat_str))
from
( select
s.id, collect_list(s.c) over (partition by s.id order by s.c ) concat_str
from
( select 1 id,'ax' c union
select 1,'b'
union select 2,'f'
union select 2,'g'
union all select 1,'b'
union all select 1,'b' )s
) gs
group by id

Select latest 30 dates for each unique ID

This is a sample data file
Data Contains unique IDs with different latitudes and longitudes on multiple timestamps.I would like to select the rows of latest 30 days of coordinates for each unique ID.Please help me on how to run the query .This date is in Hive table
Regards,
Akshay
According to your example above (where no current year dates for id=2,3), you can numbering date for each id (order by date descending) using window function ROW_NUMBER(). Then just get latest 30 values:
--get all values for each id where num<=30 (get last 30 days for each day)
select * from
(
--numbering each date for each id order by descending
select *, row_number()over(partition by ID order by DATE desc)num from Table
)X
where num<=30
If you need to get only unique dates (without consider time) for each id, then can try this query:
select * from
(
--numbering date for each id
select *, row_number()over(partition by ID order by new_date desc)num
from
(
-- move duplicate using distinct
select distinct ID,cast(DATE as date)new_date from Table
)X
)Y
where num<=30
In Oracle this will be:
SELECT * FROM TEST_DATE1
WHERE DATEUPDT > SYSDATE - 30;
select * from MyTable
where
[Date]>=dateadd(d, -30, getdate());
To group by ID and perform aggregation, something like this
select ID,
count(*) row_count,
max(Latitude) max_lat,
max(Longitude) max_long
from MyTable
where
[Date]>=dateadd(d, -30, getdate())
group by ID;