Automatically add partition conditions to WHERE clause - sql

I have a columnar table that is partitioned by day and hour. It is stored on S3 in parquet files to be queried by Athena. Here is the CREATE TABLE:
CREATE EXTERNAL TABLE foo (
-- other columns here
dt timestamp,
day string,
hour string
)
PARTITIONED BY (day string, hour string)
STORED AS parquet
LOCATION 's3://foo/foo'
And the layout on S3 is like:
s3://foo/foo/day=2021-10-10/hh=00/*.parquet
s3://foo/foo/day=2021-10-10/hh=01/*.parquet
...etc
s3://foo/foo/day=2021-10-10/hh=23/*.parquet
So a query like the following will be fast because it only scans over one hour of parquet files because the partition columns are being used to filter it:
-- fast, easy to write
SELECT * FROM foo WHERE day = '2021-10-10' AND hour = '00'
However, the table also includes the full datetime dt. Usually we want to write queries for ranges that don't align to a day/hour boundary, and/or are in a different timezone.
For example, this will scan ALL parquet files and be very slow:
-- slow, easy to write
SELECT * FROM foo WHERE dt > '2021-10-09 23:05:00' AND dt < '2021-10-11 01:00:00'
It can be improved by manually calculating the day and hour that minimally enclose the time period:
-- fast, painful to write
SELECT * FROM foo
WHERE
((day, hh) IN (('2021-10-09', '23'), ('2021-10-11', '00')) OR day = '2021-10-10')
AND
dt > '2021-10-09 23:05:00' AND dt < '2021-10-11 01:00:00'
Ideally this extra condition could be added transparently by the database so as to avoid having to manually add the ((day,hh) IN (...)).
Is this possible somehow with Athena?

I've wished for this feature many times, but unfortunately Athena does not support it. You have to include both the predicate for the dt column and the day and hour partition keys.

Related

Athena queries changes on the fly according to the partitions introduced in S3

I have partitioned my DATA_BUCKET in S3 with structure of
S3/DATA_BUCKET/table_1/YYYY/MM/DD/files.parquet
Now I have three additional columns in the table_1 which are visible in Athena as "partition_0", ""partition_1" and "partition_2" (for Year, Month and Day respectively).
Till now my apps were making time-related-queries based on the "time_stamp" column in the table:
select * from table_1 where time_stamp like '2023-01-17%'
Now to leverage the performance because of the partitions, the corresponding new query is:
select * from table_1 where partition_0 = '2023' and partition_1 = '01' and partition_2 = '17'
Problem:
Since there are many previous queries made on time_stamp in my apps I do not want to change them but still somehow transform those queries to my "partitions-type-queries" like above.
Is there any way like internally in Athena or something else ?
TIA
You can create a view from the original table with a new "time_stamp" column.
This column calculate Date from date-parts:
CREATE OR REPLACE VIEW my_view AS
SELECT mytable.col1,
mytable.col2,
cast(date_add('day', trans_day - 1, date_add('month', trans_month - 1, date_add('year', trans_year - 1970, from_unixtime(0)))) as Date) as time_stamp
FROM my_db.my_table mytable

Big query - Schedule Query on an external partitioned table with the keyword #run_date

Because it is client data I've replace in this post the project name and the dataset name by ******)
I'm trying to create a new schedule query in BigQuery on Google cloud platform
The problem is I've got this error in the web Query editor
Cannot query over table '******.raw_bounce_rate' without a filter over column(s) 'dt' that can be used for partition elimination
The thing is I do filter on the column dt.
Here is the scheme of my external partitioned table
Tracking_Code STRING
Pages STRING NULLABLE
Clicks_to_Page INTEGER
Path_Lengths INTEGER
Visit_Number INTEGER
Visitor_ID STRING
Mobile_Device_Type STRING
All_Visits INTEGER
dt DATE
dt is the field of the partition and I selected the option "Require partition filter"
Here is the simplify sql of my query
WITH yesterday_raw_bounce_rate AS (
SELECT *
FROM `******.raw_bounce_rate`
WHERE dt = DATE_SUB(#run_date, INTERVAL 1 DAY)
),
entries_table as (
SELECT dt,
ifnull(Tracking_Code, "sans campagne") as tracking_code,
ifnull(Pages, "page non trackée") as pages,
Visitor_ID,
Path_Lengths,
Clicks_to_Page,
SUM(all_visits) AS somme_visites
FROM
yesterday_raw_bounce_rate
GROUP BY
dt,
Tracking_Code,
Pages,
Visitor_ID,
Path_Lengths,
Clicks_to_Page
HAVING
somme_visites = 1 and Clicks_to_Page = 1
)
select * from entries_table
if I remove the statement
Clicks_to_Page = 1
or if I replace the
DATE_SUB(#run_date, INTERVAL 1 DAY)
by a hard coded date
the query is accepted by Big Query, it does not make sense to me
Currently, there is an open issue, here. It addresses the error regarding using #run_date filter in the filter of scheduled queries to partitioned tables with required filter. The engineering team is currently working on it, although there is no ETA.
In your scheduled query, you can use one of the two workarounds using #run_date.As follows:
First option,
DECLARE runDateVariable DATE DEFAULT #run_date;
#your code...
WHERE date = DATE_SUB(runDateVariable, INTERVAL 1 DAY)
Second option,
DECLARE runDateVariable DATE DEFAULT CAST(#run_date AS DATE);
#your code...
WHERE date = DATE_SUB(runDateVariable, INTERVAL 1 DAY)
In addition, you can also use CURRENT_DATE() instead of #run_date, as shwon below:
DECLARE runDateVariable DATE DEFAULT CURRENT_DATE();
#your code...
WHERE date = DATE_SUB(runDateVariable, INTERVAL 1 DAY)
UPDATE
I have set up another scheduled query to run daily with a table partitioned by DATE from a field called date_formatted and the partition filter is required. Then I have set up a backfill, here, so I could see the result of the scheduled query for previous days. Below is the code I used:
DECLARE runDateVariable DATE DEFAULT #run_date;
SELECT #run_date as run_date, date_formatted, fullvisitorId FROM `project_id.dataset.table_name` WHERE date_formatted > DATE_SUB(runDateVariable, INTERVAL 1 DAY)

Can I replace an interval of partitions of a BigQuery partitioned table at once?

I'm working on BigQuery tables with the Python SDK and I want to achieve something that seems doable, but can't find anything in the documentation.
I have a table T partitioned by date, and I have a SELECT request that computes values over the X last days. In T, I would like to replace the partitions of the X last days with these values, without affecting the partitions older than X days.
Here is how we do for replacing one partition only :
job_config = bigquery.QueryJobConfig()
job_config.destination = dataset.table("{}${}".format(table, date.strftime("%Y%m%d")))
job_config.use_legacy_sql = False
job_config.write_disposition = bigquery.job.WriteDisposition.WRITE_TRUNCATE
query_job = bigquery.job.QueryJob(str(uuid.uuid4()), query, client, job_config)
query_job.result()
I tried to go like this :
job_config.destination = dataset.table(table))
But it truncates all partitions, even those older than X days.
Is there a way to do this easily ? Or do I have to loop over each partition of the interval ?
Thanks
I don't think you can achieve it by playing with destination table.
Not considering the cost, what you can do with SQL is
DELETE FROM your_ds.your_table WHERE partition_date > DATE_SUB(CURRENT_DATE(), INTERVAL X DAY);
Then
INSERT INTO your_ds.your_table SELECT (...)
Cost
The first DELETE will cost:
The sum of bytes processed for all the columns referenced in all partitions for the tables scanned by the query
+ the sum of bytes for all columns in the modified or scanned partitions for the table being modified (at the time the DELETE starts).
The second INSERT INTO should cost the same as your current query.

Big Query copy partition tables within certain range

I have Production dataset and Test dataset in BigQuery. Both are partitioned in Days (both have _PARTITIONTIME column).
Is there a way for me to copy Production dataset (using bq cp function) partition tables within a specific range to Test dataset? ex: in the last 1 month.
If I just want to copy 1 partition tables, I would just use the $[yyyymmdd] keyword to select that 1 partition tables, but I am trying to avoid using comma 30 times to select a one month worth of partition tables.
I know querying is possible with something like _PARTITIONTIME >= "2018-01-01 00:00:00" AND _PARTITIONTIME < "2018-01-30 00:00:00".
Is it also possible to do a similar thing for copy?
Thanks
Answer is no.
You can write simple script using bq command to do this doing it one by one.
Or you can generate bq command to do this with something like this:
#standardSQL
SELECT
CONCAT('bq cp <srcproj>:<dataset>.<table>$', partname, ' <testproj>:<dataset>.<table>$', partname)
FROM (
SELECT
DISTINCT FORMAT_DATETIME('%Y%m%d',
CAST(_PARTITIONDATE AS datetime)) partname
FROM
`<srcproj>.<dataset>.<table>`
WHERE
_PARTITIONTIME >= "2018-05-10 00:00:00"
AND _PARTITIONTIME < "2018-05-13 00:00:00")

Most efficient way to retrieve data by timestamps

I'm using PostgreSQL 9.2.8.
I have table like:
CREATE TABLE foo
(
foo_date timestamp without time zone NOT NULL,
-- other columns, constraints
)
This table contains about 4.000.000 rows. One day data is about 50.000 rows.
My goal is to retrieve one day data as fast as possible.
I have created an index like:
CREATE INDEX foo_foo_date_idx
ON foo
USING btree
(date_trunc('day'::text, foo_date));
And now I'm selecting data like this (now() is just an example, i need data from ANY day):
select *
from process
where date_trunc('day'::text, now()) = date_trunc('day'::text, foo_date)
This query lasts about 20 s.
Is there any possiblity to obtain same data in shorter time?
It takes time to retrieve 50,000 rows. 20 seconds seems like a long time, but if the rows are wide, then that might be an issue.
You can directly index foo_date and use inequalities. So, you might try this version:
create index foo_foo_date_idx2 on foo(foo_date);
select p
from process p
where p.foo_date >= date_trunc('day', now()) and
p.foo_date < date_trunc('day', now() + interval '1 day');