BigQuery doesn't recognize filter - sql

BigQuery doesn't recognize filter over column timestamp and outputs this:
Cannot query over table 'xxxxxx' without a filter over column(s) 'timestamp' that can be used for partition elimination
Query code that produced this message is:
SELECT project as name,
DATE_TRUNC(timestamp, DAY) as day,
COUNT (timestamp) as cnt
FROM `xxxxxx`
WHERE (DATETIME(timestamp) BETWEEN DATETIME_ADD(DATETIME('2022-02-13 00:00:00 UTC'), INTERVAL 1 SECOND)
AND DATETIME_SUB(DATE_TRUNC(CURRENT_DATETIME(), DAY), INTERVAL 1 SECOND))
GROUP BY 1, 2

Everything works if we switch every conversion to DATETIME and all DATETIME operations with TIMESTAMP format and TIMESTAMP type operations.
SELECT project as name,
DATE_TRUNC(timestamp, DAY) as day,
COUNT (timestamp) as cnt
FROM `xxxxxx`
WHERE (timestamp BETWEEN TIMESTAMP_ADD(TIMESTAMP('2022-02-13 00:00:00 UTC'), INTERVAL 1 SECOND)
AND TIMESTAMP_SUB(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY), INTERVAL 1 SECOND))
GROUP BY 1, 2

The table when being created was created with require partition filter set to true.
Any query on this table should have a filter on the timestamp.
Refer :- Cannot query over table without a filter that can be used for partition elimination

Related

Create a Series of Dates between two Dates in a table - SQL

I have a table like this:
I want to list the rows per day between their Start Date and and End Date and Total Payment divided by number of days (I assume I would need a window function partition by name here). But my main concern is how to create those series of dates for each name based on their Start Date and End Date.
Using the table above I would like the output to look like this:
Consider a range join with count window function to spread out total by days:
SELECT t."Name",
t."Total Payment" / COUNT(dates) OVER(PARTITION BY t."Name") AS Payment,
t."Start Date",
t."End Date",
dates AS "Date of"
FROM generate_series(
timestamp without time zone '2022-01-01',
timestamp without time zone '2022-12-31',
'1 day'
) AS dates
INNER JOIN my_table t
ON dates BETWEEN t."Start Date" AND t."End Date"
You can get what your after is a single query by generate_series for getting each day, and by just subtracting the 2 dates. (Since you seem to want both dates included in the day count an additional 1 needs added).
select name, (total_payment/( (end_date-start_date) +1))::numeric(6,2), start_date, end_date, d::date date_of
from test t
cross join generate_series(t.start_date
,t.end_date
,interval ' 1 day'
) gs(d)
order by name desc, date_of;
See demo. I leave for you what to do when the total_payment is not a multiple of the number of days. The demo just ignores it.

Impute missing days with copy of last non-missing day in BigQuery

For some reason, I miss the ingestion of three days worth of data in a bigquery table. Now, I know that simply copying data from the last non-missing day is not the best way to impute missing data, but for my purposes, this is good enough.
I know that I could copy the last missing day, transform the date in pandas to DATE + 1, DATE +2 and so on and then append that data to the original table in bigquery. But, I would rather avoid having to do this. Is there a good and easy way to do this directly in bigquery or with dataform? I am not very comfortable with SQL.
Thanks for any given advice.
You can do the following. The query is self explanatory, but here is some details:
use the DATE_ADD() and DATE_SUB() to modify the data returned and to filter the day you want to copy from.
Use the union to return a single table many times with different modification and filters
Use the insert as described following to insert the retrieved data in the table.
Before run the insert, run only the selects and unions to check if that is the data you want
I've returned data from 1, 2 and 3 days ago (date_col = DATE_SUB(CURRENT_DATE(), interval 2 DAY)) and added 1 day on if date field.
INSERT INTO `<p>.<ds>.<t>` (date_col, data) (
SELECT DATE_ADD(date_col, INTERVAL 1 DAY) as date, data FROM `<p>.<ds>.<t>` where date_col = DATE_SUB(CURRENT_DATE(), interval 1 DAY)
UNION ALL
SELECT DATE_ADD(date_col, INTERVAL 1 DAY) as date, data FROM `<p>.<ds>.<t>` where date_col = DATE_SUB(CURRENT_DATE(), interval 2 DAY)
UNION ALL
SELECT DATE_ADD(date_col, INTERVAL 1 DAY) as date, data FROM `<p>.<ds>.<t>` where date_col = DATE_SUB(CURRENT_DATE(), interval 3 DAY)
)

Google Big Query to look at data of 2 specific dates

I am new to Big Query. I am trying to do a where condition to only select yesterday's data and that of same day last year (in this case, 10/25/2021 data and 10/25/2020 data). I know how to select a range of data, but I couldn't figure out a way to only select those 2 days of data. Any help is appreciated.
I recommend using BigQuery functions to define dates. You can read about them here.
WHERE DATE(your_date_field) IN ((DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY),
DATE_SUB(DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY), INTERVAL 1 YEAR))
This is dynamic to any day that you run the query. It will take the current date, then subtract 1 day. For the other date, it will take the current date and subtract 1 day and then 1 year, making it yesterday's date 1 year prior.
WHERE date_my_field IN (DATE('2021-10-25'), DATE('2020-10-25'))
Use IN which is a short cut for OR operator
Consider below (less verbose approach - especially if you remove time zone)
select
current_date('America/Los_Angeles') - 1 as yesterday,
date(current_date('America/Los_Angeles') - 1 - interval 1 year) same_day_last_year
with output
So, now you can use it in your WHERE clause as in below example (with dummy data via CTE)
with data as (
select your_date_field
from unnest(generate_date_array(current_date() - 1000, current_date())) your_date_field
)
select *
from data
where your_date_field in (
current_date('America/Los_Angeles') - 1,
date(current_date('America/Los_Angeles') - 1 - interval 1 year)
)
with output

SQL timestamp filtering based only on time

I want to create a query in Oracle SQL that will grab records from a given time interval, during certain hours of the day, e.g. records between 10am to noon, in the past 10 days. I tried this, but it does not work:
select * from my_table where timestamp between
to_timestamp('2020-12-30','YYYY-MM-DD')
and
to_timestamp('2021-01-08','YYYY-MM-DD') and
timestamp between
to_timestamp('10:00:00','HH24:MI:SS')
and
to_timestamp('12:00:00','HH24:MI:SS')
where timestamp is of type TIMESTAMP. I have also thought of using a join, but I am struggling to find a way to filter on time of day.
Is there a way to filter using only the time, not the date, or a way to filter on time for every day in the interval?
select *
from my_table
where timestamp between to_timestamp('2020-12-30','YYYY-MM-DD')
and to_timestamp('2021-01-08','YYYY-MM-DD')
and timestamp - trunc(timestamp) between interval '10' hour
and interval '12' hour
If you don't need to include exactly noon (including no fractional seconds), you could also do
select *
from my_table
where timestamp between to_timestamp('2020-12-30','YYYY-MM-DD')
and to_timestamp('2021-01-08','YYYY-MM-DD')
and extract( hour from timestamp ) between 10 and 11
As an aside, I'd hope that your actual column name isn't timestamp. It's legal as a column name but it is a reserved word so you're generally much better off using a different name.

Best way to count rows by arbitrary time intervals

My app has a Events table with time-stamped events.
I need to report the count of events during each of the most recent N time intervals. For different reports, the interval could be "each week" or "each day" or "each hour" or "each 15-minute interval".
For example, a user can display how many orders they received each week, day, or hour, or quarter-hour.
1) My preference is to dynamically do a single SQL query (I'm using Postgres) that groups by an arbitrary time interval. Is there a way to do that?
2) An easy but ugly brute force way is to do a single query for all records within the start/end timeframe sorted by timestamp, then have a method manually build a tally by whatever interval.
3) Another approach would be add separate fields to the event table for each interval and statically store an the_week the_day, the_hour, and the_quarter_hour field so I take the 'hit' at the time the record is created (once) instead of every time I report on that field.
What's best practice here, given I could modify the model and pre-store interval data if required (although at the modest expense of doubling the table width)?
Luckily, you are using PostgreSQL. The window function generate_series() is your friend.
Test case
Given the following test table (which you should have provided):
CREATE TABLE event(event_id serial, ts timestamp);
INSERT INTO event (ts)
SELECT generate_series(timestamp '2018-05-01'
, timestamp '2018-05-08'
, interval '7 min') + random() * interval '7 min';
One event for every 7 minutes (plus 0 to 7 minutes, randomly).
Basic solution
This query counts events for any arbitrary time interval. 17 minutes in the example:
WITH grid AS (
SELECT start_time
, lead(start_time, 1, 'infinity') OVER (ORDER BY start_time) AS end_time
FROM (
SELECT generate_series(min(ts), max(ts), interval '17 min') AS start_time
FROM event
) sub
)
SELECT start_time, count(e.ts) AS events
FROM grid g
LEFT JOIN event e ON e.ts >= g.start_time
AND e.ts < g.end_time
GROUP BY start_time
ORDER BY start_time;
The query retrieves minimum and maximum ts from the base table to cover the complete time range. You can use an arbitrary time range instead.
Provide any time interval as needed.
Produces one row for every time slot. If no event happened during that interval, the count is 0.
Be sure to handle upper and lower bound correctly. See:
Unexpected results from SQL query with BETWEEN timestamps
The window function lead() has an often overlooked feature: it can provide a default for when no leading row exists. Providing 'infinity' in the example. Else the last interval would be cut off with an upper bound NULL.
Minimal equivalent
The above query uses a CTE and lead() and verbose syntax. Elegant and maybe easier to understand, but a bit more expensive. Here is a shorter, faster, minimal version:
SELECT start_time, count(e.ts) AS events
FROM (SELECT generate_series(min(ts), max(ts), interval '17 min') FROM event) g(start_time)
LEFT JOIN event e ON e.ts >= g.start_time
AND e.ts < g.start_time + interval '17 min'
GROUP BY 1
ORDER BY 1;
Example for "every 15 minutes in the past week"`
Formatted with to_char().
SELECT to_char(start_time, 'YYYY-MM-DD HH24:MI'), count(e.ts) AS events
FROM generate_series(date_trunc('day', localtimestamp - interval '7 days')
, localtimestamp
, interval '15 min') g(start_time)
LEFT JOIN event e ON e.ts >= g.start_time
AND e.ts < g.start_time + interval '15 min'
GROUP BY start_time
ORDER BY start_time;
Still ORDER BY and GROUP BY on the underlying timestamp value, not on the formatted string. That's faster and more reliable.
db<>fiddle here
Related answer producing a running count over the time frame:
PostgreSQL: running count of rows for a query 'by minute'