Impute missing days with copy of last non-missing day in BigQuery - google-bigquery

For some reason, I miss the ingestion of three days worth of data in a bigquery table. Now, I know that simply copying data from the last non-missing day is not the best way to impute missing data, but for my purposes, this is good enough.
I know that I could copy the last missing day, transform the date in pandas to DATE + 1, DATE +2 and so on and then append that data to the original table in bigquery. But, I would rather avoid having to do this. Is there a good and easy way to do this directly in bigquery or with dataform? I am not very comfortable with SQL.
Thanks for any given advice.

You can do the following. The query is self explanatory, but here is some details:
use the DATE_ADD() and DATE_SUB() to modify the data returned and to filter the day you want to copy from.
Use the union to return a single table many times with different modification and filters
Use the insert as described following to insert the retrieved data in the table.
Before run the insert, run only the selects and unions to check if that is the data you want
I've returned data from 1, 2 and 3 days ago (date_col = DATE_SUB(CURRENT_DATE(), interval 2 DAY)) and added 1 day on if date field.
INSERT INTO `<p>.<ds>.<t>` (date_col, data) (
SELECT DATE_ADD(date_col, INTERVAL 1 DAY) as date, data FROM `<p>.<ds>.<t>` where date_col = DATE_SUB(CURRENT_DATE(), interval 1 DAY)
UNION ALL
SELECT DATE_ADD(date_col, INTERVAL 1 DAY) as date, data FROM `<p>.<ds>.<t>` where date_col = DATE_SUB(CURRENT_DATE(), interval 2 DAY)
UNION ALL
SELECT DATE_ADD(date_col, INTERVAL 1 DAY) as date, data FROM `<p>.<ds>.<t>` where date_col = DATE_SUB(CURRENT_DATE(), interval 3 DAY)
)

Related

how to get data for last calender week in redshift

I have a below query that I run to extract material movements from the last 7 days.
Purpose is to get the data for the last calender week for certain reports.
select
*
From
redshift
where
posting_date between CURRENT_DATE - 7 and CURRENT_DATE - 1
That means I need to run the query on every Monday to get the data for the former week.
Sometimes I am too busy on Monday or its vacation/bank holiday. In that case I would need to change the query or pull the data via SAP.
Question:
Is there a function for redshift that pulls out the data for the last calender week regardless when I run the query?
I already found following solution
SELECT id FROM table1
WHERE YEARWEEK(date) = YEARWEEK(NOW() - INTERVAL 1 WEEK)
But this doesnt seem to be working for redshift sql
Thanks a lot for your help.
Redshift offers a DATE_TRUNC('week', datestamp) function. Given any datestamp value, either a date or datetime, it gives back the date of the preceding Sunday.
So this might work for you. It filters rows from the Sunday before last, up until but not including, the last Sunday, and so gets a full week.
SELECT id
FROM table1
WHERE date >= DATE_TRUNC('week', NOW()) - INTERVAL 1 WEEK
AND date < DATE_TRUNC('week', NOW())
Pro tip: Every minute you spend learning your DBMS's date/time functions will save you an hour in programming.

BigQuery doesn't recognize filter

BigQuery doesn't recognize filter over column timestamp and outputs this:
Cannot query over table 'xxxxxx' without a filter over column(s) 'timestamp' that can be used for partition elimination
Query code that produced this message is:
SELECT project as name,
DATE_TRUNC(timestamp, DAY) as day,
COUNT (timestamp) as cnt
FROM `xxxxxx`
WHERE (DATETIME(timestamp) BETWEEN DATETIME_ADD(DATETIME('2022-02-13 00:00:00 UTC'), INTERVAL 1 SECOND)
AND DATETIME_SUB(DATE_TRUNC(CURRENT_DATETIME(), DAY), INTERVAL 1 SECOND))
GROUP BY 1, 2
Everything works if we switch every conversion to DATETIME and all DATETIME operations with TIMESTAMP format and TIMESTAMP type operations.
SELECT project as name,
DATE_TRUNC(timestamp, DAY) as day,
COUNT (timestamp) as cnt
FROM `xxxxxx`
WHERE (timestamp BETWEEN TIMESTAMP_ADD(TIMESTAMP('2022-02-13 00:00:00 UTC'), INTERVAL 1 SECOND)
AND TIMESTAMP_SUB(TIMESTAMP_TRUNC(CURRENT_TIMESTAMP(), DAY), INTERVAL 1 SECOND))
GROUP BY 1, 2
The table when being created was created with require partition filter set to true.
Any query on this table should have a filter on the timestamp.
Refer :- Cannot query over table without a filter that can be used for partition elimination

Google Big Query to look at data of 2 specific dates

I am new to Big Query. I am trying to do a where condition to only select yesterday's data and that of same day last year (in this case, 10/25/2021 data and 10/25/2020 data). I know how to select a range of data, but I couldn't figure out a way to only select those 2 days of data. Any help is appreciated.
I recommend using BigQuery functions to define dates. You can read about them here.
WHERE DATE(your_date_field) IN ((DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY),
DATE_SUB(DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY), INTERVAL 1 YEAR))
This is dynamic to any day that you run the query. It will take the current date, then subtract 1 day. For the other date, it will take the current date and subtract 1 day and then 1 year, making it yesterday's date 1 year prior.
WHERE date_my_field IN (DATE('2021-10-25'), DATE('2020-10-25'))
Use IN which is a short cut for OR operator
Consider below (less verbose approach - especially if you remove time zone)
select
current_date('America/Los_Angeles') - 1 as yesterday,
date(current_date('America/Los_Angeles') - 1 - interval 1 year) same_day_last_year
with output
So, now you can use it in your WHERE clause as in below example (with dummy data via CTE)
with data as (
select your_date_field
from unnest(generate_date_array(current_date() - 1000, current_date())) your_date_field
)
select *
from data
where your_date_field in (
current_date('America/Los_Angeles') - 1,
date(current_date('America/Los_Angeles') - 1 - interval 1 year)
)
with output

Get last 30 days from table in SQL with YYYY-MM-DDtHH-MM-SS (BigQuery)

In BigQuery, I have a table of values with one column containing dates in YYYY-MM-DDtHH-MM-SS format, for example, 2020-07-24T20:13:35.
I want to pull only the rows from the past 30 days and exclude any rows that are more than 30 days old.
I believe I found out how to do it for date formatted as YYYY-MM-DD:
(Column name is "dates")
SELECT DATE_SUB(dates, INTERVAL 30 DAY)
This does not work when it is formatted as YYYY-MM-DDtHH-MM-SS though.
You would simply use:
where col > datetime_add(current_datetime, interval -30 day)
or
where col > timestamp_add(current_timestamp, interval -30 day)
depending on whether the column is a datetime or timestamp.
You can also use
Select * from table where date(date_column) >= date_sub(current_date, interval 30 day)
This way you will get all the records from 30 days before, starting from 12am. Using current_datetime or current_timestamp will only give results later than 30 days ago at time you run your query.

Get timestamp of one month ago in PostgreSQL

I have a PostgreSQL database in which one table rapidly grows very large (several million rows every month or so) so I'd like to periodically archive the contents of that table into a separate table.
I'm intending to use a cron job to execute a .sql file nightly to archive all rows that are older than one month into the other table.
I have the query working fine, but I need to know how to dynamically create a timestamp of one month prior.
The time column is stored in the format 2013-10-27 06:53:12 and I need to know what to use in an SQL query to build a timestamp of exactly one month prior. For example, if today is October 27, 2013, I want the query to match all rows where time < 2013-09-27 00:00:00
Question was answered by a friend in IRC:
'now'::timestamp - '1 month'::interval
Having the timestamp return 00:00:00 wasn't terrible important, so this works for my intentions.
select date_trunc('day', NOW() - interval '1 month')
This query will return date one month ago from now and round time to 00:00:00.
When you need to query for the data of previous month, then you need to query for the respective date column having month values as (current_month-1).
SELECT *
FROM {table_name}
WHERE {column_name} >= date_trunc('month', current_date-interval '1' month)
AND {column_name} < date_trunc('month', current_date)
The first condition of where clause will search the date greater than the first day (00:00:00 Day 1 of Previous Month)of previous month and second clause will search for the date less than the first day of current month(00:00:00 Day 1 of Current Month).
This will includes all the results where date lying in previous month.