Google Big Query to look at data of 2 specific dates - google-bigquery

I am new to Big Query. I am trying to do a where condition to only select yesterday's data and that of same day last year (in this case, 10/25/2021 data and 10/25/2020 data). I know how to select a range of data, but I couldn't figure out a way to only select those 2 days of data. Any help is appreciated.

I recommend using BigQuery functions to define dates. You can read about them here.
WHERE DATE(your_date_field) IN ((DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY),
DATE_SUB(DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY), INTERVAL 1 YEAR))
This is dynamic to any day that you run the query. It will take the current date, then subtract 1 day. For the other date, it will take the current date and subtract 1 day and then 1 year, making it yesterday's date 1 year prior.

WHERE date_my_field IN (DATE('2021-10-25'), DATE('2020-10-25'))
Use IN which is a short cut for OR operator

Consider below (less verbose approach - especially if you remove time zone)
select
current_date('America/Los_Angeles') - 1 as yesterday,
date(current_date('America/Los_Angeles') - 1 - interval 1 year) same_day_last_year
with output
So, now you can use it in your WHERE clause as in below example (with dummy data via CTE)
with data as (
select your_date_field
from unnest(generate_date_array(current_date() - 1000, current_date())) your_date_field
)
select *
from data
where your_date_field in (
current_date('America/Los_Angeles') - 1,
date(current_date('America/Los_Angeles') - 1 - interval 1 year)
)
with output

Related

how to get data for last calender week in redshift

I have a below query that I run to extract material movements from the last 7 days.
Purpose is to get the data for the last calender week for certain reports.
select
*
From
redshift
where
posting_date between CURRENT_DATE - 7 and CURRENT_DATE - 1
That means I need to run the query on every Monday to get the data for the former week.
Sometimes I am too busy on Monday or its vacation/bank holiday. In that case I would need to change the query or pull the data via SAP.
Question:
Is there a function for redshift that pulls out the data for the last calender week regardless when I run the query?
I already found following solution
SELECT id FROM table1
WHERE YEARWEEK(date) = YEARWEEK(NOW() - INTERVAL 1 WEEK)
But this doesnt seem to be working for redshift sql
Thanks a lot for your help.
Redshift offers a DATE_TRUNC('week', datestamp) function. Given any datestamp value, either a date or datetime, it gives back the date of the preceding Sunday.
So this might work for you. It filters rows from the Sunday before last, up until but not including, the last Sunday, and so gets a full week.
SELECT id
FROM table1
WHERE date >= DATE_TRUNC('week', NOW()) - INTERVAL 1 WEEK
AND date < DATE_TRUNC('week', NOW())
Pro tip: Every minute you spend learning your DBMS's date/time functions will save you an hour in programming.

Date Functions Trunc (SysDate)

I am running the below query to get data recorded in the past 24 hours. I need the same data recorded starting midnight (DATE > 12:00 AM) and also data recorded starting beginning of the month. Not sure if using between will work or if there is better option. Any suggestions.
SELECT COUNT(NUM)
FROM TABLE
WHERE
STATUS = 'CNLD'
AND
TRUNC(TO_DATE('1970-01-01','YYYY-MM-DD') + OPEN_DATE/86400) = trunc(sysdate)
Output (Just need Count). OPEN_DATE Data Type is NUMBER. the output below displays count in last 24 hours. I need the count beginning midnight and another count starting beginning of the month.
The query you've shown will get the count of rows where OPEN_DATE is an 'epoch date' number representing time after midnight this morning*. The condition:
TRUNC(TO_DATE('1970-01-01','YYYY-MM-DD') + OPEN_DATE/86400) = trunc(sysdate)
requires every OPEN_DATE value in your table (or at least all those for CNLD rows) to be converted from a number to an actual date, which is going to be doing a lot more work than necessary, and would stop a standard index against that column being used. It could be rewritten as:
OPEN_DATE >= (trunc(sysdate) - date '1970-01-01') * 86400
which converts midnight this morning to its epoch equivalent, once, and compares all the numbers against that value; using an index if there is one and the optimiser thinks it's appropriate.
To get everything since the start of the month you could just change the default behaviour of trunc(), which is to truncate to the 'DD' element, to truncate to the start of the month instead:
OPEN_DATE >= (trunc(sysdate, 'MM') - date '1970-01-01') * 86400
And the the last 24 hours, subtract a day from the current time instead of truncating it:
OPEN_DATE >= ((sysdate - 1) - date '1970-01-01') * 86400
db<>fiddle with some made-up data to get 72 back for today, more for the last 24 hours, and more still for the whole month.
Based on your current query I'm assuming there won't be any future-dated values, so you don't need to worry about an upper bound for any of these.
*Ignoring leap seconds...
It sounds like you have a column that is of data type TIMESTAMP and you only want to select rows where that TIMESTAMP indicates that it is today's date? And as a related problem, you want to find those that are the current month, based on some system values like CURRENT TIMESTAMP and CURRENT DATE? If so, let's call your column TRANSACTION_TIMESTAMP instead of (reserved word) DATE. Your first query could be:
SELECT COUNT(NUM)
FROM TABLE
WHERE
STATUS = 'CLND'
AND
DATE(TRANSACTION_TIMESTAMP)=CURRENT DATE
The second example of finding all for the current month up to today's date could be:
SELECT COUNT(NUM)
FROM TABLE
WHERE
STATUS = 'CLND'
AND
YEAR(DATE(TRANSACTION_TIMESTAMP)=YEAR(CURRENT DATE) AND
MONTH(DATE(TRANSACTION_TIMESTAMP)=MONTH(CURRENT DATE) AND
DAY(DATE(TRANSACTION_TIMESTAMP)<=DAY(CURRENT DATE)

Impute missing days with copy of last non-missing day in BigQuery

For some reason, I miss the ingestion of three days worth of data in a bigquery table. Now, I know that simply copying data from the last non-missing day is not the best way to impute missing data, but for my purposes, this is good enough.
I know that I could copy the last missing day, transform the date in pandas to DATE + 1, DATE +2 and so on and then append that data to the original table in bigquery. But, I would rather avoid having to do this. Is there a good and easy way to do this directly in bigquery or with dataform? I am not very comfortable with SQL.
Thanks for any given advice.
You can do the following. The query is self explanatory, but here is some details:
use the DATE_ADD() and DATE_SUB() to modify the data returned and to filter the day you want to copy from.
Use the union to return a single table many times with different modification and filters
Use the insert as described following to insert the retrieved data in the table.
Before run the insert, run only the selects and unions to check if that is the data you want
I've returned data from 1, 2 and 3 days ago (date_col = DATE_SUB(CURRENT_DATE(), interval 2 DAY)) and added 1 day on if date field.
INSERT INTO `<p>.<ds>.<t>` (date_col, data) (
SELECT DATE_ADD(date_col, INTERVAL 1 DAY) as date, data FROM `<p>.<ds>.<t>` where date_col = DATE_SUB(CURRENT_DATE(), interval 1 DAY)
UNION ALL
SELECT DATE_ADD(date_col, INTERVAL 1 DAY) as date, data FROM `<p>.<ds>.<t>` where date_col = DATE_SUB(CURRENT_DATE(), interval 2 DAY)
UNION ALL
SELECT DATE_ADD(date_col, INTERVAL 1 DAY) as date, data FROM `<p>.<ds>.<t>` where date_col = DATE_SUB(CURRENT_DATE(), interval 3 DAY)
)

I'm trying to calculate variance using sql between two same days of two years

I will try to be simple as possible to make my question crystal-clear. I have a table that's called 'fb_ads' (it's about different facebook compaigns for different stores in USA) on BigQuery, it contains the following columns:
STORE : name of store
CLICKS: number of clicks.
IMPRESSIONS: number of impressions of the ad
COST: the ad cost
DATE: AAAA-MM-DD
Frequency: number of visitors of a store
So, I'm trying to calculate the variance between two years 2017 and 2018.
Here is the variance I'm trying to calculate:
Variance_Of_Frequency = ((Frequency in 2018 at date X) - ((Frequency in 2017 at date X))/((Frequency in 2017 at date X)
The problem is, that I'll have to compare the same day of the week close to Date X;
For example, if I have a compaign run on a Monday 2017-08-13, I'll need to compare to another monday in 2018 close to 2018-08-13 (it might be a monday on 2018-08-15 for example).
This is a daily variance!
I tried to make a weekly variance calculating and I don't know if it's correct, here is how I did it:
I first started with aggregating my daily table to a weekly tables using the following query:
creating my weekly_table
SELECT
year_week,
STORE,
min(DATE ) as DATE ,
SUM(IMPRESSIONS ) AS FB_IMPRESSIONS ,
SUM(CLICKS ) AS FB_CLICKS ,
SUM(COST) AS FB_COST ,
SUM(Frequency) AS FREQUENCY,
FROM (
SELECT
*,
CONCAT(cast(ANNEE as string), LPAD(cast((extract(WEEK from date)) as string), 2, '0') ) AS year_week
FROM `fb_ads`)
GROUP BY
year_week,
STORE,
ORDER BY year_week
Then I tried to calculate the variance using this:
SELECT
base.*, (base.frequency-lw.frequency) / lw.frequency as VAR_FF
FROM
`weekly_table` base
JOIN (
SELECT
* EXCEPT (date),
DATE_ADD(DATE(TIMESTAMP(date)) , INTERVAL 1 Week)AS date
FROM
`weekly_table` ) lw
ON
base.date = lw.date
AND base.store= lw.store
Anyone has any idea how to do the daily thing or if my weekly queries are correct ?
Thanks!
For a given date, you want to know the date of the nearest Monday to the same date in the following year...
SET #dt = '2017-08-17';
SELECT CASE WHEN WEEKDAY(#dt + INTERVAL 1 YEAR) > 3
THEN ADDDATE(ADDDATE(#dt + INTERVAL 1 YEAR,INTERVAL 1 WEEK),INTERVAL - WEEKDAY(#dt + INTERVAL 1 YEAR) DAY)
ELSE ADDDATE(#dt + INTERVAL 1 YEAR,INTERVAL - WEEKDAY(#dt + INTERVAL 1 YEAR) DAY)
END x;
Obviously, I could remove all those + INTERVAL 1 YEAR bits by defining #dt that way to begin with.

Simulate query over a range of dates

I have a fairly long query that looks over the past 13 weeks and determines if the current day's performance is an anomaly compared to the last 13 weeks. It just returns a single row that has the date, the performance of the current day and a flag saying if it is an anomaly or not. To make matters a little more complicated: The performance isn't just a single day but rather a running 24 hour window. This query is then run every hour to monitor the KPI over the last 24 hours. i.e. If it is 2pm on Tuesday, it will look from 2pm the previous day (Monday) to now, and compare it to every other 2pm-to-2pm for the last 13 weeks.
To test if this code is working I would like simulate it running over the past month.
The code goes as follows:
WITH performance AS(
SELECT TRUNC(dateColumn - to_number(to_char(sysdate, 'hh24')/24) as startdate,
KPI_a,
KPI_b,
KPI_c
FROM table
WHERE someConditions
GROUP BY TRUNC(dateColumn - to_number(to_char(sysdate, 'hh24')/24)),
compare_t AS(
-- looks at relationships of the KPIs),
variables AS(
-- calculates the variables required for the anomaly detection),
... ok I don't know how much of the query needs to be given but it's basically I need to simulate 'sysdate'. Instead of inputting the current date, input each hour for the last month so this query will run approx 720 times and return the result 720 times, for each hour of each day.
I'm thinking a FOR loop, but I'm not sure.
You can use a recursive subquery:
with times(time) as
(
select sysdate - interval '1' month as time from dual
union all
select time + interval '1' hour from times
where time < sysdate
)
, performance as ()
, compare_t as ()
, variables as ()
select *
from times
join ...
order by time;
I don't understand your specific requirements but I had to solve similar problems. To give you an idea here are two proposals:
Calculate average and standard deviation of KPI value from past 13 weeks to yesterday. If current value from today it lower than "AVG - 10*STDDEV" then select record, i.e. mark as anomaly.
WITH t AS
(SELECT dateColumn, KPI_A,
AVG(KPI_A) OVER (ORDER BY dateColumn RANGE BETWEEN 13 * INTERVAL '7' DAY PRECEDING AND INTERVAL '1' DAY PRECEDING) AS REF_AVG,
STDDEV(KPI_A) OVER (ORDER BY dateColumn RANGE BETWEEN 13 * INTERVAL '7' DAY PRECEDING AND INTERVAL '1' DAY PRECEDING) AS REF_STDDEV
FROM TABLE
WHERE someConditions)
SELECT dateColumn, REF_AVG, KPI_A, REF_STDDEV
FROM t
WHERE TRUNC(dateColumn, 'HH') = TRUNC(LOCALTIMESTAMP, 'HH')
AND KPI_A < REF_AVG - 10 * REF_STDDEV;
Take hourly values from last week (i.e. the same weekday as yesterday) and make correlation with hourly values from yesterday. If correlation is less than certain value (I use 95%) then consider this day as anomaly.
WITH t AS
(SELECT dateColumn, KPI_A,
FIRST_VALUE(KPI_A) OVER (ORDER BY dateColumn RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW) AS KPI_A_LAST_WEEK,
dateColumn - FIRST_VALUE(dateColumn) OVER (ORDER BY dateColumn RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW) AS RANGE_INT
FROM table
WHERE ...)
SELECT 100*ROUND(CORR(KPI_A, KPI_A_LAST_WEEK), 2) AS CORR_VAL
FROM t
WHERE KPI_A_LAST_WEEK IS NOT NULL
AND RANGE_INT = INTERVAL '7' DAY
AND TRUNC(dateColumn) = TRUNC(LOCALTIMESTAMP - INTERVAL '1' DAY)
GROUP BY TRUNC(dateColumn);