I'm trying to calculate variance using sql between two same days of two years - sql

I will try to be simple as possible to make my question crystal-clear. I have a table that's called 'fb_ads' (it's about different facebook compaigns for different stores in USA) on BigQuery, it contains the following columns:
STORE : name of store
CLICKS: number of clicks.
IMPRESSIONS: number of impressions of the ad
COST: the ad cost
DATE: AAAA-MM-DD
Frequency: number of visitors of a store
So, I'm trying to calculate the variance between two years 2017 and 2018.
Here is the variance I'm trying to calculate:
Variance_Of_Frequency = ((Frequency in 2018 at date X) - ((Frequency in 2017 at date X))/((Frequency in 2017 at date X)
The problem is, that I'll have to compare the same day of the week close to Date X;
For example, if I have a compaign run on a Monday 2017-08-13, I'll need to compare to another monday in 2018 close to 2018-08-13 (it might be a monday on 2018-08-15 for example).
This is a daily variance!
I tried to make a weekly variance calculating and I don't know if it's correct, here is how I did it:
I first started with aggregating my daily table to a weekly tables using the following query:
creating my weekly_table
SELECT
year_week,
STORE,
min(DATE ) as DATE ,
SUM(IMPRESSIONS ) AS FB_IMPRESSIONS ,
SUM(CLICKS ) AS FB_CLICKS ,
SUM(COST) AS FB_COST ,
SUM(Frequency) AS FREQUENCY,
FROM (
SELECT
*,
CONCAT(cast(ANNEE as string), LPAD(cast((extract(WEEK from date)) as string), 2, '0') ) AS year_week
FROM `fb_ads`)
GROUP BY
year_week,
STORE,
ORDER BY year_week
Then I tried to calculate the variance using this:
SELECT
base.*, (base.frequency-lw.frequency) / lw.frequency as VAR_FF
FROM
`weekly_table` base
JOIN (
SELECT
* EXCEPT (date),
DATE_ADD(DATE(TIMESTAMP(date)) , INTERVAL 1 Week)AS date
FROM
`weekly_table` ) lw
ON
base.date = lw.date
AND base.store= lw.store
Anyone has any idea how to do the daily thing or if my weekly queries are correct ?
Thanks!

For a given date, you want to know the date of the nearest Monday to the same date in the following year...
SET #dt = '2017-08-17';
SELECT CASE WHEN WEEKDAY(#dt + INTERVAL 1 YEAR) > 3
THEN ADDDATE(ADDDATE(#dt + INTERVAL 1 YEAR,INTERVAL 1 WEEK),INTERVAL - WEEKDAY(#dt + INTERVAL 1 YEAR) DAY)
ELSE ADDDATE(#dt + INTERVAL 1 YEAR,INTERVAL - WEEKDAY(#dt + INTERVAL 1 YEAR) DAY)
END x;
Obviously, I could remove all those + INTERVAL 1 YEAR bits by defining #dt that way to begin with.

Related

SQL week number for the whole table

How to create a new column which calculates week number but for the whole table ignoring year?
Desired output is as follows:
Appreciate any help :)
You can do this by calculating 1st day of week of oldest row, and then calculate day diff of 1st day of week of current row and coldest row, after that, divide it by 7 days plus 1 will give you the desired week number across the full table.
Assuming you are using MySQL and the first day of the week is Sunday:
WITH min_week_start AS (
SELECT
SUBDATE(MIN(record_date), dayofweek(MIN(record_date)) - 1) as week_start_date
FROM
record_table
),
record_week_start AS (
SELECT
record_date,
SUBDATE(record_date, dayofweek(record_date) - 1) as week_start_date
FROM
record_table
)
SELECT
record_week_start.record_date,
DATEDIFF(record_week_start.week_start_date, min_week_start.week_start_date) / 7 + 1 as week_num
FROM
record_week_start
CROSS JOIN
min_week_start

Convert integer years or months into days in SQL impala

I have two columns; both have integer values. One Representing years, and the other representing months.
My goal is to perform calculations in days (integer), so I have to convert both to calendar days, to achieve that, taking in consideration that we have years with both 365 and 366 days.
Example in pseudo code:
Select Convert(years_int) to days, Convert(months int) to days
from table.
Real Example:
if --> Years = 1 and Months = 12
1) Convert both to days to compare them: Years = 365 days; Months = 365 days
After conversion : (Years = Months) Returns TRUE.
The problem is when we have years = 10 (for example), we must take in account the fact that at least two of them have 366 days. The same with Months - we have 30 and 31 days. So I need to compensate that fact to get the most accurate possible value in days.
Thanks in advance
From integers to timestamp can be done in PostgreSQL. I do not have impala, but hopefully below script will help you getting this done using impala:
with
year as (select 2022 as y union select 2023),
month as (select generate_series(1,12) as m),
day as(select generate_series(1,31) as d )
select y,m,d,dt from (
select
y,m,d,
to_date(ds,'YYYYMMDD')+(((d-1)::char(2))||' day')::interval dt
from ( select
*,
y::char(4)|| right('0'||m::char(2),2) || right('0'||0::char(2),2) as ds
from year,month,day
) x
) y
where extract(year from dt)=y and extract(month from dt)=m
order by dt
;
see: DBFIDDLE
Used functions in this query and, a way, to convert them to imapala (remember I do not use that tool/language/dialect)
function
impala alternative
to_date(a,b)
This will convert the string a to a date using the format b. Using impala you can use CAST(expression AS type FORMAT pattern)
y::char(4)
Cast y to a char(4), Using imala you can use: CAST(expression AS type)
right(a,b)
Use: right()
\\
Use: concat()
generate_series(a,b)
This generates a serie of numbers from a to (an inclusing) b. A SQL altervative is to write SELECT 1 as x union SELECT 2 union SELECT 3, which generates the same series as generate_series(1,3) in PostgreSQL
extract(year from a)
Get the year from the datetime field a, see YEAR()
One special case is this one to_date(ds,'YYYYMMDD')+(((d-1)::char(2))||' day')::interval
This will convert ds (with datatype CHAR(8)) to a date, and then add (using +) a number of days (like: '4 day')
Because I included all days until 31, this will fail in Februari, April, June, September, November because those months do not have 31 days. This is corrected by the WHERE clause in the end (where extract(year from dt)=y and extract(month from dt)=m)

Google Big Query to look at data of 2 specific dates

I am new to Big Query. I am trying to do a where condition to only select yesterday's data and that of same day last year (in this case, 10/25/2021 data and 10/25/2020 data). I know how to select a range of data, but I couldn't figure out a way to only select those 2 days of data. Any help is appreciated.
I recommend using BigQuery functions to define dates. You can read about them here.
WHERE DATE(your_date_field) IN ((DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY),
DATE_SUB(DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY), INTERVAL 1 YEAR))
This is dynamic to any day that you run the query. It will take the current date, then subtract 1 day. For the other date, it will take the current date and subtract 1 day and then 1 year, making it yesterday's date 1 year prior.
WHERE date_my_field IN (DATE('2021-10-25'), DATE('2020-10-25'))
Use IN which is a short cut for OR operator
Consider below (less verbose approach - especially if you remove time zone)
select
current_date('America/Los_Angeles') - 1 as yesterday,
date(current_date('America/Los_Angeles') - 1 - interval 1 year) same_day_last_year
with output
So, now you can use it in your WHERE clause as in below example (with dummy data via CTE)
with data as (
select your_date_field
from unnest(generate_date_array(current_date() - 1000, current_date())) your_date_field
)
select *
from data
where your_date_field in (
current_date('America/Los_Angeles') - 1,
date(current_date('America/Los_Angeles') - 1 - interval 1 year)
)
with output

Simulate query over a range of dates

I have a fairly long query that looks over the past 13 weeks and determines if the current day's performance is an anomaly compared to the last 13 weeks. It just returns a single row that has the date, the performance of the current day and a flag saying if it is an anomaly or not. To make matters a little more complicated: The performance isn't just a single day but rather a running 24 hour window. This query is then run every hour to monitor the KPI over the last 24 hours. i.e. If it is 2pm on Tuesday, it will look from 2pm the previous day (Monday) to now, and compare it to every other 2pm-to-2pm for the last 13 weeks.
To test if this code is working I would like simulate it running over the past month.
The code goes as follows:
WITH performance AS(
SELECT TRUNC(dateColumn - to_number(to_char(sysdate, 'hh24')/24) as startdate,
KPI_a,
KPI_b,
KPI_c
FROM table
WHERE someConditions
GROUP BY TRUNC(dateColumn - to_number(to_char(sysdate, 'hh24')/24)),
compare_t AS(
-- looks at relationships of the KPIs),
variables AS(
-- calculates the variables required for the anomaly detection),
... ok I don't know how much of the query needs to be given but it's basically I need to simulate 'sysdate'. Instead of inputting the current date, input each hour for the last month so this query will run approx 720 times and return the result 720 times, for each hour of each day.
I'm thinking a FOR loop, but I'm not sure.
You can use a recursive subquery:
with times(time) as
(
select sysdate - interval '1' month as time from dual
union all
select time + interval '1' hour from times
where time < sysdate
)
, performance as ()
, compare_t as ()
, variables as ()
select *
from times
join ...
order by time;
I don't understand your specific requirements but I had to solve similar problems. To give you an idea here are two proposals:
Calculate average and standard deviation of KPI value from past 13 weeks to yesterday. If current value from today it lower than "AVG - 10*STDDEV" then select record, i.e. mark as anomaly.
WITH t AS
(SELECT dateColumn, KPI_A,
AVG(KPI_A) OVER (ORDER BY dateColumn RANGE BETWEEN 13 * INTERVAL '7' DAY PRECEDING AND INTERVAL '1' DAY PRECEDING) AS REF_AVG,
STDDEV(KPI_A) OVER (ORDER BY dateColumn RANGE BETWEEN 13 * INTERVAL '7' DAY PRECEDING AND INTERVAL '1' DAY PRECEDING) AS REF_STDDEV
FROM TABLE
WHERE someConditions)
SELECT dateColumn, REF_AVG, KPI_A, REF_STDDEV
FROM t
WHERE TRUNC(dateColumn, 'HH') = TRUNC(LOCALTIMESTAMP, 'HH')
AND KPI_A < REF_AVG - 10 * REF_STDDEV;
Take hourly values from last week (i.e. the same weekday as yesterday) and make correlation with hourly values from yesterday. If correlation is less than certain value (I use 95%) then consider this day as anomaly.
WITH t AS
(SELECT dateColumn, KPI_A,
FIRST_VALUE(KPI_A) OVER (ORDER BY dateColumn RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW) AS KPI_A_LAST_WEEK,
dateColumn - FIRST_VALUE(dateColumn) OVER (ORDER BY dateColumn RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW) AS RANGE_INT
FROM table
WHERE ...)
SELECT 100*ROUND(CORR(KPI_A, KPI_A_LAST_WEEK), 2) AS CORR_VAL
FROM t
WHERE KPI_A_LAST_WEEK IS NOT NULL
AND RANGE_INT = INTERVAL '7' DAY
AND TRUNC(dateColumn) = TRUNC(LOCALTIMESTAMP - INTERVAL '1' DAY)
GROUP BY TRUNC(dateColumn);

Bigquery SQL for sliding window aggregate

Hi I have a table that looks like this
Date Customer Pageviews
2014/03/01 abc 5
2014/03/02 xyz 8
2014/03/03 abc 6
I want to get page view aggregates grouped by week but showing aggregates for past 30 days - (sliding window aggregates with window-size of 30 days for every week)
I am using google bigquery
EDIT: Gordon - re your comment about "Customer", Actually what I need is slightly more complicated thats why I included customer in the table above. I am looking to get the number of customers who had >n pageviews in a 30day window every week. something like this
Date Customers>10 pageviews in 30day window
2014/02/01 10
2014/02/08 5
2014/02/15 6
2014/02/22 15
However to keep it simple, I will work my way if I could just get a sliding window aggregate of pageviews ignoring customers altogether. something like this
Date count of pageviews in 30day window
2014/02/01 50
2014/02/08 55
2014/02/15 65
2014/02/22 75
How about this:
SELECT changes + changes1 + changes2 + changes3 changes28days, login, USEC_TO_TIMESTAMP(week)
FROM (
SELECT changes,
LAG(changes, 1) OVER (PARTITION BY login ORDER BY week) changes1,
LAG(changes, 2) OVER (PARTITION BY login ORDER BY week) changes2,
LAG(changes, 3) OVER (PARTITION BY login ORDER BY week) changes3,
login,
week
FROM (
SELECT SUM(payload_pull_request_changed_files) changes,
UTC_USEC_TO_WEEK(created_at, 1) week,
actor_attributes_login login,
FROM [publicdata:samples.github_timeline]
WHERE payload_pull_request_changed_files > 0
GROUP BY week, login
))
HAVING changes28days > 0
For each user it counts how many changes they have submitted per week. Then with LAG() we can peek into the next row, how many changes they submitted the -1, -2, and -3 week. Then we just add those 4 weeks to see how many changes were submitted on the last 28 days.
Now you can wrap everything in a new query to filter users with changes>X, and count them.
I have created the following "Times" table:
Table Details: Dim_Periods
Schema
Date TIMESTAMP
Year INTEGER
Month INTEGER
day INTEGER
QUARTER INTEGER
DAYOFWEEK INTEGER
MonthStart TIMESTAMP
MonthEnd TIMESTAMP
WeekStart TIMESTAMP
WeekEnd TIMESTAMP
Back30Days TIMESTAMP -- the date 30 days before "Date"
Back7Days TIMESTAMP -- the date 7 days before "Date"
and I use such query to handle "running sums"
SELECT Date,Count(*) as MovingCNT
FROM
(SELECT Date,
Back7Days
FROM DWH.Dim_Periods
where Date < timestamp(current_date()) AND
Date >= (DATE_ADD (CURRENT_TIMESTAMP(), -5, 'month'))
)P
CROSS JOIN EACH
(SELECT repository_url,repository_created_at
FROM publicdata:samples.github_timeline
) L
WHERE timestamp(repository_created_at)>= Back7Days
AND timestamp(repository_created_at)<= Date
GROUP EACH BY Date
Note that it can be used for "Month to date", Week to Date" "30 days back" etc. aggregations as well.
However, performance is not the best and the query can take a while on larger data sets due to the Cartesian join.
Hope this helps