Simulate query over a range of dates - sql

I have a fairly long query that looks over the past 13 weeks and determines if the current day's performance is an anomaly compared to the last 13 weeks. It just returns a single row that has the date, the performance of the current day and a flag saying if it is an anomaly or not. To make matters a little more complicated: The performance isn't just a single day but rather a running 24 hour window. This query is then run every hour to monitor the KPI over the last 24 hours. i.e. If it is 2pm on Tuesday, it will look from 2pm the previous day (Monday) to now, and compare it to every other 2pm-to-2pm for the last 13 weeks.
To test if this code is working I would like simulate it running over the past month.
The code goes as follows:
WITH performance AS(
SELECT TRUNC(dateColumn - to_number(to_char(sysdate, 'hh24')/24) as startdate,
KPI_a,
KPI_b,
KPI_c
FROM table
WHERE someConditions
GROUP BY TRUNC(dateColumn - to_number(to_char(sysdate, 'hh24')/24)),
compare_t AS(
-- looks at relationships of the KPIs),
variables AS(
-- calculates the variables required for the anomaly detection),
... ok I don't know how much of the query needs to be given but it's basically I need to simulate 'sysdate'. Instead of inputting the current date, input each hour for the last month so this query will run approx 720 times and return the result 720 times, for each hour of each day.
I'm thinking a FOR loop, but I'm not sure.

You can use a recursive subquery:
with times(time) as
(
select sysdate - interval '1' month as time from dual
union all
select time + interval '1' hour from times
where time < sysdate
)
, performance as ()
, compare_t as ()
, variables as ()
select *
from times
join ...
order by time;

I don't understand your specific requirements but I had to solve similar problems. To give you an idea here are two proposals:
Calculate average and standard deviation of KPI value from past 13 weeks to yesterday. If current value from today it lower than "AVG - 10*STDDEV" then select record, i.e. mark as anomaly.
WITH t AS
(SELECT dateColumn, KPI_A,
AVG(KPI_A) OVER (ORDER BY dateColumn RANGE BETWEEN 13 * INTERVAL '7' DAY PRECEDING AND INTERVAL '1' DAY PRECEDING) AS REF_AVG,
STDDEV(KPI_A) OVER (ORDER BY dateColumn RANGE BETWEEN 13 * INTERVAL '7' DAY PRECEDING AND INTERVAL '1' DAY PRECEDING) AS REF_STDDEV
FROM TABLE
WHERE someConditions)
SELECT dateColumn, REF_AVG, KPI_A, REF_STDDEV
FROM t
WHERE TRUNC(dateColumn, 'HH') = TRUNC(LOCALTIMESTAMP, 'HH')
AND KPI_A < REF_AVG - 10 * REF_STDDEV;
Take hourly values from last week (i.e. the same weekday as yesterday) and make correlation with hourly values from yesterday. If correlation is less than certain value (I use 95%) then consider this day as anomaly.
WITH t AS
(SELECT dateColumn, KPI_A,
FIRST_VALUE(KPI_A) OVER (ORDER BY dateColumn RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW) AS KPI_A_LAST_WEEK,
dateColumn - FIRST_VALUE(dateColumn) OVER (ORDER BY dateColumn RANGE BETWEEN INTERVAL '7' DAY PRECEDING AND CURRENT ROW) AS RANGE_INT
FROM table
WHERE ...)
SELECT 100*ROUND(CORR(KPI_A, KPI_A_LAST_WEEK), 2) AS CORR_VAL
FROM t
WHERE KPI_A_LAST_WEEK IS NOT NULL
AND RANGE_INT = INTERVAL '7' DAY
AND TRUNC(dateColumn) = TRUNC(LOCALTIMESTAMP - INTERVAL '1' DAY)
GROUP BY TRUNC(dateColumn);

Related

Averaging a variable over a period of time

I am currently having difficulty formulating this into an sql query:
I would like to average the data of a column here twa for a duration of 10 minutes starting from the last value of the table i.e. data included here:
last date-10minutes<=date<=last date
I tried to start a first query but it does not show the right answer:
SELECT AVG(twa), horaire FROM OF50 WHERE ((SELECT horaire FROM of50 ORDER BY horaire DESC LIMIT 1)-INTERVAL '1 minutes'>horaire) ORDER BY horaire;
Regards,
Maybe this will do.
with t as (select max(horaire) maxhoraire from of50)
select AVG(of50.twa)
from of50, t
where of50.horaire between t.maxhoraire - interval '1 minute' and t.maxhoraire;
or even this may do, given that the last value can not be 'younger' then now and at least one event happened during the last minute, though it is not exactly the same and says 'the average over the last 1 minute'
select AVG(twa)
from of50
where horaire >= now() - interval '1 minute';

Google Big Query to look at data of 2 specific dates

I am new to Big Query. I am trying to do a where condition to only select yesterday's data and that of same day last year (in this case, 10/25/2021 data and 10/25/2020 data). I know how to select a range of data, but I couldn't figure out a way to only select those 2 days of data. Any help is appreciated.
I recommend using BigQuery functions to define dates. You can read about them here.
WHERE DATE(your_date_field) IN ((DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY),
DATE_SUB(DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY), INTERVAL 1 YEAR))
This is dynamic to any day that you run the query. It will take the current date, then subtract 1 day. For the other date, it will take the current date and subtract 1 day and then 1 year, making it yesterday's date 1 year prior.
WHERE date_my_field IN (DATE('2021-10-25'), DATE('2020-10-25'))
Use IN which is a short cut for OR operator
Consider below (less verbose approach - especially if you remove time zone)
select
current_date('America/Los_Angeles') - 1 as yesterday,
date(current_date('America/Los_Angeles') - 1 - interval 1 year) same_day_last_year
with output
So, now you can use it in your WHERE clause as in below example (with dummy data via CTE)
with data as (
select your_date_field
from unnest(generate_date_array(current_date() - 1000, current_date())) your_date_field
)
select *
from data
where your_date_field in (
current_date('America/Los_Angeles') - 1,
date(current_date('America/Los_Angeles') - 1 - interval 1 year)
)
with output

Dynamic scheduled query based on timestamp

I am trying to set up a scheduled query to run on the 1st of each month, and capture one month of data. However it should not be the previous month, but 2 months previous - due to delays in data being loaded in to the source table. The source table is partitioned by day on session_timestamp so refining this as much as possible will be of benefit to reducing query cost.
So far I have this:
WHERE
EXTRACT(YEAR
FROM
session_timestamp) = EXTRACT(YEAR
FROM
DATE_SUB(CURRENT_DATE, INTERVAL 2 MONTH))
AND EXTRACT(MONTH
FROM
session_timestamp) = EXTRACT(MONTH
FROM
DATE_SUB(CURRENT_DATE, INTERVAL 2 MONTH))
This seems a highly inelegant solution but was intended to address cases where a year boundary would be crossed. However I can see from the "This script will process * when run." area that this is going to query everything in 2020 and not just in May 2020.
As you have pointed out, your query doesn't engage partition filter down to the 2 months of data which you want to query.
You don't have to do the year trick because DATE_TRUNC(..., MONTH) has year in it. Please try filter below:
-- Last day of the month
DATE(session_timestamp) <= DATE_SUB(DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH), MONTH), INTERVAL 1 DAY)
AND
-- First day of the month
DATE(session_timestamp) >= DATE_TRUNC(DATE_SUB(CURRENT_DATE(), INTERVAL 2 MONTH), MONTH)

I'm trying to calculate variance using sql between two same days of two years

I will try to be simple as possible to make my question crystal-clear. I have a table that's called 'fb_ads' (it's about different facebook compaigns for different stores in USA) on BigQuery, it contains the following columns:
STORE : name of store
CLICKS: number of clicks.
IMPRESSIONS: number of impressions of the ad
COST: the ad cost
DATE: AAAA-MM-DD
Frequency: number of visitors of a store
So, I'm trying to calculate the variance between two years 2017 and 2018.
Here is the variance I'm trying to calculate:
Variance_Of_Frequency = ((Frequency in 2018 at date X) - ((Frequency in 2017 at date X))/((Frequency in 2017 at date X)
The problem is, that I'll have to compare the same day of the week close to Date X;
For example, if I have a compaign run on a Monday 2017-08-13, I'll need to compare to another monday in 2018 close to 2018-08-13 (it might be a monday on 2018-08-15 for example).
This is a daily variance!
I tried to make a weekly variance calculating and I don't know if it's correct, here is how I did it:
I first started with aggregating my daily table to a weekly tables using the following query:
creating my weekly_table
SELECT
year_week,
STORE,
min(DATE ) as DATE ,
SUM(IMPRESSIONS ) AS FB_IMPRESSIONS ,
SUM(CLICKS ) AS FB_CLICKS ,
SUM(COST) AS FB_COST ,
SUM(Frequency) AS FREQUENCY,
FROM (
SELECT
*,
CONCAT(cast(ANNEE as string), LPAD(cast((extract(WEEK from date)) as string), 2, '0') ) AS year_week
FROM `fb_ads`)
GROUP BY
year_week,
STORE,
ORDER BY year_week
Then I tried to calculate the variance using this:
SELECT
base.*, (base.frequency-lw.frequency) / lw.frequency as VAR_FF
FROM
`weekly_table` base
JOIN (
SELECT
* EXCEPT (date),
DATE_ADD(DATE(TIMESTAMP(date)) , INTERVAL 1 Week)AS date
FROM
`weekly_table` ) lw
ON
base.date = lw.date
AND base.store= lw.store
Anyone has any idea how to do the daily thing or if my weekly queries are correct ?
Thanks!
For a given date, you want to know the date of the nearest Monday to the same date in the following year...
SET #dt = '2017-08-17';
SELECT CASE WHEN WEEKDAY(#dt + INTERVAL 1 YEAR) > 3
THEN ADDDATE(ADDDATE(#dt + INTERVAL 1 YEAR,INTERVAL 1 WEEK),INTERVAL - WEEKDAY(#dt + INTERVAL 1 YEAR) DAY)
ELSE ADDDATE(#dt + INTERVAL 1 YEAR,INTERVAL - WEEKDAY(#dt + INTERVAL 1 YEAR) DAY)
END x;
Obviously, I could remove all those + INTERVAL 1 YEAR bits by defining #dt that way to begin with.

Sql query to return data if AVG(7 days record count) > (Today's record count)

I want to write a SQL query in Oracle database for:
A priceindex(field name) have around 120(say) records each day and I have to display the priceindex name and today's date, if the avg of last 7 days record count is greater than Todays record count for the priceindex(group by priceindex).
Basically, There will be 56 priceindex and each should have around 120 records each day and is dump to database each day from external site. So want to make sure all records are downloaded to the database everyday.
Except for the clarification I requested in a Comment to your question (having to do with "how can we know today's final count, when today is not over yet), the problem can be solved along the following lines. Not tested since you didn't provide sample data.
From your table, select only the rows where the relevant DATE is between "today" - 7 and "today" (so there are really EIGHT days: the seven days preceding today, and today). Then group by PRICEINDEX. Count total rows for each group, and count rows just for "today". The rows for "today" should be less than 1/8 times the total count (this is easy algebra: this is equivalent to being less than 1/7 times the count of OTHER days).
Such conditions, at the group level, must be in the HAVING clause.
select priceindex
from your_table
where datefield >= trunc(sysdate) - 7 and datefield < trunc(sysdate) + 1
group by priceindex
having count(case when datefield >= trunc(sysdate) then 1 end) < 1/8 * count(*)
;
EDIT The OP clarified that the query runs every day at midnight; this means that "today" should actually mean "yesterday" (the day that just ended). In Oracle, and probably in all of computing, midnight belongs to the day that BEGINS at midnight, not the one that ends at midnight. The time-of-day at midnight is 00:00:00 (beginning of the new day), not 24:00:00.
So, the query above will have to be changed slightly:
select priceindex
from your_table
where datefield >= trunc(sysdate) - 8 and datefield < trunc(sysdate)
group by priceindex
having count(case when datefield >= trunc(sysdate) - 1 then 1 end)
< 1/8 * count(*)
;