I am performing a query with a partition window of 1 calendar month. The data I'm working with is collected at regular intervals eg. every fifteen minutes.
Here is the code:
SELECT AVG(data_value) OVER (
PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND CURRENT ROW)
This query works well, and collects the monthly average. The only problem is that the start and end of the interval are exactly a month apart, so the boundaries of the interval window are inclusive, eg. the start would be Nov-01-2019 00:00 and the End would be Dec-01-2019 00:00.
I need to make it so that the starting boundary is not included, because it's not considered part of the data set, eg. Start at Nov-01-2019 00:15 (the next row) and the End would still be Dec-01-2019 00:00.
I'm wondering if there's something that Oracle can do that would achieve this.
I imagine the code looking something like this:
SELECT AVG(data_value) OVER (
PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN INTERVAL '1' MONTH (+ 1 ROW) PRECEDING AND CURRENT ROW)
I've tried several variants of this but Oracle does not like them. Any help would be appreciated.
Work out how many days there were in the previous month using:
EXTRACT( DAY FROM TRUNC( time_stamp, 'MM' ) - 1 )
The use the NUMTODSINTERVAL function to create an interval of one fewer days so you exclude the extra day that is being counted:
SELECT id,
data_value,
time_stamp,
AVG(data_value)
OVER (
PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN NUMTODSINTERVAL(
EXTRACT( DAY FROM TRUNC( time_stamp, 'MM' ) - 2 ),
'DAY'
) PRECEDING
AND CURRENT ROW
) AS avg_value_month_minus_1_day
FROM table_name;
So, if your data is:
CREATE TABLE table_name ( id, data_value, time_stamp ) AS
SELECT 1,
LEVEL,
DATE '2020-01-01' + LEVEL - 1
FROM DUAL
CONNECT BY LEVEL <= 50;
Then comparing the above query to your output with:
SELECT id,
data_value,
time_stamp,
AVG(data_value)
OVER (
PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN NUMTODSINTERVAL(
EXTRACT( DAY FROM TRUNC( time_stamp, 'MM' ) - 2 ),
'DAY'
) PRECEDING
AND CURRENT ROW
) AS avg_value_month_minus_1_day,
AVG(data_value)
OVER (
PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN INTERVAL '1' MONTH PRECEDING
AND CURRENT ROW
) AS avg_value_month
FROM table_name;
Outputs (for February, when there is a full month of preceding data):
ID | DATA_VALUE | TIME_STAMP | AVG_VALUE_MONTH_MINUS_1_DAY | AVG_VALUE_MONTH
-: | ---------: | :------------------ | --------------------------: | --------------:
1 | 32 | 2020-02-01 00:00:00 | 17 | 16.5
1 | 33 | 2020-02-02 00:00:00 | 18 | 17.5
1 | 34 | 2020-02-03 00:00:00 | 19 | 18.5
1 | 35 | 2020-02-04 00:00:00 | 20 | 19.5
1 | 36 | 2020-02-05 00:00:00 | 21 | 20.5
1 | 37 | 2020-02-06 00:00:00 | 22 | 21.5
1 | 38 | 2020-02-07 00:00:00 | 23 | 22.5
1 | 39 | 2020-02-08 00:00:00 | 24 | 23.5
1 | 40 | 2020-02-09 00:00:00 | 25 | 24.5
1 | 41 | 2020-02-10 00:00:00 | 26 | 25.5
1 | 42 | 2020-02-11 00:00:00 | 27 | 26.5
1 | 43 | 2020-02-12 00:00:00 | 28 | 27.5
1 | 44 | 2020-02-13 00:00:00 | 29 | 28.5
1 | 45 | 2020-02-14 00:00:00 | 30 | 29.5
1 | 46 | 2020-02-15 00:00:00 | 31 | 30.5
1 | 47 | 2020-02-16 00:00:00 | 32 | 31.5
1 | 48 | 2020-02-17 00:00:00 | 33 | 32.5
1 | 49 | 2020-02-18 00:00:00 | 34 | 33.5
1 | 50 | 2020-02-19 00:00:00 | 35 | 34.5
db<>fiddle here
Alas, Oracle doesn't support intervals with both months and smaller units.
One method is to subtract it out:
select (sum(data_value) over (partition by id
order by time_stamp
range between interval '3' month preceding and current row
) -
sum(data_value) over (partition by id
order by time_stamp
range between interval '3' month preceding and '3' month preceding
)
) /
(count(data_value) over (partition by id
order by time_stamp
range between interval '3' month preceding and current row
) -
count(data_value) over (partition by id
order by time_stamp
range between interval '3' month preceding and '3' month preceding
)
)
Admittedly, this is cumbersome for an average, but it might be just fine for a sum() or count().
To shift the window of time that you are looking at you can shift the value you are sorting on by an appropriate interval of time:
SELECT AVG(data_value)
OVER (PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND CURRENT ROW
) Current_Calc
, AVG(data_value)
OVER (PARTITION BY id
ORDER BY time_stamp - interval '15' minute
RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND CURRENT ROW
) Shift_Back
, AVG(data_value)
OVER (PARTITION BY id
ORDER BY time_stamp + interval '15' minute
RANGE BETWEEN INTERVAL '1' MONTH PRECEDING AND CURRENT ROW
) shift_forward
FROM Your_Data
based on the description of your problem I believe you want to shift it back by 15 minutes, but I could be misreading the problem statement, and without appropriate data to test against, and expected results </shrugs>
These are sliding windows that always contain one months worth of data relative to the current time_stamp This means that for each time_stamp month you will get anywhere from 29 to 32 days worth of data with some of that data being counted in both the current and preceding months averages.
On the other hand, if what you are interested in is averages for the discreet months, then you should be partitioning by month rather creating a sliding window, if you want running averages per month you can add the sort, but you won't need the windowing clause:
SELECT TRUNC(time_stamp, 'MM') MON
, AVG(data_value)
OVER (PARTITION BY id, TRUNC(time_stamp, 'MM')) MON_AVG
, AVG(data_value)
OVER (PARTITION BY id, TRUNC(time_stamp, 'MM')
ORDER BY time_stamp) RUN_MON_AVG
, TRUNC(time_stamp - INTERVAL '15' MINUTE, 'MM') MON_2
, AVG(data_value)
OVER (PARTITION BY id, TRUNC(time_stamp - INTERVAL '15' MINUTE, 'MM')
) MON_AVG_2
, AVG(data_value)
OVER (PARTITION BY id, TRUNC(time_stamp - INTERVAL '15' MINUTE, 'MM')
ORDER BY time_stamp) RUN_MON_AVG
FROM Your_Data
Thanks for the feedback! I was able to assemble the answer I needed based on the answers above. Here is the code that I went with:
SELECT AVG(data_value) OVER (
PARTITION BY id
ORDER BY time_stamp
RANGE BETWEEN (NUMTODSINTERVAL(EXTRACT( DAY FROM (TRUNC(time_stamp,'MM') - 1) ),'DAY') - NUMTODSINTERVAL(1,'SECOND')) PRECEDING AND CURRENT ROW)
Because my interval is exactly one month, and I want to remove the first entry, I first convert the previous month into an interval in seconds, as recommended above. Then I subtract one second from the lower bound of the interval. This has the effect of making the lower bound of the interval an "open" bound and the upper bound a "closed" bound.
As a side note, I used one second because the periodicity of my dataset is not consistent, but its minimum is three minutes, so anything less than that will work.
Related
I have a table called cpu_usages and I'm trying to find spikes of cpu usage. My table stores 4 columns:
id serial
at timestamp
cpu_usage float
cpu_core int
the at column stores a timestamp of every minute ever day.
I want to select all rows where I take each timestamp and get the next 3 minutes and if any of the timestamps has a cpu_value over at least 3% higher than the starting value for that timestamp, then return it
So for example if I have these rows:
id|at|cpu_values,cpu_core
1 | 2019-01-01-00:00|1|0
2 | 2019-01-01-00:01|1|0
3 | 2019-01-01-00:02|4|0
4 | 2019-01-01-00:03|1|0
5 | 2019-01-01-00:04|1|0
6 | 2019-01-01-00:05|1|0
7 | 2019-01-01-00:06|1|0
8 | 2019-01-01-00:07|1|0
9 | 2019-01-01-00:08|6|0
10 | 2019-01-01-00:00|1|1
11 | 2019-01-01-00:01|1|1
12| 2019-01-01-00:02|4|1
13 | 2019-01-01-00:03|1|1
14 | 2019-01-01-00:04|1|1
15 | 2019-01-01-00:05|1|1
16 | 2019-01-01-00:06|1|1
17 | 2019-01-01-00:07|1|1
18 | 2019-01-01-00:08|6|1
It would return rows:
1,2,6,7,8
I am not sure how to do this because it sounds like it needs some sort of nested joins.
Can anyone assist me with this?
This answers the original version of the question.
Just use window functions. Assuming you want the larger value, then you want to look back not forward:
select t.*
from (select t.*,
max(cpu_value) over (order by timestamp
range between interval '3 minute' preceding and interval '1 second' preceding
) as previous_min
from t
) t
where previous_min * 1.03 < cpu_value;
EDIT:
Looking backwards, this would be:
select t.*
from (select t.*,
min(cpu_value) over (order by timestamp
range between interval '1 second' following and interval '3 minute' following
) as next_min
from t
) t
where cpu_value * 1.03 > next_min;
I have a table (in an Oracle DB) that looks something like what is shown below with about 4000 records. This is just an example of how the table is designed. The timestamps range for several years.
| Time | Action |
| 9/25/2019 4:24:32 PM | Yes |
| 9/25/2019 4:28:56 PM | No |
| 9/28/2019 7:48:16 PM | Yes |
| .... | .... |
I want to be able to get a count of timestamps that occur on a rolling 15 minute interval. My main goal is to identify the maximum number of timestamps that appear for any 15 minute interval. I would like this done by looking at each timestamp and getting a count of timestamps that appear within 15 minutes of that timestamp.
My goal would to have something like
| Interval | Count |
| 9/25/2019 4:24:00 PM - 9/25/2019 4:39:00 | 2 |
| 9/25/2019 4:25:00 PM - 9/25/2019 4:40:00 | 2 |
| ..... | ..... |
| 9/25/2019 4:39:00 PM - 9/25/2019 4:54:00 | 0 |
I am not sure how I would be able to do this, if at all. Any ideas or advice would be much appreciated.
If you want any 15 minute interval in the data, then you can use:
select t.*,
count(*) over (order by timestamp
range between interval '15' minute preceding and current row
) as cnt_15
from t;
If you want the maximum, then use rank() on this:
select t.*
from (select t.*, rank() over (order by cnt_15 desc) as seqnum
from (select t.*,
count(*) over (order by timestamp
range between interval '15' minute preceding and current row
) as cnt_15
from t
) t
) t
where seqnum = 1;
This doesn't produce exactly the results you specify in the query. But it does answer the question:
I want to be able to get a count of timestamps that occur on a rolling 15 minute interval. My main goal is to identify the maximum number of timestamps that appear for any 15 minute interval.
You could enumerate the minutes with a recursive query, then bring the table with a left join:
with recursive cte (start_dt, max_dt) as (
select trunc(min(time), 'mi'), max(time) from mytable
union all
select start_dt + interval '1' minute, max_dt from cte where start_dt < max_dt
)
select
c.start_dt,
c.start_dt + interval '15' minute end_dt,
count(t.time) cnt
from cte c
left join mytable t
on t.time >= c.start_dt
and t.time < c.start_dt + interval '15' minute
group by c.start_dt
Below Tables consists of count of users on particular day.Looking to populate Total_Users signup column
Logic:Contains user count b/w Signupdate-14 & Signupdate-7
For Example: 15/01/2020 , contains users count between 1/1/2020 AND 1/7/2020
Signupdate| |Users| Total_Users(b/w D-14 & D-7)
1/1/2020 | |20. | 60
2/1/2020 | |30. | 80
3/1/2020 | |10. | 90
--- | |-- | --
--- | |-- | --
15/1/2020 | |30. | 120
16/1/2020 | |10. | 40
SELECT Signupdate
, Users
,SUM(CASE
WHEN Signupdate BETWEEN to_date(Signupdate,'DDMMYYYY')-14 and to_date(Signupdate,'DDMMYYYY')-7
THEN Users END) AS 'Total_Users'
FROM
This is assuming that the users column is of numeric type
Assuming you have a row for each date, you would use window functions with a windowing clause. I'm not sure if Redshift supports window frames with intervals, but this is the basic logic:
select t.*,
sum(users) over (order by signupdate
range between interval '-14' day and interval '-7 day'
) as total_users
from t;
If not, you can turn the date into a number and use that:
select t.*,
sum(users) over (order by signupdate
rows between 14 preceding and 7 preceding
) as total_users
from (select t.*,
datediff(day, signupdate, date '2000-01-01') as diff
from t
) t
I am guessing you want a complete week. However, this is 8 days.
Given a table as such:
# SELECT * FROM payments ORDER BY payment_date DESC;
id | payment_type_id | payment_date | amount
----+-----------------+--------------+---------
4 | 1 | 2019-11-18 | 300.00
3 | 1 | 2019-11-17 | 1000.00
2 | 1 | 2019-11-16 | 250.00
1 | 1 | 2019-11-15 | 300.00
14 | 1 | 2019-10-18 | 130.00
13 | 1 | 2019-10-18 | 100.00
15 | 1 | 2019-09-18 | 1300.00
16 | 1 | 2019-09-17 | 1300.00
17 | 1 | 2019-09-01 | 400.00
18 | 1 | 2019-08-25 | 400.00
(10 rows)
How can I SUM the amount column based on an arbitrary date range, not simply a date truncation?
Taking the example of a date range beginning on the 15th of a month, and ending on the 14th of the following month, the output I would expect to see is:
payment_type_id | payment_date | amount
-----------------+--------------+---------
1 | 2019-11-15 | 1850.00
1 | 2019-10-15 | 230.00
1 | 2019-09-15 | 2600.00
1 | 2019-08-15 | 800.00
Can this be done in SQL, or is this something that's better handled in code? I would traditionally do this in code, but looking to extend my knowledge of SQL (which at this stage, isnt much!)
Click demo:db<>fiddle
You can use a combination of the CASE clause and the date_trunc() function:
SELECT
payment_type_id,
CASE
WHEN date_part('day', payment_date) < 15 THEN
date_trunc('month', payment_date) + interval '-1month 14 days'
ELSE date_trunc('month', payment_date) + interval '14 days'
END AS payment_date,
SUM(amount) AS amount
FROM
payments
GROUP BY 1,2
date_part('day', ...) gives out the current day of month
The CASE clause is for dividing the dates before the 15th of month and after.
The date_trunc('month', ...) converts all dates in a month to the first of this month
So, if date is before the 15th of the current month, it should be grouped to the 15th of the previous month (this is what +interval '-1month 14 days' calculates: +14, because the date_trunc() truncates to the 1st of month: 1 + 14 = 15). Otherwise it is group to the 15th of the current month.
After calculating these payment_days, you can use them for simple grouping.
I would simply subtract 14 days, truncate the month, and add 14 days back:
select payment_type_id,
date_trunc('month', payment_date - interval '14 day') + interval '14 day' as month_15,
sum(amount)
from payments
group by payment_type_id, month_15
order by payment_type_id, month_15;
No conditional logic is actually needed for this.
Here is a db<>fiddle.
You can use the generate_series() function and make a inner join comparing month and year, like this:
SELECT specific_date_on_month, SUM(amount)
FROM (SELECT generate_series('2015-01-15'::date, '2015-12-15'::date, '1 month'::interval) AS specific_date_on_month)
INNER JOIN payments
ON (TO_CHAR(payment_date, 'yyyymm')=TO_CHAR(specific_date_on_month, 'yyyymm'))
GROUP BY specific_date_on_month;
The generate_series(<begin>, <end>, <interval>) function generate a serie based on begin and end with an specific interval.
I want to find the number of unique users active in the last 30 days. I want to calculate this for today, but also for days in the past. The dataset contains user ids, dates and events triggered by the user saved in BigQuery. A user is active by opening a mobile app triggering the event session_start. Example of the unnested dataset.
| resettable_device_id | date | event |
------------------------------------------------------
| xx | 2017-06-09 | session_start |
| yy | 2017-06-09 | session_start |
| xx | 2017-06-11 | session_start |
| zz | 2017-06-11 | session_start |
I found a solution which suits my problem:
BigQuery: how to group and count rows within rolling timestamp window?
My BigQuery script so far:
#standardSQL
WITH daily_aggregation AS (
SELECT
PARSE_DATE("%Y%m%d", event_dim.date) AS day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) AS unique_resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
WHERE event_dim.name = "session_start"
GROUP BY day
)
SELECT
day,
unique_resettable_device_ids,
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
FROM daily_aggregation
ORDER BY day
This script results in the following table:
| day | unique_resettable_device_ids | unique_ids_rolling_30_days |
------------------------------------------------------------------------
| 2018-06-05 | 1807 | 2614 |
| 2018-06-06 | 711 | 807 |
| 2018-06-07 | 96 | 96 |
The problem is that the column unique_ids_rolling_30_days is just a cumulative sum of the column unique_resettable_device_ids. How can I fix the rolling window function in my script?
"The problem is that the column unique_ids_rolling_30_days is just a cumulative sum of the column unique_resettable_device_ids."
Of course, as that's exactly what the code
SUM(unique_resettable_device_ids)
OVER(ORDER BY UNIX_SECONDS(TIMESTAMP(day)) DESC ROWS BETWEEN 2592000 PRECEDING AND CURRENT ROW) AS unique_ids_rolling_30_days
is asking for.
Check out https://stackoverflow.com/a/49866033/132438 where the question asks about specifically counting uniques in a rolling window: Turns out it's a very slow operation given how much memory it requires.
The solution for this when you want a rolling count of uniques: Go for approximate results.
From the linked answer:
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
, COUNT(*) window_days
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
HAVING window_days=90
ORDER BY date_grp
Working solution for a weekly calculation of the number of active users in the last 30 days.
#standardSQL
WITH days AS (
SELECT day
FROM UNNEST(GENERATE_DATE_ARRAY('2018-01-01', CURRENT_DATE(), INTERVAL 1 WEEK)) AS day
), periods AS (
SELECT
DATE_SUB(days.day, INTERVAL 30 DAY) AS StartDate,
days.day AS EndDate FROM days
)
SELECT
periods.EndDate AS Day,
COUNT(DISTINCT user_dim.device_info.resettable_device_id) as resettable_device_ids
FROM `ANDROID.app_events_*`,
UNNEST(event_dim) AS event_dim
CROSS JOIN periods
WHERE
PARSE_DATE("%Y%m%d", event_dim.date) BETWEEN periods.StartDate AND periods.EndDate
AND event_dim.name = "session_start"
GROUP BY Day
ORDER BY Day DESC