How to do a x-days grouped sum in redshift? - sql

I have the following table,
that shows how many items from different units entered the inventory, in different dates.
ID Date Unit Quantity
---------------------------------
1 2017-08-01 A_red 05
2 2017-08-13 A_red 10
3 2017-09-20 A_red 20
4 2017-09-22 A_red 40
5 2017-10-05 A_red 40
6 2017-10-25 A_red 30
7 2017-10-24 A_blue 60
The problem is: entries within a time interval of 30 days of the same unit should be grouped.
So I want the following result:
ID Date Unit Quantity fst_entry30 Quantity30
-----------------------------------------------------
1 2017-08-01 A_red 05 T 15
2 2017-08-13 A_red 10 F 15
3 2017-09-20 A_red 20 T 100
4 2017-09-22 A_red 40 F 100
5 2017-10-05 A_red 40 F 100
6 2017-10-25 A_red 30 T 30
7 2017-10-24 A_blue 60 T 60
where fst_entry30 is a flag that points if the entry was the first, of this unit, in the last 30 days. Note that if i have a different unit (A_blue instead of A_red), it won't be grouped.
And quantity30 is the grouped sum of quantity.
For example, between 5 october and 20 september there is less than 30 days, so it was grouped.
Remembering that Redshift does not allow recursive common table expressions.
I already tried self-joins, but that turned out to be cumbersome.

You would just use lag() to define the groups:
select t.*,
(case when date >= lag(date) over (partition by unit order by date) + interval '30 day'
then 0 else 1
end) as grp_start
from t;
Then you can do a cumulative sum to assign a number to the group . . . and finally add them up using a window function:
select t.*, sum(quantity) over (partition by unit, grp)
from (select t.*,
sum(grp_start) over (partition by unit order by date) as grp
from (select t.*,
(case when date >= lag(date) over (partition by unit order by date) + interval '30 day'
then 0 else 1
end) as grp_start
from t
) t
) t

Related

Group items from the first time + certain time period

I want to group orders from the same customer if they happen within 10 minutes of the first order, then find the next first order and group them and so on.
Ex:
Customer group orders
6 1 3
2 4,5
3 8
7 1 9,10
2 11,12
3 13
id customer time
3 6 2021-05-12 12:14:22.000000
4 6 2021-05-12 12:24:24.000000
5 6 2021-05-12 12:29:16.000000
8 6 2021-05-12 13:01:40.000000
9 7 2021-05-14 12:13:11.000000
10 7 2021-05-14 12:20:01.000000
11 7 2021-05-14 12:45:00.000000
12 7 2021-05-14 12:48:41.000000
13 7 2021-05-14 12:58:16.000000
18 9 2021-05-18 12:22:13.000000
25 15 2021-05-18 13:44:02.000000
26 16 2021-05-17 09:39:02.000000
27 16 2021-05-18 19:38:43.000000
28 17 2021-05-18 15:40:02.000000
29 18 2021-05-19 15:32:53.000000
30 18 2021-05-19 15:45:56.000000
31 18 2021-05-19 16:29:09.000000
34 15 2021-05-24 15:45:14.000000
35 15 2021-05-24 15:45:14.000000
36 19 2021-05-24 17:14:53.000000
Here is what I have currently, I think that it is currently not grouping by customer when case when d.StartTime > dateadd(minute, 10, c.first_time) so it compares StartTime of all orders for all customers.
with
data as (select Customer,StartTime,Id, row_number() over(partition by Customer order by StartTime) rn from orders t),
cte as (
select d.*, StartTime as first_time
from data d
where rn = 1
union all
select d.*,
case when d.StartTime > dateadd(minute, 10, c.first_time)
then d.StartTime
else c.first_time
end
from cte c
inner join data d on d.rn = c.rn + 1
)
select c.*, dense_rank() over(partition by Customer order by first_time) grp
from cte c;'
I have two databases (MySQL & SQL Server) having similar schema so either would work for me.
Try the following on SQL Server:
SELECT customer,
ROW_NUMBER() OVER (PARTITION BY customer ORDER BY grp) AS group_no,
STRING_AGG(id, ',') AS orders
FROM
(
SELECT id,customer, [time],
(DATEDIFF(SECOND, MIN([time]) OVER (PARTITION BY CUSTOMER), [time])/60)/10 grp
FROM orders
) T
GROUP BY customer, grp
ORDER BY customer
See a demo.
According to your posted requirement, you are trying to divide the period between the first order date and the last order date into groups (or let's say time frames) each one is 10 minutes long.
What I did in this query: for each customer order, find the difference between the order date and the minimum date (first customer order date) in seconds and then divide it by 10 to get it's time frame number. i.e. for a difference = 599s the frame number = 599/60 =9m /10 = 0. for a difference = 620s the frame number = 620/60 =10m /10 = 1.
After defining the correct groups/time frames for each order you can simply use the STRING_AGG function to get the desired output. Noting that the STRING_AGG function applies to SQL Server 2017 (14.x) and later.

SQL query to get top 24 records, then average the first 12 and bottom 12

I'm attempting to analyze each account's performance (A_Count & B_Count) during their first year versus their second year. This should only return clients who have at least 24 months of totals (records).
Volume Table
Account
ReportDate
A_Count
B_Count
1001A
2019-01-01
47
100
1001A
2019-02-01
50
105
1002A
2019-02-01
50
105
I think I'm on the right track by wanting to grab the top 24 records for each account (only if 24 exist) and then grabbing the top 12 and bottom 12, but not sure how to get there.
I guess ideal output would be:
Account
YR1_A_Avg
YR1_B_Avg
YR2_A_Avg
YR2_B_Avg
FirstDate
LastDate
1001A
47
100
53
115
2019-01-01
2021-12-31
1002A
50
105
65
130
2019-02-01
2022-01-01
1003A
15
180
38
200
2017-05-01
2019-04-01
I'm not too worried about performance.
Assuming there are no gaps in ReportDate (per Account).
select Account
,avg(case when year_index = 1 then A_Count end) as YR1_A_Avg
,avg(case when year_index = 1 then B_Count end) as YR1_B_Avg
,avg(case when year_index = 2 then A_Count end) as YR2_A_Avg
,avg(case when year_index = 2 then B_Count end) as YR2_B_Avg
,min(ReportDate) as FirstDate
,max(ReportDate) as LastDate
from
(
select *
,count(*) over(partition by Account) as cnt
,(row_number() over(partition by Account order by ReportDate)-1)/12 +1 as year_index
from Volume
) t
where cnt >= 24 and year_index <= 2
group by Account

Calculate sales metrics (like past 6 months, past 3 months, sale one year ago etc.) on transaction data in BigQuery

I have to create a view in BigQuery with some details of product sales. The measurements to be included in the view are explained below. These measurements have to be calculated for each product for every day that product is sold. A product is identified by unique combination of 5 -6 attributes (in our demo, code1 and code2 columns). The date represents the transaction dates.
sales_today -> the sum of sales for each product (combination of code1 and code2) per day.
TotSales_previous_3_months -> the sum of sales for each product in the previous 3 months(without including any sales from current month). for e.g., if we are calculating TotSales_previous_3_months for a product sale on 5th March 2022, we have to sum up the sales of that product from 1st December 2021 to 28th February 2022.
TotSales_previous_6_months -> the sum of sales for each product in the previous 6 months(without including any sales from current month). Follow the same logic as for TotSales_previous_3_months.
sale_one_month_ago -> The sum of sales of the product on this day exactly one month ago. For e.g., if we are calculating sale_one_month_ago for a product sale on 5th March 2022, it would be the sum of sales of that product on 5th February 2022.
sale_one_year_ago -> The sum of sales of the product on this day exactly one month ago. For e.g., if we are calculating sale_one_month_ago for a product sale on 5th March 2022, it would be the sum of sales of that product on 5th March 2021.
Unique_count_flag -> flag = 1 if the number of sales of the product on a day = 1. If the number of sales of the product is more than 1 on a day, flag = 0.
I have created this table (test_sales) with some demo data for understanding.
code1
code2
date
gen
sales
1
A
2021-02-04
jerez
7
1
A
2021-02-04
abc
5
1
A
2022-02-04
wres
10
1
A
2022-03-04
tomz
10
1
A
2022-03-05
everyz
10
1
A
2022-05-01
ben10
30
1
A
2022-06-01
xyx
10
1
A
2022-06-01
xya
5
2
A
2022-05-10
iqoom
20
3
C
2022-01-10
imola
60
3
C
2022-04-01
nurburgring
50
3
C
2022-06-01
jerez
30
The result set after calculations should be like -
code1
code2
date
gen
sales
sales_today
TotSales_previous_3_months
TotSales_previous_6_months
sale_one_month_ago
sale_one_year_ago
Unique_count_flag
1
A
2021-02-04
jerez
7
12
0
0
0
0
1
A
2021-02-04
abc
5
12
0
0
0
0
1
A
2022-02-04
wres
10
10
0
0
0
12
1
1
A
2022-03-04
tomz
10
10
10
10
10
1
1
A
2022-03-05
everyz
10
10
10
10
0
1
1
A
2022-05-01
ben10
30
30
30
30
0
1
1
A
2022-06-01
xyx
10
15
50
60
30
0
1
A
2022-06-01
xya
5
15
50
60
30
0
2
A
2022-05-10
iqoom
20
20
0
0
0
1
3
C
2022-01-10
imola
60
60
0
0
0
1
3
C
2022-04-01
nurburgring
50
50
60
60
0
1
3
C
2022-06-01
jerez
30
30
50
110
0
1
I was able to create the below code to achieve result, but the problem is that this code works fine for small datasets but here I am dealing with around 60 GB of data(~50 columns and ~80 million rows). If I adapt the code given below for the original sales data(which itself is a combination of few tables after joining them) it just long runs. Is there an alternative or efficient way to achieve the results?
with temp as
(SELECT
code1,code2,date,gen,sales,
COUNT(*) OVER(PARTITION BY code1, code2, date) AS cnt,
SUM(sales) OVER(PARTITION BY code1, code2,date) AS sales_today,
array_agg(struct(sales as sales,date as date)) over(partition by code1,code2 order by date) as past_records
FROM
`test_sales`
)
select * except(past_records,cnt),
(select ifnull(sum(x.sales),0)
from unnest(temp.past_records) as x
where x.date between (date_trunc(temp.date,MONTH) - INTERVAL 3 MONTH) and (date_trunc(temp.date, MONTH) - interval 1 day)) as TotSales_previous_3_months,
(select ifnull(sum(x.sales),0)
from unnest(temp.past_records) as x
where x.date between (date_trunc(temp.date,MONTH) - INTERVAL 6 MONTH) and (date_trunc(temp.date, MONTH) - interval 1 day)) as TotSales_previous_6_months,
(select ifnull(sum(x.sales),0)
from unnest(temp.past_records) as x
where x.date = temp.date - INTERVAL 1 MONTH) as sale_one_month_ago,
(select ifnull(sum(x.sales),0)
from unnest(temp.past_records) as x
where x.date = temp.date - INTERVAL 1 YEAR) as sale_one_year_ago,
if(cnt = 1,1,0) as Unique_count_flag
from temp
Modified Code inspired from Mikhail's approach:-
select *,
-- extract(year from date) * 12 + extract(month from date) as months,
-- UNIX_DATE(date) AS days,
sum(sales) over(product_date) as sales_today,
sum(sales) over(product range between 3 preceding and 1 preceding) as TotSales_previous_3_months,
sum(sales) over(product range between 6 preceding and 1 preceding) as TotSales_previous_6_months,
case when extract(day from date) = 31 and extract(month from date) in (3,12,10,7,5)
then sum(sales) over(product_by_unix_date range between 31 preceding and 31 preceding)
when extract(day from date) = 30 and extract(month from date) = 3
then sum(sales) over(product_by_unix_date range between 30 preceding and 30 preceding)
when extract(day from date) = 29 and extract(month from date) = 3
then sum(sales) over(product_by_unix_date range between 29 preceding and 29 preceding)
else
sum(sales) over(product_day range between 1 preceding and 1 preceding)
end as sale_one_month_ago,
case when extract(day from date) = 29 and extract(month from date) = 2
then sum(sales) over(product_by_unix_date range between 366 preceding and 366 preceding)
else
sum(sales) over(product_day range between 12 preceding and 12 preceding)
end as sale_one_year_ago
from `river-blade-343102.test.test_sales`
window
product as (partition by code1, code2 order by extract(year from date) * 12 + extract(month from date)),
product_date as (partition by code1, code2, date ),
product_day as (partition by code1, code2, extract(day from date) order by extract(year from date) * 12 + extract(month from date)),
product_by_unix_date as (partition by code1,code2 order by UNIX_DATE(date))
Consider below version of your query - it still not the perfect - but at least it is easier to handle/read and maintain
select *,
sum(sales) over(product_date) as sales_today,
sum(sales) over(product range between 3 preceding and 1 preceding) as TotSales_previous_3_months,
sum(sales) over(product range between 6 preceding and 1 preceding) as TotSales_previous_6_months,
sum(sales) over(product_day range between 1 preceding and 1 preceding) as sale_one_month_ago,
sum(sales) over(product_day range between 12 preceding and 12 preceding) as sale_one_year_ago,
from test_sales
window
product as (partition by code1, code2 order by extract(year from date) * 12 + extract(month from date)),
product_date as (partition by code1, code2, date),
product_day as (partition by code1, code2, extract(day from date) order by extract(year from date) * 12 + extract(month from date))
if applied to sample data in your question - output is
Is there an alternative or efficient way to achieve the results?
So, definitely above is an alternative way with its own pros and cons
Whether it is more efficient - I do think so, but not 100% sure to be honest - it depends on your data - you need to test it against your data and see ...

Detect Value Changes beyond a threshold in Time Series data in SQL

In PostgreSQL, I am trying to find subjects that have a sequence of values below 60 followed by two consecutive values above 60 that occur afterwards. I'm also interested in the length of time between the first recorded value below 60 and the second value above 60. This event can occur multiple times for each subject.
I am struggling to find out how to search for an unlimited amount of values < 60 followed by 2 values >= 60.
RowID SubjectID Value TimeStamp
1 1 65 2142-04-29 12:00:00
2 1 58 2142-04-30 03:00:00
3 1 55 2142-04-30 04:00:00
4 1 54 2142-04-30 05:00:00
5 1 55 2142-04-30 06:15:00
6 1 56 2142-04-30 06:45:00
7 1 65 2142-04-30 07:00:00
8 1 65 2142-04-30 08:00:00
9 2 48 2142-05-04 03:30:00
10 2 48 2142-05-04 04:00:00
11 2 50 2142-05-04 05:00:00
12 2 69 2142-05-04 06:00:00
13 2 68 2142-05-04 07:00:00
14 2 69 2142-05-04 08:00:00
15 2 50 2142-05-04 09:00:00
16 2 55 2142-05-04 10:00:00
17 2 50 2142-05-04 10:30:00
18 2 67 2142-05-04 11:00:00
19 2 67 2142-05-04 12:00:00
My current attempt uses the lag and lead functions, but I am unsure about how to use these functions when I am unsure how far I need to look ahead. This is an example of looking ahead one value and behind one value. My problem is I do not know how to partition by subjectID to look "t" time points ahead where "t" may be different for every subject.
select t.subjectId, t.didEventOccur,
(next_timestamp - timestamp) as duration
from (select t.*,
lag(t.value) over (partition by t.subjectid order by t.timestamp)
as prev_value,
lead(t.value) over (partition by t.subjectid order by
t.timestamp) as next_value,
lead(t.timestamp) over (partition by t.subjectid order by
t.timestamp) as next_timestamp
from t
) t
where value < 60 and next_value < 60 and
(prev_value is null or prev_value >= 60);
I hope to get an output such as:
SubjectID DidEventOccur Duration
1 1 05:00:00
2 1 03:30:00
2 1 03:00:00
A pure SQL solution like you have been asking for:
SELECT subjectid, start_at, next_end_at - start_at AS duration
FROM (
SELECT *
, lead(end_at) OVER (PARTITION BY subjectid ORDER BY start_at) AS next_end_at
FROM (
SELECT subjectid, grp, big
, min(ts) AS start_at
, max(ts) FILTER (WHERE big AND big_rn = 2) AS end_at -- 2nd timestamp
FROM (
SELECT subjectid, ts, grp, big
, row_number() OVER (PARTITION BY subjectid, grp, big ORDER BY ts) AS big_rn
FROM (
SELECT subjectid, ts
, row_number() OVER (PARTITION BY subjectid ORDER BY ts)
- row_number() OVER (PARTITION BY subjectid, (value > 60) ORDER BY ts) AS grp
, (value > 60) AS big
FROM tbl
) sub1
) sub2
GROUP BY subjectid, grp, big
) sub3
) sub4
WHERE NOT big -- identifies block of values <= 60 ...
AND next_end_at IS NOT NULL -- ...followed by at least 2 values > 60
ORDER BY subjectid, start_at;
I omitted the useless column DidEventOccur and added start_at instead. Otherwise exactly your desired result.
db<>fiddle here
Consider a procedural solution in plpgsql (or any PL) instead, should be faster. Simpler? I'd say yes, but that depends on who's judging. See (with explanation for the technique and links to more):
How to number consecutive records per island?

How do I compare a current partial month vs a previous partial month with postgres?

I'm building some basic reports and I want to see if I'm on track to surpass last month's metrics without waiting for the month to end. Basically I want to compare June 1 (start of current month) through June 23 (current_date) against May 1 (start of previous month) through May 23 (current_date - 1 month).
My goal is to show a count of distinct users that did event1 and event2.
Here's what I have so far:
CREATE VIEW events AS
(SELECT *
FROM public.event
WHERE TYPE in ('event1',
'event2')
AND created_at > now() - interval '1 months' );
CREATE VIEW MAU AS
(SELECT EXTRACT(DOW
FROM created_at) AS month,
DATE_TRUNC('week', created_at) AS week,
COUNT(*) AS total_engagement,
COUNT(DISTINCT user_id) AS total_users
FROM events
GROUP BY 2,
1
ORDER BY week DESC);
SELECT month,
week,
SUM(total_engagement) OVER (PARTITION BY month
ORDER BY week) AS total_engagment
FROM MAU
ORDER BY 1 DESC,
2
Here's an example of what that returns:
Month Week Unique Engagement
6 2017-05-22 00:00:00 165
6 2017-05-29 00:00:00 355
6 2017-06-05 00:00:00 572
6 2017-06-12 00:00:00 723
5 2017-05-22 00:00:00 757
5 2017-05-29 00:00:00 1549
5 2017-06-05 00:00:00 2394
5 2017-06-12 00:00:00 3261
5 2017-06-19 00:00:00 3592
Expected return
Month Day Total Engagement
6 1 50
6 2 100
6 3 180
5 1 89
5 2 213
5 3 284
5 4 341
Can you point out where I've got this wrong or if there's an easier way to do it?
You are confusing days, weeks and months in your question but from the expected output I assume that you want month number, week number within a month and a count of those pairs.
SELECT
month,
week,
count(*) as total_engagement
FROM (
SELECT
extract(month from created_at) as month,
extract('day' from date_trunc('week', created_at::date) -
date_trunc('week', date_trunc('month', created_at::date))) / 7 + 1 as week
FROM public.event
WHERE type IN ('event1', 'event2')
AND created_at > now() - interval '1 month'
) t
GROUP BY 1,2
The most interesting part could be getting the week number within a month and for that you can check this answer.