SQL query to get top 24 records, then average the first 12 and bottom 12 - sql

I'm attempting to analyze each account's performance (A_Count & B_Count) during their first year versus their second year. This should only return clients who have at least 24 months of totals (records).
Volume Table
Account
ReportDate
A_Count
B_Count
1001A
2019-01-01
47
100
1001A
2019-02-01
50
105
1002A
2019-02-01
50
105
I think I'm on the right track by wanting to grab the top 24 records for each account (only if 24 exist) and then grabbing the top 12 and bottom 12, but not sure how to get there.
I guess ideal output would be:
Account
YR1_A_Avg
YR1_B_Avg
YR2_A_Avg
YR2_B_Avg
FirstDate
LastDate
1001A
47
100
53
115
2019-01-01
2021-12-31
1002A
50
105
65
130
2019-02-01
2022-01-01
1003A
15
180
38
200
2017-05-01
2019-04-01
I'm not too worried about performance.

Assuming there are no gaps in ReportDate (per Account).
select Account
,avg(case when year_index = 1 then A_Count end) as YR1_A_Avg
,avg(case when year_index = 1 then B_Count end) as YR1_B_Avg
,avg(case when year_index = 2 then A_Count end) as YR2_A_Avg
,avg(case when year_index = 2 then B_Count end) as YR2_B_Avg
,min(ReportDate) as FirstDate
,max(ReportDate) as LastDate
from
(
select *
,count(*) over(partition by Account) as cnt
,(row_number() over(partition by Account order by ReportDate)-1)/12 +1 as year_index
from Volume
) t
where cnt >= 24 and year_index <= 2
group by Account

Related

Calculate sales metrics (like past 6 months, past 3 months, sale one year ago etc.) on transaction data in BigQuery

I have to create a view in BigQuery with some details of product sales. The measurements to be included in the view are explained below. These measurements have to be calculated for each product for every day that product is sold. A product is identified by unique combination of 5 -6 attributes (in our demo, code1 and code2 columns). The date represents the transaction dates.
sales_today -> the sum of sales for each product (combination of code1 and code2) per day.
TotSales_previous_3_months -> the sum of sales for each product in the previous 3 months(without including any sales from current month). for e.g., if we are calculating TotSales_previous_3_months for a product sale on 5th March 2022, we have to sum up the sales of that product from 1st December 2021 to 28th February 2022.
TotSales_previous_6_months -> the sum of sales for each product in the previous 6 months(without including any sales from current month). Follow the same logic as for TotSales_previous_3_months.
sale_one_month_ago -> The sum of sales of the product on this day exactly one month ago. For e.g., if we are calculating sale_one_month_ago for a product sale on 5th March 2022, it would be the sum of sales of that product on 5th February 2022.
sale_one_year_ago -> The sum of sales of the product on this day exactly one month ago. For e.g., if we are calculating sale_one_month_ago for a product sale on 5th March 2022, it would be the sum of sales of that product on 5th March 2021.
Unique_count_flag -> flag = 1 if the number of sales of the product on a day = 1. If the number of sales of the product is more than 1 on a day, flag = 0.
I have created this table (test_sales) with some demo data for understanding.
code1
code2
date
gen
sales
1
A
2021-02-04
jerez
7
1
A
2021-02-04
abc
5
1
A
2022-02-04
wres
10
1
A
2022-03-04
tomz
10
1
A
2022-03-05
everyz
10
1
A
2022-05-01
ben10
30
1
A
2022-06-01
xyx
10
1
A
2022-06-01
xya
5
2
A
2022-05-10
iqoom
20
3
C
2022-01-10
imola
60
3
C
2022-04-01
nurburgring
50
3
C
2022-06-01
jerez
30
The result set after calculations should be like -
code1
code2
date
gen
sales
sales_today
TotSales_previous_3_months
TotSales_previous_6_months
sale_one_month_ago
sale_one_year_ago
Unique_count_flag
1
A
2021-02-04
jerez
7
12
0
0
0
0
1
A
2021-02-04
abc
5
12
0
0
0
0
1
A
2022-02-04
wres
10
10
0
0
0
12
1
1
A
2022-03-04
tomz
10
10
10
10
10
1
1
A
2022-03-05
everyz
10
10
10
10
0
1
1
A
2022-05-01
ben10
30
30
30
30
0
1
1
A
2022-06-01
xyx
10
15
50
60
30
0
1
A
2022-06-01
xya
5
15
50
60
30
0
2
A
2022-05-10
iqoom
20
20
0
0
0
1
3
C
2022-01-10
imola
60
60
0
0
0
1
3
C
2022-04-01
nurburgring
50
50
60
60
0
1
3
C
2022-06-01
jerez
30
30
50
110
0
1
I was able to create the below code to achieve result, but the problem is that this code works fine for small datasets but here I am dealing with around 60 GB of data(~50 columns and ~80 million rows). If I adapt the code given below for the original sales data(which itself is a combination of few tables after joining them) it just long runs. Is there an alternative or efficient way to achieve the results?
with temp as
(SELECT
code1,code2,date,gen,sales,
COUNT(*) OVER(PARTITION BY code1, code2, date) AS cnt,
SUM(sales) OVER(PARTITION BY code1, code2,date) AS sales_today,
array_agg(struct(sales as sales,date as date)) over(partition by code1,code2 order by date) as past_records
FROM
`test_sales`
)
select * except(past_records,cnt),
(select ifnull(sum(x.sales),0)
from unnest(temp.past_records) as x
where x.date between (date_trunc(temp.date,MONTH) - INTERVAL 3 MONTH) and (date_trunc(temp.date, MONTH) - interval 1 day)) as TotSales_previous_3_months,
(select ifnull(sum(x.sales),0)
from unnest(temp.past_records) as x
where x.date between (date_trunc(temp.date,MONTH) - INTERVAL 6 MONTH) and (date_trunc(temp.date, MONTH) - interval 1 day)) as TotSales_previous_6_months,
(select ifnull(sum(x.sales),0)
from unnest(temp.past_records) as x
where x.date = temp.date - INTERVAL 1 MONTH) as sale_one_month_ago,
(select ifnull(sum(x.sales),0)
from unnest(temp.past_records) as x
where x.date = temp.date - INTERVAL 1 YEAR) as sale_one_year_ago,
if(cnt = 1,1,0) as Unique_count_flag
from temp
Modified Code inspired from Mikhail's approach:-
select *,
-- extract(year from date) * 12 + extract(month from date) as months,
-- UNIX_DATE(date) AS days,
sum(sales) over(product_date) as sales_today,
sum(sales) over(product range between 3 preceding and 1 preceding) as TotSales_previous_3_months,
sum(sales) over(product range between 6 preceding and 1 preceding) as TotSales_previous_6_months,
case when extract(day from date) = 31 and extract(month from date) in (3,12,10,7,5)
then sum(sales) over(product_by_unix_date range between 31 preceding and 31 preceding)
when extract(day from date) = 30 and extract(month from date) = 3
then sum(sales) over(product_by_unix_date range between 30 preceding and 30 preceding)
when extract(day from date) = 29 and extract(month from date) = 3
then sum(sales) over(product_by_unix_date range between 29 preceding and 29 preceding)
else
sum(sales) over(product_day range between 1 preceding and 1 preceding)
end as sale_one_month_ago,
case when extract(day from date) = 29 and extract(month from date) = 2
then sum(sales) over(product_by_unix_date range between 366 preceding and 366 preceding)
else
sum(sales) over(product_day range between 12 preceding and 12 preceding)
end as sale_one_year_ago
from `river-blade-343102.test.test_sales`
window
product as (partition by code1, code2 order by extract(year from date) * 12 + extract(month from date)),
product_date as (partition by code1, code2, date ),
product_day as (partition by code1, code2, extract(day from date) order by extract(year from date) * 12 + extract(month from date)),
product_by_unix_date as (partition by code1,code2 order by UNIX_DATE(date))
Consider below version of your query - it still not the perfect - but at least it is easier to handle/read and maintain
select *,
sum(sales) over(product_date) as sales_today,
sum(sales) over(product range between 3 preceding and 1 preceding) as TotSales_previous_3_months,
sum(sales) over(product range between 6 preceding and 1 preceding) as TotSales_previous_6_months,
sum(sales) over(product_day range between 1 preceding and 1 preceding) as sale_one_month_ago,
sum(sales) over(product_day range between 12 preceding and 12 preceding) as sale_one_year_ago,
from test_sales
window
product as (partition by code1, code2 order by extract(year from date) * 12 + extract(month from date)),
product_date as (partition by code1, code2, date),
product_day as (partition by code1, code2, extract(day from date) order by extract(year from date) * 12 + extract(month from date))
if applied to sample data in your question - output is
Is there an alternative or efficient way to achieve the results?
So, definitely above is an alternative way with its own pros and cons
Whether it is more efficient - I do think so, but not 100% sure to be honest - it depends on your data - you need to test it against your data and see ...

Snowflake SQL - Count Distinct Users within descending time interval

I want to count the distinct amount of users over the last 60 days, and then, count the distinct amount of users over the last 59 days, and so on and so forth.
Ideally, the output would look like this (TARGET OUTPUT)
Day Distinct Users
60 200
59 200
58 188
57 185
56 180
[...] [...]
where 60 days is the max total possible distinct users, and then 59 would have a little less and so on and so forth.
my query looks like this.
select
count(distinct (case when datediff(day,DATE,current_date) <= 60 then USER_ID end)) as day_60,
count(distinct (case when datediff(day,DATE,current_date) <= 59 then USER_ID end)) as day_59,
count(distinct (case when datediff(day,DATE,current_date) <= 58 then USER_ID end)) as day_58
FROM Table
The issue with my query is that This outputs the data by column instead of by rows (like shown below) AND, most importantly, I have to write out this logic 60x for each of the 60 days.
Current Output:
Day_60 Day_59 Day_58
209 207 207
Is it possible to write the SQL in a way that creates the target as shown initially above?
Using below data in CTE format -
with data_cte(dates,userid) as
(select * from values
('2022-05-01'::date,'UID1'),
('2022-05-01'::date,'UID2'),
('2022-05-02'::date,'UID1'),
('2022-05-02'::date,'UID2'),
('2022-05-03'::date,'UID1'),
('2022-05-03'::date,'UID2'),
('2022-05-03'::date,'UID3'),
('2022-05-04'::date,'UID1'),
('2022-05-04'::date,'UID1'),
('2022-05-04'::date,'UID2'),
('2022-05-04'::date,'UID3'),
('2022-05-04'::date,'UID4'),
('2022-05-05'::date,'UID1'),
('2022-05-06'::date,'UID1'),
('2022-05-07'::date,'UID1'),
('2022-05-07'::date,'UID2'),
('2022-05-08'::date,'UID1')
)
Query to get all dates and count and distinct counts -
select dates,count(userid) cnt, count(distinct userid) cnt_d
from data_cte
group by dates;
DATES
CNT
CNT_D
2022-05-01
2
2
2022-05-02
2
2
2022-05-03
3
3
2022-05-04
5
4
2022-05-05
1
1
2022-05-06
1
1
2022-05-08
1
1
2022-05-07
2
2
Query to get difference of date from current date
select dates,datediff(day,dates,current_date()) ddiff,
count(userid) cnt,
count(distinct userid) cnt_d
from data_cte
group by dates;
DATES
DDIFF
CNT
CNT_D
2022-05-01
45
2
2
2022-05-02
44
2
2
2022-05-03
43
3
3
2022-05-04
42
5
4
2022-05-05
41
1
1
2022-05-06
40
1
1
2022-05-08
38
1
1
2022-05-07
39
2
2
Get records with date difference beyond a certain range only -
include clause having
select datediff(day,dates,current_date()) ddiff,
count(userid) cnt,
count(distinct userid) cnt_d
from data_cte
group by dates
having ddiff<=43;
DDIFF
CNT
CNT_D
43
3
3
42
5
4
41
1
1
39
2
2
38
1
1
40
1
1
If you need to prefix 'day' to each date diff count, you can
add and outer query to previously fetched data-set and add the needed prefix to the date diff column as following -
I am using CTE syntax, but you may use sub-query given you will select from table -
,cte_1 as (
select datediff(day,dates,current_date()) ddiff,
count(userid) cnt,
count(distinct userid) cnt_d
from data_cte
group by dates
having ddiff<=43)
select 'day_'||to_char(ddiff) days,
cnt,
cnt_d
from cte_1;
DAYS
CNT
CNT_D
day_43
3
3
day_42
5
4
day_41
1
1
day_39
2
2
day_38
1
1
day_40
1
1
Updated the answer to get distinct user count for number of days range.
A clause can be included in the final query to limit to number of days needed.
with data_cte(dates,userid) as
(select * from values
('2022-05-01'::date,'UID1'),
('2022-05-01'::date,'UID2'),
('2022-05-02'::date,'UID1'),
('2022-05-02'::date,'UID2'),
('2022-05-03'::date,'UID5'),
('2022-05-03'::date,'UID2'),
('2022-05-03'::date,'UID3'),
('2022-05-04'::date,'UID1'),
('2022-05-04'::date,'UID6'),
('2022-05-04'::date,'UID2'),
('2022-05-04'::date,'UID3'),
('2022-05-04'::date,'UID4'),
('2022-05-05'::date,'UID7'),
('2022-05-06'::date,'UID1'),
('2022-05-07'::date,'UID8'),
('2022-05-07'::date,'UID2'),
('2022-05-08'::date,'UID9')
),cte_1 as
(select datediff(day,dates,current_date()) ddiff,userid
from data_cte), cte_2 as
(select distinct ddiff from cte_1 )
select cte_2.ddiff,
(select count(distinct userid)
from cte_1 where cte_1.ddiff <= cte_2.ddiff) cnt
from cte_2
order by cte_2.ddiff desc
DDIFF
CNT
47
9
46
9
45
9
44
8
43
5
42
4
41
3
40
1
You can do unpivot after getting your current output.
sample one.
select
*
from (
select
209 Day_60,
207 Day_59,
207 Day_58
)unpivot ( cnt for days in (Day_60,Day_59,Day_58));

DB2/SQL aggregates with preceeding weekdays

I have a query that currently gets daily records against a weekly number from a prepopulated table:
SELECT Employee,
sum(case when category = 'Shirts' then daily_total else 0 end) as Shirts_DAILY,
sum(case when category = 'Shirts' then weekly_quota else 0 end) as Shirts_QUOTA, -- this is a static column, this number is the same for every record
sum(case when category = 'Shoes' then daily_total else 0 end) as Shoes_DAILY,
sum(case when category = 'Shoes' then weekly_quota else 0 end) as Shoes_QUOTA, -- this is a static column, this number is the same for every record
CURRENT_DATE as DATE_OF_REPORT
from SalesNumbers
where date_of_report >= current_date
group by Employee;
This runs in a script nightly and returns records like this:
Employee | shirts_DAILY | shirts_QUOTA | Shoes_DAILY | Shoes_QUOTA | DATE_OF_REPORT
--------------------------------------------------------------------------------------------------------
123 15 75 14 85 2019-08-30
That's the record from last Friday Night's report. I'm trying to figure out a way to add a column for each category that would take the sum of daily totals (shirts_DAILY, shoes_DAILY) for each category on preceding weekdays (running sunday through saturday as a week) and divide by that category's quota (shirts_QUOTA, shoes_QUOTA).
For example, here are records from sunday through thursday
Employee | shirts_DAILY | shirts_QUOTA | Shoes_DAILY | Shoes_QUOTA | DATE_OF_REPORT
--------------------------------------------------------------------------------------------------------
123 15 75 16 85 2019-08-25
123 4 75 2 85 2019-08-26
123 8 75 6 85 2019-08-27
123 2 75 8 85 2019-08-28
123 15 75 14 85 2019-08-29
With my new change, I would want Friday night's record to take the sum of sunday through thursday's daily records and divide by the quota (including friday's daily in the sum)
Friday night's record with new column:
Employee | shirts_DAILY | shirts_QUOTA | shirtsPercent | Shoes_DAILY | Shoes_QUOTA | shoesPercent | DATE_OF_REPORT
-----------------------------------------------------------------------------------------------------------------------------------------------
123 2 75 61.3 7 85 62.4 2019-08-30
So friday's run added 15,4,8,2,15,2 for the shirts for 46/75 and 7,14,8,6,2,16 for shoes for 53/85. So the daily sum of each for the preceding week, including present day daily totals, if that makes sense.
What is the best way for me to achieve this?
SELECT Employee,
sum(case when category = 'Shirts' and date_of_report >= current date then
daily_total else 0 end) as Shirts_DAILY,
sum(case when category = 'Shirts' and date_of_report >= current date then
weekly_quota else 0 end) as Shirts_QUOTA,
( sum(case when category = 'Shirts' then
daily_total else 0 end) * 100 ) /
( sum(case when category = 'Shirts' and date_of_report >= current date then
weekly_quota else 0 end) ) as Shirts_PERCENT,
CURRENT_DATE as DATE_OF_REPORT
from SalesNumbers
where date_of_report >= ( current date - ( dayofweek(current date) - 1 ) days )
group by Employee

SQL cumulative sum until a flag value and resetting the sum

I'm still learning SQL and I'm trying to figure out a problem that I wasn't able to solve. So my problem is that I'm trying to select a table(let say Expense), ordered by date and in the table I have a column named Charged and I want to add charges to be cumulative(This part I figured out). However after that I have another column that will be acting as a flag called PayOut. When the PayOut value is 1 I want the summation of Charged(SumValue) to reset to zero. How would I do this? Here is what I have tried and the current output I get and what output I want. Note: I saw some posts using CTE's but wasn't the same scenario and more complex.
select ex.date,
ex.Charged,
(case when(ex.PayOut=1) then 0
else sum(ex.Charged) over (order by ex.date)end) as SumValue,
ex.PayOut
from Expense ex
order by ex.date asc
The data looks like this
Date Charged PayOut
01/10/2018 10 0
01/20/2018 5 0
01/30/2018 3 0
02/01/2018 0 1
02/11/2018 12 0
02/21/2018 15 0
Output I get
Date Charged PayOut SumValue
01/10/2018 10 0 10
01/20/2018 5 0 15
01/30/2018 3 0 18
02/01/2018 0 1 0
02/11/2018 12 0 30
02/21/2018 15 0 45
Output Wanted
Date Charged PayOut SumValue
01/10/2018 10 0 10
01/20/2018 5 0 15
01/30/2018 3 0 18
02/01/2018 0 1 0
02/11/2018 12 0 12
02/21/2018 15 0 27
Just create group from your PayOut Column and use it as a partition in OVER
WITH Expense AS (
SELECT CAST('01/10/2018' AS DATE) AS Date, 10 AS Charged, 0 AS PayOut
UNION ALL SELECT CAST('01/20/2018' AS DATE), 5, 0
UNION ALL SELECT CAST('01/30/2018' AS DATE), 3, 0
UNION ALL SELECT CAST('02/01/2018' AS DATE), 0, 1
UNION ALL SELECT CAST('02/11/2018' AS DATE), 12, 0
UNION ALL SELECT CAST('02/21/2018' AS DATE), 15, 0
)
SELECT
dat.date
,dat.Charged
,dat.PayOut
,dat.PayOutGroup
,SUM(dat.Charged) OVER (PARTITION BY dat.PayOutGroup ORDER BY dat.date) as SumValue
FROM (
SELECT
e.date
,e.Charged
,e.PayOut
,SUM(e.PayOut) OVER (ORDER BY e.date) AS PayOutGroup
FROM Expense e
) dat

How to do a x-days grouped sum in redshift?

I have the following table,
that shows how many items from different units entered the inventory, in different dates.
ID Date Unit Quantity
---------------------------------
1 2017-08-01 A_red 05
2 2017-08-13 A_red 10
3 2017-09-20 A_red 20
4 2017-09-22 A_red 40
5 2017-10-05 A_red 40
6 2017-10-25 A_red 30
7 2017-10-24 A_blue 60
The problem is: entries within a time interval of 30 days of the same unit should be grouped.
So I want the following result:
ID Date Unit Quantity fst_entry30 Quantity30
-----------------------------------------------------
1 2017-08-01 A_red 05 T 15
2 2017-08-13 A_red 10 F 15
3 2017-09-20 A_red 20 T 100
4 2017-09-22 A_red 40 F 100
5 2017-10-05 A_red 40 F 100
6 2017-10-25 A_red 30 T 30
7 2017-10-24 A_blue 60 T 60
where fst_entry30 is a flag that points if the entry was the first, of this unit, in the last 30 days. Note that if i have a different unit (A_blue instead of A_red), it won't be grouped.
And quantity30 is the grouped sum of quantity.
For example, between 5 october and 20 september there is less than 30 days, so it was grouped.
Remembering that Redshift does not allow recursive common table expressions.
I already tried self-joins, but that turned out to be cumbersome.
You would just use lag() to define the groups:
select t.*,
(case when date >= lag(date) over (partition by unit order by date) + interval '30 day'
then 0 else 1
end) as grp_start
from t;
Then you can do a cumulative sum to assign a number to the group . . . and finally add them up using a window function:
select t.*, sum(quantity) over (partition by unit, grp)
from (select t.*,
sum(grp_start) over (partition by unit order by date) as grp
from (select t.*,
(case when date >= lag(date) over (partition by unit order by date) + interval '30 day'
then 0 else 1
end) as grp_start
from t
) t
) t