Produce weekly and quarterly stats from a monthly figure - sql

I have a sample of a table as below:
Customer Ref
Bear Rate
Distance
Month
Revenue
ABA-IFNL-001
1000
01/01/2022
-135
ABA-IFNL-001
1000
01/02/2022
-135
ABA-IFNL-001
1000
01/03/2022
-135
ABA-IFNL-001
1000
01/04/2022
-135
ABA-IFNL-001
1000
01/05/2022
-135
ABA-IFNL-001
1000
01/06/2022
-135
I also have a sample of a calendar table as below:
Date
Year
Week
Quarter
WeekDay
Qtr Start
Qtr End
Week Day
04/11/2022
2022
45
4
Fri
30/09/2022
29/12/2022
1
05/11/2022
2022
45
4
Sat
30/09/2022
29/12/2022
2
06/11/2022
2022
45
4
Sun
30/09/2022
29/12/2022
3
07/11/2022
2022
45
4
Mon
30/09/2022
29/12/2022
4
08/11/2022
2022
45
4
Tue
30/09/2022
29/12/2022
5
09/11/2022
2022
45
4
Wed
30/09/2022
29/12/2022
6
10/11/2022
2022
45
4
Thu
30/09/2022
29/12/2022
7
11/11/2022
2022
46
4
Fri
30/09/2022
29/12/2022
1
12/11/2022
2022
46
4
Sat
30/09/2022
29/12/2022
2
13/11/2022
2022
46
4
Sun
30/09/2022
29/12/2022
3
14/11/2022
2022
46
4
Mon
30/09/2022
29/12/2022
4
15/11/2022
2022
46
4
Tue
30/09/2022
29/12/2022
5
16/11/2022
2022
46
4
Wed
30/09/2022
29/12/2022
6
17/11/2022
2022
46
4
Thu
30/09/2022
29/12/2022
7
How can I join/link the tables to report on revenue over weekly and quarterly periods using the calendar table? I can put into two tables if needed as an output eg:
Quarter Starting
31/12/2021
01/04/2022
01/07/2022
30/09/2022
Quarter
1
2
3
4
Revenue
500
400
540
540
Week Date Start
31/12/2021
07/01/2022
14/01/2022
21/01/2022
Week
41
42
43
44
Revenue
33.75
33.75
33.75
33.75
I am using alteryx for this but wouldnt mind explaination of possible logic in sql to apply it into the system
Thanks

Before I get into the answer, you're going to have an issue regarding data integrity. All the revenue data is aggregated at a monthly level, where your quarters start and end on someday within the month.
For example - Q4 starts September 30th (Friday) and ends Dec. 29th (Thursday). You may have a day or two that bleeds from another month into the quarters which might throw off the data a bit (esp. if there's a large amount of revenue during the days that bleed into a quarter.
Additionally, your revenue is aggregated at a monthly level - unless you have more granular data (weekly, daily would be best), it doesn't make sense to do a weekly calculation since you'll probably just be dividing revenue by 4.
That being said - You'll want to use a cross tab feature in alteryx to get the data how you want it. But before you do that, we want to aggregate your data at a quarterly level first.
You can do this with an if statement or some other data cleansing tool (sorry, been a while since I used alteryx). Something like:
# Pseudo code - this won't actually work!
# For determining quarter
if (month) between (30/09/2022,29/12/2022) then 4
where you can derive the logic from your calendar table. Then once you have the quarter, you can join in the Quarter Start date based on your quarter calculation.
Now you have a nice clean table that might look something like this:
Month
Revenue
Quarter
Quarter Start Date
01/01/2022
-135
4
30/09/2022
01/01/2022
-135
4
30/09/2022
Aggregate on your quarter to get a cleaner table
Quarter Start Date
Quarter
revenue
30/09/2022
4
300
Then use cross tab, where you pivot on the Quarter start date.
For SQL, you'd be pivoting the data. Essentially, taking the value from a row of data, and converting it into a column. It will look a bit janky because the data is so customized, but here's a good question that goes over pivioting - Simple way to transpose columns and rows in SQL?

Related

How do you get the last entry for each month in SQL?

I am looking to filter very large tables to the latest entry per user per month. I'm not sure if I found the best way to do this. I know I "should" trust the SQL engine (snowflake) but there is a part of me that does not like the join on three columns.
Note that this is a very common operation on many big tables, and I want to use it in DBT views which means it will get run all the time.
To illustrate, my data is of this form:
mytable
userId
loginDate
year
month
value
1
2021-01-04
2021
1
41.1
1
2021-01-06
2021
1
411.1
1
2021-01-25
2021
1
251.1
2
2021-01-05
2021
1
4369
2
2021-02-06
2021
2
32
2
2021-02-14
2021
2
731
3
2021-01-20
2021
1
258
3
2021-02-19
2021
2
4251
3
2021-03-15
2021
3
171
And I'm trying to use SQL to get the last value (by loginDate) for each month.
I'm currently doing a groupby & a join as follows:
WITH latest_entry_by_month AS (
SELECT "userId", "year", "month", max("loginDate") AS "loginDate"
FROM mytable
)
SELECT * FROM mytable NATURAL JOIN latest_entry_by_month
The above results in my desired output:
userId
loginDate
year
month
value
1
2021-01-25
2021
1
251.1
2
2021-01-05
2021
1
4369
2
2021-02-14
2021
2
731
3
2021-01-20
2021
1
258
3
2021-02-19
2021
2
4251
3
2021-03-15
2021
3
171
But I'm not sure if it's optimal.
Any guidance on how to do this faster? Note that I am not materializing the underlying data, so it is effectively un-clustered (I'm getting it from a vendor via the Snowflake marketplace).
Using QUALIFY and windowed function(ROW_NUMBER):
SELECT *
FROM mytable
QUALIFY ROW_NUMBER() OVER(PARTITION BY userId, year, month
ORDER BY loginDate DESC) = 1

How to compare same period of different date ranges in columns in BigQuery standard SQL

i have a hard time figuring out how to compare the same period (e.g. iso week 48) from different years for a certain metric in different columns. I am new to SQL and haven't fully understand how PARTITION BY works but guess that i'll need it for my desired output.
How can i sum the data from column "metric" and compare same periods of different date ranges (e.g. YEAR) in a table?
current table
date iso_week iso_year metric
2021-12-01 48 2021 1000
2021-11-30 48 2021 850
...
2020-11-28 48 2020 800
2020-11-27 48 2020 950
...
2019-11-27 48 2019 700
2019-11-26 48 2019 820
desired output
iso_week metric_thisYear metric_prevYear metric_prev2Year
48 1850 1750 1520
...
Consider below simple approach
select * from (
select * except(date)
from your_table
)
pivot (sum(metric) as metric for iso_year in (2021, 2020, 2019))
if applied to sample data in your question - output is

pandas groupby and filling in missing frequencies

I have a dataset of events each of which occurred on a specific day. Using Pandas I have been able to aggregate these into a count of events per month using the groupby function, and then plot a graph with Matplotlib. However, in the original dataset some months do not have any events and so there is no count of events in such a month. Such months do not therefore appear on the graph, but I would like to include then somehow with their zero count
bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count()
which produces
month_year month
2016-01 January 9
2016-02 February 7
2016-04 April 1
2016-06 June 4
2016-07 July 1
2016-08 August 3
2016-09 September 2
2016-10 October 5
2016-11 November 17
2016-12 December 3
I have been trying to find a way of filling missing months in the dataframe generated by the groupby function with a 'count' value of 0 for, in this example, March and May..
Can anyone offer some advice on how this might be achieved. I have been trying to carry out FFill on the month column but with little success and can't work out how to add in a corresponding zero value for the missing months
First of all, if bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count() is your code, then it is a series. So, let's change it to a dataframe with bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count().reset_index(). Now, into the problem.
Change to date format and use pd.Grouper and change back to string format. Also add back the month column and change the formatting of the event_no column:
bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count().reset_index()
bpm2['month_year'] = bpm2['month_year'].astype(str)
bpm2['month_year'] = pd.to_datetime(bpm2['month_year'])
bpm2 = bpm2.groupby([pd.Grouper(key='month_year', freq='1M')])['event_no'].first().fillna(0).astype(int).reset_index()
bpm2['month'] = bpm2['month_year'].dt.strftime('%B')
bpm2['month_year'] = bpm2['month_year'].dt.strftime('%Y-%m')
bpm2
output:
month_year event_no month
0 2016-01 9 January
1 2016-02 7 February
2 2016-03 0 March
3 2016-04 1 April
4 2016-05 0 May
5 2016-06 4 June
6 2016-07 1 July
7 2016-08 3 August
8 2016-09 2 September
9 2016-10 5 October
10 2016-11 17 November
11 2016-12 3 December

Compare Cumulative Sales per Year-End

Using this sample dataframe:
np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B','Group C','Group D'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2020',
freq='M'), 10000)})
I am trying to compare 12 month time frames with seaborn plots for a sub-grouping of category. For example, I'd like to compare the cumulative 12 months for each year ending 4-30 vs. the same time period for each year. I cannot wrap my head around how to get a running total of data for each respective year (5/1/17-4/30/18, 5/1/18-4/30/19, 5/1/19-4/30/20). The dates are just examples - I'd like to be able to compare different year-end data points, even better would be able to compare 365 days. For instance, I'd love to compare 3/15/19-3/14/20 to 3/15/18-3/14/19, etc.
I envision a graph for each 'Category' (A,B,C,D) with lines for each respective year representing the running total starting with zero on May 1, building through April 30 of the next year. The x axis would be the month (starting with May 1) & y axis would be 'Units_Sold' as it grows.
Any help would be greatly appreciated!
One way to convert the date to fiscal quarters and extract the fiscal year:
df = pd.DataFrame({'Date':pd.date_range('2019-01-01', '2019-12-31', freq='M'),
'Values':np.arange(12)})
df['fiscal_year'] = df.Date.dt.to_period('Q-APR').dt.qyear
Output:
Date Values fiscal_year
0 2019-01-31 0 2019
1 2019-02-28 1 2019
2 2019-03-31 2 2019
3 2019-04-30 3 2019
4 2019-05-31 4 2020
5 2019-06-30 5 2020
6 2019-07-31 6 2020
7 2019-08-31 7 2020
8 2019-09-30 8 2020
9 2019-10-31 9 2020
10 2019-11-30 10 2020
11 2019-12-31 11 2020
And now you can group by fiscal_year to your heart's content.

Usage compared to last year

I have a table that contains two columns - date an amount (usage). I want to compare my usage with last year's
Data
Date Amount
01-01-2015 23
02-01-2015 24
03-01-2015 19
04-01-2015 11
05-01-2015 5
06-01-2015 23
07-01-2015 21
08-01-2015 6
09-01-2015 26
10-01-2015 9
[...]
01-01-2016 30
02-01-2016 19
03-01-2016 5
04-01-2016 8
05-01-2016 15
06-01-2016 27
07-01-2016 19
08-01-2016 21
09-01-2016 24
10-01-2016 27
[until today's date]
The sql statement should return results up to today's date and the diff should be a running total - the amount on a day in 2016 minus the amount on the same day in 2015, plus the previous difference. Assuming today is Jan 10, the result would be:
Day Diff
01-01-2016 7 (30-23)
02-01-2016 2 (19-24+7)
03-01-2016 -12 (5-19+2)
04-01-2016 -15 (8-11-12)
05-01-2016 -5 (and so on...)
06-01-2016 -1
07-01-2016 -3
08-01-2016 12
09-01-2016 10
10-01-2016 28
I can't figure out to do this when the data is all in one table...
Thanks
Try this
SELECT t1.date Day,SUM(t2.Diff) FROM
amounts t1
INNER JOIN
(SELECT amounts.date Day, amounts.amount-prevamounts.amount Diff
FROM amounts inner join amounts prevamounts
ON amounts.date=date(prevamounts.date,'+1 years')) t2
ON t2.Day<=t1.date
group by t1.date;
demo here :
http://sqlfiddle.com/#!7/9bd4c/41