Compare Cumulative Sales per Year-End - pandas

Using this sample dataframe:
np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B','Group C','Group D'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2020',
freq='M'), 10000)})
I am trying to compare 12 month time frames with seaborn plots for a sub-grouping of category. For example, I'd like to compare the cumulative 12 months for each year ending 4-30 vs. the same time period for each year. I cannot wrap my head around how to get a running total of data for each respective year (5/1/17-4/30/18, 5/1/18-4/30/19, 5/1/19-4/30/20). The dates are just examples - I'd like to be able to compare different year-end data points, even better would be able to compare 365 days. For instance, I'd love to compare 3/15/19-3/14/20 to 3/15/18-3/14/19, etc.
I envision a graph for each 'Category' (A,B,C,D) with lines for each respective year representing the running total starting with zero on May 1, building through April 30 of the next year. The x axis would be the month (starting with May 1) & y axis would be 'Units_Sold' as it grows.
Any help would be greatly appreciated!

One way to convert the date to fiscal quarters and extract the fiscal year:
df = pd.DataFrame({'Date':pd.date_range('2019-01-01', '2019-12-31', freq='M'),
'Values':np.arange(12)})
df['fiscal_year'] = df.Date.dt.to_period('Q-APR').dt.qyear
Output:
Date Values fiscal_year
0 2019-01-31 0 2019
1 2019-02-28 1 2019
2 2019-03-31 2 2019
3 2019-04-30 3 2019
4 2019-05-31 4 2020
5 2019-06-30 5 2020
6 2019-07-31 6 2020
7 2019-08-31 7 2020
8 2019-09-30 8 2020
9 2019-10-31 9 2020
10 2019-11-30 10 2020
11 2019-12-31 11 2020
And now you can group by fiscal_year to your heart's content.

Related

Produce weekly and quarterly stats from a monthly figure

I have a sample of a table as below:
Customer Ref
Bear Rate
Distance
Month
Revenue
ABA-IFNL-001
1000
01/01/2022
-135
ABA-IFNL-001
1000
01/02/2022
-135
ABA-IFNL-001
1000
01/03/2022
-135
ABA-IFNL-001
1000
01/04/2022
-135
ABA-IFNL-001
1000
01/05/2022
-135
ABA-IFNL-001
1000
01/06/2022
-135
I also have a sample of a calendar table as below:
Date
Year
Week
Quarter
WeekDay
Qtr Start
Qtr End
Week Day
04/11/2022
2022
45
4
Fri
30/09/2022
29/12/2022
1
05/11/2022
2022
45
4
Sat
30/09/2022
29/12/2022
2
06/11/2022
2022
45
4
Sun
30/09/2022
29/12/2022
3
07/11/2022
2022
45
4
Mon
30/09/2022
29/12/2022
4
08/11/2022
2022
45
4
Tue
30/09/2022
29/12/2022
5
09/11/2022
2022
45
4
Wed
30/09/2022
29/12/2022
6
10/11/2022
2022
45
4
Thu
30/09/2022
29/12/2022
7
11/11/2022
2022
46
4
Fri
30/09/2022
29/12/2022
1
12/11/2022
2022
46
4
Sat
30/09/2022
29/12/2022
2
13/11/2022
2022
46
4
Sun
30/09/2022
29/12/2022
3
14/11/2022
2022
46
4
Mon
30/09/2022
29/12/2022
4
15/11/2022
2022
46
4
Tue
30/09/2022
29/12/2022
5
16/11/2022
2022
46
4
Wed
30/09/2022
29/12/2022
6
17/11/2022
2022
46
4
Thu
30/09/2022
29/12/2022
7
How can I join/link the tables to report on revenue over weekly and quarterly periods using the calendar table? I can put into two tables if needed as an output eg:
Quarter Starting
31/12/2021
01/04/2022
01/07/2022
30/09/2022
Quarter
1
2
3
4
Revenue
500
400
540
540
Week Date Start
31/12/2021
07/01/2022
14/01/2022
21/01/2022
Week
41
42
43
44
Revenue
33.75
33.75
33.75
33.75
I am using alteryx for this but wouldnt mind explaination of possible logic in sql to apply it into the system
Thanks
Before I get into the answer, you're going to have an issue regarding data integrity. All the revenue data is aggregated at a monthly level, where your quarters start and end on someday within the month.
For example - Q4 starts September 30th (Friday) and ends Dec. 29th (Thursday). You may have a day or two that bleeds from another month into the quarters which might throw off the data a bit (esp. if there's a large amount of revenue during the days that bleed into a quarter.
Additionally, your revenue is aggregated at a monthly level - unless you have more granular data (weekly, daily would be best), it doesn't make sense to do a weekly calculation since you'll probably just be dividing revenue by 4.
That being said - You'll want to use a cross tab feature in alteryx to get the data how you want it. But before you do that, we want to aggregate your data at a quarterly level first.
You can do this with an if statement or some other data cleansing tool (sorry, been a while since I used alteryx). Something like:
# Pseudo code - this won't actually work!
# For determining quarter
if (month) between (30/09/2022,29/12/2022) then 4
where you can derive the logic from your calendar table. Then once you have the quarter, you can join in the Quarter Start date based on your quarter calculation.
Now you have a nice clean table that might look something like this:
Month
Revenue
Quarter
Quarter Start Date
01/01/2022
-135
4
30/09/2022
01/01/2022
-135
4
30/09/2022
Aggregate on your quarter to get a cleaner table
Quarter Start Date
Quarter
revenue
30/09/2022
4
300
Then use cross tab, where you pivot on the Quarter start date.
For SQL, you'd be pivoting the data. Essentially, taking the value from a row of data, and converting it into a column. It will look a bit janky because the data is so customized, but here's a good question that goes over pivioting - Simple way to transpose columns and rows in SQL?

Pivot table with Pandas

I have a small issue trying to do a simple pivot with pandas. I have on one column some values that are entered more than once with a different value in a second column and a year on a third column. What i want to do is get a sum of the second column for the year, using as rows the values on the first column.
import pandas as pd
year = 2022
base = pd.read_csv("Database.csv")
raw_monthly = pd.read_csv("Monthly.csv")
raw_types = pd.read_csv("types.csv")
monthly = raw_monthly.assign(Year= year)
ty= raw_types[['cparty', 'sales']]
typ= sec.rename(columns={"Sales": "sales"})
type= typ.assign(Year=year)
fin = pd.concat([base, monthly, type])
fin.drop(fin.tail(1).index,inplace=True)
currentYear = fin.loc[fin['Year'] == 2022]
final = pd.pivot_table(currentYear, index=['cparty', 'sales'], values='sales', aggfunc='sum')
With the above, I am getting this result, but what i want is to have
the 2 sales values of '3' for 2022 summed in a single value so later i can also break it down by year. Any help appreciated!
Edit: The issue seems to come from the fact that the 3 csvs are concatenated into a single dataframe. Doing the 3->1 CSV conversion manually in excel and then trying to use the Groupby answer works as intended, but it does not work if i try to automatically make the 3 CVS to 1 using the
fin = pd.concat([base, monthly, type])
The 3 csvs look like this.
Base looks like this:
cparty sales year
0 50969 -146602.14 2016
1 51056 -104626.62 2016
2 51129 -101742.99 2016
3 51036 -81801.84 2016
4 51649 -35992.60 2016
monthly looks like this, missing the year
cparty sales
0 818243 -330,052.47
1 82827 -178,630.85
2 508637 -156,369.87
3 29253 -104,028.30
4 596037 -95,312.07
type is like this.
cparty sales
0 582454 -16,056.46
1 597321 24,336.16
2 567172 20,736.78
3 614070 18,590.45
4 5601295 -3,661.46
What i am attempting to do is add a new column for the last 2 to have the Year set as 2022, so that later i can do the groupby per year. When i try to concat the 3 csvs, it breaks down.
Suppose cparty is a categorical metric
# create sales and retail dataframes with year
df = pd.DataFrame({
'year':[2022, 2022, 2018, 2019, 2020, 2021, 2022, 2022, 2022, 2021, 2019, 2018],
'cparty':['cparty1', 'cparty1', 'cparty1', 'cparty2', 'cparty2', 'cparty2', 'cparty2', 'cparty3', 'cparty4', 'cparty4', 'cparty4', 'cparty4'],
'sales':[230, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100]
})
df
###
year cparty sales
0 2022 cparty1 230
1 2022 cparty1 100
2 2018 cparty1 200
3 2019 cparty2 300
4 2020 cparty2 400
5 2021 cparty2 500
6 2022 cparty2 600
7 2022 cparty3 700
8 2022 cparty4 800
9 2021 cparty4 900
10 2019 cparty4 1000
11 2018 cparty4 1100
output = df.groupby(['year','cparty']).sum()
output
###
sales
year cparty
2018 cparty1 200
cparty4 1100
2019 cparty2 300
cparty4 1000
2020 cparty2 400
2021 cparty2 500
cparty4 900
2022 cparty1 330
cparty2 600
cparty3 700
cparty4 800
Filter by year
final = output.query('year == 2022')
final
###
sales
year cparty
2022 cparty1 330
cparty2 600
cparty3 700
cparty4 800
Have figured out the issue.
result = res.groupby(['Year', 'cparty']).sum()
output = result.query('Year == 2022')
output
##
sales
Year cparty
2022 3 -20409.04
4 12064.34
5 9656.64
8081 51588.55
8099 5625.22
... ...
Baron's groupby method was the way to go. The issue is that it only works if I have all the data in 1 csv from the beginning. I was trying to add the year manually for the 2 new csv that i concat to the base, setting Year = 2022. The errors come when i concat the 3 different CSVs. If i don't add the year = 2022 it works giving this:
cparty sales Year
87174 3 -3.89 2022.0
27 3 -20,405.15 NaN
If i do .fillna(2022) then it won't work as expected.
C:\Users\user\AppData\Local\Temp/ipykernel_14252/1015456002.py:32: FutureWarning: Dropping invalid columns in DataFrameGroupBy.add is deprecated. In a future version, a TypeError will be raised. Before calling .add, select only columns which should be valid for the function.
result = fin.groupby(['Year', 'cparty']).sum()
cparty sales Year
87174 3 -3.89 2022.0
27 3 -20,405.15 2022.0
Adding the year but not doing the sum to have 'cparty' 3, 'sales' -20,409.04, Year 2022.
Any feedback appreciated.

How to compare same period of different date ranges in columns in BigQuery standard SQL

i have a hard time figuring out how to compare the same period (e.g. iso week 48) from different years for a certain metric in different columns. I am new to SQL and haven't fully understand how PARTITION BY works but guess that i'll need it for my desired output.
How can i sum the data from column "metric" and compare same periods of different date ranges (e.g. YEAR) in a table?
current table
date iso_week iso_year metric
2021-12-01 48 2021 1000
2021-11-30 48 2021 850
...
2020-11-28 48 2020 800
2020-11-27 48 2020 950
...
2019-11-27 48 2019 700
2019-11-26 48 2019 820
desired output
iso_week metric_thisYear metric_prevYear metric_prev2Year
48 1850 1750 1520
...
Consider below simple approach
select * from (
select * except(date)
from your_table
)
pivot (sum(metric) as metric for iso_year in (2021, 2020, 2019))
if applied to sample data in your question - output is

Pandas Convert Year/Month Int Columns to Datetime and Quarter Average

I have data in a df that is separated into a year and month column and I'm trying to find the average of observed data columns. I cannot find online how to convert the 'year' and 'month' columns to datetime and then to find the Q1, Q2, Q3, etc. averages.
year month data
0 2021 1 7.100427005789888
1 2021 2 7.22523237179488
2 2021 3 8.301528122415217
3 2021 4 6.843885683760697
4 2021 5 6.12365177832918
5 2021 6 6.049659188034206
6 2021 7 5.271174524400343
7 2021 8 5.098493589743587
8 2021 9 6.260155982906011
I need the final data to look like -
year Quarter Q data
2021 1 7.542395833
2021 2 6.33906555
2021 3 5.543274699
I've tried variations of this to change the 'year' and 'month' columns to datetime but it gives a long date starting with year = 1970
df.iloc[:, 1:2] = df.iloc[:, 1:2].apply(pd.to_datetime)
year month wind_speed_ms
0 2021 1970-01-01 00:00:00.000000001 7.100427
1 2021 1970-01-01 00:00:00.000000002 7.225232
2 2021 1970-01-01 00:00:00.000000003 8.301528
3 2021 1970-01-01 00:00:00.000000004 6.843886
4 2021 1970-01-01 00:00:00.000000005 6.123652
5 2021 1970-01-01 00:00:00.000000006 6.049659
6 2021 1970-01-01 00:00:00.000000007 5.271175
7 2021 1970-01-01 00:00:00.000000008 5.098494
8 2021 1970-01-01 00:00:00.000000009 6.260156
Thank you,
I hope this will work for you
# I created period column combining year and month column
df["period"]=df.apply(lambda x:f"{int(x.year)}-{int(x.month)}",axis=1).apply(pd.to_datetime).dt.to_period('Q')
# I applied groupby to period
df=df.groupby("period").mean().reset_index()
df["Quarter"] = df.period.astype(str).str[-2:]
df = df[["year","Quarter","data"]]
df.rename(columns={"data":"Q data"})
year Quarter Q data
0 2021.0 Q1 7.542396
1 2021.0 Q2 6.339066
2 2021.0 Q3 5.543275

pandas groupby and filling in missing frequencies

I have a dataset of events each of which occurred on a specific day. Using Pandas I have been able to aggregate these into a count of events per month using the groupby function, and then plot a graph with Matplotlib. However, in the original dataset some months do not have any events and so there is no count of events in such a month. Such months do not therefore appear on the graph, but I would like to include then somehow with their zero count
bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count()
which produces
month_year month
2016-01 January 9
2016-02 February 7
2016-04 April 1
2016-06 June 4
2016-07 July 1
2016-08 August 3
2016-09 September 2
2016-10 October 5
2016-11 November 17
2016-12 December 3
I have been trying to find a way of filling missing months in the dataframe generated by the groupby function with a 'count' value of 0 for, in this example, March and May..
Can anyone offer some advice on how this might be achieved. I have been trying to carry out FFill on the month column but with little success and can't work out how to add in a corresponding zero value for the missing months
First of all, if bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count() is your code, then it is a series. So, let's change it to a dataframe with bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count().reset_index(). Now, into the problem.
Change to date format and use pd.Grouper and change back to string format. Also add back the month column and change the formatting of the event_no column:
bpm2 = df2_yr1.groupby(['month_year', 'month'])['event_no'].count().reset_index()
bpm2['month_year'] = bpm2['month_year'].astype(str)
bpm2['month_year'] = pd.to_datetime(bpm2['month_year'])
bpm2 = bpm2.groupby([pd.Grouper(key='month_year', freq='1M')])['event_no'].first().fillna(0).astype(int).reset_index()
bpm2['month'] = bpm2['month_year'].dt.strftime('%B')
bpm2['month_year'] = bpm2['month_year'].dt.strftime('%Y-%m')
bpm2
output:
month_year event_no month
0 2016-01 9 January
1 2016-02 7 February
2 2016-03 0 March
3 2016-04 1 April
4 2016-05 0 May
5 2016-06 4 June
6 2016-07 1 July
7 2016-08 3 August
8 2016-09 2 September
9 2016-10 5 October
10 2016-11 17 November
11 2016-12 3 December