Pivot table with Pandas - pandas

I have a small issue trying to do a simple pivot with pandas. I have on one column some values that are entered more than once with a different value in a second column and a year on a third column. What i want to do is get a sum of the second column for the year, using as rows the values on the first column.
import pandas as pd
year = 2022
base = pd.read_csv("Database.csv")
raw_monthly = pd.read_csv("Monthly.csv")
raw_types = pd.read_csv("types.csv")
monthly = raw_monthly.assign(Year= year)
ty= raw_types[['cparty', 'sales']]
typ= sec.rename(columns={"Sales": "sales"})
type= typ.assign(Year=year)
fin = pd.concat([base, monthly, type])
fin.drop(fin.tail(1).index,inplace=True)
currentYear = fin.loc[fin['Year'] == 2022]
final = pd.pivot_table(currentYear, index=['cparty', 'sales'], values='sales', aggfunc='sum')
With the above, I am getting this result, but what i want is to have
the 2 sales values of '3' for 2022 summed in a single value so later i can also break it down by year. Any help appreciated!
Edit: The issue seems to come from the fact that the 3 csvs are concatenated into a single dataframe. Doing the 3->1 CSV conversion manually in excel and then trying to use the Groupby answer works as intended, but it does not work if i try to automatically make the 3 CVS to 1 using the
fin = pd.concat([base, monthly, type])
The 3 csvs look like this.
Base looks like this:
cparty sales year
0 50969 -146602.14 2016
1 51056 -104626.62 2016
2 51129 -101742.99 2016
3 51036 -81801.84 2016
4 51649 -35992.60 2016
monthly looks like this, missing the year
cparty sales
0 818243 -330,052.47
1 82827 -178,630.85
2 508637 -156,369.87
3 29253 -104,028.30
4 596037 -95,312.07
type is like this.
cparty sales
0 582454 -16,056.46
1 597321 24,336.16
2 567172 20,736.78
3 614070 18,590.45
4 5601295 -3,661.46
What i am attempting to do is add a new column for the last 2 to have the Year set as 2022, so that later i can do the groupby per year. When i try to concat the 3 csvs, it breaks down.

Suppose cparty is a categorical metric
# create sales and retail dataframes with year
df = pd.DataFrame({
'year':[2022, 2022, 2018, 2019, 2020, 2021, 2022, 2022, 2022, 2021, 2019, 2018],
'cparty':['cparty1', 'cparty1', 'cparty1', 'cparty2', 'cparty2', 'cparty2', 'cparty2', 'cparty3', 'cparty4', 'cparty4', 'cparty4', 'cparty4'],
'sales':[230, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000, 1100]
})
df
###
year cparty sales
0 2022 cparty1 230
1 2022 cparty1 100
2 2018 cparty1 200
3 2019 cparty2 300
4 2020 cparty2 400
5 2021 cparty2 500
6 2022 cparty2 600
7 2022 cparty3 700
8 2022 cparty4 800
9 2021 cparty4 900
10 2019 cparty4 1000
11 2018 cparty4 1100
output = df.groupby(['year','cparty']).sum()
output
###
sales
year cparty
2018 cparty1 200
cparty4 1100
2019 cparty2 300
cparty4 1000
2020 cparty2 400
2021 cparty2 500
cparty4 900
2022 cparty1 330
cparty2 600
cparty3 700
cparty4 800
Filter by year
final = output.query('year == 2022')
final
###
sales
year cparty
2022 cparty1 330
cparty2 600
cparty3 700
cparty4 800

Have figured out the issue.
result = res.groupby(['Year', 'cparty']).sum()
output = result.query('Year == 2022')
output
##
sales
Year cparty
2022 3 -20409.04
4 12064.34
5 9656.64
8081 51588.55
8099 5625.22
... ...
Baron's groupby method was the way to go. The issue is that it only works if I have all the data in 1 csv from the beginning. I was trying to add the year manually for the 2 new csv that i concat to the base, setting Year = 2022. The errors come when i concat the 3 different CSVs. If i don't add the year = 2022 it works giving this:
cparty sales Year
87174 3 -3.89 2022.0
27 3 -20,405.15 NaN
If i do .fillna(2022) then it won't work as expected.
C:\Users\user\AppData\Local\Temp/ipykernel_14252/1015456002.py:32: FutureWarning: Dropping invalid columns in DataFrameGroupBy.add is deprecated. In a future version, a TypeError will be raised. Before calling .add, select only columns which should be valid for the function.
result = fin.groupby(['Year', 'cparty']).sum()
cparty sales Year
87174 3 -3.89 2022.0
27 3 -20,405.15 2022.0
Adding the year but not doing the sum to have 'cparty' 3, 'sales' -20,409.04, Year 2022.
Any feedback appreciated.

Related

Produce weekly and quarterly stats from a monthly figure

I have a sample of a table as below:
Customer Ref
Bear Rate
Distance
Month
Revenue
ABA-IFNL-001
1000
01/01/2022
-135
ABA-IFNL-001
1000
01/02/2022
-135
ABA-IFNL-001
1000
01/03/2022
-135
ABA-IFNL-001
1000
01/04/2022
-135
ABA-IFNL-001
1000
01/05/2022
-135
ABA-IFNL-001
1000
01/06/2022
-135
I also have a sample of a calendar table as below:
Date
Year
Week
Quarter
WeekDay
Qtr Start
Qtr End
Week Day
04/11/2022
2022
45
4
Fri
30/09/2022
29/12/2022
1
05/11/2022
2022
45
4
Sat
30/09/2022
29/12/2022
2
06/11/2022
2022
45
4
Sun
30/09/2022
29/12/2022
3
07/11/2022
2022
45
4
Mon
30/09/2022
29/12/2022
4
08/11/2022
2022
45
4
Tue
30/09/2022
29/12/2022
5
09/11/2022
2022
45
4
Wed
30/09/2022
29/12/2022
6
10/11/2022
2022
45
4
Thu
30/09/2022
29/12/2022
7
11/11/2022
2022
46
4
Fri
30/09/2022
29/12/2022
1
12/11/2022
2022
46
4
Sat
30/09/2022
29/12/2022
2
13/11/2022
2022
46
4
Sun
30/09/2022
29/12/2022
3
14/11/2022
2022
46
4
Mon
30/09/2022
29/12/2022
4
15/11/2022
2022
46
4
Tue
30/09/2022
29/12/2022
5
16/11/2022
2022
46
4
Wed
30/09/2022
29/12/2022
6
17/11/2022
2022
46
4
Thu
30/09/2022
29/12/2022
7
How can I join/link the tables to report on revenue over weekly and quarterly periods using the calendar table? I can put into two tables if needed as an output eg:
Quarter Starting
31/12/2021
01/04/2022
01/07/2022
30/09/2022
Quarter
1
2
3
4
Revenue
500
400
540
540
Week Date Start
31/12/2021
07/01/2022
14/01/2022
21/01/2022
Week
41
42
43
44
Revenue
33.75
33.75
33.75
33.75
I am using alteryx for this but wouldnt mind explaination of possible logic in sql to apply it into the system
Thanks
Before I get into the answer, you're going to have an issue regarding data integrity. All the revenue data is aggregated at a monthly level, where your quarters start and end on someday within the month.
For example - Q4 starts September 30th (Friday) and ends Dec. 29th (Thursday). You may have a day or two that bleeds from another month into the quarters which might throw off the data a bit (esp. if there's a large amount of revenue during the days that bleed into a quarter.
Additionally, your revenue is aggregated at a monthly level - unless you have more granular data (weekly, daily would be best), it doesn't make sense to do a weekly calculation since you'll probably just be dividing revenue by 4.
That being said - You'll want to use a cross tab feature in alteryx to get the data how you want it. But before you do that, we want to aggregate your data at a quarterly level first.
You can do this with an if statement or some other data cleansing tool (sorry, been a while since I used alteryx). Something like:
# Pseudo code - this won't actually work!
# For determining quarter
if (month) between (30/09/2022,29/12/2022) then 4
where you can derive the logic from your calendar table. Then once you have the quarter, you can join in the Quarter Start date based on your quarter calculation.
Now you have a nice clean table that might look something like this:
Month
Revenue
Quarter
Quarter Start Date
01/01/2022
-135
4
30/09/2022
01/01/2022
-135
4
30/09/2022
Aggregate on your quarter to get a cleaner table
Quarter Start Date
Quarter
revenue
30/09/2022
4
300
Then use cross tab, where you pivot on the Quarter start date.
For SQL, you'd be pivoting the data. Essentially, taking the value from a row of data, and converting it into a column. It will look a bit janky because the data is so customized, but here's a good question that goes over pivioting - Simple way to transpose columns and rows in SQL?

How to compare same period of different date ranges in columns in BigQuery standard SQL

i have a hard time figuring out how to compare the same period (e.g. iso week 48) from different years for a certain metric in different columns. I am new to SQL and haven't fully understand how PARTITION BY works but guess that i'll need it for my desired output.
How can i sum the data from column "metric" and compare same periods of different date ranges (e.g. YEAR) in a table?
current table
date iso_week iso_year metric
2021-12-01 48 2021 1000
2021-11-30 48 2021 850
...
2020-11-28 48 2020 800
2020-11-27 48 2020 950
...
2019-11-27 48 2019 700
2019-11-26 48 2019 820
desired output
iso_week metric_thisYear metric_prevYear metric_prev2Year
48 1850 1750 1520
...
Consider below simple approach
select * from (
select * except(date)
from your_table
)
pivot (sum(metric) as metric for iso_year in (2021, 2020, 2019))
if applied to sample data in your question - output is

Add new column based on groupby map in pandas [duplicate]

This question already has answers here:
Remap values in pandas column with a dict, preserve NaNs
(11 answers)
Closed 2 years ago.
I have a df as shown below
product bought_date number_of_sales
A 2016 15
A 2017 10
A 2018 15
B 2016 20
B 2017 30
B 2018 20
C 2016 20
C 2017 30
C 2018 20
From the above I would like to add one column called cost_per_unit as shown below.
cost of product A is 100, B is 500 and C is 200
d1 = {'A':100, 'B':500, 'C':'200'}
Expected Output:
product bought_date number_of_sales cost_per_unit
A 2016 15 100
A 2017 10 100
A 2018 15 100
B 2016 20 500
B 2017 30 500
B 2018 20 500
C 2016 20 200
C 2017 30 200
C 2018 20 200
No need for any lambda function. Run just:
df['cost_per_unit'] = df['product'].map(d1)
Additional remark: product is a name of a Pandas function. You should avoid
column names "covering" existing functions or attributes.
It is a good habit, that they should differ, at least in char case.
You can try this:
df['cost_per_unit'] = df.apply(lambda x: d1[x['product']], axis=1)
print(df)
product bought_date number_of_sales cost_per_unit
0 A 2016 15 100
1 A 2017 10 100
2 A 2018 15 100
3 B 2016 20 500
4 B 2017 30 500
5 B 2018 20 500
6 C 2016 20 200
7 C 2017 30 200
8 C 2018 20 200

Replacing Pandas Dataframe Header with Date column but in ascending order

I have a Dataframe of 5k+ rows that looks like this. It has Date column which has Month/Year format. The Date column is in string format.
Name Date Friends
A June 2017 100
A April 2017 45
A March 2016 180
B June 2017 43
B April 2017 23
B March 2016 23
C June 2017 64
C April 2017 643
C March 2016 344
I want to format in the following way, which makes unique values from Date Column into headers. But in the ascending order according to Month/Year.
Name March 2016 April 2017 June 2017
A 180 45 100
B 23 23 43
C 344 643 64
I tried using the Pandas function - Pivot.
df=df.pivot(index='Name',columns='Date',values='Friends')
But this doesn't sort the month/year in ascending order but instead it does in alphabetically order. Also Pivot transforms the dataframe in Stacked format.
Any ideas on how to achieve the desired format?
Something like this,
df['Date']=pd.to_datetime(df['Date'])
df=df.sort_values(['Date'], ascending=False)
df.groupby(['Name', 'Date'], sort=False)['Friends'].sum().unstack('Date')

Compare Cumulative Sales per Year-End

Using this sample dataframe:
np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B','Group C','Group D'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2020',
freq='M'), 10000)})
I am trying to compare 12 month time frames with seaborn plots for a sub-grouping of category. For example, I'd like to compare the cumulative 12 months for each year ending 4-30 vs. the same time period for each year. I cannot wrap my head around how to get a running total of data for each respective year (5/1/17-4/30/18, 5/1/18-4/30/19, 5/1/19-4/30/20). The dates are just examples - I'd like to be able to compare different year-end data points, even better would be able to compare 365 days. For instance, I'd love to compare 3/15/19-3/14/20 to 3/15/18-3/14/19, etc.
I envision a graph for each 'Category' (A,B,C,D) with lines for each respective year representing the running total starting with zero on May 1, building through April 30 of the next year. The x axis would be the month (starting with May 1) & y axis would be 'Units_Sold' as it grows.
Any help would be greatly appreciated!
One way to convert the date to fiscal quarters and extract the fiscal year:
df = pd.DataFrame({'Date':pd.date_range('2019-01-01', '2019-12-31', freq='M'),
'Values':np.arange(12)})
df['fiscal_year'] = df.Date.dt.to_period('Q-APR').dt.qyear
Output:
Date Values fiscal_year
0 2019-01-31 0 2019
1 2019-02-28 1 2019
2 2019-03-31 2 2019
3 2019-04-30 3 2019
4 2019-05-31 4 2020
5 2019-06-30 5 2020
6 2019-07-31 6 2020
7 2019-08-31 7 2020
8 2019-09-30 8 2020
9 2019-10-31 9 2020
10 2019-11-30 10 2020
11 2019-12-31 11 2020
And now you can group by fiscal_year to your heart's content.