pandas pivot table how to rearrange columns - pandas

I have a pandas df which I am looking to build a pivot table with.
Here is a sample table
Name Week Category Amount
ABC 1 Clothing 50
ABC 1 Food 10
ABC 1 Food 10
ABC 1 Auto 20
DEF 1 Food 10
DEF 1 Services 20
The pivot table I am looking to create is to sum up the amounts per Name, per week per category.
Essentially, I am looking to land up with a table as follows:
Name Week Clothing Food Auto Services Total
ABC 1 50 20 20 0 90
DEF 1 0 10 0 20 30
If a user has no category value in a particular week, I take it as 0
And the total is the row sum.
I tried some of the options mentioned at https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html but couldnt get it to work...any thoughts on how I can achieve this. I used
df.pivot_table(values=['Amount'], index=['Name','Week','Category'], aggfunc=[np.sum]) followed by df.unstack() but that did not yield the desired result as both Week and Category got unstacked.
Thanks!

df_pvt = pd.pivot_table(df, values = 'Amount', index = ['Name', 'Week'], columns = 'Category', aggfunc = np.sum, margins=True, margins_name = 'Total', fill_value = 0
df_pvt.columns.name = None
df_pvt = df_pvt.reset_index()

Let us try crosstab
out = pd.crosstab(index = [df['Name'],df['Week']],
columns = df['Category'],
values=df['Amount'],
margins=True,
aggfunc='sum').fillna(0).iloc[:-1].reset_index()
Category Name Week Auto Clothing Food Services All
0 ABC 1 20.0 50.0 20.0 0.0 90
1 DEF 1 0.0 0.0 10.0 20.0 30

Related

How do I count values grouping by month and year (YYYYMM) in a pandas dataframe?

Example:
year_month = ['201801','201801','201801','201801','201801','201802','201802','201802','201802','201802']
Services = ['23','67','23','67','23','23','23','4','4','67']
df = list(zip(year_month, Services)
df = pd.DataFrame(df, columns=['Date', 'Services'])
Help me!
My date column is already in the right format, and I`ve alread have the YYYTMM column from that.
I tried something like:
df2 = df.loc[:, ['YYYYMM', 'Services']]
df2 = df.groupby(['YYYYMM']).count().reset_index()
EXPECTED OUTPUT
Quantity of services per month/year.
year_month 4 23 67
201801 0 3 2
201801 2 2 1
enter image description here
out = df.groupby('Date', as_index=False).count()
out
Date Services
0 201801 5
1 201802 5
Update
finally i know desired output.
pd.crosstab(df['Date'], df['Services']).sort_index(axis=1, key=lambda x: x.astype('int'))
Services 4 23 67
Date
201801 0 3 2
201802 2 2 1

Pandas - Move data in one column to the same row in a different column

I have a df which looks like the below, There are 2 quantity columns and I want to move the quantities in the "QTY 2" column to the "QTY" column
Note: there are no instances where there are values in the same row for both columns (So for each row, QTY is either populated or else QTY 2 is populated. Not Both)
DF
Index
Product
QTY
QTY 2
0
Shoes
5
1
Jumpers
10
2
T Shirts
15
3
Shorts
13
Desired Output
Index
Product
QTY
0
Shoes
5
1
Jumpers
10
2
T Shirts
15
3
Shorts
13
Thanks
Try this:
import numpy as np
df['QTY'] = np.where(df['QTY'].isnull(), df['QTY 2'], df['QTY'])
df["QTY"] = df["QTY"].fillna(df["QTY 2"], downcast="infer")
filling the gaps of QTY with QTY 2:
In [254]: df
Out[254]:
Index Product QTY QTY 2
0 0 Shoes 5.0 NaN
1 1 Jumpers NaN 10.0
2 2 T Shirts NaN 15.0
3 3 Shorts 13.0 NaN
In [255]: df["QTY"] = df["QTY"].fillna(df["QTY 2"], downcast="infer")
In [256]: df
Out[256]:
Index Product QTY QTY 2
0 0 Shoes 5 NaN
1 1 Jumpers 10 10.0
2 2 T Shirts 15 15.0
3 3 Shorts 13 NaN
downcast="infer" makes it "these look like integer after NaNs gone, so make the type integer".
you can drop QTY 2 after this with df = df.drop(columns="QTY 2"). If you want one-line is as usual possible:
df = (df.assign(QTY=df["QTY"].fillna(df["QTY 2"], downcast="infer"))
.drop(columns="QTY 2"))
You can do ( I am assuming your empty values are empty strings):
df = df.assign(QTY= df[['QTY', 'QTY2']].
replace('', 0).
sum(axis=1)).drop('QTY2', axis=1)
print(df):
Product QTY
0 Shoes 5
1 Jumpers 10
2 T Shirts 15
3 Shorts 13
If the empty values are actually NaNs then
df['QTY'] = df['QTY'].fillna(df['QTY2']) #or
df['QTY'] = df[['QTY', 'QTY2']].sum(1)

Pandas groupby and rolling window

I`m trying to calculate the sum of one field for a specific period of time, after grouping function is applied.
My dataset look like this:
Date Company Country Sold
01.01.2020 A BE 1
02.01.2020 A BE 0
03.01.2020 A BE 1
03.01.2020 A BE 1
04.01.2020 A BE 1
05.01.2020 B DE 1
06.01.2020 B DE 0
I would like to add a new column per each row, that calculates the sum of Sold (per each group "Company, Country" for the last 7 days - not including the current day
Date Company Country Sold LastWeek_Count
01.01.2020 A BE 1 0
02.01.2020 A BE 0 1
03.01.2020 A BE 1 1
03.01.2020 A BE 1 1
04.01.2020 A BE 1 3
05.01.2020 B DE 1 0
06.01.2020 B DE 0 1
I tried the following, but it is also including the current date, and it gives differnt values for the same date, i.e 03.01.2020
df['LastWeek_Count'] = df.groupby(['Company', 'Country']).rolling(7, on ='Date')['Sold'].sum().reset_index()
Is there a buildin function in pandas that I can use to perform these calculations?
You can use a .rolling window of 8 and then subtract the sum of the Date (for each grouped row) to effectively get the previous 7 days. For this sample data, we should also pass min_periods=1 (otherwise you will get NaN values, but for your actual dataset, you will need to decide what you want to do with windows that are < 8).
Then from the .rolling window of 8, simply do another .groupby of the relevant columns but also include Date this time, and take the max value of the newly created LastWeek_Count column. You need to take the max, because you have multiple records per day, so by taking the max, you are taking the total aggregated amount per Date.
Then, create a series that takes the grouped by sum per Date. In the final step subtract the sum by date from the rolling 8-day max, which is a workaround to how you can get the sum of the previous 7 days, as there is not a parameter for an offset with .rolling:
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
df['LastWeek_Count'] = df.groupby(['Company', 'Country']).rolling(8, min_periods=1, on='Date')['Sold'].sum().reset_index()['Sold']
df['LastWeek_Count'] = df.groupby(['Company', 'Country', 'Date'])['LastWeek_Count'].transform('max')
s = df.groupby(['Company', 'Country', 'Date'])['Sold'].transform('sum')
df['LastWeek_Count'] = (df['LastWeek_Count']-s).astype(int)
Out[17]:
Date Company Country Sold LastWeek_Count
0 2020-01-01 A BE 1 0
1 2020-01-02 A BE 0 1
2 2020-01-03 A BE 1 1
3 2020-01-03 A BE 1 1
4 2020-01-04 A BE 1 3
5 2020-01-05 B DE 1 0
6 2020-01-06 B DE 0 1
One way would be to first consolidate the Sold value of each group (['Date', 'Company', 'Country']) on a single line using a temporary DF.
After that, apply your .groupby with .rolling with an interval of 8 rows.
After calculating the sum, subtract the value of each line with the value in Sold column and add that column in the original DF with .merge
#convert Date column to datetime
df['Date'] = pd.to_datetime(df['Date'], format='%d.%m.%Y')
#create a temporary DataFrame
df2 = df.groupby(['Date', 'Company', 'Country'])['Sold'].sum().reset_index()
#calc the lastweek
df2['LastWeek_Count'] = (df2.groupby(['Company', 'Country'])
.rolling(8, min_periods=1, on = 'Date')['Sold']
.sum().reset_index(drop=True)
)
#subtract the value of 'lastweek' from the current 'Sold'
df2['LastWeek_Count'] = df2['LastWeek_Count'] - df2['Sold']
#add th2 new column in the original DF
df.merge(df2.drop(columns=['Sold']), on = ['Date', 'Company', 'Country'])
#output:
Date Company Country Sold LastWeek_Count
0 2020-01-01 A BE 1 0.0
1 2020-01-02 A BE 0 1.0
2 2020-01-03 A BE 1 1.0
3 2020-01-03 A BE 1 1.0
4 2020-01-04 A BE 1 3.0
5 2020-01-05 B DE 1 0.0
6 2020-01-06 B DE 0 1.0

How to apply different aggregate functions to different columns in pandas?

I have the dataframe with many columns in it , some of it contains price and rest contains volume as below:
year_month 0_fx_price_gy 0_fx_volume_gy 1_fx_price_yuy 1_fx_volume_yuy
1990-01 2 10 3 30
1990-01 2 20 2 40
1990-02 2 30 3 50
I need to do group by year_month and do mean on price columns and sum on volume columns.
is there any quick way to do this in one statement like do average if column name contains price and sum if it contains volume?
df.groupby('year_month').?
Note: this is just sample data with less columns but format is similar
output
year_month 0_fx_price_gy 0_fx_volume_gy 1_fx_price_yuy 1_fx_volume_yuy
1990-01 2 30 2.5 70
1990-02 2 30 3 50
Create dictionary by matched values and pass to DataFrameGroupBy.agg, last add reindex if order of output columns is changed:
d1 = dict.fromkeys(df.columns[df.columns.str.contains('price')], 'mean')
d2 = dict.fromkeys(df.columns[df.columns.str.contains('volume')], 'sum')
#merge dicts together
d = {**d1, **d2}
print (d)
{'0_fx_price_gy': 'mean', '1_fx_price_yuy': 'mean',
'0_fx_volume_gy': 'sum', '1_fx_volume_yuy': 'sum'}
Another solution for dictionary:
d = {}
for c in df.columns:
if 'price' in c:
d[c] = 'mean'
if 'volume' in c:
d[c] = 'sum'
And solution should be simplify if only price and volume columns without first column filtered out by df.columns[1:]:
d = {x:'mean' if 'price' in x else 'sum' for x in df.columns[1:]}
df1 = df.groupby('year_month', as_index=False).agg(d).reindex(columns=df.columns)
print (df1)
year_month 0_fx_price_gy 0_fx_volume_gy 1_fx_price_yuy 1_fx_volume_yuy
0 1990-01 2 40 3 60
1 1990-02 2 20 3 30

How to rename the column of an intermediate result?

I'm calculating an average by first getting the number of of months and then divide the number of records by that number like this:
monthly = tables[SUB_ACCT_DOC_ACC_MTHLY_SUM]
num_months = monthly.clndr_yr_month.unique().size
df = (monthly[["sub_acct_id", "clndr_yr_month"]].groupby(["sub_acct_id"]).size() / num_months).reset_index("sub_acct_id")
df.head(5)
What I get is
sub_acct_id 0
0 12716D 242.0
1 12716G 241.5
2 12716K 165.0
3 12716N 92.5
4 12716R 156.5
but how can I rename the new column to e.g. "avg"
sub_acct_id avg
0 12716D 242.0
1 12716G 241.5
2 12716K 165.0
3 12716N 92.5
4 12716R 156.5
You can access the names with the columns attribute of the dataframe:
df.columns = ['sub_acct_id','avg']