Pandas - styling with multi-index (from pivot) - pandas

As a result of pivot table I get the following table.
Entry table
I would like to apply a style with a criteria depending on the product =
when eggs are sold I would like to have a red background when sales is lower than 15
when chicken is sold I would like to have a red background when sales is lower than 18
Expected result
I am able to do it converting to records (pd.to_record) creating a single index but I am loosing the nice layout allowing to compare evolution between months. Note the number of month is variable (can be 2 in example but more).
Entry file is read from Excell=
import pandas as pd
df = pd.read_excel('xxx.xlsx',sheet_name='Sheet')
result =
product Month sales revenue
0 eggs 2021-05-01 10 8.0
1 chicken 2021-05-01 15 12.0
2 chicken 2021-05-01 17 15.0
3 eggs 2021-05-01 20 15.0
4 eggs 2021-06-01 11 8.5
5 chicken 2021-06-01 14 12.0
6 chicken 2021-06-01 18 15.0
7 eggs 2021-06-01 22 17.0
Pivoting it and then swaping it =
df2 = pd.pivot_table(df, columns = ['Month'], index = ['product'], values = ['sales','product'],aggfunc=sum)
df3 = df2.swaplevel(0,1,1).sort_values(by='Month',ascending=True, axis=1)
Month 2021-05-01 2021-06-01
revenue sales revenue sales
product
chicken 27.0 32 27.0 32
eggs 23.0 30 25.5 33
The df-to-dict() =
{(Timestamp('2021-05-01 00:00:00'), 'revenue'): {'chicken': 27.0, 'eggs': 23.0}, (Timestamp('2021-05-01 00:00:00'), 'sales'): {'chicken': 32, 'eggs': 30}, (Timestamp('2021-06-01 00:00:00'), 'revenue'): {'chicken': 27.0, 'eggs': 25.5}, (Timestamp('2021-06-01 00:00:00'), 'sales'): {'chicken': 32, 'eggs': 33}}
Thanks in advance for help you can provide.

Related

Date dependent calculation from 2 dataframes - average 6-month return

I am working with the following dataframe, I have data for multiple companies, each row associated with a specific datadate, so I have many rows related to many companies - with ipo date from 2009 to 2022.
index ID price daily_return datadate daily_market_return mean_daily_market_return ipodate
0 1 27.50 0.008 01-09-2010 0.0023 0.03345 01-12-2009
1 2 33.75 0.0745 05-02-2017 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 08-06-2020 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 09-06-2018 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 02-09-2021 0.0846 0.04345 04-05-2009
5 4 22.75 0.00539 06-12-2019 0.0003 0.0006 21-09-2012
...
26074 rows
I also have a dataframe containing the Market yield on US Treasury securities at 10-year constant maturity - measured daily. Each row represents the return associated with a specific day, each day from 2009 to 2022.
date dgs10
1 2009-01-02 2.46
2 2009-01-05 2.49
3 2009-01-06 2.51
4 2009-01-07 2.52
5 2009-01-08 2.47
6 2009-01-09 2.43
7 2009-01-12 2.34
8 2009-01-13 2.33
...
date dgs10
3570 2022-09-08 3.29
3571 2022-09-09 3.33
3572 2022-09-12 3.37
3573 2022-09-13 3.42
3574 2022-09-14 3.41
My goal is to calculate, for each ipodate (from dataframe 1), the average of the previous 6-month return of the the Market yield on US Treasury securities at 10-year constant maturity (from dataframe 2). The result should either be in a new dataframe or in an additionnal column in dataframe 1. Both dataframes are not the same length. I tried using rolling(), but it doesn't seem to be working. Anyone knows how to fix this?
# Make sure that all date columns are of type Timestamp. They are a lot easier
# to work with
df1["ipodate"] = pd.to_datetime(df1["ipodate"], dayfirst=True)
df2["date"] = pd.to_datetime(df2["date"])
# Calculate the mean market yield of the previous 6 months. Six month is not a
# fixed length of time so I replaced it with 180 days.
tmp = df2.rolling("180D", on="date").mean()
# The values of the first 180 days are invalid, because we have insufficient
# data to calculate the rolling mean. You may consider extending df2 further
# back to 2008. (You may come up with other rules for this period.)
is_invalid = (tmp["date"] - tmp["date"].min()) / pd.Timedelta(1, "D") < 180
tmp.loc[is_invalid, "dgs10"] = np.nan
# Result
df1.merge(tmp, left_on="ipodate", right_on="date", how="left")

Compare Cumulative Sales per Year-End

Using this sample dataframe:
np.random.seed(1111)
df = pd.DataFrame({
'Category':np.random.choice( ['Group A','Group B','Group C','Group D'], 10000),
'Sub-Category':np.random.choice( ['X','Y','Z'], 10000),
'Sub-Category-2':np.random.choice( ['G','F','I'], 10000),
'Product':np.random.choice( ['Product 1','Product 2','Product 3'], 10000),
'Units_Sold':np.random.randint(1,100, size=(10000)),
'Dollars_Sold':np.random.randint(100,1000, size=10000),
'Customer':np.random.choice(pd.util.testing.rands_array(10,25,dtype='str'),10000),
'Date':np.random.choice( pd.date_range('1/1/2016','12/31/2020',
freq='M'), 10000)})
I am trying to compare 12 month time frames with seaborn plots for a sub-grouping of category. For example, I'd like to compare the cumulative 12 months for each year ending 4-30 vs. the same time period for each year. I cannot wrap my head around how to get a running total of data for each respective year (5/1/17-4/30/18, 5/1/18-4/30/19, 5/1/19-4/30/20). The dates are just examples - I'd like to be able to compare different year-end data points, even better would be able to compare 365 days. For instance, I'd love to compare 3/15/19-3/14/20 to 3/15/18-3/14/19, etc.
I envision a graph for each 'Category' (A,B,C,D) with lines for each respective year representing the running total starting with zero on May 1, building through April 30 of the next year. The x axis would be the month (starting with May 1) & y axis would be 'Units_Sold' as it grows.
Any help would be greatly appreciated!
One way to convert the date to fiscal quarters and extract the fiscal year:
df = pd.DataFrame({'Date':pd.date_range('2019-01-01', '2019-12-31', freq='M'),
'Values':np.arange(12)})
df['fiscal_year'] = df.Date.dt.to_period('Q-APR').dt.qyear
Output:
Date Values fiscal_year
0 2019-01-31 0 2019
1 2019-02-28 1 2019
2 2019-03-31 2 2019
3 2019-04-30 3 2019
4 2019-05-31 4 2020
5 2019-06-30 5 2020
6 2019-07-31 6 2020
7 2019-08-31 7 2020
8 2019-09-30 8 2020
9 2019-10-31 9 2020
10 2019-11-30 10 2020
11 2019-12-31 11 2020
And now you can group by fiscal_year to your heart's content.

How to sum of certain values using pandas datetime operations

Headline is not clear. Let me explain.
I have a dataframe like this:
Order Quantity Date Accepted Date Delivered
20 01-05-2010 01-02-2011
10 01-11-2010 01-03-2011
300 01-12-2010 01-04-2011
5 01-03-2011 01-03-2012
20 01-04-2012 01-11-2013
10 01-07-2013 01-12-2014
I want to basically create another column that contains the total undelivered items for each row.
Expected output:
Order Quantity Date Accepted Date Delivered Pending Order
20 01-05-2010 01-02-2011 20
10 01-11-2010 01-03-2011 30
300 01-12-2010 01-04-2011 330
5 01-03-2011 01-03-2012 305
20 01-04-2012 01-11-2013 20
10 01-07-2013 01-12-2014 30
Here, I have taken a part of your dataframe and try to get the result.
df = pd.DataFrame({'order': [20, 10, 300, 200],
'Date_aceepted': ['01-05-2010', '01-11-2010', '01-12-2010', '01-12-2010'],
'Date_delever': ['01-02-2011', '01-03-2011', '01-04-2011', '01-12-2010']})
order Date_aceepted Date_delever
0 20 01-05-2010 01-02-2011
1 10 01-11-2010 01-03-2011
2 300 01-12-2010 01-04-2011
3 200 01-12-2010 01-12-2010
Then I will change the Date_accepted and Date_deliver to date time by using pandas data time module
df['date1'] = pd.to_datetime(df['Date_aceepted'])
df['date2'] = pd.to_datetime(df['Date_delever'])
Then I will make a new data frame in which the Date_accepted and Date_delever are not the same. I assume you just need that in your final result.
dff = df[df['date1'] != df['date2']]
You can see the last row in which both accepted and delever are same is now removed in dff.
order Date_aceepted Date_delever date1 date2
0 20 01-05-2010 01-02-2011 2010-01-05 2011-01-02
1 10 01-11-2010 01-03-2011 2010-01-11 2011-01-03
2 300 01-12-2010 01-04-2011 2010-01-12 2011-01-04
Then I did use pandas cumsum of pending order
dff['pending'] = dff['order'].cumsum()
and it gives
order Date_aceepted Date_delever date1 date2 pending
0 20 01-05-2010 01-02-2011 2010-01-05 2011-01-02 20
1 10 01-11-2010 01-03-2011 2010-01-11 2011-01-03 30
2 300 01-12-2010 01-04-2011 2010-01-12 2011-01-04 330
The final data frame has two extra columns that can be dropped if you don't want in your result.

I want to do some aggregations with the help of Group By function in pandas

My dataset consists of a date column in 'datetime64[ns]' dtype; it also has a price and a no. of sales column.
I want to calculate the monthly VWAP (Volume Weighted Average Price ) of the stock.
( VWAP = sum(price*no.of sales)/sum(no. of sales) )
What I applied is:-
created a new dataframe column of month and year using pandas functions.
Now, I want monthly VWAP from this dataset which I modified, also, it should be distinct by year.
For eg. - March,2016 and March,2017 should have their seperate VWAP monthly values.
Start from defining a function to count vwap for the current
month (group of rows):
def vwap(grp):
return (grp.price * grp.salesNo).sum() / grp.salesNo.sum()
Then apply it to monthly groups:
df.groupby(df.dat.dt.to_period('M')).apply(vwap)
Using the following test DataFrame:
dat price salesNo
0 2018-05-14 120.5 10
1 2018-05-16 80.0 22
2 2018-05-20 30.2 12
3 2018-08-10 75.1 41
4 2018-08-20 92.3 18
5 2019-05-10 10.0 33
6 2019-05-20 20.0 41
(containing data from the same months in different years), I got:
dat
2018-05 75.622727
2018-08 80.347458
2019-05 15.540541
Freq: M, dtype: float64
As you can see, the result contains separate entries for May in both
years from the source data.

Pandas Group By With Running Total

My granny has some strange ideas. Every birthday she takes me shopping.
She has some strict rules. If I buy a present less than $20 she wont contribute anything. If I spend over $20 she will contribute up to $30.
So if a present costs $27 she would contribute $7.
That now leaves me with $23 to spend on extra presents that birthday; the same rules as above apply on any additional presents.
Once the $30 are spent there are no more contributions from granny and I must pay the rest myself.
Here is an example table of my 11th, 12th and 13th birthday.
DollarsSpent granny_pays
BirthDayAge PresentNum
11 1 25.00 5.00 -- I used up $5
2 100.00 25.00 -- I used up last $20
3 10.00 0.00
4 50.00 0.00
12 1 39.00 19.00 -- I used up $19 only $11 left
2 7.00 0.00
3 32.00 11.00 -- I used up the last $11 despite $12 of $32 above the $20 starting point
4 19.00 0.00
13 1 21.00 1.00 -- used up $1
2 27.00 7.00 -- used up $7, total used up $8 and never spent last $22
So in pandas I have gotten this far.
import pandas as pd
granny_wont_pay_first = 20.
granny_limit = 30.
df = pd.DataFrame({'BirthDayAge' : ['11','11','11','11','12','12','12','12','13','13']
,'PresentNum' : [1,2,3,4,1,2,3,4,1,2]
,'DollarsSpent' : [25.,100.,10.,50.,39.,7.,32.,19.,21.,27.]
})
df.set_index(['BirthDayAge','PresentNum'],inplace=True)
df['granny_pays'] = df['DollarsSpent'] - granny_wont_pay_first
df['granny_limit'] = granny_limit
df['zero'] = 0.0
df['granny_pays'] = df[['granny_pays','zero','granny_limit']].apply(np.median,axis=1)
df.drop(['granny_limit','zero'], axis=1, inplace=True)
print df.head(len(df))
And this is the output. Using the median on the 3 numbers is a nice way to work out what granny will contribute.
The problem is that you can see each present is treated in isolation and I don't correctly erode my $30 each present within each BirthDayAge.
DollarsSpent granny_pays
BirthDayAge PresentNum
11 1 25.00 5.00
2 100.00 30.00 -- should be 25.0
3 10.00 0.00
4 50.00 30.00 -- should be 0.0
12 1 39.00 19.00
2 7.00 0.00
3 32.00 12.00 -- should be 11.0
4 19.00 0.00
13 1 21.00 1.00
2 27.00 7.00
Trying to think of a nice pandas way to do this erosion.
Hopefully no loops please.
I don't know if there is a more concise way, but this should work and does avoid loops as requested.
df['per_gift'] = df.DollarsSpent - 20
df['per_gift'] = np.where( df.per_gift > 0, df.per_gift, 0 )
df['per_bday'] = df.groupby('BirthDayAge').per_gift.cumsum()
df['per_bday'] = np.where( df.per_bday > 30, 30, df.per_bday )
df['granny_pays'] = df.groupby('BirthDayAge').per_bday.diff()
df['granny_pays'] = df.granny_pays.fillna(df.per_bday)
Note that 'per_gift' ignores the maximum subsidy of $30 and 'per_bday' is the cumulative subsidy (capped at $30) per 'BirthDayAge'.
BirthDayAge DollarsSpent PresentNum per_gift per_bday granny_pays
0 11 25 1 5 5 5
1 11 100 2 80 30 25
2 11 10 3 0 30 0
3 11 50 4 30 30 0
4 12 39 1 19 19 19
5 12 7 2 0 19 0
6 12 32 3 12 30 11
7 12 19 4 0 30 0
8 13 21 1 1 1 1
9 13 27 2 7 8 7