Is there a way in Pandas to fit in the value according to weight ranges when pivoting the dataframe? I see some answers with setting bins but these are varied weight ranges depending on how the data is entered.
Here's my dataset.
import pandas as pd
df = pd.DataFrame({'tier': [1,1,1,1,1,1,1,1,1],
'services': ["A","A","A","A","A","A","A","A","A"],
'weight_start': [1,61,161,201,1,1,61,161,201],
'weight_end': [60,160,200,500,500,60,160,200,500],
'location': [1,1,1,1,2,3,3,3,3],
'discount': [70,30,10,0,0,60,20,5,0]})
pivot_df = df.pivot(index=['tier','services','weight_start','weight_end'],columns='location',values='discount')
display(pivot_df)
Output:
df
Desired Output:
desired df
Since location 2 is 0 percent covering the ranges 1 to 500, I want it to populate 0 based on the ranges prescribed for tier 1 service A instead of having its own row.
Edit: Mozway's answer works when there is one service. When I added a second service, the dataframe ungrouped.
Here's the new dataset with service B.
import pandas as pd
df = pd.DataFrame({'tier': [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1],
'services': ["A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B"],
'weight_start': [1,61,161,201,1,1,61,161,201,1,1,81,101,1,61,161,201],
'weight_end': [60,160,200,500,500,60,160,200,500,500,80,100,200,60,160,200,500],
'location': [1,1,1,1,2,3,3,3,3,1,2,2,2,3,3,3,3],
'discount': [70,30,10,0,0,60,20,5,0,50,70,50,10,65,55,45,5]})
pivot_df = df.pivot(index=['tier','services','weight_start','weight_end'],columns='location',values='discount')
display(pivot_df)
Output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 NaN 60.0
500 NaN 0.0 NaN
61 160 30.0 NaN 20.0
161 200 10.0 NaN 5.0
201 500 0.0 NaN 0.0
B 1 60 NaN NaN 65.0
80 NaN 70.0 NaN
500 50.0 NaN NaN
61 160 NaN NaN 55.0
81 100 NaN 50.0 NaN
101 200 NaN 10.0 NaN
161 200 NaN NaN 45.0
201 500 NaN NaN 5.0
Desired Output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
B 1 60 50 70 65.0
80 50 70.0 55
61 160 50 NaN 55.0
81 100 50 50.0 55
101 200 50 10.0 NaN
161 200 50 10 45.0
201 500 50 NaN 5.0
This will work
data = (df.set_index(['tier','services','weight_start','weight_end'])
.pivot(columns='location')['discount']
.reset_index()
.rename_axis(None, axis=1)
)
IIUC, you can (temporarily) exclude the columns with 0/nan and check if all remaining values are only NaNs per row. If so, drop those rows:
mask = ~pivot_df.loc[:, pivot_df.any()].isna().all(1)
out = pivot_df[mask].fillna(0)
output:
location 1 2 3
tier services weight_start weight_end
1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
per group:
def drop(d):
mask = ~ d.loc[:, d.any()].isna().all(1)
return d[mask].fillna(0)
out = pivot_df.groupby(['services']).apply(drop)
output:
location 1 2 3
services tier services weight_start weight_end
A 1 A 1 60 70.0 0.0 60.0
61 160 30.0 0.0 20.0
161 200 10.0 0.0 5.0
201 500 0.0 0.0 0.0
B 1 B 1 60 0.0 0.0 65.0
80 0.0 70.0 0.0
500 50.0 0.0 0.0
61 160 0.0 0.0 55.0
81 100 0.0 50.0 0.0
101 200 0.0 10.0 0.0
161 200 0.0 0.0 45.0
201 500 0.0 0.0 5.0
I have a data frame as shown below
Tenancy_ID Unit_ID End_Date Rental_value
1 A 2012-04-26 10
2 A 2012-08-27 20
3 A 2013-04-27 50
4 A 2014-04-27 40
1 B 2011-06-26 10
2 B 2011-09-27 30
3 B 2013-04-27 60
4 B 2015-04-27 80
From the above I would like to prepare below data frame
Expected Output:
Unit_ID Avg_2011 Avg_2012 Avg_2013 Avg_2014 Avg_2015
A NaN 15 50 40 NaN
B 20 NaN 60 NaN 80
Steps:
Unit_ID = A, has two contracts in 2012 with rental value 10 and 20, Hence the average is 15.
Avg_2012 = Average rental value in 2012.
Use pivot_table directly with the s.dt.year
#df['End_Date']=pd.to_datetime(df['End_Date']) if dtype of End_Date is not datetime
final = (df.pivot_table('Rental_value','Unit_ID',df['End_Date'].dt.year)
.add_prefix('Avg_').reset_index().rename_axis(None,axis=1))
print(final)
Unit_ID Avg_2011 Avg_2012 Avg_2013 Avg_2014 Avg_2015
0 A NaN 15.0 50.0 40.0 NaN
1 B 20.0 NaN 60.0 NaN 80.0
You can aggregate averages and reshape by Series.unstack, last change columns names by DataFrame.add_prefix and last data cleaning - DataFrame.reset_index with DataFrame.rename_axis:
df1 = (df.groupby(['Unit_ID', df['End_Date'].dt.year])['Rental_value']
.mean()
.unstack()
.add_prefix('Avg_')
.reset_index()
.rename_axis(None, axis=1))
print (df1)
Unit_ID Avg_2011 Avg_2012 Avg_2013 Avg_2014 Avg_2015
0 A NaN 15.0 50.0 40.0 NaN
1 B 20.0 NaN 60.0 NaN 80.0
following from expand year values to month in pandas
I have:
pd.DataFrame({'comp':['a','b'], 'period':['20180331','20171231'],'value':[12,24]})
comp period value
0 a 20180331 12
1 b 20171231 24
and would like to extrapolate to 201701 to 201812 inclusive. The value should be spread out for the 12 months preceding the period.
comp yyymm value
a 201701 na
a 201702 na
...
a 201705 12
a 201706 12
...
a 201803 12
a 201804 na
b 201701 24
...
b 201712 24
b 201801 na
...
Use:
#create month periods with min and max value
r = pd.period_range('2017-01', '2018-12', freq='m')
#convert column to period
df['period'] = pd.to_datetime(df['period']).dt.to_period('m')
#create MultiIndex for add all possible values
mux = pd.MultiIndex.from_product([df['comp'], r], names=('comp','period'))
#reindex for append values
df = df.set_index(['comp','period'])['value'].reindex(mux).reset_index()
#back filling by 11 values of missing values per groups
df['new'] = df.groupby('comp')['value'].bfill(limit=11)
print (df)
comp period value new
0 a 2017-01 NaN NaN
1 a 2017-02 NaN NaN
2 a 2017-03 NaN NaN
3 a 2017-04 NaN 12.0
4 a 2017-05 NaN 12.0
...
...
10 a 2017-11 NaN 12.0
11 a 2017-12 NaN 12.0
12 a 2018-01 NaN 12.0
13 a 2018-02 NaN 12.0
14 a 2018-03 12.0 12.0
15 a 2018-04 NaN NaN
16 a 2018-05 NaN NaN
17 a 2018-06 NaN NaN
18 a 2018-07 NaN NaN
19 a 2018-08 NaN NaN
20 a 2018-09 NaN NaN
21 a 2018-10 NaN NaN
22 a 2018-11 NaN NaN
23 a 2018-12 NaN NaN
24 b 2017-01 NaN 24.0
25 b 2017-02 NaN 24.0
26 b 2017-03 NaN 24.0
...
...
32 b 2017-09 NaN 24.0
33 b 2017-10 NaN 24.0
34 b 2017-11 NaN 24.0
35 b 2017-12 24.0 24.0
36 b 2018-01 NaN NaN
37 b 2018-02 NaN NaN
38 b 2018-03 NaN NaN
...
...
44 b 2018-09 NaN NaN
45 b 2018-10 NaN NaN
46 b 2018-11 NaN NaN
47 b 2018-12 NaN NaN
See if this works:
dftime = pd.DataFrame(pd.date_range('20170101','20181231'), columns=['dt']).apply(lambda x: x.dt.strftime('%Y-%m'), axis=1) # Populating full range including dates
dftime = dftime.assign(dt=dftime.dt.drop_duplicates().reset_index(drop=True)).dropna() # Dropping duplicates from above range
df['dt'] = pd.to_datetime(df.period).apply(lambda x: x.strftime('%Y-%m')) # Adding column for merging purpose
target = df.groupby('comp').apply(lambda x: dftime.merge(x[['comp','dt','value']], on='dt', how='left').fillna({'comp':x.comp.unique()[0]})).reset_index(drop=True) # Populating data for each company
This gives desired output:
print(target)
dt comp value
0 2017-01 a NaN
1 2017-02 a NaN
2 2017-03 a NaN
3 2017-04 a NaN
4 2017-05 a NaN
5 2017-06 a NaN
6 2017-07 a NaN
and so on.
I am trying to make a new column 'ID' which should give a unique ID each time there is no 'NaN' value in 'Data' column. If the non null values come right to each other, the ID remains the same. I have provided how my final Id column should look like below as reference to better understand. Could anyone guide me on this?
Id Data
0 NaN
0 NaN
0 NaN
1 54
1 55
0 NaN
0 NaN
2 67
0 NaN
0 NaN
3 33
3 44
3 22
0 NaN
.groupby the cumsum to get consecutive groups, using where to mask the NaN. .ngroup gets the consecutive IDs. Also possible with rank.
s = df.Data.isnull().cumsum().where(df.Data.notnull())
df['ID'] = df.groupby(s).ngroup()+1
# df['ID'] = s.rank(method='dense').fillna(0).astype(int)
Output:
Data ID
0 NaN 0
1 NaN 0
2 NaN 0
3 54.0 1
4 55.0 1
5 NaN 0
6 NaN 0
7 67.0 2
8 NaN 0
9 NaN 0
10 33.0 3
11 44.0 3
12 22.0 3
13 NaN 0
Using factorize
v=pd.factorize(df.Data.isnull().cumsum()[df.Data.notnull()])[0]+1
df.loc[df.Data.notnull(),'Newid']=v
df.Newid.fillna(0,inplace=True)
df
Id Data Newid
0 0 NaN 0.0
1 0 NaN 0.0
2 0 NaN 0.0
3 1 54.0 1.0
4 1 55.0 1.0
5 0 NaN 0.0
6 0 NaN 0.0
7 2 67.0 2.0
8 0 NaN 0.0
9 0 NaN 0.0
10 3 33.0 3.0
11 3 44.0 3.0
12 3 22.0 3.0
13 0 NaN 0.0