How to subtract value from same month last year in pandas? - pandas

I have the below dataframe and I need to subtract value from same month last year and save it in output:
date value output
01-01-2012 20 null
01-02-2012 10
01-03-2012 40
01-06-2012 30
01-01-2013 20 0
01-02-2013 30 20
01-02-2014 60 30
01-03-2014 50 null

First create DatetimeIndex, then subtract by sub with new Series by shift by 12 months, MS is for start of month:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df = df.set_index('date')
df['new'] = df['value'].sub(df['value'].shift(freq='12MS'))
print (df)
value output new
date
2012-01-01 20 NaN NaN
2012-02-01 10 NaN NaN
2012-03-01 40 NaN NaN
2012-06-01 30 NaN NaN
2013-01-01 20 0.0 0.0
2013-02-01 30 20.0 20.0
2014-02-01 60 30.0 30.0
2014-03-01 50 NaN NaN

Related

Insert multiples dates at start of every group in pandas

I have a dataframe with millions of groups. I am trying to, for each group, add 3 months of dates (month end dates) at the top of every group. So if the first observation of a group is December 2019, I want to fill 3 rows prior to that observation with dates from September 2019 to November 2019. I also want to fill the group column with the relevant group ID and the other columns can remain as null values.
Would like to avoid looping if possible as this is a very large dataset
This is my before DataFrame:
import pandas as pd
before = pd.DataFrame({'Group':[1,1,1,1,1,2,2,2,2,2],
'Date':['31/10/2018','30/11/2018','31/12/2018','31/01/2019','28/02/2019','30/03/2001','30/04/2001','31/05/2001','30/06/2001','31/07/2001'],
'value':[1.1,1.7,1.9,2.3,1.5,2.8,2,2,2,2]})
This is my after DataFrame
import pandas as pd
after = pd.DataFrame({'Group':[1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2],
'Date':['31/07/2018','31/08/2018','30/09/2018','31/10/2018','30/11/2018','31/12/2018','31/01/2019','28/02/2019','31/12/2000','31/01/2001','28/02/2001','30/03/2001','30/04/2001','31/05/2001','30/06/2001','31/07/2001'],
'value':[np.nan,np.nan,np.nan,1.1,1.7,1.9,2.3,1.5,np.nan,np.nan,np.nan,2.8,2,2,2,2]})
Because processing each group separately if many groups solution cannot be very fast - idea is get first rows of Group by DataFrame.drop_duplicates, shift months by offsets.MonthOffset, join together and add all missing datets between:
before['Date'] = pd.to_datetime(before['Date'], dayfirst=True)
df1 = before.drop_duplicates('Group')
#first and last shifted months - by 1 and by 3 months
df11 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(3))
df12 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(1))
df = (pd.concat([df11, df12], sort=False, ignore_index=True)
.set_index('Date')
.groupby('Group')
.resample('m')
.size()
.reset_index(name='value')
.assign(value = np.nan))
print (df)
Group Date value
0 1 2018-07-31 NaN
1 1 2018-08-31 NaN
2 1 2018-09-30 NaN
3 2 2000-12-31 NaN
4 2 2001-01-31 NaN
5 2 2001-02-28 NaN
Last add to original and sorting:
df = pd.concat([before, df], ignore_index=True).sort_values(['Group','Date'])
print (df)
Group Date value
10 1 2018-07-31 NaN
11 1 2018-08-31 NaN
12 1 2018-09-30 NaN
0 1 2018-10-31 1.1
1 1 2018-11-30 1.7
2 1 2018-12-31 1.9
3 1 2019-01-31 2.3
4 1 2019-02-28 1.5
13 2 2000-12-31 NaN
14 2 2001-01-31 NaN
15 2 2001-02-28 NaN
5 2 2001-03-30 2.8
6 2 2001-04-30 2.0
7 2 2001-05-31 2.0
8 2 2001-06-30 2.0
9 2 2001-07-31 2.0
If new months is only few you can omit groupby part:
before['Date'] = pd.to_datetime(before['Date'], dayfirst=True)
df1 = before.drop_duplicates('Group')
df11 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(3))
df12 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(2))
df13 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(1))
df = (pd.concat([df11, df12, df13, before], ignore_index=True, sort=False)
.sort_values(['Group','Date']))
print (df)
Group Date value
0 1 2018-07-31 NaN
2 1 2018-08-31 NaN
4 1 2018-09-30 NaN
6 1 2018-10-31 1.1
7 1 2018-11-30 1.7
8 1 2018-12-31 1.9
9 1 2019-01-31 2.3
10 1 2019-02-28 1.5
1 2 2000-12-30 NaN
3 2 2001-01-30 NaN
5 2 2001-02-28 NaN
11 2 2001-03-30 2.8
12 2 2001-04-30 2.0
13 2 2001-05-31 2.0
14 2 2001-06-30 2.0
15 2 2001-07-31 2.0

Groupby Year and other column and calculate average based on specific condition pandas

I have a data frame as shown below
Tenancy_ID Unit_ID End_Date Rental_value
1 A 2012-04-26 10
2 A 2012-08-27 20
3 A 2013-04-27 50
4 A 2014-04-27 40
1 B 2011-06-26 10
2 B 2011-09-27 30
3 B 2013-04-27 60
4 B 2015-04-27 80
From the above I would like to prepare below data frame
Expected Output:
Unit_ID Avg_2011 Avg_2012 Avg_2013 Avg_2014 Avg_2015
A NaN 15 50 40 NaN
B 20 NaN 60 NaN 80
Steps:
Unit_ID = A, has two contracts in 2012 with rental value 10 and 20, Hence the average is 15.
Avg_2012 = Average rental value in 2012.
Use pivot_table directly with the s.dt.year
#df['End_Date']=pd.to_datetime(df['End_Date']) if dtype of End_Date is not datetime
final = (df.pivot_table('Rental_value','Unit_ID',df['End_Date'].dt.year)
.add_prefix('Avg_').reset_index().rename_axis(None,axis=1))
print(final)
Unit_ID Avg_2011 Avg_2012 Avg_2013 Avg_2014 Avg_2015
0 A NaN 15.0 50.0 40.0 NaN
1 B 20.0 NaN 60.0 NaN 80.0
You can aggregate averages and reshape by Series.unstack, last change columns names by DataFrame.add_prefix and last data cleaning - DataFrame.reset_index with DataFrame.rename_axis:
df1 = (df.groupby(['Unit_ID', df['End_Date'].dt.year])['Rental_value']
.mean()
.unstack()
.add_prefix('Avg_')
.reset_index()
.rename_axis(None, axis=1))
print (df1)
Unit_ID Avg_2011 Avg_2012 Avg_2013 Avg_2014 Avg_2015
0 A NaN 15.0 50.0 40.0 NaN
1 B 20.0 NaN 60.0 NaN 80.0

expand mid year values to month in pandas

following from expand year values to month in pandas
I have:
pd.DataFrame({'comp':['a','b'], 'period':['20180331','20171231'],'value':[12,24]})
comp period value
0 a 20180331 12
1 b 20171231 24
and would like to extrapolate to 201701 to 201812 inclusive. The value should be spread out for the 12 months preceding the period.
comp yyymm value
a 201701 na
a 201702 na
...
a 201705 12
a 201706 12
...
a 201803 12
a 201804 na
b 201701 24
...
b 201712 24
b 201801 na
...
Use:
#create month periods with min and max value
r = pd.period_range('2017-01', '2018-12', freq='m')
#convert column to period
df['period'] = pd.to_datetime(df['period']).dt.to_period('m')
#create MultiIndex for add all possible values
mux = pd.MultiIndex.from_product([df['comp'], r], names=('comp','period'))
#reindex for append values
df = df.set_index(['comp','period'])['value'].reindex(mux).reset_index()
#back filling by 11 values of missing values per groups
df['new'] = df.groupby('comp')['value'].bfill(limit=11)
print (df)
comp period value new
0 a 2017-01 NaN NaN
1 a 2017-02 NaN NaN
2 a 2017-03 NaN NaN
3 a 2017-04 NaN 12.0
4 a 2017-05 NaN 12.0
...
...
10 a 2017-11 NaN 12.0
11 a 2017-12 NaN 12.0
12 a 2018-01 NaN 12.0
13 a 2018-02 NaN 12.0
14 a 2018-03 12.0 12.0
15 a 2018-04 NaN NaN
16 a 2018-05 NaN NaN
17 a 2018-06 NaN NaN
18 a 2018-07 NaN NaN
19 a 2018-08 NaN NaN
20 a 2018-09 NaN NaN
21 a 2018-10 NaN NaN
22 a 2018-11 NaN NaN
23 a 2018-12 NaN NaN
24 b 2017-01 NaN 24.0
25 b 2017-02 NaN 24.0
26 b 2017-03 NaN 24.0
...
...
32 b 2017-09 NaN 24.0
33 b 2017-10 NaN 24.0
34 b 2017-11 NaN 24.0
35 b 2017-12 24.0 24.0
36 b 2018-01 NaN NaN
37 b 2018-02 NaN NaN
38 b 2018-03 NaN NaN
...
...
44 b 2018-09 NaN NaN
45 b 2018-10 NaN NaN
46 b 2018-11 NaN NaN
47 b 2018-12 NaN NaN
See if this works:
dftime = pd.DataFrame(pd.date_range('20170101','20181231'), columns=['dt']).apply(lambda x: x.dt.strftime('%Y-%m'), axis=1) # Populating full range including dates
dftime = dftime.assign(dt=dftime.dt.drop_duplicates().reset_index(drop=True)).dropna() # Dropping duplicates from above range
df['dt'] = pd.to_datetime(df.period).apply(lambda x: x.strftime('%Y-%m')) # Adding column for merging purpose
target = df.groupby('comp').apply(lambda x: dftime.merge(x[['comp','dt','value']], on='dt', how='left').fillna({'comp':x.comp.unique()[0]})).reset_index(drop=True) # Populating data for each company
This gives desired output:
print(target)
dt comp value
0 2017-01 a NaN
1 2017-02 a NaN
2 2017-03 a NaN
3 2017-04 a NaN
4 2017-05 a NaN
5 2017-06 a NaN
6 2017-07 a NaN
and so on.

Groupby by sort based on date time, groupby sequence based on 'ID' and Date and then mean by sequence

I am new in pandas functionality.
I have a DF as shown below. which is repair data of mobiles.
ID Status Date Cost
0 1 F 22-Jun-17 500
1 1 M 22-Jul-17 100
2 2 M 29-Jun-17 200
3 3 M 20-Mar-17 300
4 4 M 10-Aug-17 800
5 2 F 29-Sep-17 600
6 2 F 29-Jan-18 500
7 1 F 22-Jun-18 600
8 3 F 20-Jun-18 700
9 1 M 22-Aug-18 150
10 1 F 22-Mar-19 750
11 3 M 20-Oct-18 250
12 4 F 10-Jun-18 100
I tried to find out the duration for each id from previous status.
find the mean for each status sequence for that ID.
My expected output is shown below.
ID S1 S1_Dur S2 S2_dur S3 S3_dur S4 S4_dur Avg_MF Avg_FM
0 1 F-M 30 M-F 335.00 F-M 61.00 M-F 750.00 542.50 45.50
1 2 M-F 92 F-F 122.00 NaN nan NaN nan 92.00 nan
2 3 M-F 457 F-M 122.00 NaN nan NaN nan 457.00 122.00
3 4 M-F 304 NaN nan NaN nan NaN nan 304.00 nan
S1 = first sequence
S1_Dur = S1 Duration
Avg_MF = Average M-F Duration
Avg_FMn = Average F-M Duration
I tried following codes
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df = df.reset_index().sort_values(['ID', 'Date', 'Status']).set_index(['ID', 'Status'])
df['Difference'] = df.groupby('ID')['Date'].transform(pd.Series.diff)
df.reset_index(inplace=True)
Then I got a DF as shown below
ID Status index Date Cost Difference
0 1 F 0 2017-06-22 500 NaT
1 1 M 1 2017-07-22 100 30 days
2 1 F 7 2018-06-22 600 335 days
3 1 M 9 2018-08-22 150 61 days
4 1 F 10 2019-03-22 750 212 days
5 2 M 2 2017-06-29 200 NaT
6 2 F 5 2017-09-29 600 92 days
7 2 F 6 2018-01-29 500 122 days
8 3 M 3 2017-03-20 300 NaT
9 3 F 8 2018-06-20 700 457 days
10 3 M 11 2018-10-20 250 122 days
11 4 M 4 2017-08-10 800 NaT
12 4 F 12 2018-06-10 100 304 days
After that I am stuck.
Idea is create new columns for difference by DataFrameGroupBy.diff and join shifted values of Status by DataFrameGroupBy.shift. Remove rows with missing values in S column. Then reshape by DataFrame.unstack with GroupBy.cumcount for counter column, create means per pairs of S by DataFrame.pivot_table and last use DataFrame.join:
df['Date'] = pd.to_datetime(df['Date'], format='%d-%b-%y')
df = df.sort_values(['ID', 'Date', 'Status'])
df['D'] = df.groupby('ID')['Date'].diff().dt.days
df['S'] = df.groupby('ID')['Status'].shift() + '-'+ df['Status']
df = df.dropna(subset=['S'])
df['g'] = df.groupby('ID').cumcount().add(1).astype(str)
df1 = df.pivot_table(index='ID', columns='S', values='D', aggfunc='mean').add_prefix('Avg_')
df2 = df.set_index(['ID', 'g'])[['S','D']].unstack().sort_index(axis=1, level=1)
df2.columns = df2.columns.map('_'.join)
df3 = df2.join(df1).reset_index()
print (df3)
ID D_1 S_1 D_2 S_2 D_3 S_3 D_4 S_4 Avg_F-F Avg_F-M \
0 1 30.0 F-M 335.0 M-F 61.0 F-M 212.0 M-F NaN 45.5
1 2 92.0 M-F 122.0 F-F NaN NaN NaN NaN 122.0 NaN
2 3 457.0 M-F 122.0 F-M NaN NaN NaN NaN NaN 122.0
3 4 304.0 M-F NaN NaN NaN NaN NaN NaN NaN NaN
Avg_M-F
0 273.5
1 92.0
2 457.0
3 304.0

Pandas - Delete rows with two or more NaN values in dataframe

I want to delete column values that contain too many NaN values; specifically: 2 or more.
I have a dataframe with column which looks like this. The below column had 40 rows . I want to remove NaN values from 19th row (after 17.9 value).
AvgWS
0.12
1
2.04
3.01
3.99
5
6
7
7.99
9
10
10.98
11.99
13
13.93
14.99
15.98
NaN
17.9
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
NaN
Thanks
You can call isnull() on the column, this will return a series with boolean values, you then cast this to int, the True values become 1 and False becomes 0 and then call cumsum(), we then filter the df where the cumumlative sum is less than 2 which equates to the point where the NaN count becomes greater than 2:
In [110]:
df[df['AvgWS'].isnull().astype(int).cumsum() < 2]
Out[110]:
AvgWS
0 0.12
1 1.00
2 2.04
3 3.01
4 3.99
5 5.00
6 6.00
7 7.00
8 7.99
9 9.00
10 10.00
11 10.98
12 11.99
13 13.00
14 13.93
15 14.99
16 15.98
17 NaN
18 17.90