I have a dataframe with millions of groups. I am trying to, for each group, add 3 months of dates (month end dates) at the top of every group. So if the first observation of a group is December 2019, I want to fill 3 rows prior to that observation with dates from September 2019 to November 2019. I also want to fill the group column with the relevant group ID and the other columns can remain as null values.
Would like to avoid looping if possible as this is a very large dataset
This is my before DataFrame:
import pandas as pd
before = pd.DataFrame({'Group':[1,1,1,1,1,2,2,2,2,2],
'Date':['31/10/2018','30/11/2018','31/12/2018','31/01/2019','28/02/2019','30/03/2001','30/04/2001','31/05/2001','30/06/2001','31/07/2001'],
'value':[1.1,1.7,1.9,2.3,1.5,2.8,2,2,2,2]})
This is my after DataFrame
import pandas as pd
after = pd.DataFrame({'Group':[1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2],
'Date':['31/07/2018','31/08/2018','30/09/2018','31/10/2018','30/11/2018','31/12/2018','31/01/2019','28/02/2019','31/12/2000','31/01/2001','28/02/2001','30/03/2001','30/04/2001','31/05/2001','30/06/2001','31/07/2001'],
'value':[np.nan,np.nan,np.nan,1.1,1.7,1.9,2.3,1.5,np.nan,np.nan,np.nan,2.8,2,2,2,2]})
Because processing each group separately if many groups solution cannot be very fast - idea is get first rows of Group by DataFrame.drop_duplicates, shift months by offsets.MonthOffset, join together and add all missing datets between:
before['Date'] = pd.to_datetime(before['Date'], dayfirst=True)
df1 = before.drop_duplicates('Group')
#first and last shifted months - by 1 and by 3 months
df11 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(3))
df12 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(1))
df = (pd.concat([df11, df12], sort=False, ignore_index=True)
.set_index('Date')
.groupby('Group')
.resample('m')
.size()
.reset_index(name='value')
.assign(value = np.nan))
print (df)
Group Date value
0 1 2018-07-31 NaN
1 1 2018-08-31 NaN
2 1 2018-09-30 NaN
3 2 2000-12-31 NaN
4 2 2001-01-31 NaN
5 2 2001-02-28 NaN
Last add to original and sorting:
df = pd.concat([before, df], ignore_index=True).sort_values(['Group','Date'])
print (df)
Group Date value
10 1 2018-07-31 NaN
11 1 2018-08-31 NaN
12 1 2018-09-30 NaN
0 1 2018-10-31 1.1
1 1 2018-11-30 1.7
2 1 2018-12-31 1.9
3 1 2019-01-31 2.3
4 1 2019-02-28 1.5
13 2 2000-12-31 NaN
14 2 2001-01-31 NaN
15 2 2001-02-28 NaN
5 2 2001-03-30 2.8
6 2 2001-04-30 2.0
7 2 2001-05-31 2.0
8 2 2001-06-30 2.0
9 2 2001-07-31 2.0
If new months is only few you can omit groupby part:
before['Date'] = pd.to_datetime(before['Date'], dayfirst=True)
df1 = before.drop_duplicates('Group')
df11 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(3))
df12 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(2))
df13 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(1))
df = (pd.concat([df11, df12, df13, before], ignore_index=True, sort=False)
.sort_values(['Group','Date']))
print (df)
Group Date value
0 1 2018-07-31 NaN
2 1 2018-08-31 NaN
4 1 2018-09-30 NaN
6 1 2018-10-31 1.1
7 1 2018-11-30 1.7
8 1 2018-12-31 1.9
9 1 2019-01-31 2.3
10 1 2019-02-28 1.5
1 2 2000-12-30 NaN
3 2 2001-01-30 NaN
5 2 2001-02-28 NaN
11 2 2001-03-30 2.8
12 2 2001-04-30 2.0
13 2 2001-05-31 2.0
14 2 2001-06-30 2.0
15 2 2001-07-31 2.0
Related
I have a dataframe:
# create example df
df = pd.DataFrame(index=[1,2,3,4,5,6,7])
df['ID'] = [1,1,1,1,2,2,2]
df['election_date'] = pd.date_range("01/01/2010", periods=7, freq="M")
df['stock_price'] = [1,np.nan,np.nan,4,5,np.nan,7]
# sort values
df.sort_values(['election_date'], inplace=True, ascending=False)
df.reset_index(drop=True, inplace=True)
df
ID election_date stock_price
0 2 2010-07-31 7.0
1 2 2010-06-30 NaN
2 2 2010-05-31 5.0
3 1 2010-04-30 4.0
4 1 2010-03-31 NaN
5 1 2010-02-28 NaN
6 1 2010-01-31 1.0
I would like to calculate the cumulative count of all np.nan for column stock_price for every ID.
The expected result is:
df
ID election_date stock_price cum_count_nans
0 2 2010-07-31 7.0 1
1 2 2010-06-30 NaN 0
2 2 2010-05-31 5.0 0
3 1 2010-04-30 4.0 2
4 1 2010-03-31 NaN 1
5 1 2010-02-28 NaN 0
6 1 2010-01-31 1.0 0
Any ideas how to solve it?
Idea is change order by indexing, and then in custom function testing missing values, shifting and used cumlative sum:
f = lambda x: x.isna().shift(fill_value=0).cumsum()
df['cum_count_nans'] = df.iloc[::-1].groupby('ID')['stock_price'].transform(f)
print (df)
ID election_date stock_price cum_count_nans
0 2 2010-07-31 7.0 1
1 2 2010-06-30 NaN 0
2 2 2010-05-31 5.0 0
3 1 2010-04-30 4.0 2
4 1 2010-03-31 NaN 1
5 1 2010-02-28 NaN 0
6 1 2010-01-31 1.0 0
I have a data frame as shown below
Tenancy_ID Unit_ID End_Date Rental_value
1 A 2012-04-26 10
2 A 2012-08-27 20
3 A 2013-04-27 50
4 A 2014-04-27 40
1 B 2011-06-26 10
2 B 2011-09-27 30
3 B 2013-04-27 60
4 B 2015-04-27 80
From the above I would like to prepare below data frame
Expected Output:
Unit_ID Avg_2011 Avg_2012 Avg_2013 Avg_2014 Avg_2015
A NaN 15 50 40 NaN
B 20 NaN 60 NaN 80
Steps:
Unit_ID = A, has two contracts in 2012 with rental value 10 and 20, Hence the average is 15.
Avg_2012 = Average rental value in 2012.
Use pivot_table directly with the s.dt.year
#df['End_Date']=pd.to_datetime(df['End_Date']) if dtype of End_Date is not datetime
final = (df.pivot_table('Rental_value','Unit_ID',df['End_Date'].dt.year)
.add_prefix('Avg_').reset_index().rename_axis(None,axis=1))
print(final)
Unit_ID Avg_2011 Avg_2012 Avg_2013 Avg_2014 Avg_2015
0 A NaN 15.0 50.0 40.0 NaN
1 B 20.0 NaN 60.0 NaN 80.0
You can aggregate averages and reshape by Series.unstack, last change columns names by DataFrame.add_prefix and last data cleaning - DataFrame.reset_index with DataFrame.rename_axis:
df1 = (df.groupby(['Unit_ID', df['End_Date'].dt.year])['Rental_value']
.mean()
.unstack()
.add_prefix('Avg_')
.reset_index()
.rename_axis(None, axis=1))
print (df1)
Unit_ID Avg_2011 Avg_2012 Avg_2013 Avg_2014 Avg_2015
0 A NaN 15.0 50.0 40.0 NaN
1 B 20.0 NaN 60.0 NaN 80.0
I have two dataframes df and df1 which I want to merge or join.
import pandas as pd
df = pd.DataFrame(columns=['lt1', 'lt2','lt3','lt4','lt5','lt6'])
df['date'] = pd.date_range('2016-1-1', periods=5, freq='D')
df
lt1 lt2 lt3 lt4 lt5 lt6 date
0 NaN NaN NaN NaN NaN NaN 2016-01-01
1 NaN NaN NaN NaN NaN NaN 2016-01-02
2 NaN NaN NaN NaN NaN NaN 2016-01-03
3 NaN NaN NaN NaN NaN NaN 2016-01-04
4 NaN NaN NaN NaN NaN NaN 2016-01-05
df1 = pd.DataFrame({'location': ['lt1','lt3', 'lt6', 'lt1','lt2', 'lt3'], \
'date': ['2016-01-1', '2016-01-02','2016-01-1','2016-01-03','2016-01-5','2016-01-4'], \
'counts': ['2', '1','1','1', '3','1']})
df1.date = pd.to_datetime(df1.date)
df1
counts date location
0 2 2016-01-01 lt1
1 1 2016-01-02 lt3
2 1 2016-01-01 lt6
3 2 2016-01-03 lt1
4 3 2016-01-05 lt2
5 1 2016-01-04 lt3
I want to put counts values depending on location from df1 into df. The merge will be based on date column but the values to be added will be from df2.counts column and those values will be properly assigned into respective location names columns in df. Column names in df contains all the names present in df1.location column.
Merging just by date alone is easy but since it is not really a straightaway merge, it is more like reshaping or join. Any suggestion how to get the following df as output:
df
date lt1 lt2 lt3 lt4 lt5 lt6
0 2016-01-01 2 0 0 0 0 1
1 2016-02-01 0 0 1 0 0 0
2 2016-03-01 1 0 0 0 0 0
3 2016-04-01 0 0 1 0 0 0
4 2016-05-01 0 3 0 0 0 0
Here is one way using pivot_table and combine_first:
m=df1.pivot_table(index='date',columns='location',values='counts',aggfunc='sum')
final=df.set_index('date').combine_first(m).fillna(0).reset_index()
Or just:
(df.set_index('date').combine_first(df1.pivot('date','location','counts'))
.fillna(0).reset_index())
date lt1 lt2 lt3 lt4 lt5 lt6
0 2016-01-01 2 0 0 0 0 1
1 2016-01-02 0 0 1 0 0 0
2 2016-01-03 1 0 0 0 0 0
3 2016-01-04 0 0 1 0 0 0
4 2016-01-05 0 3 0 0 0 0
I am new in pandas functionality.
I have a DF as shown below. which is repair data of mobiles.
ID Status Date Cost
0 1 F 22-Jun-17 500
1 1 M 22-Jul-17 100
2 2 M 29-Jun-17 200
3 3 M 20-Mar-17 300
4 4 M 10-Aug-17 800
5 2 F 29-Sep-17 600
6 2 F 29-Jan-18 500
7 1 F 22-Jun-18 600
8 3 F 20-Jun-18 700
9 1 M 22-Aug-18 150
10 1 F 22-Mar-19 750
11 3 M 20-Oct-18 250
12 4 F 10-Jun-18 100
I tried to find out the duration for each id from previous status.
find the mean for each status sequence for that ID.
My expected output is shown below.
ID S1 S1_Dur S2 S2_dur S3 S3_dur S4 S4_dur Avg_MF Avg_FM
0 1 F-M 30 M-F 335.00 F-M 61.00 M-F 750.00 542.50 45.50
1 2 M-F 92 F-F 122.00 NaN nan NaN nan 92.00 nan
2 3 M-F 457 F-M 122.00 NaN nan NaN nan 457.00 122.00
3 4 M-F 304 NaN nan NaN nan NaN nan 304.00 nan
S1 = first sequence
S1_Dur = S1 Duration
Avg_MF = Average M-F Duration
Avg_FMn = Average F-M Duration
I tried following codes
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df = df.reset_index().sort_values(['ID', 'Date', 'Status']).set_index(['ID', 'Status'])
df['Difference'] = df.groupby('ID')['Date'].transform(pd.Series.diff)
df.reset_index(inplace=True)
Then I got a DF as shown below
ID Status index Date Cost Difference
0 1 F 0 2017-06-22 500 NaT
1 1 M 1 2017-07-22 100 30 days
2 1 F 7 2018-06-22 600 335 days
3 1 M 9 2018-08-22 150 61 days
4 1 F 10 2019-03-22 750 212 days
5 2 M 2 2017-06-29 200 NaT
6 2 F 5 2017-09-29 600 92 days
7 2 F 6 2018-01-29 500 122 days
8 3 M 3 2017-03-20 300 NaT
9 3 F 8 2018-06-20 700 457 days
10 3 M 11 2018-10-20 250 122 days
11 4 M 4 2017-08-10 800 NaT
12 4 F 12 2018-06-10 100 304 days
After that I am stuck.
Idea is create new columns for difference by DataFrameGroupBy.diff and join shifted values of Status by DataFrameGroupBy.shift. Remove rows with missing values in S column. Then reshape by DataFrame.unstack with GroupBy.cumcount for counter column, create means per pairs of S by DataFrame.pivot_table and last use DataFrame.join:
df['Date'] = pd.to_datetime(df['Date'], format='%d-%b-%y')
df = df.sort_values(['ID', 'Date', 'Status'])
df['D'] = df.groupby('ID')['Date'].diff().dt.days
df['S'] = df.groupby('ID')['Status'].shift() + '-'+ df['Status']
df = df.dropna(subset=['S'])
df['g'] = df.groupby('ID').cumcount().add(1).astype(str)
df1 = df.pivot_table(index='ID', columns='S', values='D', aggfunc='mean').add_prefix('Avg_')
df2 = df.set_index(['ID', 'g'])[['S','D']].unstack().sort_index(axis=1, level=1)
df2.columns = df2.columns.map('_'.join)
df3 = df2.join(df1).reset_index()
print (df3)
ID D_1 S_1 D_2 S_2 D_3 S_3 D_4 S_4 Avg_F-F Avg_F-M \
0 1 30.0 F-M 335.0 M-F 61.0 F-M 212.0 M-F NaN 45.5
1 2 92.0 M-F 122.0 F-F NaN NaN NaN NaN 122.0 NaN
2 3 457.0 M-F 122.0 F-M NaN NaN NaN NaN NaN 122.0
3 4 304.0 M-F NaN NaN NaN NaN NaN NaN NaN NaN
Avg_M-F
0 273.5
1 92.0
2 457.0
3 304.0
I would like to apply rolling mean function in dataframe. I have more than one category (A and B in column Category) in dataframe, so I have to calculate rolling mean for each category and this is my issue.
Dataframe looks like below. The Rolling_Mean column is expected outcome.
Date Category Value Rolling_Mean
01.01.2017 A 12,30 NaN
02.01.2017 A 12,50 NaN
03.01.2017 A 12,90 12,57
04.01.2017 A 13,10 12,70
05.01.2017 A 12,90 12,74
06.01.2017 A 13,55 12,88
07.01.2017 A 13,12 12,91
01.01.2017 B 1,14 NaN
02.01.2017 B 1,52 NaN
03.01.2017 B 1,74 1,47
04.01.2017 B 2,12 1,63
05.01.2017 B 1,75 1,65
06.01.2017 B 1,69 1,66
07.01.2017 B 1,35 1,62
to calculate rolling mean I use pandas rolling:
df['Rolling_Mean'] = df['Value'].rolling (window=3).mean()
but I'm not able to calculate rolling mean for more than one category.
I have tried to use numpy.where (below) to calculate this, but it works for only one category and I'm looking for solution, which works for 10 categories.
df['Rolling_Mean'] = np.where((df.Category == 'A'), df['Value'].rolling(window=3).mean(), 0)
You need groupby with rolling, but output is Multiindex, so need remove first level by reset_index:
#replace values to floats or use parameter decimal=',' in read_csv
df['Value'] = df['Value'].str.replace(',','.').astype(float)
df['new'] = df.groupby('Category')['Value'].rolling(window=3, min_periods=3).mean()
.reset_index(level=0, drop=True)
print (df)
Date Category Value Rolling_Mean new
0 01.01.2017 A 12.30 NaN NaN
1 02.01.2017 A 12.50 NaN NaN
2 03.01.2017 A 12.90 12,57 12.566667
3 04.01.2017 A 13.10 12,70 12.833333
4 05.01.2017 A 12.90 12,74 12.966667
5 06.01.2017 A 13.55 12,88 13.183333
6 07.01.2017 A 13.12 12,91 13.190000
7 01.01.2017 B 1.14 NaN NaN
8 02.01.2017 B 1.52 NaN NaN
9 03.01.2017 B 1.74 1,47 1.466667
10 04.01.2017 B 2.12 1,63 1.793333
11 05.01.2017 B 1.75 1,65 1.870000
12 06.01.2017 B 1.69 1,66 1.853333
13 07.01.2017 B 1.35 1,62 1.596667
Use rolling within a groupby context with Category. To return the same index as the current dataframe, use transform with rolling embedded in a lambda
df.assign(
Rolling_Mean=df.groupby('Category').Value.transform(
lambda x: x.rolling(3).mean()
)
)
Date Category Value Rolling_Mean
0 01.01.2017 A 12.30 NaN
1 02.01.2017 A 12.50 NaN
2 03.01.2017 A 12.90 12.566667
3 04.01.2017 A 13.10 12.833333
4 05.01.2017 A 12.90 12.966667
5 06.01.2017 A 13.55 13.183333
6 07.01.2017 A 13.12 13.190000
7 01.01.2017 B 1.14 NaN
8 02.01.2017 B 1.52 NaN
9 03.01.2017 B 1.74 1.466667
10 04.01.2017 B 2.12 1.793333
11 05.01.2017 B 1.75 1.870000
12 06.01.2017 B 1.69 1.853333
13 07.01.2017 B 1.35 1.596667
Note:
If you want this result to persist, make sure to assign it to a variable
df = df.assign(
Rolling_Mean=df.groupby('Category').Value.transform(
lambda x: x.rolling(3).mean()
)
)