I have a dataframe with millions of groups. I am trying to, for each group, add 3 months of dates (month end dates) at the top of every group. So if the first observation of a group is December 2019, I want to fill 3 rows prior to that observation with dates from September 2019 to November 2019. I also want to fill the group column with the relevant group ID and the other columns can remain as null values.
Would like to avoid looping if possible as this is a very large dataset
This is my before DataFrame:
import pandas as pd
before = pd.DataFrame({'Group':[1,1,1,1,1,2,2,2,2,2],
'Date':['31/10/2018','30/11/2018','31/12/2018','31/01/2019','28/02/2019','30/03/2001','30/04/2001','31/05/2001','30/06/2001','31/07/2001'],
'value':[1.1,1.7,1.9,2.3,1.5,2.8,2,2,2,2]})
This is my after DataFrame
import pandas as pd
after = pd.DataFrame({'Group':[1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2],
'Date':['31/07/2018','31/08/2018','30/09/2018','31/10/2018','30/11/2018','31/12/2018','31/01/2019','28/02/2019','31/12/2000','31/01/2001','28/02/2001','30/03/2001','30/04/2001','31/05/2001','30/06/2001','31/07/2001'],
'value':[np.nan,np.nan,np.nan,1.1,1.7,1.9,2.3,1.5,np.nan,np.nan,np.nan,2.8,2,2,2,2]})
Because processing each group separately if many groups solution cannot be very fast - idea is get first rows of Group by DataFrame.drop_duplicates, shift months by offsets.MonthOffset, join together and add all missing datets between:
before['Date'] = pd.to_datetime(before['Date'], dayfirst=True)
df1 = before.drop_duplicates('Group')
#first and last shifted months - by 1 and by 3 months
df11 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(3))
df12 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(1))
df = (pd.concat([df11, df12], sort=False, ignore_index=True)
.set_index('Date')
.groupby('Group')
.resample('m')
.size()
.reset_index(name='value')
.assign(value = np.nan))
print (df)
Group Date value
0 1 2018-07-31 NaN
1 1 2018-08-31 NaN
2 1 2018-09-30 NaN
3 2 2000-12-31 NaN
4 2 2001-01-31 NaN
5 2 2001-02-28 NaN
Last add to original and sorting:
df = pd.concat([before, df], ignore_index=True).sort_values(['Group','Date'])
print (df)
Group Date value
10 1 2018-07-31 NaN
11 1 2018-08-31 NaN
12 1 2018-09-30 NaN
0 1 2018-10-31 1.1
1 1 2018-11-30 1.7
2 1 2018-12-31 1.9
3 1 2019-01-31 2.3
4 1 2019-02-28 1.5
13 2 2000-12-31 NaN
14 2 2001-01-31 NaN
15 2 2001-02-28 NaN
5 2 2001-03-30 2.8
6 2 2001-04-30 2.0
7 2 2001-05-31 2.0
8 2 2001-06-30 2.0
9 2 2001-07-31 2.0
If new months is only few you can omit groupby part:
before['Date'] = pd.to_datetime(before['Date'], dayfirst=True)
df1 = before.drop_duplicates('Group')
df11 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(3))
df12 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(2))
df13 = df1[['Group','Date']].assign(Date = lambda x: x['Date'] - pd.offsets.MonthOffset(1))
df = (pd.concat([df11, df12, df13, before], ignore_index=True, sort=False)
.sort_values(['Group','Date']))
print (df)
Group Date value
0 1 2018-07-31 NaN
2 1 2018-08-31 NaN
4 1 2018-09-30 NaN
6 1 2018-10-31 1.1
7 1 2018-11-30 1.7
8 1 2018-12-31 1.9
9 1 2019-01-31 2.3
10 1 2019-02-28 1.5
1 2 2000-12-30 NaN
3 2 2001-01-30 NaN
5 2 2001-02-28 NaN
11 2 2001-03-30 2.8
12 2 2001-04-30 2.0
13 2 2001-05-31 2.0
14 2 2001-06-30 2.0
15 2 2001-07-31 2.0
I am new in pandas functionality.
I have a DF as shown below. which is repair data of mobiles.
ID Status Date Cost
0 1 F 22-Jun-17 500
1 1 M 22-Jul-17 100
2 2 M 29-Jun-17 200
3 3 M 20-Mar-17 300
4 4 M 10-Aug-17 800
5 2 F 29-Sep-17 600
6 2 F 29-Jan-18 500
7 1 F 22-Jun-18 600
8 3 F 20-Jun-18 700
9 1 M 22-Aug-18 150
10 1 F 22-Mar-19 750
11 3 M 20-Oct-18 250
12 4 F 10-Jun-18 100
I tried to find out the duration for each id from previous status.
find the mean for each status sequence for that ID.
My expected output is shown below.
ID S1 S1_Dur S2 S2_dur S3 S3_dur S4 S4_dur Avg_MF Avg_FM
0 1 F-M 30 M-F 335.00 F-M 61.00 M-F 750.00 542.50 45.50
1 2 M-F 92 F-F 122.00 NaN nan NaN nan 92.00 nan
2 3 M-F 457 F-M 122.00 NaN nan NaN nan 457.00 122.00
3 4 M-F 304 NaN nan NaN nan NaN nan 304.00 nan
S1 = first sequence
S1_Dur = S1 Duration
Avg_MF = Average M-F Duration
Avg_FMn = Average F-M Duration
I tried following codes
df['Date'] = pd.to_datetime(df['Date'])
df = df.sort_values(['ID', 'Date', 'Status'])
df = df.reset_index().sort_values(['ID', 'Date', 'Status']).set_index(['ID', 'Status'])
df['Difference'] = df.groupby('ID')['Date'].transform(pd.Series.diff)
df.reset_index(inplace=True)
Then I got a DF as shown below
ID Status index Date Cost Difference
0 1 F 0 2017-06-22 500 NaT
1 1 M 1 2017-07-22 100 30 days
2 1 F 7 2018-06-22 600 335 days
3 1 M 9 2018-08-22 150 61 days
4 1 F 10 2019-03-22 750 212 days
5 2 M 2 2017-06-29 200 NaT
6 2 F 5 2017-09-29 600 92 days
7 2 F 6 2018-01-29 500 122 days
8 3 M 3 2017-03-20 300 NaT
9 3 F 8 2018-06-20 700 457 days
10 3 M 11 2018-10-20 250 122 days
11 4 M 4 2017-08-10 800 NaT
12 4 F 12 2018-06-10 100 304 days
After that I am stuck.
Idea is create new columns for difference by DataFrameGroupBy.diff and join shifted values of Status by DataFrameGroupBy.shift. Remove rows with missing values in S column. Then reshape by DataFrame.unstack with GroupBy.cumcount for counter column, create means per pairs of S by DataFrame.pivot_table and last use DataFrame.join:
df['Date'] = pd.to_datetime(df['Date'], format='%d-%b-%y')
df = df.sort_values(['ID', 'Date', 'Status'])
df['D'] = df.groupby('ID')['Date'].diff().dt.days
df['S'] = df.groupby('ID')['Status'].shift() + '-'+ df['Status']
df = df.dropna(subset=['S'])
df['g'] = df.groupby('ID').cumcount().add(1).astype(str)
df1 = df.pivot_table(index='ID', columns='S', values='D', aggfunc='mean').add_prefix('Avg_')
df2 = df.set_index(['ID', 'g'])[['S','D']].unstack().sort_index(axis=1, level=1)
df2.columns = df2.columns.map('_'.join)
df3 = df2.join(df1).reset_index()
print (df3)
ID D_1 S_1 D_2 S_2 D_3 S_3 D_4 S_4 Avg_F-F Avg_F-M \
0 1 30.0 F-M 335.0 M-F 61.0 F-M 212.0 M-F NaN 45.5
1 2 92.0 M-F 122.0 F-F NaN NaN NaN NaN 122.0 NaN
2 3 457.0 M-F 122.0 F-M NaN NaN NaN NaN NaN 122.0
3 4 304.0 M-F NaN NaN NaN NaN NaN NaN NaN NaN
Avg_M-F
0 273.5
1 92.0
2 457.0
3 304.0
I have the below dataframe and I need to subtract value from same month last year and save it in output:
date value output
01-01-2012 20 null
01-02-2012 10
01-03-2012 40
01-06-2012 30
01-01-2013 20 0
01-02-2013 30 20
01-02-2014 60 30
01-03-2014 50 null
First create DatetimeIndex, then subtract by sub with new Series by shift by 12 months, MS is for start of month:
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df = df.set_index('date')
df['new'] = df['value'].sub(df['value'].shift(freq='12MS'))
print (df)
value output new
date
2012-01-01 20 NaN NaN
2012-02-01 10 NaN NaN
2012-03-01 40 NaN NaN
2012-06-01 30 NaN NaN
2013-01-01 20 0.0 0.0
2013-02-01 30 20.0 20.0
2014-02-01 60 30.0 30.0
2014-03-01 50 NaN NaN
I currently have a dataframe which looks like this:
df:
store item sales
0 1 1 10
1 1 2 20
2 2 1 10
3 3 2 20
4 4 3 10
5 3 4 15
...
I wanted to view the total sales of each items for each store so I used pivot table to create this:
p_table = pd.pivot_table(df, index='store', values='sales', columns='item', aggfunc=np.sum)
which gives something like:
sales
item 1 2 3 4
store
1 20 30 10 8
2 10 14 12 13
3 1 23 29 10
....
What I want to do now is apply some functions so that each total sales of items represents the percentage of the total sales for a particular store. For example, the value for item 1 at store1 would become:
1. 20/(20+30+10+8) * 100
I am struggling to do this for stacked dataframe. Any suggestions would be much appreciated.
Thanks
I think need divide by div with Series created by sum:
print (p_table)
item 1 2 3 4
store
1 10.0 20.0 NaN NaN
2 10.0 NaN NaN NaN
3 NaN 20.0 NaN 15.0
4 NaN NaN 10.0 NaN
print (p_table.sum(axis=1))
store
1 30.0
2 10.0
3 35.0
4 10.0
dtype: float64
out = p_table.div(p_table.sum(axis=1), axis=0)
print (out)
item 1 2 3 4
store
1 0.333333 0.666667 NaN NaN
2 1.000000 NaN NaN NaN
3 NaN 0.571429 NaN 0.428571
4 NaN NaN 1.0 NaN
I created a dataframe df1:
df1 = pd.read_csv('FBK_var_conc_1.csv', names = ['Cycle', 'SQ'])
df1 = df1['SQ'].copy()
df1 = df1.to_frame()
df1.head(n=10)
SQ
0 2430.0
1 2870.0
2 2890.0
3 3270.0
4 3350.0
5 3520.0
6 26900.0
7 26300.0
8 28400.0
9 3230.0
And then created a second dataframe df2, that I want to fill with the row values of df 1:
df2 = pd.DataFrame()
for x in range(12):
y='Experiment %d' % (x+1)
df2[y]= df1.iloc[3*x:3*x+3]
df2
I get the column names from Experiment 1 - Experiment 12 in df2 and the first column i filled with the right values, but all following columns are filled with N/A.
> Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment
> 10 Experiment 11 Experiment 12
> 0 2430.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 1 2870.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
> 2 2890.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
I've been looking at this for the last 2 hours but can't figure out why the columns after column 1 aren't filled with values.
Desired output:
Experiment 1 Experiment 2 Experiment 3 Experiment 4 Experiment 5 Experiment 6 Experiment 7 Experiment 8 Experiment 9 Experiment 10 Experiment 11 Experiment 12
2430 3270 26900 3230 2940 243000 256000 249000 2880 26100 3890 33400
2870 3350 26300 3290 3180 242000 254000 250000 3390 27900 3730 30700
2890 3520 28400 3090 3140 253000 260000 237000 3510 27400 3760 29600
I found the issue.
I had to use .values
So the final line of the loop has to be:
df2[y] = df1.iloc[3*x:3*x+3].values
and I get the right output