How to create column based on conditions on other rows - Pandas Dataframe? - pandas

I have the following problem :
A dataframe named df1 like this :
Id PVF PM_year Year
0 A6489 75 25 2018
1 A175 56 54 2018
2 A2856 34 65 2018
3 A6489 35 150 2019
4 A175 45 700 2019
5 A2856 55 120 2019
6 A6489 205 100 2020
7 A2856 35 445 2020
I want to create a new column named PM_previous_year which is equal for each combination (ID+Year) to the value of PM_year of the same Id and the previous year...
Example :
For the line indexed 3, the Id is 'A6489' and the year is 2019. So the value of the new column "PM_previous_year" should be the value of the line where Id is the same ('A6489') and year is equal to 2018 (2019-1). in this simple example it corresponds to the line indexed 0 and the value expected for the new column on line indexed 3 is then 25.
Finally, the targeted DataFrame df2 for this short example looks like this :
Id PVF PM_year Year PM_previous_year
0 A6489 75 25 2018 NaN
1 A175 56 54 2018 NaN
2 A2856 34 65 2018 NaN
3 A6489 35 150 2019 25.0
4 A175 45 700 2019 54.0
5 A2856 55 120 2019 65.0
6 A6489 205 100 2020 150.0
7 A2856 35 445 2020 120.0
I havn't found any obvious solution yet. Maybe there is a way in reshaping the df, but I'm not very familiar with that.
If somebody have any idea, I would be very grateful.
Thks

If possible simplify solution and shifting PM_year per Id use:
df['PM_previous_year'] = df.groupby('Id')['PM_year'].shift()
print (df)
Id PVF PM_year Year PM_previous_year
0 A6489 75 25 2018 NaN
1 A175 56 54 2018 NaN
2 A2856 34 65 2018 NaN
3 A6489 35 150 2019 25.0
4 A175 45 700 2019 54.0
5 A2856 55 120 2019 65.0
6 A6489 205 100 2020 150.0
7 A2856 35 445 2020 120.0
Or:
s = df.pivot('Year','Id','PM_year').shift().unstack().rename('PM_previous_year')
df = df.join(s, on=['Id','Year'])
print (df)
Id PVF PM_year Year PM_previous_year
0 A6489 75 25 2018 NaN
1 A175 56 54 2018 NaN
2 A2856 34 65 2018 NaN
3 A6489 35 150 2019 25.0
4 A175 45 700 2019 54.0
5 A2856 55 120 2019 65.0
6 A6489 205 100 2020 150.0
7 A2856 35 445 2020 120.0

Related

How to interchange rows and columns in panda dataframe

i m reading a csv file using pandas library. I want to interchange rows and columns but main issue is in Status column ..there is repetition of values after every three rows in this column....so transpose is making all the row values to columns...but in place i just want only three column...i.e. Confirmed, Recovered, Deceased for every date.. please find the attachment where i have shown sample input as well as sample output.
enter image description here
It's a case of using stack() and unstack()
import random
s = 10
d = pd.date_range("01-Jan-2021", periods=s)
cols = ["TT","AN","AP"]
df = pd.DataFrame([{**{"Date":dd, "Status":st}, **{c:random.randint(1,50) for c in cols}}
for dd in d
for st in ["Confirmed","Recovered","Deceased"]])
df.set_index(["Date","Status"]).rename_axis(columns="State").stack().unstack(1)
before
Date Status TT AN AP
0 2021-01-01 Confirmed 5 44 17
1 2021-01-01 Recovered 44 5 48
2 2021-01-01 Deceased 27 3 24
3 2021-01-02 Confirmed 33 14 38
4 2021-01-02 Recovered 21 15 6
5 2021-01-02 Deceased 15 37 8
6 2021-01-03 Confirmed 15 20 36
7 2021-01-03 Recovered 18 19 44
8 2021-01-03 Deceased 37 22 1
9 2021-01-04 Confirmed 16 35 37
10 2021-01-04 Recovered 30 45 49
11 2021-01-04 Deceased 35 7 18
after
Status Confirmed Deceased Recovered
Date State
2021-01-01 TT 5 27 44
AN 44 3 5
AP 17 24 48
2021-01-02 TT 33 15 21
AN 14 37 15
AP 38 8 6
2021-01-03 TT 15 37 18
AN 20 22 19
AP 36 1 44
2021-01-04 TT 16 35 30
AN 35 7 45
AP 37 18 49

How to calculate shift and rolling sum over missing dates without adding them to data frame in Pandas?

I have a data set with dates, customers and income:
Date CustomerIncome
0 1/1/2018 A 53
1 2/1/2018 A 36
2 3/1/2018 A 53
3 5/1/2018 A 89
4 6/1/2018 A 84
5 8/1/2018 A 84
6 9/1/2018 A 54
7 10/1/2018 A 19
8 11/1/2018 A 44
9 12/1/2018 A 80
10 1/1/2018 B 24
11 2/1/2018 B 100
12 9/1/2018 B 40
13 10/1/2018 B 47
14 12/1/2018 B 10
15 2/1/2019 B 5
For both customers there are missing dates as they purchased nothing at some months.
I want to add per each customer what was the income of last month and also the rolling sum of income for the last year.
Meaning, if there's a missing month, I'll see '0' at the shift(1) column of the following month that has income. And I'll see rolling sum of 12 months even if there weren't 12 observations.
This is the expected result:
Date CustomerIncome S(1)R(12)
0 1/1/2018 A 53 0 53
1 2/1/2018 A 36 53 89
2 3/1/2018 A 53 36 142
3 5/1/2018 A 89 0 231
4 6/1/2018 A 84 89 315
5 8/1/2018 A 84 0 399
6 9/1/2018 A 54 84 453
7 10/1/2018 A 19 54 472
8 11/1/2018 A 44 19 516
9 12/1/2018 A 80 44 596
10 1/1/2018 B 24 0 24
11 2/1/2018 B 100 24 124
12 9/1/2018 B 40 0 164
13 10/1/2018 B 47 40 211
14 12/1/2018 B 10 0 221
15 2/1/2019 B 5 0 102
So far, I've added the rows with missing dates with stack and unstack, but with multiple dates and customers, it explodes the data to millions of rows, crashing kernel with most rows are 0's.
You can use .shift but have logic that if the gap is > 31 days, then make (S1) = 0
The rolling 12 calculation requires figuring out the "Rolling Date" and doing some complicated list comprehension to decide whether or not to return a value. Then, take a sum of each list per row.
df['Date'] = pd.to_datetime(df['Date']).dt.date
df['S(1)'] = df.groupby('Customer')['Income'].transform('shift').fillna(0)
s = (df['Date'] - df['Date'].shift())/np.timedelta64(1, '31D') <= 1
df['S(1)'] = df['S(1)'].where(s,0).astype(int)
df['Rolling Date'] = (df['Date'] - pd.Timedelta('1Y'))
df['R(12)'] = df.apply(lambda d: sum([z for x,y,z in
zip(df['Customer'], df['Date'], df['Income'])
if y > d['Rolling Date']
if y <= d['Date']
if x == d['Customer']]), axis=1)
df = df.drop('Rolling Date', axis=1)
df
Out[1]:
Date Customer Income S(1) R(12)
0 2018-01-01 A 53 0 53
1 2018-02-01 A 36 53 89
2 2018-03-01 A 53 36 142
3 2018-05-01 A 89 0 231
4 2018-06-01 A 84 89 315
5 2018-08-01 A 84 0 399
6 2018-09-01 A 54 84 453
7 2018-10-01 A 19 54 472
8 2018-11-01 A 44 19 516
9 2018-12-01 A 80 44 596
10 2018-01-01 B 24 0 24
11 2018-02-01 B 100 24 124
12 2018-09-01 B 40 0 164
13 2018-10-01 B 47 40 211
14 2018-12-01 B 10 0 221
15 2019-02-01 B 5 0 102

How to replace last n values of a row with zero

I want to replace last 2 values of one of the column with zero. I understand for NaN values, I am able to use .fillna(0), but I would like to replace row 6 value of the last column as well.
Weight Name Age d_id_max
0 45 Sam 14 2
1 88 Andrea 25 1
2 56 Alex 55 1
3 15 Robin 8 3
4 71 Kia 21 3
5 44 Sia 43 2
6 54 Ryan 45 1
7 34 Dimi 65 NaN
df.drop(df.tail(2).index,inplace=True)
Weight Name Age d_id_max
0 45 Sam 14 2
1 88 Andrea 25 1
2 56 Alex 55 1
3 15 Robin 8 3
4 71 Kia 21 3
5 44 Sia 43 2
6 54 Ryan 45 0
7 34 Dimi 65 0
Before pandas 0.20.0 (long time) it was job for ix, but now it is deprecated. So you can use:
DataFrame.iloc for get last rows and also Index.get_loc for positions of column d_id_max:
df.iloc[-2:, df.columns.get_loc('d_id_max')] = 0
print (df)
Weight Name Age d_id_max
0 45 Sam 14 2.0
1 88 Andrea 25 1.0
2 56 Alex 55 1.0
3 15 Robin 8 3.0
4 71 Kia 21 3.0
5 44 Sia 43 2.0
6 54 Ryan 45 0.0
7 34 Dimi 65 0.0
Or DataFrame.loc with indexing index values:
df.loc[df.index[-2:], 'd_id_max'] = 0
Try .iloc and get_loc
df.iloc[[-1,-2], df.columns.get_loc('d_id_max')] = 0
Out[232]:
Weight Name Age d_id_max
0 45 Sam 14 2.0
1 88 Andrea 25 1.0
2 56 Alex 55 1.0
3 15 Robin 8 3.0
4 71 Kia 21 3.0
5 44 Sia 43 2.0
6 54 Ryan 45 0.0
7 34 Dimi 65 0.0
You can use:
df['d_id_max'].iloc[-2:] = 0
Weight Name Age d_id_max
0 45 Sam 14 2.0
1 88 Andrea 25 1.0
2 56 Alex 55 1.0
3 15 Robin 8 3.0
4 71 Kia 21 3.0
5 44 Sia 43 2.0
6 54 Ryan 45 0.0
7 34 Dimi 65 0.0

Pandas Dataframe Merging

I have a bit of a weird pandas question.
I have a master Dataframe:
a b c
0 22 44 55
1 22 45 22
2 44 23 56
3 45 22 33
I then have a dataframe in a different dimension which has some over lapping index's and column names
index col_name new_value
0 a 111
3 b 234
I'm trying to then say if you find a match on index and col_name in the master dataframe, then replace the value.
So the output would be
a b c
0 111 44 55
1 22 45 22
2 44 23 56
3 45 234 33
I've found "Combine_first" but this doesn't work unless I pivot the second dataframe (which I can't do in this scenario)
This is update problem
df.update(updated.pivot(*updated.columns))
df
Out[479]:
a b c
0 111.0 44.0 55
1 22.0 45.0 22
2 44.0 23.0 56
3 45.0 234.0 33
Or
df.values[updated['index'].values,df.columns.get_indexer(updated.col_name)]=updated.new_value.values
df
Out[495]:
a b c
0 111 44 55
1 22 45 22
2 44 23 56
3 45 234 33

repeat rows depending upon course duration in years

I have a dataframe that needs to repeat itself.
from io import StringIO
import pandas as pd
audit_trail = StringIO('''
course_id AcademicYear_to months TotalFee
260 2017 24 100
260 2018 12 140
274 2016 36 300
274 2017 24 340
274 2018 12 200
285 2017 24 300
285 2018 12 200
''')
df11 = pd.read_csv(audit_trail, sep=" " )
For the course id 260 there are 2 entries per year. Year 2017 and Year 2018. I need to repeat the years for the month groups.
I will get 2 more rows, 2018 for months 24 and 2017 for months 12. The final dataframe will look like this...
audit_trail = StringIO('''
course_id AcademicYear_to months TotalFee
260 2017 24 100
260 2018 24 100
260 2017 12 140
260 2018 12 140
274 2016 36 300
274 2017 36 300
274 2018 36 300
274 2016 24 340
274 2017 24 340
274 2018 24 340
274 2016 12 200
274 2017 12 200
274 2018 12 200
285 2017 24 300
285 2018 24 300
285 2017 12 200
285 2018 12 200
''')
df12 = pd.read_csv(audit_trail, sep=" " )
I tried to concat the same dataframe twice, but that does not solve the problem. I need to change the years and for 36 months, the data needs to be repeated 3 times.
pd.concat([df11, df11])
The group by object will return the years. I simply need to join the years in each group with the original dataframe.
df11.groupby('course_id')['AcademicYear_to'].apply(list)
260 [2017, 2018]
274 [2016, 2017, 2018]
285 [2017, 2018]
Simple join can work if the records match with the number of years. For e.g. course id 274 has 48 months and 285 has duration of 24 months and there are 3, 2 entries respectively. The problem is with course id 260 which is 24 months course but has only 1 entry. The join will not return the second year for that course.
df11=pd.read_csv('https://s3.amazonaws.com/todel162/myso.csv')
df11.course_id.value_counts()
274 3
285 2
260 1
df=df11.merge(df11[['course_id']], on='course_id')
df.course_id.value_counts()
274 9
285 4
260 1
Is it possible to write a query that will also consider the number of months?
The following query will return the records where simple join will not return expected results.
df11=pd.read_csv('https://s3.amazonaws.com/todel162/myso.csv')
df11['m1']=df11.groupby('course_id').course_id.transform( lambda x: x.count() * 12)
df11.query( 'm1 != duration_inmonths')
df11.course_id.value_counts()
274 3
285 2
260 1
df=df11.merge(df11[['course_id']], on='course_id')
df.course_id.value_counts()
274 9
285 4
260 1
The expected count in this case is
274 6
285 4
260 2
This is because even if there are 3 years for id 274, the course duration is only 24 months. And even if there is only 1 record for 260 since the duration is 24 months, it should return 2 records. (once for current year and the other current_year + 1)
IIUC we can merge df11 to itself:
In [14]: df11.merge(df11[['course_id']], on='course_id')
Out[14]:
course_id AcademicYear_to months TotalFee
0 260 2017 24 100
1 260 2017 24 100
2 260 2018 12 140
3 260 2018 12 140
4 274 2016 36 300
5 274 2016 36 300
6 274 2016 36 300
7 274 2017 24 340
8 274 2017 24 340
9 274 2017 24 340
10 274 2018 12 200
11 274 2018 12 200
12 274 2018 12 200
13 285 2017 24 300
14 285 2017 24 300
15 285 2018 12 200
16 285 2018 12 200
Not Pretty!
def f(x):
idx = x.index.remove_unused_levels()
idx = pd.MultiIndex.from_product(idx.levels, names=idx.names)
return x.reindex(idx)
df11.set_index(['months', 'AcademicYear_to']) \
.groupby('course_id').TotalFee.apply(f) \
.groupby(level=[0, 1]).transform('first') \
.astype(df11.TotalFee.dtype).reset_index()
course_id months AcademicYear_to TotalFee
0 260 24 2017 100
1 260 24 2018 100
2 260 12 2017 140
3 260 12 2018 140
4 274 12 2016 200
5 274 12 2017 200
6 274 12 2018 200
7 274 24 2016 340
8 274 24 2017 340
9 274 24 2018 340
10 274 36 2016 300
11 274 36 2017 300
12 274 36 2018 300
13 285 24 2017 300
14 285 24 2018 300
15 285 12 2017 200
16 285 12 2018 200