Pandas Dataframe Merging - pandas

I have a bit of a weird pandas question.
I have a master Dataframe:
a b c
0 22 44 55
1 22 45 22
2 44 23 56
3 45 22 33
I then have a dataframe in a different dimension which has some over lapping index's and column names
index col_name new_value
0 a 111
3 b 234
I'm trying to then say if you find a match on index and col_name in the master dataframe, then replace the value.
So the output would be
a b c
0 111 44 55
1 22 45 22
2 44 23 56
3 45 234 33
I've found "Combine_first" but this doesn't work unless I pivot the second dataframe (which I can't do in this scenario)

This is update problem
df.update(updated.pivot(*updated.columns))
df
Out[479]:
a b c
0 111.0 44.0 55
1 22.0 45.0 22
2 44.0 23.0 56
3 45.0 234.0 33
Or
df.values[updated['index'].values,df.columns.get_indexer(updated.col_name)]=updated.new_value.values
df
Out[495]:
a b c
0 111 44 55
1 22 45 22
2 44 23 56
3 45 234 33

Related

How can i get a aggregate sum of Average number of product given between two weeks and output for Each week as shown below in Pandas?

StartWeek
End Week
Numberof Week
Number of Product
Avg number of product per week
39
41
3
99
33
40
45
5
150
30
40
42
3
60
20
39
40
2
40
20
39
41
3
99
33
So that the output looks like --
Week
Sum Average Product per week
39
86
40
70
41
66
42
20
45
30
First for each row we create a list of weeks that it applies to, and put in column 'weeks'
df['weeks'] = df.apply(lambda r: np.arange(r['StartWeek'], r['EndWeek']+1),axis=1)
df looks like this
StartWeek EndWeek NumberofWeek NumberofProduct Av weeks
-- ----------- --------- -------------- ----------------- ---- -------------------
0 39 41 3 99 33 [39 40 41]
1 40 45 5 150 30 [40 41 42 43 44 45]
2 40 42 3 60 20 [40 41 42]
3 39 40 2 40 20 [39 40]
4 39 41 3 99 33 [39 40 41]
Then we explode weeks which duplicates each row for each week it is applied to, and then aggregate by the exploded week and sum:
df.explode('weeks').groupby('weeks', as_index = False)['Av'].sum()
output:
weeks Av
-- ------- ----
0 39 86
1 40 136
2 41 116
3 42 50
4 43 30
5 44 30
6 45 30
you can use the group by method in python
df=df.groupby(["StartWeek"])["Avg number of product per week"].sum()

How to interchange rows and columns in panda dataframe

i m reading a csv file using pandas library. I want to interchange rows and columns but main issue is in Status column ..there is repetition of values after every three rows in this column....so transpose is making all the row values to columns...but in place i just want only three column...i.e. Confirmed, Recovered, Deceased for every date.. please find the attachment where i have shown sample input as well as sample output.
enter image description here
It's a case of using stack() and unstack()
import random
s = 10
d = pd.date_range("01-Jan-2021", periods=s)
cols = ["TT","AN","AP"]
df = pd.DataFrame([{**{"Date":dd, "Status":st}, **{c:random.randint(1,50) for c in cols}}
for dd in d
for st in ["Confirmed","Recovered","Deceased"]])
df.set_index(["Date","Status"]).rename_axis(columns="State").stack().unstack(1)
before
Date Status TT AN AP
0 2021-01-01 Confirmed 5 44 17
1 2021-01-01 Recovered 44 5 48
2 2021-01-01 Deceased 27 3 24
3 2021-01-02 Confirmed 33 14 38
4 2021-01-02 Recovered 21 15 6
5 2021-01-02 Deceased 15 37 8
6 2021-01-03 Confirmed 15 20 36
7 2021-01-03 Recovered 18 19 44
8 2021-01-03 Deceased 37 22 1
9 2021-01-04 Confirmed 16 35 37
10 2021-01-04 Recovered 30 45 49
11 2021-01-04 Deceased 35 7 18
after
Status Confirmed Deceased Recovered
Date State
2021-01-01 TT 5 27 44
AN 44 3 5
AP 17 24 48
2021-01-02 TT 33 15 21
AN 14 37 15
AP 38 8 6
2021-01-03 TT 15 37 18
AN 20 22 19
AP 36 1 44
2021-01-04 TT 16 35 30
AN 35 7 45
AP 37 18 49

How to calculate shift and rolling sum over missing dates without adding them to data frame in Pandas?

I have a data set with dates, customers and income:
Date CustomerIncome
0 1/1/2018 A 53
1 2/1/2018 A 36
2 3/1/2018 A 53
3 5/1/2018 A 89
4 6/1/2018 A 84
5 8/1/2018 A 84
6 9/1/2018 A 54
7 10/1/2018 A 19
8 11/1/2018 A 44
9 12/1/2018 A 80
10 1/1/2018 B 24
11 2/1/2018 B 100
12 9/1/2018 B 40
13 10/1/2018 B 47
14 12/1/2018 B 10
15 2/1/2019 B 5
For both customers there are missing dates as they purchased nothing at some months.
I want to add per each customer what was the income of last month and also the rolling sum of income for the last year.
Meaning, if there's a missing month, I'll see '0' at the shift(1) column of the following month that has income. And I'll see rolling sum of 12 months even if there weren't 12 observations.
This is the expected result:
Date CustomerIncome S(1)R(12)
0 1/1/2018 A 53 0 53
1 2/1/2018 A 36 53 89
2 3/1/2018 A 53 36 142
3 5/1/2018 A 89 0 231
4 6/1/2018 A 84 89 315
5 8/1/2018 A 84 0 399
6 9/1/2018 A 54 84 453
7 10/1/2018 A 19 54 472
8 11/1/2018 A 44 19 516
9 12/1/2018 A 80 44 596
10 1/1/2018 B 24 0 24
11 2/1/2018 B 100 24 124
12 9/1/2018 B 40 0 164
13 10/1/2018 B 47 40 211
14 12/1/2018 B 10 0 221
15 2/1/2019 B 5 0 102
So far, I've added the rows with missing dates with stack and unstack, but with multiple dates and customers, it explodes the data to millions of rows, crashing kernel with most rows are 0's.
You can use .shift but have logic that if the gap is > 31 days, then make (S1) = 0
The rolling 12 calculation requires figuring out the "Rolling Date" and doing some complicated list comprehension to decide whether or not to return a value. Then, take a sum of each list per row.
df['Date'] = pd.to_datetime(df['Date']).dt.date
df['S(1)'] = df.groupby('Customer')['Income'].transform('shift').fillna(0)
s = (df['Date'] - df['Date'].shift())/np.timedelta64(1, '31D') <= 1
df['S(1)'] = df['S(1)'].where(s,0).astype(int)
df['Rolling Date'] = (df['Date'] - pd.Timedelta('1Y'))
df['R(12)'] = df.apply(lambda d: sum([z for x,y,z in
zip(df['Customer'], df['Date'], df['Income'])
if y > d['Rolling Date']
if y <= d['Date']
if x == d['Customer']]), axis=1)
df = df.drop('Rolling Date', axis=1)
df
Out[1]:
Date Customer Income S(1) R(12)
0 2018-01-01 A 53 0 53
1 2018-02-01 A 36 53 89
2 2018-03-01 A 53 36 142
3 2018-05-01 A 89 0 231
4 2018-06-01 A 84 89 315
5 2018-08-01 A 84 0 399
6 2018-09-01 A 54 84 453
7 2018-10-01 A 19 54 472
8 2018-11-01 A 44 19 516
9 2018-12-01 A 80 44 596
10 2018-01-01 B 24 0 24
11 2018-02-01 B 100 24 124
12 2018-09-01 B 40 0 164
13 2018-10-01 B 47 40 211
14 2018-12-01 B 10 0 221
15 2019-02-01 B 5 0 102

How to replace last n values of a row with zero

I want to replace last 2 values of one of the column with zero. I understand for NaN values, I am able to use .fillna(0), but I would like to replace row 6 value of the last column as well.
Weight Name Age d_id_max
0 45 Sam 14 2
1 88 Andrea 25 1
2 56 Alex 55 1
3 15 Robin 8 3
4 71 Kia 21 3
5 44 Sia 43 2
6 54 Ryan 45 1
7 34 Dimi 65 NaN
df.drop(df.tail(2).index,inplace=True)
Weight Name Age d_id_max
0 45 Sam 14 2
1 88 Andrea 25 1
2 56 Alex 55 1
3 15 Robin 8 3
4 71 Kia 21 3
5 44 Sia 43 2
6 54 Ryan 45 0
7 34 Dimi 65 0
Before pandas 0.20.0 (long time) it was job for ix, but now it is deprecated. So you can use:
DataFrame.iloc for get last rows and also Index.get_loc for positions of column d_id_max:
df.iloc[-2:, df.columns.get_loc('d_id_max')] = 0
print (df)
Weight Name Age d_id_max
0 45 Sam 14 2.0
1 88 Andrea 25 1.0
2 56 Alex 55 1.0
3 15 Robin 8 3.0
4 71 Kia 21 3.0
5 44 Sia 43 2.0
6 54 Ryan 45 0.0
7 34 Dimi 65 0.0
Or DataFrame.loc with indexing index values:
df.loc[df.index[-2:], 'd_id_max'] = 0
Try .iloc and get_loc
df.iloc[[-1,-2], df.columns.get_loc('d_id_max')] = 0
Out[232]:
Weight Name Age d_id_max
0 45 Sam 14 2.0
1 88 Andrea 25 1.0
2 56 Alex 55 1.0
3 15 Robin 8 3.0
4 71 Kia 21 3.0
5 44 Sia 43 2.0
6 54 Ryan 45 0.0
7 34 Dimi 65 0.0
You can use:
df['d_id_max'].iloc[-2:] = 0
Weight Name Age d_id_max
0 45 Sam 14 2.0
1 88 Andrea 25 1.0
2 56 Alex 55 1.0
3 15 Robin 8 3.0
4 71 Kia 21 3.0
5 44 Sia 43 2.0
6 54 Ryan 45 0.0
7 34 Dimi 65 0.0

Pandas, getting mean and sum with groupby

I have a data frame, df, which looks like this:
index New Old Map Limit count
1 93 35 54 > 18 1
2 163 93 116 > 18 1
3 134 78 96 > 18 1
4 117 81 93 > 18 1
5 194 108 136 > 18 1
6 125 57 79 <= 18 1
7 66 39 48 > 18 1
8 120 83 95 > 18 1
9 150 98 115 > 18 1
10 149 99 115 > 18 1
11 148 85 106 > 18 1
12 92 55 67 <= 18 1
13 64 24 37 > 18 1
14 84 53 63 > 18 1
15 99 70 79 > 18 1
I need to produce a data frame that looks like this:
Limit <=18 >18
total mean total mean
New xx1 yy1 aa1 bb1
Old xx2 yy2 aa2 bb2
MAP xx3 yy3 aa3 bb3
I tried this without success:
df.groupby('Limit')['New', 'Old', 'MAP'].[sum(), mean()].T without success.
How can I achieve this in pandas?
You can use groupby with agg, then transpose by T and unstack:
print (df[['New', 'Old', 'Map', 'Limit']].groupby('Limit').agg([sum, 'mean']).T.unstack())
Limit <= 18 > 18
sum mean sum mean
New 217.0 108.5 1581.0 121.615385
Old 112.0 56.0 946.0 72.769231
Map 146.0 73.0 1153.0 88.692308
I edit by comment, it looks nicer:
print (df.groupby('Limit')['New', 'Old', 'Map', 'Limit'].agg([sum, 'mean']).T.unstack())
And if need total columns:
print (df.groupby('Limit')['New', 'Old', 'Map', 'Limit']
.agg({'total':sum, 'mean': 'mean'})
.T
.unstack(0))
Limit <= 18 > 18
total mean total mean
New 217.0 108.5 1581.0 121.615385
Old 112.0 56.0 946.0 72.769231
Map 146.0 73.0 1153.0 88.692308