Supposing I have the following situation:
A dataframe where the first column ['ID'] will eventually have duplicated values.
import pandas as pd
df = pd.DataFrame({"ID": [1,2,3,4,4,5,5,5,6,6],
"l_1": [10,12,32,45,45,20,20,20,20,20],
"l_2": [11,12,32,11,21,27,38,12,9,6],
"l_3": [5,9,32,12,21,21,18,12,8,1],
"l_4": [6,21,12,77,77,2,2,2,8,8]})
ID l_1 l_2 l_3 l_4
1 10 11 5 6
2 12 12 9 21
3 32 32 32 12
4 45 11 12 77
4 45 21 21 77
5 20 27 21 2
5 20 38 18 2
5 20 12 12 2
6 20 9 8 8
6 20 6 1 8
When duplicated IDs occurs:
I need to keep only the first values for column l_1 and l_4 (other duplicated rows must be zero).
Columns 'l_2' and 'l_3' must stay the same.
When duplicated IDs occurs, the values on these rows on columns l_1 and l_4 will be also duplicated.
Expected output:
ID l_1 l_2 l_3 l_4
1 10 11 5 6
2 12 12 9 21
3 32 32 32 12
4 45 11 12 77
4 0 21 21 0
5 20 27 21 2
5 0 38 18 0
5 0 12 12 0
6 20 9 8 8
6 0 6 1 0
Is there a Straightforward way using pandas or numpy to accomplish this ?
I could just accomplish it doing all these steps:
x1 = df[df.duplicated(subset=['ID'], keep=False)].copy()
x1.loc[x1.groupby('ID')['l_1'].apply(lambda x: (x.shift(1) == x)), 'l_1'] = 0
x1.loc[x1.groupby('ID')['l_4'].apply(lambda x: (x.shift(1) == x)), 'l_4'] = 0
df = df.drop_duplicates(subset=['ID'], keep=False)
df = pd.concat([df, x1])
Isn't this just:
df.loc[df.duplicated('ID'), ['l_1','l_4']] = 0
Output:
ID l_1 l_2 l_3 l_4
0 1 10 11 5 6
1 2 12 12 9 21
2 3 32 32 32 12
3 4 45 11 12 77
4 4 0 21 21 0
5 5 20 27 21 2
6 5 0 38 18 0
7 5 0 12 12 0
8 6 20 9 8 8
9 6 0 6 1 0
I am new to using Pandas and I have a dataframe df as given below
A B
0 4 5
1 5 8
2 6 11
3 7 13
4 8 15
5 9 30
6 10 477
7 11 3643
8 12 33469
9 13 141409
10 14 335338
11 15 365115
I want to get the difference between previous row and next row for B column
I used df.set_index('B').diff() but it gives NaN for first row. How to get 5 there?
A B
4 NaN
5 3.0
6 3.0
7 2.0
8 2.0
9 15.0
10 447.0
11 3166.0
12 29826.0
13 107940.0
14 193929.0
15 29777.0
Let us do
df.B.diff().fillna(df.B)
0 5.0
1 3.0
2 3.0
3 2.0
4 2.0
5 15.0
6 447.0
7 3166.0
8 29826.0
9 107940.0
10 193929.0
11 29777.0
Name: B, dtype: float64
How to solve same problem in this link Sum of group but keep the same value for each row in r using pandas?
I can generate separate df have the sum for each group and then merge the generated df with the original.
You can use groupby & transform as below to get your output.
df['sumx']=df.groupby(['ID', 'Group'],sort=False)['x'].transform(sum)
df['sumy']=df.groupby(['ID', 'Group'],sort=False)['y'].transform(sum)
df
output
ID Group x y sumx sumy
1 1 1 1 12 3 25
2 1 1 2 13 3 25
3 1 2 3 14 3 14
4 3 1 4 15 15 48
5 3 1 5 16 15 48
6 3 1 6 17 15 48
7 3 2 7 18 15 37
8 3 2 8 19 15 37
9 4 1 9 20 30 63
10 4 1 10 21 30 63
11 4 1 11 22 30 63
12 4 2 12 23 12 23
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(30).reshape(6,5), index=[list('aaabbb'), list('XYZXYZ')])
print(df)
df.loc[pd.IndexSlice['a'], 3] /= 10
print(df)
From the above code I expected below table:
0 1 2 3 4
a X 0 1 2 0.3 4
Y 5 6 7 0.8 9
Z 10 11 12 0.13 14
b X 15 16 17 18 19
Y 20 21 22 23 24
Z 25 26 27 28 29
But the actual result is as below table:
0 1 2 3 4
a X 0 1 2 NaN 4
Y 5 6 7 NaN 9
Z 10 11 12 NaN 14
b X 15 16 17 18.0 19
Y 20 21 22 23.0 24
Z 25 26 27 28.0 29
What went wrong in the code?
Need specify second level by : for select all values:
df.loc[pd.IndexSlice['a', :], 3] /= 10
print(df)
0 1 2 3 4
a X 0 1 2 0.3 4
Y 5 6 7 0.8 9
Z 10 11 12 1.3 14
b X 15 16 17 18.0 19
Y 20 21 22 23.0 24
Z 25 26 27 28.0 29
Solution with slice:
df.loc[(slice('a'), slice(None)), 3] /= 10
print(df)
0 1 2 3 4
a X 0 1 2 0.3 4
Y 5 6 7 0.8 9
Z 10 11 12 1.3 14
b X 15 16 17 18.0 19
Y 20 21 22 23.0 24
Z 25 26 27 28.0 29
We Know that dataframe can add anther dataframe by index.
While i has two dataframes with multipindex, so how can i add them by level 'a' in index.
data3.head()
c d e
a b
0 1 2 3 4
5 6 7 8 9
10 11 12 13 14
15 16 17 18 19
20 21 22 23 24
data4.head()
b d e
a c
0 2 1 3 4
5 7 6 8 9
10 12 11 13 14
15 17 16 18 19
20 22 21 23 24
25 27 26 28 29
data3 + data4
error:merging with both multi-indexes is not implemented