pandas change all rows with Type X if 1 Type X Result = 1 - pandas

Here is a simple pandas df:
>>> df
Type Var1 Result
0 A 1 NaN
1 A 2 NaN
2 A 3 NaN
3 B 4 NaN
4 B 5 NaN
5 B 6 NaN
6 C 1 NaN
7 C 2 NaN
8 C 3 NaN
9 D 4 NaN
10 D 5 NaN
11 D 6 NaN
The object of the exercise is: if column Var1 = 3, set Result = 1 for all that Type.
This finds the rows with 3 in Var1 and sets Result to 1,
df['Result'] = df['Var1'].apply(lambda x: 1 if x == 3 else 0)
but I can't figure out how to then catch all the same Type and make them 1. In this case it should be all the As and all the Cs. Doesn't have to be a one-liner.
Any tips please?

Create boolean mask and for True/False to 1/0 mapp convert values to integers:
df['Result'] = df['Type'].isin(df.loc[df['Var1'].eq(3), 'Type']).astype(int)
#alternative
df['Result'] = np.where(df['Type'].isin(df.loc[df['Var1'].eq(3), 'Type']), 1, 0)
print (df)
Type Var1 Result
0 A 1 1
1 A 2 1
2 A 3 1
3 B 4 0
4 B 5 0
5 B 6 0
6 C 1 1
7 C 2 1
8 C 3 1
9 D 4 0
10 D 5 0
11 D 6 0
Details:
Get all Type values if match condition:
print (df.loc[df['Var1'].eq(3), 'Type'])
2 A
8 C
Name: Type, dtype: object
Test original column Type by filtered types:
print (df['Type'].isin(df.loc[df['Var1'].eq(3), 'Type']))
0 True
1 True
2 True
3 False
4 False
5 False
6 True
7 True
8 True
9 False
10 False
11 False
Name: Type, dtype: bool
Or use GroupBy.transform with any for test if match at least one value, thi solution is slowier if larger df:
df['Result'] = df['Var1'].eq(3).groupby(df['Type']).transform('any').astype(int)

Related

Pandas: new column where value is based on a specific value within subgroup

I have a dataframe where I want to create a new column ("NewValue") where it will take the value from the "Group" with Subgroup = A.
Group SubGroup Value NewValue
0 1 A 1 1
1 1 B 2 1
2 2 A 3 3
3 2 C 4 3
4 3 B 5 NaN
5 3 C 6 NaN
Can this be achieved using a groupby / transform function?
Use Series.map with filtered DataFrame in boolean indexing:
df['NewValue'] = df['Group'].map(df[df.SubGroup.eq('A')].set_index('Group')['Value'])
print (df)
Group SubGroup Value NewValue
0 1 A 1 1.0
1 1 B 2 1.0
2 2 A 3 3.0
3 2 C 4 3.0
4 3 B 5 NaN
5 3 C 6 NaN
Alternative with left join in DataFrame.merge with rename column:
df1 = df.loc[df.SubGroup.eq('A'),['Group','Value']].rename(columns={'Value':'NewValue'})
df = df.merge(df1, how='left')
print (df)
Group SubGroup Value NewValue
0 1 A 1 1.0
1 1 B 2 1.0
2 2 A 3 3.0
3 2 C 4 3.0
4 3 B 5 NaN
5 3 C 6 NaN

How to update the value in a column based on another value in the same column where both rows have the same value in another column?

The dataframe df is given:
ID I J K
0 10 1 a 1
1 10 2 b nan
2 10 3 c nan
3 11 1 f 0
4 11 2 b nan
5 11 3 d nan
6 12 1 b 1
7 12 2 d nan
8 12 3 c nan
For each unique value in the ID, when I==3, if J=='c' then the K=1 where I==1, else K=0. The other values in K do not matter. In other words, the value of K in the row 0, 3, and 6 are determined based on the value of I in the row 2, 5, and 8 respectively.
Try:
IDs = df.loc[df.I.eq(3) & df.J.eq("c"), "ID"]
df["K"] = np.where(df["ID"].isin(IDs) & df.I.eq(1), 1, 0)
df["K"] = np.where(df.I.eq(1), df.K, np.nan) # <-- if you want other values NaNs
print(df)
Prints:
ID I J K
0 10 1 a 1.0
1 10 2 b NaN
2 10 3 c NaN
3 11 1 f 0.0
4 11 2 b NaN
5 11 3 d NaN
6 12 1 b 1.0
7 12 2 d NaN
8 12 3 c NaN

Average per category per last N round in Pandas and lag it

I have a following problem.
I want to compute mean of last 2 observations per name and round and lag it. See following example:
df = pd.DataFrame(data={ 'name':["a","a","a","a","b","b","c" ] , 'value':[6,5,4,3,1,2,1] ,
'round':[1,2,3,4,1,2,1 ]})
Desired output is:
df = pd.DataFrame(data={ 'name':["a","a","a","a","b","b","c" ] , 'value':[6,5,4,3,1,2,1] ,
'round':[1,2,3,4,1,2,1 ], 'mean_last_2':["NaN","NaN",5.5,4.5,"NaN","NaN","NaN"]})
I tried this, but got "AttributeError: 'float' object has no attribute 'shift'":
df['mean_last_2'] = df.groupby("name")['value'].apply(lambda x:
x.tail(2).mean().shift(1))
How can I fix it please?
You could try something like this:
df['mean_last_2'] = df.groupby('name')['value'].apply(lambda x: x.rolling(2).mean().shift())
Output:
name value round mean_last_2
0 a 6 1 NaN
1 a 5 2 NaN
2 a 4 3 5.5
3 a 3 4 4.5
4 b 1 1 NaN
5 b 2 2 NaN
6 c 1 1 NaN
You can do something like
df.groupby("name").apply(lambda d: d.assign(mean_last_2 = d['value'].rolling(2).mean().shift()))
to get
name value round mean_last_2
name
a 0 a 6 1 NaN
1 a 5 2 NaN
2 a 4 3 5.5
3 a 3 4 4.5
b 4 b 1 1 NaN
5 b 2 2 NaN
c 6 c 1 1 NaN

Assign column values from another dataframe with repeating key values

Please help me in Pandas, i cant find good solution
Tried map, assign, merge, join, set_index.
Maybe just i am too tired :)
df:
m_num A B
0 1 0 9
1 1 1 8
2 2 2 7
3 2 3 6
4 3 4 5
5 3 5 4
df1:
m_num C
0 2 99
1 2 88
df_final:
m_num A B C
0 1 0 9 NaN
1 1 1 8 NaN
2 2 2 7 99
3 2 3 6 88
4 3 4 5 NaN
5 3 5 4 NaN
Try:
df2 = df[df['m_num'].isin(df1['m_num'])].reset_index(drop=True)
df2 = pd.merge(df2,df1,on=[df1.index,'m_num']).drop('key_0',axis=1)
df2 = pd.merge(df,df2,on=['m_num','A','B'],how='left')
print(df2)
Prints:
m_num A B C
0 1 0 9 NaN
1 1 1 8 NaN
2 2 2 7 99.0
3 2 3 6 88.0
4 3 4 5 NaN
5 3 5 4 NaN
Explanation:
There may be better solutions out there but this was my thought process. The problem is slightly tricky in the sense that because 'm_num' is the only common key and it and it has repeating values.
So first I created a dataframe matching df and df1 here so that I can use the index as another key for the subsequent merge.
df2 = df[df['m_num'].isin(df1['m_num'])].reset_index(drop=True)
This prints:
m_num A B
0 2 2 7
1 2 3 6
As you can see above, now we have the index 0 and 1 in addition to the m_num as key which we can use to match with df1.
df2 = pd.merge(df2,df1,on=[df1.index,'m_num']).drop('key_0',axis=1)
This prints:
m_num A B C
0 2 2 7 99
1 2 3 6 88
Then tie the above resultant dataframe to the original df and do a left join to get the output.
df2 = pd.merge(df,df2,on=['m_num','A','B'],how='left')

Meaning of mode() in pandas

df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
"B": np.random.randint(-10, 15, size=50)})
df5.mode()
A B
0 1.0 -9
1 NaN 10
2 NaN 13
Why does the NaN come from here?
Reason is if check DataFrame.mode:
Get the mode(s) of each element along the selected axis.
The mode of a set of values is the value that appears most often. It can be multiple values.
So missing values means for A is ony one mode value, for B column are 3 mode values, so for same rows are added missing values.
If check my sample data - there is mode A 2 times and B only once, because 2and 3 are both 11 times in data:
np.random.seed(20)
df5 = pd.DataFrame({"A": np.random.randint(0, 7, size=50),
"B": np.random.randint(-10, 15, size=50)})
print (df5.mode())
A B
0 2 8.0
1 3 NaN
print (df5.A.value_counts())
3 11 <- both top1
2 11 <- both top1
6 9
5 8
0 5
1 4
4 2
Name: A, dtype: int64
print (df5.B.value_counts())
8 6 <- only one top1
0 4
4 4
-4 3
10 3
-2 3
1 3
12 3
6 3
7 2
3 2
5 2
-9 2
-6 2
14 2
9 2
-1 1
11 1
-3 1
-7 1
Name: B, dtype: int64