Pandas New Variable Based On Multiple Conditions - pandas

I have spent two days searching, any help would be appreciated.
Trying to create c_flg based on values in other columns.
a_flg b_flg Count c_flg (Expected Output)
False True 3 False
True False 2 False
False False 4 True
a_flg & b_flg are strs, Count is an int
Approaching from two angles, neither successful.
Method 1:
df['c_flg'] = np.where((df[(df['a_flg'] == 'False') &
(df['b_flg'] == 'False') &
(df['Count'] <= 6 )]), 'True', 'False')
ValueError: Length of values does not match length of index
Method 2:
def test_func(df):
if (('a_flg' == 'False') &
('b_flg' == 'False') &
('Count' <= 6 )):
return True
else:
return False
df['c_flg']=df.apply(test_func, axis=1)
TypeError: ('unorderable types: str() <= int()', 'occurred at index 0')
Very new to the Python language, help would be appreciated.

If I understand your problem properly then you need this,
df['c_flg']=(df['a_flg']=='False')&(df['b_flg']=='False')&(df['Count']<=6)
df['c_flg']=(df['a_flg']==False)&(df['b_flg']==False)&(df['Count']<=6)#use this if 'x_flg' is boolean
Output:
a_flg b_flg Count c_flg
0 False True 3 False
1 True False 2 False
2 False False 4 True
Note: For this problem you really don't need numpy, pandas itself can solve this without any problem.

I believe np.where is not necessary, use ~ for invert boolean mask and chain & for bitwise AND:
print (df.dtypes)
a_flg bool
b_flg bool
Count int64
dtype: object
df['c_flg'] = ~df['a_flg'] & ~df['b_flg'] & (df['Count'] <= 6)
print (df)
a_flg b_flg Count c_flg
0 False True 3 False
1 True False 2 False
2 False False 4 True

Related

From a set of columns with true/false values, say which column has a True value

I have a df with with several columns which have only True/False values.
I want to create another column whose value will tell me which column has a True value.
HEre's an example:
index
bol_1
bol_2
bol_3
criteria
1
True
False
False
bol_1
2
False
True
False
bol_2
3
True
True
False
[bol_1, bol_2]
My objective is to know which rows have True values(at least 1), and which columns are responsible for those True values. I want to be able to some basic statistics on this new column, e.g. for how many rows is bol_1 the unique column to have a True values.
Use DataFrame.select_dtypes for boolean columns, convert columns names to array and in list comprehension filter Trues values:
df1 = df.select_dtypes(bool)
cols = df1.columns.to_numpy()
df['criteria'] = [list(cols[x]) for x in df1.to_numpy()]
print (df)
bol_1 bol_2 bol_3 criteria
1 True False False [bol_1]
2 False True False [bol_2]
3 True True False [bol_1, bol_2]
If performance is not important use DataFrame.apply:
df['criteria'] = df1.apply(lambda x: cols[x], axis=1)
A possible solution:
df.assign(criteria=df.apply(lambda x: list(
df.columns[1:][x[1:] == True]), axis=1))
Output:
index bol_1 bol_2 bol_3 criteria
0 1 True False False [bol_1]
1 2 False True False [bol_2]
2 3 True True False [bol_1, bol_2]

How to check if a row does not exist in another column?

import pandas as pd
import numpy as np
from numpy.random import randint
dict_1 = {'Col1':[1,1,1,1,2,4,5,6,7],'Col2':[3,3,3,3,2,4,5,6,7]}
df = pd.DataFrame(dict_1)
filt = df.apply(lambda x: x['Col2'] not in df['Col1'],axis = 1)
print(filt)
That's is what I tried the expected output is:
0 True
1 True
2 True
3 True
4 False
5 False
6 False
7 False
8 False
The given result is
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 False
It is only giving false no matter what I do, and I am not sure how to fix that.
IIUC, here's one way:
filt = ~df.Col2.isin(df.Col1.unique())
OUTPUT:
0 True
1 True
2 True
3 True
4 False
5 False
6 False
7 False
8 False
In general, using df.COLUMN notation has the drawback you mention in that it is not obvious how to reference them.
~df["Col2"].isin(df["Col1"].unique())
Remember that when using square brackets instead of .dot notation, single square brackets returns a Series, while double-square brackets return a DataFrame.
isinstance(df["Col2"], pandas.Series)
OUTPUT:
True
Versus
isinstance(df[["Col2"]], pandas.DataFrame)
OUTPUT:
True

How to return a column by checking multiple column with True and False without if statements

How to get this desired output without using if statements ? and checking row by row
import pandas as pd
test = pd.DataFrame()
test['column1'] = [True, True, False]
test['column2']= [False,True,False]
index column1 column2
0 True False
1 True True
2 False False
desired output:
index column1 column2 column3
0 True False False
1 True True True
2 False False False
Your help is much appriciated.
Thank you in advance.
Use DataFrame.all for test if all values are Trues:
test['column3'] = test.all(axis=1)
If need filter columns add subset ['column1','column1']:
test['column3'] = test[['column1','column1']].all(axis=1)
If want test only 2 columns here is possible use & for bitwise AND:
test['column3'] = test['column1'] & test['column1']

Find the min/max of rows with overlapping column values, create new column to represent the full range of both

I'm using Pandas DataFrames. I'm looking to identify all rows where both columns A and B == True, then represent in Column C the all points on other side of that intersection where only A or B is still true but not the other. For example:
A B C
0 False False False
1 True False True
2 True True True
3 True True True
4 False True True
5 False False False
6 True False False
7 True False False
I can find the direct overlaps quite easily:
df.loc[(df['A'] == True) & (df['B'] == True), 'C'] = True
... however this does not take into account the overlap need.
I considered creating column 'C' in this way, then grouping each column:
grp_a = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
grp_b = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
grp_c = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
From there I thought to iterate over the indexes in grp_c.indices and test the indices in grp_a and grp_b against those, find the min/max index of A and B and update column C. This feels like an inefficient way of getting to the result I want though.
Ideas?
Try this:
#Input df just columns 'A' and 'B'
df = df[['A','B']]
df['C'] = df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)
print(df)
Output:
A B C
0 False False False
1 True False True
2 True True True
3 True True True
4 False True True
5 False False False
6 True False False
7 True False False
Explanation:
First, create column 'C' with the assignment of minimum value, what this does is to ass True to C where both A and B are True. Next, using
df[['A','B']].max(1) == 0
0 True
1 False
2 False
3 False
4 False
5 True
6 False
7 False
dtype: bool
We can find all of the records were A and B are both False. Then we use cumsum to create a count of those False False records. Allowing us to create grouping of records with the False False recording having a count up until the next False False record which gets incremented.
(df[['A','B']].max(1) == 0).cumsum()
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
dtype: int32
Let's group the dataframe with the newly assigned column C by this grouping created with cumsum. Then take the maximum value of column C from that group. So, if the group has a True True record, assign True to all the records in that group. Lastly, use mask to turn the first False False record back to False.
df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)
0 False
1 True
2 True
3 True
4 True
5 False
6 False
7 False
Name: C, dtype: bool
And, assign that series to df['C'] overwriting the temporarily assigned C in the statement.
df['C'] = df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)

vote_counts = md[md['vote_count'].notnull()]['vote_count'].astype('int')

How this is working?
I know the intuition behind it that given movie_dataset(using panda we have loaded it in "md" and we are finding those rows in 'votecount' which are not null and converting them to int.
but i am not understanding the syntax.
md[md['vote_count'].notnull()] returns a filtered view of your current md dataframe where vote_count is not NULL. Which is being set to the variable vote_counts This is Boolean Indexing.
# Assume this dataframe
df = pd.DataFrame(np.random.randn(5,3), columns=list('ABC'))
df.loc[2,'B'] = np.nan
when you do df['B'].notnull() it will return a boolean vector which can be used to filter your data where the value is True
df['B'].notnull()
0 True
1 True
2 False
3 True
4 True
Name: B, dtype: bool
df[df['B'].notnull()]
A B C
0 -0.516625 -0.596213 -0.035508
1 0.450260 1.123950 -0.317217
3 0.405783 0.497761 -1.759510
4 0.307594 -0.357566 0.279341