From a set of columns with true/false values, say which column has a True value - pandas

I have a df with with several columns which have only True/False values.
I want to create another column whose value will tell me which column has a True value.
HEre's an example:
index
bol_1
bol_2
bol_3
criteria
1
True
False
False
bol_1
2
False
True
False
bol_2
3
True
True
False
[bol_1, bol_2]
My objective is to know which rows have True values(at least 1), and which columns are responsible for those True values. I want to be able to some basic statistics on this new column, e.g. for how many rows is bol_1 the unique column to have a True values.

Use DataFrame.select_dtypes for boolean columns, convert columns names to array and in list comprehension filter Trues values:
df1 = df.select_dtypes(bool)
cols = df1.columns.to_numpy()
df['criteria'] = [list(cols[x]) for x in df1.to_numpy()]
print (df)
bol_1 bol_2 bol_3 criteria
1 True False False [bol_1]
2 False True False [bol_2]
3 True True False [bol_1, bol_2]
If performance is not important use DataFrame.apply:
df['criteria'] = df1.apply(lambda x: cols[x], axis=1)

A possible solution:
df.assign(criteria=df.apply(lambda x: list(
df.columns[1:][x[1:] == True]), axis=1))
Output:
index bol_1 bol_2 bol_3 criteria
0 1 True False False [bol_1]
1 2 False True False [bol_2]
2 3 True True False [bol_1, bol_2]

Related

Consolidating columns by the number before the decimal point in the column name

I have the following dataframe (three example columns below):
import pandas as pd
array = {'25.2': [False, True, False], '25.4': [False, False, True], '27.78': [True, False, True]}
df = pd.DataFrame(array)
25.2 25.4 27.78
0 False False True
1 True False False
2 False True True
I want to create a new dataframe with consolidated columns names, i.e. add 25.2 and 25.4 into 25 new column. If one of the values in the separate columns is True then the value in the new column is True.
Expected output:
25 27
0 False True
1 True False
2 True True
Any ideas?
use rename()+groupby()+sum():
df=(df.rename(columns=lambda x:x.split('.')[0])
.groupby(axis=1,level=0).sum().astype(bool))
OR
In 2 steps:
df.columns=[x.split('.')[0] for x in df]
#OR
#df.columns=df.columns.str.replace(r'\.\d+','',regex=True)
df=df.groupby(axis=1,level=0).sum().astype(bool)
output:
25 27
0 False True
1 True False
2 True True
Note: If you have int columns then you can use round() instead of split()
Another way:
>>> df.T.groupby(np.floor(df.columns.astype(float))).sum().astype(bool).T
25.0 27.0
0 False True
1 True False
2 True True

How to return a column by checking multiple column with True and False without if statements

How to get this desired output without using if statements ? and checking row by row
import pandas as pd
test = pd.DataFrame()
test['column1'] = [True, True, False]
test['column2']= [False,True,False]
index column1 column2
0 True False
1 True True
2 False False
desired output:
index column1 column2 column3
0 True False False
1 True True True
2 False False False
Your help is much appriciated.
Thank you in advance.
Use DataFrame.all for test if all values are Trues:
test['column3'] = test.all(axis=1)
If need filter columns add subset ['column1','column1']:
test['column3'] = test[['column1','column1']].all(axis=1)
If want test only 2 columns here is possible use & for bitwise AND:
test['column3'] = test['column1'] & test['column1']

Find the min/max of rows with overlapping column values, create new column to represent the full range of both

I'm using Pandas DataFrames. I'm looking to identify all rows where both columns A and B == True, then represent in Column C the all points on other side of that intersection where only A or B is still true but not the other. For example:
A B C
0 False False False
1 True False True
2 True True True
3 True True True
4 False True True
5 False False False
6 True False False
7 True False False
I can find the direct overlaps quite easily:
df.loc[(df['A'] == True) & (df['B'] == True), 'C'] = True
... however this does not take into account the overlap need.
I considered creating column 'C' in this way, then grouping each column:
grp_a = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
grp_b = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
grp_c = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
From there I thought to iterate over the indexes in grp_c.indices and test the indices in grp_a and grp_b against those, find the min/max index of A and B and update column C. This feels like an inefficient way of getting to the result I want though.
Ideas?
Try this:
#Input df just columns 'A' and 'B'
df = df[['A','B']]
df['C'] = df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)
print(df)
Output:
A B C
0 False False False
1 True False True
2 True True True
3 True True True
4 False True True
5 False False False
6 True False False
7 True False False
Explanation:
First, create column 'C' with the assignment of minimum value, what this does is to ass True to C where both A and B are True. Next, using
df[['A','B']].max(1) == 0
0 True
1 False
2 False
3 False
4 False
5 True
6 False
7 False
dtype: bool
We can find all of the records were A and B are both False. Then we use cumsum to create a count of those False False records. Allowing us to create grouping of records with the False False recording having a count up until the next False False record which gets incremented.
(df[['A','B']].max(1) == 0).cumsum()
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
dtype: int32
Let's group the dataframe with the newly assigned column C by this grouping created with cumsum. Then take the maximum value of column C from that group. So, if the group has a True True record, assign True to all the records in that group. Lastly, use mask to turn the first False False record back to False.
df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)
0 False
1 True
2 True
3 True
4 True
5 False
6 False
7 False
Name: C, dtype: bool
And, assign that series to df['C'] overwriting the temporarily assigned C in the statement.
df['C'] = df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)

Pandas New Variable Based On Multiple Conditions

I have spent two days searching, any help would be appreciated.
Trying to create c_flg based on values in other columns.
a_flg b_flg Count c_flg (Expected Output)
False True 3 False
True False 2 False
False False 4 True
a_flg & b_flg are strs, Count is an int
Approaching from two angles, neither successful.
Method 1:
df['c_flg'] = np.where((df[(df['a_flg'] == 'False') &
(df['b_flg'] == 'False') &
(df['Count'] <= 6 )]), 'True', 'False')
ValueError: Length of values does not match length of index
Method 2:
def test_func(df):
if (('a_flg' == 'False') &
('b_flg' == 'False') &
('Count' <= 6 )):
return True
else:
return False
df['c_flg']=df.apply(test_func, axis=1)
TypeError: ('unorderable types: str() <= int()', 'occurred at index 0')
Very new to the Python language, help would be appreciated.
If I understand your problem properly then you need this,
df['c_flg']=(df['a_flg']=='False')&(df['b_flg']=='False')&(df['Count']<=6)
df['c_flg']=(df['a_flg']==False)&(df['b_flg']==False)&(df['Count']<=6)#use this if 'x_flg' is boolean
Output:
a_flg b_flg Count c_flg
0 False True 3 False
1 True False 2 False
2 False False 4 True
Note: For this problem you really don't need numpy, pandas itself can solve this without any problem.
I believe np.where is not necessary, use ~ for invert boolean mask and chain & for bitwise AND:
print (df.dtypes)
a_flg bool
b_flg bool
Count int64
dtype: object
df['c_flg'] = ~df['a_flg'] & ~df['b_flg'] & (df['Count'] <= 6)
print (df)
a_flg b_flg Count c_flg
0 False True 3 False
1 True False 2 False
2 False False 4 True

Drawing bar charts from boolean fields:

I have three boolean fields, where their count is shown below:
I want to draw a bar chart that have
Offline_RetentionByTime with 37528
Offline_RetentionByCount with 29640
Offline_RetentionByCapacity with 3362
How to achieve that?
I think you can use apply value_counts for creating new df1 and then DataFrame.plot.bar:
df = pd.DataFrame({'Offline_RetentionByTime':[True,False,True, False],
'Offline_RetentionByCount':[True,False,False,True],
'Offline_RetentionByCapacity':[True,True,True, False]})
print (df)
Offline_RetentionByCapacity Offline_RetentionByCount Offline_RetentionByTime
0 True True True
1 True False False
2 True False True
3 False True False
df1 = df.apply(pd.value_counts)
print (df1)
Offline_RetentionByCapacity Offline_RetentionByCount \
True 3 2
False 1 2
Offline_RetentionByTime
True 2
False 2
df1.plot.bar()
If need plot only True values select by loc:
df1.loc[True].plot.bar()