Getting boolean columns based on value presence in other columns - pandas

I have this table:
df1 = pd.DataFrame(data={'col1': ['a', 'e', 'a', 'e'],
'col2': ['e', 'a', 'c', 'b'],
'col3': ['c', 'b', 'b', 'a']},
index=pd.Series([1, 2, 3, 4], name='index'))
index
col1
col2
col3
1
a
e
c
2
e
a
b
3
a
c
b
4
e
b
a
and this list:
all_vals = ['a', 'b', 'c', 'd', 'e' 'f']
How do I make boolean columns from df1 such that it includes all columns from the all_vals list, even if the value is not in df1?
index
a
b
c
d
e
f
1
TRUE
FALSE
TRUE
FALSE
TRUE
FALSE
2
TRUE
TRUE
FALSE
FALSE
TRUE
FALSE
3
TRUE
TRUE
TRUE
FALSE
FALSE
FALSE
4
TRUE
TRUE
FALSE
FALSE
TRUE
FALSE

You can iterate over all_vals to check if the value exists and create new column
for val in all_vals:
df1[val] = (df1 == val).any(axis=1)

Use get_dummies with aggregate max per columns and DataFrame.reindex:
df1 = (pd.get_dummies(df1, dtype=bool, prefix='', prefix_sep='')
.groupby(axis=1, level=0).max()
.reindex(all_vals, axis=1, fill_value=False))
print (df1)
a b c d e f
index
1 True False True False True False
2 True True False False True False
3 True True True False False False
4 True True False False True False

Related

Pandas loop from leftmost column and change values by dictinoary

I have the following dictionary and dataframe:
val_dict = {
'key1': ['val1', 'val2', 'val3'],
'key2': ['val4', 'val5']
}
df = pd.DataFrame(data={'val5': [True, False, False],
'val2': [False, True, False],
'val3': [True, True, False],
'val1': [True, False, True],
'val4': [True, True, False],
'val6': [False, False, True]},
index=pd.Series([1, 2, 3], name='index'))
index
val5
val2
val3
val1
val4
val6
1
True
False
True
True
True
False
2
False
True
True
False
True
False
3
False
False
False
True
False
True
How do I go through the dataframe from the left so that if the column is True, other columns in the val_dict values turn to False?
index
val5
val2
val3
val1
val4
val6
1
True
False
True
FALSE
FALSE
False
2
False
True
FALSE
False
True
False
3
False
False
False
True
False
True
For example, index 1 has val5 as True, so val4 switches to False because they are both assigned to the same val_dict key. Similarly, val2 is False but val3 is True, so val1 gets turned to False. Note that it should skip over val6.
I tried converting df to a dictionary with df.to_dict('index') to work with two dictionaries. However, dictionaries are unordered and the order of the columns is important, so I thought it might make the code buggy.
One way is with a combination of assign and mask:
# either val2 or val3 can be True:
com = df.filter(['val2', 'val3']).sum(1).ge(1)
# val2 is the leftmost, so start with that
(df.assign(**df.filter(['val1', 'val3']).mask(df.val2, False))
# next is the combination of val2 and val3
.assign(val1 = lambda df: df.val1.mask(com, False),
val4 = lambda df: df.val4.mask(df.val5, False))
)
Out[84]:
val5 val2 val3 val1 val4 val6
index
1 True False True False False False
2 False True False False True False
3 False False False True False True
Note that val6 is untouched, so the values remain the same.
Here's what I have with trying to convert to a dictionary:
def section_filter(df, section_dict):
result = {}
for index, vals in df.to_dict('index').items():
lst = []
for val in section_dict.values():
lst.append({k:v for k, v in vals.items() if k in val})
for k, v in vals.items():
if k not in [m for mi in section_dict.values() for m in mi]:
lst.append({k: v})
for l in lst:
for i in l:
if l[i]:
l.update({k:False for k in l.keys()})
l[i] = True
break
result[index] = {k: v for d in lst for k, v in d.items()}
return pd.DataFrame.from_dict(result, orient='index', columns=df.columns)
print(df)
print()
print(section_filter(df, val_dict))
val5 val2 val3 val1 val4 val6
index
1 True False True True True False
2 False True True False True False
3 False False False True False True
val5 val2 val3 val1 val4 val6
1 True False True False False False
2 False True False False True False
3 False False False True False True

Pandas True False Matching

For this table:
I would like to generate the 'desired_output' column. One way to achieve this maybe:
All the True values from col_1 are transferred straight across to desired_output (red arrow)
In desired_output, place a True value above any existing True value (green arrow)
Code I have tried:
df['desired_output']=df.col_1.apply(lambda x: True if x.shift()==True else False)
Thankyou
You can chain by | for bitwise OR original with shifted values by Series.shift:
d = {"col1":[False,True,True,True,False,True,False,False,True,False,False,False]}
df = pd.DataFrame(d)
df['new'] = df.col1 | df.col1.shift(-1)
print (df)
col1 new
0 False True
1 True True
2 True True
3 True True
4 False True
5 True True
6 False False
7 False True
8 True True
9 False False
10 False False
11 False False
try this
df['desired_output'] = df['col_1']
df.loc[1:, 'desired_output'] = df.col_1[1:].values | df.col_1[:-1].values
print(df)
In case those are saved as string. all_caps (TRUE / FALSE)
Input:
col_1
0 True
1 True
2 False
3 True
4 True
5 False
6 Flase
7 True
8 False
Code:
df['desired']=df['col_1']
for i, e in enumerate(df['col_1']):
if e=='True':
df.at[i-1,'desired']=df.at[i,'col_1']
df = df[:(len(df)-1)]
df
Output:
col_1 desired
0 True True
1 True True
2 False True
3 True True
4 True True
5 False False
6 Flase True
7 True True
8 False False

Find the min/max of rows with overlapping column values, create new column to represent the full range of both

I'm using Pandas DataFrames. I'm looking to identify all rows where both columns A and B == True, then represent in Column C the all points on other side of that intersection where only A or B is still true but not the other. For example:
A B C
0 False False False
1 True False True
2 True True True
3 True True True
4 False True True
5 False False False
6 True False False
7 True False False
I can find the direct overlaps quite easily:
df.loc[(df['A'] == True) & (df['B'] == True), 'C'] = True
... however this does not take into account the overlap need.
I considered creating column 'C' in this way, then grouping each column:
grp_a = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
grp_b = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
grp_c = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
From there I thought to iterate over the indexes in grp_c.indices and test the indices in grp_a and grp_b against those, find the min/max index of A and B and update column C. This feels like an inefficient way of getting to the result I want though.
Ideas?
Try this:
#Input df just columns 'A' and 'B'
df = df[['A','B']]
df['C'] = df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)
print(df)
Output:
A B C
0 False False False
1 True False True
2 True True True
3 True True True
4 False True True
5 False False False
6 True False False
7 True False False
Explanation:
First, create column 'C' with the assignment of minimum value, what this does is to ass True to C where both A and B are True. Next, using
df[['A','B']].max(1) == 0
0 True
1 False
2 False
3 False
4 False
5 True
6 False
7 False
dtype: bool
We can find all of the records were A and B are both False. Then we use cumsum to create a count of those False False records. Allowing us to create grouping of records with the False False recording having a count up until the next False False record which gets incremented.
(df[['A','B']].max(1) == 0).cumsum()
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
dtype: int32
Let's group the dataframe with the newly assigned column C by this grouping created with cumsum. Then take the maximum value of column C from that group. So, if the group has a True True record, assign True to all the records in that group. Lastly, use mask to turn the first False False record back to False.
df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)
0 False
1 True
2 True
3 True
4 True
5 False
6 False
7 False
Name: C, dtype: bool
And, assign that series to df['C'] overwriting the temporarily assigned C in the statement.
df['C'] = df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)

Drawing bar charts from boolean fields:

I have three boolean fields, where their count is shown below:
I want to draw a bar chart that have
Offline_RetentionByTime with 37528
Offline_RetentionByCount with 29640
Offline_RetentionByCapacity with 3362
How to achieve that?
I think you can use apply value_counts for creating new df1 and then DataFrame.plot.bar:
df = pd.DataFrame({'Offline_RetentionByTime':[True,False,True, False],
'Offline_RetentionByCount':[True,False,False,True],
'Offline_RetentionByCapacity':[True,True,True, False]})
print (df)
Offline_RetentionByCapacity Offline_RetentionByCount Offline_RetentionByTime
0 True True True
1 True False False
2 True False True
3 False True False
df1 = df.apply(pd.value_counts)
print (df1)
Offline_RetentionByCapacity Offline_RetentionByCount \
True 3 2
False 1 2
Offline_RetentionByTime
True 2
False 2
df1.plot.bar()
If need plot only True values select by loc:
df1.loc[True].plot.bar()

use series to select rows from df pandas

Continued from this thread: get subsection of df based on multiple conditions
I would like to pull given rows based on multiple conditions which are stored in a Series object.
columns = ['is_net', 'is_pct', 'is_mean', 'is_wgted', 'is_sum']
index = ['a','b','c','d']
data = [['True','True','False','False', 'False'],
['True','True','True','False', 'False'],
['True','True','False','False', 'True'],
['True','True','False','True', 'False']]
df = pd.DataFrame(columns=columns, index=index, data=data)
df
is_net is_pct is_mean is_wgted is_sum
a True True False False False
b True True True False False
c True True False False True
d True True False True False
My conditions:
d={'is_net': 'True', 'is_sum': 'True'}
s=pd.Series(d)
Expected output:
is_net is_pct is_mean is_wgted is_sum
c True True False False True
My failed attempt:
(df == s).all(axis=1)
a False
b False
c False
d False
dtype: bool
Not sure why 'c' is False when the two conditions were met.
Note, I can achieve the desired results like this but I would rather use the Series method.
df[(df['is_net']=='True') & (df['is_sum']=='True')]
As you only have 2 conditions we can sum these and filter the df:
In [55]:
df[(df == s).sum(axis=1) == 2]
​
Out[55]:
is_net is_pct is_mean is_wgted is_sum
c True True False False True
This works because booleans convert to 1 and 0 for True and False:
In [56]:
(df == s).sum(axis=1)
​
Out[56]:
a 1
b 1
c 2
d 1
dtype: int64
You could modify a little bit your solution by adding subset for your columns:
In [219]: df[(df == s)[['is_net', 'is_sum']].all(axis=1)]
Out[219]:
is_net is_pct is_mean is_wgted is_sum
c True True False False True
or:
In [219]: df[(df == s)[s.index].all(axis=1)]
Out[219]:
is_net is_pct is_mean is_wgted is_sum
c True True False False True