Find the min/max of rows with overlapping column values, create new column to represent the full range of both - pandas

I'm using Pandas DataFrames. I'm looking to identify all rows where both columns A and B == True, then represent in Column C the all points on other side of that intersection where only A or B is still true but not the other. For example:
A B C
0 False False False
1 True False True
2 True True True
3 True True True
4 False True True
5 False False False
6 True False False
7 True False False
I can find the direct overlaps quite easily:
df.loc[(df['A'] == True) & (df['B'] == True), 'C'] = True
... however this does not take into account the overlap need.
I considered creating column 'C' in this way, then grouping each column:
grp_a = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
grp_b = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
grp_c = df.loc[(df['A'] == True), 'A'].groupby(df['A'].astype('int').diff.ne(0).cumsum())
From there I thought to iterate over the indexes in grp_c.indices and test the indices in grp_a and grp_b against those, find the min/max index of A and B and update column C. This feels like an inefficient way of getting to the result I want though.
Ideas?

Try this:
#Input df just columns 'A' and 'B'
df = df[['A','B']]
df['C'] = df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)
print(df)
Output:
A B C
0 False False False
1 True False True
2 True True True
3 True True True
4 False True True
5 False False False
6 True False False
7 True False False
Explanation:
First, create column 'C' with the assignment of minimum value, what this does is to ass True to C where both A and B are True. Next, using
df[['A','B']].max(1) == 0
0 True
1 False
2 False
3 False
4 False
5 True
6 False
7 False
dtype: bool
We can find all of the records were A and B are both False. Then we use cumsum to create a count of those False False records. Allowing us to create grouping of records with the False False recording having a count up until the next False False record which gets incremented.
(df[['A','B']].max(1) == 0).cumsum()
0 1
1 1
2 1
3 1
4 1
5 2
6 2
7 2
dtype: int32
Let's group the dataframe with the newly assigned column C by this grouping created with cumsum. Then take the maximum value of column C from that group. So, if the group has a True True record, assign True to all the records in that group. Lastly, use mask to turn the first False False record back to False.
df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)
0 False
1 True
2 True
3 True
4 True
5 False
6 False
7 False
Name: C, dtype: bool
And, assign that series to df['C'] overwriting the temporarily assigned C in the statement.
df['C'] = df.assign(C=df.min(1)).groupby((df[['A','B']].max(1) == 0).cumsum())['C']\
.transform('max').mask(df.max(1)==0, False)

Related

How to make count the amount between 2 conditions?

When the start column is true, start counting.
When the end column is true, stop counting.
Input:
import pandas as pd
df=pd.DataFrame()
df['start']=[False,True,False,False,False,True,False,False,False]
df['end']= [False,False,False,True,False,False,False,True,False]
Expected Output:
start end expected
0 False False 0
1 True False 1
2 False False 2
3 False True 0
4 False False 0
5 True False 1
6 False False 2
7 False True 0
8 False False 0
You can use cumsum to compute the groups, groupby.cummax to identify the values after a start (and later mask with where) and groupby.cumcount to increment a counter:
# make groups between start/end
group = (df['start']|df['end']).cumsum()
# identify values after a start and before an end
mask = df['start'].groupby(group).cummax()
# compute a cumcount and mask with the above "mask"
df['expected'] = df.groupby(group).cumcount().add(1).where(mask, 0)
Output:
start end expected
0 False False 0
1 True False 1
2 False False 2
3 False True 0
4 False False 0
5 True False 1
6 False False 2
7 False True 0
8 False False 0

From a set of columns with true/false values, say which column has a True value

I have a df with with several columns which have only True/False values.
I want to create another column whose value will tell me which column has a True value.
HEre's an example:
index
bol_1
bol_2
bol_3
criteria
1
True
False
False
bol_1
2
False
True
False
bol_2
3
True
True
False
[bol_1, bol_2]
My objective is to know which rows have True values(at least 1), and which columns are responsible for those True values. I want to be able to some basic statistics on this new column, e.g. for how many rows is bol_1 the unique column to have a True values.
Use DataFrame.select_dtypes for boolean columns, convert columns names to array and in list comprehension filter Trues values:
df1 = df.select_dtypes(bool)
cols = df1.columns.to_numpy()
df['criteria'] = [list(cols[x]) for x in df1.to_numpy()]
print (df)
bol_1 bol_2 bol_3 criteria
1 True False False [bol_1]
2 False True False [bol_2]
3 True True False [bol_1, bol_2]
If performance is not important use DataFrame.apply:
df['criteria'] = df1.apply(lambda x: cols[x], axis=1)
A possible solution:
df.assign(criteria=df.apply(lambda x: list(
df.columns[1:][x[1:] == True]), axis=1))
Output:
index bol_1 bol_2 bol_3 criteria
0 1 True False False [bol_1]
1 2 False True False [bol_2]
2 3 True True False [bol_1, bol_2]

Pandas True False Matching

For this table:
I would like to generate the 'desired_output' column. One way to achieve this maybe:
All the True values from col_1 are transferred straight across to desired_output (red arrow)
In desired_output, place a True value above any existing True value (green arrow)
Code I have tried:
df['desired_output']=df.col_1.apply(lambda x: True if x.shift()==True else False)
Thankyou
You can chain by | for bitwise OR original with shifted values by Series.shift:
d = {"col1":[False,True,True,True,False,True,False,False,True,False,False,False]}
df = pd.DataFrame(d)
df['new'] = df.col1 | df.col1.shift(-1)
print (df)
col1 new
0 False True
1 True True
2 True True
3 True True
4 False True
5 True True
6 False False
7 False True
8 True True
9 False False
10 False False
11 False False
try this
df['desired_output'] = df['col_1']
df.loc[1:, 'desired_output'] = df.col_1[1:].values | df.col_1[:-1].values
print(df)
In case those are saved as string. all_caps (TRUE / FALSE)
Input:
col_1
0 True
1 True
2 False
3 True
4 True
5 False
6 Flase
7 True
8 False
Code:
df['desired']=df['col_1']
for i, e in enumerate(df['col_1']):
if e=='True':
df.at[i-1,'desired']=df.at[i,'col_1']
df = df[:(len(df)-1)]
df
Output:
col_1 desired
0 True True
1 True True
2 False True
3 True True
4 True True
5 False False
6 Flase True
7 True True
8 False False

Pandas New Variable Based On Multiple Conditions

I have spent two days searching, any help would be appreciated.
Trying to create c_flg based on values in other columns.
a_flg b_flg Count c_flg (Expected Output)
False True 3 False
True False 2 False
False False 4 True
a_flg & b_flg are strs, Count is an int
Approaching from two angles, neither successful.
Method 1:
df['c_flg'] = np.where((df[(df['a_flg'] == 'False') &
(df['b_flg'] == 'False') &
(df['Count'] <= 6 )]), 'True', 'False')
ValueError: Length of values does not match length of index
Method 2:
def test_func(df):
if (('a_flg' == 'False') &
('b_flg' == 'False') &
('Count' <= 6 )):
return True
else:
return False
df['c_flg']=df.apply(test_func, axis=1)
TypeError: ('unorderable types: str() <= int()', 'occurred at index 0')
Very new to the Python language, help would be appreciated.
If I understand your problem properly then you need this,
df['c_flg']=(df['a_flg']=='False')&(df['b_flg']=='False')&(df['Count']<=6)
df['c_flg']=(df['a_flg']==False)&(df['b_flg']==False)&(df['Count']<=6)#use this if 'x_flg' is boolean
Output:
a_flg b_flg Count c_flg
0 False True 3 False
1 True False 2 False
2 False False 4 True
Note: For this problem you really don't need numpy, pandas itself can solve this without any problem.
I believe np.where is not necessary, use ~ for invert boolean mask and chain & for bitwise AND:
print (df.dtypes)
a_flg bool
b_flg bool
Count int64
dtype: object
df['c_flg'] = ~df['a_flg'] & ~df['b_flg'] & (df['Count'] <= 6)
print (df)
a_flg b_flg Count c_flg
0 False True 3 False
1 True False 2 False
2 False False 4 True

Drawing bar charts from boolean fields:

I have three boolean fields, where their count is shown below:
I want to draw a bar chart that have
Offline_RetentionByTime with 37528
Offline_RetentionByCount with 29640
Offline_RetentionByCapacity with 3362
How to achieve that?
I think you can use apply value_counts for creating new df1 and then DataFrame.plot.bar:
df = pd.DataFrame({'Offline_RetentionByTime':[True,False,True, False],
'Offline_RetentionByCount':[True,False,False,True],
'Offline_RetentionByCapacity':[True,True,True, False]})
print (df)
Offline_RetentionByCapacity Offline_RetentionByCount Offline_RetentionByTime
0 True True True
1 True False False
2 True False True
3 False True False
df1 = df.apply(pd.value_counts)
print (df1)
Offline_RetentionByCapacity Offline_RetentionByCount \
True 3 2
False 1 2
Offline_RetentionByTime
True 2
False 2
df1.plot.bar()
If need plot only True values select by loc:
df1.loc[True].plot.bar()