Boolean comparaison between a series and a dataframe (element-wise) - pandas

Here the series and dataframe to be compared element-wise (AND condition):
import pandas as pd
se = pd.Series(data=[False, True])
df = pd.DataFrame(data=[[True, False], [True, True]],
columns=['A','B'])
Desired result:
df2 = pd.DataFrame(data=[[False, False], [True, True]],
columns=['A','B'])
I could achieve that using a slow for loop but I am sure there is a way to vectorise that.
Many thanks !

Convert Series to numpy array and compare with broadcasting:
print (df & se.to_numpy()[:,None])
A B
0 False False
1 True True

You can use conversion to numpy array to benefit from broadcasting:
out = np.logical_and(df, se.to_numpy()[:,None])
output:
A B
0 False False
1 True True
intermediate:
se.to_numpy()[:,None]
array([[False],
[ True]])

Another possible solution:
(df & np.vstack(se))
Output:
A B
0 False False
1 True True

df.mul(se,0)
Output:
A B
0 False False
1 True Tru

Related

Consolidating columns by the number before the decimal point in the column name

I have the following dataframe (three example columns below):
import pandas as pd
array = {'25.2': [False, True, False], '25.4': [False, False, True], '27.78': [True, False, True]}
df = pd.DataFrame(array)
25.2 25.4 27.78
0 False False True
1 True False False
2 False True True
I want to create a new dataframe with consolidated columns names, i.e. add 25.2 and 25.4 into 25 new column. If one of the values in the separate columns is True then the value in the new column is True.
Expected output:
25 27
0 False True
1 True False
2 True True
Any ideas?
use rename()+groupby()+sum():
df=(df.rename(columns=lambda x:x.split('.')[0])
.groupby(axis=1,level=0).sum().astype(bool))
OR
In 2 steps:
df.columns=[x.split('.')[0] for x in df]
#OR
#df.columns=df.columns.str.replace(r'\.\d+','',regex=True)
df=df.groupby(axis=1,level=0).sum().astype(bool)
output:
25 27
0 False True
1 True False
2 True True
Note: If you have int columns then you can use round() instead of split()
Another way:
>>> df.T.groupby(np.floor(df.columns.astype(float))).sum().astype(bool).T
25.0 27.0
0 False True
1 True False
2 True True

select DataFrame Rows Based on multiple conditions on columns when column name are in a list

I need to filter rows on certain conditions on some columns.
Those columns are present in a list. Condition will be same for all columns or can be different. For my work, condition is same.
Not working
labels = ['one', 'two', 'three']
df = df [df [x] == 1 for x in labels]
Below code works:
df_list = []
for x in labels:
df_list.append(df[(df [x] == 1)])
df5 = pd.concat(df_list).drop_duplicates()
Creating different dataframes and concating them by avoiding duplicates works.
Expected:
It should filter out those rows where value of those column is 1.
Actual:
ValueError: too many values to unpack (expected 1)
I understand the reason for the error. Is there any way I can construct the condition by modifying the not working code ?
I think you are able to re-write this using the following.
labels = ['one','two','three']
df5 = df[(df[labels] == 1).any(1)]
Let's test with this MCVE:
#Create test data
df = pd.DataFrame(np.random.randint(1,5,(10,5)), columns=[*'ABCDE'])
labels = ['A','B','E']
#Your code
df_list = []
for x in labels:
df_list.append(df[(df [x] == 1)])
df5 = pd.concat(df_list).drop_duplicates()
#Suggested modification
df6 = df[(df[labels] == 1).any(1)]
Are they equal?
df5.eq(df6)
Output:
A B C D E
1 True True True True True
4 True True True True True
6 True True True True True
7 True True True True True
8 True True True True True
Do you need this ?
new_df = df[(df['one'] == 1) & (df['two']== 1) & (df['three'] == 1)].drop_duplicates()

pandas quantile comparison: indexes not aligned [duplicate]

How can I perform comparisons between DataFrames and Series? I'd like to mask elements in a DataFrame/Series that are greater/less than elements in another DataFrame/Series.
For instance, the following doesn't replace elements greater than the mean
with nans although I was expecting it to:
>>> x = pd.DataFrame(data={'a': [1, 2], 'b': [3, 4]})
>>> x[x > x.mean(axis=1)] = np.nan
>>> x
a b
0 1 3
1 2 4
If we look at the boolean array created by the comparison, it is really weird:
>>> x = pd.DataFrame(data={'a': [1, 2], 'b': [3, 4]})
>>> x > x.mean(axis=1)
a b 0 1
0 False False False False
1 False False False False
I don't understand by what logic the resulting boolean array is like that. I'm able to work around this problem by using transpose:
>>> (x.T > x.mean(axis=1).T).T
a b
0 False True
1 False True
But I believe there is some "correct" way of doing this that I'm not aware of. And at least I'd like to understand what is going on.
The problem here is that it's interpreting the index as column values to perform the comparison, if you use .gt and pass axis=0 then you get the result you desire:
In [203]:
x.gt(x.mean(axis=1), axis=0)
Out[203]:
a b
0 False True
1 False True
You can see what I mean when you perform the comparison with the np array:
In [205]:
x > x.mean(axis=1).values
Out[205]:
a b
0 False False
1 False True
here you can see that the default axis for comparison is on the column, resulting in a different result

use series to select rows from df pandas

Continued from this thread: get subsection of df based on multiple conditions
I would like to pull given rows based on multiple conditions which are stored in a Series object.
columns = ['is_net', 'is_pct', 'is_mean', 'is_wgted', 'is_sum']
index = ['a','b','c','d']
data = [['True','True','False','False', 'False'],
['True','True','True','False', 'False'],
['True','True','False','False', 'True'],
['True','True','False','True', 'False']]
df = pd.DataFrame(columns=columns, index=index, data=data)
df
is_net is_pct is_mean is_wgted is_sum
a True True False False False
b True True True False False
c True True False False True
d True True False True False
My conditions:
d={'is_net': 'True', 'is_sum': 'True'}
s=pd.Series(d)
Expected output:
is_net is_pct is_mean is_wgted is_sum
c True True False False True
My failed attempt:
(df == s).all(axis=1)
a False
b False
c False
d False
dtype: bool
Not sure why 'c' is False when the two conditions were met.
Note, I can achieve the desired results like this but I would rather use the Series method.
df[(df['is_net']=='True') & (df['is_sum']=='True')]
As you only have 2 conditions we can sum these and filter the df:
In [55]:
df[(df == s).sum(axis=1) == 2]
​
Out[55]:
is_net is_pct is_mean is_wgted is_sum
c True True False False True
This works because booleans convert to 1 and 0 for True and False:
In [56]:
(df == s).sum(axis=1)
​
Out[56]:
a 1
b 1
c 2
d 1
dtype: int64
You could modify a little bit your solution by adding subset for your columns:
In [219]: df[(df == s)[['is_net', 'is_sum']].all(axis=1)]
Out[219]:
is_net is_pct is_mean is_wgted is_sum
c True True False False True
or:
In [219]: df[(df == s)[s.index].all(axis=1)]
Out[219]:
is_net is_pct is_mean is_wgted is_sum
c True True False False True

pandas dataframe index access not raising an exception when out of bounds?

How can the following MWE script work? I actually want the assignment (right before the print) to fail. Instead it changes nothing and raises no exception. This is some of the weirdest behaviour.
import pandas as pd
import numpy as np
l = ['a', 'b']
d = np.array([[False]*len(l)]*3)
df = pd.DataFrame(columns=l, data=d, index=range(1,4))
df["a"][4] = True
print df
When you say df["a"][4] = True, you are modifying the a series object, and you aren't really modifying the df DataFrame because df's index does not have an entry of 4. I wrote up a snippet of code exhibiting this behavior:
In [90]:
import pandas as pd
import numpy as np
l = ['a', 'b']
d = np.array([[False]*len(l)]*3)
df = pd.DataFrame(columns=l, data=d, index=range(1,4))
df['a'][4] = True
print "DataFrame:"
print df
DataFrame:
a b
1 False False
2 False False
3 False False
In [91]:
df['b'][4]=False
print "DataFrame:"
print df
DataFrame:
a b
1 False False
2 False False
3 False False
In [92]:
print "DF's Index"
print df.index
DF's Index
Int64Index([1, 2, 3], dtype='int64')
In [93]:
print "Series object a:"
print df['a']
Series object a:
1 False
2 False
3 False
4 True
Name: a, dtype: bool
In [94]:
print "Series object b:"
print df['b']
Series object b:
1 False
2 False
3 False
4 False
Name: b, dtype: bool