How to interpret the result of pandas dataframe boolean index - pandas

The above operation seems a little trivia, however, I am a little lost as to the output of the operation. Below is a piece of code to illustrate my point.
# sample data for understanding concept of boolean indexing:
d_f = pd.DataFrame({'x':[0,1,2,3,4,5,6,7,8,9], 'y':[10,12,13,14,15,16,1,2,3,5]})
# changing index of dataframe:
d_f = d_f.set_index([list('abcdefghig')])
# a list of data:
myL = np.array(range(1,11))
# loop through data to boolean slicing and indexing:
for r in myL:
DF2 = d_f['x'].values == r
The result of the above code is:
array([False,
False,
False,
False,
False,
False,
False,
False,
False,
False],
dtype=bool
But all the values in myL are in d_f['x'].values except 0. It, therefore, appears that the program was doing an 'index for index' matching of the elements in the myL and d_f['x'].values. Is this a typical behavior of pandas library? If so, can some please explain the rationale behind this for me. Thank you in advance.

As #coldspeed states, you are overwriting DF2 with d_f['x'] == 10 which is a boolean series of all False.
What I think you are trying to do is this instead:
d_f['x'].isin(myL)
Output:
a False
b True
c True
d True
e True
f True
g True
h True
i True
g True
Name: x, dtype: bool

Related

pandas - index data that comes after conditional

i have the following time series
[0,1,2,3,2,1,0,1,2,3,2,1,0]
i would like to boolean index all values that:
include & come after 2
are greater than 0
terminates on 0
if the conditions are met, the following vector should be produced
[False,False,True,True,True,True,False,False,True,True,True,True,False]
i have attempted to solve it with a combination of logical queries, but to no avail
frame['boolean'] = False
frame['boolean'].loc[(frame['sequence'].gt(2)) & (frame['boolean'].shift(1).eq(False)] = True
Id use numpy for this (it works well with pandas Series)
import numpy as np
a = np.array([0,1,2,3,2,1,0,1,2,3,2,1,0])
result = a > 0
where_zero = np.where(a==0)[0]
where_two = list(np.where(a==2)[0])
# note if where_two is an empty list, then the result should simply be all False, right ?
for x1 in where_zero:
while 1:
try:
x2 = where_two.pop(0)
except IndexError:
break
if x2 > x1:
break
result[x1:x2] = False
# result
#array([False, False, True, True, True, True, False, False, True,
# True, True, True, False])

Pandas: Fast way to get cols/rows containing na

In Pandas we can drop cols/rows by .dropna(how = ..., axis = ...) but is there a way to get an array-like of True/False indicators for each col/row, which would indicate whether a col/row contains na according to how and axis arguments?
I.e. is there a way to convert .dropna(how = ..., axis = ...) to a method, which would instead of actual removal just tell us, which cols/rows would be removed if we called .dropna(...) with specific how and axis.
Thank you for your time!
You can use isna() to replicate the behaviour of dropna without actually removing data. To mimic the 'how' and 'axis' parameter, you can add any() or all() and set the axis accordingly.
Here is a simple example:
import pandas as pd
df = pd.DataFrame([[pd.NA, pd.NA, 1], [pd.NA, pd.NA, pd.NA]])
df.isna()
Output:
0 1 2
0 True True False
1 True True True
Eq. to dropna(how='any', axis=0)
df.isna().any(axis=0)
Output:
0 True
1 True
2 True
dtype: bool
Eq. to dropna(how='any', axis=1)
df.isna().any(axis=1)
Output:
0 True
1 True
dtype: bool

Create masked array from list containing ma.masked

If I have a (possibly multidimensional) Python list where each element is one of True, False, or ma.masked, what's the idiomatic way of turning this into a masked numpy array of bool?
Example:
>>> print(somefunc([[True, ma.masked], [False, True]]))
[[True --]
[False True]]
A masked array has to attributes, data and mask:
In [342]: arr = np.ma.masked_array([[True, False],[False,True]])
In [343]: arr
Out[343]:
masked_array(
data=[[ True, False],
[False, True]],
mask=False,
fill_value=True)
That starts without anything masked. Then as you suggest, assigning np.ma.masked to an element masks the slot:
In [344]: arr[0,1]=np.ma.masked
In [345]: arr
Out[345]:
masked_array(
data=[[True, --],
[False, True]],
mask=[[False, True],
[False, False]],
fill_value=True)
Here the arr.mask has been changed from scalar False (applying to the whole array) to a boolean array of False, and then the selected item has been changed to True.
arr.data hasn't changed:
In [346]: arr.data[0,1]
Out[346]: False
Looks like this change to arr.mask occurs in data.__setitem__ at:
if value is masked:
# The mask wasn't set: create a full version.
if _mask is nomask:
_mask = self._mask = make_mask_none(self.shape, _dtype)
# Now, set the mask to its value.
if _dtype.names is not None:
_mask[indx] = tuple([True] * len(_dtype.names))
else:
_mask[indx] = True
return
It checks if the assignment values is this special constant, np.ma.masked, and it makes the full mask, and assigns True to an element.

Fill forward a DataFrame with matching values

I have a DataFrame of booleans. I would like to replace the 2 False values that are directly positioned after a True value. I thought the .replace() method would do it since the 5th example seems to be what I am looking for.
Here is what I do:
dataIn = pd.DataFrame([False, False, False, True, False, False, False, False])
dataOut = dataIn.replace(to_replace=False, method='ffill', limit=2)
>>> TypeError: No matching signature found
Here is the output I am looking for:
dataOut = pd.DataFrame([False, False, False, True, True, True, False, False])
# create a series not a dateframe
# if you have a dataframe then assign to a new variable as a series
# s = df['bool_col']
s = pd.Series([False, True, False, True, False, False, False, False])
# create a mask based on the logic using shift
mask = (s == False) & (((s.shift(1) == True) & (s.shift(-1) == False))\
| ((s.shift(2) == True) & (s.shift(1) == False)))
# numpy.where to create the new output
np.where(mask, True, s)
# array([False, True, False, True, True, True, False, False])
# assign to a new column in the frame (if you want)
# df['new_col'] = np.where(mask, True, s)
Define a function which conditionally replaces 2 first elements with True:
def condRepl(grp):
rv = grp.copy()
if grp.size >= 2 and grp.eq(False).all():
rv.iloc[0:2] = [True] * 2
return rv
The condition triggering this replace is:
group has 2 elements or more,
the group is composed solely of False values.
Then, using this function, transform each group of "new" values
(each change in the value starts a new group):
dataIn[0] = dataIn[0].groupby(s.ne(s.shift()).cumsum()).transform(condRepl)
Thanks for both answers above. But actually, it seems the .replace() can be used, but it does not entirely handle booleans.
By replacing them temporarily by int, it is possible to use it:
dataIn = pd.DataFrame([False, False, False, True, False, False, False, False])
dataOut = dataIn.astype(int).replace(to_replace=False, method='ffill', limit=2).astype(bool)

pylint, pandas : Comparison to True should be just 'expr' or 'expr is True' (singleton-comparison)

did anyone solve this pylint issue when using pandas?
C:525,59: Comparison to True should be just 'expr' or 'expr is True' (singleton-comparison)
this happens in the line where i'm using:
df_current_dayparts_raw['is_standard'] == True
I tried these but didn't work:
df_current_dayparts_raw['is_standard'] is True
df_current_dayparts_raw['is_standard'].isin([True])
df_current_dayparts_raw['is_standard'].__eq__(True)
If you have instantiate a dataframe with the following code
test = pd.DataFrame({"bool": [True, False, True], "val":[1,2,3]})
>>> test
bool val
0 True 1
1 False 2
2 True 3
the following should only return the fields where "bool" is True
test[test['bool']]
bool val
0 True 1
2 True 3
You do not need to explicitly state that test['bool'] == True, test['bool'] should be enough. This should be pylint compliant and satisfy singleton-comparison.
A bit late, but maybe this will be useful for someone. This worked for me:
if df_current_dayparts_raw['is_standard']:
print("True")
Instead of <your_expr> == True, try numpy.equal(<your_expr>, True) . Concretely:
import numpy as np
np.equal(df_current_dayparts_raw['is_standard'], True)