Fill forward a DataFrame with matching values - pandas

I have a DataFrame of booleans. I would like to replace the 2 False values that are directly positioned after a True value. I thought the .replace() method would do it since the 5th example seems to be what I am looking for.
Here is what I do:
dataIn = pd.DataFrame([False, False, False, True, False, False, False, False])
dataOut = dataIn.replace(to_replace=False, method='ffill', limit=2)
>>> TypeError: No matching signature found
Here is the output I am looking for:
dataOut = pd.DataFrame([False, False, False, True, True, True, False, False])

# create a series not a dateframe
# if you have a dataframe then assign to a new variable as a series
# s = df['bool_col']
s = pd.Series([False, True, False, True, False, False, False, False])
# create a mask based on the logic using shift
mask = (s == False) & (((s.shift(1) == True) & (s.shift(-1) == False))\
| ((s.shift(2) == True) & (s.shift(1) == False)))
# numpy.where to create the new output
np.where(mask, True, s)
# array([False, True, False, True, True, True, False, False])
# assign to a new column in the frame (if you want)
# df['new_col'] = np.where(mask, True, s)

Define a function which conditionally replaces 2 first elements with True:
def condRepl(grp):
rv = grp.copy()
if grp.size >= 2 and grp.eq(False).all():
rv.iloc[0:2] = [True] * 2
return rv
The condition triggering this replace is:
group has 2 elements or more,
the group is composed solely of False values.
Then, using this function, transform each group of "new" values
(each change in the value starts a new group):
dataIn[0] = dataIn[0].groupby(s.ne(s.shift()).cumsum()).transform(condRepl)

Thanks for both answers above. But actually, it seems the .replace() can be used, but it does not entirely handle booleans.
By replacing them temporarily by int, it is possible to use it:
dataIn = pd.DataFrame([False, False, False, True, False, False, False, False])
dataOut = dataIn.astype(int).replace(to_replace=False, method='ffill', limit=2).astype(bool)

Related

The difference in results of numpy filtering does not make sense?

I have a sample dataframe which is I uploaded to my Github Gist (because it has 98 rows, but the original data has millions). It has 4 numerical columns, 1 ID column and 1 column which indicates its cluster ID. I have written a function which I apply to that dataframe in two ways:
Case A. I groupby by individual and apply the function
Case B. I groupby by both individual and cluster and apply the function.
Here is the function in question:
def vectorized_similarity_filtering2(df, cols = ["scaledPrice", "scaledAirlines", "scaledFlights", "scaledTrip"]):
from sklearn.metrics.pairwise import cosine_similarity
arr = df[cols].to_numpy()
b = arr[..., None]
c = arr.T[None, ...]
# they must less than equal
mask = (((b <= c).all(axis=1)) & ((b < c).any(axis=1)))
mask |= mask.T
sims = np.where(mask, np.nan, cosine_similarity(arr))
return np.sum(sims >= 0.6, axis = 1)
What it does in few steps:
It compares current row to all the other rows
It filters out all rows which current row has less or equal values in all dimensions and has less value in at least one dimension.
For the remaining rows, it calculates the cosine similarity between them and the current row
It counts the number of elements in similarity matrix which are greater than 0.6 and returns the result.
By logic, each element of the result of applying to all rows for every individual (case A) must be not less than the each element of the result of applying to all rows for every individual and cluster (case B). Because, case B . However, I see that case B has more elements than case A for some rows. It does not make sense to me, because Case B has less elements to compare to each other. I hope somebody can explain my what is wrong with the code, or my understanding?
Here are steps to replicate the results:
# df being the dataframe
g = df.groupby("individual")
gc = df.groupby(["individual", "cluster"])
caseA = np.concatenate(g.apply(lambda x: vectorized_similarity_filtering2(x)).values)
caseB = np.concatenate(gc.apply(lambda x: vectorized_similarity_filtering2(x)).values)
caseA >= caseB
array([ True, True, True, True, True, True, True, False, False,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, False,
False, True, True, True, True, True, True, True, True,
True, True, True, True, False, True, True, True, True,
True, True, True, True, True, True, False, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True])
EDIT: formatting
The culprit is the order of the cluster groupby which is currently looping through the clusters in this order [0, 2, 1, 5, 3, 4, 11, 6, 7, 12, 8, 9, 10]. This means that the elements aren't aligned in the comparison caseA >= caseB so you are comparing the similarity of different rows to each other.
One solution is to sort your dataframe first so that your function on the cluster groupby returns values the same order as on the individual groupby like this
df = df.sort_values(by=['cluster'])
Then it should work!

pandas - index data that comes after conditional

i have the following time series
[0,1,2,3,2,1,0,1,2,3,2,1,0]
i would like to boolean index all values that:
include & come after 2
are greater than 0
terminates on 0
if the conditions are met, the following vector should be produced
[False,False,True,True,True,True,False,False,True,True,True,True,False]
i have attempted to solve it with a combination of logical queries, but to no avail
frame['boolean'] = False
frame['boolean'].loc[(frame['sequence'].gt(2)) & (frame['boolean'].shift(1).eq(False)] = True
Id use numpy for this (it works well with pandas Series)
import numpy as np
a = np.array([0,1,2,3,2,1,0,1,2,3,2,1,0])
result = a > 0
where_zero = np.where(a==0)[0]
where_two = list(np.where(a==2)[0])
# note if where_two is an empty list, then the result should simply be all False, right ?
for x1 in where_zero:
while 1:
try:
x2 = where_two.pop(0)
except IndexError:
break
if x2 > x1:
break
result[x1:x2] = False
# result
#array([False, False, True, True, True, True, False, False, True,
# True, True, True, False])

Pandas: interpreting linked boolean conditions without using a for loop

I would like to achieve the following results in the column stop (based on columns price, limit and strength) without using a stupidly slow for loop.
The difficulty: the direction of the switch (False to True or True to False) of the first condition (price and limit) impact the interpretation of the remaining one (strength).
Here is a screenshot of the desired result with my comments and explanations:
Here is the code to replicate the above DataFrame:
import pandas as pd
# initialise data of lists.
data = {'price':[1,3,2,5,3,3,4,5,6,5,3],
'limit':[1.2,3.3,2.1,4.5,3.5,3.8,3,4.5,6.3,4.5,3.5],
'strength': [False, False, False, False, False, True, True, True, True, False, False],
'stop': [True, True, True, True, True, True, False, False, False, False, True]}
# Create DataFrame
df = pd.DataFrame(data)
Many thanks in advance for your help.

Create masked array from list containing ma.masked

If I have a (possibly multidimensional) Python list where each element is one of True, False, or ma.masked, what's the idiomatic way of turning this into a masked numpy array of bool?
Example:
>>> print(somefunc([[True, ma.masked], [False, True]]))
[[True --]
[False True]]
A masked array has to attributes, data and mask:
In [342]: arr = np.ma.masked_array([[True, False],[False,True]])
In [343]: arr
Out[343]:
masked_array(
data=[[ True, False],
[False, True]],
mask=False,
fill_value=True)
That starts without anything masked. Then as you suggest, assigning np.ma.masked to an element masks the slot:
In [344]: arr[0,1]=np.ma.masked
In [345]: arr
Out[345]:
masked_array(
data=[[True, --],
[False, True]],
mask=[[False, True],
[False, False]],
fill_value=True)
Here the arr.mask has been changed from scalar False (applying to the whole array) to a boolean array of False, and then the selected item has been changed to True.
arr.data hasn't changed:
In [346]: arr.data[0,1]
Out[346]: False
Looks like this change to arr.mask occurs in data.__setitem__ at:
if value is masked:
# The mask wasn't set: create a full version.
if _mask is nomask:
_mask = self._mask = make_mask_none(self.shape, _dtype)
# Now, set the mask to its value.
if _dtype.names is not None:
_mask[indx] = tuple([True] * len(_dtype.names))
else:
_mask[indx] = True
return
It checks if the assignment values is this special constant, np.ma.masked, and it makes the full mask, and assigns True to an element.

How to interpret the result of pandas dataframe boolean index

The above operation seems a little trivia, however, I am a little lost as to the output of the operation. Below is a piece of code to illustrate my point.
# sample data for understanding concept of boolean indexing:
d_f = pd.DataFrame({'x':[0,1,2,3,4,5,6,7,8,9], 'y':[10,12,13,14,15,16,1,2,3,5]})
# changing index of dataframe:
d_f = d_f.set_index([list('abcdefghig')])
# a list of data:
myL = np.array(range(1,11))
# loop through data to boolean slicing and indexing:
for r in myL:
DF2 = d_f['x'].values == r
The result of the above code is:
array([False,
False,
False,
False,
False,
False,
False,
False,
False,
False],
dtype=bool
But all the values in myL are in d_f['x'].values except 0. It, therefore, appears that the program was doing an 'index for index' matching of the elements in the myL and d_f['x'].values. Is this a typical behavior of pandas library? If so, can some please explain the rationale behind this for me. Thank you in advance.
As #coldspeed states, you are overwriting DF2 with d_f['x'] == 10 which is a boolean series of all False.
What I think you are trying to do is this instead:
d_f['x'].isin(myL)
Output:
a False
b True
c True
d True
e True
f True
g True
h True
i True
g True
Name: x, dtype: bool