The difference in results of numpy filtering does not make sense? - pandas

I have a sample dataframe which is I uploaded to my Github Gist (because it has 98 rows, but the original data has millions). It has 4 numerical columns, 1 ID column and 1 column which indicates its cluster ID. I have written a function which I apply to that dataframe in two ways:
Case A. I groupby by individual and apply the function
Case B. I groupby by both individual and cluster and apply the function.
Here is the function in question:
def vectorized_similarity_filtering2(df, cols = ["scaledPrice", "scaledAirlines", "scaledFlights", "scaledTrip"]):
from sklearn.metrics.pairwise import cosine_similarity
arr = df[cols].to_numpy()
b = arr[..., None]
c = arr.T[None, ...]
# they must less than equal
mask = (((b <= c).all(axis=1)) & ((b < c).any(axis=1)))
mask |= mask.T
sims = np.where(mask, np.nan, cosine_similarity(arr))
return np.sum(sims >= 0.6, axis = 1)
What it does in few steps:
It compares current row to all the other rows
It filters out all rows which current row has less or equal values in all dimensions and has less value in at least one dimension.
For the remaining rows, it calculates the cosine similarity between them and the current row
It counts the number of elements in similarity matrix which are greater than 0.6 and returns the result.
By logic, each element of the result of applying to all rows for every individual (case A) must be not less than the each element of the result of applying to all rows for every individual and cluster (case B). Because, case B . However, I see that case B has more elements than case A for some rows. It does not make sense to me, because Case B has less elements to compare to each other. I hope somebody can explain my what is wrong with the code, or my understanding?
Here are steps to replicate the results:
# df being the dataframe
g = df.groupby("individual")
gc = df.groupby(["individual", "cluster"])
caseA = np.concatenate(g.apply(lambda x: vectorized_similarity_filtering2(x)).values)
caseB = np.concatenate(gc.apply(lambda x: vectorized_similarity_filtering2(x)).values)
caseA >= caseB
array([ True, True, True, True, True, True, True, False, False,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, False,
False, True, True, True, True, True, True, True, True,
True, True, True, True, False, True, True, True, True,
True, True, True, True, True, True, False, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True])
EDIT: formatting

The culprit is the order of the cluster groupby which is currently looping through the clusters in this order [0, 2, 1, 5, 3, 4, 11, 6, 7, 12, 8, 9, 10]. This means that the elements aren't aligned in the comparison caseA >= caseB so you are comparing the similarity of different rows to each other.
One solution is to sort your dataframe first so that your function on the cluster groupby returns values the same order as on the individual groupby like this
df = df.sort_values(by=['cluster'])
Then it should work!

Related

Pandas: interpreting linked boolean conditions without using a for loop

I would like to achieve the following results in the column stop (based on columns price, limit and strength) without using a stupidly slow for loop.
The difficulty: the direction of the switch (False to True or True to False) of the first condition (price and limit) impact the interpretation of the remaining one (strength).
Here is a screenshot of the desired result with my comments and explanations:
Here is the code to replicate the above DataFrame:
import pandas as pd
# initialise data of lists.
data = {'price':[1,3,2,5,3,3,4,5,6,5,3],
'limit':[1.2,3.3,2.1,4.5,3.5,3.8,3,4.5,6.3,4.5,3.5],
'strength': [False, False, False, False, False, True, True, True, True, False, False],
'stop': [True, True, True, True, True, True, False, False, False, False, True]}
# Create DataFrame
df = pd.DataFrame(data)
Many thanks in advance for your help.

How to change p-value of Numpy.random.choice based on position in array?

I have a 3D array of size NxNxN. I would like to fill this array with random booleans, which I can do with:
a = np.random.choice([False,True],size=(N,N,N))
However, I would like the likelihood (or p-value) of choosing either True or False to be based on the element's position in the array. I thought maybe I could do this with the p-value parameter, but that only then works for selecting how often True/False is chosen for the entire array.
Is there any way to set specific p-values for the entire (N,N,N) array? I guess that would amount to an (N,N,N,2) array then, with the extra 2 being for the p-value for False and p-value for True (though p_True = 1 - p_False). I feel like there's a simpler way to do this that I'm not thinking of.
Edit:
So say I want to create a simple array, a, of shape (1,2) (just two elements, but multidimensional on purpose). I want to fill these two elements with True/False. I have another array filled with the likelihood or p-value with which I want those elements to be False, say p_False, where p_False.shape = (1,2). Let's say I want the first element to have a 25% chance of being False, but the second element to have a 50% chance of being false, so then p_False = np.array([0.25,0.5]).
I tried something along the lines of:
a = np.random.choice([[False,True],[False,True]],p=[[.25,.75],[.5,.5]])
but I got a ValueError: a must be 1-dimensional.
To generate an array with different probabilities, you can use the following code:
# define an initial value of N
N = 512
# generate an array of probabilities. You can eventually build your own, since the size is respected
prob_array = np.array((range(0,N*N*N)))
# rescale the probabilities between 0 and 1
prob_array = (prob_array - np.min(prob_array)) / (np.max(prob_array) - np.min(prob_array))
# generate the random based on the probabilities, cast to booleans and reshape
np.reshape(np.array(np.random.binomial(1, p=prob_array, size=N*N*N), dtype=bool), (N,N,N))
This generates an array with lots of Falses in the beginning and lots of Trues in the end:
array([[[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
...,
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False]],
...,
[[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
...,
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True]]])
Use the binomial method with an array of numbers in [0, 1]. Here is an example, which sets each element to 0 or 1 depending on a randomly chosen probability:
import numpy
gen=numpy.random.Generator(numpy.random.PCG64())
ret=gen.binomial(1, gen.uniform(size=(3, 3, 3)))
If you want each item to be True or False rather than 0 or 1, I'm afraid I don't know how to do so.
Note that numpy.random.Generator was introduced in NumPy 1.7. You are recommended to use the latest version of NumPy; in the meantime, you can use the following:
import numpy
ret=numpy.random.binomial(1, numpy.random.uniform(size=(3, 3, 3)))

Splitting a numpy array / pandas dataframe by boolean delimiters

Assume a numpy array (actually Pandas) of the form:
[value, included,
0.123, False,
0.127, True,
0.140, True,
0.111, False,
0.159, True,
0.321, True,
0.444, True,
0.323, True,
0.432, False]
I'd like to split the array such that False elements are excluded and successive runs of True elements are split into their own array. So for the above case, we'd end up with:
[[0.127, True,
0.140, True],
[0.159, True,
0.321, True,
0.444, True,
0.323, True]]
I can certainly do this by pushing individual elements onto lists, but surely there must be a more numpy-ish way to do this.
You can create groups by inverse mask by ~ with Series.cumsum and filter only Trues by boolean indexing, then create list of DataFrames by DataFrame.groupby:
dfs = [v for k, v in df.groupby((~df['included']).cumsum()[df['included']])]
print (dfs)
[ value included
1 0.127 True
2 0.140 True, value included
4 0.159 True
5 0.321 True
6 0.444 True
7 0.323 True]
Also is possible convert Dataframes to arrays by DataFrame.to_numpy:
dfs = [v.to_numpy() for k, v in df.groupby((~df['included']).cumsum()[df['included']])]
print (dfs)
[array([[0.127, True],
[0.14, True]], dtype=object), array([[0.159, True],
[0.321, True],
[0.444, True],
[0.32299999999999995, True]], dtype=object)]

Fill forward a DataFrame with matching values

I have a DataFrame of booleans. I would like to replace the 2 False values that are directly positioned after a True value. I thought the .replace() method would do it since the 5th example seems to be what I am looking for.
Here is what I do:
dataIn = pd.DataFrame([False, False, False, True, False, False, False, False])
dataOut = dataIn.replace(to_replace=False, method='ffill', limit=2)
>>> TypeError: No matching signature found
Here is the output I am looking for:
dataOut = pd.DataFrame([False, False, False, True, True, True, False, False])
# create a series not a dateframe
# if you have a dataframe then assign to a new variable as a series
# s = df['bool_col']
s = pd.Series([False, True, False, True, False, False, False, False])
# create a mask based on the logic using shift
mask = (s == False) & (((s.shift(1) == True) & (s.shift(-1) == False))\
| ((s.shift(2) == True) & (s.shift(1) == False)))
# numpy.where to create the new output
np.where(mask, True, s)
# array([False, True, False, True, True, True, False, False])
# assign to a new column in the frame (if you want)
# df['new_col'] = np.where(mask, True, s)
Define a function which conditionally replaces 2 first elements with True:
def condRepl(grp):
rv = grp.copy()
if grp.size >= 2 and grp.eq(False).all():
rv.iloc[0:2] = [True] * 2
return rv
The condition triggering this replace is:
group has 2 elements or more,
the group is composed solely of False values.
Then, using this function, transform each group of "new" values
(each change in the value starts a new group):
dataIn[0] = dataIn[0].groupby(s.ne(s.shift()).cumsum()).transform(condRepl)
Thanks for both answers above. But actually, it seems the .replace() can be used, but it does not entirely handle booleans.
By replacing them temporarily by int, it is possible to use it:
dataIn = pd.DataFrame([False, False, False, True, False, False, False, False])
dataOut = dataIn.astype(int).replace(to_replace=False, method='ffill', limit=2).astype(bool)

Combining logical (boolean) expressions in numpy [duplicate]

This question already has answers here:
Combining logic statements AND in numpy array
(3 answers)
Closed 4 years ago.
I want to combine logical expressions but I get an exception:
array = np.arange(10)
array > 1
array([False, False, True, True, True, True, True, True, True,
True])
array < 4
array([ True, True, True, True, False, False, False, False, False,
False])
(array > 1 & array < 4)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
What I would expect instead would be a boolean array of length 10 with True value at the indices 2 and 3 --where both conditions are met-- and False elsewhere.
You need numpy's logical_and function.
import numpy as np
np.logical_and(array>1, array<4). # [False, False, True, True, False, False, False, False, False, False]