How to change p-value of Numpy.random.choice based on position in array? - numpy

I have a 3D array of size NxNxN. I would like to fill this array with random booleans, which I can do with:
a = np.random.choice([False,True],size=(N,N,N))
However, I would like the likelihood (or p-value) of choosing either True or False to be based on the element's position in the array. I thought maybe I could do this with the p-value parameter, but that only then works for selecting how often True/False is chosen for the entire array.
Is there any way to set specific p-values for the entire (N,N,N) array? I guess that would amount to an (N,N,N,2) array then, with the extra 2 being for the p-value for False and p-value for True (though p_True = 1 - p_False). I feel like there's a simpler way to do this that I'm not thinking of.
Edit:
So say I want to create a simple array, a, of shape (1,2) (just two elements, but multidimensional on purpose). I want to fill these two elements with True/False. I have another array filled with the likelihood or p-value with which I want those elements to be False, say p_False, where p_False.shape = (1,2). Let's say I want the first element to have a 25% chance of being False, but the second element to have a 50% chance of being false, so then p_False = np.array([0.25,0.5]).
I tried something along the lines of:
a = np.random.choice([[False,True],[False,True]],p=[[.25,.75],[.5,.5]])
but I got a ValueError: a must be 1-dimensional.

To generate an array with different probabilities, you can use the following code:
# define an initial value of N
N = 512
# generate an array of probabilities. You can eventually build your own, since the size is respected
prob_array = np.array((range(0,N*N*N)))
# rescale the probabilities between 0 and 1
prob_array = (prob_array - np.min(prob_array)) / (np.max(prob_array) - np.min(prob_array))
# generate the random based on the probabilities, cast to booleans and reshape
np.reshape(np.array(np.random.binomial(1, p=prob_array, size=N*N*N), dtype=bool), (N,N,N))
This generates an array with lots of Falses in the beginning and lots of Trues in the end:
array([[[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
...,
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False]],
...,
[[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
...,
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True]]])

Use the binomial method with an array of numbers in [0, 1]. Here is an example, which sets each element to 0 or 1 depending on a randomly chosen probability:
import numpy
gen=numpy.random.Generator(numpy.random.PCG64())
ret=gen.binomial(1, gen.uniform(size=(3, 3, 3)))
If you want each item to be True or False rather than 0 or 1, I'm afraid I don't know how to do so.
Note that numpy.random.Generator was introduced in NumPy 1.7. You are recommended to use the latest version of NumPy; in the meantime, you can use the following:
import numpy
ret=numpy.random.binomial(1, numpy.random.uniform(size=(3, 3, 3)))

Related

The difference in results of numpy filtering does not make sense?

I have a sample dataframe which is I uploaded to my Github Gist (because it has 98 rows, but the original data has millions). It has 4 numerical columns, 1 ID column and 1 column which indicates its cluster ID. I have written a function which I apply to that dataframe in two ways:
Case A. I groupby by individual and apply the function
Case B. I groupby by both individual and cluster and apply the function.
Here is the function in question:
def vectorized_similarity_filtering2(df, cols = ["scaledPrice", "scaledAirlines", "scaledFlights", "scaledTrip"]):
from sklearn.metrics.pairwise import cosine_similarity
arr = df[cols].to_numpy()
b = arr[..., None]
c = arr.T[None, ...]
# they must less than equal
mask = (((b <= c).all(axis=1)) & ((b < c).any(axis=1)))
mask |= mask.T
sims = np.where(mask, np.nan, cosine_similarity(arr))
return np.sum(sims >= 0.6, axis = 1)
What it does in few steps:
It compares current row to all the other rows
It filters out all rows which current row has less or equal values in all dimensions and has less value in at least one dimension.
For the remaining rows, it calculates the cosine similarity between them and the current row
It counts the number of elements in similarity matrix which are greater than 0.6 and returns the result.
By logic, each element of the result of applying to all rows for every individual (case A) must be not less than the each element of the result of applying to all rows for every individual and cluster (case B). Because, case B . However, I see that case B has more elements than case A for some rows. It does not make sense to me, because Case B has less elements to compare to each other. I hope somebody can explain my what is wrong with the code, or my understanding?
Here are steps to replicate the results:
# df being the dataframe
g = df.groupby("individual")
gc = df.groupby(["individual", "cluster"])
caseA = np.concatenate(g.apply(lambda x: vectorized_similarity_filtering2(x)).values)
caseB = np.concatenate(gc.apply(lambda x: vectorized_similarity_filtering2(x)).values)
caseA >= caseB
array([ True, True, True, True, True, True, True, False, False,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True, False,
False, True, True, True, True, True, True, True, True,
True, True, True, True, False, True, True, True, True,
True, True, True, True, True, True, False, True, True,
True, True, True, True, True, True, True, True, True,
True, True, True, True, True, True, True, True])
EDIT: formatting
The culprit is the order of the cluster groupby which is currently looping through the clusters in this order [0, 2, 1, 5, 3, 4, 11, 6, 7, 12, 8, 9, 10]. This means that the elements aren't aligned in the comparison caseA >= caseB so you are comparing the similarity of different rows to each other.
One solution is to sort your dataframe first so that your function on the cluster groupby returns values the same order as on the individual groupby like this
df = df.sort_values(by=['cluster'])
Then it should work!

Pandas: interpreting linked boolean conditions without using a for loop

I would like to achieve the following results in the column stop (based on columns price, limit and strength) without using a stupidly slow for loop.
The difficulty: the direction of the switch (False to True or True to False) of the first condition (price and limit) impact the interpretation of the remaining one (strength).
Here is a screenshot of the desired result with my comments and explanations:
Here is the code to replicate the above DataFrame:
import pandas as pd
# initialise data of lists.
data = {'price':[1,3,2,5,3,3,4,5,6,5,3],
'limit':[1.2,3.3,2.1,4.5,3.5,3.8,3,4.5,6.3,4.5,3.5],
'strength': [False, False, False, False, False, True, True, True, True, False, False],
'stop': [True, True, True, True, True, True, False, False, False, False, True]}
# Create DataFrame
df = pd.DataFrame(data)
Many thanks in advance for your help.

Create masked array from list containing ma.masked

If I have a (possibly multidimensional) Python list where each element is one of True, False, or ma.masked, what's the idiomatic way of turning this into a masked numpy array of bool?
Example:
>>> print(somefunc([[True, ma.masked], [False, True]]))
[[True --]
[False True]]
A masked array has to attributes, data and mask:
In [342]: arr = np.ma.masked_array([[True, False],[False,True]])
In [343]: arr
Out[343]:
masked_array(
data=[[ True, False],
[False, True]],
mask=False,
fill_value=True)
That starts without anything masked. Then as you suggest, assigning np.ma.masked to an element masks the slot:
In [344]: arr[0,1]=np.ma.masked
In [345]: arr
Out[345]:
masked_array(
data=[[True, --],
[False, True]],
mask=[[False, True],
[False, False]],
fill_value=True)
Here the arr.mask has been changed from scalar False (applying to the whole array) to a boolean array of False, and then the selected item has been changed to True.
arr.data hasn't changed:
In [346]: arr.data[0,1]
Out[346]: False
Looks like this change to arr.mask occurs in data.__setitem__ at:
if value is masked:
# The mask wasn't set: create a full version.
if _mask is nomask:
_mask = self._mask = make_mask_none(self.shape, _dtype)
# Now, set the mask to its value.
if _dtype.names is not None:
_mask[indx] = tuple([True] * len(_dtype.names))
else:
_mask[indx] = True
return
It checks if the assignment values is this special constant, np.ma.masked, and it makes the full mask, and assigns True to an element.

How can I convert a boolean matrix into a greyscale map?

I am trying to visualize the Mandelbrot Set so it can be like the image like this(https://mathworld.wolfram.com/SeaHorseValley.html).
I have the function that returns a boolean array with True indicating It is a part of the Mandelbrot set and False that it is not.
def mandelbrot(x, y, dx, dy, dims=(300, 400), threshold=25, iterations=200):
'''Returns a boolean matrix of the given dimensions, indicating whether a
position inside the rectangle spanning from (x, y) to (x+dx, y+dy) is part of
the Mandelbrot set or not'''
xs, ys = np.meshgrid(np.linspace(x, x + dx, dims[1]),
np.linspace(y, y + dy, dims[0]))
c = xs + ys * 1j
zs = np.zeros(dims, dtype=np.complex128)
for i in range(iterations):
abs_zs = np.abs(zs)
zs[abs_zs < threshold] = zs[abs_zs < threshold] ** 2 + c[abs_zs < threshold]
return np.abs(zs) < threshold
I guess this code above doesn't really matter though when I input (-1.5,-1,2,2), it returns an array like this below.
array([[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
...,
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False]])
And the result I want is following below. I have no idea how I should approach this. Could you at least suggest tools I can use or hints if you have any ideas? Thank you already in advance!

Splitting a numpy array / pandas dataframe by boolean delimiters

Assume a numpy array (actually Pandas) of the form:
[value, included,
0.123, False,
0.127, True,
0.140, True,
0.111, False,
0.159, True,
0.321, True,
0.444, True,
0.323, True,
0.432, False]
I'd like to split the array such that False elements are excluded and successive runs of True elements are split into their own array. So for the above case, we'd end up with:
[[0.127, True,
0.140, True],
[0.159, True,
0.321, True,
0.444, True,
0.323, True]]
I can certainly do this by pushing individual elements onto lists, but surely there must be a more numpy-ish way to do this.
You can create groups by inverse mask by ~ with Series.cumsum and filter only Trues by boolean indexing, then create list of DataFrames by DataFrame.groupby:
dfs = [v for k, v in df.groupby((~df['included']).cumsum()[df['included']])]
print (dfs)
[ value included
1 0.127 True
2 0.140 True, value included
4 0.159 True
5 0.321 True
6 0.444 True
7 0.323 True]
Also is possible convert Dataframes to arrays by DataFrame.to_numpy:
dfs = [v.to_numpy() for k, v in df.groupby((~df['included']).cumsum()[df['included']])]
print (dfs)
[array([[0.127, True],
[0.14, True]], dtype=object), array([[0.159, True],
[0.321, True],
[0.444, True],
[0.32299999999999995, True]], dtype=object)]