Splitting a numpy array / pandas dataframe by boolean delimiters - pandas

Assume a numpy array (actually Pandas) of the form:
[value, included,
0.123, False,
0.127, True,
0.140, True,
0.111, False,
0.159, True,
0.321, True,
0.444, True,
0.323, True,
0.432, False]
I'd like to split the array such that False elements are excluded and successive runs of True elements are split into their own array. So for the above case, we'd end up with:
[[0.127, True,
0.140, True],
[0.159, True,
0.321, True,
0.444, True,
0.323, True]]
I can certainly do this by pushing individual elements onto lists, but surely there must be a more numpy-ish way to do this.

You can create groups by inverse mask by ~ with Series.cumsum and filter only Trues by boolean indexing, then create list of DataFrames by DataFrame.groupby:
dfs = [v for k, v in df.groupby((~df['included']).cumsum()[df['included']])]
print (dfs)
[ value included
1 0.127 True
2 0.140 True, value included
4 0.159 True
5 0.321 True
6 0.444 True
7 0.323 True]
Also is possible convert Dataframes to arrays by DataFrame.to_numpy:
dfs = [v.to_numpy() for k, v in df.groupby((~df['included']).cumsum()[df['included']])]
print (dfs)
[array([[0.127, True],
[0.14, True]], dtype=object), array([[0.159, True],
[0.321, True],
[0.444, True],
[0.32299999999999995, True]], dtype=object)]

Related

What is the best approach to remove in pandas all columns with all values equals to False?

I've seen in this question how to drop columns with all nan, but I'm looking for a way to remove all columns with all False values.
Using the info in that question, I'm thinking of replacing False with nan, dropping them, and then replacing nan back with False, but I don't know if that is the best approach.
A working piece of code with my approach would be as follows:
df = pd.DataFrame(data={'A':[True, True, False], 'B': [False, True, False], 'C':[False, False, False], 'D': [True, True, True]})
df.replace(to_replace=False, value=np.nan, inplace=True)
df.dropna(axis=1, how='all', inplace=True)
df.fillna(False, inplace=True)
You could use:
df.loc[:,~df.eq(False).all()]
Output:
A B D
0 True False True
1 True True True
2 False False True

Pandas: interpreting linked boolean conditions without using a for loop

I would like to achieve the following results in the column stop (based on columns price, limit and strength) without using a stupidly slow for loop.
The difficulty: the direction of the switch (False to True or True to False) of the first condition (price and limit) impact the interpretation of the remaining one (strength).
Here is a screenshot of the desired result with my comments and explanations:
Here is the code to replicate the above DataFrame:
import pandas as pd
# initialise data of lists.
data = {'price':[1,3,2,5,3,3,4,5,6,5,3],
'limit':[1.2,3.3,2.1,4.5,3.5,3.8,3,4.5,6.3,4.5,3.5],
'strength': [False, False, False, False, False, True, True, True, True, False, False],
'stop': [True, True, True, True, True, True, False, False, False, False, True]}
# Create DataFrame
df = pd.DataFrame(data)
Many thanks in advance for your help.

Create masked array from list containing ma.masked

If I have a (possibly multidimensional) Python list where each element is one of True, False, or ma.masked, what's the idiomatic way of turning this into a masked numpy array of bool?
Example:
>>> print(somefunc([[True, ma.masked], [False, True]]))
[[True --]
[False True]]
A masked array has to attributes, data and mask:
In [342]: arr = np.ma.masked_array([[True, False],[False,True]])
In [343]: arr
Out[343]:
masked_array(
data=[[ True, False],
[False, True]],
mask=False,
fill_value=True)
That starts without anything masked. Then as you suggest, assigning np.ma.masked to an element masks the slot:
In [344]: arr[0,1]=np.ma.masked
In [345]: arr
Out[345]:
masked_array(
data=[[True, --],
[False, True]],
mask=[[False, True],
[False, False]],
fill_value=True)
Here the arr.mask has been changed from scalar False (applying to the whole array) to a boolean array of False, and then the selected item has been changed to True.
arr.data hasn't changed:
In [346]: arr.data[0,1]
Out[346]: False
Looks like this change to arr.mask occurs in data.__setitem__ at:
if value is masked:
# The mask wasn't set: create a full version.
if _mask is nomask:
_mask = self._mask = make_mask_none(self.shape, _dtype)
# Now, set the mask to its value.
if _dtype.names is not None:
_mask[indx] = tuple([True] * len(_dtype.names))
else:
_mask[indx] = True
return
It checks if the assignment values is this special constant, np.ma.masked, and it makes the full mask, and assigns True to an element.

How to change p-value of Numpy.random.choice based on position in array?

I have a 3D array of size NxNxN. I would like to fill this array with random booleans, which I can do with:
a = np.random.choice([False,True],size=(N,N,N))
However, I would like the likelihood (or p-value) of choosing either True or False to be based on the element's position in the array. I thought maybe I could do this with the p-value parameter, but that only then works for selecting how often True/False is chosen for the entire array.
Is there any way to set specific p-values for the entire (N,N,N) array? I guess that would amount to an (N,N,N,2) array then, with the extra 2 being for the p-value for False and p-value for True (though p_True = 1 - p_False). I feel like there's a simpler way to do this that I'm not thinking of.
Edit:
So say I want to create a simple array, a, of shape (1,2) (just two elements, but multidimensional on purpose). I want to fill these two elements with True/False. I have another array filled with the likelihood or p-value with which I want those elements to be False, say p_False, where p_False.shape = (1,2). Let's say I want the first element to have a 25% chance of being False, but the second element to have a 50% chance of being false, so then p_False = np.array([0.25,0.5]).
I tried something along the lines of:
a = np.random.choice([[False,True],[False,True]],p=[[.25,.75],[.5,.5]])
but I got a ValueError: a must be 1-dimensional.
To generate an array with different probabilities, you can use the following code:
# define an initial value of N
N = 512
# generate an array of probabilities. You can eventually build your own, since the size is respected
prob_array = np.array((range(0,N*N*N)))
# rescale the probabilities between 0 and 1
prob_array = (prob_array - np.min(prob_array)) / (np.max(prob_array) - np.min(prob_array))
# generate the random based on the probabilities, cast to booleans and reshape
np.reshape(np.array(np.random.binomial(1, p=prob_array, size=N*N*N), dtype=bool), (N,N,N))
This generates an array with lots of Falses in the beginning and lots of Trues in the end:
array([[[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
...,
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False]],
...,
[[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
...,
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True],
[ True, True, True, ..., True, True, True]]])
Use the binomial method with an array of numbers in [0, 1]. Here is an example, which sets each element to 0 or 1 depending on a randomly chosen probability:
import numpy
gen=numpy.random.Generator(numpy.random.PCG64())
ret=gen.binomial(1, gen.uniform(size=(3, 3, 3)))
If you want each item to be True or False rather than 0 or 1, I'm afraid I don't know how to do so.
Note that numpy.random.Generator was introduced in NumPy 1.7. You are recommended to use the latest version of NumPy; in the meantime, you can use the following:
import numpy
ret=numpy.random.binomial(1, numpy.random.uniform(size=(3, 3, 3)))

Combining logical (boolean) expressions in numpy [duplicate]

This question already has answers here:
Combining logic statements AND in numpy array
(3 answers)
Closed 4 years ago.
I want to combine logical expressions but I get an exception:
array = np.arange(10)
array > 1
array([False, False, True, True, True, True, True, True, True,
True])
array < 4
array([ True, True, True, True, False, False, False, False, False,
False])
(array > 1 & array < 4)
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
What I would expect instead would be a boolean array of length 10 with True value at the indices 2 and 3 --where both conditions are met-- and False elsewhere.
You need numpy's logical_and function.
import numpy as np
np.logical_and(array>1, array<4). # [False, False, True, True, False, False, False, False, False, False]