Why asarray and list react diffently? - numpy

I'm trying to understand why :
w=[0.1,0.2,0.3,0.5,0]
print(w[w!=0])
outputs : 0.2,
while
w=[0.1,0.2,0.3,0.5,0]
w=np.asarray(w)
print(w[w!=0])
outputs : [0.1 0.2 0.3 0.5], which seems more logical
So : why lists do return the second element ?

A list and an ndarray implement comparison differently. In particular:
a list returns a single bool value of True or False when compared to something else. Clearly a list w is not the value 0.2 so w != 0.2 returns True
an ndarray implements comparison by returning an ndarray of booleans, representing each array element’s comparison. Thus, w != 0.2 returns [True False True True]
Thus
for a list, w[w!=0.2] is w[True] and this is treated as meaning w[1]
for an ndarray it is w[ ndarray([True False True True]) ] which then leverages numpy’s array indexing to return only those elements where the Boolean is True

Related

Pandas: Fast way to get cols/rows containing na

In Pandas we can drop cols/rows by .dropna(how = ..., axis = ...) but is there a way to get an array-like of True/False indicators for each col/row, which would indicate whether a col/row contains na according to how and axis arguments?
I.e. is there a way to convert .dropna(how = ..., axis = ...) to a method, which would instead of actual removal just tell us, which cols/rows would be removed if we called .dropna(...) with specific how and axis.
Thank you for your time!
You can use isna() to replicate the behaviour of dropna without actually removing data. To mimic the 'how' and 'axis' parameter, you can add any() or all() and set the axis accordingly.
Here is a simple example:
import pandas as pd
df = pd.DataFrame([[pd.NA, pd.NA, 1], [pd.NA, pd.NA, pd.NA]])
df.isna()
Output:
0 1 2
0 True True False
1 True True True
Eq. to dropna(how='any', axis=0)
df.isna().any(axis=0)
Output:
0 True
1 True
2 True
dtype: bool
Eq. to dropna(how='any', axis=1)
df.isna().any(axis=1)
Output:
0 True
1 True
dtype: bool

Is there a nullable boolean type I can use in a Pandas dataframe?

In a program I am working on I have to explicitly set the type of a column that contains boolean data. Sometimes all of the values in this column are None. Unless I provide explicit type information Pandas will infer the wrong type information for that column.
Is there a pandas-compatible type that represents a nullable-bool? I want to do something like this, but preserve the Nones:
s = pandas.Series([True, False, None]).astype(bool)
print([v for v in s])
gives:
[True, False, False]
Python's built-in bool class cannot have a Null value. It can only be True or False. And in this case, because bool(None)==False the final Null is lost.
But what if I want to preserve my nulls? Is there a type I can give the column which allows for True, False and None?
I have solved a similar issue with numeric columns: For these I can use the Numpy Int64 which is a pandas-compatible nullable integer type:
s = pandas.Series([1, 2, None, numpy.NaN]).astype("Int64")
print([v for v in s])
gives:
[1, 2, <NA>, <NA>]
Which is exactly right behaviour for nullable integers, I just need a type I can use for my Nullable bools.
boolean dtype should work:
>>> pd.Series([True, False, None])
0 True
1 False
2 None
dtype: object
>>> pd.Series([True, False, None]).astype("boolean")
0 True
1 False
2 <NA>
dtype: boolean

Numpy bug using .any()?

I'm having the following error using NumPy:
>>> distance = 0.9014179933248182
>>> min_distance = np.array([0.71341723, 0.07322284])
>>> distance < min_distance
array([False, False])
which is right, but when I try:
>>> distance < min_distance.any()
True
which is obviously wrong, since there is no number in 'min_distance' smaller than 'distance'
What is going on here? I'm using NumPy on Google Colab, on version '1.17.3'.
Whilst numpy bugs are common, this is not one. Note that min_distance.any() returns a boolean result. So in this expression:
distance < min_distance.any()
you are comparing a float with a boolean, which unfortunately works, because of a comedy of errors:
bool is a subclass of int
True is equal to 1
floats are comparable with integers.
E.g.
>>> 0.9 < True
True
>>> 1.1 < True
False
What you wanted instead:
>>> (distance < min_distance).any()
False
try (distance < min_distance).any()

How to interpret the result of pandas dataframe boolean index

The above operation seems a little trivia, however, I am a little lost as to the output of the operation. Below is a piece of code to illustrate my point.
# sample data for understanding concept of boolean indexing:
d_f = pd.DataFrame({'x':[0,1,2,3,4,5,6,7,8,9], 'y':[10,12,13,14,15,16,1,2,3,5]})
# changing index of dataframe:
d_f = d_f.set_index([list('abcdefghig')])
# a list of data:
myL = np.array(range(1,11))
# loop through data to boolean slicing and indexing:
for r in myL:
DF2 = d_f['x'].values == r
The result of the above code is:
array([False,
False,
False,
False,
False,
False,
False,
False,
False,
False],
dtype=bool
But all the values in myL are in d_f['x'].values except 0. It, therefore, appears that the program was doing an 'index for index' matching of the elements in the myL and d_f['x'].values. Is this a typical behavior of pandas library? If so, can some please explain the rationale behind this for me. Thank you in advance.
As #coldspeed states, you are overwriting DF2 with d_f['x'] == 10 which is a boolean series of all False.
What I think you are trying to do is this instead:
d_f['x'].isin(myL)
Output:
a False
b True
c True
d True
e True
f True
g True
h True
i True
g True
Name: x, dtype: bool

When does advanced indexing on structured masked arrays *really* return a copy?

When I have a structured masked array with boolean indexing, under what conditions do I get a view and when do I get a copy? The documentation says that advanced indexing always returns a copy, but this is not true, since something like X[X>0]=42 is technically advanced indexing, but the assignment works. My situation is more complex:
I want to set the mask of a particular field based on a criterion from another field, so I need to get the field, apply the boolean indexing, and get the mask. There are 3! = 6 orders of doing so.
Preparation:
In [83]: M = ma.MaskedArray(random.random(400).view("f8,f8,f8,f8")).reshape(10, 10)
In [84]: crit = M[:, 4]["f2"] > 0.5
Field - index - mask (fails):
In [85]: M["f3"][crit, 3].mask = True
In [86]: print(M["f3"][crit, 3].mask)
[False False False False False]
Index - field - mask (fails):
In [87]: M[crit, 3]["f3"].mask = True
In [88]: print(M[crit, 3]["f3"].mask)
[False False False False False]
Index - mask - field (fails):
In [94]: M[crit, 3].mask["f3"] = True
In [95]: print(M[crit, 3].mask["f3"])
[False False False False False]
Mask - index - field (fails):
In [101]: M.mask[crit, 3]["f3"] = True
In [102]: print(M.mask[crit, 3]["f3"])
[False False False False False]
Field - mask - index (succeeds):
In [103]: M["f3"].mask[crit, 3] = True
In [104]: print(M["f3"].mask[crit, 3])
[ True True True True True]
# set back to False so I can try method #6
In [105]: M["f3"].mask[crit, 3] = False
In [106]: print(M["f3"].mask[crit, 3])
[False False False False False]
Mask - field - index (succeeds):
In [107]: M.mask["f3"][crit, 3] = True
In [108]: print(M.mask["f3"][crit, 3])
[ True True True True True]
So, it looks like indexing must come last.
The issue of __setitem__ v. __getitem__ is important, but with structured array and masking it's a little harder to sort out when a __getitem__ is first making a copy.
Regarding the structured arrays, it shouldn't matter whether the field index occurs first or the element. However some releases appear to have a bug in this regard. I'll try to find a recent SO question where this was a problem.
With a masked array, there's the question of how to correctly modify the mask. The .mask is a property that accesses the underlying ._mask array. But that is fetched with __getattr__. So the simple setitem v getitem distinction does not apply directly.
Lets skip the structured bit first
In [584]: M = np.ma.MaskedArray(np.arange(4))
In [585]: M
Out[585]:
masked_array(data = [0 1 2 3],
mask = False,
fill_value = 999999)
In [586]: M.mask
Out[586]: False
In [587]: M.mask[[1,2]]=True
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-587-9010ee8f165e> in <module>()
----> 1 M.mask[[1,2]]=True
TypeError: 'numpy.bool_' object does not support item assignment
Initially mask is a scalar boolean, not an array.
This works
In [588]: M.mask=np.zeros((4,),bool) # change mask to array
In [589]: M
Out[589]:
masked_array(data = [0 1 2 3],
mask = [False False False False],
fill_value = 999999)
In [590]: M.mask[[1,2]]=True
In [591]: M
Out[591]:
masked_array(data = [0 -- -- 3],
mask = [False True True False],
fill_value = 999999)
This does not
In [592]: M[[1,2]].mask=True
In [593]: M
Out[593]:
masked_array(data = [0 -- -- 3],
mask = [False True True False],
fill_value = 999999)
M[[1,2]] is evidently the copy, and the assignment is to its mask attribute, not M.mask.
....
A masked array has .__setmask__ method. You can study that in np.ma.core.py. And the mask property is defined with
mask = property(fget=_get_mask, fset=__setmask__, doc="Mask")
So M.mask=... does use this.
So it looks like the problem case is doing
M.__getitem__(index).__setmask__(values)
hence the copy. The M.mask[]=... is doing
M._mask.__setitem__(index, values)
since _getmask just does return self._mask.
M["f3"].mask[crit, 3] = True
works because M['f3'] is a view. (M[['f1','f3']] is ok for get, but doesn't work for setting).
M.mask["f3"] is also a view. I'm not entirely sure of the order the relevant get and sets. __setmask__ has code that deals specifically with compound dtype (structured).
=========================
Looking at a structured array, without the masking complication, the indexing order matters
In [607]: M1 = np.arange(16).view("i,i")
In [609]: M1[[3,4]]['f1']=[3,4] # no change
In [610]: M1[[3,4]]['f1']
Out[610]: array([7, 9], dtype=int32)
In [611]: M1['f1'][[3,4]]=[1,2] # change
In [612]: M1
Out[612]:
array([(0, 1), (2, 3), (4, 5), (6, 1), (8, 2), (10, 11), (12, 13), (14, 15)], dtype=[('f0', '<i4'), ('f1', '<i4')])
So we still have a __getitem__ followed by a __setitem__, and we have to pay attention as to whether the get returns a view or a copy.
This is because although advanced indexing returns a copy, assigning to advanced indexing still works. Only the method where advanced indexing is the last operation is assigning to advanced indexing (through __setitem__).