What is the best way to initialise a NumPy masked array with an existing mask? - numpy

I was expecting to just say something like
ma.zeros(my_shape, mask=my_mask, hard_mask=True)
(where the mask is the correct shape) but ma.zeros (or ma.ones or ma.empty) rather surprisingly doesn't recognise the mask argument. The simplest I've come up with is
ma.array(np.zeros(my_shape), mask=my_mask, hard_mask=True)
which seems to involve unnecessary copying of lots of zeros. Is there a better way?

Make a masked array:
In [162]: x = np.arange(5); mask=np.array([1,0,0,1,0],bool)
In [163]: M = np.ma.MaskedArray(x,mask)
In [164]: M
Out[164]:
masked_array(data=[--, 1, 2, --, 4],
mask=[ True, False, False, True, False],
fill_value=999999)
Modify x, and see the result in M:
In [165]: x[-1] = 10
In [166]: M
Out[166]:
masked_array(data=[--, 1, 2, --, 10],
mask=[ True, False, False, True, False],
fill_value=999999)
In [167]: M.data
Out[167]: array([ 0, 1, 2, 3, 10])
In [169]: M.data.base
Out[169]: array([ 0, 1, 2, 3, 10])
The M.data is a view of the array used in creating it. No unnecessary copies.
I haven't used functions like np.ma.zeros, but
In [177]: np.ma.zeros
Out[177]: <numpy.ma.core._convert2ma at 0x1d84a052af0>
_convert2ma is a Python class, that takes a funcname and returns new callable. It does not add mask-specific parameters. Study that yourself if necessary.
np.ma.MaskedArray, the function that actually subclasses ndarray takes a copy parameter
copy : bool, optional
Whether to copy the input data (True), or to use a reference instead.
Default is False.
and the first line of its __new__ is
_data = np.array(data, dtype=dtype, copy=copy,
order=order, subok=True, ndmin=ndmin)
I haven't quite sorted out whether M._data is just a reference to the source data, or a view. In either case, it isn't a copy, unless you say so.
I haven't worked a lot with masked arrays, but my impression is that, while they can be convenient, they shouldn't be used where you are concerned about performance. There's a lot of extra work required to maintain both the mask and the data. The extra time involved in copying the data array, if any, will be minor.

Related

numpy array of array with custom filtering

I am trying to filter a numpy array of array with given conditions, for example
input = np.array([[1,2,3],[4,5,6],[4,5,6],[0,9,19]])
output where the [0] >= 4, [1] >= 5, [2] >= 6
expected result = np.array([[4,5,6],[4,5,6]])
what would be the best way to achieve this with performance concern?
extended question: and how to retrieve the correspondance index of the each output elements in the input array?
You can do:
a = np.array([[1,2,3],[4,5,6],[4,5,6],[0,9,19]])
a[(a[:,0] >=4) & (a[:,1] >= 5) & (a[:,2] >=6)]
Here you create binary masks for the conditions on each elements in each row of the data, use the logical and to combine them, and finally use the resulting mask to get the matching data rows.
To find the index of the data rows matching the conditions, you can use numpys where() function:
idx = np.where((a[:,0] >=4) & (a[:,1] >= 1) & (a[:,2] >=6))[0]
As per your request, a numba version
import numpy as np
import numba as nb
import sys
import timeit
target = np.random.randint(low=-100000, high=100000, size=(int(sys.argv[2]), 3), dtype=np.int)
comp = np.array([4, 5, 6])
#nb.njit((nb.int64[:, :], nb.int64[::3]), parallel=True)
def cmp(a, b):
c = np.empty((a.shape[0],), dtype=a.dtype)
for i in nb.prange(a.shape[0]):
c[i] = a[i][0] > b[0] and a[i][1] > b[1] and a[i][2] > b[2]
return c
def cmp_normal(a, b):
# return np.all(a > b, axis=1)
return (a[:,0] >=b[0]) & (a[:,1] >= b[1]) & (a[:,2] >=b[2])
print(timeit.timeit(lambda: eval(sys.argv[1])(target, comp), number=10))
First output time is for sequential numba, second one is for parallel numba.
Parallel numba gives 5 times speed up compared to sequential
(base) xxx#xxx:~$ python test.py cmp 1000000
6.40756068899982
(base) xxx#xxx:~$ python test.py cmp 1000000
1.3425709140001345
Now vanilla numpy
(base) xxx#xxx:~$ python test.py cmp_normal 1000000
4.04174472700015
Numba parallel is fastest. But if you try to return a[c] instead, numba will slow down. So it depends on what you write
In [223]: arr =np.array([[1,2,3],[4,5,6],[4,5,6],[0,9,19]])
In [224]: arr
Out[224]:
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 4, 5, 6],
[ 0, 9, 19]])
Since you are testing values, one for each column, you can do a simple numpy == test (the (3,) test broadcasts with the (4,3) arr)
In [225]: arr==[4,5,6]
Out[225]:
array([[False, False, False],
[ True, True, True],
[ True, True, True],
[False, False, False]])
and where a whole row is true:
In [226]: (arr==[4,5,6]).all(axis=1)
Out[226]: array([False, True, True, False])
This can be applied as a boolean mask to select those rows from arr:
In [227]: arr[_]
Out[227]:
array([[4, 5, 6],
[4, 5, 6]])
and the numeric indices:
In [228]: np.nonzero(__)
Out[228]: (array([1, 2]),)

Udacity Deep Learning: Assignment 1, Part 5

I'm working on the Udacity Deep Learning class and I'm working on the first assignment, problem 5 where you try to count the number of duplicates in, say, your test set and training set. (Or validation and training, etc.)
I've looked at other people's answers, but I'm not satisfied with them for various reasons. For example, I tried out someone's hash based solution. But I felt the results returned was not likely to be correct.
So the main idea is that you have an array of images that are formatted as arrays. I.e. you're trying to compare two 3-dimensional arrays on index 0. One array is the training dataset, which is 200000 rows with each row containing a 2-D array that is the values for the image. The other is the test set, with is 10000 rows with each row containing a 2-D array of an image. The goal is to find all rows in the test set that match (for now, exactly match is fine) a row in the training set. Since each 'row' is itself an image (which is a 2-d array) then to make this work fast I must be able to do a comparison of both sets as an element-wise compare of each row.
I worked up my own fairly simple solution like this:
# Find duplicates
# Loop through validation/test set and find ones that are identical matrices to something in the training data
def find_duplicates(compare_set, compare_labels, training_set, training_labels):
dup_count = 0
duplicates = []
for i in range(len(compare_set)):
if i > 100: continue
if i % 100 == 0:
print("i: ", i)
for j in range(len(training_set)):
if compare_labels[i] == training_labels[j]:
if np.array_equal(compare_set[i], training_set[j]):
duplicates.append((i,j))
dup_count += 1
return dup_count, duplicates
#print(len(valid_dataset))
print(len(train_dataset))
valid_dup_count, duplicates = find_duplicates(valid_dataset, valid_labels, train_dataset, train_labels)
print(valid_dup_count)
print(duplicates)
#test_dups = find_duplicates(test_dataset, train_dataset)
#print(test_dups)
The reason it just "continues" after 100 is because that alone takes a very long time. If I were to try to compare all 10,000 rows of the validation set to the training set, it would take forever.
I like my solution in principle because it allows me to not only count the duplicates, but get a list back of which matches existed. (Something missing on every other solution I've looked at.) This allows me to manually test that I'm getting the right solution.
What I really need is a much faster (i.e. built into Numpy) solution to compare matrices of matrices like this. I've played with 'isin' and 'where' but haven't figured out how to use those to get the results I'm after. Can someone point me in the right direction for a faster solution?
You should be able to compare a single image from compare_set throughout all the images in training_set with a single line of code using np.all(). You can provide multiple axes as a tuple in the axis argument to check array equality over rows and columns, going through each of the images. Then np.where() can give you the indices you want.
For example:
n_train = 50
n_validation = 10
h, w = 28, 28
training_set = np.random.rand(n_train, h, w)
validation_set = np.random.rand(n_validation, h, w)
# create some duplicates
training_set[5] = training_set[10]
validation_set[2] = training_set[10]
validation_set[8] = training_set[10]
duplicates = []
for i, img in enumerate(validation_set):
training_dups = np.where(np.all(training_set == img, axis=(1, 2)))[0]
for j in training_dups:
duplicates.append((i, j))
print(duplicates)
[(2, 5), (2, 10), (8, 5), (8, 10)]
Many numpy functions, np.all() included, let you specify the axes to operate on. For example, let's say you had the two arrays
>>> A = np.array([[1, 2], [3, 4]])
>>> B = np.array([[1, 2], [5, 6]])
>>> A
array([[1, 2],
[3, 4]])
>>> B
array([[1, 2],
[5, 6]])
Now, A and B have the same first row, but a different second row. If we check equality for them
>>> A == B
array([[ True, True],
[False, False]], dtype=bool)
We get an array the same shape as A and B. But what if I want the indices of the rows which are equal? Well in this case what we can do is say 'only return True if all the values in the row (i.e. the value in each column) are True'. So we can use np.all() after the equality check, and provide it the axis corresponding to the columns.
>>> np.all(A == B, axis=1)
array([ True, False], dtype=bool)
So this result is letting us know that the first row is equal in both arrays, and the second row is not all equal. We can then get the row indices with np.where()
>>> np.where(np.all(A == B, axis=1))
(array([0]),)
So here we see row 0, i.e. A[0] and B[0] are equal.
Now in the solution I proposed, you have a 3D array instead of these 2D arrays. We don't care if a single row is equal, we care if all the rows and columns are equal. So breaking it down as above, let's create two random 5x5 images. I'll grab one of those images and check for equality among the array of two images:
>>> imgs = np.random.rand(2, 5, 5)
>>> img = imgs[1]
>>> imgs == img
array([[[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False],
[False, False, False, False, False]],
[[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True],
[ True, True, True, True, True]]], dtype=bool)
So this is obvious that the second one is correct, but I want to reduce all those True values to one True value; I only want the index corresponding to images where every value is equal.
If we use axis=1
>>> np.all(imgs == img, axis=1)
array([[False, False, False, False, False],
[ True, True, True, True, True]], dtype=bool)
Then we get True for each row if all the columns in each row are equivalent. And really we want to reduce this further by checking equality along all the rows as well. So we can take this result, feed it into np.all() and check along the rows of the resulting array:
>>> np.all(np.all(imgs == img, axis=1), axis=1)
array([False, True], dtype=bool)
And this gives us a boolean of which image inside imgs is equal to img, and we can simply get the result with np.where(). But you don't actually need to call np.all() twice like this; instead you can provide it multiple axes in a tuple to just reduce along both the rows and columns in one step:
>>> np.all(imgs == img, axis=(1, 2))
array([False, True], dtype=bool)
And that's what the solution above does. Hope that clears it up!

Delete rows from a ndarray in python

I have a 2D - array A, which contains the x and y coordinates of points
array([[ 0, 0],
[ 0, 0],
[ 0, 0],
[ 3, 4],
[ 4, 1],
[ 5, 10],
[ 9, 7]])
as you can see the point ( 0 , 0 ) appears more often.
I want to delete this point so that the array looks like this:
array([[ 3, 4],
[ 4, 1],
[ 5, 10],
[ 9, 7]])
Since the array in real is very huge, it is very important to do this without for loops, otherwise it takes very long.
I'm new to python but i'm used to matlab, where I can solve it very easily with:
A (A(:,1) == 0 & A(:,2) == 0, :) = []
I thought it is almost the same or very similar in python, but I can't figure it out - am totally stuck. Errors like "use a.any()/all()" or "ufunc "bitwise_and" not supported for the input types" appear and I don't know what I should change.
Technically what you are doing in MATLAB is not deleting elements from A. What you are actually doing is creating a new array that lacks the elements of A. It is equivalent to:
>> A = A (A(:,1) ~= 0 | A(:,2) ~= 0, :);
You can do exactly the same thing in numpy:
>>> a = a[(a[:,0] != 0) | (a[:,1] != 0), :]
However, thanks to numpy's automatic broadcasting, you can make this simpler:
>>> a = a[(a != [0, 0]).any(1)]
This will work for any target array so long as it has the same number of columns as a.

When does advanced indexing on structured masked arrays *really* return a copy?

When I have a structured masked array with boolean indexing, under what conditions do I get a view and when do I get a copy? The documentation says that advanced indexing always returns a copy, but this is not true, since something like X[X>0]=42 is technically advanced indexing, but the assignment works. My situation is more complex:
I want to set the mask of a particular field based on a criterion from another field, so I need to get the field, apply the boolean indexing, and get the mask. There are 3! = 6 orders of doing so.
Preparation:
In [83]: M = ma.MaskedArray(random.random(400).view("f8,f8,f8,f8")).reshape(10, 10)
In [84]: crit = M[:, 4]["f2"] > 0.5
Field - index - mask (fails):
In [85]: M["f3"][crit, 3].mask = True
In [86]: print(M["f3"][crit, 3].mask)
[False False False False False]
Index - field - mask (fails):
In [87]: M[crit, 3]["f3"].mask = True
In [88]: print(M[crit, 3]["f3"].mask)
[False False False False False]
Index - mask - field (fails):
In [94]: M[crit, 3].mask["f3"] = True
In [95]: print(M[crit, 3].mask["f3"])
[False False False False False]
Mask - index - field (fails):
In [101]: M.mask[crit, 3]["f3"] = True
In [102]: print(M.mask[crit, 3]["f3"])
[False False False False False]
Field - mask - index (succeeds):
In [103]: M["f3"].mask[crit, 3] = True
In [104]: print(M["f3"].mask[crit, 3])
[ True True True True True]
# set back to False so I can try method #6
In [105]: M["f3"].mask[crit, 3] = False
In [106]: print(M["f3"].mask[crit, 3])
[False False False False False]
Mask - field - index (succeeds):
In [107]: M.mask["f3"][crit, 3] = True
In [108]: print(M.mask["f3"][crit, 3])
[ True True True True True]
So, it looks like indexing must come last.
The issue of __setitem__ v. __getitem__ is important, but with structured array and masking it's a little harder to sort out when a __getitem__ is first making a copy.
Regarding the structured arrays, it shouldn't matter whether the field index occurs first or the element. However some releases appear to have a bug in this regard. I'll try to find a recent SO question where this was a problem.
With a masked array, there's the question of how to correctly modify the mask. The .mask is a property that accesses the underlying ._mask array. But that is fetched with __getattr__. So the simple setitem v getitem distinction does not apply directly.
Lets skip the structured bit first
In [584]: M = np.ma.MaskedArray(np.arange(4))
In [585]: M
Out[585]:
masked_array(data = [0 1 2 3],
mask = False,
fill_value = 999999)
In [586]: M.mask
Out[586]: False
In [587]: M.mask[[1,2]]=True
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-587-9010ee8f165e> in <module>()
----> 1 M.mask[[1,2]]=True
TypeError: 'numpy.bool_' object does not support item assignment
Initially mask is a scalar boolean, not an array.
This works
In [588]: M.mask=np.zeros((4,),bool) # change mask to array
In [589]: M
Out[589]:
masked_array(data = [0 1 2 3],
mask = [False False False False],
fill_value = 999999)
In [590]: M.mask[[1,2]]=True
In [591]: M
Out[591]:
masked_array(data = [0 -- -- 3],
mask = [False True True False],
fill_value = 999999)
This does not
In [592]: M[[1,2]].mask=True
In [593]: M
Out[593]:
masked_array(data = [0 -- -- 3],
mask = [False True True False],
fill_value = 999999)
M[[1,2]] is evidently the copy, and the assignment is to its mask attribute, not M.mask.
....
A masked array has .__setmask__ method. You can study that in np.ma.core.py. And the mask property is defined with
mask = property(fget=_get_mask, fset=__setmask__, doc="Mask")
So M.mask=... does use this.
So it looks like the problem case is doing
M.__getitem__(index).__setmask__(values)
hence the copy. The M.mask[]=... is doing
M._mask.__setitem__(index, values)
since _getmask just does return self._mask.
M["f3"].mask[crit, 3] = True
works because M['f3'] is a view. (M[['f1','f3']] is ok for get, but doesn't work for setting).
M.mask["f3"] is also a view. I'm not entirely sure of the order the relevant get and sets. __setmask__ has code that deals specifically with compound dtype (structured).
=========================
Looking at a structured array, without the masking complication, the indexing order matters
In [607]: M1 = np.arange(16).view("i,i")
In [609]: M1[[3,4]]['f1']=[3,4] # no change
In [610]: M1[[3,4]]['f1']
Out[610]: array([7, 9], dtype=int32)
In [611]: M1['f1'][[3,4]]=[1,2] # change
In [612]: M1
Out[612]:
array([(0, 1), (2, 3), (4, 5), (6, 1), (8, 2), (10, 11), (12, 13), (14, 15)], dtype=[('f0', '<i4'), ('f1', '<i4')])
So we still have a __getitem__ followed by a __setitem__, and we have to pay attention as to whether the get returns a view or a copy.
This is because although advanced indexing returns a copy, assigning to advanced indexing still works. Only the method where advanced indexing is the last operation is assigning to advanced indexing (through __setitem__).

Python pandas json 2D array

relatively new to pandas, I have a json and python files:
{"dataset":{
"id": 123,
"data": [["2015-10-16",1,2,3,4,5,6],
["2015-10-15",7,8,9,10,11,12],
["2015-10-14",13,14,15,16,17]]
}}
&
import pandas
x = pandas.read_json('sample.json')
y = x.dataset.data
print x.dataset
Printing x.dataset and y works fine, but when I go to access a sub-element y, it returns a 'buffer' type. What's going on? How can I access the data inside the array? Attempting y[0][1] it returns out of bounds error, and iterating through returns a strange series of 'nul' characters and yet, it appears to be able to return the first portion of the data after printing x.dataset...
The data attribute of a pandas Series points to the memory buffer of all the data contained in that series:
>>> df = pandas.read_json('sample.json')
>>> type(df.dataset)
pandas.core.series.Series
>>> type(df.dataset.data)
memoryview
If you have a column/row named "data", you have to access it by it's string name, e.g.:
>>> type(df.dataset['data'])
list
Because of surprises like this, it's usually considered best practice to access columns through indexing rather than through attribute access. If you do this, you will get your desired result:
>>> df['dataset']['data']
[['2015-10-16', 1, 2, 3, 4, 5, 6],
['2015-10-15', 7, 8, 9, 10, 11, 12],
['2015-10-14', 13, 14, 15, 16, 17]]
>>> arr = df['dataset']['data']
>>> arr[0][0]
'2015-10-16'