How to compare numpy arrays of tuples? - numpy

Here's an MWE that illustrates the issue I have:
import numpy as np
arr = np.full((3, 3), -1, dtype="i,i")
doesnt_work = arr == (-1, -1)
n_arr = np.full((3, 3), -1, dtype=int)
works = n_arr == 10
arr is supposed to be an array of tuples, but it doesn't behave as expected.
works is an array of booleans, as expected, but doesnt_work is False. Is there a way to get numpy to do elementwise comparisons on more complex types, or do I have to resort to list comprehension, flatten and reshape?
There's a second problem:
f = arr[(0, 0)] == (-1, -1)
f is False, because arr[(0,0)] is of type numpy.void rather than a tuple. So even if the componentwise comparison worked, it would give the wrong result. Is there a clever numpy way to do this or should I just resort to list comprehension?

Both problems are actually the same problem! And are both related to the custom data type you created when you specified dtype="i,i".
If you run arr.dtype you will get dtype([('f0', '<i4'), ('f1', '<i4')]). That is a 2 signed integers that are placed in one continuous block of memory. This is not a python tuple. Thus it is clear why the naive comparison fails, since (-1,-1) is a python tuple and is not represented in memory the same way that the numpy data type is.
However if you compare with a_comp = np.array((-1,-1), dtype="i,i") you get the exact behavior you are expecting!
You can read more about how the custom dtype stuff works on the numpy docs:
https://numpy.org/doc/stable/reference/arrays.dtypes.html
Oh and to address what np.void is: it comes from the idea that it is a void c pointer which essentially means that it is an address to a continuous block of memory of unspecified type. But, provided you (the programer) knows what is going to be stored in that memory (in this case two back to back integers) it's fine provided you are careful (compare with the same custom data type).

Related

Arrow ListArray from pandas has very different structure from arrow array generated by awkward?

I encountered the following issue making some tests to demonstrate the usefulness of a pure pyarrow UDF in pyspark as compared to always going through pandas.
import awkward
import numpy
import pandas
import pyarrow
counts = numpy.random.randint(0,20,size=200000)
content = numpy.random.normal(size=counts.sum())
test_jagged = awkward.JaggedArray.fromcounts(counts, content)
test_arrow = awkward.toarrow(test_jagged)
def awk_arrow(col):
jagged = awkward.fromarrow(col)
jagged2 = jagged**2
return awkward.toarrow(jagged2)
def pds_arrow(col):
pds = col.to_pandas()
pds2 = pds**2
return pyarrow.Array.from_pandas(pds2)
out1 = awk_arrow(test_arrow)
out2 = pds_arrow(test_arrow)
out3 = awkward.fromarrow(out1)
out4 = awkward.fromarrow(out2)
type(out3)
type(out4)
yields
<class 'awkward.array.jagged.JaggedArray'>
<class 'awkward.array.masked.BitMaskedArray'>
and
out3 == out4
yields (at the end of the stack trace):
AttributeError: no column named 'reshape'
looking at the arrays:
print(out3);print();print(out4);
[[0.00736072240594475 0.055560612050914775 0.4094101942882973 ... 2.4428454924678533 0.07220045904440388 3.627270394986972] [0.16496227597707766 0.44899025266849046 1.314602433843517 ... 0.07384558862546337 0.5655043672418324 4.647396184088295] [0.04356259421421215 1.8983172440218923 0.10442121937532822 0.7222467989756899 0.03199694383894229 0.954281670741488] ... [0.23437909336737087 2.3050822727237272 0.10325064534860394 0.685018355096147] [0.8678765133108529 0.007214659054089928 0.3674379091794599 0.1891573101427716 2.1412651888713317 0.1461282900111415] [0.3315468986268042 2.7520115602119772 1.3905787720409803 ... 4.476255451581318 0.7237199572195625 0.8820112289563018]]
[[0.00736072240594475 0.055560612050914775 0.4094101942882973 ... 2.4428454924678533 0.07220045904440388 3.627270394986972] [0.16496227597707766 0.44899025266849046 1.314602433843517 ... 0.07384558862546337 0.5655043672418324 4.647396184088295] [0.04356259421421215 1.8983172440218923 0.10442121937532822 0.7222467989756899 0.03199694383894229 0.954281670741488] ... [0.23437909336737087 2.3050822727237272 0.10325064534860394 0.685018355096147] [0.8678765133108529 0.007214659054089928 0.3674379091794599 0.1891573101427716 2.1412651888713317 0.1461282900111415] [0.3315468986268042 2.7520115602119772 1.3905787720409803 ... 4.476255451581318 0.7237199572195625 0.8820112289563018]]
You can see the contents and shape of the arrays are the same, but they're not comparable to each other at face value, which is very counter intuitive. Is there a good reason for dense jagged structures with no Nulls to be represented as a BitMaskedArray?
All data in Arrow are nullable (at every level), and they use bit masks (as opposed to byte masks) to specify which elements are valid. The specification allows columns of entirely valid data to not write the bitmask, but not every writer takes advantage of that freedom. Quite often, you see unnecessary bitmasks.
When it encounters a bitmask, such as here, awkward inserts a BitMaskedArray.
It could be changed to check to see if the mask is unnecessary and skip that step, though that adds an operation that scales with the size of the dataset (though likely insignificant in most cases—bitmasks are 8 times faster to check than bytemasks). It's also a little complicated: the last byte may be incomplete if the length of the dataset is not a multiple of 8. One would need to check these bits individually, but the rest of the mask could be checked in bulk. (Maybe even cast as int64 to check 64 flags at a time.)

How to check the presence of a given numpy array in a larger-shape numpy array?

I guess the title of my question might not be very clear..
I have a small array, say a = ([[0,0,0],[0,0,1],[0,1,1]]). Then I have a bigger array of a higher dimension, say b = ([[[2,2,2],[2,0,1],[2,1,1]],[[0,0,0],[3,3,1],[3,1,1]],[...]]).
I'd like to check if one of the elements of a can be found in b. In this case, I'd find that the first element of a [0,0,0] is indeed in b, and then I'd like to retrieve the corresponding index in b.
I'd like to do that avoiding looping, since from the very little I understood from numpy arrays, they are not meant to be iterated over in a classic way. In other words, I need it to be very fast, because my actual arrays are quite big.
Any idea?
Thanks a lot!
Arnaud.
I don't know of a direct way, but I here's a function that works around the problem:
import numpy as np
def find_indices(val, arr):
# first take a mean at the lowest level of each array,
# then compare these to eliminate the majority of entries
mb = np.mean(arr, axis=2); ma = np.mean(val)
Y = np.argwhere(mb==ma)
indices = []
# Then run a quick loop on the remaining elements to
# eliminate arrays that don't match the order
for i in range(len(Y)):
idx = (Y[i,0],Y[i,1])
if np.array_equal(val, arr[idx]):
indices.append(idx)
return indices
# Sample arrays
a = np.array([[0,0,0],[0,0,1],[0,1,1]])
b = np.array([ [[6,5,4],[0,0,1],[2,3,3]], \
[[2,5,4],[6,5,4],[0,0,0]], \
[[2,0,2],[3,5,4],[5,4,6]], \
[[6,5,4],[0,0,0],[2,5,3]] ])
print(find_indices(a[0], b))
# [(1, 2), (3, 1)]
print(find_indices(a[1], b))
# [(0, 1)]
The idea is to use the mean of each array and compare this with the mean of the input. np.argwhere() is the key here. That way you remove most of the unwanted matches, but I did need to use a loop on the remainder to avoid the unsorted matches (this shouldn't be too memory-consuming). You'll probably want to customise it further, but I hope this helps.

Evaluate several elements of numpy object array

I have an ndarray A that stores objects of the same type, in particular various LinearNDInterpolator objects. For example's sake assume it's just 2:
>>> A
array([ <scipy.interpolate.interpnd.LinearNDInterpolator object at 0x7fe122adc750>,
<scipy.interpolate.interpnd.LinearNDInterpolator object at 0x7fe11daee590>], dtype=object)
I want to be able to do two things. First, I'd like to evaluate all objects in A at a certain point and get back an ndarray of A.shape with all the values in it. Something like
>> A[[0,1]](1,1) =
array([ 1, 2])
However, I get
TypeError: 'numpy.ndarray' object is not callable
Is it possible to do that?
Second, I would like to change the interpolation values without constructing new LinearNDInterpolator objects (since the nodes stay the same). I.e., something like
A[[0,1]].values = B
where B is an ndarray containing the new values for every element of A.
Thank you for your suggestions.
The same issue, but with simpler functions:
In [221]: A=np.array([add,multiply])
In [222]: A[0](1,2) # individual elements can be called
Out[222]: 3
In [223]: A(1,2) # but not the array as a whole
---------------------------------------------------------------------------
TypeError: 'numpy.ndarray' object is not callable
We can iterate over a list of functions, or that array as well, calling each element on the parameters. Done right we can even zip a list of functions and a list of parameters.
In [224]: ll=[add,multiply]
In [225]: [x(1,2) for x in ll]
Out[225]: [3, 2]
In [226]: [x(1,2) for x in A]
Out[226]: [3, 2]
Another test, the callable function:
In [229]: callable(A)
Out[229]: False
In [230]: callable(A[0])
Out[230]: True
Can you change the interpolation values for individual Interpolators? If so, just iterate through the list and do that.
In general, dtype object arrays function like lists. They contain the same kind of object pointers. Most operations requires the same sort of iteration. Unless you need to organize the elements in multiple dimensions, dtype object arrays have few, if any advantages over lists.
Another thought - the normal array dtype is numeric or fixed length strings. These elements are not callable, so there's no need to implement a .__call__ method on these arrays. They could write something like that to operate on object dtype arrays, but the core action is a Python call. So such a function would just hide the kind of iteration that I outlined.
In another recent question I showed how to use np.char.upper to apply a string method to every element of a S dtype array. But my time tests showed that this did not speedup anything.

NumPy argmax and structured array error: expected a readable buffer object

I got the following error while using NumPy argmax method. Could some one help me to understand what happened:
import numpy as np
b = np.zeros(1, dtype={'names':['a','b'], 'formats': ['i4']*2})
b.argmax()
The error is
TypeError: expected a readable buffer object
While the following runs without a problem:
a = np.zeros(3)
a.argmax()
It seems the error dues to the structured array. But could you anyone help to explain the reason?
Your b is:
array([(0, 0)], dtype=[('a', '<i4'), ('b', '<i4')])
I get a different error message with argmax:
TypeError: Cannot cast array data from dtype([('a', '<i4'), ('b', '<i4')]) to dtype('V8') according to the rule 'safe'
But this works:
In [88]: b['a'].argmax()
Out[88]: 0
Generally you can't do math operations across the fields of a structured array. You can operate within each field (if it is numeric). Since the fields could be a mix of numbers, strings and other objects, so there's been no effort to handle special cases where such operations might make sense.
If you really must to operations across the fields, try a different view, eg:
In [94]: b.view('<i4').argmax()
Out[94]: 0

numpy concatenate of empty with non empty arrays yields in float

I just found that concatenating an empty array with a non-empty array yielded in a one value array containing the non-empty array but changed to a float.
for example:
import numpy as np
np.concatenate([1], [1])
array([1, 1])
but
np.concatenate([], [1])
array([1.])
this works the same with np.hstack
By default, the empty array in the code
np.concatenate([], [1])
is initialized with dtype=float, and concatenate casts the second int array to float.
Now, it's worth asking if it ever happens that you use concatenate on empty arrays. Clearly, you never write code like
a=array([1,2,3])#int array
b=np.concatenate([], a)
One case scenario where it may happens follows:
a=array([1,2,3])#int array
b=concatenate((a[:j],a)) #usually j!=0 here
Then for some reasons the code is run with j=0. it is true that a[:0] is empty, but it still retains the dtype=int and the result of concatenate is an array of integer anyway, as you expected.
So I would say that yes, your example shows somehow an unexpected behaviour at first sight, but it's quite harmless.