How to zero out all entries of a dask array less than the top k - numpy

I want to zero out all of the elements of a dask.array except for the top few elements. How do I do this?
Example
Say I have a small dask array like the following:
import numpy as np
import dask.array as da
x = np.array([0, 4, 2, 3, 1])
x = da.from_array(x, chunks=(2,))
How do I zero out all but the two largest elements? I want something like the following:
>>> result.compute()
array([0, 4, 0, 3, 0])

You can do this with a combination of the topk function and inplace setitem
top = x.topk(2)
x[x < top[-1]] = 0
>>> x.compute()
array([0, 4, 0, 3, 0])
Note that this won't stream particularly nicely through memory. If you're using the single machine scheduler then you might want to do this in two passes by explicitly computing top ahead of time:
top = x.topk(2)
top = top.compute() # pass through data once to get top elements
x[x < top[-1]] = 0 # then pass through again applying filter
>>> x.compute()
array([0, 4, 0, 3, 0])
This only matters if you're trying to stream through a large dataset on a single machine and should not affect you much if you're on a distributed system.

Related

Numpy subarrays and relative indexing

I have been searching if there is an standard mehtod to create a subarray using relative indexes. Take the following array into consideration:
>>> m = np.arange(25).reshape([5, 5])
array([[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]])
I want to access the 3x3 matrix at a specific array position, for example [2,2]:
>>> x = 2, y = 2
>>> m[slice(x-1,x+2), slice(y-1,y+2)]
array([[ 6, 7, 8],
[11, 12, 13],
[16, 17, 18]])
For example for the above somethig like m.subarray(pos=[2,2], shape=[3,3])
I want to sample a ndarray of n dimensions on a specific position which might change.
I did not want to use a loop as it might be inneficient. Scipy functions correlate and convolve do this very efficiently, but for all positions. I am interested only in the sampling of one.
The best answer could solve the issues at edges, in my case I would like for example to have wrap mode:
(a b c d | a b c d | a b c d)
--------------------EDITED-----------------------------
Based on the answer from #Carlos Horn, I could create the following function.
def cell_neighbours(array, index, shape):
pads = [(floor(dim/2), ceil(dim / 2)) for dim in shape]
array = np.pad(self.configuration, pads, "wrap")
views = np.lib.stride_tricks.sliding_window_view
return views(array, shape)[tuple(index)]
Last concern might be about speed, from docs: For many applications using a sliding window view can be convenient, but potentially very slow. Often specialized solutions exist.
From here maybe is easier to get a faster solution.
You could build a view of 3x3 matrices into the array as follows:
import numpy as np
m = np.arange(25).reshape(5,5)
m3x3view = np.lib.stride_tricks.sliding_window_view(m, (3,3))
Note that it will change slightly your indexing on half the window size meaning
x_view = x - 3//2
y_view = y - 3//2
print(m3x3view[x_view,y_view]) # gives your result
In case a copy operation is fine, you could use:
mpad = np.pad(m, 1, mode="wrap")
mpad3x3view = np.lib.stride_tricks.sliding_window_view(mpad, (3,3))
print(mpad3x3view[x % 5,y % 5])
to use arbitrary x, y integer values.

What does the [1] do when using .where()?

I m practicing on a Data Cleaning Kaggle excercise.
In parsing dates example I can´t figure out what the [1] does at the end of the indices object.
Thanks..
# Finding indices corresponding to rows in different date format
indices = np.where([date_lengths == 24])[1]
print('Indices with corrupted data:', indices)
earthquakes.loc[indices]
As described in the documentation, numpy.where called with a single argument is equivalent to calling np.asarray([date_lengths == 24]).nonzero().
numpy.nonzero return a tuple with as many items as the dimensions of the input array with the indexes of the non-zero values.
>>> np.nonzero([1,0,2,0])
(array([0, 2]),)
Slicing [1] enables to get the second element (i.e. second dimension) but as the input was wrapped into […], this is equivalent to doing:
np.where(date_lengths == 24)[0]
>>> np.nonzero([1,0,2,0])[0]
array([0, 2])
It is an artefact of the extra [] around the condition. For example:
a = np.arange(10)
To find, for example, indices where a>3 can be done like this:
np.where(a > 3)
gives as output a tuple with one array
(array([4, 5, 6, 7, 8, 9]),)
So the indices can be obtained as
indices = np.where(a > 3)[0]
In your case, the condition is between [], which is unnecessary, but still works.
np.where([a > 3])
returns a tuple of which the first is an array of zeros, and the second array is the array of indices you want
(array([0, 0, 0, 0, 0, 0]), array([4, 5, 6, 7, 8, 9]))
so the indices are obtained as
indices = np.where([a > 3])[1]

Explicit slicing across a particular dimension

I've got a 3D tensor x (e.g 4x4x100). I want to obtain a subset of this by explicitly choosing elements across the last dimension. This would have been easy if I was choosing the same elements across last dimension (e.g. x[:,:,30:50] but I want to target different elements across that dimension using the 2D tensor indices which specifies the idx across third dimension. Is there an easy way to do this in numpy?
A simpler 2D example:
x = [[1,2,3,4,5,6],[10,20,30,40,50,60]]
indices = [1,3]
Let's say I want to grab two elements across third dimension of x starting from points specified by indices. So my desired output is:
[[2,3],[40,50]]
Update: I think I could use a combination of take() and ravel_multi_index() but some of the platforms that are inspired by numpy (like PyTorch) don't seem to have ravel_multi_index so I'm looking for alternative solutions
Iterating over the idx, and collecting the slices is not a bad option if the number of 'rows' isn't too large (and the size of the sizes is relatively big).
In [55]: x = np.array([[1,2,3,4,5,6],[10,20,30,40,50,60]])
In [56]: idx = [1,3]
In [57]: np.array([x[j,i:i+2] for j,i in enumerate(idx)])
Out[57]:
array([[ 2, 3],
[40, 50]])
Joining the slices like this only works if they all are the same size.
An alternative is to collect the indices into an array, and do one indexing.
For example with a similar iteration:
idxs = np.array([np.arange(i,i+2) for i in idx])
But broadcasted addition may be better:
In [58]: idxs = np.array(idx)[:,None]+np.arange(2)
In [59]: idxs
Out[59]:
array([[1, 2],
[3, 4]])
In [60]: x[np.arange(2)[:,None], idxs]
Out[60]:
array([[ 2, 3],
[40, 50]])
ravel_multi_index is not hard to replicate (if you don't need clipping etc):
In [65]: np.ravel_multi_index((np.arange(2)[:,None],idxs),x.shape)
Out[65]:
array([[ 1, 2],
[ 9, 10]])
In [66]: x.flat[_]
Out[66]:
array([[ 2, 3],
[40, 50]])
In [67]: np.arange(2)[:,None]*x.shape[1]+idxs
Out[67]:
array([[ 1, 2],
[ 9, 10]])
along the 3D axis:
x = [x[:,i].narrow(2,index,2) for i,index in enumerate(indices)]
x = torch.stack(x,dim=1)
by enumerating you get the index of the axis and index from where you want to start slicing in one.
narrow gives you a zero-copy length long slice from a starting index start along a certain axis
you said you wanted:
dim = 2
start = index
length = 2
then you simply have to stack these tensors back to a single 3D.
This is the least work intensive thing i can think of for pytorch.
EDIT
if you just want different indices along different axis and indices is a 2D tensor you can do:
x = [x[:,i,index] for i,index in enumerate(indices)]
x = torch.stack(x,dim=1)
You really should have given a proper working example, making it unnecessarily confusing.
Here is how to do it in numpy, now clue about torch, though.
The following picks a slice of length n along the third dimension starting from points idx depending on the other two dimensions:
# example
a = np.arange(60).reshape(2, 3, 10)
idx = [(1,2,3),(4,3,2)]
n = 4
# build auxiliary 4D array where the last two dimensions represent
# a sliding n-window of the original last dimension
j,k,l = a.shape
s,t,u = a.strides
aux = np.lib.stride_tricks.as_strided(a, (j,k,l-n+1,n), (s,t,u,u))
# pick desired offsets from sliding windows
aux[(*np.ogrid[:j, :k], idx)]
# array([[[ 1, 2, 3, 4],
# [12, 13, 14, 15],
# [23, 24, 25, 26]],
# [[34, 35, 36, 37],
# [43, 44, 45, 46],
# [52, 53, 54, 55]]])
I came up with below using broadcasting:
x = np.array([[1,2,3,4,5,6,7,8,9,10],[10,20,30,40,50,60,70,80,90,100]])
i = np.array([1,5])
N = 2 # number of elements I want to extract along each dimension. Starting points specified in i
r = np.arange(x.shape[-1])
r = np.broadcast_to(r, x.shape)
ii = i[:, np.newaxis]
ii = np.broadcast_to(ii, x.shape)
mask = np.logical_and(r-ii>=0, r-ii<=N)
output = x[mask].reshape(2,3)
Does this look alright?

Most efficient way to do this slice based multiplication in Tensorflow

I'm trying to perform an operation of multiplying a slice of a 2D matrix by a constant.
For example, if i wanted to multiply everything but the first 2 columns
To perform this in numpy, one could do:
a = np.array([[0,7,4],
[1,6,4],
[0,2,4],
[4,2,7]])
a[:, 2:] = 2.0*a[:, 2:]
>> a
>> array([[ 0, 7, 8],
[ 1, 6, 8],
[ 0, 2, 8],
[ 4, 2, 14]])
However, at least from what i've searched, tensorflow currently doesn't have a straightforward way to do this.
My current solution is to create a originally as two separate Tensors a1 and a2, multiply the second one by 2.0 and then concatenate them across axis=1. The operation is simple enough that this is possible. However I have two questions
Is that the most efficient way to do this
Is there a better (general/efficient) way to perform this to bring the functionality closer to numpy's slicing magic (perhaps https://www.tensorflow.org/api_docs/python/tf/scatter_
One option is to perform entrywise multiplication, as follows:
import tensorflow as tf
a = tf.Variable(initial_value=[[0,7,4],[1,6,4],[0,2,4],[4,2,7]])
b = tf.mul(a,[1,1,2])
s=tf.InteractiveSession()
s.run(tf.global_variables_initializer())
b.eval()
This prints
array([[ 0, 7, 8],
[ 1, 6, 8],
[ 0, 2, 8],
[ 4, 2, 14]])
More generally, if a has more columns, you can do something like that:
import tensorflow as tf
a = tf.Variable(initial_value=[[0,7,4],[1,6,4],[0,2,4],[4,2,7]])
b = tf.mul(a,[1,1]+[2 for i in range(a.get_shape()[1]-2)])
s=tf.InteractiveSession()
s.run(tf.global_variables_initializer())
b.eval()
Or if your matrix has many columns you could replace
b = tf.mul(a,[1,1]+[2 for i in range(a.get_shape()[1]-2)])
with
import numpy as np
b = tf.mul(a,np.concatenate((np.array([1,1]),2*np.ones(a.get_shape()[1]-2))))

Bitwise OR along one axis of a NumPy array

For a given NumPy array, it is easy to perform a "normal" sum along one dimension. For example:
X = np.array([[1, 0, 0], [0, 2, 2], [0, 0, 3]])
X.sum(0)
=array([1, 2, 5])
X.sum(1)
=array([1, 4, 3])
Instead, is there an "efficient" way of computing the bitwise OR along one dimension of an array similarly? Something like the following, except without requiring for-loops or nested function calls.
Example: bitwise OR along zeroeth dimension as I currently am doing it:
np.bitwise_or(np.bitwise_or(X[:,0],X[:,1]),X[:,2])
=array([1, 2, 3])
What I would like:
X.bitwise_sum(0)
=array([1, 2, 3])
numpy.bitwise_or.reduce(X, axis=whichever_one_you_wanted)
Use the reduce method of the numpy.bitwise_or ufunc.