Simple computation in numpy - numpy

I have numpy array like this a = [-- -- -- 1.90 2.91 1.91 2.92]
I need to find % of values more than 2, so here it is 50%.
How to get the same in easy way? also, why len(a) gives 7 (instead of 4)?

Try this:
import numpy as np
import numpy.ma as ma
a = ma.array([0, 1, 2, 1.90, 2.91, 1.91, 2.92])
for i in range(3):
a[i] = ma.masked
print(a)
print(np.sum(a>2)/((len(a) - ma.count_masked(a))))
The last line prints 0.5 which is your 50%. It subtracted from the total length of your array (7) the number of masked elements (3) which you see as the three "--" in the output you posted.

Generally speaking, you can simply use
a = np.array([...])
threshold = 2.0
fraction_higher = (a > threshold).sum() / len(a) # in [0, 1)
percentage_higher = fraction_higher * 100

The array contains 7 elements, being 3 of them masked. This code emulates the test case, generating a masked array as well:
# generate the test case: a masked array
a = np.ma.array([-1, -1, -1, 1.90, 2.91, 1.91, 2.92], mask=[1, 1, 1, 0, 0, 0, 0])
# check its format
print(a)
[-- -- -- 1.9 2.91 1.91 2.92]
#print the output
print(a[a>2].count() / a.count())
0.5

Related

numpy array of array with custom filtering

I am trying to filter a numpy array of array with given conditions, for example
input = np.array([[1,2,3],[4,5,6],[4,5,6],[0,9,19]])
output where the [0] >= 4, [1] >= 5, [2] >= 6
expected result = np.array([[4,5,6],[4,5,6]])
what would be the best way to achieve this with performance concern?
extended question: and how to retrieve the correspondance index of the each output elements in the input array?
You can do:
a = np.array([[1,2,3],[4,5,6],[4,5,6],[0,9,19]])
a[(a[:,0] >=4) & (a[:,1] >= 5) & (a[:,2] >=6)]
Here you create binary masks for the conditions on each elements in each row of the data, use the logical and to combine them, and finally use the resulting mask to get the matching data rows.
To find the index of the data rows matching the conditions, you can use numpys where() function:
idx = np.where((a[:,0] >=4) & (a[:,1] >= 1) & (a[:,2] >=6))[0]
As per your request, a numba version
import numpy as np
import numba as nb
import sys
import timeit
target = np.random.randint(low=-100000, high=100000, size=(int(sys.argv[2]), 3), dtype=np.int)
comp = np.array([4, 5, 6])
#nb.njit((nb.int64[:, :], nb.int64[::3]), parallel=True)
def cmp(a, b):
c = np.empty((a.shape[0],), dtype=a.dtype)
for i in nb.prange(a.shape[0]):
c[i] = a[i][0] > b[0] and a[i][1] > b[1] and a[i][2] > b[2]
return c
def cmp_normal(a, b):
# return np.all(a > b, axis=1)
return (a[:,0] >=b[0]) & (a[:,1] >= b[1]) & (a[:,2] >=b[2])
print(timeit.timeit(lambda: eval(sys.argv[1])(target, comp), number=10))
First output time is for sequential numba, second one is for parallel numba.
Parallel numba gives 5 times speed up compared to sequential
(base) xxx#xxx:~$ python test.py cmp 1000000
6.40756068899982
(base) xxx#xxx:~$ python test.py cmp 1000000
1.3425709140001345
Now vanilla numpy
(base) xxx#xxx:~$ python test.py cmp_normal 1000000
4.04174472700015
Numba parallel is fastest. But if you try to return a[c] instead, numba will slow down. So it depends on what you write
In [223]: arr =np.array([[1,2,3],[4,5,6],[4,5,6],[0,9,19]])
In [224]: arr
Out[224]:
array([[ 1, 2, 3],
[ 4, 5, 6],
[ 4, 5, 6],
[ 0, 9, 19]])
Since you are testing values, one for each column, you can do a simple numpy == test (the (3,) test broadcasts with the (4,3) arr)
In [225]: arr==[4,5,6]
Out[225]:
array([[False, False, False],
[ True, True, True],
[ True, True, True],
[False, False, False]])
and where a whole row is true:
In [226]: (arr==[4,5,6]).all(axis=1)
Out[226]: array([False, True, True, False])
This can be applied as a boolean mask to select those rows from arr:
In [227]: arr[_]
Out[227]:
array([[4, 5, 6],
[4, 5, 6]])
and the numeric indices:
In [228]: np.nonzero(__)
Out[228]: (array([1, 2]),)

Vectorized running bin index calculation with Tensorflow or numpy

I have an integer array like this:
in=[1, 2, 6, 1, 3, 2, 1]
I would like to calculate a running index for the equal values in the array. For the matrix above the output would be:
out=[0, 0, 0, 1, 0, 1, 2]
So the naive implementation would be to have a counter for all the values. I would like to have a vectorized solution to run it with tensorflow, perhaps with numpy.
I already thought of creating a 2D tensor of shape=(in.shape[0], tf.max(in), ) and writing 1 to the tensor[i, in[i]] cell, and then call a cumsum column-wise, then writing back row-wise. But my input array is quite big (with several 100k entries) with the maximum value of ~500k, thus this sparse matrix wouldn't even fit into the memory.
Do you have better suggestions? Thank you!
Here's a pandas solution:
s = pd.Series([1, 2, 6, 1, 3, 2, 1])
s.groupby(s).cumcount().values
Output:
array([0, 0, 0, 1, 0, 1, 2], dtype=int64)
Test on similar sized data:
s = pd.Series(np.random.randint(0,500000, 100000))
%timeit -n 100 s.groupby(s).cumcount().values
# 23.9 ms ± 562 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can use an actual sparse matrix, i.e. use sparse storage. With that an input like a = np.random.randint(0,5*10**5,10**6) is no problem:
import numpy as np
from scipy import sparse
def running(a):
n,m = a.size,a.max()+1
aux = sparse.csr_matrix((np.ones_like(a),a,np.arange(n+1)),(n,m)).tocsc()
msk = aux.indptr[1:] != aux.indptr[:-1]
indptr = aux.indptr[:-1][msk]
aux.data[0] = 0
aux.data[indptr[1:]] -= np.diff(indptr)
out = np.empty_like(a)
out[aux.indices] = aux.data.cumsum()
return out
# alternative method for validation
def use_argsort(a):
indices = a.argsort(kind="stable")
ao = a[indices]
indptr = np.concatenate([[0],(ao[1:] != ao[:-1]).nonzero()[0]+1])
data = np.ones_like(a)
data[0] = 0
data[indptr[1:]] -= np.diff(indptr)
out = np.empty_like(a)
out[indices] = data.cumsum()
return out
in_ = np.array([1, 2, 6, 1, 3, 2, 1])
print("OP example",in_,"->",running(in_))
print("second opinion","->",use_argsort(in_))
from timeit import timeit
A = np.random.randint(0,500_000,1_000_000)
print("large example (500k labels, 1M entries) takes",
timeit(lambda:running(A),number=10)*100,"ms")
print("using other method takes",
timeit(lambda:use_argsort(A),number=10)*100,"ms")
print("same result:",(use_argsort(A) == running(A)).all())
Sample run:
OP example [1 2 6 1 3 2 1] -> [0 0 0 1 0 1 2]
second opinion -> [0 0 0 1 0 1 2]
large example (500k labels, 1M entries) takes 84.1427305014804 ms
using other method takes 262.38483290653676 ms
same result: True

Finding those elements in an array which are "close"

I have an 1 dimensional sorted array and would like to find all pairs of elements whose difference is no larger than 5.
A naive approach would to be to make N^2 comparisons doing something like
diffs = np.tile(x, (x.size,1) ) - x[:, np.newaxis]
D = np.logical_and(diffs>0, diffs<5)
indicies = np.argwhere(D)
Note here that the output of my example are indices of x. If I wanted the values of x which satisfy the criteria, I could do x[indicies].
This works for smaller arrays, but not arrays of the size with which I work.
An idea I had was to find where there are gaps larger than 5 between consecutive elements. I would split the array into two pieces, and compare all the elements in each piece.
Is this a more efficient way of finding elements which satisfy my criteria? How could I go about writing this?
Here is a small example:
x = np.array([ 9, 12,
21,
36, 39, 44, 46, 47,
58,
64, 65,])
the result should look like
array([[ 0, 1],
[ 3, 4],
[ 5, 6],
[ 5, 7],
[ 6, 7],
[ 9, 10]], dtype=int64)
Here is a solution that iterates over offsets while shrinking the set of candidates until there are none left:
import numpy as np
def f_pp(A, maxgap):
d0 = np.diff(A)
d = d0.copy()
IDX = []
k = 1
idx, = np.where(d <= maxgap)
vidx = idx[d[idx] > 0]
while vidx.size:
IDX.append(vidx[:, None] + (0, k))
if idx[-1] + k + 1 == A.size:
idx = idx[:-1]
d[idx] = d[idx] + d0[idx+k]
k += 1
idx = idx[d[idx] <= maxgap]
vidx = idx[d[idx] > 0]
return np.concatenate(IDX, axis=0)
data = np.cumsum(np.random.exponential(size=10000)).repeat(np.random.randint(1, 20, (10000,)))
pairs = f_pp(data, 1)
#pairs = set(map(tuple, pairs))
from timeit import timeit
kwds = dict(globals=globals(), number=100)
print(data.size, 'points', pairs.shape[0], 'close pairs')
print('pp', timeit("f_pp(data, 1)", **kwds)*10, 'ms')
Sample run:
99963 points 1020651 close pairs
pp 43.00256529124454 ms
Your idea of slicing the array is a very efficient approach. Since your data are sorted you can just calculate the difference and split it:
d=np.diff(x)
ind=np.where(d>5)[0]
pieces=np.split(x,ind)
Here pieces is a list, where you can then use in a loop with your own code on every element.
The best algorithm is highly dependent on the nature of your data which I'm unaware. For example another possibility is to write a nested loop:
pairs=[]
for i in range(x.size):
j=i+1
while x[j]-x[i]<=5 and j<x.size:
pairs.append([i,j])
j+=1
If you want it to be more clever, you can edit the outer loop in a way to jump when j hits a gap.

Is there a Julia analogue to numpy.argmax?

In Python, there is numpy.argmax:
In [7]: a = np.random.rand(5,3)
In [8]: a
Out[8]:
array([[ 0.00108039, 0.16885304, 0.18129883],
[ 0.42661574, 0.78217538, 0.43942868],
[ 0.34321459, 0.53835544, 0.72364813],
[ 0.97914267, 0.40773394, 0.36358753],
[ 0.59639274, 0.67640815, 0.28126232]])
In [10]: np.argmax(a,axis=1)
Out[10]: array([2, 1, 2, 0, 1])
Is there a Julia analogue to Numpy's argmax? I only found a indmax, which only accept a vector, not a two dimensional array as np.argmax.
The fastest implementation will usually be findmax (which allows you to reduce over multiple dimensions at once, if you wish):
julia> a = rand(5, 3)
5×3 Array{Float64,2}:
0.867952 0.815068 0.324292
0.44118 0.977383 0.564194
0.63132 0.0351254 0.444277
0.597816 0.555836 0.32167
0.468644 0.336954 0.893425
julia> mxval, mxindx = findmax(a; dims=2)
([0.8679518267243425; 0.9773828942695064; … ; 0.5978162823947759; 0.8934254589671011], CartesianIndex{2}[CartesianIndex(1, 1); CartesianIndex(2, 2); … ; CartesianIndex(4, 1); CartesianIndex(5, 3)])
julia> mxindx
5×1 Array{CartesianIndex{2},2}:
CartesianIndex(1, 1)
CartesianIndex(2, 2)
CartesianIndex(3, 1)
CartesianIndex(4, 1)
CartesianIndex(5, 3)
According to the Numpy documentation, argmax provides the following functionality:
numpy.argmax(a, axis=None, out=None)
Returns the indices of the maximum values along an axis.
I doubt a single Julia function does that, but combining mapslices and argmax is just the ticket:
julia> a = [ 0.00108039 0.16885304 0.18129883;
0.42661574 0.78217538 0.43942868;
0.34321459 0.53835544 0.72364813;
0.97914267 0.40773394 0.36358753;
0.59639274 0.67640815 0.28126232] :: Array{Float64,2}
julia> mapslices(argmax,a,dims=2)
5x1 Array{Int64,2}:
3
2
3
1
2
Of course, because Julia's array indexing is 1-based (whereas Numpy's array indexing is 0-based), each element of the resulting Julia array is offset by 1 compared to the corresponding element in the resulting Numpy array. You may or may not want to adjust that.
If you want to get a vector rather than a 2D array, you can simply tack [:] at the end of the expression:
julia> b = mapslices(argmax,a,dims=2)[:]
5-element Array{Int64,1}:
3
2
3
1
2
To add to the jub0bs's answer, argmax in Julia 1+ mirrors the behavior of np.argmax, by replacing axis with dims keyword, returning CarthesianIndex instead of index along given dimension:
julia> a = [ 0.00108039 0.16885304 0.18129883;
0.42661574 0.78217538 0.43942868;
0.34321459 0.53835544 0.72364813;
0.97914267 0.40773394 0.36358753;
0.59639274 0.67640815 0.28126232] :: Array{Float64,2}
julia> argmax(a, dims=2)
5×1 Array{CartesianIndex{2},2}:
CartesianIndex(1, 3)
CartesianIndex(2, 2)
CartesianIndex(3, 3)
CartesianIndex(4, 1)
CartesianIndex(5, 2)

Numpy : resize array

I have two Numpy array whose size is 994 and 1000. As such I when I am doing the below operation:
X * Y
I get error that "ValueError: operands could not be broadcast together with shapes (994) (1000)"
Hence as per fix I am trying to pad extras / trailing zeros to the array which great size by below method:
padzero = 0
if(bw.size > w.size):
padzero = bw.size - w.size
w = np.pad(w,padzero, 'constant', constant_values=0)
if(bw.size < w.size):
padzero = w.size - bw.size
bw = np.pad(bw,padzero, 'constant', constant_values=0)
But now the issue comes that if the size difference is 6 then 12 0's are getting padded in the array - which exactly should be six in my case.
I tried many ways to achieve this but its not resulting to resolve the issue. If I try he below way:
bw = np.pad(bw,padzero/2, 'constant', constant_values=0)
ValueError: Unable to create correctly shaped tuple from 3.0
How can I fix the issue?
a = np.array([1, 2, 3])
To insert zeros front:
np.pad(a,(2,0),'constant', constant_values=0)
array([0, 0, 1, 2, 3])
To insert zeros back:
np.pad(a,(0,2),'constant', constant_values=0)
array([1, 2, 3, 0, 0])
Front and back:
np.pad(a,(1,1),'constant', constant_values=0)
array([0, 1, 2, 3, 0])