Unexpected output when passing array as argument of a vectorized function - numpy

In the example code below
import numpy as np
def f(x):
print(x)
x = np.array([[ 0.31432202, 7.94263361],
[-0.5346868, 1.93901039],
[-0.47571535, 4.17720033]])
np.vectorize(f)(x[0,:])
As output, I expected to get something like
[ 0.31432202 7.94263361]
Instead I get
0.31432202
0.31432202
7.94263361
Can anyone tell me what is wrong with it? Thank you

How much of the np.vectorize docs did you read?
In [129]: def f(x):
...: print(x)
...:
...: x = np.array([[ 0.31432202, 7.94263361],
...: [-0.5346868, 1.93901039],
...: [-0.47571535, 4.17720033]])
In [130]: f1=np.vectorize(f)
In [131]: f1(1)
1
1
Out[131]: array(None, dtype=object)
f gets called twice, once to determine the return dtype, and once for each element. Try it with 3 elements:
In [132]: f1([1,2,3])
1
1
2
3
Out[132]: array([None, None, None], dtype=object)
Note that the return is an array with None. That's because your f doesn't have a return statement. It just does the print.
Why are you using np.vectorize? It has a clear performance disclaimer. It also talks about the return dtype and how it determines that. It's not a high performance way of calling a function that just prints something. It may be useful for running a function of several scalar values, and you want to take advantage of numpy broadcasting.
Read the docs.

Related

Computing quick convex hull using Numba

I came across to this nice implementation of computing convex hull of 2d points using Numpy implementation. I would like to be able to #njit this function to use it inside my other Numba jitted code. However I'm not able to modify it, to run, as it uses recursion, and unsupported Numba features? Can anybody help me to rewrite this?
import numpy as np
from numba import njit
def process(S, P, a, b):
signed_dist = np.cross(S[P] - S[a], S[b] - S[a])
K = [i for s, i in zip(signed_dist, P) if s > 0 and i != a and i != b]
if len(K) == 0:
return (a, b)
c = max(zip(signed_dist, P))[1]
return process(S, K, a, c)[:-1] + process(S, K, c, b)
def quickhull_2d(S: np.ndarray) -> np.ndarray:
a, b = np.argmin(S[:,0]), np.argmax(S[:,0])
max_index = np.argmax(S[:,0])
max_element = S[max_index]
return process(S, np.arange(S.shape[0]), a, max_index)[:-1] + process(S, np.arange(S.shape[0]), max_index, a)[:-1]
Example data input and output
points = np.array([[0, 0], [1, 1], [0.5, 0.5], [0, 1], [1, 0]])
ch = quickhull_2d(points)
print(ch)
[0, 4, 1, 3]
print(points[ch])
[[0. 0.]
[1. 0.]
[1. 1.]
[0. 1.]]
There are many issues in this code for Numba to be used.
First of all, returning variable-sized tuples is not possible in Numba because the type of a tuple implicitly includes its size. A tuple is basically a structured type and not a list. See this post and this one for more information about this issue. The solution is basically to return a list (slow) or an array (fast).
Moreover, the type of the parameters change from one function to another. Indeed, process is called in quickhull_2d with a P defined as a Numpy array and then called from process itself with P defined as a list. List and array are completely different things. It is better to use array when possible in Numba unless you use a list to add an unknown number of items (not small nor bounded).
Additionally, max(zip(signed_dist, P))[1] is apparently unsupported by Numba and it is not very efficient anyway (nor idiomatic for a Numpy code). P[np.argmax(signed_dist)] should be used instead.
Furthermore, np.cross also does not seems supported for the general case and you need to currently use cross2d instead (from numba.np.extensions).
Finally, when you use recursive function like this, it is better to specify the input type of the parameters so to avoid weird errors. This can be done thanks to a signature string.
The resulting code is:
import numpy as np
from numba import njit
from numba.np.extensions import cross2d
#njit('(float64[:,:], int64[:], int64, int64)')
def process(S, P, a, b):
signed_dist = cross2d(S[P] - S[a], S[b] - S[a])
K = np.array([i for s, i in zip(signed_dist, P) if s > 0 and i != a and i != b], dtype=np.int64)
if len(K) == 0:
return [a, b]
c = P[np.argmax(signed_dist)]
return process(S, K, a, c)[:-1] + process(S, K, c, b)
#njit('(float64[:,:],)')
def quickhull_2d(S: np.ndarray) -> np.ndarray:
a, b = np.argmin(S[:,0]), np.argmax(S[:,0])
max_index = np.argmax(S[:,0])
max_element = S[max_index]
return process(S, np.arange(S.shape[0]), a, max_index)[:-1] + process(S, np.arange(S.shape[0]), max_index, a)[:-1]
points = np.array([[0, 0], [1, 1], [0.5, 0.5], [0, 1], [1, 0]])
ch = quickhull_2d(points)
print(ch) # print [0, 4, 1, 3]
Note that the compilation time is slow and the execution time should not be great. This is due to lists (and so temporary array for the runtime performance). The next step is simply to use arrays. The bad news is that concatenate is not supported by Numba (because the general case is not easy to implement though specific case are trivial). You can create a new array and copy each part (or even better: you can preallocate an array and slice it during the recursive calls).
Also not that any recursive function can be transformed to a non-recursive function using a manual stack. That being said, it may be slower and make the code more verbose. There are some benefits to this approach though: it avoid stack overflow when the recursion is deep and it may be faster if the function is rewritten so not to stack one of the function call thanks to tail call optimization.

Explination of numpy's einsum

I am currently doing some studies on computing a 4th order tensor in numpy with the einsum function.
The tensor I am computing is written in Einstein notation and the function einsun does the work perfectly! But I would like to know what it is doing in the following case:
import numpy as np
a=np.array([[2,0,3],[0,1,0],[0, 0, 4]])
b= np.eye(3)
r1=np.einsum("ij,kl->ijkl", a, b)
r2=np.einsum("ik,jl->ijkl", a, b)
in r1 I am basically doing the standard tensor product (equivalent to np.tensordot(a,b,axes=0)).
What about in r2?
I know I can get the value by doing a[:,None,:,None]*b[None,:,None,:] but I do not know what the indexing is doing. Does this operation have a name?
Sorry if this is too basic!
I tried to use the transpose definition to change multiple axes.
It works for 'ij,kl -> ijkl' , 'ik,jl->ijkl' ,'kl,ij->ijkl'
but fails for 'il,jk->ijkl', 'jl,ik->ijkl'and 'jk,il->ijkl'
import numpy as np
a=np.eye(3)
a[0][0]=2
a[0][-1]=3
a[-1][-1]=4
b=np.eye(3)
def permutation(str_,Arr):
Arr=np.reshape(Arr,[3,3,3,3])
def splitString(str_):
tmp1=str_.split(',')
tmp2=tmp1[1].split('->')
str_idx1=tmp1[0]
str_idx2=tmp2[0]
str_idx_out=tmp2[1]
return str_idx1,str_idx2, str_idx_out
idx_a, idx_b, idx_out=splitString(str_)
dict_={'i':0,'j':1,'k':2,'l':3}
def split(word):
return [char for char in word]
a,b=split(idx_a)
c,d=split(idx_b)
Arr=np.transpose(Arr,(dict_[a],dict_[b],dict_[c],dict_[d]))
return Arr
str_='jk,il->ijkl'
d=np.outer(a,b)
f=np.einsum(str_, a,b)
check=permutation(str_,d)
if (np.count_nonzero(f-check)==0):
print ('Code is working!')
else:
print("Something is wrong...")
Appreciate your suggestions!
r2 is essentially the same tensor as r1, but the indices are rearranged. In particular, r2[i,j,k,l] is equal to a[i,k]*b[k,l].
For instance:
>>> r2[0,1,2,1]
3.0
This corresponds to the fact that a[0,2]*b[1,1] is 3 * 1, which is indeed 3.
Another way to think about this is to observe that a[:,j,:,l] is equal to a whenever j == l and is a zero-matrix otherwise.

Replace values in DataFrame column when they start with string using lambda

I have a DataFrame:
import pandas as pd
import numpy as np
x = {'Value': ['Test', 'XXX123', 'XXX456', 'Test']}
df = pd.DataFrame(x)
I want to replace the values starting with XXX with np.nan using lambda.
I have tried many things with replace, apply and map and the best I have been able to do is False, True, True, False.
The below works, but I would like to know a better way to do it and I think the apply, replace and a lambda is probably a better way to do it.
df.Value.loc[df.Value.str.startswith('XXX', na=False)] = np.nan
use the apply method
In [80]: x = {'Value': ['Test', 'XXX123', 'XXX456', 'Test']}
In [81]: df = pd.DataFrame(x)
In [82]: df.Value.apply(lambda x: np.nan if x.startswith('XXX') else x)
Out[82]:
0 Test
1 NaN
2 NaN
3 Test
Name: Value, dtype: object
Performance Comparision of apply, where, loc
np.where() performs way better here:
df.Value=np.where(df.Value.str.startswith('XXX'),np.nan,df.Value)
Performance vs apply on larger dfs:
Use of .loc is not necessary. Write just:
df.Value[df.Value.str.startswith('XXX')] = np.nan
Lambda function could be necessary if you wanted to compute some
expression to be substituted. In this case just np.nan is enough.

What is the meaning of `numpy.array(value)`?

numpy.array(value) evaluates to true, if value is int, float or complex. The result seems to be a shapeless array (numpy.array(value).shape returns ()).
Reshaping the above like so numpy.array(value).reshape(1) works fine and numpy.array(value).reshape(1).squeeze() reverses this and again results in a shapeless array.
What is the rationale behind this behavior? Which use-cases exist for this behaviour?
When you create a zero-dimensional array like np.array(3), you get an object that behaves as an array in 99.99% of situations. You can inspect the basic properties:
>>> x = np.array(3)
>>> x
array(3)
>>> x.ndim
0
>>> x.shape
()
>>> x[None]
array([3])
>>> type(x)
numpy.ndarray
>>> x.dtype
dtype('int32')
So far so good. The logic behind this is simple: you can process any array-like object the same way, regardless of whether is it a number, list or array, just by wrapping it in a call to np.array.
One thing to keep in mind is that when you index an array, the index tuple must have ndim or fewer elements. So you can't do:
>>> x[0]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: too many indices for array
Instead, you have to use a zero-sized tuple (since x[] is invalid syntax):
>>> x[()]
3
You can also use the array as a scalar instead:
>>> y = x + 3
>>> y
6
>>> type(y)
numpy.int32
Adding two scalars produces a scalar instance of the dtype, not another array. That being said, you can use y from this example in exactly the same way you would x, 99.99% of the time, since dtypes inherit from ndarray. It does not matter that 3 is a Python int, since np.add will wrap it in an array regardless. y = x + x will yield identical results.
One difference between x and y in these examples is that x is not officially considered to be a scalar:
>>> np.isscalar(x)
False
>>> np.isscalar(y)
True
The indexing issue can potentially throw a monkey wrench in your plans to index any array like-object. You can easily get around it by supplying ndmin=1 as an argument to the constructor, or using a reshape:
>>> x1 = np.array(3, ndmin=1)
>>> x1
array([3])
>>> x2 = np.array(3).reshape(-1)
>>> x2
array([3])
I generally recommend the former method, as it requires no prior knowledge of the dimensionality of the input.
FurtherRreading:
Why are 0d arrays in Numpy not considered scalar?

pandas groupby transform: multiple functions applied at the same time with custom names

as the title suggests, I want to be able to do the following (best explained with some code) [pandas 0.20.1 is mandatory]
import pandas as pd
import numpy as np
a = pd.DataFrame(np.random.rand(10, 4), columns=[['a','a','b','b'], ['alfa','beta','alfa','beta',]])
def as_is(x):
return x
def power_2(x):
return x**2
# desired result
a.transform([as_is, power_2])
the problem is the function could be more complex than this and thus I would lose the "naming" feature as pandas.DataFrame.transform only allows for lists to be passed whereas a dictionary would have been most convenient.
going back to the basics, I got to this:
dict_funct= {'as_is': as_is, 'power_2': power_2}
def wrapper(x):
return pd.concat({k: x.apply(v) for k,v in dict_funct.items()}, axis=1)
a.groupby(level=[0,1], axis=1).apply(wrapper)
but the output Dataframe is all nan, presumably due to multi-index columns ordering. is there any way I can fix this?
If need dict I remove paramater axis in concat to to default (axis=0), but then is necessary add parameter group_keys=False and function unstack:
def wrapper(x):
return pd.concat({k: x.apply(v) for k,v in dict_funct.items()})
a.groupby(level=[0,1], axis=1, group_keys=False).apply(wrapper).unstack(0)
Similar solution:
def wrapper(x):
return pd.concat({k: x.transform(v) for k,v in dict_funct.items()})
a.groupby(level=[0,1], axis=1, group_keys=False).apply(wrapper).unstack(0)
Another solution is simply add list comprehension:
a.transform([v for k, v in dict_funct.items()])