Speed up applying a transformation to each index value of a given array - numpy

I need to apply a function to the result of a transformation of all index values of a given numpy array. The following code does this:
import numpy as np
from matplotlib.transforms import IdentityTransform
# some 2D array
a = np.empty((2,3))
# some affine transformation, identity is just an example here
trans = IdentityTransform()
# some function taking a 2D index and returning some value depending
# on that index, again just an example
def f(idx):
return (idx[0]+idx[1])/2
# apply f to the result of transforming each index of a
b=np.empty_like(a)
for idx in np.ndindex(a.shape):
b[idx] = f(trans.transform(idx))
print(b)
This prints the following correct result:
[[0. 0.5 1. ]
[0.5 1. 1.5]]
The problem now is, the code is too slow when the shape of a gets larger, say 2000x3000. Is there a way to speed this up?
My idea is to create an array of indices of a idx = [[0,0], [0,1], ..., [1,2]], then transform this array in one go using something like tmp = trans.transform(idx), and lastly apply f to every element with np.vectorize(f)(tmp).
Is this a reasonable approach? If yes, how would this actually look like? If no, are there any alternatives?
Edit: I managed to get at tmp via the following code:
tmp=trans.transform(np.asarray([idx for idx in np.ndindex(a.shape)]))
So now I have an array containing the results of the affine transformation for every index value of a. But this seems to use an awful lot of memory.

I'll post an answer myself with what I figured out now. Maybe it is of use for someone.
To answer the first part of my question, I found a fast and efficient way to create the result of transforming the index values, using the result of np.indices() and then massaging the result of that until it fits to what t.transform() expects.
Given some array a = np.empty((2,3)), the indices of that array can be obtained via np.indices(a.shape). This returns two 2D arrays (one for each dimension of a, actually). What I failed to understand was how to turn these results into something transform() understands.
The key here is to apply np.ravel() to the result of each of those arrays, np.indices() returns:
>>> a=np.empty((2,3))
>>> list(map(np.ravel, np.indices(a.shape)))
[array([0, 0, 0, 1, 1, 1]), array([0, 1, 2, 0, 1, 2])]
Now I have a list of arrays containing all the x and y indices, which just needs to be put together with np.vstack() and then transposed to get an array of all (x, y) indices, and this is the form transform() will accept.
>>> l=list(map(np.ravel, np.indices(a.shape)))
>>> np.vstack(l).transpose()
array([[0, 0],
[0, 1],
[0, 2],
[1, 0],
[1, 1],
[1, 2]])
And finally, for some arbitrary affine transformation:
>>> from matplotlib.transforms import Affine2D
>>> t = Affine2D().translate(10, 20).scale(0.5)
>>> t.transform(np.vstack(l).transpose())
array([[ 5. , 10. ],
[ 5. , 10.5],
[ 5. , 11. ],
[ 5.5, 10. ],
[ 5.5, 10.5],
[ 5.5, 11. ]])
This is quite fast, even for larger array sizes. If the shape gets big enough (something like 20000x30000), I run out of memory, but for shapes 10000x10000 it still is amazingly fast.
>>> timeit.timeit("t.transform(np.vstack(list(map(np.ravel, np.indices(a.shape, dtype=np.uint16)))).transpose())",
... "import numpy as np ; from matplotlib.transforms import Affine2D ; a = np.empty((20, 10)) ; t = Affine2D().translate(10, 20).scale(0.5)", number=10)
0.0003051299718208611
>>> timeit.timeit("t.transform(np.vstack(list(map(np.ravel, np.indices(a.shape, dtype=np.uint16)))).transpose())",
... "import numpy as np ; from matplotlib.transforms import Affine2D ; a = np.empty((200, 100)) ; t = Affine2D().translate(10, 20).scale(0.5)", number=10)
0.0026413939776830375
>>> timeit.timeit("t.transform(np.vstack(list(map(np.ravel, np.indices(a.shape, dtype=np.uint16)))).transpose())",
... "import numpy as np ; from matplotlib.transforms import Affine2D ; a = np.empty((2000, 1000)) ; t = Affine2D().translate(10, 20).scale(0.5)", number=10)
0.35055489401565865
>>> timeit.timeit("t.transform(np.vstack(list(map(np.ravel, np.indices(a.shape, dtype=np.uint16)))).transpose())",
... "import numpy as np ; from matplotlib.transforms import Affine2D ; a = np.empty((20000, 10000)) ; t = Affine2D().translate(10, 20).scale(0.5)", number=10)
43.62860555597581
Now for the second part, for applying the function to each of the transformed index values I use the following code for now, which is fast enough in my case.
xxyy = t.transform(np.vstack(...).transpose())
np.fromiter((f(*xy) for xy in xxyy), dtype=np.short, count=len(xxyy))

Related

Eigenvector normalization in numpy

I'm using the linalg in numpy to compute eigenvalues and eigenvectors of matrices of signed reals.
I've read this previous question but still don't grasp the normalization of eigenvectors.
Here is an example straight off Wikipedia:
import numpy as np
from numpy import linalg as la
a = np.matrix([[2, 1], [1, 2]], dtype=np.float)
eigh_vals, eigh_vects = np.linalg.eig(a)
print 'eigen_values='
print eigh_vals
print 'eigen_vectors='
print eigh_vects
The eigenvalues are 1 and 3.
For eigenvectors we expect scalar multiples of [1, -1] and [1, 1], which I get:
eig_vals=
[ 3. 1.]
eig_vets=
[[ 0.70710678 -0.70710678]
[ 0.70710678 0.70710678]]
I understand the 1/sqrt(2) factor is to have the norm=1 but why?
Can normalization be 'switched off'?
Thanks!
The key message for the first eigenvector in the Wikipedia article is
Any non-zero vector with v1 = −v2 solves this equation.
So the actual solution is V1 = [x, -x]. Picking the vector V1 = [1, -1] may be pleasing to the human eye, but it is just as aritrary as picking a vector V1 = [104051, -104051] or any other real value.
Actually, picking V1 = [1, -1] / sqrt(2) is the least arbitrary. Of all the possible vectors for V1, it's the only one that is of unit length.
However if instead of unit length you prefer the first value to be 1, you can do
eigh_vects /= eigh_vects[:, 0]
import numpy as np
import sympy as sp
v = sp.Matrix([[2, 1], [1, 2]])
v_vec = v.eigenvects()
v_vec is a list contains 2 tuples:
[(1, 1, [Matrix([
[-1],
[ 1]])]), (3, 1, [Matrix([
[1],
[1]])])]
1 and 3 is the two eigenvalues. The '1' behind 1 & 3 is the number of the eigenvalues. In each tuple, the third element is the eigenvector of each eigenvalue. It is a Matrix object in sp. You can convert a Matrix object to the np array.
v_vec1 = np.array(v_vec[0][2], dtype=float)
v_vec2 = np.array(v_vec[1][2], dtype=float)
print('v_vec1 =', v_vec1)
print('v_vec2 =', v_vec2)
Here is the normalized eigenvectors you would get:
v_vec1 = [[-1. 1.]]
v_vec2 = [[1. 1.]]
If sympy is an option for you, it appears to normalize less aggressively:
import sympy
a = sympy.Matrix([[2, 1], [1, 2]])
a.eigenvects()
# [(1, 1, [Matrix([
# [-1],
# [ 1]])]), (3, 1, [Matrix([
# [1],
# [1]])])]

Partition training data by class in NumPy

I have a 50000 x 784 data matrix (50000 samples and 784 features) and the corresponding 50000 x 1 class vector (classes are integers 0-9). I'm looking for an efficient way to group the data matrix into 10 data matrices and class vectors that each have only the data for a particular class 0-9.
I can't seem to find an elegant way to do this, aside from just looping through the data matrix and constructing the 10 other matrices that way.
Does anyone know if there is a clean way to do this with something in scipy, numpy, or sklearn?
Probably the cleanest way of doing this in numpy, especially if you have many classes, is through sorting:
SAMPLES = 50000
FEATURES = 784
CLASSES = 10
data = np.random.rand(SAMPLES, FEATURES)
classes = np.random.randint(CLASSES, size=SAMPLES)
sorter = np.argsort(classes)
classes_sorted = classes[sorter]
splitter, = np.where(classes_sorted[:-1] != classes_sorted[1:])
data_splitted = np.split(data[sorter], splitter + 1)
data_splitted will be a list of arrays, one for each class found in classes. Running the above code with SAMPLES = 10, FEATURES = 2 and CLASSES = 3 I get:
>>> data
array([[ 0.45813694, 0.47942962],
[ 0.96587082, 0.73260743],
[ 0.70539842, 0.76376921],
[ 0.01031978, 0.93660231],
[ 0.45434223, 0.03778273],
[ 0.01985781, 0.04272293],
[ 0.93026735, 0.40216376],
[ 0.39089845, 0.01891637],
[ 0.70937483, 0.16077439],
[ 0.45383099, 0.82074859]])
>>> classes
array([1, 1, 2, 1, 1, 2, 0, 2, 0, 1])
>>> data_splitted
[array([[ 0.93026735, 0.40216376],
[ 0.70937483, 0.16077439]]),
array([[ 0.45813694, 0.47942962],
[ 0.96587082, 0.73260743],
[ 0.01031978, 0.93660231],
[ 0.45434223, 0.03778273],
[ 0.45383099, 0.82074859]]),
array([[ 0.70539842, 0.76376921],
[ 0.01985781, 0.04272293],
[ 0.39089845, 0.01891637]])]
If you want to make sure the sort is stable, i.e. that data points in the same class remain in the same relative order after sorting, you will need to specify sorter = np.argsort(classes, kind='mergesort').
If your data and labels matrices are in numpy format, you can do:
data_class_3 = data[labels == 3, :]
If they aren't, turn them into numpy format:
import numpy as np
data = np.array(data)
labels = np.array(labels)
data_class_3 = data[labels == 3, :]
You can loop and do this for all labels automatically if you like. Something like this:
import numpy as np
split_classes = np.array([data[labels == i, :] for i in range(10)])
After #Jaime numpy optimal answer, I suggest you pandas, specialized in data manipulations :
import pandas
df=pandas.DataFrame(data,index=classes).sort_index()
then df.loc[i] is your class i.
if you want a list, just do
metadata=[df.loc[i].values for i in range(10)]
so metadata[i] is the subset you want, or make a panel with pandas. All that is based on numpy arrays, so efficiency is preserved.

Numpy eigenvectors aren't eigenvectors?

I was doing some matrix calculations and wanted to calculate the eigenvalues and eigenvectors of this particular matrix:
I found its eigenvalues and eigenvectors analytically and wanted to confirm my answer using numpy.linalg.eigh, since this matrix is symmetric. Here is the problem: I find the expected eigenvalues, but the corresponding eigenvectors appear to be not eigenvectors at all
Here is the little piece of code I used:
import numpy as n
def createA():
#create the matrix A
m=3
T = n.diag(n.ones(m-1.),-1.) + n.diag(n.ones(m)*-4.) +\
n.diag(n.ones(m-1.),1.)
I = n.identity(m)
A = n.zeros([m*m,m*m])
for i in range(m):
a, b, c = i*m, (i+1)*m, (i+2)*m
A[a:b, a:b] = T
if i < m - 1:
A[b:c, a:b] = A[a:b, b:c] = I
return A
A = createA()
ev,vecs = n.linalg.eigh(A)
print vecs[0]
print n.dot(A,vecs[0])/ev[0]
So for the first eigenvalue/eigenvector pair, this yields:
[ 2.50000000e-01 5.00000000e-01 -5.42230975e-17 -4.66157689e-01
3.03192985e-01 2.56458619e-01 -7.84539156e-17 -5.00000000e-01
2.50000000e-01]
[ 0.14149052 0.21187998 -0.1107808 -0.35408209 0.20831606 0.06921674
0.14149052 -0.37390646 0.18211242]
In my understanding of the Eigenvalue problem, it appears that this vector doesn't suffice the equation A.vec = ev.vec, and that therefore this vector is no eigenvalue at all.
I am pretty sure the matrix A itself is correctly implemented and that there is a correct eigenvector. For example, my analytically derived eigenvector:
rvec = [0.25,-0.35355339,0.25,-0.35355339,0.5,-0.35355339,0.25,
-0.35355339,0.25]
b = n.dot(A,rvec)/ev[0]
print n.allclose(real,b)
yields True.
Can anyone, by any means, explain this strange behaviour? Am I misunderstanding the Eigenvalue problem? Might numpy be erroneous?
(As this is my first post here: my apologies for any unconventionalities in my question. Thanks you in advance for your patience.)
The eigen vectors are stored as column vectors as described here. So you have to use vecs[:,0] instead vecs[0]
For example this here works for me (I use eig because A is not symmetric)
import numpy as np
import numpy.linalg as LA
import numpy.random
A = numpy.random.randint(10,size=(4,4))
# array([[4, 7, 7, 7],
# [4, 1, 9, 1],
# [7, 3, 7, 7],
# [6, 4, 6, 5]])
eval,evec = LA.eig(A)
evec[:,0]
# array([ 0.55545073+0.j, 0.37209887+0.j, 0.56357432+0.j, 0.48518131+0.j])
np.dot(A,evec[:,0]) / eval[0]
# array([ 0.55545073+0.j, 0.37209887+0.j, 0.56357432+0.j, 0.48518131+0.j])

creating an numpy matrix with a lag

Lets say I have
q=2
y=[5,10,5,15,20,25,30,35,5,10,15,20]
n=len(y)
and I want to make a matrix with n x q dimensions where the first row would be [5,10], the second row would be [10,5], and the third would be [5,15] ...etc.
Is there a way to do this or would I have to use a for loop and concatenate function?
Our good friend index_tricks to the rescue:
import numpy as np
#illustrate functionality on a 2d array
y=np.array([5,10,5,15,20,25,30,35,5,10,15,20]).reshape(2,-1)
def running_view(arr, window, axis=-1):
"""
return a running view of length 'window' over 'axis'
the returned array has an extra last dimension, which spans the window
"""
shape = list(arr.shape)
shape[axis] -= (window-1)
assert(shape[axis]>0)
return np.lib.index_tricks.as_strided(
arr,
shape + [window],
arr.strides + (arr.strides[axis],))
print running_view(y, 2)
It returns a view into the original array, so O(1) performance.
Edit: generalized to include an optional axis parameter for nd-arrays.
Since NumPy arrays are row-major ordered by default, you can directly reshape() to "wrap" an array to the rows of a matrix (assuming the number of columns divides the length of the array).
import numpy as np
def as_matrix(x, ncols):
nrows = len(x) // ncols
return np.array(x).reshape(nrows, ncols)
as_matrix(y, 2)
#> array([[ 5, 10],
#> [ 5, 15],
#> [20, 25],
#> [30, 35],
#> [ 5, 10],
#> [15, 20]])

Turn 2D NumPy array into 1D array for plotting a histogram

I'm trying to plot a histogram with matplotlib.
I need to convert my one-line 2D Array
[[1,2,3,4]] # shape is (1,4)
into a 1D Array
[1,2,3,4] # shape is (4,)
How can I do this?
Adding ravel as another alternative for future searchers. From the docs,
It is equivalent to reshape(-1, order=order).
Since the array is 1xN, all of the following are equivalent:
arr1d = np.ravel(arr2d)
arr1d = arr2d.ravel()
arr1d = arr2d.flatten()
arr1d = np.reshape(arr2d, -1)
arr1d = arr2d.reshape(-1)
arr1d = arr2d[0, :]
You can directly index the column:
>>> import numpy as np
>>> x2 = np.array([[1,2,3,4]])
>>> x2.shape
(1, 4)
>>> x1 = x2[0,:]
>>> x1
array([1, 2, 3, 4])
>>> x1.shape
(4,)
Or you can use squeeze:
>>> xs = np.squeeze(x2)
>>> xs
array([1, 2, 3, 4])
>>> xs.shape
(4,)
reshape will do the trick.
There's also a more specific function, flatten, that appears to do exactly what you want.
the answer provided by mtrw does the trick for an array that actually only has one line like this one, however if you have a 2d array, with values in two dimension you can convert it as follows
a = np.array([[1,2,3],[4,5,6]])
From here you can find the shape of the array with np.shape and find the product of that with np.product this now results in the number of elements. If you now use np.reshape() to reshape the array to one length of the total number of element you will have a solution that always works.
np.reshape(a, np.product(a.shape))
>>> array([1, 2, 3, 4, 5, 6])
Use numpy.flat
import numpy as np
import matplotlib.pyplot as plt
a = np.array([[1,0,0,1],
[2,0,1,0]])
plt.hist(a.flat, [0,1,2,3])
The flat property returns a 1D iterator over your 2D array. This method generalizes to any number of rows (or dimensions). For large arrays it can be much more efficient than making a flattened copy.