Partition training data by class in NumPy - numpy

I have a 50000 x 784 data matrix (50000 samples and 784 features) and the corresponding 50000 x 1 class vector (classes are integers 0-9). I'm looking for an efficient way to group the data matrix into 10 data matrices and class vectors that each have only the data for a particular class 0-9.
I can't seem to find an elegant way to do this, aside from just looping through the data matrix and constructing the 10 other matrices that way.
Does anyone know if there is a clean way to do this with something in scipy, numpy, or sklearn?

Probably the cleanest way of doing this in numpy, especially if you have many classes, is through sorting:
SAMPLES = 50000
FEATURES = 784
CLASSES = 10
data = np.random.rand(SAMPLES, FEATURES)
classes = np.random.randint(CLASSES, size=SAMPLES)
sorter = np.argsort(classes)
classes_sorted = classes[sorter]
splitter, = np.where(classes_sorted[:-1] != classes_sorted[1:])
data_splitted = np.split(data[sorter], splitter + 1)
data_splitted will be a list of arrays, one for each class found in classes. Running the above code with SAMPLES = 10, FEATURES = 2 and CLASSES = 3 I get:
>>> data
array([[ 0.45813694, 0.47942962],
[ 0.96587082, 0.73260743],
[ 0.70539842, 0.76376921],
[ 0.01031978, 0.93660231],
[ 0.45434223, 0.03778273],
[ 0.01985781, 0.04272293],
[ 0.93026735, 0.40216376],
[ 0.39089845, 0.01891637],
[ 0.70937483, 0.16077439],
[ 0.45383099, 0.82074859]])
>>> classes
array([1, 1, 2, 1, 1, 2, 0, 2, 0, 1])
>>> data_splitted
[array([[ 0.93026735, 0.40216376],
[ 0.70937483, 0.16077439]]),
array([[ 0.45813694, 0.47942962],
[ 0.96587082, 0.73260743],
[ 0.01031978, 0.93660231],
[ 0.45434223, 0.03778273],
[ 0.45383099, 0.82074859]]),
array([[ 0.70539842, 0.76376921],
[ 0.01985781, 0.04272293],
[ 0.39089845, 0.01891637]])]
If you want to make sure the sort is stable, i.e. that data points in the same class remain in the same relative order after sorting, you will need to specify sorter = np.argsort(classes, kind='mergesort').

If your data and labels matrices are in numpy format, you can do:
data_class_3 = data[labels == 3, :]
If they aren't, turn them into numpy format:
import numpy as np
data = np.array(data)
labels = np.array(labels)
data_class_3 = data[labels == 3, :]
You can loop and do this for all labels automatically if you like. Something like this:
import numpy as np
split_classes = np.array([data[labels == i, :] for i in range(10)])

After #Jaime numpy optimal answer, I suggest you pandas, specialized in data manipulations :
import pandas
df=pandas.DataFrame(data,index=classes).sort_index()
then df.loc[i] is your class i.
if you want a list, just do
metadata=[df.loc[i].values for i in range(10)]
so metadata[i] is the subset you want, or make a panel with pandas. All that is based on numpy arrays, so efficiency is preserved.

Related

Speed up applying a transformation to each index value of a given array

I need to apply a function to the result of a transformation of all index values of a given numpy array. The following code does this:
import numpy as np
from matplotlib.transforms import IdentityTransform
# some 2D array
a = np.empty((2,3))
# some affine transformation, identity is just an example here
trans = IdentityTransform()
# some function taking a 2D index and returning some value depending
# on that index, again just an example
def f(idx):
return (idx[0]+idx[1])/2
# apply f to the result of transforming each index of a
b=np.empty_like(a)
for idx in np.ndindex(a.shape):
b[idx] = f(trans.transform(idx))
print(b)
This prints the following correct result:
[[0. 0.5 1. ]
[0.5 1. 1.5]]
The problem now is, the code is too slow when the shape of a gets larger, say 2000x3000. Is there a way to speed this up?
My idea is to create an array of indices of a idx = [[0,0], [0,1], ..., [1,2]], then transform this array in one go using something like tmp = trans.transform(idx), and lastly apply f to every element with np.vectorize(f)(tmp).
Is this a reasonable approach? If yes, how would this actually look like? If no, are there any alternatives?
Edit: I managed to get at tmp via the following code:
tmp=trans.transform(np.asarray([idx for idx in np.ndindex(a.shape)]))
So now I have an array containing the results of the affine transformation for every index value of a. But this seems to use an awful lot of memory.
I'll post an answer myself with what I figured out now. Maybe it is of use for someone.
To answer the first part of my question, I found a fast and efficient way to create the result of transforming the index values, using the result of np.indices() and then massaging the result of that until it fits to what t.transform() expects.
Given some array a = np.empty((2,3)), the indices of that array can be obtained via np.indices(a.shape). This returns two 2D arrays (one for each dimension of a, actually). What I failed to understand was how to turn these results into something transform() understands.
The key here is to apply np.ravel() to the result of each of those arrays, np.indices() returns:
>>> a=np.empty((2,3))
>>> list(map(np.ravel, np.indices(a.shape)))
[array([0, 0, 0, 1, 1, 1]), array([0, 1, 2, 0, 1, 2])]
Now I have a list of arrays containing all the x and y indices, which just needs to be put together with np.vstack() and then transposed to get an array of all (x, y) indices, and this is the form transform() will accept.
>>> l=list(map(np.ravel, np.indices(a.shape)))
>>> np.vstack(l).transpose()
array([[0, 0],
[0, 1],
[0, 2],
[1, 0],
[1, 1],
[1, 2]])
And finally, for some arbitrary affine transformation:
>>> from matplotlib.transforms import Affine2D
>>> t = Affine2D().translate(10, 20).scale(0.5)
>>> t.transform(np.vstack(l).transpose())
array([[ 5. , 10. ],
[ 5. , 10.5],
[ 5. , 11. ],
[ 5.5, 10. ],
[ 5.5, 10.5],
[ 5.5, 11. ]])
This is quite fast, even for larger array sizes. If the shape gets big enough (something like 20000x30000), I run out of memory, but for shapes 10000x10000 it still is amazingly fast.
>>> timeit.timeit("t.transform(np.vstack(list(map(np.ravel, np.indices(a.shape, dtype=np.uint16)))).transpose())",
... "import numpy as np ; from matplotlib.transforms import Affine2D ; a = np.empty((20, 10)) ; t = Affine2D().translate(10, 20).scale(0.5)", number=10)
0.0003051299718208611
>>> timeit.timeit("t.transform(np.vstack(list(map(np.ravel, np.indices(a.shape, dtype=np.uint16)))).transpose())",
... "import numpy as np ; from matplotlib.transforms import Affine2D ; a = np.empty((200, 100)) ; t = Affine2D().translate(10, 20).scale(0.5)", number=10)
0.0026413939776830375
>>> timeit.timeit("t.transform(np.vstack(list(map(np.ravel, np.indices(a.shape, dtype=np.uint16)))).transpose())",
... "import numpy as np ; from matplotlib.transforms import Affine2D ; a = np.empty((2000, 1000)) ; t = Affine2D().translate(10, 20).scale(0.5)", number=10)
0.35055489401565865
>>> timeit.timeit("t.transform(np.vstack(list(map(np.ravel, np.indices(a.shape, dtype=np.uint16)))).transpose())",
... "import numpy as np ; from matplotlib.transforms import Affine2D ; a = np.empty((20000, 10000)) ; t = Affine2D().translate(10, 20).scale(0.5)", number=10)
43.62860555597581
Now for the second part, for applying the function to each of the transformed index values I use the following code for now, which is fast enough in my case.
xxyy = t.transform(np.vstack(...).transpose())
np.fromiter((f(*xy) for xy in xxyy), dtype=np.short, count=len(xxyy))

Usefull way of reverting Homogeneous coordinates back to 2d?

Is there some numpy sugar for reverting Homogeneous coordinates back to 2d coordinates.
So this:
[[4,8,2],
6,3,2]]
becomes this:
[[2,4],
[3,1.5]]
One approach making use of broadcasted elementwise divisions -
from __future__ import division
a[:,:2]/a[:,[-1]]
We can use a[:,-1,None] or a[:,-1][:,None] or a[:,-1].reshape(-1,1) in place of a[:,[-1]]. With a[:,[-1]], we are keeping the number of dims intact, letting us perform the broadcasting divisions.
Another with np.true_divide again using broadcasting -
np.true_divide(a[:,:2], a[:,[-1]])
Sample run -
In [194]: a
Out[194]:
array([[4, 8, 2],
[6, 3, 2]])
In [195]: a[:,:2]/a[:,[-1]]
Out[195]:
array([[ 2. , 4. ],
[ 3. , 1.5]])
In [196]: np.true_divide(a[:,:2], a[:,[-1]])
Out[196]:
array([[ 2. , 4. ],
[ 3. , 1.5]])
If you have your input as a vector called x you could do
x[:-1]/x[-1]
Full example:
import numpy as np
x = np.array([6,3,2])
x[:-1]/x[-1] # array([ 3. , 1.5])
You can also apply it to multiple coordinates in an array:
xs = np.array([[4,8,2],[6,3,2]])
np.array([x[:-1]/x[-1] for x in xs]) # array([[ 2. , 4. ],
# [ 3. , 1.5]])
If you want to reuse this you can define a function homogen:
homogen = lambda x: x[:-1]/x[-1]
# previous stuff becomes something like
np.array([homogen(x) for x in xs])

Numpy eigenvectors aren't eigenvectors?

I was doing some matrix calculations and wanted to calculate the eigenvalues and eigenvectors of this particular matrix:
I found its eigenvalues and eigenvectors analytically and wanted to confirm my answer using numpy.linalg.eigh, since this matrix is symmetric. Here is the problem: I find the expected eigenvalues, but the corresponding eigenvectors appear to be not eigenvectors at all
Here is the little piece of code I used:
import numpy as n
def createA():
#create the matrix A
m=3
T = n.diag(n.ones(m-1.),-1.) + n.diag(n.ones(m)*-4.) +\
n.diag(n.ones(m-1.),1.)
I = n.identity(m)
A = n.zeros([m*m,m*m])
for i in range(m):
a, b, c = i*m, (i+1)*m, (i+2)*m
A[a:b, a:b] = T
if i < m - 1:
A[b:c, a:b] = A[a:b, b:c] = I
return A
A = createA()
ev,vecs = n.linalg.eigh(A)
print vecs[0]
print n.dot(A,vecs[0])/ev[0]
So for the first eigenvalue/eigenvector pair, this yields:
[ 2.50000000e-01 5.00000000e-01 -5.42230975e-17 -4.66157689e-01
3.03192985e-01 2.56458619e-01 -7.84539156e-17 -5.00000000e-01
2.50000000e-01]
[ 0.14149052 0.21187998 -0.1107808 -0.35408209 0.20831606 0.06921674
0.14149052 -0.37390646 0.18211242]
In my understanding of the Eigenvalue problem, it appears that this vector doesn't suffice the equation A.vec = ev.vec, and that therefore this vector is no eigenvalue at all.
I am pretty sure the matrix A itself is correctly implemented and that there is a correct eigenvector. For example, my analytically derived eigenvector:
rvec = [0.25,-0.35355339,0.25,-0.35355339,0.5,-0.35355339,0.25,
-0.35355339,0.25]
b = n.dot(A,rvec)/ev[0]
print n.allclose(real,b)
yields True.
Can anyone, by any means, explain this strange behaviour? Am I misunderstanding the Eigenvalue problem? Might numpy be erroneous?
(As this is my first post here: my apologies for any unconventionalities in my question. Thanks you in advance for your patience.)
The eigen vectors are stored as column vectors as described here. So you have to use vecs[:,0] instead vecs[0]
For example this here works for me (I use eig because A is not symmetric)
import numpy as np
import numpy.linalg as LA
import numpy.random
A = numpy.random.randint(10,size=(4,4))
# array([[4, 7, 7, 7],
# [4, 1, 9, 1],
# [7, 3, 7, 7],
# [6, 4, 6, 5]])
eval,evec = LA.eig(A)
evec[:,0]
# array([ 0.55545073+0.j, 0.37209887+0.j, 0.56357432+0.j, 0.48518131+0.j])
np.dot(A,evec[:,0]) / eval[0]
# array([ 0.55545073+0.j, 0.37209887+0.j, 0.56357432+0.j, 0.48518131+0.j])

creating an numpy matrix with a lag

Lets say I have
q=2
y=[5,10,5,15,20,25,30,35,5,10,15,20]
n=len(y)
and I want to make a matrix with n x q dimensions where the first row would be [5,10], the second row would be [10,5], and the third would be [5,15] ...etc.
Is there a way to do this or would I have to use a for loop and concatenate function?
Our good friend index_tricks to the rescue:
import numpy as np
#illustrate functionality on a 2d array
y=np.array([5,10,5,15,20,25,30,35,5,10,15,20]).reshape(2,-1)
def running_view(arr, window, axis=-1):
"""
return a running view of length 'window' over 'axis'
the returned array has an extra last dimension, which spans the window
"""
shape = list(arr.shape)
shape[axis] -= (window-1)
assert(shape[axis]>0)
return np.lib.index_tricks.as_strided(
arr,
shape + [window],
arr.strides + (arr.strides[axis],))
print running_view(y, 2)
It returns a view into the original array, so O(1) performance.
Edit: generalized to include an optional axis parameter for nd-arrays.
Since NumPy arrays are row-major ordered by default, you can directly reshape() to "wrap" an array to the rows of a matrix (assuming the number of columns divides the length of the array).
import numpy as np
def as_matrix(x, ncols):
nrows = len(x) // ncols
return np.array(x).reshape(nrows, ncols)
as_matrix(y, 2)
#> array([[ 5, 10],
#> [ 5, 15],
#> [20, 25],
#> [30, 35],
#> [ 5, 10],
#> [15, 20]])

Turn 2D NumPy array into 1D array for plotting a histogram

I'm trying to plot a histogram with matplotlib.
I need to convert my one-line 2D Array
[[1,2,3,4]] # shape is (1,4)
into a 1D Array
[1,2,3,4] # shape is (4,)
How can I do this?
Adding ravel as another alternative for future searchers. From the docs,
It is equivalent to reshape(-1, order=order).
Since the array is 1xN, all of the following are equivalent:
arr1d = np.ravel(arr2d)
arr1d = arr2d.ravel()
arr1d = arr2d.flatten()
arr1d = np.reshape(arr2d, -1)
arr1d = arr2d.reshape(-1)
arr1d = arr2d[0, :]
You can directly index the column:
>>> import numpy as np
>>> x2 = np.array([[1,2,3,4]])
>>> x2.shape
(1, 4)
>>> x1 = x2[0,:]
>>> x1
array([1, 2, 3, 4])
>>> x1.shape
(4,)
Or you can use squeeze:
>>> xs = np.squeeze(x2)
>>> xs
array([1, 2, 3, 4])
>>> xs.shape
(4,)
reshape will do the trick.
There's also a more specific function, flatten, that appears to do exactly what you want.
the answer provided by mtrw does the trick for an array that actually only has one line like this one, however if you have a 2d array, with values in two dimension you can convert it as follows
a = np.array([[1,2,3],[4,5,6]])
From here you can find the shape of the array with np.shape and find the product of that with np.product this now results in the number of elements. If you now use np.reshape() to reshape the array to one length of the total number of element you will have a solution that always works.
np.reshape(a, np.product(a.shape))
>>> array([1, 2, 3, 4, 5, 6])
Use numpy.flat
import numpy as np
import matplotlib.pyplot as plt
a = np.array([[1,0,0,1],
[2,0,1,0]])
plt.hist(a.flat, [0,1,2,3])
The flat property returns a 1D iterator over your 2D array. This method generalizes to any number of rows (or dimensions). For large arrays it can be much more efficient than making a flattened copy.