Related
I need to apply a function to the result of a transformation of all index values of a given numpy array. The following code does this:
import numpy as np
from matplotlib.transforms import IdentityTransform
# some 2D array
a = np.empty((2,3))
# some affine transformation, identity is just an example here
trans = IdentityTransform()
# some function taking a 2D index and returning some value depending
# on that index, again just an example
def f(idx):
return (idx[0]+idx[1])/2
# apply f to the result of transforming each index of a
b=np.empty_like(a)
for idx in np.ndindex(a.shape):
b[idx] = f(trans.transform(idx))
print(b)
This prints the following correct result:
[[0. 0.5 1. ]
[0.5 1. 1.5]]
The problem now is, the code is too slow when the shape of a gets larger, say 2000x3000. Is there a way to speed this up?
My idea is to create an array of indices of a idx = [[0,0], [0,1], ..., [1,2]], then transform this array in one go using something like tmp = trans.transform(idx), and lastly apply f to every element with np.vectorize(f)(tmp).
Is this a reasonable approach? If yes, how would this actually look like? If no, are there any alternatives?
Edit: I managed to get at tmp via the following code:
tmp=trans.transform(np.asarray([idx for idx in np.ndindex(a.shape)]))
So now I have an array containing the results of the affine transformation for every index value of a. But this seems to use an awful lot of memory.
I'll post an answer myself with what I figured out now. Maybe it is of use for someone.
To answer the first part of my question, I found a fast and efficient way to create the result of transforming the index values, using the result of np.indices() and then massaging the result of that until it fits to what t.transform() expects.
Given some array a = np.empty((2,3)), the indices of that array can be obtained via np.indices(a.shape). This returns two 2D arrays (one for each dimension of a, actually). What I failed to understand was how to turn these results into something transform() understands.
The key here is to apply np.ravel() to the result of each of those arrays, np.indices() returns:
>>> a=np.empty((2,3))
>>> list(map(np.ravel, np.indices(a.shape)))
[array([0, 0, 0, 1, 1, 1]), array([0, 1, 2, 0, 1, 2])]
Now I have a list of arrays containing all the x and y indices, which just needs to be put together with np.vstack() and then transposed to get an array of all (x, y) indices, and this is the form transform() will accept.
>>> l=list(map(np.ravel, np.indices(a.shape)))
>>> np.vstack(l).transpose()
array([[0, 0],
[0, 1],
[0, 2],
[1, 0],
[1, 1],
[1, 2]])
And finally, for some arbitrary affine transformation:
>>> from matplotlib.transforms import Affine2D
>>> t = Affine2D().translate(10, 20).scale(0.5)
>>> t.transform(np.vstack(l).transpose())
array([[ 5. , 10. ],
[ 5. , 10.5],
[ 5. , 11. ],
[ 5.5, 10. ],
[ 5.5, 10.5],
[ 5.5, 11. ]])
This is quite fast, even for larger array sizes. If the shape gets big enough (something like 20000x30000), I run out of memory, but for shapes 10000x10000 it still is amazingly fast.
>>> timeit.timeit("t.transform(np.vstack(list(map(np.ravel, np.indices(a.shape, dtype=np.uint16)))).transpose())",
... "import numpy as np ; from matplotlib.transforms import Affine2D ; a = np.empty((20, 10)) ; t = Affine2D().translate(10, 20).scale(0.5)", number=10)
0.0003051299718208611
>>> timeit.timeit("t.transform(np.vstack(list(map(np.ravel, np.indices(a.shape, dtype=np.uint16)))).transpose())",
... "import numpy as np ; from matplotlib.transforms import Affine2D ; a = np.empty((200, 100)) ; t = Affine2D().translate(10, 20).scale(0.5)", number=10)
0.0026413939776830375
>>> timeit.timeit("t.transform(np.vstack(list(map(np.ravel, np.indices(a.shape, dtype=np.uint16)))).transpose())",
... "import numpy as np ; from matplotlib.transforms import Affine2D ; a = np.empty((2000, 1000)) ; t = Affine2D().translate(10, 20).scale(0.5)", number=10)
0.35055489401565865
>>> timeit.timeit("t.transform(np.vstack(list(map(np.ravel, np.indices(a.shape, dtype=np.uint16)))).transpose())",
... "import numpy as np ; from matplotlib.transforms import Affine2D ; a = np.empty((20000, 10000)) ; t = Affine2D().translate(10, 20).scale(0.5)", number=10)
43.62860555597581
Now for the second part, for applying the function to each of the transformed index values I use the following code for now, which is fast enough in my case.
xxyy = t.transform(np.vstack(...).transpose())
np.fromiter((f(*xy) for xy in xxyy), dtype=np.short, count=len(xxyy))
Consider the following image:
I'd like to print it as a grayscale image. I can do the conversion with scikit-image:
from skimage.io import imread
from matplotlib import pyplot as plt
from skimage.color import rgb2gray
img = imread('image.jpg')
plt.grid(which = 'both')
plt.imshow(rgb2gray(img), cmap=plt.cm.gray)
I get:
which is obviously not what I want.
My question is: Is there a way with scikit-image or with raw numpy and/or mathplotlib to digitize the image so that I get a 3D array (first dimension: X index, second dimension: Y index, third dimension: value according to the colormap). Then I can easily change to colormap to something that turns out to have better results when printing in grayscale?
The example below demonstrates a simple way to undo a colormap's value -> RGB mapping.
def unmap_nearest(img, rgb):
""" img is an image of shape [n, m, 3], and rgb is a colormap of shape [k, 3]. """
d = np.sum(np.abs(img[np.newaxis, ...] - rgb[:, np.newaxis, np.newaxis, :]), axis=-1)
i = np.argmin(d, axis=0)
return i / (rgb.shape[0] - 1)
This function works by taking the RGB value of each pixel and looking up the index of the best matching color in the colormap. Some trickery with indexing and broadcasting allows for efficient vectorization (at the cost of memory spent on temporary arrays):
img[np.newaxis, ...] converts the image from shape [n, m, 3] to [1, n, m, 3]
rgb[:, np.newaxis, np.newaxis, :] converts the colormap from shape [k, 3] to [k, 1, 1, 3].
subtracting the resulting arrays leads to an array of shape [k, n, m, 3] that contians the difference between each colormap index k and pixel n, m for each color component.
sum(abs(..), axis=-1) takes the absolute value of the differences and sums over all color components (the last dimension) to get the total difference between all pixels and color map entries (array of shape [k, n, m]).
i = np.argmin(d, axis=0) finds the index of the minimum element along the first dimension. The result is the index of the best matching color map entry of each pixel [n, m].
return i / (rgb.shape[0] - 1) finally returns the indices normalized by the color map size so that the result is in range 0-1.
There are a faw caveats with this approach:
It cannot reconstruct the original value range.
It will treat all pixels as part of the color map (i.e. continent contuors will also be mapped).
If you use the wrong color map it will fail hilariously.
.
import numpy as np
import matplotlib.pyplot as plt
from skimage.color import rgb2gray
def unmap_nearest(img, rgb):
""" img is an image of shape [n, m, 3], and rgb is a colormap of shape [k, 3]. """
d = np.sum(np.abs(img[np.newaxis, ...] - rgb[:, np.newaxis, np.newaxis, :]), axis=-1)
i = np.argmin(d, axis=0)
return i / (rgb.shape[0] - 1)
cmap = plt.cm.jet
rgb = cmap(np.linspace(0, 1, cmap.N))[:, :3]
original = (np.arange(10)[:, None] + np.arange(10)[None, :])
plt.subplot(2, 2, 1)
plt.imshow(original, cmap='gray')
plt.colorbar()
plt.title('original')
plt.subplot(2, 2, 2)
rgb_img = cmap(original / 18)[..., :-1]
plt.imshow(rgb_img)
plt.title('color-mapped')
plt.subplot(2, 2, 3)
wrong = rgb2gray(rgb_img)
plt.imshow(wrong, cmap='gray')
plt.title('rgb2gray')
plt.subplot(2, 2, 4)
reconstructed = unmap_nearest(rgb_img, rgb)
plt.imshow(reconstructed, cmap='gray')
plt.colorbar()
plt.title('reconstructed')
plt.show()
Building on #kazemakmakase's answer, if you're digitizing a figure, you probably are dealing with a copy of the original that's been converted, or maybe even printed and scanned at some point. Those things can distort colors from the "true" colormap that was originally used.
You can deal with this by using a slice through the figure's colorbar as the 'pattern' (rgb) to match against. Specifically, crop the figure down to just the color ramp (in landscape orientation in this example), then replace the rgb variable in #kazemakmakase's example with:
cmapimg = plt.imread('cropped_colorbar.png')
rgb = cmapimg[cmapimg.shape[0]/2,:,:3]
Is there some numpy sugar for reverting Homogeneous coordinates back to 2d coordinates.
So this:
[[4,8,2],
6,3,2]]
becomes this:
[[2,4],
[3,1.5]]
One approach making use of broadcasted elementwise divisions -
from __future__ import division
a[:,:2]/a[:,[-1]]
We can use a[:,-1,None] or a[:,-1][:,None] or a[:,-1].reshape(-1,1) in place of a[:,[-1]]. With a[:,[-1]], we are keeping the number of dims intact, letting us perform the broadcasting divisions.
Another with np.true_divide again using broadcasting -
np.true_divide(a[:,:2], a[:,[-1]])
Sample run -
In [194]: a
Out[194]:
array([[4, 8, 2],
[6, 3, 2]])
In [195]: a[:,:2]/a[:,[-1]]
Out[195]:
array([[ 2. , 4. ],
[ 3. , 1.5]])
In [196]: np.true_divide(a[:,:2], a[:,[-1]])
Out[196]:
array([[ 2. , 4. ],
[ 3. , 1.5]])
If you have your input as a vector called x you could do
x[:-1]/x[-1]
Full example:
import numpy as np
x = np.array([6,3,2])
x[:-1]/x[-1] # array([ 3. , 1.5])
You can also apply it to multiple coordinates in an array:
xs = np.array([[4,8,2],[6,3,2]])
np.array([x[:-1]/x[-1] for x in xs]) # array([[ 2. , 4. ],
# [ 3. , 1.5]])
If you want to reuse this you can define a function homogen:
homogen = lambda x: x[:-1]/x[-1]
# previous stuff becomes something like
np.array([homogen(x) for x in xs])
Elaborated question:
Let me clarify my question. I want to plot a list of array output as a 2D scatter plot with polarity along x axis subjectivity along y axis and modality values that ranges between -1 and 1 determines the type of marker( o,x, ^, v)
output
polarities: [ 0. 0. 0. 0.]
subjectivity: [ 0.1 0. 0. 0. ]
modalities: [ 1. -0.25 1. 1. ]
The modified code with limited marker value for 2 range.
print "polarities: ", a[:,0]
print "subjectivity: ", a[:,1]
print "modalities: ", a[:,2]
def markers(r):
markers = np.array(r, dtype=np.object)
markers[(r>=0)] = 'o'
markers[r<0] = 'x'
return markers.tolist()
def colors(s):
colors = np.array(s, dtype=np.object)
colors[(s>=0)] = 'g'
colors[s<0] = 'r'
return colors.tolist()
fig=plt.figure()
ax=fig.add_subplot(111)
ax.scatter(a[:,0], a[:,1], marker = markers(a[:,2]), color= colors(a[:,0]), s=100, picker=5)
My intent is to check the modality value and return one of the four markers.
if I hardcore 'o' it returns the plot.
ax.scatter(a[:,0], a[:,1], marker = markers('o'), color= colors(a[:,0]), s=100, picker=5)
As a trial i tried to mimic the color function and pass it as a[:,2] but hit a shell output error
ValueError: Unrecognized marker style ['o', 'x', 'o', 'o']
The question is: Is my approach wrong? or how to make it recognize the marker style?
Edit1
Trying to get the m value between 0 and .5
with this code
ax.scatter (p[0<m<=.5], s[0<m<=.5], marker = "v", color= colors(a[:,0]), s=100, picker=5)
yields this error
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
How to range m value between 0 and .5 in the example given in answer 2.
It's not clear from your question, but I assume your array a is of shape (N,3) and so your arrays s and r are actual arrays and not scalars.
First off, you cannot have several markers with one call of scatter(). If you want your plot to have several markers, you'll have to slice your array correctly and do several scatter() for each of your markers.
Regarding the colors, your problem is that your function colors(r) only return one color where it should return an array of colors (with the same number of elements as a[:,0]). Like such:
def colors(s):
colors = np.array(s, dtype=np.object)
colors[(s>0.25)&(s<0.75)] = 'g'
colors[s>=0.75] = 'b'
colors[s<=0.25] = 'r'
return colors.tolist()
a = np.random.random((100,))
b = np.random.random((100,))
plt.scatter(a,b,color=colors(b))
ANSWER TO YOUR EDIT 1:
You seem to be on the right track, you'll have to do as many scatter() calls as you have markers.
Your error comes from the slicing index [0<m<=.5] which you cannot use like that. You have to use the full notation [(m>0.)&(m<=.5)]
As Diziet pointed out, plt.scatter() cannot handle several markers. You therefore need to make one scatter plot per marker-category. This can be done my conditioning on the property which should be reflected by the marker. In this case:
import numpy as np
import matplotlib.pyplot as plt
p = np.array( [ 0. , 0.2 , -0.3 , 0.2] )
s = np.array( [ 0.1, 0., 0., 0.3 ] )
m = np.array( [ 1., -0.25, 1. , -0.6 ] )
colors = np.array([(0.8*(1-x), 0.7*x, 0) for x in np.ceil(p)])
fig=plt.figure()
ax=fig.add_subplot(111)
ax.scatter(p[m>=0], s[m>=0], marker = "o", color= colors[m>=0], s=100)
ax.scatter(p[m<0], s[m<0], marker = "s", color= colors[m<0], s=100)
ax.set_xlabel("polarity")
ax.set_ylabel("subjectivity")
plt.show()
I have a 50000 x 784 data matrix (50000 samples and 784 features) and the corresponding 50000 x 1 class vector (classes are integers 0-9). I'm looking for an efficient way to group the data matrix into 10 data matrices and class vectors that each have only the data for a particular class 0-9.
I can't seem to find an elegant way to do this, aside from just looping through the data matrix and constructing the 10 other matrices that way.
Does anyone know if there is a clean way to do this with something in scipy, numpy, or sklearn?
Probably the cleanest way of doing this in numpy, especially if you have many classes, is through sorting:
SAMPLES = 50000
FEATURES = 784
CLASSES = 10
data = np.random.rand(SAMPLES, FEATURES)
classes = np.random.randint(CLASSES, size=SAMPLES)
sorter = np.argsort(classes)
classes_sorted = classes[sorter]
splitter, = np.where(classes_sorted[:-1] != classes_sorted[1:])
data_splitted = np.split(data[sorter], splitter + 1)
data_splitted will be a list of arrays, one for each class found in classes. Running the above code with SAMPLES = 10, FEATURES = 2 and CLASSES = 3 I get:
>>> data
array([[ 0.45813694, 0.47942962],
[ 0.96587082, 0.73260743],
[ 0.70539842, 0.76376921],
[ 0.01031978, 0.93660231],
[ 0.45434223, 0.03778273],
[ 0.01985781, 0.04272293],
[ 0.93026735, 0.40216376],
[ 0.39089845, 0.01891637],
[ 0.70937483, 0.16077439],
[ 0.45383099, 0.82074859]])
>>> classes
array([1, 1, 2, 1, 1, 2, 0, 2, 0, 1])
>>> data_splitted
[array([[ 0.93026735, 0.40216376],
[ 0.70937483, 0.16077439]]),
array([[ 0.45813694, 0.47942962],
[ 0.96587082, 0.73260743],
[ 0.01031978, 0.93660231],
[ 0.45434223, 0.03778273],
[ 0.45383099, 0.82074859]]),
array([[ 0.70539842, 0.76376921],
[ 0.01985781, 0.04272293],
[ 0.39089845, 0.01891637]])]
If you want to make sure the sort is stable, i.e. that data points in the same class remain in the same relative order after sorting, you will need to specify sorter = np.argsort(classes, kind='mergesort').
If your data and labels matrices are in numpy format, you can do:
data_class_3 = data[labels == 3, :]
If they aren't, turn them into numpy format:
import numpy as np
data = np.array(data)
labels = np.array(labels)
data_class_3 = data[labels == 3, :]
You can loop and do this for all labels automatically if you like. Something like this:
import numpy as np
split_classes = np.array([data[labels == i, :] for i in range(10)])
After #Jaime numpy optimal answer, I suggest you pandas, specialized in data manipulations :
import pandas
df=pandas.DataFrame(data,index=classes).sort_index()
then df.loc[i] is your class i.
if you want a list, just do
metadata=[df.loc[i].values for i in range(10)]
so metadata[i] is the subset you want, or make a panel with pandas. All that is based on numpy arrays, so efficiency is preserved.