Convert one-hot encoding back to number label - numpy

I have a series of one-hot encoding vector, say
np.array([[1,0,0,0],[0,1,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]])
I want to convert it back to
np.array(0,1,1,2,3)
Is there an efficient way of doing without for loop?

As pointed out by #Divakar in the comments, NumPy's argmax is the easiest way to get the job done. Notice that you need to pass the function the proper value of parameter axis.
In [18]: import numpy as np
In [19]: x = np.array([[1,0,0,0],[0,1,0,0],[0,1,0,0],[0,0,1,0],[0,0,0,1]])
In [20]: np.argmax(x, axis=-1)
Out[20]: array([0, 1, 1, 2, 3], dtype=int64)

Related

Get 2D elements from 3D numpy array, corresponding to 2D indices

I have a 3D array that could be interpreted as a 2D matrix of positions, where each position is a 2D array of coordinates [x,y].
I then have a list of 2D indices, each one indicating a position in the matrix in terms of [row, column].
I would like to obtain the positions from the matrix corresponding to all these indices.
What I am doing is:
import numpy as np
input_matrix = np.array(
[[[0.0, 1.5], [3.0, 3.0]], [[7.0, 5.2], [6.0, 7.0]]]
)
indices = np.array([[1, 0], [1, 1]])
selected_elements = np.array([input_matrix[tuple(idx)] for idx in indices])
So for example the 2D element corresponding to the 2D index [1, 0] would be [7.0, 5.2] and so on.
My code works, but I was wondering if there is a better and way, for example using numpy entirely (e.g. without having to use list comprehension in the case of multiple 2D indices).
I tried to use the numpy take but it does not seem to produce the wanted results.
You can use:
input_matrix[tuple(indices.T)]
Or, as suggested in comments:
input_matrix[indices[:,0], indices[:,1]]
Output:
array([[7. , 5.2],
[6. , 7. ]])

Whats the difference between `arr[tuple(seq)]` and `arr[seq]`? Relating to Using a non-tuple sequence for multidimensional indexing is deprecated

I am using an ndarray to slice another ndarray.
Normally I use arr[ind_arr]. numpy seems to not like this and raises a FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated use arr[tuple(seq)] instead of arr[seq].
What's the difference between arr[tuple(seq)] and arr[seq]?
Other questions on StackOverflow seem to be running into this error in scipy and pandas and most people suggest the error to be in the particular version of these packages. I am running into the warning running purely in numpy.
Example posts:
FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated use `arr[tuple(seq)]` instead of `arr[seq]`
FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated use `arr[tuple(seq)]`
FutureWarning with distplot in seaborn
MWE reproducing warning:
import numpy as np
# generate a random 2d array
A = np.random.randint(20, size=(7,7))
print(A, '\n')
# define indices
ind_i = np.array([1, 2, 3]) # along i
ind_j = np.array([5, 6]) # along j
# generate index array using meshgrid
ind_ij = np.meshgrid(ind_i, ind_j, indexing='ij')
B = A[ind_ij]
print(B, '\n')
C = A[tuple(ind_ij)]
print(C, '\n')
# note: both produce the same result
meshgrid returns a list of arrays:
In [50]: np.meshgrid([1,2,3],[4,5],indexing='ij')
Out[50]:
[array([[1, 1],
[2, 2],
[3, 3]]), array([[4, 5],
[4, 5],
[4, 5]])]
In [51]: np.meshgrid([1,2,3],[4,5],indexing='ij',sparse=True)
Out[51]:
[array([[1],
[2],
[3]]), array([[4, 5]])]
ix_ does the same thing, but returns a tuple:
In [52]: np.ix_([1,2,3],[4,5])
Out[52]:
(array([[1],
[2],
[3]]), array([[4, 5]]))
np.ogrid also produces the list.
In [55]: arr = np.arange(24).reshape(4,6)
indexing with the ix tuple:
In [56]: arr[_52]
Out[56]:
array([[10, 11],
[16, 17],
[22, 23]])
indexing with the meshgrid list:
In [57]: arr[_51]
/usr/local/bin/ipython3:1: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
#!/usr/bin/python3
Out[57]:
array([[10, 11],
[16, 17],
[22, 23]])
Often the meshgrid result is used with unpacking:
In [62]: I,J = np.meshgrid([1,2,3],[4,5],indexing='ij',sparse=True)
In [63]: arr[I,J]
Out[63]:
array([[10, 11],
[16, 17],
[22, 23]])
Here [I,J] is the same as [(I,J)], making a tuple of the 2 subarrays.
Basically they are trying to remove a loophole that existed for historical reasons. I don't know if they can change the handling of meshgrid results without causing further compatibility issues.
For Future Readers having this FutureWarning: ... and want to correctly resolve them.
Read the answer of #hpaulj here. And notice the point is when he(she) said:
[...] returns a list of [...]
[...] returns a tuple of [...]
If you don't know why 1. is the point please read the answer: https://stackoverflow.com/a/71487259/5290519, which is also written by him(her). This answer provides:
Why, in the first place, the usage of list as index will cause a warning.
[4] is a problem because in the past, certain lists were interpreted as though they were tuples. This is a legacy case that developers are trying to cleanup, hence the FutureWarning.
A series of concise examples on the (correct) usage of list,tuple,np.array as array index.
If you still don't get the point, try my own answer after I figuring all these complicated concepts: https://stackoverflow.com/a/71493474/5290519

what does the np.array command do?

question about the np.array command.
let's say the content of caches when you displayed it with the print command is
caches = [array([1,2,3]),array([1,2,3]),...,array([1,2,3])]
Then I executed following code:
train_x = np.array(caches)
When I print the content of train_x I have:
train_x = [[1,2,3],[1,2,3],...,[1,2,3]]
Now, the behavior is exactly as I want but do not really understand in dept what the np.array(caches) command has done. Can somebody explain this to me?
Making a 1d array
In [89]: np.array([1,2,3])
Out[89]: array([1, 2, 3])
In [90]: np.array((1,2,3))
Out[90]: array([1, 2, 3])
[1,2,3] is a list; (1,2,3) is a tuple. np.array treats them as the same. (list versus tuple does make a difference when creating structured arrays, but that's a more advanced topic.)
Note the shape is (3,) (shape is a tuple)
Making a 2d array from a nested list - a list of lists:
In [91]: np.array([[1,2],[3,4]])
Out[91]:
array([[1, 2],
[3, 4]])
In [92]: _.shape
Out[92]: (2, 2)
np.array takes data, not shape information. It infers shape from the data.
array(object, dtype=None, copy=True, order='K', subok=False, ndmin=0)
In these examples the object parameter is a list or list of lists. We aren't, at this stage, defining the other parameters.

Why does MinMaxScaler add lines to image?

I want to normalize the pixel values of an image to the range [0, 1] for each channel (R, G, B).
Minimal Example
#!/usr/bin/env python
import numpy as np
import scipy
from sklearn import preprocessing
original = scipy.misc.imread('Crocodylus-johnsoni-3.jpg')
scipy.misc.imshow(original)
transformed = np.zeros(original.shape, dtype=np.float64)
scaler = preprocessing.MinMaxScaler()
for channel in range(3):
transformed[:, :, channel] = scaler.fit_transform(original[:, :, channel])
scipy.misc.imsave("transformed.jpg", transformed)
What happens
Taking https://commons.wikimedia.org/wiki/File:Crocodylus-johnsoni-3.jpg,
I get the following "normalized" result:
As you can see there are lines from top to bottom at the right side. What happened there? It seems to me that the normalization went wrong. If so: How do I fix it?
In scikit-learn, a two-dimensional array with shape (m, n) is usually interpreted as a collection of m samples, with each sample having n features.
MinMaxScaler.fit_transform() transforms each feature, so each column of your array is transformed independently of the others. That results in the vertical "stripes" in the image.
It looks like you intended to scale each color channel independently. To do that using MinMaxScaler, reshape the input so that each channel becomes one column. That is, if the original image has shape (m, n, 3), reshape it to (m*n, 3) before passing it to the fit_transform() method, and then restore the shape of the result to create the transformed array.
For example,
ascolumns = original.reshape(-1, 3)
t = scaler.fit_transform(ascolumns)
transformed = t.reshape(original.shape)
With this, transformed looks like this:
The image looks exactly like the original, because it turns out that in the array original, the minimum and maximum are 0 and 255, respectively, in each channel:
In [41]: original.min(axis=(0, 1))
Out[41]: array([0, 0, 0], dtype=uint8)
In [42]: original.max(axis=(0, 1))
Out[42]: array([255, 255, 255], dtype=uint8)
So all fit_transform does in this case is transform all the input values to the floating point range [0.0, 1.0] uniformly. If the minimum or maximum was different in one of the channels, the transformed image would look different.
By the way, it is not difficult to perform the transform using pure numpy. (I'm using Python 3, so in the following, the division automatically casts the result to floating point. If you are using Python 2, you'll need to convert one of the argument to floating point, or use from __future__ import division.)
In [58]: omin = original.min(axis=(0, 1), keepdims=True)
In [59]: omax = original.max(axis=(0, 1), keepdims=True)
In [60]: xformed = (original - omin)/(omax - omin)
In [61]: np.allclose(xformed, transformed)
Out[61]: True
(One potential problem with that method is that it will generate an error if one of the channels is constant, because then one of the values in omax - omin will be 0.)

Specify the spherical covariance in numpy's multivariate_normal random sampling

In numpy manual, it is said:
Instead of specifying the full covariance matrix, popular approximations include:
Spherical covariance (cov is a multiple of the identity matrix)
Has anybody ever specified spherical covariance? I am trying to make it work to avoid building the full covariance matrix, which is too much memory-consuming.
If you just have a diagonal covariance matrix, it is usually easier (and more efficient) to just scale standard normal variates yourself instead of using multivariate_normal().
>>> import numpy as np
>>> stdevs = np.array([3.0, 4.0, 5.0])
>>> x = np.random.standard_normal([100, 3])
>>> x.shape
(100, 3)
>>> x *= stdevs
>>> x.std(axis=0)
array([ 3.23973255, 3.40988788, 4.4843039 ])
While #RobertKern's approach is correct, you can let numpy handle all of that for you, as np.random.normal will do broadcasting on multiple means and standard deviations:
>>> np.random.normal(0, [1,2,3])
array([ 0.83227999, 3.40954682, -0.01883329])
To get more than a single random sample, you have to give it an appropriate size:
>>> x = np.random.normal(0, [1, 2, 3], size=(1000, 3))
>>> np.std(x, axis=0)
array([ 1.00034817, 2.07868385, 3.05475583])