How to check if my data is one-hot encoded - pandas

If I have a data matrix, how do I check if the categorical variables have been one-hot encoded or not?
I need to use LIME to explain my prediction, and I read that LIME works only if you have category labels instead of one-hot encoded columns.
I found code to convert it, but it works only if it has been encoded otherwise the columns get turned to NaNs.
So I need e piece of code that looks at a numpy array with data and tells me if it has been one hot encoded or not.

You can sum all the rows, and see if you get a all 1's array, as in the following example:
Example:
X = np.array(
[
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[0, 1, 0],
[1, 0, 0]
]
)
print(f'X is one-hot-encoded: {(X.sum(axis=1)-np.ones(X.shape[0])).sum()==0}')
Result:
X is one-hot-encoded: True

Related

Using lexsort on higher dimensional arrays

I could not for the life of me get array indexing to work properly with higher dimensional lexsort.
I have an ndarray lines of shape (N, 2, 3). You can think of it as N pairs (start and end of a line) of three-dimensional coordinates. These pairs of vectors can contain duplicates, which should be removed.
points = np.array([[1,1,0],[-1,1,0],[-1,-1,0],[1,-1,0]])
lines = np.dstack([points, np.roll(points, shift=1, axis=0)]) # create point pairs / lines
lines = np.vstack([lines, lines[..., ::-1]]) # add duplicates w/reversed direction
lines = lines.transpose(0,2,1) # change shape from N,3,2 to N,2,3
Since the pair (v1, v2) is not equal to (v2, v1), I am sorting the vectors with lexsort as follows
idx = np.lexsort((lines[..., 0], lines[..., 1], lines[..., 2]))
which gives me an array idx of shape (N, 2) indicating the order along axis 1:
array([[0, 1],
[0, 1],
[1, 0],
[1, 0],
[1, 0],
[1, 0],
[0, 1],
[0, 1]])
However, lines[idx] results in something with shape (N, 2, 2, 3). I had tried all manner of newaxis padding, axis reordering etc. to get broadcasting to work, but everything results in the output having even more dimensions, not less. I also tried lines[:, idx], but this gives (N, N, 2, 3).
Based on https://numpy.org/doc/stable/user/basics.indexing.html#integer-array-indexing
for my concrete problem I eventually figured out I need to add an additional
idx_n = np.arange(len(lines))[:, np.newaxis]
lines[idx_n, idx]
due to mixing "advanced" and "simple" indexing lines[:, idx] did not work as I expected.
but is this really the most succinct it can be?
Eventually I found out I wanted
np.take_along_axis(lines, idx[..., np.newaxis] , axis=1)

Convert multidimensional array elements into same number of arrays

I am doing a Computer Vision project in which I am getting an error 'setting an array element with a sequence' when I am trying to change the data type of input image matrix.
I realized this is happening because the input image matrix I am having does not have the same number of elements in all of its array. Is there any way I can convert that input image into the 2D array with the same number of elements in each of its array?
I am getting an error when I am trying to execute the following line:
X_train = X_train.astype('float32')
Any help would be appreciated.
Cheers.
You need to pad the rows with less elements with zeros to make their lengths equal to the length of the longest array (or list) in the list of lists (matrix).
Below's a code snippet to pad a list of lists of unequal lengths to a matrix of same row-lengths:
import numpy as np
unpadded_matrix = np.array([[1, 2], [3, 4, 5], [6, 7, 8, 9]])
max_len = max([len(row) for row in unpadded_matrix])
np.array([row + [0]*(max_len-len(row)) for row in unpadded_matrix])
o/p:
array([[1, 2, 0, 0],
[3, 4, 5, 0],
[6, 7, 8, 9]])

What is the difference between a Categorical Column and a Dense Column?

In Tensorflow, there are 9 different feature columns, arranged into three groups: categorical, dense and hybrid.
From reading the guide, I understand categorical columns are used to represent discrete input data with a numerical value. It gives the example of a categorical column called categorical identity column:
ID Represented using one-hot encoding
0 [1, 0, 0, 0]
1 [0, 1, 0, 0]
2 [0, 0, 1, 0]
3 [0, 0, 0, 1]
But you also have a dense column called indicator column, which 'wraps'(?) a categorical column to produce something that looks almost identical:
Category (from category column) Represented as...
0 [1, 0, 0, 0]
1 [0, 1, 0, 0]
2 [0, 0, 1, 0]
3 [0, 0, 0, 1]
So both 'categorical' and 'dense' columns seems to be able to represent discrete data, and both can use one-hot encoding, so that's not what distinguishes one from another.
My question is: In principle, what are the difference between a 'categorical column' and a 'dense column'?
I just came across this before finding an answer on the DataScience StackExchange,
you can find the original answer here
If i understood correctly the answer is simply that while the categorical column will indeed encode the data as one-hot the indicatorcolumn will encode it as multi-hot

Bitwise OR along one axis of a NumPy array

For a given NumPy array, it is easy to perform a "normal" sum along one dimension. For example:
X = np.array([[1, 0, 0], [0, 2, 2], [0, 0, 3]])
X.sum(0)
=array([1, 2, 5])
X.sum(1)
=array([1, 4, 3])
Instead, is there an "efficient" way of computing the bitwise OR along one dimension of an array similarly? Something like the following, except without requiring for-loops or nested function calls.
Example: bitwise OR along zeroeth dimension as I currently am doing it:
np.bitwise_or(np.bitwise_or(X[:,0],X[:,1]),X[:,2])
=array([1, 2, 3])
What I would like:
X.bitwise_sum(0)
=array([1, 2, 3])
numpy.bitwise_or.reduce(X, axis=whichever_one_you_wanted)
Use the reduce method of the numpy.bitwise_or ufunc.

How to get a dense representation of one-hot vectors

Suppose a Tensor containing :
[[0 0 1]
[0 1 0]
[1 0 0]]
How to get the dense representation in a native way (without using numpy or iterations) ?
[2,1,0]
There is tf.one_hot() to do the inverse, there is also tf.sparse_to_dense() that seems to do it but I was not able to figure out how to use it.
tf.argmax(x, axis=1) should do the job.
vec = tf.constant([[0, 0, 1], [0, 1, 0], [1, 0, 0]])
locations = tf.where(tf.equal(vec, 1))
# This gives array of locations of "1" indices below
# => [[0, 2], [1, 1], [2, 0]])
# strip first column
indices = locations[:,1]
sess = tf.Session()
print(sess.run(indices))
# => [2 1 0]
TensorFlow does not have a native dense to sparse conversion function/helper. Given that the input array is a dense tensor, such as the one you provided, you can define a function to convert a dense tensor to a sparse tensor.
def dense_to_sparse(dense_tensor):
where_dense_non_zero = tf.where(tf.not_equal(dense_tensor, 0))
indices = where_dense_non_zero
values = tf.gather_nd(dense_tensor, where_dense_non_zero)
shape = dense_tensor.get_shape()
return tf.SparseTensor(
indices=indices,
values=values,
shape=shape
)
This helper function finds the indices and values where the Tensor is non-zero and outputs a Sparse tensor with those indices and values. Additionally, the shape is effectively copied over.
You do not want to use tf.sparse_to_dense as that gives you the opposite representation. If you want your output to be [2, 1, 0] instead, you'll need to index the indices. First, you'll need the indices where the array isn't 0:
indices = tf.where(tf.not_equal(dense_tensor, 0))
Then, you'll need to access the tensor using slicing/indicing:
output = indices[:, 1]
You might notice that 1 in the slice above is equivalent to the dimension of the tensor - 1. Therefore, to make these value generic, you could do something like:
output = indices[:, len(dense_tensor.get_shape()) - 1]
Although I'm not exactly sure what you'd do with these values (the value of the column where the value is). Hope this helped!
EDIT: Yaroslav's answer is better if you're looking for the indices/locations of where the input tensor if 1; it won't be extensible for tensors with non-1/0 values if that is required.