What is the difference between a Categorical Column and a Dense Column? - tensorflow

In Tensorflow, there are 9 different feature columns, arranged into three groups: categorical, dense and hybrid.
From reading the guide, I understand categorical columns are used to represent discrete input data with a numerical value. It gives the example of a categorical column called categorical identity column:
ID Represented using one-hot encoding
0 [1, 0, 0, 0]
1 [0, 1, 0, 0]
2 [0, 0, 1, 0]
3 [0, 0, 0, 1]
But you also have a dense column called indicator column, which 'wraps'(?) a categorical column to produce something that looks almost identical:
Category (from category column) Represented as...
0 [1, 0, 0, 0]
1 [0, 1, 0, 0]
2 [0, 0, 1, 0]
3 [0, 0, 0, 1]
So both 'categorical' and 'dense' columns seems to be able to represent discrete data, and both can use one-hot encoding, so that's not what distinguishes one from another.
My question is: In principle, what are the difference between a 'categorical column' and a 'dense column'?

I just came across this before finding an answer on the DataScience StackExchange,
you can find the original answer here
If i understood correctly the answer is simply that while the categorical column will indeed encode the data as one-hot the indicatorcolumn will encode it as multi-hot

Related

multilabel classification with counts

I am trying to train a model for multi-output/multilabel classification where the labels are not binary but counts. Suppose the possible labels are A, B, C, D and E, for example, the y matrix could be
y = [[2, 0, 0, 0, 0], [0, 1, 1, 0, 0], [0, 0, 0, 1, 0], [0, 0, 0, 0, 3]]
I will have at most 5 labels and the count values could at most be 5 (usually 2). Options:
expanding the label space as {A,B,C,D,E} x {1,2,3,4,5} and hence reducing it to a binary label. The number of actual combinations may be less than 15 and my datasets have ~100k rows (not too big). Then using catboost MultiLabel binary classification with MultiLogLoss.
scaling all counts using max among all counts of labels. So, the above would be (dividing all numbers by 3) [[0.66,0,0,0,0], [0,0.33,0.33,0,0], [0,0,0,0.33,0], [0,0,0,0,1]]. Then using catboost MultiLogloss.
Which would be better? Is there anyway of training a model using xgboost objective=count:poisson and sklearn.multioutput.MultiOutputClassifier? Sample code for that would really help.

How to check if my data is one-hot encoded

If I have a data matrix, how do I check if the categorical variables have been one-hot encoded or not?
I need to use LIME to explain my prediction, and I read that LIME works only if you have category labels instead of one-hot encoded columns.
I found code to convert it, but it works only if it has been encoded otherwise the columns get turned to NaNs.
So I need e piece of code that looks at a numpy array with data and tells me if it has been one hot encoded or not.
You can sum all the rows, and see if you get a all 1's array, as in the following example:
Example:
X = np.array(
[
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[0, 1, 0],
[1, 0, 0]
]
)
print(f'X is one-hot-encoded: {(X.sum(axis=1)-np.ones(X.shape[0])).sum()==0}')
Result:
X is one-hot-encoded: True

How do you create a "count" matrix from a series?

I have a Pandas series of lists of categorical variables. For example:
df = pd.Series([["A", "A", "B"], ["A", "C"]])
Note that in my case the series is pretty long (50K elements) and the number of possible distinct elements in the list is also big (20K elements).
I would like to obtain a matrix having a column for each distinct feature and its count as value. For the previous example, this means:
[[2, 0, 0], [1, 0, 1]]
This is the same output as the one obtained with OneHot encoding, except that it contains the count instead of just 1.
What is the best way to achieve this?
Let's try explode:
df.explode().groupby(level=0).value_counts().unstack(fill_value=0)
Output:
A B C
0 2 1 0
1 1 0 1
To get the list of list, chain the above with .values:
array([[2, 1, 0],
[1, 0, 1]])
Note that you will end up with a 50K x 20K array.

Bitwise OR along one axis of a NumPy array

For a given NumPy array, it is easy to perform a "normal" sum along one dimension. For example:
X = np.array([[1, 0, 0], [0, 2, 2], [0, 0, 3]])
X.sum(0)
=array([1, 2, 5])
X.sum(1)
=array([1, 4, 3])
Instead, is there an "efficient" way of computing the bitwise OR along one dimension of an array similarly? Something like the following, except without requiring for-loops or nested function calls.
Example: bitwise OR along zeroeth dimension as I currently am doing it:
np.bitwise_or(np.bitwise_or(X[:,0],X[:,1]),X[:,2])
=array([1, 2, 3])
What I would like:
X.bitwise_sum(0)
=array([1, 2, 3])
numpy.bitwise_or.reduce(X, axis=whichever_one_you_wanted)
Use the reduce method of the numpy.bitwise_or ufunc.

tensorflow transform a (structured) dense matrix to sparse, when number of rows unknow

My task is to transform a special formed dense matrix tensor into a sparse one. e.g. input matrix M as followed (dense positive integer sequence followed by 0 as padding in each row)
[[3 5 7 0]
[2 2 0 0]
[1 3 9 0]]
Additionally, given the non-padding length for each row, e.g. given by tensor L =
[3, 2, 3].
The desired output would be sparse tensor S.
SparseTensorValue(indices=array([[0, 0],[0, 1],[0, 2],[1, 0],[1, 1],[2, 0],[2, 1], [2, 2]]), values=array([3, 5, 7, 2, 2, 1, 3, 9], dtype=int32), shape=array([3, 4]))
This is useful in models where objects are described by variable-sized descriptors (S are then used in embedding_lookup_sparse to connect embeddings of descriptors.)
I am able to do it when number of M's row is known (by python loop and ops like slice and concat). However, M's row number here is determined by mini-batch size and could change (say in testing phase). Is there a good way to implement that? I am trying some control_flow_ops but haven't succeeded.
Thanks!!