multilabel classification with counts - xgboost

I am trying to train a model for multi-output/multilabel classification where the labels are not binary but counts. Suppose the possible labels are A, B, C, D and E, for example, the y matrix could be
y = [[2, 0, 0, 0, 0], [0, 1, 1, 0, 0], [0, 0, 0, 1, 0], [0, 0, 0, 0, 3]]
I will have at most 5 labels and the count values could at most be 5 (usually 2). Options:
expanding the label space as {A,B,C,D,E} x {1,2,3,4,5} and hence reducing it to a binary label. The number of actual combinations may be less than 15 and my datasets have ~100k rows (not too big). Then using catboost MultiLabel binary classification with MultiLogLoss.
scaling all counts using max among all counts of labels. So, the above would be (dividing all numbers by 3) [[0.66,0,0,0,0], [0,0.33,0.33,0,0], [0,0,0,0.33,0], [0,0,0,0,1]]. Then using catboost MultiLogloss.
Which would be better? Is there anyway of training a model using xgboost objective=count:poisson and sklearn.multioutput.MultiOutputClassifier? Sample code for that would really help.

Related

How to check if my data is one-hot encoded

If I have a data matrix, how do I check if the categorical variables have been one-hot encoded or not?
I need to use LIME to explain my prediction, and I read that LIME works only if you have category labels instead of one-hot encoded columns.
I found code to convert it, but it works only if it has been encoded otherwise the columns get turned to NaNs.
So I need e piece of code that looks at a numpy array with data and tells me if it has been one hot encoded or not.
You can sum all the rows, and see if you get a all 1's array, as in the following example:
Example:
X = np.array(
[
[1, 0, 0],
[0, 1, 0],
[0, 0, 1],
[0, 1, 0],
[1, 0, 0]
]
)
print(f'X is one-hot-encoded: {(X.sum(axis=1)-np.ones(X.shape[0])).sum()==0}')
Result:
X is one-hot-encoded: True

What is the difference between a Categorical Column and a Dense Column?

In Tensorflow, there are 9 different feature columns, arranged into three groups: categorical, dense and hybrid.
From reading the guide, I understand categorical columns are used to represent discrete input data with a numerical value. It gives the example of a categorical column called categorical identity column:
ID Represented using one-hot encoding
0 [1, 0, 0, 0]
1 [0, 1, 0, 0]
2 [0, 0, 1, 0]
3 [0, 0, 0, 1]
But you also have a dense column called indicator column, which 'wraps'(?) a categorical column to produce something that looks almost identical:
Category (from category column) Represented as...
0 [1, 0, 0, 0]
1 [0, 1, 0, 0]
2 [0, 0, 1, 0]
3 [0, 0, 0, 1]
So both 'categorical' and 'dense' columns seems to be able to represent discrete data, and both can use one-hot encoding, so that's not what distinguishes one from another.
My question is: In principle, what are the difference between a 'categorical column' and a 'dense column'?
I just came across this before finding an answer on the DataScience StackExchange,
you can find the original answer here
If i understood correctly the answer is simply that while the categorical column will indeed encode the data as one-hot the indicatorcolumn will encode it as multi-hot

what is the use of reduce command in tensorflow?

tensorflow.reduce_sum(..) computes the sum of elements across dimensions of a tensor. it is Ok.
But one thing is not clear to me , what is the purpose of saying reduce in the function name ?
Is it related to map_reduce of parallel computation?
Let's say, it distributes the required computation to
different cores , and collect the result from the cores , finally delivers the sum of the collected results ?
Because you can compute the sum along a given dimension (and therefore reduce it). And no it has nothing to do with map-reduce.
Quoting the documentation string of the method:
Reduces input_tensor along the dimensions given in axis. Unless keepdims is true, the rank of the tensor is reduced by 1 for each entry in axis. If keepdims is true, the reduced dimensions are retained with length 1.
Example from the API:
x = tf.constant([[1, 1, 1], [1, 1, 1]])
tf.reduce_sum(x) # 6
tf.reduce_sum(x, 0) # [2, 2, 2]
tf.reduce_sum(x, 1) # [3, 3]
tf.reduce_sum(x, 1, keepdims=True) # [[3], [3]]
tf.reduce_sum(x, [0, 1]) # 6

How to design the label for tensorflow's ctc loss layer

I just started using ctc loss layer in tensorflow(r1.0) and got a little bit confused with the "labels" input
In tensorflow's API document, it says
labels: An int32 SparseTensor. labels.indices[i, :] == [b, t] means labels.values[i] stores the id for (batch b, time t). labels.values[i] must take on values in [0, num_labels)
Is [b,t] and values[i] mean there is a label "values[i]" at "t" of sequence "b" in the batch?
It says value must be in [0,num_labels), but for a sparse tensor, almost everywhere is 0 excepted for some specified places, so I don't really know how should the sparse tensor for ctc be like
And for example, if I have a short video of hand gesture, and it has a label "1",should I label the output of all timesteps as "1", or only label the last timestep as "1" and take other as "blank"?
thanks!
To address your questions:
1. The notation in the documentation here seems a bit misleading, as the output label index t need not be the same as the input time slice, it's simply the index to the output sequence. A different letter could be used because the input and output sequences are not explicitly aligned. Otherwise, your assertion seems correct. I give an example below.
Zero is a valid class in your sequence output label. The so-called blank label in TensorFlow's CTC implementation is the last (largest) class, which should probably not be in your ground truth labels anyhow. So if you were writing a binary sequence classifier, you'd have three classes, 0 (say "off"), 1 ("on") and 2 ("blank" output of CTC).
CTC Loss is for labeling sequence input with sequence output. If you only have
a single class label output for the sequence input, you're probably better off using a softmax cross entropy loss on the output of the last time step of the RNN cell.
If you do end up using CTC loss, you can see how I've constructed the training sequence through a reader here: How to generate/read sparse sequence labels for CTC loss within Tensorflow?.
As an example, after I batch two examples that have label sequences [44, 45, 26, 45, 46, 44, 30, 44] and [5, 8, 17, 4, 18, 19, 14, 17, 12], respectively, I get the following result from evaluating the (batched) SparseTensor:
SparseTensorValue(indices=array([[0, 0],
[0, 1],
[0, 2],
[0, 3],
[0, 4],
[0, 5],
[0, 6],
[0, 7],
[1, 0],
[1, 1],
[1, 2],
[1, 3],
[1, 4],
[1, 5],
[1, 6],
[1, 7],
[1, 8]]), values=array([44, 45, 26, 45, 46, 44, 30, 44, 5, 8, 17, 4, 18, 19, 14, 17, 12], dtype=int32), dense_shape=array([2, 9]))
Notice how the rows of the indices in the sparse tensor value correspond to the batch number and the columns correspond to the sequence index for that particular label. The values themselves are the sequence label classes. The rank is 2 and the size of the last dimension (nine in this case) is the length of the longest sequence.

tensorflow transform a (structured) dense matrix to sparse, when number of rows unknow

My task is to transform a special formed dense matrix tensor into a sparse one. e.g. input matrix M as followed (dense positive integer sequence followed by 0 as padding in each row)
[[3 5 7 0]
[2 2 0 0]
[1 3 9 0]]
Additionally, given the non-padding length for each row, e.g. given by tensor L =
[3, 2, 3].
The desired output would be sparse tensor S.
SparseTensorValue(indices=array([[0, 0],[0, 1],[0, 2],[1, 0],[1, 1],[2, 0],[2, 1], [2, 2]]), values=array([3, 5, 7, 2, 2, 1, 3, 9], dtype=int32), shape=array([3, 4]))
This is useful in models where objects are described by variable-sized descriptors (S are then used in embedding_lookup_sparse to connect embeddings of descriptors.)
I am able to do it when number of M's row is known (by python loop and ops like slice and concat). However, M's row number here is determined by mini-batch size and could change (say in testing phase). Is there a good way to implement that? I am trying some control_flow_ops but haven't succeeded.
Thanks!!