Tensorflow modified onehot encode - tensorflow

I am trying to twist the loss function a little bit, such as below.
Ants, Cockroach, Horse, Pig
[1, 0, 0, 0]
[0, 1, 0, 0]
[0, 0, 0, 1]
[0, 0, 0, 1]
And of course it will be used in tensorflow code like below:
tf.nn.softmax_cross_entropy_with_logits(labels=self.y, logits=self.logits)
I was thinking since ants and cockroach are insects as horse and pig are Mammalia, I would like to add some points to it for example,
Ants: [1, 0.5, 0, 0]
Cockroach: [0.5, 1, 0, 0]
Horse: [0, 0, 1, 0.5]
Pig: [0, 0, 0.5, 1]
I thought it would add some relativity(or relationship) of other label(s) in back propagation of the loss function. However I am not sure if
tf.nn.softmax_cross_entropy_with_logits
will use the additional information I added in the one hot encode(which is not one hot anymore). I tried to follow tensorflow code but I am still not sure though.
If it is not applying as I intended then how can I apply this to be used in back propagation? Will tensorflow allow it to be used in this way?

Related

How to relaibly create a multi-dimensional array and a one-dimensional view of it in numpy, so that the memory layout be contiguous?

According to the documentation of numpy.ravel,
Return a contiguous flattened array.
A 1-D array, containing the elements of the input, is returned. A copy is made only if needed.
For convenience and efficiency of indexing, I would like to have a one-dimensional view of a 2-dimensional array. I am using ravel for creating the view, and so far so good.
However, it is not clear to me what is meant by "A copy is made only if needed." If some day a copy is created while my code is executed, the code will stop working.
I know that there is numpy.reshape, but its documentation says:
It is not always possible to change the shape of an array without copying the data.
In any case, I would like the data to be contiguous.
How can I reliably create at 2-dimensional array and a 1-dimensional view into it? I would like the data to be contiguous in memory (for efficiency). Are there any attributes to specify when creating the 2-dimensional array to assure that it is contiguous and ravel will not need to copy it?
Related question: What is the difference between flatten and ravel functions in numpy?
The warnings for ravel and reshape are the same. ravel is just reshape(-1), to 1d. Conversely reshape docs tells us that we can think of it as first doing a ravel.
Normal array construction produces a contiguous array, and reshape with the same order will produce a view. You can visually test that by looking at the ravel and checking if the values appear in the expected order.
In [348]: x = np.arange(6).reshape(2,3)
In [349]: x
Out[349]:
array([[0, 1, 2],
[3, 4, 5]])
In [350]: x.ravel()
Out[350]: array([0, 1, 2, 3, 4, 5])
I started with the arange, reshaped it to 2d, and back to 1d. No change in order.
But if I make a sliced view:
In [351]: x[:,:2]
Out[351]:
array([[0, 1],
[3, 4]])
In [352]: x[:,:2].ravel()
Out[352]: array([0, 1, 3, 4])
This ravel has a gap, and thus is a copy.
Transpose is also a view, which cannot be reshaped to a view:
In [353]: x.T
Out[353]:
array([[0, 3],
[1, 4],
[2, 5]])
In [354]: x.T.ravel()
Out[354]: array([0, 3, 1, 4, 2, 5])
Except, if we specify the right order, the ravel is a view.
In [355]: x.T.ravel(order='F')
Out[355]: array([0, 1, 2, 3, 4, 5])
reshape has a extensive discussion of order. And transpose actually works by returning a view with different shape and strides. For a 2d array transpose produces a order F array.
So as long as you are aware of manipulations like this, you can safely assume that the reshape/ravel is contiguous.
Note that even though [354] is a copy, assignment to the flat changes the original
In [361]: x[:,:2].flat[:] = [3,4,2,1]
In [362]: x
Out[362]:
array([[3, 4, 2],
[2, 1, 5]])
x[:,:2].ravel()[:] = [10,11,2,3] does not change x. In cases like this y = x[:,:2].flat may be more useful than the ravel equivalent.

About get_special_tokens_mask in huggingface-transformers

I use transformers tokenizer, and created mask using API: get_special_tokens_mask.
My Code
In RoBERTa Doc, returns of this API is "A list of integers in the range [0, 1]: 0 for a special token, 1 for a sequence token". But I seem that this API returns "0 for a sequence token, 1 for a special token".
Is it all right?
You are indeed correct. I tested this for both transformers 2.7 and the (at the time of writing) current release of 2.9, and in both cases I do get the inverted results (0 for regular characters, and 1 for the special characters.
For reference, this is how I tested it:
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("roberta-base")
sentence = "This is a special sentence."
encoded_sentence = tokenizer.encode(sentence)
# [0, 152, 16, 10, 780, 3645, 4, 2]
special_masks = tokenizer.get_special_tokens_mask(encoded_sentence)
# [1, 0, 0, 0, 0, 0, 0, 0, 0, 1]
I would suggest you report this issue in their repository, or ideally provide a pull request yourself to fix the issue ;-)

keras.preprocessing.text.Tokenizer equivalent in Pytorch?

Basically the title; is there any equivalent tokeras.preprocessing.text.Tokenizer in Pytorch? I have yet to find any that gives all the utilities without handcrafting things.
I find Torchtext more difficult to use for simple things. PyTorch-NLP can do this in a more straightforward way:
from torchnlp.encoders.text import StaticTokenizerEncoder, stack_and_pad_tensors, pad_tensor
loaded_data = ["now this ain't funny", "so don't you dare laugh"]
encoder = StaticTokenizerEncoder(loaded_data, tokenize=lambda s: s.split())
encoded_data = [encoder.encode(example) for example in loaded_data]
print(encoded_data)
[tensor([5, 6, 7, 8]), tensor([ 9, 10, 11, 12, 13])]
encoded_data = [pad_tensor(x, length=10) for x in encoded_data]
print(stack_and_pad_tensors(encoded_data))
# alternatively, use encoder.batch_encode()
BatchedSequences(tensor=tensor([[ 5, 6, 7, 8, 0, 0, 0, 0, 0, 0], [ 9, 10, 11, 12, 13, 0, 0, 0, 0, 0]]), lengths=tensor([10, 10]))
​
It comes with other types of encoders, such as spaCy's tokenizer, subword encoder, etc.
PyTorch itself does not provide a function like this, you either need to it manually (which should be easy: use a tokenizer of your choice and do a dictionary lookup for the indices).
Alternatively, you can use Torchtext, which provides basic abstraction from text processing. All you need to do is create a Field object. You can use string.split, SpaCy or custom function for tokenization. You can provide a vocabulary or create it directly from data. Then you just call the process method which tokenizes text and does the vocabulary lookup.
If you want something more complex, you might consider using also AllenNLP. In AllenNLP, you do separately the tokenization and the vocabulary lookup.

How do I resolve one hot encoding if my test data has missing values in a col?

For example if my training data has the categorical values (1,2,3,4,5) in the col,then one hot encoding will give me 5 cols. But in the test data I have, say only 4 out of the 5 values i.e.(1,3,4,5).So one hot encoding will give me only 4 cols.Therefore if I apply my trained weights on the test data, I will get an error as the dimensions of the cols do not match in the train and test data, dim(4)!=dim(5).Any suggestions on what do I do with the missing col values?
The image of my code is provided below:
image
Guys don't do this mistake, please!
Yes, you can do this hack with the concatenation of train and test and fool yourself, but the real problem is in production. There your model will someday face an unknown level of your categorical variable and then break.
In reality, some of the more viable options could be:
Retrain your model periodically to account for new data.
Do not use one hot. Seriously, there are many better options like leave one out encoding (https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154) conditional probability encoding (https://medium.com/airbnb-engineering/designing-machine-learning-models-7d0048249e69), target encoding to name a few. Some classifiers like CatBoost even have a built-in mechanism for encoding, there are mature libraries like target_encoders in Python, where you will find lots of other options.
Embed categorical features and this could save you from a complete retrain (http://flovv.github.io/Embeddings_with_keras/)
You can first combine two dataframes, then get_dummies then split them so they can have exact number of columns i.e
#Example Dataframes
Xtrain = pd.DataFrame({'x':np.array([4,2,3,5,3,1])})
Xtest = pd.DataFrame({'x':np.array([4,5,1,3])})
# Concat with keys then get dummies
temp = pd.get_dummies(pd.concat([Xtrain,Xtest],keys=[0,1]), columns=['x'])
# Selecting data from multi index and assigning them i.e
Xtrain,Xtest = temp.xs(0),temp.xs(1)
# Xtrain.as_matrix()
# array([[0, 0, 0, 1, 0],
# [0, 1, 0, 0, 0],
# [0, 0, 1, 0, 0],
# [0, 0, 0, 0, 1],
# [0, 0, 1, 0, 0],
# [1, 0, 0, 0, 0]], dtype=uint8)
# Xtest.as_matrix()
# array([[0, 0, 0, 1, 0],
# [0, 0, 0, 0, 1],
# [1, 0, 0, 0, 0],
# [0, 0, 1, 0, 0]], dtype=uint8)
Do not follow this approach. Its a simple trick with lot of disadvantages. #Vast Academician answer explains better.
Use dummy(binary) encoding instead of one hot encoding. Pandas pd.dummies() with drop_first = True creates dummy encoding to get k-1 dummies out of k categorical levels by removing the first level. The default option drop_first = False creates one hot encoding.
See pandas official documentation
Also dummy(binary) encoding creates less number of columns.

Using Index Arrays on Columns of an MXNet NDArray

Given an index array index and, say, a matrix A I want a matrix B with the corresponding permutation of the columns of A.
In Numpy I would do the following,
>>> A = np.arange(6).reshape(2,3); A
array([[0, 1, 2],
[3, 4, 5]])
>>> index = [2,0,1]
>>> A[:,index]
array([[2, 0, 1],
[5, 3, 4]])
Is there a natural or efficient way to do this in MXNet? The functions pick() and take() don't seem to work in this way. I managed to come up with the following but it's not elegant.
>>> mx.nd.take(A.T, mx.nd.array([[2],[0],[1]])).T.reshape((2,3))
[[ 2. 0. 1.]
[ 5. 3. 4.]]
<NDArray 2x3 #cpu(0)>
Finally, to throw a wrench into the works, is there a way to do this in-place?
Update Here is a slightly more elegant, but presumably not as efficient (due to the transposition), version of above:
>>> mx.nd.take(A.T, mx.nd.array([2,0,1])).T
[[ 2. 0. 1.]
[ 5. 3. 4.]]
<NDArray 2x3 #cpu(0)>
What you need is the so-called advanced indexing in MXNet. There is a PR submitted for getting elements through advanced indexing from MXNet NDArray and will add the functionality of setting elements to NDArray as well. It is expected to come out in the release 1.0.
https://github.com/apache/incubator-mxnet/pull/8246