About get_special_tokens_mask in huggingface-transformers - tokenize

I use transformers tokenizer, and created mask using API: get_special_tokens_mask.
My Code
In RoBERTa Doc, returns of this API is "A list of integers in the range [0, 1]: 0 for a special token, 1 for a sequence token". But I seem that this API returns "0 for a sequence token, 1 for a special token".
Is it all right?

You are indeed correct. I tested this for both transformers 2.7 and the (at the time of writing) current release of 2.9, and in both cases I do get the inverted results (0 for regular characters, and 1 for the special characters.
For reference, this is how I tested it:
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("roberta-base")
sentence = "This is a special sentence."
encoded_sentence = tokenizer.encode(sentence)
# [0, 152, 16, 10, 780, 3645, 4, 2]
special_masks = tokenizer.get_special_tokens_mask(encoded_sentence)
# [1, 0, 0, 0, 0, 0, 0, 0, 0, 1]
I would suggest you report this issue in their repository, or ideally provide a pull request yourself to fix the issue ;-)

Related

Column missing when trying to open hdf created by pandas in h5py

This is what my dataframe looks like. The first column is a single int. The second column is a single list of 512 ints.
IndexID Ids
1899317 [0, 47715, 1757, 9, 38994, 230, 12, 241, 12228...
22861131 [0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...
2163410 [0, 26039, 41156, 227, 860, 3320, 6673, 260, 1...
15760716 [0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
12244098 [0, 45651, 4128, 227, 5, 10397, 995, 731, 9, 3...
I saved it to hdf and tried opening it using
df.to_hdf('test.h5', key='df', data_columns=True)
h3 = h5py.File('test.h5')
I see 4 keys when I list the keys
h3['df'].keys()
KeysViewHDF5 ['axis0', 'axis1', 'block0_items', 'block0_values']
Axis1 sees to contain the values for the first column
h3['df']['axis1'][0:5]
array([ 1899317, 22861131, 2163410, 15760716, 12244098,
However, there doesn't seem to be data from the second column. There does is another column with other data
h3['df']['block0_values'][0][0:5]
But that doesn't seem to correspond to any of the data in the second column
array([128, 4, 149, 1, 0], dtype=uint8)
Purpose
I am eventually trying to create a datastore that's memory mapped, that retrieves data using particular indices.
So something like
h3['df']['workingIndex'][22861131, 15760716]
would retrieve
[0, 48156, 154, 6304, 43611, 11, 9496, 8982, 1...],
[0, 40883, 4086, 11, 5, 18559, 1923, 1494, 4, ...
The problem is you're trying to serialize a Pandas Series of Python lists and it is not rectangular (it is jagged).
Pandas and HDF5 are largely used for rectangular (cube, hypercube, etc) data, not for jagged lists-of-lists.
Did you see this warning when you call to_hdf()?
PerformanceWarning:
your performance may suffer as PyTables will pickle object types that it cannot
map directly to c-types [inferred_type->mixed,key->block0_values] [items->['Ids']]
What it's trying to tell you is that lists-of-lists are not supported in an intuitive, high-performance way. And if you run an HDF5 visualization tool like h5dump on your output file, you'll see what's wrong. The index (which is well-behaved) looks like this:
DATASET "axis1" {
DATATYPE H5T_STD_I64LE
DATASPACE SIMPLE { ( 5 ) / ( 5 ) }
DATA {
(0): 1899317, 22861131, 2163410, 15760716, 12244098
}
ATTRIBUTE "CLASS" {
DATA {
(0): "ARRAY"
}
}
But the values (lists of lists) look like this:
DATASET "block0_values" {
DATATYPE H5T_VLEN { H5T_STD_U8LE}
DATASPACE SIMPLE { ( 1 ) / ( H5S_UNLIMITED ) }
DATA {
(0): (128, 5, 149, 164, ...)
}
ATTRIBUTE "CLASS" {
DATA {
(0): "VLARRAY"
}
}
ATTRIBUTE "PSEUDOATOM" {
DATA {
(0): "object"
}
}
What's happening is exactly what the PerformanceWarning warned you about:
> PyTables will pickle object types that it cannot map directly to c-types
Your list-of-lists is being pickled and stored as H5T_VLEN which is just a blob of bytes.
Here are some ways you could fix this:
Store each row under a separate key in HDF5. That is, each list will be stored as an array, and they can all have different lengths. This is no problem with HDF5, because it supports any number of keys in one file.
Change your data to be rectangular, e.g. by padding the shorter lists with zeros. See: Pandas split column of lists into multiple columns
Use h5py to write the data in whatever format you like. It's much more flexible and creates simpler (and yet more powerful) HDF5 files than Pandas/PyTables. Here's one example (which shows h5py can actually store jagged arrays, though it's not pretty): Storing multidimensional variable length array with h5py

keras.preprocessing.text.Tokenizer equivalent in Pytorch?

Basically the title; is there any equivalent tokeras.preprocessing.text.Tokenizer in Pytorch? I have yet to find any that gives all the utilities without handcrafting things.
I find Torchtext more difficult to use for simple things. PyTorch-NLP can do this in a more straightforward way:
from torchnlp.encoders.text import StaticTokenizerEncoder, stack_and_pad_tensors, pad_tensor
loaded_data = ["now this ain't funny", "so don't you dare laugh"]
encoder = StaticTokenizerEncoder(loaded_data, tokenize=lambda s: s.split())
encoded_data = [encoder.encode(example) for example in loaded_data]
print(encoded_data)
[tensor([5, 6, 7, 8]), tensor([ 9, 10, 11, 12, 13])]
encoded_data = [pad_tensor(x, length=10) for x in encoded_data]
print(stack_and_pad_tensors(encoded_data))
# alternatively, use encoder.batch_encode()
BatchedSequences(tensor=tensor([[ 5, 6, 7, 8, 0, 0, 0, 0, 0, 0], [ 9, 10, 11, 12, 13, 0, 0, 0, 0, 0]]), lengths=tensor([10, 10]))
​
It comes with other types of encoders, such as spaCy's tokenizer, subword encoder, etc.
PyTorch itself does not provide a function like this, you either need to it manually (which should be easy: use a tokenizer of your choice and do a dictionary lookup for the indices).
Alternatively, you can use Torchtext, which provides basic abstraction from text processing. All you need to do is create a Field object. You can use string.split, SpaCy or custom function for tokenization. You can provide a vocabulary or create it directly from data. Then you just call the process method which tokenizes text and does the vocabulary lookup.
If you want something more complex, you might consider using also AllenNLP. In AllenNLP, you do separately the tokenization and the vocabulary lookup.

Tensorflow modified onehot encode

I am trying to twist the loss function a little bit, such as below.
Ants, Cockroach, Horse, Pig
[1, 0, 0, 0]
[0, 1, 0, 0]
[0, 0, 0, 1]
[0, 0, 0, 1]
And of course it will be used in tensorflow code like below:
tf.nn.softmax_cross_entropy_with_logits(labels=self.y, logits=self.logits)
I was thinking since ants and cockroach are insects as horse and pig are Mammalia, I would like to add some points to it for example,
Ants: [1, 0.5, 0, 0]
Cockroach: [0.5, 1, 0, 0]
Horse: [0, 0, 1, 0.5]
Pig: [0, 0, 0.5, 1]
I thought it would add some relativity(or relationship) of other label(s) in back propagation of the loss function. However I am not sure if
tf.nn.softmax_cross_entropy_with_logits
will use the additional information I added in the one hot encode(which is not one hot anymore). I tried to follow tensorflow code but I am still not sure though.
If it is not applying as I intended then how can I apply this to be used in back propagation? Will tensorflow allow it to be used in this way?

How do I resolve one hot encoding if my test data has missing values in a col?

For example if my training data has the categorical values (1,2,3,4,5) in the col,then one hot encoding will give me 5 cols. But in the test data I have, say only 4 out of the 5 values i.e.(1,3,4,5).So one hot encoding will give me only 4 cols.Therefore if I apply my trained weights on the test data, I will get an error as the dimensions of the cols do not match in the train and test data, dim(4)!=dim(5).Any suggestions on what do I do with the missing col values?
The image of my code is provided below:
image
Guys don't do this mistake, please!
Yes, you can do this hack with the concatenation of train and test and fool yourself, but the real problem is in production. There your model will someday face an unknown level of your categorical variable and then break.
In reality, some of the more viable options could be:
Retrain your model periodically to account for new data.
Do not use one hot. Seriously, there are many better options like leave one out encoding (https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154) conditional probability encoding (https://medium.com/airbnb-engineering/designing-machine-learning-models-7d0048249e69), target encoding to name a few. Some classifiers like CatBoost even have a built-in mechanism for encoding, there are mature libraries like target_encoders in Python, where you will find lots of other options.
Embed categorical features and this could save you from a complete retrain (http://flovv.github.io/Embeddings_with_keras/)
You can first combine two dataframes, then get_dummies then split them so they can have exact number of columns i.e
#Example Dataframes
Xtrain = pd.DataFrame({'x':np.array([4,2,3,5,3,1])})
Xtest = pd.DataFrame({'x':np.array([4,5,1,3])})
# Concat with keys then get dummies
temp = pd.get_dummies(pd.concat([Xtrain,Xtest],keys=[0,1]), columns=['x'])
# Selecting data from multi index and assigning them i.e
Xtrain,Xtest = temp.xs(0),temp.xs(1)
# Xtrain.as_matrix()
# array([[0, 0, 0, 1, 0],
# [0, 1, 0, 0, 0],
# [0, 0, 1, 0, 0],
# [0, 0, 0, 0, 1],
# [0, 0, 1, 0, 0],
# [1, 0, 0, 0, 0]], dtype=uint8)
# Xtest.as_matrix()
# array([[0, 0, 0, 1, 0],
# [0, 0, 0, 0, 1],
# [1, 0, 0, 0, 0],
# [0, 0, 1, 0, 0]], dtype=uint8)
Do not follow this approach. Its a simple trick with lot of disadvantages. #Vast Academician answer explains better.
Use dummy(binary) encoding instead of one hot encoding. Pandas pd.dummies() with drop_first = True creates dummy encoding to get k-1 dummies out of k categorical levels by removing the first level. The default option drop_first = False creates one hot encoding.
See pandas official documentation
Also dummy(binary) encoding creates less number of columns.

Can you name this data serialization format?

OK, so an API I'm using spits out data that looks like this:
0=0\160\128\0\0\0\205\0
50=16\16\128\32\0\0\48\128
100=100\80\128\100\16\48\32\16
192=0\200\137\84\25\1\0\18
It's basically a key-value array, where the value is an array of values. So, in JSON, the first two lines might look like this:
{
0:[0, 160, 128, 0, 0, 0, 205, 0],
50:[16, 16, 128, 32, 0, 0, 28, 128]
}
Can anyone tell me if they have seen data formatted like this before? If so, I'd appreciate it if you could also explain what advantages the format has, if any.