keras.preprocessing.text.Tokenizer equivalent in Pytorch? - tensorflow

Basically the title; is there any equivalent tokeras.preprocessing.text.Tokenizer in Pytorch? I have yet to find any that gives all the utilities without handcrafting things.

I find Torchtext more difficult to use for simple things. PyTorch-NLP can do this in a more straightforward way:
from torchnlp.encoders.text import StaticTokenizerEncoder, stack_and_pad_tensors, pad_tensor
loaded_data = ["now this ain't funny", "so don't you dare laugh"]
encoder = StaticTokenizerEncoder(loaded_data, tokenize=lambda s: s.split())
encoded_data = [encoder.encode(example) for example in loaded_data]
print(encoded_data)
[tensor([5, 6, 7, 8]), tensor([ 9, 10, 11, 12, 13])]
encoded_data = [pad_tensor(x, length=10) for x in encoded_data]
print(stack_and_pad_tensors(encoded_data))
# alternatively, use encoder.batch_encode()
BatchedSequences(tensor=tensor([[ 5, 6, 7, 8, 0, 0, 0, 0, 0, 0], [ 9, 10, 11, 12, 13, 0, 0, 0, 0, 0]]), lengths=tensor([10, 10]))
​
It comes with other types of encoders, such as spaCy's tokenizer, subword encoder, etc.

PyTorch itself does not provide a function like this, you either need to it manually (which should be easy: use a tokenizer of your choice and do a dictionary lookup for the indices).
Alternatively, you can use Torchtext, which provides basic abstraction from text processing. All you need to do is create a Field object. You can use string.split, SpaCy or custom function for tokenization. You can provide a vocabulary or create it directly from data. Then you just call the process method which tokenizes text and does the vocabulary lookup.
If you want something more complex, you might consider using also AllenNLP. In AllenNLP, you do separately the tokenization and the vocabulary lookup.

Related

LGBMClassifier + Unbalanced data + GridSearchCV()

The dependent variable is binary, the unbalanced data is 1:10, the dataset has 70k rows, the scoring is the roc curve, and I'm trying to use LGBM + GridSearchCV to get a model. However, I'm struggling with the parameters as sometimes it doesn't recognize them even when I use the parameters as the documentation shows:
params = {'num_leaves': [10, 12, 14, 16],
'max_depth': [4, 5, 6, 8, 10],
'n_estimators': [50, 60, 70, 80],
'is_unbalance': [True]}
best_classifier = GridSearchCV(LGBMClassifier(), params, cv=3, scoring="roc_auc")
best_classifier.fit(X_train, y_train)
So:
What is the difference between putting the parameters in the GridsearchCV() and params?
As it's unbalanced data, I'm trying to use the roc_curve as the scoring metric as it's a metric that considers the unbalanced data. Should I use the argument scoring="roc_auc" put it in the params argument?
The difference between putting the parameters in GridsearchCV()or params is mentioned in the docs of GridSearch:
When you put it in params:
Dictionary with parameters names (str) as keys and lists of parameter settings to try as values, or a list of such dictionaries,
in which case the grids spanned by each dictionary in the list are
explored. This enables searching over any sequence of parameter
settings.
And yes you can put the scoring also in the params.

About get_special_tokens_mask in huggingface-transformers

I use transformers tokenizer, and created mask using API: get_special_tokens_mask.
My Code
In RoBERTa Doc, returns of this API is "A list of integers in the range [0, 1]: 0 for a special token, 1 for a sequence token". But I seem that this API returns "0 for a sequence token, 1 for a special token".
Is it all right?
You are indeed correct. I tested this for both transformers 2.7 and the (at the time of writing) current release of 2.9, and in both cases I do get the inverted results (0 for regular characters, and 1 for the special characters.
For reference, this is how I tested it:
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("roberta-base")
sentence = "This is a special sentence."
encoded_sentence = tokenizer.encode(sentence)
# [0, 152, 16, 10, 780, 3645, 4, 2]
special_masks = tokenizer.get_special_tokens_mask(encoded_sentence)
# [1, 0, 0, 0, 0, 0, 0, 0, 0, 1]
I would suggest you report this issue in their repository, or ideally provide a pull request yourself to fix the issue ;-)

Tensorflow sliding window transformation over large data

I would like to feed my model with stride-1 windows from a very long data sequence (tens of millions of entries). This is similar to the aim presented in this thread, only that my data sequence may contain several features to begin with, so the final number of features is n_features * window_size. i.e. with two original features and a window size of 3, this would mean transforming this:
[[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]]
to:
[[1, 2, 3, 6, 7, 8], [2, 3, 4, 7, 8, 9], [3, 4, 5, 8, 9, 10]]
I was trying to use slicing with map_fn or Dataset.map, applied to a sequence of indices (per the answer in the above-mentioned thread), as in:
ti = tf.range(data.shape[0] - window_size)
train_dataset = tf.data.Dataset.from_tensor_slices((ti, labels))
def get_window(l, label):
wnd = tf.reshape(data_tensor[l:(l + window_size), :], (-1, window_size * n_features))
wnd = tf.squeeze(wnd)
return (wnd, label)
train_dataset = train_dataset.map(get_window)
train_dataset = train_dataset.batch(batch_size)
...
This is working in principle, but the training is extremely slow, with minimal GPU utilization (1-5%, probably in part because the mapping is done in the CPU).
When trying to do the same with tf.map_fn, the graph building becomes very lengthy, with tremendous memory utilization.
Another option I was trying is to transform all of the data in advance before I load it in Tensorflow. This works much faster (even when considering the pre-processing time, I wonder why - shouldn't it be the same operation as the mapping during training?) but is very inefficient in terms of memory and storage, as the data becomes window_size-fold larger. That is a deal-breaker for my large datasets.
I thought about splitting these transformed bloated datasets into several files ("hyper-batches") and go through them in sequence for each epoch, but this seems very inefficient and I was wondering if there is a better way to achieve this simple transformation.

How do I resolve one hot encoding if my test data has missing values in a col?

For example if my training data has the categorical values (1,2,3,4,5) in the col,then one hot encoding will give me 5 cols. But in the test data I have, say only 4 out of the 5 values i.e.(1,3,4,5).So one hot encoding will give me only 4 cols.Therefore if I apply my trained weights on the test data, I will get an error as the dimensions of the cols do not match in the train and test data, dim(4)!=dim(5).Any suggestions on what do I do with the missing col values?
The image of my code is provided below:
image
Guys don't do this mistake, please!
Yes, you can do this hack with the concatenation of train and test and fool yourself, but the real problem is in production. There your model will someday face an unknown level of your categorical variable and then break.
In reality, some of the more viable options could be:
Retrain your model periodically to account for new data.
Do not use one hot. Seriously, there are many better options like leave one out encoding (https://www.kaggle.com/c/caterpillar-tube-pricing/discussion/15748#143154) conditional probability encoding (https://medium.com/airbnb-engineering/designing-machine-learning-models-7d0048249e69), target encoding to name a few. Some classifiers like CatBoost even have a built-in mechanism for encoding, there are mature libraries like target_encoders in Python, where you will find lots of other options.
Embed categorical features and this could save you from a complete retrain (http://flovv.github.io/Embeddings_with_keras/)
You can first combine two dataframes, then get_dummies then split them so they can have exact number of columns i.e
#Example Dataframes
Xtrain = pd.DataFrame({'x':np.array([4,2,3,5,3,1])})
Xtest = pd.DataFrame({'x':np.array([4,5,1,3])})
# Concat with keys then get dummies
temp = pd.get_dummies(pd.concat([Xtrain,Xtest],keys=[0,1]), columns=['x'])
# Selecting data from multi index and assigning them i.e
Xtrain,Xtest = temp.xs(0),temp.xs(1)
# Xtrain.as_matrix()
# array([[0, 0, 0, 1, 0],
# [0, 1, 0, 0, 0],
# [0, 0, 1, 0, 0],
# [0, 0, 0, 0, 1],
# [0, 0, 1, 0, 0],
# [1, 0, 0, 0, 0]], dtype=uint8)
# Xtest.as_matrix()
# array([[0, 0, 0, 1, 0],
# [0, 0, 0, 0, 1],
# [1, 0, 0, 0, 0],
# [0, 0, 1, 0, 0]], dtype=uint8)
Do not follow this approach. Its a simple trick with lot of disadvantages. #Vast Academician answer explains better.
Use dummy(binary) encoding instead of one hot encoding. Pandas pd.dummies() with drop_first = True creates dummy encoding to get k-1 dummies out of k categorical levels by removing the first level. The default option drop_first = False creates one hot encoding.
See pandas official documentation
Also dummy(binary) encoding creates less number of columns.

Can you name this data serialization format?

OK, so an API I'm using spits out data that looks like this:
0=0\160\128\0\0\0\205\0
50=16\16\128\32\0\0\48\128
100=100\80\128\100\16\48\32\16
192=0\200\137\84\25\1\0\18
It's basically a key-value array, where the value is an array of values. So, in JSON, the first two lines might look like this:
{
0:[0, 160, 128, 0, 0, 0, 205, 0],
50:[16, 16, 128, 32, 0, 0, 28, 128]
}
Can anyone tell me if they have seen data formatted like this before? If so, I'd appreciate it if you could also explain what advantages the format has, if any.