Using tensorflow for sequence tagging : Synced sequence input and output - tensorflow

I would like to use Tensorflow for sequence tagging namely Part of Speech tagging. I tried to use the same model outlined here: http://tensorflow.org/tutorials/seq2seq/index.md (which outlines a model to translate English to French).
Since in tagging, the input sequence and output sequence have exactly the same length, I configured the buckets so that input and output sequences have same length and tried to learn a POS tagger using this model on ConLL 2000.
However it seems that the decoder sometimes outputs a taggedsequence shorter than the input sequence (it seems to feel that the EOS tag appears prematurely)
For example:
He reckons the current account deficit will narrow to only # 1.8 billion in September .
The above sentence is tokenized to have 18 tokens which gets padded to 20 (due to bucketing).
When asked to decode the above, the decoder spits out the following:
PRP VBD DT JJ JJ NN MD VB TO VB DT NN IN NN . _EOS . _EOS CD CD
So here it ends the sequence (EOS) after 15 tokens not 18.
How can I force the sequence to learn that the decoded sequence should be the same length as the encoded one in my scenario.

If your input and output sequences are the same length you probably want something simpler than a seq2seq model (since handling different sequence lengths is one of it's strengths)
Have you tried just training (word -> tag) ?
note: that for something like pos tagging where there is clear signal from tokens on either side you'll definitely get a benefit from a bidirectional net.
If you want to go all crazy there would be some fun character level variants too where you only emit the tag at the token boundary (the rationale being that pos tagging benefits from character level features; e.g. things like out of vocab names). So many variants to try! :D

There are various ways of specifying an end of sequence parameter. The translate demo uses a flag <EOS> to determine the end of sequence. However, you can also specify end of sequence by counting the number of expected words in the output. In the lines 225-227 of the translate.py:
# If there is an EOS symbol in outputs, cut them at that point.
if data_utils.EOS_ID in outputs:
outputs = outputs[:outputs.index(data_utils.EOS_ID)]
You can see that outputs are being cut off whenever <EOS> is encountered. You can easily tweak it to constrain the number of output words. You might also consider getting rid of <EOS> flag altogether while training, considering your application.

I came to the same problem. At the end I found ptb_word_lm.py example in tensorflow's examples is exactly what we need for tokenization, NER and POS tagging.
If you look into details of the language model example, you can find out that it treats the input character sequence as X and right shift X for 1 space as Y. It is exactly what fixed length sequence labeling needs.

Related

Pre_tokenization/tokenization of DNA data using HuggingFace

I am struggling with transformers in DNA data for a supervised binary classification problem. I have very long DNA sequences (the mean is 6E7 characters) and, to be able to pass longer sequences as input to the Neural Network, I am trying to tokenize using different algorithms to work with longer sequences tokens rather than only (C, G, A, T) ones.
At the moment I am trying with HuggingFace to implement BPE, WordPiece, and Unigram algorithms. However, before training those models I do have to apply a pretokenizer to my data. All of them are based into "classic" language structures like Whitespace() but in my case I only have a list of DNA sequences like (small chunk):
['CCAGCAGCTCGGTGCGCTTGCCGCTCCAGTCGCCCAGCAGCTCGGTGCGCTTGCCGCCCCAGTCGC']
My intention is to group those characters to work with bigger tokens than only 1 single character. However, when I use for example Whitespace(), my model does not learn...
Could you recommend me some pre_tokenizer for passing as input to BPE, WPiece and UNIGRAM only characters?
Also, would you recommend padding sequence before or after tokenization process?
Thank you very much

Tensorflow bert tokenize unknown words

I am currently doing the following tf tutorial : https://www.tensorflow.org/tutorials/text/solve_glue_tasks_using_bert_on_tpu
Testing the outputs of the tokenize function on different sentences, I wonder what happens when tokenizing unknown words.
Loading model:
bert_model_name = 'bert_en_uncased_L-12_H-768_A-12'
tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3'
tfhub_handle_preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
bert_preprocess = hub.load(tfhub_handle_preprocess)
Tokenizing sentence/word:
tok = bert_preprocess.tokenize(tf.constant(['Tensorsss bla']))
print(tok)
# Output:
<tf.RaggedTensor [[[23435, 4757, 2015], [1038, 2721]]]>
Shouldnt it be so every word is tokenized to a single token ? Those are obviously made up words, but I am wondering what happens when you encode those words to fixed length vectors.
Also, how does the tokenizer transform the made up words in 3 different tokens ? Does it split the unknown words into different known parts ?
The default cache location for the tensorflow/bert_en_uncased_preprocess/3 model is /tmp/tfhub_modules/602d30248ff7929470db09f7385fc895e9ceb4c0 (more on caching). In the assets directory, you'll find vocab.txt, which is the used vocabulary. You can use the file to look up what token the token-id i corresponds to by looking at line i+1 of the file i.e.
sed '23436q;d' /tmp/tfhub_modules/602d30248ff7929470db09f7385fc895e9ceb4c0/assets/vocab.txt
> tensor
Doing that for all token-ids returns
[tensor, ##ss, ##s], [b, ##la]
As you can see, this confirms your theory that words are split into different known parts. More details on the exact algorithm can be found in Subword tokenizers.

one_hot Vs Tokenizer for Word representation

I have seen in many blogs , people using one_hot (from tf.keras.preprocessing.text.one_hot ) to convert the string of words into array of numbers which represent indices. This does not ensure unicity. Whereas Tokenizer class ensures unicity (tf.keras.preprocessing.text.Tokenizer ).
Then why is one_hot prefered over tokenizer?
Update: I got to know that hashing is used in One_hot to convert words into numbers but didn't get its importance as we can use the tokenizer class to do the same thing with more accuracy.
Not sure what you mean by uncity. I expect it has to do with the sequential relationship between the words. That of course is lost with ine hot encoding. However one-hot encoding is used when the number of words is limited. If say you have 10 words in the vocubulary you will create 10 new features which is fine for most neural networks to process. If you have other features in your data set beside the word sequences say numeric ordinal parameters you can still create a single input model. However if you have 10,000 words in the vocabulary you would create 10,000 new features which at best will take a lot to process. So in the case of a large vocabularly it is best to use "dense" encoding" versus the sparse encoding generated by one hot encoding. You can use the results of the tokenizer encoding to serve as input to a keras embedding layer which will encode the words into an n dimensional space where N is a value you specify. If you have additional ordinal features then to process the data your model will need multiple inputs. Perhaps that is why some people prefer to one hot encode the words.

Different usage of <PAD>, <EOS> and <GO> tokens

I found that there are many different usages of <PAD>, <EOS>, and <GO> tokens.
Personally, I separate those three tokens and assign different embeddings to them, assigning an all-zero embedding vector to <PAD> token specifically (with RNN-based seq2seq model).
The majority of codes show that <PAD>, <EOS> and <GO> are all represented as <PAD> token.
I want to know if there is the optimum usage of those tokens (in terms of RNN-based models or transformer-based models).
These are the special tokens used in seq2seq:
GO - the same as on the picture below - the first token which is fed to the decoder along with the though vector in order to start generating tokens of the answer
EOS - "end of sentence" - the same as on the picture below - as soon as decoder generates this token we consider the answer to be complete (you can't use usual punctuation marks for this purpose cause their meaning can be different)
UNK - "unknown token" - is used to replace the rare words that did not fit in your vocabulary. So your sentence My name is guotong1988 will be translated into My name is unk.
PAD - your GPU (or CPU at worst) processes your training data in batches and all the sequences in your batch should have the same length. If the max length of your sequence is 8, your sentence My name is guotong1988 will be padded from either side to fit this length: My name is guotong1988 pad pad pad pad
will help understand better
Reference: https://github.com/nicolas-ivanov/tf_seq2seq_chatbot/issues/15

How to pass a list of numbers as a single feature to a neural network?

I am trying to cluster sentences by clustering the sentence embedding of them taken from fasttext model. Each sentence embedding has 300 dimensions, and I want to reduce them to 50 (say). I tried t-SNE, PCA, UMAP. I wanted to see how Auto Encoder works for my data.
Now passing those 300 numbers for each sentence as separate features to the NN would make sense or they should be passed as a single entity? If so, is there an way to pass a list as a feature to NN?
I tried passing the 300 numbers as individual features and with the output I tried clustering. Could get very few meaningful clusters rest were either noise or clusters with no similar sentences but being grouped (But with other techniques like UMAP I could get far more meaningful clusters in more number). Any leads would be helpful. Thanks in advance :)