I am currently doing the following tf tutorial : https://www.tensorflow.org/tutorials/text/solve_glue_tasks_using_bert_on_tpu
Testing the outputs of the tokenize function on different sentences, I wonder what happens when tokenizing unknown words.
Loading model:
bert_model_name = 'bert_en_uncased_L-12_H-768_A-12'
tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3'
tfhub_handle_preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
bert_preprocess = hub.load(tfhub_handle_preprocess)
Tokenizing sentence/word:
tok = bert_preprocess.tokenize(tf.constant(['Tensorsss bla']))
print(tok)
# Output:
<tf.RaggedTensor [[[23435, 4757, 2015], [1038, 2721]]]>
Shouldnt it be so every word is tokenized to a single token ? Those are obviously made up words, but I am wondering what happens when you encode those words to fixed length vectors.
Also, how does the tokenizer transform the made up words in 3 different tokens ? Does it split the unknown words into different known parts ?
The default cache location for the tensorflow/bert_en_uncased_preprocess/3 model is /tmp/tfhub_modules/602d30248ff7929470db09f7385fc895e9ceb4c0 (more on caching). In the assets directory, you'll find vocab.txt, which is the used vocabulary. You can use the file to look up what token the token-id i corresponds to by looking at line i+1 of the file i.e.
sed '23436q;d' /tmp/tfhub_modules/602d30248ff7929470db09f7385fc895e9ceb4c0/assets/vocab.txt
> tensor
Doing that for all token-ids returns
[tensor, ##ss, ##s], [b, ##la]
As you can see, this confirms your theory that words are split into different known parts. More details on the exact algorithm can be found in Subword tokenizers.
Related
I have seen in many blogs , people using one_hot (from tf.keras.preprocessing.text.one_hot ) to convert the string of words into array of numbers which represent indices. This does not ensure unicity. Whereas Tokenizer class ensures unicity (tf.keras.preprocessing.text.Tokenizer ).
Then why is one_hot prefered over tokenizer?
Update: I got to know that hashing is used in One_hot to convert words into numbers but didn't get its importance as we can use the tokenizer class to do the same thing with more accuracy.
Not sure what you mean by uncity. I expect it has to do with the sequential relationship between the words. That of course is lost with ine hot encoding. However one-hot encoding is used when the number of words is limited. If say you have 10 words in the vocubulary you will create 10 new features which is fine for most neural networks to process. If you have other features in your data set beside the word sequences say numeric ordinal parameters you can still create a single input model. However if you have 10,000 words in the vocabulary you would create 10,000 new features which at best will take a lot to process. So in the case of a large vocabularly it is best to use "dense" encoding" versus the sparse encoding generated by one hot encoding. You can use the results of the tokenizer encoding to serve as input to a keras embedding layer which will encode the words into an n dimensional space where N is a value you specify. If you have additional ordinal features then to process the data your model will need multiple inputs. Perhaps that is why some people prefer to one hot encode the words.
I'm using tf.keras.preprocessing.text.Tokenizer to build a vocabulary for my corpus (5 million documents). The tokenizer finds 145K tokens. The issue is that the embedding layer has far too many parameters.
What's a simple way to force the tokenizer to only consider the top N most common words? Then anything outside of that would get a .
Solution
As indicated by #MarcoCerliani in the comment, you can simply change the num_words parameter in the tokenizer. That won't change the tokenizer's internal data - it'll instead return OOV tokens for words outside the range specified by num_words.
tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token='<OOV>')
tokenizer.fit_on_texts(huge_corpus)
len(tokenizer.word_counts)
>>> (some huge vocabulary size)
# Change vocabulary size
tokenizer.num_words = 100
tokenizer.texts_to_sequences(texts)
>>> (any words in `texts` not within 100 most common words
>>> will get `1` (out of vocab).
I have some sentences for which I am creating an embedding and it works great for similarity searching unless there are some truly unusual words in the sentence.
In that case, the truly unusual words in fact contain the very most similarity information of any words in the sentence BUT all of that information is lost during embedding due to the fact that the word is apparently not in the vocabulary of the model.
I'd like to get a list of all of the words known by the GUSE embedding model so that I can mask those known words out of my sentence, leaving only the "novel" words.
I can then do an exact word search for those novel words in my target corpus and achieve usability for my similar sentence searching.
e.g. "I love to use Xapian!" gets embedded as "I love to use UNK".
If I just do a keyword search for "Xapian" instead of a semantic similarity search, I'll get much more relevant results than I would using GUSE and vector KNN.
Any ideas on how I can extract the vocabulary known/used by GUSE?
I combine the earlier answer from #Roee Shenberg and the solution provided here to come up with solution, which is applicable for USE v4:
import importlib
loader_impl = importlib.import_module('tensorflow.python.saved_model.loader_impl')
saved_model = loader_impl.parse_saved_model("/tmp/tfhub_modules/063d866c06683311b44b4992fd46003be952409c/")
graph = saved_model.meta_graphs[0].graph_def
fns = [f for f in saved_model.meta_graphs[0].graph_def.library.function if "ptb" in str(f).lower()];
print(len(fns)) # should be 1
nodes_with_sp = [n for n in fns[0].node_def if n.name == "Embeddings_words"]
print(len(nodes_with_sp)) # should be 1
words_tensor = nodes_with_sp[0].attr.get("value").tensor
word_list = [i.decode('utf-8') for i in words_tensor.string_val]
print(len(word_list)) # should be 400004
If you are just curious about the words I upload them here.
I'm assuming you have tensorflow & tensorflow_hub installed, and youhave already downloaded the model.
IMPORTANT: I'm assuming you're looking at https://tfhub.dev/google/universal-sentence-encoder/4! There's no guarantee the object graph looks the same for different versions, it's likely that modifications will be needed.
Find it's location on disk - it's somewhere at /tmp/tfhub_modules unless you set the TFHUB_CACHE_DIR environment variable (Windows/Mac have different locations). The path should contain a file called saved_model.pb, which is the model, serialized using Protocol Buffers.
Unfortunately, the dictionary is serialized inside the model's Protocol Buffers file and not as an external asset, so we'll have to load the model and get the variable from it.
The strategy is to use tensorflow's code to deserialize the file, and then travel down the serialized object tree all the way to the dictionary.
import importlib
MODEL_PATH = 'path/to/model/dir' # e.g. '/tmp/tfhub_modules/063d866c06683311b44b4992fd46003be952409c/'
# Use the tensorflow internal Protobuf loader. A regular import statement will fail.
loader_impl = importlib.import_module('tensorflow.python.saved_model.loader_impl')
saved_model = loader_impl.parse_saved_model(MODEL_PATH)
# reach into the object graph to get the tensor
graph = saved_model.meta_graphs[0].graph_def
function = graph.library.function
node_type, node_value = function[5].node_def
# if you print(node_type) you'll see it's called "text_preprocessor/hash_table"
# as well as get insight into this branch of the object graph we're looking at
words_tensor = node_value.attr.get("value").tensor
word_list = [i.decode('utf-8') for i in words_tensor.string_val]
print(len(word_list)) # -> 400004
Some resources that helped:
A GitHub issue relating to changing the vocabulary
A Tensorflow Google-group thread linked from the issue
Extra Notes
Despite what the GitHub issue may lead you to think, the 400k words here are not the GloVe 400k vocabulary. You can verify this by downloading the GloVe 6B embeddings (file link), extracting glove.6B.50d.txt, and then using the following code to compare the two dictionaries:
with open('/path/to/glove.6B.50d.txt') as f:
glove_vocabulary = set(line.strip().split(maxsplit=1)[0] for line in f)
USE_vocabulary = set(word_list) # from above
print(len(USE_vocabulary - glove_vocabulary)) # -> 281150
Inspecting the different vocabularies is interesting in and of itself, e.g. why does GloVe have an entry for '287.9'?
I am training a simple model for text classification (currently with scikit-learn). To transform my document samples into word count vectors using a vocabulary I use
CountVectorizer(vocabulary=myDictionaryWords).fit_transform(myDocumentsAsArrays)
from sklearn.feature_extraction.text.
This works great and I can subsequently train my classifier on this word count vectors as feature vectors. But what I don't know is how to inverse transform these word count vectors to the original documents. CountVectorizer indeed has a function inverse_transform(X) but this only gives you back the unique non-zero tokens.
As far as I know CountVectorizer doesn't have any implementation of a mapping back to the original documents.
Anyone know how I can restore the original sequences of tokens from their count-vectorized representation? Is there maybe a Tensorflow or any other module for this?
CountVectorizer is "lossy", i.e. for a document :
This is the amazing string in amazing program , it will only store counts of words in the document (i.e. string -> 1, amazing ->2 etc), but loses the position information.
So by reversing it, you can create a document having same words repeated same number of times, but their sequence in the document cannot be retraced.
I would like to use Tensorflow for sequence tagging namely Part of Speech tagging. I tried to use the same model outlined here: http://tensorflow.org/tutorials/seq2seq/index.md (which outlines a model to translate English to French).
Since in tagging, the input sequence and output sequence have exactly the same length, I configured the buckets so that input and output sequences have same length and tried to learn a POS tagger using this model on ConLL 2000.
However it seems that the decoder sometimes outputs a taggedsequence shorter than the input sequence (it seems to feel that the EOS tag appears prematurely)
For example:
He reckons the current account deficit will narrow to only # 1.8 billion in September .
The above sentence is tokenized to have 18 tokens which gets padded to 20 (due to bucketing).
When asked to decode the above, the decoder spits out the following:
PRP VBD DT JJ JJ NN MD VB TO VB DT NN IN NN . _EOS . _EOS CD CD
So here it ends the sequence (EOS) after 15 tokens not 18.
How can I force the sequence to learn that the decoded sequence should be the same length as the encoded one in my scenario.
If your input and output sequences are the same length you probably want something simpler than a seq2seq model (since handling different sequence lengths is one of it's strengths)
Have you tried just training (word -> tag) ?
note: that for something like pos tagging where there is clear signal from tokens on either side you'll definitely get a benefit from a bidirectional net.
If you want to go all crazy there would be some fun character level variants too where you only emit the tag at the token boundary (the rationale being that pos tagging benefits from character level features; e.g. things like out of vocab names). So many variants to try! :D
There are various ways of specifying an end of sequence parameter. The translate demo uses a flag <EOS> to determine the end of sequence. However, you can also specify end of sequence by counting the number of expected words in the output. In the lines 225-227 of the translate.py:
# If there is an EOS symbol in outputs, cut them at that point.
if data_utils.EOS_ID in outputs:
outputs = outputs[:outputs.index(data_utils.EOS_ID)]
You can see that outputs are being cut off whenever <EOS> is encountered. You can easily tweak it to constrain the number of output words. You might also consider getting rid of <EOS> flag altogether while training, considering your application.
I came to the same problem. At the end I found ptb_word_lm.py example in tensorflow's examples is exactly what we need for tokenization, NER and POS tagging.
If you look into details of the language model example, you can find out that it treats the input character sequence as X and right shift X for 1 space as Y. It is exactly what fixed length sequence labeling needs.