Tensorflow text tokenizer incorrect tokenization - tensorflow

I am trying to use TF Tokenizer for a NLP model
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words=200, split=" ")
sample_text = ["This is a sample sentence1 created by sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ",
"This is another sample sentence1 created by another sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]
tokenizer.fit_on_texts(sample_text)
print (tokenizer.texts_to_sequences(["sample person AB.CDEFGHIJKLMNOPQRSTUVWXYZ"]))
OP:
[[1, 7, 8, 9]]
Word_Index:
print(tokenizer.index_word[8]) ===> 'ab'
print(tokenizer.index_word[9]) ===> 'cdefghijklmnopqrstuvwxyz'
The problem is that the tokenizer creates tokens based on . in this case. I am giving the split = " " in the Tokenizer so I expect the following op:
[[1,7,8]], where tokenizer.index_word[8] should be 'ab.cdefghijklmnopqrstuvwxyz'
As in I want the tokenizer to create words based on space (" ") and not on any special characters
How do I make the tokenizer create tokens only on spaces?

The Tokenizer takes another argument called filter which is currently defaults to all ascii punctuations (filters='!"#$%&()*+,-./:;<=>?#[\\]^_`{|}~\t\n'). During tokenization, all of the characters contained in filter are replaced by the specified split string.
If you will look in the source code of the Tokenizer and specifically on the method fit_on_texts, you will see it uses the function text_to_word_sequence which receive the filter characters and consider them the same as the split it also receives:
def text_to_word_sequence(... ):
...
translate_dict = {c: split for c in filters}
translate_map = maketrans(translate_dict)
text = text.translate(translate_map)
seq = text.split(split)
return [i for i in seq if i]
So, in order to not split nothing but the specified split, just pass empty string to the filter argument

Related

spacy IS_DIGIT or LIKE_NUM not working as expected for certain chars

I am trying to extract some numbers using IS_DIGIT and LIKE_NUM attributes but it seems to be behaving a bit strange for a beginner like me.
The matcher is only able to detect the numbers when the 5 character string ends in M, G, T . If it is any other character, the IS_DIGIT and LIKE_NUM attributes are not able to detect. What am I missing here?
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{'LIKE_NUM': True}]
matcher.add("DIGIT",[pattern])
doc = nlp("1231M 1232G 1233H 1234J 1235V 1236T")
matches = matcher(doc, as_spans=True)
for span in matches:
print(span.text, span.label_)
# prints only 1231, 1232 and 1236
It may be helpful to just check which tokens are true for LIKE_NUM, like this:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
pattern = [{"LIKE_NUM": True}]
matcher.add("DIGIT", [pattern])
doc = nlp("1231M 1232G 1233H 1234J 1235V 1236T")
for tok in doc:
print(tok, tok.like_num)
Here you'll see that sometimes the tokens you have are split in two, and sometimes they aren't. The tokens you match are only the ones that consist just of digits.
Now, why are M, G, and T split off, while H, J, and V aren't? This is because they are units, as for mega, giga, or terabytes.
This behaviour with units may seem inconsistent and weird, but it's been chosen to be consistent with the training data used for the English models. If you need to change it for your application, look at this section in the docs, which covers customizing the exceptions.

Obtaining the index of a word between two columns in pandas

I am checking on which words the SpaCy Spanish lemmatizer works on using the .has_vector method. In the two columns of the datafame I have the output of the function that indicates which words can be lemmatized and in the other one the corresponding phrase.
I would like to know how I can extract all the words that have False output to correct them so that I can lemmatize.
So I created the function:
def lemmatizer(text):
doc = nlp(text)
return ' '.join([str(word.has_vector) for word in doc])
And applied it to the column sentences in the DataFrame
df["Vectors"] = df.reviews.apply(lemmatizer)
And put in another data frame as:
df2= pd.DataFrame(df[['Vectors', 'reviews']])
The output is
index Vectors reviews
1 True True True False 'La pelicula es aburridora'
Two ways to do this:
import pandas
import spacy
nlp = spacy.load('en_core_web_lg')
df = pandas.DataFrame({'reviews': ["aaabbbcccc some example words xxxxyyyz"]})
If you want to use has_vector:
def get_oov1(text):
return [word.text for word in nlp(text) if not word.has_vector]
Alternatively you can use the is_oov attribute:
def get_oov2(text):
return [word.text for word in nlp(text) if word.is_oov]
Then as you already did:
df["oov_words1"] = df.reviews.apply(get_oov1)
df["oov_words2"] = df.reviews.apply(get_oov2)
Which will return:
> reviews oov_words1 oov_words2
0 aaabbbcccc some example words xxxxyyyz [aaabbbcccc, xxxxyyyz] [aaabbbcccc, xxxxyyyz]
Note:
When working with both of these ways it is important to know that this is model dependent, and usually has no backbone in smaller models and will always return a default value!
That means when you run the exact same code but e.g. with en_core_web_sm you get this:
> reviews oov_words1 oov_words2
0 aaabbbcccc some example words xxxxyyyz [] [aaabbbcccc, some, example, words, xxxxyyyz]
Which is because has_vector has a default value of False and is then not set by the model. is_oov has a default value of True and then is not by the model either. So with the has_vector model it wrongly shows all words as unknown and with is_oov it wrongly shows all as known.

Get python string from Tensor without the numpy function

I'm following the tutorial on https://storage.googleapis.com/tensorflow_docs/docs/site/en/tutorials/text/word_embeddings.ipynb
TextVectorization by defalut splits on whitespace but I want to implement custom split. I want to keep punctuations (which I have implemented in custom_standardization), and split between words and punctuations.
For instance, "fn(1,2)=1+2=3" needs to split to ["fn","(","1",",","2",")","=","1","+","2","=","3"].
def custom_split(input_data: tf.Tensor):
assert input_data.dtype.name=='string'
assert hasattr(input_data,'numpy') == False
???
vectorize_layer = TextVectorization(
standardize=custom_standardization,
split=custom_split,
output_mode='int',
output_sequence_length=sequence_length)
I'm confident in such spliting given a standard Python string. However the input is tf.Tensor and following the aforementioned tutorial, input_data does not have numpy() function.
What's the proper way to do such spliting? Is it possible to retrieve Python string from string Tensor?

How to go about creating embeddings (especially, token to Id mapping) for categorical columns in tensorflow 2.0+?

I have a csv with both categorical and float dtypes. I want to do the following:
For each categorical column i will use pandas to compute the unique values (pd.unique()) that are present in the column. say u_l for a column
I will use the len(u_l) to decide upon the dimension of embeddings that i use for a particular categorical column that i want i embed (this step is the reason i cannot use tensorflow_transform)
I want to create some stateful node that can map category (token) value to embeddings index thus subsequently i can lookup the embedding from embeddings matrix that i created in step 2
I dont know how to go about doing it currently. A very inelegant solution i can see is using tensorflow_datasets:
encoder = tfds.features.text.TokenTextEncoder(u_l,decode_token_separator=' ')
concatenate the entire column using space delimiter (c_l) (c_l is one string now) and then using encoder.encode(c_l)
This is a very basic thing that i think tensorflow would be able to do relatively easily. Please guide me to the right solution
If you want to use your word corpus as embedding like if you have corpus as this :
corpus :
"This pasta is good"
"This pasta is very good"
and you want to use embedding you can use Tokenizer of TF see this. It will create a dict containing words as keys and index as value like in above corpus dict looks like :
word_index = {"this" : 1, "pasta" : 2, "good" : 3, "very" : 4}
you can avoid stopwords.
Now you can make word embedding vector using these word_index dict so that it looks like
For corpus 1 : [1, 2, 3]
For corpus 2 : [1, 2, 4, 3]
Enough talk let see some code : Also define oov_token for out of vocabulary words.
You can do like this :
vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type='post'
oov_tok = "<OOV>"
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(training_sentences) # This will create word embedding vector
padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type) # This will padd zeros according to `trunc_type`, here add zeros in last
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)
Also see this GitHub code of me hope it will help

Tensorflow vocabularyprocessor

I am following the wildml blog on text classification using tensorflow. I am not able to understand the purpose of max_document_length in the code statement :
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
Also how can i extract vocabulary from the vocab_processor
I have figured out how to extract vocabulary from vocabularyprocessor object. This worked perfectly for me.
import numpy as np
from tensorflow.contrib import learn
x_text = ['This is a cat','This must be boy', 'This is a a dog']
max_document_length = max([len(x.split(" ")) for x in x_text])
## Create the vocabularyprocessor object, setting the max lengh of the documents.
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
## Transform the documents using the vocabulary.
x = np.array(list(vocab_processor.fit_transform(x_text)))
## Extract word:id mapping from the object.
vocab_dict = vocab_processor.vocabulary_._mapping
## Sort the vocabulary dictionary on the basis of values(id).
## Both statements perform same task.
#sorted_vocab = sorted(vocab_dict.items(), key=operator.itemgetter(1))
sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1])
## Treat the id's as index into list and create a list of words in the ascending order of id's
## word with id i goes at index i of the list.
vocabulary = list(list(zip(*sorted_vocab))[0])
print(vocabulary)
print(x)
not able to understand the purpose of max_document_length
The VocabularyProcessor maps your text documents into vectors, and you need these vectors to be of a consistent length.
Your input data records may not (or probably won't) be all the same length. For example if you're working with sentences for sentiment analysis they'll be of various lengths.
You provide this parameter to the VocabularyProcessor so that it can adjust the length of output vectors. According to the documentation,
max_document_length: Maximum length of documents. if documents are
longer, they will be trimmed, if shorter - padded.
Check out the source code.
def transform(self, raw_documents):
"""Transform documents to word-id matrix.
Convert words to ids with vocabulary fitted with fit or the one
provided in the constructor.
Args:
raw_documents: An iterable which yield either str or unicode.
Yields:
x: iterable, [n_samples, max_document_length]. Word-id matrix.
"""
for tokens in self._tokenizer(raw_documents):
word_ids = np.zeros(self.max_document_length, np.int64)
for idx, token in enumerate(tokens):
if idx >= self.max_document_length:
break
word_ids[idx] = self.vocabulary_.get(token)
yield word_ids
Note the line word_ids = np.zeros(self.max_document_length).
Each row in raw_documents variable will be mapped to a vector of length max_document_length.