Training Data cleaning for Spacy NER - spacy

I am trying to train spaCy NER on custom data. Each sample of my training data consists of raw text that is extracted from a documents. Each of my sample contains around 100+ words. For example:
[
[
"Some long raw text here \n\n\n This text contains multiple line breaks...",
{
"entities": [
[
246,
264,
"entity_1"
...
]
]
}
]
]
My question is, for training NER, I am feeding raw document text to
the model without any pre-processing. So, does spaCy performs some data pre-processing steps or do I need to perform
following pre-processing steps on raw data before feeding it to
spacy:
Removal of stop words
Removal of punctuation
Lower case text
Lemmatization
Normalization of spaces and line breaks
My concern is, in my documents If I have terms like 'Context',
'Context:', 'context', 'context-1'. and in my model vocabulary, I
only have word 'Context' then would spaCy treat 'Context:',
'context' and 'context-1' as OOV word? and will it generates word
vector of zeros for these words like it does for other OOV words
before feeding to model?
If spaCy perform some pre-processing on training data, what type of pre-processing does it do?
Is it possible to log in spaCy what feature vectors gets fed into model for certain sample while training?

Related

How to do tokenization from a predifined vocab in tensorflow or pytorch or keras?

I have a predefined vocab which build from the common-used 3500 Chinese characters. Now I want to tokenize the Dataset with this vocab to fix each character. Any mature class or function exists I can inherit from to buid the data reading pipline?
Rather than go through the how to details here I suggest you go to a tutorial on YouTube located here.. The author demonstrates how to use the tokenizer to encode text characters into sequences which can then be used as input to an embedding layer. The part you will be interested in starts at time 23:30 of the video

How to get word vectors from pre-trained word2vec model downloaded from TFHub?

So I'm using the following word2vec model from TFHub:
embed = hub.load("https://tfhub.dev/google/Wiki-words-250-with-normalization/2")
The type of this object is:
tensorflow.python.saved_model.load.Loader._recreate_base_user_object.<locals>._UserObject
While I can use the model to embed lists of text, it's not clear to me how I can access the word embeddings themselves.
First of all, let's discuss what is embed actually? According to the official documentation, the embed object is a TextEmbedding created based on Skipgram model stored in TensorFlow 2 format.
The Skipgram model is just a feed-forward neural network that takes the one-hot encoding representations of the words in the vocabulary as an input, and it calculates the word embedding. So, these word embeddings aren't stored within the model, they get calculated.
So, if you want the word embedding of separate words, then you can pass them one at a time like so:
# word embedding of `apple`
>>> apple_embedding = embed(["apple"])
>>> apple_embedding.shape
TensorShape([1, 250])
>>> #concatenation of three different word embeddings
>>> group = embed(["apple", "banana", "carrot"])
>>> group.shape
TensorShape([3, 250])

How to use Transformers for text classification?

I have two questions about how to use Tensorflow implementation of the Transformers for text classifications.
First, it seems people mostly used only the encoder layer to do the text classification task. However, encoder layer generates one prediction for each input word. Based on my understanding of transformers, the input to the encoder each time is one word from the input sentence. Then, the attention weights and the output is calculated using the current input word. And we can repeat this process for all of the words in the input sentence. As a result we'll end up with pairs of (attention weights, outputs) for each word in the input sentence. Is that correct? Then how would you use this pairs to perform a text classification?
Second, based on the Tensorflow implementation of transformer here, they embed the whole input sentence to one vector and feed a batch of these vectors to the Transformer. However, I expected the input to be a batch of words instead of sentences based on what I've learned from The Illustrated Transformer
Thank you!
There are two approaches, you can take:
Just average the states you get from the encoder;
Prepend a special token [CLS] (or whatever you like to call it) and use the hidden state for the special token as input to your classifier.
The second approach is used by BERT. When pre-training, the hidden state corresponding to this special token is used for predicting whether two sentences are consecutive. In the downstream tasks, it is also used for sentence classification. However, my experience is that sometimes, averaging the hidden states give a better result.
Instead of training a Transformer model from scratch, it is probably more convenient to use (and eventually finetune) a pre-trained model (BERT, XLNet, DistilBERT, ...) from the transformers package. It has pre-trained models ready to use in PyTorch and TensorFlow 2.0.
The Transformers are designed to take the whole input sentence at once. The main motive for designing a transformer was to enable parallel processing of the words in the sentences. This parallel processing is not possible in LSTMs or RNNs or GRUs as they take words of the input sentence as input one by one.
So in the encoder part of the transformers, the very first layer contains the number of units equal to the number of words in a sentence and then each unit converts that word into an embedding vector corresponding to that word. Further, the rest of the processes are carried out. For more details, you can go through the article: http://jalammar.github.io/illustrated-transformer/
How to use this transformer for text classification - Since in text classification our output is a single number not a sequence of numbers or vectors so we can remove the decoder part and just use the encoder part. The output of the encoder is a set of vectors, the same in number as the number of words in the input sentence. Further, we can feed these sets of output vectors into a CNN, or we can add an LSTM or RNN model and perform classification.
The input is the whole sentence or batch of sentences not word by word. Surely you would have misunderstood it.

Convert textual document to tf.data in tensorflow for reading sequentially

In a textual corpus, there are 50 textual documents that each document approximately is about 80 lines.
I want to feed my corpus as an input to tensorflow, but I want to batch each document when system read each document? actually same as TfRecord that used for images I want to by using Tf.Data make batch each document in my corpus for reading it sequentially?
How can I solve this issue?
You can create a TextLineDataset that will contain the lines of your documents:
dataset = tf.data.TextLineDataset(['doc1.txt', 'doc2.txt', ...])
After you create the dataset, you can split the strings into batches using the batch method and other methods of the Dataset class.

Where should pre-processing and post-processing steps be executed when a TF model is served using TensorFlow serving?

Typically to use a TF graph, it is necessary to convert raw data to numerical values. I refer to this process as a pre-processing step. For example, if the raw data is a sentence, one way is to do this is to tokenize the sentence and map each word to a unique number. This preprocessing creates a sequence of number for each sentence, which will be the input of the model.
We need also to post-process the output of a model to interpret it. For example, converting a sequence of numbers generated by the model to words and then building a sentence.
TF Serving is a new technology that is recently introduced by Google to serve a TF model. My question is that:
Where should pre-processing and post-processing be executed when a TF model is served using TensorFlow serving?
Should I encapsulate pre-processing and post-processing steps in my TF Graph (e.g. using py_fun or map_fn) or there is another TensorFlow technology that I am not aware of.
I'm running over the same issue here, even if I'm not 100% sure yet on how to use the wordDict variable (I guess you use one too to map the words with its ids), the main pre-process and post-process functions are defined here:
https://www.tensorflow.org/programmers_guide/saved_model
as export_outputs and serving_input_receiver_fn.
exports_outputs
Needs to be defined in EstimatorSpec if you are using estimators. Here is an example for a classification algorithm
predicted_classes = tf.argmax(logits, 1)
categories_tensor = tf.convert_to_tensor(CATEGORIES, tf.string)
export_outputs = { "categories": export_output.ClassificationOutput(classes=categories_tensor) }
if mode == tf.estimator.ModeKeys.PREDICT:
return tf.estimator.EstimatorSpec(
mode=mode,
predictions={
'class': predicted_classes,
'prob': tf.nn.softmax(logits)
},
export_outputs=export_outputs)
serving_input_receiver_fn
It needs to be defined on before exporting the trained estimator model, it assumes the input is a raw string and parses your input from there, you can write your own function but I'm unsure whenever you can use external variables. Here is a simple example for a classification algorithm:
def serving_input_receiver_fn():
feature_spec = { "words": tf.FixedLenFeature(dtype=tf.int64, shape=[4]) }
return tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)()
export_dir = classifier.export_savedmodel(export_dir_base=args.job_dir,
serving_input_receiver_fn=serving_input_receiver_fn)
hope it helps.