Training multi word verb and noun entities with Spacy NER - spacy

All NER training instances I have come across are nouns, but is it possible to train entities with Spacy NER that are verb and noun combinations. For example 'stirring pot'.
Do i use a noun based NER first and then train a nested NER on such phrases or do i directly go for training the phrase in Spacy NER. I guess the answer will depend on if Spacy NER uses POS and dependency features as part of its training.

NER technologies usually work best when the entities are fairly short, and when there are clear clues at the starts and ends of the phrases. These are both the case for recognising proper nouns in English, which is the canonical use-case the algorithms were developed for.
A noun phrase like "stepping stone" or "deciding factor" will be easy for an NER system to learn. The system would be less good at recognising verb + object constructions, as the verb and object might be arbitrarily far apart, e.g. stirring the pot, stirring the metal pot, stir the pot vigorously, etc. You should also be a bit wary of applying sequential labellings to arbitrary spans of text, that aren't syntactic constituents. It'll be very difficult to describe where the boundary of the phrases should fall, so your annotators probably won't behave consistently. Indecision about the exact boundaries of the phrases will make the NER system perform very poorly, because spans which differ by one word are seen as entirely different spans by the loss function.
Finally, to answer your question about the POS and dependency parsing features: no, we don't use these in the NER at the moment.
You might be interested in the dependency tree matcher contributed in these two pull requests:
https://github.com/explosion/spaCy/pull/2732
https://github.com/explosion/spaCy/pull/2836
More improvements to the Matcher will also help you: https://github.com/explosion/spaCy/issues/1971

Related

Why do some entities perform better than others?

I have trained different entities within a NER task. Among others, I used spaCY, Stanford and BERT for this purpose.
The results show that BERT models perform best on average. However, certain entities (3/9) perform better on spaCy and Stanford NER. I am now looking for general reasons why spaCy and Stanford give better results than BERT. It would be nice if a few can give their thoughts on this.

Reference text for pre-training with ELMo/BERT

How-to issue:
spaCy mentions that ELMo/BERT are very effective in NLP tasks if you have few data, as these two have very good transfer learning properties.
My question: transfer learning relative to what model. If you have a language model for dogs, finding a good language model for kangeroos is easier (my case is biology-related, and has a lot of terminology)?
Well, BERT and ELMo are trained on huge corpus(BERT is trained on 16GB of raw text) of data. This implies, that the embeddings produced from these models are generic, this would leverage the capabilities of a language model in most of the task.
Since your task is biology related, you can have look at alternatives such as BioBERT (https://arxiv.org/abs/1901.08746)

Which model (GPT2, BERT, XLNet and etc) would you use for a text classification task? Why?

I'm trying to train a model for a sentence classification task. The input is a sentence (a vector of integers) and the output is a label (0 or 1). I've seen some articles here and there about using Bert and GPT2 for text classification tasks. However, I'm not sure which one should I pick to start with. Which of these recent models in NLP such as original Transformer model, Bert, GPT2, XLNet would you use to start with? And why? I'd rather to implement in Tensorflow, but I'm flexible to go for PyTorch too.
Thanks!
It highly depends on your dataset and is part of the data scientist's job to find which model is more suitable for a particular task in terms of selected performance metric, training cost, model complexity etc.
When you work on the problem you will probably test all of the above models and compare them. Which one of them to choose first? Andrew Ng in "Machine Learning Yearning" suggest starting with simple model so you can quickly iterate and test your idea, data preprocessing pipeline etc.
Don’t start off trying to design and build the perfect system.
Instead, build and train a basic system quickly—perhaps in just a few
days
According to this suggestion, you can start with a simpler model such as ULMFiT as a baseline, verify your ideas and then move on to more complex models and see how they can improve your results.
Note that modern NLP models contain a large number of parameters and it is difficult to train them from scratch without a large dataset. That's why you may want to use transfer learning: you can download pre-trained model and use it as a basis and fine-tune it to your task-specific dataset to achieve better performance and reduce training time.
I agree with Max's answer, but if the constraint is to use a state of the art large pretrained model, there is a really easy way to do this. The library by HuggingFace called pytorch-transformers. Whether you chose BERT, XLNet, or whatever, they're easy to swap out. Here is a detailed tutorial on using that library for text classification.
EDIT: I just came across this repo, pytorch-transformers-classification (Apache 2.0 license), which is a tool for doing exactly what you want.
Well like others mentioned, it depends on the dataset and multiple models should be tried and best one must be chosen.
However, sharing my experience, XLNet beats all other models so far by a good margin. Hence if learning is not the objective, i would simple start with XLNET and then try a few more down the line and conclude. It just saves time in exploring.
Below repo is excellent to do all this quickly. Kudos to them.
https://github.com/microsoft/nlp-recipes
It uses hugging face transformers and makes them dead simple. 😃
I have used XLNet, BERT, and GPT2 for summarization tasks (English only). Based on my experience, GPT2 works the best among all 3 on short paragraph-size notes, while BERT performs better for longer texts (up to 2-3 pages). You can use XLNet as a benchmark.

Learning word-embeddings from characters using already learned word embedding

I have a corpus of text and I would like to find embeddings for words starting from a characters. So I have a sequence of characters as input and I want to project it into a multidimensional space.
As an initialization, I would like to fit already learned word embeddings (for example, the Google ones).
I have some doubts:
Do I need use a character embedding vector for each input
character in the input sequence? would it be a problem if I use
simply the ascii or utf-8 encoding ?
despite of what be the input
vector definition (embedding vec, ascii,..)it's really confusing to
select a proper model there are several options but im not sure
which one is the better choice :seq2seq, auto-encoder, lstm,
multi-regressor+lstm ?
Could you give me any sample code by
keras or tensorflow?
I answer each question:
If you want to exploit characters similarities (that area far relatives of phonetic similarities too), you need an embedding layer. Encodings are symbolic inputs while embeddings are continuous inputs. With symbolic knowledge any kind of generalization is impossible because you have no concept of distance (or similarity), while with embeddings you can behave similarly with similar inputs (and so generalizing). However, since the input space is very small, short embeddings are sufficient.
The model highly depends on the kind of phenomena you want to capture. A model that I see often in literature and seems working well in different task is a multilayer bidirectional-lstm on the characters with a linear layer in the top.
The code is similar to all the RNN implementation of Tensorflow. A good way to start is the Tensorflow tutorial https://www.tensorflow.org/tutorials/recurrent. The function for creating the bidirectional is https://www.tensorflow.org/api_docs/python/tf/nn/static_bidirectional_rnn
From experience, I had problems to fit word-based word embeddings using a character model. The reason is that a word-based model will put morphologically similar words very far if there is no semantic similarities. A character-based model can't do it because morphologically similar input cannot be distinguished very well (are very close in the embedded space).
This is one of the reason why, in literature, people often use characters-models as a plus to word models and not as "per se" models. It is an open research area if a character model can be enough to capture both semantic and morphological similarities.

Tensorflow textsum model- different source and target vocabs

I want to use the textsum model for tagging named entities. Hence the target size vocab is very small. While training there doesn't seem to be an option to provide different vocabs on the encoder and on the decoder side-or is there?
See code lines on Github
if hps.mode == 'train':
model = seq2seq_attention_model.Seq2SeqAttentionModel(hps, vocab, num_gpus=FLAGS.num_gpus)
Hrishikesh I don't believe there is a way to provide separate vocab files, but not fully understanding why you need it. The vocab simply provides a numerical way of representing a word. Therefore when the model is working with these words, it uses the numerical representations of them. Once the hypothesis is complete and the statistical choices for words have been chosen, it then simply uses the vocab file to convert back the vocab index to it's associated word. I hope this helps answer your question and solidify why you shouldn't need to have separate vocab files. That said, I may jsut be misunderstanding your need for it and I apologize if that is the case.
No there is no out-of-the-box option to use the textsum in this way. I don't see any reason why it shouldn't be possible to modify the architecture to achieve this, though. Would be interested if you pointed towards some literature on using seq2seq w/attention models for NER