Why do some entities perform better than others? - spacy

I have trained different entities within a NER task. Among others, I used spaCY, Stanford and BERT for this purpose.
The results show that BERT models perform best on average. However, certain entities (3/9) perform better on spaCy and Stanford NER. I am now looking for general reasons why spaCy and Stanford give better results than BERT. It would be nice if a few can give their thoughts on this.

Related

Does knowledge distillation have an ensemble effect?

I don't know much about knowledge distillation.
I have a one question.
There is a model with showing 99% performance(10class image classification). But I can't use a bigger model because I have to keep inference time.
Does it have an ensemble effect if I train knowledge distillation using another big model?
-------option-------
Or let me know if there's any way to improve performance than this.
enter image description here
The technical answer is no. KD is a different technique from ensembling.
But they are related in the sense that KD was originally proposed to distill larger models, and the authors specifically cite ensemble models as the type of larger model they experimented on.
Net net, give KD a try on your big model to see if you can keep a lot of the performance of the bigger model but with the size of the smaller model. I have empirically found that you can retain 75%-80% of the power of the a 5x larger model after distilling it down to the smaller model.
From the abstract of the KD paper:
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, especially if the individual models are large neural nets. Caruana and his collaborators have shown that it is possible to compress the knowledge in an ensemble into a single model which is much easier to deploy and we develop this approach further using a different compression technique. We achieve some surprising results on MNIST and we show that we can significantly improve the acoustic model of a heavily used commercial system by distilling the knowledge in an ensemble of models into a single model. We also introduce a new type of ensemble composed of one or more full models and many specialist models which learn to distinguish fine-grained classes that the full models confuse. Unlike a mixture of experts, these specialist models can be trained rapidly and in parallel.
https://arxiv.org/abs/1503.02531

Where to get models for TransferLearning based on topics

Suppose you're searching for a pretrained model for e.g. human gender recognition, or age estimation (Transfer Learning).
So, you'd want a net that is trained on, ideally, human faces and not on stuff like the ImageNet dataset.
I know that there are two big starting points for the search:
Keras applications
TensorHub
Now, the best I've found is to use the search tool of the TensorHub website, like here.
That gives me some models trained on the CelebA-HQ dataset, which is something I was searching for.
But, it didn't give any results for e.g. the keywords "sport", "food" or "gun".
So, what is a good way to find pretrained models for a desired "topic"?
It's hard to find a model for each topic at a single place.
The general strategy could be searching in GitHub with the relevant tags ["tensorflow", "sport"].
You can generally find many models on model-zoo websites: https://modelzoo.co/
This is also useful: https://github.com/tensorflow/models
If you need code (probably with pre-trained weights): paperswithcode.com is a good place to search.

Understanding the Hugging face transformers

I am new to the Transformers concept and I am going through some tutorials and writing my own code to understand the Squad 2.0 dataset Question Answering using the transformer models. In the hugging face website, I came across 2 different links
https://huggingface.co/models
https://huggingface.co/transformers/pretrained_models.html
I want to know the difference between these 2 websites. Does one link have just a pre-trained model and the other have a pre-trained and fine-tuned model?
Now if I want to use, let's say an Albert Model For Question Answering and train with my Squad 2.0 training dataset on that and evaluate the model, to which of the link should I further?
I would formulate it like this:
The second link basically describes "community-accepted models", i.e., models that serve as the basis for the implemented Huggingface classes, like BERT, RoBERTa, etc., and some related models that have a high aceptance or have been peer-reviewed.
This list has bin around much longer, whereas the list in the first link only recently got introduced directly on the Huggingface website, where the community can basically upload arbitrary checkpoints that are simply considered "compatible" with the library. Oftentimes, these are additional models trained by practitioners or other volunteers, and have a task-specific fine-tuning. Note that al models from /pretrained_models.html are also included in the /models interface as well.
If you have a very narrow usecase, you might as well check and see if there was already some model that has been fine-tuned on your specific task. In the worst case, you'll simply end up with the base model anyways.

Reference text for pre-training with ELMo/BERT

How-to issue:
spaCy mentions that ELMo/BERT are very effective in NLP tasks if you have few data, as these two have very good transfer learning properties.
My question: transfer learning relative to what model. If you have a language model for dogs, finding a good language model for kangeroos is easier (my case is biology-related, and has a lot of terminology)?
Well, BERT and ELMo are trained on huge corpus(BERT is trained on 16GB of raw text) of data. This implies, that the embeddings produced from these models are generic, this would leverage the capabilities of a language model in most of the task.
Since your task is biology related, you can have look at alternatives such as BioBERT (https://arxiv.org/abs/1901.08746)

Which model (GPT2, BERT, XLNet and etc) would you use for a text classification task? Why?

I'm trying to train a model for a sentence classification task. The input is a sentence (a vector of integers) and the output is a label (0 or 1). I've seen some articles here and there about using Bert and GPT2 for text classification tasks. However, I'm not sure which one should I pick to start with. Which of these recent models in NLP such as original Transformer model, Bert, GPT2, XLNet would you use to start with? And why? I'd rather to implement in Tensorflow, but I'm flexible to go for PyTorch too.
Thanks!
It highly depends on your dataset and is part of the data scientist's job to find which model is more suitable for a particular task in terms of selected performance metric, training cost, model complexity etc.
When you work on the problem you will probably test all of the above models and compare them. Which one of them to choose first? Andrew Ng in "Machine Learning Yearning" suggest starting with simple model so you can quickly iterate and test your idea, data preprocessing pipeline etc.
Don’t start off trying to design and build the perfect system.
Instead, build and train a basic system quickly—perhaps in just a few
days
According to this suggestion, you can start with a simpler model such as ULMFiT as a baseline, verify your ideas and then move on to more complex models and see how they can improve your results.
Note that modern NLP models contain a large number of parameters and it is difficult to train them from scratch without a large dataset. That's why you may want to use transfer learning: you can download pre-trained model and use it as a basis and fine-tune it to your task-specific dataset to achieve better performance and reduce training time.
I agree with Max's answer, but if the constraint is to use a state of the art large pretrained model, there is a really easy way to do this. The library by HuggingFace called pytorch-transformers. Whether you chose BERT, XLNet, or whatever, they're easy to swap out. Here is a detailed tutorial on using that library for text classification.
EDIT: I just came across this repo, pytorch-transformers-classification (Apache 2.0 license), which is a tool for doing exactly what you want.
Well like others mentioned, it depends on the dataset and multiple models should be tried and best one must be chosen.
However, sharing my experience, XLNet beats all other models so far by a good margin. Hence if learning is not the objective, i would simple start with XLNET and then try a few more down the line and conclude. It just saves time in exploring.
Below repo is excellent to do all this quickly. Kudos to them.
https://github.com/microsoft/nlp-recipes
It uses hugging face transformers and makes them dead simple. 😃
I have used XLNet, BERT, and GPT2 for summarization tasks (English only). Based on my experience, GPT2 works the best among all 3 on short paragraph-size notes, while BERT performs better for longer texts (up to 2-3 pages). You can use XLNet as a benchmark.