Optimize spaCy named entity recognition for precision - spacy

I am currently working on a NER task where precision is more important than recall. To train the model I am using the cli command python -m spacy train en . train.json dev.json. As far as I know it optimizes for the f1 score. Is it possible to optimize for f0.5 score using this cli command?

try to change the hyperparameters
https://spacy.io/api/cli#train-hyperparams

Related

Training with spacy on full dataset

When I train my spacy model as follows
spacy train config.cfg --paths.train ./train.spacy --paths.dev ./dev.spacy
the model gets trained on train.spacy data file, and scored on dev.spacy. Then output_updated/model-best is the model with the highest score.
Is this best model finally trained on a combination of both train and dev data? I understand, it makes sense to split those datasets to avoid overfitting, but given little training data, I would like the final model to be trained on all data I have at hand.
No, spaCy does not automatically merge your datasets before training model-best. If you want to do that you would need to manually create a new training data set.
If you have so little data that seems like a good idea, you should probably prioritize getting more data.

How to use Hugging Face transfomers with spaCy 3.0

Let's say that I want to include distilbert https://huggingface.co/distilbert-base-uncased from Hugging Face into spaCy 3.0 pipeline. I think that this is possible and I found some code on how to convert this model for spaCy 2.0 but it doesn't work in v3.0. What I really want is to load this model using something like this
nlp = spacy.load('path_to_distilbert')
Is it even possible and could you please provide the exact steps to do that.
You can use spacy-transformers to this end. In spaCy v3, you can train custom pipelines using a config file, where you would define the transformer component using any HF model you like in components.transformer.model.name:
[components.transformer]
factory = "transformer"
max_batch_items = 4096
[components.transformer.model]
#architectures = "spacy-transformers.TransformerModel.v1"
name = "bert-base-cased"
tokenizer_config = {"use_fast": true}
[components.transformer.model.get_spans]
#span_getters = "spacy-transformers.doc_spans.v1"
[components.transformer.set_extra_annotations]
#annotation_setters = "spacy-transformers.null_annotation_setter.v1"
You can then train any other component (NER, textcat, ...) on top of this pretrained transformer model, and the transformer weights will be further finetuned, too.
You can read more about this in the docs here: https://spacy.io/usage/embeddings-transformers#transformers-training
It appears that the only transformer that will work out of the box is their roberta-base model. In the docs it mentions being able to connect thousands of Huggingface models but there is no mention of how to add them to a SpaCy pipeline.
In the meantime if you wanted to use the roberta model you can do the following.
# install using spacy transformers
pip install spacy[transformers]
python -m spacy download en_core_web_trf
import spacy
nlp = spacy.load("en_core_web_trf")

Can I use the spaCy command line tools to train an NER model containing an additional entity type?

I am trying to train spaCy models using just the python -m spacy train command line tool without writing any code of my own.
I have a training set of documents to which I have added OIL_COMPANY entity spans. I used gold.docs_to_json to create training files in the JSON-serializable format.
I can train starting from an empty model. However, if I try to extend the existing en_core_web_lg model I see the following error.
KeyError: "[E022] Could not find a transition with the name 'B-OIL_COMPANY' in the NER model."
So I need to be able to tell the command line tool to add OIL_COMPANY to an existing list of NER labels. The discussion in Training an additional entity type shows how to do this in code by calling add_label on the NER pipeline, but I don't see any command line option that does this.
Is it possible to extend an existing NER model to new entities with just the command line training tools, or do I have to write code?
Ines answered this for me on the Prodigy support forum.
I think what's happening here is that the spacy train command expects
the base model you want to update to already have all labels added
that you want to train. (It processes the data as a stream, so it's
not going to compile all labels upfront and silently add them on the
fly.) So if you want to update an existing pretrained model and add a
new label, you should be able to just add the label and save out the
base model:
ner = nlp.get_pipe("ner") ner.add_label("YOUR_LABEL")
nlp.to_disk("./base-model")
This isn't quite writing no code but it's pretty close.
See this link for the CLI in spaCy.
Train a model. Expects data in spaCy’s JSON format. On each epoch, a model will be saved out to the directory. Accuracy scores and model details will be added to a meta.json to allow packaging the model using the package command.
python -m spacy train [lang] [output_path] [train_path] [dev_path]
[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-early-stopping]
[--n-examples] [--use-gpu] [--version] [--meta-path] [--init-tok2vec]
[--parser-multitasks] [--entity-multitasks] [--gold-preproc] [--noise-level]
[--orth-variant-level] [--learn-tokens] [--textcat-arch] [--textcat-multilabel]
[--textcat-positive-label] [--verbose]

Using model optimizer for tensorflow slim models

I am aiming to inference tensorflow slim model with Intel OpenVINO optimizer. Using open vino docs and slides for inference and tf slim docs for training model.
It's a multi-class classification problem. I have trained tf slim mobilnet_v2 model from scratch (using sript train_image_classifier.py). Evaluation of trained model on test set gives relatively good results to begin with (using script eval_image_classifier.py):
eval/Accuracy[0.8017]eval/Recall_5[0.9993]
However, single .ckpt file is not saved (even though at the end of train_image_classifier.py run there is a message like "model.ckpt is saved to checkpoint_dir"), there are 3 files (.ckpt-180000.data-00000-of-00001, .ckpt-180000.index, .ckpt-180000.meta) instead.
OpenVINO model optimizer requires a single checkpoint file.
According to docs I call mo_tf.py with following params:
python mo_tf.py --input_model D:/model/mobilenet_v2_224.pb --input_checkpoint D:/model/model.ckpt-180000 -b 1
It gives the error (same if pass --input_checkpoint D:/model/model.ckpt):
[ ERROR ] The value for command line parameter "input_checkpoint" must be existing file/directory, but "D:/model/model.ckpt-180000" does not exist.
Error message is clear, there are not such files on disk. But as I know most tf utilities convert .ckpt-????.meta to .ckpt under the hood.
Trying to call:
python mo_tf.py --input_model D:/model/mobilenet_v2_224.pb --input_meta_graph D:/model/model.ckpt-180000.meta -b 1
Causes:
[ ERROR ] Unknown configuration of input model parameters
It doesn't matter for me in which way I will transfer graph to OpenVINO intermediate representation, just need to reach that result.
Thanks a lot.
EDIT
I managed to run OpenVINO model optimizer on frozen graph of tf slim model. However I still have no idea why had my previous attempts (based on docs) failed.
you can try converting the model to frozen format (.pb) and then convert the model using OpenVINO.
.ckpt-meta has the metagraph. The computation graph structure without variable values.
the one you can observe in tensorboard.
.ckpt-data has the variable values,without the skeleton or structure. to restore a model we need both meta and data files.
.pb file saves the whole graph (meta+data)
As per the documentation of OpenVINO:
When a network is defined in Python* code, you have to create an inference graph file. Usually, graphs are built in a form that allows model training. That means that all trainable parameters are represented as variables in the graph. To use the graph with the Model Optimizer, it should be frozen.
https://software.intel.com/en-us/articles/OpenVINO-Using-TensorFlow
the OpenVINO optimizes the model by converting the weighted graph passed in frozen form.

Train two models on two GPUs

I have two models (theano scripts) I want to train and evaluate.
I have two GPUs I can use to train them.
How can I run a model on each GPU at the same time?
When running your script you can choose where your progam will run with THEANO_FLAGS:
THEANO_FLAGS='device=gpu0' python script_1.py
THEANO_FLAGS='device=gpu1' python script_2.py
change gpuX for each gpu (eg. gpu0,gpu1,gpu2...)