I'm using Spacy large model but it's incorrectly tagging entities with categories that are not relevant to my domain, eg 'work of art' can cause it not to recognise what should have been an Org.
Is it possible to restrict NER to only return People, Locations and Organisations ?
Short answer:
No, you cannot restrict NER to not tag specific Tags or the opposite.
What you can do is limit it in code or modify the model [see long answer].
Limiting it in code is just filtering the retrieved entities, but it won't solve your problem with missclassifications.
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp(u"Apple is looking at buying U.K. startup for $1 billion")
entities = [ent for ent in doc.ents if ent.label_ == "ORG"]
Long answer:
You can restrict NER in spacy, but not with a simple parameter (currently).
Why not? Simple: NER is a supervised machine learning task. You provide text with tagged entities, it trains and then attempts to predict new instances from the parameters it learned beforehand.
If you want NER only to recognize certain entities, such as orgs, you have to train a new model only with org instances.
If you're familiar with Machine Learning concepts, you'll understand it this way: in a multi class classification task, you cannot simply remove a class without retraining the entire model with filtered train data.
Check this page for more info on NER training: https://spacy.io/usage/linguistic-features/#named-entities
Related
I have an already existing spaCy model which I want to refine with additional training data at runtime.
For example a training dataSet in my training model looks like this:
text="Anna lives in Munich and works at BMW"
entity: name=Anna
entity: city=Munich
entity: company=BMW
In my implementation I take the ner from the existing model before I start my new training with:
nlp = spacy.load(modelPath)
ner = nlp.get_pipe('ner')
and than I train my existing model with my new TrainingData:
# batch up the examples using spaCy's minibatch which is much faster than
batches = minibatch(trainingData, size=compounding(4.0, 32.0, 1.001))
for batch in batches:
texts, annotations = zip(*batch)
nlp.update(
texts, # batch of texts
annotations, # batch of annotations
#drop=0.5, # dropout - make it harder to memorise data
losses=losses,
)
Now I have the following question:
My existing ner model contains already the three entities with the labels
city, name, company
But my new training dataSet has only the entities 'city' and 'name' (not the entity 'company'). Like
text="Bob lives in London"
entity: name=Bob
entity: city=London
Because only 'city' and 'name' are part of my sentence.
Now I had the impression that my model quality downgrades if I retrain my model with training datasets containing less entities that in the current model knows.
Would it be clever to (re)set the ner in my model with only the entity labels contained in my current training dataset before I start the training?
Something like this:
ner = nlp.create_pipe('ner')
nlp.add_pipe(ner)
ner.add_label('city')
ner.add_label('name')
Or does this not make sense?
Now I had the impression that my model quality downgrades if I retrain my model with training datasets containing less entities that in the current model knows.
Yes. This is called catastrophic forgetting.
Would it be clever to (re)set the ner in my model with only the entity labels contained in my current training dataset before I start the training?
In my opinion, yes. If your current training data doesn't not have company names in it, the model will be become biased as you keep training it, and say in the future you decide to use the same model to detect company names, it will detect city or names as company because it has forgotten what company names are.
i'm currently developing a Question-Answer system (in Indonesian language) using BERT for my thesis.
The dataset and the questions given are in Indonesian.
The problem is, i'm still not clear on how the step-to-step process to develop the Question-Answer system in BERT.
From what I concluded after reading a number of research journals and papers, the process might be like this:
Prepare main dataset
Load Pre-Train Data
Train the main dataset with the pre-train data (so that it produce "fine-tuned" model)
Cluster the fine-tuned model
Testing (giving questions to the system)
Evaluation
What i want to ask are :
Are those steps correct? Or maybe there any missing step(s)?
Also, if the default pre-train data that BERT provide is in English while my main dataset is in Indonesian, how can i create my own indonesian pre-train data?
Does it really need to perform data/model clustering in BERT?
I appreciate any helpful answer(s).
Thank you very much in advance.
I would take a look at Huggingface's Question & Answer examples. That would at least be a good place to start.
from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
text = r"""
🤗 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""
questions = [
"How many pretrained models are available in Transformers?",
"What does Transformers provide?",
"Transformers provides interoperability between which frameworks?",
]
for question in questions:
inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
input_ids = inputs["input_ids"].tolist()[0]
text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
answer_start_scores, answer_end_scores = model(**inputs)
answer_start = torch.argmax(
answer_start_scores
) # Get the most likely beginning of answer with the argmax of the score
answer_end = torch.argmax(answer_end_scores) + 1 # Get the most likely end of answer with the argmax of the score
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))
print(f"Question: {question}")
print(f"Answer: {answer}\n")
I am trying to train spaCy models using just the python -m spacy train command line tool without writing any code of my own.
I have a training set of documents to which I have added OIL_COMPANY entity spans. I used gold.docs_to_json to create training files in the JSON-serializable format.
I can train starting from an empty model. However, if I try to extend the existing en_core_web_lg model I see the following error.
KeyError: "[E022] Could not find a transition with the name 'B-OIL_COMPANY' in the NER model."
So I need to be able to tell the command line tool to add OIL_COMPANY to an existing list of NER labels. The discussion in Training an additional entity type shows how to do this in code by calling add_label on the NER pipeline, but I don't see any command line option that does this.
Is it possible to extend an existing NER model to new entities with just the command line training tools, or do I have to write code?
Ines answered this for me on the Prodigy support forum.
I think what's happening here is that the spacy train command expects
the base model you want to update to already have all labels added
that you want to train. (It processes the data as a stream, so it's
not going to compile all labels upfront and silently add them on the
fly.) So if you want to update an existing pretrained model and add a
new label, you should be able to just add the label and save out the
base model:
ner = nlp.get_pipe("ner") ner.add_label("YOUR_LABEL")
nlp.to_disk("./base-model")
This isn't quite writing no code but it's pretty close.
See this link for the CLI in spaCy.
Train a model. Expects data in spaCy’s JSON format. On each epoch, a model will be saved out to the directory. Accuracy scores and model details will be added to a meta.json to allow packaging the model using the package command.
python -m spacy train [lang] [output_path] [train_path] [dev_path]
[--base-model] [--pipeline] [--vectors] [--n-iter] [--n-early-stopping]
[--n-examples] [--use-gpu] [--version] [--meta-path] [--init-tok2vec]
[--parser-multitasks] [--entity-multitasks] [--gold-preproc] [--noise-level]
[--orth-variant-level] [--learn-tokens] [--textcat-arch] [--textcat-multilabel]
[--textcat-positive-label] [--verbose]
Can we use neural machine translation (like seq2seq) for named entity recognition?such as USING TRANSFORMER nerual network FOR NER TASK. SOURCE IS WORD SEQUENCE, TARGET IS TAG sequence LIKE "o o o PERSON o o o location",is it possible?
Yes, you can use Transformer-based language models for NER tasks. You may check this paper BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. The pre-trained Transformer-based language models are utilized as transfer learning (i.e., fine-tuning) to further use for some tasks, such as Question Answering, Text Classification, Named Entity Recognition, etc.
I try to use an example LSTM, trained according to Tensorflow LSTM example. This example allows to get perplexity on whole test set. But I need to use the trained model to score (get loglikes) of each sentence separately (to score hypotheses of STT decoder output). I modified reader a bit and used code:
mtests=list()
with tf.name_scope("Test"):
for test_data_item in test_data:
test_input.append(PTBInput(config=eval_config, data=test_data_item, name="TestInput"))
with tf.variable_scope("Model", reuse=True, initializer=initializer):
for test_input_item in test_input:
mtests.append(PTBModel(is_training=False, config=eval_config,
input_=test_input_item))
sv = tf.train.Supervisor(logdir=FLAGS.model_dir)
with sv.managed_session() as session:
checkpoint=tf.train.latest_checkpoint(FLAGS.model_dir)
sv.saver.restore(session, checkpoint)
sys.stderr.write("model restored\n")
for mtest in mtests:
score, test_perplexity = run_epoch_test(session, mtest)
print(score)
So, using that code, I get score of each sentence independently. If I pass 5 sentences, it works ok. But if I pass 1k sentences to this code, it works extremely slow and uses a lot of memory, because I create 1k models mtest. So, could you tell me another way to reach my goal? Thank you.
It seems like the model can take a batch of inputs, which is set to 20 in all cases by default. You should be able to feed a larger batch of sentences to one test model to get the output for all of them without having to create multiple models instances. This probably involves some experimenting with the reader, which you are already familiar with.