SpaCy NER doesn't seem to correctly recognize hyphenated names - spacy

1st thing 1st: I'm new to SpaCy and just started to test it. I have to say that I'm impressed by its simplicity and the doc quality. Thanks!
Now, I'm trying to identify PER in a French text. It seems to work pretty well for most but I saw a recurring incorrect pattern: names with a hyphen are not correctly recognized (ex: Pierre-Louis Durand will appear as two PER: "Pierre" and "Louis Durand").
See example:
import spacy
# nlp = spacy.load('fr')
nlp = spacy.load('fr_core_news_md')
description = ('C\'est Jean-Sébastien Durand qui leur a dit. Pierre Dupond n\'est pas venu à Boston comme attendu. '
'Louis-Jean s\'est trompé. Claire a bien choisi.')
text = nlp(description)
labels = set([w.label_ for w in text.ents])
for label in labels:
entities = [e.string for e in text.ents if label==e.label_]
entities = list(entities)
print(label, entities)
output is:
LOC ['Boston ']
PER ['Jean', 'Sébastien Durand ', 'Pierre Dupond ', 'Louis', 'Jean ', 'Claire ']
It should be: "Jean-Sébastien Durand" and "Louis-Jean".
I'm not sure what to do here:
change the way tokens are extracted (I'm wondering about the side effect for non PER) - I don't think this is the issue as a PER can be an aggregation of multiple tokens
apply a magic setting somewhere so that hyphen can be used in NER for PER
train the model
go back to school ;-)
Thanks for your help (and yes I'm investigating by reading more, I love it)!
-TC

I initially thought this would be a mismatch between the tokenizer and the training data, but it's actually a problem with how the regex that handles some words with hyphens is loaded from the saved model.
A temporary fix for spacy v2.2 models (which you have to do every time after loading a French model) is to replace the problematic tokenizer setting with the correct default setting:
import spacy
from spacy.lang.fr import French
nlp = spacy.load("fr_core_news_md")
nlp.tokenizer.token_match = French.Defaults.token_match
description = ('C\'est Jean-Sébastien Durand qui leur a dit. Pierre Dupond n\'est pas venu à Boston comme attendu. '
'Louis-Jean s\'est trompé. Claire a bien choisi.')
text = nlp(description)
labels = set([w.label_ for w in text.ents])
for label in labels:
entities = [e.text for e in text.ents if label==e.label_]
entities = list(entities)
print(label, entities)
Output:
PER ['Jean-Sébastien Durand', 'Pierre Dupond']
LOC ['Boston', 'Louis-Jean', 'Claire']
(The French NER model is trained on data from Wikipedia, so it still doesn't do very well on the entity types for this particular text.)

Related

Patient name extraction using MedSpacy

I was looking for some guidence on NER using medspacy. Aware of disease extraction using MedSpacy but the aim is to extract patient name from medical report using medspacy.
Text supposed to be :
patient:Jeromy, David (DOB)
Date range 2020 to 2022. Visited Dr Brian. Suffered from ...
This type of dataset is there, want to extract patient name from all the pages of medical reports using MedSpacy. I know target rules can be helpful but any clarified guidence will be appreciated.
Thanks & regards
If you find that the default SpaCy NER model is not sufficient, as it will not pick up names such as "Byrn, John", I have a couple of suggestions:
Train a custom NER component using SpaCy's Prodigy annotation tool, which you can use to easily label some examples of names. This is a rather simple task, so you can likely train a model with less than 100 diverse examples. Note: Prodigy is a paid tool, so see my other suggestions if you do not have access/are not willing to pay.
Train a custom NER component without Prodigy. Similar to the above approach, but slightly more involved. This Medium article provides a beginner-friendly introduction to doing so, and you can also refer to SpaCy's own documentation. You can provide SpaCy with some examples of texts and the entities you want extracted, like so:
TRAIN_DATA = [
('Patient: Byrn, John', {
'entities': [(9, 19, 'PERSON')]
}),
('John Byrn received 10mg of advil', {
'entities': [(0, 10, 'PERSON')]
})
]
Build rules based on existing SpaCy components. You can leverage existing SpaCy pipeline components (you don't necessarily need MedSpaCy for this), such as POS tagging and Dependency Parsing. For example, you can look for proper nouns in your documents to identify names. Check out the docs on POS tagging here.
Try other pretrained NER models. There may be other models that are better suited to your task. Check out other models on SpaCy Universe, or even better, on HuggingFaceHub, which contains some of the best models out there for every use case. Added bonus of HF Hub is that you can try out the models on each model model page, and assess the performance on some examples before you decide.
Hope this helps!

Why don't spacy transformer models do NER for non-english models?

Why is it that spacy transformer models for languages like spanish (es_dep_news_trf) don't do named entity recognition.
However, for english (en_core_web_trf) it does.
In code:
import spacy
nlp=spacy.load("en_core_web_trf")
doc=nlp("my name is John Smith and I work at Apple and I like visiting the Eiffel Tower")
print(doc.ents)
(John Smith, Apple, the Eiffel Tower)
nlp=spacy.load("es_dep_news_trf")
doc=nlp("mi nombre es John Smith y trabajo en Apple y me gusta visitar la Torre Eiffel")
print(doc.ents)
()
Why doesn't spanish extract entities but english does?
It has to do with the available training data. ner is only included for the trf models if the training data has NER annotation on the exact same data as for tagging and parsing.
Training trf models on partial annotation does not work well in practice and an independent NER component (as in the CNN pipelines) would mean including an additional transformer component in the pipeline, which would make the pipeline a lot larger and slower.
The spaCy models vary with regards to which NLP features they provide - this is just a result of how the respective authors created/trained them. I.e., https://spacy.io/models/en#en_core_web_trf lists "ner" in its components, but https://spacy.io/models/es#es_dep_news_trf does not.
The Spanish https://spacy.io/models/es#es_core_news_lg (as well the two smaller variants) does list "ner" in its components, so they show named entities:
>>> import spacy
>>> nlp=spacy.load("es_core_news_sm")
>>> doc=nlp("mi nombre es John Smith y trabajo en Apple y me gusta visitar la Torre Eiffel")
>>> print(doc.ents)
(John Smith, Apple, Torre Eiffel)

Any way to extract the exhaustive vocabulary of the google universal sentence encoder large?

I have some sentences for which I am creating an embedding and it works great for similarity searching unless there are some truly unusual words in the sentence.
In that case, the truly unusual words in fact contain the very most similarity information of any words in the sentence BUT all of that information is lost during embedding due to the fact that the word is apparently not in the vocabulary of the model.
I'd like to get a list of all of the words known by the GUSE embedding model so that I can mask those known words out of my sentence, leaving only the "novel" words.
I can then do an exact word search for those novel words in my target corpus and achieve usability for my similar sentence searching.
e.g. "I love to use Xapian!" gets embedded as "I love to use UNK".
If I just do a keyword search for "Xapian" instead of a semantic similarity search, I'll get much more relevant results than I would using GUSE and vector KNN.
Any ideas on how I can extract the vocabulary known/used by GUSE?
I combine the earlier answer from #Roee Shenberg and the solution provided here to come up with solution, which is applicable for USE v4:
import importlib
loader_impl = importlib.import_module('tensorflow.python.saved_model.loader_impl')
saved_model = loader_impl.parse_saved_model("/tmp/tfhub_modules/063d866c06683311b44b4992fd46003be952409c/")
graph = saved_model.meta_graphs[0].graph_def
fns = [f for f in saved_model.meta_graphs[0].graph_def.library.function if "ptb" in str(f).lower()];
print(len(fns)) # should be 1
nodes_with_sp = [n for n in fns[0].node_def if n.name == "Embeddings_words"]
print(len(nodes_with_sp)) # should be 1
words_tensor = nodes_with_sp[0].attr.get("value").tensor
word_list = [i.decode('utf-8') for i in words_tensor.string_val]
print(len(word_list)) # should be 400004
If you are just curious about the words I upload them here.
I'm assuming you have tensorflow & tensorflow_hub installed, and youhave already downloaded the model.
IMPORTANT: I'm assuming you're looking at https://tfhub.dev/google/universal-sentence-encoder/4! There's no guarantee the object graph looks the same for different versions, it's likely that modifications will be needed.
Find it's location on disk - it's somewhere at /tmp/tfhub_modules unless you set the TFHUB_CACHE_DIR environment variable (Windows/Mac have different locations). The path should contain a file called saved_model.pb, which is the model, serialized using Protocol Buffers.
Unfortunately, the dictionary is serialized inside the model's Protocol Buffers file and not as an external asset, so we'll have to load the model and get the variable from it.
The strategy is to use tensorflow's code to deserialize the file, and then travel down the serialized object tree all the way to the dictionary.
import importlib
MODEL_PATH = 'path/to/model/dir' # e.g. '/tmp/tfhub_modules/063d866c06683311b44b4992fd46003be952409c/'
# Use the tensorflow internal Protobuf loader. A regular import statement will fail.
loader_impl = importlib.import_module('tensorflow.python.saved_model.loader_impl')
saved_model = loader_impl.parse_saved_model(MODEL_PATH)
# reach into the object graph to get the tensor
graph = saved_model.meta_graphs[0].graph_def
function = graph.library.function
node_type, node_value = function[5].node_def
# if you print(node_type) you'll see it's called "text_preprocessor/hash_table"
# as well as get insight into this branch of the object graph we're looking at
words_tensor = node_value.attr.get("value").tensor
word_list = [i.decode('utf-8') for i in words_tensor.string_val]
print(len(word_list)) # -> 400004
Some resources that helped:
A GitHub issue relating to changing the vocabulary
A Tensorflow Google-group thread linked from the issue
Extra Notes
Despite what the GitHub issue may lead you to think, the 400k words here are not the GloVe 400k vocabulary. You can verify this by downloading the GloVe 6B embeddings (file link), extracting glove.6B.50d.txt, and then using the following code to compare the two dictionaries:
with open('/path/to/glove.6B.50d.txt') as f:
glove_vocabulary = set(line.strip().split(maxsplit=1)[0] for line in f)
USE_vocabulary = set(word_list) # from above
print(len(USE_vocabulary - glove_vocabulary)) # -> 281150
Inspecting the different vocabularies is interesting in and of itself, e.g. why does GloVe have an entry for '287.9'?

Spacy - English language model outruns german language model on german text?

Is it by design that the english language model performs better on german salution entities than the german model?
# pip install spacy
# python -m spacy download en
# python -m spacy download de
nlp = spacy.load('en')
# Uncomment line below to get less good results
# nlp = spacy.load('de')
# Process text
text = (u"Das Auto kauft Herr Müller oder Frau Meier, Frank Muster")
doc = nlp(text)
# Find named entities
for entity in doc.ents:
print(entity.text, entity.label_)
expected result if using nlp = spacy.load('en'). All three PERSON is returned
Das Auto ORG
Herr Müller PERSON
Frau Meier PERSON
Frank Muster PERSON
unexpected result if using nlp = spacy.load('de'). Only one of three PERSON is returned
Frank Muster PERSON
Info about spaCy
spaCy version:** 2.0.12
Platform:** Linux-4.17.2-1-ARCH-x86_64-with-arch-Arch-Linux
Python version:** 3.6.5
Models:** en, de
It's not by design, but it's certainly possible that this is a side-effect of the training data and the statistical predictions. The English model is trained on a larger NER corpus with more entity types, while the German model uses NER data based on Wikipedia.
In Wikipedia text, full names like "Frank Muster" are quite common, whereas things like "Herr Muster" are typically avoided. This might explain why the model only labels the full name as a person and not the others. The example sentence also makes it easy for the English model to guess correctly – in English, capitalization is a much stronger indicator of a named entity than it is in German. This might explain why the model consistently labels all capitalised multi-word spans as entities.
In any case, this is a good example of how subtle language-specific or stylistic conventions end up influencing a model's predictions. It also shows why you almost always want to fine-tune a model with more examples specific to your data. I do think that the German model will likely perform better on German texts overall, but if references like "Herr Müller" are common in your texts, you probably want to update the model with more examples of them in different contexts.

SpaCy similarity score makes no sense

I am trying to figure out if I trust SpaCy's similarity function and I am getting confused. Here's my toy example:
import spacy
nlp = spacy.load('en')
doc1 = nlp(u'Unsalted butter')
doc2 = nlp(u'babi carrot peel babi carrot grim french babi fresh babi roundi fresh exot petit petit peel shred carrot dole shred')
doc1.similarity(doc2)
I get the similarity of 0.64. How can it be this high for two sentences with no overlapping tokens? Could someone please explain this to me? Thank you!
The problem is that you are using en model which is most likely linked to en_core_web_sm.
The en_core_web_sm model doesn't have pretrained glove vectors, so you are using the vectors produced by NER, PoS and DEP tagger to compute similarity.
These vectors encode structural information, e.g. having the same PoS tag or DEP role in the sentence. There is no semantic information encoded in these vectors, so the result you are getting is as expected, weird.
Have also a look here.