Displaying the description of entity from kb id in spacy entity linking - spacy

I have successfully trained a spacy entity linking model(obviously by
limiting the data).
my question is how to display the description
of entity from kb as output?
import spacy
nlp = spacy.load(r"D:\el model\nlp")
doc = nlp("Amir Khan is a great boxer")
ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
print(ents)

As said by Sofie Van Landeghem(Spacy Entity Linking Representative).
The descriptions are currently not stored in the KB itself because of performance reasons. However, from the intermediary results during processing, you should have a file entity_descriptions.csv that maps the WikiData ID to its description in a simple tabular format.

Related

Patient name extraction using MedSpacy

I was looking for some guidence on NER using medspacy. Aware of disease extraction using MedSpacy but the aim is to extract patient name from medical report using medspacy.
Text supposed to be :
patient:Jeromy, David (DOB)
Date range 2020 to 2022. Visited Dr Brian. Suffered from ...
This type of dataset is there, want to extract patient name from all the pages of medical reports using MedSpacy. I know target rules can be helpful but any clarified guidence will be appreciated.
Thanks & regards
If you find that the default SpaCy NER model is not sufficient, as it will not pick up names such as "Byrn, John", I have a couple of suggestions:
Train a custom NER component using SpaCy's Prodigy annotation tool, which you can use to easily label some examples of names. This is a rather simple task, so you can likely train a model with less than 100 diverse examples. Note: Prodigy is a paid tool, so see my other suggestions if you do not have access/are not willing to pay.
Train a custom NER component without Prodigy. Similar to the above approach, but slightly more involved. This Medium article provides a beginner-friendly introduction to doing so, and you can also refer to SpaCy's own documentation. You can provide SpaCy with some examples of texts and the entities you want extracted, like so:
TRAIN_DATA = [
('Patient: Byrn, John', {
'entities': [(9, 19, 'PERSON')]
}),
('John Byrn received 10mg of advil', {
'entities': [(0, 10, 'PERSON')]
})
]
Build rules based on existing SpaCy components. You can leverage existing SpaCy pipeline components (you don't necessarily need MedSpaCy for this), such as POS tagging and Dependency Parsing. For example, you can look for proper nouns in your documents to identify names. Check out the docs on POS tagging here.
Try other pretrained NER models. There may be other models that are better suited to your task. Check out other models on SpaCy Universe, or even better, on HuggingFaceHub, which contains some of the best models out there for every use case. Added bonus of HF Hub is that you can try out the models on each model model page, and assess the performance on some examples before you decide.
Hope this helps!

Is spaCy supporting custom types for Named Entity Recognition?

In the documentation of the 'Named Entity Recognition' feature of spaCy (https://spacy.io/usage/linguistic-features#named-entities)
the documentation states that spaCy can recognize 'various types' of named entities such as 'PERSON', 'LOC', 'PRODUCT' (https://spacy.io/api/annotation#named-entities).
My question is: can I also train data with my custom entities? For example I would like to train invoice data to regognize for example IBAN / BIC or an invoice no. . Is this also possible or is this feature restricted to a fixed list of entities only?
It does support custom entities, cf this section titled "Training an additional entity type".
For example, to add a label called MY_ANIMAL, you can use training data like such:
TRAIN_DATA = [
(
"Horses are too tall and they pretend to care about your feelings",
{"entities": [(0, 6, MY_ANIMAL)]},
),
("Do they bite?", {"entities": []}),
(
"horses are too tall and they pretend to care about your feelings",
{"entities": [(0, 6, MY_ANIMAL)]},
),
]
And feed that into either an existing NER model as additional training, or a newly created NER pipe.
However, a caveat: the ML model is optimized for recognizing named entities, which are usually capitalized nouns like "John", "London" or "The Times". You can also try to train it on more generic things like numbers, but it may not work as well.

How to visualize results of LDA topic modelling as shown below

As a part of the assignment, I am asked to do topic modeling using LDA and visualize the words that come under the top 3 topics as shown in the below screenshot 1. However, even after searching a lot I am not able to find any helpful resource that would help me achieve my goal. All resources about text visualization are pointed towards the word cloud, but my goal is not to use word cloud visualizations.
Required LDA topic visulization
Any help will be greatly appreciated.
If you use gensim to generate the LDA model (gensim.models.ldamodel.LdaModel()) you can use the following to easily visualize the key words related to each topic:
# Example of LDA model building:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=20,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
# Visualize keywords
print(lda_model.print_topics())
Also for better visualization I suggest you to use pyLDAvis through:
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis
Where you feed your lda model, corpus and a dictionary mapping ids to tokens (id2word). Output:
Have a look at the example here for a detailed explanation:
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

Spacy - English language model outruns german language model on german text?

Is it by design that the english language model performs better on german salution entities than the german model?
# pip install spacy
# python -m spacy download en
# python -m spacy download de
nlp = spacy.load('en')
# Uncomment line below to get less good results
# nlp = spacy.load('de')
# Process text
text = (u"Das Auto kauft Herr Müller oder Frau Meier, Frank Muster")
doc = nlp(text)
# Find named entities
for entity in doc.ents:
print(entity.text, entity.label_)
expected result if using nlp = spacy.load('en'). All three PERSON is returned
Das Auto ORG
Herr Müller PERSON
Frau Meier PERSON
Frank Muster PERSON
unexpected result if using nlp = spacy.load('de'). Only one of three PERSON is returned
Frank Muster PERSON
Info about spaCy
spaCy version:** 2.0.12
Platform:** Linux-4.17.2-1-ARCH-x86_64-with-arch-Arch-Linux
Python version:** 3.6.5
Models:** en, de
It's not by design, but it's certainly possible that this is a side-effect of the training data and the statistical predictions. The English model is trained on a larger NER corpus with more entity types, while the German model uses NER data based on Wikipedia.
In Wikipedia text, full names like "Frank Muster" are quite common, whereas things like "Herr Muster" are typically avoided. This might explain why the model only labels the full name as a person and not the others. The example sentence also makes it easy for the English model to guess correctly – in English, capitalization is a much stronger indicator of a named entity than it is in German. This might explain why the model consistently labels all capitalised multi-word spans as entities.
In any case, this is a good example of how subtle language-specific or stylistic conventions end up influencing a model's predictions. It also shows why you almost always want to fine-tune a model with more examples specific to your data. I do think that the German model will likely perform better on German texts overall, but if references like "Herr Müller" are common in your texts, you probably want to update the model with more examples of them in different contexts.

Spacy says dependency parser not loaded

I installed spaCy v2.0.2 on Ubuntu 16.04. I then used
sudo python3 -m spacy download en
to download the English model.
After that I use Spacy as follows:
from spacy.lang.en import English
p = English(parser=True, tagger=True, entity=True)
d = p("This is a sentence. I am who I am.")
print(list(d.sents))
I get this error however:
File "doc.pyx", line 511, in __get__
ValueError: Sentence boundary detection requires the dependency parse, which requires a statistical model to be installed and loaded. For more info, see the documentation:
https://spacy.io/usage/models
I really can't figure out what is going on. I have this version of the 'en' model installed:
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.0.0/en_core_web_sm-2.0.0.tar.gz
which I think is the default. Any help is appreciated. Thank you.
I think the problem here is quite simple – when you call this:
p = English(parser=True, tagger=True, entity=True)
... spaCy will load the English language class containing the language data and special-case rules, but no model data and weights, which enable the parser, tagger and entity recognizer to make predictions. This is by design, because spaCy has no way of knowing if you want to load in model data and if so, which package.
So if you want to load an English model, you'll have to use spacy.load(), which will take care of loading the data, and putting together the language and processing pipeline:
nlp = spacy.load('en_core_web_sm') # name of model, shortcut name or path
Under the hood, spacy.load() will look for an installed model package called en_core_web_sm, load it and check the model's meta data to determine which language the model needs (in this case, English) and which pipeline it supports (in this case, tagger, parser and NER). It then initialises an instance of English, creates the pipeline, loads in the binary data from the model package and returns the object so you can call it on your text. See this section for a more detailed explanation of this.