Patient name extraction using MedSpacy - spacy

I was looking for some guidence on NER using medspacy. Aware of disease extraction using MedSpacy but the aim is to extract patient name from medical report using medspacy.
Text supposed to be :
patient:Jeromy, David (DOB)
Date range 2020 to 2022. Visited Dr Brian. Suffered from ...
This type of dataset is there, want to extract patient name from all the pages of medical reports using MedSpacy. I know target rules can be helpful but any clarified guidence will be appreciated.
Thanks & regards

If you find that the default SpaCy NER model is not sufficient, as it will not pick up names such as "Byrn, John", I have a couple of suggestions:
Train a custom NER component using SpaCy's Prodigy annotation tool, which you can use to easily label some examples of names. This is a rather simple task, so you can likely train a model with less than 100 diverse examples. Note: Prodigy is a paid tool, so see my other suggestions if you do not have access/are not willing to pay.
Train a custom NER component without Prodigy. Similar to the above approach, but slightly more involved. This Medium article provides a beginner-friendly introduction to doing so, and you can also refer to SpaCy's own documentation. You can provide SpaCy with some examples of texts and the entities you want extracted, like so:
TRAIN_DATA = [
('Patient: Byrn, John', {
'entities': [(9, 19, 'PERSON')]
}),
('John Byrn received 10mg of advil', {
'entities': [(0, 10, 'PERSON')]
})
]
Build rules based on existing SpaCy components. You can leverage existing SpaCy pipeline components (you don't necessarily need MedSpaCy for this), such as POS tagging and Dependency Parsing. For example, you can look for proper nouns in your documents to identify names. Check out the docs on POS tagging here.
Try other pretrained NER models. There may be other models that are better suited to your task. Check out other models on SpaCy Universe, or even better, on HuggingFaceHub, which contains some of the best models out there for every use case. Added bonus of HF Hub is that you can try out the models on each model model page, and assess the performance on some examples before you decide.
Hope this helps!

Related

How do I evaluate a custom pipeline component with a custom attribute?

Questions:
How can I give GoldParse the gold data for a custom attribute?
How can I extend the properties of Scorer by custom scores which are based on a custom attribute?
Explanation
I have implemented a custom pipeline component setting a custom attribute which was set with Doc.set_extension('results', default=[]). I want to evaluate my pipeline with labelled data (something like {text: "This is some text", results: ["banana", "picture"]}). It seems to me that GoldParse and Scorer are doing what I need with default attributes, but I can't find information on how to use them with a custom attribute.
I have seen and understood examples like this, but they only ever deal with default attributes.
What I've tried
I have tried figuring out whether I can somehow configure the two classes for custom attributes/scores, but haven't found a way. The parameters of the init method of GoldParse and the Scorer properties seem to be fixed.
I have thought about extending the two classes with subclasses, but they don't seem easily extendable to me.
What I would like to avoid
Of course I can copy the code from Scorer and GoldParse which I need and add code for my custom attribute, but that seems like a bad solution. Also, considering how spaCy encourages you to extend a pipeline and a Doc, I would be surprised if the evaluation of those extensions were this hard.
Unfortunately, it actually is this hard in spacy v2. It's very hard to add things to GoldParse (basically a don't-try-this-at-home level of hard) and the Scorer is also hard to extend.
We're working on this for the upcoming spacy v3, where the scoring methods will be implemented more generally and each component will be able to provide its own score method. Be aware that this is still unstable, but if you're curious you can have a look at: https://github.com/explosion/spaCy/pull/5731. GoldParse has been replaced with Example, which stores both the gold annotation and the predicted annotation on individual Doc objects, getting rid of the restrictions related to GoldParse.
If you have a doc-level extension (as above) then you should probably just use a different library for evaluation. You could potentially use ROCAUCScore or PRFScore from spacy.scorer, but it may be easier to use something like sklearn metrics instead. (The ROCAUCScore is just a simplified version of the sklearn ROC AUC metric.)
If you have a token-level extension, for v2 I think the best you can do within spacy is to use PRFScore and extract the alignment bits based on words from a GoldParse to use outside of the scorer itself. Something like this:
import spacy
from spacy.scorer import PRFScore
nlp = spacy.load("my_model")
score = PRFScore()
for text, gold_words, gold_attrs in zip(texts, gold_words_list, gold_attrs_list):
# NOTE: gold_attrs must be aligned with gold_words
# gold_words = ["a", "b", "c", ...]
# gold_attrs = ["a1", "b1", "c1", ...]
gold = GoldParse(nlp.make_doc(text), words=gold_words)
doc = nlp(text)
gold_values = set()
cand_values = set()
for i, gold_attr in enumerate(gold_attrs):
gold_values.add((i, gold_attr))
for token in doc:
if token.orth_.isspace():
continue
gold_i = gold.cand_to_gold[token.i]
if gold_i is not None:
cand_values.add((gold_i, doc._.attr))
score.score_set(cand_values, gold_values)
print(score.fscore)
This is an untested sketch that should parallel how token.tag is evaluated in the Scorer. The alignment bits are the trickiest part, so if you don't have misalignments between gold words and spacy's tokenization, then you may also be better off exporting your results and using a different library for evaluation.

How to visualize results of LDA topic modelling as shown below

As a part of the assignment, I am asked to do topic modeling using LDA and visualize the words that come under the top 3 topics as shown in the below screenshot 1. However, even after searching a lot I am not able to find any helpful resource that would help me achieve my goal. All resources about text visualization are pointed towards the word cloud, but my goal is not to use word cloud visualizations.
Required LDA topic visulization
Any help will be greatly appreciated.
If you use gensim to generate the LDA model (gensim.models.ldamodel.LdaModel()) you can use the following to easily visualize the key words related to each topic:
# Example of LDA model building:
lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus,
id2word=id2word,
num_topics=20,
random_state=100,
update_every=1,
chunksize=100,
passes=10,
alpha='auto',
per_word_topics=True)
# Visualize keywords
print(lda_model.print_topics())
Also for better visualization I suggest you to use pyLDAvis through:
import pyLDAvis
import pyLDAvis.gensim
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, corpus, id2word)
vis
Where you feed your lda model, corpus and a dictionary mapping ids to tokens (id2word). Output:
Have a look at the example here for a detailed explanation:
https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/

Discrepancy documentation and implementation of spaCy vectors for German words?

According to documentation:
spaCy's small models (all packages that end in sm) don't ship with
word vectors, and only include context-sensitive tensors. [...]
individual tokens won't have any vectors assigned.
But when I use the de_core_news_sm model, the tokens Do have entries for x.vector and x.has_vector=True.
It looks like these are context_vectors, but as far as I understood the documentation only word vectors are accessible through the vector attribute and sm models should have none. Why does this work for a "small model"?
has_vector behaves differently than you expect.
This is discussed in the comments on an issue raised on github. The gist is, since vectors are available, it is True, even though those vectors are context vectors. Note that you can still use them, eg to compute similarity.
Quote from spaCy contributor Ines:
We've been going back and forth on how the has_vector should behave in
cases like this. There is a vector, so having it return False would be
misleading. Similarly, if the model doesn't come with a pre-trained
vocab, technically all lexemes are OOV.
Version 2.1.0 has been announced to include German word vectors.

Spacy - English language model outruns german language model on german text?

Is it by design that the english language model performs better on german salution entities than the german model?
# pip install spacy
# python -m spacy download en
# python -m spacy download de
nlp = spacy.load('en')
# Uncomment line below to get less good results
# nlp = spacy.load('de')
# Process text
text = (u"Das Auto kauft Herr Müller oder Frau Meier, Frank Muster")
doc = nlp(text)
# Find named entities
for entity in doc.ents:
print(entity.text, entity.label_)
expected result if using nlp = spacy.load('en'). All three PERSON is returned
Das Auto ORG
Herr Müller PERSON
Frau Meier PERSON
Frank Muster PERSON
unexpected result if using nlp = spacy.load('de'). Only one of three PERSON is returned
Frank Muster PERSON
Info about spaCy
spaCy version:** 2.0.12
Platform:** Linux-4.17.2-1-ARCH-x86_64-with-arch-Arch-Linux
Python version:** 3.6.5
Models:** en, de
It's not by design, but it's certainly possible that this is a side-effect of the training data and the statistical predictions. The English model is trained on a larger NER corpus with more entity types, while the German model uses NER data based on Wikipedia.
In Wikipedia text, full names like "Frank Muster" are quite common, whereas things like "Herr Muster" are typically avoided. This might explain why the model only labels the full name as a person and not the others. The example sentence also makes it easy for the English model to guess correctly – in English, capitalization is a much stronger indicator of a named entity than it is in German. This might explain why the model consistently labels all capitalised multi-word spans as entities.
In any case, this is a good example of how subtle language-specific or stylistic conventions end up influencing a model's predictions. It also shows why you almost always want to fine-tune a model with more examples specific to your data. I do think that the German model will likely perform better on German texts overall, but if references like "Herr Müller" are common in your texts, you probably want to update the model with more examples of them in different contexts.

How to identify relevant features in WEKA?

I would like to perform feature analysis in WEKA. I have a data set of 8 features and 65 instances.
I would like to perform feature selection and optimization functionalities that are available for machine learning methods like SVM.
For example in Weka I would like to know how I can display which of the features contribute best to the classification result.
I think that WEKA provides a nice graphical user interface and allows a very detailed analysis of the influence of single features. But I dont know how to use it. Any help?
You have two options:
You can perform attribute selection using filters. For instance you can use the AttributeSelection tab (or filter) with the search method Ranker and the attribute evaluation metric InfoGainAttributeEval. This way you get a ranked list of the most predictive features according to its Information Gain score. I have done this many times with good results. Sometimes it helps even to increase the accuracy of SVMs, which are known not to need (too much) of feature selection. You can try with other search methods in order to find subgroups of coupled predictors, and with other metrics.
You can just look at the coefficients in the SVM output. For instance, in linear SVMs, the classifier is a polynomial like a1.f1 + a2.f2 + ... + an.fn + fn+1 > 0, being ai the attribute values for an instance, and fi the "weights" obtained in the SVM training algorithm. In consequence, those weights with values close to 0 represent attributes that do not count too much, thus being bad predictors; extreme weights (either positive or negative) represent good predictors.
Additionally, you can check the visualization options available for a particular classifier (e.g. J48 is a decision tree, the attribute used in the root test is for the best predictor). You can check the AttributeSelection tab visualization options as well.