How to improve accuracy of Rasa NLU while using Spacy as pipeline? - spacy

In Spacy documentation it is mentioned that it uses vector similarity in featurization and hence in classification.
For example if we test a sentence which is not in the training data but has same meaning then it should be classified in same intent in which training sentences have classified.
But it's not happening.
Let's say training data is like this-
## intent: delete_event
- delete event
- delete all events
- delete all events of friday
- delete ...
Now if I test remove event then it is not classified as delete_event rather it falls in some other intent.
I have tried changing the pipeline to supervised_embeddings and also made changes in components of spacy pipeline. But still this issue is there.
I don't want to create training data for remove... texts, as it should be supported by spacy according to it's documentation.
I don't have other intents which has sentences delete... in them.
Config file in rasa -
language: "en_core_web_sm"
pipeline:
- name: "SpacyNLP"
- name: "SpacyTokenizer"
- name: "SpacyFeaturizer"
- name: "RegexFeaturizer"
- name: "SpacyEntityExtractor"
- name: "CRFEntityExtractor"
- name: "EntitySynonymMapper"
- name: "SklearnIntentClassifier"
policies:
- name: MemoizationPolicy
- name: KerasPolicy
- name: MappingPolicy

It's probably an overdone answer, but likely you just need more training data. And that probably means that you have to include some other words besides delete.
Yes, spaCy can generalize outside of words you include, but if all of your training data for that intent uses the word delete then you are training it to only accept that word or that word is extremely important. if you include more similar words to delete you train it that related words are allowed.
As far as the TensorFlow pipeline, it doesn't even know the words exist until you use them, so you would be best served including remove at least once so it can build the vectors connecting delete and remove (and cancel, call off, drop, etc as well)
Also, you are currently using the small spaCy language model, it may be useful trying one of the larger ones once you've got more training data.

Related

Patient name extraction using MedSpacy

I was looking for some guidence on NER using medspacy. Aware of disease extraction using MedSpacy but the aim is to extract patient name from medical report using medspacy.
Text supposed to be :
patient:Jeromy, David (DOB)
Date range 2020 to 2022. Visited Dr Brian. Suffered from ...
This type of dataset is there, want to extract patient name from all the pages of medical reports using MedSpacy. I know target rules can be helpful but any clarified guidence will be appreciated.
Thanks & regards
If you find that the default SpaCy NER model is not sufficient, as it will not pick up names such as "Byrn, John", I have a couple of suggestions:
Train a custom NER component using SpaCy's Prodigy annotation tool, which you can use to easily label some examples of names. This is a rather simple task, so you can likely train a model with less than 100 diverse examples. Note: Prodigy is a paid tool, so see my other suggestions if you do not have access/are not willing to pay.
Train a custom NER component without Prodigy. Similar to the above approach, but slightly more involved. This Medium article provides a beginner-friendly introduction to doing so, and you can also refer to SpaCy's own documentation. You can provide SpaCy with some examples of texts and the entities you want extracted, like so:
TRAIN_DATA = [
('Patient: Byrn, John', {
'entities': [(9, 19, 'PERSON')]
}),
('John Byrn received 10mg of advil', {
'entities': [(0, 10, 'PERSON')]
})
]
Build rules based on existing SpaCy components. You can leverage existing SpaCy pipeline components (you don't necessarily need MedSpaCy for this), such as POS tagging and Dependency Parsing. For example, you can look for proper nouns in your documents to identify names. Check out the docs on POS tagging here.
Try other pretrained NER models. There may be other models that are better suited to your task. Check out other models on SpaCy Universe, or even better, on HuggingFaceHub, which contains some of the best models out there for every use case. Added bonus of HF Hub is that you can try out the models on each model model page, and assess the performance on some examples before you decide.
Hope this helps!

Gensim word2vec saves numpy arrays?

I am running the Word2Vec implementation from gensim twice, and I have a problem with the save function:
model_ = gensim.models.Word2Vec(all_doc, size=int(config['MODEL']['embed_size']),
window=int(config['MODEL']['window']),
workers=multiprocessing.cpu_count(),
sg=1, iter=int(config['MODEL']['iteration']),
negative=int(config['MODEL']['negative']),
min_count=int(config['MODEL']['min_count']), seed=int(config['MODEL']['seed']))
model_.save(config['BASIC']['embedding_dir'])
I obtain different outputs for each time I run it. The first time it gives an "output_embedding", an "output_embedding.trainables.syn1neg.npy" and an "output_embedding.wv.vectors.npy". But the second time it does not give the two npy files, it just generates "output_embedding".
The only thing I change from the first to the second time is the sentences I use as input (all_doc).
Why it does not generate the 3 files ?
Gensim only creates the separate files when the size of the internal numpy arrays is over a certain threshold – so I suspect your all_doc corpus has a very small vocabulary in one case, and a more typically large vocabulary in the other.
When it does generate multiple files, be sure to keep them all together for later loads to work.
(If for some urgent reason you needed to change that behavior, the inherited .save() method takes an optional sep_limit argument to change the threshold - but I'd recommend against mucking with this.)
Separately: that your file names have .trainables. in them suggests you're using a pre-4.0.0 version of Gensim. There've been some improvements to Word2Vec & related algorithms in the latest Gensim, and some older code will need small changes to keep working, so you may want to upgrade to the latest version before building any more functionality on an older base.

How to optimize SpaCy pipe for NER only (using an existing model, no training)

I am looking to use SpaCy v3 to extract named entities from a large list of sentences. What I have works, but it seems slower than it should be, and before investing in more machines, I'd like to know if I am doing more work than I need to in the pipe.
I've used ntlk to parse everything into sentences as an iterator, then process these using "pipe" to get the named entities. All of this appears to work well, and python appears to be hitting every cpu core on my machine fairly heavily, which is good.
nlp = spacy.load("en_core_web_trf")
for (doc, context) in nlp.pipe(lines, as_tuples=True, batch_size=1000):
for ent in doc.ents:
pass #handle each entity
I understand that I can use nlp.disable_pipes to disable certain elements. Is there anything I can disable that won't impact accuracy and that isn't required for NER?
For NER only with the transformer model en_core_web_trf, you can disable ["tagger", "parser", "attribute_ruler", "lemmatizer"].
If you want to use a non-transformer model like en_core_web_lg (much faster but slightly lower accuracy), you can disable ["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"] and use nlp.pipe(n_process=-1) for multiprocessing on all CPUs (or n_process=N to restrict to N CPUs).

Tensorflow Object Detection API - Detecting the humans not wearing Helmet

*** Please note, my previous problem of detecting withouthelmet as NA is resolved.
Now I have a new issue, I used 1000 images for detecting humans not wearing helmets and also 1000 images for humans wearing helmet and 1000 images for only humans. I used SSD_mobilenet_v1_pets.config file for training.
Here is my pbtxt file
item {
id: 1
name: 'withouthelmet'
}
item {
id: 2
name: 'withhelmet'
}
item {
id: 3
name: 'person'
}
sample training Image
After the training my model detect every car as person..
Is that because of using ssd_mobilenet model(id: 1 for person but I used id: 1 as withouthelmet and id:3 for car but I used id:3 for person)
Pls help me to solve this problem
Have you set num_classes to 1 in your config?
Please note that min_negatives_per_image means min # of negative anchors (instead of images) so you data mix has nothing to do with this parameter.
I had to modify earlier answer - if you add a background image(image with no gt boxes) to the dataset, it should help reduce false positives. Sorry I got confused with some other stuff.
Have you used the pre-trained SSD-MobileNetV1 model trained on the pets dataset?
I think you better use a model trained on COCO dataset since it has persons, in contrast to pets.
Of course that if you train your model it will learn to detect the person as well, but since you don't have a lot of examples of persons without a helmet, it would probably be better to start with a model which knows what a person is.
Regarding your questions, if you only want to detect people without helmet, you can simply drop everything else in the pbtxt file, only put
item {
id: 1
name: 'withouthelmet'
display_name: 'withouthelmet'
}
change the number of categories in the config file to 1, and fine-tune the model.

Can I change Inv operation into Reciprocal in an existing graph in Tensorflow?

I am working on an image classification problem with tensorflow. I have 2 different CNNs trained separately (in fact 3 in total but I will deal with the third later), for different tasks and on a AWS (Amazon) machine. One tells if there is text in the image and the other one tells if the image is safe for work or not. Now I want to use them in a single script on my computer, so that I can put an image as input and get the results of both networks as output.
I load the two graphs in a single tensorflow Session, using the import_meta_graph API and the import_scope argument and putting each subgraph in a separate scope. Then I just use the restore method of the created saver, giving it the common Session as argument.
Then, in order to run inference, I retrieve the placeholders and final output with graph=tf.get_default_graph() and my_var=graph.get_operation_by_name('name').outputs[0] before using it in sess.run (I think I could just have put 'name' in sess.run instead of fetching the output tensor and putting it in a variable, but this is not my problem).
My problem is the text CNN works perfectly fine, but the nsfw detector always gives me the same output, no matter the input (even with np.zeros()). I have tried both separately and same story: text works but not nsfw. So I don't think the problem comes from using two networks simultaneaously.
I also tried on the original AWS machine I trained it on, and this time the nsfw CNN worked perfectly.
Both networks are very similar. I checked on Tensorboard if everything was fine and I think it is ok. The differences are in the number of hidden units and the fact that I use batch normalization in the nsfw model and not in the text one. Now why this title ? I observed that I had a warning when running the nsfw model that I didn't have when using only the text model:
W tensorflow/core/framework/op_def_util.cc:332] Op Inv is deprecated. It will cease to work in GraphDef version 17. Use Reciprocal.
So I thougt maybe this was the reason, everything else being equal. I checked my GraphDef version, which seems to be 11, so Inv should still work in theory. By the way the AWS machine use tensroflow version 0.10 and I use version 0.12.
I noticed that the text network only had one Inv operation (via a filtering on the names of the operations given by graph.get_operations()), and that the nsfw model had the same operation plus multiple Inv operations due to the batch normalization layers. As precised in the release notes, tf.inv has simply been renamed to tf.reciprocal, so I tried to change the names of the operations to Reciprocal with tf.group(), as proposed here, but it didn't work. I have seen that using tf.identity() and changing the name could also work, but from what I understand, tensorflow graphs are an append-only structure, so we can't really modify its operations (which seems to be immutable anyway).
The thing is:
as I said, the Inv operation should still work in my GraphDef version;
this is only a warning;
the Inv operations only appear under name scopes that begin with 'gradients' so, from my understanding, this shouldn't be used for inference;
the text model also have an Inv operation.
For these reasons, I have a big doubt on my diagnosis. So my final questions are:
do you have another diagnosis?
if mine is correct, is it possible to replace Inv operations with Reciprocal operations, or do you have any other solution?
After a thorough examination of the output of relevant nodes, with the help of Tensorboard, I am now pretty certain that the renaming of Inv to Reciprocal has nothing to do with my problem.
It appears that the last batch normalization layer eliminates almost any variance of its output when the inputs varies. I will ask why elsewhere.