Why spacy need start and end position in tagging annotation - spacy

I was training spacy name entity recognition with my custom dataset.
One question stick in my mind, why spacy need start and end position of tag in annotation?
[
('I want apples', {'entities': [(2, 5, 'COMMAND'), (7, 12, 'FRUIT')]})
]
Thanks in advance.

Because named entities are allowed to span several tokens, for instance:
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
"Shaka Khan" would be one entity with the label PERSON.
Instead, if you would annotate
("Who is Shaka Khan?", {"entities": [(7, 12, "PERSON")]}),
then only "Shaka" would be the tagged entity.

Related

Patient name extraction using MedSpacy

I was looking for some guidence on NER using medspacy. Aware of disease extraction using MedSpacy but the aim is to extract patient name from medical report using medspacy.
Text supposed to be :
patient:Jeromy, David (DOB)
Date range 2020 to 2022. Visited Dr Brian. Suffered from ...
This type of dataset is there, want to extract patient name from all the pages of medical reports using MedSpacy. I know target rules can be helpful but any clarified guidence will be appreciated.
Thanks & regards
If you find that the default SpaCy NER model is not sufficient, as it will not pick up names such as "Byrn, John", I have a couple of suggestions:
Train a custom NER component using SpaCy's Prodigy annotation tool, which you can use to easily label some examples of names. This is a rather simple task, so you can likely train a model with less than 100 diverse examples. Note: Prodigy is a paid tool, so see my other suggestions if you do not have access/are not willing to pay.
Train a custom NER component without Prodigy. Similar to the above approach, but slightly more involved. This Medium article provides a beginner-friendly introduction to doing so, and you can also refer to SpaCy's own documentation. You can provide SpaCy with some examples of texts and the entities you want extracted, like so:
TRAIN_DATA = [
('Patient: Byrn, John', {
'entities': [(9, 19, 'PERSON')]
}),
('John Byrn received 10mg of advil', {
'entities': [(0, 10, 'PERSON')]
})
]
Build rules based on existing SpaCy components. You can leverage existing SpaCy pipeline components (you don't necessarily need MedSpaCy for this), such as POS tagging and Dependency Parsing. For example, you can look for proper nouns in your documents to identify names. Check out the docs on POS tagging here.
Try other pretrained NER models. There may be other models that are better suited to your task. Check out other models on SpaCy Universe, or even better, on HuggingFaceHub, which contains some of the best models out there for every use case. Added bonus of HF Hub is that you can try out the models on each model model page, and assess the performance on some examples before you decide.
Hope this helps!

React Native Mi Scale Weight Data

I am trying to getting data from mi scale V2. I am getting service data like this: “serviceData”: {“0000181b-0000-1000-8000-00805f9b34fb”: “BiTlBwcZFgsYAAAmAg==”}(5.15kg) and I decode the base64 string to array like this [66, 105, 84, 108, 66, 119, 99, 90, 70, 103, 115, 89, 65, 65, 65, 109, 65, 103, 61, 61] But I can not retrieve the correct result. How can I get the weight data?
The UUID 0000181b-0000-1000-8000-00805f9b34fb belongs to the pre-defined Body Composition Service (BCS). You can download the specification from here.
It should have the two characteristics Body Composition Feature and Body Composition Measurement.
The Features characteristic shows you the features supported by your scale, the measurement characteristic returns the actual measurement.
Take a look at this answer where I explain the process of decoding a sample weight measurement.
UUIDs with the format of 0000xxxx-0000-1000-8000-00805f9b34fb are an officially adopted Bluetooth SIG UUID and can be looked up online.
If you look at the following URL:
https://www.bluetooth.com/specifications/assigned-numbers/
there is a document with the title "16-bit UUIDs". I can see from that document that 0x181b is Body Composition GATT Service.
According to the "Body Composition Service 1.0" document at:
https://www.bluetooth.com/specifications/specs/
there should be a Body Composition Feature (0x2A9B) and a Body Composition Measurement (0x2A9C) characteristic available for that service.
It will be the Body Composition Measurement characteristic that will contain the weight value.
A generic Bluetooth Low Energy scanning and exploration tool like nRF Connect can be useful when exploring and understanding the data on a device.

Correct annotation to train spaCy's NER

I'm having some troubles finding the right way how to annotate my data. I'm dealing with laboratory test related texts and I am using the following labels:
1) Test specification (e.g. voltage, length,...)
2) Test object (e.g. battery, steal beam,...)
3) Test value (e.g. 5 V; 5 m...)
Let's take this example sentences:
The battery voltage should be 5 V.
I would annotate this sentences like this:
The
battery voltage (test specification)
should
be
5 V (Test value)
.
However, if this sentences looks like this:
The voltage of the battery should be 5 V.
I would use the following annotation:
The
voltage (Test specification)
of
the
battery (Test object)
should
be
5 V (Test value)
.
Is anyone experienced in annotating data to explain if this is the right way? Or should I use in he first example the Test object label for battery as well? Or should I combine the labels in the second example voltage of the battery as Test specification?
I am annotating the data to perform information extraction.
Thanks for any help!
All of your examples are unusual annotations formats. The typical way to tag NER data (in text) is to use an IOB/BILOU format, where each token is on one line, the file is a TSV, and one of the columns is a label. So for your data it would look like:
The
voltage U-SPEC
of
the
battery U-OBJ
should
be
5 B-VALUE
V L-VALUE
.
Pretend that is TSV, and I have omitted O tags, which are used for "other" items.
You can find documentation of these schema in the spaCy docs.
If you already have data in the format you provided, or you find it easier to make it that way, it should be easy to convert at least. For training NER spaCy requires the data be provided in a particular format, see the docs for details, but basically you need the input text, character spans, and the labels of those spans. Here's example data:
TRAIN_DATA = [
("Who is Shaka Khan?", {"entities": [(7, 17, "PERSON")]}),
("I like London and Berlin.", {"entities": [(7, 13, "LOC"), (18, 24, "LOC")]}),
]
This format is trickier to produce manually than the above TSV type format, so generally you would produce the TSV-like format, possibly using a tool, and then convert it.
The main rule to correctly annotate entities is to be consistent (i.e. you always apply the same rules when deciding which entity is what). I can see you already khave some rules in terms of when battery voltage should be considered test object or test specification.
Apply those rules consistently and you'll be ok.
Have a look at the spacy-annotator.
It is a library that helps you annotating data in the way you want.
Example:
import pandas as pd
import re
from spacy_annotator.pandas_annotations import annotate as pd_annotate
# Data
df = pd.DataFrame.from_dict({'full_text' : [The battery voltage should be 5 V., 'The voltage of the battery should be 5 V.']})
# Annotations
pd_dd = pd_annotate(df,
col_text = 'full_text', # Column in pandas dataframe containing text to be labelled
labels = ['test_specification', 'test object', 'test_value'], # List of labels
sample_size=1, # Size of the sample to be labelled
delimiter=',', # Delimiter to separate entities in GUI
model = None, # spaCy model for noisy pre-labelling
regex_flags=re.IGNORECASE # One (or more) regex flags to be applied when searching for entities in text
)
# Example output
pd_dd['annotations'][0]
The code will show you a user interface you can use to annotate each relevant entities.

Is spaCy supporting custom types for Named Entity Recognition?

In the documentation of the 'Named Entity Recognition' feature of spaCy (https://spacy.io/usage/linguistic-features#named-entities)
the documentation states that spaCy can recognize 'various types' of named entities such as 'PERSON', 'LOC', 'PRODUCT' (https://spacy.io/api/annotation#named-entities).
My question is: can I also train data with my custom entities? For example I would like to train invoice data to regognize for example IBAN / BIC or an invoice no. . Is this also possible or is this feature restricted to a fixed list of entities only?
It does support custom entities, cf this section titled "Training an additional entity type".
For example, to add a label called MY_ANIMAL, you can use training data like such:
TRAIN_DATA = [
(
"Horses are too tall and they pretend to care about your feelings",
{"entities": [(0, 6, MY_ANIMAL)]},
),
("Do they bite?", {"entities": []}),
(
"horses are too tall and they pretend to care about your feelings",
{"entities": [(0, 6, MY_ANIMAL)]},
),
]
And feed that into either an existing NER model as additional training, or a newly created NER pipe.
However, a caveat: the ML model is optimized for recognizing named entities, which are usually capitalized nouns like "John", "London" or "The Times". You can also try to train it on more generic things like numbers, but it may not work as well.

use tensorflow object detection API for gender recognition

can I use tensorflow object detection API for gender recognition?
I want to train SSD_mobile net for gender recognition and detection. I changed labelmap to:
item {
id: 1
name: 'man'
}
item {
id: 2
name: 'woman'
}
and num_classes=2
I attach to training_loss=8 but when I feed an image to the network to test, the result is awful.
what should I do? can somebody help me?
For this kind of task you will need a huge dataset and a very long time for training if you don't have a super computer haha jokes apart but this is pretty difficult we need very keen analysis because men or women has almost same kind of features for computer not for us but for computer just like it can not make a difference between a bitch and a dog but we human can by just one watch so I hope you will understand what I am trying to say but you should definitely try it its a very nice idea and there is a lot of applications for this if you can do some thing better with this. Good luck let me know if you can do some thing better.
You can. The method that you need to follow is as following:
Use SSD to extract the location of the object to be found (face in here).
Get relevant feature map of the location at conv5 (assume that you use VGG). For example, if you find your object at location (100, 100, 100, 100 - XYWH) within input image with size (300, 300), cut conv5 features at (12, 12, 12, 12 - XYWH). Math is (100 / 300) * 38.
Now you shall have the activation features cut from conv5 (12 x 12 x 512) and which is only relevant with the face that you want to predict the gender.
Flatten this feature activation and apply DNN Classifier for it (i.e. Classifier used for VGG).
Get binary output stating either male or female.
Train your network by adding gender loss to global loss function.
Voila. You have the gender estimation network.