Filtering Entities Based on the type "PERSON", "ORG" etc in Spacy - spacy

After creating a nlp pipeline from spacy. Passed the doc into the pipeline.
Am trying to filter the entities based on the Type of it.
for ent in doc.ents:
print(ent.text)
What would be the code to filter just the PERSON or ORG from the ents list

for ent in doc.ents:
if ent.label_ in ["PERSON", "ORG"]:
print(ent.text)
See also https://spacy.io/usage/linguistic-features#named-entities-101 for an overview of the relevant properties available from NER.

or,
filter(lambda e: e.label_ in ["PERSON", "ORG"], doc.ents)

Related

Is there a way to exclude an apostrophe “s” from entities in spaCy’s NER component?

I am performing named entity recognition (NER) in the context of a custom entity linking component using spaCy 3.2 (I am running into the same issue with the latest 3.3 version as well).
When performing named entity recognition on texts that contain an apostrophe “s” (e.g. Apple's), I would like to exclude the apostrophe “s” from the named entity the component returns.
In instances where the named entity includes a single token in addition to the apostrophe “s”, the NER component correctly returns only the named entity token (e.g. Apple instead of Apple's):
import spacy
nlp = spacy.load("en_core_web_sm")
s = u"Apple's looking at buying U.K. startup for $1 billion"
doc = nlp(s)
for ent in doc.ents:
print(f"{ent.text} (lemma={ent.lemma_}): {ent.label_}\n")
Apple (lemma=Apple): ORG
U.K. (lemma=U.K.): GPE
$1 billion (lemma=$1 billion): MONEY
for token in doc:
print(f"{token.text} (IOB={token.ent_iob_}): {token.ent_type_}")
Apple (IOB=B): ORG
's (IOB=O):
...
However, when I perform named entity recognition on texts where the named entity spans multiple tokens (e.g. Apple Inc. instead of Apple), the apostrophe “s” is included as a part of the returned named entity:
s = u"Apple Inc.'s looking at buying U.K. startup for $1 billion"
doc = nlp(s)
for ent in doc.ents:
print(f"{ent.text} (lemma={ent.lemma_}): {ent.label_}\n")
Apple Inc.'s (lemma=Apple Inc.'s): ORG
U.K. (lemma=U.K.): GPE
$1 billion (lemma=$1 billion): MONEY
for token in doc:
print(f"{token.text} (IOB={token.ent_iob_}): {token.ent_type_}")
Apple (IOB=B): ORG
Inc. (IOB=I): ORG
's (IOB=I): ORG
...
There is no difference in tokenization between these texts (i.e. the 's portion is split out as its own token in both cases). I do not want the apostrophe “s” to be included in the named entity for entities like Apple Inc.'s and would like the NER component to return only the Apple Inc. portion of this named entity.
Is there a way to configure the NER component to prevent this behavior with multi-token named entities and exclude the apostrophe “s”?
There is no way to simply configure the component not to do this. What you can do is use a small custom component to remove 's from any entities.
def my_component(doc):
out = []
for ent in doc.ents:
if ent[-1].text == "'s":
out.append(ent[0:-1])
else:
out.append(ent)
doc.ents = out
return doc
See the docs for info on how to use it.

Spacy NER - How to Identify People names using matcher patterns

I'm trying to identify the People names using following matcher patterns in Spacy. but this is identifying other words like 'my', and 'name'. Can anyone help me identify the issue in the pattern.?
person_pattern = [
{"label":"PERSON",
"pattern": [{'POS':'PROPN'}, {"ENT_TYPE": "PERSON"}],
"comment": "Spacy's in-built PERSON capure"
}]
Example:
My Name as in Google Record is Hannah, but i would like to modify Name as in AADHAR Hanna. My CDS ID is JANAN34
Result/Behavior:
text: My, pos_: PRON, ent_type_: PERSON
text: Name, pos_: NOUN, ent_type_: PERSON
I ran some sample code using your pattern and it seems that your pattern isn't matching anything, so the problem isn't with the Matcher. The problem seems to be with spaCy's NER models.
Your text is kind of unusual - "My Name as in..." is not normal capitalization, and the model seems to mistake it for an actual name. If you change "Name" to "name" then it's no longer detected as an entity.
I think this is just a case of your data not being similar to spaCy's training data, which is more like newspaper articles that use formal capitalization. The v3 models are a little weak to case changes at the moment because some data augmentation was accidentally left out when training them, but that should be resolved in the v3.1 release coming up soon.
If you have training data, you might look at training using spaCy's data augmentation to be more resilient to unusual data.

How should I model a schema in which models can be related to entities of different types in Django

Schema Option 1: https://gist.github.com/guyjacks/6ec4c1b0fa41b3f666f5c6adf2dfaf89
Schema Option 2: https://gist.github.com/guyjacks/4838cd76b2f924629d2a3f2ba316a504
I guess this is really two questions:
Which schema is recommended from a relational db perspective?
Is there an idiomatic way to model either schema in Django?
Cheers!
One way to do this would be to use Generic Relations:
from django.contrib.contenttypes.fields import GenericForeignKey
from django.contrib.contenttypes.models import ContentType
And in the model you want to relate to various models:
class SomeModel(models.Model):
...
# Below the mandatory fields for generic relation
content_type = models.ForeignKey(ContentType, on_delete=models.CASCADE)
object_id = models.PositiveIntegerField()
content_object = GenericForeignKey()
And in the models you want related:
from django.contrib.contenttypes.fields import GenericRelation
class SomeOtherModel(models.Model):
...
# The relation
your_field = GenericRelation(SomeModel)
If you want to be able to use reverse queries see this from the docs:
related_query_name
The relation on the related object back to this object doesn’t exist
by default. Setting related_query_name creates a relation from the
related object back to this one. This allows querying and filtering
from the related object.
You should be careful when using them as they can quickly add complexity.
Links for more info:
https://docs.djangoproject.com/en/1.11/ref/contrib/contenttypes/#generic-relations
https://simpleisbetterthancomplex.com/tutorial/2016/10/13/how-to-use-generic-relations.html

Django one-to-many queries

Django newbie here.
I created three models: Band, Album and AlbumType (Live, EP, LP). Album have two foreign keys (from band and albumtype). Ok, so, in the view a make something like this:
bandList = Band.objects.all()
To retrieve all the bands but I can't figure out how to get all the Albums from a Band in the same view.
Any help will be appreciated. Thanks.
By default the related objects (through a ForeignKey on the other model) are accessible trough <modelname>_set zo in this case that is band.album_set (note this is a Manager attribute so you will probably use band.album_set.all() most of the time).
I personally prefer to give al my ForeignKey fields a related_name so I can name the attribute on the other model. The next example gives the Band the attribute band.albums.
class Band(models.Model):
# Band definition
class Album(models.Model):
band = models.ForeignKey(Band, related_name='albums')
# Rest of album definition
Could be great if you share your models definition. But I hope this helps you:
If you want to retrieve the Albums for a specific band:
band = Band.objects.get(...)
band_albums = Album.objects.filter(band=band)
That will return the albums for a band.
If you want retrive albums for all the bands:
bandList = Band.objects.all().select_related('album_set')
This will return the bans as before, but will cache the albums.

Django REST framework flat, read-write serializer

In Django REST framework, what is involved in creating a flat, read-write serializer representation? The docs refer to a 'flat representation' (end of the section http://django-rest-framework.org/api-guide/serializers.html#dealing-with-nested-objects) but don't offer examples or anything beyond a suggestion to use a RelatedField subclass.
For instance, how to provide a flat representation of the User and UserProfile relationship, below?
# Model
class UserProfile(models.Model):
user = models.OneToOneField(User)
favourite_number = models.IntegerField()
# Serializer
class UserProfileSerializer(serializers.ModelSerializer):
email = serialisers.EmailField(source='user.email')
class Meta:
model = UserProfile
fields = ['id', 'favourite_number', 'email',]
The above UserProfileSerializer doesn't allow writing to the email field, but I hope it expresses the intention sufficiently well. So, how should a 'flat' read-write serializer be constructed to allow a writable email attribute on the UserProfileSerializer? Is it at all possible to do this when subclassing ModelSerializer?
Thanks.
Looking at the Django REST framework (DRF) source I settled on the view that a DRF serializer is strongly tied to an accompanying Model for unserializing purposes. Field's source param make this less so for serializing purposes.
With that in mind, and viewing serializers as encapsulating validation and save behaviour (in addition to their (un)serializing behaviour) I used two serializers: one for each of the User and UserProfile models:
class UserSerializer(serializer.ModelSerializer):
class Meta:
model = User
fields = ['email',]
class UserProfileSerializer(serializer.ModelSerializer):
email = serializers.EmailField(source='user.email')
class Meta:
model = UserProfile
fields = ['id', 'favourite_number', 'email',]
The source param on the EmailField handles the serialization case adequately (e.g. when servicing GET requests). For unserializing (e.g. when serivicing PUT requests) it is necessary to do a little work in the view, combining the validation and save behaviour of the two serializers:
class UserProfileRetrieveUpdate(generics.GenericAPIView):
def get(self, request, *args, **kwargs):
# Only UserProfileSerializer is required to serialize data since
# email is populated by the 'source' param on EmailField.
serializer = UserProfileSerializer(
instance=request.user.get_profile())
return Response(serializer.data)
def put(self, request, *args, **kwargs):
# Both UserSerializer and UserProfileSerializer are required
# in order to validate and save data on their associated models.
user_profile_serializer = UserProfileSerializer(
instance=request.user.get_profile(),
data=request.DATA)
user_serializer = UserSerializer(
instance=request.user,
data=request.DATA)
if user_profile_serializer.is_valid() and user_serializer.is_valid():
user_profile_serializer.save()
user_serializer.save()
return Response(
user_profile_serializer.data, status=status.HTTP_200_OK)
# Combine errors from both serializers.
errors = dict()
errors.update(user_profile_serializer.errors)
errors.update(user_serializer.errors)
return Response(errors, status=status.HTTP_400_BAD_REQUEST)
First: better handling of nested writes is on it's way.
Second: The Serializer Relations docs say of both PrimaryKeyRelatedField and SlugRelatedField that "By default this field is read-write..." — so if your email field was unique (is it?) it might be you could use the SlugRelatedField and it would just work — I've not tried this yet (however).
Third: Instead I've used a plain Field subclass that uses the source="*" technique to accept the whole object. From there I manually pull the related field in to_native and return that — this is read-only. In order to write I've checked request.DATA in post_save and updated the related object there — This isn't automatic but it works.
So, Fourth: Looking at what you've already got, my approach (above) amounts to marking your email field as read-only and then implementing post_save to check for an email value and perform the update accordingly.
Although this does not strictly answer the question - I think it will solve your need. The issue may be more in the split of two models to represent one entity than an issue with DRF.
Since Django 1.5, you can make a custom user, if all you want is some method and extra fields but apart from that you are happy with the Django user, then all you need to do is:
class MyUser(AbstractBaseUser):
favourite_number = models.IntegerField()
and in settings: AUTH_USER_MODEL = 'myapp.myuser'
(And of course a db-migration, which could be made quite simple by using db_table option to point to your existing user table and just add the new columns there).
After that, you have the common case which DRF excels at.