Is there a way to exclude an apostrophe “s” from entities in spaCy’s NER component? - spacy

I am performing named entity recognition (NER) in the context of a custom entity linking component using spaCy 3.2 (I am running into the same issue with the latest 3.3 version as well).
When performing named entity recognition on texts that contain an apostrophe “s” (e.g. Apple's), I would like to exclude the apostrophe “s” from the named entity the component returns.
In instances where the named entity includes a single token in addition to the apostrophe “s”, the NER component correctly returns only the named entity token (e.g. Apple instead of Apple's):
import spacy
nlp = spacy.load("en_core_web_sm")
s = u"Apple's looking at buying U.K. startup for $1 billion"
doc = nlp(s)
for ent in doc.ents:
print(f"{ent.text} (lemma={ent.lemma_}): {ent.label_}\n")
Apple (lemma=Apple): ORG
U.K. (lemma=U.K.): GPE
$1 billion (lemma=$1 billion): MONEY
for token in doc:
print(f"{token.text} (IOB={token.ent_iob_}): {token.ent_type_}")
Apple (IOB=B): ORG
's (IOB=O):
...
However, when I perform named entity recognition on texts where the named entity spans multiple tokens (e.g. Apple Inc. instead of Apple), the apostrophe “s” is included as a part of the returned named entity:
s = u"Apple Inc.'s looking at buying U.K. startup for $1 billion"
doc = nlp(s)
for ent in doc.ents:
print(f"{ent.text} (lemma={ent.lemma_}): {ent.label_}\n")
Apple Inc.'s (lemma=Apple Inc.'s): ORG
U.K. (lemma=U.K.): GPE
$1 billion (lemma=$1 billion): MONEY
for token in doc:
print(f"{token.text} (IOB={token.ent_iob_}): {token.ent_type_}")
Apple (IOB=B): ORG
Inc. (IOB=I): ORG
's (IOB=I): ORG
...
There is no difference in tokenization between these texts (i.e. the 's portion is split out as its own token in both cases). I do not want the apostrophe “s” to be included in the named entity for entities like Apple Inc.'s and would like the NER component to return only the Apple Inc. portion of this named entity.
Is there a way to configure the NER component to prevent this behavior with multi-token named entities and exclude the apostrophe “s”?

There is no way to simply configure the component not to do this. What you can do is use a small custom component to remove 's from any entities.
def my_component(doc):
out = []
for ent in doc.ents:
if ent[-1].text == "'s":
out.append(ent[0:-1])
else:
out.append(ent)
doc.ents = out
return doc
See the docs for info on how to use it.

Related

Laravel Scout toSearchableArray attribute is not filterable

I've been doing some testing with laravel scout and according to the documentation (https://laravel.com/docs/8.x/scout#configuring-searchable-data), I've mapped my User model as such:
/**
* Get the indexable data array for the model.
*
* #return array
*/
public function toSearchableArray()
{
$data = $this->toArray();
return array_merge($data, [
'entity' => 'An entity'
]);
}
Just for the sake of testing, this is literally what I came down to on the debugging.
After importing the User model with this mapping, I can see on the meilisearch dashboard it is indeed showing the user data + the entity = 'An entity'.
However, when applying this:
User::search('something')->where('entity', 'An entity')->get()
It produces the following error:
"message": " --> 1:1\n |\n1 | entity=\"An entity\"\n | ^----^\n |\n = attribute `entity` is not filterable, available filterable attributes are: ",
"exception": "MeiliSearch\\Exceptions\\ApiException",
"file": "/var/www/api/vendor/meilisearch/meilisearch-php/src/Http/Client.php",
Tracing back to view the 'filterable attributes', I've ended at the conclusion that:
$client = app(\MeiliSearch\Client::class);
dump($client->index('users')->getFilterableAttributes()); // Returns []
$client->index('users')->updateFilterableAttributes(['entity']);
dump($client->index('users')->getFilterableAttributes()); // Returns ['entity']
Forcing the updateFilterableAttributes now allows me to complete the search as intended, but I don't feel this should be the regular behaviour? If its mapped on the searchableArray, it should be searchable? What am I not understanding and what other approaches are there to achieve this goal?
This is actually not an issue but a requirement of meilisearch in particular. Scout under the hood uses different drivers for indexing - "algolia", "meilisearch", "database", "collection" and even "null", all of them have different indexing methods unifing of which would be troublesome and inefficient for scout I believe.
So filtering or a faceted search, as meilisearch refers to it, requires us to establish filtering criteria first, which is empty by default for document (models in laravel) fields.
Quoting from the docs:
This step is mandatory and cannot be done at search time. Filters need
to be properly processed and prepared by Meilisearch before they can
be used.
Updating filterableAttributes requires recreating the entire
index. This may take a significant amount of time depending on your
dataset size.
For more info please refer to meilisearch official docs https://docs.meilisearch.com/learn/advanced/filtering_and_faceted_search.html

Spacy NER - How to Identify People names using matcher patterns

I'm trying to identify the People names using following matcher patterns in Spacy. but this is identifying other words like 'my', and 'name'. Can anyone help me identify the issue in the pattern.?
person_pattern = [
{"label":"PERSON",
"pattern": [{'POS':'PROPN'}, {"ENT_TYPE": "PERSON"}],
"comment": "Spacy's in-built PERSON capure"
}]
Example:
My Name as in Google Record is Hannah, but i would like to modify Name as in AADHAR Hanna. My CDS ID is JANAN34
Result/Behavior:
text: My, pos_: PRON, ent_type_: PERSON
text: Name, pos_: NOUN, ent_type_: PERSON
I ran some sample code using your pattern and it seems that your pattern isn't matching anything, so the problem isn't with the Matcher. The problem seems to be with spaCy's NER models.
Your text is kind of unusual - "My Name as in..." is not normal capitalization, and the model seems to mistake it for an actual name. If you change "Name" to "name" then it's no longer detected as an entity.
I think this is just a case of your data not being similar to spaCy's training data, which is more like newspaper articles that use formal capitalization. The v3 models are a little weak to case changes at the moment because some data augmentation was accidentally left out when training them, but that should be resolved in the v3.1 release coming up soon.
If you have training data, you might look at training using spaCy's data augmentation to be more resilient to unusual data.

Filtering Entities Based on the type "PERSON", "ORG" etc in Spacy

After creating a nlp pipeline from spacy. Passed the doc into the pipeline.
Am trying to filter the entities based on the Type of it.
for ent in doc.ents:
print(ent.text)
What would be the code to filter just the PERSON or ORG from the ents list
for ent in doc.ents:
if ent.label_ in ["PERSON", "ORG"]:
print(ent.text)
See also https://spacy.io/usage/linguistic-features#named-entities-101 for an overview of the relevant properties available from NER.
or,
filter(lambda e: e.label_ in ["PERSON", "ORG"], doc.ents)

Designing a rest url - filter with multiple params over entities (not last entity)

Let's say I have the following entities in my libraries app - Library Room, shelf, Book.
Where Room has N shelves, and shelves have N Books.
Now the following url brings me a list of books whose
library is 3, room no. is 5 and shelf no. is 43.
.../library/3/room/5/shelf/43/books
Assuming shelf 43 is unique per room only
(There is shelf 43 also in other rooms)
and Rooms are not unique (There's a few room no. 5 ) in the library.
Here is my questions:
I want to filter with more fields on the entities, here is what i want to do
(representation not in rest):
.../library/id=3&type=3/room/decade=21&topic=horror/shelf/location=east&/books
This is not rest.
How do I represent it in rest?
Notes:
I don't want to do this way
.../books&param1=X&param2=X&param3=X&param4=X
because not all params are related to books.
Couple of things that you need to look into while designing your apis.
1) are type, decade, topic etc required fields? if so, I will probably make them a part of the path itself, such as:
../libraries/{libraryId}/type/{typeId}/rooms/{roomId}/decades/{decadeId}/topics/{topicName}/shelves/{shelfId}/locations/{shelfLocation}/books
Here I am assuming that each library can have rooms which have unique room ids per library, each room can have shelves which has unique ids/locations per room (and so on and so forth). Yes, the url is pretty long, but that's kind of expected
2) if these fields are not required, you could use a different approach which is a bit less verbose but a bit more confusing for client developers who have never used such approach here. Here's a straight up example Restful Java with JAX-RS by Bill Burke
#Path("{first}-{last}")
#GET
#Produces("application/xml")
public StreamingOutput getCustomer(#PathParam("first") String firstName,
#PathParam("last") String lastName) {
...
}
Here, we have the URI path parameters {first} and {last}. If our HTTP request is
GET /customers/bill-burke, bill will be injected into the firstName parameter and
burke will be injected into the lastName parameter.
If we follow this somewhat academic approach (I have not seen this implemented on many platforms. Most platforms normally go with approach # 1, a more verbose but clear approach), your URL would look somewhat like this:
../libraries/{libraryId}-{typeId}/rooms/{roomId}-{decadeId}-{topicName}/shelves/{shelfId}-{shelfLocation}/books
This way, if the client developer doesn't pass in the non-required fields, you can handle it at the business logic level and assign these variables a default value, for example:
../libraries/3-/rooms/2-1-horror/shelves/1-/books
With this url, libraryId = 3, typeId = null (thus can be defaulted to it's default value) and so on and so forth. Remember that if libraryId is required field, then you might want to actually make it a part of the pathparam itself
Hope this helps!

Django REST framework flat, read-write serializer

In Django REST framework, what is involved in creating a flat, read-write serializer representation? The docs refer to a 'flat representation' (end of the section http://django-rest-framework.org/api-guide/serializers.html#dealing-with-nested-objects) but don't offer examples or anything beyond a suggestion to use a RelatedField subclass.
For instance, how to provide a flat representation of the User and UserProfile relationship, below?
# Model
class UserProfile(models.Model):
user = models.OneToOneField(User)
favourite_number = models.IntegerField()
# Serializer
class UserProfileSerializer(serializers.ModelSerializer):
email = serialisers.EmailField(source='user.email')
class Meta:
model = UserProfile
fields = ['id', 'favourite_number', 'email',]
The above UserProfileSerializer doesn't allow writing to the email field, but I hope it expresses the intention sufficiently well. So, how should a 'flat' read-write serializer be constructed to allow a writable email attribute on the UserProfileSerializer? Is it at all possible to do this when subclassing ModelSerializer?
Thanks.
Looking at the Django REST framework (DRF) source I settled on the view that a DRF serializer is strongly tied to an accompanying Model for unserializing purposes. Field's source param make this less so for serializing purposes.
With that in mind, and viewing serializers as encapsulating validation and save behaviour (in addition to their (un)serializing behaviour) I used two serializers: one for each of the User and UserProfile models:
class UserSerializer(serializer.ModelSerializer):
class Meta:
model = User
fields = ['email',]
class UserProfileSerializer(serializer.ModelSerializer):
email = serializers.EmailField(source='user.email')
class Meta:
model = UserProfile
fields = ['id', 'favourite_number', 'email',]
The source param on the EmailField handles the serialization case adequately (e.g. when servicing GET requests). For unserializing (e.g. when serivicing PUT requests) it is necessary to do a little work in the view, combining the validation and save behaviour of the two serializers:
class UserProfileRetrieveUpdate(generics.GenericAPIView):
def get(self, request, *args, **kwargs):
# Only UserProfileSerializer is required to serialize data since
# email is populated by the 'source' param on EmailField.
serializer = UserProfileSerializer(
instance=request.user.get_profile())
return Response(serializer.data)
def put(self, request, *args, **kwargs):
# Both UserSerializer and UserProfileSerializer are required
# in order to validate and save data on their associated models.
user_profile_serializer = UserProfileSerializer(
instance=request.user.get_profile(),
data=request.DATA)
user_serializer = UserSerializer(
instance=request.user,
data=request.DATA)
if user_profile_serializer.is_valid() and user_serializer.is_valid():
user_profile_serializer.save()
user_serializer.save()
return Response(
user_profile_serializer.data, status=status.HTTP_200_OK)
# Combine errors from both serializers.
errors = dict()
errors.update(user_profile_serializer.errors)
errors.update(user_serializer.errors)
return Response(errors, status=status.HTTP_400_BAD_REQUEST)
First: better handling of nested writes is on it's way.
Second: The Serializer Relations docs say of both PrimaryKeyRelatedField and SlugRelatedField that "By default this field is read-write..." — so if your email field was unique (is it?) it might be you could use the SlugRelatedField and it would just work — I've not tried this yet (however).
Third: Instead I've used a plain Field subclass that uses the source="*" technique to accept the whole object. From there I manually pull the related field in to_native and return that — this is read-only. In order to write I've checked request.DATA in post_save and updated the related object there — This isn't automatic but it works.
So, Fourth: Looking at what you've already got, my approach (above) amounts to marking your email field as read-only and then implementing post_save to check for an email value and perform the update accordingly.
Although this does not strictly answer the question - I think it will solve your need. The issue may be more in the split of two models to represent one entity than an issue with DRF.
Since Django 1.5, you can make a custom user, if all you want is some method and extra fields but apart from that you are happy with the Django user, then all you need to do is:
class MyUser(AbstractBaseUser):
favourite_number = models.IntegerField()
and in settings: AUTH_USER_MODEL = 'myapp.myuser'
(And of course a db-migration, which could be made quite simple by using db_table option to point to your existing user table and just add the new columns there).
After that, you have the common case which DRF excels at.