Spacy NER - How to Identify People names using matcher patterns - spacy

I'm trying to identify the People names using following matcher patterns in Spacy. but this is identifying other words like 'my', and 'name'. Can anyone help me identify the issue in the pattern.?
person_pattern = [
{"label":"PERSON",
"pattern": [{'POS':'PROPN'}, {"ENT_TYPE": "PERSON"}],
"comment": "Spacy's in-built PERSON capure"
}]
Example:
My Name as in Google Record is Hannah, but i would like to modify Name as in AADHAR Hanna. My CDS ID is JANAN34
Result/Behavior:
text: My, pos_: PRON, ent_type_: PERSON
text: Name, pos_: NOUN, ent_type_: PERSON

I ran some sample code using your pattern and it seems that your pattern isn't matching anything, so the problem isn't with the Matcher. The problem seems to be with spaCy's NER models.
Your text is kind of unusual - "My Name as in..." is not normal capitalization, and the model seems to mistake it for an actual name. If you change "Name" to "name" then it's no longer detected as an entity.
I think this is just a case of your data not being similar to spaCy's training data, which is more like newspaper articles that use formal capitalization. The v3 models are a little weak to case changes at the moment because some data augmentation was accidentally left out when training them, but that should be resolved in the v3.1 release coming up soon.
If you have training data, you might look at training using spaCy's data augmentation to be more resilient to unusual data.

Related

Representing complex data types in XACML using Authzforce

I am new to XACML and I would be grateful if you can help me with one problem I encountered.
I use AuthzForce Core PDP (version 17.1.2).
I am wondering what is the correct approach of representing complex data types in XACML.
Example
Access should be granted if PIP response contains any person whose name is present in names array from request and salary of that person is higher than salary provided in request.
Request
names = ["Eric", "Kyle"]
salary = 1500
PIP response
[
{
"name": "Kyle",
"salary": 1000
},
{
"name": "Kenny",
"salary": 2000
},
{
"name": "Eric",
"salary": 4000
},
{
"name": "Stan",
"salary": 3000
}
]
Access will be granted because PIP response contains person with name Eric and his salary is higher than 1500.
My implementation
To represent PIP response I ended up with creating custom type by extending StringParseableValue class from AuthzForce. For above mentioned logic I use attribute designator in xml and have coresponding attribute provider (class extending BaseNamedAttributeProvider) in Java performing PIP call.
I also wrote two custom functions:
Find people with higher salary than provided in one param (returns filtered list)
Get person name (returns string)
And using those functions and standard function I wrote policy and it works.
However my solution seems to be overcomplicated. I suppose what I did can be achieved by using only standard functions.
Additionally if I wanted to define hardcoded bag of people inside other policy single element would look like this:
<AttributeValue DataType="person">name=Eric###salary=4000</AttributeValue>
There is always possibility that parsing of such strings might fail.
So my question is: What is a good practice of representing complex types like my PIP response in XACML using Authzforce? Sometimes I might need to pass more complex data in the request and I saw example in XACML specification showing passing such data inside <Content> element.
Creating a new XACML data-type - and consequently new XACML function(s) to handle that new data-type - seems a bit overkill indeed. Instead, you may improve your PIP (Attribute Provider) a little bit, so that it returns only the results for the employees named in the Request, and only their salaries (extracting them from the JSON using JSON path) returned as a bag of integers.
Then, assuming this PIP result is set to the attribute employee_salaries in your policy (bag of integers) for instance, and min_salary is the salary in the Request, it is just a matter of applying any-of(integer-less-than, min_salary, employee_salaries) in a Condition. (I'm using short names for the functions by convenience, please refer to the XACML 3.0 standard for the full identifiers.)
Tips to improve the PIP:
One issue here is performance (scalability, response time / size...) because if you have hundreds even thousands of employees, it is overkill to get the whole list from the REST service over and over, all the more as you need only a small subset (the names in the Request). Instead, you may have some way to request the REST service to return only a specific employees, using query parameters; an example using RSQL (but this depends on the REST service API):
HTTP GET http://rest-service.example.com/employees?search=names=in=($employee_names)
... where you set the $employee_names variable to (a comma-separated list of) the employee names from the Request (e.g. Eric,Kyle). You can get these in your AttributeProvider implementation, from the EvaluationContext argument of the overriden get(...) method (EvaluationContext#getNamedAttributeValue(...)).
Then you can use a JSON path library (as you did) to extract the salaries from the JSON response (so you have only the salaries of the employees named in the Request), using this JSON path for instance (tested with Jayway):
$[*].salary
If the previous option is not possible, i.e. you have no way of filtering employees on the REST API, you can always do this filtering in your AttributeProvider implementation with the JSON path library, using this JSON path for instance (tested with Jayway against your PIP response):
$[?(#.name in [$employee_names])].salary
... where you set the $employee_names variable like in the previous way, getting the names from the EvaluationContext. So the actual JSONpath after variable replacement would be something like:
$[?(#.name in [Eric,Kyle])].salary
(You may add quotes to each name to be safe.)
All things considered, if you still prefer to go for new XACML data-type (and functions), and since you seem to have done most of the work (impressive btw), I have a suggestion - if doable without to much extra work - to generalize the Person data-type to more generic JSON object datatype that could be reused in any use case dealing with JSON. Then see whether the extra functions could be done with a generic JSONPath evaluation function applied to the new JSON object data-type. This would provide a JSON equivalent to the standard XML/XPath data-type and functions we already have in XACML, and this kind of contribution would benefit the AuthzForce community greatly.
For the JSON object data-type, actually you can use the one in the testutils module as an example: CustomJsonObjectBasedAttributeValue which has been used to test support of JSON objects for the GeoXACML extension.

REST GET mehod: Can return a list of enriched resources?

I have a doubt when I'm designing a REST API.
Consider I have a Resource "Customer" with two elements in my server, like this:
[
{
name : "Mary",
description : "An imaginary woman very tall."
},
{
name : "John",
description : "Just a guy."
}
]
And I want to make an endpoint, that will accept a GET request with a query. The query will provide a parameter with a value that will make an algorithm count how many occurrences for this text are there in all of its parameters.
So if we throw this request:
GET {baseURL}/customers?letters=ry
I should get something like
[
{
name : "Mary",
description : "An imaginary woman very tall.",
count : 3
},
{
name : "John",
description : "Just a guy.",
count : 0
}
]
Count parameter can not be included in the resource scheme as will depend on the value provided in the query, so the response objects have to be enriched.
I'm not getting a list of my resource but a modified resource.
Although it keeps the idempotent condition for GET Method, I see it escapes from the REST architecture concept (even the REST beyond CRUD).
Is it still a valid endpoint in a RESTful API? or should I create something like a new resource called "ratedCustomer"?
REST GET mehod: Can return a list of enriched resources?
TL;DR: yes.
Longer answer...
A successful GET request returns a representation of a single resource, identified by the request-target.
The fact that the information used to create the representation of the resource comes from multiple entities in your domain model, or multiple rows in your database, or from reports produced by other services... these are all implementation details. The HTTP transfer of documents over a network application doesn't care.
That also means that we can have multiple resources that include the same information in their representations. Think "pages in wikipedia" that duplicate each others' information.
Resource identifiers on the web are semantically opaque. All three of these identifiers are understood to be different resources
/A
/A?enriched
/B
We human beings looking at these identifiers might expect /A?enriched to be semantically closer to /A than /B, but the machines don't make that assumption.
It's perfectly reasonable for /A?enriched to produce representations using a different schema, or even a different content-type (as far as the HTTP application is concerned, it's perfectly reasonable that /A be an HTML document and /A?enriched be an image).
Because the machines don't care, you've got additional degrees of freedom in how you design both you resources and your resource identifiers, which you can use to enjoy additional benefits, including designing a model that's easy to implement, or easy to document, or easy to interface with, or easy to monitor, or ....
Design is what we do to get more of what we want than we would get by just doing it.

Separation between TBox and ABox in Fuseki with TDB and Pubby

For my current project I need to load a dataset and different ontologies and expose everything as linked data using Fuseki with TDB and Pubby. Pubby will take a data set from a single location and create URIs based on that location, therefore if we need multiple different locations (as in the case with 2–3 separate ontologies), that would be easy to do with Pubby by adding another data set.
The concept of dataset seems to also apply to Fuseki.
Essentially I will need to expose three types of URIs:
www.mywebsite.com/project/data
www.mywebsite.com/project/data/structure
www.mywebsite.com/project/ontology
In order to create such URIs with Pubby 0.3.3. you will have to specify lines like these:
conf:dataset [
conf:sparqlEndpoint <sparql_endpoint_url_ONE>;
conf:sparqlDefaultGraph <sparql_default_graph_name_ONE>;
conf:datasetBase <http://mywebsite.com/project/>;
conf:datasetURIPattern "(data)/.*";
(...)
]
Each data set specified in Pubby will take its data from a certain URL (typically a SPARQL endpoint).
For ontologies you will have a dataset that uses the second a datasetURIPattern like this one:
conf:dataset [
conf:sparqlEndpoint <sparql_endpoint_url_TWO>;
conf:sparqlDefaultGraph <sparql_default_graph_name_TWO>;
conf:datasetBase <http://mywebsite.com/project/>;
conf:datasetURIPattern "(ontology)/.*";
(...)
]
As you can see the differences would be at the following: conf:sparqlEndpoint (the SPARQL endpoint), conf:sparqlDefaultGraph (the default Graph), conf:datasetURIPattern (needed for creating the actual URIs with Pubby).
It is not however clear to me how can I have separate URIs for the data sets when using Fuseki. When using Sesame, for example, I just create two different repositories and this trick works like charm when publishing data with Pubby. Not immediately clear with
The examples from the official Fuseki documentation present a single dataset (read-only or not, etc), but none of them seem to present such a scenario. There is no immediate example where there is a clear separation between the TBox and the ABox, even though this is a fundamental principle of Linked Data (see Keeping ABox and TBox Split).
As far as I understand this should be possible, but how? Also is it correct that the TBox and ABox can be reunited under a single SPARQL endpoint later by using (tdb:unionDefaultGraph true ;).
The dataset concept is not unique to Jena Fuseki; it's quite central in SPARQL. A dataset is a collection of named graphs and a default graph. The prefix of a URI has nothing to do with where triples about it are stored (whether in a named graph or in the default graph).
It sounds like you want to keep your ABox triples in one named graph and your TBox triples in another. Then, if the default graph is the union of the named graphs, you'll see the contents of both in the default graph. You can do that in Fuseki.

RESTful API for large "mash up" data pulls?

I have recently been tasked with a small project of setting up a periodic (somewhere between daily & weekly) data dump from an internal database to a 3rd party product. This project dovetails nicely with my company's desire (one which I share) to start standing up a formal service layer/API over the top of our data.
My personal preference is that those APIs should take on the form of RESTful endpoints - however, now I have what I think is a big design question - let me explain...
Looking at the data pull in question, it's hardly complicated. If I were just going to construct a one-off query, it would conceptually look a little something like:
select o.order_num, o.order_date, p.product_description, sr.sales_rep_name
from order o, line_item li, product p, sales_rep sr
where li.order_num = o.order_num
and li.product_id = p.product_id
and sr.sales_rep_id = o.sales_rep_id
and o.order_date >= [some arbitrary date]
Flipping my brain into "Resource Mode", I can think about how to convert this basic data model into URI's/payloads without too much trouble:
GET /orders/123
{
"order_num": 59324,
"order_date": "2014-07-07",
"sales_rep_uri": "/salesRep/34",
"line_items_uri": "/order/123/lineItems"
}
Getting more information about the sales rep:
GET /salesRep/34
{
"sales_rep_name": "Jane Doe",
"open_orders_uri": "/salesRep/34/orders"
}
Getting more information about line items:
GET /orders/123/lineItems
{
"line_items": [
{"order_uri": "/order/123", "product_uri": "/products/68"},
{"order_uri": "/order/123", "product_uri": "/products/99"}
]
}
And so on. I'm not saying it's a perfect API, I'm just trying to demonstrate it's not exactly rocket science to go about thinking how you might express the data model in a nicely normalized, resource-oriented type of way via RESTful URIs. But that is exactly where the design question comes into play...
On one hand, I can crank out a query to solve the problem very easily, but the very nature of queries requires the various domain concepts to be tightly coupled (in other words, utilizing joins to bring all of the normalized data together into one nice, custom-purpose de-normalized structure).
On the other hand, going through the mental process of thinking out a RESTful API leads me right back down that road of keeping things nicely compartmentalized - e.g. asking for "Order 123" shouldn't send me back this huge graph where I can see the full product description, the sales rep's phone number, etc, etc. The concept of a full blown HATEOAS-level RESTful API dictates consumers should be making subsequent GETs to drill down for that kind of detail only as-needed.
My question boils down to this: solving this use case seems really easy to do with a direct query and really difficult to do against a nice & tidy RESTful API (I'm picturing the literally 1000's of individual GETs it would take for me to assemble a weeks worth of data vs the few seconds it would take for the query to run). Is there some elegant subtlety of good RESTful design that I don't understand that would prevent me from seeing a good solution, or am I trying to fit a round peg into a square hole (i.e. REST is not good at pulling big data batches across multiple resources)?
I'm just going to throw this out there as a potential solution:
Conceptually, I treat the results of this query as a resource unto itself - like "orderReport".
Treating this as it's own resource, the API could behave something like:
GET /orderReport/[some arbitrary date]
You could then send back either a 201 Created (if the query is relatively quick running) with a location header like Location: /orderReport/[GUID]. Alternatively, if the query takes a while to run (I honestly don't know if it does or not off the top of my head), you could send back a 202 Accepted with a location header of Location: /orderReport/[GUID]/status.
You could then do follow up GETs against those URLs to get either the report status (200 OK if still processing without error, 201 Created with location header pointing to the report URL if its done) or the report itself.
There's nothing to say the report data couldn't also incorporate HATEOAS in addition to the data strictly needed to fulfill the use case requirements, like:
{
[
{
"order_num": 123,
"order_uri": "/orders/123",
"order_date": 2014-07-03,
"product_description": "widget",
"sales_rep_name": "Jane Doe",
"sales_rep_uri": "/salesRep/34"
},
{
"order_num": 456,
"order_uri": "/orders/456",
"order_date": 2014-07-04,
"product_description": "gadget",
"sales_rep_name": "Frank Smith",
"sales_rep_uri": "/salesRep/53"
}
]
}

REST API how to model resources which need to maintain a specific order

We are in the middle of designing a new feature for our API and we stumbled across a dilemma.
We have two different types of resources with a 1-N relationship.
Representations and Layers.
A Representation can contain multiple Layers. A Layer can belong only to one Representation.
The thing we are stuck with is that we need to maintain an order for the Layers in a Representation.
And we came up with two approaches:
First approach: Linked List
Each Layer knows about its previous layer. In the DB this is achieved by having a "parent" field in the Layers table that contains the id of another Layer. The first Layer will have a "parent" set to NULL.
This then is exposes through the API by the following URIs:
Create
GET /Representations/{repID}/layers
gets all the layers for a representation. The order can be worked out by going through all the layers and look at the Parent field.
POST /Representations/{repID}/layers
body:
Label: (string)
Parent: LayerId
This is used to create and insert a Layer in a specific position, by specifying the Parent in the request body. If you set Parent to NULL the newly created Layer will be the first Layer in the order.
If you omit the Parent field, the newly created Layer will be positioned at the bottom of the order.
Problem with this is that, in the response, we need to notify the api consumer that other layers have changed order because of the new insertion.
Update
PUT /Representations/{repID}/layers/{layerId}
body:
Label: (string)
Parent: LayerId
Again you can specify a new Parent to re-order the layer and again we will need to send back some information about all the other layers that have changed.
Delete
DELETE /Representations/{repID}/layers/{layerId}
need to send back some information about all the other layers that have changed.
Second approach: Layers Order as its own resource
The idea is that layers themselves don't have a notion of order. They are merely a resource.
Then you get a layersorder resource which is in charge of keeping the information about the order of the layers.
So you will still have the CRUD functionality for layers:
GET - POST - PUT - DELETE
but when you want to know anything about their order, or you want to change their order, you will use the following uri:
/Representation/{repId}/layersorder
This resource will support only two methods
GET /Representations/{repID}/layersorder
gets back an ordered list of Layers Ids in this Representation.
PUT /Representations/{repID}/layersorder
body:
[] - Array of Layers Ids in the new order.
Updates the order of the layers. you need to pass an Array of Layers ids in the new order as the body of the request. (e.g. [1,3,2,4,6,5] )
as per the first approach, whenever you add or remove a Layer you will need to notify the api consumer that another resource has been updated. In the first approach that was the list of layers affected by the change, in this approach is the new order of the layers (the layersorder resource).
I would like to hear opinions and also examples of similar situations and how you solved the problem.
thanks.
I've experienced part of what you are describing.
When doing a PUT/POST that affects an entity (or entities), I like to return the full object after the change. Hopefully the return object isn't massive, but if I were using your API, with the first approach ... I would enjoy doing a PUT/POST and updating a layer and then getting back the full Representation object with the updated Layer information.
It just makes it easy for me to confirm my changes and also start working immediately in my code with the new structure. I would dislike doing the PUT/POST and then having to do an additional GET to see the change.
For the second approach ... the more work I have to do to make an API call, the more frustrated I get. I'm not sure if I read it right, but to do a PUT with the second approach, it seems like I have to construct the entire representation and layer objects to update one piece of data. That would be frustrating.
I would prefer the syntax of the first approach, but with the concepts from the second. In other words, this is the document I would expect to see after a PUT/POST/GET:
{
"type" : "Representation",
"id" : 1,
"layers" : [
{ "id" : 1, "name" : "the first layer", "order" : 1, "parent" : "" },
{ "id" : 2, "name" : "the second layer", "order" : 2, "parent" : 1 },
{ "id" : 3, "name" : "the third layer", "order" : 3, "parent" : 1 }
]
}
It's sorted for me already so I don't have to do that work, but also has the information used to produce the sorting just in case. I've done this with REST APIs and it seems to work great for the users.