Custom model in Apache Open NLP - apache

I am working currently with custom models which I am training for my own use case. My use case is to classify emails based on whether it is an address change request. If the address change request could be understood from a single sentence, it is working fine without issues. But if the address change request needs to be understood from multiple sentences, it is not working.
Giving few examples below :-
Example 1 :- THIS IS WORKING
1.
a)training file :-
Guys I wish to <START:contactupdate> change my address <END> .
My new address is 68 Dorset Road, Coventry, West Midlands, CV1 4ED.
Please confirm once you are done.
Thanks.
b)Testing model with the below sentence :-
String input = "Guys I wish to change my address.My new address is 68 Dorset Road, Coventry, West Midlands, CV1 4ED.Please confirm once you are done. Thanks."; //Working
EXAMPLE 2 :- This is not working.
Lets say the address change request can only be deduced from multiple lines.
"My old address is no longer valid. Need to update it."
How do I train my model in this scenario?How do I specify the custom tags for above?
Can you please help. I am stuck.
Many Thanks

What do you mean with not working? That the thing you want to retrieve is not retrieved? Or that the training crashes somewhere when the tags are spread out over multiple lines?
In general, the (by default MaxEnt) model that you are training in this procedure tries to detect common features for the thing you are training for. Typically, these are named entities like persons, organisations, locations. And in many languages, these contain typical features (like the prefix Mr./Mrs., the suffix corp., the morpheme "street", respectively). This can be picked up by the model, and applied in new data, leading to the recognition of whichever it is you want to recognise. The thing you are trying to do however, is pretty advanced NLP already. Since the longer the phrase, the larger the possible variation, it becomes more difficult to pick up commonalities. I'd say for your use case, people are typically using parsing (either constituency or dependency parsing) or other more sophisticated tools than just this relatively flat pattern recognition. So you may want to look into these instead. I don't know how much data you have at your disposal, from which you can infer different ways of expressing the desire to change an address in a customer database. If reasonable (i.e. not just a couple of sentences), you may want to manually annotate them, parse the corpus, use machine learning on the parse trees/graphs for the sentences of interest and go about it in this way. As mentioned, quite advanced NLP in my opinion, and not something that has an out of the box solution.

If I understand your question correctly, I think you are trying to categorize emails to find out if its for address change. But the model example looks like for named entity. In my opinion, it might be better to use "Document Categorizer" feature of Apache OpenNLP.
You can provide different samples for possible sentences which can be categorized as address change. "Address_change", "general_inquiry" etc. can be a categories. This way you can add as many different sampels as you want with many variations of sentences. Here is easy & basic tutorial for document categorization training & usage.

Related

Does Named Entity Recognition only work on nouns?

I am considering training spaCy to recognize a custom named entity, but I am curious if this really only works for nouns or if it would equally work well with POS such as adjectives?
For example, I want to train on words like depressed, anxious, paranoid, etc. I'm trying to curate a list of adjectives that are considered clinically relevant, separating them from other irrelevant adjectives like happy, sad, unwell.
Is NER the right approach here, would it make more sense to just manually maintain a list of clinical adjectives and use a custom extension (e.g. ent._.clinical_adj) to mark them?
NER is typically used mainly on nouns. It's not that sensitive to part of speech type, but picking up just adjectives would be an unusual use.
Since it sounds like you do have a specific, finite list of words you're interested in, it probably makes sense for you to just use that word list and an extension to mark them.
You might also want to look at the DependencyMatcher. If there are some nouns you are interested in, you can use the DependencyMatcher to get all adjectives that modify them, for example.

SpaCy different Language Models

I'm making some progress:) developing my litle OCR Project.
I was wondering if my idea is possible in this case!
After extracting the Text from a Images (ocr), I use nlp (spacy) to identify two Entities (LOCation and PERson). I write to a Dictionary and later in a JSON Data. That works good.
Now I'm wondering if I can improve my identified Entities.
One way I can imagine is to use the right Language Model for the text.
I have varies Texts in German, English,Spanish and French.
At the moment I'm using the
But now I have no idea how to put langdetect into this
Have a great week!
Greets
Here is a link that you might find useful when it comes to detecting a language (There are multiple options including langdetect) - How to detect language
You can create a dictionary with the languages you plan to detect and match it with langdetect's output. I guess you have the rest sorted out.

is this possible to manipulate a description to same meaning but different words with data manipulation

i want to copy a data from a website which sells courses like ITIL, Prince2 and PMP and many other IT sector courses now there are 20,000 different courses's description is there.
However, i want to use selenium to scrape all the data but description is still subject to copyright.
Kindly let me know how i can manipulate all of that description to data to same meaning but different words.
Is there any API which can give me an access to build an code which will be helping these description data by using it's synonymous or which can change it's grammer to completely new sentennces but same meaning.
Kindly let me know where to start this.
Thanks,
The task you are referring to is called paraphrasing.
There is a lot of research on the field. In arXiv you fill find research papers on the topic. However, since you are asking for an API, I am assuming you don't want to implement these models by your self. Luckily, some authors have published their models online on GitHub. (Note: some are a re-implementation by someone else.)
When you use some of these implementations, note that most offer a pre-trained model. Do read which data set was used for training and try to pick the one that is the most similar to the data that you are facing. By doing so, more words in the domain of your descriptions will be available and more synonyms can be used.

Question Answering with Lucene

For a toy project, I want to implement an automated question answering system with Lucene and I'm trying to figure out a reasonable way to implement it. The basic operation is as follows:
1) The user will enter a question.
2) The system will identify the keywords in the question.
3) The keywords will be searched in a large knowledgebase and matching sentences will be shown as answers.
My knowledgebase (i.e., corpus) is not structured. It is just a large, continuous text (say, a user manual without any chapters). I mean that the only structure is that sentences and paragraphs are identified.
I plan to treat each sentence or paragraph as a separate document. To present the answer in a context, I may consider keeping one sentence/paragraph before/after the indexed one as payload. I would like to know if that makes sense. Also, I'm wondering if there are other tried and well-known approaches for that kind of systems. As an example, another approach that comes to mind is to index large chunks of the corpus as documents with the token positions, then process the vicinity of found keywords to construct my answers.
I would appreciate direct recommendations based on experience or intuition, but also tutorials or introductory materials to question-answering systems with Lucene in mind.
Thanks.
It's not an unreasonable approach to take.
One enhancement you might consider is incorporating learning feedback, so that you can continually improve the scoring of content vs search terms. To do this you would ask users to rate the answers that come back ('helpful vs unhelpful'), that way you can start to rank documents against keywords based on the historical data. You could classify potential documents as helpful/unhelpful for given keywords by using a simple Bayesian classifier.
Indexing each sentence as a document will give you some problems. You've pointed out one: you would need to store the surrounding texts a payloads. That means you'll need to store each sentence three times (before, during and after), and you'll have to manually get into the payload.
If you want to go the route of each sentence being a document, I would recommend coming up with an ID for each sentence and storing that as a separate field. Then you can display [ID-1, ID, ID+1] in each result.
The bigger question though is: how should you break up the text into documents? Identifying semantically related areas seems difficult, so doing it by sentence/paragraph might be the only way to go. A better way would be if you could find which text is the header of a section, and then put everything in that section as a document.
You might also want to use the index (if your corpus has one). The terms there could be boosted, as they are presumably more important.
Instead of luncene which does text indexing, search and retrieval, I think using something like Apache Mahout would help with this. Mahout considers text as knowledge and doing that makes the answering the question better than just text matching. Mahout is a machine learning and data mining f/w which fits this domain better. Just a very high level thought.
--Sai

Does my API design violate RESTful principles?

I'm currently (I try to) designing a RESTful API for a social network. But I'm not sure if my current approach does still accord to the RESTful principles. I'd be glad if some brighter heads could give me some tips.
Suppose the following URI represents the name field of a user account:
people/{UserID}/profile/fields/name
But there are almost hundred possible fields. So I want the client to create its own field views or use predefined ones. Let's suppose that the following URI represents a predefined field view that includes the fields "name", "age", "gender":
utils/views/field-views/myFieldView
And because field views are kind of higher logic I don't want to mix support for field views into the "people/{UserID}/profile/fields" resource. Instead I want to do the following:
utils/views/field-views/myFieldView/{UserID}
Another example
Suppose we want to perform some quantity operations (hope that this is the right name for it in English). We have the following URIs whereas each of them points to a list of persons -- the friends of them:
GET people/exampleUID-1/relationships/friends
GET people/exampleUID-2/relationships/friends
And now we want to find out which of their friends are also friends of mine. So we do this:
GET people/myUID/relationships/intersections/{Value-1};{Value-2}
Whereas "{Value-1/2}" are the url encoded values of "people/exampleUID-1/friends" and "people/exampleUID-2/friends". And then we get back a representation of all people which are friends of all three persons.
Though Leonard Richardson & Sam Ruby state in their book "RESTful Web Services" that a RESTful design is somehow like an "extreme object oriented" approach, I think that my approach is object oriented and therefore accords to RESTful principles. Or am I wrong?
When not: Are such "object oriented" approaches generally encouraged when used with care and in order to avoid query-based REST-RPC hybrids?
Thanks for your feedback in advance,
peta
I've never worked with REST, but I'd have assumed that GETting a profile resource at '''/people/{UserId}/profile''' would yield a document, in XML or JSON or something, that includes all the fields. Client-side I'd then ignore the fields I'm not interested in. Isn't that much nicer than having to (a) configure a personalised view on the server or (b) make lots of requests to fetch each field?
Hi peta,
I'm still reading through RESTful Web Services myself, but I'd suggest a slightly different approach than the proposed one.
Regarding the first part of your post:
utils/views/field-views/myFieldView/{UserID}
I don't think that this is RESTful, as utils is not a resource. Defining custom views is OK, however these views should be (imho) a natural part of your API's URI scheme. To incorporate the above into your first URI example, I would propose one of the following examples instead of creating a special view for it:
people/{UserID}/profile/fields/name,age,gender/
people/{UserID}/profile/?fields=name,age,gender
The latter example considers fields as an input value for your algorithm. This might be a better approach than having fields in the URI as it is not a resource itself - it just puts constraints on the existing view of people/{UserID}/profile/. Technically, it's very similar as pagination, where you would limit a view by default and allow clients to browse through resources by using ?page=1, ?page=2 and so on.
Regarding the second part of your post:
This is a more difficult one to crack.
First:
Having intersection in the URI breaks your URI scheme a bit. It's not a resource by itself and also it sits on the same level as friends, whereas it would be more suitable one level below or as an input value for your algorithm, i.e.
GET people/{UserID}/relationships/friends/intersections/{Value-1};{Value-2}
GET people/{UserID}/relationships/friends/?intersections={Value-1};{Value-2}
I'm again personally inclined to the latter, because similarly as in the first case, you are just constraining the existing view of people/{UserID}/relationships/friends/
Secondly, regarding:
Whereas "{Value-1/2}" are the url
encoded values of
"people/exampleUID-1/friends" and
"people/exampleUID-2/friends"
If you meant that {Value-1/2} contain the whole encoded response of the mentioned GET requests, then I would avoid that - I don't think that the RESTful way. Since friends is a resource by itself, you may want to expose it and access it directly, i.e.:
GET friends/{UserID-1};{UserID-2};{UserID-3}
One important thing to note here - I've used ; between user IDs in the previous example, whereas I used , in the fields example above. The reasoning is that both represent a different operator. In the first case we needed OR (,) in order to get all three fields, while in the last example above we had to use AND (;) in order to get an intersection.
Usage of two types of operators can over-complicate the API design, but it should provide more flexibility in the end.
thanks for your clarifying answers. They are exactly what I was asking for. Unfortunately I hadn't the time to read "RESTful Web Services" from cover to cover; but I will catch it up as soon as possible. :-)
Regarding the first part of my post:
You're right. I incline to your first example, and without fields. I think that the I don't need it at all. (At the moment) Why do you suggest the use of OR (,) instead of AND (;)? Intuitively I'd use the AND operator because I want all three of them and not just the first one existing. (Like on page 121 the colorpairs example)
Regarding the second part:
With {Value-1/2} I meant only the url-encoded value of the URIs -- not their response data. :) Here I incline with you second example. Here it should be obvious that under the hood an algorithm is involed when calculating intersecting friends. And beside that I'm probably going to add some further operations to it.
peta