Generating similar named entities/compound nouns - spacy

I have been trying to create distractors (false answers) for multiple choice questions. Using word vectors, I was able to get decent results for single-word nouns.
When dealing with compound nouns (such as "car park" or "Donald Trump"), my best attempt was to compute similar words for each part of the compound and combine them. The results are very entertaining:
Car park -> vehicle campground | automobile zoo
Fire engine -> flame horsepower | fired motor
Donald Trump -> Richard Jeopardy | Jeffrey Gamble
Barrack Obama -> Obamas McCain | Auschwitz Clinton
Unfortunately, these are not very convincing. Especially in case of named entities, I want to produce other named entities, which appear in similar contexts; e.g:
Fire engine -> Fire truck | Fireman
Donald Trump -> Barrack Obama | Hillary Clinton
Niagara Falls -> American Falls | Horseshoe Falls
Does anyone have any suggestions of how this could be achieved? Is there are a way to generate similar named entities/noun chunks?
I managed to get some good distractors by searching for the named entities on Wikipedia, then extracting entities which are similar from the summary. Though I'd prefer to find a solution using just spacy.

If you haven't seen it yet, you might want to check out sense2vec, which allows learning context-sensitive vectors by including the part-of-speech tags or entity labels. Quick usage example of the spaCy extension:
s2v = Sense2VecComponent('/path/to/reddit_vectors-1.1.0')
nlp.add_pipe(s2v)
doc = nlp(u"A sentence about natural language processing.")
most_similar = doc[3]._.s2v_most_similar(3)
# [(('natural language processing', 'NOUN'), 1.0),
# (('machine learning', 'NOUN'), 0.8986966609954834),
# (('computer vision', 'NOUN'), 0.8636297583580017)]
See here for the interactive demo using a sense2vec model trained on Reddit comments. Using this model, "car park" returns things like "parking lot" and "parking garage", and "Donald Trump" gives you "Sarah Palin", "Mitt Romney" and "Barack Obama". For ambiguous entities, you can also include the entity label – for example, "Niagara Falls|GPE" will show similar terms to the geopolitical entitiy (GPE), e.g. the city as opposed to the actual waterfalls. The results obviously depend on what was present in the data, so for even more specific similarities, you could also experiment with training your own sense2vec vectors.

Related

Detect predefined topics in a text

I would like to find in a text corpus (with long boring text), allusions about some pre-defined topic (let's say i am interested in the 2 topic: "Remuneration" and "Work condition").
For exemple finding in my corpus where (the specific paragraph) it is pointing problems about "remuneration".
To accomplish that i first thought about a deterministic approach: building a big dictionary, and thanks to regex maybe flagging those words in the text corpus. It is a very basic idea but i do not know how i could build efficiently my dictionary (i need a lot of words in the lexical field of remuneration). Do you know some website in french which could help me to build this dictionary ?
Perhaps can you think about a more clever approach based on some Machine Learning algorithm which could realize this task (i know about topic modelling but the difference here is that i am focusing on pre-determines subject/topic like "Remuneration"). I need a simple approach :)
The dictionary approach is a very basic one, but it could work. You can build the dictionary iteratively:
Suppose you want a dictionary of terms related to "work conditions".
Start with a seed, a small number of terms that may be related, with high probability, to work conditions.
Use this dictionary to run through the corpus and find relevant documents.
Now go through the relevant documents and find terms with high TFIDF value (terms which have high representation in the above documents but low representation in the rest of the corpus). These terms can be assumed to refer to the subject of "work conditions" as well.
Add the new terms you found to the dictionary.
Now you can run again through the corpus and find additional relevant documents.
You can repeat the above process for a pre-configured number of times, or until no more new terms are found.
A full treatment of such "topic analysis" problems is well beyond the scope of a Stack Overflow Q&A - There are multiple books and papers on such.
For a small, starter project: collect a number of articles which focus on discussing your topic(s), and on other topics. Rate each document according to whether or not it covers each of your chosen topics. Calculate the term-frequency-inverse-document-frequency for each of the sample articles. Convert these into a vector of the frequency of appearance of each word for each document. (You'll probably want to eliminate extremely common or ambiguous "stop words" from the analysis and do "stemming" as well. You can also scan for common sequences of two or more words.) This then defines a set of "positive" and "negative" examples for each defined topic.
If there's only a single topic of interest, you can then use a cosine-similarity function to determine which sample article is most like your new / input text sample. For multiple topics, you'll probably want to do something like Principal Component Analysis from the original text samples to identify which words and word combinations are most representative of each topic.
The quality of the classification will depend largely on the number of example texts you have to train the model and how much they differ.
If you're talking about the coding way of solving this, why don't you write a code (in your language) that finds the paragraph containing the word or the allusion word.
For example, I would do it like this in JavaScript
// longText is a long text that includes 4 paragraphs in total
const longText = `
In 1893, the first running, gasoline-powered American car was built and road-tested by the Duryea brothers of Springfield, Massachusetts. The first public run of the Duryea Motor Wagon took place on 21 September 1893, on Taylor Street in Metro Center Springfield.[32][33] The Studebaker Automobile Company, subsidiary of a long-established wagon and coach manufacturer, started to build cars in 1897[34]: p.66  and commenced sales of electric vehicles in 1902 and gasoline vehicles in 1904.[35]
In Britain, there had been several attempts to build steam cars with varying degrees of success, with Thomas Rickett even attempting a production run in 1860.[36] Santler from Malvern is recognized by the Veteran Car Club of Great Britain as having made the first gasoline-powered car in the country in 1894,[37] followed by Frederick William Lanchester in 1895, but these were both one-offs.[37] The first production vehicles in Great Britain came from the Daimler Company, a company founded by Harry J. Lawson in 1896, after purchasing the right to use the name of the engines. Lawson's company made its first car in 1897, and they bore the name Daimler.[37]
In 1892, German engineer Rudolf Diesel was granted a patent for a "New Rational Combustion Engine". In 1897, he built the first diesel engine.[1] Steam-, electric-, and gasoline-powered vehicles competed for decades, with gasoline internal combustion engines achieving dominance in the 1910s. Although various pistonless rotary engine designs have attempted to compete with the conventional piston and crankshaft design, only Mazda's version of the Wankel engine has had more than very limited success.
All in all, it is estimated that over 100,000 patents created the modern automobile and motorcycle.
`
document.querySelector('.searchbox').addEventListener('submit', (e)=> { e.preventDefault(); search() })
function search(){
const allusion = document.querySelector('#searchbox').value.toLowerCase()
const output = document.querySelector('#results-body ol')
output.innerHTML = "" // reset the output
const paragraphs = longText.split('\n').filter(item => item != "")
const included = paragraphs.filter((paragraph) => paragraph.toLowerCase().includes(allusion))
let foundIn = included.map(paragraph => `<div class="result-row"> <li>${paragraph.toLowerCase()}</li>
</div>`)
foundIn = foundIn.map(el => el.replaceAll(allusion, `<span class="highlight">${allusion}</span>`))
output.insertAdjacentHTML('afterbegin', foundIn.join('\n'))
}
.container{
padding : 5px;
border: .2px solid black;
}
.searchbox{
padding-bottom: 5px
}
.searchbox input {
width: 90%
}
.result-row{
padding-bottom: 5px;
}
.highlight{
background: yellow;
}
h3 span {
font-size: 14px;
font-style: italic;
}
<div class="container">
<form class="searchbox">
<h3>Give me an hint: <span>ex: car, gasoline, company</span> </h3>
<input id="searchbox" type="text" placeholder="allusion word, ex: car, gasoline, company">
<button type"submit"> find </button>
</form>
<div id="results-body">
<ol></ol>
</div>
</div>

How to improve the results from DBpedia Spotlight?

I am using DBpedia Spotlight to extract DBpedia resources as follows.
import json
from SPARQLWrapper import SPARQLWrapper, JSON
import requests
import urllib.parse
## initial consts
BASE_URL = 'http://api.dbpedia-spotlight.org/en/annotate?text={text}&confidence={confidence}&support={support}'
TEXT = "Tolerance, safety and efficacy of Hedera helix extract in inflammatory bronchial diseases under clinical practice conditions: a prospective, open, multicentre postmarketing study in 9657 patients. In this postmarketing study 9657 patients (5181 children) with bronchitis (acute or chronic bronchial inflammatory disease) were treated with a syrup containing dried ivy leaf extract. After 7 days of therapy, 95% of the patients showed improvement or healing of their symptoms. The safety of the therapy was very good with an overall incidence of adverse events of 2.1% (mainly gastrointestinal disorders with 1.5%). In those patients who got concomitant medication as well, it could be shown that the additional application of antibiotics had no benefit respective to efficacy but did increase the relative risk for the occurrence of side effects by 26%. In conclusion, it is to say that the dried ivy leaf extract is effective and well tolerated in patients with bronchitis. In view of the large population considered, future analyses should approach specific issues concerning therapy by age group, concomitant therapy and baseline conditions."
CONFIDENCE = '0.5'
SUPPORT = '10'
REQUEST = BASE_URL.format(
text=urllib.parse.quote_plus(TEXT),
confidence=CONFIDENCE,
support=SUPPORT
)
HEADERS = {'Accept': 'application/json'}
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
all_urls = []
r = requests.get(url=REQUEST, headers=HEADERS)
response = r.json()
resources = response['Resources']
for res in resources:
all_urls.append(res['#URI'])
print(all_urls)
My text is shown below:
Tolerance, safety and efficacy of Hedera helix extract in inflammatory bronchial diseases under clinical practice conditions: a prospective, open, multicentre postmarketing study in 9657 patients. In this postmarketing study 9657 patients (5181 children) with bronchitis (acute or chronic bronchial inflammatory disease) were treated with a syrup containing dried ivy leaf extract. After 7 days of therapy, 95% of the patients showed improvement or healing of their symptoms. The safety of the therapy was very good with an overall incidence of adverse events of 2.1% (mainly gastrointestinal disorders with 1.5%). In those patients who got concomitant medication as well, it could be shown that the additional application of antibiotics had no benefit respective to efficacy but did increase the relative risk for the occurrence of side effects by 26%. In conclusion, it is to say that the dried ivy leaf extract is effective and well tolerated in patients with bronchitis. In view of the large population considered, future analyses should approach specific issues concerning therapy by age group, concomitant therapy and baseline conditions.
The results I got is as follows.
['http://dbpedia.org/resource/Hedera',
'http://dbpedia.org/resource/Helix',
'http://dbpedia.org/resource/Bronchitis',
'http://dbpedia.org/resource/Cough_medicine',
'http://dbpedia.org/resource/Hedera',
'http://dbpedia.org/resource/After_7',
'http://dbpedia.org/resource/Gastrointestinal_tract',
'http://dbpedia.org/resource/Antibiotics',
'http://dbpedia.org/resource/Relative_risk',
'http://dbpedia.org/resource/Hedera',
'http://dbpedia.org/resource/Bronchitis']
As you can see, the results are not very good.
For example, consider Hedera helix extract in the text mentioned above. Even though DBpedia has a resource for Hedera helix (http://dbpedia.org/resource/Hedera_helix), the Spotlight outputs it as two URIs as http://dbpedia.org/resource/Hedera and http://dbpedia.org/resource/Helix.
According to my dataset, I would like to get the longest term in DBpedia as the results. In that case, what are the improvements I can do to get my desired output?
I am happy to provide more details if needed.
Although I am answering quiet late for this question but you can use Babelnet API in python to obtain dbpedia URI's containing longer texts. I reproduced the problem using the code below:
`from babelpy.babelfy import BabelfyClient
text ="Tolerance, safety and efficacy of Hedera helix extract in inflammatory
bronchial diseases under clinical practice conditions: a prospective, open,
multicentre postmarketing study in 9657 patients. In this postmarketing
study 9657 patients (5181 children) with bronchitis (acute or chronic
bronchial inflammatory disease) were treated with a syrup containing dried ivy
leaf extract. After 7 days of therapy, 95% of the patients showed improvement
or healing of their symptoms. The safety of the therapy was very good with an
overall incidence of adverse events of 2.1% (mainly gastrointestinal disorders
with 1.5%). In those patients who got concomitant medication as well, it could
be shown that the additional application of antibiotics had no benefit
respective to efficacy but did increase the relative risk for the occurrence
of side effects by 26%. In conclusion, it is to say that the dried ivy leaf
extract is effective and well tolerated in patients with bronchitis. In view
of the large population considered, future analyses should approach specific
issues concerning therapy by age group, concomitant therapy and baseline
conditions."
# Instantiate BabelFy client.
params = dict()
params['lang'] = 'english'
babel_client = BabelfyClient("**Your Registration Code For API**", params)
# Babelfy sentence.
babel_client.babelfy(text)
# Get all merged entities.
babel_client.all_merged_entities'
The output will be in the sample format as shown below for all the merged entities in the text. You can further store and process the dictionary structure to extract the dbpedia URIs.
{'start': 34,
'end': 45,
'text': 'Hedera helix',
'isEntity': True,
'tokenFragment': {'start': 6, 'end': 7},
'charFragment': {'start': 34, 'end': 45},
'babelSynsetID': 'bn:00021109n',
'DBpediaURL': 'http://dbpedia.org/resource/Hedera_helix',
'BabelNetURL': 'http://babelnet.org/rdf/s00021109n',
'score': 1.0,
'coherenceScore': 0.0847457627118644,
'globalScore': 0.0013494092960806407,
'source': 'BABELFY'},

How to create family relation logic?

I'm trying to make this but for a different language. In this language there are different kinds of names for uncles and aunts. We call paternal aunt something else and maternal aunt something else.
I came across a graph database 'neo4j'. I created 5 members. I got this approach to work just like I want. But the problem in this is that I've to create n * (n-1) relationships. I'm create a full tree here and not just 5 members of a family.
Also, this is more like brute force. I'm creating all the possibilities.
I'm looking for a smarter way to do this. Creating rules like for example Father's sister = paternal-aunt and Mother's sister = maternal-aunt
I also want queries Father's wife's sister but don't want to define them separately.
You can create functions that establish the rules, e.g.:
def is_maternal_aunt(individual, member):
return member in individual.mother.sister
These can be arbitrary complex:
def is_maternal_aunt(individual, member):
return is_sister(individual.mother, member)
def is_sister(individual, member):
return individual.mother == member.mother
It will be a matter of design for which you consider are primary relationships and which are derived. You can probably derive everything from parent child relationships (and marriage).
You don't have to create the bidirectional relationships, and you also don't have to create short-cut-relationships, you can just infer the information in the other direction or across multiple steps.
MATCH path = allShortestPaths((p1:Person {name:"Jane"})-[*]-(p2:Person {name:"John"}))
RETURN [r in relationships(path) | type(r)] as rels
which would then return ["husband","father"] e.g. for an Father-in-Law
or ["mother","sister"] for an maternal aunt.
You can then map those tuples either still in cypher (with case) or in your python program.
Prolog is a reasonable choice... For instance I have this small library to draw 'genealogy trees' like this
from this definition (genre definitions are used only to change node' color)
:- module(elizabeth, [elizabeth/0]).
:- use_module(genealogy).
elizabeth :- genealogy(elizabeth, 'Elizabeth II Family').
female('Elizabeth II').
female('Lady Elizabeth Bowes-Lyon').
female('Princess Mary of Teck').
female('Cecilia Cavendish-Bentinck').
parent_child('George VI', 'Elizabeth II').
parent_child('Lady Elizabeth Bowes-Lyon','Elizabeth II').
parent_child('George V', 'George VI').
parent_child('Princess Mary of Teck', 'George VI').
parent_child('Cecilia Cavendish-Bentinck','Lady Elizabeth Bowes-Lyon').
parent_child('Claude Bowes-Lyon', 'Lady Elizabeth Bowes-Lyon').
It requires SWI-Prolog and Graphviz.
edit adding some facts
female('Rose Bowes-Lyon').
parent_child('Cecilia Cavendish-Bentinck','Rose Bowes-Lyon').
parent_child('Claude Bowes-Lyon', 'Rose Bowes-Lyon').
and the rule
is_maternal_aunt(Person, Aunt) :-
parent_child(Parent, Person),
female(Parent),
parent_child(GranParent, Parent),
parent_child(GranParent, Aunt),
Aunt \= Parent.
we get
?- is_maternal_aunt(X,Y).
X = 'Elizabeth II',
Y = 'Rose Bowes-Lyon' ;

Need to extract information from free text, information like location, course etc

I need to write a text parser for the education domain which can extract out the information like institute, location, course etc from the free text.
Currently i am doing it through lucene, steps are as follows:
Index all the data related to institute, courses and location.
Making shingles of the free text and searching each shingle in location, course and institute index dir and then trying to find out which part of text represents location, course etc.
In this approach I am missing lot of cases like B.tech can be written as btech, b-tech or b.tech.
I want to know is there any thing available which can do all these kind of things, I have heard about Ling-pipe and Gate but don't know how efficient they are.
You definitely need GATE. GATE has 2 main most frequently used features (among thousands others): rules and dictionaries. Dictionaries (gazetteers in GATE's terms) allow you to put all possible cases like "B.tech", "btech" and so on in a single text file and let GATE find and mark them all. Rules (more precisely, JAPE-rules) allow you to define patterns in text. For example, here's pattern to catch MIT's postal address ("77 Massachusetts Ave., Building XX, Cambridge MA 02139"):
{Token.kind == number}(SP){Token.orth == uppercase}(SP){Lookup.majorType == avenue}(COMMA)(SP)
{Token.string == "Building"}(SP){Token.kind == number}(COMMA)(SP)
{Lookup.majorType == city}(SP){Lookup.majorType == USState}(SP){Token.kind == number}
where (SP) and (COMMA) - macros (just to make text shorter), {Somthing} - is annotation, , {Token.kind == number} - annotation "Token" with feature "kind" equal to "number" (i.e. just number in the text), {Lookup} - annotation that captures values from dictionary (BTW, GATE already has dictionaries for such things as US cities). This is quite simple example, but you should see how easily you can cover even very complicated cases.
I didn't use Lucene but in your case I would leave different forms of the same keyword as they are and just hold a link table or such. In this table I'd keep the relation of these different forms.
You may need to write a regular expression to cover each possible form of your vocabulary.
Be careful about your choice of analyzer / tokenizer, because words like B.tech can be easily split into 2 different words (i.e. B and tech).
You may want to check UIMA. As Lingpipe and Gate, this framework features text annotation, which is what you are trying to do. Here is a tutorial which will help you write an annotator for UIMA:
http://uima.apache.org/d/uimaj-2.3.1/tutorials_and_users_guides.html#ugr.tug.aae.developing_annotator_code
UIMA has addons, in particular one for Lucene integration.
You can try http://code.google.com/p/graph-expression/
example of Adress parsing rules
GraphRegExp.Matcher Token = match("Token");
GraphRegExp.Matcher Country = GraphUtils.regexp("^USA$", Token);
GraphRegExp.Matcher Number = GraphUtils.regexp("^\\d+$", Token);
GraphRegExp.Matcher StateLike = GraphUtils.regexp("^([A-Z]{2})$", Token);
GraphRegExp.Matcher Postoffice = seq(match("BoxPrefix"), Number);
GraphRegExp.Matcher Postcode =
mark("Postcode", seq(GraphUtils.regexp("^\\d{5}$", Token), opt(GraphUtils.regexp("^\\d{4}$", Token))))
;
//mark(String, Matcher) -- means creating chunk over sub matcher
GraphRegExp.Matcher streetAddress = mark("StreetAddress", seq(Number, times(Token, 2, 5).reluctant()));
//without new lines
streetAddress = regexpNot("\n", streetAddress);
GraphRegExp.Matcher City = mark("City", GraphUtils.regexp("^[A-Z]\\w+$", Token));
Chunker chunker = Chunkers.pipeline(
Chunkers.regexp("Token", "\\w+"),
Chunkers.regexp("BoxPrefix", "\\b(POB|PO BOX)\\b"),
new GraphExpChunker("Address",
seq(
opt(streetAddress),
opt(Postoffice),
City,
StateLike,
Postcode,
Country
)
).setDebugString(true)
);
B.tech can be written as btech, b-tech or b.tech
Lucene will let you do fuzzy searches based on the Levenshtein Distance. A query for roam~ (note the ~) will find terms like foam and roams.
That might allow you to match the different cases.

Creating, Visualizing and Querying simple Data Structures

Simple and common tree like data structures
Data Structure example
Animated Cartoons have 4 extremities (arm, leg,limb..)
Human have 4 ext.
Insects have 6 ext.
Arachnids have 6 ext.
Animated Cartoons have 4 by extremity
Human have 5 by ext.
Insects have 1 by ext.
Arachnids have 1 by ext.
Some Kind of Implementation
Level/Table0
Quantity, Item
Level/Table1
ItemName, Kingdom
Level/Table2
Kingdom, NumberOfExtremities
Level/Table3
ExtremityName, NumberOfFingers
Example Dataset
1 Homer Simpson, 1 Ralph Wiggum, 2 jon
skeet, 3 Atomic ant, 2 Shelob (spider)
Querying.. "Number of fingers"
Number = 1*4*4 + 1*4*4 + 1*4*5 + 3*6*1 + 2*6*1 = 82 fingers (Let Jon be a Human)
I wonder if there is any tool for define it parseable for automatic create the inherited data, and drawing this kind of trees, (with the plus of making this kind of data access, if where posible..)
It could be drawn manually with for example FreeMind, but AFAIK it dont let you define datatype or structures to automatically create inherited branch of items, so it's really annoying to have to repeat and repeat a structure by copying (and with the risk of mistake). Repeated Work over Repeated Data, (an human running repeated code), it's a buggy feature.
So I would like to write the data in the correct language that let me reuse it
for queries and visualization, if all data is in XML, or Java Classes, or in a Database File, etc.. there is some tool for viewing the tree and making the query?
PD : Creating nested folders in a filesystem and using Norton Commander in tree view, is not an option, I hope (just because It have to be builded manually)
Your answer is mostly going to depend on what programming skills you already have and what skills you are willing to acquire. I can tell you what I would do with what I know.
I think for drawing trees you want a LaTeX package like qtree. If you don't like this one, there are a bunch of others out there. You'd have to write a script in whatever your favorite scripting language is to parse your input into the LaTeX code to generate the trees, but this could easily be done with less than 100 lines in most languages, if I properly understand your intentions. I would definitely recommend storing your data in an XML format using a library like Ruby's REXML, or whatever your favorite scripting language has.
If you are looking to generate more interactive trees, check out the Adobe Flex Framework. Again, if you don't like this specific framework, there are bunches of others out there (I recommend the blog FlowingData).
Hope this helps and I didn't miserably misunderstand your question.
Data structure that You are describing looks like it can fit in xml format. Take a look at Exist XML database, and if I can say so it is the most complete xml database. It comes with many tools to get you started fast ! like XQuery Sandbox option in admin http interface.
Example Dataset
1 Homer Simpson, 1 Ralph Wiggum, 2 jon skeet, 3 Atomic ant, 2 Shelob (spider)
I am assuming that there are 2 instances of jon skeet, 3 instances of Atomic ant and 2 instances of Shelob
Here is a XQuery example:
let $doc :=
<root>
<definition>
<AnimatedCartoons>
<extremities>4</extremities>
<fingers_per_ext>4</fingers_per_ext>
</AnimatedCartoons>
<Human>
<extremities>4</extremities>
<fingers_per_ext>5</fingers_per_ext>
</Human>
<Insects>
<extremities>6</extremities>
<fingers_per_ext>1</fingers_per_ext>
</Insects>
<Arachnids>
<extremities>6</extremities>
<fingers_per_ext>1</fingers_per_ext>
</Arachnids>
</definition>
<subject><name>Homer Simpson</name><kind>AnimatedCartoons</kind></subject>
<subject><name>Ralph Wiggum</name><kind>AnimatedCartoons</kind></subject>
<subject><name>jon skeet</name><kind>Human</kind></subject>
<subject><name>jon skeet</name><kind>Human</kind></subject>
<subject><name>Atomic ant</name><kind>Insects</kind></subject>
<subject><name>Atomic ant</name><kind>Insects</kind></subject>
<subject><name>Atomic ant</name><kind>Insects</kind></subject>
<subject><name>Shelob</name><kind>Arachnids</kind></subject>
<subject><name>Shelob</name><kind>Arachnids</kind></subject>
</root>
let $definitions := $doc/definition/*
let $subjects := $doc/subject
(: here goes some query logic :)
let $fingers := fn:sum(
for $subject in $subjects
return (
for $x in $definitions
where fn:name($x) = $subject/kind
return $x/extremities * $x/fingers_per_ext
)
)
return $fingers
XML Schema Editor with visualization is perhaps what I am searching for
http://en.wikipedia.org/wiki/XML_Schema_Editor
checking it..