how to match concept values in chatscript - named-entity-extraction

If i use a ~concept, how would I know what the user typed?
this is like an entity in typical NLP frameworks.
eg
u: (I am from ~country )
^keep() ^repeat()
you come from _0?
If the user types I am from FRANCE there seems no way to extract the value FRANCE for ~country to echo back to the user or to use it later with perhaps $country=_0
I thought the _0 might help with that, but no workie
This will work but doesn't use concepts just wildcards
u: (I was born [in near close by] _* )
^keep() ^repeat()
you were born in _0?
ref docs
https://github.com/ChatScript/ChatScript/blob/master/WIKI/OVERVIEWS-AND-TUTORIALS/ChatScript-Tutorial.md#short-term-memory--_

seems _ before the ~concept will allow capturing:
u: (I am from _~country )

Related

How to keep translations separated where the same word is used in English but a different one in other languages?

Imagine I have a report, a letter actually, which I need to translate to several languages. I have created a greeting field in the form which is filled programatically by an onchange event method.
if self.partner_id.gender == 'female':
self.letter_greeting = _('Dear %s %s,') % ( # the translation should be "Estimada"
self.repr_recipient_id.title.shorcut, surname
)
elif self.partner_id.gender == 'male':
self.letter_greeting = _('Dear %s %s,') % ( # translation "Estimado"
self.repr_recipient_id.title.shorcut, surname
)
else:
self.letter_greeting = _('Dear %s %s,') % ( # translation: "Estimado/a"
self.partner_id.title.shorcut, surname
)
In that case the word Dear should be translated to different Spanish translations depending on which option is used, this is because we use different termination depending on the gender. Exporting the po file I found that all the options are altogether, that make sense because almost all the cases the translations will be the same, but not in this case:
#. module: custom_module
#: code:addons/custom_module/models/sale_order.py:334
#: code:addons/custom_module/models/sale_order.py:338
#: code:addons/custom_module/models/sale_order.py:342
#, python-format
msgid "Dear %s %s,"
msgstr "Dear %s %s,"
Solutions I can apply directly
Put all the terms in different entries to avoid the same translation manually every time I need to update the po file. This can be cumbersome if you have many different words with that problem. If I do it and I open the file with poedit, this error appears: duplicate message definition
Put all the possible combinations with slashes, this is done y some other parts of Odoo. For both gender would be:
#. module: stock
#: model:res.company,msg:stock.res_company
msgid "Dear"
msgstr "Estimado/a"
This is just an example. I can think of many words that look the same in English, but they use different spelling or meanings in other languages depending on the context.
Possible best solutions
I don't know if Odoo know anything aboutu the context of a word to know if it was already translated or not. Adding a context manually could solve the problem, at least for words with different meanings.
The nicest solution would be to have a parameter to the translation module to make sure that the word is exported as an isolated entry for that especific translation.
Do you think that I am giving to it too much importance haha? Do you know if there is any better solution? Why is poedit not taking into account that problem at all?
I propose an extension of models res.partner.title and res.partner.
res.partner.title should get a translateable field for saving salutation prefixes like 'Dear' or 'Sehr geehrter' (German). Maybe it's worth to get something about genders, too, but i won't get into detail here about that.
You probably want to show the configuring user an example like "Dear Mr. Name" or something like that. A computed field should work.
On res.partner you should just implement either a computed field or just a method to get a full salutation for a partner record.
To some degree this is a linguistics problem. I believe the best solution would be to use a different "Source Language", one made up of keys, and then have English as another Translation. The word "Dear" in English does not have a gender context (and typically, much of English doesn't), while the word "Estimado" in Spanish does. The translation from that Spanish word to English is more appropriately "Masculine Dear." Therefore, using keys as your source language, you would have this:
SourceText (EnglishDescription) -> Translation (English) -> Translation (Spanish)
DearMasculine -> Dear -> Estimado
DearFeminine -> Dear -> Estimada
DearNuetral -> Dear -> Estimado/a

DAX / Xetra on alphavantage

I am very very happy with Alphavantage.
BUT I can't find the german stocks (Xetra)
I have tried:
https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=xtr:lin&apikey=MYKEY
(But this works https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=NYSE:DIN&apikey=MYKEY)
https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=Lin.be&apikey=MYKEY
(But this works: https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=Novo-b.CO&apikey=MYKEY)
So my question is - has anyone had any luck getting german stocks on Alphavanta (or another free service. Realtime is not crucial, but obviously a plus).
I use the "Search Endpoint" function to find german stocks on alphavantage.
Let's say you look for "BASF" you could query:
https://www.alphavantage.co/query?function=SYMBOL_SEARCH&keywords=BASF&apikey=[your API key]&datatype=csv
You get a list with possible matches:
symbol,name,type,region,marketOpen,marketClose,timezone,currency,matchScore
BASFY,BASF SE,Equity,United States,09:30,16:00,UTC-05,USD,0.8889
BFFAF,BASF SE,Equity,United States,09:30,16:00,UTC-05,USD,0.8889
BASFX,BMO Short Tax-Free Fund Class A,Mutual Fund,United States,09:30,16:00,UTC- 05,USD,0.8889
BAS.DEX,BASF SE,Equity,XETRA,08:00,20:00,UTC+02,EUR,0.7273
BAS.FRK,BASF SE,Equity,Frankfurt,08:00,20:00,UTC+02,EUR,0.7273
BASA.DEX,BASF SE,Equity,XETRA,08:00,20:00,UTC+02,EUR,0.7273
BAS.BER,BASF SE NA O.N.,Equity,Berlin,08:00,20:00,UTC+02,EUR,0.7273
BASF.NSE,BASF India Limited,Equity,India/NSE,09:15,15:30,UTC+5.5,INR,0.6000
See documentation: https://www.alphavantage.co/documentation/
It seems to work with the yahoo symbols on alphavantage, at least for a few stocks (I did not check all). BASF for example works with:
https://www.alphavantage.co/query?function=TIME_SERIES_DAILY&symbol=BASF.TI&apikey=MYKEY
The alphavantage symbols for German securities consist of the Xetra symbol + .DE. For example EUNL.DE (for iShares MSCI World Core ETF). You can find a list of all Xetra stocks here.

Sentence segmentation and dependency parser

I’m pretty new to python (using python 3) and spacy (and programming too). Please bear with me.
I have three questions where two are more or less the same I just can’t get it to work.
I took the “syntax specific search with spacy” (example) and tried to make different things work.
My program currently reads txt and the normal extraction
if w.lower_ != 'music':
return False
works.
My first question is: How can I get spacy to extract two words?
For example: “classical music”
With the previous mentioned snippet I can make it extract either classical or music. But if I only search for one of the words I also get results I don’t want like.
Classical – period / era
Or when I look for only music
Music – baroque, modern
The second question is: How can I get the dependencies to work?
The example dependency with:
elif w.dep_ != 'nsubj': # Is it the subject of a verb?
return False
works fine. But everything else I tried does not really work.
For example, I want to extract sentences with the word “birthday” and the dependency ‘DATE’. (so the dependency is an entity)
I got
if d.ent_type_ != ‘DATE’:
return False
To work.
So now it would look like:
def extract_information(w,d):
if w.lower_ != ‘birthday’:
return False
elif d.ent_type_ != ‘DATE’:
return False
else:
return True
Does something like this even work?
If it works the third question would be how I can filter sentences for example with a DATE. So If the sentence contains a certain word and a DATE exclude it.
Last thing maybe, I read somewhere that the dependencies are based on the “Stanford typed dependencies manual”. Is there a list which of those dependencies work with spacy?
Thank you for your patience and help :)
Before I get into offering some simple suggestions to your questions, have you tried using displaCy's visualiser on some of your sentences?
Using an example sentence 'John's birthday was yesterday', you'll find that within the parsed sentence, birthday and yesterday are not necessarily direct dependencies of one another. So searching based on the birthday word having a dependency of a DATE type entity, might not be yield the best of results.
Onto the first question:
A brute force method would be to look for matching subsequent words after you have parsed the sentence.
doc = nlp(u'Mary enjoys classical music.')
for (i,token) in enumerate(doc):
if (token.lower_ == 'classical') and (i != len(doc)-1):
if doc[i+1].lower_ == 'music':
print 'Target Acquired!'
If you're unsure of what enumerate does, look it up. It's the pythonic way of using python.
To questions 2 and 3, one simple (but not elegant) way of solving this is to just identify in a parsed sentence if the word 'birthday' exists and if it contains an entity of type 'DATE'.
doc = nlp(u'John\'s birthday was yesterday.')
for token in doc:
if token.lower_ == 'birthday':
for entities in doc.ents:
if entities.label_ == 'DATE':
print 'Found ya!'
As for the list of dependencies, I presume you're referring to the Part-Of-Speech tags. Check out the documentation on this page.
Good luck! Hope that helped.

Amazon CloudSearch returns false results

I have a DB of articles, and i would like to search for all the articles who:
1. contain the word 'RIO' in either the title or the excerpt
2. contain the word 'BRAZIL' in the parent_post_content
3. and in a certain time range
The query I search with (structured) was:
(and (phrase field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or (phrase field=title 'RIO') (phrase field=excerpt 'RIO')))
but for some reason i get results that contain 'RIO' in the title, but do not contain 'BRAZIL' in the parent_post_content.
This is especially weird because i tried to condition only on the title (and not the excerpt) with this query:
(and (phrase field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (phrase field=name 'RIO'))
and the results seem OK.
I'm fairy new to CloudSearch, so i very likely have syntax errors, but i can't seem to find them. help?
You're using the phrase operator but not actually searching for a phrase; it would be best to use the term operator (or no operator) instead. I can't see why it should matter but using something outside of how it was intended to be used can invite unintended consequences.
Here is how I'd re-write your queries:
Using term (mainly just used if you want to boost fields):
(and (term field=parent_post_content 'BRAZIL') (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or (term field=title 'RIO') (term field=excerpt 'RIO')))
Without an operator (I find this simplest):
(and parent_post_content:'BRAZIL' (range field=post_date ['2016-02-16T08:13:26Z','2016-09-16T08:13:26Z'}) (or title:'RIO' excerpt:'RIO'))
If that fails, can you post the complete query? I'd like to check that, for example, you're using the structured query parser since you mentioned you're new to CloudSearch.
Here are some relevant docs from Amazon:
Compound queries for more on the various operators
Searching text for specifics on the phrase operator
Apparently the problem was not with the query, but with the displayed content. I foolishly trusted that the content displaying in the CloudSearch site was complete, and so concluded that it does not contain Brazil. But alas, it is not the full content, and when i check the full content, Brazil was there.
Sorry for the foolery.

Need to extract information from free text, information like location, course etc

I need to write a text parser for the education domain which can extract out the information like institute, location, course etc from the free text.
Currently i am doing it through lucene, steps are as follows:
Index all the data related to institute, courses and location.
Making shingles of the free text and searching each shingle in location, course and institute index dir and then trying to find out which part of text represents location, course etc.
In this approach I am missing lot of cases like B.tech can be written as btech, b-tech or b.tech.
I want to know is there any thing available which can do all these kind of things, I have heard about Ling-pipe and Gate but don't know how efficient they are.
You definitely need GATE. GATE has 2 main most frequently used features (among thousands others): rules and dictionaries. Dictionaries (gazetteers in GATE's terms) allow you to put all possible cases like "B.tech", "btech" and so on in a single text file and let GATE find and mark them all. Rules (more precisely, JAPE-rules) allow you to define patterns in text. For example, here's pattern to catch MIT's postal address ("77 Massachusetts Ave., Building XX, Cambridge MA 02139"):
{Token.kind == number}(SP){Token.orth == uppercase}(SP){Lookup.majorType == avenue}(COMMA)(SP)
{Token.string == "Building"}(SP){Token.kind == number}(COMMA)(SP)
{Lookup.majorType == city}(SP){Lookup.majorType == USState}(SP){Token.kind == number}
where (SP) and (COMMA) - macros (just to make text shorter), {Somthing} - is annotation, , {Token.kind == number} - annotation "Token" with feature "kind" equal to "number" (i.e. just number in the text), {Lookup} - annotation that captures values from dictionary (BTW, GATE already has dictionaries for such things as US cities). This is quite simple example, but you should see how easily you can cover even very complicated cases.
I didn't use Lucene but in your case I would leave different forms of the same keyword as they are and just hold a link table or such. In this table I'd keep the relation of these different forms.
You may need to write a regular expression to cover each possible form of your vocabulary.
Be careful about your choice of analyzer / tokenizer, because words like B.tech can be easily split into 2 different words (i.e. B and tech).
You may want to check UIMA. As Lingpipe and Gate, this framework features text annotation, which is what you are trying to do. Here is a tutorial which will help you write an annotator for UIMA:
http://uima.apache.org/d/uimaj-2.3.1/tutorials_and_users_guides.html#ugr.tug.aae.developing_annotator_code
UIMA has addons, in particular one for Lucene integration.
You can try http://code.google.com/p/graph-expression/
example of Adress parsing rules
GraphRegExp.Matcher Token = match("Token");
GraphRegExp.Matcher Country = GraphUtils.regexp("^USA$", Token);
GraphRegExp.Matcher Number = GraphUtils.regexp("^\\d+$", Token);
GraphRegExp.Matcher StateLike = GraphUtils.regexp("^([A-Z]{2})$", Token);
GraphRegExp.Matcher Postoffice = seq(match("BoxPrefix"), Number);
GraphRegExp.Matcher Postcode =
mark("Postcode", seq(GraphUtils.regexp("^\\d{5}$", Token), opt(GraphUtils.regexp("^\\d{4}$", Token))))
;
//mark(String, Matcher) -- means creating chunk over sub matcher
GraphRegExp.Matcher streetAddress = mark("StreetAddress", seq(Number, times(Token, 2, 5).reluctant()));
//without new lines
streetAddress = regexpNot("\n", streetAddress);
GraphRegExp.Matcher City = mark("City", GraphUtils.regexp("^[A-Z]\\w+$", Token));
Chunker chunker = Chunkers.pipeline(
Chunkers.regexp("Token", "\\w+"),
Chunkers.regexp("BoxPrefix", "\\b(POB|PO BOX)\\b"),
new GraphExpChunker("Address",
seq(
opt(streetAddress),
opt(Postoffice),
City,
StateLike,
Postcode,
Country
)
).setDebugString(true)
);
B.tech can be written as btech, b-tech or b.tech
Lucene will let you do fuzzy searches based on the Levenshtein Distance. A query for roam~ (note the ~) will find terms like foam and roams.
That might allow you to match the different cases.