Spacy Dependency parsing for His word - spacy

How would I differentiate "His" word in any sentence as a Determiner or as a Pronoun in Spacy, because if I give word.dep_ for "his" word it will give as poss (possession modifier) in both cases.
for Example:
"Ronaldo was diagnosed with a deadly disease. His disease only came to public when he was at the age of 19" - here His is Pronoun.
"Hemanth and his friends went to Coorg" - here his is a Determiner.

Use coref and replace 'his' or other words with name then apply the above thing coref will help you to identify which Pronouns is used for particular person and simply you can replace that

Related

Use MeCab to separate Japanese sentences into words not morphemes in vb.net

I am using the following code to split Japanese sentences into its words:
Dim parameter = New MeCabParam()
Dim tagger = MeCabTagger.Create(parameter)
For Each node In tagger.ParseToNodes(sentence)
If node.CharType > 0 Then
Dim features = node.Feature.Split(",")
Console.Write(node.Surface)
Console.WriteLine(" (" & features(7) & ") " & features(1))
End If
Next
An input of それに応じて大きくになります。 outputs morphemes:
それ (それ) 代名詞
に (に) 格助詞
応じ (おうじ) 自立
て (て) 接続助詞
大きく (おおきく) 自立
に (に) 格助詞
なり (なり) 自立
ます (ます) *
。 (。) 句点
Rather than words like so:
それ
に
応じて
大きく
に
なります
。
Is there a way I can use a parameter to get MeCab to output the latter? I am very new to coding so would appreciate it if you explain simply. Thanks.
This is actually pretty hard to do. MeCab, Kuromoji, Sudachi, KyTea, Rakuten-MA—all of these Japanese parsers and the dictionary databases they consume (IPADIC, UniDic, Neologd, etc.) have chosen to parse morphemes, the smallest units of meaning, instead of what you call "words", which as your example shows often contain multiple morphemes.
There are some strategies that usually folks combine to improve on this.
Experiment with different dictionaries. I've noticed that UniDic is sometimes more consistent than IPADIC.
Use a bunsetsu chunker like J.DepP, which consumes the output of MeCab to chunk together morphemes into bunsetsu. Per this paper, "We use the notion of a bunsetsu which roughly corresponds to a minimum phrase in English and consists of a content words (basically nouns or verbs) and the functional words surrounding them." The bunsetsu output by J.DepP often correspond to "words". I personally don't think of, say, a noun + particle phrase as a "word" but you might—these two are usually in a single bunsetsu. (J.DepP is also pretttty fancy, in that it also outputs a dependency tree between bunsetsu, so you can see which one modifies or is secondary to which other one. See my example.)
A last technique that you shouldn't overlook is scanning the dictionary (JMdict) for runs of adjacent morphemes; this helps find idioms or set phrases. It can get complicated because the dictionary may have a deconjugated form of a phrase in your sentence, so you might have to search both the literal sentence form and the deconjugated (lemma) form of MeCab output.
I have an open-source package that combines all of the above called Curtiz: it runs text through MeCab, chunks them into bunsetsu with J.DepP to find groups of morphemes that belong together, identifies vocabulary by looking them up in the dictionary, separates particles and conjugated phrases, etc. It is likely not going to be useful for you, since I use it to support my activities in learning Japanese and making Japanese learning tools but it shows how the above pieces can be combined to get to what you need in Japanese NLP.
Hopefully that's helpful. I'm happy to elaborate more on any of the above topics.

CountVectorizer method get_feature_names() produces codes but not words

I'm trying to vectorize some text with sklearn CountVectorizer. After, I want to look at features, which generate vectorizer. But instead, I got a list of codes, not words. What does this mean and how to deal with the problem? Here is my code:
vectorizer = CountVectorizer(min_df=1, stop_words='english')
X = vectorizer.fit_transform(df['message_encoding'])
vectorizer.get_feature_names()
And I got the following output:
[u'00',
u'000',
u'0000',
u'00000',
u'000000000000000000',
u'00001',
u'000017',
u'00001_copy_1',
u'00002',
u'000044392000001',
u'0001',
u'00012',
u'0004',
u'0005',
u'00077d3',
and so on.
I need real feature names (words), not these codes. Can anybody help me please?
UPDATE:
I managed to deal with this problem, but now when I want to look at my words I see many words that actually are not words, but senseless sets of letters (see screenshot attached). Anybody knows how to filter this words before I use CountVectorizer?
You are using min_df = 1 which will include all the words which are found in at least one document ie. all the words. min_df could be considered a hyperparameter itself to remove the most commonly used words. I would recommend using spacy to tokenize the words and join them as strings before giving it as input to the Count Vectorizer.
Note: The feature names that you see are actually part of your vocabulary. It's just noise. If you want to remove them, then set min_df >1.
Here is what you can do get what you exactly want:
vectorizer=CountVectorizer()
vectorizer.fit_transform(df['message_encoding'])
feat_dict=vectorizer.vocabulary_.keys()
instead of vectorizer.get_feature_names() you can write vectorizer.vocabulary_.keys() to get the words.

SQL Server Full Text Search with complete sentences

I have an Azure SQL database and tried the full text search.
Is is possible to search for a complete sentence?
E.g. query with LIKE-operator that works (but probably not fast as full text search):
SELECT Sentence
FROM Sentences
WHERE 'This is a whole sentence for example.' LIKE '%'+Sentence+'%'
Would return: "a whole sentence"
I need something like that with full text search:
SELECT Sentence
FROM Sentences
WHERE FREETEXT(WorkingExperience,'This is a whole sentence for example.')
This will return each hit on a word, but not on the complete sentence.
E.g. would return: "a whole sentence" and "another sentence".
Is that possible or do I have to use the LIKE-operator?
Have you tried this:
SELECT Sentence
FROM Sentences
WHERE FREETEXT(WorkingExperience,'"This is a whole sentence for example."')
If the above doesn't work you may need to construct the proper FTS search string using AND operator, like below:
SELECT Sentence
FROM Sentences
WHERE FREETEXT(WorkingExperience,'"This" AND "is" AND "a" AND "whole"
AND "sentence" AND "for" AND "example."')
Also, for more precise matching I recommend using CONTAINS or CONTAINSTABLE:
SELECT Sentence
FROM Sentences
WHERE CONTAINS(WorkingExperience,'"This is a whole sentence for example."')
HTH
If anyone else is interested here is a link for a good article with examples:
https://www.microsoftpressstore.com/articles/article.aspx?p=2201634&seqNum=3
You can choose the best method to accommodate your need from the examples.
To me matching a whole sentence can be easily done with the below where clause as mentioned in the other answer:
WHERE CONTAINS(WorkingExperience,'"This is a whole sentence for example."')
if you need to look for all the words but user might input them in abnormal sequence I would suggest to use
WHERE CONTAINS(WorkingExperience, N'NEAR(This, whole, sentence, is, a, for, example)')
You can do other magics with this which can be found in the above article. If you need to order the result based on the hit score/rank you will need to use CONTAINSTABLE instead of CONTAINS.

How to generate (book) indexes?

I need to create an index for a book. While the task is easy at the first look -- group words by the first letter, then sort them, -- this obvious solution works only for the usa language. The real word is, however, more complex. See http://en.wikipedia.org/wiki/Collation :
The difference between computer-style numerical sorting and true alphabetical sorting becomes obvious in languages using an extended Latin alphabet. For example, the 29-letter alphabet of Spanish treats ñ as a basic letter following n, and formerly treated ch and ll as basic letters following c and l, respectively. Ch and ll are still considered letters, but are now alphabetized as two-letter combinations. (The new alphabetization rule was issued by the Royal Spanish Academy in 1994.) On the other hand, the digraph rr follows rqu as expected, both with and without the 1994 alphabetization rule. A numeric sort may order ñ incorrectly following z and treat ch as c + h, also incorrect when using pre-1994 alphabetization.
I tried to find an existing solution.
DocBook stylesheets does not address the problem.
The best match I found is xindy ( http://xindy.sourceforge.net/ ), but this tool is too much connected to LaTeX.
Any other suggestions?
Naively, you could examine every word in the text and create a hash, using the words as a key, and building up an array of locations (page numbers?) as values.
But indexes are generally a bit more focused than that.
Well, after answering to comments, I realized that I don't need a tool to generate indexes, but a library which can sort according to cultures. First experiments shows that I'm going to use ICU and its Python bindings PyICU. For example:
import icu
words = ["liche", "lichée", "lichen", "lichénoïde", "licher", "lichoter"]
collator = icu.Collator.createInstance(icu.Locale.getFrance())
for word in sorted(words, cmp=collator.compare):
print word.decode("string-escape")

Add spaces between words in spaceless string

I'm on OS X, and in objective-c I'm trying to convert
for example,
"Bobateagreenapple"
into
"Bob ate a green apple"
Is there any way to do this efficiently? Would something involving a spell checker work?
EDIT: Just some extra information:
I'm attempting to build something that takes some misformatted text (for example, text copy pasted from old pdfs that end up without spaces, especially from internet archives like JSTOR). Since the misformatted text is probably going to be long... well, I'm just trying to figure out whether this is feasibly possible before I actually attempt to actually write system only to find out it takes 2 hours to fix a paragraph of text.
One possibility, which I will describe this in a non-OS specific manner, is to perform a search through all the possible words that make up the collection of letters.
Basically you chop off the first letter of your letter collection and add it to the current word you are forming. If it makes a word (eg dictionary lookup) then add it to the current sentence. If you manage to use up all the letters in your collection and form words out of all of them, then you have a full sentence. But, you don't have to stop here. Instead, you keep running, and eventually you will produce all possible sentences.
Pseudo-code would look something like this:
FindWords(vector<Sentence> sentences, Sentence s, Word w, Letters l)
{
if (l.empty() and w.empty())
add s to sentences;
return;
if (l.empty())
return;
add first letter from l to w;
if w in dictionary
{
add w to s;
FindWords(sentences, s, empty word, l)
remove w from s
}
FindWords(sentences, s, w, l)
put last letter from w back onto l
}
There are, of course, a number of optimizations you could perform to make it go fast. For instance checking if the word is the stem of any word in the dictionary. But, this is the basic approach that will give you all possible sentences.
Solving this problem is much harder than anything you'll find in a framework. Notice that even in your example, there are other "solutions": "Bob a tea green apple," for one.
A very naive (and not very functional) approach might be to use a spell-checker to try to isolate one "real word" at a time in the string; of course, in this example, that would only work because "Bob" happens to be an English word.
This is not to say that there is no way to accomplish what you want, but the way you phrase this question indicates to me that it might be a lot more complicated than what you're expecting. Maybe someone can give you an acceptable solution, but I bet they'll need to know a lot more about what exactly you're trying to do.
Edit: in response to your edit, it would probably take less effort to run some kind of OCR tool on a PDF and correct its output than it would just to correct what this system might give you, let alone program it
I implemented a solution, the code is avaible on code project:
http://www.codeproject.com/Tips/704003/How-to-add-spaces-between-spaceless-strings
My idea was to prioritize results that use up most of the characters (preferable all of them) then favor the ones with the longest words, because 2,3 or 4 character long words can often come up by chance from leftout characters. Most of the times this provides the correct solution.
To find all possible permutations I used recursion. The code is quite fast even with big dictionaries (tested with 50 000 words).