Spacy token.lemma_ not identifying nouns and pronouns - spacy

I have been following a tutorial on Lemmatization -> https://www.machinelearningplus.com/nlp/lemmatization-examples-python/
As told in the spacy lemmatization section, I loaded the 'en-core-web-sm' model, parsed and extracted the lemmas of each word from a given sentence.
My code is as below
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
sentence = "The striped bats are hanging on their feet for best"
doc = nlp(sentence)
lemmatized_spacy_output = " ".join([token.lemma_ for token in doc])
print(lemmatized_spacy_output)
For input
"The striped bats are hanging on their feet for best"
It gives the output as
the stripe bat be hang on their foot for good
while the expected output is
the strip bat be hang on -PRON- foot for good'
As can be seen, the stripes word should be identified as a verb, but for some reason it is being classified as a noun (as the output is stripe, not strip).
Also, it is not identifying personal pronouns, and is giving the tokens as it is.
I have already tried a lot of github and stackoverflow questions, but none target my query.

Just like aab said in his comment. Which version are you using? I use version 3 of spacy and calling
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])
sentence = "The striped bats are hanging on their feet for best"
doc = nlp(sentence)
for token in doc:
print(token.text, " -- ", token.pos_, " -- ",token.lemma_)
returns
The -- DET -- the
striped -- VERB -- stripe
bats -- NOUN -- bat
are -- VERB -- be
hanging -- VERB -- hang
on -- ADP -- on
their -- PRON -- their
feet -- NOUN -- foot
for -- ADP -- for
best -- ADJ -- good
This means striped is identified as a verb

Related

Recognize newline (\n) in text as end of sentence in Spacy

I'd like to recognize a newline in text as the end of a sentence. I've tried inputting it into the nlp object like this:
text = 'Guest Blogging\nGuest Blogging allows the user to collect backlinks'
nlp = spacy.load("en_core_web_lg")
config = {"punct_chars": ['\n']}
nlp.add_pipe("sentencizer", config=config)
for sent in nlp(text).sents:
print('next sentence:')
print(sent)
The output of this is:
next sentence:
Guest Blogging
Guest Blogging allows the user to collect backlinks
I don't understand why Spacy isn't recognizing the newline as a sentence end. My desired output is:
next sentence:
Guest Blogging:
next sentence:
Guest Blogging allows the user to collect backlinks
Does anyone know how to achieve this?
The reason the sentencizer isn't doing anything here is that the parser has run first and already set all the sentence boundaries, and then the sentencizer doesn't modify any existing sentence boundaries.
The sentencizer with \n is only the right option if you know you have exactly one sentence per line in your input text. Otherwise a custom component that adds sentence starts after newlines (but doesn't set all sentence boundaries) is probably what you want.
If you want to set some custom sentence boundaries before running the parser, you need be sure you add your custom component before the parser in the pipeline:
nlp.add_pipe("my_component", before="parser")
Your custom component would set token.is_start_start = True for the tokens right after newlines and leave all other tokens unmodified.
Check out the second example here: https://spacy.io/usage/processing-pipelines#custom-components-simple
you can do this by using
nlp = spacy.load('en_core_web_sm', exclude=["parser"])
text = 'Guest Blogging\nGuest Blogging allows the user to collect backlinks'
config = {"punct_chars": ['\n']}
nlp.add_pipe("sentencizer", config=config)
for sent in nlp(text).sents:
print("next sentence")
print(sent)
Output:
next sentence
Guest Blogging
next sentence
Guest Blogging allows the user to collect backlinks
You could also break sentences by \n before feeding spacy.
from spacy.lang.en import English
def get_sentences(_str):
chunks = _str.split('\n')
sentences = []
nlp = English()
nlp.add_pipe("sentencizer")
for chunk in chunks:
doc = nlp(chunk)
sentences += [sent.text.strip() for sent in doc.sents]
return sentences

How to properly use keyword 'theorem' in Isabelle?

I obtained the following code from Isabelle's wikipedia page:
theorem sqrt2_not_rational:
"sqrt (real 2) ∉ ℚ"
proof
assume "sqrt (real 2) ∈ ℚ"
then obtain m n :: nat where
n_nonzero: "n ≠ 0" and sqrt_rat: "¦sqrt (real 2)¦ = real m / real n"
and lowest_terms: "gcd m n = 1" ..
from n_nonzero and sqrt_rat have "real m = ¦sqrt (real 2)¦ * real n" by simp
then have "real (m²) = (sqrt (real 2))² * real (n²)" by (auto simp add: power2_eq_square)
also have "(sqrt (real 2))² = real 2" by simp
also have "... * real (m²) = real (2 * n²)" by simp
finally have eq: "m² = 2 * n²" ..
hence "2 dvd m²" ..
with two_is_prime have dvd_m: "2 dvd m" by (rule prime_dvd_power_two)
then obtain k where "m = 2 * k" ..
with eq have "2 * n² = 2² * k²" by (auto simp add: power2_eq_square mult_ac)
hence "n² = 2 * k²" by simp
hence "2 dvd n²" ..
with two_is_prime have "2 dvd n" by (rule prime_dvd_power_two)
with dvd_m have "2 dvd gcd m n" by (rule gcd_greatest)
with lowest_terms have "2 dvd 1" by simp
thus False by arith
qed
However, when I copy this text into an Isabelle instance, there are multiple 'do not enter' symbols to the left of each line. One says 'Illegal application of command "theorem" at top level' so I assumed that you cannot simply define a theorem at the top level and the wikipedia page was not supplying a complete initial example. I wrapped the theorem in a theory as follows:
theory Scratch
imports Main
begin
(* Theorem *)
end
Isabelle stopped complaining about the theorem, but, on the second line of the theorem, it now says:
Inner lexical error at: ℚ
Failed to parse proposition
It is also complaining about the proof line:
Illegal application of command "proof" in theory mode
It also has an error for the remaining lines in the theorem.
What is the proper way to wrap this theorem provided by wikipedia so that it can be checked in Isabelle?
I completely agree to Manuel, that just importing Main is not sufficient. If you're not interested in proofs, but just on testing irrationality then a good possibility would be to include $AFP/Real_Impl/Real_Impl from the Archive of Formal Proofs: then testing irrationality becomes very easy:
theory Test
imports "$AFP/Real_Impl/Real_Impl"
begin
lemma "sqrt 2 ∉ ℚ" by eval
lemma "sqrt 1.21 ∈ ℚ" by eval
lemma "sqrt 3.45 ∉ ℚ" by eval
end
Your guess that you have to wrap the “theorem” command in a theory in the way you did is correct. However, you need a few more imports, imports Main does not even load the theories containing sqrt, rational numbers, and prime numbers.
Moreover, the proof on Wikipedia is somewhat outdated. Isabelle is a very dynamic system; its maintainers port all the proofs in the library and the Archive of Formal Proofs with every release, but code snippets lying around somewhere (e.g. Wikipedia) tend to become outdated after a while, and I think this particular one is positively ancient.
For an up-to-date proof of pretty much the same thing, properly embedded in a theory with the right imports, look here:
http://isabelle.in.tum.de/repos/isabelle/file/4546c9fdd8a7/src/HOL/ex/Sqrt.thy
Note that this is for the development version of Isabelle; it may not work with your version. In any case, you should have the same file in the correct version as src/HOL/ex/Sqrt.thy in your downloaded Isabelle distribution.
Probably you encountered some encoding difficulties - this was the problem in my case (I got the same error).
Isabelle uses so called 'Isabelle symbols' to represent unicode characters (see three
(reference manual)[http://isabelle.in.tum.de/doc/isar-ref.pdf] from page 307).
If you use the jEdit IDE that gets distributed with Isabelle 2014 then --> looks the same as the \<longrightarrow> (the Isabelle symbol). The first cannot be parsed, the second is correct. If you copy and pasted the wiki-code this is the reason why it broke.
You can also take a look at the examples in <yourIsabelleInstallFolder/src/HOL/Isar_Examples.thy for further use of the isabelle symbols and the general structure of proofs written in the Isar language.

Inconsistencies in tokenizing large English files using Stanford's PTBTokenizer?

I have the Stanford PTBTokenizer (included with POS tagger v3.2.0) from the Stanford JavaNLP API that I'm using to try to tokenize a largish (~12M) file (English language text). Invoking from bash:
java -cp ../david/Desktop/quest/lib/stanford-parser.jar \
edu.stanford.nlp.process.PTBTokenizer -options normalizeAmpersandEntity=true \
-preserveLines foo.txt >tmp.out
I see instances of punctuation not tokenized properly in some places but not others. E.g., output contains "Center, Radius{4}" and also contains elsewhere "Center , Radius -LCB- 4 -RCB-". (The former is a bad tokenization; the latter is correct.)
If I isolate the lines that don't get tokenized properly in their own file and invoke the parser on the new file, the output is fine.
Has anybody else run into this? Is there a way to work around that doesn't involve checking output for bad parses, separating them, and re-tokenizing?
Upgrading to the latest version (3.3.0) fixed the comma attachment problem. There are spurious cases of brackets/braces not being tokenized correctly (mostly because they are [mis-]recognized as emoticons).
Thanks again to Professor Manning & John Bauer for their prompt & thorough help.

Lexing space seperated words in ANTLR3 where some words are keywords

I am working on a project that involves transforming part of speech tagged text into an ANTLR3 AST with phrases as nodes of the AST.
The input to ANTLR looks like:
DT-THE The NN dog VBD sat IN-ON on DT-THE the NN mat STOP .
i.e. (tag token)+ where neither the tag or the token contain white space.
Is the following a good way of lexing this:
WS : (' ')+ {skip();};
TOKEN : (~' ')+;
The grammar then has entries like the following to describe the lowest level of the AST:
dtTHE:'DT-THE' TOKEN -> ^('DT-THE' TOKEN);
nn:'NN' TOKEN -> ^('NN' TOKEN);
(and 186 more of these!)
This approach seems to work but results in a ~9000 line Java Lexer and takes a large amount of memory to build (~2gb) hence I was wondering whether this is the optimal way of solving this problem.
Could you combine the TAG space TOKEN into a single AST tree? Then you could pass both the TAG and TOKEN into your source code for handling. If the Java code used to handle the resulting tree is very similar between the various TAGs, then you could perhaps simplify the ANTLR with the trade-off of a bit more complication in your Java code.

ANTLR on a noisy data stream Part 2

Following a very interesing discussion with Bart Kiers on parsing a noisy datastream with ANTLR, I'm ending up with another problem...
The aim is still the same : only extracting useful information with the following grammar,
VERB : 'SLEEPING' | 'WALKING';
SUBJECT : 'CAT'|'DOG'|'BIRD';
INDIRECT_OBJECT : 'CAR'| 'SOFA';
ANY : . {skip();};
parse
: sentenceParts+ EOF
;
sentenceParts
: SUBJECT VERB INDIRECT_OBJECT
;
a sentence like it's 10PM and the Lazy CAT is currently SLEEPING heavily on the SOFA in front of the TV. will produce the following
This is perfect and it's doing exactly what I want.. from a big sentence, I'm extracting only the words that had a sense for me.... But the, I founded the following error. If somewhere in the text I'm introducing a word that begin exactly like a token, I'm ending up with a MismathedTokenException or a noViableException
it's 10PM and the Lazy CAT is currently SLEEPING heavily,
with a DOGGY bag, on the SOFA in front of the TV.
produce an error :
DOGGY is interpreted as the beginning for DOG which is also a part of the TOKEN SUBJECT and the lexer is lost... How could I avoid this without defining DOGGY as a special token... I would have like the parser to understand DOGGY as a word in itself.
Well, it seems that adding this ANY2 :'A'..'Z'+ {skip();}; solves my problem !