How to convert plural nouns to singular using SpaCy? - spacy

I am using SpaCy to lemmatize text, but in some special cases I need to keep original text and just convert plural nouns to their singular forms.
Is there a way to tell SpaCy to only convert plural nouns to singulars without lemmatizing the whole text (like removing ed, ing...etc) ? Or should I explicitly test each token to check if it is a plural noun to take its lemma?
P.S. Input text is dynamic, so I don't know in advance if the word is a noun or not
Thanks

Thanks to bivouac0's comment. I checked tag_ field of each token and retrieved lemma of tokens being tagged as 'NNS' or 'NNPS'
processed_text = nlp(original_text)
lemma_tags = {"NNS", "NNPS"}
for token in processed_text:
lemma = token.text
if token.tag_ in lemma_tags:
lemma = token.lemma_
...
# rest of code
...
...

You cannot convert plural nouns to singular nouns using spacy.
You can check whether the token is a plural noun or a singular noun.
If the token's tag is equal to 'NNS', check that token in a dictionary and get the singular form of that token.

Related

How do I tag the gender of a noun with Spacy?

I would like to tag the gender of nouns using Spacy; specifically in my case, German.
I am not sure which Spacy pipeline has information about noun gender, for example, the Tagger or the Lemmatizer?
Different languages have different grammatical features, so you can look at the specific language model of a language to determine what pipelines it has.
For German, we can see under “Label Scheme” that the “morphologizer” pipeline has tags including “gender”.
Here, it shows that the morphologizer assigns the attribute “morph” to each Token.
“morph” is respectively of type “MorphAnalysis”.
There are different ways to access the morphological annotation from a MorphAnalysis object.
The simplest is to use the “.get” method, by passing the name of the category you want:
Token.morph.get(“gender”)
which returns a list of strings in case that category has multiple values.
You can also return the MorphAnalysis as a dictionary with to_dict(), as a string with str(Token.morph), or iterate over Token.morph with a loop, which returns each attribute-value pair as strings.

How to assign lexical features to new unanalyzable tokens in spaCy?

I'm working with spaCy, version 2.3. I have a not-quite-regular-expression scanner which identifies spans of text which I don't want analyzed any further. I've added a pipe at the beginning of the pipeline, right after the tokenizer, which uses the document retokenizer to make these spans into single tokens. I'd like to remainder of the pipeline to treat these tokens as proper nouns. What's the right way to do this? I've set the POS and TAG attrs in my calls to retokenizer.merge(), and those settings persist in the resulting sentence parse, but the dependency information on these tokens makes me doubt that my settings have had the desired impact. Is there a way to update the vocabulary so that the POS tagger knows that the only POS option for these tokens is PROPN?
Thanks in advance.
The tagger and parser are independent (the parser doesn't use the tags as features), so modifying the tags isn't going to affect the dependency parse.
The tagger doesn't overwrite any existing tags, so if a tag is already set, it doesn't modify it. (The existing tags don't influence its predictions at all, though, so the surrounding words are tagged the same way they would be otherwise.)
Setting TAG and POS in the retokenizer is a good way to set those attributes. If you're not always retokenizing and you want to set the TAG and/or POS based on a regular expression for the token text, then the best way to do this is a custom pipeline component that you add before the tagger that sets tags for certain words.
The transition-based parsing algorithm can't easily deal with partial dependencies in the input, so there isn't a straightforward solution here. I can think of a few things that might help:
The parser does respect pre-set sentence boundaries. If your skipped tokens are between sentences, you can set token.is_sent_start = True for that token and the following token so that the skipped token always ends up in its own sentence. If the skipped tokens are in the middle of a sentence or you want them to be analyzed as nouns in the sentence, then this won't help.
The parser does use the token.norm feature, so if you set the NORM feature in the retokenizer to something extremely PROPN-like, you might have a better chance of getting the intended analysis. For example, if you're using a provided English model like en_core_web_sm, use a word you think would be a frequent similar proper noun in American newspaper text from 20 years ago, so if the skipped token should be like a last name, use "Bush" or "Clinton". It won't guarantee a better parse, but it could help.
If you using a model with vectors like en_core_web_lg, you can also set the vectors for the skipped token to be the same as a similar word (check that the similar word has a vector first). This is how to tell the model to refer to the same row in the vector table for UNKNOWN_SKIPPED as Bush.
The simpler option (that duplicates the vectors in the vector table internally):
nlp.vocab.set_vector("UNKNOWN_SKIPPED", nlp.vocab["Bush"].vector)
The less elegant version that doesn't duplicate vectors underneath:
nlp.vocab.vectors.add("UNKNOWN_SKIPPED", row=nlp.vocab["Bush"].rank)
nlp.vocab["UNKNOWN_SKIPPED"].rank = nlp.vocab["Bush"].rank
(The second line is only necessary to get this to work for a model that's currently loaded. If you save it as a custom model after the first line with nlp.to_disk() and reload it, then only the first line is necessary.)
If you just have a small set of skipped tokens, you could update the parser with some examples containing these tokens, but this can be tricky to do well without affecting the accuracy of the parser for other cases.
The NORM and vector modifications will also influence the tagger, so it's possible if you choose those well, you might get pretty close to the results you want.

How to add new lemma rule to existing language for spacy

I want to add a new lemmatiser rule for an existing language, i.e. lemmatise all nouns ending with "z" to ending with "".
In the case of individual words, spaCy gives the opportunity to add a tokeniser exception to an existing language after loading using
nlp.tokenizer.add_special_case("adidas", [{ORTH: 'adidas', LEMMA: 'Adidas', POS: 'NOUN', TAG: 'NNP'}])
The above sets the lemma, pos and tag of the new word and this is not altered.
The default English lemmatiser returned "adida" as the lemma.
Now, I am trying to "lemmatise" nouns "wordz" to "word", "windowz" to "window" etc without setting all cases as exceptions but rather add a new rule: Noun ending with "z" has lemma the noun without the trailing "z".
I understand that it will depend on the tagger output as the rules that exist in _lemma_rules.py are pos dependent.
Is there a way to add the rule without creating a new language as a copy of an existing with just one modified file?
Since my question was very specific, I had to communicate with the spaCy developer team and got a working answer.
Actually it is does not work for the fake example in English but it works in real case scenario while using the Greek models as Greek lemmatisation is mainly rule based.
The proposed solution is to use the Lookups Api, which is only available in versions 2.2 and later.
As they mention,
nlp.vocab.lookups.get_table("lemma_rules")
returns a dict-like table that you can write to.
Full answer in spaCy GitHub

Spacy tokenizer add exception for n't

I want to convert n't to not using this code:
doc = nlp(u"this. isn't ad-versere")
special_case = [{ORTH: u"not"}]
nlp.tokenizer.add_special_case(u"n't",specia_case)
print [text.orth_ for text in doc]
But I get the output as:
[u'this', u'.', u'is', u"n't", u'ad', u'-', u'versere']
n't is still n't
How to solve the problem?
The reason your logic doesn't work is because spaCy uses non-destructive tokenization. This means that it'll always keep a reference to the original input text, and you'll never lose any information.
The tokenizer exceptions and special cases let you define rules for how to split a string of text into a sequence of tokens – but they won't let you modify the original string. The ORTH values of the tokens plus whitespace always needs to match the original text. So the tokenizer can split "isn't" into ["is", "n't"], but not into ["is", "not"].
To define a "normalised" form of the string, spaCy uses the NORM attribute, available as token.norm_. You can see this in the source of the tokenizer exceptions here – the norm of the token "n't" is "not". The NORM attribute is also used as a feature in the model, to ensure that tokens with the same norm receive similar representations (even if one is more frequent in the training data than the other).
So if you're interested in the normalised form, you can simply use the norm_ attribute instead:
>>> [t.norm_ for t in doc]
['this', '.', 'is', 'not', 'ad', '-', 'versere']

Antlr tokens from file

What is the best way to feed Antlr with huge numbers of tokens?
Say we have a list of 100,000 English verbs, how could we add them to our grammar? We could of cause include a huge grammar file like verbs.g, but maybe there is a more elegant way, by modifying a .token file etc?
grammar verbs;
VERBS:
'eat' |
'drink' |
'sit' |
...
...
| 'sleep'
;
Also should the tokens rather be lexer or parser tokens, ie VERBS: or verbs: ? Probably VERBS:.
I rather would use semantic predicates.
For this you have to define a token
word : [a-z]+
and at every site you want to use a verb (instead of a generic word) put a semantic predicate that checks if the parsed word is in the list of verbs.
Using recommend not to use the parser/lexer for such a task
each additional verb would change the grammar
each additional verb enlarges the generated code
conjugation is easier
upper/lower case could be handled easier