How to add new lemma rule to existing language for spacy - spacy

I want to add a new lemmatiser rule for an existing language, i.e. lemmatise all nouns ending with "z" to ending with "".
In the case of individual words, spaCy gives the opportunity to add a tokeniser exception to an existing language after loading using
nlp.tokenizer.add_special_case("adidas", [{ORTH: 'adidas', LEMMA: 'Adidas', POS: 'NOUN', TAG: 'NNP'}])
The above sets the lemma, pos and tag of the new word and this is not altered.
The default English lemmatiser returned "adida" as the lemma.
Now, I am trying to "lemmatise" nouns "wordz" to "word", "windowz" to "window" etc without setting all cases as exceptions but rather add a new rule: Noun ending with "z" has lemma the noun without the trailing "z".
I understand that it will depend on the tagger output as the rules that exist in _lemma_rules.py are pos dependent.
Is there a way to add the rule without creating a new language as a copy of an existing with just one modified file?

Since my question was very specific, I had to communicate with the spaCy developer team and got a working answer.
Actually it is does not work for the fake example in English but it works in real case scenario while using the Greek models as Greek lemmatisation is mainly rule based.
The proposed solution is to use the Lookups Api, which is only available in versions 2.2 and later.
As they mention,
nlp.vocab.lookups.get_table("lemma_rules")
returns a dict-like table that you can write to.
Full answer in spaCy GitHub

Related

Force 'parser' to not segment sentences?

Is there an easy way to tell the "parser" pipe not to change the value of Token.is_sent_start ?
So, here is the story:
I am working with documents that are pre-sentencized (1 line = 1 sentence), this segmentation is all I need. I realized the parser's segmentation is not always the same as in my documents, so I don't want to rely on the segmentation made by it.
I can't change the segmentation after the parser has done it, so I cannot correct it when it makes mistakes (you get an error). And if I segment the text myself and then apply the parser, it overrules the segmentation I've just made, so it doesn't work.
So, to force keeping the original segmentation and still use a pretrained transformer model (fr_dep_news_trf), I either :
disable the parser,
add a custom Pipe to nlp to set Token.is_sent_start how I want,
create the Doc with nlp("an example")
or, I simply create a Doc with
doc = Doc(words=["an", "example"], sent_starts=[True, False])
and then I apply every element of the pipeline except the parser.
However, if I still do need the parser at some point (which I do, because I need to know some subtrees), If I simply apply it on my Doc, it overrules the segmentation already in place, so, in some cases, the segmentation is incorrect. So I do the following workaround:
Keep the correct segmentation in a list sentences = list(doc.sents)
Apply the parser on the doc
Work with whatever syntactic information the parser computed
Retrieve whatever sentencial information I need from the list I previously made, as I now cannot trust Token.is_sent_start.
It works, but it doesn't really feel right imho, it feels a bit messy. Is there an easier, cleaner way I missed ?
Something else I am considering is setting a custom extension, so that I would, for instance, use Token._.is_sent_start instead of the default Token.is_sent_start, and a custom Doc._.sents, but I fear it might be more confusing than helpful ...
Some user suggested using span.merge() for a pretty similar topic, but the function doesn't seem to exist in recent releases of spaCy (Preventing spaCy splitting paragraph numbers into sentences)
The parser is supposed to respect sentence boundaries if they are set in advance. There is one outstanding bug where this doesn't happen, but that was only in the case where some tokens had their sentence boundaries left unset.
If you set all the token boundaries to True or False (not None) and then run the parser, does it overwrite your values? If so it'd be great to have a specific example of that, because that sounds like a bug.
Given that, if you use a custom component to set your true sentence boundaries before the parser, it should work.
Regarding some of your other points...
I don't think it makes any sense to keep your sentence boundaries separate from the parser's - if you do that you can end up with subtrees that span multiple sentences, which will just be weird and unhelpful.
You didn't mention this in your question, but is treating each sentence/line as a separate doc an option? (It's not clear if you're combining multiple lines and the sentence boundaries are wrong, or if you're passing in a single line but it's turning into multiple sentences.)

How to assign lexical features to new unanalyzable tokens in spaCy?

I'm working with spaCy, version 2.3. I have a not-quite-regular-expression scanner which identifies spans of text which I don't want analyzed any further. I've added a pipe at the beginning of the pipeline, right after the tokenizer, which uses the document retokenizer to make these spans into single tokens. I'd like to remainder of the pipeline to treat these tokens as proper nouns. What's the right way to do this? I've set the POS and TAG attrs in my calls to retokenizer.merge(), and those settings persist in the resulting sentence parse, but the dependency information on these tokens makes me doubt that my settings have had the desired impact. Is there a way to update the vocabulary so that the POS tagger knows that the only POS option for these tokens is PROPN?
Thanks in advance.
The tagger and parser are independent (the parser doesn't use the tags as features), so modifying the tags isn't going to affect the dependency parse.
The tagger doesn't overwrite any existing tags, so if a tag is already set, it doesn't modify it. (The existing tags don't influence its predictions at all, though, so the surrounding words are tagged the same way they would be otherwise.)
Setting TAG and POS in the retokenizer is a good way to set those attributes. If you're not always retokenizing and you want to set the TAG and/or POS based on a regular expression for the token text, then the best way to do this is a custom pipeline component that you add before the tagger that sets tags for certain words.
The transition-based parsing algorithm can't easily deal with partial dependencies in the input, so there isn't a straightforward solution here. I can think of a few things that might help:
The parser does respect pre-set sentence boundaries. If your skipped tokens are between sentences, you can set token.is_sent_start = True for that token and the following token so that the skipped token always ends up in its own sentence. If the skipped tokens are in the middle of a sentence or you want them to be analyzed as nouns in the sentence, then this won't help.
The parser does use the token.norm feature, so if you set the NORM feature in the retokenizer to something extremely PROPN-like, you might have a better chance of getting the intended analysis. For example, if you're using a provided English model like en_core_web_sm, use a word you think would be a frequent similar proper noun in American newspaper text from 20 years ago, so if the skipped token should be like a last name, use "Bush" or "Clinton". It won't guarantee a better parse, but it could help.
If you using a model with vectors like en_core_web_lg, you can also set the vectors for the skipped token to be the same as a similar word (check that the similar word has a vector first). This is how to tell the model to refer to the same row in the vector table for UNKNOWN_SKIPPED as Bush.
The simpler option (that duplicates the vectors in the vector table internally):
nlp.vocab.set_vector("UNKNOWN_SKIPPED", nlp.vocab["Bush"].vector)
The less elegant version that doesn't duplicate vectors underneath:
nlp.vocab.vectors.add("UNKNOWN_SKIPPED", row=nlp.vocab["Bush"].rank)
nlp.vocab["UNKNOWN_SKIPPED"].rank = nlp.vocab["Bush"].rank
(The second line is only necessary to get this to work for a model that's currently loaded. If you save it as a custom model after the first line with nlp.to_disk() and reload it, then only the first line is necessary.)
If you just have a small set of skipped tokens, you could update the parser with some examples containing these tokens, but this can be tricky to do well without affecting the accuracy of the parser for other cases.
The NORM and vector modifications will also influence the tagger, so it's possible if you choose those well, you might get pretty close to the results you want.

portuguese tokenizer: t is breaking “ao” in “a” and “o”

I am using the Spacy as a tokenizer for Portuguese documents (the last version).
But, it is making a mistake in the following sentence: 'esta quebrando aonde nao devia, separando a e o em ao e aos'.
It is breaking “ao” in “a” and “o”. The same is happening with other words like “aonde” (“a” + “onde”) and othes (“aos”, etc).
Other strange cases: "àquele" into "a" and "quele"; "às" into "à" and "s".
The problem can be shown in the "Test the model live (experimental)" in https://spacy.io/models/pt.
For now, I am adding some known words with tokenizer.add_special_case. But I may not remember all cases.
Is it possible to adjust this problem?
It seems appropriate to me break down the expression "ao" in two functional parts: preposition and article. Depending on the application, it would be simple to concatenate these parts together as required by the official grammar.

Fuziness In UIMA ruta

Is there any option of fuzziness in case of word matching, or ignoring some special cases.
For ex:
STRINGLIST AMIMALLIST = {"LION","TIGER","MONKEY"};
DECLARE ANIMAL;
Document {-> MARKFAST(ANIMAL, AMIMALLIST, true)};
I need to match words with list in case I face some special character like
Tiger- or MONKEY$
According to documentation There are different evaluator any idea how to use?
Or can I use SCORE or MARKSCORE
There are several aspects to consider here. In general, UIMA Ruta does not support fuzziness in the dictionary lookup. SCORE and MARKSCORE are language elements which can be utilized to introduce some heurstic scoring (not really fuzziness) in sequential rules. In the examples you gave in your question, you do not really need fuzzy matching.
The dictionary lookup in UIMA Ruta works on the RutaBasic annotation. These annotations are automatically created and maintained by UIMA Ruta itself (and should not be changed by other analysis engines or rules directly). The RutaBasic annotations represent the smallest fragments annotations are referring to. By default, the seeder of the RutaEngine creates annotations for words (W -> CW, SW, CAP) and many other tokens like SPECIAL for - or $. This means that there is also a RutaBasic annotation, and that the dictionary lookup can distinghish between these tokens. As a result, Tiger and Monkey should be annotated and the example in your question should actually work (I tested it). You maybe need some postprossesing in order to include the SPECIAL in ANIMAL.
I have to mention that there is also the functionality to use an edit distance in the dictionary lookup (Multi Tree Word List, TRIE). However, this functionality has not been maintained for several years. It should also support different weights for specific replacements. I do not know if this counts as fuzziness.
DISCLAIMER: I am a developer of UIMA Ruta

Handling Grammar / Spelling Issues in Translation Strings

We are currently implementing a Zend Framework Project, that needs to be translated in 6 different languages. We already have a pretty sophisticated translation system, based on Zend_Translate, which also handles variables in translation keys.
Our project has a new Turkish translator, and we are facing a new issue: Grammar, especially Turkish one. I noticed that this problem might be evident in every translation system and in most languages, so I posted a question here.
Question: Any ideas how to handle translations like:
Key: I have a[n] {fruit}
Variables: apple, banana
Result: I have an apple. I have a banana.
Key: Stimme für {user}[s] Einsendung
Variables: Paul, Markus
Result: Stimme für Pauls Einsendung,
Result: Stimme für Markus Einsendung
Anybody has a solution or idea for this? My only guess would be to avoid this by not using translations where these issues occur.
How do other platforms handle this?
Of course the translation system has no idea which type of word it is placing where in which type of Sentence. It only does some string replacements...
PS: Turkish is even more complicated:
For example, on a profile page, we have "Annie's Network". This should translate as "Annie'nin Aği".
If the first name ends in a vowel, the suffix will start with an n and look like "Annie'nin"
If the first name ends in a consonant, it will not have the first n, and look like "Kris'in"
If the last vowel is an a or ı, it will look like "Dan'ın"; or Seyma'nın"
If the last vowel is an o or u, it will look like "Davud'un"; or "Burcu'nun"
If the last vowel is an e or i, it will look like "Erin'in"; or "Efe'nin"
If the last vowel is an ö or ü, it will look like "Göz'ün'; or "Iminönü'nün"
If the last letter is a k (like the name "Basak"), it will look like "Basağın"; or "Eriğin"
It is actually very hard problem, as grammar rules are different even among languages from the same family. I don't think you could easily do anything for let's say Slavic languages...
However, if you want to solve this problem (because this is extra challenging) and you are looking for creative (cross inspiring) ways to do that, you might want to look into something called ChoiceFormat (example would be one from ICU Project) or you can look up GNU Gettext's solution for plural forms problem.
ICU (mentioned above) has a SelectFormat http://site.icu-project.org/design/formatting/select that may be of help- it's like a choice format but with arbitrary keywords. Also, it does have a PluralFormat which already has rules for many language's plural rules.