Why is SpaCy matcher not matching case-insensitive words in a document? - spacy

I want SpaCy matcher to match keywords (multi-word entities) in a document irrespective of their case.
Token.lemma is case sensitive... So, with this code, I can only find "product preferences" rather than "PRODUCT PREFERENCES" or "Product Preferences" in my document.
pat_piece = ({"LEMMA": token.lemma_.lower()} if is_final_token(token, tmpdoc)
else {"LOWER": token.lower_})
Can someone suggest how I can edit my code to match ALL cases for keywords (i.e., entities)?

With the provided attributes you can only match LOWER or LEMMA, not "lowercase lemma". So if you generate this pattern:
{"LEMMA": "product"}
for a token whose lemma is PRODUCT, it simply won't match.
If you want to match lowercase lemmas, some options:
postprocess the docs to lowercase lemmas before running the matcher (either separately in your script or with a custom pipeline component)
use a custom lemmatizer that produces lowercase lemmas
use a custom extension with a getter to return the lowercase form of the lemma for use with a "_" matcher pattern (a "property extension" as described here: https://spacy.io/usage/processing-pipelines#description)
If your only concern is matching lowercase lemmas, I'd suggest the first option as the easiest to implement and fastest to run in the matcher.

Related

How to add new lemma rule to existing language for spacy

I want to add a new lemmatiser rule for an existing language, i.e. lemmatise all nouns ending with "z" to ending with "".
In the case of individual words, spaCy gives the opportunity to add a tokeniser exception to an existing language after loading using
nlp.tokenizer.add_special_case("adidas", [{ORTH: 'adidas', LEMMA: 'Adidas', POS: 'NOUN', TAG: 'NNP'}])
The above sets the lemma, pos and tag of the new word and this is not altered.
The default English lemmatiser returned "adida" as the lemma.
Now, I am trying to "lemmatise" nouns "wordz" to "word", "windowz" to "window" etc without setting all cases as exceptions but rather add a new rule: Noun ending with "z" has lemma the noun without the trailing "z".
I understand that it will depend on the tagger output as the rules that exist in _lemma_rules.py are pos dependent.
Is there a way to add the rule without creating a new language as a copy of an existing with just one modified file?
Since my question was very specific, I had to communicate with the spaCy developer team and got a working answer.
Actually it is does not work for the fake example in English but it works in real case scenario while using the Greek models as Greek lemmatisation is mainly rule based.
The proposed solution is to use the Lookups Api, which is only available in versions 2.2 and later.
As they mention,
nlp.vocab.lookups.get_table("lemma_rules")
returns a dict-like table that you can write to.
Full answer in spaCy GitHub

Spacy tokenizer to handle final period in sentence

I'm using Spacy to tokenize sentences, and I know that the text I pass to the tokenizer will always be a single sentence.
In my tokenization rules, I would like non-final periods (".") to be attached to the text before it so I updated the suffix rules to remove the rules that split on periods (this gets abbreviations correctly).
The exception, however, is that the very last period should be split into a separate token.
I see that the latest version of Spacy allows you to split tokens after the fact, but I'd prefer to do this within the Tokenizer itself so that other pipeline components are processing the correct tokenization.
Here is one solution that uses some post processing after the tokenizer:
I added "." to suffixes so that a period is always split into its own token.
I then used a regex to find non-final periods, generated a span with doc.char_span, and merged the span to a single token with span.merge.
Would be nice to be able to do this within the tokenizer if anyone knows how to do that.

docx4j simulate expression evaluation

I have a docx document with some formulas, e.g.
{IF "Name" = "Foo" "Foo" "Bar"}
which should print "Bar" at the end.
In Word I have to press "F9" to get the expression evaluated.
No I am using docx4j, can I somehow tell docx4j to do the evaluation?
I'm afraid not. You can get the expression of course (see this discussion about some classes which help), but there is currently nothing in docx4j to evaluate an IF field for you.
If the objective is to include/exclude text, you'll be able to achieve the same end with an OpenDoPE conditional content control (based on whether an XPath evaluates to true or false). (docx4j can evaluate these; they can also be nested, to support complex content)

Lucene - Which analyzer to use to avoid prepositions

I am using the Lucene standard analyzer to parse text. however, it is returning prepositions as well as words like "i", "the", "and" etc...
Is there an Analyzer I can use that will not return these words?
Thanks
StandardAnalyzer uses StopFilter.
By default the words in the STOP_WORDS_SET are excluded. If this is not sufficient, there are constructors which allow you to pass in a list of stop words which should be removed from the token stream. You can provide the list using a File, a Set, or a Reader.

"Exclude these words" feature

How do I implement "Exclude these words" feature for a search appliation using Lucene?
Thanks!
therefor i can use the stopanalyzer:
StopAnalyzer StopAnalyzer includes the lower-case filter, and also has a filter that drops out any "stop words", words like articles (a, an, the, etc) that occur so commonly in english that they might as well be noise for searching purposes. StopAnalyzer comes with a set of stop words, but you can instantiate it with your own array of stop words.
http://lucene.apache.org/java/2_3_0/api/org/apache/lucene/analysis/StopAnalyzer.html
more information:
http://www.darksleep.com/lucene/
How to sort by Lucene.Net field and ignore common stop words such as 'a' and 'the'?
Look at the NOT operator here. Just construct your query accordingly or massage if it is a user-generated query.