adding singular plural combinations to SpaCy - spacy

How do I add new lemmas to SpaCy. For instance, new singular-plural nouns.
Example:
Kirana = Singular
Kiranas = Plural
I want to add it to SpaCy so that when a sentence contains "Kiranas", Kirana will show up as its lemma.

Just add the word "kirana" to the _nouns.py file which is located inside the Spacy installation folder:(spacy/en/lemmatizer/_nouns.py). When you are adding the word to the file, the text format inside the file should not be changed. Since the lemmatization rules are already defined this should work as you intended.

Related

Using VBA, how do I create a custom field in Word that uses the current page number?

I'm trying to add a custom field in Word (in the shape { CUSTOM_FIELD } ) that uses the current page number and outputs its text representation (12 => twelve), but in multiple exotic (not supported) languages, which is why the built-in English variant (page * cardtext) isn't sufficient.
The VBA code won't be a problem, but I need to know how to create a custom field.
The field would be added to the footer template, before 100s of pages would be added programmatically.
I tried using a custom DocProperty, but wasn't able to find a way to integrate the needed behavior. Another linked answer seems to be using the existing { PAGE } field, which wouldn't help, as I need to insert the new field (once only) into the footer template.
This isn't the only possible way to do this, but if you have a reasonably small maximum page count then you could use a { DOCVARIABLE } field something like this, where "LANG" is just a piece of text that identifies the language you want to use:
{ DOCVARIABLE "LANG{ PAGE }" }
You would then need to use VBA to set up 1 Document variable to store the text verion of each page number. Document variables are in essence key-value pairs stored invisibly in the document.
e.g. Let's suppose you were wanted the text versions of numbers from one to three in English, French and German. So you could have document variables with the following names and values
EN1 One
EN2 Two
EN3 Three
FR1 Un
FR2 Deux
FR3 Trois
DE1 Ein
DE2 Zwei
DE3 Drei
Even if you need hundreds or thousands of these, the amount of text you can store in Document variables is very large. OTOH if you need to be able to generate the texts dynamically by building them using an algorothm (as \* Cardtext probably does) this won't work.
To set up one of these variables youjust need, e.g.
ActiveDocument.Variables("EN1").Value = "One"
The field you would need for the English results would be
{ DOCVARIABLE "EN{ PAGE }" }
As long as you only need to use one language in each document, you could just change the "EN" to "FR" to get the French version, etc. - after all, if you only have one footer layout, you would only need to make one change. Otherwise, you could consider storing the language code somewhere else in the document, e.g.
in a bookmark called LANG, in which case you might use
{ DOCVARIABLE "{ LANG }{ PAGE }" }
in a DOCVARIABLE called LANG, so you would use
{ DOCVARIABLE "{ DOCVARIABLE LANG }{ PAGE }" }
in a Custom document property called LANG, so you would use
{ DOCVARIABLE "{ DOCPROPERTY LANG }{ PAGE }" }
(The problem with using custom document properties for your numbers is that you can only have a small number of them).
If that general approach can't be made to fit what you're trying to achieve, I think you'll probably need to clarify your Question some more.
The Cardtext switch will give numbering in different languages, assuming that the language is applied to the text as proofing language.
I would recommend saving a { Page * Cardtext } field as AutoText or another Building Block in your template and using code to insert it. Here is my writing on using vba to insert AutoText.
The following creates the field at the insertion point.
Selection.Fields.Add Range:=Selection.Range, Type:=wdFieldEmpty, _
PreserveFormatting:=False
Selection.TypeText Text:="Page \* CardText"
Now that you've explained what you're trying to do, here's how it can be done without VBA, for up to 999 pages, using a compound field coded as:
{QUOTE "{=MOD(INT({PAGE}/100),10) # "'{=INT((MOD(INT({PAGE}/100),10)-1)/3)-1 # "'{=MOD(MOD(INT({PAGE}/100),10)-1,9)-7 # "'Novecientos';'Sietientos';' Ochocientos'"}';'{=MOD(MOD(INT({PAGE}/100),10)-1,9)-1 # "'Trescientos';'{=MOD({PAGE},100) # "'Ciento';;'Cien'"}';'Doscientos'"}';'{=MOD(MOD(INT({PAGE}/100),10)-1,9)-4 # "'Seisientos';'Cuatroscientos';'Quinientos'"}'"}{=MOD({PAGE},100) # "' ';;"}';;''"}{=INT((MOD({PAGE},100)+10)/20)-1 # "'{=MOD(INT({PAGE}/10),10) # "'{=INT((MOD(INT({PAGE}/10),10)-1)/3)-1 # "'{=MOD(MOD(INT({PAGE}/10),10)-1,9)-7 # "'Noventa';'Setenta';'Ochenta'"}';'{=MOD(MOD(INT({PAGE}/10),10)-1,9)-1 # "'Treinta';'';''"}';'{=MOD(MOD(INT({PAGE}/10),10)-1,9)-4 # "'Sesenta';'Cuarenta';'Cincuenta'"}'"}';;''"}{=MOD({PAGE},10) # "'{=INT((MOD({PAGE},10)-1)/3)-1 # "'{=MOD(MOD({PAGE},10)-1,9)-7 # "' y Nueve';' y Siete';' y Ocho'"}';'{=MOD(MOD({PAGE},10)-1,9)-1 # "' y Tres';' y Uno';' y Dos'"}';'{=MOD(MOD({PAGE},10)-1,9)-4 # "' y Seis';' y Cuatro';' y Cinco'"}'"}';;"}';'{=INT((MOD({PAGE},100)+10)/20)-1 # "'';'{=MOD({PAGE},10) # "'{=INT((MOD({PAGE},10)-1)/3)-1 # "'{=MOD(MOD({PAGE},10)-1,9)-7 # "'Nueve';'Siete';'Ocho'"}';'{=MOD(MOD({PAGE},10)-1,9)-1 # "'Tres';'Uno';'Dos'"}';'{=MOD(MOD({PAGE},10)-1,9)-4 # "'Seis';'Cuatro';'Cinco'"}'"}';;"}';''"}';'{=INT(MOD({PAGE},100)/10)-1 # "'{=MOD({PAGE},10) # "'{=INT((MOD({PAGE},10)-1)/3)-1 # "'{=MOD(MOD({PAGE},10)-1,9)-7 # "'Veintinueve';'Veintisiete';'Veintiocho'"}';'{=MOD(MOD({PAGE},10)-1,9)-1 # "'Veintitrés';'Veintiuno';'Veintidós'"}';'{=MOD(MOD({PAGE},10)-1,9)-4 # "'Veintiséis';'Veinticuatro';'Veinticinco'"}'"}';;'Veinte'}';;'{=MOD({PAGE},10) # "'{=INT((MOD({PAGE},10)-1)/3)-1 # "'{=MOD(MOD({PAGE},10)-1,9)-7 # "'Diecinueve';'Diecisiete';'Dieciocho'"}';'{=MOD(MOD({PAGE},10)-1,9)-1 # "'Trece';'Once';'Doce'"}';'{=MOD(MOD({PAGE},10)-1,9) -4 # "'Dieciséis';'Catorce';'Quince'"}'"}';;'Diez'"}'"}'"}"}
As coded, the page #s are output in standard Spanish text - just as an example of what can be done. I'll leave it to you to edit the text to match the Basque-Algonquian Pidgin dialect. Note that the outputs for pages 1-9 are specified twice in the field code - once for digits below 10 and once for digits above 30 (with ' y ').
As Cherles Kenyon has already pointed out, for most languages all you need do is set the proofing language setting and employ a *\CardText switch. By using a single Custom Document Property (for which no VBA code is needed), you could wrap a series of such fields in an IF test that tests the Document Property's text. For example:
{IF{DOCPROPERTY Lang}= "Basque-Algonquian" "Field code for
Basque-Algonquian"}
{IF{DOCPROPERTY Lang}= "Exotic Language A" "Field code for Exotic Language A"}
etc.
The above field coding assumes you're using a system with English-language regional settings. For use with other language regional settings you'll need to replace the commas with semi-colons and the semi-colons with commas.
For a macro to convert the above into a working field, see Convert Text Representations of Fields to Working Fields in the Mailmerge Tips and Tricks page at: https://www.msofficeforums.com/mail-merge/21803-mailmerge-tips-tricks.html

Optimize single word base form extraction (lemmatization) in spacy

I am looking to reduce a word to its base form without using contextual information. I tried out spacy and that requires running out nlp to get the base form of a single word but that comes with an increase in execution time.
I have gone through this post where disabling parser and NER pipeline components speed up the execution time to some extent but I just want a process to directly lookup into the database of word and its lemma form ( basically the base form of a word without considering contextual information
my_list = ["doing", "done", "did", "do"]
for my_word in my_list:
doc = nlp(my_word, disable=['parser', 'ner'])
for w in doc:
print("my_word {}, base_form {}".format(w, w.lemma_))
desired output
my_word doing, base_form do
my_word done, base_form do
my_word did, base_form do
my_word do, base_form do
Note: I also tried out spacy.lemmatizer but that is not giving the expected results and required pos as an additional arugments.
If you just want lemmas from a lookup table, you can install the lookup tables and initialize a very basic pipeline that only includes a tokenizer. If the lookup tables are installed, token.lemma_ will look up the form in the table.
Install the lookup tables (which are otherwise only saved in the provided models and not included in the main spacy package to save space):
pip install spacy[lookups]
Tokenize and lemmatize:
import spacy
nlp = spacy.blank("en")
assert nlp("doing")[0].lemma_ == "do"
assert nlp("done")[0].lemma_ == "do"
Spacy's lookup tables are available in this repository:
https://github.com/explosion/spacy-lookups-data
There you can read the documentation and check the examples that might help you.

How to include comments inside text to be processed by spaCy

I'm using spaCy v2 with the French module fr_core_news_sm
Unfortunately this model produces many parsing errors so I would like to preprocess the text in order to optimize the output.
Here is an example: the interjection/adverb carrément is analyzed as the plural 3rd person of the (imaginary) verb carrémer. I don't mind for the wrong POS tag analysis, but it does spoil the dependency parse. Therefore I would like to replace carrément by some other adverb (like souvent) or interjection for which I know that spaCy will parse correctly.
For that I need to able to add a comment saying that a replacement has taken place, something like souvent /*orig=carrément*/ so that souvent will be parsed by spaCy but NOT /*orig=carrément*/ which will have no incidence on the dependency parsing.
Is this possible?
IS there some other way to tell spaCy “carrément is NOT a verb but an interjection please treat it as such”, without recompiling the model?
(I know this is possible in TreeTagger, where you can add a configuration file with POS tags for any word you want… but of course TreeTagger is not a dependency parser.)
You can use custom extensions to save this kind of information on each token:
https://spacy.io/usage/processing-pipelines#custom-components-attributes
from spacy.tokens import Token
Token.set_extension("orig", default="")
doc[1]._.orig = "carrément"
Here's an example from the Token docs:
from spacy.tokens import Token
fruit_getter = lambda token: token.text in ("apple", "pear", "banana")
Token.set_extension("is_fruit", getter=fruit_getter)
doc = nlp("I have an apple")
assert doc[3]._.is_fruit
(The tagger and the dependency parser are completely independent, so changing the POS of a word won't affect how it gets parsed.)

How to create spaCy doc given I have raw text and 'words' but no 'spaces' data

I want to create spaCy doc given I have raw text and words but missing whitespaces data.
from spacy.tokens import Doc
doc = Doc(nlp.vocab, words=words, spaces=spaces)
How to do it correctly so information about whitespaces was not lost ?
Example of data I have :
data= {'text': 'This is just a test sample.', 'words': ['This', 'is', 'just', 'a', 'test', 'sample', '.']}
Based on our discussion in the comments, I would suggest doing either of the following:
Preferred route:
Substitute in the Spacy pipeline those elements you want to improve. If you don't trust the POS tagger for a reason, substitute in a custom parser more fit-for-purpose. OPtionally, you can train the existing POS tagger model with your own annotated data using a tool like Prodigy.
Quick and dirty route:
Load the document as plain text in a Spacy doc
Loop over the tokens as Spacy parsed them and match to your own list of tokens by checking of all the characters match.
If you don't get matches, handle exceptions as an input for a better tokenizer / check why your tokenizer is doing things differently
if you do get a match, load your additional information as Extension Attributes (https://spacy.io/usage/processing-pipelines#custom-components-attributes)
Use these extra attributes in further loops to check if these extra attributes match the Spacy Parser, and output the eventual training dataset.

Apache OpenNLP: How do I implement a dictionary based entity recognition?

I have already downloaded the jar files to eclipse.
http://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/index.html
How do I do the following:
1.) Be able to add my own names and tags.
2.) Be able to get the names and tags that were in the dictionary.
3.) Configure between case sensitive and insensitive.
For example, let's say, I add the name "Mike Smith" with name tag "Author".
If I have text that has that name, it should be able to recognize that its there along with the tag.
Please give actual java code!!!
I have asked a very similar question here:
Is it possible to conduct 'Context Analysis' for precise entity extraction with OpenNLP?
general concensus is that its 2 steps, first to identify if your sentence contains Author, the second to find the name.
I too would like to do it in 1 step (where the analysis of the corpus includes the words within itself as a way to determine the context of the name)