I'm using spaCy v2 with the French module fr_core_news_sm
Unfortunately this model produces many parsing errors so I would like to preprocess the text in order to optimize the output.
Here is an example: the interjection/adverb carrément is analyzed as the plural 3rd person of the (imaginary) verb carrémer. I don't mind for the wrong POS tag analysis, but it does spoil the dependency parse. Therefore I would like to replace carrément by some other adverb (like souvent) or interjection for which I know that spaCy will parse correctly.
For that I need to able to add a comment saying that a replacement has taken place, something like souvent /*orig=carrément*/ so that souvent will be parsed by spaCy but NOT /*orig=carrément*/ which will have no incidence on the dependency parsing.
Is this possible?
IS there some other way to tell spaCy “carrément is NOT a verb but an interjection please treat it as such”, without recompiling the model?
(I know this is possible in TreeTagger, where you can add a configuration file with POS tags for any word you want… but of course TreeTagger is not a dependency parser.)
You can use custom extensions to save this kind of information on each token:
https://spacy.io/usage/processing-pipelines#custom-components-attributes
from spacy.tokens import Token
Token.set_extension("orig", default="")
doc[1]._.orig = "carrément"
Here's an example from the Token docs:
from spacy.tokens import Token
fruit_getter = lambda token: token.text in ("apple", "pear", "banana")
Token.set_extension("is_fruit", getter=fruit_getter)
doc = nlp("I have an apple")
assert doc[3]._.is_fruit
(The tagger and the dependency parser are completely independent, so changing the POS of a word won't affect how it gets parsed.)
Related
I am looking to reduce a word to its base form without using contextual information. I tried out spacy and that requires running out nlp to get the base form of a single word but that comes with an increase in execution time.
I have gone through this post where disabling parser and NER pipeline components speed up the execution time to some extent but I just want a process to directly lookup into the database of word and its lemma form ( basically the base form of a word without considering contextual information
my_list = ["doing", "done", "did", "do"]
for my_word in my_list:
doc = nlp(my_word, disable=['parser', 'ner'])
for w in doc:
print("my_word {}, base_form {}".format(w, w.lemma_))
desired output
my_word doing, base_form do
my_word done, base_form do
my_word did, base_form do
my_word do, base_form do
Note: I also tried out spacy.lemmatizer but that is not giving the expected results and required pos as an additional arugments.
If you just want lemmas from a lookup table, you can install the lookup tables and initialize a very basic pipeline that only includes a tokenizer. If the lookup tables are installed, token.lemma_ will look up the form in the table.
Install the lookup tables (which are otherwise only saved in the provided models and not included in the main spacy package to save space):
pip install spacy[lookups]
Tokenize and lemmatize:
import spacy
nlp = spacy.blank("en")
assert nlp("doing")[0].lemma_ == "do"
assert nlp("done")[0].lemma_ == "do"
Spacy's lookup tables are available in this repository:
https://github.com/explosion/spacy-lookups-data
There you can read the documentation and check the examples that might help you.
I want to create spaCy doc given I have raw text and words but missing whitespaces data.
from spacy.tokens import Doc
doc = Doc(nlp.vocab, words=words, spaces=spaces)
How to do it correctly so information about whitespaces was not lost ?
Example of data I have :
data= {'text': 'This is just a test sample.', 'words': ['This', 'is', 'just', 'a', 'test', 'sample', '.']}
Based on our discussion in the comments, I would suggest doing either of the following:
Preferred route:
Substitute in the Spacy pipeline those elements you want to improve. If you don't trust the POS tagger for a reason, substitute in a custom parser more fit-for-purpose. OPtionally, you can train the existing POS tagger model with your own annotated data using a tool like Prodigy.
Quick and dirty route:
Load the document as plain text in a Spacy doc
Loop over the tokens as Spacy parsed them and match to your own list of tokens by checking of all the characters match.
If you don't get matches, handle exceptions as an input for a better tokenizer / check why your tokenizer is doing things differently
if you do get a match, load your additional information as Extension Attributes (https://spacy.io/usage/processing-pipelines#custom-components-attributes)
Use these extra attributes in further loops to check if these extra attributes match the Spacy Parser, and output the eventual training dataset.
I would like to build a keyword list from tokens with a lookup back to the sentence where they came from, thanks
You can get the sentences from token.doc.sents, and then find the first one that starts on or after your token. You can make this more convenient by adding an extension attribute to token like this:
import spacy
from spacy.tokens import Token
def get_sentence(token):
for sent in token.doc.sents:
if sent.start <= token.i:
return sent
# Add a computed property, which will be accessible as token._.sent
Token.set_extension('sent', getter=get_sentence)
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Sentence one. Sentence two.')
print(list(doc.sents))
print(doc[0]._.sent)
print(doc[-1]._.sent)
I am writing a multi-lang website.
I read the language info from users cookies, and I have several translation modules such as en.go gr.go etc.
The modules are of type map[string]string.The problem here is in javascript I can do something like lang[cookies.lang]["whatever message"].'But go does not support accessing struct members in this way.
I could make switch case ormap[string]map[string]string` and map all possible languages, but this is much extra work.
So is there any way golang provides some way to access members like js brackets notation?
Not: There was a similar question on the stack, and somebody wrote to use "reflect" package, but I could not quite understand how it works and failed to reproduce by myself works and failed to reproduce by myself.
One possible route would be to use a map[string]map[string]string.
You could then have a base package in which you declare your base translation variable and in your translation modules, you can use an init function to populate the relevant sub-map. It's essentially optional if you want to do this as separate packages or just separate files (doing it as packages means you have better compile-time control of what languages to include, doing it as files is probably less confusing).
If you go the packages root, I suggest the following structure:
translation/base This is where you export from
translation/<language> These are "import only" packages
Then, in translation/base:
package "base"
var Lang map[string]map[string]string
And in each language-specific package:
package "<language code>"
import "language/base"
var code = "<langcode>"
func init() {
d := map[string]string{}
d[<phrase1>] = "your translation here"
d[<phrase2>] = "another translation here"
// Do this for all the translations
base.Lang[code] = d
}
Then you can use this from your main program:
package "..."
import (
"language/base"
_ "language/lang1" // We are only interested in side-effects
_ "language/lang2" // Same here...
)
Not using separate packages is (almost) the same. You simply ensure that all the files are in the same package and you can skip the package prefix for the Lang variable.
A toy example on the Go Playground, with all the "fiddly" bits inlined.
How do I add new lemmas to SpaCy. For instance, new singular-plural nouns.
Example:
Kirana = Singular
Kiranas = Plural
I want to add it to SpaCy so that when a sentence contains "Kiranas", Kirana will show up as its lemma.
Just add the word "kirana" to the _nouns.py file which is located inside the Spacy installation folder:(spacy/en/lemmatizer/_nouns.py). When you are adding the word to the file, the text format inside the file should not be changed. Since the lemmatization rules are already defined this should work as you intended.