Spacy: How to specify/enforce a token's POS to be treated as PROPN in Spacy POS? - spacy

I'm analyzing a dataset which has a particular brand name. Instead of training POS from scratch, is there a way where we can supply the POS values of this brand name as PROPN? So that its treated as PROPN always.

Not exactly. You can post-process your doc to correct the POS for individual tokens to PROPN, but the POS annotation for the surrounding tokens won't be affected. The tagger doesn't support providing partial annotation as a starting point, so there's no way to influence the tagger by providing it in advance.
To correct the POS for individual tokens, you can use the attribute_ruler component or you can write your own small custom component.
Note that the parser and NER components don't use the POS tags as features, so their analysis will not change even if you modify the POS.

Related

Force 'parser' to not segment sentences?

Is there an easy way to tell the "parser" pipe not to change the value of Token.is_sent_start ?
So, here is the story:
I am working with documents that are pre-sentencized (1 line = 1 sentence), this segmentation is all I need. I realized the parser's segmentation is not always the same as in my documents, so I don't want to rely on the segmentation made by it.
I can't change the segmentation after the parser has done it, so I cannot correct it when it makes mistakes (you get an error). And if I segment the text myself and then apply the parser, it overrules the segmentation I've just made, so it doesn't work.
So, to force keeping the original segmentation and still use a pretrained transformer model (fr_dep_news_trf), I either :
disable the parser,
add a custom Pipe to nlp to set Token.is_sent_start how I want,
create the Doc with nlp("an example")
or, I simply create a Doc with
doc = Doc(words=["an", "example"], sent_starts=[True, False])
and then I apply every element of the pipeline except the parser.
However, if I still do need the parser at some point (which I do, because I need to know some subtrees), If I simply apply it on my Doc, it overrules the segmentation already in place, so, in some cases, the segmentation is incorrect. So I do the following workaround:
Keep the correct segmentation in a list sentences = list(doc.sents)
Apply the parser on the doc
Work with whatever syntactic information the parser computed
Retrieve whatever sentencial information I need from the list I previously made, as I now cannot trust Token.is_sent_start.
It works, but it doesn't really feel right imho, it feels a bit messy. Is there an easier, cleaner way I missed ?
Something else I am considering is setting a custom extension, so that I would, for instance, use Token._.is_sent_start instead of the default Token.is_sent_start, and a custom Doc._.sents, but I fear it might be more confusing than helpful ...
Some user suggested using span.merge() for a pretty similar topic, but the function doesn't seem to exist in recent releases of spaCy (Preventing spaCy splitting paragraph numbers into sentences)
The parser is supposed to respect sentence boundaries if they are set in advance. There is one outstanding bug where this doesn't happen, but that was only in the case where some tokens had their sentence boundaries left unset.
If you set all the token boundaries to True or False (not None) and then run the parser, does it overwrite your values? If so it'd be great to have a specific example of that, because that sounds like a bug.
Given that, if you use a custom component to set your true sentence boundaries before the parser, it should work.
Regarding some of your other points...
I don't think it makes any sense to keep your sentence boundaries separate from the parser's - if you do that you can end up with subtrees that span multiple sentences, which will just be weird and unhelpful.
You didn't mention this in your question, but is treating each sentence/line as a separate doc an option? (It's not clear if you're combining multiple lines and the sentence boundaries are wrong, or if you're passing in a single line but it's turning into multiple sentences.)

How to assign lexical features to new unanalyzable tokens in spaCy?

I'm working with spaCy, version 2.3. I have a not-quite-regular-expression scanner which identifies spans of text which I don't want analyzed any further. I've added a pipe at the beginning of the pipeline, right after the tokenizer, which uses the document retokenizer to make these spans into single tokens. I'd like to remainder of the pipeline to treat these tokens as proper nouns. What's the right way to do this? I've set the POS and TAG attrs in my calls to retokenizer.merge(), and those settings persist in the resulting sentence parse, but the dependency information on these tokens makes me doubt that my settings have had the desired impact. Is there a way to update the vocabulary so that the POS tagger knows that the only POS option for these tokens is PROPN?
Thanks in advance.
The tagger and parser are independent (the parser doesn't use the tags as features), so modifying the tags isn't going to affect the dependency parse.
The tagger doesn't overwrite any existing tags, so if a tag is already set, it doesn't modify it. (The existing tags don't influence its predictions at all, though, so the surrounding words are tagged the same way they would be otherwise.)
Setting TAG and POS in the retokenizer is a good way to set those attributes. If you're not always retokenizing and you want to set the TAG and/or POS based on a regular expression for the token text, then the best way to do this is a custom pipeline component that you add before the tagger that sets tags for certain words.
The transition-based parsing algorithm can't easily deal with partial dependencies in the input, so there isn't a straightforward solution here. I can think of a few things that might help:
The parser does respect pre-set sentence boundaries. If your skipped tokens are between sentences, you can set token.is_sent_start = True for that token and the following token so that the skipped token always ends up in its own sentence. If the skipped tokens are in the middle of a sentence or you want them to be analyzed as nouns in the sentence, then this won't help.
The parser does use the token.norm feature, so if you set the NORM feature in the retokenizer to something extremely PROPN-like, you might have a better chance of getting the intended analysis. For example, if you're using a provided English model like en_core_web_sm, use a word you think would be a frequent similar proper noun in American newspaper text from 20 years ago, so if the skipped token should be like a last name, use "Bush" or "Clinton". It won't guarantee a better parse, but it could help.
If you using a model with vectors like en_core_web_lg, you can also set the vectors for the skipped token to be the same as a similar word (check that the similar word has a vector first). This is how to tell the model to refer to the same row in the vector table for UNKNOWN_SKIPPED as Bush.
The simpler option (that duplicates the vectors in the vector table internally):
nlp.vocab.set_vector("UNKNOWN_SKIPPED", nlp.vocab["Bush"].vector)
The less elegant version that doesn't duplicate vectors underneath:
nlp.vocab.vectors.add("UNKNOWN_SKIPPED", row=nlp.vocab["Bush"].rank)
nlp.vocab["UNKNOWN_SKIPPED"].rank = nlp.vocab["Bush"].rank
(The second line is only necessary to get this to work for a model that's currently loaded. If you save it as a custom model after the first line with nlp.to_disk() and reload it, then only the first line is necessary.)
If you just have a small set of skipped tokens, you could update the parser with some examples containing these tokens, but this can be tricky to do well without affecting the accuracy of the parser for other cases.
The NORM and vector modifications will also influence the tagger, so it's possible if you choose those well, you might get pretty close to the results you want.

Spacy tokenizer to handle final period in sentence

I'm using Spacy to tokenize sentences, and I know that the text I pass to the tokenizer will always be a single sentence.
In my tokenization rules, I would like non-final periods (".") to be attached to the text before it so I updated the suffix rules to remove the rules that split on periods (this gets abbreviations correctly).
The exception, however, is that the very last period should be split into a separate token.
I see that the latest version of Spacy allows you to split tokens after the fact, but I'd prefer to do this within the Tokenizer itself so that other pipeline components are processing the correct tokenization.
Here is one solution that uses some post processing after the tokenizer:
I added "." to suffixes so that a period is always split into its own token.
I then used a regex to find non-final periods, generated a span with doc.char_span, and merged the span to a single token with span.merge.
Would be nice to be able to do this within the tokenizer if anyone knows how to do that.

Spacy tokenizer add exception for n't

I want to convert n't to not using this code:
doc = nlp(u"this. isn't ad-versere")
special_case = [{ORTH: u"not"}]
nlp.tokenizer.add_special_case(u"n't",specia_case)
print [text.orth_ for text in doc]
But I get the output as:
[u'this', u'.', u'is', u"n't", u'ad', u'-', u'versere']
n't is still n't
How to solve the problem?
The reason your logic doesn't work is because spaCy uses non-destructive tokenization. This means that it'll always keep a reference to the original input text, and you'll never lose any information.
The tokenizer exceptions and special cases let you define rules for how to split a string of text into a sequence of tokens – but they won't let you modify the original string. The ORTH values of the tokens plus whitespace always needs to match the original text. So the tokenizer can split "isn't" into ["is", "n't"], but not into ["is", "not"].
To define a "normalised" form of the string, spaCy uses the NORM attribute, available as token.norm_. You can see this in the source of the tokenizer exceptions here – the norm of the token "n't" is "not". The NORM attribute is also used as a feature in the model, to ensure that tokens with the same norm receive similar representations (even if one is more frequent in the training data than the other).
So if you're interested in the normalised form, you can simply use the norm_ attribute instead:
>>> [t.norm_ for t in doc]
['this', '.', 'is', 'not', 'ad', '-', 'versere']

How does spaCy tokenizer splits sentences?

I am finding the tokenization code quite complicated and I still couldn't find where in the code the sentences are split.
For example, how does the tokenizer know that
Mr. Smitt stayed at home. He was tired
should not be split in "Mr." and should be split before "He".? And where in the code does the split before "He" happens?
(In fact, I am unsure actually unsure if I am looking at the right place: if I search for sents in tokenizer.pyx I don't find any occurrence)
You access the splits via the doc object, with the generator:
doc.sents
The output of the generator is a series of spans.
As for how the splits are chosen, the document is parsed for dependency relationships. Understanding the parser is not trivial - you'll have to read into it if you want to understand it - it's using a neural network to inform the decision about how to construct the dependency trees; but the splits are those gaps between tokens which are not crossed by dependencies. This is not simply where you find a full-stop, and the method is more robust as a result.