How to set the sentiment attribute on a Span? - spacy

I'm trying thhe Keras example from spacy documentation, but instead of suming the sentiment score in the Doc like that
for sent, label in zip(sentences, ys):
sent.doc.sentiment += label - 0.5
I would like to keep the score on the sentence level like that
for sent, label in zip(sentences, ys):
sent.sentiment = float(label)
This code give me that error
AttributeError: attribute 'sentiment' of 'spacy.tokens.span.Span' objects is not writable
Is there a setter to call instead? I tried set_sentiment without success.
Am I missing something? Is it a bug?

You can find the implementation of Span.sentiment here. You can see it is indeed not writable, because it either looks up the value in self.doc.user_span_hooks, or takes the average of token.sentiment for the tokens in that span.
[EDITED BELOW]
The sentiment of a Token is not context-dependent though. It uses the information present in the underlying Lexeme. That means that any word, such as "love", would have the same sentiment value in any sentence/context.
So there's two things you can do: either write to the sentiment of the lexemes like so:
vocab["love"].sentiment = 3.0
Or implement a custom hook that allows you to define any function you want. You can do this on the span (doc.user_span_hooks) or token (doc.user_token_hooks) level:
doc.user_span_hooks["sentiment"] = lambda span: 10.0

Related

How can I get test_value in PyMC(PyMC4)?

I am a newbie in Bayesian and Probabilistic inference, and sorry for this basic question. Recently I am following some examples in Bayesian Methods. And, the examples require me to use "tag.test_value." However, I am trying to use PyMC rather than PyMC3, so there is an error using the sentence. Although I tried to use others such as init_value, initial_value, it does not work...
Could you kindly let me know alternatives for that sentence to check the initial value in PyMC (that was originally test value in PyMC3)?
a = pm.Uniform("b", 0, 50)
print(a.tag.test_value)
AttributeError: 'ValidatingScratchpad' object has no attribute 'test_value
It appears that Aesara does not compute test value by default. You need to set aesara.config.compute_test_value = "warn". Then you can call a.get_test_value(). Hope this helps!

Force 'parser' to not segment sentences?

Is there an easy way to tell the "parser" pipe not to change the value of Token.is_sent_start ?
So, here is the story:
I am working with documents that are pre-sentencized (1 line = 1 sentence), this segmentation is all I need. I realized the parser's segmentation is not always the same as in my documents, so I don't want to rely on the segmentation made by it.
I can't change the segmentation after the parser has done it, so I cannot correct it when it makes mistakes (you get an error). And if I segment the text myself and then apply the parser, it overrules the segmentation I've just made, so it doesn't work.
So, to force keeping the original segmentation and still use a pretrained transformer model (fr_dep_news_trf), I either :
disable the parser,
add a custom Pipe to nlp to set Token.is_sent_start how I want,
create the Doc with nlp("an example")
or, I simply create a Doc with
doc = Doc(words=["an", "example"], sent_starts=[True, False])
and then I apply every element of the pipeline except the parser.
However, if I still do need the parser at some point (which I do, because I need to know some subtrees), If I simply apply it on my Doc, it overrules the segmentation already in place, so, in some cases, the segmentation is incorrect. So I do the following workaround:
Keep the correct segmentation in a list sentences = list(doc.sents)
Apply the parser on the doc
Work with whatever syntactic information the parser computed
Retrieve whatever sentencial information I need from the list I previously made, as I now cannot trust Token.is_sent_start.
It works, but it doesn't really feel right imho, it feels a bit messy. Is there an easier, cleaner way I missed ?
Something else I am considering is setting a custom extension, so that I would, for instance, use Token._.is_sent_start instead of the default Token.is_sent_start, and a custom Doc._.sents, but I fear it might be more confusing than helpful ...
Some user suggested using span.merge() for a pretty similar topic, but the function doesn't seem to exist in recent releases of spaCy (Preventing spaCy splitting paragraph numbers into sentences)
The parser is supposed to respect sentence boundaries if they are set in advance. There is one outstanding bug where this doesn't happen, but that was only in the case where some tokens had their sentence boundaries left unset.
If you set all the token boundaries to True or False (not None) and then run the parser, does it overwrite your values? If so it'd be great to have a specific example of that, because that sounds like a bug.
Given that, if you use a custom component to set your true sentence boundaries before the parser, it should work.
Regarding some of your other points...
I don't think it makes any sense to keep your sentence boundaries separate from the parser's - if you do that you can end up with subtrees that span multiple sentences, which will just be weird and unhelpful.
You didn't mention this in your question, but is treating each sentence/line as a separate doc an option? (It's not clear if you're combining multiple lines and the sentence boundaries are wrong, or if you're passing in a single line but it's turning into multiple sentences.)

How to assign lexical features to new unanalyzable tokens in spaCy?

I'm working with spaCy, version 2.3. I have a not-quite-regular-expression scanner which identifies spans of text which I don't want analyzed any further. I've added a pipe at the beginning of the pipeline, right after the tokenizer, which uses the document retokenizer to make these spans into single tokens. I'd like to remainder of the pipeline to treat these tokens as proper nouns. What's the right way to do this? I've set the POS and TAG attrs in my calls to retokenizer.merge(), and those settings persist in the resulting sentence parse, but the dependency information on these tokens makes me doubt that my settings have had the desired impact. Is there a way to update the vocabulary so that the POS tagger knows that the only POS option for these tokens is PROPN?
Thanks in advance.
The tagger and parser are independent (the parser doesn't use the tags as features), so modifying the tags isn't going to affect the dependency parse.
The tagger doesn't overwrite any existing tags, so if a tag is already set, it doesn't modify it. (The existing tags don't influence its predictions at all, though, so the surrounding words are tagged the same way they would be otherwise.)
Setting TAG and POS in the retokenizer is a good way to set those attributes. If you're not always retokenizing and you want to set the TAG and/or POS based on a regular expression for the token text, then the best way to do this is a custom pipeline component that you add before the tagger that sets tags for certain words.
The transition-based parsing algorithm can't easily deal with partial dependencies in the input, so there isn't a straightforward solution here. I can think of a few things that might help:
The parser does respect pre-set sentence boundaries. If your skipped tokens are between sentences, you can set token.is_sent_start = True for that token and the following token so that the skipped token always ends up in its own sentence. If the skipped tokens are in the middle of a sentence or you want them to be analyzed as nouns in the sentence, then this won't help.
The parser does use the token.norm feature, so if you set the NORM feature in the retokenizer to something extremely PROPN-like, you might have a better chance of getting the intended analysis. For example, if you're using a provided English model like en_core_web_sm, use a word you think would be a frequent similar proper noun in American newspaper text from 20 years ago, so if the skipped token should be like a last name, use "Bush" or "Clinton". It won't guarantee a better parse, but it could help.
If you using a model with vectors like en_core_web_lg, you can also set the vectors for the skipped token to be the same as a similar word (check that the similar word has a vector first). This is how to tell the model to refer to the same row in the vector table for UNKNOWN_SKIPPED as Bush.
The simpler option (that duplicates the vectors in the vector table internally):
nlp.vocab.set_vector("UNKNOWN_SKIPPED", nlp.vocab["Bush"].vector)
The less elegant version that doesn't duplicate vectors underneath:
nlp.vocab.vectors.add("UNKNOWN_SKIPPED", row=nlp.vocab["Bush"].rank)
nlp.vocab["UNKNOWN_SKIPPED"].rank = nlp.vocab["Bush"].rank
(The second line is only necessary to get this to work for a model that's currently loaded. If you save it as a custom model after the first line with nlp.to_disk() and reload it, then only the first line is necessary.)
If you just have a small set of skipped tokens, you could update the parser with some examples containing these tokens, but this can be tricky to do well without affecting the accuracy of the parser for other cases.
The NORM and vector modifications will also influence the tagger, so it's possible if you choose those well, you might get pretty close to the results you want.

How to serialize data in example-in-example format for tensorflow-ranking?

I'm building a ranking model with tensorflow-ranking. I'm trying to serialize a data set in the TFRecord format and read it back at training time.
The tutorial doesn't show how to do this. There is some documentation here on an example-in-example data format but it's hard for me to understand: I'm not sure what the serialized_context or serialized_examples fields are or how they fit into examples and I'm not sure what the Serialize() function in the code block is.
Concretely, how can I write and read data in example-in-example format?
The context is a map from feature name to tf.train.Feature. The examples list is a list of maps from feature name to tf.train.Feature. Once you have these, the following code will create an "example-in-example":
context = {...}
examples = [{...}, {...}, ...]
serialized_context = tf.train.Example(features=tf.train.Features(feature=context)).SerializeToString()
serialized_examples = tf.train.BytesList()
for example in examples:
tf_example = tf.train.Example(features=tf.train.Features(feature=example))
serialized_examples.value.append(tf_example.SerializeToString())
example_in_example = tf.train.Example(features=tf.train.Features(feature={
'serialized_context': tf.train.Feature(bytes_list=tf.train.BytesList(value=[serialized_context])),
'serialized_examples': tf.train.Feature(bytes_list=serialized_examples)
}))
To read the examples back, you may call
tfr.data.parse_from_example_in_example(example_pb,
context_feature_spec = context_feature_spec,
example_feature_spec = example_feature_spec)
where context_feature_spec and example_feature_spec are maps from feature name to tf.io.FixedLenFeature or tf.io.VarLenFeature.
First of all, I recommend reading this article to ensure that you know how to create a tf.Example as well as tf.SequenceExample (which by the way, is the other data format supported by TF-Ranking):
Tensorflow Records? What they are and how to use them
In the second part of this article, you will see that a tf.SequenceExample has two components: 1) Context and 2)Sequence (or examples). This is the same idea that Example-in-Example is trying to implement. Basically, context is the set of features that are independent of the items that you want to rank (a search query in the case of search, or user features in the case of a recommendation system) and the sequence part is a list of items (aka examples). This could be a list of documents (in search) or movies (in recommendation).
Once you are comfortable with tf.Example, Example-in-Example will be easier to understand. Take a look at this piece of code for how to create an EIE instance:
https://www.gitmemory.com/issue/tensorflow/ranking/95/518480361
1) bundle context features together in a tf.Example object and serialize it
2) bundle sequence(example) features (each of which could contain a list of values) in another tf.Example object and serialize this one too.
3) wrap these inside a parent tf.Example
4) (if you're writing to tfrecords) serialize the parent tf.Example object and write to your tfrecord file.

How to correctly pass initial value of transition_params in tensorflow linear chain CRF

I'm trying to use the linear chain CRF in my work. I took the help of the example usage code provided in -- https://github.com/tensorflow/tensorflow/tree/r1.0/tensorflow/contrib/crf
My question is how to supply some initial value of "transition_params" in "crf_log_likelihood()". For concreteness of the example, say, I want to initialize it with standard random normal distribution. In the api doc, I saw that "transition_params" can, in fact, be passed as an input argument. Inside the method I see that if no "transition_params" is passed, it is obtained by doing a "vs.get_variable()" with name = "transitions".
So should I do something similar to this, before creating the 'crf_log_likelihood' op? Something like -- transition_params = vs.get_variable("transitions", [num_tags, num_tags], initializer=tf.random_normal_initializer()) -- and then change the call of "crf_log_likelihood()" to "log_likelihood, transition_params = tf.contrib.crf.crf_log_likelihood(unary_scores, y_t, sequence_lengths_t, transition_params)"?
The get_variable() inside the definition of crf_log_likelihood() will create a fresh, randomly-initialized variable to represent the transition parameters, if you don't provide one yourself. You only need to provide an explicit transition_params if you don't want the default behavior.
To understand the behavior of get_variable(), see here:
https://www.tensorflow.org/api_docs/python/state_ops/sharing_variables#get_variable
Hope that helps!