How do I tag the gender of a noun with Spacy? - spacy

I would like to tag the gender of nouns using Spacy; specifically in my case, German.
I am not sure which Spacy pipeline has information about noun gender, for example, the Tagger or the Lemmatizer?

Different languages have different grammatical features, so you can look at the specific language model of a language to determine what pipelines it has.
For German, we can see under “Label Scheme” that the “morphologizer” pipeline has tags including “gender”.
Here, it shows that the morphologizer assigns the attribute “morph” to each Token.
“morph” is respectively of type “MorphAnalysis”.
There are different ways to access the morphological annotation from a MorphAnalysis object.
The simplest is to use the “.get” method, by passing the name of the category you want:
Token.morph.get(“gender”)
which returns a list of strings in case that category has multiple values.
You can also return the MorphAnalysis as a dictionary with to_dict(), as a string with str(Token.morph), or iterate over Token.morph with a loop, which returns each attribute-value pair as strings.

Related

Automatically detect security identifier columns using Visions

I'm interested in using the Visions library to automate the process of identifying certain types of security (stock) identifiers. The documentation mentions that it could be used in such a way for ISBN codes but I'm looking for a more concrete example of how to do it. I think the process would be pretty much identical for the fields I'm thinking of as they all have check digits (ISIN, SEDOL, CUSIP).
My general idea is that I would create custom types for the different identifier types and could use those types to
Take a dataframe where the types are unknown and identify columns matching the types (even if it's not a 100% match)
Validate the types on a dataframe where the intended type is known
Great question and use-case! Unfortunately, the documentation on making new types probably needs a little love right now as there were API breaking changes with the 0.7.0 release. Both the previous link and this post from August, 2020 should cover the conceptual idea of type creation in greater detail. If any of those examples break then mea culpa and our apologies, we switched to a dispatch based implementation to support different backends (pandas, numpy, dask, spark, etc...) for each type. You shouldn't have to worry about that for now but if you're interested you can find the default type definitions here with their backends here.
Building an ISBN Type
We need to make two basic decisions when defining a type:
What defines the type
What other types are our new type related to?
For the ISBN use-case O'Reilly provides a validation regex to match ISBN-10 and ISBN-13 codes. So,
What defines a type?
We want every element in the sequence to be a string which matches a corresponding ISBN-10 or ISBN-13 regex
What other types are our new type related to?
Since ISBN's are themselves strings we can use the default String type provided by visions.
Type Definition
from typing import Sequence
import pandas as pd
from visions.relations import IdentityRelation, TypeRelation
from visions.types.string import String
from visions.types.type import VisionsBaseType
isbn_regex = "^(?:ISBN(?:-1[03])?:?●)?(?=[0-9X]{10}$|(?=(?:[0-9]+[-●]){3})[-●0-9X]{13}$|97[89][0-9]{10}$|(?=(?:[0-9]+[-●]){4})[-●0-9]{17}$)(?:97[89][-●]?)?[0-9]{1,5}[-●]?[0-9]+[-●]?[0-9]+[-●]?[0-9X]$"
class ISBN(VisionsBaseType):
#staticmethod
def get_relations() -> Sequence[TypeRelation]:
relations = [
IdentityRelation(String),
]
return relations
#staticmethod
def contains_op(series: pd.Series, state: dict) -> bool:
return series.str.contains(isbn_regex).all()
Looking at this closely there are three things to take note of.
The new type inherits from VisionsBaseType
We had to define a get_relations method which is how we relate a new type to others we might want to use in a typeset. In this case, I've used an IdentityRelation to String which means ISBNs are subsets of String. We can also use InferenceRelation's when we want to support relations which change the underlying data (say converting the string '4.2' to the float 4.2).
A contains_op this is our definition of the type. In this case, we are applying a regex string to every element in the input and verifying it matched the regex provided by O'Reilly.
Extensions
In theory ISBNs can be encoded in what looks like a 10 or 13 digit integer as well - to work with those you might want to create an InferenceRelation between Integer and ISBN. A simple implementation would involve coercing Integers to string and applying the above regex.

How to assign lexical features to new unanalyzable tokens in spaCy?

I'm working with spaCy, version 2.3. I have a not-quite-regular-expression scanner which identifies spans of text which I don't want analyzed any further. I've added a pipe at the beginning of the pipeline, right after the tokenizer, which uses the document retokenizer to make these spans into single tokens. I'd like to remainder of the pipeline to treat these tokens as proper nouns. What's the right way to do this? I've set the POS and TAG attrs in my calls to retokenizer.merge(), and those settings persist in the resulting sentence parse, but the dependency information on these tokens makes me doubt that my settings have had the desired impact. Is there a way to update the vocabulary so that the POS tagger knows that the only POS option for these tokens is PROPN?
Thanks in advance.
The tagger and parser are independent (the parser doesn't use the tags as features), so modifying the tags isn't going to affect the dependency parse.
The tagger doesn't overwrite any existing tags, so if a tag is already set, it doesn't modify it. (The existing tags don't influence its predictions at all, though, so the surrounding words are tagged the same way they would be otherwise.)
Setting TAG and POS in the retokenizer is a good way to set those attributes. If you're not always retokenizing and you want to set the TAG and/or POS based on a regular expression for the token text, then the best way to do this is a custom pipeline component that you add before the tagger that sets tags for certain words.
The transition-based parsing algorithm can't easily deal with partial dependencies in the input, so there isn't a straightforward solution here. I can think of a few things that might help:
The parser does respect pre-set sentence boundaries. If your skipped tokens are between sentences, you can set token.is_sent_start = True for that token and the following token so that the skipped token always ends up in its own sentence. If the skipped tokens are in the middle of a sentence or you want them to be analyzed as nouns in the sentence, then this won't help.
The parser does use the token.norm feature, so if you set the NORM feature in the retokenizer to something extremely PROPN-like, you might have a better chance of getting the intended analysis. For example, if you're using a provided English model like en_core_web_sm, use a word you think would be a frequent similar proper noun in American newspaper text from 20 years ago, so if the skipped token should be like a last name, use "Bush" or "Clinton". It won't guarantee a better parse, but it could help.
If you using a model with vectors like en_core_web_lg, you can also set the vectors for the skipped token to be the same as a similar word (check that the similar word has a vector first). This is how to tell the model to refer to the same row in the vector table for UNKNOWN_SKIPPED as Bush.
The simpler option (that duplicates the vectors in the vector table internally):
nlp.vocab.set_vector("UNKNOWN_SKIPPED", nlp.vocab["Bush"].vector)
The less elegant version that doesn't duplicate vectors underneath:
nlp.vocab.vectors.add("UNKNOWN_SKIPPED", row=nlp.vocab["Bush"].rank)
nlp.vocab["UNKNOWN_SKIPPED"].rank = nlp.vocab["Bush"].rank
(The second line is only necessary to get this to work for a model that's currently loaded. If you save it as a custom model after the first line with nlp.to_disk() and reload it, then only the first line is necessary.)
If you just have a small set of skipped tokens, you could update the parser with some examples containing these tokens, but this can be tricky to do well without affecting the accuracy of the parser for other cases.
The NORM and vector modifications will also influence the tagger, so it's possible if you choose those well, you might get pretty close to the results you want.

How to convert plural nouns to singular using SpaCy?

I am using SpaCy to lemmatize text, but in some special cases I need to keep original text and just convert plural nouns to their singular forms.
Is there a way to tell SpaCy to only convert plural nouns to singulars without lemmatizing the whole text (like removing ed, ing...etc) ? Or should I explicitly test each token to check if it is a plural noun to take its lemma?
P.S. Input text is dynamic, so I don't know in advance if the word is a noun or not
Thanks
Thanks to bivouac0's comment. I checked tag_ field of each token and retrieved lemma of tokens being tagged as 'NNS' or 'NNPS'
processed_text = nlp(original_text)
lemma_tags = {"NNS", "NNPS"}
for token in processed_text:
lemma = token.text
if token.tag_ in lemma_tags:
lemma = token.lemma_
...
# rest of code
...
...
You cannot convert plural nouns to singular nouns using spacy.
You can check whether the token is a plural noun or a singular noun.
If the token's tag is equal to 'NNS', check that token in a dictionary and get the singular form of that token.

Is there a way to assign an internal string or identifier or tag to a matplotlib artist?

Sometimes it is useful to assign a 'tag', which can be a simple string, to a matplotlib artist in order to later find it easily.
If we imagine a scenario where say plt.Line2D had a property called tag which can be retrieved using plt.Line2D.get_tag() it would be very easy to find it later in a complicated plot.
The only thing I can find that looks remotely similar is the group ID: for example line.set_gid() and line.get_gid(). I haven't found any good documentation on this. The only reference is this. Is this meant for such use as described above? Is it reserved for other operations in matplotlib?
This would be very useful for grouping different artists and then performing operations on them later, for example:
for line in ax.get_lines():
if line.get_tag() == 'group A'
line.set_color('red')
# or whatever other operation
Does such a thing exist?
You can use the gid for such purposes. The only side-effect is that those names will appear in a saved svg file as the gid tag.
Alternatively you can assign any attribute to a python object.
line, = plt.plot(...)
line.myid = "group A"
just make sure not to use any existing attribute in such case.

Fuziness In UIMA ruta

Is there any option of fuzziness in case of word matching, or ignoring some special cases.
For ex:
STRINGLIST AMIMALLIST = {"LION","TIGER","MONKEY"};
DECLARE ANIMAL;
Document {-> MARKFAST(ANIMAL, AMIMALLIST, true)};
I need to match words with list in case I face some special character like
Tiger- or MONKEY$
According to documentation There are different evaluator any idea how to use?
Or can I use SCORE or MARKSCORE
There are several aspects to consider here. In general, UIMA Ruta does not support fuzziness in the dictionary lookup. SCORE and MARKSCORE are language elements which can be utilized to introduce some heurstic scoring (not really fuzziness) in sequential rules. In the examples you gave in your question, you do not really need fuzzy matching.
The dictionary lookup in UIMA Ruta works on the RutaBasic annotation. These annotations are automatically created and maintained by UIMA Ruta itself (and should not be changed by other analysis engines or rules directly). The RutaBasic annotations represent the smallest fragments annotations are referring to. By default, the seeder of the RutaEngine creates annotations for words (W -> CW, SW, CAP) and many other tokens like SPECIAL for - or $. This means that there is also a RutaBasic annotation, and that the dictionary lookup can distinghish between these tokens. As a result, Tiger and Monkey should be annotated and the example in your question should actually work (I tested it). You maybe need some postprossesing in order to include the SPECIAL in ANIMAL.
I have to mention that there is also the functionality to use an edit distance in the dictionary lookup (Multi Tree Word List, TRIE). However, this functionality has not been maintained for several years. It should also support different weights for specific replacements. I do not know if this counts as fuzziness.
DISCLAIMER: I am a developer of UIMA Ruta