How to make spaCy case Insensitive - case-sensitive

How can I make spaCy case insensitive when finding the entity name?
Is there any code snippet that i should add or something because the questions could mention entities that are not in uppercase?
def analyseQuestion(question):
doc = nlp(question)
entity=doc.ents
return entity
print(analyseQuestion("what is the best seller of Nicholas Sparks "))
print(analyseQuestion("what is the best seller of nicholas sparks "))
which gives
(Nicholas Sparks,)
()

This is old, but this hopefully this will help anyone looking at similar problems.
You can use a truecaser to improve your results.
https://pypi.org/project/truecase/

It is very easy. You just need to add a preprocessing step of question.lower() to your function:
def analyseQuestion(question):
# Preprocess question to make further analysis case-insensetive
question = question.lower()
doc = nlp(question)
entity=doc.ents
return entity
The solution inspired by this code from Rasa NLU library. However, for non-english (non-ASCII) text it might not work. For that case you can try:
question = question.decode('utf8').lower().encode('utf8')
However the NER module in spacy, to some extent depends on the case of the tokens and you might face some discrepancies as it is a statistical trained model.Refer this link.

Related

Feature and FeatureView versioning

my team is interested in a feature store solution that enables rapid experimentation of features, probably using feature versioning. In the Feast slack history, I found
#Benjamin Tan’s post that explains their feast workflow, and they explain FeatureView versioning:
insights_v1 = FeatureView(
features=[
Feature(name="insight_type", dtype=ValueType.STRING)
]
)
insights_v2 = FeatureView(
features=[
Feature(name="customer_id", dtype=ValueType.STRING)
Feature(name="insight_type", dtype=ValueType.STRING)
]
)
Is this the recommended best practice for FeatureView versioning? It looks like Features do not have a version field. Is there a recommended strategy for Feature versioning?
Creating a new column for each Feature version is one approach:
driver_rating_v1
driver_rating_v2
But that could get unwieldy if we want to experiment with dozens of permutations of the same Feature.
Featureform appears to have support for feature versions through the "variant" field, but their documentation is a bit unclear.
Adding additional clarity on Featureform: Variant is analogous to version. You'd supply a string which then becomes an immutable identifier for the version of the transformation, source, etc. Variant is one of the common metadata fields provided in the Featureform API.
Using the example of an ecommerce dataset & spark, here's an example of using the variant field to version a source (a parquet file in this case):
orders = spark.register_parquet_file(
name="orders",
variant="default",
description="This is the core dataset. From each order you might find all other information.",
file_path="path_to_file",
)
You can set the variant variable ahead of time:
VERSION="v1" # You can change this to rerun the definitions with with new variants
orders = spark.register_parquet_file(
name="orders",
variant=f"{VERSION}",
description="This is the core dataset. From each order you might find all other information.",
file_path="path_to_file",
)
And you can create versions or variants of the transformations -- here I'm taking a dataframe called total_paid_per_customer_per_day and aggregating it.
# Get average order value per day
#spark.df_transformation(inputs=[("total_paid_per_customer_per_day", "default")], variant="skeller88_20220110")
def average_daily_transaction(df):
from pyspark.sql.functions import mean
return df.groupBy("day_date").agg(mean("total_customer_order_paid").alias("average_order_value"))
There are some more details on the Featureform CLI here: https://docs.featureform.com/getting-started/interact-with-the-cli

SpaCy: Set entity information for a token which is included in more than one span

I am trying to use SpaCy for entity context recognition in the world of ontologies. I'm a novice at using SpaCy and just playing around for starters.
I am using the ENVO Ontology as my 'patterns' list for creating a dictionary for entity recognition. In simple terms the data is an ID (CURIE) and the name of the entity it corresponds to along with its category.
Screenshot of my sample data:
The following is the workflow of my initial code:
Creating patterns and terms
# Set terms and patterns
terms = {}
patterns = []
for curie, name, category in envoTerms.to_records(index=False):
if name is not None:
terms[name.lower()] = {'id': curie, 'category': category}
patterns.append(nlp(name))
Setup a custom pipeline
#Language.component('envo_extractor')
def envo_extractor(doc):
matches = matcher(doc)
spans = [Span(doc, start, end, label = 'ENVO') for matchId, start, end in matches]
doc.ents = spans
for i, span in enumerate(spans):
span._.set("has_envo_ids", True)
for token in span:
token._.set("is_envo_term", True)
token._.set("envo_id", terms[span.text.lower()]["id"])
token._.set("category", terms[span.text.lower()]["category"])
return doc
# Setter function for doc level
def has_envo_ids(self, tokens):
return any([t._.get("is_envo_term") for t in tokens])
##EDIT: #################################################################
def resolve_substrings(matcher, doc, i, matches):
# Get the current match and create tuple of entity label, start and end.
# Append entity to the doc's entity. (Don't overwrite doc.ents!)
match_id, start, end = matches[i]
entity = Span(doc, start, end, label="ENVO")
doc.ents += (entity,)
print(entity.text)
#########################################################################
Implement the custom pipeline
nlp = spacy.load("en_core_web_sm")
matcher = PhraseMatcher(nlp.vocab)
#### EDIT: Added 'on_match' rule ################################
matcher.add("ENVO", None, *patterns, on_match=resolve_substrings)
nlp.add_pipe('envo_extractor', after='ner')
and the pipeline looks like this
[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x7fac00c03bd0>),
('tagger', <spacy.pipeline.tagger.Tagger at 0x7fac0303fcc0>),
('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x7fac02fe7460>),
('ner', <spacy.pipeline.ner.EntityRecognizer at 0x7fac02f234c0>),
('envo_extractor', <function __main__.envo_extractor(doc)>),
('attribute_ruler',
<spacy.pipeline.attributeruler.AttributeRuler at 0x7fac0304a940>),
('lemmatizer',
<spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x7fac03068c40>)]
Set extensions
# Set extensions to tokens, spans and docs
Token.set_extension('is_envo_term', default=False, force=True)
Token.set_extension("envo_id", default=False, force=True)
Token.set_extension("category", default=False, force=True)
Doc.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Doc.set_extension("envo_ids", default=[], force=True)
Span.set_extension("has_envo_ids", getter=has_envo_ids, force=True)
Now when I run the text 'tissue culture', it throws me an error:
nlp('tissue culture')
ValueError: [E1010] Unable to set entity information for token 0 which is included in more than one span in entities, blocked, missing or outside.
I know why the error occurred. It is because there are 2 entries for the 'tissue culture' phrase in the ENVO database as shown below:
Ideally I'd expect the appropriate CURIE to be tagged depending on the phrase that was present in the text. How do I address this error?
My SpaCy Info:
============================== Info about spaCy ==============================
spaCy version 3.0.5
Location *irrelevant*
Platform macOS-10.15.7-x86_64-i386-64bit
Python version 3.9.2
Pipelines en_core_web_sm (3.0.0)
It might be a little late nowadays but, complementing Sofie VL's answer a little bit, and to anyone who might be still interested in it, what I (another spaCy newbie, lol) have done to get rid of overlapping spans, goes as follows:
import spacy
from spacy.util import filter_spans
# [Code to obtain 'entity']...
# 'entity' should be a list, i.e.:
# entity = ["Carolina", "North Carolina"]
pat_orig = len(entity)
filtered = filter_spans(ents) # THIS DOES THE TRICK
pat_filt =len(filtered)
doc.ents = filtered
print("\nCONVERSION REPORT:")
print("Original number of patterns:", pat_orig)
print("Number of patterns after overlapping removal:", pat_filt)
Important to mention that I am using the most recent version of spaCy at this date, v3.1.1. Additionally, it will work only if you actually do not mind about overlapping spans being removed, but if you do, then you might want to give this thread a look. More info regarding 'filter_spans' here.
Best regards.
Since spacy v3, you can use doc.spans to store entities that may be overlapping. This functionality is not supported by doc.ents.
So you have two options:
Implement an on_match callback that will filter out the results of the matcher before you use the result to set doc.ents. From a quick glance at your code (and the later edits), I don't think resolve_substrings is actually resolving conflicts? Ideally, the on_match function should check whether there are conflicts with existing ents, and decide which of them to keep.
Use doc.spans instead of doc.ents if that works for your use-case.

Many inputs to one output, access wildcards in input files

Apologies if this is a straightforward question, I couldn't find anything in the docs.
currently my workflow looks something like this. I'm taking a number of input files created as part of this workflow, and summarizing them.
Is there a way to avoid this manual regex step to parse the wildcards in the filenames?
I thought about an "expand" of cross_ids and config["chromosomes"], but unsure to guarantee conistent order.
rule report:
output:
table="output/mendel_errors.txt"
input:
files=expand("output/{chrom}/{cross}.in", chrom=config["chromosomes"], cross=cross_ids)
params:
req="h_vmem=4G",
run:
df = pd.DataFrame(index=range(len(input.files), columns=["stat", "chrom", "cross"])
for i, fn in enumerate(input.files):
# open fn / make calculations etc // stat =
# manual regex of filename to get chrom cross // chrom, cross =
df.loc[i] = stat, chrom, choss
This seems a bit awkward when this information must be in the environment somewhere.
(via Johannes Köster on the google group)
To answer your question:
Expand uses functools.product from the standard library. Hence, you could write
from functools import product
product(config["chromosomes"], cross_ids)

Sentence segmentation and dependency parser

I’m pretty new to python (using python 3) and spacy (and programming too). Please bear with me.
I have three questions where two are more or less the same I just can’t get it to work.
I took the “syntax specific search with spacy” (example) and tried to make different things work.
My program currently reads txt and the normal extraction
if w.lower_ != 'music':
return False
works.
My first question is: How can I get spacy to extract two words?
For example: “classical music”
With the previous mentioned snippet I can make it extract either classical or music. But if I only search for one of the words I also get results I don’t want like.
Classical – period / era
Or when I look for only music
Music – baroque, modern
The second question is: How can I get the dependencies to work?
The example dependency with:
elif w.dep_ != 'nsubj': # Is it the subject of a verb?
return False
works fine. But everything else I tried does not really work.
For example, I want to extract sentences with the word “birthday” and the dependency ‘DATE’. (so the dependency is an entity)
I got
if d.ent_type_ != ‘DATE’:
return False
To work.
So now it would look like:
def extract_information(w,d):
if w.lower_ != ‘birthday’:
return False
elif d.ent_type_ != ‘DATE’:
return False
else:
return True
Does something like this even work?
If it works the third question would be how I can filter sentences for example with a DATE. So If the sentence contains a certain word and a DATE exclude it.
Last thing maybe, I read somewhere that the dependencies are based on the “Stanford typed dependencies manual”. Is there a list which of those dependencies work with spacy?
Thank you for your patience and help :)
Before I get into offering some simple suggestions to your questions, have you tried using displaCy's visualiser on some of your sentences?
Using an example sentence 'John's birthday was yesterday', you'll find that within the parsed sentence, birthday and yesterday are not necessarily direct dependencies of one another. So searching based on the birthday word having a dependency of a DATE type entity, might not be yield the best of results.
Onto the first question:
A brute force method would be to look for matching subsequent words after you have parsed the sentence.
doc = nlp(u'Mary enjoys classical music.')
for (i,token) in enumerate(doc):
if (token.lower_ == 'classical') and (i != len(doc)-1):
if doc[i+1].lower_ == 'music':
print 'Target Acquired!'
If you're unsure of what enumerate does, look it up. It's the pythonic way of using python.
To questions 2 and 3, one simple (but not elegant) way of solving this is to just identify in a parsed sentence if the word 'birthday' exists and if it contains an entity of type 'DATE'.
doc = nlp(u'John\'s birthday was yesterday.')
for token in doc:
if token.lower_ == 'birthday':
for entities in doc.ents:
if entities.label_ == 'DATE':
print 'Found ya!'
As for the list of dependencies, I presume you're referring to the Part-Of-Speech tags. Check out the documentation on this page.
Good luck! Hope that helped.

How to use org.openimaj.ml.gmm to construct speaker models.

I would like to know how I can get GMM speaker model using OpenIMaj library.
org.openimaj.ml.gmm.GaussianMixtureModelEM. I have tried following
GaussianMixtureModelEM gmm = new GaussianMixtureModelEM
(DEFAULT_NUMBER_COMPONENTS,GaussianMixtureModelEM.CovarianceType.Diagonal);
MixtureOfGaussians mixture = gmm.estimate(data);
boolean convergerd = gmm.hasConverged();
I get true that GaussianMixtureModelEM has converged, I am lost where to go from here. Any help guidance would be appreciated.
Given your comment, then mixture.estimateLogProbability(point) should do what you want (see http://www.openimaj.org/apidocs/org/openimaj/math/statistics/distribution/MixtureOfGaussians.html#estimateLogProbability(double[])).