Prodigy - Using pattern file with model - spacy

Can you add a pattern file to a model?
matcher = Matcher(nlp_lg.vocab)
pattern = [{"LOWER": "tumulus"}]
matcher.add("thing", [pattern])
MyText = df.loc[52]["TEXT"]
doc = nlp_lg(MyText )
spacy.displacy.render(doc, style='ent')
It seems to make no difference and doesn't tag 'tumulus'.
"(Name: SS 26271656 ORG ) Woolley Barrows PERSON ( NR ORG ). (SS 26191653 CARDINAL ) Tumulus (NR)."

When you create a Matcher object, it has no special association with the pipeline, it's just an object that exists. Which is why it doesn't modify the pipeline output at all.
It sounds like what you want to do is add an EntityRuler - which is a component that wraps a Matcher - and have it overwrite entities. See the rule-based matching docs for an example of how to use the EntityRuler. It's a bit like this:
ruler = nlp.add_pipe("entity_ruler")
patterns = [{"label": "ORG", "pattern": [{"LOWER": "tumulus"}]}]
Note nlp.add_pipe, which is key because it actually adds the component to the pipeline.


Remove whitespace from spacy doc.ents

I am trying to run a spacy model for NER. I have Doc object and doc.ents shows below output
(august 3, 2021, book building offer, bse, nse)
All the tags have space.Due to this i am receiving below error.
ValueError: [E024] Could not find an optimal move to supervise the parser. Usually, this means that the model can't be updated in a way that's valid and satisfies the correct annotations specified in the GoldParse. For example, are all labels added to the model? If you're training a named entity recognizer, also make sure that none of your annotated entity spans have leading or trailing whitespace or punctuation. You can also use the `debug data` command to validate your JSON-formatted training data
Can anyone suggest how to remove this whitespace?
Found out a problem in rule based labeling, the entity spans had leading or trailing whitespace.
To solve above problem, you can use below function:
def trim_entity_spans(data: list) -> list:
"""Removes leading and trailing white spaces from entity spans.
data (list): The data to be cleaned in spaCy JSON format.
list: The cleaned data.
invalid_span_tokens = re.compile(r'\s')
cleaned_data = []
for text, annotations in data:
entities = annotations['entities']
valid_entities = []
for start, end, label in entities:
valid_start = start
valid_end = end
while valid_start < len(text) and invalid_span_tokens.match(
valid_start += 1
while valid_end > 1 and invalid_span_tokens.match(
text[valid_end - 1]):
valid_end -= 1
valid_entities.append([valid_start, valid_end, label])
cleaned_data.append([text, {'entities': valid_entities}])
return cleaned_data
If you have documents labeled in new .spacy format then you can use below helper function:
new_docs = []
for doc in skweak.utils.docbin_reader("./ipo_v3.spacy",spacy_model_name='en_core_web_trf'):
Once you have all the docs, you can use them to convert into JSON format:
doc_examples = []
for i,j in enumerate(new_docs):
spans_ex = [(ent.start_char,ent.end_char,ent.label_) for ent in new_docs[i].ents]
Then you can simply use :
doc_examples = trim_entity_spans(doc_examples)

Can I do any analysis on spacy display using NER?

When accessing this display in spacy NER, can you add the found entities - in this case any tweets with GPE or LOC - to a new dataframe or do any further analysis on this topic? I thought once I got them into a list I could use geopy to visualive it possibly, any thoughts?
colors = {'LOC': 'linear-gradient(90deg, ~aa9cde, #dc9ce7)', 'GPE' : 'radial-gradient(white, blue)'}
options = {'ents' : ['LOC', 'GPE'],'colors':colors}
spacy.displacy.render(doc, style='ent',jupyter=True, options=options, )
The entities are accessible on the doc object. If you want to get all the ents in the doc object into a list, simply use, doc.ents. For example:
import spacy
content = "Narendra Modi is the Prime Minister of India"
nlp = spacy.load('en_core_web_md')
doc = nlp(content)
should output:
(Narendra Modi, India)
Say, you want to the text (or mention) of the entity and the label of the entity (say, PERSON, GPE, LOC, NORP, etc.) then you can get them as follows:
print([(ent, ent.label_) for ent in doc.ents])
should output:
[(Narendra Modi, 'PERSON'), (India, 'GPE')]
You should be able to use them in other places as you see fit.

Make spacy nlp.pipe process tuples of text and additional information to add as document features?

Apparently for doc in nlp.pipe(sequence) is much faster than running for el in sequence: doc = nlp(el) ..
The problem I have is that my sequence is really a sequence of tuples, which contain the text for spacy to convert into a document, but also additional information which I would like to get into the spacy document as document attributes (which I would register for Doc).
I am not sure how I can modify a spacy pipeline so that the first stage really picks one item from the tuple to run the tokeniser on and get the document, and then have some other function use the remaining items from the tuple to add the features to the existing document.
It sounds like you might be looking for the as_tuples argument of nlp.pipe? If you set as_tuples=True, you can pass in a stream of (text, context) tuples and spaCy will yield (doc, context) tuples (instead of just Doc objects). You can then use the context and add it to custom attributes etc.
Here's an example:
data = [
("Some text to process", {"meta": "foo"}),
("And more text...", {"meta": "bar"})
for doc, context in nlp.pipe(data, as_tuples=True):
# Let's assume you have a "meta" extension registered on the Doc
doc._.meta = context["meta"]
A bit late, but in case someone comes looking for this in 2022:
There is no official/documented way to access the context (the second tuple) for the Doc object from within a pipeline. However, the context does get written to an internal doc._context attribute, so we can use this internal attribute to access the context from within our pipelines.
For example:
import spacy
from spacy.language import Language
nlp = spacy.load("en_core_web_sm")
data = [
("stackoverflow is great", {"attr1": "foo", "attr2": "bar"}),
("The sun is shining today", {"location": "Hawaii"})
# Set up your custom pipeline. You can access the doc's context from
# within your pipeline, such as {"attr1": "foo", "attr2": "bar"}
def my_pipeline(doc):
return doc
# Add the pipeline
# Process the data and do something with the doc and/or context
for doc, context in nlp.pipe(data, as_tuples=True):
If you are interested in the source code, see the nlp.pipe method and the internal nlp._ensure_doc_with_context methods here:

How to use a pattern to match element names in compact relaxng

I have some XML that needs validating from external source that has a similar layout too below
I tried the following but it is not valid
datatypes xs = ""
start = stuff
stuff = element stuff
element id-* { text }*
Ideally I would like a regex match on the id tag names
To my knowledge it's not possible to define patterns in RELAX NG for element names. See also RelaxNG enumerated element names and relax-ng compact: attribute whose name matches a reg-ex for similar questions.

Cypher: Scope of Match Statements in which Variables are Valid

I think I have a general problem with understanding the structure of matches and the scope in which variables of the match live.
The specific piece of code where I have the problem with is this:
// S sentiment toward A goodFor/badFor T
// => S sentiment toward the idea of A goodFor/badFor T
MATCH (S:A)-[:SOURCE]->(sent1:PS {type:"sentiment"})-[:TARGET]->(gfbf:E {type:"gfbf"}) , (A)-[:SOURCE]->(gfbf)-[:TARGET]->(T) , (Writer:A {type:"writer"})
// if there is some negative belief in any of the writers private state spaces that involve gfbf then inference is blocked
WHERE NOT (Writer)-[*1..]->({type:"believesTrue" , spec:FALSE})-[*1..]->(gfbf)
// if sent1 is in some private state spaces of the writer return all of these
OPTIONAL MATCH p=(Writer)-[*]->(sent1)
WITH NODES(p)[1..-1] AS ps_nodes
WHERE ALL(x IN ps_nodes[1..] WHERE LABELS(x) = "PS")
MERGE (S)-[:SOURCE]->(sent2:PS {type:"sentiment" , spec:(sent1.spec)})-[:TARGET]->(ideaOf:I {name:"ideaOf" , type:"ideaOf"})-[:TARGET]->(gfbf)
CASE sent2.spec
I think it's not relevant to understand what this is for. It suffices to see the structure I assume, but basically what it does is: It looks for a subgraph where there is path S-->sent1-->gfbf and also a path A-->gfbf-->T. If it finds that is makes a new path A-->sent2-->ideaOf-->gfbf, all he while setting the properties of the new nodes depending on the properties of the nodes from the match. Furthermore it looks whether it also has a path writer-->...-->sent where all nodes in the ... part have label PS. If it finds that path then it returns this for further operations in a different part of the program.
The error I am getting is this:
py2neo.cypher.error.statement.InvalidSyntax: sent1 not defined (line 6, column 58 (offset: 421))
"MERGE (S)-[:SOURCE]->(sent2:PS {type:"sentiment" , spec:(sent1.spec)})-[:TARGET]->(ideaOf:I {name:"ideaOf" , type:"ideaOf"})-[:TARGET]->(g"bf)
Why is sent1 no longer defined where I use it and how would I need to restructure the code to make it valid?
sent1 in isn't in the prior WITH - change it so:
WITH NODES(p)[1..-1] AS ps_nodes, sent1