Adding a Retokenize pipe while training NER model - spacy

I am currenly attempting to train a NER model centered around Property Descriptions. I could get a fully trained model to function to my liking however, I now want to add a retokenize pipe to the model so that I can set up the model to train other things.
From here, I am having issues getting the retokenize pipe to actually work. Here is the definition:
def retok(doc):
ents = [(ent.start, ent.end, ent.label) for ent in doc.ents]
with doc.retokenize() as retok:
string_store = doc.vocab.strings
for start, end, label in ents:
retok.merge(
doc[start: end],
attrs=intify_attrs({'ent_type':label},string_store))
return doc
i am adding it into my training like this:
nlp.add_pipe(retok, after="ner")
and I am adding it into the Language Factories like this:
Language.factories['retok'] = lambda nlp, **cfg: retok(nlp)
The issue I keep getting is "AttributeError: 'English' object has no attribute 'ents'". Now I am assuming I am getting this error because the parameter that is being passed through this function is not a doc but actually the NLP model itself. I am not really sure to get a doc to flow into this pipe during training. At this point I don't really know where to go from here to get the pipe to function the way I want.
Any help is appreciated, thanks.

You can potentially use the built-in merge_entities pipeline component: https://spacy.io/api/pipeline-functions#merge_entities
The example copied from the docs:
texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David", "Bowie"]
merge_ents = nlp.create_pipe("merge_entities")
nlp.add_pipe(merge_ents)
texts = [t.text for t in nlp("I like David Bowie")]
assert texts == ["I", "like", "David Bowie"]
If you need to customize it further, the current implementation of merge_entities (v2.2) is a good starting point:
def merge_entities(doc):
"""Merge entities into a single token.
doc (Doc): The Doc object.
RETURNS (Doc): The Doc object with merged entities.
DOCS: https://spacy.io/api/pipeline-functions#merge_entities
"""
with doc.retokenize() as retokenizer:
for ent in doc.ents:
attrs = {"tag": ent.root.tag, "dep": ent.root.dep, "ent_type": ent.l
abel}
retokenizer.merge(ent, attrs=attrs)
return doc
P.S. You are passing nlp to retok() below, which is where the error is coming from:
Language.factories['retok'] = lambda nlp, **cfg: retok(nlp)
See a related question: Spacy - Save custom pipeline

Related

Vertex AI Model Batch prediction, issue with referencing existing model and input file on Cloud Storage

I'm struggling to correctly set Vertex AI pipeline which does the following:
read data from API and store to GCS and as as input for batch prediction.
get an existing model (Video classification on Vertex AI)
create Batch prediction job with input from point 1.
As it will be seen, I don't have much experience with Vertex Pipelines/Kubeflow thus I'm asking for help/advice, hope it's just some beginner mistake.
this is the gist of the code I'm using as pipeline
from google_cloud_pipeline_components import aiplatform as gcc_aip
from kfp.v2 import dsl
from kfp.v2.dsl import component
from kfp.v2.dsl import (
Output,
Artifact,
Model,
)
PROJECT_ID = 'my-gcp-project'
BUCKET_NAME = "mybucket"
PIPELINE_ROOT = "{}/pipeline_root".format(BUCKET_NAME)
#component
def get_input_data() -> str:
# getting data from API, save to Cloud Storage
# return GS URI
gcs_batch_input_path = 'gs://somebucket/file'
return gcs_batch_input_path
#component(
base_image="python:3.9",
packages_to_install=['google-cloud-aiplatform==1.8.0']
)
def load_ml_model(project_id: str, model: Output[Artifact]):
"""Load existing Vertex model"""
import google.cloud.aiplatform as aip
model_id = '1234'
model = aip.Model(model_name=model_id, project=project_id, location='us-central1')
#dsl.pipeline(
name="batch-pipeline", pipeline_root=PIPELINE_ROOT,
)
def pipeline(gcp_project: str):
input_data = get_input_data()
ml_model = load_ml_model(gcp_project)
gcc_aip.ModelBatchPredictOp(
project=PROJECT_ID,
job_display_name=f'test-prediction',
model=ml_model.output,
gcs_source_uris=[input_data.output], # this doesn't work
# gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
)
if __name__ == '__main__':
from kfp.v2 import compiler
import google.cloud.aiplatform as aip
pipeline_export_filepath = 'test-pipeline.json'
compiler.Compiler().compile(pipeline_func=pipeline,
package_path=pipeline_export_filepath)
# pipeline_params = {
# 'gcp_project': PROJECT_ID,
# }
# job = aip.PipelineJob(
# display_name='test-pipeline',
# template_path=pipeline_export_filepath,
# pipeline_root=f'gs://{PIPELINE_ROOT}',
# project=PROJECT_ID,
# parameter_values=pipeline_params,
# )
# job.run()
When running the pipeline it throws this exception when running Batch prediction:
details = "List of found errors: 1.Field: batch_prediction_job.model; Message: Invalid Model resource name.
so I'm not sure what could be wrong. I tried to load model in the notebook (outside of component) and it correctly returns.
Second issue I'm having is referencing GCS URI as output from component to batch job input.
input_data = get_input_data2()
gcc_aip.ModelBatchPredictOp(
project=PROJECT_ID,
job_display_name=f'test-prediction',
model=ml_model.output,
gcs_source_uris=[input_data.output], # this doesn't work
# gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
)
During compilation, I get following exception TypeError: Object of type PipelineParam is not JSON serializable, though I think this could be issue of ModelBatchPredictOp component.
Again any help/advice appreciated, I'm dealing with this from yesterday, so maybe I missed something obvious.
libraries I'm using:
google-cloud-aiplatform==1.8.0
google-cloud-pipeline-components==0.2.0
kfp==1.8.10
kfp-pipeline-spec==0.1.13
kfp-server-api==1.7.1
UPDATE
After comments, some research and tuning, for referencing model this works:
#component
def load_ml_model(project_id: str, model: Output[Artifact]):
region = 'us-central1'
model_id = '1234'
model_uid = f'projects/{project_id}/locations/{region}/models/{model_id}'
model.uri = model_uid
model.metadata['resourceName'] = model_uid
and then I can use it as intended:
batch_predict_op = gcc_aip.ModelBatchPredictOp(
project=gcp_project,
job_display_name=f'batch-prediction-test',
model=ml_model.outputs['model'],
gcs_source_uris=[input_batch_gcs_path],
gcs_destination_output_uri_prefix=f'gs://{BUCKET_NAME}/prediction_output/test'
)
UPDATE 2
regarding GCS path, a workaround is to define path outside of the component and pass it as an input parameter, for example (abbreviated):
#dsl.pipeline(
name="my-pipeline",
pipeline_root=PIPELINE_ROOT,
)
def pipeline(
gcp_project: str,
region: str,
bucket: str
):
ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
gcs_prediction_input_path = f'gs://{BUCKET_NAME}/prediction_input/video_batch_prediction_input_{ts}.jsonl'
batch_input_data_op = get_input_data(gcs_prediction_input_path) # this loads input data to GCS path
batch_predict_op = gcc_aip.ModelBatchPredictOp(
project=gcp_project,
model=training_job_run_op.outputs["model"],
job_display_name='batch-prediction',
# gcs_source_uris=[batch_input_data_op.output],
gcs_source_uris=[gcs_prediction_input_path],
gcs_destination_output_uri_prefix=f'gs://{BUCKET_NAME}/prediction_output/',
).after(batch_input_data_op) # we need to add 'after' so it runs after input data is prepared since get_input_data doesn't returns anything
still not sure, why it doesn't work/compile when I return GCS path from get_input_data component
I'm glad you solved most of your main issues and found a workaround for model declaration.
For your input.output observation on gcs_source_uris, the reason behind it is because the way the function/class returns the value. If you dig inside the class/methods of google_cloud_pipeline_components you will find that it implements a structure that will allow you to use .outputs from the returned value of the function called.
If you go to the implementation of one of the components of the pipeline you will find that it returns an output array from convert_method_to_component function. So, in order to have that implemented in your custom class/function your function should return a value which can be called as an attribute. Below is a basic implementation of it.
class CustomClass():
def __init__(self):
self.return_val = {'path':'custompath','desc':'a desc'}
#property
def output(self):
return self.return_val
hello = CustomClass()
print(hello.output['path'])
If you want to dig more about it you can go to the following pages:
convert_method_to_component, which is the implementation of convert_method_to_component
Properties, basics of property in python.

Convert CONLL file to a list of Doc objects

Is there a way to convert CONLL file into list of Doc objects without having to parse the sentence using the nlp object. I have a list of annotations that I have to pass to the automatic component that uses Doc objects as input. I have found a way to create the doc:
doc = Doc(nlp.vocab, words=[...])
And that I can use the from_array function to recreate the other linguistic features. This array can be recreated by using index value from StringStore object, I have successfully created Doc object with LEMMA and TAG information but cannot recreate HEAD data. My question is how to pass HEAD data to Doc object using from_array method.
The confusing thing about the HEAD is that for sentence that has this structure:
Ona 2
je 2
otišla 2
u 4
školu 2
. 2
The output of this code snippet:
from spacy.attrs import TAG, HEAD, DEP
doc.to_array([TAG, HEAD, DEP])
is:
array([[10468770234730083819, 2, 429],
[ 5333907774816518795, 1, 405],
[11670076340363994323, 0, 8206900633647566924],
[ 6471273018469892813, 1, 8110129090154140942],
[ 7055653905424136462, 18446744073709551614, 435],
[ 7173976090571422945, 18446744073709551613, 445]],
dtype=uint64)
I cannot correlate the center column of the from_array output to dependency tree structure given above.
Thanks in advance for the help,
Daniel
Ok, so I finally cracked it it appears that head - index if the index is lower than head and 18446744073709551616 - index otherwise. This is the function that I used if anyone else needed it:
import numpy as np
from spacy.tokens import Doc
docs = []
for sent in sents:
generated_doc = Doc(doc.vocab, words=[word["word"] for word in sent])
heads = []
for idx, word in enumerate(sent):
if word["pos"] not in doc.vocab.strings:
doc.vocab.strings.add(word["pos"])
if word["dep"] not in doc.vocab.strings:
doc.vocab.strings.add(word["dep"])
if word["head"] >= idx:
heads.append(word["head"] - idx)
else:
heads.append(18446744073709551616-idx)
np_array = np.array([np.array([doc.vocab.strings[word["pos"]], heads[idx], doc.vocab.strings[word["dep"]]], dtype=np.uint64) for idx, word in enumerate(sent)], dtype=np.uint64)
generated_doc.from_array([TAG, HEAD, DEP], np_array)
docs.append(generated_doc)

How do I convert simple training style data to spaCy's command line JSON format?

I have the training data for a new NER type in the "Training an additional entity type" section of the spaCy documentation.
TRAIN_DATA = [
("Horses are too tall and they pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
}),
("Do they bite?", {
'entities': []
}),
("horses are too tall and they pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
}),
("horses pretend to care about your feelings", {
'entities': [(0, 6, 'ANIMAL')]
}),
("they pretend to care about your feelings, those horses", {
'entities': [(48, 54, 'ANIMAL')]
}),
("horses?", {
'entities': [(0, 6, 'ANIMAL')]
})
]
I want to train an NER model on this data using the spacy command line application. This requires data in spaCy's JSON format. How do I write the above data (i.e. text with labeled character offset spans) in this JSON format?
After looking at the documentation for that format, it's not clear to me how to manually write data in this format. (For example, do I have partition everything into paragraphs?) There is also a convert command line utility that converts from non-spaCy data formats to spaCy's format, but that doesn't take a spaCy format like the one above as input.
I understand the examples of NER training code that uses the "Simple training style", but I'd like to be able to use the command line utility for training. (Though as is apparent from my previous spaCy question, I'm unclear when you're supposed to use that style and when you're supposed to use the command line.)
Can someone show me an example of the above data in "spaCy's JSON format", or point to documentation that explains how to make this transformation.
There's a built in function to spaCy that will get you most of the way there:
from spacy.gold import biluo_tags_from_offsets
That takes in the "offset" type annotations you have there and converts them to the token-by-token BILOU format.
To put the NER annotations into the final training JSON format, you just need a bit more wrapping around them to fill out the other slots the data requires:
sentences = []
for t in TRAIN_DATA:
doc = nlp(t[0])
tags = biluo_tags_from_offsets(doc, t[1]['entities'])
ner_info = list(zip(doc, tags))
tokens = []
for n, i in enumerate(ner_info):
token = {"head" : 0,
"dep" : "",
"tag" : "",
"orth" : i[0].string,
"ner" : i[1],
"id" : n}
tokens.append(token)
sentences.append(tokens)
Make sure that you disable the non-NER pipelines before training with this data.
I've run into some issues using spacy train on NER-only data. See #1907 and also check out this discussion on the Prodigy forum for some possible workarounds.

Custom SPARQL functions in rdflib

What is a good way to hook a custom SPARQL function into rdflib?
I have been looking around in rdflib for an entry point for custom function. I found no dedicated entry point but found that rdflib.plugins.sparql.CUSTOM_EVALS might be a place to add the custom function.
So far I have made an attempt with the code below. It seems "dirty" to me. I am calling a "hidden" function (_eval) and I am not sure I got all the argument updating correct. Beyond the custom_eval.py example code (which form the basis for my code) I found little other code or documentation about CUSTOM_EVALS.
import rdflib
from rdflib.plugins.sparql.evaluate import evalPart
from rdflib.plugins.sparql.sparql import SPARQLError
from rdflib.plugins.sparql.evalutils import _eval
from rdflib.namespace import Namespace
from rdflib.term import Literal
NAMESPACE = Namespace('//custom/')
LENGTH = rdflib.term.URIRef(NAMESPACE + 'length')
def customEval(ctx, part):
"""Evaluate custom function."""
if part.name == 'Extend':
cs = []
for c in evalPart(ctx, part.p):
if hasattr(part.expr, 'iri'):
# A function
argument = _eval(part.expr.expr[0], c.forget(ctx, _except=part.expr._vars))
if part.expr.iri == LENGTH:
e = Literal(len(argument))
else:
raise SPARQLError('Unhandled function {}'.format(part.expr.iri))
else:
e = _eval(part.expr, c.forget(ctx, _except=part._vars))
if isinstance(e, SPARQLError):
raise e
cs.append(c.merge({part.var: e}))
return cs
raise NotImplementedError()
QUERY = """
PREFIX custom: <%s>
SELECT ?s ?length WHERE {
BIND("Hello, World" AS ?s)
BIND(custom:length(?s) AS ?length)
}
""" % (NAMESPACE,)
rdflib.plugins.sparql.CUSTOM_EVALS['exampleEval'] = customEval
for row in rdflib.Graph().query(QUERY):
print(row)
So first off, I want to thank you for showing how you implemented a new SPARQL function.
Secondly, by using your code I was able to create a SPARQL function that evaluates two strings by using the Levenshtein distance. It has been really insightful and I wish to share it for it holds additional documentation that could help other developers creating their own custom SPARQL functions.
# Import needed to introduce new SPARQL function
import rdflib
from rdflib.plugins.sparql.evaluate import evalPart
from rdflib.plugins.sparql.sparql import SPARQLError
from rdflib.plugins.sparql.evalutils import _eval
from rdflib.namespace import Namespace
from rdflib.term import Literal
# Import for custom function calculation
from Levenshtein import distance as levenshtein_distance # python-Levenshtein==0.12.2
def SPARQL_levenshtein(ctx:object, part:object) -> object:
"""
The first two variables retrieved from a SPARQL-query are compared using the Levenshtein distance.
The distance value is then stored in Literal object and added to the query results.
Example:
Query:
PREFIX custom: //custom/ # Note: this part refereces to the custom function
SELECT ?label1 ?label2 ?levenshtein WHERE {
BIND("Hello" AS ?label1)
BIND("World" AS ?label2)
BIND(custom:levenshtein(?label1, ?label2) AS ?levenshtein)
}
Retrieve:
?label1 ?label2
Calculation:
levenshtein_distance(?label1, ?label2) = distance
Output:
Save distance in Literal object.
:param ctx: <class 'rdflib.plugins.sparql.sparql.QueryContext'>
:param part: <class 'rdflib.plugins.sparql.parserutils.CompValue'>
:return: <class 'rdflib.plugins.sparql.processor.SPARQLResult'>
"""
# This part holds basic implementation for adding new functions
if part.name == 'Extend':
cs = []
# Information is retrieved and stored and passed through a generator
for c in evalPart(ctx, part.p):
# Checks if the function holds an internationalized resource identifier
# This will check if any custom functions are added.
if hasattr(part.expr, 'iri'):
# From here the real calculations begin.
# First we get the variable arguments, for example ?label1 and ?label2
argument1 = str(_eval(part.expr.expr[0], c.forget(ctx, _except=part.expr._vars)))
argument2 = str(_eval(part.expr.expr[1], c.forget(ctx, _except=part.expr._vars)))
# Here it checks if it can find our levenshtein IRI (example: //custom/levenshtein)
# Please note that IRI and URI are almost the same.
# Earlier this has been defined with the following:
# namespace = Namespace('//custom/')
# levenshtein = rdflib.term.URIRef(namespace + 'levenshtein')
if part.expr.iri == levenshtein:
# After finding the correct path for the custom SPARQL function the evaluation can begin.
# Here the levenshtein distance is calculated using ?label1 and ?label2 and stored as an Literal object.
# This object is than stored as an output value of the SPARQL-query (example: ?levenshtein)
evaluation = Literal(levenshtein_distance(argument1, argument2))
# Standard error handling and return statements
else:
raise SPARQLError('Unhandled function {}'.format(part.expr.iri))
else:
evaluation = _eval(part.expr, c.forget(ctx, _except=part._vars))
if isinstance(evaluation, SPARQLError):
raise evaluation
cs.append(c.merge({part.var: evaluation}))
return cs
raise NotImplementedError()
namespace = Namespace('//custom/')
levenshtein = rdflib.term.URIRef(namespace + 'levenshtein')
query = """
PREFIX custom: <%s>
SELECT ?label1 ?label2 ?levenshtein WHERE {
BIND("Hello" AS ?label1)
BIND("World" AS ?label2)
BIND(custom:levenshtein(?label1, ?label2) AS ?levenshtein)
}
""" % (namespace,)
# Save custom function in custom evaluation dictionary.
rdflib.plugins.sparql.CUSTOM_EVALS['SPARQL_levenshtein'] = SPARQL_levenshtein
for row in rdflib.Graph().query(query):
print(row)
To answer your question: "What is a good way to hook a custom SPARQL function into rdflib?
Currently I'm developing a class that handles RDF data and I believe it might be best to implement the following code in to __init__function.
For example:
class ClassName():
"""DOCSTRING"""
def __init__(self):
"""DOCSTRING"""
# Save custom function in custom evaluation dictionary.
rdflib.plugins.sparql.CUSTOM_EVALS['SPARQL_levenshtein'] = SPARQL_levenshtein
Please note, this SPARQL function will only work for the endpoint on which it is implemented. Even though the SPARQL syntax in the query is correct, it is not possible applying the function in SPARQL-queries used for databases like DBPedia. The DBPedia endpoint does not support this custom function (yet).

ClassCastException in LingPipe

I am serializing a trained model using
TradNaiveBayesClassifier classifier = new TradNaiveBayesClassifier(categories,tokenizerFactory,categoryPrior,tokenInCategoryPrior,lengthNorm);
then I trained it and compiled it using
AbstractExternalizable.compileTo(classifier,new File(modelPath));
When I read in the model using
TradNaiveBayesClassifier decompClassifier = (TradNaiveBayesClassifier)AbstractExternalizable.readObject(new File(modelPath));{
I get a ClassCastException. Any ideas?
I got it working. I had to upcast to BaseClassifier:
BaseClassifier<CharSequence> eval = (BaseClassifier<CharSequence>)AbstractExternalizable.readObject(new File(modelPath));
evaluator = new BaseClassifierEvaluator<CharSequence>(eval, cat, storeInputs);
Then I couldn't use the JointClassifierEvaluator any more, I had to use the BaseClassifierEvaluator.