Vertex AI Model Batch prediction, issue with referencing existing model and input file on Cloud Storage - kubeflow-pipelines

I'm struggling to correctly set Vertex AI pipeline which does the following:
read data from API and store to GCS and as as input for batch prediction.
get an existing model (Video classification on Vertex AI)
create Batch prediction job with input from point 1.
As it will be seen, I don't have much experience with Vertex Pipelines/Kubeflow thus I'm asking for help/advice, hope it's just some beginner mistake.
this is the gist of the code I'm using as pipeline
from google_cloud_pipeline_components import aiplatform as gcc_aip
from kfp.v2 import dsl
from kfp.v2.dsl import component
from kfp.v2.dsl import (
Output,
Artifact,
Model,
)
PROJECT_ID = 'my-gcp-project'
BUCKET_NAME = "mybucket"
PIPELINE_ROOT = "{}/pipeline_root".format(BUCKET_NAME)
#component
def get_input_data() -> str:
# getting data from API, save to Cloud Storage
# return GS URI
gcs_batch_input_path = 'gs://somebucket/file'
return gcs_batch_input_path
#component(
base_image="python:3.9",
packages_to_install=['google-cloud-aiplatform==1.8.0']
)
def load_ml_model(project_id: str, model: Output[Artifact]):
"""Load existing Vertex model"""
import google.cloud.aiplatform as aip
model_id = '1234'
model = aip.Model(model_name=model_id, project=project_id, location='us-central1')
#dsl.pipeline(
name="batch-pipeline", pipeline_root=PIPELINE_ROOT,
)
def pipeline(gcp_project: str):
input_data = get_input_data()
ml_model = load_ml_model(gcp_project)
gcc_aip.ModelBatchPredictOp(
project=PROJECT_ID,
job_display_name=f'test-prediction',
model=ml_model.output,
gcs_source_uris=[input_data.output], # this doesn't work
# gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
)
if __name__ == '__main__':
from kfp.v2 import compiler
import google.cloud.aiplatform as aip
pipeline_export_filepath = 'test-pipeline.json'
compiler.Compiler().compile(pipeline_func=pipeline,
package_path=pipeline_export_filepath)
# pipeline_params = {
# 'gcp_project': PROJECT_ID,
# }
# job = aip.PipelineJob(
# display_name='test-pipeline',
# template_path=pipeline_export_filepath,
# pipeline_root=f'gs://{PIPELINE_ROOT}',
# project=PROJECT_ID,
# parameter_values=pipeline_params,
# )
# job.run()
When running the pipeline it throws this exception when running Batch prediction:
details = "List of found errors: 1.Field: batch_prediction_job.model; Message: Invalid Model resource name.
so I'm not sure what could be wrong. I tried to load model in the notebook (outside of component) and it correctly returns.
Second issue I'm having is referencing GCS URI as output from component to batch job input.
input_data = get_input_data2()
gcc_aip.ModelBatchPredictOp(
project=PROJECT_ID,
job_display_name=f'test-prediction',
model=ml_model.output,
gcs_source_uris=[input_data.output], # this doesn't work
# gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
)
During compilation, I get following exception TypeError: Object of type PipelineParam is not JSON serializable, though I think this could be issue of ModelBatchPredictOp component.
Again any help/advice appreciated, I'm dealing with this from yesterday, so maybe I missed something obvious.
libraries I'm using:
google-cloud-aiplatform==1.8.0
google-cloud-pipeline-components==0.2.0
kfp==1.8.10
kfp-pipeline-spec==0.1.13
kfp-server-api==1.7.1
UPDATE
After comments, some research and tuning, for referencing model this works:
#component
def load_ml_model(project_id: str, model: Output[Artifact]):
region = 'us-central1'
model_id = '1234'
model_uid = f'projects/{project_id}/locations/{region}/models/{model_id}'
model.uri = model_uid
model.metadata['resourceName'] = model_uid
and then I can use it as intended:
batch_predict_op = gcc_aip.ModelBatchPredictOp(
project=gcp_project,
job_display_name=f'batch-prediction-test',
model=ml_model.outputs['model'],
gcs_source_uris=[input_batch_gcs_path],
gcs_destination_output_uri_prefix=f'gs://{BUCKET_NAME}/prediction_output/test'
)
UPDATE 2
regarding GCS path, a workaround is to define path outside of the component and pass it as an input parameter, for example (abbreviated):
#dsl.pipeline(
name="my-pipeline",
pipeline_root=PIPELINE_ROOT,
)
def pipeline(
gcp_project: str,
region: str,
bucket: str
):
ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
gcs_prediction_input_path = f'gs://{BUCKET_NAME}/prediction_input/video_batch_prediction_input_{ts}.jsonl'
batch_input_data_op = get_input_data(gcs_prediction_input_path) # this loads input data to GCS path
batch_predict_op = gcc_aip.ModelBatchPredictOp(
project=gcp_project,
model=training_job_run_op.outputs["model"],
job_display_name='batch-prediction',
# gcs_source_uris=[batch_input_data_op.output],
gcs_source_uris=[gcs_prediction_input_path],
gcs_destination_output_uri_prefix=f'gs://{BUCKET_NAME}/prediction_output/',
).after(batch_input_data_op) # we need to add 'after' so it runs after input data is prepared since get_input_data doesn't returns anything
still not sure, why it doesn't work/compile when I return GCS path from get_input_data component

I'm glad you solved most of your main issues and found a workaround for model declaration.
For your input.output observation on gcs_source_uris, the reason behind it is because the way the function/class returns the value. If you dig inside the class/methods of google_cloud_pipeline_components you will find that it implements a structure that will allow you to use .outputs from the returned value of the function called.
If you go to the implementation of one of the components of the pipeline you will find that it returns an output array from convert_method_to_component function. So, in order to have that implemented in your custom class/function your function should return a value which can be called as an attribute. Below is a basic implementation of it.
class CustomClass():
def __init__(self):
self.return_val = {'path':'custompath','desc':'a desc'}
#property
def output(self):
return self.return_val
hello = CustomClass()
print(hello.output['path'])
If you want to dig more about it you can go to the following pages:
convert_method_to_component, which is the implementation of convert_method_to_component
Properties, basics of property in python.

Related

How to get the uri of the current pipeline's artifact

Consider the following pipeline:
example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder)
statistics_gen = tfx.components.StatisticsGen(examples=example_gen.outputs['examples'])
schema_gen = tfx.components.SchemaGen(
statistics=statistics_gen.outputs['statistics'],
infer_feature_shape=True)
transform = tfx.components.Transform(
examples=example_gen.outputs['examples'],
schema=schema_gen.outputs['schema'],
module_file=os.path.abspath('preprocessing_fn.py'))
_trainer_module_file = 'run_fn.py'
trainer = tfx.components.Trainer(
module_file=os.path.abspath(_trainer_module_file),
examples=transform.outputs['transformed_examples'],
transform_graph=transform.outputs['transform_graph'],
schema=schema_gen.outputs['schema'],
train_args=tfx.proto.TrainArgs(num_steps=10),
eval_args=tfx.proto.EvalArgs(num_steps=6),)
pusher = tfx.components.Pusher(
model=trainer.outputs['model'],
push_destination=tfx.proto.PushDestination(
filesystem=tfx.proto.PushDestination.Filesystem(
base_directory=_serving_model_dir)
)
)
components = [
example_gen,
statistics_gen,
schema_gen,
transform,
trainer,
pusher,
]
_pipeline_data_folder = './simple_pipeline_data'
pipeline = tfx.dsl.Pipeline(
pipeline_name='simple_pipeline',
pipeline_root=_pipeline_data_folder,
metadata_connection_config=tfx.orchestration.metadata.sqlite_metadata_connection_config(
f'{_pipeline_data_folder}/metadata.db'),
components=components)
tfx.orchestration.LocalDagRunner().run(pipeline)
Now, let's assume that once the pipeline is down, I would like to do something with the artifacts. I know I can query the ML Metadata like this:
import ml_metadata as mlmd
connection_config = pipeline.metadata_connection_config
store = mlmd.MetadataStore(connection_config)
print(store.get_artifact_types())
But this way, I have no idea which IDs belong to the current pipeline. Sure, I can assume that the largest IDs represent the current pipeline artifacts but that's not going to be a practical approach in production when multiple executions might try to work with the same metadata store concurrently.
So, the question is how can I figure out the artifact IDs that were just created by the current execution?
[UPDATE]
To clarify the problem, consider the following partial solution:
def get_latest_artifact(metadata_connection_config, pipeline_name: str, component_name: str, type_name: str):
with Metadata(metadata_connection_config) as metadata:
context = metadata.store.get_context_by_type_and_name('node', f'{pipeline_name}.{component_name}')
artifacts = metadata.store.get_artifacts_by_context(context.id)
artifact_type = metadata.store.get_artifact_type(type_name)
latest_artifact = max([a for a in artifacts if a.type_id == artifact_type.id],
key=lambda a: a.last_update_time_since_epoch)
artifact = types.Artifact(artifact_type)
artifact.set_mlmd_artifact(latest_artifact)
return artifact
sqlite_path = './pipeline_data/metadata.db'
metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config(sqlite_path)
examples_artifact = get_latest_artifact(metadata_connection_config, 'simple_pipeline',
'SchemaGen', 'Schema')
Using get_latest_artifact function, I can get the latest artifact of a specific type from a specific pipeline. This will work even if two pipelines (with different names) create new artifacts concurrently. But it will fail when I try to extract the artifact of the "just finished" pipeline if multiple instances of the same pipeline are making changes to the store concurrently. That's because the function takes in the pipeline name as an input argument (as opposed to some pipeline unique ID).
I'm looking for a solution that works no matter how many different (or the same) pipelines work with the same store concurrently. At this point, I'm not sure if this can be done with MlMD. And if it cannot be done at the moment, I consider that a missed feature, a very crucial one.
OK, this is the solution I found. When defining the pipeline's components, you should use .with_id() method and give the component a custom ID. That way you can find it later on.
Here's an example. Let's say that I want to find the schema generated as part of the recently executed pipeline.
schema_gen = tfx.components.SchemaGen(
statistics=statistics_gen.outputs['statistics'],
infer_feature_shape=True).with_id('some_unique_id')
Then, the same function I defined above can be used like this:
def get_latest_artifact(metadata_connection_config, pipeline_name: str, component_name: str, type_name: str):
with Metadata(metadata_connection_config) as metadata:
context = metadata.store.get_context_by_type_and_name('node', f'{pipeline_name}.{component_name}')
artifacts = metadata.store.get_artifacts_by_context(context.id)
artifact_type = metadata.store.get_artifact_type(type_name)
latest_artifact = max([a for a in artifacts if a.type_id == artifact_type.id],
key=lambda a: a.last_update_time_since_epoch)
artifact = types.Artifact(artifact_type)
artifact.set_mlmd_artifact(latest_artifact)
return artifact
sqlite_path = './pipeline_data/metadata.db'
metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config(sqlite_path)
examples_artifact = get_latest_artifact(metadata_connection_config, 'simple_pipeline',
'some_unique_id', 'Schema')
Once your TFX pipeline completes the run, you can query the ML metadata using below code.
connection_config = interactive_context.metadata_connection_config
store = mlmd.MetadataStore(connection_config)
# All TFX artifacts are stored in the base directory
base_dir = connection_config.sqlite.filename_uri.split('metadata.sqlite')[0]
Once metadata is fetched, you can use below helper functions to view data from MD store. The display_types() function queries the list of all its stored ArtifactTypes. Then display_artifacts() function list all artifacts for given artifact type and their uri. The display_properties() function will give execution properties for given artifact.
Please refer MLMD tutorial for detailed implementation of below functions.
def display_types(types):
# Helper function to render dataframes for the artifact and execution types
table = {'id': [], 'name': []}
for a_type in types:
table['id'].append(a_type.id)
table['name'].append(a_type.name)
return pd.DataFrame(data=table)
def display_artifacts(store, artifacts):
# Helper function to render dataframes for the input artifacts
table = {'artifact id': [], 'type': [], 'uri': []}
for a in artifacts:
table['artifact id'].append(a.id)
artifact_type = store.get_artifact_types_by_id([a.type_id])[0]
table['type'].append(artifact_type.name)
table['uri'].append(a.uri.replace(base_dir, './'))
return pd.DataFrame(data=table)
def display_properties(store, node):
# Helper function to render dataframes for artifact and execution properties
table = {'property': [], 'value': []}
for k, v in node.properties.items():
table['property'].append(k)
table['value'].append(
v.string_value if v.HasField('string_value') else v.int_value)
for k, v in node.custom_properties.items():
table['property'].append(k)
table['value'].append(
v.string_value if v.HasField('string_value') else v.int_value)
return pd.DataFrame(data=table)
Example code to get the latest pushed model execution properties.
# get all artifacts with ArtifactType as PusherModel
pushed_models = store.get_artifacts_by_type("PushedModel")
# get the latest pushed model
pushed_model = pushed_models[-1]
# get execution properties for latest pushed model
display_properties(store, pushed_model)

How to switch between policies in multiagent rllib?

For my thesis, I want to train a PPO model on a collaborative multi-agent situation. The 'teammates' are already trained Behavior Cloning models each representing different kind of behavior. These models will not be trained simultaneously to the PPO. I want the PPO encounter all these different models during training, meaning: I want to switch out the 'teammate' of the PPO every x amount of epochs. How do I do this?
I am using the overcooked-ai environment. All code would be too much to include. But the generation of the trainer ends up returning this:
trainer = PPOTrainer(env="overcooked_multi_agent", config={
"multiagent": multi_agent_config,
"callbacks" : TrainingCallbacks,
"custom_eval_function" : get_rllib_eval_function(evaluation_params, environment_params['eval_mdp_params'], environment_params['env_params'],
environment_params["outer_shape"], 'ppo', 'ppo' if self_play else 'bc',
verbose=params['verbose']),
"env_config" : environment_params,
"eager" : False,
**training_params
}, logger_creator=custom_logger_creator
and the multi-agent-config is created like
def gen_policy(policy_type="ppo"):
# supported policy types thus far
assert policy_type in ["ppo", "bc"]
if policy_type == "ppo":
config = {
"model" : {
"custom_options" : model_params,
"custom_model" : "MyPPOModel"
}
}
return (None, env.ppo_observation_space, env.action_space, config)
elif policy_type == "bc":
bc_cls = bc_params['bc_policy_cls']
bc_config = bc_params['bc_config']
return (bc_cls, env.bc_observation_space, env.action_space, bc_config)
# Rllib compatible way of setting the directory we store agent checkpoints in
logdir_prefix = "{0}_{1}_{2}".format(params["experiment_name"], params['training_params']['seed'], timestr)
def custom_logger_creator(config):
"""Creates a Unified logger that stores results in <params['results_dir']>/<params["experiment_name"]>_<seed>_<timestamp>
"""
results_dir = params['results_dir']
if not os.path.exists(results_dir):
try:
os.makedirs(results_dir)
except Exception as e:
print("error creating custom logging dir. Falling back to default logdir {}".format(DEFAULT_RESULTS_DIR))
results_dir = DEFAULT_RESULTS_DIR
logdir = tempfile.mkdtemp(
prefix=logdir_prefix, dir=results_dir)
logger = UnifiedLogger(config, logdir, loggers=None)
return logger
# Create rllib compatible multi-agent config based on params
multi_agent_config = {}
all_policies = ['ppo']
# Whether both agents should be learned
self_play = iterable_equal(multi_agent_params['bc_schedule'], OvercookedMultiAgent.self_play_bc_schedule)
if not self_play:
all_policies.append('bc')
multi_agent_config['policies'] = { policy : gen_policy(policy) for policy in all_policies }
def select_policy(agent_id):
if agent_id.startswith('ppo'):
return 'ppo'
if agent_id.startswith('bc'):
return 'bc'
multi_agent_config['policy_mapping_fn'] = select_policy
multi_agent_config['policies_to_train'] = 'ppo'
Do I already change my configuration to include multiple BHC models (code above is built to handle 1), or can I somehow change the models in the trainingloop?
Any ideas would be super helpful!

How to extract the [Documentation] text from Robot framework test case

I am trying to extract the content of the [Documentation] section as a string for comparision with other part in a Python script.
I was told to use Robot framework API https://robot-framework.readthedocs.io/en/stable/
to extract but I have no idea how.
However, I am required to work with version 3.1.2
Example:
*** Test Cases ***
ATC Verify that Sensor Battery can enable and disable manufacturing mode
[Documentation] E1: This is the description of the test 1
... E2: This is the description of the test 2
[Tags] E1 TRACE{Trace_of_E1}
... E2 TRACE{Trace_of_E2}
Extract the string as
E1: This is the description of the test 1
E2: This is the description of the test 2
Have a look at these examples. I did something similar to generate testplans descritio. I tried to adapt my code to your requirements and this could maybe work for you.
import os
import re
from robot.api.parsing import (
get_model, get_tokens, Documentation, EmptyLine, KeywordCall,
ModelVisitor, Token
)
class RobotParser(ModelVisitor):
def __init__(self):
# Create object with remarkup_text to store formated documentation
self.text = ''
def get_text(self):
return self.text
def visit_TestCase(self, node):
# The matched `TestCase` node is a block with `header` and
# `body` attributes. `header` is a statement with familiar
# `get_token` and `get_value` methods for getting certain
# tokens or their value.
for keyword in node.body:
# skip empty lines
if keyword.get_value(Token.DOCUMENTATION) == None:
continue
self.text += keyword.get_value(Token.ARGUMENT)
def visit_Documentation(self,node):
# The matched "Documentation" node with value
self.remarkup_text += node.value + self.new_line
def visit_File(self, node):
# Call `generic_visit` to visit also child nodes.
return self.generic_visit(node)
if __name__ == "__main__":
path = "../tests"
for filename in os.listdir(path):
if re.match(".*\.robot", filename):
model = get_model(os.path.join(path, filename))
robot_parser = RobotParser()
robot_parser.visit(model)
text=robot_parser._text()
The code marked as best answer didn't quite work for me and has a lot of redundancy but it inspired me enough to get into the parsing and write it in a much readable and efficient way that actually works as is. You just have to have your own way of generating & iterating through filesystem where you call the get_robot_metadata(filepath) function.
from robot.api.parsing import (get_model, ModelVisitor, Token)
class RobotParser(ModelVisitor):
def __init__(self):
self.testcases = {}
def visit_TestCase(self, node):
testcasename = (node.header.name)
self.testcases[testcasename] = {}
for section in node.body:
if section.get_value(Token.DOCUMENTATION) != None:
documentation = section.value
self.testcases[testcasename]['Documentation'] = documentation
elif section.get_value(Token.TAGS) != None:
tags = section.values
self.testcases[testcasename]['Tags'] = tags
def get_testcases(self):
return self.testcases
def get_robot_metadata(filepath):
if filepath.endswith('.robot'):
robot_parser = RobotParser()
model = get_model(filepath)
robot_parser.visit(model)
metadata = robot_parser.get_testcases()
return metadata
This function will be able to extract the [Documentation] section from the testcase:
def documentation_extractor(testcase):
documentation = []
for setting in testcase.settings:
if len(setting) > 2 and setting[1].lower() == "[documentation]":
for doc in setting[2:]:
if doc.startswith("#"):
# the start of a comment, so skip rest of the line
break
documentation.append(doc)
break
return "\n".join(documentation)

How to send numpy array to sagemaker endpoint using lambda function

How to invoke sagemaker endpoint with input data type numpy.ndarray.
I have deployed a sagemaker model and trying to hit it using lambda function.
But I am unable to figure out how to do it. I am getting server error.
One row of the Input data.
The total data set has shape=(91,5,12).
The below is only one row of Input data.
array([[[0.30440741, 0.30209799, 0.33520652, 0.41558442, 0.69096432,
0.69611016, 0.25153326, 0.98333333, 0.82352941, 0.77187154,
0.7664042 , 0.74468085],
[0.30894981, 0.33151662, 0.22907725, 0.46753247, 0.69437367,
0.70410559, 0.29259044, 0.9 , 0.80882353, 0.79401993,
0.89501312, 0.86997636],
[0.33511896, 0.34338939, 0.24065546, 0.48051948, 0.70384005,
0.71058715, 0.31031288, 0.86666667, 0.89705882, 0.82724252,
0.92650919, 0.89125296],
[0.34617355, 0.36150251, 0.23726854, 0.54545455, 0.71368726,
0.71703244, 0.30228356, 0.85 , 0.86764706, 0.86157254,
0.97112861, 0.94089835],
[0.36269508, 0.35923332, 0.40285461, 0.62337662, 0.73325475,
0.7274392 , 0.26241391, 0.85 , 0.82352941, 0.89922481,
0.9343832 , 0.90780142]]])
I am using the following code but unable to invoke the endpoint
import boto3
def lambda_handler(event, context):
# The SageMaker runtime is what allows us to invoke the endpoint that we've created.
runtime = boto3.Session().client('sagemaker-runtime')
endpoint = 'sagemaker-tensorflow-2019-04-22-07-16-51-717'
print('givendata ', event['body'])
# data = numpy.array([numpy.array(xi) for xi in event['body']])
data = event['body']
print('numpy array ', data)
# Now we use the SageMaker runtime to invoke our endpoint, sending the review we were given
response = runtime.invoke_endpoint(EndpointName = endpoint,# The name of the endpoint we created
ContentType = 'application/json', # The data format that is expected
Body = data) # The actual review
# The response is an HTTP response whose body contains the result of our inference
result = response['Body'].read().decode('utf-8')
print('response', result)
# Round the result so that our web app only gets '1' or '0' as a response.
result = round(float(result))
return {
'statusCode' : 200,
'headers' : { 'Content-Type' : 'text/plain', 'Access-Control-Allow-Origin' : '*' },
'body' : str(result)
}
I am unable to figure out what should be written in place of ContentType.
Because I am not aware of MIME type in case of numpy.ndarray.
Illustration of what I had and how I solved
from sagemaker.tensorflow import TensorFlowPredictor
predictor = TensorFlowPredictor('sagemaker-tensorflow-serving-date')
data = np.array(raw_data)
response = predictor.predict(data=data)
predictions = response['predictions']
print(predictions)
What I did to find the answer:
Looked up predictions.py and content_types.py implementation in sagemaker python library to see what content types it used and what arguments it had.
First I thought that application/x-npy content_type was used and thus tried using serialisation code from predictor.py and passing the application/x-npy as content_type to invoke_endpoint.
After receiving 415 (unsupported media type), the issue was still the content_type. The following print statements helped me to reveal what content_type predictor actually uses (application/json) and thus I took the appropriate serialisation code from predictor.py
from sagemaker.tensorflow import TensorFlowPredictor
predictor = TensorFlowPredictor('sagemaker-tensorflow-serving-date')
data = np.array(raw_data)
response = predictor.predict(data=data)
print(predictor.content_type)
print(predictor.accept)
predictions = response['predictions']
print(predictions)
TL;DR
Solution for lambda:
import json
import boto3
ENDPOINT_NAME = 'sagemaker-tensorflow-serving-date'
config = botocore.config.Config(read_timeout=80)
runtime= boto3.client('runtime.sagemaker', config=config)
data = np.array(raw_data)
payload = json.dumps(data.tolist())
response = runtime.invoke_endpoint(EndpointName=ENDPOINT_NAME,
ContentType='application/json',
Body=payload)
result = json.loads(response['Body'].read().decode())
res = result['predictions']
Note: numpy is not included in lambda thus you would either include the numpy yourself or instead of data.tolist() operate with python list and json.dump that list (of lists). From your code it seems to me you have python list instead of numpy array, so simple json dump should work.
If you are training and hosting custom algorithm on SageMaker using TensorFlow, you can serialize/de-serialize the request and response format as JSON as in TensorFlow Serving Predict API.
import numpy
from sagemaker.predictor import json_serializer, json_deserializer
# define predictor
predictor = estimator.deploy(1, instance_type)
# format request
data = {'instances': numpy.asarray(np_array).astype(float).tolist()}
# set predictor request/response formats
predictor.accept = 'application/json'
predictor.content_type = 'application/json'
predictor.serializer = json_serializer
predictor.deserializer = json_deserializer
# run inference using SageMaker predict class
# https://github.com/aws/sagemaker-python-sdk/blob/master/src/sagemaker/predictor.py
predictor.predict(data)
You can refer the example notebook here to train and host custom TensorFlow container.

Custom SPARQL functions in rdflib

What is a good way to hook a custom SPARQL function into rdflib?
I have been looking around in rdflib for an entry point for custom function. I found no dedicated entry point but found that rdflib.plugins.sparql.CUSTOM_EVALS might be a place to add the custom function.
So far I have made an attempt with the code below. It seems "dirty" to me. I am calling a "hidden" function (_eval) and I am not sure I got all the argument updating correct. Beyond the custom_eval.py example code (which form the basis for my code) I found little other code or documentation about CUSTOM_EVALS.
import rdflib
from rdflib.plugins.sparql.evaluate import evalPart
from rdflib.plugins.sparql.sparql import SPARQLError
from rdflib.plugins.sparql.evalutils import _eval
from rdflib.namespace import Namespace
from rdflib.term import Literal
NAMESPACE = Namespace('//custom/')
LENGTH = rdflib.term.URIRef(NAMESPACE + 'length')
def customEval(ctx, part):
"""Evaluate custom function."""
if part.name == 'Extend':
cs = []
for c in evalPart(ctx, part.p):
if hasattr(part.expr, 'iri'):
# A function
argument = _eval(part.expr.expr[0], c.forget(ctx, _except=part.expr._vars))
if part.expr.iri == LENGTH:
e = Literal(len(argument))
else:
raise SPARQLError('Unhandled function {}'.format(part.expr.iri))
else:
e = _eval(part.expr, c.forget(ctx, _except=part._vars))
if isinstance(e, SPARQLError):
raise e
cs.append(c.merge({part.var: e}))
return cs
raise NotImplementedError()
QUERY = """
PREFIX custom: <%s>
SELECT ?s ?length WHERE {
BIND("Hello, World" AS ?s)
BIND(custom:length(?s) AS ?length)
}
""" % (NAMESPACE,)
rdflib.plugins.sparql.CUSTOM_EVALS['exampleEval'] = customEval
for row in rdflib.Graph().query(QUERY):
print(row)
So first off, I want to thank you for showing how you implemented a new SPARQL function.
Secondly, by using your code I was able to create a SPARQL function that evaluates two strings by using the Levenshtein distance. It has been really insightful and I wish to share it for it holds additional documentation that could help other developers creating their own custom SPARQL functions.
# Import needed to introduce new SPARQL function
import rdflib
from rdflib.plugins.sparql.evaluate import evalPart
from rdflib.plugins.sparql.sparql import SPARQLError
from rdflib.plugins.sparql.evalutils import _eval
from rdflib.namespace import Namespace
from rdflib.term import Literal
# Import for custom function calculation
from Levenshtein import distance as levenshtein_distance # python-Levenshtein==0.12.2
def SPARQL_levenshtein(ctx:object, part:object) -> object:
"""
The first two variables retrieved from a SPARQL-query are compared using the Levenshtein distance.
The distance value is then stored in Literal object and added to the query results.
Example:
Query:
PREFIX custom: //custom/ # Note: this part refereces to the custom function
SELECT ?label1 ?label2 ?levenshtein WHERE {
BIND("Hello" AS ?label1)
BIND("World" AS ?label2)
BIND(custom:levenshtein(?label1, ?label2) AS ?levenshtein)
}
Retrieve:
?label1 ?label2
Calculation:
levenshtein_distance(?label1, ?label2) = distance
Output:
Save distance in Literal object.
:param ctx: <class 'rdflib.plugins.sparql.sparql.QueryContext'>
:param part: <class 'rdflib.plugins.sparql.parserutils.CompValue'>
:return: <class 'rdflib.plugins.sparql.processor.SPARQLResult'>
"""
# This part holds basic implementation for adding new functions
if part.name == 'Extend':
cs = []
# Information is retrieved and stored and passed through a generator
for c in evalPart(ctx, part.p):
# Checks if the function holds an internationalized resource identifier
# This will check if any custom functions are added.
if hasattr(part.expr, 'iri'):
# From here the real calculations begin.
# First we get the variable arguments, for example ?label1 and ?label2
argument1 = str(_eval(part.expr.expr[0], c.forget(ctx, _except=part.expr._vars)))
argument2 = str(_eval(part.expr.expr[1], c.forget(ctx, _except=part.expr._vars)))
# Here it checks if it can find our levenshtein IRI (example: //custom/levenshtein)
# Please note that IRI and URI are almost the same.
# Earlier this has been defined with the following:
# namespace = Namespace('//custom/')
# levenshtein = rdflib.term.URIRef(namespace + 'levenshtein')
if part.expr.iri == levenshtein:
# After finding the correct path for the custom SPARQL function the evaluation can begin.
# Here the levenshtein distance is calculated using ?label1 and ?label2 and stored as an Literal object.
# This object is than stored as an output value of the SPARQL-query (example: ?levenshtein)
evaluation = Literal(levenshtein_distance(argument1, argument2))
# Standard error handling and return statements
else:
raise SPARQLError('Unhandled function {}'.format(part.expr.iri))
else:
evaluation = _eval(part.expr, c.forget(ctx, _except=part._vars))
if isinstance(evaluation, SPARQLError):
raise evaluation
cs.append(c.merge({part.var: evaluation}))
return cs
raise NotImplementedError()
namespace = Namespace('//custom/')
levenshtein = rdflib.term.URIRef(namespace + 'levenshtein')
query = """
PREFIX custom: <%s>
SELECT ?label1 ?label2 ?levenshtein WHERE {
BIND("Hello" AS ?label1)
BIND("World" AS ?label2)
BIND(custom:levenshtein(?label1, ?label2) AS ?levenshtein)
}
""" % (namespace,)
# Save custom function in custom evaluation dictionary.
rdflib.plugins.sparql.CUSTOM_EVALS['SPARQL_levenshtein'] = SPARQL_levenshtein
for row in rdflib.Graph().query(query):
print(row)
To answer your question: "What is a good way to hook a custom SPARQL function into rdflib?
Currently I'm developing a class that handles RDF data and I believe it might be best to implement the following code in to __init__function.
For example:
class ClassName():
"""DOCSTRING"""
def __init__(self):
"""DOCSTRING"""
# Save custom function in custom evaluation dictionary.
rdflib.plugins.sparql.CUSTOM_EVALS['SPARQL_levenshtein'] = SPARQL_levenshtein
Please note, this SPARQL function will only work for the endpoint on which it is implemented. Even though the SPARQL syntax in the query is correct, it is not possible applying the function in SPARQL-queries used for databases like DBPedia. The DBPedia endpoint does not support this custom function (yet).