output for append job in BigQuery using Luigi Orchestrator - google-bigquery

I have a Bigquery task which only aims to append a daily temp table (Table-xxxx-xx-xx) to an existing table (PersistingTable).
I am not sure how to handle the output(self) method. Indeed, I can not just output PersistingTable as a luigi.contrib.bigquery.BigQueryTarget, since it already exists before the process started. Has anyone asked himself such a question?

I could not find an answer anywhere else so I will give my solution even though this is a very old question.
I created a new class that inherits from luigi.contrib.bigquery.BigQueryLoadTask
class BigQueryLoadIncremental(luigi.contrib.bigquery.BigQueryLoadTask):
'''
a subclass that checks whether a write-log on gcs exists to append data to the table
needs to define Two Outputs! [0] of type BigQueryTarget and [1] of type GCSTarget
Everything else is left unchanged
'''
def exists(self):
return luigi.contrib.gcs.GCSClient.exists(self.output()[1].path)
#property
def write_disposition(self):
"""
Set to WRITE_APPEND as this subclass only makes sense for this
"""
return luigi.contrib.bigquery.WriteDisposition.WRITE_APPEND
def run(self):
output = self.output()[0]
gcs_output = self.output()[1]
assert isinstance(output,
luigi.contrib.bigquery.BigQueryTarget), 'Output[0] must be a BigQueryTarget, not %s' % (
output)
assert isinstance(gcs_output,
luigi.contrib.gcs.GCSTarget), 'Output[1] must be a Cloud Storage Target, not %s' % (
gcs_output)
bq_client = output.client
source_uris = self.source_uris()
assert all(x.startswith('gs://') for x in source_uris)
job = {
'projectId': output.table.project_id,
'configuration': {
'load': {
'destinationTable': {
'projectId': output.table.project_id,
'datasetId': output.table.dataset_id,
'tableId': output.table.table_id,
},
'encoding': self.encoding,
'sourceFormat': self.source_format,
'writeDisposition': self.write_disposition,
'sourceUris': source_uris,
'maxBadRecords': self.max_bad_records,
'ignoreUnknownValues': self.ignore_unknown_values
}
}
}
if self.source_format == luigi.contrib.bigquery.SourceFormat.CSV:
job['configuration']['load']['fieldDelimiter'] = self.field_delimiter
job['configuration']['load']['skipLeadingRows'] = self.skip_leading_rows
job['configuration']['load']['allowJaggedRows'] = self.allow_jagged_rows
job['configuration']['load']['allowQuotedNewlines'] = self.allow_quoted_new_lines
if self.schema:
job['configuration']['load']['schema'] = {'fields': self.schema}
# test write to and removal of GCS pseudo output in order to make sure this does not fail.
gcs_output.fs.put_string(
'test write for task {} (this file should have been removed immediately)'.format(self.task_id),
gcs_output.path)
gcs_output.fs.remove(gcs_output.path)
bq_client.run_job(output.table.project_id, job, dataset=output.table.dataset)
gcs_output.fs.put_string(
'success! The following BigQuery Job went through without errors: {}'.format(self.task_id), gcs_output.path)
it uses a second output (which might violate luigis atomicity principle) on google cloud storage. Example usage:
class LeadsToBigQuery(BigQueryLoadIncremental):
date = luigi.DateParameter(default=datetime.date.today())
def output(self):
return luigi.contrib.bigquery.BigQueryTarget(project_id=...,
dataset_id=...,
table_id=...), \
create_gcs_target(...)

Related

How to get the uri of the current pipeline's artifact

Consider the following pipeline:
example_gen = tfx.components.ImportExampleGen(input_base=_dataset_folder)
statistics_gen = tfx.components.StatisticsGen(examples=example_gen.outputs['examples'])
schema_gen = tfx.components.SchemaGen(
statistics=statistics_gen.outputs['statistics'],
infer_feature_shape=True)
transform = tfx.components.Transform(
examples=example_gen.outputs['examples'],
schema=schema_gen.outputs['schema'],
module_file=os.path.abspath('preprocessing_fn.py'))
_trainer_module_file = 'run_fn.py'
trainer = tfx.components.Trainer(
module_file=os.path.abspath(_trainer_module_file),
examples=transform.outputs['transformed_examples'],
transform_graph=transform.outputs['transform_graph'],
schema=schema_gen.outputs['schema'],
train_args=tfx.proto.TrainArgs(num_steps=10),
eval_args=tfx.proto.EvalArgs(num_steps=6),)
pusher = tfx.components.Pusher(
model=trainer.outputs['model'],
push_destination=tfx.proto.PushDestination(
filesystem=tfx.proto.PushDestination.Filesystem(
base_directory=_serving_model_dir)
)
)
components = [
example_gen,
statistics_gen,
schema_gen,
transform,
trainer,
pusher,
]
_pipeline_data_folder = './simple_pipeline_data'
pipeline = tfx.dsl.Pipeline(
pipeline_name='simple_pipeline',
pipeline_root=_pipeline_data_folder,
metadata_connection_config=tfx.orchestration.metadata.sqlite_metadata_connection_config(
f'{_pipeline_data_folder}/metadata.db'),
components=components)
tfx.orchestration.LocalDagRunner().run(pipeline)
Now, let's assume that once the pipeline is down, I would like to do something with the artifacts. I know I can query the ML Metadata like this:
import ml_metadata as mlmd
connection_config = pipeline.metadata_connection_config
store = mlmd.MetadataStore(connection_config)
print(store.get_artifact_types())
But this way, I have no idea which IDs belong to the current pipeline. Sure, I can assume that the largest IDs represent the current pipeline artifacts but that's not going to be a practical approach in production when multiple executions might try to work with the same metadata store concurrently.
So, the question is how can I figure out the artifact IDs that were just created by the current execution?
[UPDATE]
To clarify the problem, consider the following partial solution:
def get_latest_artifact(metadata_connection_config, pipeline_name: str, component_name: str, type_name: str):
with Metadata(metadata_connection_config) as metadata:
context = metadata.store.get_context_by_type_and_name('node', f'{pipeline_name}.{component_name}')
artifacts = metadata.store.get_artifacts_by_context(context.id)
artifact_type = metadata.store.get_artifact_type(type_name)
latest_artifact = max([a for a in artifacts if a.type_id == artifact_type.id],
key=lambda a: a.last_update_time_since_epoch)
artifact = types.Artifact(artifact_type)
artifact.set_mlmd_artifact(latest_artifact)
return artifact
sqlite_path = './pipeline_data/metadata.db'
metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config(sqlite_path)
examples_artifact = get_latest_artifact(metadata_connection_config, 'simple_pipeline',
'SchemaGen', 'Schema')
Using get_latest_artifact function, I can get the latest artifact of a specific type from a specific pipeline. This will work even if two pipelines (with different names) create new artifacts concurrently. But it will fail when I try to extract the artifact of the "just finished" pipeline if multiple instances of the same pipeline are making changes to the store concurrently. That's because the function takes in the pipeline name as an input argument (as opposed to some pipeline unique ID).
I'm looking for a solution that works no matter how many different (or the same) pipelines work with the same store concurrently. At this point, I'm not sure if this can be done with MlMD. And if it cannot be done at the moment, I consider that a missed feature, a very crucial one.
OK, this is the solution I found. When defining the pipeline's components, you should use .with_id() method and give the component a custom ID. That way you can find it later on.
Here's an example. Let's say that I want to find the schema generated as part of the recently executed pipeline.
schema_gen = tfx.components.SchemaGen(
statistics=statistics_gen.outputs['statistics'],
infer_feature_shape=True).with_id('some_unique_id')
Then, the same function I defined above can be used like this:
def get_latest_artifact(metadata_connection_config, pipeline_name: str, component_name: str, type_name: str):
with Metadata(metadata_connection_config) as metadata:
context = metadata.store.get_context_by_type_and_name('node', f'{pipeline_name}.{component_name}')
artifacts = metadata.store.get_artifacts_by_context(context.id)
artifact_type = metadata.store.get_artifact_type(type_name)
latest_artifact = max([a for a in artifacts if a.type_id == artifact_type.id],
key=lambda a: a.last_update_time_since_epoch)
artifact = types.Artifact(artifact_type)
artifact.set_mlmd_artifact(latest_artifact)
return artifact
sqlite_path = './pipeline_data/metadata.db'
metadata_connection_config = tfx.orchestration.metadata.sqlite_metadata_connection_config(sqlite_path)
examples_artifact = get_latest_artifact(metadata_connection_config, 'simple_pipeline',
'some_unique_id', 'Schema')
Once your TFX pipeline completes the run, you can query the ML metadata using below code.
connection_config = interactive_context.metadata_connection_config
store = mlmd.MetadataStore(connection_config)
# All TFX artifacts are stored in the base directory
base_dir = connection_config.sqlite.filename_uri.split('metadata.sqlite')[0]
Once metadata is fetched, you can use below helper functions to view data from MD store. The display_types() function queries the list of all its stored ArtifactTypes. Then display_artifacts() function list all artifacts for given artifact type and their uri. The display_properties() function will give execution properties for given artifact.
Please refer MLMD tutorial for detailed implementation of below functions.
def display_types(types):
# Helper function to render dataframes for the artifact and execution types
table = {'id': [], 'name': []}
for a_type in types:
table['id'].append(a_type.id)
table['name'].append(a_type.name)
return pd.DataFrame(data=table)
def display_artifacts(store, artifacts):
# Helper function to render dataframes for the input artifacts
table = {'artifact id': [], 'type': [], 'uri': []}
for a in artifacts:
table['artifact id'].append(a.id)
artifact_type = store.get_artifact_types_by_id([a.type_id])[0]
table['type'].append(artifact_type.name)
table['uri'].append(a.uri.replace(base_dir, './'))
return pd.DataFrame(data=table)
def display_properties(store, node):
# Helper function to render dataframes for artifact and execution properties
table = {'property': [], 'value': []}
for k, v in node.properties.items():
table['property'].append(k)
table['value'].append(
v.string_value if v.HasField('string_value') else v.int_value)
for k, v in node.custom_properties.items():
table['property'].append(k)
table['value'].append(
v.string_value if v.HasField('string_value') else v.int_value)
return pd.DataFrame(data=table)
Example code to get the latest pushed model execution properties.
# get all artifacts with ArtifactType as PusherModel
pushed_models = store.get_artifacts_by_type("PushedModel")
# get the latest pushed model
pushed_model = pushed_models[-1]
# get execution properties for latest pushed model
display_properties(store, pushed_model)

Scalding Unit Test - How to Write A Local File?

I work at a place where scalding writes are augmented with a specific API to track dataset meta data. When converting from normal writes to these special writes, there are some intricacies with respect to Key/Value, TSV/CSV, Thrift ... datasets. I would like to compare the binary file is the same prior to conversion and after conversion to the special API.
Given I cannot provide the specific api for the metadata-inclusive writes, I only ask how can I write a unit test for .write method on a TypedPipe?
implicit val timeZone: TimeZone = DateOps.UTC
implicit val dateParser: DateParser = DateParser.default
implicit def flowDef: FlowDef = new FlowDef()
implicit def mode: Mode = Local(true)
val fileStrPath = root + "/test"
println("writing data to " + fileStrPath)
TypedPipe
.from(Seq[Long](1, 2, 3, 4, 5))
// .map((x: Long) => { println(x.toString); System.out.flush(); x })
.write(TypedTsv[Long](fileStrPath))
.forceToDisk
The above doesn't seem to write anything to local (OSX) disk.
So I wonder if I need to use a MiniDFSCluster something like this:
def setUpTempFolder: String = {
val tempFolder = new TemporaryFolder
tempFolder.create()
tempFolder.getRoot.getAbsolutePath
}
val root: String = setUpTempFolder
println(s"root = $root")
val tempDir = Files.createTempDirectory(setUpTempFolder).toFile
val hdfsCluster: MiniDFSCluster = {
val configuration = new Configuration()
configuration.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, tempDir.getAbsolutePath)
configuration.set("io.compression.codecs", classOf[LzopCodec].getName)
new MiniDFSCluster.Builder(configuration)
.manageNameDfsDirs(true)
.manageDataDfsDirs(true)
.format(true)
.build()
}
hdfsCluster.waitClusterUp()
val fs: DistributedFileSystem = hdfsCluster.getFileSystem
val rootPath = new Path(root)
fs.mkdirs(rootPath)
However, my attempts to get this MiniCluster to work haven't panned out either - somehow I need to link the MiniCluster with the Scalding write.
Note: The Scalding JobTest framework for unit testing isn't going to work due actual data written is sometimes wrapped in bijection codec or setup with case class wrappers prior to the writes made by the metadata-inclusive writes APIs.
Any ideas how I can write a local file (without using the Scalding REPL) with either Scalding alone or a MiniCluster? (If using the later, I need a hint how to read the file.)
Answering ... There is an example of how to use a mini cluster for exactly reading and writing to HDFS. I will be able to cross read with my different writes and examine them. Here it is in the tests for scalding's TypedParquet type
HadoopPlatformJobTest is an extension for JobTest that uses a MiniCluster.
With some hand-waiving on detail in the link, the bulk of the code is this:
"TypedParquetTuple" should {
"read and write correctly" in {
import com.twitter.scalding.parquet.tuple.TestValues._
def toMap[T](i: Iterable[T]): Map[T, Int] = i.groupBy(identity).mapValues(_.size)
HadoopPlatformJobTest(new WriteToTypedParquetTupleJob(_), cluster)
.arg("output", "output1")
.sink[SampleClassB](TypedParquet[SampleClassB](Seq("output1"))) {
toMap(_) shouldBe toMap(values)
}
.run()
HadoopPlatformJobTest(new ReadWithFilterPredicateJob(_), cluster)
.arg("input", "output1")
.arg("output", "output2")
.sink[Boolean]("output2")(toMap(_) shouldBe toMap(values.filter(_.string == "B1").map(_.a.bool)))
.run()
}
}

Vertex AI Model Batch prediction, issue with referencing existing model and input file on Cloud Storage

I'm struggling to correctly set Vertex AI pipeline which does the following:
read data from API and store to GCS and as as input for batch prediction.
get an existing model (Video classification on Vertex AI)
create Batch prediction job with input from point 1.
As it will be seen, I don't have much experience with Vertex Pipelines/Kubeflow thus I'm asking for help/advice, hope it's just some beginner mistake.
this is the gist of the code I'm using as pipeline
from google_cloud_pipeline_components import aiplatform as gcc_aip
from kfp.v2 import dsl
from kfp.v2.dsl import component
from kfp.v2.dsl import (
Output,
Artifact,
Model,
)
PROJECT_ID = 'my-gcp-project'
BUCKET_NAME = "mybucket"
PIPELINE_ROOT = "{}/pipeline_root".format(BUCKET_NAME)
#component
def get_input_data() -> str:
# getting data from API, save to Cloud Storage
# return GS URI
gcs_batch_input_path = 'gs://somebucket/file'
return gcs_batch_input_path
#component(
base_image="python:3.9",
packages_to_install=['google-cloud-aiplatform==1.8.0']
)
def load_ml_model(project_id: str, model: Output[Artifact]):
"""Load existing Vertex model"""
import google.cloud.aiplatform as aip
model_id = '1234'
model = aip.Model(model_name=model_id, project=project_id, location='us-central1')
#dsl.pipeline(
name="batch-pipeline", pipeline_root=PIPELINE_ROOT,
)
def pipeline(gcp_project: str):
input_data = get_input_data()
ml_model = load_ml_model(gcp_project)
gcc_aip.ModelBatchPredictOp(
project=PROJECT_ID,
job_display_name=f'test-prediction',
model=ml_model.output,
gcs_source_uris=[input_data.output], # this doesn't work
# gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
)
if __name__ == '__main__':
from kfp.v2 import compiler
import google.cloud.aiplatform as aip
pipeline_export_filepath = 'test-pipeline.json'
compiler.Compiler().compile(pipeline_func=pipeline,
package_path=pipeline_export_filepath)
# pipeline_params = {
# 'gcp_project': PROJECT_ID,
# }
# job = aip.PipelineJob(
# display_name='test-pipeline',
# template_path=pipeline_export_filepath,
# pipeline_root=f'gs://{PIPELINE_ROOT}',
# project=PROJECT_ID,
# parameter_values=pipeline_params,
# )
# job.run()
When running the pipeline it throws this exception when running Batch prediction:
details = "List of found errors: 1.Field: batch_prediction_job.model; Message: Invalid Model resource name.
so I'm not sure what could be wrong. I tried to load model in the notebook (outside of component) and it correctly returns.
Second issue I'm having is referencing GCS URI as output from component to batch job input.
input_data = get_input_data2()
gcc_aip.ModelBatchPredictOp(
project=PROJECT_ID,
job_display_name=f'test-prediction',
model=ml_model.output,
gcs_source_uris=[input_data.output], # this doesn't work
# gcs_source_uris=['gs://mybucket/output/'], # hardcoded gs uri works
gcs_destination_output_uri_prefix=f'gs://{PIPELINE_ROOT}/prediction_output/'
)
During compilation, I get following exception TypeError: Object of type PipelineParam is not JSON serializable, though I think this could be issue of ModelBatchPredictOp component.
Again any help/advice appreciated, I'm dealing with this from yesterday, so maybe I missed something obvious.
libraries I'm using:
google-cloud-aiplatform==1.8.0
google-cloud-pipeline-components==0.2.0
kfp==1.8.10
kfp-pipeline-spec==0.1.13
kfp-server-api==1.7.1
UPDATE
After comments, some research and tuning, for referencing model this works:
#component
def load_ml_model(project_id: str, model: Output[Artifact]):
region = 'us-central1'
model_id = '1234'
model_uid = f'projects/{project_id}/locations/{region}/models/{model_id}'
model.uri = model_uid
model.metadata['resourceName'] = model_uid
and then I can use it as intended:
batch_predict_op = gcc_aip.ModelBatchPredictOp(
project=gcp_project,
job_display_name=f'batch-prediction-test',
model=ml_model.outputs['model'],
gcs_source_uris=[input_batch_gcs_path],
gcs_destination_output_uri_prefix=f'gs://{BUCKET_NAME}/prediction_output/test'
)
UPDATE 2
regarding GCS path, a workaround is to define path outside of the component and pass it as an input parameter, for example (abbreviated):
#dsl.pipeline(
name="my-pipeline",
pipeline_root=PIPELINE_ROOT,
)
def pipeline(
gcp_project: str,
region: str,
bucket: str
):
ts = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")
gcs_prediction_input_path = f'gs://{BUCKET_NAME}/prediction_input/video_batch_prediction_input_{ts}.jsonl'
batch_input_data_op = get_input_data(gcs_prediction_input_path) # this loads input data to GCS path
batch_predict_op = gcc_aip.ModelBatchPredictOp(
project=gcp_project,
model=training_job_run_op.outputs["model"],
job_display_name='batch-prediction',
# gcs_source_uris=[batch_input_data_op.output],
gcs_source_uris=[gcs_prediction_input_path],
gcs_destination_output_uri_prefix=f'gs://{BUCKET_NAME}/prediction_output/',
).after(batch_input_data_op) # we need to add 'after' so it runs after input data is prepared since get_input_data doesn't returns anything
still not sure, why it doesn't work/compile when I return GCS path from get_input_data component
I'm glad you solved most of your main issues and found a workaround for model declaration.
For your input.output observation on gcs_source_uris, the reason behind it is because the way the function/class returns the value. If you dig inside the class/methods of google_cloud_pipeline_components you will find that it implements a structure that will allow you to use .outputs from the returned value of the function called.
If you go to the implementation of one of the components of the pipeline you will find that it returns an output array from convert_method_to_component function. So, in order to have that implemented in your custom class/function your function should return a value which can be called as an attribute. Below is a basic implementation of it.
class CustomClass():
def __init__(self):
self.return_val = {'path':'custompath','desc':'a desc'}
#property
def output(self):
return self.return_val
hello = CustomClass()
print(hello.output['path'])
If you want to dig more about it you can go to the following pages:
convert_method_to_component, which is the implementation of convert_method_to_component
Properties, basics of property in python.

In Karate - Feature file calling from another feature file along with variable value

My apologies it seems repetitive question but it is really troubling me.
I am trying to call one feature file from another feature file along with variable values. and it is not working at all.
Below is the structure I am using.
my request json having variable name. Filename:InputRequest.json
{
"transaction" : "123",
"transactionDateTime" : "#(sTransDateTime)"
}
my featurefile1 : ABC.Feature
Background:
* def envValue = env
* def config = { username: '#(dbUserName)', password: '#(dbPassword)', url: '#(dbJDBCUrl)', driverClassName: "oracle.jdbc.driver.OracleDriver"};
* def dbUtils = Java.type('Common.DbUtils')
* def request1= read(karate.properties['user.dir'] + 'InputRequest.json')
* def endpoint= '/v1/ABC'
* def appDb = new dbUtils(config);
Scenario: ABC call
* configure cookies = null
Given url endpoint
And request request1
When method Post
Then status 200
Feature file from which I am calling ABC.Feature
#tag1
**my featurefile1: XYZ.Feature**
`Background`:
* def envValue = env
Scenario: XYZ call
* def sTransDateTime = function() { var SimpleDateFormat = Java.type('java.text.SimpleDateFormat'); var sdf = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSS'+00:00'"); return sdf.format(new java.util.Date()); }
* def result = call read(karate.properties['user.dir'] + 'ABC.feature') { sTransDateTime: sTransDateTime }
Problem is,
While executing it, runnerTest has tag1 configured to execute.
Currently, it is ignoring entire ABC.feature to execute and also not generating cucumber report.
If I mention the same tag for ABC.feature (Which is not expected for me as this is just reusable component for me ) then it is being executed but sTransDateTime value is not being passed from XYZ.feature to ABC.feature. Eventually, InputRequest.json should have that value while communicating with the server as a part of the request.
I am using 0.9.4 Karate version. Any help please.
Change to this:
{ sTransDateTime: '#(sTransDateTime)' }
And read this explanation: https://github.com/intuit/karate#call-vs-read
I'm sorry the other part doesn't make sense and shouldn't happen, please follow this process: https://github.com/intuit/karate/wiki/How-to-Submit-an-Issue

mongoengine slow serialization of embedded documents with reference fields

I have a small DB with about 500 records. I'm trying to implement a versioning scheme where I save the form along with its current version to my Record collection. Ideally, I would like to store the form along with its version number in an embedded document to keep things nice and tidy:
class Structure(db.EmbeddedDocument):
form = db.ReferenceField(Form, required = True)
version = db.IntField(required = True)
#property
def short(self):
return {
'form': self.form,
'version': self.version
}
class Record(db.Document):
structure = db.EmbeddedDocumentField(Structure)
#property
def short(self):
return {
'structure': self.structure.short
}
This way when I recall a record I can grab the form and the version that was used at the time. Running some timing tests:
start = time.clock()
records = Record.objects.select_related()
print ('Time: ', time.clock() - start)
response = [i.short for i in records]
print ('Time: ', time.clock() - start)
I find the query time for all records Record.objects.select_related() to be reasonable at, ~ 1.12 s, however, I'm finding serialization for the purpose of JSON transfer is extremely expensive at ~ 24.1s!
If I make a slight modification by removing use of the EmbeddedDocument:
class Record(db.Document):
form = db.ReferenceField(Form, required = True)
version = db.IntField(required = True)
#property
def short(self):
return {
'form': self.form,
'version': self.version
}
Running the same test I find the query time to be pretty much unchanged at ~ 1.36 s, however, the serialization time improved by 24x to 1.14s. I really do not understand why use of an embedded document would lead to such as massive penalty in serialization time...? Is dereferencing in an embedded object more difficult?