How to use Google DataFlow Runner and Templates in tf.Transform? - tensorflow

We are in the process of establishing a Machine Learning pipeline on Google Cloud, leveraging GC ML-Engine for distributed TensorFlow training and model serving, and DataFlow for distributed pre-processing jobs.
We would like to run our Apache Beam apps as DataFlow jobs on Google Cloud. looking at the ML-Engine samples
it appears possible to get tensorflow_transform.beam.impl AnalyzeAndTransformDataset to specify which PipelineRunner to use as follows:
from tensorflow_transform.beam import impl as tft
pipeline_name = "DirectRunner"
p = beam.Pipeline(pipeline_name)
p | "xxx" >> xxx | "yyy" >> yyy | tft.AnalyzeAndTransformDataset(...)
TemplatingDataflowPipelineRunner provides the ability to separate our preprocessing development from parameterized operations - see here: https://cloud.google.com/dataflow/docs/templates/overview - basically:
A) in PipelineOptions derived types, change option types to ValueProvider (python way: type inference or type hints ???)
B) change runner to TemplatingDataflowPipelineRunner
C) mvn archetype:generate to store template in GCS (python way: a yaml file like TF Hypertune ???)
D) gcloud beta dataflow jobs run --gcs-location —parameters
The question is: Can you show me how we can we use tf.Transform to leverage TemplatingDataflowPipelineRunner ?

Python templates are available as of April 2017 (see documentation). The way to operate them is the following:
Define UserOptions subclassed from PipelineOptions.
Use the add_value_provider_argument API to add specific arguments to be parameterized.
Regular non-parameterizable options will continue to be defined using argparse's add_argument.
class UserOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument('--value_provider_arg', default='some_value')
parser.add_argument('--non_value_provider_arg', default='some_other_value')
Note that Python doesn't have a TemplatingDataflowPipelineRunner, and neither does Java 2.X (unlike what happened in Java 1.X).

Unfortunately, Python pipelines cannot be used as templates. It is only available for Java today. Since you need to use the python library, it will not be feasible to do this.
tensorflow_transform would also need to support ValueProvider so that you can pass in options as a value provider type through it.

Related

How do I launch a SmartSim orchestrator without models?

I'm trying to prototype using the SmartRedis Python client to interact with the SmartSim Orchestrator. Is it possible to launch the orchestrator without any other models in the experiment? If so, what would be the best way to do so?
It is entirely possible to do that. A SmartSim Experiment can contain different types of 'entities' including Models, Ensembles (i.e. groups of Models), and Orchestrator (i.e. the Redis-backed database). None of these entities, however, are 'required' to be in the Experiment.
Here's a short script that creates an experiment which includes only a database.
from SmartSim import Experiment
NUM_DB_NODES = 3
exp = Experiment("Database Only")
db = exp.create_database(db_nodes=NUM_DB_NODES)
exp.generate(db)
exp.start(db)
After this, the Orchestrator (with the number of shards specified by NUM_DB_NODES) will have been spunup. You can then connect the Python client using the following line:
client = smartredis.Client(db.get_address()[0],NUM_DB_NODES>1)

how to concatenate the OutputPathPlaceholder with a string with Kubeflow pipelines?

I am using Kubeflow pipelines (KFP) with GCP Vertex AI pipelines. I am using kfp==1.8.5 (kfp SDK) and google-cloud-pipeline-components==0.1.7. Not sure if I can find which version of Kubeflow is used on GCP.
I am bulding a component (yaml) using python inspired form this Github issue. I am defining an output like:
outputs=[(OutputSpec(name='drt_model', type='Model'))]
This will be a base output directory to store few artifacts on Cloud Storage like model checkpoints and model.
I would to keep one base output directory but add sub directories depending of the artifact:
<output_dir_base>/model
<output_dir_base>/checkpoints
<output_dir_base>/tensorboard
but I didn't find how to concatenate the OutputPathPlaceholder('drt_model') with a string like '/model'.
How can append extra folder structure like /model or /tensorboard to the OutputPathPlaceholder that KFP will set during run time ?
I didn't realized in the first place that ConcatPlaceholder accept both Artifact and string. This is exactly what I wanted to achieve:
ConcatPlaceholder([OutputPathPlaceholder('drt_model'), '/model'])

Dataflow BigQuery read from ValueProvider: 'StaticValueProvider' object has no attribute 'projectId'

I'm using the Python SDK for apache beam. Im attempting to read data from BigQuery via a ValueProvider (as the documentation states that these are allowed).
def run(bq_source_table: ValueProvider,
pipeline_options=None):
pipeline_options.view_as(SetupOptions).setup_file = "./setup.py"
with beam.Pipeline(options=pipeline_options) as pipeline:
(
pipeline
| "Read from BigQuery" >> ReadFromBigQuery(table=bq_source_table)
)
The options are declared as follows:
class CPipelineOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
"--bq_source_table",
help="The BigQuery source table name..\n"
'"<project>:<dataset>.<table>".'
)
Executing the pipeline yields the error below:
AttributeError: 'StaticValueProvider' object has no attribute 'projectId' [while running 'Read from BigQuery/Read/SDFBoundedSourceReader/ParDo(SDFBoundedSourceDoFn)/SplitAndSizeRestriction']
Any suggestions on how to resolve this? I do not want to use Flex Templates.
EDIT: Good thing to mention is that the query param does support the ValueProvider. Could this be a bug?
My only suggestion would be to use Flex Templates.

Is it possible to use service accounts to schedule queries in BigQuery "Schedule Query" feature ?

We are using the Beta Scheduled query feature of BigQuery.
Details: https://cloud.google.com/bigquery/docs/scheduling-queries
We have few ETL scheduled queries running overnight to optimize the aggregation and reduce query cost. It works well and there hasn't been much issues.
The problem arises when the person who scheduled the query using their own credentials leaves the organization. I know we can do "update credential" in such cases.
I read through the document and also gave it some try but couldn't really find if we can use a service account instead of individual accounts to schedule queries.
Service accounts are cleaner and ties up to the rest of the IAM framework and is not dependent on a single user.
So if you have any additional information regarding scheduled queries and service account please share.
Thanks for taking time to read the question and respond to it.
Regards
BigQuery Scheduled Query now does support creating a scheduled query with a service account and updating a scheduled query with a service account. Will these work for you?
While it's not supported in BigQuery UI, it's possible to create a transfer (including a scheduled query) using python GCP SDK for DTS, or from BQ CLI.
The following is an example using Python SDK:
r"""Example of creating TransferConfig using service account.
Usage Example:
1. Install GCP BQ python client library.
2. If it has not been done, please grant p4 service account with
iam.serviceAccout.GetAccessTokens permission on your project.
$ gcloud projects add-iam-policy-binding {user_project_id} \
--member='serviceAccount:service-{user_project_number}#'\
'gcp-sa-bigquerydatatransfer.iam.gserviceaccount.com' \
--role='roles/iam.serviceAccountTokenCreator'
where {user_project_id} and {user_project_number} are the user project's
project id and project number, respectively. E.g.,
$ gcloud projects add-iam-policy-binding my-test-proj \
--member='serviceAccount:service-123456789#'\
'gcp-sa-bigquerydatatransfer.iam.gserviceaccount.com'\
--role='roles/iam.serviceAccountTokenCreator'
3. Set environment var PROJECT to your user project, and
GOOGLE_APPLICATION_CREDENTIALS to the service account key path. E.g.,
$ export PROJECT_ID='my_project_id'
$ export GOOGLE_APPLICATION_CREDENTIALS=./serviceacct-creds.json'
4. $ python3 ./create_transfer_config.py
"""
import os
from google.cloud import bigquery_datatransfer
from google.oauth2 import service_account
from google.protobuf.struct_pb2 import Struct
PROJECT = os.environ["PROJECT_ID"]
SA_KEY_PATH = os.environ["GOOGLE_APPLICATION_CREDENTIALS"]
credentials = (
service_account.Credentials.from_service_account_file(SA_KEY_PATH))
client = bigquery_datatransfer.DataTransferServiceClient(
credentials=credentials)
# Get full path to project
parent_base = client.project_path(PROJECT)
params = Struct()
params["query"] = "SELECT CURRENT_DATE() as date, RAND() as val"
transfer_config = {
"destination_dataset_id": "my_data_set",
"display_name": "scheduled_query_test",
"data_source_id": "scheduled_query",
"params": params,
}
parent = parent_base + "/locations/us"
response = client.create_transfer_config(parent, transfer_config)
print response
As far as I know, unfortunately you can't use a service account to directly schedule queries yet. Maybe a Googler will correct me, but the BigQuery docs implicitly state this:
https://cloud.google.com/bigquery/docs/scheduling-queries#quotas
A scheduled query is executed with the creator's credentials and
project, as if you were executing the query yourself
If you need to use a service account (which is great practice BTW), then there are a few workarounds listed here. I've raised a FR here for posterity.
This question is very old and came on this thread while I was searching for same.
Yes, It is possible to use service account to schedule big query jobs.
While creating schedule query job, click on "Advance options", you will get option to select service account.
By default is uses credential of requesting user.
Image from bigquery "create schedule query"1

Tensorflow: checkpoints simple load

I have a checkpoint file:
checkpoint-20001 checkpoint-20001.meta
how do I extract variables from this space, without having to load the previous model and starting session etc.
I want to do something like
cp = load(checkpoint-20001)
cp.var_a
It's not documented, but you can inspect the contents of a checkpoint from Python using the class tf.train.NewCheckpointReader.
Here's a test case that uses it, so you can see how the class works.
https://github.com/tensorflow/tensorflow/blob/861644c0bcae5d56f7b3f439696eefa6df8580ec/tensorflow/python/training/saver_test.py#L1203
Since it isn't a documented class, its API may change in the future.