TensorFlow Extended Kubeflow Multiple Workers - tensorflow

I have an issue with TFX on Kubeflow DAG Runner. Issues is that I managed to start only one pod per run. I don't see any configuration for "workers" except on the Apache Beam arguments, which doesn't help.
Running CSV load on one pod results in OOMKilled error because file has more than 5GB. I tried splitting the file in parts of 100MB but that did not help also.
So my question is: How to run a TFX job/stage on Kubeflow on multiple "worker" pods, or is that even possible?
Here is the code I've been using:
examples = external_input(data_root)
example_gen = CsvExampleGen(input=examples)
statistics_gen = StatisticsGen(examples=example_gen.outputs['examples'])
dsl_pipeline = pipeline.Pipeline(
pipeline_name=pipeline_name,
pipeline_root=pipeline_root,
components=[
example_gen, statistics_gen
],
enable_cache=True,
beam_pipeline_args=['--num_workers=%d' % 5]
)
if __name__ == '__main__':
tfx_image = 'custom-aws-imgage:tfx-0.26.0'
config = kubeflow_dag_runner.KubeflowDagRunnerConfig(
kubeflow_metadata_config=kubeflow_dag_runner.get_default_kubeflow_metadata_config(),
tfx_image=tfx_image)
kfp_runner = kubeflow_dag_runner.KubeflowDagRunner(config=config)
# KubeflowDagRunner compiles the DSL pipeline object into KFP pipeline package.
# By default it is named <pipeline_name>.tar.gz
kfp_runner.run(dsl_pipeline)
Environment:
Docker image: tensorflow/tfx:0.26.0 with boto3 installed (aws related issue)
Kubernetes: AWS EKS latest
Kubeflow: 1.0.4

It seams that this is not possible at the time.
See:
https://github.com/kubeflow/kubeflow/issues/1583

Related

Deploy TensorFlow probability regression model as Sagemaker endpoint

I would like to develop a TensorFlow probability regression model locally and deploy as Sagemaker endpoint. I have deployed standard XGB models like this previously and understand that one can deploy TensorFlow model like so:
from sagemaker.tensorflow.model import TensorFlowModel
tensorflow_model = TensorFlowModel(
name=tensorflow_model_name,
source_dir='code',
entry_point='inference.py',
model_data=<TENSORFLOW_MODEL_S3_URI>,
role=role,
framework_version='<TENSORFLOW_VERSION>')
tensorflow_model.deploy(endpoint_name=<ENDPOINT_NAME>,
initial_instance_count=1,
instance_type='ml.m5.4xlarge',
wait=False)
However, I do not think this will cover for example the dependency:
import tensorflow_probability as tfp
Do I need to use script mode or Docker instead? Any pointer would be very much appreciated. Thanks.
Another way would be to add "tensorflow_probability" to "requirements.txt" and both your code and requirements into dependencies if you are not using source_dir:
Model(entry_point='inference.py',
dependencies=['code', 'requirements.txt'])
You can create a requirements.txt in your source_dir(code) and place tensorflow-probability in it. Sagemaker will install the dependencies listed in requirements.txt before running your script.

tensorflow data validation tfdv fails on google cloud dataflow with "Can't get attribute 'NumExamplesStatsGenerator' "

I am following this "get started" tensorflow tutorial on how to run tfdv on apache beam on google cloud dataflow. My code is very similar to the one in the tutorial:
import tensorflow_data_validation as tfdv
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, SetupOptions, WorkerOptions
PROJECT_ID = 'my-project-id'
JOB_NAME = 'my-job-name'
REGION = "europe-west3"
NETWORK = "regions/europe-west3/subnetworks/mysubnet"
GCS_STAGING_LOCATION = 'gs://my-bucket/staging'
GCS_TMP_LOCATION = 'gs://my-bucket/tmp'
GCS_DATA_LOCATION = 'gs://another-bucket/my-data.CSV'
# GCS_STATS_OUTPUT_PATH is the file path to which to output the data statistics
# result.
GCS_STATS_OUTPUT_PATH = 'gs://my-bucket/stats'
# downloaded locally with: pip download tensorflow_data_validation --no-deps --platform manylinux2010_x86_64 --only-binary=:all:
#(would be great to use it have it on cloud storage) PATH_TO_WHL_FILE = 'gs://my-bucket/wheels/tensorflow_data_validation-1.7.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl'
PATH_TO_WHL_FILE = '/Users/myuser/some-folder/tensorflow_data_validation-1.7.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl'
# Create and set your PipelineOptions.
options = PipelineOptions()
# For Cloud execution, set the Cloud Platform project, job_name,
# staging location, temp_location and specify DataflowRunner.
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT_ID
google_cloud_options.job_name = JOB_NAME
google_cloud_options.staging_location = GCS_STAGING_LOCATION
google_cloud_options.temp_location = GCS_TMP_LOCATION
google_cloud_options.region = REGION
options.view_as(StandardOptions).runner = 'DataflowRunner'
setup_options = options.view_as(SetupOptions)
# PATH_TO_WHL_FILE should point to the downloaded tfdv wheel file.
setup_options.extra_packages = [PATH_TO_WHL_FILE]
# Worker options
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = NETWORK
worker_options.max_num_workers = 2
print("Generating stats...")
tfdv.generate_statistics_from_tfrecord(GCS_DATA_LOCATION, output_path=GCS_STATS_OUTPUT_PATH, pipeline_options=options)
print("Stats generated!")
The code above starts a dataflow job but unfortunately it fails with the following error:
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/apache_beam/internal/dill_pickler.py", line 285, in loads
return dill.loads(s)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 275, in loads
return load(file, ignore, **kwds)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 270, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 472, in load
obj = StockUnpickler.load(self)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 462, in find_class
return StockUnpickler.find_class(self, module, name)
AttributeError: Can't get attribute 'NumExamplesStatsGenerator' on <module 'tensorflow_data_validation.statistics.stats_impl' from '/usr/local/lib/python3.8/site-packages/tensorflow_data_validation/statistics/stats_impl.py'>
I couldn't find on the internet anything similar.
If it can help, on my local machine (MACOS) I have the following versions:
Apache Beam version: 2.34.0
Tensorflow version: 2.6.2
TensorFlow Transform version: 1.4.0
TFDV version: 1.4.0
Apache beam on cloud runs with Apache Beam Python 3.8 SDK 2.34.0
BONUS QUESTION: Another question I have is around the PATH_TO_WHL_FILE. I tried to put it on a storage bucket but Beam doesn't seem to be able to pick it up. Only locally, which is actually a problem, because it would make it more difficult to distribute this code. What would be a good practice to distribute this wheel file?
Based on the name of the attribute NumExamplesStatsGenerator, it's a generator that is not pickle-able.
But I couldn't find the attribute from the module now.
A search indicates that in 1.4.0 this module contains this attribute.
So you may want to try a newer versioned TFDV.
PATH_TO_WHL_FILE indicates a file to stage/distribute to Dataflow for execution, so you can use a file on GCS.

Can't get the subnet id of azure using python

I want to get the resource id of a subnet in a virtual network in azure using python, the command i have used is this line : subnets=network_client.subnets.get(resource_group,'XXX','XXX')
But what I get is an error: HttpResponseError: (InvalidApiVersionParameter) The api-version '2021-02-01' is invalid. The supported versions are '2021-04-01,2021-01-01,2020-10-01,2020-09-01,2020-08-01,2020-07-01,2020-06-01,2020-05-01,2020-01-01,2019-11-01,2019-10-01,2019-09-01,2019-08-01,2019-07-01,2019-06-01,2019-05-10,2019-05-01,2019-03-01,2018-11-01,2018-09-01,2018-08-01,2018-07-01,2018-06-01,2018-05-01,2018-02-01,2018-01-01,2017-12-01,2017-08-01,2017-06-01,2017-05-10,2017-05-01,2017-03-01,2016-09-01,2016-07-01,2016-06-01,2016-02-01,2015-11-01,2015-01-01,2014-04-01-preview,2014-04-01,2014-01-01,2013-03-01,2014-02-26,2014-04'.
I have tried different api versions but it's getting me errors.Any idea please ?
The version of azure-mgmt-network I used is 19.0.0
Please make sure that you have the below two models installed first before executing the script:
pip install azure-mgmt-network
pip install azure-identity
Then use the below script to get the subnet-id of specific subnet present in your subscription:
from azure.identity import AzureCliCredential
from azure.mgmt.network import NetworkManagementClient
credential = AzureCliCredential()
subscription_id = "948d4068-xxxx-xxxx-xxxx-e00a844e059b"
network_client = NetworkManagementClient(credential, subscription_id)
resource_group_name = "ansumantest"
location = "West US 2"
virtual_network_name = "ansuman-vnet"
subnet_name = "acisubnet"
Subnet=network_client.subnets.get(resource_group_name, virtual_network_name, subnet_name)
print(Subnet.id)
Output:
Note : I am using pip version pip 21.2.4 and (python 3.9). The pip models version that I am using are as below :
I am using the same network model version as you . But if you are still facing the issue then trying installing the new one i.e. 19.1.0 .

MultiWorkerMirroredStrategy() not working on Google AI-Platform (CMLE)

I'm getting the following error while using MultiWorkerMirroredStrategy() for training Custom Estimator on Google AI-Platform (CMLE).
ValueError: Unrecognized task_type: 'master', valid task types are: "chief", "worker", "evaluator" and "ps".
Both MirroredStrategy() and PamameterServerStrategy() are working fine on AI-Platform with their respective config.yaml files. I'm currently not providing device scopes for any operations. Neither I'm providing any device filter in session config, tf.ConfigProto(device_filters=device_filters).
The config.yaml file which I'm using for training with MultiWorkerMirroredStrategy() is:
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
workerType: standard_gpu
workerCount: 4
The masterType input is mandatory for submitting the training job on AI-Platform.
Note: It's showing 'chief' as a valid task type and 'master' as invalid. I'm providing tensorflow-gpu==1.14.0 in setup.py for trainer package.
I got into same issue. As far I understand MultiWorkerMirroredStrategy config values are different from other strategies and from what CMLE provides by default: https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration
It doesn't support 'master' node, it calls it 'chief' instead.
If you are running your jobs in container, you can try using 'useChiefInTfConfig' flag, see documentation here: https://developers.google.com/resources/api-libraries/documentation/ml/v1/python/latest/ml_v1.projects.jobs.html
Otherwise you might try hacking your TF_CONFIG manually:
TF_CONFIG = os.environ.get('TF_CONFIG')
if TF_CONFIG and '"master"' in TF_CONFIG:
os.environ['TF_CONFIG'] = TF_CONFIG.replace('"master"', '"chief"')
(1) This appears to be a bug then with MultiWorkerMirroredStrategy. Please file a bug in TensorFlow. In TensorFlow 1.x, it should be using master and in TensorFlow 2.x, it should be using chief. The code is (wrongly) asking for chief, and AI Platform (because you are using 1.14) is providing only master. Incidentally: master = chief + evaluator.
(2) Do not have add tensorflow to your setup.py. Provide the tensorflow framework you want AI Platform to use using the --runtime-version (See https://cloud.google.com/ml-engine/docs/runtime-version-list) flag to gcloud.

Unable to build the TensorFlow model using bazel build

I have set up the TensorFlow server in a docker machine running on a windows machine by following the instructions from https://tensorflow.github.io/serving/serving_basic. I am successfully able to build and run the mnist_model. However, when I am trying to build the model for the wide_n_deep_tutorial example by running the following command "bazel build //tensorflow_serving/example:wide_n_deep_tutorial.py" the model is not successfully built as there are no files generated in bazel-bin folder.
Since there is no error message displayed while building the model, I am unable to figure out the problem. I would really appreciate if someone can help me debug and solve the problem.
You are just guessing the command line here as there is no target in the BUILD file of tensorflow serving for wide_n_deep_tutorial.py.
You can only build mnist and inception targets as of today.
By adding the target for wide and deep model in the BUILD file solves the problem.
Added the following to the BUILD file:
py_binary(
name = "wide_n_deep_model",
srcs = [
"wide_n_deep_model.py",
],
deps = [
"//tensorflow_serving/apis:predict_proto_py_pb2",
"//tensorflow_serving/apis:prediction_service_proto_py_pb2",
"#org_tensorflow//tensorflow:tensorflow_py",
],
)