Tensorflow object detection : Using transfer learning on local running - tensorflow

From Tensorflow docs, we can use transfer learning for object detection when you run from cloud. Also, can we using transfer-learning for running locally ? i see that we have a doc about running on local but i can not find any documents that write about transfer learning when run on local machine . I expected it in pipeline but i can not find it. So how can we config transfer learning on running locally ?

You are correct. You can use transfer learning when running locally.
The steps are similar to the instruction on running pets on google cloud, but your training config should reference your local file system instead of a path on GCS. In particular, you need to configure the train_config.fine_tune_checkpoint field to point to the unzipped checkpoint.
See the sample resnet101 config for more details. You should only have to change the PATH_TO_BE_CONFIGURED strings to point to the correct value.

Related

Tensorflow Serving: How to re train a model that is currently served in production using tensorflow serving?

How to re train a model with new data that is currently served in production using tensorflow serving?
Do we have to train the model manually and serve it again? Or is there any automated way of doing this.
I am using tensorflow serving with docker.
Basically the idea is that:
Considering there is already a model served using tensorflow serving, and in the future I get some bunch of additional data and I want the model to be fitted with this data then, how can we do this training to the same model?
Question 1: I do have a script to train the model, but does the training have to be done locally/manually?
Answer: As far as i understand you are talking it should be done locally or in some remote server, you can do wherever as per convenience the main important step for tensorflow serving is to save model in the respective format that could be used by the server. Please refer to the link on how to save as well as how to load it in the serving docker container.
serving tensorflow model
Question 2: Suppose I create a entirely new model (apart from modelA currently server), how can I load it to tensorflow serving again? Do I have to manually load it to the docker target path?
Answer: Yes if you are loading it without using serving config, you will have to manually shut down container, remap the path in the command and then load it in the docker container. That is where the serving config helps you to load models in runtime only.
Question 3: TFX document says to update the model.config file for adding new models, but how can I update it when the serving is running.
Answer: A basic configuration file would look like this
config {
name: 'my_first_model'
base_path: '/tmp/my_first_model/'
model_platform: 'tensorflow'
}
config {
name: 'my_second_model'
base_path: '/tmp/my_second_model/'
model_platform: 'tensorflow'
}
}
This file would be needed to be mapped before starting your docker container and of course the path as well where different models will be located. This config file when changed will load new models accordingly in the serving docker container. You can also maintain different versions of the same model as well. For more info please refer to this link serving config. This file is automatically looked up by the serving periodically and as soon as it detects some change it will automatically load new models without the need to restart the docker container.

Submit a Keras training job to Google cloud

I am trying to follow this tutorial:
https://medium.com/#natu.neeraj/training-a-keras-model-on-google-cloud-ml-cb831341c196
to upload and train a Keras model on Google Cloud Platform, but I can't get it to work.
Right now I have downloaded the package from GitHub, and I have created a cloud environment with AI-Platform and a bucket for storage.
I am uploading the files (with the suggested folder structure) to my Cloud Storage bucket (basically to the root of my storage), and then trying the following command in the cloud terminal:
gcloud ai-platform jobs submit training JOB1
--module-name=trainer.cnn_with_keras
--package-path=./trainer
--job-dir=gs://mykerasstorage
--region=europe-north1
--config=gs://mykerasstorage/trainer/cloudml-gpu.yaml
But I get errors, first the cloudml-gpu.yaml file can't be found, it says "no such folder or file", and trying to just remove it, I get errors because it says the --init--.py file is missing, but it isn't, even if it is empty (which it was when I downloaded from the tutorial GitHub). I am Guessing I haven't uploaded it the right way.
Any suggestions of how I should do this? There is really no info on this in the tutorial itself.
I have read in another guide that it is possible to let gcloud package and upload the job directly, but I am not sure how to do this or where to write the commands, in my terminal with gcloud command? Or in the Cloud Shell in the browser? And how do I define the path where my python files are located?
Should mention that I am working with Mac, and pretty new to using Keras and Python.
I was able to follow the tutorial you mentioned successfully, with some modifications along the way.
I will mention all the steps although you made it halfway as you mentioned.
First of all create a Cloud Storage Bucket for the job:
gsutil mb -l europe-north1 gs://keras-cloud-tutorial
To answer your question on where you should write these commands, depends on where you want to store the files that you will download from GitHub. In the tutorial you posted, the writer is using his own computer to run the commands and that's why he initializes the gcloud command with gcloud init. However, you can submit the job from the Cloud Shell too, if you download the needed files there.
The only files we need from the repository are the trainer folder and the setup.py file. So, if we put them in a folder named keras-cloud-tutorial we will have this file structure:
keras-cloud-tutorial/
├── setup.py
└── trainer
├── __init__.py
├── cloudml-gpu.yaml
└── cnn_with_keras.py
Now, a possible reason for the ImportError: No module named eager error is that you might have changed the runtimeVersion inside the cloudml-gpu.yaml file. As we can read here, eager was introduced in Tensorflow 1.5. If you have specified an earlier version, it is expected to experience this error. So the structure of cloudml-gpu.yaml should be like this:
trainingInput:
scaleTier: CUSTOM
# standard_gpu provides 1 GPU. Change to complex_model_m_gpu for 4 GPUs
masterType: standard_gpu
runtimeVersion: "1.5"
Note: "standard_gpu" is a legacy machine type.
Also, the setup.py file should look like this:
from setuptools import setup, find_packages
setup(name='trainer',
version='0.1',
packages=find_packages(),
description='Example on how to run keras on gcloud ml-engine',
author='Username',
author_email='user#gmail.com',
install_requires=[
'keras==2.1.5',
'h5py'
],
zip_safe=False)
Attention: As you can see, I have specified that I want version 2.1.5 of keras. This is because if I don't do that, the latest version is used which has compatibility issues with versions of Tensorflow earlier than 2.0.
If everything is set, you can submit the job by running the following command inside the folder keras-cloud-tutorial:
gcloud ai-platform jobs submit training test_job --module-name=trainer.cnn_with_keras --package-path=./trainer --job-dir=gs://keras-cloud-tutorial --region=europe-west1 --config=trainer/cloudml-gpu.yaml
Note: I used gcloud ai-platform instead of gcloud ml-engine command although both will work. At some point in the future though, gcloud ml-engine will be deprecated.
Attention: Be careful when choosing the region in which the job will be submitted. Some regions do not support GPUs and will throw an error if chosen. For example, if in my command I set the region parameter to europe-north1 instead of europe-west1, I will receive the following error:
ERROR: (gcloud.ai-platform.jobs.submit.training) RESOURCE_EXHAUSTED:
Quota failure for project . The request for 1 K80
accelerators exceeds the allowed maximum of 0 K80, 0 P100, 0 P4, 0 T4,
0 TPU_V2, 0 TPU_V3, 0 V100. To read more about Cloud ML Engine quota,
see https://cloud.google.com/ml-engine/quotas.
- '#type': type.googleapis.com/google.rpc.QuotaFailure violations:
- description: The request for 1 K80 accelerators exceeds the allowed maximum of
0 K80, 0 P100, 0 P4, 0 T4, 0 TPU_V2, 0 TPU_V3, 0 V100.
subject:
You can read more about the features of each region here and here.
EDIT:
After the completion of the training job, there should be 3 folders in the bucket that you specified: logs/, model/ and packages/. The model is saved on the model/ folder a an .h5 file. Have in mind that if you set a specific folder for the destination you should include the '/' at the end. For example, you should set gs://my-bucket/output/ instead of gs://mybucket/output. If you do the latter you will end up with folders output, outputlogs and outputmodel. Inside output there should be packages. The job page link should direct to output folder so make sure to check the rest of the bucket too!
In addition, in the AI-Platform job page you should be able to see information regarding CPU, GPU and Network utilization:
Also, I would like to clarify something as I saw that you posted some related questions as an answer:
Your local environment, either it is your personal Mac or the Cloud Shell has nothing to do with the actual training job. You don't need to install any specific package or framework locally. You just need to have the Google Cloud SDK installed (in Cloud Shell is of course already installed) to run the appropriate gcloud and gsutil commands. You can read more on how exactly training jobs on the AI-Platform work here.
I hope that you will find my answer helpful.
I got it to work halfway now by not uploading the files but just running the upload commands from cloud at my local terminal... however there was an error during it running ending in "job failed"
Seems it was trying to import something from the TensorFlow backend called "from tensorflow.python.eager import context" but there was an ImportError: No module named eager
I have tried "pip install tf-nightly" which was suggested at another place, but it says I don't have permission or I am loosing the connection to cloud shell(exactly when I try to run the command).
I have also tried making a virtual environment locally to match that on gcloud (with Conda), and have made an environment with Conda with Python=3.5, Tensorflow=1.14.0 and Keras=2.2.5, which should be supported for gcloud.
The python program works fine in this environment locally, but I still get the (ImportError: No module named eager) when trying to run the job on gcloud.
I am putting the flag --python-version 3.5 when submitting the job, but when I write the command "Python -V" in the google cloud shell, it says Python=2.7. Could this be the issue? I have not fins a way to update the python version with the cloud shell prompt, but it says google cloud should support python 3.5. If this is anyway the issue, any suggestions on how to upgrade python version on google cloud?
It is also possible to manually there a new job in the google cloud web interface, doing this, I get a different error message: ERROR: Could not find a version that satisfies the requirement cnn_with_keras.py (from versions: none) and No matching distribution found for cnn_with_keras.py. Where cnn_with_keras.py is my python code from the tutorial, which runs fine locally.
Really don't know what to do next. Any suggestions or tips would be very helpful!
The issue with the GPU is solved now, it was something so simple as, my google cloud account had GPU settings disabled and needed to be upgraded.

Nightly TF / Cloned TFX - how to manage Image for Kubeflow?

When I access my Kubeflow endpoint to upload and run a pipeline using a cloned TFX, the process starts hanging at the first step producing this message:
"This step is in Pending state with this message: ImagePullBackOff: Back-off pulling image "tensorflow/tfx:0.14.0dev", which is the same image used in the created pipeline yaml file.
My overall goal is to build an ExampleGen for tfrecords files, just as described in the guide here. The most recent tfx version in pip is 0.13 and does not yet include the necessary functions. For this reason, I install tf-nightly and clone/build tfx (dev-version 0.14). Doing so and installing some additional modules, e.g. tensorflow_data_validation, I can now create my pipeline using the tfx components and including an ExampleGen for tfrecords files. I finally build the pipeline with the KubeflowRunner. Yet this yields the error stated above.
I now wonder about an appropriate way to address this. I guess one way would be to build an image myself with the specified versions, but maybe there is a more practical way?
TFX doesn't have a nightly image build as yet. Currently, it defaults to using the image tagged with the version of the library you use to build the pipeline, hence the reason the tag is 0.14dev0. This is the current version at HEAD, see here:
https://github.com/tensorflow/tfx/blob/a1f43af5e66f9548ae73eb64813509445843eb53/tfx/version.py#L17
You can build your own image and push it somewhere, for example gcr.io/your-gcp-project/your-image-name:tag, and specify that the pipeline use this image instead, by customizing the tfx_image argument to the pipeline:
https://github.com/tensorflow/tfx/blob/74f9b6ab26c51ebbfb5d17826c5d5288a67dcf85/tfx/orchestration/kubeflow/base_component.py#L54
See for example:
https://github.com/tensorflow/tfx/blob/b3796fc37bd4331a4e964c822502ba5096ad4bb6/tfx/examples/chicago_taxi_pipeline/taxi_pipeline_kubeflow.py#L243

How to fix trainer package not found in AI Platform GPU-distributed training job

I'm attempting to train a Tensorflow Estimator on AI Platform. The model trains on local perfectly fine, albeit extremely slowly, but right when I try to run distributed-GPU training on AI Platform I run into this error:
CommandException: No URLs matched: gs://path/.../trainer-0.1.tar.gz
I have my code packaged with the trainer module as recommended by Google Cloud AI Platform. Any help would be appreciated!
I was actually able to fix my issue: it appears that if I don't set up a staging bucket then the model dir where checkpoints are stored will overwrite the trainer package before the worker replicas are able to download the trainer! I'm unsure how the checkpoints were even able to begin being stored when the worker replicas hadn't all downloaded the trainer yet, but adding the staging bucket that was different from my model dir fixed this.

Running a JSON graph locally

I've been using Flowhub.io to do my development on the nodejs device. Now that the GUI-based design is done, I'm ready to take it offline and run the code via the command line. How would do I do this? I have the JSON file corresponding to the graph I created online, but not sure how to use the noflo nodejs module.
Could someone help me by showing me an example of how to load a graph using the noflo module, please? Thanks!
f you want to run an existing graph, you can use the --graph option.
noflo-nodejs --graph graphs/MyMainGraph.json
If you also want the process to exit when the network stops, you can pass --batch.
PS: I added this to the noflo-nodejs README.