Google Cloud ML Engine pass multiple file paths as arguments - tensorflow

I am trying to run a job on Google Cloud ML Engine and can't seem to pass multiple file paths as arguments to the parser.
Here is what I am writing in the terminal:
JOB_NAME=my_job_name
BUCKET_NAME=my_bucket_name
OUTPUT_PATH=gs://$BUCKET_NAME/$JOB_NAME
DATA_PATH=gs://$BUCKET_NAME/my_data_directory
REGION=us-east1
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $OUTPUT_PATH \
--runtime-version 1.2 \
--module-name trainer.task \
--package-path trainer/ \
--region $REGION \
-- \
--file-path "${DATA_PATH}/*" \
--num-epochs 10
Where my_data_directory contains multiple files I later want to read, the problem is that --file-path contains only ['gs://my_bucket_name/my_data_directory'] and not a list of files in said directory.
How do I fix this?
Many thanks in advance.

Since the arguments you pass after the -- \ line will be user arguments, how the program handles these arguments will largely depend on the trainer you defined. I would go back and modify the trainer program and make it either treat the directory differently or take multiple paths like this:
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $OUTPUT_PATH \
--runtime-version 1.2 \
--module-name trainer.task \
--package-path trainer/ \
--region $REGION \
--scale-tier STANDARD_1 \
-- \
--train-files $TRAIN_DATA \
--eval-files $EVAL_DATA \
--train-steps 1000 \
--verbosity DEBUG \
--eval-steps 100
Some links that will be helpful for developing your own trainer: [1] [2]

Related

Please post working tflite_convert command line examples

I am converting several models from Tensorflowsj Keras and Tensorflow to TensorflowLite and then to TensorflowMicro c-header files.
I can do the main conversions but have found little information about using tflite_convert for quantization.
Wondering if people could post working command line examples. As far as I can tell we are encouraged to use python to do the conversions, but I would prefer to stay on the command line.
I have summarized what I am working on here https://github.com/hpssjellis/my-examples-for-the-arduino-portentaH7/tree/master/m09-Tensoflow/tfjs-convert-to-arduino-header.
This is what I have so far and it works converting a saved tensorfowjs model.json into a .pb file that is converted to a .tflite and then to a c-header file to work on an Arduino style microcontroller.
tensorflowjs_converter --input_format=tfjs_layers_model --output_format=keras_saved_model ./model.json ./
tflite_convert --keras_model_file ./ --output_file ./model.tflite
xxd -i model.tflite model.h
But my files do not get any smaller when I try any quantization.
The tflit_convert command line help at Tensorflow is not specific enough https://www.tensorflow.org/lite/convert/cmdline
Here are some examples I have found using both tflite_convert or tensorflowjs_convert, some seem to work on other peoples models but do not seem to work on my own models:
tflite_convert --output_file=/home/wang/Downloads/deeplabv3_mnv2_pascal_train_aug/optimized_graph.tflite --graph_def_file=/home/wang/Downloads/deeplabv3_mnv2_pascal_train_aug/frozen_inference_graph.pb --inference_type=FLOAT --inference_input_type=QUANTIZED_UINT8 --input_arrays=ImageTensor --input_shapes=1,513,513,3 --output_arrays=SemanticPredictions –allow_custom_ops
tflite_convert --graph_def_file=<your_frozen_graph> \
--output_file=<your_chosen_output_location> \
--input_format=TENSORFLOW_GRAPHDEF \
--output_format=TFLITE \
--inference_type=QUANTIZED_UINT8 \
--output_arrays=<your_output_arrays> \
--input_arrays=<your_input_arrays> \
--mean_values=<mean of input training data> \
--std_dev_values=<standard deviation of input training data>
tflite_convert --graph_def_file=/tmp/frozen_cifarnet.pb \
--output_file=/tmp/quantized_cifarnet.tflite \
--input_format=TENSORFLOW_GRAPHDEF \
--output_format=TFLITE \
--inference_type=QUANTIZED_UINT8 \
--output_arrays=CifarNet/Predictions/Softmax \
--input_arrays=input \
--mean_values 121 \
--std_dev_values 64
tflite_convert
--graph_def_file=frozen_inference_graph.pb
--output_file=new_graph.tflite
--input_format=TENSORFLOW_GRAPHDEF
--output_format=TFLITE
--input_shape=1,600,600,3
--input_array=image_tensor
--output_array=detection_boxes,detection_scores,detection_classes,num_detections
--inference_type=QUANTIZED_UINT8
--inference_input_type=QUANTIZED_UINT8
--mean_values=128 \
--std_dev_values=127
tflite_convert --graph_def_file=~YOUR PATH~/yolov3-tiny.pb --output_file=~YOUR PATH~/yolov3-tiny.tflite --input_format=TENSORFLOW_GRAPHDEF --output_format=TFLITE --input_shape=1,416,416,3 --input_array=~YOUR INPUT NAME~ --output_array=~YOUR OUTPUT NAME~ --inference_type=FLOAT --input_data_type=FLOAT
tflite_convert \ --graph_def_file=built_graph/yolov2-tiny.pb \ --output_file=built_graph/yolov2_graph.lite \ --input_format=TENSORFLOW_GRAPHDEF \ --output_format=TFLITE \ --input_shape=1,416,416,3 \ --input_array=input \ --output_array=output \ --inference_type=FLOAT \ --input_data_type=FLOAT
tflite_convert --graph_def_file=frozen_inference_graph.pb --output_file=optimized_graph.lite --input_format=TENSORFLOW_GRAPHDEF --output_format=TFLITE --input_shape=1,1024,1024,3 --input_array=image_tensor --output_array=Softmax
tensorflowjs_converter --quantize_float16 --input_format=tf_hub 'https://tfhub.dev/google/imagenet/mobilenet_v1_100_224/classification/1' ./
tensorflowjs_converter --control_flow_v2=True --input_format=tf_hub --quantize_uint8=* --strip_debug_ops=True --weight_shard_size_bytes=4194304 --output_node_names='Postprocessor/ExpandDims_1,Postprocessor/Slice' --signature_name 'serving_default' https://tfhub.dev/tensorflow/ssd_mobilenet_v2/2 test
If anyone has working examples of quantization that they can explain especially what is important to include and what is optional, that would be very helpful. I use netron to visualize the models so I should be able to see when a float input has been changed to an int8. A bit of an explanation would be helpful to.
Recently tried this set of commands to make which compiled but the quantized file was larger than the un-quantized file
tensorflowjs_converter --input_format=tfjs_layers_model --output_format=keras_saved_model ./model.json ./saved_model
tflite_convert --keras_model_file ./saved_model --output_file ./model.tflite
xxd -i model.tflite model.h
tflite_convert --saved_model_dir=./saved_model \
--output_file=./model_int8.tflite \
--input_format=TENSORFLOW_GRAPHDEF \
--output_format=TFLITE \
--inference_type=QUANTIZED_UINT8 \
--output_arrays=1,1 \
--input_arrays=1,2 \
--mean_value=128 \
--std_dev_value=127
xxd -i model_int8.tflite model_int8.h
The python way is easy as well. And you can find official examples here:
https://www.tensorflow.org/lite/performance/post_training_quantization
There is an entire section for this. I think you didn't train the model so post-training quantization is what you are looking for.

ValueError: Invalid tensors 'normalized_input_image_tensor' were found

OS: Ubuntu 18.04,
Tensorflow model:ssd_mobilenet_v2_quantized_300x300_coco_2019_01_03
I have retrained the ssd_mobilenet_v1_quantized_coco model using my own data. I have successfully generated the frozen_inference_graph.pb, using the script, "export_inference_graph.py." But when I ran the script, "tflite_convert.py," the error, "ValueError: Invalid tensors 'normalized_input_image_tensor' were found." broke out. The parameters of the script, "tflite_convert.py" is
python tflite_convert.py \
--output_file="converted_quant_traffic_tflite/traffic_tflite_graph.tflite" \
--graph_def_file="traffic_inference_graph_lite/frozen_inference_graph.pb" \
--input_arrays='normalized_input_image_tensor' \
--inference_type=QUANTIZED_UINT8 \
--output_arrays='TFLite_Detection_PostProcess','TFLite_Detection_PostProcess:1','TFLite_Detection_PostProcess:2','TFLite_Detection_PostProcess:3' \
--mean_values=128 \
--std_dev_values=128 \
--input_shapes=1,300,300,3 \
--default_ranges_min=0 \
--default_ranges_max=6 \
--change_concat_input_ranges=false \
--allow_nudging_weights_to_use_fast_gemm_kernel=true \
--allow_custom_ops
Obviously, the input_arrays was not set correctly. Please advise me how to set the input_arrays.

Evaluation and Visualization of DeeplabV3 completed or not?

Disclaimer: First time i am trying Machine Learning!
We have a requirement of Automatic segmentation of a objects in an image from Background. Through internet we found that "Deep lab" will solve our purpose. we downloaded the deeplab from their offical site and followed all the instructions that they have mentioned. we trained the pascal_voc_2012 dataset with below command
python deeplab/train.py \
--logtostderr \
--training_number_of_steps=30000 \
--train_split="train" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--train_crop_size=513 \
--train_crop_size=513 \
--train_batch_size=1 \
--dataset="pascal_voc_seg" \
--tf_initial_checkpoint=/home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/checkpoint
\
--train_logdir=/home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/train$
\
--dataset_dir=/home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/tfrecord
Training is done after 50 hours. Then i started the Evaluation using below command
python deeplab/eval.py \
--logtostderr \
--eval_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--eval_crop_size=513 \
--eval_crop_size=513 \
--dataset="pascal_voc_seg" \
--checkpoint_dir=/home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/train/
\
--eval_logdir=/home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/eval/
\
--dataset_dir=/home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/tfrecord
After executing the above command, it found one checkpoint correctly, but after that it stays with this message
"Waiting for checkpoint at
home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/train/"
So i terminated the execution of Eval after 2 hours and started the visualization with below command
python deeplab/vis.py \
--logtostderr \
--vis_split="val" \
--model_variant="xception_65" \
--atrous_rates=6 \
--atrous_rates=12 \
--atrous_rates=18 \
--output_stride=16 \
--decoder_output_stride=4 \
--vis_crop_size=513 \
--vis_crop_size=513 \
--dataset="pascal_voc_seg" \
--checkpoint_dir=/home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/train/
\
--vis_logdir=/home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/vis/
\
--dataset_dir=/home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/tfrecord/
visualization also executed for one checkpoint and then again got the same message like Eval.
"Waiting for checkpoint at
home/ktpl13/Desktop/models-master/research/deeplab/datasets/pascal_voc_seg/exp/train_on_train_set/train/"
Again i terminated the execution of vis. There is a folder generated under vis with name "segmentation_results" which contains the "prediction.png" for each input image. which is "completly black image".
Now My questions are.
Did My Evaluation and visualization are done? or am i doing something wrong?
Why the predicted images all are Black?
About the eval waiting for another checkpoint, it's because the default expects to run along with the train process. To run the eval script only once, after training, add this flag to the eval.sh script:
--max_number_of_evaluations = 1
And you can view the value using TensorBoard.
The vis.sh script appears to be running correctly as it's saving the images to the directory. The issue with all black images is a different problem (e.g: dataset configuration, label weights, colormap removal, etc).
For future reference, I ran into the same problem. After I found out what happened I laughed so hard.
Both eval and vis ran as expected.
For eval, right above your output of "waiting for checkpoints," there should be a line that says "miou[your model accuracy here]" It is a tiny line and easy to miss.
For vis, you will find your segmented result in the vis logdir you provided in your vis command.
More in depth, both eval and vis have successfully analyze the network your trained, and as a feature they are waiting for more checkpoints in case you decided to train more networks to compare.

Can't submit training job gcloud ml

I get this error when I try to submit my training job.
ERROR: (gcloud.ml-engine.jobs.submit.training) Could not copy [dist/object_detection-0.1.tar.gz] to [packages/10a409168355064d603079b7c34cdd7010a13b181a8f7776751e9110d66a5bdf/object_detection-0.1.tar.gz]. Please retry: HTTPError 404: Not Found
I'm running the following code:
gcloud ml-engine jobs submit training ${train1} \
--job-dir=gs://${object-detection-tutorial-bucket1/}/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train1 \
--region us-central1 \
--config object_detection/samples/cloud/cloud.yml \
--runtime-version=1.4 \
-- \
--train_dir=gs://${object-detection-tutorial-bucket1/}/train \
--pipeline_config_path=gs://${object-detection-tutorial-
bucket1/}/data/ssd_mobilenet_v1_coco.config
It looks like the syntax you're using is incorrect.
If the name of your bucket is object-detection-tutorial-bucket1, then you specify that with:
--job-dir=gs://object-detection-tutorial-bucket1/train
or you can run:
export YOUR_GCS_BUCKET="gs://object-detection-tutorial-bucket1"
and then specify the bucket as:
--job-dir=${YOUR_GCS_BUCKET}/train
The ${} syntax is used for accessing the value of a variable, but object-detection-tutorial-bucket1/ isn't a valid variable name, so it evaluates as empty.
Sources:
https://cloud.google.com/blog/big-data/2017/06/training-an-object-detector-using-cloud-machine-learning-engine
Difference between ${} and $() in Bash
Just remove $ { } in the script.Considering your bucket name to be object-detection-tutorial-bucket1,Run the below script-
gcloud ml-engine jobs submit training \
--job-dir=gs://object-detection-tutorial-bucket1/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train1 \
--region us-central1 \
--config object_detection/samples/cloud/cloud.yml \
--runtime-version=1.4 \
-- \
--train_dir=gs://object-detection-tutorial-bucket1/train \
--pipeline_config_path=gs://object-detection-tutorial- \
bucket1/data/ssd_mobilenet_v1_coco.config \
Terrible fix but something which worked for me - just remove $variable format completely.
Here is an example:
!gcloud ai-platform jobs submit training anurag_card_fraud \
--scale-tier basic \
--job-dir gs://anurag/credit_card_fraud/models/JOB_20210401_194058 \
--master-image-uri gcr.io/anurag/xgboost_fraud_trainer:latest \
--config trainer/hptuning_config.yaml \
--region us-central1 \
-- \
--training_dataset_path=$TRAINING_DATASET_PATH \
--validation_dataset_path=$EVAL_DATASET_PATH \
--hptune

RuntimeError: `RunConfig` instance is expected to be used by the `Estimator` inside the `Experiment`

Trying to run this tutorial experiment:
https://cloud.google.com/ml-engine/docs/how-tos/getting-started-training-prediction#local-train-single
Running locally, in virtualenv, Python v2.7, TensorFlow v1.2
When executing this command:
gcloud ml-engine local train \
--module-name trainer.task \
--package-path trainer/ \
-- \
--train-files $TRAIN_DATA \
--eval-files $EVAL_DATA \
--train-steps 1000 \
--job-dir $MODEL_DIR \
--eval-steps 100
I get the following error:
RuntimeError: `RunConfig` instance is expected to be used by the `Estimator` inside the `Experiment`.
expected
_cluster_spec={},
_environment=u'cloud',
_evaluation_master='',
_is_chief=True,
_master='',
_model_dir='output',
_num_ps_replicas=0,
_num_worker_replicas=0,
_task_id=0,
_task_type=None,
_tf_config=<tensorflow.core.protobuf.config_pb2.ConfigProto object at 0x111833938>,
but got
_cluster_spec={},
_environment=u'cloud',
_evaluation_master='',
_is_chief=True,
_master='',
_model_dir='output',
_num_ps_replicas=0,
_num_worker_replicas=0,
_task_id=0,
_task_type=None,
_tf_config=<tensorflow.core.protobuf.config_pb2.ConfigProto object at 0x111833c80>
So, it appears it is not finding the _tf_config at the right address. Have not been able to find documentation on how to set that up. Thanks.
UPDATE:
Appears to have something to do with my virtualenv setup. Works fine when I install tensorflow natively.