Can't submit training job gcloud ml - tensorflow

I get this error when I try to submit my training job.
ERROR: (gcloud.ml-engine.jobs.submit.training) Could not copy [dist/object_detection-0.1.tar.gz] to [packages/10a409168355064d603079b7c34cdd7010a13b181a8f7776751e9110d66a5bdf/object_detection-0.1.tar.gz]. Please retry: HTTPError 404: Not Found
I'm running the following code:
gcloud ml-engine jobs submit training ${train1} \
--job-dir=gs://${object-detection-tutorial-bucket1/}/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train1 \
--region us-central1 \
--config object_detection/samples/cloud/cloud.yml \
--runtime-version=1.4 \
-- \
--train_dir=gs://${object-detection-tutorial-bucket1/}/train \
--pipeline_config_path=gs://${object-detection-tutorial-
bucket1/}/data/ssd_mobilenet_v1_coco.config

It looks like the syntax you're using is incorrect.
If the name of your bucket is object-detection-tutorial-bucket1, then you specify that with:
--job-dir=gs://object-detection-tutorial-bucket1/train
or you can run:
export YOUR_GCS_BUCKET="gs://object-detection-tutorial-bucket1"
and then specify the bucket as:
--job-dir=${YOUR_GCS_BUCKET}/train
The ${} syntax is used for accessing the value of a variable, but object-detection-tutorial-bucket1/ isn't a valid variable name, so it evaluates as empty.
Sources:
https://cloud.google.com/blog/big-data/2017/06/training-an-object-detector-using-cloud-machine-learning-engine
Difference between ${} and $() in Bash

Just remove $ { } in the script.Considering your bucket name to be object-detection-tutorial-bucket1,Run the below script-
gcloud ml-engine jobs submit training \
--job-dir=gs://object-detection-tutorial-bucket1/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train1 \
--region us-central1 \
--config object_detection/samples/cloud/cloud.yml \
--runtime-version=1.4 \
-- \
--train_dir=gs://object-detection-tutorial-bucket1/train \
--pipeline_config_path=gs://object-detection-tutorial- \
bucket1/data/ssd_mobilenet_v1_coco.config \

Terrible fix but something which worked for me - just remove $variable format completely.
Here is an example:
!gcloud ai-platform jobs submit training anurag_card_fraud \
--scale-tier basic \
--job-dir gs://anurag/credit_card_fraud/models/JOB_20210401_194058 \
--master-image-uri gcr.io/anurag/xgboost_fraud_trainer:latest \
--config trainer/hptuning_config.yaml \
--region us-central1 \
-- \
--training_dataset_path=$TRAINING_DATASET_PATH \
--validation_dataset_path=$EVAL_DATASET_PATH \
--hptune

Related

How to run 'run_squad.py' on google colab? It gives 'invalid syntax' error

I downloaded the file first using:
!curl -L -O https://github.com/huggingface/transformers/blob/master/examples/legacy/question-answering/run_squad.py
Then used following code:
!python run_squad.py \
--model_type bert \
--model_name_or_path bert-base-uncased \
--output_dir models/bert/ \
--data_dir data/squad \
--overwrite_output_dir \
--overwrite_cache \
--do_train \
--train_file /content/train.json \
--version_2_with_negative \
--do_lower_case \
--do_eval \
--predict_file /content/val.json \
--per_gpu_train_batch_size 2 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 384 \
--doc_stride 128 \
--threads 10 \
--save_steps 5000
Also tried following:
!python run_squad.py \
--model_type bert \
--model_name_or_path bert-base-cased \
--do_train \
--do_eval \
--do_lower_case \
--train_file /content/train.json \
--predict_file /content/val.json \
--per_gpu_train_batch_size 12 \
--learning_rate 3e-5 \
--num_train_epochs 2.0 \
--max_seq_length 584 \
--doc_stride 128 \
--output_dir /content/
The error says in both the codes:
File "run_squad.py", line 7
^ SyntaxError: invalid syntax
What exactly is the issue? How can I run the .py file?
SOLVED: It was giving error because I was downloading the github link rather than the script in github. Once I copied and used 'Raw' link to download the script, the code ran.

How to convert configure options for use with cmake

I have a script for building a project that I need to upgrade from using configure to cmake. The original configure command is
CFLAGS="$SLKCFLAGS" \
CXXFLAGS="$SLKCFLAGS" \
./configure \
--with-clang \
--prefix=$PREFIX \
--libdir=$PREFIX/lib${LIBDIRSUFFIX} \
--incdir=$PREFIX/include \
--mandir=$PREFIX/man/man1 \
--etcdir=$PREFIX/etc/root \
--docdir=/usr/doc/$PRGNAM-$VERSION \
--enable-roofit \
--enable-unuran \
--disable-builtin-freetype \
--disable-builtin-ftgl \
--disable-builtin-glew \
--disable-builtin-pcre \
--disable-builtin-zlib \
--disable-builtin-lzma \
$GSL_FLAGS \
$FFTW_FLAGS \
$QT_FLAGS \
--enable-shared \
--build=$ARCH-slackware-linux
I am not familiar enough with cmake to know how to do the equivalent. I would prefer a command line option but am open to modifying the CMakeLists.txt file as well.

ValueError: Invalid tensors 'normalized_input_image_tensor' were found

OS: Ubuntu 18.04,
Tensorflow model:ssd_mobilenet_v2_quantized_300x300_coco_2019_01_03
I have retrained the ssd_mobilenet_v1_quantized_coco model using my own data. I have successfully generated the frozen_inference_graph.pb, using the script, "export_inference_graph.py." But when I ran the script, "tflite_convert.py," the error, "ValueError: Invalid tensors 'normalized_input_image_tensor' were found." broke out. The parameters of the script, "tflite_convert.py" is
python tflite_convert.py \
--output_file="converted_quant_traffic_tflite/traffic_tflite_graph.tflite" \
--graph_def_file="traffic_inference_graph_lite/frozen_inference_graph.pb" \
--input_arrays='normalized_input_image_tensor' \
--inference_type=QUANTIZED_UINT8 \
--output_arrays='TFLite_Detection_PostProcess','TFLite_Detection_PostProcess:1','TFLite_Detection_PostProcess:2','TFLite_Detection_PostProcess:3' \
--mean_values=128 \
--std_dev_values=128 \
--input_shapes=1,300,300,3 \
--default_ranges_min=0 \
--default_ranges_max=6 \
--change_concat_input_ranges=false \
--allow_nudging_weights_to_use_fast_gemm_kernel=true \
--allow_custom_ops
Obviously, the input_arrays was not set correctly. Please advise me how to set the input_arrays.

Tensorflow inception-v4 Classify Image

I use TF-slim inception-v4 training a model from scratch.
python train_image_classifier.py \
--train_dir=${TRAIN_DIR} \
--dataset_name=mydata \
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=inception_v4 \
--clone_on_cpu=true \
--max_number_of_steps=1000 \
--log_every_n_steps=100
# Run evaluation.
python eval_image_classifier.py \
--checkpoint_path=${TRAIN_DIR} \
--eval_dir=${TRAIN_DIR} \
--dataset_name=mydata \
--dataset_split_name=validation \
--dataset_dir=${DATASET_DIR} \
--model_name=inception_v4 \
--batch_size=32
# # # Fine-tune all the new layers for 500 steps.
python train_image_classifier.py \
--train_dir=${TRAIN_DIR}/all \
--dataset_name=mydata \
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--model_name=inception_v4 \
--clone_on_cpu=true \
--checkpoint_path=${TRAIN_DIR} \
--max_number_of_steps=1000 \
--log_every_n_steps=100 \
--batch_size=32 \
--learning_rate=0.0001 \
--learning_rate_decay_type=fixed \
--save_interval_secs=600 \
--save_summaries_secs=600 \
--optimizer=rmsprop \
--weight_decay=0.00004
then freeze the graph:
python export_inference_graph.py \
--alsologtostderr \
--model_name=inception_v4 \
--is_training=True \
--labels_offset=999 \
--output_file=${OUTPUT_DIR}/unfrozen_inception_v4_graph.pb \
--dataset_dir=${DATASET_DIR}
#NEWEST_CHECKPOINT=$(cat ${TRAIN_DIR}/all/checkpoint |head -n1|awk -F\" '{print $2}')
NEWEST_CHECKPOINT=$(ls -t1 ${TRAIN_DIR}/all|grep model.ckpt |head -n1)
echo ${NEWEST_CHECKPOINT%.*}
python ${OUTPUT_DIR}/tensorflow/tensorflow/python/tools/freeze_graph.py \
--input_graph=${OUTPUT_DIR}/unfrozen_inception_v4_graph.pb \
--input_checkpoint=${TRAIN_DIR}/all/${NEWEST_CHECKPOINT%.*} \
--input_binary=true \
--output_graph=${OUTPUT_DIR}/frozen_inception_v4.pb \
--output_node_names=InceptionV4/Logits/Predictions \
--input_meta_graph=True
After all this, I got a frozen_inception_v4.pb file.
for this example https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/label_image/label_image.py
what is the input layer for inception_v4 ?
Does anyone know how to solve this?
That depends on the particular implementation of slim you used. Look where they define the input and see what is the name of that tensor.
Try this:
bazel build tensorflow/tools/graph_transforms:summarize_graph
bazel-bin/tensorflow/tools/graph_transforms/summarize_graph \
--in_graph=/path/to/your_frozen.pb
It will show possible input and output layer

Google Cloud ML Engine pass multiple file paths as arguments

I am trying to run a job on Google Cloud ML Engine and can't seem to pass multiple file paths as arguments to the parser.
Here is what I am writing in the terminal:
JOB_NAME=my_job_name
BUCKET_NAME=my_bucket_name
OUTPUT_PATH=gs://$BUCKET_NAME/$JOB_NAME
DATA_PATH=gs://$BUCKET_NAME/my_data_directory
REGION=us-east1
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $OUTPUT_PATH \
--runtime-version 1.2 \
--module-name trainer.task \
--package-path trainer/ \
--region $REGION \
-- \
--file-path "${DATA_PATH}/*" \
--num-epochs 10
Where my_data_directory contains multiple files I later want to read, the problem is that --file-path contains only ['gs://my_bucket_name/my_data_directory'] and not a list of files in said directory.
How do I fix this?
Many thanks in advance.
Since the arguments you pass after the -- \ line will be user arguments, how the program handles these arguments will largely depend on the trainer you defined. I would go back and modify the trainer program and make it either treat the directory differently or take multiple paths like this:
gcloud ml-engine jobs submit training $JOB_NAME \
--job-dir $OUTPUT_PATH \
--runtime-version 1.2 \
--module-name trainer.task \
--package-path trainer/ \
--region $REGION \
--scale-tier STANDARD_1 \
-- \
--train-files $TRAIN_DATA \
--eval-files $EVAL_DATA \
--train-steps 1000 \
--verbosity DEBUG \
--eval-steps 100
Some links that will be helpful for developing your own trainer: [1] [2]