Error while tensorflow training on gcloud ml engine - tensorflow

I am following this ml-engine guide. I did setup my gcloud and created vm also. For tensorflow, I am using Anaconda 3 to create my python environment. I created new environment with python=3.6. But when I fire this
gcloud ml-engine local train --module-name trainer.task --package-path trainer -- --train-files c:\Anaconda3\mytensorflowcode\cloudml-samples-master\census\estimator\data\adult.data.csv --eval-files c:\Anaconda3\mytensorflowcode\cloudml-samples-master\census\estimator\data\adult.test.csv --train-steps 1000 --job-dir c:\Anaconda3\mytensorflowcode\cloudml-samples-master\census\estimator\output --eval-steps 100
I am getting following error
Traceback (most recent call last):
File "D:\gcsdk174\google-cloud-sdk\platform\bundledpython\lib\runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "D:\gcsdk174\google-cloud-sdk\platform\bundledpython\lib\runpy.py", line 72, in _run_code
exec code in run_globals
File "C:\Anaconda3\mytensorflowcode\cloudml-samples-master\census\estimator\trainer\task.py", line 4, in <module>
import model
File "trainer\model.py", line 20, in <module>
import tensorflow as tf
ImportError: No module named tensorflow
I could able to install tensorflow successfully with pip install -r ../requirements.txt command as per the guide.
Can anybody point out, what I am doing wrong?

Update: this issue should now be fixed with the most recent version of gcloud. Can you give it a try and see if it works for you? First do:
gcloud components update
What's happening is that gcloud is (silently) requiring py2.7, which is causing your import error. This is a bug that we will fix soon. (It's particularly problematic for Windows, since TF doesn't support a 2.7 install for windows). We'll update here when it's fixed.
In the meantime, the best option is probably to test locally by just running your python script directly (unless you are trying to test distributed training locally).
If you are trying to test distributed training locally, then your best temporary option is probably to use Docker and the TensorFlow docker container.

Related

Tensorflow docker installation problem. (undefined symbol: _ZN10tensorflow8OpKernel11TraceStringEPNS_15OpKernelContextEb)

I am trying to learn more about tensorflow and building a custom model with it.
I follow an instruction in which i install a docker container with the following steps:
git clone https://github.com/tensorflow/models.git
cd models
docker build -f research/object_detection/dockerfiles/tf2/Dockerfile -t od .
docker run -it od
All works ok.
Next step is to run a test in the container script:
python object_detection/builders/model_builder_tf2_test.py
This fails and outputs:
Traceback (most recent call last):
File "object_detection/builders/model_builder_tf2_test.py", line 22, in <module>
import tensorflow.compat.v1 as tf
File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/__init__.py", line 438, in <module>
_ll.load_library(_main_dir)
File "/home/tensorflow/.local/lib/python3.6/site-packages/tensorflow/python/framework/load_library.py", line 154, in load_library
py_tf.TF_LoadLibrary(lib)
tensorflow.python.framework.errors_impl.NotFoundError: /usr/local/lib/python3.6/dist-packages/tensorflow/core/kernels/libtfkernel_sobol_op.so: undefined symbol: _ZN10tensorflow8OpKernel11TraceStringEPNS_15OpKernelContextEb
So there is a question about this in stackoverflow see:
Also a followup at:
It says, if i understand correctly, to uninstall and install tensorflow again, but it does not work.
The version is 2.6.1. If i try to downgrade i run into all kind of other problems.
I am stuck ;-(, any clues ?
I am working on a VPS, is it possible to run the tensorflow docker on 'normal' hardware ?

How can I install the tensorflow object detection API on Google colab?

I tried running tensorflow object detection API on Colab according to
Inline Link
I got such an error at the first Install required packages.
How can I solve it?
Background : Python2 , GPU
/root
fatal: destination path 'models' already exists and is not an empty directory.
/root/models/research
Traceback (most recent call last):
File "object_detection/builders/model_builder_test.py", line 23, in <module>
from object_detection.builders import model_builder
ImportError: No module named object_detection.builders
I'm not clear about from which directory you are executing the command.
if you executing it from content directory then go to model and then to research directory.
%cd ~/models/research
!python object_detection/builders/model_builder_test.py
If you don't have model directory clone it by using
!git clone --quiet https://github.com/tensorflow/models.git

Tensorflow: TypeError: __new__() got an unexpected keyword argument 'file'

I'm trying to run the pet detector google cloud example seen here: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_pets.md, and get most of the way though until I actually try to run the training, until I run the command:
gcloud ml-engine jobs submit training `whoami`_object_detection_`date +%s` \
--runtime-version 1.2 \
--job-dir=gs://test-run-2/train \
--packages dist/object_detection-0.1.tar.gz,slim/dist/slim-0.1.tar.gz \
--module-name object_detection.train \
--region us-central1 \
--config object_detection/samples/cloud/cloud.yml \
-- \
--train_dir=gs://test-run-2/train \
--pipeline_config_path=gs://test-run-2/data/pipeline.config
Which inevitably leads to the error:
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/root/.local/lib/python2.7/site-packages/object_detection/train.py", line 49, in <module>
from object_detection import trainer
File "/root/.local/lib/python2.7/site-packages/object_detection/trainer.py", line 27, in <module>
from object_detection.builders import preprocessor_builder
File "/root/.local/lib/python2.7/site-packages/object_detection/builders/preprocessor_builder.py", line 21, in <module>
from object_detection.protos import preprocessor_pb2
File "/root/.local/lib/python2.7/site-packages/object_detection/protos/preprocessor_pb2.py", line 71, in <module>
options=None, file=DESCRIPTOR),
TypeError: __new__() got an unexpected keyword argument 'file'
I'm not sure what exactly I'm doing wrong here or where that error comes from, as I can't see inside the preprocessor_pb2.py file. Any help would be greatly appreciated!
I met this issue after upgrading tensorflow to version 1.11 and solved it by running:
pip install --upgrade protobuf
Ran into this issue- it is likely due to using an updated tensorflow dist (1.4) while trying to run the job on 1.2. I solved it by passing in a yaml config file that sets it:
trainingInput:
runtimeVersion: "1.4"
pythonVersion: "3.5"
If you run into matplot lib import error, you'll need to follow: https://github.com/tensorflow/models/issues/2739#issuecomment-351213863 for now until tensorflow models repo fixes it.
#olive_tree has a good answer, what ended up being my issue was the version of protobuf I was compiling with. Make sure you protoc version matches the correct one described here: https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/installation.md
This is what I ended up having to run (on OSX)
brew uninstall protobuf
brew install protobuf#2.6
voila, problems solved!

ModuleNotFoundError: No module named 'tensorflow.contrib.lite.toco.python'

Exact command to reproduce: toco --help
I am trying to run the codelab tutorial of tensorflow lite. After installing tf-nightly, when I try to run the command "toco --help", I get the error ModuleNotFoundError: No module named 'tensorflow.contrib.lite.toco.python'.
I have tried this on 3 computers( all Windows) and the same problem persists.
Source code / logs
C:\Users\HP\Downloads>toco --help
Traceback (most recent call last):
File "c:\programdata\anaconda3\lib\runpy.py", line 193, in
_run_module_as_main "main", mod_spec)
File "c:\programdata\anaconda3\lib\runpy.py", line 85, in run_code
exec(code, run_globals)
File "C:\ProgramData\Anaconda3\Scripts\toco.exe_main.py", line 5, in 
ModuleNotFoundError: No module named
'tensorflow.contrib.lite.toco.python'
I was getting the same error and apparently TOCO doesn't work on Windows machines,
https://github.com/tensorflow/tensorflow/issues/16374
My alternative to optimize the model was to user TensorFlow Mobile, instead of TensorFlow Lite, and use 'optimize_for_inference'. You could also try on a linux environment.
I solved it by downgrading tensorflow to 1.7
pip install --upgrade "tensorflow==1.7.*"
Issue Solved
Clone the tensorflow repository an copy the lite folder from tensorflow\tensorflow\contrib
and paste it in the C:\Users\$USERNAME$\AppData\Local\Programs\Python\Python36\Lib\site-packages\tensorflow\contrib\lite
The tensorflow library in Python36 has some missing files. My Python36 folder is in c:\Python36. So the toco will load "C:\Python36\Lib\site-packages\tensorflow\contrib\lite\python". Whoever coded the toco forgot to copy the whole folder into there.
You need to copy from your tensorflow folder to the lite folder. My tensorflow is at
"c:\tensorflow". The lite folder looks like this:
Copy all files from "C:\tensorflow\tensorflow\contrib\lite\python" to "C:\Python36\Lib\site-packages\tensorflow\contrib\lite\python".
Now, you need to make a test, "toco --help"

Tensorflow - import error in CIFAR tutorial

Recently, I installed tensorflow and got python import error in CIFAR tutorial.
I'm using Mac OS X, CPU only, Python 2.7.
$ python cifar10_train.py
Filling queue with 20000 CIFAR images before starting to train. This will take a few minutes.
Traceback (most recent call last):
File "cifar10_train.py", line 120, in
tf.app.run()
File "/Users/sunwoo/tensorflow/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 30, in run
sys.exit(main(sys.argv[:1] + flags_passthrough))
File "cifar10_train.py", line 116, in main
train()
File "cifar10_train.py", line 76, in train
class _LoggerHook(tf.train.SessionRunHook):
AttributeError: 'module' object has no attribute 'SessionRunHook'
How can I import tf.train.SessionRunHook?
It looks like you are using the master branch of cifar10_train.py, with an older installed version of TensorFlow (0.11 or earlier). The master branch was recently modified to use a new API, which wasn't available in TensorFlow 0.11 or earlier.
There are two ways to fix this problem. Either upgrade TensorFlow to version 0.12 or later, or check out the r0.11 branch of the TensorFlow source, and use the version of cifar10_train.py from that branch.