training MNIST with TPU generates errors - tensorflow

Following the Running MNIST on Cloud TPU tutorial:
I get the following error when I try to train:
python /usr/share/models/official/mnist/ \
--tpu=$TPU_NAME \
--use_tpu=True \
--iterations=500 \
alexryan#alex-tpu:~/tpu$ ./
W1025 20:21:39.351166 139745816463104] file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/", line 41, in autodetect
from . import file_cache
File "/usr/local/lib/python2.7/dist-packages/googleapiclient/discovery_cache/", line 41, in <module>
'file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth')
ImportError: file_cache is unavailable when using oauth2client >= 4.0.0 or google-auth
Traceback (most recent call last):
File "/usr/share/models/official/mnist/", line 173, in <module>
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/", line 125, in run
File "/usr/share/models/official/mnist/", line 152, in main
tpu_config=tf.contrib.tpu.TPUConfig(FLAGS.iterations, FLAGS.num_shards),
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/", line 207, in __init__
self._master = cluster.master()
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/cluster_resolver/python/training/", line 223, in master
job_tasks = self.cluster_spec().job_tasks(self._job_name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/cluster_resolver/python/training/", line 269, in cluster_spec
(compat.as_text(self._tpu), response['health']))
RuntimeError: TPU "alex-tpu" is unhealthy: "TIMEOUT"
The only places where I varied from the instructions were:
Instead of running ctpu in the cloud shell, I ran it on the mac.
>ctpu version
ctpu version: 1.7
The zone where the TPU resided was different than the default zone of my config, so I specified it as an option like so:
ctpu up --zone us-central1-b --preemptible
I was able to move the MNIST files to the gcs bucket from the vm no problem:
alexryan#alex-tpu:~$ gsutil cp -r ./data ${STORAGE_BUCKET}
Copying file://./data/validation.tfrecords [Content-Type=application/octet-stream]...
Copying file://./data/train-images-idx3-ubyte.gz [Content-Type=application/octet-stream]...
I tried the (Optional) Set up TensorBoard >
Running cloud_tpu_profiler
Go to the Cloud Console > TPUs > and click on the TPU you created.
Locate the service account name for the Cloud TPU and copy it, for
In the list of buckets, select the bucket you want to use, select Show
Info Panel, and then select Edit bucket permissions. Paste your
service account name into the add members field for that bucket and
select the following permissions:
"Cloud Console > TPUs" does not exist as an option
so I used the service account associate with the VM
"Cloud Console > Compute Engine > alex-tpu"
since the last error message was "RuntimeError: TPU "alex-tpu" is unhealthy: "TIMEOUT", I used ctpu to delete the vm and re-create it and ran it again.
This time I got more errors:
This one seems like it might be just a warning ...
ImportError: file_cache is unavailable when using oauth2client >=
4.0.0 or google-auth
Not sure about this one ...
ERROR:tensorflow:Operation of type Placeholder (reshape_input) is not supported on the TPU. Execution will fail if this op is used in the graph.
this one seemed to kill the training ...
INFO:tensorflow:Error recorded from training_loop: File system scheme '[local]' not implemented (file: '/tmp/tmpaiggRW/model.ckpt-0_temp_9216e11a1368405795d9b5282775f562') [[{{node save/SaveV2}} = SaveV2[dtypes=[DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64],
_device="/job:worker/replica:0/task:0/device:CPU:0"](save/ShardedFilename, save/SaveV2/tensor_names, save/SaveV2/shape_and_slices, conv2d/bias/Read/ReadVariableOp, conv2d/kernel/Read/ReadVariableOp, conv2d_1/bias/Read/ReadVariableOp, conv2d_1/kernel/Read/ReadVariableOp, dense/bias/Read/ReadVariableOp, dense/kernel/Read/ReadVariableOp, dense_1/bias/Read/ReadVariableOp, dense_1/kernel/Read/ReadVariableOp, global_step/Read/ReadVariableOp)]]
Caused by op u'save/SaveV2', defined at: File "/usr/share/models/official/mnist/", line 173, in <module> File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/", line 125, in run
_sys.exit(main(argv)) File "/usr/share/models/official/mnist/", line 163, in main
estimator.train(input_fn=train_input_fn, max_steps=FLAGS.train_steps) File "/usr/local/lib/python2.7/dist-packages/tensorflow/contrib/tpu/python/tpu/", line 2394, in train
saving_listeners=saving_listeners File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/", line 356, in train
loss = self._train_model(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/", line 1181, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/", line 1215, in _train_model_default
saving_listeners) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/estimator/", line 1406, in _train_with_estimator_spec
log_step_count_steps=self._config.log_step_count_steps) as mon_sess: File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/", line 504, in MonitoredTrainingSession
stop_grace_period_secs=stop_grace_period_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/", line 921, in __init__
stop_grace_period_secs=stop_grace_period_secs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/", line 643, in __init__
self._sess = _RecoverableSession(self._coordinated_creator) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/", line 1107, in __init__
_WrappedSession.__init__(self, self._create_session()) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/", line 1112, in _create_session
return self._sess_creator.create_session() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/", line 800, in create_session
self.tf_sess = self._session_creator.create_session() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/", line 557, in create_session
self._scaffold.finalize() File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/", line 215, in finalize File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/", line 1106, in build
self._build(self._filename, build_save=True, build_restore=True) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/", line 1143, in _build
build_save=build_save, build_restore=build_restore) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/", line 778, in _build_internal
save_tensor = self._AddShardedSaveOps(filename_tensor, per_device) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/", line 369, in _AddShardedSaveOps
return self._AddShardedSaveOpsForV2(filename_tensor, per_device) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/", line 343, in _AddShardedSaveOpsForV2
sharded_saves.append(self._AddSaveOps(sharded_filename, saveables)) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/", line 284, in _AddSaveOps
save = self.save_op(filename_tensor, saveables) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/", line 202, in save_op
tensors) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/", line 1690, in save_v2
shape_and_slices=shape_and_slices, tensors=tensors, name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/", line 787, in _apply_op_helper
op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/", line 488, in new_func
return func(*args, **kwargs) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/", line 3272, in create_op
op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/", line 1768, in __init__
self._traceback = tf_stack.extract_stack()
UnimplementedError (see above for traceback): File system scheme '[local]' not implemented (file: '/tmp/tmpaiggRW/model.ckpt-0_temp_9216e11a1368405795d9b5282775f562')
I get this error ...
INFO:tensorflow:Error recorded from training_loop: File system scheme '[local]' not implemented
... even when --use_tpu=False
alexryan#alex-tpu:~/tpu$ cat
python /usr/share/models/official/mnist/ \
--tpu=$TPU_NAME \
--use_tpu=False \
--iterations=500 \
This stack overflow answer suggests that the tpu is trying to write to a non-existent file system instead of the gcs bucket I specified. It is unclear to me why that would be happening.

In the first scenario, it seems the TPU you created is not in healthy state. So, deleting and recreating the TPU or the entire VM is the right way to resolve this.
I think the error comes in second scenario (where you deleted the vm and re-created it again) is because your ${STORAGE_BUCKET} is either undefined or not a proper GCS bucket. It should be a GCS bucket. Local path won't work and gives the following error.
More information on creating a GCS bucket is in the section "Create a Cloud Storage bucket" at
Hope this answers your question.

Ran into the same problem and found that there was a typo in the tutorial. If you check you'll find that the params need to be lowercase.
If you change that, it works fine.
python /usr/share/models/official/mnist/ \
--tpu=$TPU_NAME \
--data_dir=${STORAGE_BUCKET}/data \
--model_dir=${STORAGE_BUCKET}/output \
--use_tpu=True \
--iterations=500 \


Continuous Machine Learning pipeline broken by tensorflow requirements installation

I am doing Continuous Machine Learning ( on my own GitLab server. My goal is to test the basic Continuous Machine Learning pipeline with python script as an example.
My .gitlab-ci.yml file is a basic one:
- cml_run
stage: cml_run
image: dvcorg/cml-py3:latest
- pip3 install -r requirements.txt
- python
- cat metrics.txt >>
- cml-publish confusion_matrix.png --md >>
- cml-send-comment
For pandas,sklearn and Keras in the requirements.txt, there is a successful installation. But I receive a pipeline broken by TensorFlow requirements installation
$ pip3 install -r requirements.txt
Collecting pandas
Downloading pandas-1.0.5-cp36-cp36m-manylinux1_x86_64.whl (10.1 MB)
Collecting sklearn
Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting keras
Downloading Keras-2.4.3-py2.py3-none-any.whl (36 kB)
Collecting tensorflow
Downloading tensorflow-2.2.0-cp36-cp36m-manylinux2010_x86_64.whl (516.2 MB)
ERROR: Exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/cli/", line 188, in _main
status =, args)
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/cli/", line 185, in wrapper
return func(self, options, args)
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/commands/", line 333, in run
reqs, check_supported_wheels=not options.target_dir
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/resolution/legacy/", line 179, in resolve
discovered_reqs.extend(self._resolve_one(requirement_set, req))
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/resolution/legacy/", line 362, in _resolve_one
abstract_dist = self._get_abstract_dist_for(req_to_install)
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/resolution/legacy/", line 314, in _get_abstract_dist_for
abstract_dist = self.preparer.prepare_linked_requirement(req)
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/operations/", line 469, in prepare_linked_requirement
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/operations/", line 259, in unpack_url
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/operations/", line 130, in get_http_url
link, downloader, temp_dir.path, hashes
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/operations/", line 281, in _download_http_url
for chunk in download.chunks:
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/cli/", line 166, in iter
for x in it:
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/network/", line 39, in response_chunks
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/urllib3/", line 564, in stream
data =, decode_content=decode_content)
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/urllib3/", line 507, in read
data = if not fp_closed else b""
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/cachecontrol/", line 65, in read
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/cachecontrol/", line 52, in _close
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/cachecontrol/", line 309, in cache_response
cache_url, self.serializer.dumps(request, response, body=body)
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/cachecontrol/", line 72, in dumps
return b",".join([b"cc=4", msgpack.dumps(data, use_bin_type=True)])
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/msgpack/", line 35, in packb
return Packer(**kwargs).pack(o)
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/msgpack/", line 936, in pack
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/msgpack/", line 920, in _pack
len(obj), dict_iteritems(obj), nest_limit - 1
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/msgpack/", line 1021, in _pack_map_pairs
self._pack(v, nest_limit - 1)
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/msgpack/", line 920, in _pack
len(obj), dict_iteritems(obj), nest_limit - 1
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/msgpack/", line 1021, in _pack_map_pairs
self._pack(v, nest_limit - 1)
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/msgpack/", line 865, in _pack
return self._buffer.write(obj)
Any ideas on how to overcome his issue with CML pipeline on GitLab?
It is difficult to diagnose without knowing more about your .gitlab-ci.yml file. But based on the MemoryError message, it seems runner does not have enough memory to install Tensorflow in addition to your other project dependencies.
You can try installing TF with the --no-cache-dir flag


I'm using an object detection module for classifying images. My specs are as follows:
OS: Ubuntu 18.04 LTS
Python: 3.6.7
VirtualEnv: Version: 16.4.3
Pip3 version inside virtualenv: 19.0.3
TensorFlow Version: 1.13.1
Protoc Version: 3.0.0-9
I'm working on Windows virtualenv and google-colab. This is the error message I get:
python3 legacy/ --logtostderr --train_dir=training/ --pipeline_config_path=training/ssd_mobilenet_v1_pets.config
INFO:tensorflow:global step 1: loss = 18.5013 (48.934 sec/step)
INFO:tensorflow:Finished training! Saving model to disk.
/home/priyank/venv/lib/python3.6/site-packages/tensorflow/python/summary/writer/ UserWarning: Attempting to use a closed FileWriter. The operation will be a noop unless the FileWriter is explicitly reopened.
warnings.warn("Attempting to use a closed FileWriter. "
Traceback (most recent call last):
File "legacy/", line 184, in <module>
File "/home/priyank/venv/lib/python3.6/site-packages/tensorflow/python/platform/", line 125, in run
File "/home/priyank/venv/lib/python3.6/site-packages/tensorflow/python/util/", line 324, in new_func
return func(*args, **kwargs)
File "legacy/", line 180, in main
File "/home/priyank/venv/models-master/research/object_detection/legacy/", line 416, in train
File "/home/priyank/venv/lib/python3.6/site-packages/tensorflow/contrib/slim/python/slim/", line 785, in train
File "/home/priyank/venv/lib/python3.6/site-packages/tensorflow/python/training/", line 832, in stop
File "/home/priyank/venv/lib/python3.6/site-packages/tensorflow/python/training/", line 389, in join
File "/home/priyank/venv/lib/python3.6/site-packages/", line 693, in reraise
raise value
File "/home/priyank/venv/lib/python3.6/site-packages/tensorflow/python/training/", line 257, in _run
File "/home/priyank/venv/lib/python3.6/site-packages/tensorflow/python/client/", line 1257, in _single_operation_run
self._call_tf_sessionrun(None, {}, [], target_list, None)
File "/home/priyank/venv/lib/python3.6/site-packages/tensorflow/python/client/", line 1407, in _call_tf_sessionrun
<b>tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[15,1,1755,2777,3] and type float on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu
[[{{node batch}}]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.</b>
You can try the following fixes:
1. Reducing the image dimension in case you are using very high image resolution
2. Try reducing the batch size
3. Check if any other process is using up your memory
Could you also please share your config file

TypeError: __init__() got an unexpected keyword argument 'repeated'

I am using the google cloud vm instance for developing my custom object detector- TENSORFLOW object detection API. I am using pretrained model
After creating all the necessary TFrecord files for input and configuring the object_detection pipeline config files, i used the following command for training:
python --logtostderr --train_dir=training /
I get the following error:
Traceback (most recent call last):
File "", line 184, in <module>
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "", line 180, in main
File "/opt/models/research/object_detection/", line 274, in train
train_config.prefetch_queue_capacity, data_augmentation_options)
File "/opt/models/research/object_detection/", line 59, in create_input_queue
tensor_dict = create_tensor_dict_fn()
File "", line 121, in get_next
File "/opt/models/research/object_detection/builders/", line 176, in build
File "/opt/models/research/object_detection/data_decoders/", line 204, in __init__
TypeError: __init__() got an unexpected keyword argument 'repeated'
How should i fix the error? i am quite new into this. Any help would be appreciated.
Check the correctness of your command and if the config file is in correct relative nested directory. I see there is a space between "training /" it should be "training/"
My assumption is the error is due to the incompatibility of file with the Tensorflow installed. Try removing that argument. Hopefully this helps.
I had the similar issue. i had older tensorflow installed and trying to use new models, upgrading tensorflow solved my problem.

Renaming the model Directory in Tensorflow

I renamed the model dir from /home/abcd/andrew_model_jul25_tif/ which contained model and summary directories to /home/abcd/andrew_model_sep22/ which contained the same two folders. When I ran the python script it gave me the following error:
Traceback (most recent call last):
File "", line 127, in <module>
File "/home/abcd/virtualenvs/project/local/lib/python2.7/site-packages/tensorflow/python/platform/", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "", line 119, in main
do_eval_on_whole(model_dir, file, file[a:], output_dir)
File "", line 51, in do_eval_on_whole
File "/home/abcd/virtualenvs/project/local/lib/python2.7/site-packages/tensorflow/python/training/", line 1602, in latest_checkpoint
if file_io.get_matching_files(v2_path) or file_io.get_matching_files(
File "/home/abcd/virtualenvs/project/local/lib/python2.7/site-packages/tensorflow/python/lib/io/", line 334, in get_matching_files
compat.as_bytes(single_filename), status)
File "/usr/lib/python2.7/", line 24, in __exit__
File "/home/abcd/virtualenvs/project/local/lib/python2.7/site-packages/tensorflow/python/framework/", line 466, in raise_exception_on_not_ok_status
tensorflow.python.framework.errors_impl.NotFoundError: /home/abcd/andrew_model_jul25_tif/model
When I changed the folder's name back to andrew_model_jul25 the script worked. Can changing the folder's name have such an effect?
I'm using the 1.1.0 version of tf, and running it on a GPU.
The problem arises here:
Try updating the name of your model_dir variable

Redownloaded custom built tensorflow wheel wont install

I've built Tensorflow with custom SIMD extensions and created a wheel for it. If I simply do pip install /tmp/tensorflow_pkg/tensorflow-1.0.0-cp27-cp27mu-linux_x86_64.wh on the box that I built it on, that works. However if I upload the whl file to cloud storage, and do pip install I get this error:
Collecting tensorflow==1.0.0 from
Traceback (most recent call last):
File "/usr/local/lib/python2.7/site-packages/pip/", line 215, in main
status =, args)
File "/usr/local/lib/python2.7/site-packages/pip/commands/", line 335, in run
File "/usr/local/lib/python2.7/site-packages/pip/", line 749, in build
File "/usr/local/lib/python2.7/site-packages/pip/req/", line 380, in prepare_files
File "/usr/local/lib/python2.7/site-packages/pip/req/", line 620, in _prepare_file
session=self.session, hashes=hashes)
File "/usr/local/lib/python2.7/site-packages/pip/", line 821, in unpack_url
File "/usr/local/lib/python2.7/site-packages/pip/", line 663, in unpack_http_url
unpack_file(from_path, location, content_type, link)
File "/usr/local/lib/python2.7/site-packages/pip/utils/", line 599, in unpack_file
flatten=not filename.endswith('.whl')
File "/usr/local/lib/python2.7/site-packages/pip/utils/", line 484, in unzip_file
zip = zipfile.ZipFile(zipfp, allowZip64=True)
File "/usr/local/lib/python2.7/", line 770, in __init__
File "/usr/local/lib/python2.7/", line 811, in _RealGetContents
raise BadZipfile, "File is not a zip file"
BadZipfile: File is not a zip file
Do I need to configure my build differently somehow?
(capturing the solution as an answer)
The URL used for the download is not correct. The base url needed to be