Exception: Cannot start TensorBoard. in Google Cloud Datalab - tensorflow

I see that Tensorboard process is running. Files are written into the model directory. However, repeatedly I get the Exception: Unable to start Tensorboard. I am using TF.estimator.
I am running my code on Google Cloud Datalab. I have tried changing model directory and restarting the Datalab instance many times. Also tried running killing all running Tensorboard processes. Nothing has worked so far. It was working earlier or once in every 10-15 attempts it magically runs. Whats happening?
This is how I am starting Tensorboard.
from google.datalab.ml import TensorBoard as tb
tb.start(model_dir)
This is how my Estimator is configured.
run_config = tf.estimator.RunConfig(
save_checkpoints_steps=FLAGS.save_checkpoints_steps,
tf_random_seed=FLAGS.tf_random_seed,
model_dir=model_dir
)
estimator = tf.estimator.Estimator(model_fn=model_fn,
config=run_config)
Below are the files being written into the model directory by tf.estimator.
eval 8 minutes ago
checkpoint 124 B 9 minutes ago
events.out.tfevents.1559025239.78fe4cbf0fad 603 kB 9 minutes ago
graph.pbtxt 399 kB 12 minutes ago
model.ckpt-1.data-00000-of-00001 261 MB 11 minutes ago
model.ckpt-1.index 811 B 11 minutes ago
model.ckpt-1.meta 170 kB 11 minutes ago
model.ckpt-5.data-00000-of-00001 261 MB 9 minutes ago
model.ckpt-5.index 811 B 9 minutes ago
model.ckpt-5.meta 170 kB 9 minutes ago
The error I am getting is below. It is the same everytime and I have no further information to identify what is going wrong.
Exception Traceback (most recent call >last)
in ()
2 #tensorboard --logdir ./logs/1/train --host localhost --port 8081
3 from google.datalab.ml import TensorBoard as tb
----> 4 tb.start(model_dir)
/usr/local/envs/py3env/lib/python3.5/site-packages/google/datalab/ml/_tensorboard.py in start(logdir)
77 retry -= 1
78
---> 79 raise Exception('Cannot start TensorBoard.')
80
81 #staticmethod
Exception: Cannot start TensorBoard.
When I list the Tensorboard processes running using below code, below is what I get.
x = tb.list() #Returns a dataframe
print(x)
logdir pid port
0 ./model_no_reuse/2 6236 40269
1 ./model_no_reuse/2 6241 57895
Please help me identify what is going wrong.

I tried increasing the VM configuration from 2 vCPU/4.5 GB to 4 vCPU/20GB and the issue is resolved. It looks like even though Tensorboard process does get started, for it to open up certain minimum resources are needed. Will change the answer if I arrive at any other conclusion.

Related

MachineConfig when training Model with TPU on GCP with Tensorflow-cloud

I am trying to train a rather large model (Longformer-large with a CNN classification head on top) on Google Cloud Platform. I am using Tensorflow-cloud and Colab to run my model. I tried to run this with batchsize 4 and 4 P100-GPUs but I still get an OOM error, so I would like to try it with TPU. I have increased batch size to 8 now.
However, I get the error that TPU config cannot be the chief_worker_config.
This is my code:
tfc.run(
distribution_strategy="auto",
requirements_txt="requirements.txt",
docker_config=tfc.DockerConfig(
image_build_bucket=GCS_BUCKET
),
worker_count=1,
worker_config= tfc.COMMON_MACHINE_CONFIGS["TPU"],
chief_config=tfc.COMMON_MACHINE_CONFIGS["TPU"],
job_labels={"job": JOB_NAME})
This is the error:
Validating environment and input parameters.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-26-e1be60d71623> in <module>()
19 worker_config= tfc.COMMON_MACHINE_CONFIGS["TPU"],
20 chief_config=tfc.COMMON_MACHINE_CONFIGS["TPU"],
---> 21 job_labels={"job": JOB_NAME},
22 )
2 frames
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/run.py in run(entry_point, requirements_txt, docker_config, distribution_strategy, chief_config, worker_config, worker_count, entry_point_args, stream_logs, job_labels, service_account, **kwargs)
256 job_labels=job_labels or {},
257 service_account=service_account,
--> 258 docker_parent_image=docker_config.parent_image,
259 )
260 print("Validation was successful.")
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/validate.py in validate(entry_point, requirements_txt, distribution_strategy, chief_config, worker_config, worker_count, entry_point_args, stream_logs, docker_image_build_bucket, called_from_notebook, job_labels, service_account, docker_parent_image)
78 _validate_distribution_strategy(distribution_strategy)
79 _validate_cluster_config(
---> 80 chief_config, worker_count, worker_config, docker_parent_image
81 )
82 _validate_job_labels(job_labels or {})
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/validate.py in _validate_cluster_config(chief_config, worker_count, worker_config, docker_parent_image)
160 "Invalid `chief_config` input. "
161 "`chief_config` cannot be a TPU config. "
--> 162 "Received {}.".format(chief_config)
163 )
164
ValueError: Invalid `chief_config` input. `chief_config` cannot be a TPU config. Received <tensorflow_cloud.core.machine_config.MachineConfig object at 0x7f5860afe210>.
Can someone tell me how I can run my code on GCP-TPUs? I actually don't care too much about time, I just want some configuration that runs without getting OOM issues (so GPU if it works totally fine with me as well).
Thank you!

Load dataset from Roboflow in colab

I'm trying to retreive a roboflow project dataset in google colab. It works for two of the dataset versions, but not the latest I have created (same project, version 5).
Anyone know what goes wrong?
Snippet:
from roboflow import Roboflow
rf = Roboflow(api_key="keyremoved")
project = rf.workspace().project("project name")
dataset = project.version(5).download("yolov5")
loading Roboflow workspace...
loading Roboflow project...
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-22-7f073ab2bc86> in <module>()
7 rf = Roboflow(api_key="keyremoved")
8 project = rf.workspace().project("projectname")
----> 9 dataset = project.version(5).download("yolov5")
10
11
/usr/local/lib/python3.7/dist-packages/roboflow/core/version.py in download(self, model_format, location)
76 link = resp.json()['export']['link']
77 else:
---> 78 raise RuntimeError(resp.json())
79
80 def bar_progress(current, total, width=80):
RuntimeError: {'error': {'message': 'Unsupported get request. Export with ID `idremoved` does not exist or cannot be loaded due to missing permissions.', 'type': 'GraphMethodException', 'hint': 'You can find the API docs at https://docs.roboflow.com'}}
There can be limits for the number of images+augmentations that you can export with roboflow according to the plans that you use. Please check your account details and limits. Contact roboflow support if you need more help.

Jupyter kernel dies while reading file

I am reading a 22.2 GB csv file to a pandas df in a Jupyter notebook on an EC2 instance but I keep getting this error:
The instance is a t3 2x large instance and while reading the file, the CPU utilization is 13.4% and the total volume size is 60 GB
I am not sure what is causing this issue. Any ideas?

Tensorflow SavedModel file size increases with each save

I have a Tensorflow R1.13 training code that saves a SavedModel periodically during a long training run (I am following this excellent article on the topic). I have noticed that each time the model is saved the size increases. In fact it seems that it increases exactly linearly each time, and seems to be a multiple of the initial file size. I wonder if TF is keeping a reference to all previous saved files and accumulating them for each later save. Below are the file sizes for several SavedModel files written in sequence over time, as training progresses.
-rw-rw-r-- 1 ubuntu ubuntu 576962 Apr 15 23:56 ./model_accuracy_0.361/saved_model.pb
-rw-rw-r-- 1 ubuntu ubuntu 1116716 Apr 15 23:58 ./model_accuracy_0.539/saved_model.pb
-rw-rw-r-- 1 ubuntu ubuntu 1656470 Apr 16 00:11 ./model_accuracy_0.811/saved_model.pb
-rw-rw-r-- 1 ubuntu ubuntu 2196440 Apr 16 00:15 ./model_accuracy_0.819/saved_model.pb
-rw-rw-r-- 1 ubuntu ubuntu 2736794 Apr 16 00:17 ./model_accuracy_0.886/saved_model.pb
-rw-rw-r-- 1 ubuntu ubuntu 3277150 Apr 16 00:19 ./model_accuracy_0.908/saved_model.pb
-rw-rw-r-- 1 ubuntu ubuntu 3817530 Apr 16 00:21 ./model_accuracy_0.919/saved_model.pb
-rw-rw-r-- 1 ubuntu ubuntu 4357950 Apr 16 00:25 ./model_accuracy_0.930/saved_model.pb
-rw-rw-r-- 1 ubuntu ubuntu 4898492 Apr 16 00:27 ./model_accuracy_0.937/saved_model.pb
Is there a way to cull out the previous saved versions? Or at least prevent them from being accumulated in the first place? I will certainly only keep the last file, but it seems to be 10x larger than it should be.
Below is my code (largely copied from Silva):
# Creates the TensorInfo protobuf objects that encapsulates the input/output tensors
tensor_info_input_data_1 = tf.saved_model.utils.build_tensor_info(gd.data_1)
tensor_info_input_data_2 = tf.saved_model.utils.build_tensor_info(gd.data_2)
tensor_info_input_keep = tf.saved_model.utils.build_tensor_info(gd.keep )
# output tensor info
tensor_info_output_pred = tf.saved_model.utils.build_tensor_info(gd.targ_pred_oneh)
tensor_info_output_soft = tf.saved_model.utils.build_tensor_info(gd.targ_pred_soft)
# Define the SignatureDef for this export
prediction_signature = \
tf.saved_model.signature_def_utils.build_signature_def(
inputs={
'data_1': tensor_info_input_data_1,
'data_2': tensor_info_input_data_2,
'keep' : tensor_info_input_keep
},
outputs={
'pred_orig': tensor_info_output_pred,
'pred_soft': tensor_info_output_soft
},
method_name=tf.saved_model.signature_constants.CLASSIFY_METHOD_NAME)
graph_entry_point_name = "my_model" # The logical name for the model in TF Serving
try:
builder = tf.saved_model.builder.SavedModelBuilder(saved_model_path)
builder.add_meta_graph_and_variables(
sess= sess,
tags=[tf.saved_model.tag_constants.SERVING],
signature_def_map = {graph_entry_point_name:prediction_signature}
)
builder.save(as_text=False)
if verbose:
print(" SavedModel graph written successfully. " )
success = True
except Exception as e:
print(" WARNING::SavedModel write FAILED. " )
traceback.print_tb(e.__traceback__)
success = False
return success
#Hephaestus,
If you're constructing a SavedModelBuilder each time, then it'll add new save operations to the graph every time you save.
Instead, you can construct SavedModelBuilder only once and just call builder.save repeatedly. This will not add new ops to the graph on each save call.
Alternatively I think you can create your own tf.train.Saver and pass it to add_meta_graph_and_variables. Then it shouldn't create any new operations.
A good debugging aid is tf.get_default_graph().finalize() once you're done graph building, which will throw an exception rather than expanding the graph like this.
Hope this helps.
Set clear_extraneous_saver = True for Saver
https://github.com/tensorflow/tensorflow/blob/b78d23cf92656db63bca1f2cbc9636c7caa387ca/tensorflow/python/saved_model/builder_impl.py#L382
meta_graph_def = saver.export_meta_graph(
clear_devices=clear_devices, clear_extraneous_savers=True, strip_default_attrs=strip_default_attrs)

AttributeError: 'module' object has no attribute 'to_rgba'

I got error when I used matplotlib.pyplot to show image
5 plt.ylim(-5,6)
6 plt.title('Question 1(c): sample cluster data (10,000 points per cluster)')
----> 7 plt.show()
C:\Users\yashi\Anaconda3\envs\CSC411\lib\site-packages\matplotlib\pyplot.py in show(*args, **kw)
242 In non-interactive mode, display all figures and block until
243 the figures have been closed; in interactive mode it has no
--> 244 effect unless figures were created prior to a change from
245 non-interactive to interactive mode (not recommended). In
246 that case it displays the figures but does not block.
C:\Users\yashi\Anaconda3\envs\CSC411\lib\site-packages\ipykernel\pylab\backend_inline.pyc in show(close, block)
37 display(
38 figure_manager.canvas.figure,
---> 39 metadata=_fetch_figure_metadata(figure_manager.canvas.figure)
40 )
41 finally:
C:\Users\yashi\Anaconda3\envs\CSC411\lib\site-packages\ipykernel\pylab\backend_inline.pyc in _fetch_figure_metadata(fig)
172 """Get some metadata to help with displaying a figure."""
173 # determine if a background is needed for legibility
--> 174 if _is_transparent(fig.get_facecolor()):
175 # the background is transparent
176 ticksLight = _is_light([label.get_color()
C:\Users\yashi\Anaconda3\envs\CSC411\lib\site-packages\ipykernel\pylab\backend_inline.pyc in _is_transparent(color)
193 def _is_transparent(color):
194 """Determine transparency from alpha."""
--> 195 rgba = colors.to_rgba(color)
196 return rgba[3] < .5
AttributeError: 'module' object has no attribute 'to_rgba'
According to the post,
I updated matplotlib to 2.23 but it still doesn't work. How can I fix it?
I also encountered this situation,that caused by ipykernel versions. I change ipykernel from 4.10.0 to 4.9.0.The problem can be solved.
Run the command line on your windows / terminal on Mac and perform
conda install ipykernel=4.9.0