MachineConfig when training Model with TPU on GCP with Tensorflow-cloud - tensorflow

I am trying to train a rather large model (Longformer-large with a CNN classification head on top) on Google Cloud Platform. I am using Tensorflow-cloud and Colab to run my model. I tried to run this with batchsize 4 and 4 P100-GPUs but I still get an OOM error, so I would like to try it with TPU. I have increased batch size to 8 now.
However, I get the error that TPU config cannot be the chief_worker_config.
This is my code:
tfc.run(
distribution_strategy="auto",
requirements_txt="requirements.txt",
docker_config=tfc.DockerConfig(
image_build_bucket=GCS_BUCKET
),
worker_count=1,
worker_config= tfc.COMMON_MACHINE_CONFIGS["TPU"],
chief_config=tfc.COMMON_MACHINE_CONFIGS["TPU"],
job_labels={"job": JOB_NAME})
This is the error:
Validating environment and input parameters.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-26-e1be60d71623> in <module>()
19 worker_config= tfc.COMMON_MACHINE_CONFIGS["TPU"],
20 chief_config=tfc.COMMON_MACHINE_CONFIGS["TPU"],
---> 21 job_labels={"job": JOB_NAME},
22 )
2 frames
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/run.py in run(entry_point, requirements_txt, docker_config, distribution_strategy, chief_config, worker_config, worker_count, entry_point_args, stream_logs, job_labels, service_account, **kwargs)
256 job_labels=job_labels or {},
257 service_account=service_account,
--> 258 docker_parent_image=docker_config.parent_image,
259 )
260 print("Validation was successful.")
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/validate.py in validate(entry_point, requirements_txt, distribution_strategy, chief_config, worker_config, worker_count, entry_point_args, stream_logs, docker_image_build_bucket, called_from_notebook, job_labels, service_account, docker_parent_image)
78 _validate_distribution_strategy(distribution_strategy)
79 _validate_cluster_config(
---> 80 chief_config, worker_count, worker_config, docker_parent_image
81 )
82 _validate_job_labels(job_labels or {})
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/validate.py in _validate_cluster_config(chief_config, worker_count, worker_config, docker_parent_image)
160 "Invalid `chief_config` input. "
161 "`chief_config` cannot be a TPU config. "
--> 162 "Received {}.".format(chief_config)
163 )
164
ValueError: Invalid `chief_config` input. `chief_config` cannot be a TPU config. Received <tensorflow_cloud.core.machine_config.MachineConfig object at 0x7f5860afe210>.
Can someone tell me how I can run my code on GCP-TPUs? I actually don't care too much about time, I just want some configuration that runs without getting OOM issues (so GPU if it works totally fine with me as well).
Thank you!

Related

aiplatform.PipelineJob.submit() fails due to ServiceUnavailable: 503 failed to connect to all addresses

Trying to run a PipelineJob from local instance (local machine with windows, GCP CLI installed /local jupyter lab), but I'm getting
_InactiveRpcError Traceback (most recent call last)
The above exception was the direct cause of the following exception:
ServiceUnavailable Traceback (most recent call last)
~\AppData\Local\Temp\1\ipykernel_8896\1564544395.py in <module>
15 #credentials=CREDENTIALS,
16
---> 17 job.submit() #service_account=SERVICE_ACCOUNT
18 # job.run()
c:\users\<user>\pipelines_draft_env\lib\site-packages\google\cloud\aiplatform\pipeline_jobs.py in submit(self, service_account, network)
284 parent=self._parent,
285 pipeline_job=self._gca_resource,
--> 286 pipeline_job_id=self.job_id,
287 )
288
c:\users\<user>\pipelines_draft_env\lib\site-packages\google\cloud\aiplatform_v1\services\pipeline_service\client.py in create_pipeline_job(self, request, parent, pipeline_job, pipeline_job_id, retry, timeout, metadata)
1197
1198 # Send the request.
-> 1199 response = rpc(request, retry=retry, timeout=timeout, metadata=metadata,)
1200
1201 # Done; return the response.
c:\users\<user>\pipelines_draft_env\lib\site-packages\google\api_core\gapic_v1\method.py in __call__(self, timeout, retry, *args, **kwargs)
152 kwargs["metadata"] = metadata
153
--> 154 return wrapped_func(*args, **kwargs)
155
156
c:\users\<user>\pipelines_draft_env\lib\site-packages\google\api_core\grpc_helpers.py in error_remapped_callable(*args, **kwargs)
66 return callable_(*args, **kwargs)
67 except grpc.RpcError as exc:
---> 68 raise exceptions.from_grpc_error(exc) from exc
69
70 return error_remapped_callable
ServiceUnavailable: 503 failed to connect to all addresses; last error: UNKNOWN: ipv4:xxx.xx.xxx.x:443: tcp handshaker shutdown
I was trying to run the code, that is below:
The user, project is set up, and the same code runs perfectly fine on machine with MacOS (same user account, same project).
#code for the pipeline here, was too long for the question
compiler.Compiler().compile(pipeline_func=pipeline, package_path="intro_pipeline.json")
DISPLAY_NAME = "intro_" + UUID
job = aiplatform.PipelineJob(
display_name=DISPLAY_NAME,
template_path="intro_pipeline.json",
pipeline_root=PIPELINE_ROOT,
project=PROJECT_ID,
)
job.submit()
Please, help me to debug, maybe there is an issue with some certificates, but I have no idea were should I look . Thanks for any help

NCCL operation ncclGroupEnd() failed: unhandled system error

I am able to run this file vit_jax.ipynb on colab and perform training and run my experiments but when I try to replicate it on my cluster, I am getting an error during training given below.
However, the forward pass to calculate accuracy works fine on my cluster.
I have 4 GTX 1080 with CUDA10.1 version on my cluster and using tensorflow==2.4.0 and jax[cuda101]==0.2.18. I am running this as jupyter notebook from inside a docker container.
---------------------------------------------------------------------------
UnfilteredStackTrace Traceback (most recent call last)
<ipython-input-57-176d6124ae02> in <module>()
11 opt_repl, loss_repl, update_rng_repl = update_fn_repl(
---> 12 opt_repl, flax.jax_utils.replicate(step), batch, update_rng_repl)
13 losses.append(loss_repl[0])
/usr/local/lib/python3.7/dist-packages/jax/_src/traceback_util.py in reraise_with_filtered_traceback(*args, **kwargs)
182 try:
--> 183 return fun(*args, **kwargs)
184 except Exception as e:
/usr/local/lib/python3.7/dist-packages/jax/_src/api.py in f_pmapped(*args, **kwargs)
1638 name=flat_fun.__name__, donated_invars=tuple(donated_invars),
-> 1639 global_arg_shapes=tuple(global_arg_shapes_flat))
1640 return tree_unflatten(out_tree(), out)
/usr/local/lib/python3.7/dist-packages/jax/core.py in bind(self, fun, *args, **params)
1620 assert len(params['in_axes']) == len(args)
-> 1621 return call_bind(self, fun, *args, **params)
1622
/usr/local/lib/python3.7/dist-packages/jax/core.py in call_bind(primitive, fun, *args, **params)
1551 tracers = map(top_trace.full_raise, args)
-> 1552 outs = primitive.process(top_trace, fun, tracers, params)
1553 return map(full_lower, apply_todos(env_trace_todo(), outs))
/usr/local/lib/python3.7/dist-packages/jax/core.py in process(self, trace, fun, tracers, params)
1623 def process(self, trace, fun, tracers, params):
-> 1624 return trace.process_map(self, fun, tracers, params)
1625
/usr/local/lib/python3.7/dist-packages/jax/core.py in process_call(self, primitive, f, tracers, params)
606 def process_call(self, primitive, f, tracers, params):
--> 607 return primitive.impl(f, *tracers, **params)
608 process_map = process_call
/usr/local/lib/python3.7/dist-packages/jax/interpreters/pxla.py in xla_pmap_impl(fun, backend, axis_name, axis_size, global_axis_size, devices, name, in_axes, out_axes_thunk, donated_invars, global_arg_shapes, *args)
636 ("fingerprint", fingerprint))
--> 637 return compiled_fun(*args)
638
/usr/local/lib/python3.7/dist-packages/jax/interpreters/pxla.py in execute_replicated(compiled, backend, in_handler, out_handler, *args)
1159 input_bufs = in_handler(args)
-> 1160 out_bufs = compiled.execute_sharded_on_local_devices(input_bufs)
1161 if xla.needs_check_special():
UnfilteredStackTrace: RuntimeError: Internal: external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nccl_utils.cc:203: NCCL operation ncclGroupEnd() failed: unhandled system error: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-57-176d6124ae02> in <module>()
10
11 opt_repl, loss_repl, update_rng_repl = update_fn_repl(
---> 12 opt_repl, flax.jax_utils.replicate(step), batch, update_rng_repl)
13 losses.append(loss_repl[0])
14 lrs.append(lr_fn(step))
/usr/local/lib/python3.7/dist-packages/jax/interpreters/pxla.py in execute_replicated(compiled, backend, in_handler, out_handler, *args)
1158 def execute_replicated(compiled, backend, in_handler, out_handler, *args):
1159 input_bufs = in_handler(args)
-> 1160 out_bufs = compiled.execute_sharded_on_local_devices(input_bufs)
1161 if xla.needs_check_special():
1162 for bufs in out_bufs:
RuntimeError: Internal: external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nccl_utils.cc:203: NCCL operation ncclGroupEnd() failed: unhandled system error: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).
Please let me know if anyone has faced this issue before? Or any way to resolve this?
It is hard to know for sure without more information, but this error can be caused by running out of GPU memory. Depending on your local settings, you may be able to remedy it by upping the proportion of the GPU memory reserved by XLA, e.g. by setting the XLA_PYTHON_CLIENT_MEM_FRACTION system variable to 0.9 or something similarly high.
Alternatively, you could try running your code on a smaller problem that fits into memory on your local hardware.

Instance Normalization Error while converting model from tensorflow to Coreml (4.0)

I try to convert my model from Tensorflow to Coreml however I get below error. Isn't it possible to convert instance normalization layer to CoreML? Any workaround to overcome?
ValueError Traceback (most recent call last)
in ()
6
7 model = ct.convert(
----> 8 tf_keras_model )
6 frames
/usr/local/lib/python3.6/dist-packages/coremltools/converters/mil/mil/block.py in remove_ops(self, existing_ops)
700 + "used by ops {}"
701 )
--> 702 raise ValueError(msg.format(op.name, i, v.name, child_op_names))
703 # Check that the output Var isn't block's output
704 if v in self._outputs:
ValueError: Cannot delete op 'Generator/StatefulPartitionedCall/StatefulPartitionedCall/encoder_down_resblock_0/instance_norm_0/Shape' with active output at id 0: 'Generator/StatefulPartitionedCall/StatefulPartitionedCall/encoder_down_resblock_0/instance_norm_0/Shape' used by ops ['Generator/StatefulPartitionedCall/StatefulPartitionedCall/encoder_down_resblock_0/instance_norm_0/strided_slice']
SEARCH STACK OVERFLOW
I use keras-contrib instead and it works fine. Please see issue and its solution below. It is still open for tensorflow_addons.
https://github.com/apple/coremltools/issues/1007

Running TensorFlow tests in Google Colab

I want to run tests on Google Colab to ensure reproducibility but I get a system error at the end, which I do not on my local machine.
I set up TensorFlow in Google Colab with
!pip install tensorflow==1.12.0
import tensorflow as tf
print(tf.__version__)
which, after some lines of installation, prints:
1.12.0
I then want to run a simple test:
import tensorflow as tf
class Tests(tf.test.TestCase):
def test_gpu(self):
self.assertEqual(False, tf.test.is_gpu_available())
tf.test.main()
The test passes (along with a default session test) on my local machine, and also on Colab, but after that the kernel returns a system error:
..
----------------------------------------------------------------------
Ran 2 tests in 0.005s
OK
An exception has occurred, use %tb to see the full traceback.
SystemExit: False
After calling %tb, I get a long stack trace pasted below, which gives little indication. How can I fix it?
The stacktrace is:
SystemExit Traceback (most recent call last)
<ipython-input-20-6a87bf6320f2> in <module>()
7 self.assertEqual(False, tf.test.is_gpu_available())
8
----> 9 tf.test.main()
10
11
/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/test.py in main(argv)
62 """Runs all unit tests."""
63 _test_util.InstallStackTraceHandler()
---> 64 return _googletest.main(argv)
65
66
/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/googletest.py in main(argv)
98 args = sys.argv
99 return app.run(main=g_main, argv=args)
--> 100 benchmark.benchmarks_main(true_main=main_wrapper)
101
102
/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/benchmark.py in benchmarks_main(true_main, argv)
342 app.run(lambda _: _run_benchmarks(regex), argv=argv)
343 else:
--> 344 true_main()
/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/googletest.py in main_wrapper()
97 if args is None:
98 args = sys.argv
---> 99 return app.run(main=g_main, argv=args)
100 benchmark.benchmarks_main(true_main=main_wrapper)
101
/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py in run(main, argv)
123 # Call the main function, passing through any arguments
124 # to the final program.
--> 125 _sys.exit(main(argv))
126
/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/googletest.py in g_main(argv)
68 if ('TEST_TOTAL_SHARDS' not in os.environ or
69 'TEST_SHARD_INDEX' not in os.environ):
---> 70 return unittest_main(argv=argv)
71
72 total_shards = int(os.environ['TEST_TOTAL_SHARDS'])
/usr/lib/python3.6/unittest/main.py in __init__(self, module, defaultTest, argv, testRunner, testLoader, exit, verbosity, failfast, catchbreak, buffer, warnings, tb_locals)
93 self.progName = os.path.basename(argv[0])
94 self.parseArgs(argv)
---> 95 self.runTests()
96
97 def usageExit(self, msg=None):
/usr/lib/python3.6/unittest/main.py in runTests(self)
256 self.result = testRunner.run(self.test)
257 if self.exit:
--> 258 sys.exit(not self.result.wasSuccessful())
259
260 main = TestProgram
SystemExit: False
The error you're seeing is coming from unittest trying to exit the python process, which Jupyter prevents on your behalf. You can avoid that with e.g.:
import tensorflow as tf
class Tests(tf.test.TestCase):
def test_gpu(self):
self.assertEqual(False, tf.test.is_gpu_available())
import unittest
unittest.main(argv=['first-arg-is-ignored'], exit=False)
(note the last line is different to yours and is lifted from https://github.com/jupyter/notebook/issues/2746)

AttributeError: 'module' object has no attribute 'to_rgba'

I got error when I used matplotlib.pyplot to show image
5 plt.ylim(-5,6)
6 plt.title('Question 1(c): sample cluster data (10,000 points per cluster)')
----> 7 plt.show()
C:\Users\yashi\Anaconda3\envs\CSC411\lib\site-packages\matplotlib\pyplot.py in show(*args, **kw)
242 In non-interactive mode, display all figures and block until
243 the figures have been closed; in interactive mode it has no
--> 244 effect unless figures were created prior to a change from
245 non-interactive to interactive mode (not recommended). In
246 that case it displays the figures but does not block.
C:\Users\yashi\Anaconda3\envs\CSC411\lib\site-packages\ipykernel\pylab\backend_inline.pyc in show(close, block)
37 display(
38 figure_manager.canvas.figure,
---> 39 metadata=_fetch_figure_metadata(figure_manager.canvas.figure)
40 )
41 finally:
C:\Users\yashi\Anaconda3\envs\CSC411\lib\site-packages\ipykernel\pylab\backend_inline.pyc in _fetch_figure_metadata(fig)
172 """Get some metadata to help with displaying a figure."""
173 # determine if a background is needed for legibility
--> 174 if _is_transparent(fig.get_facecolor()):
175 # the background is transparent
176 ticksLight = _is_light([label.get_color()
C:\Users\yashi\Anaconda3\envs\CSC411\lib\site-packages\ipykernel\pylab\backend_inline.pyc in _is_transparent(color)
193 def _is_transparent(color):
194 """Determine transparency from alpha."""
--> 195 rgba = colors.to_rgba(color)
196 return rgba[3] < .5
AttributeError: 'module' object has no attribute 'to_rgba'
According to the post,
I updated matplotlib to 2.23 but it still doesn't work. How can I fix it?
I also encountered this situation,that caused by ipykernel versions. I change ipykernel from 4.10.0 to 4.9.0.The problem can be solved.
Run the command line on your windows / terminal on Mac and perform
conda install ipykernel=4.9.0