aiplatform.PipelineJob.submit() fails due to ServiceUnavailable: 503 failed to connect to all addresses - ssl

Trying to run a PipelineJob from local instance (local machine with windows, GCP CLI installed /local jupyter lab), but I'm getting
_InactiveRpcError Traceback (most recent call last)
The above exception was the direct cause of the following exception:
ServiceUnavailable Traceback (most recent call last)
~\AppData\Local\Temp\1\ipykernel_8896\1564544395.py in <module>
15 #credentials=CREDENTIALS,
16
---> 17 job.submit() #service_account=SERVICE_ACCOUNT
18 # job.run()
c:\users\<user>\pipelines_draft_env\lib\site-packages\google\cloud\aiplatform\pipeline_jobs.py in submit(self, service_account, network)
284 parent=self._parent,
285 pipeline_job=self._gca_resource,
--> 286 pipeline_job_id=self.job_id,
287 )
288
c:\users\<user>\pipelines_draft_env\lib\site-packages\google\cloud\aiplatform_v1\services\pipeline_service\client.py in create_pipeline_job(self, request, parent, pipeline_job, pipeline_job_id, retry, timeout, metadata)
1197
1198 # Send the request.
-> 1199 response = rpc(request, retry=retry, timeout=timeout, metadata=metadata,)
1200
1201 # Done; return the response.
c:\users\<user>\pipelines_draft_env\lib\site-packages\google\api_core\gapic_v1\method.py in __call__(self, timeout, retry, *args, **kwargs)
152 kwargs["metadata"] = metadata
153
--> 154 return wrapped_func(*args, **kwargs)
155
156
c:\users\<user>\pipelines_draft_env\lib\site-packages\google\api_core\grpc_helpers.py in error_remapped_callable(*args, **kwargs)
66 return callable_(*args, **kwargs)
67 except grpc.RpcError as exc:
---> 68 raise exceptions.from_grpc_error(exc) from exc
69
70 return error_remapped_callable
ServiceUnavailable: 503 failed to connect to all addresses; last error: UNKNOWN: ipv4:xxx.xx.xxx.x:443: tcp handshaker shutdown
I was trying to run the code, that is below:
The user, project is set up, and the same code runs perfectly fine on machine with MacOS (same user account, same project).
#code for the pipeline here, was too long for the question
compiler.Compiler().compile(pipeline_func=pipeline, package_path="intro_pipeline.json")
DISPLAY_NAME = "intro_" + UUID
job = aiplatform.PipelineJob(
display_name=DISPLAY_NAME,
template_path="intro_pipeline.json",
pipeline_root=PIPELINE_ROOT,
project=PROJECT_ID,
)
job.submit()
Please, help me to debug, maybe there is an issue with some certificates, but I have no idea were should I look . Thanks for any help

Related

MachineConfig when training Model with TPU on GCP with Tensorflow-cloud

I am trying to train a rather large model (Longformer-large with a CNN classification head on top) on Google Cloud Platform. I am using Tensorflow-cloud and Colab to run my model. I tried to run this with batchsize 4 and 4 P100-GPUs but I still get an OOM error, so I would like to try it with TPU. I have increased batch size to 8 now.
However, I get the error that TPU config cannot be the chief_worker_config.
This is my code:
tfc.run(
distribution_strategy="auto",
requirements_txt="requirements.txt",
docker_config=tfc.DockerConfig(
image_build_bucket=GCS_BUCKET
),
worker_count=1,
worker_config= tfc.COMMON_MACHINE_CONFIGS["TPU"],
chief_config=tfc.COMMON_MACHINE_CONFIGS["TPU"],
job_labels={"job": JOB_NAME})
This is the error:
Validating environment and input parameters.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-26-e1be60d71623> in <module>()
19 worker_config= tfc.COMMON_MACHINE_CONFIGS["TPU"],
20 chief_config=tfc.COMMON_MACHINE_CONFIGS["TPU"],
---> 21 job_labels={"job": JOB_NAME},
22 )
2 frames
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/run.py in run(entry_point, requirements_txt, docker_config, distribution_strategy, chief_config, worker_config, worker_count, entry_point_args, stream_logs, job_labels, service_account, **kwargs)
256 job_labels=job_labels or {},
257 service_account=service_account,
--> 258 docker_parent_image=docker_config.parent_image,
259 )
260 print("Validation was successful.")
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/validate.py in validate(entry_point, requirements_txt, distribution_strategy, chief_config, worker_config, worker_count, entry_point_args, stream_logs, docker_image_build_bucket, called_from_notebook, job_labels, service_account, docker_parent_image)
78 _validate_distribution_strategy(distribution_strategy)
79 _validate_cluster_config(
---> 80 chief_config, worker_count, worker_config, docker_parent_image
81 )
82 _validate_job_labels(job_labels or {})
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/validate.py in _validate_cluster_config(chief_config, worker_count, worker_config, docker_parent_image)
160 "Invalid `chief_config` input. "
161 "`chief_config` cannot be a TPU config. "
--> 162 "Received {}.".format(chief_config)
163 )
164
ValueError: Invalid `chief_config` input. `chief_config` cannot be a TPU config. Received <tensorflow_cloud.core.machine_config.MachineConfig object at 0x7f5860afe210>.
Can someone tell me how I can run my code on GCP-TPUs? I actually don't care too much about time, I just want some configuration that runs without getting OOM issues (so GPU if it works totally fine with me as well).
Thank you!

NCCL operation ncclGroupEnd() failed: unhandled system error

I am able to run this file vit_jax.ipynb on colab and perform training and run my experiments but when I try to replicate it on my cluster, I am getting an error during training given below.
However, the forward pass to calculate accuracy works fine on my cluster.
I have 4 GTX 1080 with CUDA10.1 version on my cluster and using tensorflow==2.4.0 and jax[cuda101]==0.2.18. I am running this as jupyter notebook from inside a docker container.
---------------------------------------------------------------------------
UnfilteredStackTrace Traceback (most recent call last)
<ipython-input-57-176d6124ae02> in <module>()
11 opt_repl, loss_repl, update_rng_repl = update_fn_repl(
---> 12 opt_repl, flax.jax_utils.replicate(step), batch, update_rng_repl)
13 losses.append(loss_repl[0])
/usr/local/lib/python3.7/dist-packages/jax/_src/traceback_util.py in reraise_with_filtered_traceback(*args, **kwargs)
182 try:
--> 183 return fun(*args, **kwargs)
184 except Exception as e:
/usr/local/lib/python3.7/dist-packages/jax/_src/api.py in f_pmapped(*args, **kwargs)
1638 name=flat_fun.__name__, donated_invars=tuple(donated_invars),
-> 1639 global_arg_shapes=tuple(global_arg_shapes_flat))
1640 return tree_unflatten(out_tree(), out)
/usr/local/lib/python3.7/dist-packages/jax/core.py in bind(self, fun, *args, **params)
1620 assert len(params['in_axes']) == len(args)
-> 1621 return call_bind(self, fun, *args, **params)
1622
/usr/local/lib/python3.7/dist-packages/jax/core.py in call_bind(primitive, fun, *args, **params)
1551 tracers = map(top_trace.full_raise, args)
-> 1552 outs = primitive.process(top_trace, fun, tracers, params)
1553 return map(full_lower, apply_todos(env_trace_todo(), outs))
/usr/local/lib/python3.7/dist-packages/jax/core.py in process(self, trace, fun, tracers, params)
1623 def process(self, trace, fun, tracers, params):
-> 1624 return trace.process_map(self, fun, tracers, params)
1625
/usr/local/lib/python3.7/dist-packages/jax/core.py in process_call(self, primitive, f, tracers, params)
606 def process_call(self, primitive, f, tracers, params):
--> 607 return primitive.impl(f, *tracers, **params)
608 process_map = process_call
/usr/local/lib/python3.7/dist-packages/jax/interpreters/pxla.py in xla_pmap_impl(fun, backend, axis_name, axis_size, global_axis_size, devices, name, in_axes, out_axes_thunk, donated_invars, global_arg_shapes, *args)
636 ("fingerprint", fingerprint))
--> 637 return compiled_fun(*args)
638
/usr/local/lib/python3.7/dist-packages/jax/interpreters/pxla.py in execute_replicated(compiled, backend, in_handler, out_handler, *args)
1159 input_bufs = in_handler(args)
-> 1160 out_bufs = compiled.execute_sharded_on_local_devices(input_bufs)
1161 if xla.needs_check_special():
UnfilteredStackTrace: RuntimeError: Internal: external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nccl_utils.cc:203: NCCL operation ncclGroupEnd() failed: unhandled system error: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-57-176d6124ae02> in <module>()
10
11 opt_repl, loss_repl, update_rng_repl = update_fn_repl(
---> 12 opt_repl, flax.jax_utils.replicate(step), batch, update_rng_repl)
13 losses.append(loss_repl[0])
14 lrs.append(lr_fn(step))
/usr/local/lib/python3.7/dist-packages/jax/interpreters/pxla.py in execute_replicated(compiled, backend, in_handler, out_handler, *args)
1158 def execute_replicated(compiled, backend, in_handler, out_handler, *args):
1159 input_bufs = in_handler(args)
-> 1160 out_bufs = compiled.execute_sharded_on_local_devices(input_bufs)
1161 if xla.needs_check_special():
1162 for bufs in out_bufs:
RuntimeError: Internal: external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nccl_utils.cc:203: NCCL operation ncclGroupEnd() failed: unhandled system error: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).
Please let me know if anyone has faced this issue before? Or any way to resolve this?
It is hard to know for sure without more information, but this error can be caused by running out of GPU memory. Depending on your local settings, you may be able to remedy it by upping the proportion of the GPU memory reserved by XLA, e.g. by setting the XLA_PYTHON_CLIENT_MEM_FRACTION system variable to 0.9 or something similarly high.
Alternatively, you could try running your code on a smaller problem that fits into memory on your local hardware.

What is OSError: [Errno 95] Operation not supported for pandas to_csv on colab?

My input is:
test=pd.read_csv("/gdrive/My Drive/data-kaggle/sample_submission.csv")
test.head()
It ran as expected.
But, for
test.to_csv('submitV1.csv', header=False)
The full error message that I got was:
OSError Traceback (most recent call last)
<ipython-input-5-fde243a009c0> in <module>()
9 from google.colab import files
10 print(test)'''
---> 11 test.to_csv('submitV1.csv', header=False)
12 files.download('/gdrive/My Drive/data-
kaggle/submission/submitV1.csv')
2 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in
to_csv(self, path_or_buf, sep, na_rep, float_format, columns,
header, index, index_label, mode, encoding, compression, quoting,
quotechar, line_terminator, chunksize, tupleize_cols, date_format,
doublequote, escapechar, decimal)
3018 doublequote=doublequote,
3019 escapechar=escapechar,
decimal=decimal)
-> 3020 formatter.save()
3021
3022 if path_or_buf is None:
/usr/local/lib/python3.6/dist-packages/pandas/io/formats/csvs.pyi
in save(self)
155 f, handles = _get_handle(self.path_or_buf,
self.mode,
156 encoding=self.encoding,
--> 157
compression=self.compression)
158 close = True
159
/usr/local/lib/python3.6/dist-packages/pandas/io/common.py in
_get_handle(path_or_buf, mode, encoding, compression, memory_map,
is_text)
422 elif encoding:
423 # Python 3 and encoding
--> 424 f = open(path_or_buf, mode,encoding=encoding,
newline="")
425 elif is_text:
426 # Python 3 and no explicit encoding
OSError: [Errno 95] Operation not supported: 'submitV1.csv'
Additional Information about the error:
Before running this command, if I run
df=pd.DataFrame()
df.to_csv("file.csv")
files.download("file.csv")
It is running properly, but the same code is producing the operation not supported error if I try to run it after trying to convert test data frame to a csv file.
I am also getting a message A Google Drive timeout has occurred (most recently at 13:02:43). More info. just before running the command.
You are currently in the directory in which you don't have write permissions.
Check your current directory with pwd. It might be gdrive of some directory inside it, that's why you are unable to save there.
Now change the current working directory to some other directory where you have permissions to write. cd ~ will work fine. It wil chage the directoy to /root.
Now you can use:
test.to_csv('submitV1.csv', header=False)
It will save 'submitV1.csv' to /root

Running TensorFlow tests in Google Colab

I want to run tests on Google Colab to ensure reproducibility but I get a system error at the end, which I do not on my local machine.
I set up TensorFlow in Google Colab with
!pip install tensorflow==1.12.0
import tensorflow as tf
print(tf.__version__)
which, after some lines of installation, prints:
1.12.0
I then want to run a simple test:
import tensorflow as tf
class Tests(tf.test.TestCase):
def test_gpu(self):
self.assertEqual(False, tf.test.is_gpu_available())
tf.test.main()
The test passes (along with a default session test) on my local machine, and also on Colab, but after that the kernel returns a system error:
..
----------------------------------------------------------------------
Ran 2 tests in 0.005s
OK
An exception has occurred, use %tb to see the full traceback.
SystemExit: False
After calling %tb, I get a long stack trace pasted below, which gives little indication. How can I fix it?
The stacktrace is:
SystemExit Traceback (most recent call last)
<ipython-input-20-6a87bf6320f2> in <module>()
7 self.assertEqual(False, tf.test.is_gpu_available())
8
----> 9 tf.test.main()
10
11
/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/test.py in main(argv)
62 """Runs all unit tests."""
63 _test_util.InstallStackTraceHandler()
---> 64 return _googletest.main(argv)
65
66
/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/googletest.py in main(argv)
98 args = sys.argv
99 return app.run(main=g_main, argv=args)
--> 100 benchmark.benchmarks_main(true_main=main_wrapper)
101
102
/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/benchmark.py in benchmarks_main(true_main, argv)
342 app.run(lambda _: _run_benchmarks(regex), argv=argv)
343 else:
--> 344 true_main()
/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/googletest.py in main_wrapper()
97 if args is None:
98 args = sys.argv
---> 99 return app.run(main=g_main, argv=args)
100 benchmark.benchmarks_main(true_main=main_wrapper)
101
/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/app.py in run(main, argv)
123 # Call the main function, passing through any arguments
124 # to the final program.
--> 125 _sys.exit(main(argv))
126
/usr/local/lib/python3.6/dist-packages/tensorflow/python/platform/googletest.py in g_main(argv)
68 if ('TEST_TOTAL_SHARDS' not in os.environ or
69 'TEST_SHARD_INDEX' not in os.environ):
---> 70 return unittest_main(argv=argv)
71
72 total_shards = int(os.environ['TEST_TOTAL_SHARDS'])
/usr/lib/python3.6/unittest/main.py in __init__(self, module, defaultTest, argv, testRunner, testLoader, exit, verbosity, failfast, catchbreak, buffer, warnings, tb_locals)
93 self.progName = os.path.basename(argv[0])
94 self.parseArgs(argv)
---> 95 self.runTests()
96
97 def usageExit(self, msg=None):
/usr/lib/python3.6/unittest/main.py in runTests(self)
256 self.result = testRunner.run(self.test)
257 if self.exit:
--> 258 sys.exit(not self.result.wasSuccessful())
259
260 main = TestProgram
SystemExit: False
The error you're seeing is coming from unittest trying to exit the python process, which Jupyter prevents on your behalf. You can avoid that with e.g.:
import tensorflow as tf
class Tests(tf.test.TestCase):
def test_gpu(self):
self.assertEqual(False, tf.test.is_gpu_available())
import unittest
unittest.main(argv=['first-arg-is-ignored'], exit=False)
(note the last line is different to yours and is lifted from https://github.com/jupyter/notebook/issues/2746)

Accessing S3 from Dask.bag

As the title suggests, I'm trying to use a dask.bag to read a single file from S3 on an EC2 instance:
from distributed import Executor, progress
from dask import delayed
import dask
import dask.bag as db
data = db.read_text('s3://pycuda-euler-data/Ba10k.sim1.fq')
I get a very long error:
---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
/home/ubuntu/anaconda3/lib/python3.5/site-packages/s3fs/core.py in info(self, path, refresh)
322 bucket, key = split_path(path)
--> 323 out = self.s3.head_object(Bucket=bucket, Key=key)
324 out = {'ETag': out['ETag'], 'Key': '/'.join([bucket, key]),
/home/ubuntu/anaconda3/lib/python3.5/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
277 # The "self" in this scope is referring to the BaseClient.
--> 278 return self._make_api_call(operation_name, kwargs)
279
/home/ubuntu/anaconda3/lib/python3.5/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
571 if http.status_code >= 300:
--> 572 raise ClientError(parsed_response, operation_name)
573 else:
ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
During handling of the above exception, another exception occurred:
FileNotFoundError Traceback (most recent call last)
<ipython-input-43-0ad435c69ecc> in <module>()
4 #data = db.read_text('/Users/zen/Code/git/sra_data.fastq')
5 #data = db.read_text('/Users/zen/Code/git/pycuda-euler/data/Ba10k.sim1.fq')
----> 6 data = db.read_text('s3://pycuda-euler-data/Ba10k.sim1.fq', blocksize=900000)
/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bag/text.py in read_text(urlpath, blocksize, compression, encoding, errors, linedelimiter, collection, storage_options)
89 _, blocks = read_bytes(urlpath, delimiter=linedelimiter.encode(),
90 blocksize=blocksize, sample=False, compression=compression,
---> 91 **(storage_options or {}))
92 if isinstance(blocks[0], (tuple, list)):
93 blocks = list(concat(blocks))
/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bytes/core.py in read_bytes(urlpath, delimiter, not_zero, blocksize, sample, compression, **kwargs)
210 return read_bytes(storage_options.pop('path'), delimiter=delimiter,
211 not_zero=not_zero, blocksize=blocksize, sample=sample,
--> 212 compression=compression, **storage_options)
213
214
/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bytes/s3.py in read_bytes(path, s3, delimiter, not_zero, blocksize, sample, compression, **kwargs)
91 offsets = [0]
92 else:
---> 93 size = getsize(s3_path, compression, s3)
94 offsets = list(range(0, size, blocksize))
95 if not_zero:
/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bytes/s3.py in getsize(path, compression, s3)
185 def getsize(path, compression, s3):
186 if compression is None:
--> 187 return s3.info(path)['Size']
188 else:
189 with s3.open(path, 'rb') as f:
/home/ubuntu/anaconda3/lib/python3.5/site-packages/s3fs/core.py in info(self, path, refresh)
327 return out
328 except (ClientError, ParamValidationError):
--> 329 raise FileNotFoundError(path)
330
331 def _walk(self, path, refresh=False):
FileNotFoundError: pycuda-euler-data/Ba10k.sim1.fq
As far as I can tell, this is exactly what the docs say to do and unfortunately many examples I see online use the older from_s3() method that no longer exists.
However I am able to access the file using s3fs:
sample, partitions = s3.read_bytes('pycuda-euler-data/Ba10k.sim1.fq', s3=s3files, delimiter=b'\n')
sample
b'#gi|30260195|ref|NC_003997.3|_5093_5330_1:0:0_1:0:0_0/1\nGATAACTCGATTTAAACCAGATCCAGAAAATTTTCA\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_7142_7326_1:1:0_0:0:0_1/1\nCTATTCCGCCGCATCAACTTGGTGAAGTAATGGATG\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_5524_5757_3:0:0_2:0:0_2/1\nGTAATTTAACTGGTGAGGACGTGCGTGATGGTTTAT\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_2706_2926_1:0:0_3:0:0_3/1\nAGTAAAACAGATATTTTTGTAAATAGAAAAGAATTT\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_500_735_3:1:0_0:0:0_4/1\nATACTCTGTGGTAAATGATTAGAATCATCTTGTGCT\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_2449_2653_3:0:0_1:0:0_5/1\nCTTGAATTGCTACAGATAGTCATAGGTTAGCCCTTC\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_3252_3460_0:0:0_0:0:0_6/1\nCGATGTAATTGATACAGGTGGCGCTGTAAAATGGTT\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_1860_2095_0:0:0_1:0:0_7/1\nATAAAAGATTCAATCGAAATATCAGCATCGTTTCCT\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_870_1092_1:0:0_0:0:0_8/1\nTTGGAAAAACCCATTTAATGCATGCAATTGGCCTTT\n ... etc.
What am I doing wrong?
EDIT:
Upon suggestion, I went back and checked permissions. On the bucket I added a Grantee Everyone List, and on the file a Grantee Everyone Open/Download. I still get the same error.
Dask uses the library s3fs to manage data on S3. The s3fs project uses Amazon's boto3. You can provide credentials in two ways:
Use a .boto file
You can put a .boto file in your home directory
Use the storage_options= keyword
You can add a storage_option= keyword to your db.read_text call to include credential information by hand. This option is a dictionary whose values will be added to the s3fs.S3FileSystem constructor.