I am able to run this file vit_jax.ipynb on colab and perform training and run my experiments but when I try to replicate it on my cluster, I am getting an error during training given below.
However, the forward pass to calculate accuracy works fine on my cluster.
I have 4 GTX 1080 with CUDA10.1 version on my cluster and using tensorflow==2.4.0 and jax[cuda101]==0.2.18. I am running this as jupyter notebook from inside a docker container.
---------------------------------------------------------------------------
UnfilteredStackTrace Traceback (most recent call last)
<ipython-input-57-176d6124ae02> in <module>()
11 opt_repl, loss_repl, update_rng_repl = update_fn_repl(
---> 12 opt_repl, flax.jax_utils.replicate(step), batch, update_rng_repl)
13 losses.append(loss_repl[0])
/usr/local/lib/python3.7/dist-packages/jax/_src/traceback_util.py in reraise_with_filtered_traceback(*args, **kwargs)
182 try:
--> 183 return fun(*args, **kwargs)
184 except Exception as e:
/usr/local/lib/python3.7/dist-packages/jax/_src/api.py in f_pmapped(*args, **kwargs)
1638 name=flat_fun.__name__, donated_invars=tuple(donated_invars),
-> 1639 global_arg_shapes=tuple(global_arg_shapes_flat))
1640 return tree_unflatten(out_tree(), out)
/usr/local/lib/python3.7/dist-packages/jax/core.py in bind(self, fun, *args, **params)
1620 assert len(params['in_axes']) == len(args)
-> 1621 return call_bind(self, fun, *args, **params)
1622
/usr/local/lib/python3.7/dist-packages/jax/core.py in call_bind(primitive, fun, *args, **params)
1551 tracers = map(top_trace.full_raise, args)
-> 1552 outs = primitive.process(top_trace, fun, tracers, params)
1553 return map(full_lower, apply_todos(env_trace_todo(), outs))
/usr/local/lib/python3.7/dist-packages/jax/core.py in process(self, trace, fun, tracers, params)
1623 def process(self, trace, fun, tracers, params):
-> 1624 return trace.process_map(self, fun, tracers, params)
1625
/usr/local/lib/python3.7/dist-packages/jax/core.py in process_call(self, primitive, f, tracers, params)
606 def process_call(self, primitive, f, tracers, params):
--> 607 return primitive.impl(f, *tracers, **params)
608 process_map = process_call
/usr/local/lib/python3.7/dist-packages/jax/interpreters/pxla.py in xla_pmap_impl(fun, backend, axis_name, axis_size, global_axis_size, devices, name, in_axes, out_axes_thunk, donated_invars, global_arg_shapes, *args)
636 ("fingerprint", fingerprint))
--> 637 return compiled_fun(*args)
638
/usr/local/lib/python3.7/dist-packages/jax/interpreters/pxla.py in execute_replicated(compiled, backend, in_handler, out_handler, *args)
1159 input_bufs = in_handler(args)
-> 1160 out_bufs = compiled.execute_sharded_on_local_devices(input_bufs)
1161 if xla.needs_check_special():
UnfilteredStackTrace: RuntimeError: Internal: external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nccl_utils.cc:203: NCCL operation ncclGroupEnd() failed: unhandled system error: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
RuntimeError Traceback (most recent call last)
<ipython-input-57-176d6124ae02> in <module>()
10
11 opt_repl, loss_repl, update_rng_repl = update_fn_repl(
---> 12 opt_repl, flax.jax_utils.replicate(step), batch, update_rng_repl)
13 losses.append(loss_repl[0])
14 lrs.append(lr_fn(step))
/usr/local/lib/python3.7/dist-packages/jax/interpreters/pxla.py in execute_replicated(compiled, backend, in_handler, out_handler, *args)
1158 def execute_replicated(compiled, backend, in_handler, out_handler, *args):
1159 input_bufs = in_handler(args)
-> 1160 out_bufs = compiled.execute_sharded_on_local_devices(input_bufs)
1161 if xla.needs_check_special():
1162 for bufs in out_bufs:
RuntimeError: Internal: external/org_tensorflow/tensorflow/compiler/xla/service/gpu/nccl_utils.cc:203: NCCL operation ncclGroupEnd() failed: unhandled system error: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).
Please let me know if anyone has faced this issue before? Or any way to resolve this?
It is hard to know for sure without more information, but this error can be caused by running out of GPU memory. Depending on your local settings, you may be able to remedy it by upping the proportion of the GPU memory reserved by XLA, e.g. by setting the XLA_PYTHON_CLIENT_MEM_FRACTION system variable to 0.9 or something similarly high.
Alternatively, you could try running your code on a smaller problem that fits into memory on your local hardware.
Related
Trying to run a PipelineJob from local instance (local machine with windows, GCP CLI installed /local jupyter lab), but I'm getting
_InactiveRpcError Traceback (most recent call last)
The above exception was the direct cause of the following exception:
ServiceUnavailable Traceback (most recent call last)
~\AppData\Local\Temp\1\ipykernel_8896\1564544395.py in <module>
15 #credentials=CREDENTIALS,
16
---> 17 job.submit() #service_account=SERVICE_ACCOUNT
18 # job.run()
c:\users\<user>\pipelines_draft_env\lib\site-packages\google\cloud\aiplatform\pipeline_jobs.py in submit(self, service_account, network)
284 parent=self._parent,
285 pipeline_job=self._gca_resource,
--> 286 pipeline_job_id=self.job_id,
287 )
288
c:\users\<user>\pipelines_draft_env\lib\site-packages\google\cloud\aiplatform_v1\services\pipeline_service\client.py in create_pipeline_job(self, request, parent, pipeline_job, pipeline_job_id, retry, timeout, metadata)
1197
1198 # Send the request.
-> 1199 response = rpc(request, retry=retry, timeout=timeout, metadata=metadata,)
1200
1201 # Done; return the response.
c:\users\<user>\pipelines_draft_env\lib\site-packages\google\api_core\gapic_v1\method.py in __call__(self, timeout, retry, *args, **kwargs)
152 kwargs["metadata"] = metadata
153
--> 154 return wrapped_func(*args, **kwargs)
155
156
c:\users\<user>\pipelines_draft_env\lib\site-packages\google\api_core\grpc_helpers.py in error_remapped_callable(*args, **kwargs)
66 return callable_(*args, **kwargs)
67 except grpc.RpcError as exc:
---> 68 raise exceptions.from_grpc_error(exc) from exc
69
70 return error_remapped_callable
ServiceUnavailable: 503 failed to connect to all addresses; last error: UNKNOWN: ipv4:xxx.xx.xxx.x:443: tcp handshaker shutdown
I was trying to run the code, that is below:
The user, project is set up, and the same code runs perfectly fine on machine with MacOS (same user account, same project).
#code for the pipeline here, was too long for the question
compiler.Compiler().compile(pipeline_func=pipeline, package_path="intro_pipeline.json")
DISPLAY_NAME = "intro_" + UUID
job = aiplatform.PipelineJob(
display_name=DISPLAY_NAME,
template_path="intro_pipeline.json",
pipeline_root=PIPELINE_ROOT,
project=PROJECT_ID,
)
job.submit()
Please, help me to debug, maybe there is an issue with some certificates, but I have no idea were should I look . Thanks for any help
I am trying to train a rather large model (Longformer-large with a CNN classification head on top) on Google Cloud Platform. I am using Tensorflow-cloud and Colab to run my model. I tried to run this with batchsize 4 and 4 P100-GPUs but I still get an OOM error, so I would like to try it with TPU. I have increased batch size to 8 now.
However, I get the error that TPU config cannot be the chief_worker_config.
This is my code:
tfc.run(
distribution_strategy="auto",
requirements_txt="requirements.txt",
docker_config=tfc.DockerConfig(
image_build_bucket=GCS_BUCKET
),
worker_count=1,
worker_config= tfc.COMMON_MACHINE_CONFIGS["TPU"],
chief_config=tfc.COMMON_MACHINE_CONFIGS["TPU"],
job_labels={"job": JOB_NAME})
This is the error:
Validating environment and input parameters.
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-26-e1be60d71623> in <module>()
19 worker_config= tfc.COMMON_MACHINE_CONFIGS["TPU"],
20 chief_config=tfc.COMMON_MACHINE_CONFIGS["TPU"],
---> 21 job_labels={"job": JOB_NAME},
22 )
2 frames
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/run.py in run(entry_point, requirements_txt, docker_config, distribution_strategy, chief_config, worker_config, worker_count, entry_point_args, stream_logs, job_labels, service_account, **kwargs)
256 job_labels=job_labels or {},
257 service_account=service_account,
--> 258 docker_parent_image=docker_config.parent_image,
259 )
260 print("Validation was successful.")
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/validate.py in validate(entry_point, requirements_txt, distribution_strategy, chief_config, worker_config, worker_count, entry_point_args, stream_logs, docker_image_build_bucket, called_from_notebook, job_labels, service_account, docker_parent_image)
78 _validate_distribution_strategy(distribution_strategy)
79 _validate_cluster_config(
---> 80 chief_config, worker_count, worker_config, docker_parent_image
81 )
82 _validate_job_labels(job_labels or {})
/usr/local/lib/python3.7/dist-packages/tensorflow_cloud/core/validate.py in _validate_cluster_config(chief_config, worker_count, worker_config, docker_parent_image)
160 "Invalid `chief_config` input. "
161 "`chief_config` cannot be a TPU config. "
--> 162 "Received {}.".format(chief_config)
163 )
164
ValueError: Invalid `chief_config` input. `chief_config` cannot be a TPU config. Received <tensorflow_cloud.core.machine_config.MachineConfig object at 0x7f5860afe210>.
Can someone tell me how I can run my code on GCP-TPUs? I actually don't care too much about time, I just want some configuration that runs without getting OOM issues (so GPU if it works totally fine with me as well).
Thank you!
I'm using TF 2.0 on Google Colab. I copied from of the most code the Tensorflow "Load CSV Data" tutorial and changed up some config variables for my training and eval / test csv files. When I ran it, I got this error (only last frame is shown, full output is here:
In
NUMERIC_FEATURES = ['microtime', 'dist']
packed_train_data = raw_train_data.map(
PackNumericFeatures(NUMERIC_FEATURES))
packed_test_data = raw_test_data.map(
PackNumericFeatures(NUMERIC_FEATURES))
Out
/tensorflow-2.0.0/python3.6/tensorflow_core/python/autograph/impl/api.py in wrapper(*args, **kwargs)
235 except Exception as e: # pylint:disable=broad-except
236 if hasattr(e, 'ag_error_metadata'):
--> 237 raise e.ag_error_metadata.to_exception(e)
238 else:
239 raise
KeyError: in converted code:
<ipython-input-19-85ea56f80c91>:6 __call__ *
numeric_features = [features.pop(name) for name in self.names]
/tensorflow-2.0.0/python3.6/tensorflow_core/python/autograph/impl/api.py:396 converted_call
return py_builtins.overload_of(f)(*args)
KeyError: 'dist'
The "in converted code" is used when autograph wraps errors that likely occur in user code. In this case, the following like is relevant:
<ipython-input-19-85ea56f80c91>:6 __call__ *
numeric_features = [features.pop(name) for name in self.names]
The error message is missing some critical information and we should fix it, but it suggests a call to features.pop(name) raises KeyError, so likely that key is missing from features.
My input is:
test=pd.read_csv("/gdrive/My Drive/data-kaggle/sample_submission.csv")
test.head()
It ran as expected.
But, for
test.to_csv('submitV1.csv', header=False)
The full error message that I got was:
OSError Traceback (most recent call last)
<ipython-input-5-fde243a009c0> in <module>()
9 from google.colab import files
10 print(test)'''
---> 11 test.to_csv('submitV1.csv', header=False)
12 files.download('/gdrive/My Drive/data-
kaggle/submission/submitV1.csv')
2 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in
to_csv(self, path_or_buf, sep, na_rep, float_format, columns,
header, index, index_label, mode, encoding, compression, quoting,
quotechar, line_terminator, chunksize, tupleize_cols, date_format,
doublequote, escapechar, decimal)
3018 doublequote=doublequote,
3019 escapechar=escapechar,
decimal=decimal)
-> 3020 formatter.save()
3021
3022 if path_or_buf is None:
/usr/local/lib/python3.6/dist-packages/pandas/io/formats/csvs.pyi
in save(self)
155 f, handles = _get_handle(self.path_or_buf,
self.mode,
156 encoding=self.encoding,
--> 157
compression=self.compression)
158 close = True
159
/usr/local/lib/python3.6/dist-packages/pandas/io/common.py in
_get_handle(path_or_buf, mode, encoding, compression, memory_map,
is_text)
422 elif encoding:
423 # Python 3 and encoding
--> 424 f = open(path_or_buf, mode,encoding=encoding,
newline="")
425 elif is_text:
426 # Python 3 and no explicit encoding
OSError: [Errno 95] Operation not supported: 'submitV1.csv'
Additional Information about the error:
Before running this command, if I run
df=pd.DataFrame()
df.to_csv("file.csv")
files.download("file.csv")
It is running properly, but the same code is producing the operation not supported error if I try to run it after trying to convert test data frame to a csv file.
I am also getting a message A Google Drive timeout has occurred (most recently at 13:02:43). More info. just before running the command.
You are currently in the directory in which you don't have write permissions.
Check your current directory with pwd. It might be gdrive of some directory inside it, that's why you are unable to save there.
Now change the current working directory to some other directory where you have permissions to write. cd ~ will work fine. It wil chage the directoy to /root.
Now you can use:
test.to_csv('submitV1.csv', header=False)
It will save 'submitV1.csv' to /root
As the title suggests, I'm trying to use a dask.bag to read a single file from S3 on an EC2 instance:
from distributed import Executor, progress
from dask import delayed
import dask
import dask.bag as db
data = db.read_text('s3://pycuda-euler-data/Ba10k.sim1.fq')
I get a very long error:
---------------------------------------------------------------------------
ClientError Traceback (most recent call last)
/home/ubuntu/anaconda3/lib/python3.5/site-packages/s3fs/core.py in info(self, path, refresh)
322 bucket, key = split_path(path)
--> 323 out = self.s3.head_object(Bucket=bucket, Key=key)
324 out = {'ETag': out['ETag'], 'Key': '/'.join([bucket, key]),
/home/ubuntu/anaconda3/lib/python3.5/site-packages/botocore/client.py in _api_call(self, *args, **kwargs)
277 # The "self" in this scope is referring to the BaseClient.
--> 278 return self._make_api_call(operation_name, kwargs)
279
/home/ubuntu/anaconda3/lib/python3.5/site-packages/botocore/client.py in _make_api_call(self, operation_name, api_params)
571 if http.status_code >= 300:
--> 572 raise ClientError(parsed_response, operation_name)
573 else:
ClientError: An error occurred (403) when calling the HeadObject operation: Forbidden
During handling of the above exception, another exception occurred:
FileNotFoundError Traceback (most recent call last)
<ipython-input-43-0ad435c69ecc> in <module>()
4 #data = db.read_text('/Users/zen/Code/git/sra_data.fastq')
5 #data = db.read_text('/Users/zen/Code/git/pycuda-euler/data/Ba10k.sim1.fq')
----> 6 data = db.read_text('s3://pycuda-euler-data/Ba10k.sim1.fq', blocksize=900000)
/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bag/text.py in read_text(urlpath, blocksize, compression, encoding, errors, linedelimiter, collection, storage_options)
89 _, blocks = read_bytes(urlpath, delimiter=linedelimiter.encode(),
90 blocksize=blocksize, sample=False, compression=compression,
---> 91 **(storage_options or {}))
92 if isinstance(blocks[0], (tuple, list)):
93 blocks = list(concat(blocks))
/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bytes/core.py in read_bytes(urlpath, delimiter, not_zero, blocksize, sample, compression, **kwargs)
210 return read_bytes(storage_options.pop('path'), delimiter=delimiter,
211 not_zero=not_zero, blocksize=blocksize, sample=sample,
--> 212 compression=compression, **storage_options)
213
214
/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bytes/s3.py in read_bytes(path, s3, delimiter, not_zero, blocksize, sample, compression, **kwargs)
91 offsets = [0]
92 else:
---> 93 size = getsize(s3_path, compression, s3)
94 offsets = list(range(0, size, blocksize))
95 if not_zero:
/home/ubuntu/anaconda3/lib/python3.5/site-packages/dask/bytes/s3.py in getsize(path, compression, s3)
185 def getsize(path, compression, s3):
186 if compression is None:
--> 187 return s3.info(path)['Size']
188 else:
189 with s3.open(path, 'rb') as f:
/home/ubuntu/anaconda3/lib/python3.5/site-packages/s3fs/core.py in info(self, path, refresh)
327 return out
328 except (ClientError, ParamValidationError):
--> 329 raise FileNotFoundError(path)
330
331 def _walk(self, path, refresh=False):
FileNotFoundError: pycuda-euler-data/Ba10k.sim1.fq
As far as I can tell, this is exactly what the docs say to do and unfortunately many examples I see online use the older from_s3() method that no longer exists.
However I am able to access the file using s3fs:
sample, partitions = s3.read_bytes('pycuda-euler-data/Ba10k.sim1.fq', s3=s3files, delimiter=b'\n')
sample
b'#gi|30260195|ref|NC_003997.3|_5093_5330_1:0:0_1:0:0_0/1\nGATAACTCGATTTAAACCAGATCCAGAAAATTTTCA\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_7142_7326_1:1:0_0:0:0_1/1\nCTATTCCGCCGCATCAACTTGGTGAAGTAATGGATG\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_5524_5757_3:0:0_2:0:0_2/1\nGTAATTTAACTGGTGAGGACGTGCGTGATGGTTTAT\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_2706_2926_1:0:0_3:0:0_3/1\nAGTAAAACAGATATTTTTGTAAATAGAAAAGAATTT\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_500_735_3:1:0_0:0:0_4/1\nATACTCTGTGGTAAATGATTAGAATCATCTTGTGCT\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_2449_2653_3:0:0_1:0:0_5/1\nCTTGAATTGCTACAGATAGTCATAGGTTAGCCCTTC\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_3252_3460_0:0:0_0:0:0_6/1\nCGATGTAATTGATACAGGTGGCGCTGTAAAATGGTT\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_1860_2095_0:0:0_1:0:0_7/1\nATAAAAGATTCAATCGAAATATCAGCATCGTTTCCT\n+\n222222222222222222222222222222222222\n#gi|30260195|ref|NC_003997.3|_870_1092_1:0:0_0:0:0_8/1\nTTGGAAAAACCCATTTAATGCATGCAATTGGCCTTT\n ... etc.
What am I doing wrong?
EDIT:
Upon suggestion, I went back and checked permissions. On the bucket I added a Grantee Everyone List, and on the file a Grantee Everyone Open/Download. I still get the same error.
Dask uses the library s3fs to manage data on S3. The s3fs project uses Amazon's boto3. You can provide credentials in two ways:
Use a .boto file
You can put a .boto file in your home directory
Use the storage_options= keyword
You can add a storage_option= keyword to your db.read_text call to include credential information by hand. This option is a dictionary whose values will be added to the s3fs.S3FileSystem constructor.