How to add model Checkpoint as Callback, when running model on TPU? - tensorflow

I am trying to save my model by using tf.keras.callbacks.ModelCheckpoint with filepath as some folder in drive, but I am getting this error:
File system scheme '[local]' not implemented (file: './ckpt/tensorflow/training_20220111-093004_temp/part-00000-of-00001')
Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
Does anybody know what is the reason for this and the workaround for this?

Looks to me that you are trying to access the file system of your host VM from the TPU which is not directly possible.
When using the TPU and you want to access files in e.g. GoogleColab you should place it within:
with tf.device('/job:localhost'):
<YOUR_CODE>
Now to your problem:
The local host acts as parameter server when training on TPU. So if you want to checkpoint your training, the localhost must do so.
When you check the documention for said callback, you cann find the parameter options.
checkpoint_options = tf.train.CheckpointOptions(experimental_io_device='/job:localhost')
checkpoint = tf.keras.callbacks.ModelCheckpoint(<YOUR_PATH>, options = checkpoint_options)
Hope this solves your issue!
Best,
Sascha

Related

Losing variables while fixing runtime in Google Colab

I am training a machine learning model in google collab and fixed the automatically disconnecting by using this code in the console inspector view of this question (Google Colab session timeout):
function ClickConnect(){
console.log("Working");
document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click();
}
var clicker = setInterval(ClickConnect,60000);
However, after a given amount of time, all of my variables become undefined, which I require after the model has been trained. Is there any way to avoid this?

Objective-C plugin and CoreML model failing after email or Box transfer

I have an plugin written in Objective-C which incorporates a CoreML model. The plugin and ML model compile and run fine locally. If I email or transfer the plugin model and coreml model via Box, my plugin crashes and throws a damaged error. I can get the plugin to function by removing extended attributes in terminal: xattr -cr me/myplugin.plugin but the ML section of code still fails.
If I monitor in XCode, I notice the following when the coreml model fails:
[coreml] Input feature input_layer required but not passed to neural network.
[coreml] Failure verifying inputs.
Is there some signature or attached attribute that would lead to this issue when transferring via email/box?
Is there some signature or attached attribute that would lead to this
issue when transferring via email/box?
Since you have access to both versions of each file (before emailing / transferring via box, and the after transferring).
Go to both versions of each file and do the following:
ls -la
If it has extended attributes there will be an # symbol. For example:
drwxr-xr-x# 254 hoakley staff 8636 24 Jul 18:39 miscDocs
If the versions after transfer do not have an # symbol, then they do not have extended attributes.
Then for each file (both versions) do:
xattr -l filepath
This will display the extended attributes of each file.
You should compare the attributes of both versions of each file and see the difference. This should answer your question. If there is no difference, then no extended attribute has been added or removed.
Read : https://eclecticlight.co/2017/08/14/show-me-your-metadata-extended-attributes-in-macos-sierra/

Ways to provide authorisation header to use google ml-engine

I've currently got involved in a project using GCP Ml-engine. It's already set & ready so my task is to use it's predict command to leverage the API. The whole project exists in VM instance so I want to know, does it help to get access token in a more concise way? I mean, SDK or something like that, because I didn't find anything useful. If not, what are my options here? JWT?
You might find this useful. https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/ml_engine/online_prediction/predict.py
Especially these lines:
# Create the ML Engine service object.
# To authenticate set the environment variable
# GOOGLE_APPLICATION_CREDENTIALS=<path_to_service_account_file>
service = googleapiclient.discovery.build('ml', 'v1')
name = 'projects/{}/models/{}'.format(project, model)
if version is not None:
name += '/versions/{}'.format(version)
response = service.projects().predict(
name=name,
body={'instances': instances}
).execute()
You can create the service account file from the project IAM page and download the token onto the VM.

recommended way of profiling distributed tensorflow

Currently, I am using tensorflow estimator API to train my tf model. I am using distributed training that is almost 20-50 workers and 5-30 parameter servers based on the training data size. Since I do not have access to the session, I cannot use run metadata a=with full trace to look at the chrome trace. I see there are two other approaches :
1) tf.profiler.profile
2) tf.train.profilerhook
I am specifically using
tf.estimator.train_and_evaluate(estimator, train_spec, test_spec)
where my estimator is a prebuilt estimator.
Can someone give me some guidance (concrete code samples and code pointers will be really helpful since I am very new to tensorflow) what is the recommended way to profile estimator? Are the 2 approaches getting some different information or serve the same purpose? Also is one recommended over another?
There are two things you can try:
ProfilerContext
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/profiler/profile_context.py
Example usage:
with tf.contrib.tfprof.ProfileContext('/tmp/train_dir') as pctx:
train_loop()
ProfilerService
https://www.tensorflow.org/tensorboard/r2/tensorboard_profiling_keras
You can start a ProfilerServer via tf.python.eager.profiler.start_profiler_server(port) on all workers and parameter servers. And use TensorBoard to capture profile.
Note that this is a very new feature, you may want to use tf-nightly.
Tensorflow have recently added a way to sample multiple workers.
Please have a look at the API:
https://www.tensorflow.org/api_docs/python/tf/profiler/experimental/client/trace?version=nightly
The parameter of the above API which is important in this context is :
service_addr: A comma delimited string of gRPC addresses of the
workers to profile. e.g. service_addr='grpc://localhost:6009'
service_addr='grpc://10.0.0.2:8466,grpc://10.0.0.3:8466'
service_addr='grpc://localhost:12345,grpc://localhost:23456'
Also, please look at the API,
https://www.tensorflow.org/api_docs/python/tf/profiler/experimental/ProfilerOptions?version=nightly
The parameter of the above API which is important in this context is :
delay_ms: Requests for all hosts to start profiling at a timestamp
that is delay_ms away from the current time. delay_ms is in
milliseconds. If zero, each host will start profiling immediately upon
receiving the request. Default value is None, allowing the profiler
guess the best value.

Google word2vec load error

I want to use Google word2vec (GoogleNews-vectors-negative300.bin)
I downloaded it from https://code.google.com/archive/p/word2vec/
When I load it, the memory errors occured
(Process finished with exit code 139 (interrupted by signal 11: SIGSEGV))
from gensim.models.word2vec import Word2Vec
embedding_path = "data/GoogleNews-vectors-negative300.bin"
word2vec = Word2Vec.load_word2vec_format(embedding_path, binary=True)
print word2vec
I use ubuntu 16.04 / GTX-1070(8gb) / Ram(16gb).
How can I fix it?!
A SIGSEGV is an error occurring when the process tries to access a particular segment in memory that it does not have permission on.
So you should check permissions and, by debugging, see which memory location gives you the error.
This way you could understand if another program is interfering.
The problem might also be CUDA related as #TheM00s3 suggested