I am training a machine learning model in google collab and fixed the automatically disconnecting by using this code in the console inspector view of this question (Google Colab session timeout):
function ClickConnect(){
console.log("Working");
document.querySelector("#top-toolbar > colab-connect-button").shadowRoot.querySelector("#connect").click();
}
var clicker = setInterval(ClickConnect,60000);
However, after a given amount of time, all of my variables become undefined, which I require after the model has been trained. Is there any way to avoid this?
Related
I recently upgraded my google colab account to premium. But when I select the Premium GPU in runtime screenshot I encounter an issue while trying to train my model. The code runs and it does not show any error but it does not show any progress screenshot. Only when I change back to normal GPU under runtime the model trains.
Here is the code I am using to set up and train the model
model = ClassificationModel('roberta', 'roberta-base', num_labels=3, weight=class_weights.tolist(),
use_cuda=True, args={'reprocess_input_data': True, 'overwrite_output_dir': True,
"num_train_epochs": 10})
model.train_model(train)
I'm using Google's Natural Language API to run sentiment analysis on text blocks, and according to some instructions I'm following, I need t be using the latest version of google-cloud-language.
So I'm running this at the start of my Colab notebook.
!pip install --upgrade google-cloud-language
When I get to the end of that, it requires me to restart the runtime, which means I can't automatically run this along with my entire code, instead having to manually attend to the runtime restart.
This SO post touches on the topic, but only offers the 'crash' solution, and I'm wondering if anything else is available now 3 years later.
Restart kernel in Google Colab
So I'm curious if there's any workaround, or way to permanently upgrade google-cloud-language to avoid that?
Thank you for any input.
Here's the NL code I'm running, if helpful.
# Imports the Google Cloud client library
from google.cloud import language_v1
# Instantiates a client
client = language_v1.LanguageServiceClient()
def get_sentiment(text):
# The text to analyze
document = language_v1.Document(
content=text,
type_=language_v1.types.Document.Type.PLAIN_TEXT
)
# Detects the sentiment of the text
sentiment = client.analyze_sentiment(
request={"document": document}
).document_sentiment
return sentiment
dfTW01["sentiment"] = dfTW01["text"].apply(get_sentiment)
I am trying to save my model by using tf.keras.callbacks.ModelCheckpoint with filepath as some folder in drive, but I am getting this error:
File system scheme '[local]' not implemented (file: './ckpt/tensorflow/training_20220111-093004_temp/part-00000-of-00001')
Encountered when executing an operation using EagerExecutor. This error cancels all future operations and poisons their output tensors.
Does anybody know what is the reason for this and the workaround for this?
Looks to me that you are trying to access the file system of your host VM from the TPU which is not directly possible.
When using the TPU and you want to access files in e.g. GoogleColab you should place it within:
with tf.device('/job:localhost'):
<YOUR_CODE>
Now to your problem:
The local host acts as parameter server when training on TPU. So if you want to checkpoint your training, the localhost must do so.
When you check the documention for said callback, you cann find the parameter options.
checkpoint_options = tf.train.CheckpointOptions(experimental_io_device='/job:localhost')
checkpoint = tf.keras.callbacks.ModelCheckpoint(<YOUR_PATH>, options = checkpoint_options)
Hope this solves your issue!
Best,
Sascha
I am working on Google Colab environment to create a Siamese network using Keras. I have used this code from GitHub. But I get an error when I tried to run the pickle.dump code:
with open(os.path.join(save_path,"train.pickle"), "wb") as f:
pickle.dump((X,c), f)
The error : OverflowError: cannot serialize a bytes object larger than 4 GiB
So, I used Use pickle with protocol=4
pickle.dump((X,c), f, protocol=4)
but the session stopped during running this code and I got this message "Session crash for an unknown reason " and Your session crashed after using all available RAM
How can I solve this problem?
My guess is that your runtime is crashing out of memory.
I was able to pickle 4 GB of data, but it required ~8G of memory in Python to do so.
You can view the runtime logs with 'View runtime logs' for the Runtime menu. That often has hints about crashes. In this case, it reports many large allocations.
Example:
The sessions manager will show memory. In my case, without doing anything else:
I tried this and its working for me
import pickle
pickle_out = open("train.pickle","wb")
pickle.dump((X,c), pickle_out)
pickle_out.close()
Currently, I am using tensorflow estimator API to train my tf model. I am using distributed training that is almost 20-50 workers and 5-30 parameter servers based on the training data size. Since I do not have access to the session, I cannot use run metadata a=with full trace to look at the chrome trace. I see there are two other approaches :
1) tf.profiler.profile
2) tf.train.profilerhook
I am specifically using
tf.estimator.train_and_evaluate(estimator, train_spec, test_spec)
where my estimator is a prebuilt estimator.
Can someone give me some guidance (concrete code samples and code pointers will be really helpful since I am very new to tensorflow) what is the recommended way to profile estimator? Are the 2 approaches getting some different information or serve the same purpose? Also is one recommended over another?
There are two things you can try:
ProfilerContext
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/profiler/profile_context.py
Example usage:
with tf.contrib.tfprof.ProfileContext('/tmp/train_dir') as pctx:
train_loop()
ProfilerService
https://www.tensorflow.org/tensorboard/r2/tensorboard_profiling_keras
You can start a ProfilerServer via tf.python.eager.profiler.start_profiler_server(port) on all workers and parameter servers. And use TensorBoard to capture profile.
Note that this is a very new feature, you may want to use tf-nightly.
Tensorflow have recently added a way to sample multiple workers.
Please have a look at the API:
https://www.tensorflow.org/api_docs/python/tf/profiler/experimental/client/trace?version=nightly
The parameter of the above API which is important in this context is :
service_addr: A comma delimited string of gRPC addresses of the
workers to profile. e.g. service_addr='grpc://localhost:6009'
service_addr='grpc://10.0.0.2:8466,grpc://10.0.0.3:8466'
service_addr='grpc://localhost:12345,grpc://localhost:23456'
Also, please look at the API,
https://www.tensorflow.org/api_docs/python/tf/profiler/experimental/ProfilerOptions?version=nightly
The parameter of the above API which is important in this context is :
delay_ms: Requests for all hosts to start profiling at a timestamp
that is delay_ms away from the current time. delay_ms is in
milliseconds. If zero, each host will start profiling immediately upon
receiving the request. Default value is None, allowing the profiler
guess the best value.