saving weights of a tensorflow model in Databricks - tensorflow

In a Databricks notebook which is running on Cluster1 when I do
path='dbfs:/Shared/P1-Prediction/Weights_folder/Weights'
model.save_weights(path)
and then immediately try
ls 'dbfs:/Shared/P1-Prediction/Weights_folder'
I see the actual weights file in the output display
But When I run the exact same command
ls 'dbfs:/Shared/P1-Prediction/Weights_folder'
on a different Databricks notebook which is running on cluster 2, I am getting the error
ls: cannot access 'dbfs:/Shared/P1-Prediction/Weights_folder': No such file or directory
I am not able to intrepret this. Does that mean my "save_weights" is saving the weights in clusters memory and not in an actual physical location? If so is there a solution for it.
Any help is highly appreciated.

Tensorflow uses Python's local file API that doesn't work with dbfs:/... - you need to change path to use /dbfs/... instead of dbfs:/....
But really, it could be better to log model using MLflow, in this case you can easily load it for inference. See documentation and maybe this example.

Related

How to save tensorflow models in RAMDisk?

In my original python code, there is a frequent restore of the ckpt model file. It takes too much time to read the checkpoints again and again. So I decided to save the model in the memory. A simple way is to create a RAMDisk and save the model in that disk. However, something unexpected happens.
I deployed 1G of RAMDisk according to the tutorial How to Create RAM Disk in Windows 10 for Super-Fast Read and Write Speeds. My system is windows 11.
I made two attempts: In the first one, I copied my code to the RAMDisk E: and used tf.train.Saver().save(self.sess,'./') to save the model, but it reports that UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb4 in position 114: invalid start byte. However, if I put the code on other normal folders, it runs successfully.
In the second attempt, I put the code under D: and modified the line as tf.train.Saver().save(self.sess,'E:\\'), and it reports that cannot create directory E: Permission Denied. Obviously, E:\ is not a directory to create. So I don't know how to handle this.
Your jupyter/python environment cannot go beyond the directory from which jupyter/python is started from and that's why you get a permission denied error.
However, you can run shell commands from the jupyter notebook. If your user has write access to your destination, you can do the following.
model.save("my_model") # This will save the model to the current directory.
!mv "my_model" "E:\my_model" # This will move the model from the current directory to your required directory.
On a side note, I when searching for tf.train.Saver().save(), I get this page as the only relevant result, which says it is used for saving checkpoints and not model. Also they recommend switching to the newer tf.train.Checkpoint or tf.keras.Model.save_weights. None the less, the above method should work as expected.

Why GCP gets killed when reading a partitioned parquet file from Google Storage but not locally?

I am running a Notebook instance from the AI Platform on a E2 high-memory VM with 4 vCPUs and 32Gb RAM
I need to read a partitioned parquet file with about 1.8Gb from Google Storage using pandas
It needs to be completely loaded in RAM and I can't use Dask compute for it.
Nonetheless, I tried loading through this route and it gave the same problem
When I download the file locally in the VM, I can read it with pd.read_parquet.
The RAM consumption goes up to about 13Gb and then down to 6Gb when the file is loaded. It works.
df = pd.read_parquet("../data/file.parquet",
engine="pyarrow")
When I try to load it directly from Google Storage the RAM goes up to about 13Gb and then the kernel dies. No log, warnings or errors raised.
df = pd.read_parquet("gs://path_to_file/file.parquet",
engine="pyarrow")
Some info on the packages versions
Python 3.7.8
pandas==1.1.1
pyarrow==1.0.1
What could be causing it?
I found a thread where is explained how to execute this task on different way.
For your scenario use the GCSF service is a good option, for example:
import pyarrow.parquet as pq
import gcsfs
fs = gcsfs.GCSFileSystem(project=myprojectname)
f = fs.open('my_bucket/path.csv')
myschema = pq.ParquetFile(f).schema
print(schema)
If you want to know more about this service, take a look at this document
The problem was being caused by a deprecated image version on the VM.
According to GCP's support you can find if the image is deprecated by
Go to GCE and click on “VM instances”.
Click on the “VM instance” in question
Look for the section “Boot disk” and click on the Image link.
If the image has been Deprecated, there will be a field showing it.
The solution to it is to create a new Notebook Instance and export/import whatever you want to keep. That way the new VM will have an updated image which hopefully has a fix for the problem

Use tensorboard with object detection API in sagemaker

with this I successfully created a training job on sagemaker using the Tensorflow Object Detection API in a docker container. Now I'd like to monitor the training job using sagemaker, but cannot find anything explaining how to do it. I don't use a sagemaker notebook.
I think I can do it by saving the logs into a S3 bucket and point there a local tensorboard instance .. but don't know how to tell the tensorflow object detection API where to save the logs (is there any command line argument for this ?).
Something like this, but the script generate_tensorboard_command.py fails because my training job don't have the sagemaker_submit_directory parameter..
The fact is when I start the training job nothing is created on my s3 until the job finish and upload everything. There should be a way tell tensorflow where to save the logs (s3) during the training, hopefully without modifying the API source code..
Edit
I can finally make it works with the accepted solution (tensorflow natively supports read/write to s3), there are however additional steps to do:
Disable network isolation in the training job configuration
Provide credentials to the docker image to write to S3 bucket
The only thing is that Tensorflow continuously polls filesystem (i.e. looking for an updated model in serving mode) and this cause useless requests to S3, that you will have to pay (together with a buch of errors in the console). I opened a new question here for this. At least it works.
Edit 2
I was wrong, TF just write logs, is not polling so it's an expected behavior and the extra costs are minimal.
Looking through the example you posted, it looks as though the model_dir passed to the TensorFlow Object Detection package is configured to /opt/ml/model:
# These are the paths to where SageMaker mounts interesting things in your container.
prefix = '/opt/ml/'
input_path = os.path.join(prefix, 'input/data')
output_path = os.path.join(prefix, 'output')
model_path = os.path.join(prefix, 'model')
param_path = os.path.join(prefix, 'input/config/hyperparameters.json')
During the training process, tensorboard logs will be written to /opt/ml/model, and then uploaded to s3 as a final model artifact AFTER training: https://docs.aws.amazon.com/sagemaker/latest/dg/your-algorithms-training-algo-envvariables.html.
You might be able to side-step the SageMaker artifact upload step and point the model_dir of TensorFlow Object Detection API directly at an s3 location during training:
model_path = "s3://your-bucket/path/here
This means that the TensorFlow library within the SageMaker job is directly writing to S3 instead of the filesystem inside of it's container. Assuming the underlying TensorFlow Object Detection code can write directly to S3 (something you'll have to verify), you should be able to see the tensorboard logs and checkpoints there in realtime.

kaggle directly download input data from copied kernel

How can I download all the input data from a kaggle kernel? For example this kernel: https://www.kaggle.com/davidmezzetti/cord-19-study-metadata-export.
Once you make a copy and have the option to edit, you have the ability to run the notebook and make changes.
One thing I have noticed is that anything that goes in the output directory is provided with an option of a download button next to the file icon. So I see that I can surely just read each and every file and write to the output but it seems like a waste.
Am I missing something here?
The notebook you list contains two data sources;
another notebook (https://www.kaggle.com/davidmezzetti/cord-19-analysis-with-sentence-embeddings)
and a dataset (https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)
You can use Kaggle's API to retrieve a kernel's output:
kaggle kernels output davidmezzetti/cord-19-analysis-with-sentence-embeddings
And to download dataset files:
kaggle datasets download allen-institute-for-ai/CORD-19-research-challenge

Saving Variable state in Colaboratory

When I am running a Python Script in Colaboratory, it's running all previous code cell.
Is there any way by which previous cell state/output can be saved and I can directly run next cell after returning to the notebook.
The outputs of Colab cells shown in your browser are stored in notebook JSON saved to Drive. Those will persist.
If you want to save your Python variable state, you'll need to use something like pickle to save to a file and then save that file somewhere outside of the VM.
Of course, that's a bit a trouble. One way to make things easier is to use a FUSE filesystem to mount some persistant storage where you can easily save regular files but have them persist beyond the lifetime of the VM.
An example of using a Drive FUSE wrapper to do this is in this example notebook:
https://colab.research.google.com/notebook#fileId=1mhRDqCiFBL_Zy_LAcc9bM0Hqzd8BFQS3
This notebook shows the following:
Installing a Google Drive FUSE wrapper.
Authenticating and mounting a Google Drive backed filesystem.
Saving local Python variables using pickle as a file on Drive.
Loading the saved variables.
It'a a nope. As #Bob in this recent thread says: "VMs time out after a period of inactivity, so you'll want to structure your notebooks to install custom dependencies if needed."