kaggle directly download input data from copied kernel - kaggle

How can I download all the input data from a kaggle kernel? For example this kernel: https://www.kaggle.com/davidmezzetti/cord-19-study-metadata-export.
Once you make a copy and have the option to edit, you have the ability to run the notebook and make changes.
One thing I have noticed is that anything that goes in the output directory is provided with an option of a download button next to the file icon. So I see that I can surely just read each and every file and write to the output but it seems like a waste.
Am I missing something here?

The notebook you list contains two data sources;
another notebook (https://www.kaggle.com/davidmezzetti/cord-19-analysis-with-sentence-embeddings)
and a dataset (https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)
You can use Kaggle's API to retrieve a kernel's output:
kaggle kernels output davidmezzetti/cord-19-analysis-with-sentence-embeddings
And to download dataset files:
kaggle datasets download allen-institute-for-ai/CORD-19-research-challenge

Related

saving weights of a tensorflow model in Databricks

In a Databricks notebook which is running on Cluster1 when I do
path='dbfs:/Shared/P1-Prediction/Weights_folder/Weights'
model.save_weights(path)
and then immediately try
ls 'dbfs:/Shared/P1-Prediction/Weights_folder'
I see the actual weights file in the output display
But When I run the exact same command
ls 'dbfs:/Shared/P1-Prediction/Weights_folder'
on a different Databricks notebook which is running on cluster 2, I am getting the error
ls: cannot access 'dbfs:/Shared/P1-Prediction/Weights_folder': No such file or directory
I am not able to intrepret this. Does that mean my "save_weights" is saving the weights in clusters memory and not in an actual physical location? If so is there a solution for it.
Any help is highly appreciated.
Tensorflow uses Python's local file API that doesn't work with dbfs:/... - you need to change path to use /dbfs/... instead of dbfs:/....
But really, it could be better to log model using MLflow, in this case you can easily load it for inference. See documentation and maybe this example.

A good way to locate colab Notebook (from code inside colab notebook)

Colab code runs on a temporarily allocated machine, thus the running environment is not aware of the notebook location on Google Drive.
I am wondering if there is an API which colab provides, which I may invoke programatically, which could tell me the location of colab notebook in Google Drive (I can manually do it by clicking on: file > Locate in drive, but I need to do this via code in Colab). This would be useful to save the data generated by my code.
Of course I can hard code this path (after mounting the gDrive), but each time, I would need to update it, and if I forget it can even overwrite the previous data. Is there a way where it could be automatically detected? It seems this solution: https://stackoverflow.com/a/71438046/1953366 is walking though every path until it matches the filename, which is not efficient and also will fail in case I have same file name at different location.
Any better solution(s)?

What is the use of .profile-empty file in Tensorflow events folder

There is this file (events.out.tfevents.1611631707.8f60fbcf7419.profile-empty) that appears alongside other files e.g. events.out.tfevents.1611897478.844156cf4a75.61.560.v2.
My model training is not going well at all so I am looking all over to identify things I don't understand to see if they may be the cause. What is this .profile-empty file for? An image below to show the files.
This is a file written by the TensorFlow profiler. It is here to help the TensorBoard know which directory contains the profile data.
From the commit c66b603:
save empty event file in logdir when running profiler. TensorBoard will use this event file to identify the logdir that contains profile data
And from the commit 23d8e38:
Save an empty event file when StartTracing is called. This is to help with TensorBoard subdirectory searching.

Does google colab permanently change file

I am doing some data pre-processing on Google Colab and just wondering how it works with manipulating dataset. For example R does not change the original dataset until you use write.csv to export the changed dataset. Does it work similarly in colab? Thank you!
Until you explicitly save your changed data, e.g. using df.to_csv to the same file you read from, your changed dataset is not saved.
You must remember that due to inactivity (up to an hour or so), you colab session might expire and all progress be lost.
Update
To download a model, dataset or a big file from Google Drive, gdown command is already available
!gdown https://drive.google.com/uc?id=FILE_ID
Download your code from GitHub and run predictions using the model you already downloaded
!git clone https://USERNAME:PASSWORD#github.com/username/project.git
Write ! before a line of your code in colab and it would be treated as bash command. You can download files form internet using wget for example
!wget file_url
You can commit and push your updated code to GitHub etc. And updated dataset / model to Google Drive or Dropbox.

Saving Variable state in Colaboratory

When I am running a Python Script in Colaboratory, it's running all previous code cell.
Is there any way by which previous cell state/output can be saved and I can directly run next cell after returning to the notebook.
The outputs of Colab cells shown in your browser are stored in notebook JSON saved to Drive. Those will persist.
If you want to save your Python variable state, you'll need to use something like pickle to save to a file and then save that file somewhere outside of the VM.
Of course, that's a bit a trouble. One way to make things easier is to use a FUSE filesystem to mount some persistant storage where you can easily save regular files but have them persist beyond the lifetime of the VM.
An example of using a Drive FUSE wrapper to do this is in this example notebook:
https://colab.research.google.com/notebook#fileId=1mhRDqCiFBL_Zy_LAcc9bM0Hqzd8BFQS3
This notebook shows the following:
Installing a Google Drive FUSE wrapper.
Authenticating and mounting a Google Drive backed filesystem.
Saving local Python variables using pickle as a file on Drive.
Loading the saved variables.
It'a a nope. As #Bob in this recent thread says: "VMs time out after a period of inactivity, so you'll want to structure your notebooks to install custom dependencies if needed."