Does google colab permanently change file - google-colaboratory

I am doing some data pre-processing on Google Colab and just wondering how it works with manipulating dataset. For example R does not change the original dataset until you use write.csv to export the changed dataset. Does it work similarly in colab? Thank you!

Until you explicitly save your changed data, e.g. using df.to_csv to the same file you read from, your changed dataset is not saved.
You must remember that due to inactivity (up to an hour or so), you colab session might expire and all progress be lost.
Update
To download a model, dataset or a big file from Google Drive, gdown command is already available
!gdown https://drive.google.com/uc?id=FILE_ID
Download your code from GitHub and run predictions using the model you already downloaded
!git clone https://USERNAME:PASSWORD#github.com/username/project.git
Write ! before a line of your code in colab and it would be treated as bash command. You can download files form internet using wget for example
!wget file_url
You can commit and push your updated code to GitHub etc. And updated dataset / model to Google Drive or Dropbox.

Related

Data that should be saved in drive during run is gone

I have a colab pro plus subscription, and I ran a python code that runs a network using pytorch.
Until now, I succeeded in training my network and using my Google drive to save checkpoints and such.
But now I had a run that ran for around 16 hours, and no checkpoint or any other data was saved - even though the logs clearly show that I have saved the data and even evaluated the metrics on saved data.
Maybe the data was saved on a different folder somehow?
I looked in the drive activity and I could not see any data that was saved.
Has anyone ran to this before?
Any help would be appreciated.

How do I download a 7GB tensorflow-dataset in Google Colab without ending the 12 hour limit?

I'm trying to download the wmt14_translate/fr-en dataset from TensorFlow-datasets in Google Colab under the free tier. Downloading the dataset itself is taking me over 12 hours. Is there any alternative using Google Drive or something since I already have the data stored on my laptop.
PS - The file format of the dataset isn't really clear since it does not even end with a '.'.
1[enter image description here]
Upload the dataset to google drive.
Then, in colab,
from google.colab import drive
drive.mount('/content/drive')
You can use wget to download datasets. Downloading by wget utility is much faster than uploading.
Also, if you'll use kaggle dataset someday you can use kaggle datasets download.

kaggle directly download input data from copied kernel

How can I download all the input data from a kaggle kernel? For example this kernel: https://www.kaggle.com/davidmezzetti/cord-19-study-metadata-export.
Once you make a copy and have the option to edit, you have the ability to run the notebook and make changes.
One thing I have noticed is that anything that goes in the output directory is provided with an option of a download button next to the file icon. So I see that I can surely just read each and every file and write to the output but it seems like a waste.
Am I missing something here?
The notebook you list contains two data sources;
another notebook (https://www.kaggle.com/davidmezzetti/cord-19-analysis-with-sentence-embeddings)
and a dataset (https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge)
You can use Kaggle's API to retrieve a kernel's output:
kaggle kernels output davidmezzetti/cord-19-analysis-with-sentence-embeddings
And to download dataset files:
kaggle datasets download allen-institute-for-ai/CORD-19-research-challenge

Write out file with google colab

Was there a way to write out files with google colab?
For example, if I use
import requests
r = requests.get(url)
Where will those files be stored? Can they be found?
And similarly, can I get the file I outputted via say tensorflow save function
saver=tf.Saver(....)
...
path = saver.save(sess, "./my_model.ckpt")
Thanks!
In your first example, the data is still in r.content. So you also need to save them first with open('data.dat', 'wb').write(r.content)
Then you can download them with files.download
from google.colab import files
files.download('data.dat')
Downloading your model is the same:
files.download('my_model.ckpt')
I found it is easier to first mount your Google drive to the non-persistent VM and then use os.chdir() to change your current working folder.
After doing this, you can do the exactly same stuff as in local machine.
I have a gist listing several ways to save and transfer files between Colab VM and Google drive, but I think mounting Google drive is the easiest approach.
For more details, please refer to mount_your_google_drive.md in this gist
https://gist.github.com/Joshua1989/dc7e60aa487430ea704a8cb3f2c5d6a6

Saving Variable state in Colaboratory

When I am running a Python Script in Colaboratory, it's running all previous code cell.
Is there any way by which previous cell state/output can be saved and I can directly run next cell after returning to the notebook.
The outputs of Colab cells shown in your browser are stored in notebook JSON saved to Drive. Those will persist.
If you want to save your Python variable state, you'll need to use something like pickle to save to a file and then save that file somewhere outside of the VM.
Of course, that's a bit a trouble. One way to make things easier is to use a FUSE filesystem to mount some persistant storage where you can easily save regular files but have them persist beyond the lifetime of the VM.
An example of using a Drive FUSE wrapper to do this is in this example notebook:
https://colab.research.google.com/notebook#fileId=1mhRDqCiFBL_Zy_LAcc9bM0Hqzd8BFQS3
This notebook shows the following:
Installing a Google Drive FUSE wrapper.
Authenticating and mounting a Google Drive backed filesystem.
Saving local Python variables using pickle as a file on Drive.
Loading the saved variables.
It'a a nope. As #Bob in this recent thread says: "VMs time out after a period of inactivity, so you'll want to structure your notebooks to install custom dependencies if needed."