Possibility to save uploaded data in Google Colab for reopening - google-colaboratory

I started recently solving Kaggle competitions, using 2 computers (laptop and PC). Kaggle gives big amout of data for training ML.
The biggest problem for me is downloading that data, it takes about 30 GB, and bigger issue, unzipping it. I was working on my laptop, but I decided to move to PC. I saved the ipynb file and closed laptop.
After opening this file I saw that all unzipped data went missing and I need to spend 2h for downloading and unzipping it once again.
Is it possible to save all unzipped data with this notebook? Or maybe it's stored somewhere on Google Disk?

You can leverage the storage capacity of GoogleDrive. Colab allows you to have this data stored on your Drive and access it from colab notbook as follows:
from google.colab import drive
import matplotlib.image as mpimg
import matplotlib.pyplot as plt
import pandas as pd
drive.mount('/content/gdrive')
img = mpimg.imread(r'/content/gdrive/My Drive/top.bmp') # Reading image files
df = pd.read_csv('/content/gdrive/My Drive/myData.csv') # Loading CSV
When it mounts, it would ask you to visit a particular url to grant permission for accessing drive. Just paste the token returned. Needs to be done only once.
The best thing about colab is you can also run shell cmds from code, all you need to do is to prefix the commands with a ! (bang). Useful when you need to unzip etc.
import os
os.chdir('gdrive/My Drive/data') #change dir
!ls
!unzip -q iris_data.zip
df3 = pd.read_csv('/content/gdrive/My Drive/data/iris_data.csv')
Note: Since you have specified that the data is about 30GB, this may not be useful if you are on the free tier provided by Google (as it gives only 15GB per account) you may have to look elsewhere.
You can also visit this particular question for more solutions on Kaggle integration with Google Colab.

Related

Why GCP gets killed when reading a partitioned parquet file from Google Storage but not locally?

I am running a Notebook instance from the AI Platform on a E2 high-memory VM with 4 vCPUs and 32Gb RAM
I need to read a partitioned parquet file with about 1.8Gb from Google Storage using pandas
It needs to be completely loaded in RAM and I can't use Dask compute for it.
Nonetheless, I tried loading through this route and it gave the same problem
When I download the file locally in the VM, I can read it with pd.read_parquet.
The RAM consumption goes up to about 13Gb and then down to 6Gb when the file is loaded. It works.
df = pd.read_parquet("../data/file.parquet",
engine="pyarrow")
When I try to load it directly from Google Storage the RAM goes up to about 13Gb and then the kernel dies. No log, warnings or errors raised.
df = pd.read_parquet("gs://path_to_file/file.parquet",
engine="pyarrow")
Some info on the packages versions
Python 3.7.8
pandas==1.1.1
pyarrow==1.0.1
What could be causing it?
I found a thread where is explained how to execute this task on different way.
For your scenario use the GCSF service is a good option, for example:
import pyarrow.parquet as pq
import gcsfs
fs = gcsfs.GCSFileSystem(project=myprojectname)
f = fs.open('my_bucket/path.csv')
myschema = pq.ParquetFile(f).schema
print(schema)
If you want to know more about this service, take a look at this document
The problem was being caused by a deprecated image version on the VM.
According to GCP's support you can find if the image is deprecated by
Go to GCE and click on “VM instances”.
Click on the “VM instance” in question
Look for the section “Boot disk” and click on the Image link.
If the image has been Deprecated, there will be a field showing it.
The solution to it is to create a new Notebook Instance and export/import whatever you want to keep. That way the new VM will have an updated image which hopefully has a fix for the problem

How do I download a 7GB tensorflow-dataset in Google Colab without ending the 12 hour limit?

I'm trying to download the wmt14_translate/fr-en dataset from TensorFlow-datasets in Google Colab under the free tier. Downloading the dataset itself is taking me over 12 hours. Is there any alternative using Google Drive or something since I already have the data stored on my laptop.
PS - The file format of the dataset isn't really clear since it does not even end with a '.'.
1[enter image description here]
Upload the dataset to google drive.
Then, in colab,
from google.colab import drive
drive.mount('/content/drive')
You can use wget to download datasets. Downloading by wget utility is much faster than uploading.
Also, if you'll use kaggle dataset someday you can use kaggle datasets download.

Does google colab permanently change file

I am doing some data pre-processing on Google Colab and just wondering how it works with manipulating dataset. For example R does not change the original dataset until you use write.csv to export the changed dataset. Does it work similarly in colab? Thank you!
Until you explicitly save your changed data, e.g. using df.to_csv to the same file you read from, your changed dataset is not saved.
You must remember that due to inactivity (up to an hour or so), you colab session might expire and all progress be lost.
Update
To download a model, dataset or a big file from Google Drive, gdown command is already available
!gdown https://drive.google.com/uc?id=FILE_ID
Download your code from GitHub and run predictions using the model you already downloaded
!git clone https://USERNAME:PASSWORD#github.com/username/project.git
Write ! before a line of your code in colab and it would be treated as bash command. You can download files form internet using wget for example
!wget file_url
You can commit and push your updated code to GitHub etc. And updated dataset / model to Google Drive or Dropbox.

Write out file with google colab

Was there a way to write out files with google colab?
For example, if I use
import requests
r = requests.get(url)
Where will those files be stored? Can they be found?
And similarly, can I get the file I outputted via say tensorflow save function
saver=tf.Saver(....)
...
path = saver.save(sess, "./my_model.ckpt")
Thanks!
In your first example, the data is still in r.content. So you also need to save them first with open('data.dat', 'wb').write(r.content)
Then you can download them with files.download
from google.colab import files
files.download('data.dat')
Downloading your model is the same:
files.download('my_model.ckpt')
I found it is easier to first mount your Google drive to the non-persistent VM and then use os.chdir() to change your current working folder.
After doing this, you can do the exactly same stuff as in local machine.
I have a gist listing several ways to save and transfer files between Colab VM and Google drive, but I think mounting Google drive is the easiest approach.
For more details, please refer to mount_your_google_drive.md in this gist
https://gist.github.com/Joshua1989/dc7e60aa487430ea704a8cb3f2c5d6a6

Export Excel files from Google Colab

I use following codes to save some data frame data, in Google Colab. But it is saved in the "local file system", not in my computer nor Google Drive. How can I get the Excel file from there?
Thanks!
writer = pd.ExcelWriter('hey.xlsx')
df.to_excel(writer)
writer.save()
from google.colab import files
files.download('result.csv')
Use Google Chrome. Firefox shows Network Error.
You'll want to upload the file using something like:
from google.colab import files
uploaded = files.upload()
Then, you can pick out the file data using something like uploaded['your_file_here.xls'].
There's a distinct question that includes recipes for working with Excel files in Drive that might be useful:
https://stackoverflow.com/a/47440841/8841057