Taking forever to save a pandas dataframe from google colab session to my google drive - pandas

I mounted my google drive in my colab notebook, and I have a fairly big pandas dataframe and try to mydf.to_feather(path) where path is in my google drive. it is expected to be 100meg big and it is taking forever.
Is this to be expected? it seems the network link between colab and google drive is not great. Anyone know if the servers are in same region/zone?
I may need to change my workflow to avoid this. If you have any best practice or suggestion, pls let me know, anything short of going all GCP (which I expect don't have this kind of latency).

If you find calling df.to_feather("somewhere on your gdrive") from google colab and it is on the order of ~X00mb, you may find sporadic performance. It can take anywhere between a few min to a whole hour to save a file. I can't explain this behavior.
Workaround: First save to /content/, the colab's host machine's local dir. Then copy the file from /content to your gdrive mount dir. This seems to work much more consistently and faster for me. I just can't explain why .to_feather directly to gdrive suffer so much.

Related

A good way to locate colab Notebook (from code inside colab notebook)

Colab code runs on a temporarily allocated machine, thus the running environment is not aware of the notebook location on Google Drive.
I am wondering if there is an API which colab provides, which I may invoke programatically, which could tell me the location of colab notebook in Google Drive (I can manually do it by clicking on: file > Locate in drive, but I need to do this via code in Colab). This would be useful to save the data generated by my code.
Of course I can hard code this path (after mounting the gDrive), but each time, I would need to update it, and if I forget it can even overwrite the previous data. Is there a way where it could be automatically detected? It seems this solution: https://stackoverflow.com/a/71438046/1953366 is walking though every path until it matches the filename, which is not efficient and also will fail in case I have same file name at different location.
Any better solution(s)?

How to set up credentials file in Google Colab python notebook

Goal
I'd like to have a credentials file for things like API keys stored in a place where someone I've shared the Colab notebook with can't access the file or the information.
Situation
I'm calling several APIs in a Colab Notebook, and have multiple keys for different APIs. I'd prefer a simpler approach, if there are different levels of complexity.
Current attempts
I'm storing the keys in the main Python notebook, as I'm researching the best way to approach this. I'm pretty new at authentication, so would prefer a simpler solution. I haven't seen any articles addressing this directly.
Greatly appreciate any input on this.
You can store the credential files in your Google Drive.
Only you can access them at /content/drive/MyDrive/ after mounting it. Other people need their own credential files in their own Drive.

How to permanently upload files onto Google Colab such that it can be directly accessed by multiple people?

My friend and I are working on a project together on Google Colab for which we require a dataset but we keep running into the same problem while uploading it.
What we're doing right now is uploading onto drive and giving each other access and then mounting gdrive each time. This becomes time consuming and irritating as we need to authorize and mount each time.
Is there a better way so that the we can upload the dataset to the home directory and directly access it each time? Or is that not possible because we're assessed a different machine(?) each time?
If you create a new notebook, you can set it to mount automatically, no need to authenticate every time.
See this demo.

How to access data from machine using google colab

I want to use google colab. But my data is pretty huge. So I want to access my data directly from the machine in google colab. And I also want to save the files directly in my machine directory. Is there a way I can do that as I can't seem to find any.
Look at how to use local runtime here.
https://research.google.com/colaboratory/local-runtimes.html
Otherwise, you can store your data on GDrive, GCS, or S3. Then, you can just mount it, no need to upload every time.

Write out file with google colab

Was there a way to write out files with google colab?
For example, if I use
import requests
r = requests.get(url)
Where will those files be stored? Can they be found?
And similarly, can I get the file I outputted via say tensorflow save function
saver=tf.Saver(....)
...
path = saver.save(sess, "./my_model.ckpt")
Thanks!
In your first example, the data is still in r.content. So you also need to save them first with open('data.dat', 'wb').write(r.content)
Then you can download them with files.download
from google.colab import files
files.download('data.dat')
Downloading your model is the same:
files.download('my_model.ckpt')
I found it is easier to first mount your Google drive to the non-persistent VM and then use os.chdir() to change your current working folder.
After doing this, you can do the exactly same stuff as in local machine.
I have a gist listing several ways to save and transfer files between Colab VM and Google drive, but I think mounting Google drive is the easiest approach.
For more details, please refer to mount_your_google_drive.md in this gist
https://gist.github.com/Joshua1989/dc7e60aa487430ea704a8cb3f2c5d6a6