How to make persistent installation with Colaboratory? - google-colaboratory

Is there a way to have a persistent installation on the machine you get when you launch a notebook from the Colaboratory Environment ?
There is such a mechanism with mybinder.org with a requirements.txt or setup.py that specify the different packages you want at the startup.
https://mybinder.readthedocs.io/en/latest/config_files.html#requirements-txt-install-a-python-environment
I have tested a colab notebook with an installation procedure but I have to rerun a sequence of cells each time I want to work.
Try:
https://colab.research.google.com/drive/1u5Y-92-b4rVcJjkUpPPa5xnuvKAHcnNa
Also how to define environment variables for once (at startup) ?
Do I have also to rerun their settings each time ?
Thanks
Patrick

Here's how I install a library jdc permanently, by installing it in Google Drive.
import os, sys
from google.colab import drive
drive.mount('/content/mnt')
nb_path = '/content/notebooks'
os.symlink('/content/mnt/My Drive/Colab Notebooks', nb_path)
sys.path.insert(0, nb_path)
# call this one time only
!pip install --target=$nb_path jdc
# later just import it
import jdc # for %%add_to
And here's how to set environmental variables.
%env VAR1=value1
%env VAR2=value2
Put them in your first cell and run it.

Related

Install RAPIDS library on Googe Colab notebook

I was wondering if I could install RAPIDS library (executing machine learning tasks entirely on GPU) in Google Colaboratory notebook?
I've done some research but I've not been able to find the way to do that...
This is now possible with the new T4 instances https://medium.com/rapids-ai/run-rapids-on-google-colab-for-free-1617ac6323a8
To enable cuGraph too, you can replace the wget command with:
!conda install -c nvidia/label/cuda10.0 -c rapidsai/label/cuda10.0 -c pytorch \
-c numba -c conda-forge -c numba -c defaults \
boost cudf=0.6 cuml=0.6 python=3.6 cugraph=0.6 -y
Dec 2019 update
New process for RAPIDS v0.11+
Because
RAPIDS v0.11 has dependencies (pyarrow) which were
not covered by the prior install script,
the notebooks-contrib repo, which contains RAPIDS demo notebooks (e.g.
colab_notebooks) and the Colab install script, now follows RAPIDS standard version-specific branch structure*
and some Colab users still enjoy v0.10,
our honorable notebooks-contrib overlord taureandyernv has updated the script which now:
If running v0.11 or higher, updates pyarrow library to 0.15.x.
Here's the code cell to run in Colab for v0.11:
# Install RAPIDS
!wget -nc https://raw.githubusercontent.com/rapidsai/notebooks-contrib/890b04ed8687da6e3a100c81f449ff6f7b559956/utils/rapids-colab.sh
!bash rapids-colab.sh
import sys, os
dist_package_index = sys.path.index("/usr/local/lib/python3.6/dist-packages")
sys.path = sys.path[:dist_package_index] + ["/usr/local/lib/python3.6/site-packages"] + sys.path[dist_package_index:]
sys.path
if os.path.exists('update_pyarrow.py'): ## This file only exists if you're using RAPIDS version 0.11 or higher
exec(open("update_pyarrow.py").read(), globals())
For a walk thru setting up Colab & implementing this script, see How to Install RAPIDS in Google Colab
-* e.g. branch-0.11 for v0.11 and branch-0.12 for v0.12 with default set to the current version
Looks like various subparts are not yet pip-installable so the only way to get them on colab would be to build them on colab, which might be more effort than you're interested in investing in this :)
https://github.com/rapidsai/cudf/issues/285 is the issue to watch for rapidsai/cudf (presumably the other rapidsai/ libs will follow suit).
Latest solution;
!wget -nc https://github.com/rapidsai/notebooks-extended/raw/master/utils/rapids-colab.sh
!bash rapids-colab.sh
import sys, os
sys.path.append('/usr/local/lib/python3.6/site-packages/')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
was pushed a few days ago, see issues #104 or #110, or the full rapids-colab.sh script for more info.
Note: instillation currently requires a Tesla T4 instance, checking for this can be done with;
# check gpu type
!nvidia-smi
import pynvml
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
device_name = pynvml.nvmlDeviceGetName(handle)
# your dolphin is broken, please reset & try again
if device_name != b'Tesla T4':
raise Exception("""Unfortunately this instance does not have a T4 GPU.
Please make sure you've configured Colab to request a GPU instance type.
Sometimes Colab allocates a Tesla K80 instead of a T4. Resetting the instance.
If you get a K80 GPU, try Runtime -> Reset all runtimes...""")
# got a T4, good to go
else:
print('Woo! You got the right kind of GPU!')

'%matplotlib notebook' behavior in Jupyter Lab [duplicate]

With old Jupyter notebooks, I could create interactive plots via:
import matplotlib.pyplot as plt
%matplotlib notebook
x = [1,2,3]
y = [4,5,6]
plt.figure()
plt.plot(x,y)
However, in JupyterLab, this gives an error:
JavaScript output is disabled in JupyterLab
I have also tried the magic (with jupyter-matplotlib installed):
%matplotlib ipympl
But that just returns:
FigureCanvasNbAgg()
Inline plots work, but they are not interactive plots:
%matplotlib inline
JupyterLab 3.0+
Install jupyterlab and ipympl.
For pip users:
pip install --upgrade jupyterlab ipympl
For conda users:
conda update -c conda-forge jupyterlab ipympl
Restart JupyterLab.
Decorate the cell containing plotting code with the header:
%matplotlib widget
# plotting code goes here
JupyterLab 2.0
Install nodejs, e.g. conda install -c conda-forge nodejs.
Install ipympl, e.g. conda install -c conda-forge ipympl.
[Optional, but recommended.] Update JupyterLab, e.g.
conda update -c conda-forge jupyterlab==2.2.9==py_0.
[Optional, but recommended.] For a local user installation, run:
export JUPYTERLAB_DIR="$HOME/.local/share/jupyter/lab".
Install extensions:
jupyter labextension install #jupyter-widgets/jupyterlab-manager
jupyter labextension install jupyter-matplotlib
Enable widgets: jupyter nbextension enable --py widgetsnbextension.
Restart JupyterLab.
Decorate with %matplotlib widget.
To enable the jupyter-matplotlib backend, use the matplotlib Jupyter magic:
%matplotlib widget
import matplotlib.pyplot as plt
plt.figure()
x = [1,2,3]
y = [4,5,6]
plt.plot(x,y)
More info here jupyter-matplotlib on GitHub
As per Georgy's suggestion, this was caused by Node.js not being installed.
Steps for JupyterLab 3.*
I had previously used Mateen's answer several times, but when I tried them with JupyterLab 3.0.7 I found that jupyter labextension install #jupyter-widgets/jupyterlab-manager returned an error and I had broken widgets.
After a lot of headaches and googling I thought I would post the solution for anyone else who finds themselves here.
The steps are now simplified, and I was able to get back to working interactive plots with the following:
pip install jupyterlab
pip install ipympl
Decorate with %matplotlib widget
Step 2 will automatically take care of the rest of the dependencies, including the replacements for (the now depreciated?) #jupyter-widgets/jupyterlab-manager
Hope this saves someone else some time!
Summary
In a complex setup, where jupyter-lab process and the Jupyter/IPython kernel process are running in different Python virtual environments, pay attention to Jupyter-related Python package and Jupyter extension (e.g. ipympl, jupyter-matplotlib) versions and their compatibility between the environments.
And even in single Python virtual environment make sure you comply with the ipympl compatibility table.
Example
A couple of examples how to run JupyterLab.
Simple(st)
The simplest cross-platform way to run JupyterLab, I guess, is running it from a Docker container. You can build and run JupyterLab 3 container like this.
docker run --name jupyter -it -p 8888:8888 \
# This line on a Linux- and non-user-namespaced Docker will "share"
# the directory between Docker host and container, and run from the user.
-u 1000 -v $HOME/Documents/notebooks:/tmp/notebooks \
-e HOME=/tmp/jupyter python:3.8 bash -c "
mkdir /tmp/jupyter; \
pip install --user 'jupyterlab < 4' 'ipympl < 0.8' pandas matplotlib; \
/tmp/jupyter/.local/bin/jupyter lab --ip=0.0.0.0 --port 8888 \
--no-browser --notebook-dir /tmp/notebooks;
"
When it finishes (and it'll take a while), the bottommost lines in the terminal should be something like.
To access the server, open this file in a browser:
...
http://127.0.0.1:8888/lab?token=abcdef...
You can just click on that link and JupyterLab should open in your browser. Once you shut down the JupyterLab instance the container will stop. You can restart it with docker start -ai jupyter.
Complex
This GitHub Gist illustrates the idea how to build a Python virtual environment with JupyterLab 2 and also building all required extensions with Nodejs in the container, without installing Nodejs on host system. With JupyterLab 3 and pre-build extensions this approach gets less relevant.
Context
I was scratching my head today while debugging the %matplotlib widget not working in JupyterLab 2. I have separate pre-built JupyterLab venv (as described above) which powers local JupyterLab as Chromium "app mode" (i.e. c.LabApp.browser = 'chromium-browser --app=%s' in the config), and a few IPython kernels from simple Python venvs with specific dependencies (rarely change) and an application exposing itself as an IPython kernel. The issue with the interactive "widget" mode manifested in different ways.
For instance, having
in JupyterLab "host" venv: jupyter-matplotlib v0.7.4 extension and ipympl==0.6.3
in the kernel venv: ipympl==0.7.0 and matplotlib==3.4.2
In the browser console I had these errors:
Error: Module jupyter-matplotlib, semver range ^0.9.0 is not registered as a widget module
Error: Could not create a model.
Could not instantiate widget
In the JupyterLab UI:
%matplotlib widget succeeds on restart
Charts stuck in "Loading widget..."
Nothing on re-run of the cell with chart output
On previous attempts %matplotlib widget could raise something like KeyError: '97acd0c8fb504a2288834b349003b4ae'
On downgrade of ipympl==0.6.3 in the kernel venv in the browser console:
Could not instantiate widget
Exception opening new comm
Error: Could not create a model.
Module jupyter-matplotlib, semver range ^0.8.3 is not registered as a widget module
Once I made the packages/extensions according to ipympl compatibility table:
in JupyterLab "host" venv: jupyter-matplotlib v0.8.3 extension, ipympl==0.6.3
in the kernel venv: ipympl==0.6.3, matplotlib==3.3.4
It more or less works as expected. Well, there are verious minor glitches like except I put %matplotlib widget per cell with chart, say on restart, the first chart "accumulates" all the contents of all the charts in the notebook. With %matplotlib widget per cell, only one chart is "active" at a time. And on restart only last widget is rendered (but manual re-run of a cell remediates).
This solution works in jupyterlab
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import clear_output
n = 10
a = np.zeros((n, n))
plt.figure()
for i in range(n):
plt.imshow(a)
plt.show()
a[i, i] = 1
clear_output(wait=True)

import local file to google colab

I don't understand how colab works with directories, I created a notebook, and colab put it in /Google Drive/Colab Notebooks.
Now I need to import a file (data.py) where I have a bunch of functions I need. Intuition tells me to put the file in that same directory and import it with:
import data
but apparently that's not the way...
I also tried adding the directory to the set of paths but I am specifying the directory incorrectly..
Can anyone help with this?
Thanks in advance!
Colab notebooks are stored on Google Drive. But it is run on another virtual machine. So, you need to copy your data.py there too. Do this to upload data.py through Colab.
from google.colab import files
files.upload()
# choose the file on your computer to upload it then
import data
Now google is officially providing support for accessing and working with Gdrive at ease.
You can use the below code to mount your drive to Colab:
from google.colab import drive
drive.mount('/gdrive')
%cd /gdrive/My\ Drive/{location you want to move}
To easily upload a local file you can use the new Google Colab feature:
click on right arrow on the left of your screen (below the Google
Colab logo)
select Files tab
click Upload button
It will open a popup to choose file to upload from your local filesystem.
To upload Local files from system to collab storage/directory.
from google.colab import files
def getLocalFiles():
_files = files.upload()
if len(_files) >0:
for k,v in _files.items():
open(k,'wb').write(v)
getLocalFiles()
So, here is how I finally solved this. I have to point out however, that in my case I had to work with several files and proprietary modules that were changing all the time.
The best solution I found to do this was to use a FUSE wrapper to "link" colab to my google account. I used this particular tool:
https://github.com/astrada/google-drive-ocamlfuse
There is an example of how to set up your environment there, but here is how I did it:
# Install a Drive FUSE wrapper.
!apt-get install -y -qq software-properties-common python-software-properties module-init-tools
!add-apt-repository -y ppa:alessandro-strada/ppa 2>&1 > /dev/null
!apt-get update -qq 2>&1 > /dev/null
!apt-get -y install -qq google-drive-ocamlfuse fuse
# Generate auth tokens for Colab
from google.colab import auth
auth.authenticate_user()
# Generate creds for the Drive FUSE library.
from oauth2client.client import GoogleCredentials
creds = GoogleCredentials.get_application_default()
import getpass
!google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret} < /dev/null 2>&1 | grep URL
vcode = getpass.getpass()
!echo {vcode} | google-drive-ocamlfuse -headless -id={creds.client_id} -secret={creds.client_secret}
At this point you'll have installed the wrapper and the code above will generate a couple of links for you to authorize access to your google drive account.
The you have to create a folder in the colab file system (remember this is not persistent, as far as I know...) and mount your drive there:
# Create a directory and mount Google Drive using that directory.
!mkdir -p drive
!google-drive-ocamlfuse drive
print ('Files in Drive:')
!ls drive/
the !ls command will print the directory contents so you can check it works, and that's it. You now have all the files you need and you can make changes to them with no further complications. Remember that you may need to restar the kernel to update the imports and variables.
Hope this works for someone!
you can write following commands in colab to mount the drive
from google.colab import drive
drive.mount('/content/gdrive')
and you can download from some external url into the drive through simple linux command wget like this
!wget 'https://dataverse.harvard.edu/dataset'

Changing directory in Google colab (breaking out of the python interpreter)

So I'm trying to git clone and cd into that directory using Google collab - but I cant cd into it. What am I doing wrong?
!rm -rf SwitchFrequencyAnalysis && git clone https://github.com/ACECentre/SwitchFrequencyAnalysis.git
!cd SwitchFrequencyAnalysis
!ls
datalab/ SwitchFrequencyAnalysis/
You would expect it to output the directory contents of SwitchFrequencyAnalysis - but instead its the root. I'm feeling I'm missing something obvious - Is it something to do with being within the python interpreter? (where is the documentation??)
Demo here.
use
%cd SwitchFrequencyAnalysis
to change the current working directory for the notebook environment (and not just the subshell that runs your ! command).
you can confirm it worked with the pwd command like this:
!pwd
further information about jupyter / ipython magics:
http://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-cd
As others have pointed out, the cd command needs to start with a percentage sign:
%cd SwitchFrequencyAnalysis
Difference between % and !
Google Colab seems to inherit these syntaxes from Jupyter (which inherits them from IPython).
Jake VanderPlas explains this IPython behaviour here. You can see the excerpt below.
If you play with IPython's shell commands for a while, you might
notice that you cannot use !cd to navigate the filesystem:
In [11]: !pwd
/home/jake/projects/myproject
In [12]: !cd ..
In [13]: !pwd
/home/jake/projects/myproject
The reason is that
shell commands in the notebook are executed in a temporary subshell.
If you'd like to change the working directory in a more enduring way,
you can use the %cd magic command:
In [14]: %cd ..
/home/jake/projects
Another way to look at this: you need % because changing directory is relevant to the environment of the current notebook but not to the entire server runtime.
In general, use ! if the command is one that's okay to run in a separate shell. Use % if the command needs to be run on the specific notebook.
Use os.chdir. Here's a full example:
https://colab.research.google.com/notebook#fileId=1CSPBdmY0TxU038aKscL8YJ3ELgCiGGju
Compactly:
!mkdir abc
!echo "file" > abc/123.txt
import os
os.chdir('abc')
# Now the directory 'abc' is the current working directory.
# and will show 123.txt.
!ls
If you want to use the cd or ls functions , you need proper identifiers before the function names ( % and ! respectively)
use %cd and !ls to navigate
.
!ls # to find the directory you're in ,
%cd ./samplefolder #if you wanna go into a folder (say samplefolder)
or if you wanna go out of the current folder
%cd ../
and then navigate to the required folder/file accordingly
!pwd
import os
os.chdir('/content/drive/My Drive/Colab Notebooks/Data')
!pwd
view this answer for detailed explaination
https://stackoverflow.com/a/61636734/11535267
I believe you'd have to mount the Google Drive first before you do anything else.
from google.colab import drive
drive.mount('/content/drive')

Shipping and using virtualenv in a pyspark job

PROBLEM: I am attempting to run a spark-submit script from my local machine to a cluster of machines. The work done by the cluster uses numpy. I currently get the following error:
ImportError:
Importing the multiarray numpy extension module failed. Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control). Otherwise reinstall numpy.
Original error was: cannot import name multiarray
DETAIL:
In my local environment I have setup a virtualenv that includes numpy as well as a private repo I use in my project and other various libraries. I created a zip file (lib/libs.zip) from the site-packages directory at venv/lib/site-packages where 'venv' is my virtual environment. I ship this zip to the remote nodes. My shell script for performing the spark-submit looks like this:
$SPARK_HOME/bin/spark-submit \
--deploy-mode cluster \
--master yarn \
--conf spark.pyspark.virtualenv.enabled=true \
--conf spark.pyspark.virtualenv.type=native \
--conf spark.pyspark.virtualenv.requirements=${parent}/requirements.txt \
--conf spark.pyspark.virtualenv.bin.path=${parent}/venv \
--py-files "${parent}/lib/libs.zip" \
--num-executors 1 \
--executor-cores 2 \
--executor-memory 2G \
--driver-memory 2G \
$parent/src/features/pi.py
I also know that on the remote nodes there is a /usr/local/bin/python2.7 folder that includes a python 2.7 install.
so in my conf/spark-env.sh I have set the following:
export PYSPARK_PYTHON=/usr/local/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python2.7
When I run the script I get the error above. If I screen print the installed_distributions I get a zero length list []. Also my private library imports correctly (which says to me it is actually accessing my libs.zip site-packages.). My pi.py file looks something like this:
from myprivatelibrary.bigData.spark import spark_context
spark = spark_context()
import numpy as np
spark.parallelize(range(1, 10)).map(lambda x: np.__version__).collect()
EXPECTATION/MY THOUGHTS:
I expect this to import numpy correctly especially since I know numpy works correctly in my local virtualenv. I suspect this is because I'm not actually using the version of python that is installed in my virtualenv on the remote node. My question is first, how do I fix this and second how do I use my virtualenv installed python on the remote nodes instead of the python that is just manually installed and currently sitting on those machines? I've seen some write-ups on this but frankly they are not well written.
With --conf spark.pyspark.{} and export PYSPARK_PYTHON=/usr/local/bin/python2.7 you set options for your local environment / your driver. To set options for the cluster (executors) use the following syntax:
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON
Furthermore, I guess you should make your virtualenv relocatable (this is experimental, however). <edit 20170908> This means that the virtualenv uses relative instead of absolute links. </edit>
What we did in such cases: we shipped an entire anaconda distribution over hdfs.
<edit 20170908>
If we are talking about different environments (MacOs vs. Linux, as mentioned in the comment below), you cannot just submit a virtualenv, at least not if your virtualenv contains packages with binaries (as is the case with numpy). In that case I suggest you create yourself a 'portable' anaconda, i.e. install Anaconda in a Linux VM and zip it.
Regarding --archives vs. --py-files:
--py-files adds python files/packages to the python path. From the spark-submit documentation:
For Python applications, simply pass a .py file in the place of instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.
--archives means these are extracted into the working directory of each executor (only yarn clusters).
However, a crystal-clear distinction is lacking, in my opinion - see for example this SO post.
In the given case, add the anaconda.zip via --archives, and your 'other python files' via --py-files.
</edit>
See also: Running Pyspark with Virtualenv, a blog post by Henning Kropp.