Shipping and using virtualenv in a pyspark job

Shipping and using virtualenv in a pyspark job - numpy

PROBLEM: I am attempting to run a spark-submit script from my local machine to a cluster of machines. The work done by the cluster uses numpy. I currently get the following error:
ImportError:
Importing the multiarray numpy extension module failed. Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control). Otherwise reinstall numpy.
Original error was: cannot import name multiarray
DETAIL:
In my local environment I have setup a virtualenv that includes numpy as well as a private repo I use in my project and other various libraries. I created a zip file (lib/libs.zip) from the site-packages directory at venv/lib/site-packages where 'venv' is my virtual environment. I ship this zip to the remote nodes. My shell script for performing the spark-submit looks like this:
$SPARK_HOME/bin/spark-submit \
--deploy-mode cluster \
--master yarn \
--conf spark.pyspark.virtualenv.enabled=true \
--conf spark.pyspark.virtualenv.type=native \
--conf spark.pyspark.virtualenv.requirements=${parent}/requirements.txt \
--conf spark.pyspark.virtualenv.bin.path=${parent}/venv \
--py-files "${parent}/lib/libs.zip" \
--num-executors 1 \
--executor-cores 2 \
--executor-memory 2G \
--driver-memory 2G \
$parent/src/features/pi.py
I also know that on the remote nodes there is a /usr/local/bin/python2.7 folder that includes a python 2.7 install.
so in my conf/spark-env.sh I have set the following:
export PYSPARK_PYTHON=/usr/local/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python2.7
When I run the script I get the error above. If I screen print the installed_distributions I get a zero length list []. Also my private library imports correctly (which says to me it is actually accessing my libs.zip site-packages.). My pi.py file looks something like this:
from myprivatelibrary.bigData.spark import spark_context
spark = spark_context()
import numpy as np
spark.parallelize(range(1, 10)).map(lambda x: np.__version__).collect()
EXPECTATION/MY THOUGHTS:
I expect this to import numpy correctly especially since I know numpy works correctly in my local virtualenv. I suspect this is because I'm not actually using the version of python that is installed in my virtualenv on the remote node. My question is first, how do I fix this and second how do I use my virtualenv installed python on the remote nodes instead of the python that is just manually installed and currently sitting on those machines? I've seen some write-ups on this but frankly they are not well written.

With --conf spark.pyspark.{} and export PYSPARK_PYTHON=/usr/local/bin/python2.7 you set options for your local environment / your driver. To set options for the cluster (executors) use the following syntax:
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON
Furthermore, I guess you should make your virtualenv relocatable (this is experimental, however). <edit 20170908> This means that the virtualenv uses relative instead of absolute links. </edit>
What we did in such cases: we shipped an entire anaconda distribution over hdfs.
<edit 20170908>
If we are talking about different environments (MacOs vs. Linux, as mentioned in the comment below), you cannot just submit a virtualenv, at least not if your virtualenv contains packages with binaries (as is the case with numpy). In that case I suggest you create yourself a 'portable' anaconda, i.e. install Anaconda in a Linux VM and zip it.
Regarding --archives vs. --py-files:
--py-files adds python files/packages to the python path. From the spark-submit documentation:
For Python applications, simply pass a .py file in the place of instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.
--archives means these are extracted into the working directory of each executor (only yarn clusters).
However, a crystal-clear distinction is lacking, in my opinion - see for example this SO post.
In the given case, add the anaconda.zip via --archives, and your 'other python files' via --py-files.
</edit>
See also: Running Pyspark with Virtualenv, a blog post by Henning Kropp.

Related

I failed to convert caffe model into mlmodel using coremltools 5

I try to convert caffe model. I am using coremltools v5.
this is my code
import coremltools
caffe_model = ('oxford102.caffemodel', 'deploy.prototxt')
labels = 'flower-labels.txt'
coreml_model = coremltools.converters.caffe.convert(
caffe_model,
class_labels=labels,
image_input_names='data'
)
coreml_model.save('FlowerClassifier.mlmodel')
I convert using below command
python3 convert-script.py
And i get an error message like below.
error message
Does anybody face this problem and have solution on it?

I just came across this as I was having the same problem. The caffe support is not available in the newer versions of coremltools API. To make this code run an older version of coremltools (such as 3.4) must be used, which requires using Python 2.7 - which is best done in a virtual environment.
I assume you've solved your issue already, but I added this in case anyone else stumbles onto this question.

There are several solutions according to your case:
I had the same issue on my M1 Mac. You can resolve the same by duplicating your Terminal, and running it with Rosetta.(This worked for me)
cd ~/.virtualenvs/<your venv name here>/bin
mkdir bk; cp python bk; mv -f bk/python .;rmdir bk
codesign -s - --preserve-metadata=identifier,entitlements,flags,runtime -f python
Fore more solutions and issue you can watch this issue on github

I had the same error running python 3.7
In the virtualenv, solution is to run:
pip install coremltools==3.0
Don't have to change python versions and just rerun the script

Problems at running ImageDataBunch in Deepnote

I'm having trouble running this line of code in Deepnote, does anyone know why?
data = ImageDataBunch.from_folder(path, train="train", valid ="test",ds_tfms=get_transforms(), size=(256,256), bs=32, num_workers=4).normalize()
The error says:
NameError: name 'ImageDataBunch' is not defined
And previously, I have imported the Fastai library. So I don't get it!

The FastAI setup in Deepnote is not that straightforward. It's best to use a custom environment where you set stuff up in a Dockerfile and everything works afterwards in the notebook. I am not sure if the ImageDataBunch or whatever you're trying to do works the same way in FastAI v1 and v2, but here are the details for v1.
This is a Dockerfile which sets up the FastAI environment via conda:
# This is Dockerfile
FROM deepnote/python:3.9
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
RUN bash ~/miniconda.sh -b -p $HOME/miniconda
ENV PATH $HOME/miniconda/bin:$PATH
ENV PYTONPATH $HOME/miniconda
RUN $HOME/miniconda/bin/conda install python=3.9 ipykernel -y
RUN $HOME/miniconda/bin/conda install -c fastai -c pytorch fastai -y
RUN $HOME/miniconda/bin/python -m ipykernel install --user --name=conda
ENV DEFAULT_KERNEL_NAME "conda"
After that, you can test the fastai imports in the notebook:
import fastai
from fastai.vision import *
print(fastai.__version__)
ImageDataBunch
And if you download and unpack this sample MNIST dataset, you should be able to load the data like you suggested:
data = ImageDataBunch.from_folder(path, train="train", valid ="test",ds_tfms=get_transforms(), size=(256,256), bs=32, num_workers=4).normalize()
Feel free to check out or clone my Deepnote project to continue working on this.

Install RAPIDS library on Googe Colab notebook

I was wondering if I could install RAPIDS library (executing machine learning tasks entirely on GPU) in Google Colaboratory notebook?
I've done some research but I've not been able to find the way to do that...

This is now possible with the new T4 instances https://medium.com/rapids-ai/run-rapids-on-google-colab-for-free-1617ac6323a8
To enable cuGraph too, you can replace the wget command with:
!conda install -c nvidia/label/cuda10.0 -c rapidsai/label/cuda10.0 -c pytorch \
-c numba -c conda-forge -c numba -c defaults \
boost cudf=0.6 cuml=0.6 python=3.6 cugraph=0.6 -y

Dec 2019 update
New process for RAPIDS v0.11+
Because
RAPIDS v0.11 has dependencies (pyarrow) which were
not covered by the prior install script,
the notebooks-contrib repo, which contains RAPIDS demo notebooks (e.g.
colab_notebooks) and the Colab install script, now follows RAPIDS standard version-specific branch structure*
and some Colab users still enjoy v0.10,
our honorable notebooks-contrib overlord taureandyernv has updated the script which now:
If running v0.11 or higher, updates pyarrow library to 0.15.x.
Here's the code cell to run in Colab for v0.11:
# Install RAPIDS
!wget -nc https://raw.githubusercontent.com/rapidsai/notebooks-contrib/890b04ed8687da6e3a100c81f449ff6f7b559956/utils/rapids-colab.sh
!bash rapids-colab.sh
import sys, os
dist_package_index = sys.path.index("/usr/local/lib/python3.6/dist-packages")
sys.path = sys.path[:dist_package_index] + ["/usr/local/lib/python3.6/site-packages"] + sys.path[dist_package_index:]
sys.path
if os.path.exists('update_pyarrow.py'): ## This file only exists if you're using RAPIDS version 0.11 or higher
exec(open("update_pyarrow.py").read(), globals())
For a walk thru setting up Colab & implementing this script, see How to Install RAPIDS in Google Colab
-* e.g. branch-0.11 for v0.11 and branch-0.12 for v0.12 with default set to the current version

Looks like various subparts are not yet pip-installable so the only way to get them on colab would be to build them on colab, which might be more effort than you're interested in investing in this :)
https://github.com/rapidsai/cudf/issues/285 is the issue to watch for rapidsai/cudf (presumably the other rapidsai/ libs will follow suit).

Latest solution;
!wget -nc https://github.com/rapidsai/notebooks-extended/raw/master/utils/rapids-colab.sh
!bash rapids-colab.sh
import sys, os
sys.path.append('/usr/local/lib/python3.6/site-packages/')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
was pushed a few days ago, see issues #104 or #110, or the full rapids-colab.sh script for more info.
Note: instillation currently requires a Tesla T4 instance, checking for this can be done with;
# check gpu type
!nvidia-smi
import pynvml
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
device_name = pynvml.nvmlDeviceGetName(handle)
# your dolphin is broken, please reset & try again
if device_name != b'Tesla T4':
raise Exception("""Unfortunately this instance does not have a T4 GPU.
Please make sure you've configured Colab to request a GPU instance type.
Sometimes Colab allocates a Tesla K80 instead of a T4. Resetting the instance.
If you get a K80 GPU, try Runtime -> Reset all runtimes...""")
# got a T4, good to go
else:
print('Woo! You got the right kind of GPU!')

How to set up Spark to use pandas managed by anaconda?

We've updated the Spark version from 2.2 to 2.3, but admins didn't update the pandas. So our jobs fail with the following error:
ImportError: Pandas >= 0.19.2 must be installed; however, your version was 0.18.1
Our admin team suggested to created a VM downloading latest version from anaconda (using the command conda create -n myenv anaconda).
I did that and after activating the local environment using source activate myenv when I logged into pyspark2 then I found it was picking the new version of pandas.
But when I am submitting a job using spark2-submit command then it is not working. I did added the below configuration in the spark2-submit command
--conf spark.pyspark.virtualenv.enabled=true
--conf spark.pyspark.virtualenv.type=conda
--conf spark.pyspark.virtualenv.requirements=/home/<user>/.conda/requirements_conda.txt --conf spark.pyspark.virtualenv.bin.path=/home/<user>/.conda/envs/myenv/bin
Also I did zipped whole python 2.7 folder and passed that in the --py-files option along with other .py files --py-files /home/<user>/python.zip, but still getting the same version issue for pandas.
I tried to follow the instruction specified in the URL https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html , but still no luck yet.
How to fix it and be able to spark2-submit with the proper pandas?

I think you may need to define environment variables such as SPARK_HOME and PYTHONPAH pointing to corresponding locations in your virtualenv.
export SPARK_HOME=path_to_spark_in_virtualenv
export PYTHONPATH=$SPARK_HOME/python

'%matplotlib notebook' behavior in Jupyter Lab [duplicate]

With old Jupyter notebooks, I could create interactive plots via:
import matplotlib.pyplot as plt
%matplotlib notebook
x = [1,2,3]
y = [4,5,6]
plt.figure()
plt.plot(x,y)
However, in JupyterLab, this gives an error:
JavaScript output is disabled in JupyterLab
I have also tried the magic (with jupyter-matplotlib installed):
%matplotlib ipympl
But that just returns:
FigureCanvasNbAgg()
Inline plots work, but they are not interactive plots:
%matplotlib inline

JupyterLab 3.0+
Install jupyterlab and ipympl.
For pip users:
pip install --upgrade jupyterlab ipympl
For conda users:
conda update -c conda-forge jupyterlab ipympl
Restart JupyterLab.
Decorate the cell containing plotting code with the header:
%matplotlib widget
# plotting code goes here
JupyterLab 2.0
Install nodejs, e.g. conda install -c conda-forge nodejs.
Install ipympl, e.g. conda install -c conda-forge ipympl.
[Optional, but recommended.] Update JupyterLab, e.g.
conda update -c conda-forge jupyterlab==2.2.9==py_0.
[Optional, but recommended.] For a local user installation, run:
export JUPYTERLAB_DIR="$HOME/.local/share/jupyter/lab".
Install extensions:
jupyter labextension install #jupyter-widgets/jupyterlab-manager
jupyter labextension install jupyter-matplotlib
Enable widgets: jupyter nbextension enable --py widgetsnbextension.
Restart JupyterLab.
Decorate with %matplotlib widget.

To enable the jupyter-matplotlib backend, use the matplotlib Jupyter magic:
%matplotlib widget
import matplotlib.pyplot as plt
plt.figure()
x = [1,2,3]
y = [4,5,6]
plt.plot(x,y)
More info here jupyter-matplotlib on GitHub

As per Georgy's suggestion, this was caused by Node.js not being installed.

Steps for JupyterLab 3.*
I had previously used Mateen's answer several times, but when I tried them with JupyterLab 3.0.7 I found that jupyter labextension install #jupyter-widgets/jupyterlab-manager returned an error and I had broken widgets.
After a lot of headaches and googling I thought I would post the solution for anyone else who finds themselves here.
The steps are now simplified, and I was able to get back to working interactive plots with the following:
pip install jupyterlab
pip install ipympl
Decorate with %matplotlib widget
Step 2 will automatically take care of the rest of the dependencies, including the replacements for (the now depreciated?) #jupyter-widgets/jupyterlab-manager
Hope this saves someone else some time!

Summary
In a complex setup, where jupyter-lab process and the Jupyter/IPython kernel process are running in different Python virtual environments, pay attention to Jupyter-related Python package and Jupyter extension (e.g. ipympl, jupyter-matplotlib) versions and their compatibility between the environments.
And even in single Python virtual environment make sure you comply with the ipympl compatibility table.
Example
A couple of examples how to run JupyterLab.
Simple(st)
The simplest cross-platform way to run JupyterLab, I guess, is running it from a Docker container. You can build and run JupyterLab 3 container like this.
docker run --name jupyter -it -p 8888:8888 \
# This line on a Linux- and non-user-namespaced Docker will "share"
# the directory between Docker host and container, and run from the user.
-u 1000 -v $HOME/Documents/notebooks:/tmp/notebooks \
-e HOME=/tmp/jupyter python:3.8 bash -c "
mkdir /tmp/jupyter; \
pip install --user 'jupyterlab < 4' 'ipympl < 0.8' pandas matplotlib; \
/tmp/jupyter/.local/bin/jupyter lab --ip=0.0.0.0 --port 8888 \
--no-browser --notebook-dir /tmp/notebooks;
"
When it finishes (and it'll take a while), the bottommost lines in the terminal should be something like.
To access the server, open this file in a browser:
...
http://127.0.0.1:8888/lab?token=abcdef...
You can just click on that link and JupyterLab should open in your browser. Once you shut down the JupyterLab instance the container will stop. You can restart it with docker start -ai jupyter.
Complex
This GitHub Gist illustrates the idea how to build a Python virtual environment with JupyterLab 2 and also building all required extensions with Nodejs in the container, without installing Nodejs on host system. With JupyterLab 3 and pre-build extensions this approach gets less relevant.
Context
I was scratching my head today while debugging the %matplotlib widget not working in JupyterLab 2. I have separate pre-built JupyterLab venv (as described above) which powers local JupyterLab as Chromium "app mode" (i.e. c.LabApp.browser = 'chromium-browser --app=%s' in the config), and a few IPython kernels from simple Python venvs with specific dependencies (rarely change) and an application exposing itself as an IPython kernel. The issue with the interactive "widget" mode manifested in different ways.
For instance, having
in JupyterLab "host" venv: jupyter-matplotlib v0.7.4 extension and ipympl==0.6.3
in the kernel venv: ipympl==0.7.0 and matplotlib==3.4.2
In the browser console I had these errors:
Error: Module jupyter-matplotlib, semver range ^0.9.0 is not registered as a widget module
Error: Could not create a model.
Could not instantiate widget
In the JupyterLab UI:
%matplotlib widget succeeds on restart
Charts stuck in "Loading widget..."
Nothing on re-run of the cell with chart output
On previous attempts %matplotlib widget could raise something like KeyError: '97acd0c8fb504a2288834b349003b4ae'
On downgrade of ipympl==0.6.3 in the kernel venv in the browser console:
Could not instantiate widget
Exception opening new comm
Error: Could not create a model.
Module jupyter-matplotlib, semver range ^0.8.3 is not registered as a widget module
Once I made the packages/extensions according to ipympl compatibility table:
in JupyterLab "host" venv: jupyter-matplotlib v0.8.3 extension, ipympl==0.6.3
in the kernel venv: ipympl==0.6.3, matplotlib==3.3.4
It more or less works as expected. Well, there are verious minor glitches like except I put %matplotlib widget per cell with chart, say on restart, the first chart "accumulates" all the contents of all the charts in the notebook. With %matplotlib widget per cell, only one chart is "active" at a time. And on restart only last widget is rendered (but manual re-run of a cell remediates).

This solution works in jupyterlab
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import clear_output
n = 10
a = np.zeros((n, n))
plt.figure()
for i in range(n):
plt.imshow(a)
plt.show()
a[i, i] = 1
clear_output(wait=True)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas