How to set up Spark to use pandas managed by anaconda? - pandas

We've updated the Spark version from 2.2 to 2.3, but admins didn't update the pandas. So our jobs fail with the following error:
ImportError: Pandas >= 0.19.2 must be installed; however, your version was 0.18.1
Our admin team suggested to created a VM downloading latest version from anaconda (using the command conda create -n myenv anaconda).
I did that and after activating the local environment using source activate myenv when I logged into pyspark2 then I found it was picking the new version of pandas.
But when I am submitting a job using spark2-submit command then it is not working. I did added the below configuration in the spark2-submit command
--conf spark.pyspark.virtualenv.enabled=true
--conf spark.pyspark.virtualenv.type=conda
--conf spark.pyspark.virtualenv.requirements=/home/<user>/.conda/requirements_conda.txt --conf spark.pyspark.virtualenv.bin.path=/home/<user>/.conda/envs/myenv/bin
Also I did zipped whole python 2.7 folder and passed that in the --py-files option along with other .py files --py-files /home/<user>/python.zip, but still getting the same version issue for pandas.
I tried to follow the instruction specified in the URL https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html , but still no luck yet.
How to fix it and be able to spark2-submit with the proper pandas?

I think you may need to define environment variables such as SPARK_HOME and PYTHONPAH pointing to corresponding locations in your virtualenv.
export SPARK_HOME=path_to_spark_in_virtualenv
export PYTHONPATH=$SPARK_HOME/python

Related

Get a list of installed python packages with AWS EMR Notebooks

I'm using AWS EMR Notebooks with the PySpark kernel.
Within my notebook, I'd like to use Python to analyze a list of the Python packages installed.
The following displays the list of packages, but the packages variable appears to be None
packages = sc.list_packages()
type(packages) # <class 'NoneType'>
How can I get a list of packages into a Python variable for further analysis?
Make sure to have the virtualenv spark config is set to "spark.pyspark.virtualenv.enabled": "true" Here is the example config:
%%configure -f
{ "conf":{
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
}
}
Checkout the following documentation: https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/

I failed to convert caffe model into mlmodel using coremltools 5

I try to convert caffe model. I am using coremltools v5.
this is my code
import coremltools
caffe_model = ('oxford102.caffemodel', 'deploy.prototxt')
labels = 'flower-labels.txt'
coreml_model = coremltools.converters.caffe.convert(
caffe_model,
class_labels=labels,
image_input_names='data'
)
coreml_model.save('FlowerClassifier.mlmodel')
I convert using below command
python3 convert-script.py
And i get an error message like below.
error message
Does anybody face this problem and have solution on it?
I just came across this as I was having the same problem. The caffe support is not available in the newer versions of coremltools API. To make this code run an older version of coremltools (such as 3.4) must be used, which requires using Python 2.7 - which is best done in a virtual environment.
I assume you've solved your issue already, but I added this in case anyone else stumbles onto this question.
There are several solutions according to your case:
I had the same issue on my M1 Mac. You can resolve the same by duplicating your Terminal, and running it with Rosetta.(This worked for me)
cd ~/.virtualenvs/<your venv name here>/bin
mkdir bk; cp python bk; mv -f bk/python .;rmdir bk
codesign -s - --preserve-metadata=identifier,entitlements,flags,runtime -f python
Fore more solutions and issue you can watch this issue on github
I had the same error running python 3.7
In the virtualenv, solution is to run:
pip install coremltools==3.0
Don't have to change python versions and just rerun the script

Problems at running ImageDataBunch in Deepnote

I'm having trouble running this line of code in Deepnote, does anyone know why?
data = ImageDataBunch.from_folder(path, train="train", valid ="test",ds_tfms=get_transforms(), size=(256,256), bs=32, num_workers=4).normalize()
The error says:
NameError: name 'ImageDataBunch' is not defined
And previously, I have imported the Fastai library. So I don't get it!
The FastAI setup in Deepnote is not that straightforward. It's best to use a custom environment where you set stuff up in a Dockerfile and everything works afterwards in the notebook. I am not sure if the ImageDataBunch or whatever you're trying to do works the same way in FastAI v1 and v2, but here are the details for v1.
This is a Dockerfile which sets up the FastAI environment via conda:
# This is Dockerfile
FROM deepnote/python:3.9
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
RUN bash ~/miniconda.sh -b -p $HOME/miniconda
ENV PATH $HOME/miniconda/bin:$PATH
ENV PYTONPATH $HOME/miniconda
RUN $HOME/miniconda/bin/conda install python=3.9 ipykernel -y
RUN $HOME/miniconda/bin/conda install -c fastai -c pytorch fastai -y
RUN $HOME/miniconda/bin/python -m ipykernel install --user --name=conda
ENV DEFAULT_KERNEL_NAME "conda"
After that, you can test the fastai imports in the notebook:
import fastai
from fastai.vision import *
print(fastai.__version__)
ImageDataBunch
And if you download and unpack this sample MNIST dataset, you should be able to load the data like you suggested:
data = ImageDataBunch.from_folder(path, train="train", valid ="test",ds_tfms=get_transforms(), size=(256,256), bs=32, num_workers=4).normalize()
Feel free to check out or clone my Deepnote project to continue working on this.

Install RAPIDS library on Googe Colab notebook

I was wondering if I could install RAPIDS library (executing machine learning tasks entirely on GPU) in Google Colaboratory notebook?
I've done some research but I've not been able to find the way to do that...
This is now possible with the new T4 instances https://medium.com/rapids-ai/run-rapids-on-google-colab-for-free-1617ac6323a8
To enable cuGraph too, you can replace the wget command with:
!conda install -c nvidia/label/cuda10.0 -c rapidsai/label/cuda10.0 -c pytorch \
-c numba -c conda-forge -c numba -c defaults \
boost cudf=0.6 cuml=0.6 python=3.6 cugraph=0.6 -y
Dec 2019 update
New process for RAPIDS v0.11+
Because
RAPIDS v0.11 has dependencies (pyarrow) which were
not covered by the prior install script,
the notebooks-contrib repo, which contains RAPIDS demo notebooks (e.g.
colab_notebooks) and the Colab install script, now follows RAPIDS standard version-specific branch structure*
and some Colab users still enjoy v0.10,
our honorable notebooks-contrib overlord taureandyernv has updated the script which now:
If running v0.11 or higher, updates pyarrow library to 0.15.x.
Here's the code cell to run in Colab for v0.11:
# Install RAPIDS
!wget -nc https://raw.githubusercontent.com/rapidsai/notebooks-contrib/890b04ed8687da6e3a100c81f449ff6f7b559956/utils/rapids-colab.sh
!bash rapids-colab.sh
import sys, os
dist_package_index = sys.path.index("/usr/local/lib/python3.6/dist-packages")
sys.path = sys.path[:dist_package_index] + ["/usr/local/lib/python3.6/site-packages"] + sys.path[dist_package_index:]
sys.path
if os.path.exists('update_pyarrow.py'): ## This file only exists if you're using RAPIDS version 0.11 or higher
exec(open("update_pyarrow.py").read(), globals())
For a walk thru setting up Colab & implementing this script, see How to Install RAPIDS in Google Colab
-* e.g. branch-0.11 for v0.11 and branch-0.12 for v0.12 with default set to the current version
Looks like various subparts are not yet pip-installable so the only way to get them on colab would be to build them on colab, which might be more effort than you're interested in investing in this :)
https://github.com/rapidsai/cudf/issues/285 is the issue to watch for rapidsai/cudf (presumably the other rapidsai/ libs will follow suit).
Latest solution;
!wget -nc https://github.com/rapidsai/notebooks-extended/raw/master/utils/rapids-colab.sh
!bash rapids-colab.sh
import sys, os
sys.path.append('/usr/local/lib/python3.6/site-packages/')
os.environ['NUMBAPRO_NVVM'] = '/usr/local/cuda/nvvm/lib64/libnvvm.so'
os.environ['NUMBAPRO_LIBDEVICE'] = '/usr/local/cuda/nvvm/libdevice/'
was pushed a few days ago, see issues #104 or #110, or the full rapids-colab.sh script for more info.
Note: instillation currently requires a Tesla T4 instance, checking for this can be done with;
# check gpu type
!nvidia-smi
import pynvml
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
device_name = pynvml.nvmlDeviceGetName(handle)
# your dolphin is broken, please reset & try again
if device_name != b'Tesla T4':
raise Exception("""Unfortunately this instance does not have a T4 GPU.
Please make sure you've configured Colab to request a GPU instance type.
Sometimes Colab allocates a Tesla K80 instead of a T4. Resetting the instance.
If you get a K80 GPU, try Runtime -> Reset all runtimes...""")
# got a T4, good to go
else:
print('Woo! You got the right kind of GPU!')

Shipping and using virtualenv in a pyspark job

PROBLEM: I am attempting to run a spark-submit script from my local machine to a cluster of machines. The work done by the cluster uses numpy. I currently get the following error:
ImportError:
Importing the multiarray numpy extension module failed. Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control). Otherwise reinstall numpy.
Original error was: cannot import name multiarray
DETAIL:
In my local environment I have setup a virtualenv that includes numpy as well as a private repo I use in my project and other various libraries. I created a zip file (lib/libs.zip) from the site-packages directory at venv/lib/site-packages where 'venv' is my virtual environment. I ship this zip to the remote nodes. My shell script for performing the spark-submit looks like this:
$SPARK_HOME/bin/spark-submit \
--deploy-mode cluster \
--master yarn \
--conf spark.pyspark.virtualenv.enabled=true \
--conf spark.pyspark.virtualenv.type=native \
--conf spark.pyspark.virtualenv.requirements=${parent}/requirements.txt \
--conf spark.pyspark.virtualenv.bin.path=${parent}/venv \
--py-files "${parent}/lib/libs.zip" \
--num-executors 1 \
--executor-cores 2 \
--executor-memory 2G \
--driver-memory 2G \
$parent/src/features/pi.py
I also know that on the remote nodes there is a /usr/local/bin/python2.7 folder that includes a python 2.7 install.
so in my conf/spark-env.sh I have set the following:
export PYSPARK_PYTHON=/usr/local/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python2.7
When I run the script I get the error above. If I screen print the installed_distributions I get a zero length list []. Also my private library imports correctly (which says to me it is actually accessing my libs.zip site-packages.). My pi.py file looks something like this:
from myprivatelibrary.bigData.spark import spark_context
spark = spark_context()
import numpy as np
spark.parallelize(range(1, 10)).map(lambda x: np.__version__).collect()
EXPECTATION/MY THOUGHTS:
I expect this to import numpy correctly especially since I know numpy works correctly in my local virtualenv. I suspect this is because I'm not actually using the version of python that is installed in my virtualenv on the remote node. My question is first, how do I fix this and second how do I use my virtualenv installed python on the remote nodes instead of the python that is just manually installed and currently sitting on those machines? I've seen some write-ups on this but frankly they are not well written.
With --conf spark.pyspark.{} and export PYSPARK_PYTHON=/usr/local/bin/python2.7 you set options for your local environment / your driver. To set options for the cluster (executors) use the following syntax:
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON
Furthermore, I guess you should make your virtualenv relocatable (this is experimental, however). <edit 20170908> This means that the virtualenv uses relative instead of absolute links. </edit>
What we did in such cases: we shipped an entire anaconda distribution over hdfs.
<edit 20170908>
If we are talking about different environments (MacOs vs. Linux, as mentioned in the comment below), you cannot just submit a virtualenv, at least not if your virtualenv contains packages with binaries (as is the case with numpy). In that case I suggest you create yourself a 'portable' anaconda, i.e. install Anaconda in a Linux VM and zip it.
Regarding --archives vs. --py-files:
--py-files adds python files/packages to the python path. From the spark-submit documentation:
For Python applications, simply pass a .py file in the place of instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.
--archives means these are extracted into the working directory of each executor (only yarn clusters).
However, a crystal-clear distinction is lacking, in my opinion - see for example this SO post.
In the given case, add the anaconda.zip via --archives, and your 'other python files' via --py-files.
</edit>
See also: Running Pyspark with Virtualenv, a blog post by Henning Kropp.