Get a list of installed python packages with AWS EMR Notebooks - amazon-emr

I'm using AWS EMR Notebooks with the PySpark kernel.
Within my notebook, I'd like to use Python to analyze a list of the Python packages installed.
The following displays the list of packages, but the packages variable appears to be None
packages = sc.list_packages()
type(packages) # <class 'NoneType'>
How can I get a list of packages into a Python variable for further analysis?

Make sure to have the virtualenv spark config is set to "spark.pyspark.virtualenv.enabled": "true" Here is the example config:
%%configure -f
{ "conf":{
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type":"native",
"spark.pyspark.virtualenv.bin.path":"/usr/bin/virtualenv"
}
}
Checkout the following documentation: https://aws.amazon.com/blogs/big-data/install-python-libraries-on-a-running-cluster-with-emr-notebooks/

Related

Gitlab CI / CD with Terraform + Python

On my Gitlab CI / CD, I have a terraform code that requires Python installed to use an external module.
When running terraform plan via Gitlab pipelines, I get the following error:
module.notify_slack.module.lambda.data.aws_caller_identity.current[0]: Refreshing state...
Error: can't find external program "python3"
on .terraform/modules/notify_slack.lambda/terraform-aws-lambda-1.6.0/package.tf line 3, in data "external" "archive_prepare":
3: data "external" "archive_prepare" {
ERROR: Job failed: exit code 1
What image do I need to use that contains Terraform and Python? Will I need to create my own docker image?
I know this is a bit of an old post, but I'll share my solution in case anyone else stumbles upon this problem too.
Choose an existing python image and install terraform manually - this seems to me to be the easiest solution, if pragmatism is important to you.
This is the relevant section of my .gitlab-ci.yml file:
default:
image: python:latest
before_script:
- python -V # Display version for debugging purposes only
- apt-get update -y
- apt-get install unzip wget -y
- wget https://releases.hashicorp.com/terraform/${TERRAFORM_VERSION}/terraform_${TERRAFORM_VERSION}_linux_amd64.zip
- unzip terraform_${TERRAFORM_VERSION}_linux_amd64.zip
- mv terraform /usr/local/bin/
- terraform --version # Display version for debugging purposes only
The environment variable was set up in the GitLab CI/CD settings, otherwise just change it for the specific version of Terraform you want.
I was pleasantly surprised at the speed in which this installation takes place, as this clearly isn't an optimal way to do it - the best performing runner will probably use your own custom image with all of your required dependencies pre-installed - I'll leave you to decide whether its worth it for your own purposes. Nonetheless, this solution doesn't appear to be prohibitively slow.

How to submit a SPARK job of which the jar is hosted in S3 object store

I have a SPARK cluster with Yarn, and I want to put my job's jar into a S3 100% compatible Object Store. If I want to submit the job, I search from google and seems that just simply as this way:
spark-submit --master yarn --deploy-mode cluster <...other parameters...> s3://my_ bucket/jar_file
However the S3 Object Store required user name and password to access. So how I can config those credential information to let SPARRK download the jar from S3?
Many thanks!
You can use Default Credential Provider Chain from AWS docs:
export AWS_ACCESS_KEY_ID=your_access_key
export AWS_SECRET_ACCESS_KEY=your_secret_key
./bin/spark-submit \
--master local[2] \
--class org.apache.spark.examples.SparkPi \
s3a://your_bucket/.../spark-examples_2.11-2.4.6-SNAPSHOT.jar
I needed to download the following jars from Maven and put it to Spark jar dir in order to allow to use s3a schema in spark-submit (note, you can use --packages directive to reference these dependencies from inside your jar, but not from spark-submit itself):
// build Spark `assembly` project
sbt "project assembly" package
cd assembly/target/scala-2.11/jars/
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.7/hadoop-aws-2.7.7.jar

How to set up Spark to use pandas managed by anaconda?

We've updated the Spark version from 2.2 to 2.3, but admins didn't update the pandas. So our jobs fail with the following error:
ImportError: Pandas >= 0.19.2 must be installed; however, your version was 0.18.1
Our admin team suggested to created a VM downloading latest version from anaconda (using the command conda create -n myenv anaconda).
I did that and after activating the local environment using source activate myenv when I logged into pyspark2 then I found it was picking the new version of pandas.
But when I am submitting a job using spark2-submit command then it is not working. I did added the below configuration in the spark2-submit command
--conf spark.pyspark.virtualenv.enabled=true
--conf spark.pyspark.virtualenv.type=conda
--conf spark.pyspark.virtualenv.requirements=/home/<user>/.conda/requirements_conda.txt --conf spark.pyspark.virtualenv.bin.path=/home/<user>/.conda/envs/myenv/bin
Also I did zipped whole python 2.7 folder and passed that in the --py-files option along with other .py files --py-files /home/<user>/python.zip, but still getting the same version issue for pandas.
I tried to follow the instruction specified in the URL https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html , but still no luck yet.
How to fix it and be able to spark2-submit with the proper pandas?
I think you may need to define environment variables such as SPARK_HOME and PYTHONPAH pointing to corresponding locations in your virtualenv.
export SPARK_HOME=path_to_spark_in_virtualenv
export PYTHONPATH=$SPARK_HOME/python

'%matplotlib notebook' behavior in Jupyter Lab [duplicate]

With old Jupyter notebooks, I could create interactive plots via:
import matplotlib.pyplot as plt
%matplotlib notebook
x = [1,2,3]
y = [4,5,6]
plt.figure()
plt.plot(x,y)
However, in JupyterLab, this gives an error:
JavaScript output is disabled in JupyterLab
I have also tried the magic (with jupyter-matplotlib installed):
%matplotlib ipympl
But that just returns:
FigureCanvasNbAgg()
Inline plots work, but they are not interactive plots:
%matplotlib inline
JupyterLab 3.0+
Install jupyterlab and ipympl.
For pip users:
pip install --upgrade jupyterlab ipympl
For conda users:
conda update -c conda-forge jupyterlab ipympl
Restart JupyterLab.
Decorate the cell containing plotting code with the header:
%matplotlib widget
# plotting code goes here
JupyterLab 2.0
Install nodejs, e.g. conda install -c conda-forge nodejs.
Install ipympl, e.g. conda install -c conda-forge ipympl.
[Optional, but recommended.] Update JupyterLab, e.g.
conda update -c conda-forge jupyterlab==2.2.9==py_0.
[Optional, but recommended.] For a local user installation, run:
export JUPYTERLAB_DIR="$HOME/.local/share/jupyter/lab".
Install extensions:
jupyter labextension install #jupyter-widgets/jupyterlab-manager
jupyter labextension install jupyter-matplotlib
Enable widgets: jupyter nbextension enable --py widgetsnbextension.
Restart JupyterLab.
Decorate with %matplotlib widget.
To enable the jupyter-matplotlib backend, use the matplotlib Jupyter magic:
%matplotlib widget
import matplotlib.pyplot as plt
plt.figure()
x = [1,2,3]
y = [4,5,6]
plt.plot(x,y)
More info here jupyter-matplotlib on GitHub
As per Georgy's suggestion, this was caused by Node.js not being installed.
Steps for JupyterLab 3.*
I had previously used Mateen's answer several times, but when I tried them with JupyterLab 3.0.7 I found that jupyter labextension install #jupyter-widgets/jupyterlab-manager returned an error and I had broken widgets.
After a lot of headaches and googling I thought I would post the solution for anyone else who finds themselves here.
The steps are now simplified, and I was able to get back to working interactive plots with the following:
pip install jupyterlab
pip install ipympl
Decorate with %matplotlib widget
Step 2 will automatically take care of the rest of the dependencies, including the replacements for (the now depreciated?) #jupyter-widgets/jupyterlab-manager
Hope this saves someone else some time!
Summary
In a complex setup, where jupyter-lab process and the Jupyter/IPython kernel process are running in different Python virtual environments, pay attention to Jupyter-related Python package and Jupyter extension (e.g. ipympl, jupyter-matplotlib) versions and their compatibility between the environments.
And even in single Python virtual environment make sure you comply with the ipympl compatibility table.
Example
A couple of examples how to run JupyterLab.
Simple(st)
The simplest cross-platform way to run JupyterLab, I guess, is running it from a Docker container. You can build and run JupyterLab 3 container like this.
docker run --name jupyter -it -p 8888:8888 \
# This line on a Linux- and non-user-namespaced Docker will "share"
# the directory between Docker host and container, and run from the user.
-u 1000 -v $HOME/Documents/notebooks:/tmp/notebooks \
-e HOME=/tmp/jupyter python:3.8 bash -c "
mkdir /tmp/jupyter; \
pip install --user 'jupyterlab < 4' 'ipympl < 0.8' pandas matplotlib; \
/tmp/jupyter/.local/bin/jupyter lab --ip=0.0.0.0 --port 8888 \
--no-browser --notebook-dir /tmp/notebooks;
"
When it finishes (and it'll take a while), the bottommost lines in the terminal should be something like.
To access the server, open this file in a browser:
...
http://127.0.0.1:8888/lab?token=abcdef...
You can just click on that link and JupyterLab should open in your browser. Once you shut down the JupyterLab instance the container will stop. You can restart it with docker start -ai jupyter.
Complex
This GitHub Gist illustrates the idea how to build a Python virtual environment with JupyterLab 2 and also building all required extensions with Nodejs in the container, without installing Nodejs on host system. With JupyterLab 3 and pre-build extensions this approach gets less relevant.
Context
I was scratching my head today while debugging the %matplotlib widget not working in JupyterLab 2. I have separate pre-built JupyterLab venv (as described above) which powers local JupyterLab as Chromium "app mode" (i.e. c.LabApp.browser = 'chromium-browser --app=%s' in the config), and a few IPython kernels from simple Python venvs with specific dependencies (rarely change) and an application exposing itself as an IPython kernel. The issue with the interactive "widget" mode manifested in different ways.
For instance, having
in JupyterLab "host" venv: jupyter-matplotlib v0.7.4 extension and ipympl==0.6.3
in the kernel venv: ipympl==0.7.0 and matplotlib==3.4.2
In the browser console I had these errors:
Error: Module jupyter-matplotlib, semver range ^0.9.0 is not registered as a widget module
Error: Could not create a model.
Could not instantiate widget
In the JupyterLab UI:
%matplotlib widget succeeds on restart
Charts stuck in "Loading widget..."
Nothing on re-run of the cell with chart output
On previous attempts %matplotlib widget could raise something like KeyError: '97acd0c8fb504a2288834b349003b4ae'
On downgrade of ipympl==0.6.3 in the kernel venv in the browser console:
Could not instantiate widget
Exception opening new comm
Error: Could not create a model.
Module jupyter-matplotlib, semver range ^0.8.3 is not registered as a widget module
Once I made the packages/extensions according to ipympl compatibility table:
in JupyterLab "host" venv: jupyter-matplotlib v0.8.3 extension, ipympl==0.6.3
in the kernel venv: ipympl==0.6.3, matplotlib==3.3.4
It more or less works as expected. Well, there are verious minor glitches like except I put %matplotlib widget per cell with chart, say on restart, the first chart "accumulates" all the contents of all the charts in the notebook. With %matplotlib widget per cell, only one chart is "active" at a time. And on restart only last widget is rendered (but manual re-run of a cell remediates).
This solution works in jupyterlab
import numpy as np
import matplotlib.pyplot as plt
from IPython.display import clear_output
n = 10
a = np.zeros((n, n))
plt.figure()
for i in range(n):
plt.imshow(a)
plt.show()
a[i, i] = 1
clear_output(wait=True)

Shipping and using virtualenv in a pyspark job

PROBLEM: I am attempting to run a spark-submit script from my local machine to a cluster of machines. The work done by the cluster uses numpy. I currently get the following error:
ImportError:
Importing the multiarray numpy extension module failed. Most
likely you are trying to import a failed build of numpy.
If you're working with a numpy git repo, try `git clean -xdf` (removes all
files not under version control). Otherwise reinstall numpy.
Original error was: cannot import name multiarray
DETAIL:
In my local environment I have setup a virtualenv that includes numpy as well as a private repo I use in my project and other various libraries. I created a zip file (lib/libs.zip) from the site-packages directory at venv/lib/site-packages where 'venv' is my virtual environment. I ship this zip to the remote nodes. My shell script for performing the spark-submit looks like this:
$SPARK_HOME/bin/spark-submit \
--deploy-mode cluster \
--master yarn \
--conf spark.pyspark.virtualenv.enabled=true \
--conf spark.pyspark.virtualenv.type=native \
--conf spark.pyspark.virtualenv.requirements=${parent}/requirements.txt \
--conf spark.pyspark.virtualenv.bin.path=${parent}/venv \
--py-files "${parent}/lib/libs.zip" \
--num-executors 1 \
--executor-cores 2 \
--executor-memory 2G \
--driver-memory 2G \
$parent/src/features/pi.py
I also know that on the remote nodes there is a /usr/local/bin/python2.7 folder that includes a python 2.7 install.
so in my conf/spark-env.sh I have set the following:
export PYSPARK_PYTHON=/usr/local/bin/python2.7
export PYSPARK_DRIVER_PYTHON=/usr/local/bin/python2.7
When I run the script I get the error above. If I screen print the installed_distributions I get a zero length list []. Also my private library imports correctly (which says to me it is actually accessing my libs.zip site-packages.). My pi.py file looks something like this:
from myprivatelibrary.bigData.spark import spark_context
spark = spark_context()
import numpy as np
spark.parallelize(range(1, 10)).map(lambda x: np.__version__).collect()
EXPECTATION/MY THOUGHTS:
I expect this to import numpy correctly especially since I know numpy works correctly in my local virtualenv. I suspect this is because I'm not actually using the version of python that is installed in my virtualenv on the remote node. My question is first, how do I fix this and second how do I use my virtualenv installed python on the remote nodes instead of the python that is just manually installed and currently sitting on those machines? I've seen some write-ups on this but frankly they are not well written.
With --conf spark.pyspark.{} and export PYSPARK_PYTHON=/usr/local/bin/python2.7 you set options for your local environment / your driver. To set options for the cluster (executors) use the following syntax:
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON
Furthermore, I guess you should make your virtualenv relocatable (this is experimental, however). <edit 20170908> This means that the virtualenv uses relative instead of absolute links. </edit>
What we did in such cases: we shipped an entire anaconda distribution over hdfs.
<edit 20170908>
If we are talking about different environments (MacOs vs. Linux, as mentioned in the comment below), you cannot just submit a virtualenv, at least not if your virtualenv contains packages with binaries (as is the case with numpy). In that case I suggest you create yourself a 'portable' anaconda, i.e. install Anaconda in a Linux VM and zip it.
Regarding --archives vs. --py-files:
--py-files adds python files/packages to the python path. From the spark-submit documentation:
For Python applications, simply pass a .py file in the place of instead of a JAR, and add Python .zip, .egg or .py files to the search path with --py-files.
--archives means these are extracted into the working directory of each executor (only yarn clusters).
However, a crystal-clear distinction is lacking, in my opinion - see for example this SO post.
In the given case, add the anaconda.zip via --archives, and your 'other python files' via --py-files.
</edit>
See also: Running Pyspark with Virtualenv, a blog post by Henning Kropp.