How to run a Python project (package) on AWS EMR serverless - amazon-emr

I have a python project with several modules, classes, and dependencies files (a requirements.txt file). I want to pack it into one file with all the dependencies and give the file path to AWS EMR serverless, which will run it.
The problem is that I don't understand how to pack a python project with all the dependencies, which file the EMR can consume, etc. All the examples I have found used one python file.
In simple words, what should I do if my python project is not a single file but is more complex?
Can anyone help with some details?

There's a few ways to do this with EMR Serverless. Regardless of which way you choose, you will need to provide a main entrypoint Python script to the EMR Serverless StartJobRun command.
Let's assume you've got a job structure like this where main.py is your entrypoint that creates a Spark session and runs your jobs and job1 and job2 are your local modules.
├── jobs
│ └── job1.py
│ └── job2.py
├── main.py
├── requirements.txt
Option 1. Use --py-files with your zipped local modules and --archives with a packaged virtual environment for your external dependencies
Zip up your job files
zip -r job_files.zip jobs
Create a virtual environment using venv-pack with your dependencies.
Note: This has to be done with a similar OS and Python version as EMR Serverless, so I prefer using a multi-stage Dockerfile with custom outputs.
FROM --platform=linux/amd64 amazonlinux:2 AS base
RUN yum install -y python3
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install venv-pack==0.2.0 && \
python3 -m pip install -r requirements.txt
RUN mkdir /output && venv-pack -o /output/pyspark_deps.tar.gz
FROM scratch AS export
COPY --from=base /output/pyspark_deps.tar.gz /
If you run DOCKER_BUILDKIT=1 docker build --output . ., you should now have a pyspark_deps.tar.gz file on your local system.
Upload main.py, job_files.zip, and pyspark_deps.tar.gz to a location on S3.
Run your EMR Serverless job with a command like this (replacing APPLICATION_ID, JOB_ROLE_ARN, and YOUR_BUCKET):
aws emr-serverless start-job-run \
--application-id $APPLICATION_ID \
--execution-role-arn $JOB_ROLE_ARN \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://<YOUR_BUCKET>/main.py",
"sparkSubmitParameters": "--py-files s3://<YOUR_BUCKET>/job_files.zip --conf spark.archives=s3://<YOUR_BUCKET>/pyspark_deps.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
}
}'
Option 2. Package your local modules as a Python library and use --archives with a packaged virtual environment
This is probably the most reliable way, but it will require you to use setuptools. You can use a simple pyproject.toml file along with your existing requirements.txt
[project]
name = "mysparkjobs"
version = "0.0.1"
dynamic = ["dependencies"]
[tool.setuptools.dynamic]
dependencies = {file = ["requirements.txt"]}
You then can use a multi-stage Dockerfile and custom build outputs to package your modules and dependencies into a virtual environment.
Note: This requires you to enable Docker Buildkit
FROM --platform=linux/amd64 amazonlinux:2 AS base
RUN yum install -y python3
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
WORKDIR /app
COPY . .
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install venv-pack==0.2.0 && \
python3 -m pip install .
RUN mkdir /output && venv-pack -o /output/pyspark_deps.tar.gz
FROM scratch AS export
COPY --from=base /output/pyspark_deps.tar.gz /
Now you can run DOCKER_BUILDKIT=1 docker build --output . . and a pyspark_deps.tar.gz file will be generated with all your dependencies. Upload this file and your main.py script to S3.
Assuming you uploaded both files to s3://<YOUR_BUCKET>/code/pyspark/myjob/, run the EMR Serverless job like this (replacing the APPLICATION_ID, JOB_ROLE_ARN, and YOUR_BUCKET:
aws emr-serverless start-job-run \
--application-id <APPLICATION_ID> \
--execution-role-arn <JOB_ROLE_ARN> \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://<YOUR_BUCKET>/code/pyspark/myjob/main.py",
"sparkSubmitParameters": "--conf spark.archives=s3://<YOUR_BUCKET>/code/pyspark/myjob/pyspark_deps.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
}
}'
Note the additional sparkSubmitParameters that specify your dependencies and configure the driver and executor environment variables for the proper paths to python.

Related

Do I need to set aws config in docker compose as volume?

In my project I need to configure a AWS bucket download as it always gets read time error or connection error when downloading a fairly large file in my deployment. I have a .aws/config in my root directory and in my dockerfile I use "ADD . ." which adds all the files in the project. To build the image I use Docker compose. However, for some reason it is not using the aws config values. Is there a way to pass these values to Docker so that they are actually used?
This is my "config" file which is in ".aws" in the root of the project.
[default]
read_timeout = 1200
connect_timeout = 1200
http_socket_timeout = 1200
s3 =
max_concurrent_requests = 2
multipart_threshold = 8MB
multipart_chunksize = 8MB
My Dockerfile looks like this:
FROM python:3.7.7-stretch AS BASE
RUN apt-get update \
&& apt-get --assume-yes --no-install-recommends install \
build-essential \
curl \
git \
jq \
libgomp1 \
vim
WORKDIR /app
# upgrade pip version
RUN pip install --no-cache-dir --upgrade pip
RUN pip3 install boto3
ADD . .
I expected that through the "ADD . ." boto3 would use the config file. But that is unfortunately not the case.
Perhaps this would answer your question on why the ADD command didn't work.
Hidden file .env not copied using Docker COPY
Instead of relying on the local config setting of the machine where the docker image is built, you might want to put in the configuration as an explicit file in your repo, which is copied over to ~/.aws/config or anywhere in the container and referenced by setting its path to AWS_CONFIG_FILE; OR use any one of the the methods defined in the AWS documentation below:
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html
wherein you can define your configuration as part of your python code or declare them as environment variables

GraphDB Docker Container Fails to Run: adoptopenjdk/openjdk12:alpine

When using the standard DockerFile available here, GraphDB fails to start with the following output:
Could not find any executable java binary. Please install java in your PATH or set JAVA_HOME
Looking into it, the DockerFile uses adoptopenjdk/openjdk11:alpine which was recently updated to Alpine 3.14.
If I switch to an older Docker image (or use adoptopenjdk/openjdk12:alpine) then GraphDB starts without a problem.
How can I fix this while still using the latest version of adoptopenjdk/openjdk11:alpine?
Below is the DockerFile:
FROM adoptopenjdk/openjdk11:alpine
# Build time arguments
ARG version=9.1.1
ARG edition=ee
ENV GRAPHDB_PARENT_DIR=/opt/graphdb
ENV GRAPHDB_HOME=${GRAPHDB_PARENT_DIR}/home
ENV GRAPHDB_INSTALL_DIR=${GRAPHDB_PARENT_DIR}/dist
WORKDIR /tmp
RUN apk add --no-cache bash curl util-linux procps net-tools busybox-extras wget less && \
curl -fsSL "http://maven.ontotext.com/content/groups/all-onto/com/ontotext/graphdb/graphdb-${edition}/${version}/graphdb-${edition}-${version}-dist.zip" > \
graphdb-${edition}-${version}.zip && \
bash -c 'md5sum -c - <<<"$(curl -fsSL http://maven.ontotext.com/content/groups/all-onto/com/ontotext/graphdb/graphdb-${edition}/${version}/graphdb-${edition}-${version}-dist.zip.md5) graphdb-${edition}-${version}.zip"' && \
mkdir -p ${GRAPHDB_PARENT_DIR} && \
cd ${GRAPHDB_PARENT_DIR} && \
unzip /tmp/graphdb-${edition}-${version}.zip && \
rm /tmp/graphdb-${edition}-${version}.zip && \
mv graphdb-${edition}-${version} dist && \
mkdir -p ${GRAPHDB_HOME}
ENV PATH=${GRAPHDB_INSTALL_DIR}/bin:$PATH
CMD ["-Dgraphdb.home=/opt/graphdb/home"]
ENTRYPOINT ["/opt/graphdb/dist/bin/graphdb"]
EXPOSE 7200
The issue comes from an update in the base image. From a few weeks adopt switched to alpine 3.14 which has some issues with older container runtime (runc). The issue can be seen in the release notes: https://wiki.alpinelinux.org/wiki/Release_Notes_for_Alpine_3.14.0
Updating your Docker will fix the issue. However, if you don't wish to update your Docker, there's a workaround.
Some additional info:
The cause of the issue is that for some reason containers running in older docker versions and alpine 3.14 seem to have issues with the test flag "-x" so an if [ -x /opt/java/openjdk/bin/java ] returns false, although java is there and is executable.
You can workaround this for now by
Pull the GraphDB distribution
Unzip it
Open "setvars.in.sh" in the bin folder
Find and remove the if block around line 32
if [ ! -x "$JAVA" ]; then
echo "Could not find any executable java binary. Please install java in your PATH or set JAVA_HOME"
exit 1
fi
Zip it again and provide it in the Dockerfile without pulling it from maven.ontotext.com
Passing it to the Dockerfile is done with 'ADD'
You can check the GraphDB free version's Dockerfile for a reference on how to pass the zip file to the Dockerfile https://github.com/Ontotext-AD/graphdb-docker/blob/master/free-edition/Dockerfile

Problems at running ImageDataBunch in Deepnote

I'm having trouble running this line of code in Deepnote, does anyone know why?
data = ImageDataBunch.from_folder(path, train="train", valid ="test",ds_tfms=get_transforms(), size=(256,256), bs=32, num_workers=4).normalize()
The error says:
NameError: name 'ImageDataBunch' is not defined
And previously, I have imported the Fastai library. So I don't get it!
The FastAI setup in Deepnote is not that straightforward. It's best to use a custom environment where you set stuff up in a Dockerfile and everything works afterwards in the notebook. I am not sure if the ImageDataBunch or whatever you're trying to do works the same way in FastAI v1 and v2, but here are the details for v1.
This is a Dockerfile which sets up the FastAI environment via conda:
# This is Dockerfile
FROM deepnote/python:3.9
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
RUN bash ~/miniconda.sh -b -p $HOME/miniconda
ENV PATH $HOME/miniconda/bin:$PATH
ENV PYTONPATH $HOME/miniconda
RUN $HOME/miniconda/bin/conda install python=3.9 ipykernel -y
RUN $HOME/miniconda/bin/conda install -c fastai -c pytorch fastai -y
RUN $HOME/miniconda/bin/python -m ipykernel install --user --name=conda
ENV DEFAULT_KERNEL_NAME "conda"
After that, you can test the fastai imports in the notebook:
import fastai
from fastai.vision import *
print(fastai.__version__)
ImageDataBunch
And if you download and unpack this sample MNIST dataset, you should be able to load the data like you suggested:
data = ImageDataBunch.from_folder(path, train="train", valid ="test",ds_tfms=get_transforms(), size=(256,256), bs=32, num_workers=4).normalize()
Feel free to check out or clone my Deepnote project to continue working on this.

hdfsBuilderConnect error while using tfserving load model from hdfs

there is my environment info:
TensorFlow Serving version 1.14
os mac10.15.7
i want to load modle from hdfs by using tfserving.
when i build a tensorflow-serving:hadoop docker image,like this:
FROM tensorflow/serving:2.2.0
RUN apt update && apt install -y openjdk-8-jre
RUN mkdir /opt/hadoop-2.8.2
COPY /hadoop-2.8.2 /opt/hadoop-2.8.2
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64
ENV HADOOP_HDFS_HOME /opt/hadoop-2.8.2
ENV HADOOP_HOME /opt/hadoop-2.8.2
ENV LD_LIBRARY_PATH
${LD_LIBRARY_PATH}:${JAVA_HOME}/jre/lib/amd64/server
# ENV PATH $PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
RUN echo '#!/bin/bash \n\n\
CLASSPATH=(${HADOOP_HDFS_HOME}/bin/hadoop classpath --glob)
tensorflow_model_server --port=8500 --rest_api_port=9000 \
--model_name=${MODEL_NAME} --
model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} \
"#"' > /usr/bin/tf_serving_entrypoint.sh \
&& chmod +x /usr/bin/tf_serving_entrypoint.sh
EXPOSE 8500
EXPOSE 9000
ENTRYPOINT ["/usr/bin/tf_serving_entrypoint.sh"]
and then run :
docker run -p 9001:9000 --name tensorflow-serving-11 -e MODEL_NAME=tfrest -e MODEL_BASE_PATH=hdfs://ip:port/user/cess2_test/workspace/cess/models -t tensorflow_serving:1.14-hadoop-2.8.2
i met this problem. ps:i have already modify hadoop config in hadoop-2.8.2
hdfsBuilderConnect(forceNewInstance=0, nn=ip:port, port=0, kerbTicketCachePath=(NULL), userName=(NULL))
error:(unable to get stack trace for java.lang.NoClassDefFoundError exception: ExceptionUtils::getStackTrace error.)
is there any suggestions how to solve this problem?
thanks
i solved this problem by adding hadoop absolute path to classpath.

Create default files for conan without install

I'm creating a docker image as a build environment where I can mount a project and build it. For build I use cmake and conan. The dockerfile of this image:
FROM alpine:3.9
RUN ["apk", "add", "--no-cache", "gcc", "g++", "make", "cmake", "python3", "python3-dev", "linux-headers", "musl-dev"]
RUN ["pip3", "install", "--upgrade", "pip"]
RUN ["pip3", "install", "conan"]
WORKDIR /project
Files like
~/.conan/profiles/default
are created after I call
conan install ..
so that these files are created in the container and not in the image. The default behavior of conan is to set
compiler.libcxx=libstdc++
I'd like to run something like
RUN ["sed", "-i", "s/compiler.libcxx=libstdc++/compiler.libcxx=libstdc++11/", "~/.conan/profiles/default"]
to change the libcxx value but this file does not exist at this point. The only way I found to create the default profile by conan would be to install something.
Currently I'm running this container with
docker run --rm -v $(dirname $(realpath $0))/project:/project build-environment /bin/sh -c "\
rm -rf build && \
mkdir build && \
cd build && \
conan install -s compiler.libcxx=libstdc++11 .. --build missing && \
cmake .. && \
cmake --build . ; \
chown -R $(id -u):$(id -u) /project/build \
"
but I need to remove -s compiler.libcxx=libstdc++11 as it should be dependent on the image and not fixed by the build script.
Is there a way to initialize conan inside the image and edit the configuration without installing something? Currently I'm planning to write the whole configuration by myself but that seems a little too much as I want to use the default configuration and change only one line.
You can also create an image from a running container. Try installing conan in running container and then create an image of it. As it is being installed in running container it will have all dependencies only for it. To create that image you can follow this link
https://docs.docker.com/engine/reference/commandline/commit/