Do I need to set aws config in docker compose as volume? - amazon-s3

In my project I need to configure a AWS bucket download as it always gets read time error or connection error when downloading a fairly large file in my deployment. I have a .aws/config in my root directory and in my dockerfile I use "ADD . ." which adds all the files in the project. To build the image I use Docker compose. However, for some reason it is not using the aws config values. Is there a way to pass these values to Docker so that they are actually used?
This is my "config" file which is in ".aws" in the root of the project.
[default]
read_timeout = 1200
connect_timeout = 1200
http_socket_timeout = 1200
s3 =
max_concurrent_requests = 2
multipart_threshold = 8MB
multipart_chunksize = 8MB
My Dockerfile looks like this:
FROM python:3.7.7-stretch AS BASE
RUN apt-get update \
&& apt-get --assume-yes --no-install-recommends install \
build-essential \
curl \
git \
jq \
libgomp1 \
vim
WORKDIR /app
# upgrade pip version
RUN pip install --no-cache-dir --upgrade pip
RUN pip3 install boto3
ADD . .
I expected that through the "ADD . ." boto3 would use the config file. But that is unfortunately not the case.

Perhaps this would answer your question on why the ADD command didn't work.
Hidden file .env not copied using Docker COPY
Instead of relying on the local config setting of the machine where the docker image is built, you might want to put in the configuration as an explicit file in your repo, which is copied over to ~/.aws/config or anywhere in the container and referenced by setting its path to AWS_CONFIG_FILE; OR use any one of the the methods defined in the AWS documentation below:
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html
wherein you can define your configuration as part of your python code or declare them as environment variables

Related

How to run a Python project (package) on AWS EMR serverless

I have a python project with several modules, classes, and dependencies files (a requirements.txt file). I want to pack it into one file with all the dependencies and give the file path to AWS EMR serverless, which will run it.
The problem is that I don't understand how to pack a python project with all the dependencies, which file the EMR can consume, etc. All the examples I have found used one python file.
In simple words, what should I do if my python project is not a single file but is more complex?
Can anyone help with some details?
There's a few ways to do this with EMR Serverless. Regardless of which way you choose, you will need to provide a main entrypoint Python script to the EMR Serverless StartJobRun command.
Let's assume you've got a job structure like this where main.py is your entrypoint that creates a Spark session and runs your jobs and job1 and job2 are your local modules.
├── jobs
│ └── job1.py
│ └── job2.py
├── main.py
├── requirements.txt
Option 1. Use --py-files with your zipped local modules and --archives with a packaged virtual environment for your external dependencies
Zip up your job files
zip -r job_files.zip jobs
Create a virtual environment using venv-pack with your dependencies.
Note: This has to be done with a similar OS and Python version as EMR Serverless, so I prefer using a multi-stage Dockerfile with custom outputs.
FROM --platform=linux/amd64 amazonlinux:2 AS base
RUN yum install -y python3
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install venv-pack==0.2.0 && \
python3 -m pip install -r requirements.txt
RUN mkdir /output && venv-pack -o /output/pyspark_deps.tar.gz
FROM scratch AS export
COPY --from=base /output/pyspark_deps.tar.gz /
If you run DOCKER_BUILDKIT=1 docker build --output . ., you should now have a pyspark_deps.tar.gz file on your local system.
Upload main.py, job_files.zip, and pyspark_deps.tar.gz to a location on S3.
Run your EMR Serverless job with a command like this (replacing APPLICATION_ID, JOB_ROLE_ARN, and YOUR_BUCKET):
aws emr-serverless start-job-run \
--application-id $APPLICATION_ID \
--execution-role-arn $JOB_ROLE_ARN \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://<YOUR_BUCKET>/main.py",
"sparkSubmitParameters": "--py-files s3://<YOUR_BUCKET>/job_files.zip --conf spark.archives=s3://<YOUR_BUCKET>/pyspark_deps.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
}
}'
Option 2. Package your local modules as a Python library and use --archives with a packaged virtual environment
This is probably the most reliable way, but it will require you to use setuptools. You can use a simple pyproject.toml file along with your existing requirements.txt
[project]
name = "mysparkjobs"
version = "0.0.1"
dynamic = ["dependencies"]
[tool.setuptools.dynamic]
dependencies = {file = ["requirements.txt"]}
You then can use a multi-stage Dockerfile and custom build outputs to package your modules and dependencies into a virtual environment.
Note: This requires you to enable Docker Buildkit
FROM --platform=linux/amd64 amazonlinux:2 AS base
RUN yum install -y python3
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
WORKDIR /app
COPY . .
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install venv-pack==0.2.0 && \
python3 -m pip install .
RUN mkdir /output && venv-pack -o /output/pyspark_deps.tar.gz
FROM scratch AS export
COPY --from=base /output/pyspark_deps.tar.gz /
Now you can run DOCKER_BUILDKIT=1 docker build --output . . and a pyspark_deps.tar.gz file will be generated with all your dependencies. Upload this file and your main.py script to S3.
Assuming you uploaded both files to s3://<YOUR_BUCKET>/code/pyspark/myjob/, run the EMR Serverless job like this (replacing the APPLICATION_ID, JOB_ROLE_ARN, and YOUR_BUCKET:
aws emr-serverless start-job-run \
--application-id <APPLICATION_ID> \
--execution-role-arn <JOB_ROLE_ARN> \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://<YOUR_BUCKET>/code/pyspark/myjob/main.py",
"sparkSubmitParameters": "--conf spark.archives=s3://<YOUR_BUCKET>/code/pyspark/myjob/pyspark_deps.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
}
}'
Note the additional sparkSubmitParameters that specify your dependencies and configure the driver and executor environment variables for the proper paths to python.

GraphDB Docker Container Fails to Run: adoptopenjdk/openjdk12:alpine

When using the standard DockerFile available here, GraphDB fails to start with the following output:
Could not find any executable java binary. Please install java in your PATH or set JAVA_HOME
Looking into it, the DockerFile uses adoptopenjdk/openjdk11:alpine which was recently updated to Alpine 3.14.
If I switch to an older Docker image (or use adoptopenjdk/openjdk12:alpine) then GraphDB starts without a problem.
How can I fix this while still using the latest version of adoptopenjdk/openjdk11:alpine?
Below is the DockerFile:
FROM adoptopenjdk/openjdk11:alpine
# Build time arguments
ARG version=9.1.1
ARG edition=ee
ENV GRAPHDB_PARENT_DIR=/opt/graphdb
ENV GRAPHDB_HOME=${GRAPHDB_PARENT_DIR}/home
ENV GRAPHDB_INSTALL_DIR=${GRAPHDB_PARENT_DIR}/dist
WORKDIR /tmp
RUN apk add --no-cache bash curl util-linux procps net-tools busybox-extras wget less && \
curl -fsSL "http://maven.ontotext.com/content/groups/all-onto/com/ontotext/graphdb/graphdb-${edition}/${version}/graphdb-${edition}-${version}-dist.zip" > \
graphdb-${edition}-${version}.zip && \
bash -c 'md5sum -c - <<<"$(curl -fsSL http://maven.ontotext.com/content/groups/all-onto/com/ontotext/graphdb/graphdb-${edition}/${version}/graphdb-${edition}-${version}-dist.zip.md5) graphdb-${edition}-${version}.zip"' && \
mkdir -p ${GRAPHDB_PARENT_DIR} && \
cd ${GRAPHDB_PARENT_DIR} && \
unzip /tmp/graphdb-${edition}-${version}.zip && \
rm /tmp/graphdb-${edition}-${version}.zip && \
mv graphdb-${edition}-${version} dist && \
mkdir -p ${GRAPHDB_HOME}
ENV PATH=${GRAPHDB_INSTALL_DIR}/bin:$PATH
CMD ["-Dgraphdb.home=/opt/graphdb/home"]
ENTRYPOINT ["/opt/graphdb/dist/bin/graphdb"]
EXPOSE 7200
The issue comes from an update in the base image. From a few weeks adopt switched to alpine 3.14 which has some issues with older container runtime (runc). The issue can be seen in the release notes: https://wiki.alpinelinux.org/wiki/Release_Notes_for_Alpine_3.14.0
Updating your Docker will fix the issue. However, if you don't wish to update your Docker, there's a workaround.
Some additional info:
The cause of the issue is that for some reason containers running in older docker versions and alpine 3.14 seem to have issues with the test flag "-x" so an if [ -x /opt/java/openjdk/bin/java ] returns false, although java is there and is executable.
You can workaround this for now by
Pull the GraphDB distribution
Unzip it
Open "setvars.in.sh" in the bin folder
Find and remove the if block around line 32
if [ ! -x "$JAVA" ]; then
echo "Could not find any executable java binary. Please install java in your PATH or set JAVA_HOME"
exit 1
fi
Zip it again and provide it in the Dockerfile without pulling it from maven.ontotext.com
Passing it to the Dockerfile is done with 'ADD'
You can check the GraphDB free version's Dockerfile for a reference on how to pass the zip file to the Dockerfile https://github.com/Ontotext-AD/graphdb-docker/blob/master/free-edition/Dockerfile

Problems at running ImageDataBunch in Deepnote

I'm having trouble running this line of code in Deepnote, does anyone know why?
data = ImageDataBunch.from_folder(path, train="train", valid ="test",ds_tfms=get_transforms(), size=(256,256), bs=32, num_workers=4).normalize()
The error says:
NameError: name 'ImageDataBunch' is not defined
And previously, I have imported the Fastai library. So I don't get it!
The FastAI setup in Deepnote is not that straightforward. It's best to use a custom environment where you set stuff up in a Dockerfile and everything works afterwards in the notebook. I am not sure if the ImageDataBunch or whatever you're trying to do works the same way in FastAI v1 and v2, but here are the details for v1.
This is a Dockerfile which sets up the FastAI environment via conda:
# This is Dockerfile
FROM deepnote/python:3.9
RUN wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
RUN bash ~/miniconda.sh -b -p $HOME/miniconda
ENV PATH $HOME/miniconda/bin:$PATH
ENV PYTONPATH $HOME/miniconda
RUN $HOME/miniconda/bin/conda install python=3.9 ipykernel -y
RUN $HOME/miniconda/bin/conda install -c fastai -c pytorch fastai -y
RUN $HOME/miniconda/bin/python -m ipykernel install --user --name=conda
ENV DEFAULT_KERNEL_NAME "conda"
After that, you can test the fastai imports in the notebook:
import fastai
from fastai.vision import *
print(fastai.__version__)
ImageDataBunch
And if you download and unpack this sample MNIST dataset, you should be able to load the data like you suggested:
data = ImageDataBunch.from_folder(path, train="train", valid ="test",ds_tfms=get_transforms(), size=(256,256), bs=32, num_workers=4).normalize()
Feel free to check out or clone my Deepnote project to continue working on this.

How to install apache module in docker container at the correct location

I have the following docker file:
FROM wodby/apache:2.4
MAINTAINER NAME EMAIL
ENV http_proxy 'http://xxx.xxx.xxx.de:80'
ENV https_proxy 'http://xxx.xxx.xxx.xxx:80'
ENV APP_ROOT="/var/www/html" \
APACHE_DIR="/usr/local/apache2"
WORKDIR /usr/local/apache2
USER root
RUN ls
RUN set -x \
&& apk add apache-mod-auth-kerb
CMD ["tail", "-f", "/dev/null"]
My intention is to add the apache-mod-auth-kerb module to my container.
Base Image is alpine but wodby/apache inherits from wodby/http which is Debian.
Somehow the module is installed under /usr/lib/apache2 but the apache in wodby/apache seems to load its modules from /usr/local/apache2/modules.
I don't think the solution is to move the module per cp or symlink?
Here are the links to the base dockerfiles:
https://github.com/wodby/httpd
https://github.com/wodby/apache
How can I make sure that the module and config are put in the correct location? I think the problem might be the difference between the used Linux distros.
Any hints?
The docker-library/httpd (Maintained by Docker) supports alpine and Debian based images.
Since wodby/httpd is forked from docker-library/httpd, you can see files Debian related Dockerfile but they only support alpine based images as per the README.md file.
Even images woby/apache are alpine based.
For modules, you can create a conf file as shown below
mod_auth_kerb.conf
LoadModule auth_kerb_module /usr/lib/apache2/mod_auth_kerb.so
Dockerfile
FROM wodby/apache:2.4
MAINTAINER NAME EMAIL
ENV http_proxy 'http://xxx.xxx.xxx.de:80'
ENV https_proxy 'http://xxx.xxx.xxx.xxx:80'
ENV APP_ROOT="/var/www/html" \
APACHE_DIR="/usr/local/apache2"
WORKDIR /usr/local/apache2
USER root
RUN ls
RUN set -x \
&& apk add apache-mod-auth-kerb
COPY mod_auth_kerb.conf /usr/local/apache2/conf/conf.d/mod_auth_kerb.conf
You can check them
bash-4.4# httpd -M | grep auth_kerb_module
auth_kerb_module (shared)

docker file to run automation test in JS files

I am trying to create a docker file to run selenium tests for a java script based project. Below is my docker file so far:
#base image
FROM selenium/standalone-chrome
#access to the project within docker container - Bundle app source
COPY ./seleniumTest/project /app
# Install Node.js
RUN sudo apt-get update
RUN sudo apt-get install --yes curl
RUN curl --silent --location https://deb.nodesource.com/setup_8.x | sudo bash -
#binding
EXPOSE 8080
#Define runtime
ENTRYPOINT /app/login.test.js
while running with $ docker run -p 4000:80 lamgadekamal/dockertest
returns: Unable to find image 'lamkam/dockertest:latest' locally docker: Error response from daemon: manifest for lamkam/dockertest:latest not found. Could not figure out why am I getting this?
I suspect that you need to build your image first, since the image cannot be found.
Run this command from the same directory where your Dockerfile is located. This will build the image.
docker build -t lamgadekamal/dockertest .
You can then verify that the image exists by running docker images
EDIT: After looking at this again, it appears that you are trying to run the wrong image. You are trying to run lamgadekamal/dockertest, but you built the image with the tag lamkam/dockertest? Seems like you have a typo. I would suggest running docker images to see exactly what is there, but in all likelihood, you need to run lamkam/dockertest.
docker run -p 4000:80 lamkam/dockertest