How to work with Mahout? - apache

I have tried to work with mahout using the link:
http://girlincomputerscience.blogspot.in/2010/11/apache-mahout.html
When i execute the command
anand#ubuntu:~/Downloads/mahout-distribution-0.9$ bin/mahout recommenditembased --input mydata.dat --usersFile user.dat --numRecommendations 2 --output output/ --similarityClassname SIMILARITY_PEARSON_CORRELATION
it shows:
hadoop binary is not in PATH,HADOOP_HOME/bin,HADOOP_PREFIX/bin, running locally
Error occurred during initialization of VM
Could not reserve enough space for object heap
Could not create the Java virtual machine.
What may be the possible reasons for the errors?

You need to setup Apache Hadoop first ( also described here ):
$ http://mirror.metrocast.net/apache/hadoop/common/hadoop-1.2.1/hadoop-1.2.1-bin.tar.gz
$ tar zxf hadoop-1.2.1-bin.tar.gz
$ cd hadoop-1.2.1
$ export PATH=`pwd`/bin:$PATH
Then try to setup Apache Mahout
Apache Mahout setup
$ wget -c http://archive.apache.org/dist/mahout/0.9/mahout-distribution-0.9.tar.gz
$ tar zxf mahout-distribution-0.9.tar.gz
$ cd mahout-distribution-0.9
Setup env variables:
export HADOOP_HOME=/path/to/hadoop-1.2.1
export PATH=$HADOOP_HOME/bin:$PATH
export MAHOUT_HOME=/path/to/mahout-distribution-0.9
export PATH=$MAHOUT_HOME/bin:$PATH
Now your Mahout installation should work fine ( read here for more ).

Related

How to run a Python project (package) on AWS EMR serverless

I have a python project with several modules, classes, and dependencies files (a requirements.txt file). I want to pack it into one file with all the dependencies and give the file path to AWS EMR serverless, which will run it.
The problem is that I don't understand how to pack a python project with all the dependencies, which file the EMR can consume, etc. All the examples I have found used one python file.
In simple words, what should I do if my python project is not a single file but is more complex?
Can anyone help with some details?
There's a few ways to do this with EMR Serverless. Regardless of which way you choose, you will need to provide a main entrypoint Python script to the EMR Serverless StartJobRun command.
Let's assume you've got a job structure like this where main.py is your entrypoint that creates a Spark session and runs your jobs and job1 and job2 are your local modules.
├── jobs
│ └── job1.py
│ └── job2.py
├── main.py
├── requirements.txt
Option 1. Use --py-files with your zipped local modules and --archives with a packaged virtual environment for your external dependencies
Zip up your job files
zip -r job_files.zip jobs
Create a virtual environment using venv-pack with your dependencies.
Note: This has to be done with a similar OS and Python version as EMR Serverless, so I prefer using a multi-stage Dockerfile with custom outputs.
FROM --platform=linux/amd64 amazonlinux:2 AS base
RUN yum install -y python3
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install venv-pack==0.2.0 && \
python3 -m pip install -r requirements.txt
RUN mkdir /output && venv-pack -o /output/pyspark_deps.tar.gz
FROM scratch AS export
COPY --from=base /output/pyspark_deps.tar.gz /
If you run DOCKER_BUILDKIT=1 docker build --output . ., you should now have a pyspark_deps.tar.gz file on your local system.
Upload main.py, job_files.zip, and pyspark_deps.tar.gz to a location on S3.
Run your EMR Serverless job with a command like this (replacing APPLICATION_ID, JOB_ROLE_ARN, and YOUR_BUCKET):
aws emr-serverless start-job-run \
--application-id $APPLICATION_ID \
--execution-role-arn $JOB_ROLE_ARN \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://<YOUR_BUCKET>/main.py",
"sparkSubmitParameters": "--py-files s3://<YOUR_BUCKET>/job_files.zip --conf spark.archives=s3://<YOUR_BUCKET>/pyspark_deps.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
}
}'
Option 2. Package your local modules as a Python library and use --archives with a packaged virtual environment
This is probably the most reliable way, but it will require you to use setuptools. You can use a simple pyproject.toml file along with your existing requirements.txt
[project]
name = "mysparkjobs"
version = "0.0.1"
dynamic = ["dependencies"]
[tool.setuptools.dynamic]
dependencies = {file = ["requirements.txt"]}
You then can use a multi-stage Dockerfile and custom build outputs to package your modules and dependencies into a virtual environment.
Note: This requires you to enable Docker Buildkit
FROM --platform=linux/amd64 amazonlinux:2 AS base
RUN yum install -y python3
ENV VIRTUAL_ENV=/opt/venv
RUN python3 -m venv $VIRTUAL_ENV
ENV PATH="$VIRTUAL_ENV/bin:$PATH"
WORKDIR /app
COPY . .
RUN python3 -m pip install --upgrade pip && \
python3 -m pip install venv-pack==0.2.0 && \
python3 -m pip install .
RUN mkdir /output && venv-pack -o /output/pyspark_deps.tar.gz
FROM scratch AS export
COPY --from=base /output/pyspark_deps.tar.gz /
Now you can run DOCKER_BUILDKIT=1 docker build --output . . and a pyspark_deps.tar.gz file will be generated with all your dependencies. Upload this file and your main.py script to S3.
Assuming you uploaded both files to s3://<YOUR_BUCKET>/code/pyspark/myjob/, run the EMR Serverless job like this (replacing the APPLICATION_ID, JOB_ROLE_ARN, and YOUR_BUCKET:
aws emr-serverless start-job-run \
--application-id <APPLICATION_ID> \
--execution-role-arn <JOB_ROLE_ARN> \
--job-driver '{
"sparkSubmit": {
"entryPoint": "s3://<YOUR_BUCKET>/code/pyspark/myjob/main.py",
"sparkSubmitParameters": "--conf spark.archives=s3://<YOUR_BUCKET>/code/pyspark/myjob/pyspark_deps.tar.gz#environment --conf spark.emr-serverless.driverEnv.PYSPARK_DRIVER_PYTHON=./environment/bin/python --conf spark.emr-serverless.driverEnv.PYSPARK_PYTHON=./environment/bin/python --conf spark.executorEnv.PYSPARK_PYTHON=./environment/bin/python"
}
}'
Note the additional sparkSubmitParameters that specify your dependencies and configure the driver and executor environment variables for the proper paths to python.

GraphDB Docker Container Fails to Run: adoptopenjdk/openjdk12:alpine

When using the standard DockerFile available here, GraphDB fails to start with the following output:
Could not find any executable java binary. Please install java in your PATH or set JAVA_HOME
Looking into it, the DockerFile uses adoptopenjdk/openjdk11:alpine which was recently updated to Alpine 3.14.
If I switch to an older Docker image (or use adoptopenjdk/openjdk12:alpine) then GraphDB starts without a problem.
How can I fix this while still using the latest version of adoptopenjdk/openjdk11:alpine?
Below is the DockerFile:
FROM adoptopenjdk/openjdk11:alpine
# Build time arguments
ARG version=9.1.1
ARG edition=ee
ENV GRAPHDB_PARENT_DIR=/opt/graphdb
ENV GRAPHDB_HOME=${GRAPHDB_PARENT_DIR}/home
ENV GRAPHDB_INSTALL_DIR=${GRAPHDB_PARENT_DIR}/dist
WORKDIR /tmp
RUN apk add --no-cache bash curl util-linux procps net-tools busybox-extras wget less && \
curl -fsSL "http://maven.ontotext.com/content/groups/all-onto/com/ontotext/graphdb/graphdb-${edition}/${version}/graphdb-${edition}-${version}-dist.zip" > \
graphdb-${edition}-${version}.zip && \
bash -c 'md5sum -c - <<<"$(curl -fsSL http://maven.ontotext.com/content/groups/all-onto/com/ontotext/graphdb/graphdb-${edition}/${version}/graphdb-${edition}-${version}-dist.zip.md5) graphdb-${edition}-${version}.zip"' && \
mkdir -p ${GRAPHDB_PARENT_DIR} && \
cd ${GRAPHDB_PARENT_DIR} && \
unzip /tmp/graphdb-${edition}-${version}.zip && \
rm /tmp/graphdb-${edition}-${version}.zip && \
mv graphdb-${edition}-${version} dist && \
mkdir -p ${GRAPHDB_HOME}
ENV PATH=${GRAPHDB_INSTALL_DIR}/bin:$PATH
CMD ["-Dgraphdb.home=/opt/graphdb/home"]
ENTRYPOINT ["/opt/graphdb/dist/bin/graphdb"]
EXPOSE 7200
The issue comes from an update in the base image. From a few weeks adopt switched to alpine 3.14 which has some issues with older container runtime (runc). The issue can be seen in the release notes: https://wiki.alpinelinux.org/wiki/Release_Notes_for_Alpine_3.14.0
Updating your Docker will fix the issue. However, if you don't wish to update your Docker, there's a workaround.
Some additional info:
The cause of the issue is that for some reason containers running in older docker versions and alpine 3.14 seem to have issues with the test flag "-x" so an if [ -x /opt/java/openjdk/bin/java ] returns false, although java is there and is executable.
You can workaround this for now by
Pull the GraphDB distribution
Unzip it
Open "setvars.in.sh" in the bin folder
Find and remove the if block around line 32
if [ ! -x "$JAVA" ]; then
echo "Could not find any executable java binary. Please install java in your PATH or set JAVA_HOME"
exit 1
fi
Zip it again and provide it in the Dockerfile without pulling it from maven.ontotext.com
Passing it to the Dockerfile is done with 'ADD'
You can check the GraphDB free version's Dockerfile for a reference on how to pass the zip file to the Dockerfile https://github.com/Ontotext-AD/graphdb-docker/blob/master/free-edition/Dockerfile

hdfsBuilderConnect error while using tfserving load model from hdfs

there is my environment info:
TensorFlow Serving version 1.14
os mac10.15.7
i want to load modle from hdfs by using tfserving.
when i build a tensorflow-serving:hadoop docker image,like this:
FROM tensorflow/serving:2.2.0
RUN apt update && apt install -y openjdk-8-jre
RUN mkdir /opt/hadoop-2.8.2
COPY /hadoop-2.8.2 /opt/hadoop-2.8.2
ENV JAVA_HOME /usr/lib/jvm/java-8-openjdk-amd64
ENV HADOOP_HDFS_HOME /opt/hadoop-2.8.2
ENV HADOOP_HOME /opt/hadoop-2.8.2
ENV LD_LIBRARY_PATH
${LD_LIBRARY_PATH}:${JAVA_HOME}/jre/lib/amd64/server
# ENV PATH $PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
RUN echo '#!/bin/bash \n\n\
CLASSPATH=(${HADOOP_HDFS_HOME}/bin/hadoop classpath --glob)
tensorflow_model_server --port=8500 --rest_api_port=9000 \
--model_name=${MODEL_NAME} --
model_base_path=${MODEL_BASE_PATH}/${MODEL_NAME} \
"#"' > /usr/bin/tf_serving_entrypoint.sh \
&& chmod +x /usr/bin/tf_serving_entrypoint.sh
EXPOSE 8500
EXPOSE 9000
ENTRYPOINT ["/usr/bin/tf_serving_entrypoint.sh"]
and then run :
docker run -p 9001:9000 --name tensorflow-serving-11 -e MODEL_NAME=tfrest -e MODEL_BASE_PATH=hdfs://ip:port/user/cess2_test/workspace/cess/models -t tensorflow_serving:1.14-hadoop-2.8.2
i met this problem. ps:i have already modify hadoop config in hadoop-2.8.2
hdfsBuilderConnect(forceNewInstance=0, nn=ip:port, port=0, kerbTicketCachePath=(NULL), userName=(NULL))
error:(unable to get stack trace for java.lang.NoClassDefFoundError exception: ExceptionUtils::getStackTrace error.)
is there any suggestions how to solve this problem?
thanks
i solved this problem by adding hadoop absolute path to classpath.

Building Apache Impala fails

I was trying to build Apache Impala from source(newest version on github).
I followed following instructions to build Impala:
(1) clone Impala
> git clone https://git-wip-us.apache.org/repos/asf/incubator-impala.git
> cd Impala
(2) configure environmental variables
> export JAVA_HOME=/usr/lib/jvm/java-7-oracle-amd64
> export IMPALA_HOME=<path to Impala>
> export BOOST_LIBRARYDIR=/usr/lib/x86_64-linux-gnu
> export LC_ALL="en_US.UTF-8"
(3)build
${IMPALA_HOME}/buildall.sh -noclean -skiptests -build_shared_libs -format
(4) errors are shown below:
Heap is needed to find the cause. Looks like the compiler does not support the GLIBCXX_3.4.21. But the GCC is automatically downloaded by the building script.
Appreciate your help!!!
Starting from this commit https://github.com/apache/impala/commit/d5cefe07c931a0d3bf02bca97bbba05400d91a48 , Impala has been shipped with a development bootstrap script.
I tried the master branch in a fresh ubuntu 16.04 docker image and it works fine. Here is what I did.
checkout the latest impala code base and do
docker run --rm -it --privileged -v /home/amos/git/impala/:/root/Impala ubuntu:16.04
inside docker, do
apt-get update
apt-get install sudo
cd /root/Impala
comment this out in bin/bootstrap_system.sh if you don't need test data
# if ! [[ -d ~/Impala-lzo ]]
# then
# git clone https://github.com/cloudera/impala-lzo.git ~/Impala-lzo
# fi
# if ! [[ -d ~/hadoop-lzo ]]
# then
# git clone https://github.com/cloudera/hadoop-lzo.git ~/hadoop-lzo
# fi
# cd ~/hadoop-lzo/
# time -p ant package
also add this line before ssh localhost whoami
echo "source ${IMPALA_HOME}/bin/impala-config-local.sh" >> ~/.bashrc
change the build command to whatever you like in bin/bootstrap_development.sh
${IMPALA_HOME}/buildall.sh -noclean -skiptests -build_shared_libs -format
then run bin/bootstrap_development.sh
You'll be prompted for some input. Just fill in default value and it'll work.

Apache Tomcat 8 not starting within a docker container

I am experimenting with Docker and am very new to it. I am struck at a point for a long time and am not getting a way through and hence came up with this question here...
Problem Statement:
I am trying to create an image from a docker file containing Apache and lynx installation. Once done I am trying to access tomcat on 8080 of the container which is in turn forwarded to the 8082 of the host. But when running the image I never get tomcat started in the container.
The Docker file
FROM ubuntu:16.10
#Install Lynx
Run apt-get update
Run apt-get install -y lynx
#Install Curl
Run apt-get install -y curl
#Install tools: jdk
Run apt-get update
Run apt-get install -y openjdk-8-jdk wget
#Install apache tomcat
Run groupadd tomcat
Run useradd -s /bin/false -g tomcat -d /opt/tomcat tomcat
Run cd /tmp
Run curl -O http://apache.mirrors.ionfish.org/tomcat/tomcat- 8/v8.5.12/bin/apache-tomcat-8.5.12.tar.gz
Run mkdir /opt/tomcat
Run tar xzvf apache-tomcat-8*tar.gz -C /opt/tomcat --strip-components=1
Run cd /opt/tomcat
Run chgrp -R tomcat /opt/tomcat
Run chmod -R g+r /opt/tomcat/conf
Run chmod g+x /opt/tomcat/conf
Run chown -R tomcat /opt/tomcat/webapps /opt/tomcat/work /opt/tomcat/temp opt/tomcat/logs
Run cd /opt/tomcat/bin
Expose 8080
CMD /opt/tomcat/bin/catalina.sh run && tail -f /opt/tomcat/logs/catalina.out
When the image is built I tried running the container by the two below methods
docker run -d -p 8082:8080 imageid tail -f /dev/null
While using the above, container is running but tomcat is not started inside the container and hence not accessible from localhost:8082. Also I do not see anything if I perform docker logs longcontainerid
docker run -d -p 8082:8080 imageid /path/to/catalina.sh start tail -f /dev/null
I see tomcat started when I do docker logs longconatainrid
While using the above the container is started and stopped immediately and is not running as I can see from docker ps and hence again not accessible from localhost:8082.
Can anyone please tell me where I am going wrong?
P.s. I searched a lot on the internet but could not get the thing right. Might be there is some concept that i am not getting clearly.
Looking at the docker run command documentation, the doc states that any command passed to the run will override the original CMD in your Dockerfile:
As the operator (the person running a container from the image), you can override that CMD instruction just by specifying a new COMMAND
1/ Then when you run:
docker run -d -p 8082:8080 imageid tail -f /dev/null
The container is run with COMMAND tail -f /dev/null, the original command starting tomcat is overridden.
To resolve your problem, try to run:
docker run -d -p 8082:8080 imageid
and
docker log -f containerId
To see if tomcat is correctly started.
2/ You should not use the start argument with catalina.sh. Have a look at this official tomcat Dokerfile, the team uses :
CMD ["catalina.sh", "run"]
to start tomcat (when you use start, docker ends container at the end of the shell script and tomcat will start but not maintain a running process).
3/ Finally, why don't you use tomcat official image to build your container? You could just use the :
FROM tomcat:latest
directive at the beginning of your Dockerfile, and add you required elements (new files, webapps war, settings) to the docker image.