Mrjob failed when running on hadoop with lxml library - lxml

I'm working on a project using hadoop mapreduce. My project tree have showed in this picture:
MyProject
├── parse_xml_file.py
├── store_xml_directory
│   └── my_xml_file.xml
├── requirements.txt
├── input_to_hadoop.txt
└── testMrjob.py
I've run without error when run in local with command:
python testMrjob.py < input_to_hadoop.txt > output
But when running on hadoop using follow command: (all node have installed lxml library)
python testMrjob.py -r hadoop --file parse_xml_file.py < input_to_hadoop.txt
Or
python testMrjob.py -r hadoop --file parse_xml_file.py --file store_xml_directory/my_xml_file.xml < input_to_hadoop.txt > output
I've got error:
no configs found; falling back on auto-configuration
creating tmp directory /tmp/testMrjob.haduser.20141018.152349.482573
Uploading input to hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/input
reading from STDIN
Copying non-input files into hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/
Using Hadoop version 1.2.1
HADOOP: Loaded the native-hadoop library
HADOOP: Snappy native library not loaded
HADOOP: Total input paths to process : 1
HADOOP: getLocalDirs(): [/opt/hadoop/dfs/mapred/local]
HADOOP: Running job: job_201410182107_0012
HADOOP: To kill this job, run:
HADOOP: /opt/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=master:54311 -kill job_201410182107_0012
HADOOP: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201410182107_0012
HADOOP: map 0% reduce 0%
HADOOP: map 100% reduce 100%
HADOOP: To kill this job, run:
HADOOP: /opt/hadoop/libexec/../bin/hadoop job -Dmapred.job.tracker=master:54311 -kill job_201410182107_0012
HADOOP: Tracking URL: http://master:50030/jobdetails.jsp?jobid=job_201410182107_0012
HADOOP: Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201410182107_0012_m_000000
HADOOP: killJob...
HADOOP: Streaming Command Failed!
STDOUT: packageJobJar: [/opt/hadoop/tmp/hadoop-unjar9122722052766576889/] [] /tmp/streamjob2542718124608434574.jar tmpDir=null
Job failed with return code 1: ['/opt/hadoop/bin/hadoop', 'jar', '/opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar', '-files', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/testMrjob.py#testMrjob.py,hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/requirements.txt#requirements.txt,hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/parse_xml_file.py#parse_xml_file.py', '-archives', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/mrjob.tar.gz#mrjob.tar.gz', '-cmdenv', 'PYTHONPATH=mrjob.tar.gz', '-input', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/input', '-output', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/output', '-mapper', 'python testMrjob.py --step-num=0 --mapper', '-reducer', 'python testMrjob.py --step-num=0 --reducer']
Scanning logs for probable cause of failure
Traceback (most recent call last):
File "testMrjob.py", line 25, in <module>
MRWordFrequencyCount.run()
File "/usr/lib/python2.7/dist-packages/mrjob/job.py", line 516, in run
mr_job.execute()
File "/usr/lib/python2.7/dist-packages/mrjob/job.py", line 532, in execute
self.run_job()
File "/usr/lib/python2.7/dist-packages/mrjob/job.py", line 602, in run_job
runner.run()
File "/usr/lib/python2.7/dist-packages/mrjob/runner.py", line 516, in run
self._run()
File "/usr/lib/python2.7/dist-packages/mrjob/hadoop.py", line 239, in _run
self._run_job_in_hadoop()
File "/usr/lib/python2.7/dist-packages/mrjob/hadoop.py", line 442, in _run_job_in_hadoop
raise Exception(msg)
Exception: Job failed with return code 1: ['/opt/hadoop/bin/hadoop', 'jar', '/opt/hadoop/contrib/streaming/hadoop-streaming-1.2.1.jar', '-files', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/testMrjob.py#testMrjob.py,hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/requirements.txt#requirements.txt,hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/parse_xml_file.py#parse_xml_file.py', '-archives', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/files/mrjob.tar.gz#mrjob.tar.gz', '-cmdenv', 'PYTHONPATH=mrjob.tar.gz', '-input', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/input', '-output', 'hdfs:///user/haduser/tmp/mrjob/testMrjob.haduser.20141018.152349.482573/output', '-mapper', 'python testMrjob.py --step-num=0 --mapper', '-reducer', 'python testMrjob.py --step-num=0 --reducer']

To spread python modules using mrjob, you should use --python-archive rather than --file.

Related

"No CMAKE_CXX_COMPILER could be found" errror while deploying flask app on gcloud

I have a flask application that I'm deploying on google cloud run. The app using a library 'face_recognition' that requires Cmake. I'm installing the CMake by running a command in DockerFile but getting an error. I don't know what it mean.
Here is my Dockerfile
# Use the official lightweight Python image.
# https://hub.docker.com/_/python
FROM python:3.9-slim
# Allow statements and log messages to immediately appear in the Knative logs
ENV PYTHONUNBUFFERED True
# Copy local code to the container image.
ENV APP_HOME /app
WORKDIR $APP_HOME
COPY . ./
# Install production dependencies.
RUN apt-get update && apt-get install -y cmake
RUN pip install --no-cache-dir -r requirements.txt
RUN pip install gunicorn
# Run the web service on container startup. Here we use the gunicorn
# webserver, with one worker process and 8 threads.
# For environments with multiple CPU cores, increase the number of workers
# to be equal to the cores available.
# Timeout is set to 0 to disable the timeouts of the workers to allow Cloud Run to handle instance scaling.
CMD exec gunicorn --bind :$PORT --workers 1 --threads 8 --timeout 0 main:app
Here is the error
CMake Error at CMakeLists.txt:14 (project):
No CMAKE_CXX_COMPILER could be found.
Tell CMake where to find the compiler by setting either the environment
variable "CXX" or the CMake cache entry CMAKE_CXX_COMPILER to the full path
to the compiler, or to the compiler name if it is in the PATH.
-- Configuring incomplete, errors occurred!
See also "/tmp/pip-install-2m1peq73/dlib_d6f82528b68745578021b2f234f89d7c/build/temp.linux-x86_64-3.9/CMakeFiles/CMakeOutput.log".
See also "/tmp/pip-install-2m1peq73/dlib_d6f82528b68745578021b2f234f89d7c/build/temp.linux-x86_64-3.9/CMakeFiles/CMakeError.log".
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-2m1peq73/dlib_d6f82528b68745578021b2f234f89d7c/setup.py", line 222, in <module>
setup(
File "/usr/local/lib/python3.9/site-packages/setuptools/__init__.py", line 153, in setup
return distutils.core.setup(**attrs)
File "/usr/local/lib/python3.9/distutils/core.py", line 148, in setup
dist.run_commands()
File "/usr/local/lib/python3.9/distutils/dist.py", line 966, in run_commands
self.run_command(cmd)
File "/usr/local/lib/python3.9/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/usr/local/lib/python3.9/site-packages/setuptools/command/install.py", line 61, in run
return orig.install.run(self)
File "/usr/local/lib/python3.9/distutils/command/install.py", line 546, in run
self.run_command('build')
File "/usr/local/lib/python3.9/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/local/lib/python3.9/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/usr/local/lib/python3.9/distutils/command/build.py", line 135, in run
self.run_command(cmd_name)
File "/usr/local/lib/python3.9/distutils/cmd.py", line 313, in run_command
self.distribution.run_command(command)
File "/usr/local/lib/python3.9/distutils/dist.py", line 985, in run_command
cmd_obj.run()
File "/tmp/pip-install-2m1peq73/dlib_d6f82528b68745578021b2f234f89d7c/setup.py", line 134, in run
self.build_extension(ext)
File "/tmp/pip-install-2m1peq73/dlib_d6f82528b68745578021b2f234f89d7c/setup.py", line 171, in build_extension
subprocess.check_call(cmake_setup, cwd=build_folder)
File "/usr/local/lib/python3.9/subprocess.py", line 373, in check_call
raise CalledProcessError(retcode, cmd)
The container base python:3.9-slim is very stripped down. If your application requires CMake which often implies the gcc compiler as well,
you have at least two options:
Use a more feature rich base container such as debian:buster
Choose a container with those tools already configured.
Example Dockerfile to build a base container:
FROM debian:buster
RUN apt update && apt install -y gcc clang clang-tools cmake python3
You can then use that container as the base for future containers or modify the Dockerfile to include your application.
Docker debian:buster

EMR JupyterHub: S3 persistence of notebooks not working

I am trying to set up an EMR cluster with JupyterHub and S3 persistence. I have the following classification:
{
"Classification": "jupyter-s3-conf",
"Properties": {
"s3.persistence.enabled": "true",
"s3.persistence.bucket": "my-persistence-bucket"
}
}
I am installing dask with the following step (otherwise, opening the notebook would result in a 500 error):
command-runner.jar
Arguments: /usr/bin/sudo /usr/bin/docker exec jupyterhub conda install dask
However, when I then open a new notebook, it is not persisted. The bucket stays empty. The cluster DOES have access to S3, as when running a Spark job with the same configuration which reads from and writes to S3, it can do so, with the same bucket.
However, when looking into the jupyter log on my master, I see this:
[E 2019-08-07 12:27:14.609 SingleUserNotebookApp application:574] Exception while loading config file /etc/jupyter/jupyter_notebook_config.py
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 562, in _load_config_files
config = loader.load_config()
File "/opt/conda/lib/python3.6/site-packages/traitlets/config/loader.py", line 457, in load_config
self._read_file_as_dict()
File "/opt/conda/lib/python3.6/site-packages/traitlets/config/loader.py", line 489, in _read_file_as_dict
py3compat.execfile(conf_filename, namespace)
File "/opt/conda/lib/python3.6/site-packages/ipython_genutils/py3compat.py", line 198, in execfile
exec(compiler(f.read(), fname, 'exec'), glob, loc)
File "/etc/jupyter/jupyter_notebook_config.py", line 5, in <module>
from s3contents import S3ContentsManager
File "/opt/conda/lib/python3.6/site-packages/s3contents/__init__.py", line 15, in <module>
from .gcsmanager import GCSContentsManager
File "/opt/conda/lib/python3.6/site-packages/s3contents/gcsmanager.py", line 8, in <module>
from s3contents.gcs_fs import GCSFS
File "/opt/conda/lib/python3.6/site-packages/s3contents/gcs_fs.py", line 3, in <module>
import gcsfs
File "/opt/conda/lib/python3.6/site-packages/gcsfs/__init__.py", line 4, in <module>
from .dask_link import register as register_dask
File "/opt/conda/lib/python3.6/site-packages/gcsfs/dask_link.py", line 56, in <module>
register()
File "/opt/conda/lib/python3.6/site-packages/gcsfs/dask_link.py", line 51, in register
dask.bytes.core._filesystems['gcs'] = DaskGCSFileSystem
AttributeError: module 'dask.bytes.core' has no attribute '_filesystems'
What am I missing and what is going wrong?
It turned out it was a chain reaction of upgrading and installing custom packages breaking compatibility. I install additional packages in my cluster with the command-runner where I had some issues - I could only run one conda install command, the second one failed with no module named 'conda'.
So I updated Anaconda first by doing /usr/bin/sudo /usr/bin/docker exec jupyterhub conda update -n base conda with the command-runner. This caused jinja2 not finding markupsafe. Installing markupsafe pulled jupyterhub to 1.0.0 which broke even more things.
So here is how I got it to work (executed in order with command-runner.jar):
/usr/bin/sudo /usr/bin/docker exec jupyterhub conda update -n base conda
updates Anaconda.
/usr/bin/sudo /usr/bin/docker exec jupyterhub conda install --freeze-installed markupsafe
installs markupsafe which is needed after step 1.
Installed my desired additional packages into the container, but always with --freeze-installed option to circumvent breaking anything installed by EMR
A custom bootstrap action that runs a script from S3 installs my desired packages from step 3 with pip-3.6 as well so they work for PySpark (for it to work, they have to be installed on all nodes directly)

Tensorflow build error : Cannot find cudnn.h under ~

I am trying to build tensorflow r1.12 using bazel 0.15 on Redhat 7.5 ppc64le.
I am stuck with the following error.
[u0017649#sys-97184 tensorflow]$ bazel build --config=opt --config=cuda //tensorflow/tools/pip_package:build_pip_package
...
ERROR: error loading package 'tensorflow/tools/pip_package': Encountered error while reading extension file 'cuda/build_defs.bzl': no such package
'#local_config_cuda//cuda': Traceback (most recent call last):
File
"/home/u0017649/files/tensorflow/third_party/gpus/cuda_configure.bzl", line 1447
_create_local_cuda_repository(repository_ctx)
File
"/home/u0017649/files/tensorflow/third_party/gpus/cuda_configure.bzl", line 1187, in _create_local_cuda_repository
_get_cuda_config(repository_ctx)
File
"/home/u0017649/files/tensorflow/third_party/gpus/cuda_configure.bzl", line 911, in _get_cuda_config
_cudnn_version(repository_ctx, cudnn_install_base..., ...)
File
"/home/u0017649/files/tensorflow/third_party/gpus/cuda_configure.bzl", line 582, in _cudnn_version
_find_cudnn_header_dir(repository_ctx, cudnn_install_base...)
File
"/home/u0017649/files/tensorflow/third_party/gpus/cuda_configure.bzl", line 869, in _find_cudnn_header_dir
auto_configure_fail(("Cannot find cudnn.h under %s" ...))
File
"/home/u0017649/files/tensorflow/third_party/gpus/cuda_configure.bzl", line 317, in auto_configure_fail
fail(("\n%sCuda Configuration Error:%...)))
Cuda Configuration Error: Cannot find cudnn.h under /usr/local/cuda-9.2/targets/ppc64le-linux/lib
I do have a soft link for cudnn.h under /usr/local/cuda-9.2/targets/ppc64le-linux/lib as below.
[u0017649#sys-97184 tensorflow]$ ls -l /usr/local/cuda-9.2/targets/ppc64le-linux/lib/cudnn.h
lrwxrwxrwx. 1 root root 57 Feb 20 10:15 /usr/local/cuda-9.2/targets/ppc64le-linux/lib/cudnn.h -> /usr/local/cuda-9.2/targets/ppc64le-linux/include/cudnn.h
Any comments, pls ?
After reading tensorflow/third_party/gpus/cuda_configure.bzl, I could solve this by the following.
$ sudo ln -sf /usr/local/cuda-9.2/targets/ppc64le-linux/include/cudnn.h /usr/include/cudnn.h

Unable to setup WebRTC native android code on ubuntu instance

Followed steps provided in this web page https://webrtc.org/native-code/android/
When I executed command "ninja -C out/Debug AppRTCMobile" I got following response
ninja: Entering directory `out/Debug'
ninja: fatal: chdir to 'out/Debug' - No such file or directory
I got stuck here and need help to continue in executing the next steps to complete the code setup.
Ubuntu version - 16.04.2
I have followed the same procedure once again to setup the code. This time I got a new error
Steps:
-> fetch --nohooks webrtc_android
-> gclient sync
-> gn gen out/Debug --args='target_os="android" target_cpu="arm"'
-> ninja -C out/Debug
ninja: Entering directory `out/Debug'
[1/8508] ACTION
//base:android_runtime_jni_headers__jni_Runtime(//build/toolchain/android:android_clang_arm)
FAILED: gen/android_runtime_jni_headers/base/jni/Runtime_jni.h python
../../base/android/jni_generator/jni_generator.py --depfile
gen/base/android_runtime_jni_headers__jni_Runtime.d --jar_file
../../third_party/android_tools/sdk/platforms/android-26/android.jar
--input_file java/lang/Runtime.class --ptr_type=long --output_dir gen/android_runtime_jni_headers/base/jni --includes
../../../../../../base/android/jni_generator/jni_generator_helper.h
--native_exports_optional
Traceback (most recent call last):
File "../../base/android/jni_generator/jni_generator.py", line 1428,
in
sys.exit(main(sys.argv))
File "../../base/android/jni_generator/jni_generator.py", line 1421,
in main
GenerateJNIHeader(input_file, output_file, options) File "../../base/android/jni_generator/jni_generator.py", line 1326, in
GenerateJNIHeader
jni_from_javap = JNIFromJavaP.CreateFromClass(input_file, options)
File "../../base/android/jni_generator/jni_generator.py", line 662,
in CreateFromClass
stderr=subprocess.PIPE)
File "/usr/lib/python2.7/subprocess.py", line 710, in init
errread, errwrite)
File "/usr/lib/python2.7/subprocess.py", line 1327, in
_execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
[3/8508] CC obj/third_party/boringssl/boringssl/v3_ncons.o
ninja: build stopped: subcommand failed.
Someone please help me in resolving this issue.
You have to generate the projects using GN before trying to compile. You have the instructions here:
https://webrtc.org/native-code/android/

Buildroot - installing of the "pytest-runner" package fails

Buildroot version: 2017-02
I need to integrate python package 'chardet' to my build. The chardet require the 'pytest-runner'package.
Case 1
The 'chardet' is marked to be integrated into build. The 'pytest-runner' is not pre-feteched and the 'chardet' package updated to the latest version (3.0.3) before the build with the scanpypi script. When the make is run the following error message is shown indicating problems with the 'pytest-runner':
>>> python-chardet 3.0.3 Building
(cd /home/nnnn/bldr_lab/buildroot/output/build/python-chardet-3.0.3//;
PATH="/home/nnnn/bldr_lab/buildroot/output/host/bin:/home/nnnn/bldr_lab
/buildroot/output/host/sbin:/home/nnnn/bldr_lab/buildroot/output/host/usr
/bin:/home/nnnn/bldr_lab/buildroot/output/host/usr/sbin:/home/nnnn/x-tools
/arm-cortex_a8-linux-gnueabihf/bin:/home/nnnn/bin:/home/nnnn/.local
/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr
/games:/usr/local/games" PYTHONPATH="/home/nnnn/bldr_lab/buildroot/output
/target/usr/lib/python2.7/sysconfigdata/:/home/nnnn/bldr_lab/buildroot
/output/target/usr/lib/python2.7/site-packages/" _python_sysroot=/home
/nnnn/bldr_lab/buildroot/output/host/usr/arm-buildroot-linux-gnueabihf
/sysroot _python_prefix=/usr _python_exec_prefix=/usr /home/nnnn/bldr_lab
/buildroot/output/host/usr/bin/python setup.py build )
Download error on https://pypi.python.org/simple/pytest-runner/: unknown
url type: https -- Some packages may not be found!
Couldn't find index page for 'pytest-runner' (maybe misspelled?)
Download error on https://pypi.python.org/simple/: unknown url type: https
-- Some packages may not be found!
No local packages or download links found for pytest-runner
Traceback (most recent call last):
File "setup.py", line 52, in <module>
['chardetect = chardet.cli.chardetect:main']})
File "/home/nnnn/bldr_lab/buildroot/output/host/usr/lib/python2.7
/distutils/core.py", line 111, in setup
_setup_distribution = dist = klass(attrs)
File "build/bdist.linux-x86_64/egg/setuptools/dist.py", line 268, in
__init__
File "build/bdist.linux-x86_64/egg/setuptools/dist.py", line 313, in
fetch_build_eggs
File "build/bdist.linux-x86_64/egg/pkg_resources/__init__.py", line 846, in resolve
File "build/bdist.linux-x86_64/egg/pkg_resources/__init__.py", line 1091, in best_match
File "build/bdist.linux-x86_64/egg/pkg_resources/__init__.py", line 1103, in obtain
File "build/bdist.linux-x86_64/egg/setuptools/dist.py", line 380, in fetch_build_egg
File "build/bdist.linux-x86_64/egg/setuptools/command/easy_install.py", line 633, in easy_install
distutils.errors.DistutilsError: Could not find suitable distribution for Requirement.parse('pytest-runner')
package/pkg-generic.mk:216: recipe for target '/home/nnnn/bldr_lab
/buildroot/output/build/python-chardet-3.0.3/.stamp_built' failed
make[1]: *** [/home/nnnn/bldr_lab/buildroot/output/build/python-chardet-
3.0.3/.stamp_built] Error 1
Makefile:79: recipe for target '_all' failed
make: *** [_all] Error 2
case 2 When I try to create the 'pytest-runner' with the scanpypi the following error message is shown:
nnnn#xxxx:~/bldr2/buildroot$ ./support/scripts/scanpypi pytest-runner -o package
buildroot package name for pytest-runner: python-pytest-runner
Package: python-pytest-runner
Fetching package pytest-runner
Downloading package pytest-runner from https://pypi.python.org/packages/9e/4d/08889e5e27a9f5d6096b9ad257f4dea1faabb03c5ded8f665ead448f5d8a/pytest-runner-2.11.1.tar.gz...
Error: Could not install package pytest-runner
When I download the 'pytest-runner-2.11.1.tar.gz' from pypi manually it looks as normal pypi tar file. Any idea what is root cause and how to solve the problem?
-timo-