EMR JupyterHub: S3 persistence of notebooks not working - amazon-s3

I am trying to set up an EMR cluster with JupyterHub and S3 persistence. I have the following classification:
{
"Classification": "jupyter-s3-conf",
"Properties": {
"s3.persistence.enabled": "true",
"s3.persistence.bucket": "my-persistence-bucket"
}
}
I am installing dask with the following step (otherwise, opening the notebook would result in a 500 error):
command-runner.jar
Arguments: /usr/bin/sudo /usr/bin/docker exec jupyterhub conda install dask
However, when I then open a new notebook, it is not persisted. The bucket stays empty. The cluster DOES have access to S3, as when running a Spark job with the same configuration which reads from and writes to S3, it can do so, with the same bucket.
However, when looking into the jupyter log on my master, I see this:
[E 2019-08-07 12:27:14.609 SingleUserNotebookApp application:574] Exception while loading config file /etc/jupyter/jupyter_notebook_config.py
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/site-packages/traitlets/config/application.py", line 562, in _load_config_files
config = loader.load_config()
File "/opt/conda/lib/python3.6/site-packages/traitlets/config/loader.py", line 457, in load_config
self._read_file_as_dict()
File "/opt/conda/lib/python3.6/site-packages/traitlets/config/loader.py", line 489, in _read_file_as_dict
py3compat.execfile(conf_filename, namespace)
File "/opt/conda/lib/python3.6/site-packages/ipython_genutils/py3compat.py", line 198, in execfile
exec(compiler(f.read(), fname, 'exec'), glob, loc)
File "/etc/jupyter/jupyter_notebook_config.py", line 5, in <module>
from s3contents import S3ContentsManager
File "/opt/conda/lib/python3.6/site-packages/s3contents/__init__.py", line 15, in <module>
from .gcsmanager import GCSContentsManager
File "/opt/conda/lib/python3.6/site-packages/s3contents/gcsmanager.py", line 8, in <module>
from s3contents.gcs_fs import GCSFS
File "/opt/conda/lib/python3.6/site-packages/s3contents/gcs_fs.py", line 3, in <module>
import gcsfs
File "/opt/conda/lib/python3.6/site-packages/gcsfs/__init__.py", line 4, in <module>
from .dask_link import register as register_dask
File "/opt/conda/lib/python3.6/site-packages/gcsfs/dask_link.py", line 56, in <module>
register()
File "/opt/conda/lib/python3.6/site-packages/gcsfs/dask_link.py", line 51, in register
dask.bytes.core._filesystems['gcs'] = DaskGCSFileSystem
AttributeError: module 'dask.bytes.core' has no attribute '_filesystems'
What am I missing and what is going wrong?

It turned out it was a chain reaction of upgrading and installing custom packages breaking compatibility. I install additional packages in my cluster with the command-runner where I had some issues - I could only run one conda install command, the second one failed with no module named 'conda'.
So I updated Anaconda first by doing /usr/bin/sudo /usr/bin/docker exec jupyterhub conda update -n base conda with the command-runner. This caused jinja2 not finding markupsafe. Installing markupsafe pulled jupyterhub to 1.0.0 which broke even more things.
So here is how I got it to work (executed in order with command-runner.jar):
/usr/bin/sudo /usr/bin/docker exec jupyterhub conda update -n base conda
updates Anaconda.
/usr/bin/sudo /usr/bin/docker exec jupyterhub conda install --freeze-installed markupsafe
installs markupsafe which is needed after step 1.
Installed my desired additional packages into the container, but always with --freeze-installed option to circumvent breaking anything installed by EMR
A custom bootstrap action that runs a script from S3 installs my desired packages from step 3 with pip-3.6 as well so they work for PySpark (for it to work, they have to be installed on all nodes directly)

Related

pipenv installed pip does not work with specified python version

On a Raspberry Pi OS Bullseye system, I tried to install numpy with pipenv using a specific python version and got this:
$ pipenv --python /opt/python/3.7/bin/python3 install numpy --verbose
Creating a virtualenv for this project…
Using /opt/python/3.7/bin/python3 (3.7.9) to create virtualenv…
⠋created virtual environment CPython3.7.9.final.0-32 in 410ms
creator CPython3Posix(dest=/home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC, clear=False, no_vcs_ignore=False, global=False)
seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/home/pi/.local/share/virtualenv)
added seed packages: pip==20.3.4, pkg_resources==0.0.0, setuptools==44.1.1, wheel==0.34.2
activators BashActivator,CShellActivator,FishActivator,PowerShellActivator,PythonActivator,XonshActivator
Virtualenv location: /home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC
Installing numpy…
⠙Installing 'numpy'
$ "/home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/bin/pip" install --verbose "numpy" -i https://pypi.org/simple --exists-action w
⠙
Error: An error occurred while installing numpy!
Traceback (most recent call last):
File "/home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/bin/pip", line 5, in <module>
from pip._internal.cli.main import main
File "/home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/lib/python3.7/site-packages/pip/_internal/cli/main.py", line 10, in <module>
from pip._internal.cli.autocompletion import autocomplete
File "/home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/lib/python3.7/site-packages/pip/_internal/cli/autocompletion.py", line 9, in <module>
from pip._internal.cli.main_parser import create_main_parser
File "/home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/lib/python3.7/site-packages/pip/_internal/cli/main_parser.py", line 7, in <module>
from pip._internal.cli import cmdoptions
File "/home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/lib/python3.7/site-packages/pip/_internal/cli/cmdoptions.py", line 23, in <module>
from pip._vendor.packaging.utils import canonicalize_name
ModuleNotFoundError: No module named 'pip._vendor.packaging'
Looking at the verbose output i see that the path to pip used by pipenv is /home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/bin/pip.
Calling this pip directly indeed leads to the same error:
$ /home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/bin/pip --version
Traceback (most recent call last):
File "/home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/bin/pip", line 5, in <module>
from pip._internal.cli.main import main
File "/home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/lib/python3.7/site-packages/pip/_internal/cli/main.py", line 10, in <module>
from pip._internal.cli.autocompletion import autocomplete
File "/home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/lib/python3.7/site-packages/pip/_internal/cli/autocompletion.py", line 9, in <module>
from pip._internal.cli.main_parser import create_main_parser
File "/home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/lib/python3.7/site-packages/pip/_internal/cli/main_parser.py", line 7, in <module>
from pip._internal.cli import cmdoptions
File "/home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/lib/python3.7/site-packages/pip/_internal/cli/cmdoptions.py", line 23, in <module>
from pip._vendor.packaging.utils import canonicalize_name
ModuleNotFoundError: No module named 'pip._vendor.packaging'
Which python is used in that case? Looking at the shebang line it would seem it's the one I passed to pipenv initially:
$ head -n 1 /home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/bin/pip
#!/home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/bin/python
$ ls -l /home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/bin/python
lrwxrwxrwx 1 pi pi 27 Dec 11 11:00 /home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/bin/python -> /opt/python/3.7/bin/python3
But when I explicitly use that exact interpreter there is no error:
$ /opt/python/3.7/bin/python3 /home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/bin/pip --version
pip 20.1.1 from /opt/python/3.7/lib/python3.7/site-packages/pip (python 3.7)
The difference seems to be that in the case it goes wrong, the pip installation in /home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/lib/python3.7/site-packages/pip is used while in the working case it's the one in /opt/python/3.7/lib/python3.7/site-packages/pip.
But why? My understanding of the shebang is that it points to the interpreter that's to be used. In the working example all i do is call that interpreter explicitly myself. Why is there a difference in behaviour?
And also, why did pipenv even install its own pip in /home/pi/.local/share/virtualenvs/deep-dregs-eaJke9eC/lib/python3.7/site-packages/pip ? Why didn't it reuse the pip that comes with the python version I passed? And if that's just how pipenv works, why is its pip broken? What's going on? And how can I fix it?
EDIT
When i use my system python 3.9 installation it works fine.

Can't build spark py-files with pandas included

I am attempting to package up my dependencies for a spark program I am creating. I have a requirements.txt file as below
pandas
I then run
pip3 install -t dependencies -r requirements.txt
cd dependencies
zip -r ../dependencies.zip .
pyspark --py-files dependencies.zip
And run the line -
import pandas
And I get the error -
Traceback (most recent call last):
File "/mnt/tmp/spark-REDACTED/userFiles-REDACTED/dependencies.zip/pandas/__init__.py", line 31, in <module>
File "/mnt/tmp/spark-REDACTED/userFiles-REDACTED/dependencies.zip/pandas/_libs/__init__.py", line 3, in <module>
File "/mnt/tmp/spark-REDACTED/userFiles-REDACTED/dependencies.zip/pandas/_libs/tslibs/__init__.py", line 3, in <module>
ModuleNotFoundError: No module named 'pandas._libs.tslibs.conversion'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/mnt/tmp/spark-REDACTED/userFiles-REDACTED/dependencies.zip/pandas/__init__.py", line 36, in <module>
ImportError: C extension: No module named 'pandas._libs.tslibs.conversion' not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first.
Any ideas on how to fix this?
In order to ship dependency on the worker, there are two ways one is exactly what you did, zip the file, or simple py file then use --py-file. The problem you encountered is because of missing C dependency on the worker side. Pkg like NumPy/pandas all have c dependency.
In order to solve this, create the virtualenv, and zip the virtualenv including
the python executable
PYSPARK_DRIVER_PYTHON = <path to current working python>
PYSPARK_PYTHON = './venv/<path to python executable>'
pyspark --archives = <path to zip file>#venv
or follow this link

Could not install gsutil with latest version

I upgraded to the latest BQ (2.0.14) then downloaded the latest gsutil tar package and manually updated gsutil thus:
"python setup.py install"
When I run gsutil I now get the following error message:
Traceback (most recent call last):
File "/usr/local/bin/gsutil", line 8, in
load_entry_point('gsutil==3.31', 'console_scripts', 'gsutil')()
File "build/bdist.linux-i686/egg/pkg_resources.py", line 318, in load_entry_point
File "build/bdist.linux-i686/egg/pkg_resources.py", line 2221, in load_entry_point
File "build/bdist.linux-i686/egg/pkg_resources.py", line 1954, in load
File "/usr/local/lib/python2.7/site-packages/gsutil-3.31-py2.7.egg/gslib/main.py", line 32, in
from gslib import util
File "/usr/local/lib/python2.7/site-packages/gsutil-3.31-py2.7.egg/gslib/util.py", line 28, in
from oauth2client.client import HAS_CRYPTO
ImportError: cannot import name HAS_CRYPTO
I couldn't find a way to actually uninstall gsutil so I'm stuck.
Any ideas?
Thanks
Have you tried upgrading oauth2client?
pip install -U oauth2client

Plone ldap add on installation issue

I am trying to get ldap authentication to work on Plone version 4.2. I have hammered at the issue for several hours without results. I have even tried these steps:
Install python-ldap 2.6 (C:\Python26)
Install Plone 4.2 with the installer (D:\Plone)
Edit buildout.cfg with plone.app.ldap in the EGG and ZCML section
Create a new folder called python_ldap-2.3.12-py2.6.egg in D:\Plone\buildout-cache\eggs\
Copy C:\Python26\lib\site-packages\python_ldap-2.3.12-py2.6.egg-info to D:\Plone\buildout-cache\eggs\python_ldap-2.3.12-py2.6.egg\ and rename to EGG-INFO
Also copy the ldap folder in C:\Python26\lib\site-packages\ to D:\Plone\buildout-cache\eggs\python_ldap-2.3.12-py2.6.egg\
Also copy the file ldapurl.py to C:\Python26\lib\site-packages\ to D:\Plone\buildout-cache\eggs\python_ldap-2.3.12-py2.6.egg\
Next copy:
folder: C:\Python26\lib\site-packages\python_ldap-2.3.12-py2.6.egg-info
folder: C:\Python26\lib\site-packages\ldap
to D:\Plone\python\Lib\site-packages
Start commandbox and run bin\buildout
Start Plone, log in as admin and go to the extra products section. Here you will find the LDAP product. Install it and enter you LDAP details.
None of that really helped. When i try bin/buildout, I get the following message:
Installing instance.
Getting distribution for 'dataflake.fakeldap'.
zip_safe flag not set; analyzing archive contents...
Installed /tmp/easy_install-oISsVG/dataflake.fakeldap-1.0/setuptools_git-0.4.2-py2.6.egg
Got dataflake.fakeldap 1.0.
Generated script '/usr/local/Plone/zinstance/bin/instance'.
Installing zopepy.
Generated interpreter '/usr/local/Plone/zinstance/bin/zopepy'.
Installing zopeskel.
Generated script '/usr/local/Plone/zinstance/bin/zopeskel'.
Generated script '/usr/local/Plone/zinstance/bin/paster'.
Updating backup.
Updating chown.
chown: Running
echo Dummy references to force this to execute after referenced parts
echo /usr/local/Plone/zinstance/var/backups sudo -u plone
chmod 600 .installed.cfg
find /usr/local/Plone/zinstance/var -type d -exec chmod 700 {} \;
chmod 744 /usr/local/Plone/zinstance/bin/*
Dummy references to force this to execute after referenced parts
/usr/local/Plone/zinstance/var/backups sudo -u plone
Updating repozo.
Updating unifiedinstaller.
*************** PICKED VERSIONS ****************
[versions]
Products.LDAPMultiPlugins = 1.14
Products.LDAPUserFolder = 2.23
Products.PloneLDAP = 1.1
collective.sendaspdf = 2.6
dataflake.fakeldap = 1.0
jquery.pyproxy = 0.4.1
plone.app.ldap = 1.2.8
*************** /PICKED VERSIONS ***************
When I try bin/buildout, it says daemon process started and gives an id but when i try localhost:8080, it says "Problem loading page" and the page does not load. I tried bin/instance fg to display the errors and i following message.
bin/instance fg
2012-07-24 08:53:18 INFO ZServer HTTP server started at Tue Jul 24 08:53:18 2012
Hostname: 0.0.0.0
Port: 8080
2012-07-24 08:53:18 INFO Zope Set effective user to "plone"
2012-07-24 08:53:19 WARNING SecurityInfo Conflicting security declarations for "setText"
2012-07-24 08:53:19 WARNING SecurityInfo Class "ATTopic" had conflicting security declarations
2012-07-24 08:53:19 ERROR Application Could not import Products.LDAPMultiPlugins
Traceback (most recent call last):
File "/usr/local/Plone/buildout-cache/eggs/Zope2-2.13.15-py2.6.egg/OFS/Application.py", line 606, in import_product
product=__import__(pname, global_dict, global_dict, silly)
File "/usr/local/Plone/buildout-cache/eggs/Products.LDAPMultiPlugins-1.14-py2.6.egg/Products/LDAPMultiPlugins/__init__.py", line 22, in <module>
from Products.LDAPMultiPlugins.LDAPMultiPlugin import addLDAPMultiPluginForm
File "/usr/local/Plone/buildout-cache/eggs/Products.LDAPMultiPlugins-1.14-py2.6.egg/Products/LDAPMultiPlugins/LDAPMultiPlugin.py", line 29, in <module>
from Products.LDAPUserFolder import manage_addLDAPUserFolder
File "/usr/local/Plone/buildout-cache/eggs/Products.LDAPUserFolder-2.23-py2.6.egg/Products/LDAPUserFolder/__init__.py", line 20, in <module>
from Products.LDAPUserFolder.LDAPUserFolder import LDAPUserFolder
File "/usr/local/Plone/buildout-cache/eggs/Products.LDAPUserFolder-2.23-py2.6.egg/Products/LDAPUserFolder/LDAPUserFolder.py", line 52, in <module>
from Products.LDAPUserFolder.LDAPDelegate import filter_format
File "/usr/local/Plone/buildout-cache/eggs/Products.LDAPUserFolder-2.23-py2.6.egg/Products/LDAPUserFolder/LDAPDelegate.py", line 19, in <module>
import ldap
File "/usr/local/Plone/buildout-cache/eggs/python_ldap-2.3.12-py2.6.egg/ldap/__init__.py", line 22, in <module>
from _ldap import *
ImportError: No module named _ldap
Traceback (most recent call last):
File "/usr/local/Plone/buildout-cache/eggs/Zope2-2.13.15-py2.6.egg/Zope2/Startup/run.py", line 76, in <module>
run()
File "/usr/local/Plone/buildout-cache/eggs/Zope2-2.13.15-py2.6.egg/Zope2/Startup/run.py", line 22, in run
starter.prepare()
File "/usr/local/Plone/buildout-cache/eggs/Zope2-2.13.15-py2.6.egg/Zope2/Startup/__init__.py", line 86, in prepare
self.startZope()
File "/usr/local/Plone/buildout-cache/eggs/Zope2-2.13.15-py2.6.egg/Zope2/Startup/__init__.py", line 259, in startZope
Zope2.startup()
File "/usr/local/Plone/buildout-cache/eggs/Zope2-2.13.15-py2.6.egg/Zope2/__init__.py", line 47, in startup
_startup()
File "/usr/local/Plone/buildout-cache/eggs/Zope2-2.13.15-py2.6.egg/Zope2/App/startup.py", line 67, in startup
OFS.Application.import_products()
File "/usr/local/Plone/buildout-cache/eggs/Zope2-2.13.15-py2.6.egg/OFS/Application.py", line 583, in import_products
import_product(product_dir, product_name, raise_exc=debug_mode)
File "/usr/local/Plone/buildout-cache/eggs/Zope2-2.13.15-py2.6.egg/OFS/Application.py", line 606, in import_product
product=__import__(pname, global_dict, global_dict, silly)
File "/usr/local/Plone/buildout-cache/eggs/Products.LDAPMultiPlugins-1.14-py2.6.egg/Products/LDAPMultiPlugins/__init__.py", line 22, in <module>
from Products.LDAPMultiPlugins.LDAPMultiPlugin import addLDAPMultiPluginForm
File "/usr/local/Plone/buildout-cache/eggs/Products.LDAPMultiPlugins-1.14-py2.6.egg/Products/LDAPMultiPlugins/LDAPMultiPlugin.py", line 29, in <module>
from Products.LDAPUserFolder import manage_addLDAPUserFolder
File "/usr/local/Plone/buildout-cache/eggs/Products.LDAPUserFolder-2.23-py2.6.egg/Products/LDAPUserFolder/__init__.py", line 20, in <module>
from Products.LDAPUserFolder.LDAPUserFolder import LDAPUserFolder
File "/usr/local/Plone/buildout-cache/eggs/Products.LDAPUserFolder-2.23-py2.6.egg/Products/LDAPUserFolder/LDAPUserFolder.py", line 52, in <module>
from Products.LDAPUserFolder.LDAPDelegate import filter_format
File "/usr/local/Plone/buildout-cache/eggs/Products.LDAPUserFolder-2.23-py2.6.egg/Products/LDAPUserFolder/LDAPDelegate.py", line 19, in <module>
import ldap
File "/usr/local/Plone/buildout-cache/eggs/python_ldap-2.3.12-py2.6.egg/ldap/__init__.py", line 22, in <module>
from _ldap import *
ImportError: No module named _ldap
What am i doing wrong? Help wil be deeply appreciated
Your buildout ran successfully, there were no problems there. Some of the packages you picked were not pinned, so your buildout reported what versions it choose for you.
Your server itself is not indeed running because the Python LDAP egg you installed seems to be incorrectly installed. The buildout-cache/eggs/python_ldap-2.3.12-py2.6.egg/ldap/_ldap.so library file is missing.
Remove the whole egg (rm -rf buildout-cache-eggs/python_ldap-2.3.12-py2.6.egg) make sure you have the OpenLDAP 2.x library and headers installed on your system (on Ubuntu and Debian the libldap2-dev should be enough). Then re-run buildout to reinstall the egg.
Alternatively, you could try and install the system python-ldap package (remove the egg) and see if buildout picks that up instead.
You need to install 2 libs:
sudo apt-get install libldap2-dev
sudo apt-get install libsasl2-dev
Hope that will help.

Django-nonrel import cache fail

I am trying to setup django-nonrel on GAE (Google App Engine) -
following the steps here http://www.allbuttonspressed.com/projects/djangoappengine#installation
The test application works great -
I was able to use the cache API in the application, but not so for the tests and shell:
Attempting to from django.core.cache import cache in the shell gives me:
>>> from django.core.cache import cache
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "django-testapp/django/core/cache/__init__.py", line 182, in <module>
cache = get_cache(DEFAULT_CACHE_ALIAS)
File "django-testapp/django/core/cache/__init__.py", line 180, in get_cache
return backend_cls(location, params)
File "django-testapp/django/core/cache/backends/memcached.py", line 154, in __init__
import memcache
ImportError: No module named memcache
Similarly attempting ./manage.py test fails the same way.
Any idea why ./manage runserver works fine, but ./manage shell or ./manage test fails to import cache?
I had the same problem when I upgraded to Google App Engine 1.6.0 from 1.5.5 .
I solved the problem by installing python-memcached:
pip install python-memcached
For gentoo users it's recommended:
emerge -av dev-python/python-memcached
I alse do like this:
sudo pip install python-memcached
then restart the django, it works.