Workflow Disconnection - snakemake

When running a workflow from the repo like:
snakemake --use-conda -p --cluster-config cluster.yaml --cluster "qsub -l {cluster.l} -m {cluster.m} -N {cluster.N} -r {cluster.r} -V" --jobs 1
My job starts but runs into a urllib related error (see below). I'm running v3.13.3 on a compute server. Any tips on how to avoid this? Thanks in advance.
File "miniconda3/lib/python3.5/site-packages/snakemake/__init__.py", line 469, in snakemake
force_use_threads=use_threads)
File "miniconda3/lib/python3.5/site-packages/snakemake/workflow.py", line 450, in execute
dag.create_conda_envs(dryrun=dryrun)
File "miniconda3/lib/python3.5/site-packages/snakemake/dag.py", line 166, in create_conda_envs
hash = env.hash
File "miniconda3/lib/python3.5/site-packages/snakemake/conda.py", line 55, in hash
md5hash.update(self.content)
File "miniconda3/lib/python3.5/site-packages/snakemake/conda.py", line 38, in content
content = urlopen(env_file).read()
File "miniconda3/lib/python3.5/urllib/request.py", line 163, in urlopen
return opener.open(url, data, timeout)
File "miniconda3/lib/python3.5/urllib/request.py", line 466, in open
response = self._open(req, data) ...
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno 101] Network is unreachable>

Related

Continuous Machine Learning pipeline broken by tensorflow requirements installation

I am doing Continuous Machine Learning (https://cml.dev/) on my own GitLab server. My goal is to test the basic Continuous Machine Learning pipeline with python script as an example.
My .gitlab-ci.yml file is a basic one:
stages:
- cml_run
cml:
stage: cml_run
image: dvcorg/cml-py3:latest
script:
- pip3 install -r requirements.txt
- python train.py
- cat metrics.txt >> report.md
- cml-publish confusion_matrix.png --md >> report.md
- cml-send-comment report.md
For pandas,sklearn and Keras in the requirements.txt, there is a successful installation. But I receive a pipeline broken by TensorFlow requirements installation
$ pip3 install -r requirements.txt
Collecting pandas
Downloading pandas-1.0.5-cp36-cp36m-manylinux1_x86_64.whl (10.1 MB)
Collecting sklearn
Downloading sklearn-0.0.tar.gz (1.1 kB)
Collecting keras
Downloading Keras-2.4.3-py2.py3-none-any.whl (36 kB)
Collecting tensorflow
Downloading tensorflow-2.2.0-cp36-cp36m-manylinux2010_x86_64.whl (516.2 MB)
ERROR: Exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/cli/base_command.py", line 188, in _main
status = self.run(options, args)
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/cli/req_command.py", line 185, in wrapper
return func(self, options, args)
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/commands/install.py", line 333, in run
reqs, check_supported_wheels=not options.target_dir
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/resolution/legacy/resolver.py", line 179, in resolve
discovered_reqs.extend(self._resolve_one(requirement_set, req))
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/resolution/legacy/resolver.py", line 362, in _resolve_one
abstract_dist = self._get_abstract_dist_for(req_to_install)
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/resolution/legacy/resolver.py", line 314, in _get_abstract_dist_for
abstract_dist = self.preparer.prepare_linked_requirement(req)
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/operations/prepare.py", line 469, in prepare_linked_requirement
hashes=hashes,
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/operations/prepare.py", line 259, in unpack_url
hashes=hashes,
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/operations/prepare.py", line 130, in get_http_url
link, downloader, temp_dir.path, hashes
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/operations/prepare.py", line 281, in _download_http_url
for chunk in download.chunks:
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/cli/progress_bars.py", line 166, in iter
for x in it:
File "/usr/local/lib/python3.6/dist-packages/pip/_internal/network/utils.py", line 39, in response_chunks
decode_content=False,
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/urllib3/response.py", line 564, in stream
data = self.read(amt=amt, decode_content=decode_content)
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/urllib3/response.py", line 507, in read
data = self._fp.read(amt) if not fp_closed else b""
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/cachecontrol/filewrapper.py", line 65, in read
self._close()
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/cachecontrol/filewrapper.py", line 52, in _close
self.__callback(self.__buf.getvalue())
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/cachecontrol/controller.py", line 309, in cache_response
cache_url, self.serializer.dumps(request, response, body=body)
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/cachecontrol/serialize.py", line 72, in dumps
return b",".join([b"cc=4", msgpack.dumps(data, use_bin_type=True)])
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/msgpack/__init__.py", line 35, in packb
return Packer(**kwargs).pack(o)
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/msgpack/fallback.py", line 936, in pack
self._pack(obj)
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/msgpack/fallback.py", line 920, in _pack
len(obj), dict_iteritems(obj), nest_limit - 1
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/msgpack/fallback.py", line 1021, in _pack_map_pairs
self._pack(v, nest_limit - 1)
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/msgpack/fallback.py", line 920, in _pack
len(obj), dict_iteritems(obj), nest_limit - 1
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/msgpack/fallback.py", line 1021, in _pack_map_pairs
self._pack(v, nest_limit - 1)
File "/usr/local/lib/python3.6/dist-packages/pip/_vendor/msgpack/fallback.py", line 865, in _pack
return self._buffer.write(obj)
MemoryError
Any ideas on how to overcome his issue with CML pipeline on GitLab?
It is difficult to diagnose without knowing more about your .gitlab-ci.yml file. But based on the MemoryError message, it seems runner does not have enough memory to install Tensorflow in addition to your other project dependencies.
You can try installing TF with the --no-cache-dir flag

Ambari cluster restart error: Timeline Service V2.0 Reader not restarting

Attempting to restart an Ambari-managed cluster and getting errors related to the Timeline Service V2.0 Reader service starting:
Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/timelinereader.py", line 108, in <module>
ApplicationTimelineReader().execute()
File "/usr/lib/ambari-agent/lib/resource_management/libraries/script/script.py", line 353, in execute
method(env)
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/timelinereader.py", line 51, in start
hbase(action='start')
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/hbase_service.py", line 80, in hbase
createTables()
File "/var/lib/ambari-agent/cache/stacks/HDP/3.0/services/YARN/package/scripts/hbase_service.py", line 147, in createTables
logoutput=True)
File "/usr/lib/ambari-agent/lib/resource_management/core/base.py", line 166, in __init__
self.env.run()
File "/usr/lib/ambari-agent/lib/resource_management/core/environment.py", line 160, in run
self.run_action(resource, action)
File "/usr/lib/ambari-agent/lib/resource_management/core/environment.py", line 124, in run_action
provider_action()
File "/usr/lib/ambari-agent/lib/resource_management/core/providers/system.py", line 263, in action_run
returns=self.resource.returns)
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 72, in inner
result = function(command, **kwargs)
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 102, in checked_call
tries=tries, try_sleep=try_sleep, timeout_kill_strategy=timeout_kill_strategy, returns=returns)
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 150, in _call_wrapper
result = _call(command, **kwargs_copy)
File "/usr/lib/ambari-agent/lib/resource_management/core/shell.py", line 308, in _call
raise ExecuteTimeoutException(err_msg)
resource_management.core.exceptions.ExecuteTimeoutException: Execution of 'ambari-sudo.sh su yarn-ats -l -s /bin/bash -c 'export PATH='"'"'/usr/sbin:/sbin:/usr/lib/ambari-server/*:/usr/local/texlive/2016/bin/x86_64-linux:/usr/local/texlive/2016/bin/x86_64-linux:/usr/local/texlive/2016/bin/x86_64-linux:/usr/lib64/qt-3.3/bin:/usr/local/texlive/2016/bin/x86_64-linux:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/opt/maven/bin:/root/bin:/opt/maven/bin:/opt/maven/bin:/var/lib/ambari-agent'"'"' ; sleep 10;export HBASE_CLASSPATH_PREFIX=/usr/hdp/3.0.0.0-1634/hadoop-yarn/timelineservice/*; /usr/hdp/3.0.0.0-1634/hbase/bin/hbase --config /usr/hdp/3.0.0.0-1634/hadoop/conf/embedded-yarn-ats-hbase org.apache.hadoop.yarn.server.timelineservice.storage.TimelineSchemaCreator -Dhbase.client.retries.number=35 -create -s'' was killed due timeout after 300 seconds
I have not changed any configs or installed anything new between the restart attempt; simply stopped the cluster services and attempted to restart them. Not sure what this error message means. Any debugging tips or fixes?
Found the solution on another community post.
navigate to the host where Timeline Reader is installed and Install Hbase Client in that host
Here is how I installed HBase Client from via the Ambari UI...
In the Ambari UI, go to Hosts then click the host you want to install the hbase client component on
In the list on components, you will have option to add more, see...
From here I installed the HBase client
Then stopped and restarted the cluster via Ambari UI (got notification of stale configs (though not sure if this was my problem all along or if installing the HBase Client reaised the stale configs alert))

Cannot run dask-mpi with Python 3.7 -- timeout when connecting client to dask-mpi scheduler

I'm attempting to run the Dask-MPI "Getting Started" (http://mpi.dask.org/en/latest/) example in a fresh Anaconda environment.
I set up an environment using
conda create -n dask-mpi -c conda-forge python=3.7 dask-mpi
conda activate dask-mpi
Inside the environment, I run
mpirun -np 4 dask-mpi --scheduler-file ./scheduler.json
Then, from a python interpreter on the same machine (and in the same folder), I run
from dask.distributed import Client
client = Client(scheduler_file='/path/to/scheduler.json')
This results in the following error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 712, in __init__
self.start(timeout=timeout)
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 858, in start
sync(self.loop, self._start, **kwargs)
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/utils.py", line 331, in sync
six.reraise(*error[0])
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/six.py", line 693, in reraise
raise value
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/utils.py", line 316, in f
result[0] = yield future
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 954, in _start
yield self._ensure_connected(timeout=timeout)
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 736, in run
yielded = self.gen.throw(*exc_info) # type: ignore
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/distributed/client.py", line 1015, in _ensure_connected
timedelta(seconds=timeout), self._update_scheduler_info()
File "/home/nleaf/anaconda3/envs/dask-mpi/lib/python3.7/site-packages/tornado/gen.py", line 729, in run
value = future.result()
tornado.util.TimeoutError: Timeout
The terminal that I ran dask-mpi from does not have any output which would indicate that something is trying to connect. I have verified that the port in question, 8786, is open. I've also verified via debugger that the client is getting the correct address from the scheduler file.
I've tried this in quite a few different environments and on a few different machines, including a fresh Ubuntu 18.04 docker container. I'm completely at a loss for what steps I might be missing.
It turns out this was due to an error in newer versions of dask.distributed (1.25.3) which broke the behavior of dask-mpi. This seems to be fixed as of dask-mpi 1.0.3 (https://github.com/dask/dask-mpi/releases/tag/1.0.3).

Apache Ambari : Datanode installation failed while installing in existing cluster

I have created hadoop cluster using apache ambari 2.1.0 with 3 datanodes.
Now when i am trying to add another datanode into(existing cluster) it, it throws an error that
resource_management.core.exceptions.Fail: Execution of '/usr/bin/yum
-d 0 -e 0 -y install 'hadoop_2_3_*'' returned 1. No Presto metadata available for base
Delta RPMs reduced 3.6 M of updates to 798 k (78% saved)
Here is my web UI console log:
Traceback (most recent call last):
File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/datanode.py", line 153, in
DataNode().execute()
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 218, in execute
method(env)
File "/var/lib/ambari-agent/cache/common-services/HDFS/2.1.0.2.0/package/scripts/datanode.py", line 34, in install
self.install_packages(env, params.exclude_packages)
File "/usr/lib/python2.6/site-packages/resource_management/libraries/script/script.py", line 376, in install_packages
Package(name)
File "/usr/lib/python2.6/site-packages/resource_management/core/base.py", line 157, in init
self.env.run()
File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 152, in run
self.run_action(resource, action)
File "/usr/lib/python2.6/site-packages/resource_management/core/environment.py", line 118, in run_action
provider_action()
File "/usr/lib/python2.6/site-packages/resource_management/core/providers/package/init.py", line 45, in action_install
self.install_package(package_name, self.resource.use_repos, self.resource.skip_repos)
File "/usr/lib/python2.6/site-packages/resource_management/core/providers/package/yumrpm.py", line 49, in install_package
shell.checked_call(cmd, sudo=True, logoutput=self.get_logoutput())
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 70, in inner
result = function(command, **kwargs)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 92, in checked_call
tries=tries, try_sleep=try_sleep)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 140, in _call_wrapper
result = _call(command, **kwargs_copy)
File "/usr/lib/python2.6/site-packages/resource_management/core/shell.py", line 291, in _call
raise Fail(err_msg)
resource_management.core.exceptions.Fail: Execution of '/usr/bin/yum
-d 0 -e 0 -y install 'hadoop_2_3_*'' returned 1. No Presto metadata available for base Delta RPMs reduced 3.6 M of updates to 798 k (78%
saved)
Error downloading packages:
hadoop_2_3_4_0_3485-yarn-proxyserver-2.7.1.2.3.4.0-3485.el6.x86_64:
[Errno 256] No more mirrors to try.
This looks like there are two issues with yum and your repositories.
First I see the message:
No Presto metadata available for base Delta RPMs reduced 3.6 M of
updates to 798 k (78% saved)
Try running the following command on the host that you are trying to add as a datanode to fix the first issue:
sudo yum clean all
Then see if you can perform this command successfully:
sudo yum -v install hadoop_2_3_*
If you get to the prompt that asks if you want to install (y/n) then it was successful, choose the no option, and retry the add datanode action from Ambari. If you get an error or some failure take a look at the verbose output to troubleshoot the problem further.

Celery not connecting to Redis broker: Connection to broker lost

I'm trying to get Redis to work as a broker for my Celery 3.0.19 install on Django. I see that redis-server is running on port 6379. When I run a simple Celery test, I get the following stack trace:
Ubuntu Lucid 10.0.4
Celery 3.0.19
celery -A tasks worker --loglevel=info
[2013-05-02 18:56:27,835: INFO/MainProcess] consumer: Connected to redis://127.0.0.1:6379/0.
[2013-05-02 18:56:27,835: ERROR/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/usr/local/lib/python2.6/dist-packages/celery/worker/consumer.py", line 394, in start
self.reset_connection()
File "/usr/local/lib/python2.6/dist-packages/celery/worker/consumer.py", line 744, in reset_connection
self.connection, on_decode_error=self.on_decode_error,
File "/usr/local/lib/python2.6/dist-packages/celery/app/amqp.py", line 311, in __init__
**kw
File "/usr/local/lib/python2.6/dist-packages/kombu/messaging.py", line 355, in __init__
self.revive(self.channel)
File "/usr/local/lib/python2.6/dist-packages/kombu/messaging.py", line 367, in revive
self.declare()
File "/usr/local/lib/python2.6/dist-packages/kombu/messaging.py", line 377, in declare
queue.declare()
File "/usr/local/lib/python2.6/dist-packages/kombu/entity.py", line 490, in declare
self.queue_declare(nowait, passive=False)
File "/usr/local/lib/python2.6/dist-packages/kombu/entity.py", line 516, in queue_declare
nowait=nowait)
File "/usr/local/lib/python2.6/dist-packages/kombu/transport/virtual/__init__.py", line 404, in queue_declare
return queue, self._size(queue), 0
File "/usr/local/lib/python2.6/dist-packages/kombu/transport/redis.py", line 516, in _size
sizes = cmds.execute()
File "/usr/local/lib/python2.6/dist-packages/redis/client.py", line 1919, in execute
return execute(conn, stack, raise_on_error)
File "/usr/local/lib/python2.6/dist-packages/redis/client.py", line 1811, in _execute_transaction
self.parse_response(connection, '_')
File "/usr/local/lib/python2.6/dist-packages/redis/client.py", line 1882, in parse_response
self, connection, command_name, **options)
File "/usr/local/lib/python2.6/dist-packages/redis/client.py", line 387, in parse_response
response = connection.read_response()
File "/usr/local/lib/python2.6/dist-packages/redis/connection.py", line 312, in read_response
raise response
ResponseError: unknown command 'MULTI'
You need redis version >= 2.2.0.