I'm trying to create a tensorflow cluster on top of the Ignite cluster in my local multi-node environment.
I followed the tutorials and found tried the following command:
./ignite-tf.sh start TESTDATA models python /usr/local/grid/cifar10_main.py
This gives me an unmatched error as follows:
Unmatched argument:
Usage: ignite-tf [-hV] [-c=<cfg>] [COMMAND]
Apache Ignite and TensorFlow integration command line tool that allows to
start, maintain and stop distributed deep learning utilizing Apache Ignite
infrastructure and data.
-c, --config=<cfg> Apache Ignite client configuration.
-h, --help Show this help message and exit.
-V, --version Print version information and exit.
Commands:
start Starts a new TensorFlow cluster and attaches to user script process.
stop Stops a running TensorFlow cluster.
attach Attaches to running TensorFlow cluster (user script process).
ps Prints identifiers of all running TensorFlow clusters.
I'm not sure which is the umatched argument. Need help getting this to work.
I have downloaded this specific version and tried to run this command, I don't get any error messages, it would try to start a node:
311842 pts/12 S+ 0:00 | | \_ bash ./ignite-tf.sh start TESTDATA models python /usr/local/grid/cifar10_main.py
311902 pts/12 Sl+ 0:03 | | \_ /usr/lib/jvm/java-8-openjdk-amd64//bin/java -XX:+AggressiveOpts -Xms1g -Xmx1g -server -XX:MaxMetaspaceSize=256m -DIGNITE_QUIET=false -DIGNITE_SUCCESS_FILE=/home/user/Downloads/gridgain-community-8.7.24/work/ignite_success_20e882c5-5b64-4d0a-b7ed-9587c08a0e44 -DIGNITE_HOME=/home/user/Downloads/gridgain-community-8.7.24 -DIGNITE_PROG_NAME=./ignite-tf.sh -cp /home/user/Downloads/gridgain-community-8.7.24/libs/*:/home/user/Downloads/gridgain-community-8.7.24/libs/ignite-control-utility/*:/home/user/Downloads/gridgain-community-8.7.24/libs/ignite-indexing/*:/home/user/Downloads/gridgain-community-8.7.24/libs/ignite-opencensus/*:/home/user/Downloads/gridgain-community-8.7.24/libs/ignite-spring/*:/home/user/Downloads/gridgain-community-8.7.24/libs/licenses/*:/home/user/Downloads/gridgain-community-8.7.24/libs/optional//ignite-tensorflow/*:/home/user/Downloads/gridgain-community-8.7.24/libs/optional//ignite-slf4j/* org.apache.ignite.tensorflow.submitter.JobSubmitter start TESTDATA models python /usr/local/grid/cifar10_main.py
Related
I am having some difficulties in starting airflow using docker-compose with appropriate GPU libraries to run my machine learning tasks.
The airflow-scheduler throws this error:
airflow-scheduler_1 | 2022-03-21 12:33:36.919960: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
Basically, there is no CUDA libraries installed in the /usr/local within the airflow container hence the error. I have installed nvidia-container runtime and set the deamon default runtime in deamon.json file
curl -s -L https://nvidia.github.io/nvidia-container-runtime/gpgkey | \ sudo apt-key add - distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-container-runtime/$distribution/nvidia-container-runtime.list | \ sudo tee /etc/apt/sources.list.d/nvidia-container-runtime.list sudo apt-get update
And I have managed to use the runtime:nvidia in the docker-compose.yaml file. This way within the airflow container I can see nvidia-smi. However CUDA libraries are still missing.
Is there a way to install these libraries automatically (ideally FROM tensorflow/tensorflow:latest-gpu) as these set the CUDA libraries within the container?
On the other hand, if I am not using docker-compose I can start a container with docker:
docker run -it --gpus all tensorflow/tensorflow:latest-gpu
This container has all the libraries that I need. However, I would like to use docker-compose as life will be much easier to run multiple containers and setting up all network. So I would like to avoid this approach.
Also I can use the docker in airflow and mount the docker socket to airflow container such that I can initialise a new container from the airflow. This way, I can have all the CUDA libraries also installed however, it sounds very counter-intuitive and I am having difficulties understanding why I can't set all these within the airflow container originally.
client = docker.from_env()
# run the container
response = client.containers.run(
# The container you wish to call
'tensorflow/tensorflow:latest-gpu',
# The command to run inside the container
'find / -name "libcudart.so.11.0"',
# Passing the GPU access
device_requests=[
docker.types.DeviceRequest(count=-1, capabilities=[['gpu']])
]
)
I would appreciate if you can assist me in the right direction.
I try to migrate from DSpace 6.4 to 7.1. New Dspace is installed on other machine (virtual machine on Centos 7 with 8Gb of RAM)
I have created full site AIP backup with user passwords. (total size of packages - 11Gb)
I've tried to do full restore but always got the same error.
So I'm just trying to import only "first level without childs"
JAVA_OPTS="-Xmx2048m -Xss4m -Dfile.encoding=UTF-8" /dspace/bin/dspace packager -r -k -t AIP -e dinkwi.test#gmail.com -o skipIfParentMissing=true -i 123456789/0 /home/dimich/11111/repo.zip
It doesn't matter if I use -k or -f param, output ia always the same
Ingesting package located at /home/dimich/11111/repo.zip
Exception: null
java.lang.StackOverflowError
at org.dspace.eperson.GroupServiceImpl.getChildren(GroupServiceImpl.java:788)
at org.dspace.eperson.GroupServiceImpl.getChildren(GroupServiceImpl.java:802)
.... (more then 1k lines)
at org.dspace.eperson.GroupServiceImpl.getChildren(GroupServiceImpl.java:802)
my dspace.log ended with
2021-12-20 11:05:28,141 INFO unknown unknown org.dspace.eperson.GroupServiceImpl # dinkwi.test#gmail.com::update_group:group_id=9e6a2038-01d9-41ad-96b9-c6fb55b44381
2021-12-20 11:05:30,048 INFO unknown unknown org.dspace.eperson.GroupServiceImpl # dinkwi.test#gmail.com::update_group:group_id=23aaa7e9-ca2d-4af5-af64-600f7126e2be
2021-12-20 11:05:30,800 INFO unknown unknown org.springframework.cache.ehcache.EhCacheManagerFactoryBean # Shutting down EhCache CacheManager 'org.dspace.services'
So I just want to figure out the problem: small stack or some bug in user/group that fails with infinite loop/recursion, or maybe something else...
Main problem - i'm good in php/mysql and have no experience with java/postgre and the way how to debug this ...
Any help would be appreciated.
p.s after failed restore I always run command
/dspace/bin/dspace cleanup -v
I have installed redis cluster 3.0.0. But Want to upgrade it to 3.0.7. Can somebody tell me the steps to do it?
I don't want to loose any data. And don't want any downtime either.
Steps I did when upgrading from 2.9.101 to 3.0 release. I hope it will do for upgrading to 3.0.7 too.
Compile 3.0.7 from the source and start several instances with cluster enabled.
Let the 3.0.7 instances replicate the 3.0.0 instances as slave
Connect to each 3.0.7 instance and do a manual failover, then the 3.0.0 masters would become slaves after several seconds.
Wait for your application to connect to the new masters; also check the configuration files, and modify the entries to the new masters on your need
Remove those slaves
UPDATE : Docker approach
As it's probably unable to replacing the binary executable while the process is still alive, you could do it by run some Redis in docker.
First you should install docker on your machine and pull the Redis image, or pull a basic OS image and manually build Redis in it, whatever
Based on this image, you are supposed to
copy your current redis.conf into it
make sure the dir exists in the image (cluster-config-file could be the same for all the containers as they are saved individually in their own fs)
make sure the directory for logfile exists and is not the same as dir (we will later map this directory to the host)
leave port logfile anything you like, as they are specified when a container is started
commit the image as redis-3.0.7
Now launch a containerized Redis. I suppose your logfile is located in /var/log/redis/, this Redis binds :8000, and your config file in the image is /etc/redis/redis.conf
docker run -d --net=host -v /var/log/redis:/var/log/redis \
-p 8000:8000 -t redis-3.0.7 \
/usr/bin/redis-server /etc/redis/redis.conf \
--port 8000 \
--logfile /var/log/redis/redis_8000.log
Now you have a Redis 3.0.7 instance, and are ready to finish the rest steps in the previous part.
I was wondering if anybody could help me with this issue in deploying a spark cluster using the bdutil tool.
When the total number of cores increase (>= 1024), it failed all the time with the following reasons:
Some machine is never sshable, like "Tue Dec 8 13:45:14 PST 2015: 'hadoop-w-5' not yet sshable (255); sleeping"
Some nodes fail with an "Exited 100" error when deploying spark worker nodes, like "Tue Dec 8 15:28:31 PST 2015: Exited 100 : gcloud --project=cs-bwamem --quiet --verbosity=info compute ssh hadoop-w-6 --command=sudo su -l -c "cd ${PWD} && ./deploy-core-setup.sh" 2>>deploy-core-setup_deploy.stderr 1>>deploy-core-setup_deploy.stdout --ssh-flag=-tt --ssh-flag=-oServerAliveInterval=60 --ssh-flag=-oServerAliveCountMax=3 --ssh-flag=-oConnectTimeout=30 --zone=us-central1-f"
In the log file, it says:
hadoop-w-40: ==> deploy-core-setup_deploy.stderr <==
hadoop-w-40: dpkg-query: package 'openjdk-7-jdk' is not installed and no information is available
hadoop-w-40: Use dpkg --info (= dpkg-deb --info) to examine archive files,
hadoop-w-40: and dpkg --contents (= dpkg-deb --contents) to list their contents.
hadoop-w-40: Failed to fetch http://httpredir.debian.org/debian/pool/main/x/xml-core/xml-core_0.13+nmu2_all.deb Error reading from server. Remote end closed connection [IP: 128.31.0.66 80]
hadoop-w-40: E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
I tried 16-core 128-nodes, 32-core 64-nodes, 32-core 32-nodes and other over 1024-core configurations, but either the above Reason 1 or 2 will show up.
I also tried to modify the ssh-flag to change the ConnectTimeout to 1200s, and change bdutil_env.sh to set the polling interval to 30s, 60s, ..., none of them works. There will be always some nodes which fail.
Here is one of the configurations that I used:
time ./bdutil \
--bucket $BUCKET \
--force \
--machine_type n1-highmem-32 \
--master_machine_type n1-highmem-32 \
--num_workers 64 \
--project $PROJECT \
--upload_files ${JAR_FILE} \
--env_var_files hadoop2_env.sh,extensions/spark/spark_env.sh \
deploy
To summarize some of the information that came out from a separate email discussion, as IP mappings change and different debian mirrors get assigned, there can be occasional problems where the concurrent calls to apt-get install during a bdutil deployment can either overload some unbalanced servers or trigger DDOS protections leading to deployment failures. These do tend to be transient, and at the moment it appears I can deploy large clusters in zones like us-east1-c and us-east1-d successfully again.
There are a few options you can take to reduce the load on the debian mirrors:
Set MAX_CONCURRENT_ASYNC_PROCESSES to a much smaller value than the default 150 inside bdutil_env.sh, such as 10 to only deploy 10 at a time; this will make the deployment take longer, but would lighten the load as if you just did several back-to-back 10-node deployments.
If the VMs were successfully created but the deployment steps fail, instead of needing to retry the whole delete/deploy cycle, you can try ./bdutil <all your flags> run_command -t all -- 'rm -rf /home/hadoop' followed by ./bdutil <all your flags> run_command_steps to just run through the whole deployment attempt.
Incrementally build your cluster using resize_env.sh; initially set --num_workers 10 and deploy your cluster, and then edit resize_env.sh to set NEW_NUM_WORKERS=20, and run ./bdutil <all your flags> -e extensions/google/experimental/resize_env.sh deploy and it will only deploy the new workers 10-20 without touching those first 10. Then you just repeat, adding another 10 workers to NEW_NUM_WORKERS each time. If a resize attempt fails, you simply ./bdutil <all your flags> -e extensions/google/experimental/resize_env.sh delete to only delete those extra workers without affecting the ones you already deployed successfully.
Finally, if you're looking for more reproducible and optimized deployments, you should consider using Google Cloud Dataproc, which lets you use the standard gcloud CLI to deploy cluster, submit jobs, and further manage/delete clusters without needing to remember your bdutil flags or keep track of what clusters you have on your client machine. You can SSH into Dataproc clusters and use them basically the same as bdutil clusters, with some minor differences, like Dataproc DEFAULT_FS being HDFS so that any GCS paths you use should fully-specify the complete gs://bucket/object name.
I created a Django application which runs inside a Docker container. I needed to create a thread inside the Django application so I used Celery and Redis as the Celery Database.
If I install redis in the docker image (Ubuntu 14.04):
RUN apt-get update && apt-get -y install redis-server
RUN pip install redis
The Redis server is not launched: the Django application throws an exception because the connection is refused on the port 6379. If I manually start Redis, it works.
If I start the Redis server with the following command, it hangs :
RUN redis-server
If I try to tweak the previous line, it does not work either :
RUN nohup redis-server &
So my question is: is there a way to start Redis in background and to make it restart when the Docker container is restarted ?
The Docker "last command" is already used with:
CMD uwsgi --http 0.0.0.0:8000 --module mymodule.wsgi
RUN commands are adding new image layers only. They are not executed during runtime. Only during build time of the image.
Use CMD instead. You can combine multiple commands by externalizing them into a shell script which is invoked by CMD:
CMD start.sh
In the start.sh script you write the following:
#!/bin/bash
nohup redis-server &
uwsgi --http 0.0.0.0:8000 --module mymodule.wsgi
When you run a Docker container, there is always a single top level process. When you fire up your laptop, that top level process is an "init" script, systemd or the like. A docker image has an ENTRYPOINT directive. This is the top level process that runs in your docker container, with anything else you want to run being a child of that. In order to run Django, a Celery Worker, and Redis all inside a single Docker container, you would have to run a process that starts all three of them as child processes. As explained by Milan, you could set up a Supervisor configuration to do it, and launch supervisor as your parent process.
Another option is to actually boot the init system. This will get you very close to what you want since it will basically run things as though you had a full scale virtual machine. However, you lose many of the benefits of containerization by doing that :)
The simplest way altogether is to run several containers using Docker-compose. A container for Django, one for your Celery worker, and another for Redis (and one for your data store as well?) is pretty easy to set up that way. For example...
# docker-compose.yml
web:
image: myapp
command: uwsgi --http 0.0.0.0:8000 --module mymodule.wsgi
links:
- redis
- mysql
celeryd:
image: myapp
command: celery worker -A myapp.celery
links:
- redis
- mysql
redis:
image: redis
mysql:
image: mysql
This would give you four containers for your four top level processes. redis and mysql would be exposed with the dns name "redis" and "mysql" inside your app containers, so instead of pointing at "localhost" you'd point at "redis".
There is a lot of good info on the Docker-compose docs
use supervisord which would control both processes. The conf file might look like this:
...
[program:redis]
command= /usr/bin/redis-server /srv/redis/redis.conf
stdout_logfile=/var/log/supervisor/redis-server.log
stderr_logfile=/var/log/supervisor/redis-server_err.log
autorestart=true
[program:nginx]
command=/usr/sbin/nginx
stdout_events_enabled=true
stderr_events_enabled=true