running a Vasp job in HPC cluster - pyiron

Using pyiron, I build up my script and I would submit it in cluster for running , I was wondering How can I do that ?
Note: Vasp is already installed in my Cluster.

pyiron uses pysqa to submit jobs to a queuing system:
https://github.com/pyiron/pysqa
With sample queuing configurations available at:
https://github.com/pyiron/pysqa/tree/master/tests/config
So in your pyiron resources directory you create a folder named queues which contains the pysqa queuing system configuration.
Once this is done you can use:
job.server.list_queues()
to view the available queues and:
job.server.view_queues()
to get more information about the individual queue and finally submit the job using:
job.server.queue = 'queue_name'
where queue_name is the name of the queue you want to select and then specify the cores and run_time using:
job.server.cores = 8
job.server.run_time = 30000
Finally when you call job.run() it is automatically submitted to the queue.

Related

How to attach multiple worker for a queue in rabbit MQ

I am using exchange based pattern in Rabbit MQ.
Producer --> Exchange --> Queues --> Consumer1
How do I run multiple consumer (C1, C2, C3 so on....) for load balancing purpose and scalability of the consumers.
Is it ok run ./worker.js twice thrice based on uses?
Yes it should be ok to run your workers multiple times as that would run multiple instances of your worker listening to your queue to achieve what you want. Please refer this tutorial from RabbitMQ for more info. Specifically see section Round-robin dispatching
To quote a few details:
One of the advantages of using a Task Queue is the ability to easily parallelise work. If we are building up a backlog of work, we can just add more workers and that way, scale easily.
You need three consoles open. Two will run the worker.js script. These consoles will be our two consumers - C1 and C2.
Just to add on #AJS answer. You may want to make use of a 'Process Monitor/Manager' like Supervisord to manage your long-running program C1 and most importantly to run multiple of them (C1, C2, C3 so on....). Just install supervisor in your environment (local, VPS, Docker etc), then add a configuration file like the example below to make it run, monitor and restart multiple worker.js processes as needed,
So, create a supervisor configuration file for your program, eg. my_awesome_worker.conf and place it /etc/supervisor/conf.d directory.
[program:wise_worker]
process_name=%(program_name)s_%(process_num)02d
command=node /my_app_location/worker.js
autostart=true
autorestart=true
numprocs=4
stderr_logfile=/var/log/myapp.err.log
stdout_logfile=/var/log/myapp.out.log
user=myuser
To update the changes, run
$sudo supervisorctl reread
$sudo supervisorctl update
Note the process_name and numprocs section is responsible for running 4 worker.js processes (keep numprocs equal to or less than your number of CPUs). The numprocs in combination with process_name expression of %(program_name)s_%(process_num)02d, will create four processes, namely wise_worker_00, wise_worker_01, wise_worker_02 and wise_worker_03.
Verify that they are all running using
$sudo systemctl status supervisor
or
$sudo service supervisor status

How to submit code to a remote Spark cluster from IntelliJ IDEA

I have two clusters, one in local virtual machine another in remote cloud. Both clusters in Standalone mode.
My Environment:
Scala: 2.10.4
Spark: 1.5.1
JDK: 1.8.40
OS: CentOS Linux release 7.1.1503 (Core)
The local cluster:
Spark Master: spark://local1:7077
The remote cluster:
Spark Master: spark://remote1:7077
I want to finish this:
Write codes(just simple word-count) in IntelliJ IDEA locally(on my laptp), and set the Spark Master URL to spark://local1:7077 and spark://remote1:7077, then run my codes in IntelliJ IDEA. That is, I don't want to use spark-submit to submit a job.
But I got some problem:
When I use the local cluster, everything goes well. Run codes in IntelliJ IDEA or use spark-submit can submit job to cluster and can finish the job.
But When I use the remote cluster, I got a warning log:
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
It is sufficient resources not sufficient memory!
And this log keep printing, no further actions. Both spark-submit and run codes in IntelliJ IDEA result the same.
I want to know:
Is it possible to submit codes from IntelliJ IDEA to remote cluster?
If it's OK, does it need configuration?
What are the possible reasons that can cause my problem?
How can I handle this problem?
Thanks a lot!
Update
There is a similar question here, but I think my scene is different. When I run my codes in IntelliJ IDEA, and set Spark Master to local virtual machine cluster, it works. But I got Initial job has not accepted any resources;... warning instead.
I want to know whether the security policy or fireworks can cause this?
Submitting code programatically (e.g. via SparkSubmit) is quite tricky. At the least there is a variety of environment settings and considerations -handled by the spark-submit script - that are quite difficult to replicate within a scala program. I am still uncertain of how to achieve it: and there have been a number of long running threads within the spark developer community on the topic.
My answer here is about a portion of your post: specifically the
TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have
sufficient resources
The reason is typically there were a mismatch on the requested memory and/or number of cores from your job versus what were available on the cluster. Possibly when submitting from IJ the
$SPARK_HOME/conf/spark-defaults.conf
were not properly matching the parameters required for your task on the existing cluster. You may need to update:
spark.driver.memory 4g
spark.executor.memory 8g
spark.executor.cores 8
You can check the spark ui on port 8080 to verify that the parameters you requested are actually available on the cluster.

elasticsearch-mesos not getting listed under frameworks of mesosUI

Iam trying to run elasticsearch-mesos on mesos.My machine is running ubuntu 14.04. I have running mesos cluster installed with mesosphere packages by following these instructions. When I run test frameworks it gets lister under frameworks of mesosUI but for elasticsearch-mesos its not getting listed under mesos webUI. I want to run elasticsearch-mesos on top of mesos. I followed instructions given here. When I run ./elasticsearch-mesos I am getting a message in terminal
I0108 17:24:01.898540 23861 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
I tried running ./elasticsearch-mesos on both mesos masters and slaves.
The last few lines of terminal output is given below
2015-01-08 17:24:01,881:23844(0x7f175bfff700):ZOO_INFO#zookeeper_init#786: Initiating
client connection, host=localhost:2181 sessionTimeout=10000 watcher=0x7f1762a3e6a0
sessionId=0 sessionPasswd=<null> context=0x7f1710002530 flags=0
I0108 17:24:01.881392 23858 sched.cpp:137] Version: 0.21.1
2015-01-08 17:24:01,881:23844(0x7f172b7fe700):ZOO_INFO#check_events#1703: initiated
connection to server [127.0.0.1:2181]
2015-01-08 17:24:01,897:23844(0x7f172b7fe700):ZOO_INFO#check_events#1750: session
establishment complete on server [127.0.0.1:2181], sessionId=0x14ac7c469270006,
negotiated timeout=10000
I0108 17:24:01.898455 23861 group.cpp:313] Group process (group(1)#127.0.1.1:38668)
connected to ZooKeeper
I0108 17:24:01.898509 23861 group.cpp:790] Syncing group operations: queue size (joins,
cancels, datas) = (0, 0, 0)
I0108 17:24:01.898540 23861 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
According to the README at https://github.com/mesosphere/elasticsearch-mesos,
you may need to modify mesos.master.url to point to the same ZK url that the Mesos master is using (maybe not localhost). If you're using a single-master Mesos cluster, you can skip the ZK url and point this parameter directly to the Mesos master.
Please also note that the elasticsearch framework is a bit outdated, so use with caution

How to Delete temporary RabbitMQ queues once corresponding result has been retrieved?

My question builds off this one: Temporary queue made in Celery
My application needs to retrieve results, as it uploads them to an S3 file. however, the number of temporary queues being made is causing my broker to crash (machine doesn't have enough memory). I want to delete the temporary queue once the corresponding result as been retrieved. In my celery client script, I am iterating through a list of of results (where each result is from function.delay() ):
for result in result_list:
while True:
if result.ready():
#do something with result
#I WANT TO DELETE TEMPORARY QUEUE HERE
Is there any way I can achieve the above -- deleting the temporary queue once the result has been retrieved?
I would have used CELERY_TASK_RESULT_EXPIRES option in my celeryconfig , but I don't know when I can safely clean up the temporary queue, as the result may not have been retrieved. Is there anyway I can delete specific queues in this script (note that I have the queue Id from the result).
ADDITIONAL NOTE:
I am running all rabbitmq servers in a cluster with HA enabled.
The way I did this was to use the rabbitmqadmin from rabbitmq. I downloaded it via
wget localhost:15672/cli/rabbitmqadmin
after installing the management plugin
rabbitmq-plugins enable rabbitmq_management
Make sure your user has the administrator tag for rabbitmq, or you will not be able to perform commands. I then deleted the queue in my script using python subprocess import and rabbitmqadmin delete queue name='' . Keep in mind that the queue name is the same as the corresponding result id, except without the hyphens.
Also make sure you add the params -v myvhost -u myusername -p mypassword in rabbitmqadmin commands, default vhost is /.
I believe this will delete queues across all nodes in a cluster, though I am not completely sure of this.

How to configure and run remote celery worker correctly?

I'm new to celery and may be doing something wrong, but I already
spent a lot of trying to figure out how to configure celery
correctly.
So, in my environment I have 2 remote servers; one is main (it has
public IP address and most of the stuff like database server, rabbitmq
server and web server running my web application is there) and another
is used for specific tasks which I want to asynchronously invoke from
the main server using celery.
I was planning to use RabbitMQ as a broker and as results back-end.
Celery config is very basic:
CELERY_IMPORTS = ("main.tasks", )
BROKER_HOST = "Public IP of my main server"
BROKER_PORT = 5672
BROKER_USER = "guest"
BROKER_PASSWORD = "guest"
BROKER_VHOST = "/"
CELERY_RESULT_BACKEND = "amqp"
When I'm running a worker on the main server tasks are executed just
fine, but when I'm running it on the remote server only a few tasks
are executed and then worker gets stuck not being able to executed any
task. When I restart the worker it executes a few more tasks and gets
stuck again. There is nothing special inside the task and I even tried
a test task that just adds 2 numbers. I tried to run the worker
differently (demonizing and not, setting different concurrency and
using celeryd_multi), nothing really helped.
What could be the reason? Did I miss something? Do I have to run
something on the main server other than the broker (RabbitMQ)? Or is
it a bug in the celery (I tried a few version: 2.2.4, 2.3.3 and dev,
but none of them worked)?
Hm... I've just reproduced the same problem on the local worker, so I
don't really know what it is... Is it required to restart celery
worker after every N tasks executed?
Any help will be very much appreciated :)
Don't know if you ended up solving the problem, but I had similar symptoms. Turned out that (for whatever reason) print statements from within tasks was causing tasks not to complete (maybe some sort of deadlock situation?). Only some of the tasks had print statements, so when these tasks executed eventually the number of workers (set by concurrency option) were all exhausted, which caused tasks to stop executing.
Try to set your celery config to
CELERYD_PREFETCH_MULTIPLIER = 1
CELERYD_MAX_TASKS_PER_CHILD = 1
docs