srun: error: Slurm controller not responding, sleeping and retrying

srun: error: Slurm controller not responding, sleeping and retrying - ssh

Running the following command in Slurm:
$ srun -J FRD_gpu --partition=gpu --gres=gpu:1 --time=0-02:59:00 --mem=2000 --ntasks=1 --cpus-per-task=1 --pty /bin/bash -i
Returns the following error:
srun: error: Slurm controller not responding, sleeping and retrying.
The Slurm controller seems to be up:
$ scontrol ping
Slurmctld(primary) at narvi-install is UP
Any idea why and how to resolve this?
$ scontrol -V
slurm 18.08.8
System info: gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal up 7-00:00:00 1 drain* me99
normal up 7-00:00:00 3 down* me[64-65,97]
normal up 7-00:00:00 1 drain me89
normal up 7-00:00:00 23 mix me[55,67,86,88,90-94,96,98,100-101],na[27,41-42,44-45,47-49,51-52]
normal up 7-00:00:00 84 alloc me[56-63,66,68-74,76-81,83-85,87,95,102,153-158],na[01-26,28-40,43,46,50,53-60]
normal up 7-00:00:00 3 idle me[82,151-152]
test* up 4:00:00 1 drain* me99
test* up 4:00:00 3 down* me[64-65,97]
test* up 4:00:00 2 drain me[04,89]
test* up 4:00:00 27 mix me[55,67,86,88,90-94,96,98,100-101,248,260],meg[11-12],na[27,41-42,44-45,47-49,51-52]
test* up 4:00:00 130 alloc me[56-63,66,68-74,76-81,83-85,87,95,102,153-158,233-247,249-259,261-280],na[01-26,28-40,43,46,50,53-60]
test* up 4:00:00 14 idle me[01-03,50-54,82,151-152],meg10,nag[01,14]
grid up 7-00:00:00 10 mix na[27,41-42,44-45,47-49,51-52]
grid up 7-00:00:00 42 alloc na[01-26,28-32,43,46,50,53-60]
gpu up 7-00:00:00 15 mix meg[11-12],nag[02-10,12-13,16-17]
gpu up 7-00:00:00 4 idle meg10,nag[01,11,15]

If you are positive the Slurm controller is up and running (for instance sinfo command is responding), SSH to the compute node that is allocated to your job and run scontrol ping to test connectivity to the master. If it fails, look for firewall rules blocking the connection from the compute node to the master.

Related

kdump no crash log in centos 7

rpm -qa | grep kexec
kexec-tools-2.0.15-13.el7.x86_64
cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-3.10.0-862.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet
dmesg | grep -i crash
[ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-862.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet [ 0.000000] Reserving 161MB of memory at 688MB for crashkernel (System RAM: 16383MB) [ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-862.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet [ 1.033253] crash memory driver: version 1.1
grep -v ^# /etc/kdump.conf
path /var/crash
core_collector makedumpfile -l --message-level 1 -d 31
systemctl status kdump
● kdump.service - Crash recovery kernel arming Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled) Active: active (exited) since Fri 2022-08-26 07:34:03 CST; 1h 34min ago Process: 996 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS) Main PID: 996 (code=exited, status=0/SUCCESS) CGroup: /system.slice/kdump.service
Aug 26 07:34:01 c-1 systemd[1]: Starting Crash recovery kernel arming...
Aug 26 07:34:03 c-1 kdumpctl[996]: kexec: loaded kdump kernel
Aug 26 07:34:03 c-1 kdumpctl[996]: Starting kdump: [OK]
Aug 26 07:34:03 c-1 systemd[1]: Started Crash recovery kernel arming.
kdumpctl status
Kdump is operational
everything looks fine,but when echo c > /proc/sysrq-trigger to test function,it doesn't work.plz help,thanks!

Making Dockerized Flask server concurrent

I have a Flask server that I'm running on AWS Fargate. My task has 2 vCPUs and 8 GB of memory. My server is only able to respond to one request at a time. If I run 2 API requests at the same, each that takes 7 seconds, the first request will take 7 seconds to return and the second will take 14 seconds to return.
This is my Docker file (using this repo):
FROM tiangolo/uwsgi-nginx-flask:python3.7
COPY ./requirements.txt requirements.txt
RUN pip3 install --no-cache-dir -r requirements.txt
RUN python3 -m spacy download en
RUN apt-get update
RUN apt-get install wkhtmltopdf -y
RUN apt-get install poppler-utils -y
RUN apt-get install xvfb -y
COPY ./ /app
I have the following config file:
[uwsgi]
module = main
callable = app
enable-threads = true
These are my logs when I start the server:
Checking for script in /app/prestart.sh
Running script /app/prestart.sh
Running inside /app/prestart.sh, you could add migrations to this file, e.g.:
#! /usr/bin/env bash
# Let the DB start
sleep 10;
# Run migrations
alembic upgrade head
/usr/lib/python2.7/dist-packages/supervisor/options.py:298: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
'Supervisord is running as root and it is searching '
2019-10-05 06:29:53,438 CRIT Supervisor running as root (no user in config file)
2019-10-05 06:29:53,438 INFO Included extra file "/etc/supervisor/conf.d/supervisord.conf" during parsing
2019-10-05 06:29:53,446 INFO RPC interface 'supervisor' initialized
2019-10-05 06:29:53,446 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2019-10-05 06:29:53,446 INFO supervisord started with pid 1
2019-10-05 06:29:54,448 INFO spawned: 'nginx' with pid 9
2019-10-05 06:29:54,450 INFO spawned: 'uwsgi' with pid 10
[uWSGI] getting INI configuration from /app/uwsgi.ini
[uWSGI] getting INI configuration from /etc/uwsgi/uwsgi.ini
;uWSGI instance configuration
[uwsgi]
cheaper = 2
processes = 16
ini = /app/uwsgi.ini
module = main
callable = app
enable-threads = true
ini = /etc/uwsgi/uwsgi.ini
socket = /tmp/uwsgi.sock
chown-socket = nginx:nginx
chmod-socket = 664
hook-master-start = unix_signal:15 gracefully_kill_them_all
need-app = true
die-on-term = true
show-config = true
;end of configuration
*** Starting uWSGI 2.0.18 (64bit) on [Sat Oct 5 06:29:54 2019] ***
compiled with version: 6.3.0 20170516 on 09 August 2019 03:11:53
os: Linux-4.14.138-114.102.amzn2.x86_64 #1 SMP Thu Aug 15 15:29:58 UTC 2019
nodename: ip-10-0-1-217.ec2.internal
machine: x86_64
clock source: unix
pcre jit disabled
detected number of CPU cores: 2
current working directory: /app
detected binary path: /usr/local/bin/uwsgi
your memory page size is 4096 bytes
detected max file descriptor number: 1024
lock engine: pthread robust mutexes
thunder lock: disabled (you can enable it with --thunder-lock)
uwsgi socket 0 bound to UNIX address /tmp/uwsgi.sock fd 3
uWSGI running as root, you can use --uid/--gid/--chroot options
*** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
Python version: 3.7.4 (default, Jul 13 2019, 14:20:24) [GCC 6.3.0 20170516]
Python main interpreter initialized at 0x55e1e2b181a0
uWSGI running as root, you can use --uid/--gid/--chroot options
*** WARNING: you are running uWSGI as root !!! (use the --uid flag) ***
python threads support enabled
your server socket listen backlog is limited to 100 connections
your mercy for graceful operations on workers is 60 seconds
mapped 1239640 bytes (1210 KB) for 16 cores
*** Operational MODE: preforking ***
2019-10-05 06:29:55,483 INFO success: nginx entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2019-10-05 06:29:55,484 INFO success: uwsgi entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

running gem5 with SPEC2006

when running GEM5 X86 in SE mode, I am trying to run bzip2 from SPEC2006, at first it was failing because it says it can't run dynamic execution so I compiled it with -static flag.
now I get this error:
gem5 Simulator System. http://gem5.org
gem5 is copyrighted software; use the --copyright option for details.
gem5 compiled Oct 27 2018 00:36:02
gem5 started Dec 22 2018 18:16:40
gem5 executing on Dan
command line: ./build/X86/gem5.opt configs/example/se.py -c /home/dan/SPEC2006/benchspec/CPU2006/401.bzip2/exe/bzip2_base.ia64-gcc42 -i /home/dan/SPEC2006/benchspec/CPU2006/401.bzip2/data/test/input/dryer.jpg
Could not import 03_BASE_FLAT
Could not import 03_BASE_NARROW
Global frequency set at 1000000000000 ticks per second
warn: DRAM device capacity (8192 Mbytes) does not match the address range assigned (4096 Mbytes)
0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
**** REAL SIMULATION ****
info: Entering event queue # 0. Starting simulation...
panic: Tried to write unmapped address 0xffffedd8. Inst is at 0x400da4
# tick 5500
[invoke:build/X86/arch/x86/faults.cc, line 160]
Memory Usage: 4316736 KBytes
Program aborted at tick 5500
Aborted (core dumped)
I am running gem5 on ubuntu 17.10.
I tried to find solutions in google but I didn't see any one referring to this problem, does anyone know how to fix the problem?

Please check your host machine configuration. Bzip2 does not work in a 32-bit machine. My desktop is dual core have 32-bit X86 architecture, I tried to run bzip2 it had shown same error.

Why does my script not work on FreeBSD? (awk: syntax error)

Why does this script not work on FreeBSD? I ran the script on Centos and Debian, all was fine. On FreeBSD (10.2) I encounter the following error:
awk: syntax error at source line 1
context is
match($0, "^listen >>> queue:[[:space:]]+(.*)", <<<
awk: bailing out at source line 1
-0.9902
As an example, here is some output of php-form status:
pool: www
process manager: ondemand
start time: 29/Feb/2016:15:18:54 +0200
start since: 2083770
accepted conn: 1467128
listen queue: 0
max listen queue: 129
listen queue len: 128
idle processes: 1
active processes: 2
total processes: 3
max active processes: 64
max children reached: 1
slow requests: 0
On Centos and Debian, when I run:
/path/to/script/php-fpm-check.sh "idle processes" http://127.0.0.1/status
I get 1, but on FreeBSD the error mentioned above.

The 3-argument form of match is a GNU awk extension (docs). You'll have to find another way to capture the match (perhaps using the RSTART and RLENGTH variables set as a side-effect of match()), or install gawk on your freebsd system.

mpich cluster test error, unable to change wdir

I have built up a mpich2 cluster, and the machinefile is:
pc3#ub3:4 # this will spawn 4 process on ub3
pc1#ub1 # this will spawn 1 process on ub1
when I run the test process, it should print:
Hello from processor 0 of 8
Hello from processor 1 of 8
Hello from processor 2 of 8
Hello from processor 3 of 8
Hello from processor 4 of 8
Hello from processor 5 of 8
Hello from processor 6 of 8
Hello from processor 7 of 8
But it returned:
pc1#ub1:~$ mpiexec -n 8 -f machinefile ./mpi_hello
[proxy:0:0#ub3] launch_procs (./pm/pmiserv/pmip_cb.c:648): unable to change wdir to /home/pc1 (No such file or directory)
[proxy:0:0#ub3] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:893): launch_procs returned error
[proxy:0:0#ub3] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:0#ub3] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec#ub1] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed
[mpiexec#ub1] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[mpiexec#ub1] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event
[mpiexec#ub1] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
I have successfully enable passwordless SSH so that pc1 can connect passwordlessly to pc3. Though it is, I still think there is something wrong with SSH or access permission. My OS is Ubuntu 14.04 LTS 32bit
Thanks for help.

make sure all the user names are the same. So change machine file to
ub3:4 # this will spawn 4 process on ub3
ub1 # this will spawn 1 process on ub1
And copy all the compiled file to the corresponding directory.
Make sure all the hostnames all in all the nodes' /etc/hostname file.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

srun: error: Slurm controller not responding, sleeping and retrying - ssh

Related

kdump no crash log in centos 7

Making Dockerized Flask server concurrent

running gem5 with SPEC2006

Why does my script not work on FreeBSD? (awk: syntax error)

mpich cluster test error, unable to change wdir

Categories

Resources