srun: error: Slurm controller not responding, sleeping and retrying - ssh
Running the following command in Slurm:
$ srun -J FRD_gpu --partition=gpu --gres=gpu:1 --time=0-02:59:00 --mem=2000 --ntasks=1 --cpus-per-task=1 --pty /bin/bash -i
Returns the following error:
srun: error: Slurm controller not responding, sleeping and retrying.
The Slurm controller seems to be up:
$ scontrol ping
Slurmctld(primary) at narvi-install is UP
Any idea why and how to resolve this?
$ scontrol -V
slurm 18.08.8
System info: gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)
$ sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
normal up 7-00:00:00 1 drain* me99
normal up 7-00:00:00 3 down* me[64-65,97]
normal up 7-00:00:00 1 drain me89
normal up 7-00:00:00 23 mix me[55,67,86,88,90-94,96,98,100-101],na[27,41-42,44-45,47-49,51-52]
normal up 7-00:00:00 84 alloc me[56-63,66,68-74,76-81,83-85,87,95,102,153-158],na[01-26,28-40,43,46,50,53-60]
normal up 7-00:00:00 3 idle me[82,151-152]
test* up 4:00:00 1 drain* me99
test* up 4:00:00 3 down* me[64-65,97]
test* up 4:00:00 2 drain me[04,89]
test* up 4:00:00 27 mix me[55,67,86,88,90-94,96,98,100-101,248,260],meg[11-12],na[27,41-42,44-45,47-49,51-52]
test* up 4:00:00 130 alloc me[56-63,66,68-74,76-81,83-85,87,95,102,153-158,233-247,249-259,261-280],na[01-26,28-40,43,46,50,53-60]
test* up 4:00:00 14 idle me[01-03,50-54,82,151-152],meg10,nag[01,14]
grid up 7-00:00:00 10 mix na[27,41-42,44-45,47-49,51-52]
grid up 7-00:00:00 42 alloc na[01-26,28-32,43,46,50,53-60]
gpu up 7-00:00:00 15 mix meg[11-12],nag[02-10,12-13,16-17]
gpu up 7-00:00:00 4 idle meg10,nag[01,11,15]
If you are positive the Slurm controller is up and running (for instance sinfo command is responding), SSH to the compute node that is allocated to your job and run scontrol ping to test connectivity to the master. If it fails, look for firewall rules blocking the connection from the compute node to the master.
Related
kdump no crash log in centos 7
rpm -qa | grep kexec kexec-tools-2.0.15-13.el7.x86_64 cat /proc/cmdline BOOT_IMAGE=/vmlinuz-3.10.0-862.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet dmesg | grep -i crash [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.10.0-862.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet [ 0.000000] Reserving 161MB of memory at 688MB for crashkernel (System RAM: 16383MB) [ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.10.0-862.el7.x86_64 root=/dev/mapper/centos-root ro crashkernel=auto rd.lvm.lv=centos/root rd.lvm.lv=centos/swap rhgb quiet [ 1.033253] crash memory driver: version 1.1 grep -v ^# /etc/kdump.conf path /var/crash core_collector makedumpfile -l --message-level 1 -d 31 systemctl status kdump ● kdump.service - Crash recovery kernel arming Loaded: loaded (/usr/lib/systemd/system/kdump.service; enabled; vendor preset: enabled) Active: active (exited) since Fri 2022-08-26 07:34:03 CST; 1h 34min ago Process: 996 ExecStart=/usr/bin/kdumpctl start (code=exited, status=0/SUCCESS) Main PID: 996 (code=exited, status=0/SUCCESS) CGroup: /system.slice/kdump.service Aug 26 07:34:01 c-1 systemd[1]: Starting Crash recovery kernel arming... Aug 26 07:34:03 c-1 kdumpctl[996]: kexec: loaded kdump kernel Aug 26 07:34:03 c-1 kdumpctl[996]: Starting kdump: [OK] Aug 26 07:34:03 c-1 systemd[1]: Started Crash recovery kernel arming. kdumpctl status Kdump is operational everything looks fine,but when echo c > /proc/sysrq-trigger to test function,it doesn't work.plz help,thanks!
Making Dockerized Flask server concurrent
I have a Flask server that I'm running on AWS Fargate. My task has 2 vCPUs and 8 GB of memory. My server is only able to respond to one request at a time. If I run 2 API requests at the same, each that takes 7 seconds, the first request will take 7 seconds to return and the second will take 14 seconds to return. This is my Docker file (using this repo): FROM tiangolo/uwsgi-nginx-flask:python3.7 COPY ./requirements.txt requirements.txt RUN pip3 install --no-cache-dir -r requirements.txt RUN python3 -m spacy download en RUN apt-get update RUN apt-get install wkhtmltopdf -y RUN apt-get install poppler-utils -y RUN apt-get install xvfb -y COPY ./ /app I have the following config file: [uwsgi] module = main callable = app enable-threads = true These are my logs when I start the server: Checking for script in /app/prestart.sh Running script /app/prestart.sh Running inside /app/prestart.sh, you could add migrations to this file, e.g.: #! /usr/bin/env bash # Let the DB start sleep 10; # Run migrations alembic upgrade head /usr/lib/python2.7/dist-packages/supervisor/options.py:298: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security. 'Supervisord is running as root and it is searching ' 2019-10-05 06:29:53,438 CRIT Supervisor running as root (no user in config file) 2019-10-05 06:29:53,438 INFO Included extra file "/etc/supervisor/conf.d/supervisord.conf" during parsing 2019-10-05 06:29:53,446 INFO RPC interface 'supervisor' initialized 2019-10-05 06:29:53,446 CRIT Server 'unix_http_server' running without any HTTP authentication checking 2019-10-05 06:29:53,446 INFO supervisord started with pid 1 2019-10-05 06:29:54,448 INFO spawned: 'nginx' with pid 9 2019-10-05 06:29:54,450 INFO spawned: 'uwsgi' with pid 10 [uWSGI] getting INI configuration from /app/uwsgi.ini [uWSGI] getting INI configuration from /etc/uwsgi/uwsgi.ini ;uWSGI instance configuration [uwsgi] cheaper = 2 processes = 16 ini = /app/uwsgi.ini module = main callable = app enable-threads = true ini = /etc/uwsgi/uwsgi.ini socket = /tmp/uwsgi.sock chown-socket = nginx:nginx chmod-socket = 664 hook-master-start = unix_signal:15 gracefully_kill_them_all need-app = true die-on-term = true show-config = true ;end of configuration *** Starting uWSGI 2.0.18 (64bit) on [Sat Oct 5 06:29:54 2019] *** compiled with version: 6.3.0 20170516 on 09 August 2019 03:11:53 os: Linux-4.14.138-114.102.amzn2.x86_64 #1 SMP Thu Aug 15 15:29:58 UTC 2019 nodename: ip-10-0-1-217.ec2.internal machine: x86_64 clock source: unix pcre jit disabled detected number of CPU cores: 2 current working directory: /app detected binary path: /usr/local/bin/uwsgi your memory page size is 4096 bytes detected max file descriptor number: 1024 lock engine: pthread robust mutexes thunder lock: disabled (you can enable it with --thunder-lock) uwsgi socket 0 bound to UNIX address /tmp/uwsgi.sock fd 3 uWSGI running as root, you can use --uid/--gid/--chroot options *** WARNING: you are running uWSGI as root !!! (use the --uid flag) *** Python version: 3.7.4 (default, Jul 13 2019, 14:20:24) [GCC 6.3.0 20170516] Python main interpreter initialized at 0x55e1e2b181a0 uWSGI running as root, you can use --uid/--gid/--chroot options *** WARNING: you are running uWSGI as root !!! (use the --uid flag) *** python threads support enabled your server socket listen backlog is limited to 100 connections your mercy for graceful operations on workers is 60 seconds mapped 1239640 bytes (1210 KB) for 16 cores *** Operational MODE: preforking *** 2019-10-05 06:29:55,483 INFO success: nginx entered RUNNING state, process has stayed up for > than 1 seconds (startsecs) 2019-10-05 06:29:55,484 INFO success: uwsgi entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
running gem5 with SPEC2006
when running GEM5 X86 in SE mode, I am trying to run bzip2 from SPEC2006, at first it was failing because it says it can't run dynamic execution so I compiled it with -static flag. now I get this error: gem5 Simulator System. http://gem5.org gem5 is copyrighted software; use the --copyright option for details. gem5 compiled Oct 27 2018 00:36:02 gem5 started Dec 22 2018 18:16:40 gem5 executing on Dan command line: ./build/X86/gem5.opt configs/example/se.py -c /home/dan/SPEC2006/benchspec/CPU2006/401.bzip2/exe/bzip2_base.ia64-gcc42 -i /home/dan/SPEC2006/benchspec/CPU2006/401.bzip2/data/test/input/dryer.jpg Could not import 03_BASE_FLAT Could not import 03_BASE_NARROW Global frequency set at 1000000000000 ticks per second warn: DRAM device capacity (8192 Mbytes) does not match the address range assigned (4096 Mbytes) 0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000 **** REAL SIMULATION **** info: Entering event queue # 0. Starting simulation... panic: Tried to write unmapped address 0xffffedd8. Inst is at 0x400da4 # tick 5500 [invoke:build/X86/arch/x86/faults.cc, line 160] Memory Usage: 4316736 KBytes Program aborted at tick 5500 Aborted (core dumped) I am running gem5 on ubuntu 17.10. I tried to find solutions in google but I didn't see any one referring to this problem, does anyone know how to fix the problem?
Please check your host machine configuration. Bzip2 does not work in a 32-bit machine. My desktop is dual core have 32-bit X86 architecture, I tried to run bzip2 it had shown same error.
Why does my script not work on FreeBSD? (awk: syntax error)
Why does this script not work on FreeBSD? I ran the script on Centos and Debian, all was fine. On FreeBSD (10.2) I encounter the following error: awk: syntax error at source line 1 context is match($0, "^listen >>> queue:[[:space:]]+(.*)", <<< awk: bailing out at source line 1 -0.9902 As an example, here is some output of php-form status: pool: www process manager: ondemand start time: 29/Feb/2016:15:18:54 +0200 start since: 2083770 accepted conn: 1467128 listen queue: 0 max listen queue: 129 listen queue len: 128 idle processes: 1 active processes: 2 total processes: 3 max active processes: 64 max children reached: 1 slow requests: 0 On Centos and Debian, when I run: /path/to/script/php-fpm-check.sh "idle processes" http://127.0.0.1/status I get 1, but on FreeBSD the error mentioned above.
The 3-argument form of match is a GNU awk extension (docs). You'll have to find another way to capture the match (perhaps using the RSTART and RLENGTH variables set as a side-effect of match()), or install gawk on your freebsd system.
mpich cluster test error, unable to change wdir
I have built up a mpich2 cluster, and the machinefile is: pc3#ub3:4 # this will spawn 4 process on ub3 pc1#ub1 # this will spawn 1 process on ub1 when I run the test process, it should print: Hello from processor 0 of 8 Hello from processor 1 of 8 Hello from processor 2 of 8 Hello from processor 3 of 8 Hello from processor 4 of 8 Hello from processor 5 of 8 Hello from processor 6 of 8 Hello from processor 7 of 8 But it returned: pc1#ub1:~$ mpiexec -n 8 -f machinefile ./mpi_hello [proxy:0:0#ub3] launch_procs (./pm/pmiserv/pmip_cb.c:648): unable to change wdir to /home/pc1 (No such file or directory) [proxy:0:0#ub3] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:893): launch_procs returned error [proxy:0:0#ub3] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [proxy:0:0#ub3] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event [mpiexec#ub1] control_cb (./pm/pmiserv/pmiserv_cb.c:202): assert (!closed) failed [mpiexec#ub1] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status [mpiexec#ub1] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:197): error waiting for event [mpiexec#ub1] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion I have successfully enable passwordless SSH so that pc1 can connect passwordlessly to pc3. Though it is, I still think there is something wrong with SSH or access permission. My OS is Ubuntu 14.04 LTS 32bit Thanks for help.
make sure all the user names are the same. So change machine file to ub3:4 # this will spawn 4 process on ub3 ub1 # this will spawn 1 process on ub1 And copy all the compiled file to the corresponding directory. Make sure all the hostnames all in all the nodes' /etc/hostname file.