byobu (tmux) hangs on startup - crash

Yesterday, I was happily using Byobu/tmux (byobu version 5.74, tmux 2.0) with zsh as shell on my machine. Starting from this morning on, every byobu session just hangs after a small time interval and does not accept any input anymore. I start the session and then, for a while (around 0-15 seconds), I see the clock updating in the status bar - and suddenly it stops and I cannot do anything apart from killing tmux.
I already deleted ~/.byobu which did not change anything. Also changing my shell back to bash did not result in any further success... Running byobu as root does not exhibit the problem, switching to screen as backend also fixes it.
When running byobu with strace, the last lines of the log before the crash are
poll([{fd=4, events=POLLIN}, {fd=6, events=POLLIN}], 2, 4294967295) = 1 ([{fd=6, revents=POLLIN}])
socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 5
socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 7
close(5) = 0
close(7) = 0
recvmsg(6, {msg_name(0)=NULL, msg_iov(1)=[{"\317\0\0\0\20\0\0\0\10\0\0\0\377\377\377\377", 65535}], msg_controllen=0, msg_flags=0}, 0) = 16
poll([{fd=4, events=POLLIN}, {fd=6, events=POLLIN|POLLOUT}], 2, 4294967295) = 1 ([{fd=6, revents=POLLOUT}])
sendmsg(6, {msg_name(0)=NULL, msg_iov(1)=[{"\320\0\0\0\20\0\0\0\10\0\0\0\377\377\377\377", 16}], msg_controllen=0, msg_flags=0}, 0) = 16
poll([{fd=4, events=POLLIN}, {fd=6, events=POLLIN}], 2, 4294967295) = 1 ([{fd=4, revents=POLLIN}])
recvfrom(4, "a", 1, 0, NULL, NULL) = 1
[ last two lines repeated 20 times ]
I can also attach the complete trace if that helps, or provide other debugging data if told how to :)

Related

apscheduler loses a job after reboot

I've faced a problem with python APScheduler.
I've made a simple script:
from apscheduler.schedulers.background import BackgroundScheduler
from time import sleep
from datetime import datetime
scheduler = BackgroundScheduler({
'apscheduler.jobstores.default': {
'type': 'redis',
'host': "127.0.0.1",
'port': 6379,
'db': 0,
'encoding': "utf-8",
'encoding_errors': "strict",
'decode_responses': False,
'retry_on_timeout': False,
'ssl': False,
'ssl_cert_reqs': "required"
},
'apscheduler.executors.default': {
'class': 'apscheduler.executors.pool:ThreadPoolExecutor',
'max_workers': '20'
},
'apscheduler.job_defaults.coalesce': 'false',
'apscheduler.job_defaults.max_instances': '3',
'apscheduler.timezone': 'UTC',
}, daemon=False)
scheduler.start()
def testfunc():
with open('./data.log', 'a') as f:
f.write(f'{datetime.now()}\n')
scheduler.add_job(testfunc, 'interval', minutes=1, id='my_job_id')
scheduler.add_job(testfunc, 'date', run_date="2120-1-1 11:12:13", id='my_job_id_2')
while True:
scheduler.print_jobs()
sleep(10)
So, I'm using BackgroundScheduler with ThreadPoolExecutor and Redis jobstore, quite simple, as in documentation.
It's working fine, tasks added, I can see data in redis-cli also.
Then I reboot server and check data in redis again. And what I see is only my_job_id_2 task. The one with interval trigger disappeared totally.
Redis is set to save data to RDB every minute. The same happening if I execute SAVE command in redis-cli before reboot.
Why is it happening?
Redis is an in-memory store unless you configure persistence data will be dropped between reboots.
https://redis.io/topics/persistence details the options available to you
Are you sure it's saving?
check the configuration
redis-cli config get save
1) "save"
2) "3600 1 300 100 60 10000"
check the redis log to ensure the save has occured
1:M 16 Sep 2021 16:06:31.375 * DB saved on disk
example w/ error
547130:M 16 Sep 2021 09:12:58.100 # Error moving temp DB file temp-547130.rdb on the final destination dump.rdb (in server root dir /home/namizaru/redis/src): Operation not permitted
Make sure on restart it's loading the file in the redis log
549156:M 16 Sep 2021 09:14:31.522 * DB loaded from disk: 0.000 seconds
consider using AOF as it logs all of the commands so you'll at least be able to figure out which commands are getting missed

Application failed to start after timestamp check failed in monit

I am having a simple monit control file which contains the following
check process fooBar1 with pidfile fooBar1PidFile
start program = "/etc/init.d/fooBar1 start" with timeout 10 seconds
stop program = "/etc/init.d/fooBar1 stop"
if 5 restarts within 5 cycles then unmonitor
check process fooBar2 with pidfile fooBar2PidFile
start program = "/etc/init.d/fooBar2 start" with timeout 10 seconds
stop program = "/etc/init.d/fooBar2 stop"
if 5 restarts within 5 cycles then unmonitor
check process fooBar with pidfile fooBarPidFile
start program = "/etc/init.d/fooBar start" with timeout 10 seconds
stop program = "/etc/init.d/fooBar stop"
if 5 restarts within 5 cycles then unmonitor
if memory usage > 25.0 MB for 4 cycles then alert
depends on fooBar1
depends on fooBar2
depends on checkFile
check file checkFile with path pathToFile
if timestamp > 4 minute for 8 cycles then restart
Here the intention is to restart fooBar, fooBar1 and fooBar2 applications when the timestamp check for checkFile fails. But what actually happens is it tries to restart the checkFile instead of fooBar.
This check was working fine with monit version 5.5, but not working with 5.18.
This is what I am getting when timestamp fails and after that 8 cycles are elapsed.
'checkFile' timestamp for pathToFile failed -- current timestamp is Fri, 08 Dec 2017 12:47:04
'fooBar' failed to start -- could not start required services: 'checkFile'
'fooBar' start action failed
Am I missing something here?
Thanks in advance
Tried with a work around,
check file checkFile with path pathToFile
if timestamp > 4 minute for 8 cycles then
exec "/etc/init.d/fooBar restart"
repeat every 1 cycles
This is restarting application fooBar every cycle when timestamp check fails. But just wondering any other better way to do it?

open MPI - ring_c on multiple hosts fails

I have recently installed open MPI on two Ubuntu 14.04 hosts and I am now testing its functionality with the two provided test functions hello_c and ring_c. The hosts are called 'hermes' and 'zeus' and they both have the user 'mpiuser' to log in non-interactively (via ssh-agent).
The functions mpirun hello_c and mpirun --host hermes,zeus hello_c both work properly.
Calling the function mpirun --host zeus ring_c locally also works. Output for both hermes and zeus:
mpiuser#zeus:/opt/openmpi-1.6.5/examples$ mpirun --host zeus ring_c
Process 0 sending 10 to 0, tag 201 (1 processes in ring)
Process 0 sent to 0
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
But calling the function mpirun --host zeus,hermes ring_c fails and gives following output:
mpiuser#zeus:/opt/openmpi-1.6.5/examples$ mpirun --host hermes,zeus ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
[zeus:2930] *** An error occurred in MPI_Recv
[zeus:2930] *** on communicator MPI_COMM_WORLD
[zeus:2930] *** MPI_ERR_TRUNCATE: message truncated
[zeus:2930] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
Process 0 sent to 1
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 2930 on
node zeus exiting improperly. There are two reasons this could occur:
1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.
2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"
This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
I haven't found any documentation on how to solve such a problem and I don't have a clue where to look for the mistake on the basis of the error output. How can I fix this?
You've changed two things between the first and second runs - you've increased the number of processes from 1 to 2, and run on multiple hosts rather than a single host.
I'd suggest you first check you can run on 2 processes on the same host:
mpirun -n 2 ring_c
and see what you get.
When debugging on a cluster it's often useful to know where each process is running. You should always print out the total number of processes as well. Try using the following code at the top of ring_c.c:
char nodename[MPI_MAX_PROCESSOR_NAME];
int namelen;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
MPI_Get_processor_name(nodename, &namelen);
printf("Rank %d out of %d running on node %s\n", rank, size, nodename);
The error you're getting is saying that the incoming message is too large for the receive buffer, which is weird given that the code always sends and receives a single integer.

What are the best ways to automate a GDB debugging session?

Does GDB have a built in scripting mechanism, should I code up an expect script, or is there an even better solution out there?
I'll be sending the same sequence of commands every time and I'll be saving the output of each command to a file (most likely using GDB's built-in logging mechanism, unless someone has a better idea).
Basically, in this example I wanted to get some variable values in particular places of the code; and have them output until the program crashes. So here is first a little program which is guaranteed to crash in a few steps, test.c:
#include <stdio.h>
#include <stdlib.h>
int icount = 1; // default value
main(int argc, char *argv[])
{
int i;
if (argc == 2) {
icount = atoi(argv[1]);
}
i = icount;
while (i > -1) {
int b = 5 / i;
printf(" 5 / %d = %d \n", i, b );
i = i - 1;
}
printf("Finished\n");
return 0;
}
The only reason the program accepts command-line arguments is to be able to choose the number of steps before crashing - and to show that gdb ignores --args in batch mode. This I compile with:
gcc -g test.c -o test.exe
Then, I prepare the following script - the main trick here is to assign a command to each breakpoint, which will eventually continue (see also Automate gdb: show backtrace at every call to function puts). This script I call test.gdb:
# http://sourceware.org/gdb/wiki/FAQ: to disable the
# "---Type <return> to continue, or q <return> to quit---"
# in batch mode:
set width 0
set height 0
set verbose off
# at entry point - cmd1
b main
commands 1
print argc
continue
end
# printf line - cmd2
b test.c:17
commands 2
p i
p b
continue
end
# int b = line - cmd3
b test.c:16
commands 3
p i
p b
continue
end
# show arguments for program
show args
printf "Note, however: in batch mode, arguments will be ignored!\n"
# note: even if arguments are shown;
# must specify cmdline arg for "run"
# when running in batch mode! (then they are ignored)
# below, we specify command line argument "2":
run 2 # run
#start # alternative to run: runs to main, and stops
#continue
Note that, if you intend to use it in batch mode, you have to "start up" the script at the end, with run or start or something similar.
With this script in place, I can call gdb in batch mode - which will generate the following output in the terminal:
$ gdb --batch --command=test.gdb --args ./test.exe 5
Breakpoint 1 at 0x804844d: file test.c, line 10.
Breakpoint 2 at 0x8048485: file test.c, line 17.
Breakpoint 3 at 0x8048473: file test.c, line 16.
Argument list to give program being debugged when it is started is "5".
Note, however: in batch mode, arguments will be ignored!
Breakpoint 1, main (argc=2, argv=0xbffff424) at test.c:10
10 if (argc == 2) {
$1 = 2
Breakpoint 3, main (argc=2, argv=0xbffff424) at test.c:16
16 int b = 5 / i;
$2 = 2
$3 = 134513899
Breakpoint 2, main (argc=2, argv=0xbffff424) at test.c:17
17 printf(" 5 / %d = %d \n", i, b );
$4 = 2
$5 = 2
5 / 2 = 2
Breakpoint 3, main (argc=2, argv=0xbffff424) at test.c:16
16 int b = 5 / i;
$6 = 1
$7 = 2
Breakpoint 2, main (argc=2, argv=0xbffff424) at test.c:17
17 printf(" 5 / %d = %d \n", i, b );
$8 = 1
$9 = 5
5 / 1 = 5
Breakpoint 3, main (argc=2, argv=0xbffff424) at test.c:16
16 int b = 5 / i;
$10 = 0
$11 = 5
Program received signal SIGFPE, Arithmetic exception.
0x0804847d in main (argc=2, argv=0xbffff424) at test.c:16
16 int b = 5 / i;
Note that while we specify command line argument 5, the loop still spins only two times (as is the specification of run in the gdb script); if run didn't have any arguments, it spins only once (the default value of the program) confirming that --args ./test.exe 5 is ignored.
However, since now this is output in a single call, and without any user interaction, the command line output can easily be captured in a text file using bash redirection, say:
gdb --batch --command=test.gdb --args ./test.exe 5 > out.txt
There is also an example of using python for automating gdb in c - GDB auto stepping - automatic printout of lines, while free running?
gdb executes file .gdbinit after running.
So you can add your commands to this file and see if it is OK for you.
This is an example of .gdbinit in order to print backtrace for all f() calls:
set pagination off
set logging file gdb.txt
set logging on
file a.out
b f
commands
bt
continue
end
info breakpoints
r
set logging off
quit
If a -x with a file is too much for you, just use multiple -ex's.
This is an example to track a running program showing (and saving) the backtrace on crashes
sudo gdb -p "$(pidof my-app)" -batch \
-ex "set logging on" \
-ex continue \
-ex "bt full" \
-ex quit

Is it possible to ack nagios alerts from the terminal?

I have nagios alerts set up to come through jabber with an http link to ack.
Is is possible there is a script I can run from a terminal on a remote workstation that takes the hostname as a parameter and acks the alert?
./ack hostname
The benefit, while seemingly mundane, is threefold. First, take http load off nagios. Secondly, nagios http pages can take up to 10-20 seconds to load, so I want to save time there. Thirdly, avoiding slower use of mouse + web interface + firefox/other annoyingly slow browser.
Ideally, I would like a script bound to a keyboard shortcut that simply acks the most recent alert. Finally, I want to take the inputs from a joystick, buttons and whatnot, and connect one to a big red button bound to the script so I can just ack the most recent nagios alert by hitting the button lol. (It would be rad too if the button had a screen on the enclosure that showed the text of the alert getting acked lol)
Make fun of me all you want, but this is actually something that would be useful to me. If I can save five seconds per alert, and I get 200 alerts per day I need to ack, that's saving me 15 minutes a day. And isn't the whole point of the sysadmin to automate what can be automated?
Thanks!
Yes, it's possible to ack nagios by parsing /var/lib/nagios3/retention.dat file.
See :
#!/usr/bin/env python
# -*- coding: utf8 -*-
# vim:ts=4:sw=4
import sys
file = "/var/lib/nagios3/retention.dat"
try:
sys.argv[1]
except:
print("Usage:\n"+sys.argv[0]+" <HOST>\n")
sys.exit(1)
f = open(file, "r")
line = f.readline()
c=0
name = {}
state = {}
host = {}
while line:
if "service_description=" in line:
name[c] = line.split("=", 2)[1]
elif "current_state=" in line:
state[c] = line.split("=", 2)[1]
elif "host_name=" in line:
host[c] = line.split("=", 2)[1]
elif "}" in line:
c+=1
line = f.readline()
for i in name:
num = int(state[i])
if num > 0 and sys.argv[1] == host[i].strip():
print(name[i].strip("\n"))
You simply have to put the host as parameter, and the script will displays the broken services.