RabbitMQ lager_error_logger_h dropped messages - rabbitmq

Help please solve the problem.
There are:
RabbitMQ - 3.7.2
Erlang - 20.1
Connections: 527
Channels: 500
Exchanges: 49
Queues: 4437
Consumers: 131
Publish rate ~ 200/s
Ack rate ~ 200/s
Config:
disk_free_limit.absolute = 5GB
log.default.level = warning
log.file.level = warning
In the logs constantly appear such messages:
11:42:16.000 [warning] <0.32.0> lager_error_logger_h dropped 105 messages in the last second that exceeded the limit of 100 messages/sec
11:42:17.000 [warning] <0.32.0> lager_error_logger_h dropped 101 messages in the last second that exceeded the limit of 100 messages/sec
11:42:18.000 [warning] <0.32.0> lager_error_logger_h dropped 177 messages in the last second that exceeded the limit of 100 messages/sec
How to get rid of them correctly? How to remove this messages from logs?

The RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.
The message means that RabbitMQ is generating a very large number of error messages and that they are being dropped to avoid filling the log rapidly. If "dropped X messages in the last second" is the only message you are seeing in the logs, you need to determine what the messages are that are being dropped to find the root of the problem. You can do this by temporarily raising that limit by running the following command:
rabbitmqctl eval '[lager:set_loghwm(H, 250) || H <- gen_event:which_handlers(lager_event)].'
You should then see a much larger number of messages that will reveal the underlying issue. To revert back to the previous setting, run this command:
rabbitmqctl eval '[lager:set_loghwm(H, 50) || H <- gen_event:which_handlers(lager_event)].'

Related

Monit - Check if process exists and kill when consuming X memory

I have 'someprocess' running on some hosts, and want a single monit check for hosts with and without 'someprocess' running, which kills someprocess when memory usage exceeds a threshold.
Below check works, but on hosts not running someprocess monit continually logs "process is not running" in /var/log/monit.log.
check process someprocess
matching "someprocess"
if memory usage > 2% for 1 cycle then exec "/usr/bin/kill someprocess"
I want to also include 'if exists' but keep getting monit systax errors, Im not sure I can have more than one if statement.
Does anyone know if I can do this, so have something along the lines of:
if exists AND memory usage > 2% for 1 cycle then exec "/usr/bin/kill someprocess"
You can add an additional test and disable the monitoring if the application is not available, to prevent floating the logs with messages.
check process someprocess matching "someprocess"
if memory usage > 2% for 1 cycle then exec "/usr/bin/kill someprocess"
if not exist for 5 cycles then unmonitor
This will disable the check, if the application is not available for 5 monitor cycles. To enable the monitoring again, use "monit monitor someprocess".
See https://mmonit.com/monit/documentation/monit.html#EXISTENCE-TESTS

How to understand the output of rabbitmqctl commands

$rabbitmqctl list_queues
Timeout: 60.0 seconds ...
Listing queues for vhost / ...
privateTransactionQ 2
amq.gen-o9dl3Zj7HxS50gkTC2xbBQ 0
task_queue 0
Output of rabbitmqctl looks like this. I cant make out what each column is meant for. How can I see the meaning of each column?
There is no "easy" solution for this, but we're IT and we can build them. I'm not an expert in RabbitMQ nor in programming, but I'll give my best to give a good answer to this one, just in case someone lands in here looking for help.
Let's take the exact case of listing the queues from rabbitmqctl console. By typing "rabbitmqctl" you get the list of available commands:
Commands:
[...]
list_queues [-p <vhost>] [--online] [--offline] [--local] [<queueinfoitem> ...] [-t <timeout>]
[...]
Assuming you know what a vhost and queue are, let's say you want to list all the queues in vhost "TEST", then you would need to type:
> rabbitmqctil list_queues -p TEST
Timeout: 60.0 seconds ...
Listing queues for vhost TEST ...
test.queue 0
By default, you only get the "name" of the queue and its "current depth".
Where do you find all the parameters of the queues? Pay special attention to the word "queueinfoitem" in the help instruction you typed first. If you see the rabbitmqctl help instructions (by typing "rabbitmqctl"), at the end of the instruction you can see a list of available options for the parameter "".
Now let's see an example where you want to see a more advanced status of the queue, per say: messages ready in queues, messages in status unacknowledged, messages RAM, consumers, consumer's memory utilization, state of the queue and of course, its name.
You are right about one thing: rabbitmqctl doesn't return the result in a friendly way. By default, you get this:
rabbitmqctl list_queues -p TEST messages_ready, messages_unacknowledged, messages_ram, consumers, consumer_utilisation, state, name
Timeout: 60.0 seconds ...
Listing queues for vhost TEST ...
0 0 0 0 running test.queue
But with a bit of immagination, you can achieve this:
----------------------------------------------------------
Msg. * Msg. * Msg. ** ** Cons. ** **** Name
Rdy * Unack * RAM *** Cons. * Util. ** State ***
----------------------------------------------------------
0 0 0 0 running test.queue
It's no big deal, but it's better than the default.
I achieved that with a small python script:
import os
vhosts = os.popen("rabbitmqctl list_vhosts name").read()
logging.info(vhosts)
vhosts = vhosts.split("\n",1)[1]
vhosts = vhosts[:-1]
vhosts = vhosts.split("\n")
for vhost in vhosts:
header_a = "Msg. * Msg. * Msg. ** ** Cons. ** **** Name\n"
header_b = "Rdy * Unack * RAM *** Cons. * Util. ** State *** \n"
dash = "----------------------------------------------------------\n"
queues = os.popen("rabbitmqctl list_queues -p " + vhost + " messages_ready, messages_unacknowledged, messages_ram, consumers, consumer_utilisation, state, name").read()
queues = queues.split("\n",2)[2]
queues_list = dash + header_a + header_b + dash + queues
print(queues_list)
Of course this can be improved in so many ways and critics are always welcome, I still hope it helps someone.
Cheers.

Erlang VM killed when creating millions of processes

So after Joe Armstrongs' claims that erlang processes are cheap and vm can handle millions of them. I decided to test it on my machine:
process_galore(N)->
io:format("process limit: ~p~n", [erlang:system_info(process_limit)]),
statistics(runtime),
statistics(wall_clock),
L = for(0, N, fun()-> spawn(fun() -> wait() end) end),
{_, Rt} = statistics(runtime),
{_, Wt} = statistics(wall_clock),
lists:foreach(fun(Pid)-> Pid ! die end, L),
io:format("Processes created: ~p~n
Run time ms: ~p~n
Wall time ms: ~p~n
Average run time: ~p microseconds!~n", [N, Rt, Wt, (Rt/N)*1000]).
wait()->
receive die ->
done
end.
for(N, N, _)->
[];
for(I, N, Fun) when I < N ->
[Fun()|for(I+1, N, Fun)].
Results are impressive for million processes - I get aprox 6.6 micro! seconds average spawn time. But when starting 3m processes, OS shell prints "Killed" with erlang runtime gone.
I run erl with +P 5000000 flag, system is: arch linux with quadcore i7 and 8GB ram.
Erlang processes are cheap, but they're not free. Erlang processes spawned by spawn use 338 words of memory, which is 2704 bytes on a 64 bit system. Spawning 3 million processes will use at least 8112 MB of RAM, not counting the overhead of creating the linked list of pids and the anonymous function created for each process (I'm not sure if they're shared if they're created like you're creating.) You'll probably need 10-12GB of free RAM to spawn and keep alive 3 million (almost) empty processes.
As I pointed out in the comments (and you later verified), the "Killed" message was printed by the Linux Kernel when it killed the Erlang VM, most likely for using up too much RAM. More information here.

What could cause Redis RDB Snapshoting to Stall?

I have a redis install on Ubuntu 14.04, and I seem to have nearly weekly issues with RDB snapshots completing. Redis version is 3.0.4 64 bit.
3838:M 24 Feb 09:46:28.826 * Background saving terminated with success
3838:M 24 Feb 09:47:29.088 * 100000 changes in 60 seconds. Saving...
3838:M 24 Feb 09:47:29.230 * Background saving started by pid 17281 17281:signal-handler (1456338079) Received SIGTERM scheduling shutdown...
3838:M 24 Feb 13:24:19.358 # Background saving terminated by signal 9
3838:M 24 Feb 13:24:19.622 * 10 changes in 900 seconds. Saving...
3838:M 24 Feb 13:24:19.730 * Background saving started by pid 17477
What you see there is that at 9:47am the background save started, but when I found it at 1:24pm it appeared to be completely stalled. I found the forked process to have basically no activity - the amount of memory it was consuming wasn't increasing. I tried to "kill" the child process, but it never actually quit, so i had to kill it with extreme prejudice (-9).
When things are getting bad, I get the following errors in my app:
2016-02-24 13:11:12,046 [2344] ERROR kCollectors.Main - Error while adding to Redis: No connection is available to service this operation: SADD ALLCH
My redis config is to do rdb snapshots only (no AOF). The load is modification heavy, with thousands of writes per second.
Currently I'm at the point where no redis background save is succeeding, and the background process becomes so much larger than the regular process that my VM starts swapping. Here's my TOP. 3838 is my redis instance, and 17477 is the background save process (as noted above):
top - 14:06:42 up 118 days, 2:05, 1 user, load average: 1.07, 1.07, 1.13
Tasks: 81 total, 3 running, 78 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.8 us, 1.5 sy, 0.0 ni, 45.8 id, 51.3 wa, 0.0 hi,
0.5 si, 0.0 st
KiB Mem: 8176996 total, 8036792 used, 140204 free, 120 buffers
KiB Swap: 6289404 total, 3968236 used, 2321168 free. 4044 cached Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
36 root 20 0 0 0 0 S 2.3 0.0
288:05.05 kswapd0
3838 rrr 20 0 7791836 3.734g 612 S 2.0
47.9 330:08.65 redis-server
17477 rrr 20 0 7792228 6.606g 364 D 1.0 84.7 0:43.49 redis-server
This is very interesting since I don't remember to ever read of such issues, so to discover the root cause could be very useful.
So here you are reporting a child process that stays a long time active, and even continues to allocate memory. I've no explanation for this if not a data corruption in the process memory, causing the RDB process to find unexpected conditions and looping forever in some way.
A few questions:
Does this happen even if you restart the process? (However please DON'T DO IT if you can avoid restarting and you did not restated yet, otherwise we may no longer understand the root cause).
While the RDB saving is active, do you see the CPU usage to be high and the process running with ps/top?
Could you try to interrupt the process with gdb -p <pid> and obtain a stack trace of the process?
Could you provide Redis INFO output to check version and other configuration things and state?
Could you check free output while this happens?
TLDR: is it possible the system is out of memory and is swapping a lot? So the child process while saving the RDB file visited all the pages and forced everything to be in the Resident Set. The system can't cope with so much I/O so it takes ages to complete the RDB saving.
EDIT: I just noticed you reported memory info:
KiB Mem: 8176996 total, 8036792 used, 140204 free, 120 buffers
So the system is out of memory and is swapping like crazy, and this results in the above behavior. As RDB saving starts, COW will use a lot of additional memory pushing the server on the memory limits.
Thanks.

How to understand Exim log file?

Can someone help me understand Exim log file, and also point me a great documentation about it's log.
LINE 1
2010-12-05 17:30:15 1PPKHn-0003mA-5w <= username=example.com.br--4219--bounce#mydomain.com.br H=myserver.com.br () [174.120.195.18] P=esmtpa A=dovecot_plain:email#e-mydomain.com.br S=3851 id=4cfbe84724135_7b201579466da9b433988131#myserver.com.br.tmail
LINE 2
2010-12-05 17:30:12 H=mydomain.com.br () [111.111.111.11] Warning: Sender rate 1455.2 / 1h
LINE 3
2010-12-05 17:30:12 1PPGo3-00010A-FL == super#domain.in R=lookuphost T=remote_smtp defer (-53): retry time not reached for any host
Also, how can I parse Exim log file to know which ISP( eg. hotmail.com, gmail.com) is blocking my server IP?
Exim logs message arrivals and deliveries in a compact format, described here, in the online documentation. The log files are configurable so you can add or remove information using the log_selector option.