RabbitMQ crash with bump_reduce_memory_use - rabbitmq

I am using RabbitMQ 3.7.3 on Erlang 20.2.2 deployed on a docker (image rabbitmq:3.7-management).
Memory is setup like this : Memory high watermark set to 6000 MiB (6291456000 bytes) of 8192 MiB (8589934592 bytes) total
Here is the crash report that I am getting on automatic restart of RabbitMQ :
CRASH REPORT Process <0.818.0> with 0 neighbours exited with reason:
no function clause matching
rabbit_priority_queue:handle_info(bump_reduce_memory_use,
{state,rabbit_variable_queue,[{10,{vqstate,{0,{[],[]}},{0,{[],[]}},{delta,undefined,0,0,undefined},...}},...],...})
line 396 in gen_server2:terminate/3 line 1161
It seems to be due to messages posted to a queue setup like this filled with 500k+ messages :
Thanks for your help !

I filed this bug and opened these pull requests to fix this issue - 3.7.x PR, master PR. This fix will ship in RabbitMQ 3.7.4.
In the future, it would be preferable to discuss or report issues on the mailing list as the RabbitMQ core team monitors it daily.
Thanks for reporting this issue and for using RabbitMQ.

Related

RabbitMQ crash rabbit_disk_monitor

crash exception log as below:
2022-12-06 03:40:44 =ERROR REPORT====
** Generic server rabbit_disk_monitor terminating
** Last message in was update
** When Server state == {state,"c:/Users/appadmin/AppData/Roaming/RabbitMQ/db/rabbit#FR-PPA004-mnesia",50000000,79131406336,100,10000,#Ref<0.526994179.3769630732.93858>,false,true,10,120000}
** Reason for termination ==
** {eacces,[{erlang,open_port,[{spawn,[67,58,92,87,105,110,100,111,119,115,92,115,121,115,116,101,109,51,50,92,99,109,100,46,101,120,101,32,47,99,32,100,105,114,32,47,45,67,32,47,87,32,34,92,92,63,92,"c:","\","Users","\","appadmin","\","AppData","\","Roaming","\","RabbitMQ","\","db","\","rabbit#FR-PPA004-mnesia",34]},[binary,stderr_to_stdout,stream,in,hide]],[{file,"erlang.erl"},{line,2272}]},{os,cmd,2,[{file,"os.erl"},{line,275}]},{rabbit_disk_monitor,get_disk_free,2,[{file,"src/rabbit_disk_monitor.erl"},{line,255}]},{rabbit_disk_monitor,internal_update,1,[{file,"src/rabbit_disk_monitor.erl"},{line,209}]},{rabbit_disk_monitor,handle_info,2,[{file,"src/rabbit_disk_monitor.erl"},{line,181}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,689}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,765}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}
2022-12-06 03:40:44 =CRASH REPORT====
crasher:
initial call: rabbit_disk_monitor:init/1
pid: <0.436.0>
registered_name: rabbit_disk_monitor
exception error: {eacces,[{erlang,open_port,[{spawn,[67,58,92,87,105,110,100,111,119,115,92,115,121,115,116,101,109,51,50,92,99,109,100,46,101,120,101,32,47,99,32,100,105,114,32,47,45,67,32,47,87,32,34,92,92,63,92,"c:","\","Users","\","appadmin","\","AppData","\","Roaming","\","RabbitMQ","\","db","\","rabbit#FR-PPA004-mnesia",34]},[binary,stderr_to_stdout,stream,in,hide]],[{file,"erlang.erl"},{line,2272}]},{os,cmd,2,[{file,"os.erl"},{line,275}]},{rabbit_disk_monitor,get_disk_free,2,[{file,"src/rabbit_disk_monitor.erl"},{line,255}]},{rabbit_disk_monitor,internal_update,1,[{file,"src/rabbit_disk_monitor.erl"},{line,209}]},{rabbit_disk_monitor,handle_info,2,[{file,"src/rabbit_disk_monitor.erl"},{line,181}]},{gen_server,try_dispatch,4,[{file,"gen_server.erl"},{line,689}]},{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,765}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}
ancestors: [rabbit_disk_monitor_sup,rabbit_sup,<0.281.0>]
message_queue_len: 0
messages: []
links: [<0.435.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 6772
stack_size: 28
reductions: 35778
neighbours:
Anyone have idea on what is the error?
Appreciate with your help. Thanks.
You must provide the RabbitMQ and Erlang version any time you ask a question or report an issue with RabbitMQ. In fact, any time you are talking about software, providing versions is necessary!
This is the important part of the error text:
{eacces,[{erlang,open_port,[{spawn,[67,58,92,87,105,110,100,111,119,115,92,115,121,115,116,101,109,51,50,92,99,109,100,46,101,120,101,32,47,99,32,100,105,114,32,47,45,67,32,47,87,32,34,92,92,63,92,"c:","\","Users","\","appadmin","\","AppData","\","Roaming","\","RabbitMQ","\","db","\","rabbit#FR-PPA004-mnesia",34]},
eacces means that there is a permission issue. Since I don't know what version of RabbitMQ you are using, my guess is that the free disk space check errors because RabbitMQ/Erlang can't start up Powershell or cmd.exe.
I have fixed issues around free disk space monitoring as well as to make RabbitMQ more UTF-8 friendly (in case you are using a non-English) language.
My suggestion is to upgrade to the latest RabbitMQ (3.11.4) and Erlang (25.1.2). If that does not resolve your issue, you should start a new discussion here.
NOTE: the RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.

JVM Appears to be hung with Outofheapspace error while having response payload size more than 3 mb in Mule 4

I am using mule 4 to retrieve records from database and show it in the response . Somehow I see all the components are getting passed successfully but while streaming the response its failing . I am trying to call from postman and I see error:
<h1>502 Bad Gateway</h1>
The server returned an invalid or incomplete response.
In the studio , I get logs like :
Pinging the JVM took 9 seconds to respond.
JVM appears hung: Timed out waiting for signal from JVM. Requesting thread dump.
Dumping JVM state.
JVM appears hung: Timed out waiting for signal from JVM. Restarting JVM.
JVM exited after being requested to terminate.
JVM Restarts disabled. Shutting down.
<-- Wrapper Stopped
Could anyone help me on this .
Thanks
Sanjukta
Some information is not being streamed. You didn't provide any details of the implementation but clearly something is consuming a lot of heap memory. It may not be the database, but some other component. Check the streaming configuration for your components.
To identify the cause locally you can capture a heap dump and analyze it while the runtime in studio is timing out for the ping before it crashes. That is probably because of high garbage collection activity.
This is a symptom that your JVM heap memory is full, check your settings in Anypoint Studio and see how much is allocated
Check this article
https://help.mulesoft.com/s/article/Out-Of-Memory-in-Studio-Application-How-to-increase-the-maximum-heap-size?r=6&ui-force-components-controllers-recordGlobalValueProvider.RecordGvp.getRecord=1

Splunk 7.2.9.1 Universal forwarder on SUSE Linux12.4 not communicating and forwarding logs to Indexer after certain period of time

I have noticed Splunk 7.2.9.1 Universal forwarder on SUSE Linux12.4 is not communicating to deployment server and forwarding logs to indexer after certain period of time. "splunkd" process appears to be running while this issue persists.
I have to restart UFW for it to resume communication to deployment and forward logs. But this will again stop communication after certain period of time.
I cannot see any specific logs in splunkd.log while this issue occurs.
However, i noticed below message from watchdog.log
06-16-2020 11:51:09.055 +0200 ERROR Watchdog - No response received from IMonitoredThread=0x7f24365fdcd0 within 8000 ms. Looks like thread name='Shutdown' is busy !? Starting to trace with 8000 ms interval.
Can somebody help to understand what is causing this issue.
This appears to be a Known Issue. From the 7.2.9.1 release notes:
Universal Forwarders stop sending data repeatedly throughout the day
Workaround: In limits.conf, try changing file_tracking_db_threshold_mb
in the [inputproc] stanza to a lower value.
I did not find a version where this is not listed as a known problem.

Why is "await Publish<T>" hanging / not completing / not finishing

The following piece of code has been working for some time and it has suddenly stopped returning:
await availableChangedPublishEndpoint
.Publish<IAvailableStockChanged>(
AvailableStockCounter.ConvertSkuQtyToAvailableStockChangedEvent(
newAvailable,
absMessage.Warehouse)
);
There is nothing clever in ConvertSkuQtyToAvailableStockChangedEvent - it just maps one simple class to another.
We added logs before and after this code and it's definitely just stopping at this point. Other systems are publishing fine, other messages are being sent from this application (for e.g. logs are actually sent via RabbitMQ). We have redeployed and we have upgraded to latest MassTransit version. We are seeing that the messages are being published - possibly multiple times, but this Publish method never returns.
We had a broken RabbitMQ node and a clean service restart on one node fixed it. I appreciate there might be other reasons for this behaviour, but this was our problem.
systemctl restart rabbitmq-server
Looking further into RabbitMQ we saw that some of the empty queues that were connected to this exchange were not synchronized (see below) and when we tried to synchronize them that wouldn't work.
We also couldn't delete some of these unsynchronized queues.
We believe an unexpected shutdown of one of the nodes had caused this problem - but it left most queues / exchanges completely OK.

Problems using BitTornado for file distribution

I'm experimenting with BitTornado-0.3.17 to distribute a file to several machines (*nix). I ran into couple of problems while doing so. Here is what I have done so far.
Download BitTornado-0.3.17.tar.gz
from
http://download2.bittornado.com/download/BitTornado-0.3.17.tar.gz
and untar it.
Created a torrent file and started tracker following instructions in the README file.
Started a seeder
./btdownloadheadless.py ../BitTornado-0.3.17.tar.gz.torrent --saveas ../BitTornado-0.3.17.tar.gz
saving: BitTornado-0.3.17.tar.gz (0.2 MB)
percent done: 0.0
time left: Download Succeeded!
download to: /home/srikanth/BitTornado-0.3.17.tar.gz
download rate:
upload rate: 0.0 kB/s
share rating: 0.000 (0.0 MB up / 0.0 MB down)
seed status: 0 seen recently, plus 0.000 distributed copies
peer status: 0 seen now, 0.0% done at 0.0 kB/s
Now we have a seeder. I start a peer on another machine to download BitTornado-0.3.17.tar.gz.
./btdownloadheadless.py BitTornado-0.3.17.tar.gz.torrent
At this point I do not observer my peer to download data from seeder. However if I kill my seeder and start again, the peer immediately downloads from the seeder. Why is it happening this way? The first time seeder reports tracker, tracker should be aware of the seeder and share that information to newly joined peer. Its happening only when I start seeder after peer joins the network.
Has anyone used BitTornado to distribute files programmatically (not using GUI tools at all.) ?
Thanks :-)
EDIT: Here is what happened a few days later. I dig into tracker logs and figure that seeder is binding itself onto a private ip address interface and reporting it. It is causing other clients to not reach seeder. hence no download. So I passed --ip options to it, which made it to report the tracker the machine's public ip address to which it bound. Even then for some reason i couldn't get client to download from seeder. However I got it working by starting client first and seeder last. This worked for me consistently. I can't think of any reason why it shouldn't work the other way. So, I'm starting clients first and then starting seeder.
All the symptoms indicate only one of your machines is able to connect to the other (in this case, the "seeder" machine). Restarting the "seeder" means it announces to the tracker and gets the other peers info, then connects. If the downloader is unconnectable, it simply cannot do anything until the seeder sees its IP.
This may be also related to rerequest_interval in download_bt1.py or reannounce_interval in track.py. Setting them to smaller values may help you debug if the tracker receives and distributes the right information.
When I diff BitTornado with twitter murder code, I found a little different.
Especially the at line 75 of Downloader.py file:
self.backlog = max(50, int(self.backlog * 0.075))
this will fix the bug, download uncomplete.