Rabbitmq server crash randomly occur and I don't know why - crash

I want to know the cause of Rabbitmq crash which randomly occur. can you let me know what kind of causes could be considered?
Also my team should manually restart the rabbitmq when crash happens, so I want to know if there is a way to restart rabbitmq server automatically.
Here is the error report when rabbitmq crash occur:
=WARNING REPORT==== 6-Dec-2017::07:56:43 ===
closing AMQP connection <0.4387.0> (000000:23070 -> 00000:5672, vhost: '/', user: '00000'):
client unexpectedly closed TCP connection
Also this is part of sasl.gsd fild :
=SUPERVISOR REPORT==== 7-Dec-2017::10:03:15 ===
Supervisor: {local,sockjs_session_sup}
Context: child_terminated
Reason: {function_clause,
[{gen_server,cast,
[{},sockjs_closed],
[{file,"gen_server.erl"},{line,218}]},
{rabbit_ws_sockjs,service_stomp,3,
[{file,"src/rabbit_ws_sockjs.erl"},{line,150}]},
{sockjs_session,emit,2,
[{file,"src/sockjs_session.erl"},{line,173}]},
{sockjs_session,terminate,2,
[{file,"src/sockjs_session.erl"},{line,311}]},
{gen_server,try_terminate,3,
[{file,"gen_server.erl"},{line,629}]},
{gen_server,terminate,7,
[{file,"gen_server.erl"},{line,795}]},
{proc_lib,init_p_do_apply,3,
[{file,"proc_lib.erl"},{line,247}]}]}
Offender: [{pid,<0.20883.1160>},
{id,undefined},
{mfargs,
{sockjs_session,start_link,
["pd4tvvi0",
{service,"/stomp",
#Fun<rabbit_ws_sockjs.1.47892404>,{},
"//cdn.jsdelivr.net/sockjs/1.0.3/sockjs.min.js",
false,true,5000,25000,131072,
#Fun<rabbit_ws_sockjs.0.47892404>,undefined},
[{peername,{{172,31,6,213},9910}},
{sockname,{{172,31,5,49},15674}},
{path,"/stomp/744/pd4tvvi0/htmlfile"},
{headers,[]},
{socket,#Port<0.12491352>}]]}},
{restart_type,transient},
{shutdown,5000},
{child_type,worker}]
=CRASH REPORT==== 7-Dec-2017::10:03:20 ===
crasher:
initial call: sockjs_session:init/1
pid: <0.25851.1160>
registered_name: []
exception exit: {function_clause,
[{gen_server,cast,
[{},sockjs_closed],
[{file,"gen_server.erl"},{line,218}]},
{rabbit_ws_sockjs,service_stomp,3,
[{file,"src/rabbit_ws_sockjs.erl"},{line,150}]},
{sockjs_session,emit,2,
[{file,"src/sockjs_session.erl"},{line,173}]},
{sockjs_session,terminate,2,
[{file,"src/sockjs_session.erl"},{line,311}]},
{gen_server,try_terminate,3,
[{file,"gen_server.erl"},{line,629}]},
{gen_server,terminate,7,
[{file,"gen_server.erl"},{line,795}]},
{proc_lib,init_p_do_apply,3,
[{file,"proc_lib.erl"},{line,247}]}]}
in function gen_server:terminate/7 (gen_server.erl, line 800)
ancestors: [sockjs_session_sup,<0.177.0>]
messages: []
links: [<0.178.0>]
dictionary: []
trap_exit: true
status: running
heap_size: 987
stack_size: 27
reductions: 175
neighbours:
Please check out the error report I posted above and let me know the causes of rabbitmq crash and the way to automatically restart rabbitmq server.
Thanks!!

Related

Unable to join Akka.NET cluster

I am having a problem joining and debugging joining to Akka.NET cluster. I am using version 1.3.8. My setup is following:
Lighthouse
Almost default code from github. Runs in console akka.hocon is following:
lighthouse {
actorsystem: "sng"
}
petabridge.cmd{
host = "0.0.0.0"
port = 9110
}
akka {
loglevel = DEBUG
loggers = ["Akka.Logger.Serilog.SerilogLogger, Akka.Logger.Serilog"]
actor {
provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"
debug {
receive = on
autoreceive = on
lifecycle = on
event-stream = on
unhandled = on
}
}
remote {
log-sent-messages = on
log-received-messages = on
log-remote-lifecycle-events = on
enabled-transports = ["akka.remote.dot-netty.tcp"]
dot-netty.tcp {
transport-class = "Akka.Remote.Transport.DotNetty.TcpTransport, Akka.Remote"
applied-adapters = []
transport-protocol = tcp
hostname = "0.0.0.0"
port = 4053
}
log-remote-lifecycle-events = DEBUG
}
cluster {
auto-down-unreachable-after = 5s
seed-nodes = []
roles = [lighthouse]
}
}
Working node
Also console (net461) application with as simple as possible startup and joining. It works as excpected. akka.hocon:
akka {
loglevel = DEBUG
loggers = ["Akka.Logger.Serilog.SerilogLogger, Akka.Logger.Serilog"]
actor {
provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"
}
remote {
log-sent-messages = on
log-received-messages = on
log-remote-lifecycle-events = on
dot-netty.tcp {
transport-class = "Akka.Remote.Transport.DotNetty.TcpTransport, Akka.Remote"
applied-adapters = []
transport-protocol = tcp
hostname = "0.0.0.0"
port = 0
}
}
cluster {
auto-down-unreachable-after = 5s
seed-nodes = ["akka.tcp://sng#127.0.0.1:4053"]
roles = [monitor]
}
}
Not working node
An .NET 4.6.1 library, registerd as COM and started in other (Media Monkey) application with VBA code:
Sub OnStartup
Set o = CreateObject("MediaMonkey.Akka.Agent.MediaMonkeyAkkaProxy")
o.Init(SDB)
End Sub
Akka system is, as in console aplikation, created with standard ActorSystem.Create("sng", config);
akka.hocon:
akka {
loglevel = DEBUG
loggers = ["Akka.Logger.Serilog.SerilogLogger, Akka.Logger.Serilog"]
actor {
provider = "Akka.Cluster.ClusterActorRefProvider, Akka.Cluster"
}
remote {
log-sent-messages = on
log-received-messages = on
log-remote-lifecycle-events = on
dot-netty.tcp {
transport-class = "Akka.Remote.Transport.DotNetty.TcpTransport, Akka.Remote"
applied-adapters = []
transport-protocol = tcp
hostname = "0.0.0.0"
port = 0
}
}
cluster {
auto-down-unreachable-after = 5s
seed-nodes = ["akka.tcp://sng#127.0.0.1:4053"]
roles = [mediamonkey]
}
}
Debugging workflow
Startup Lighthouse application:
Configuration Result:
[Success] Name sng.Lighthouse
[Success] ServiceName sng.Lighthouse
Topshelf v4.0.0.0, .NET Framework v4.0.30319.42000
[Lighthouse] ActorSystem: sng; IP: 127.0.0.1; PORT: 4053
[Lighthouse] Performing pre-boot sanity check. Should be able to parse address [akka.tcp://sng#127.0.0.1:4053]
[Lighthouse] Parse successful.
[21:01:35 INF] Starting remoting
[21:01:35 INF] Remoting started; listening on addresses : [akka.tcp://sng#127.0.0.1:4053]
[21:01:35 INF] Remoting now listens on addresses: [akka.tcp://sng#127.0.0.1:4053]
[21:01:35 INF] Cluster Node [akka.tcp://sng#127.0.0.1:4053] - Starting up...
[21:01:35 INF] Cluster Node [akka.tcp://sng#127.0.0.1:4053] - Started up successfully
The sng.Lighthouse service is now running, press Control+C to exit.
[21:01:35 INF] petabridge.cmd host bound to [0.0.0.0:9110]
[21:01:35 INF] Node [akka.tcp://sng#127.0.0.1:4053] is JOINING, roles [lighthouse]
[21:01:35 INF] Leader is moving node [akka.tcp://sng#127.0.0.1:4053] to [Up]
Started and stopped working console node
Lighthouse logs:
[21:05:40 INF] Node [akka.tcp://sng#0.0.0.0:37516] is JOINING, roles [monitor]
[21:05:40 INF] Leader is moving node [akka.tcp://sng#0.0.0.0:37516] to [Up]
[21:05:54 INF] Connection was reset by the remote peer. Channel [[::ffff:127.0.0.1]:4053->[::ffff:127.0.0.1]:37517](Id=1293c63a)
[21:05:54 INF] Message AckIdleCheckTimer from akka://sng/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fsng%400.0.0.0%3A37516-1/endpointWriter to akka://sng/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fsng%400.0.0.0%3A37516-1/endpointWriter was not delivered. 1 dead letters encountered.
[21:05:55 INF] Message GossipStatus from akka://sng/system/cluster/core/daemon to akka://sng/deadLetters was not delivered. 2 dead letters encountered.
[21:05:55 INF] Message Heartbeat from akka://sng/system/cluster/core/daemon/heartbeatSender to akka://sng/deadLetters was not delivered. 3 dead letters encountered.
[21:05:56 INF] Message GossipStatus from akka://sng/system/cluster/core/daemon to akka://sng/deadLetters was not delivered. 4 dead letters encountered.
[21:05:56 INF] Message Heartbeat from akka://sng/system/cluster/core/daemon/heartbeatSender to akka://sng/deadLetters was not delivered. 5 dead letters encountered.
[21:05:57 INF] Message GossipStatus from akka://sng/system/cluster/core/daemon to akka://sng/deadLetters was not delivered. 6 dead letters encountered.
[21:05:57 INF] Message Heartbeat from akka://sng/system/cluster/core/daemon/heartbeatSender to akka://sng/deadLetters was not delivered. 7 dead letters encountered.
[21:05:58 INF] Message GossipStatus from akka://sng/system/cluster/core/daemon to akka://sng/deadLetters was not delivered. 8 dead letters encountered.
[21:05:58 INF] Message Heartbeat from akka://sng/system/cluster/core/daemon/heartbeatSender to akka://sng/deadLetters was not delivered. 9 dead letters encountered.
[21:05:59 WRN] Cluster Node [akka.tcp://sng#127.0.0.1:4053] - Marking node(s) as UNREACHABLE [Member(address = akka.tcp://sng#0.0.0.0:37516, Uid=1060233119 status = Up, role=[monitor], upNumber=2)]. Node roles [lighthouse]
[21:06:01 WRN] AssociationError [akka.tcp://sng#127.0.0.1:4053] -> akka.tcp://sng#0.0.0.0:37516: Error [Association failed with akka.tcp://sng#0.0.0.0:37516] []
[21:06:01 WRN] Tried to associate with unreachable remote address [akka.tcp://sng#0.0.0.0:37516]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [Association failed with akka.tcp://sng#0.0.0.0:37516] Caused by: [System.AggregateException: One or more errors occurred. ---> Akka.Remote.Transport.InvalidAssociationException: No connection could be made because the target machine actively refused it tcp://sng#0.0.0.0:37516
at Akka.Remote.Transport.DotNetty.TcpTransport.<AssociateInternal>d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Akka.Remote.Transport.DotNetty.DotNettyTransport.<Associate>d__22.MoveNext()
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
at Akka.Remote.Transport.ProtocolStateActor.<>c.<InitializeFSM>b__11_54(Task`1 result)
at System.Threading.Tasks.ContinuationResultTaskFromResultTask`2.InnerInvoke()
at System.Threading.Tasks.Task.Execute()
---> (Inner Exception #0) Akka.Remote.Transport.InvalidAssociationException: No connection could be made because the target machine actively refused it tcp://sng#0.0.0.0:37516
at Akka.Remote.Transport.DotNetty.TcpTransport.<AssociateInternal>d__1.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Akka.Remote.Transport.DotNetty.DotNettyTransport.<Associate>d__22.MoveNext()<---
]
[21:06:04 INF] Cluster Node [akka.tcp://sng#127.0.0.1:4053] - Leader is auto-downing unreachable node [akka.tcp://sng#127.0.0.1:4053]
[21:06:04 INF] Marking unreachable node [akka.tcp://sng#0.0.0.0:37516] as [Down]
[21:06:05 INF] Leader is removing unreachable node [akka.tcp://sng#0.0.0.0:37516]
[21:06:05 WRN] Association to [akka.tcp://sng#0.0.0.0:37516] having UID [1060233119] is irrecoverably failed. UID is now quarantined and all messages to this UID will be delivered to dead letters. Remote actorsystem must be restarted to recover from this situation.
Working node logs:
[21:05:38 INF] Starting remoting
[21:05:38 INF] Remoting started; listening on addresses : [akka.tcp://sng#0.0.0.0:37516]
[21:05:38 INF] Remoting now listens on addresses: [akka.tcp://sng#0.0.0.0:37516]
[21:05:38 INF] Cluster Node [akka.tcp://sng#0.0.0.0:37516] - Starting up...
[21:05:38 INF] Cluster Node [akka.tcp://sng#0.0.0.0:37516] - Started up successfully
[21:05:40 INF] Welcome from [akka.tcp://sng#127.0.0.1:4053]
[21:05:40 INF] Member is Up: Member(address = akka.tcp://sng#127.0.0.1:4053, Uid=439782041 status = Up, role=[lighthouse], upNumber=1)
[21:05:40 INF] Member is Up: Member(address = akka.tcp://sng#0.0.0.0:37516, Uid=1060233119 status = Up, role=[monitor], upNumber=2)
//shutdown logs are missing
Started and stopped COM node
Lighthouse logs:
[21:12:02 INF] Connection was reset by the remote peer. Channel [::ffff:127.0.0.1]:4053->[::ffff:127.0.0.1]:37546](Id=4ca91e15)
COM node logs:
[WARNING][18. 07. 2018 19:11:15][Thread 0001][ActorSystem(sng)] The type name for serializer 'hyperion' did not resolve to an actual Type: 'Akka.Serialization.HyperionSerializer, Akka.Serialization.Hyperion'
[WARNING][18. 07. 2018 19:11:15][Thread 0001][ActorSystem(sng)] Serialization binding to non existing serializer: 'hyperion'
[21:11:15 DBG] Logger log1-SerilogLogger [SerilogLogger] started
[21:11:15 DBG] StandardOutLogger being removed
[21:11:15 DBG] Default Loggers started
[21:11:15 INF] Starting remoting
[21:11:15 DBG] Starting prune timer for endpoint manager...
[21:11:15 INF] Remoting started; listening on addresses : [akka.tcp://sng#0.0.0.0:37543]
[21:11:15 INF] Remoting now listens on addresses: [akka.tcp://sng#0.0.0.0:37543]
[21:11:15 INF] Cluster Node [akka.tcp://sng#0.0.0.0:37543] - Starting up...
[21:11:15 INF] Cluster Node [akka.tcp://sng#0.0.0.0:37543] - Started up successfully
[21:11:15 DBG] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
[21:11:15 DBG] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
[21:11:16 DBG] [Uninitialized] Received Akka.Cluster.InternalClusterAction+JoinSeedNodes
[21:11:16 DBG] [Uninitialized] Received Akka.Cluster.InternalClusterAction+Subscribe
[21:11:26 WRN] Couldn't join seed nodes after [2] attempts, will try again. seed-nodes=[akka.tcp://sng#127.0.0.1:4053]
[21:11:31 WRN] Couldn't join seed nodes after [3] attempts, will try again. seed-nodes=[akka.tcp://sng#127.0.0.1:4053]
[21:11:36 WRN] Couldn't join seed nodes after [4] attempts, will try again. seed-nodes=[akka.tcp://sng#127.0.0.1:4053]
[21:11:40 ERR] No response from remote. Handshake timed out or transport failure detector triggered.
[21:11:40 WRN] AssociationError [akka.tcp://sng#0.0.0.0:37543] -> akka.tcp://sng#127.0.0.1:4053: Error [Association failed with akka.tcp://sng#127.0.0.1:4053] []
[21:11:40 WRN] Tried to associate with unreachable remote address [akka.tcp://sng#127.0.0.1:4053]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: [Association failed with akka.tcp://sng#127.0.0.1:4053] Caused by: [Akka.Remote.Transport.AkkaProtocolException: No response from remote. Handshake timed out or transport failure detector triggered.
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Akka.Remote.Transport.AkkaProtocolTransport.<Associate>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Akka.Remote.EndpointWriter.<AssociateAsync>d__23.MoveNext()]
[21:11:40 DBG] Disassociated [akka.tcp://sng#0.0.0.0:37543] -> akka.tcp://sng#127.0.0.1:4053
[21:11:40 INF] Message InitJoin from akka://sng/system/cluster/core/daemon/joinSeedNodeProcess-1 to akka://sng/deadLetters was not delivered. 1 dead letters encountered.
[21:11:40 INF] Message InitJoin from akka://sng/system/cluster/core/daemon/joinSeedNodeProcess-1 to akka://sng/deadLetters was not delivered. 2 dead letters encountered.
[21:11:40 INF] Message InitJoin from akka://sng/system/cluster/core/daemon/joinSeedNodeProcess-1 to akka://sng/deadLetters was not delivered. 3 dead letters encountered.
[21:11:40 INF] Message InitJoin from akka://sng/system/cluster/core/daemon/joinSeedNodeProcess-1 to akka://sng/deadLetters was not delivered. 4 dead letters encountered.
[21:11:40 INF] Message InitJoin from akka://sng/system/cluster/core/daemon/joinSeedNodeProcess-1 to akka://sng/deadLetters was not delivered. 5 dead letters encountered.
[21:11:40 INF] Message AckIdleCheckTimer from akka://sng/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fsng%40127.0.0.1%3A4053-1/endpointWriter to akka://sng/system/endpointManager/reliableEndpointWriter-akka.tcp%3A%2F%2Fsng%40127.0.0.1%3A4053-1/endpointWriter was not delivered. 6 dead letters encountered.
[21:11:41 WRN] Couldn't join seed nodes after [5] attempts, will try again. seed-nodes=[akka.tcp://sng#127.0.0.1:4053]
[21:11:41 INF] Message InitJoin from akka://sng/system/cluster/core/daemon/joinSeedNodeProcess-1 to akka://sng/deadLetters was not delivered. 7 dead letters encountered.
[21:11:46 WRN] Couldn't join seed nodes after [6] attempts, will try again. seed-nodes=[akka.tcp://sng#127.0.0.1:4053]
[21:11:51 WRN] Couldn't join seed nodes after [7] attempts, will try again. seed-nodes=[akka.tcp://sng#127.0.0.1:4053]
Do you have any idea how to debug and/or resolve this?
As I can see that the first thing I notice in the non-working node the hocon configuration contains different "seed-nodes" address from the working node.
IMHO the "seed-nodes" in all the applications [nodes as called in cluster] withinvthe cluster needs to be same. So in the non-working node instead of
seed-nodes = ["akka.tcp://songoulash#127.0.0.1:4053"]
replace with the below which is in the working node
seed-nodes = ["akka.tcp://sng#127.0.0.1:4053"]
Also, please check the github link for sample https://github.com/AJEETX/Akka.Cluster
and another link https://github.com/AJEETX/AkkaNet.Cluster.RoundRobinGroup
#Rok, Kindly let me know if this was helpful or I can further try to investigate.

Logstash - Storing RabbitMQ Logs - Multiline

I have been using ELK for about six months now, and it's been great so far. I'm on logstash version 6.2.3.
RabbitMQ makes up the heart of my distributed system (RabbitMQ is itself distributed), and as such it is very important that I track the logs of RabbitMQ.
Most other conversations on this forum seem to use RabbitMQ as an input/output stage, but I just want to monitor the logs.
The only problem I'm finding is that RabbitMQ has multiline logging, like so:
=WARNING REPORT==== 19-Nov-2017::06:53:14 ===
closing AMQP connection <0.27161.0> (...:32799 -> ...:5672, vhost: '/', user: 'worker'):
client unexpectedly closed TCP connection
=WARNING REPORT==== 19-Nov-2017::06:53:18 ===
closing AMQP connection <0.22410.0> (...:36656 -> ...:5672, vhost: '/', user: 'worker'):
client unexpectedly closed TCP connection
=WARNING REPORT==== 19-Nov-2017::06:53:19 ===
closing AMQP connection <0.26045.0> (...:55427 -> ...:5672, vhost: '/', user: 'worker'):
client unexpectedly closed TCP connection
=WARNING REPORT==== 19-Nov-2017::06:53:20 ===
closing AMQP connection <0.5484.0> (...:47740 -> ...:5672, vhost: '/', user: 'worker'):
client unexpectedly closed TCP connection
I have found a brilliant code example here, which I have stripped just to the filter stage, such that it looks like this:
filter {
if [type] == "rabbitmq" {
codec => multiline {
pattern => "^="
negate => true
what => "previous"
}
grok {
type => "rabbit"
patterns_dir => "patterns"
pattern => "^=%{WORD:report_type} REPORT=+ %{RABBIT_TIME:time_text} ===.*$"
}
date {
type => "rabbit"
time_text => "dd-MMM-yyyy::HH:mm:ss"
}
mutate {
type => "rabbit"
add_field => [
"message",
"%{#message}"
]
}
mutate {
gsub => [
"message", "^=[A-Za-z0-9: =-]+=\n", "",
# interpret message header text as "severity"
"report_type", "INFO", "1",
"report_type", "WARNING", "3",
"report_type", "ERROR", "4",
"report_type", "CRASH", "5",
"report_type", "SUPERVISOR", "5"
]
}
}
}
But when I save this to a conf file and restart logstash I get the following error:
[2018-04-04T07:01:57,308][INFO ][logstash.modules.scaffold] Initializing module {:module_name=>"fb_apache", :directory=>"/usr/share/logstash/modules/fb_apache/configuration"}
[2018-04-04T07:01:57,316][INFO ][logstash.modules.scaffold] Initializing module {:module_name=>"netflow", :directory=>"/usr/share/logstash/modules/netflow/configuration"}
[2018-04-04T07:01:57,841][INFO ][logstash.runner ] Starting Logstash {"logstash.version"=>"6.2.3"}
[2018-04-04T07:01:57,973][INFO ][logstash.agent ] Successfully started Logstash API endpoint {:port=>9600}
[2018-04-04T07:01:58,037][ERROR][logstash.agent ] Failed to execute action {:action=>LogStash::PipelineAction::Create/pipeline_id:main, :exception=>"LogStash::ConfigurationError", :message=>"Expected one of #, { at line 3, column 15 (byte 54) after filter {\n if [type] == \"rabbitmq\" {\n codec ", :backtrace=>["/usr/share/logstash/logstash-core/lib/logstash/compiler.rb:42:in `compile_imperative'", "/usr/share/logstash/logstash-core/lib/logstash/compiler.rb:50:in `compile_graph'", "/usr/share/logstash/logstash-core/lib/logstash/compiler.rb:12:in `block in compile_sources'", "org/jruby/RubyArray.java:2486:in `map'", "/usr/share/logstash/logstash-core/lib/logstash/compiler.rb:11:in `compile_sources'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:51:in `initialize'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline.rb:169:in `initialize'", "/usr/share/logstash/logstash-core/lib/logstash/pipeline_action/create.rb:40:in `execute'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:315:in `block in converge_state'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:141:in `with_pipelines'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:312:in `block in converge_state'", "org/jruby/RubyArray.java:1734:in `each'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:299:in `converge_state'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:166:in `block in converge_state_and_update'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:141:in `with_pipelines'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:164:in `converge_state_and_update'", "/usr/share/logstash/logstash-core/lib/logstash/agent.rb:90:in `execute'", "/usr/share/logstash/logstash-core/lib/logstash/runner.rb:348:in `block in execute'", "/usr/share/logstash/vendor/bundle/jruby/2.3.0/gems/stud-0.0.23/lib/stud/task.rb:24:in `block in initialize'"]}
Any ideas what the issue could be?
Thanks,
In case you are sending your logs from the rabbitMQ server to logstash with filebeat, you should configure the multiline there.
The answer is multiline indeed. The goal is to merge the lines starting with something else than a date with the previous line that started with a date. This is how :
multiline.pattern: '^\d{4}-\d{2}-\d{2}'
multiline.negate: true
multiline.match: after
Note: I previously tried to merge any lines started with space characters ^\s+ but that did not work because not all warning or error messages started with a space.
Complete filebeat input (7.5.2 format)
filebeat:
inputs:
- exclude_lines:
- 'Failed to publish events caused by: EOF'
fields:
type: rabbitmq
fields_under_root: true
paths:
- /var/log/rabbitmq/*.log
tail_files: false
timeout: 60s
type: log
multiline.pattern: '^\d{4}-\d{2}-\d{2}'
multiline.negate: true
multiline.match: after
Logstash patterns:
# RabbitMQ
RABBITMQDATE %{MONTHDAY}-%{MONTH}-%{YEAR}::%{HOUR}:%{MINUTE}:%{SECOND}
RABBITMQLINE (?m)=%{DATA:severity} %{DATA}==== %{RABBITMQDATE:timestamp} ===\n%{GREEDYDATA:message}
I am sure they had good reasons to log in this odd way in RMQ 3.7.x but without knowing them, it really makes our life hard.
You can't use a codec as a filter plugin. Codecs can only be used in input or output plugins (see the doc), with the codec configuration option.
You'll have to put your multiline codec in the input plugin that's producing your rabbitmq logs.

Rabbitmq's management plugin's aliveness-test timing out for all cluster nodes

I was doing POC for aliveness-test and saw a strange scenario, where it started timing out for all the cluster nodes. I am planning to use application level health check in HAProxy using this aliveness-test API. But this scenario scared me as HAProxy started showing all the cluster nodes as DOWN when aliveness-test was timing out for all the cluster nodes But the rabbitmq server port was connectable during that time. As per the documentation, rabbitmq management tool (which listens to port 15672) can be assumed connectable while server (which listens at port 5672) is up and running.
When I was restarting rabbit nodes, it was going in the same state.
It resumed and started returning response code 200 when I killed rabbitmq processes in all the hosts using kill -9 and started the app gracefully.
how this can be avoided?
Putting some error logs from one of the node------
=INFO REPORT==== 7-Jul-2017::16:05:18 ===
Mirrored queue 'aliveness-test' in vhost '/': Slave <rabbit#pgperf-rabbitmq2.1.26928.0> saw deaths of mirrors <rabbit#pgperf-rabbitmq1.1.18235.0>
=WARNING REPORT==== 7-Jul-2017::16:05:18 ===
Mnesia('rabbit#pgperf-rabbitmq2'): ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}
=ERROR REPORT==== 7-Jul-2017::16:05:18 ===
** Generic server <0.27470.0> terminating
** Last message in was {'DOWN',#Ref<0.0.24.29729>,process,<7887.11927.0>,
noproc}
** When Server state == {state,
{20,<0.27470.0>},
{{17,<7888.18916.0>},#Ref<0.0.24.29728>},
{{10,<7887.11927.0>},#Ref<0.0.24.29729>},
{resource,<<"/">>,queue,<<"aliveness-test">>},
rabbit_mirror_queue_slave,
{21,
[{{10,<7887.11927.0>},
{view_member,
{10,<7887.11927.0>},
[],
{20,<0.27470.0>},
{14,<7888.18825.0>}}},
{{12,<0.26929.0>},
{view_member,
{12,<0.26929.0>},
[],
{14,<7888.18825.0>},
{17,<7888.18916.0>}}},
{{14,<7888.18825.0>},
{view_member,
{14,<7888.18825.0>},
[],
{10,<7887.11927.0>},
{12,<0.26929.0>}}},
{{17,<7888.18916.0>},
{view_member,
{17,<7888.18916.0>},
[],
{12,<0.26929.0>},
{20,<0.27470.0>}}},
{{20,<0.27470.0>},
{view_member,
{20,<0.27470.0>},
[],
{17,<7888.18916.0>},
{10,<7887.11927.0>}}}]},
-1,
[{{10,<7887.11927.0>},{member,{[],[]},1,1}},
{{12,<0.26929.0>},
{member,{[{1,process_death}],[]},1,-1}},
{{14,<7888.18825.0>},{member,{[],[]},0,0}},
{{17,<7888.18916.0>},{member,{[],[]},-1,-1}},
{{20,<0.27470.0>},{member,{[],[]},-1,-1}}],
[<0.27469.0>],
{[],[]},
[],0,undefined,
#Fun<rabbit_misc.execute_mnesia_transaction.1>,
false}
** Reason for termination ==
** {bad_return_value,
{error,
{function_clause,
[{gm,check_membership,
[{20,<0.27470.0>},{error,not_found}],
[{file,"src/gm.erl"},{line,1590}]},
{gm,'-record_dead_member_in_group/5-fun-1-',4,
[{file,"src/gm.erl"},{line,1132}]},
{mnesia_tm,apply_fun,3,[{file,"mnesia_tm.erl"},{line,833}]},
{mnesia_tm,execute_transaction,5,
[{file,"mnesia_tm.erl"},{line,808}]},
{rabbit_misc,'-execute_mnesia_transaction/1-fun-0-',1,
[{file,"src/rabbit_misc.erl"},{line,537}]},
{worker_pool_worker,'-run/2-fun-0-',3,
[{file,"src/worker_pool_worker.erl"},{line,77}]}]}}}
=ERROR REPORT==== 7-Jul-2017::16:05:18 ===
** Generic server <0.27469.0> terminating
** Last message in was {'EXIT',<0.27470.0>,
{bad_return_value,
{error,
{function_clause,
[{gm,check_membership,
[{20,<0.27470.0>},{error,not_found}],
[{file,"src/gm.erl"},{line,1590}]},
{gm,'-record_dead_member_in_group/5-fun-1-',4,
[{file,"src/gm.erl"},{line,1132}]},
{mnesia_tm,apply_fun,3,
[{file,"mnesia_tm.erl"},{line,833}]},
{mnesia_tm,execute_transaction,5,
[{file,"mnesia_tm.erl"},{line,808}]},
{rabbit_misc,
'-execute_mnesia_transaction/1-fun-0-',1,
[{file,"src/rabbit_misc.erl"},{line,537}]},
{worker_pool_worker,'-run/2-fun-0-',3,
[{file,"src/worker_pool_worker.erl"},
{line,77}]}]}}}}
** When Server state == {state,
{amqqueue,
{resource,<<"/">>,queue,<<"aliveness-test">>},
false,false,none,[],<7888.18913.0>,
[<7887.11753.0>],
[<7887.11753.0>],
['rabbit#pgpdr-rabbitmq2'],
[{vhost,<<"/">>},
{name,<<"ha-all">>},
{pattern,<<>>},
{'apply-to',<<"all">>},
{definition,
[{<<"ha-mode">>,<<"all">>},
{<<"ha-sync-mode">>,<<"automatic">>}]},
{priority,0}],
[{<7888.18916.0>,<7888.18913.0>},
{<7887.11774.0>,<7887.11753.0>}],
[],live},
<0.27470.0>,rabbit_priority_queue,
{passthrough,rabbit_variable_queue,
{vqstate,
{0,{[],[]}},
{0,{[],[]}},
{delta,undefined,0,undefined},
{0,{[],[]}},
{0,{[],[]}},
0,
{0,nil},
{0,nil},
{0,nil},
{qistate,
"/paytm/rabbitmq/mnesia/rabbit#pgperf-rabbitmq2/queues/1EZ1LHRKWKS0CPF59OD5ZBSYL",
{{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
[]},
undefined,0,32768,
#Fun<rabbit_variable_queue.2.95522769>,
#Fun<rabbit_variable_queue.3.95522769>,
{0,nil},
{0,nil},
[],[]},
{undefined,
{client_msstate,msg_store_transient,
<<190,153,122,22,186,127,25,5,168,81,229,140,5,
142,73,73>>,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
{state,438349,
"/paytm/rabbitmq/mnesia/rabbit#pgperf-rabbitmq2/msg_store_transient"},
rabbit_msg_store_ets_index,
"/paytm/rabbitmq/mnesia/rabbit#pgperf-rabbitmq2/msg_store_transient",
<0.370.0>,442446,434242,446543,450640,
{2000,500}}},
false,0,4096,0,0,0,0,0,infinity,0,0,0,0,0,0,
{rates,0.0,0.0,0.0,0.0,-576459879667187727},
{0,nil},
{0,nil},
{0,nil},
{0,nil},
0,0,0,0,2048,default}},
undefined,undefined,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]}}},
{state,
{dict,0,16,16,8,80,48,
{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],[]},
{{[],[],[],[],[],[],[],[],[],[],[],[],[],[],[],
[]}}},
delegate},
undefined}
** Reason for termination ==
** {bad_return_value,
{error,
{function_clause,
[{gm,check_membership,
[{20,<0.27470.0>},{error,not_found}],
[{file,"src/gm.erl"},{line,1590}]},
{gm,'-record_dead_member_in_group/5-fun-1-',4,
[{file,"src/gm.erl"},{line,1132}]},
{mnesia_tm,apply_fun,3,[{file,"mnesia_tm.erl"},{line,833}]},
{mnesia_tm,execute_transaction,5,
[{file,"mnesia_tm.erl"},{line,808}]},
{rabbit_misc,'-execute_mnesia_transaction/1-fun-0-',1,
[{file,"src/rabbit_misc.erl"},{line,537}]},
{worker_pool_worker,'-run/2-fun-0-',3,
[{file,"src/worker_pool_worker.erl"},{line,77}]}]}}}
=WARNING REPORT==== 7-Jul-2017::16:05:18 ===
Mnesia('rabbit#pgperf-rabbitmq2'): ** WARNING ** Mnesia is overloaded: {dump_log,write_threshold}
=ERROR REPORT==== 7-Jul-2017::16:47:39 ===
Channel error on connection <0.27776.0> (<rabbit#pgperf-rabbitmq2.1.27776.0>, vhost: '/', user: 'guest'), channel 1:
operation basic.get caused a channel exception not_found: "failed to perform operation on queue 'aliveness-test' in vhost '/' due to timeout"
=ERROR REPORT==== 7-Jul-2017::16:47:39 ===
webmachine error: path="/api/aliveness-test/%2F"
"Not Found"
=ERROR REPORT==== 7-Jul-2017::16:47:39 ===
Channel error on connection <0.10847.1> (<rabbit#pgperf-rabbitmq2.1.10847.1>, vhost: '/', user: 'guest'), channel 1:
operation queue.declare caused a channel exception not_found: "failed to perform operation on queue 'aliveness-test' in vhost '/' due to timeout"
=ERROR REPORT==== 7-Jul-2017::16:47:39 ===
Channel error on connection <0.151.1> (<rabbit#pgperf-rabbitmq2.1.151.1>, vhost: '/', user: 'guest'), channel 1:
operation queue.declare caused a channel exception not_found: "failed to perform operation on queue 'aliveness-test' in vhost '/' due to timeout"

RabbitMQ STOMP connection

I am working on a fun project which requires me to learn message queues and websockets. I am trying to connect browsers via websockets to an instance of rabbitmq using sockjs rather than pure websockets. On rabbit I have activated the plugins for stomp and web_stomp (web_stomp is required when using sockjs).
The problem I am running into is that while the call from the browser seems to be working properly because a very brief connection to Rabbit is made through the webstomp/stomp connection but after 2 or 3 seconds the connection is dropped by Rabbit.
This is confirmed by the rabbitmq logs:
=INFO REPORT==== 11-Jul-2016::23:01:54 ===
accepting STOMP connection (192.168.1.10:49746 -> 192.168.1.100:55674)
=INFO REPORT==== 11-Jul-2016::23:02:02 ===
closing STOMP connection (192.168.1.10:49746 -> 192.168.1.100:55674)
This is the browser code that connects to RabbitMQ via the webstomp plugin:
var url = "http://192.168.1.100:55674/stomp";
var ws = new SockJS(url);
var client = Stomp.over(ws);
var header = {
login: 'test',
passcode: 'test'
};
client.connect(header,
function(){
console.log('Hooray! Connected');
},
function(error){
console.log('Error connecting to WS via stomp:' + JSON.stringify(error));
}
);
Here is the Rabbit config:
[
{rabbitmq_stomp, [{default_user, [{login, "test"},
{passcode, "test"}
]
},
{tcp_listeners, [{"192.168.1.100", 55674}]},
{heartbeat, 0}
]
}
]
I have been over the Rabbit docs a million times but this feels like something simple that I am overlooking.
Resolved. After combing through the logs I realized that web_stomp was listening on port 15674 so I changed the config file to reflect that. I swear I had made that change at some point but it did not seem to make a difference.
One of the late changes I made before sending out my request was to turn off heartbeat. Everything I have read states that sockjs does not support heartbeat and that there were suggestions to turn it off rather than use the default. In addition to turning off heartbeat in the config file I also added this to the browser code:
client.heartbeat.outgoing=0;
client.heartbeat.incoming=0;

Is There a Way to Prevent MassTransit From Taking Down Our Service?

We're seeing one exception that, if I recall correctly, should be fixed in MT 3.0 (we're on 3.1) we see this when we have our environment under a very high load:
Exception Info: RabbitMQ.Client.Exceptions.AlreadyClosedException
Stack:
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.AsyncMethodBuilderCore.<ThrowAsync>b__1(System.Object)
at System.Threading.QueueUserWorkItemCallback.WaitCallback_Context(System.Object)
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)
at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()
This exception is causing our windows service to crash and someone has to go start it back up. Is there an event or some configuration we can setup at a higher level to handle these unforeseen scenarios?
I know there are bus events, but no even specific to exceptions outside of my message handling.
I also saw there is exception handling, unfortunately this is only good for scenarios where there's an error while processing a message. Not for when RabbitMQ throws an exception unrelated to retrieving or sending a message.
I just added some logging and service level restart logic, will update this with more details if available in the future.
AppDomain.CurrentDomain.UnhandledException += UnhandledException;
public void UnhandledException(object sender, UnhandledExceptionEventArgs e)
{
// Shut down, dispose, reinitialize internal objects and log messages.
}
Edit 1 - RabbitMQ Logs
We saw this error at 11:37:47, in all 3 of the rabbitmq servers we're seeing these type of logs:
=INFO REPORT==== 5-May-2016::11:37:06 ===
accepting AMQP connection <0.925.14> (10.50.0.3:59625 -> 10.50.0.123:5672)
=ERROR REPORT==== 5-May-2016::11:37:10 ===
closing AMQP connection <0.917.14> (10.50.0.3:59588 -> 10.50.0.123:5672):
{handshake_timeout,handshake}
=WARNING REPORT==== 5-May-2016::11:37:15 ===
closing AMQP connection <0.614.14> (10.50.0.3:58645 -> 10.50.0.123:5672):
connection_closed_abruptly
=ERROR REPORT==== 5-May-2016::11:37:16 ===
closing AMQP connection <0.925.14> (10.50.0.3:59625 -> 10.50.0.123:5672):
{handshake_timeout,handshake}
=INFO REPORT==== 5-May-2016::11:37:17 ===
accepting AMQP connection <0.941.14> (10.50.0.3:59665 -> 10.50.0.123:5672)
=WARNING REPORT==== 5-May-2016::11:37:24 ===
closing AMQP connection <0.642.14> (10.50.0.3:58726 -> 10.50.0.123:5672):
connection_closed_abruptly
=INFO REPORT==== 5-May-2016::11:37:25 ===
accepting AMQP connection <0.955.14> (10.50.0.3:59774 -> 10.50.0.123:5672)
=WARNING REPORT==== 5-May-2016::11:37:27 ===
closing AMQP connection <0.955.14> (10.50.0.3:59774 -> 10.50.0.123:5672):
connection_closed_abruptly
=ERROR REPORT==== 5-May-2016::11:37:27 ===
closing AMQP connection <0.941.14> (10.50.0.3:59665 -> 10.50.0.123:5672):
{handshake_timeout,handshake}
=INFO REPORT==== 5-May-2016::11:37:29 ===
accepting AMQP connection <0.962.14> (10.50.0.3:59796 -> 10.50.0.123:5672)
=WARNING REPORT==== 5-May-2016::11:37:30 ===
closing AMQP connection <0.670.14> (10.50.0.3:58769 -> 10.50.0.123:5672):
connection_closed_abruptly
=INFO REPORT==== 5-May-2016::11:37:35 ===
accepting AMQP connection <0.972.14> (10.50.0.3:59814 -> 10.50.0.123:5672)
=INFO REPORT==== 5-May-2016::11:37:36 ===
accepting AMQP connection <0.975.14> (10.50.0.3:59824 -> 10.50.0.123:5672)
=WARNING REPORT==== 5-May-2016::11:37:36 ===
closing AMQP connection <0.975.14> (10.50.0.3:59824 -> 10.50.0.123:5672):
connection_closed_abruptly
=ERROR REPORT==== 5-May-2016::11:37:39 ===
closing AMQP connection <0.962.14> (10.50.0.3:59796 -> 10.50.0.123:5672):
{handshake_timeout,handshake}
=INFO REPORT==== 5-May-2016::11:37:44 ===
accepting AMQP connection <0.993.14> (10.50.0.3:59872 -> 10.50.0.123:5672)
=WARNING REPORT==== 5-May-2016::11:37:45 ===
closing AMQP connection <0.705.14> (10.50.0.3:58934 -> 10.50.0.123:5672):
connection_closed_abruptly
=ERROR REPORT==== 5-May-2016::11:37:45 ===
closing AMQP connection <0.972.14> (10.50.0.3:59814 -> 10.50.0.123:5672):
{handshake_timeout,handshake}
=INFO REPORT==== 5-May-2016::11:37:46 ===
accepting AMQP connection <0.1005.14> (10.50.0.3:59881 -> 10.50.0.123:5672)
=WARNING REPORT==== 5-May-2016::11:37:46 ===
closing AMQP connection <0.1005.14> (10.50.0.3:59881 -> 10.50.0.123:5672):
connection_closed_abruptly
=INFO REPORT==== 5-May-2016::11:37:47 ===
accepting AMQP connection <0.1010.14> (10.50.0.3:59892 -> 10.50.0.123:5672)
Going back a few minutes we see blocks of this several times:
=INFO REPORT==== 5-May-2016::11:28:25 ===
Mirrored queue 'bus-JEMSO04-w3wp-ktbyyynsicyfb1scbdjzj6targ' in vhost 'Beta': Adding mirror on node rabbit#redisd01: <18045.5072.16>
=INFO REPORT==== 5-May-2016::11:28:25 ===
Mirrored queue 'bus-JEMSO04-w3wp-ktbyyynsicyfb1scbdjzj6targ' in vhost 'Beta': Adding mirror on node rabbit#redisd02: <6520.2073.16>
=INFO REPORT==== 5-May-2016::11:28:25 ===
Mirrored queue 'bus-JEMSO04-w3wp-ktbyyynsicyfb1scbdjzj6targ' in vhost 'Beta': Synchronising: 0 messages to synchronise
=INFO REPORT==== 5-May-2016::11:28:25 ===
Mirrored queue 'bus-JEMSO04-w3wp-ktbyyynsicyfb1scbdjzj6targ' in vhost 'Beta': Synchronising: all slaves already synced
=INFO REPORT==== 5-May-2016::11:28:25 ===
Mirrored queue 'bus-JEMSO04-w3wp-ktbyyynsicyfb1scbdjzj6targ' in vhost 'Beta': Synchronising: 0 messages to synchronise
=INFO REPORT==== 5-May-2016::11:28:25 ===
Mirrored queue 'bus-JEMSO04-w3wp-ktbyyynsicyfb1scbdjzj6targ' in vhost 'Beta': Synchronising: all slaves already synced
=ERROR REPORT==== 5-May-2016::11:28:26 ===
closing AMQP connection <0.32168.13> (10.50.0.3:47716 -> 10.50.0.123:5672):
{handshake_timeout,handshake}
Edit 2 - Updated to MassTransit 3.3.3
We have a new issue after upgrading, this is coming from our consumer and is not under load:
MassTransit.Util.TaskSupervisor Error: 0 : Failed to close scope MassTransit.RabbitMqTransport.Pipeline.RabbitMqBasicConsumer - rabbitmq://rabbitmqdlb.jsa.local:5672/LocalDev/bus-MRHODEN-DT-Se
rvice.vshost-xabyyydu3ecy84dibdjamdsbrb?prefetch=16, System.Threading.Tasks.TaskCanceledException: A task was canceled.
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.ConfiguredTaskAwaitable.ConfiguredTaskAwaiter.GetResult()
at MassTransit.RabbitMqTransport.Pipeline.RabbitMqBasicConsumer.<Stop>d__32.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task)
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at System.Runtime.CompilerServices.ConfiguredTaskAwaitable.ConfiguredTaskAwaiter.GetResult()
at MassTransit.Util.TaskSupervisorExtensions.<>c__DisplayClass2_0.<<CreateParticipant>b__0>d.MoveNext()