why my rabbitmq cluster connections and channels stay in flow status - rabbitmq

I am testing my rabbitmq 3 nodes cluster these days,
I use the java tool to test,
[root#server-42 bin ]$ ./runjava com.rabbitmq.perf.PerfTest -x1 -y1 -e testex -Hmqp://username:password#123.123.123.2/test' -t topic -k sample.info -s 1500 -i 20
id: test-154506-639, starting consumer #0
id: test-154506-639, starting consumer #0, channel #0
id: test-154506-639, starting producer #0
id: test-154506-639, starting producer #0, channel #0
id: test-154506-639, time: 20.000s, sent: 8913 msg/s, received: 8804 msg/s, min/avg/max latency: 6317/251907/727492 microseconds
id: test-154506-639, time: 40.004s, sent: 8993 msg/s, received: 8991 msg/s, min/avg/max latency: 157294/256691/387926 microseconds
id: test-154506-639, time: 60.011s, sent: 9029 msg/s, received: 9019 msg/s, min/avg/max latency: 146744/255631/384696 microseconds
id: test-154506-639, time: 80.017s, sent: 8946 msg/s, received: 8972 msg/s, min/avg/max latency: 164969/259147/723908 microseconds
id: test-154506-639, time: 100.019s, sent: 8971 msg/s, received: 8949 msg/s, min/avg/max latency: 164012/258115/353767 microseconds
I find my rabbitmq connection and channel status keeps at flow status.
however why is it ? is there any way to increase the performance?
I thought the flow status to keep the publisher to send messages to quick, in case that server can not queue the messages.
but the sending rate I used to test seems not high at all, why they are still in flow status?
anyone can help? thanks in advance.

Flow Control:
RabbitMQ will reduce the speed of connections which are publishing too quickly for queues to keep up.
If you want to learn more about the credit flow you can read this doc, in particular:
To see how credit_flow and its settings affect publishing, let’s see how internal messages flow in RabbitMQ. Keep in mind that RabbitMQ is implemented in Erlang, where processes communicate by sending messages to each other.
you can try to increase credit_flow parameter

in my case, I was getting this due to Lack of memory (Too many unacked messages causes memory to be full which makes the connection in flow state)

Related

Keda RabbitMQ - Keda not spawning additional jobs when queue has few messages

I have a Keda Scaledjob configured to spawn 1 job per message having the state 'ready' in RabbitMQ.
It has a max replica count set to 70.
Observed:
When there are many messages in the queue, let's say 300, Keda correctly creates new jobs to reach the max replica count limit => So there are 70 running jobs each consuming 1 message from the queue.
When there are few messages in the queue, let's say 1 Ready and 1 Unacked, Keda refuses to create a new job even if there's enough resources in the cluster.
It's like waiting until the current running job finishes to spawn a new job.
Here's my Keda configuration :
---
# Reference - https://keda.sh/docs/2.0/concepts/scaling-jobs/
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: scaledjob-puppeteer
labels:
environment: development
app: puppeteer-display
spec:
jobTargetRef:
parallelism: 1 # [max number of desired pods](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#controlling-parallelism)
completions: 1 # [desired number of successfully finished pods](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#controlling-parallelism)
activeDeadlineSeconds: 7200 # (2 hours) Specifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it; value must be positive integer
backoffLimit: 2 # Specifies the number of retries before marking this job failed. Defaults to 6
template:
spec:
volumes:
...
containers:
...
pollingInterval: 10
successfulJobsHistoryLimit: 0
failedJobsHistoryLimit: 0
maxReplicaCount: 75
triggers:
- type: rabbitmq
metadata:
protocol: amqp
queueName: tasks
mode: QueueLength
value: "1"
authenticationRef:
name: keda-trigger-auth-rabbitmq-conn
---
How to make Keda to create a job whenever the queue has >= 1 message ?
Edit: It seems like it waits for at least 1 hour before creating the new job.
The problem seems to be the missing scalingStrategy setting. You can add following configuration:
scalingStrategy:
strategy: accurate
The accurate setting is used when you consume messages from your queue instead of locking the messages. This is often used in other message queues.
For reference you can look into https://keda.sh/docs/2.7/concepts/scaling-jobs/
You can find further information about the scaling strategies in the details section.

Python program using Stomp protocol to connect to ActiveMQ keeps disconnecting

Below is the connection parameters in the python program to connect to ActiveMQ
broker_url = config_params.items('BROKERS')
conn = stomp.Connection12(broker_url,
reconnect_sleep_initial=20.0,
reconnect_sleep_increase=2.0,
reconnect_attempts_max=10,
heartbeats=(60000,60000)
)
So the ReadCheckInterval and WriteCheckInterval are set to 1 minute for the connection. It looks like the heartbeats are being missed. I am just trying to figure out if the heart beats are missing from the client or the ActiveMQ server end. Can someone help me?
Below are the logs from the Python program:
2020-02-25 12:27:16,141 - INFO - Attempting connection to host
2020-02-25 12:27:16,142 - INFO - Established connection to host
2020-02-25 12:27:16,142 - INFO - Starting receiver loop
2020-02-25 12:27:16,143 - DEBUG - Sending frame: ['STOMP', '\n', 'accept-version:1.2\n', 'client-id, 'heart-beat:60000,60000\n',]
2020-02-25 12:27:16,143 - DEBUG - Received frame: 'CONNECTED', headers={'server': 'ActiveMQ/5.15.2', 'heart-beat': '60000,60000']body=''
2020-02-25 12:27:16,143 - DEBUG - Sending frame: ['SUBSCRIBE', '\n', 'ack:auto\n', 'activemq.subscriptionName:subscriber\n']
2020-02-25 12:30:16,144 - DEBUG - Received frame: 'heartbeat', headers={}, body=None
2020-02-25 12:30:16,145 - ERROR - disconnected from broker, will attempt to reconnect...
2020-02-25 12:30:16,145 - INFO - Receiver loop ended
2020-02-25 12:30:16,320 - INFO - Attempting connection to host
2020-02-25 12:30:16,321 - INFO - Established connection to host
2020-02-25 12:30:16,321 - INFO - Starting receiver loop
2020-02-25 12:30:16,321 - DEBUG - Sending frame: ['STOMP', '\n', 'accept-version:1.2\n', 'client-id:\n', 'heart-beat:60000,60000\n']
2020-02-25 12:30:16,322 - DEBUG - Received frame: 'CONNECTED', headers={'server': 'ActiveMQ/5.15.2', 'heart-beat': '60000,60000']body=''
2020-02-25 12:30:16,322 - DEBUG - Sending frame: ['SUBSCRIBE', '\n', 'ack:auto\n', 'activemq.subscriptionName:subscriber]
I see client and the server both missing sending heart beats to each other. Below is a log where the client has missed sending the heartbeat. The connection gets established at 12:03:32. The client sends the first heart beat at 12:03:32 and then subscribes to the ActiveMQ destination. It keeps getting messages, so there is activity, until 12:12:08. Then a period of inactivity until 12:13:32 (>60 seconds) and the connection gets terminated. Is this a problem of the ActiveMQ server being too less tolerant to missed heart beats from the client. Would increasing the heartbeat interval from the client to 120 seconds help in this case?
2020-02-26 12:03:32,498 - INFO - Established connection to host, port 61613
2020-02-26 12:03:32,499 - INFO - Sending frame: 'STOMP', headers={'heart-beat': '60000,60000'}
2020-02-26 12:03:32,512 - INFO - Received frame: 'CONNECTED', headers={'heart-beat': '60000,60000'}
2020-02-26 12:03:32,513 - INFO - Sending frame: 'SUBSCRIBE'
2020-02-26 12:04:27,924 - INFO - Received frame: 'MESSAGE'
.
.
2020-02-26 12:12:08,475 - INFO - Received frame: 'MESSAGE'
2020-02-26 12:13:32,519 - INFO - Received frame: 'heartbeat'
2020-02-26 12:13:32,548 - ERROR - disconnected from broker
I also see problems os the server missing to send the heartbeat and the client getting a heartbeat timeout error. I am thinking of disabling heartbeats from the server by setting the heartbeat configuration to (120000,0). Any suggestions?
After some testing it turned out that even a few milliseconds delay in the client heartbeat was causing the connection to be closed by the broker.
For the same reason, from ActiveMQ server version 5.9.0,
transport.hbGracePeriodMultiplier (default=1) was added. This would increase the heartbeat timeout by a factor of the set value. Below is the JIRA link which was used to implement this feature.
https://issues.apache.org/jira/browse/AMQ-4674
I've also removed the broker heartbeat, by setting the heartbeat as (60000,0) as it was redundant.
The first value in the Connect hear-beat header is the 'will send' value for client heart beats to the broker. The client should be attempting to maintain a consistent heart beat at the level indicated which is defined as
smallest number of milliseconds between heart-beats that it can
guarantee
The broker will allow for some grace period based on that value after which if the client has not sent a heart beat or any other frame the connection will be closed. From the trace given the client is not sending any heart beats or other wire level activity so the broker is dropping the connection.

Asterisk: "TLS clean shutdown alert reading data" after 120s in SIP call

I am using a Secure SIP trunk provided by Twilio to implement an IVR. I have implemented per Twilio's Asterisk configuration guide, installed SRTP to /usr/local/lib, as well as implemented the configuration in https://wiki.asterisk.org/wiki/display/AST/Secure+Calling+Tutorial.
The problem lies in any call that is longer than 2 minutes cannot be ended cleanly and causes Asterisk to restart.
sip.conf (using chan_sip, not pjsip):
[general]
; other configuration lines removed
tlsenable=yes
tlsbindaddr=0.0.0.0
tlscertfile=/etc/pki/tls/private/pbx.pem
tlscafile=/etc/pki/tls/private/gd_bundle-g2-g1.crt
tlscipher=ALL
tlsclientmethod=tlsv1
tlsdontverifyserver=yes
[twilio-trunk](!)
type=peer
context=from-twilio ;Which dialplan to use for incoming calls
dtmfmode=rfc4733
canreinvite=no
insecure=port,invite
transport=tls
qualify=yes
encryption=yes
media_encryption=sdes
I can make and receive calls just fine, and I have confirmed the calls are encrypted both via wireshark and confirmation from Twilio's own support queue.
At exactly 120 seconds into every call, this debug pops up:
[Dec 6 13:14:39] DEBUG[30015]: iostream.c:157 iostream_read: TLS clean shutdown alert reading data
[Dec 6 13:14:39] DEBUG[30015]: chan_sip.c:2905 sip_tcptls_read: SIP TCP/TLS server has shut down
The call continues to flow bi-directionally, the caller never knows there is a problem until they hit a hangup in context, i.e. h,1,Hangup(). Then Asterisk is restarted (new PID) and the caller hangs in limbo for another 5 minutes before the call times out with a fast busy. Twilio confirms they see the BYE and return an ACK at the point of the Hangup.
I was on 13.11 and updated to 15.1.3, same result. Calls longer than 120s result in TLS message in debug and Asterisk restarts.
No Google query results out there. Twilio hasn't been real helpful. Can anyone shed some light on what is happening and where I need to look next?
More logs:
[Dec 8 10:18:48] DEBUG[4993][C-00000001]: channel.c:5551 set_format: Channel SIP/twilio0-00000000 setting write format path: gsm -> ulaw
[Dec 8 10:18:48] DEBUG[4993][C-00000001]: res_rtp_asterisk.c:4017 rtp_raw_write: Difference is 2472, ms is 329
[Dec 8 10:18:48] DEBUG[4993][C-00000001]: channel.c:3192 ast_settimeout_full: Scheduling timer at (50 requested / 50 actual) timer ticks per second
– <SIP/twilio0-00000000> Playing ‘IVR/omnicare_9d_account.gsm’ (language ‘en’)
[Dec 8 10:18:48] DEBUG[4993][C-00000001]: res_rtp_asterisk.c:4928 ast_rtcp_interpret: Got RTCP report of 64 bytes from 34.203.250.7:10475
[Dec 8 10:18:53] DEBUG[4993][C-00000001]: res_rtp_asterisk.c:4928 ast_rtcp_interpret: Got RTCP report of 64 bytes from 34.203.250.7:10475
[Dec 8 10:18:55] DEBUG[4992]: iostream.c:157 iostream_read: TLS clean shutdown alert reading data
[Dec 8 10:18:55] DEBUG[4992]: chan_sip.c:2905 sip_tcptls_read: SIP TCP/TLS server has shut down
[Dec 8 10:18:58] DEBUG[4993][C-00000001]: channel.c:3192 ast_settimeout_full: Scheduling timer at (0 requested / 0 actual) timer ticks per second
[Dec 8 10:18:58] DEBUG[4993][C-00000001]: channel.c:3192 ast_settimeout_full: Scheduling timer at (0 requested / 0 actual) timer ticks per second
[Dec 8 10:18:58] DEBUG[4993][C-00000001]: channel.c:3192 ast_settimeout_full: Scheduling timer at (0 requested / 0 actual) timer ticks per second
[Dec 8 10:18:58] DEBUG[4993][C-00000001]: channel.c:5551 set_format: Channel SIP/twilio0-00000000 setting write format path: ulaw -> ulaw
[Dec 8 10:18:58] DEBUG[4993][C-00000001]: res_rtp_asterisk.c:4928 ast_rtcp_interpret: Got RTCP report of 64 bytes from 34.203.250.7:10475
[Dec 8 10:19:01] DEBUG[4914]: cdr.c:4305 ast_cdr_engine_term: CDR Engine termination request received; waiting on messages…
Asterisk uncleanly ending (0).
Executing last minute cleanups
== Destroying musiconhold processes
[Dec 8 10:19:01] DEBUG[4914]: res_musiconhold.c:1627 moh_class_destructor: Destroying MOH class ‘default’
[Dec 8 10:19:01] DEBUG[4914]: cdr.c:1289 cdr_object_finalize: Finalized CDR for SIP/twilio0-00000000 - start 1512749813.880448 answer 1512749813.881198 end 1512749941.201797 dispo ANSWERED
== Manager unregistered action DBGet
== Manager unregistered action DBPut
== Manager unregistered action DBDel
== Manager unregistered action DBDelTree
[Dec 8 10:19:01] DEBUG[4914]: asterisk.c:2157 really_quit: Asterisk ending (0).
Check your firewall logs. We've had issues with sessions being torn down by firewalls that thought the NAT entries were stale/old.
You can also try configuring Asterisk to send keep-alive packets using the option qualify=yes and nat=yes in your sip.conf entry for that user/trunk. Or inside the RTP stream with rtpkeepalive=<secs>. The best docs I could find for sip.conf are the example config on github.
I dug in the source code for the text "TLS clean shutdown alert reading data", which pointed me to some OpenSSL docs which suggest a clean/normal closure (which I'm guessing was caused by your firewall):
The TLS/SSL connection has been closed. If the protocol version is SSL 3.0 or higher, this result code is returned only if a closure alert has occurred in the protocol, i.e. if the connection has been closed cleanly. Note that in this case SSL_ERROR_ZERO_RETURN does not necessarily indicate that the underlying transport has been closed.

ignite: possible starvation in striped pool with igniteCache

there:
I got the error when use ignite cache.
My system select a master node use zookeeper,and has many slave nodes.The master process ignite cache expired values and put in an ignite queue.The slave node provide data into ignite cache use streamer.addData(k,v) and consume the ignite queue.
My code is:
ignite cache and streamer :
// use zookeeper IpFinder
ignite = Ignition.getOrStart(igniteConfiguration);
igniteCache = ignite.getOrCreateCache(cacheConfiguration);
igniteCache.registerCacheEntryListener(new MutableCacheEntryListenerConfiguration<>(
(Factory<CacheEntryListener<K, CountValue>>)() -> (CacheEntryExpiredListener<K, CountValue>)this
::onCacheExpired, null, true, true));
//onCacheExpired master resolve the expired entry and put in igniteQueue
cacheConfiguration.setExpiryPolicyFactory(CreatedExpiryPolicy.factoryOf(Duration.ONE_MINUTE));
igniteDataStreamer = ignite.dataStreamer(igniteCache.getName());
igniteDataStreamer.deployClass(BaseIgniteStreamCount.class);
igniteDataStreamer.allowOverwrite(true);
igniteDataStreamer.receiver(StreamTransformer.from((CacheEntryProcessor<K, CountValue, Object>)(e, arg) -> {
// process the value.
return null;
}));
master process the entry expired from the cache,and put in ignite queue:
CollectionConfiguration collectionConfiguration = new CollectionConfiguration().setCollocated(true);
queue = ignite.queue(igniteQueueName, 0, collectionConfiguration);
the slaves consume the queue.
but i got error log below after running hours later:
2017-09-14 17:06:45,256 org.apache.ignite.logger.java.JavaLogger warning
WARNING: >>> Possible starvation in striped pool.
Thread name: sys-stripe-6-#7%ignite%
Queue: []
Deadlock: false
Completed: 77168
Thread [name="sys-stripe-6-#7%ignite%", id=134, state=WAITING, blockCnt=0, waitCnt=68842]
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
at o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:176)
at o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:139)
at o.a.i.i.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:935)
at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler.onEntryUpdate(CacheContinuousQueryHandler.java:850)
at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler.access$700(CacheContinuousQueryHandler.java:82)
at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryHandler$1.onEntryUpdated(CacheContinuousQueryHandler.java:413)
at o.a.i.i.processors.cache.query.continuous.CacheContinuousQueryManager.onEntryExpired(CacheContinuousQueryManager.java:429)
at o.a.i.i.processors.cache.GridCacheMapEntry.onExpired(GridCacheMapEntry.java:3046)
at o.a.i.i.processors.cache.GridCacheMapEntry.onTtlExpired(GridCacheMapEntry.java:2961)
at o.a.i.i.processors.cache.GridCacheTtlManager$1.applyx(GridCacheTtlManager.java:61)
at o.a.i.i.processors.cache.GridCacheTtlManager$1.applyx(GridCacheTtlManager.java:52)
at o.a.i.i.util.lang.IgniteInClosure2X.apply(IgniteInClosure2X.java:38)
at o.a.i.i.processors.cache.IgniteCacheOffheapManagerImpl.expire(IgniteCacheOffheapManagerImpl.java:1007)
at o.a.i.i.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:198)
at o.a.i.i.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:160)
at o.a.i.i.processors.cache.GridCacheUtils.unwindEvicts(GridCacheUtils.java:854)
at o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1073)
at o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:561)
at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:378)
at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:304)
at o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:99)
at o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:293)
at o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
at o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1184)
at o.a.i.i.managers.communication.GridIoManager.access$4200(GridIoManager.java:126)
at o.a.i.i.managers.communication.GridIoManager$9.run(GridIoManager.java:1097)
at o.a.i.i.util.StripedExecutor$Stripe.run(StripedExecutor.java:483)
at java.lang.Thread.run(Thread.java:745)
Striped pool is responsible for messages processing. This warning tells you that there is no progress happening on some of the stripes. It may happen due to a bad network connection or when you put massive objects to a cache or a queue.
You may find more information about it in these threads:
http://apache-ignite-users.70518.x6.nabble.com/Possible-starvation-in-striped-pool-td14892.html
http://apache-ignite-users.70518.x6.nabble.com/Possible-starvation-in-striped-pool-message-td15993.html

What are the numbers in the square brackets in NSLog() output?

What is the stuff between the [] in the log message below? I get this in my iPhone app, and I have no idea where the message is coming from. My first guess would be a line number, but which file would it be in?
2010-10-19 08:56:12.006 Encore[376:6907]
The first number is the process ID, the second is the logging thread's Mach port. A desktop example:
2010-10-19 17:37:13.189 nc_init[28617:a0f] nc <CFNotificationCenter 0x10010d170 [0x7fff70d96f20]> - default <CFNotificationCenter 0x10010d2a0 [0x7fff70d96f20]>
(gdb) i thread
Thread 1 has current state "WAITING"
Mach port #0xa0f (gdb port #0x4203)
frame 0: main () at nc_init.m:10
pthread ID: 0x7fff70ebfc20
system-wide unique thread id: 0x167b49
dispatch queue name: "com.apple.main-thread"
dispatch queue flags: 0x0
total user time: 13232000
total system time: 16099000
scaled cpu usage percentage: 0
scheduling policy in effect: 0x1
run state: 0x3 (WAITING)
flags: 0x0
number of seconds that thread has slept: 0
current priority: 31
max priority: 63
suspend count: 0.
(gdb) p/x (int)mach_thread_self()
$1 = 0xa0f
Notice how 0xa0f is reported as the thread's Mach port.
first number is the process ID, unsure about second, this line will precede every line thats printed to console from your application.
Possibly a NSLog(#""); is causing this.
Is your application running or has it crashed by this stage?
The first number is the process ID, as the others have said. The second number is the thread ID, at least I'm pretty sure that's what it is...
It's a process ID in fact. You can see that in the GDB console with a line somewhere that should read "[Switching to process 376]".