My storm topology fails after running for 10 days - apache

My storm topology fails after running for 10 days , when I deploy the same topology (same JAR) with new name it runs well till date, so my question is that what are the new resources got allocated for newly deployed Storm topology including zookeeper memory. If I redeployed that topology with OLD name its fails again in few hours.
I have not done any changes before deploying it with new topology name.
Is that Storm topology consume any memory space on worker node after running for longer period which i need to take care of?

I'm familiar with at least one bug in Storm pre-1.0.0 that can cause workers to hang. If you aren't on the latest Storm version, try upgrading.
Other than that, your best bet for debugging this is to use jstack or kill -3 on the worker JVM to figure out what your hanging worker is doing. You may also want to enable debug logging if it doesn't harm your performance too much. You do this by doing config.setDebug(true); when setting up the topology.
Once you know why the worker isn't processing tuples you can try posting the stack trace here, maybe there's an issue in Storm.

Related

How do I configure Embedded Infinispan to handle K8s rolling updates?

I have a simple project that allows you to add keys to a distributed cache in an application that is running Infinispan version 13 in embedded mode. It is all published here.
I run a kubernetes setup that can run in minikube. I observe that when I run my example with six pods and perform a rolling update, my infinispan performance degrades from the start of the roll out up until four minutes after the last pod has restarted and created its cache. After this time the cluster operates as normal again. With degrading I mean that the operation of getting the count of items in the cache takes 2-3 seconds to execute, compared to below 0.5 seconds in normal mode. With my setup this is consistently happening, and consistently working again after four minutes.
When running the project on my local machine without a kubernetes environment I have not experienced the same kind of delays.
I have tried using TRACE logs, but can see no event of significance that happens after these four minutes.
Is there something obvious that I'm missing in my configuration of Infinispan (that you can see in my referenced project), or some additional operation that needs to be performed? (currently I start the cache on startup, and perform stop on shutdown).
A colleague found the following logs when running Infinispan in non embedded mode:
2022-01-09 14:56:45,378 DEBUG (jgroups-230,infinispan-server-2) [org.jgroups.protocols.UNICAST3] infinispan-server-2: removing expired connection for infinispan-server-0 (240058 ms old) from recv_table
After these logs the service performance was returned to normal again. This lead us to suspect that JGroups somehow tries to use old connections to pods that have been removed. By changing the conn_close_timeout setting on UNICAST3 for Jgroups to 10 seconds instead of the default value 4 minutes we could confirm that service degradation was fixed in 10s instead of 4 minutes.
Additionally it seems that this fix only works when the service is running as a StatefulSet and not when it runs as a Deployment. I don't have explanation for exactly why this is, but in conclusion make the service to a stateful set and changing the conn_close_timeout on UNICAST3 in the JGroups configuration fixed our problem.

How to get all running containers in Yarn ApplicationMaster code?

I'm developing a long-running service using Yarn framework. The ApplicationMaster code just allocates and starts some containers and let them running forever. AM also reports the status of every running containers periodically. AM knows every container it allocated and started by explicitly storing them into a in-memory map.
Now the question is: in case of AM restart, i.e. a new appattempt is made. How does the new AM know all of the running containers that the old AM allocated? The new AM needs this because it needs to report the status of them.
The AMRMClient clearly doesn't have this interface for AM to get the container list of its application.
AM (one per Job) is a container which has been initialized after the RM (one per cluster) reserves the Memory and Vcores for that Job.
Are you talking of AM failure and it needs to be restarted? If so then the new AM will start the new attempt with new containers (it will lose the connection to the old containers) and the old ones will be freed after some timeout due to NM (NodeManager) heartbeat to RM (which happens periodically).
Regarding the code, I am not sure how it is implemented.

How to track celery and rabbitmq in production server

I have installed both celery and rabbitmq. Now i would like to track how many messages are there in the queue and how it is distributed, want to see the list of celery consumers and tasks they are executing etc. this is bcoz i had issues with celery getting stuck when there is a memory pressure. I tried installing rabbitmq management for a start and when i tried to login at myservr.com:15672 it said can only be used through localhost, is there any workaround? Also is it a good idea to run such monitoring on production servers? Will there be any chance for memory leaks?

RabbitMQ creates a number of strange processes

I happened to find a number of strange processes created by rabbitmq on my RabbitMQ server. I ran rabbitmq server in a docker container. I recreated the container and hours later those processes appeared again. There're some consumers connecting to it. Any idea about what those processes for? Thanks!

Celery workers missing heartbeats and getting substantial drift over Ec2

I am testing my celery implementation over 3 ec2 machines right now. I am pretty confident in my implementation now, but I am getting problems with the actual worker execution. My test structure is as follows:
1 ec2 machine is designated as the broker, also runs a celery worker
1 ec2 machine is designated as the client (runs the client celery script that enqueues all the tasks using .delay(), also runs a celery worker
1 ec2 machine is purely a worker.
All the machines have 1 celery worker running. Before, I was immediately getting the message:
"Substantial drift from celery#[other ec2 ip] may mean clocks are out of sync."
A drift amount in seconds would then be printed, which would increase over time.
I would also get messages : "missed heartbeat from celery#[other ec2 ip].
The machine would be doing very little work at this point, so my AutoScaling config in ec2 would shut down the instance automatically once it got to cpu utilization levels very low (<5%)
So to try to solve this problem, i attempted to sync all my machine's clocks (although I thought celery handled this) with this command, which was performed upon start up for all machines:
apt-get -qy install ntp
service ntp start
With this, they all performed well for about 10 minutes with no hitches, after which I started getting missed heartbeats and my ec2 instances stalled and shut down. The weird thing is, the drift increased and then decreased sometimes.
Any idea on why this is happening?
I am using the newest version of celery (3.1) and rabbitmq
EDIT: It should be noted that I am utilizing us-west-1a and us-west-1c availability zones on ec2.
EDIT2: I am starting to think memory problems might be an issue. I am using a t2.micro instance, and running 3 celery workers on the same machine (only 1 instance) which is also the broker, still cause heartbeat misses and stalls.