How to get all running containers in Yarn ApplicationMaster code? - hadoop-yarn

I'm developing a long-running service using Yarn framework. The ApplicationMaster code just allocates and starts some containers and let them running forever. AM also reports the status of every running containers periodically. AM knows every container it allocated and started by explicitly storing them into a in-memory map.
Now the question is: in case of AM restart, i.e. a new appattempt is made. How does the new AM know all of the running containers that the old AM allocated? The new AM needs this because it needs to report the status of them.
The AMRMClient clearly doesn't have this interface for AM to get the container list of its application.

AM (one per Job) is a container which has been initialized after the RM (one per cluster) reserves the Memory and Vcores for that Job.
Are you talking of AM failure and it needs to be restarted? If so then the new AM will start the new attempt with new containers (it will lose the connection to the old containers) and the old ones will be freed after some timeout due to NM (NodeManager) heartbeat to RM (which happens periodically).
Regarding the code, I am not sure how it is implemented.

Related

Apache Ignite: Getting affinity for too old topology version that is already out of history (try to increase 'IGNITE_AFFINITY_HISTORY_SiZE'

I am getting this exception intermittently while trying to run co-located join queries on cached data. Below are some of specifics of the environment and how the caches are initialized.
Running embedded with a spring boot application
Deployed in Kubernetes environment with TcpDiscoveryJdbcIpFinder
Running on 3+ nodes
The caches are created dynamically using BinaryObjects and QueryEntity
The affinity keys are forced to be a static value using AffinityKeyMapper (for the same group of data)
I am getting Getting affinity for too old topology version that is already out of history (try to increase 'IGNITE_AFFINITY_HISTORY_SiZE) sporadically. Sometimes this happens continuously for a few minutes. Sometimes it would work on a second or third try and sometimes we don't see this error for hours. I already increased IGNITE_AFFINITY_HISTORY_SiZE to 100000 and we are still getting this message.

Dynatrace one agent in ecs fargate containers stops but application container is running

Am trying to install one agent in my ECS fargate task. Along with application container i have added another container definition for one agent with image as alpine:latest and used run time injection.
While running the task, initially the one agent container is in running state and after a minute it goes to stopped state same time application container will be in running state.
In dynatrace the same host is available and keeps recreating after 5-10mins frequently.
Actually the issue that I had was task was in draining status because of application issue due to which in dynatrace it keeps recreating... And the same time i used run time injection for my ECS fargate so once the binaries are downloaded and injected to volume, the one agent container definition will stop while the application container keeps running and injecting logs in dynatrace.
I have the same problem and connected via ssh to the cluster I saw that the agent needs to be privileged. The only thing that worked for me was sending traces and metrics through Opentelemetry.
https://aws-otel.github.io/docs/components/otlp-exporter
Alternative:
use sleep infinity in the command field of your oneAgent container.

How do I configure Embedded Infinispan to handle K8s rolling updates?

I have a simple project that allows you to add keys to a distributed cache in an application that is running Infinispan version 13 in embedded mode. It is all published here.
I run a kubernetes setup that can run in minikube. I observe that when I run my example with six pods and perform a rolling update, my infinispan performance degrades from the start of the roll out up until four minutes after the last pod has restarted and created its cache. After this time the cluster operates as normal again. With degrading I mean that the operation of getting the count of items in the cache takes 2-3 seconds to execute, compared to below 0.5 seconds in normal mode. With my setup this is consistently happening, and consistently working again after four minutes.
When running the project on my local machine without a kubernetes environment I have not experienced the same kind of delays.
I have tried using TRACE logs, but can see no event of significance that happens after these four minutes.
Is there something obvious that I'm missing in my configuration of Infinispan (that you can see in my referenced project), or some additional operation that needs to be performed? (currently I start the cache on startup, and perform stop on shutdown).
A colleague found the following logs when running Infinispan in non embedded mode:
2022-01-09 14:56:45,378 DEBUG (jgroups-230,infinispan-server-2) [org.jgroups.protocols.UNICAST3] infinispan-server-2: removing expired connection for infinispan-server-0 (240058 ms old) from recv_table
After these logs the service performance was returned to normal again. This lead us to suspect that JGroups somehow tries to use old connections to pods that have been removed. By changing the conn_close_timeout setting on UNICAST3 for Jgroups to 10 seconds instead of the default value 4 minutes we could confirm that service degradation was fixed in 10s instead of 4 minutes.
Additionally it seems that this fix only works when the service is running as a StatefulSet and not when it runs as a Deployment. I don't have explanation for exactly why this is, but in conclusion make the service to a stateful set and changing the conn_close_timeout on UNICAST3 in the JGroups configuration fixed our problem.

What will happen for an Openstack instance if a server (Node) go offline?

I'm new to OpenStack and have a basic question about it. Assume that we have 3 Master node (Controller) and 10 Slave node (Compute node) in our cloud. We make 50 VMs (Instances) on the cloud. What will happen if one node (Controller or Compute node) become offline (Failure)? What is the best solution to prevent shutting down a VM if a server get offline?
Best regards
This question requires more than a short Stackoverflow answer. Here are a few initial thoughts.
When a controller goes offline, the instance itself continues running, but if the failed controller hosts a router, the instance might be cut off from the network. Generally, if the controller has anything that the instance needs, that thing won't be available anymore. There are measures like HA routers that can help in such a case.
When the instance's compute host goes down, the instance doesn't run anymore. You can evacuate instances from a failed compute host, which means that they are rebuilt on different hosts. If an instance's root disk resides on a volume or an ephemeral disk that is shared with other compute hosts, this means a mere instance reboot. If the instance has an ephemeral disk inside the failed host, it must be rebuilt from scratch.
OpenStack has a project named Masakari whose goal is to make instances resilient by redundancy. In short, instance HA. The application keeps running even if an instance crashes.
By the way, master and slave are not correct terminology in this context. Use controller and compute instead.

My storm topology fails after running for 10 days

My storm topology fails after running for 10 days , when I deploy the same topology (same JAR) with new name it runs well till date, so my question is that what are the new resources got allocated for newly deployed Storm topology including zookeeper memory. If I redeployed that topology with OLD name its fails again in few hours.
I have not done any changes before deploying it with new topology name.
Is that Storm topology consume any memory space on worker node after running for longer period which i need to take care of?
I'm familiar with at least one bug in Storm pre-1.0.0 that can cause workers to hang. If you aren't on the latest Storm version, try upgrading.
Other than that, your best bet for debugging this is to use jstack or kill -3 on the worker JVM to figure out what your hanging worker is doing. You may also want to enable debug logging if it doesn't harm your performance too much. You do this by doing config.setDebug(true); when setting up the topology.
Once you know why the worker isn't processing tuples you can try posting the stack trace here, maybe there's an issue in Storm.