Apache Ignite unable find a deployed service - ignite

I've noticed a strange behaviour of Apache Ignite which occurs fairly reliably on my 5-node Apache Ignite cluster but can be replicated with even a two node cluster. I use Apache Ignite 2.7 for Net in the Linux environment deployed in a Kubernetes cluster (each pod hosts one node).
The problem as follows. Assume we've got a cluster which consists of 2 Apache Ignite nodes, A and B. Both nodes start and initialize. A couple of Ignite Services are deployed on each node during the initialization phase. Among all, a service named QuoteService is deployed on the node B.
So far so good. The cluster works as expected. Then, the node B crashes or gets stopped for whatever reason and then restarts. All the ignite services hosted on the node B get redeployed. The node rejoins the cluster.
However, when a service on the node A is trying to call the QuoteService expected to be available on the node B, an exception gets thrown with the following message: Failed to find deployed service: QuoteService. It is strange as the line registering the service did run during the restart of the node B:
services.DeployMultiple("QuoteGenerator", new Services.Ignite.QuoteGenerator(), 8, 2);
(deploying the service as singleton does not make any difference)
A restart of either node A or node B separately does not help. The problem can only be resolved by shutting down the entire Ignite cluster and restarting all the nodes.
This condition can be reproduced even when 5 nodes are running.
This bug report may look a bit unspecific but it is hard to specify the concrete reproduce steps as the replication involves setting up at least two ignite nodes and stopping and restarting them in a sequence. So let me pose the questions this way:
1. Have you ever noticed such a condition or did you received similar reports from other users?
2. If so, what steps can you recommend to address this problem?
3. Should I wait for the next version of Apache Ignite as I read that the service deployment mechanism is currently being overhauled?
UPD:
Getting a similar problem on a running cluster even if I don't stop/start nodes. I will open another question on SA and it seems to have a different genesis.

I've figured out what caused the described behavior (although I don't understand why exactly).
I wanted to ensure that the Ignite service is only deployed on the current node so I used the following C# code to deploy the service:
var services = ignite.GetCluster().ForLocal().GetServices();
services.DeployMultiple("FlatFileService", new Services.Ignite.FlatFileService(), 8, 2);
When I changed my code to rely only on a NodeFilter to limit the deployment of the service to a specific set of nodes and got rid of "GetCluster().ForLocal().", the bug disappeared. The final code is as follows:
var flatFileServiceCfg = new ServiceConfiguration
{
Service = new Services.Ignite.FlatFileService(),
Name = "FlatFileService",
NodeFilter = new ProductServiceNodeFilter(),
MaxPerNodeCount = 2,
TotalCount = 8
};
var services = ignite.GetServices();
services.DeployAll(new[] { flatFileServiceCfg, ... other services... });
It is still strange, however, why the old code did work until the topology changed.

Related

Certain java-based containers throw "UnknownHostException"

I have two issues with my kubernetes.
kubernetes version 1.12.5, ubuntu16.04
the first issue is
Occasionally, containers on a specific node are restarted including kube-proxy
kernel: IPVS: rr TCP - no destination available
IPVS: __ip_vs_del_service:enter
net_ratelimit: callbacks suppressed
As these logs are continuously recorded,
The load avarage of node system resources is rather high.
Docker containers uploaded to the node keep repeating the restart.
In this case, node drain can relieve the symptoms.
the second issue
Certain java-based containers throw "UnknownHostException".
Restarting the container manually will resolve the symptoms.
Should I look at the container deployment settings?
Should I look at the cluster dns, resolve related settings?
I want to know if UnknownHostException is related to dns settings.
Can you give me some good comments?

Ignite error upgrading the setup in Kubernetes

While I upgraded the Ignite that is deployed in Kubernetes (EKS) for Log4j vulnerability, I get the error below
[ignite-1] Caused by: class org.apache.ignite.spi.IgniteSpiException: BaselineTopology of joining node (54b55de4-7742-4e82-9212-7158bf51b4a9) is not compatible with BaselineTopology in the cluster. Joining node BlT id (4) is greater than cluster BlT id (3). New BaselineTopology was set on joining node with set-baseline command. Consider cleaning persistent storage of the node and adding it to the cluster again.
The setup is a 3 node cluster, with native persistence enabled (PVC). This seems to be occurring many times in our journey with Apache Ignite, having followed the official guide.
I cannot clean the storage as the pod gets restarted every now and then, by the time I get the pod shell the pod crash & restarts.
This might happen to be due to the wrong startup order, starting nodes manually in reverse order may resolve this, but I'm not sure if that is possible in K8s. Another possible issue might be related to the baseline auto-adjustment that might change your baseline unexpectedly, I suggest you turn it off if it's enabled.
One of the workarounds to clean a DB of a failing POD might be (quite tricky) - to replace Ignite image with some simple image like a plain Debian or Alpine docker images (just to be able to access CLI) keeping the same PVC attached, and once you fix the persistence issue, set the Ignite image back. The other one is - to access underlying PV directly if possible and do surgery in place.

Re-join cluster node after seed node got restarted

Let's imagine such scenario. I have a three nodes inside my akka cluster (node A,B,C). Each node is deployed to a different physical device inside a network.
All of those nodes are wrapped inside Topshelf windows services.
Node A is my seed node, the other ones are just simply 'worker' nodes with port specified.
When I run cluster and stop node (service) B or C and then restart them. Nodes are rejoining with no issues.
I'd like to ask whether it's possible to handle other scenario which will be. When I stop seed node (node A), the other nodes - services still running and then I restart node-service A - I'd like to make nodes B,C rejoin the cluster and make the whole eco system working again.
Is such scenario possible to implement? If yes then how should I do that?
In Akka.NET cluster any node can serve as a seed node for others as long as it's a part of the cluster. "Seeds" are just a configuration thing, so you can define a list of well-known node addresses you know, that are a part of the cluster.
Regarding your case, there are several solutions I can think of:
Quite common approach is to define more than one seed node in the configuration, so that your node doesn't serve as a single point of failure. As long as at least one of the configured seed nodes is alive, everything should work fine. Keep in mind, that the seed nodes should be defined in each configuration in the exactly same order.
If your "worker" nodes have statically assigned endpoints, they can be used as seed nodes as well.
Since you can initialize the cluster programmaticaly from code, you can also use 3rd party service as a node discovery service. You can use i.e. consul for that - I've started a project, which gives such functionality. While it's not yet published, feel free to fork it or contribute, if it will help you.

How to setup multiple gemfire/geode WAN clusters on one machine for testing?

What's needed to run multiple gemfire/geode clusters on one machine? I'm trying to test using WAN gateways locally, before setting it up on servers.
I have one cluster (i.e. gemfire.distributed-system-id=1) up and running with one locator and one server.
I am trying to setup a second cluster (i.e. gemfire.distributed-system-id=2), but receive the following error when attempting to connect to the locator in cluster 2:
Exception caused JMX Manager startup to fail because: 'HTTP service
failed to start'
I assume the error is due to a JMX Manager already running in cluster 1, so I'm guessing I need to start a second JMX Manager on a different port in cluster 2. Is this a correct assumption? If so, how do I setup the second JMX Manager?
Your assumption is correct, the exception is being thrown because the first members started some services (PULSE, jmx-manager, etc.) using the default ports already
You basically want to make sure the properties http-service-port and jmx-manager-port (non an extensive list, there are other properties you need to look at), are different in the second cluster.
Hope this helps.
Cheers.

Apache ignite node won't join cluster if server nodes start simultaneously

We have a problem with ignite when we start two ignite server nodes at the exact same time. We are currently implementing our own discovery mechanism by extending TcpDiscoveryIpFinderAdapter. The first time the TcpDiscoveryIpFinderAdapter is called neither ignite servers will be able to find the other node (due to the nature of our discovery mechanism). Subsequent invocations does report the other node with a correct IP, yet the ignite nodes will not start to talk to each other.
If we start the servers with some delay, the second server will (on the first attempt) find the other node and join the cluster successfully.
Is there a way to get the two nodes to talk to each other even after both of them initially think they are a cluster of one node?