clustre apache ignite (2.9.1-1) ubuntu 18.04 - ignite

for testing, I build a clustre apache ignite (2.9.1-1) when starting first node, everything is ok, when starting second nodes, I get an error (Failed to add node to topology because it has the same hash code for partitioned affinity as one of existing nodes) since I am not an expert in apache-ignite, I wanted to clarify how I can fix this error

You need to specify different consistentId for every node in the cluster.
In this case, it is possible that you are starting both nodes with myIgniteNode01.

Related

Node is not able to join cluster in v3.8.24 version

We are upgrading our system from RabbitMQ version 3.6.10 & Erlang version v19.3.4 to RabbitMQ v3.8.24 and Erlang version v23.3.4.8.
We are using Rightscale to deploy our deployments. While performing resiliency testing on 3 node cluster we had deleted one node (node3) and as a result 1 new node (node4) auto churned with the same cluster Id. All the cluster join commands are well in place and are working properly for 3.6.10. But we have observed that after upgrading the newly launched node on v3.8.24 is not able to join the cluster. Rather than it is treating itself as a new single node deployment.
On the 1st and 2nd node we are getting below error in the crash.log file.
2022-02-17 09:01:32 =ERROR REPORT====
** gen_event handler lager_exchange_backend crashed.
** Was installed in lager_event
** Last event was: {log,{lager_msg,[],[{pid,<0.44.0>}],info,{["2022",45,"02",45,"17"],["07",58,"45",58,"12",46,"982"]},{1645,83912,982187},[65,112,112,108,105,99,97,116,105,111,110,32,"mnesia",32,101,120,105,116,101,100,32,119,105,116,104,32,114,101,97,115,111,110,58,32,"stopped"]}}
** When handler state == {state,{mask,127},lager_default_formatter,[date," ",time," ",color,"[",severity,"] ",{pid,[]}," ",message,"\n"],-576448326,{resource,<<"/">>,exchange,<<"amq.rabbitmq.log">>}}
** Reason == {badarg,[{ets,lookup,[rabbit_exchange,{resource,<<"/">>,exchange,<<"amq.rabbitmq.log">>}],[]},{rabbit_misc,dirty_read,1,[{file,"src/rabbit_misc.erl"},{line,367}]},{rabbit_basic,publish,1,[{file,"src/rabbit_basic.erl"},{line,65}]},{lager_exchange_backend,handle_log_event,2,[{file,"src/lager_exchange_backend.erl"},{line,173}]},{gen_event,server_update,4,[{file,"gen_event.erl"},{line,620}]},{gen_event,server_notify,4,[{file,"gen_event.erl"},{line,602}]},{gen_event,server_notify,4,[{file,"gen_event.erl"},{line,604}]},{gen_event,handle_msg,6,[{file,"gen_event.erl"},{line,343}]}]}
2022-02-17 09:01:37 =ERROR REPORT====
** Connection attempt from node 'rabbit#node-4' rejected. Invalid challenge reply. **
2022-02-17 09:01:37 =ERROR REPORT====
** Connection attempt from node 'rabbitmqcli-481-rabbit# node -4' rejected. Invalid challenge reply. **
*node-4 is the new node which is churned automatically.
Here we have two concerns.
Why the newly churned node is not able to join the cluster.
It has been observed that post termination old node details are still present in Disc Nodes section. Is there any specific reason for retaining it or some configurational changes that need to be performed.
Regards
Kushagra
I was somehow able to resolve the issue by doing some googling. So, just thought to share my findings with you. Might be it will help someone.
Based on RabbitMQ recommendations it is always good to have RabbitMQ cluster having static nodes.
It might be possible that an unresponsive node might be able to rejoin cluster once recovered and dynamic removal of the nodes is not recommended. Please refer
https://www.rabbitmq.com/cluster-formation.html#:~:text=Nodes%20in%20clusters,understood%20and%20considered.
After having all due diligence, in case if we want to remove the unused node then we can use forget_cluster_node and pass the expired node name from any working node. It will clean all the entries.
I hope it will help you guys.
Regards
Kushagra

Infinspan console shows only one node for clustered servers in the cache node view

We are working with infinspan version 9.4.8 in a domain mode with cluster of two hosts servers with two nodes.
In the statistics of the cluster view we can see that both nodes get hits but when we look at the view of the cache nodes for a distributed cache we can see only one node in the nodes view
In console of infinspan 8 we used to have the two nodes in the cache nodes view but after upgrading to version 9 it is not the case
Could you please advise if it is bug in the console for version 9.4.8 or something is missed in the configuration
This is a bug which has just been fixed and will be included in the upcoming 9.4.18.Final release. The issue is tracked by ISPN-11265.
In the future please utilise the Infinispan JIRA directly if you suspect a bug.

Ignite thin Client unstable behavior

I am newbie to ignite and trying to play around with the example https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/client/ClientPutGetExample.java
i first tried the example with one server node and executed the client everything work fine.
then i started a second node with the following config
IgniteClient igniteClient = Ignition.startClient(new ClientConfiguration().setAddresses("127.0.0.1:10800","127.0.0.1:10801" )))
with CacheMode.REPLICATED;
i re-run the code it work fine, then i kept the same config and i shut down
one of the nodes
then i re-run the code the result is unstable sometimes it gives me Ignite cluster is unavailable sometimes it gives me an empty cache
Thin client put-get example started.
Created cache [put-get-example].
Loaded [null] from the cache.
1-as per the documentation ignite thin client is supposed to failover one of the
running nodes.
2- why the cache is note replicated?
is there something that i am missing here
thank you for your help
This looks like IGNITE-11599 - Thin Client will not failover properly if some of addresses were not up when it started.
It is fixed recently but did not get in any released versions. I'm afraid you will have to work around it by doing manual failovers.

Solr issue: ClusterState says we are the leader, but locally we don't think so

So today we run into a disturbing solr issue.
After a restart of the whole cluster one of the shard stop being able to index/store documents.
We had no hint about the issue until we started indexing (querying the server looks fine).
The error is:
2014-05-19 18:36:20,707 ERROR o.a.s.u.p.DistributedUpdateProcessor [qtp406017988-19] ClusterState says we are the leader, but locally we don't think so
2014-05-19 18:36:20,709 ERROR o.a.s.c.SolrException [qtp406017988-19] org.apache.solr.common.SolrException: ClusterState says we are the leader (http://x.x.x.x:7070/solr/shard3_replica1), but locally we don't think so. Request came from null
at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:503)
at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:267)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:550)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:126)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:101)
at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:65)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
We run Solr 4.7 in Cluster mode (5 shards) on jetty.
Each shard run on a different host with one zookeeper server.
I checked the zookeeper log and I cannot see anything there.
The only difference is that in the /overseer_election/election folder I see this specific server repeated 3 times, while the other server are only mentioned twice.
45654861x41276x432-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x368-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x369-x.x.x.x:7070_solr-n_00000003xx
Not even sure if this is relevant. (Can it be?)
Any clue what other check can we do?
We've experienced this error under 2 conditions.
Condition 1
On a single zookeeper host there was an orphaned Zookeeper ephemeral node in
/overseer_elect/election. The session this ephemeral node was associated with no longer existed.
The orphaned ephemeral node cannot be deleted.
Caused by: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
This condition will also be accompanied by a /overseer/queue directory that is clogged-up with queue items that are forever waiting to be processed.
To resolve the issue you must restart the Zookeeper node in question with the orphaned ephemeral node.
If after the restart you see Still seeing conflicting information about the leader of shard shard1 for collection <name> after 30 seconds
You will need to restart the Solr hosts as well to resolve the problem.
Condition 2
Cause: a mis-configured systemd service unit.
Make sure you have Type=forking and have PIDFile configured correctly if you are using systemd.
systemd was not tracking the PID correctly, it thought the service was dead, but it wasn't, and at some point 2 services were started. Because the 2nd service will not be able to start (as they both can't listen on the same port) it seems to just sit there in a failed state hanging, or fails to start the process but just messes up the other solr processes somehow by possibly overwriting temporary clusterstate files locally.
Solr logs reported the same error the OP posted.
Interestingly enough, another symptom was that zookeeper listed no leader for our collection in /collections/<name>/leaders/shard1/leader normally this zk node contains contents such as:
{"core":"collection-name_shard1_replica1",
"core_node_name":"core_node7",
"base_url":"http://10.10.10.21:8983/solr",
"node_name":"10.10.10.21:8983_solr"}
But the node is completely missing on the cluster with duplicate solr instances attempting to start.
This error also appeared in the Solr Logs:
HttpSolrCall null:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /roles.json
To correct the issue, killall instances of solr (or java if you know it's safe), and restart the solr service.
We figured out!
The issue was that jetty didn't really stop so we had 2 running processes, for whatever reason this was fine for reading but not for writing.
Killing the older java process solved the issue.

Brisk TaskTracker not starting in a multi-node Brisk setup

I have a 3 node Brisk cluster (Briskv1.0_beta2). Cassandra is working fine (all three nodes see each other and data is balanced across the ring). I started the nodes with the brisk cassandra -t command. I cannot, however, run any Hive or Pig jobs. When I do, I get an exception saying that it cannot connect to the task tracker.
During the startup process, I see the following in the log:
TaskTracker.java (line 695) TaskTracker up at: localhost.localdomain/127.0.0.1:34928
A few lines later, however, I see this:
Retrying connect to server: localhost.localdomain/127.0.0.1:8012. Already tried 9 time(s).
INFO [TASK-TRACKER-INIT] RPC.java (line 321) Server at localhost.localdomain/127.0.0.1:8012 not available yet, Zzzzz...
Those lines are repeated non-stop as long as my cluster is running.
My cassandra.yaml file specifies the box IP (not 0.0.0.0 or localhost) as the listen_address and the rpc_address is set to 0.0.0.0
Why is the client attempting to connect to a different port than the log shows the task tracker as using? Is there anywhere these addresses/ports can be specified?
I figured this out. In case anyone else has the same issues, here's what was going on:
Brisk uses the first entry in the Cassandra cluster's seed list to pick the initial jobtracker. One of my nodes had 127.0.0.1 in the seed list. This worked for the Cassandra setup since all the other nodes in the cluster connected to that box to get the cluster topology but this didn't work for the job tracker selection.
looks like your jobtracker isn't running. What do you see when you run "brisktool jobtracker"?