Redis cluster cannot add nodes - redis

There are two redis server. And I have run three redis instances on each server.
When I executed cluster meet [ip] [port] to add the cluster nodes, I found I just could add the nods which was running on the same server. Everytime I run this command, it alwasys echo an "OK" for me. But when I use cluster nodes to check the nodes list, it always shows like this.
172.18.0.155:7010> cluster meet 172.18.0.156 7020
OK
172.18.0.155:7010> cluster nodes
ad829d8b297c79f644f48609f17985c5586b4941 127.0.0.1:7010#17010 myself,master - 0 1540538312000 1 connected
87a8017cfb498e47b6b48f0ad69fc066c466a9c2 172.18.0.156:7020#17020 handshake - 1540538308677 0 0 disconnected
fdf5879554741759aab14eba701dc185b605ac16 127.0.0.1:7012#17012 master - 0 1540538313000 0 connected
ec7b3ecba7a175ddb81f254821243dd469a7f961 127.0.0.1:7011#17011 master - 0 1540538314288 2 connected
You can see the nodes status is disconnected. And you can find it will disappare from the list, if you check again about 5s later.
Has anybody meet this problem before? I have no idea how to solve this problem. Please help me. Thanks a lot.

I have solved the problem. I found I had done some mistakes with the bind configuration. When I just add one IP which communicate with other nodes for the bind setting. The cluster nodes can add normally.

Related

Aerospike- Python Client- get host name from INFO_NODE

looking for details of below questions.
Using INFO_NODE we are looking to build an UI interface to display something like (i net statistics) below.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Network Information (2020-07-21 00:39:27 UTC)~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Cluster Node Node Ip Build Cluster Migrations Cluster Cluster Principal Client Uptime
Name . Id . . Size . Key Integrity . Conns .
test testbox.test.com:3000 *aaaaaaaaaaaa0 XX.XXX.XX>XXXX:3000 E-4.5.3.5 1 0.000 key12344 True aaaaaaaaaaaa0 3 14:12:13
How can we get details of cluster name and node name.
we could get other details from INFO_NODE?
Can we run quiesce, recluster from python client?
You should be able to issue any info command from a client. See this Knowledge Base article. Getting any statistics or configuration parameter values can also be done through similar info calls. For example node-id or cluster-name.

Inconsistent behavior of Quartz2 scheduler in Apache Camel

I have an Apache Camel project that is using Quartz2 as the scheduler. The requirement is to make it a cluster. The code is deployed to weblogic 12c. the quartz is configured as per many samples with clustering enabled.
This is my properties file (without the datasource)
org.quartz.scheduler.instanceName = MyScheduler
org.quartz.scheduler.instanceId = AUTO
org.quartz.scheduler.skipUpdateCheck = true
org.quartz.scheduler.jobFactory.class = org.quartz.simpl.SimpleJobFactory
org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount = 10
org.quartz.threadPool.threadPriority = 5
org.quartz.jobStore.misfireThreshold = 60000
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.oracle.OracleDelegate
org.quartz.jobStore.useProperties=true
org.quartz.JobBuilder.requestRecovery=true
org.quartz.jobStore.isClustered = true
org.quartz.jobStore.clusterCheckinInterval = 20000
When I deploy and start both nodes I see that the QRTZ_SCHEDULER_STATE table has extra entry for one of the nodes:
MyScheduler-routerContext server_node21567108546690
MyScheduler-routerContext-1 server_node11565896495100
MyScheduler-routerContext-1 server_node11567108547295
And I am guessing because of that the one node is being called once in a while while the other node gets called all the time (so occasionally both nodes are invoked at the same time).
I have tried to do a clean restart of weblogic nodes but the issue is still there
This is how my route(s) look like:
from("quartz2://provRegGroup/createUsersTrigger?cron={{create_users_cron}}&job.name=createUsersJob")
.routeId("createUsersRB")
.log("**** starting check for create users");
//where
//create_users_cron=0+0,5,10,15,20,25,30,35,40,45,50,55+*+*+*+?
//expecting one node being called by the scheduler at a time..
I figured out what caused the issue. apparently there were orphan weblogic processes that were running on one (or even both nodes) - this would be a question to our tech archs - why this was such a mess.. ps was showing two weblogic servers running on a node - one that I started recently and one that was there for say a month..
expecting this would never happen to production environment I assume the issue has been resolved..

Authentication failures in cassandra when 1 of 16 nodes is down

I have a Cassandra cluster running :
Cassandra 2.0.11.83 | DSE 4.6.0 | CQL spec 3.1.1 | Thrift protocol 19.39.0
The cluster has 18 nodes, split among 3 datacenters, 6 in each. My system_auth keyspace has the following replication defined:
replication = {
'class': 'NetworkTopologyStrategy',
'DC1': '4',
'DC2': '4',
'DC3': '4'}
and my authenticator/authorizer are set to:
authenticator: org.apache.cassandra.auth.PasswordAuthenticator
authorizer: org.apache.cassandra.auth.CassandraAuthorizer
This morning I brought down one of the nodes in DC1 for maintenance. Within a few seconds/minute client applications started logging exceptions like this:
"User my_application_user has no MODIFY permission on or any of its parents"
Running 'LIST ALL PERMISSIONS of my_application_user' on one of the other nodes shows that user to have SELECT and MODIFY on the keyspace xxxxx, so I am rather confused. Do I have a setup issue? Is this a bug of some sort?
Re-posting this as the answer, as BrianC suggested above.
So this is resolved... Here's the sequence of events that seems to have fixed it:
Add 18 more nodes
Run cleanup on original nodes (this was part of the original plan)
Run a scrub on 1 table, since it was throwing exceptions on cleanup
Run a repair on the system_auth KS on the original troubled node
Wait for repair service to complete a full pass on all keyspaces
Decom original 18 nodes.
Honestly, I don't know what fixed it. The system_auth repair makes most sense, but what doesn't make sense is that it had run many passes before, so why work now, I don't know. I hope this at least helps someone.

julia on PBS cluster: what to give to addprocs()?

I'm trying to setup a cluster across machines on a PBS managed cluster. I'm perfectly able to compute within one node by saying julia -p 12 (after having reserved one node with 12 CPUs).
I understand that to use several machines, I have to add them to the master process with addprocs. I was able to do that on a different cluster (SGE). on this one here something is going wrong.
You can see everything I'm doing, including submit scripts etc, on this branch of a github repo.
to get a list of machines, I parse the PBS_NODEFILE, which for the case of a submit script with option
#PBS -l nodes=2:ppn=12 # give me 2 nodes with 12 processors each
looks like something like this:
red0004
red0004
...
red0004
red0347
...
red0347
I parse this file with bind_pe_procs() in sge.jl in the repo and give a vector of machine names to addprocs. When I submit this I get this error which I put up a gist with the resulting SSH error. I don't know what it means.
has this to do with a system setting, ie do i have to talk to the sys admin about SSH between machines? What are the right questions to ask?
I am unsure about what exactly I have to give to addprocs(). I don't want to add the master process (I don't want worker 1 SSHing into itself?), so I exclude ENV["HOST"] = node001 from my list. but what about all processors with the same name node002? do i list all of those
machines = [ "red0347" for i=1:12]
or just once
machines = ["red0347"]
in addprocs(machines)
thanks!

Voldemort replication and failover details

I'm evaluating Voldemort and encountered some confusing stuff related to replication and failover. I tried to make a simple 2 nodes cluster configuration where each node is a backup for another one. So data written to node 1 should be replicated to node 2 and vice versa. In case of node 1 failover the second node should serve client requests. After node 1 recovery data should be transfered back to node 1. I think it's very common and clear case. So I made the following configuration.
<cluster>
<name>perf_cluster</name>
<server>
<id>0</id>
<host>10.50.3.156</host>
<http-port>8081</http-port>
<socket-port>6666</socket-port>
<admin-port>6667</admin-port>
<partitions>0, 1, 2, 3</partitions>
<zone-id>0</zone-id>
</server>
<server>
<id>1</id>
<host>10.50.3.157</host>
<http-port>8081</http-port>
<socket-port>6666</socket-port>
<admin-port>6667</admin-port>
<partitions>4, 5, 6, 7</partitions>
<zone-id>0</zone-id>
</server>
</cluster>
<stores>
<store>
<name>perftest</name>
<persistence>memory</persistence>
<description>Performance Test store</description>
<owners>owner</owners>
<routing>client</routing>
<replication-factor>2</replication-factor>
<required-reads>1</required-reads>
<required-writes>1</required-writes>
<key-serializer>
<type>string</type>
</key-serializer>
<value-serializer>
<type>java-serialization</type>
</value-serializer>
</store>
</stores>
I perform the following test:
Start both nodes;
Connect cluster via shell using 'bin/voldemort-shell.sh perftest tcp://10.50.3.156:6666';
Put the key-value "1" "a";
Perform 'preflist "1"' which returns me 'Node 1' 'Node 0' so I assume that 'get' request will be sent to Node 1 first;
Crash Node 1;
Get key "1". I see some errors related to loss of connectivity but finally it returns me correct value;
Start Node 1;
Get key "1". It says that Node 1 is available but returns me 'null' instead of the value. So I assume the Node 1 didn't get the data from Node 0 and since my required-reads = 1 it doesn't ask for Node 0 and returns me null.
Crash Node 0;
Key "1" is lost forever because it wasn't replicated to Node 1.
I'm more than sure that I misunderstand something in configuration or cluster replication details. Could you clarify why the data doesn't replicate back from Node 0 to Node 1 after recovery? And am I right that replication is a client responsibility, not server? If so how should the data be replicated after Node recovery?
Thanks in advance.
I don't known if you've already solved the problem, but take a look to: http://code.google.com/p/project-voldemort/issues/detail?id=246
Remember that the memory store is only for testing (junit) purposes, you should use the readonly or bdb stores.