On my way to use Akka.NET for a scalable application, I am trying to setup a cluster of Lighthouse seed nodes. I am testing 3 Lighthouse nodes as seed nodes, each running on the same machine with different ports. Following is my hocon config sample:
lighthouse.actorsystem: "my-system"
# See petabridge.cmd configuration options here: https://cmd.petabridge.com/articles/install/host-configuration.html
petabridge.cmd.host = "0.0.0.0"
petabridge.cmd.port = 9111/9112/9113 #one in each node
akka.actor.provider = cluster
akka.remote.log-remote-lifecycle-events = DEBUG
akka.remote.dot-netty.tcp.transport-class = "Akka.Remote.Transport.DotNetty.TcpTransport, Akka.Remote"
akka.remote.dot-netty.tcp.applied-adapters = []
akka.remote.dot-netty.tcp.transport-protocol = tcp
akka.remote.dot-netty.tcp.public-hostname = "localhost"
akka.remote.dot-netty.tcp.hostname = "localhost"
akka.remote.dot-netty.tcp.port = 4001/4002/4003
akk.cluster.seed-nodes = ["akka.tcp://my-system#localhost:4001","akka.tcp://my-system#localhost:4002","akka.tcp://my-system#localhost:4003"]
akk.cluster.roles = [lighthouse]
If I start up these nodes from 3 command prompts, each is printing the following messages:
[INFO][22-01-2019 11:45:17][Thread 0020][Cluster] Cluster Node [akka.tcp://my-system#localhost:4001/4002/4003] - Node [akka.tcp://my-system#localhost:4001/4002/4003] is JOINING itself (with roles []) and forming a new cluster
[INFO][22-01-2019 11:45:17][Thread 0020][Cluster] Cluster Node [akka.tcp://my-system#localhost:4001/4002/4003] - Leader is moving node [akka.tcp://my-system#localhost:4001/4002/4003] to [Up]
My concern here is that, as per the logs printed, these three instances are not forming a cluster and seems to be forming three separate clusters as the nodes themselves are not getting any message about other Lighthouse nodes.
Can somebody please clarify if this is the expected behavior as there is no example seems to be available online.
Related
I have an Apache Camel project that is using Quartz2 as the scheduler. The requirement is to make it a cluster. The code is deployed to weblogic 12c. the quartz is configured as per many samples with clustering enabled.
This is my properties file (without the datasource)
org.quartz.scheduler.instanceName = MyScheduler
org.quartz.scheduler.instanceId = AUTO
org.quartz.scheduler.skipUpdateCheck = true
org.quartz.scheduler.jobFactory.class = org.quartz.simpl.SimpleJobFactory
org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount = 10
org.quartz.threadPool.threadPriority = 5
org.quartz.jobStore.misfireThreshold = 60000
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.oracle.OracleDelegate
org.quartz.jobStore.useProperties=true
org.quartz.JobBuilder.requestRecovery=true
org.quartz.jobStore.isClustered = true
org.quartz.jobStore.clusterCheckinInterval = 20000
When I deploy and start both nodes I see that the QRTZ_SCHEDULER_STATE table has extra entry for one of the nodes:
MyScheduler-routerContext server_node21567108546690
MyScheduler-routerContext-1 server_node11565896495100
MyScheduler-routerContext-1 server_node11567108547295
And I am guessing because of that the one node is being called once in a while while the other node gets called all the time (so occasionally both nodes are invoked at the same time).
I have tried to do a clean restart of weblogic nodes but the issue is still there
This is how my route(s) look like:
from("quartz2://provRegGroup/createUsersTrigger?cron={{create_users_cron}}&job.name=createUsersJob")
.routeId("createUsersRB")
.log("**** starting check for create users");
//where
//create_users_cron=0+0,5,10,15,20,25,30,35,40,45,50,55+*+*+*+?
//expecting one node being called by the scheduler at a time..
I figured out what caused the issue. apparently there were orphan weblogic processes that were running on one (or even both nodes) - this would be a question to our tech archs - why this was such a mess.. ps was showing two weblogic servers running on a node - one that I started recently and one that was there for say a month..
expecting this would never happen to production environment I assume the issue has been resolved..
I am planning to use Apache Ignite Distributed Queue.
I am using Ignite with a spring boot application. So, on bootup, I will be adding 20 names in a queue. But, since there are 3 servers in a cluster, the same 20 names gets added 3 times. But, i want to add them only once in the queue.
Ignite ignite = Ignition.ignite();
IgniteQueue<String> queue = ignite.queue(
"queueName", // Queue name.
0, // Queue capacity. 0 for unbounded queue.
null // Collection configuration.
);
Distributed executors, will be able to poll from the queue and run the task. Here, the executor is expected to poll, run the task and then add the same name to the queue. Trying to achieve round robin here.
Only one executor should be running the same task at any point of time, though there are multiple servers in a cluster.
Any suggestion for this.
You can launch ignite cluster singleton service https://apacheignite.readme.io/docs/cluster-singletons which will fill data to queue. Also you can adding data from coordinator node (oldest node in cluster) ignite.cluster().forOldest().node().isLocal()
I fixed bootup time duplicate cache loading issue this way:
final IgniteAtomicLong cacheLoadCnt = ignite.atomicLong(cacheName + "Cnt", 0, true);
if (cacheLoadCnt.get() == 0) {
loadCache();
cacheLoadCnt.addAndGet(1);
}
We are using the multi-site WAN configuration. We have two clusters across geographical distances in North America and Europe.
Context: Cluster 1 has two members A and B that are both gateway senders. Cluster B has two members C and D that are both gateway receivers. When member A in cluster 1 starts, it reads data from database and loads it into the gemfire cache which gets sent to the cluster 2. Everything so far is good.
Problem: If both members in Cluster 2 are restarted at the same time, they lose all the gemfire regions/data. At that point, we could restart member A in cluster 1, it again loads data from the DB and gets pushed to cluster B. But we would prefer to avoid the restart of member A and without persisting to hard disk.
Is there a solution where if cluster 2 is restarted, it can request a full copy of data from cluster 1?
Not sure if it's possible, but could we somehow setup peer to peer for the gateway receivers in cluster 2 (on top of WAN), so they would be updated automatically upon restart.
Thanks
Getting a full copy of data over WAN is not supported at this time. What you could do instead is run a function on all members of site A, that simply iterates over all data and puts it back again in the region. i.e something like:
public void execute(FunctionContext context) {
RegionFunctionContext ctx = (RegionFunctionContext)context;
Region localData = PartitionRegionHelper.getLocalDataForContext(ctx);
for (Object key : localData.keySet()) {
Object val = localData.get(key);
localData.put(key, val);
}
}
For testing, I wanted to shrink my 3 node cluster to 2 nodes, to later go and do the same thing for my 5 node cluster.
However, after following the best practice of shrinking a cluster:
Back up all tables
For all tables: alter table xyz set (number_of_replicas=2) if it was less than 2 before
SET GLOBAL PERSISTENT discovery.zen.minimum_master_nodes = <half of the cluster + 1>;
3 a. If the data check should always be green, set the min_availability to 'full':
https://crate.io/docs/reference/configuration.html#graceful-stop
Initiate graceful stop on one node
Wait for the data check to turn green
Repeat from 3.
When done, persist the node configurations in crate.yml:
gateway.recover_after_nodes: n
discovery.zen.minimum_master_nodes:[![enter image description here][1]][1] (n/2) +1
gateway.expected_nodes: n
My cluster never went back to "green" again, and I also have a critical node check failing.
What went wrong here?
crate.yml:
...
################################## Discovery ##################################
# Discovery infrastructure ensures nodes can be found within a cluster
# and master node is elected. Multicast discovery is the default.
# Set to ensure a node sees M other master eligible nodes to be considered
# operational within the cluster. Its recommended to set it to a higher value
# than 1 when running more than 2 nodes in the cluster.
#
# We highly recommend to set the minimum master nodes as follows:
# minimum_master_nodes: (N / 2) + 1 where N is the cluster size
# That will ensure a full recovery of the cluster state.
#
discovery.zen.minimum_master_nodes: 2
# Set the time to wait for ping responses from other nodes when discovering.
# Set this option to a higher value on a slow or congested network
# to minimize discovery failures:
#
# discovery.zen.ping.timeout: 3s
#
# Time a node is waiting for responses from other nodes to a published
# cluster state.
#
# discovery.zen.publish_timeout: 30s
# Unicast discovery allows to explicitly control which nodes will be used
# to discover the cluster. It can be used when multicast is not present,
# or to restrict the cluster communication-wise.
# For example, Amazon Web Services doesn't support multicast discovery.
# Therefore, you need to specify the instances you want to connect to a
# cluster as described in the following steps:
#
# 1. Disable multicast discovery (enabled by default):
#
discovery.zen.ping.multicast.enabled: false
#
# 2. Configure an initial list of master nodes in the cluster
# to perform discovery when new nodes (master or data) are started:
#
# If you want to debug the discovery process, you can set a logger in
# 'config/logging.yml' to help you doing so.
#
################################### Gateway ###################################
# The gateway persists cluster meta data on disk every time the meta data
# changes. This data is stored persistently across full cluster restarts
# and recovered after nodes are started again.
# Defines the number of nodes that need to be started before any cluster
# state recovery will start.
#
gateway.recover_after_nodes: 3
# Defines the time to wait before starting the recovery once the number
# of nodes defined in gateway.recover_after_nodes are started.
#
#gateway.recover_after_time: 5m
# Defines how many nodes should be waited for until the cluster state is
# recovered immediately. The value should be equal to the number of nodes
# in the cluster.
#
gateway.expected_nodes: 3
So there are two things that are important:
The number of replicas is essentially the number of nodes you can loose in a typical setup (2 is recommended so that you can scale down AND loose a node in the process and still be ok)
The procedure is recommended for clusters > 2 nodes ;)
CrateDB will automatically distribute the shards across the cluster in a way that no replica and primary share a node. If that is not possible (which is the case if you have 2 nodes and 1 primary with 2 replicas, the data check will never return to 'green'. So in your case, set the number of replicas to 1 in order to get the cluster back to green (alter table mytable set (number_of_replicas = 1)).
The critical node check is due to the cluster not having received an updated crate.yml yet: Your file also still has the configuration of a 3-node cluster in it, hence the message. Since CrateDB only loads the expected_nodes at startup (it's not a runtime setting), a restart of the whole cluster is required to conclude scaling down. It can be done with a rolling restart, but be sure to set SET GLOBAL PERSISTENT discovery.zen.minimum_master_nodes = <half of the cluster + 1>; properly, otherwise the consensus will not work...
Also, it's recommended to scale down one-by-one in order to avoid overloading the cluster with rebalancing and accidentally loosing data.
I'm trying to use the asadmin interface to monitor a thread-pool on GlassFish 3.1.1. I'm executing the following command:
asadmin get -m server.network.my-listener.thread-pool.*
and I'm getting data back, but most of it has lastsampletime = -1 (so the related data is zero; and is worthless).
Note: I've also tried the REST interface, which I believe asadmin delegates to, and the JMX interface. Same problem: much of the data has lastsampletime = -1.
I've already turned monitoring to HIGH for all modules. What am I missing?
It seems like redeploying my application was necessary for the monitoring to actually get values. Perhaps I interpreted the manual incorrectly but it seems to suggest that a restart/redeploy wouldn't be required:
Oracle GlassFish Server 3.1 Administration Guide
Also, it is weird that the following shows there is no monitoring data:
asadmin get -m server.thread-pools.thread-pool.http-thread-pool.*
Instead you must go through a specific network listener like:
asadmin get -m server.network.http-listener-2.thread-pool.*
It also took me by surprise that enabling thread-pool monitoring IS NOT enough to see thread pool statistics. You must also enable http-service monitoring:
asadmin enable-monitoring
asadmin set server.monitoring-service.module-monitoring-levels.thread-pool=HIGH
asadmin set server.monitoring-service.module-monitoring-levels.http-service=HIGH
That's all you should need to do.
Enable monitoring, set to HIGH, for the http-service module on the DAS, stand-alone instance, or cluster you want to monitor.
Deploy an app to the DAS, stand-alone instance, or cluster and make http-requests.
asadmin get -m *instancename*.network.*listener*.thread-pool.*
Looks like you are monitoring DAS, since you are using asadmin get -m server.network.my-listener.thread-pool.*.
I deployed a simple war to DAS and made a bunch of http requests. I see the corethreads-count and maxthreads-count have last sample time as -1. And the remaining statistics have actual last sample times.
asadmin get -m "server.network.http-listener-1.thread-pool.*"
server.network.http-listener-1.thread-pool.corethreads-count = 0
server.network.http-listener-1.thread-pool.corethreads-description = Core number of threads in the thread pool
server.network.http-listener-1.thread-pool.corethreads-lastsampletime = -1
server.network.http-listener-1.thread-pool.corethreads-name = CoreThreads
server.network.http-listener-1.thread-pool.corethreads-starttime = 1320764890444
server.network.http-listener-1.thread-pool.corethreads-unit = count
server.network.http-listener-1.thread-pool.currentthreadcount-count = 5
server.network.http-listener-1.thread-pool.currentthreadcount-description = Provides the number of request processing threads currently in the listener thread pool
server.network.http-listener-1.thread-pool.currentthreadcount-lastsampletime = 1320765351708
server.network.http-listener-1.thread-pool.currentthreadcount-name = CurrentThreadCount
server.network.http-listener-1.thread-pool.currentthreadcount-starttime = 1320764890445
server.network.http-listener-1.thread-pool.currentthreadcount-unit = count
server.network.http-listener-1.thread-pool.currentthreadsbusy-count = 0
server.network.http-listener-1.thread-pool.currentthreadsbusy-description = Provides the number of request processing threads currently in use in the listener thread pool serving requests
server.network.http-listener-1.thread-pool.currentthreadsbusy-lastsampletime = 1320765772814
server.network.http-listener-1.thread-pool.currentthreadsbusy-name = CurrentThreadsBusy
server.network.http-listener-1.thread-pool.currentthreadsbusy-starttime = 1320764890445
server.network.http-listener-1.thread-pool.currentthreadsbusy-unit = count
server.network.http-listener-1.thread-pool.dotted-name = server.network.http-listener-1.thread-pool
server.network.http-listener-1.thread-pool.maxthreads-count = 0
server.network.http-listener-1.thread-pool.maxthreads-description = Maximum number of threads allowed in the thread pool
server.network.http-listener-1.thread-pool.maxthreads-lastsampletime = -1
server.network.http-listener-1.thread-pool.maxthreads-name = MaxThreads
server.network.http-listener-1.thread-pool.maxthreads-starttime = 1320764890443
server.network.http-listener-1.thread-pool.maxthreads-unit = count
server.network.http-listener-1.thread-pool.totalexecutedtasks-count = 31
server.network.http-listener-1.thread-pool.totalexecutedtasks-description = Provides the total number of tasks, which were executed by the thread pool
server.network.http-listener-1.thread-pool.totalexecutedtasks-lastsampletime = 1320765772814
server.network.http-listener-1.thread-pool.totalexecutedtasks-name = TotalExecutedTasksCount
server.network.http-listener-1.thread-pool.totalexecutedtasks-starttime = 1320764890444
server.network.http-listener-1.thread-pool.totalexecutedtasks-unit = count
Command get executed successfully.
To instantly enable monitoring without restart use enable-monitoring command
enable-monitoring
enable-monitoring --modules jvm=LOW
enable-monitoring --modules thread-pool=HIGH
enable-monitoring --modules http-service=HIGH
enable-monitoring --modules jdbc-connection-pool=HIGH
The trick is that thread-pool and http-service modules must have high level to get monitoring info.
For more info refer https://docs.oracle.com/cd/E26576_01/doc.312/e24928/monitoring.htm#GSADG00558