How to scale down a CrateDB cluster? - sql

For testing, I wanted to shrink my 3 node cluster to 2 nodes, to later go and do the same thing for my 5 node cluster.
However, after following the best practice of shrinking a cluster:
Back up all tables
For all tables: alter table xyz set (number_of_replicas=2) if it was less than 2 before
SET GLOBAL PERSISTENT discovery.zen.minimum_master_nodes = <half of the cluster + 1>;
3 a. If the data check should always be green, set the min_availability to 'full':
https://crate.io/docs/reference/configuration.html#graceful-stop
Initiate graceful stop on one node
Wait for the data check to turn green
Repeat from 3.
When done, persist the node configurations in crate.yml:
gateway.recover_after_nodes: n
discovery.zen.minimum_master_nodes:[![enter image description here][1]][1] (n/2) +1
gateway.expected_nodes: n
My cluster never went back to "green" again, and I also have a critical node check failing.
What went wrong here?
crate.yml:
...
################################## Discovery ##################################
# Discovery infrastructure ensures nodes can be found within a cluster
# and master node is elected. Multicast discovery is the default.
# Set to ensure a node sees M other master eligible nodes to be considered
# operational within the cluster. Its recommended to set it to a higher value
# than 1 when running more than 2 nodes in the cluster.
#
# We highly recommend to set the minimum master nodes as follows:
# minimum_master_nodes: (N / 2) + 1 where N is the cluster size
# That will ensure a full recovery of the cluster state.
#
discovery.zen.minimum_master_nodes: 2
# Set the time to wait for ping responses from other nodes when discovering.
# Set this option to a higher value on a slow or congested network
# to minimize discovery failures:
#
# discovery.zen.ping.timeout: 3s
#
# Time a node is waiting for responses from other nodes to a published
# cluster state.
#
# discovery.zen.publish_timeout: 30s
# Unicast discovery allows to explicitly control which nodes will be used
# to discover the cluster. It can be used when multicast is not present,
# or to restrict the cluster communication-wise.
# For example, Amazon Web Services doesn't support multicast discovery.
# Therefore, you need to specify the instances you want to connect to a
# cluster as described in the following steps:
#
# 1. Disable multicast discovery (enabled by default):
#
discovery.zen.ping.multicast.enabled: false
#
# 2. Configure an initial list of master nodes in the cluster
# to perform discovery when new nodes (master or data) are started:
#
# If you want to debug the discovery process, you can set a logger in
# 'config/logging.yml' to help you doing so.
#
################################### Gateway ###################################
# The gateway persists cluster meta data on disk every time the meta data
# changes. This data is stored persistently across full cluster restarts
# and recovered after nodes are started again.
# Defines the number of nodes that need to be started before any cluster
# state recovery will start.
#
gateway.recover_after_nodes: 3
# Defines the time to wait before starting the recovery once the number
# of nodes defined in gateway.recover_after_nodes are started.
#
#gateway.recover_after_time: 5m
# Defines how many nodes should be waited for until the cluster state is
# recovered immediately. The value should be equal to the number of nodes
# in the cluster.
#
gateway.expected_nodes: 3

So there are two things that are important:
The number of replicas is essentially the number of nodes you can loose in a typical setup (2 is recommended so that you can scale down AND loose a node in the process and still be ok)
The procedure is recommended for clusters > 2 nodes ;)
CrateDB will automatically distribute the shards across the cluster in a way that no replica and primary share a node. If that is not possible (which is the case if you have 2 nodes and 1 primary with 2 replicas, the data check will never return to 'green'. So in your case, set the number of replicas to 1 in order to get the cluster back to green (alter table mytable set (number_of_replicas = 1)).
The critical node check is due to the cluster not having received an updated crate.yml yet: Your file also still has the configuration of a 3-node cluster in it, hence the message. Since CrateDB only loads the expected_nodes at startup (it's not a runtime setting), a restart of the whole cluster is required to conclude scaling down. It can be done with a rolling restart, but be sure to set SET GLOBAL PERSISTENT discovery.zen.minimum_master_nodes = <half of the cluster + 1>; properly, otherwise the consensus will not work...
Also, it's recommended to scale down one-by-one in order to avoid overloading the cluster with rebalancing and accidentally loosing data.

Related

Redis Gears events in cluster

I have a redis cluster with the following configuration :
91d426e9a569b1c1ad84d75580607e3f99658d30 127.0.0.1:7002#17002 myself,master - 0 1596197488000 1 connected 0-5460
9ff311ae9f413b48578ff0519e97fef2ced57b1e 127.0.0.1:7003#17003 master - 0 1596197490000 2 connected 5461-10922
4de4d36b968bd0b5b5dc8023cb00a5a2ab62effc 127.0.0.1:7004#17004 master - 0 1596197492253 3 connected 10923-16383
a32088043c31c5d3f20828bfe06306b9f0717635 127.0.0.1:7005#17005 slave 91d426e9a569b1c1ad84d75580607e3f99658d30 0 1596197490251 1 connected
b5e9ec7851dfd8dc5ab0cf35c230a0e716dd934c 127.0.0.1:7006#17006 slave 9ff311ae9f413b48578ff0519e97fef2ced57b1e 0 1596197489000 2 connected
a34cc74321e1c75e4cf203248bc0883833c928c7 127.0.0.1:7007#17007 slave 4de4d36b968bd0b5b5dc8023cb00a5a2ab62effc 0 1596197492000 3 connected
I want to create a set with all keys in the cluster by listening key operations with redis gears and store key names in a redis set called keys.
To do thant, I run this redis gears command
RG.PYEXECUTE "GearsBuilder('KeysReader').foreach(lambda x: execute('sadd', 'keys', x['key'])).register(readValue=False)"
It work, but only if the updated key is store on the same node of the key keys
Example :
With my cluster configuration, the key keys is store un node 91d426e9a569b1c1ad84d75580607e3f99658d30 (the first node).
If i run :
SET foo bar
SET bar foo
SMEMBERS keys
I have the following result :
127.0.0.1:7002> SET foo bar
-> Redirected to slot [12182] located at 127.0.0.1:7004
OK
127.0.0.1:7004> SET bar foo
-> Redirected to slot [5061] located at 127.0.0.1:7002
OK
127.0.0.1:7002> SMEMBERS keys
1) "bar"
2) "keys"
127.0.0.1:7002>
The first key name foo is not saved in the set keys.
Is it possible to have key names on other nodes saved in the keys set with redis gears ?
Redis version : 6.0.6
Redis gears version : 1.0.1
Thanks.
If the key was written to a shard that does not contain the 'keys' key you need to make sure to move it to another shard with the repartition operation (https://oss.redislabs.com/redisgears/operations.html#repartition), so this should work:
RG.PYEXECUTE "GearsBuilder('KeysReader').repartition(lambda x: 'keys').foreach(lambda x: execute('sadd', 'keys', x['key'])).register(readValue=False)"
The repartition operation will move the record to the correct shard and the 'sadd' will succeed.
Another option is to maintain a set per shard and collect them using another Gear function. To do that you need to use the hashtag function (https://oss.redislabs.com/redisgears/runtime.html#hashtag) to make sure the set created belongs to the current shard. So the following registration will maintain a set per shard:
RG.PYEXECUTE "GearsBuilder('KeysReader').foreach(lambda x: execute('sadd', 'keys{%s}' % hashtag(), x['key'])).register(mode='sync', readValue=False)"
Notice that the sync mode tells RedisGears not to start a distributed execution and it should be much faster to follow the keys this way.
Then to collect all the values:
RG.PYEXECUTE "GB('ShardsIDReader').flatmap(lambda x: execute('smembers', 'keys{%s}' % hashtag())).run()"
The first approach is good for read-intensive use cases and the second approach is good for write-intensive use cases. Depends on your use case you need to chose the right approach.

Inconsistent behavior of Quartz2 scheduler in Apache Camel

I have an Apache Camel project that is using Quartz2 as the scheduler. The requirement is to make it a cluster. The code is deployed to weblogic 12c. the quartz is configured as per many samples with clustering enabled.
This is my properties file (without the datasource)
org.quartz.scheduler.instanceName = MyScheduler
org.quartz.scheduler.instanceId = AUTO
org.quartz.scheduler.skipUpdateCheck = true
org.quartz.scheduler.jobFactory.class = org.quartz.simpl.SimpleJobFactory
org.quartz.threadPool.class = org.quartz.simpl.SimpleThreadPool
org.quartz.threadPool.threadCount = 10
org.quartz.threadPool.threadPriority = 5
org.quartz.jobStore.misfireThreshold = 60000
org.quartz.jobStore.class=org.quartz.impl.jdbcjobstore.JobStoreTX
org.quartz.jobStore.driverDelegateClass=org.quartz.impl.jdbcjobstore.oracle.OracleDelegate
org.quartz.jobStore.useProperties=true
org.quartz.JobBuilder.requestRecovery=true
org.quartz.jobStore.isClustered = true
org.quartz.jobStore.clusterCheckinInterval = 20000
When I deploy and start both nodes I see that the QRTZ_SCHEDULER_STATE table has extra entry for one of the nodes:
MyScheduler-routerContext server_node21567108546690
MyScheduler-routerContext-1 server_node11565896495100
MyScheduler-routerContext-1 server_node11567108547295
And I am guessing because of that the one node is being called once in a while while the other node gets called all the time (so occasionally both nodes are invoked at the same time).
I have tried to do a clean restart of weblogic nodes but the issue is still there
This is how my route(s) look like:
from("quartz2://provRegGroup/createUsersTrigger?cron={{create_users_cron}}&job.name=createUsersJob")
.routeId("createUsersRB")
.log("**** starting check for create users");
//where
//create_users_cron=0+0,5,10,15,20,25,30,35,40,45,50,55+*+*+*+?
//expecting one node being called by the scheduler at a time..
I figured out what caused the issue. apparently there were orphan weblogic processes that were running on one (or even both nodes) - this would be a question to our tech archs - why this was such a mess.. ps was showing two weblogic servers running on a node - one that I started recently and one that was there for say a month..
expecting this would never happen to production environment I assume the issue has been resolved..

How to run multiple Akka.NET Lighthouse seeds

On my way to use Akka.NET for a scalable application, I am trying to setup a cluster of Lighthouse seed nodes. I am testing 3 Lighthouse nodes as seed nodes, each running on the same machine with different ports. Following is my hocon config sample:
lighthouse.actorsystem: "my-system"
# See petabridge.cmd configuration options here: https://cmd.petabridge.com/articles/install/host-configuration.html
petabridge.cmd.host = "0.0.0.0"
petabridge.cmd.port = 9111/9112/9113 #one in each node
akka.actor.provider = cluster
akka.remote.log-remote-lifecycle-events = DEBUG
akka.remote.dot-netty.tcp.transport-class = "Akka.Remote.Transport.DotNetty.TcpTransport, Akka.Remote"
akka.remote.dot-netty.tcp.applied-adapters = []
akka.remote.dot-netty.tcp.transport-protocol = tcp
akka.remote.dot-netty.tcp.public-hostname = "localhost"
akka.remote.dot-netty.tcp.hostname = "localhost"
akka.remote.dot-netty.tcp.port = 4001/4002/4003
akk.cluster.seed-nodes = ["akka.tcp://my-system#localhost:4001","akka.tcp://my-system#localhost:4002","akka.tcp://my-system#localhost:4003"]
akk.cluster.roles = [lighthouse]
If I start up these nodes from 3 command prompts, each is printing the following messages:
[INFO][22-01-2019 11:45:17][Thread 0020][Cluster] Cluster Node [akka.tcp://my-system#localhost:4001/4002/4003] - Node [akka.tcp://my-system#localhost:4001/4002/4003] is JOINING itself (with roles []) and forming a new cluster
[INFO][22-01-2019 11:45:17][Thread 0020][Cluster] Cluster Node [akka.tcp://my-system#localhost:4001/4002/4003] - Leader is moving node [akka.tcp://my-system#localhost:4001/4002/4003] to [Up]
My concern here is that, as per the logs printed, these three instances are not forming a cluster and seems to be forming three separate clusters as the nodes themselves are not getting any message about other Lighthouse nodes.
Can somebody please clarify if this is the expected behavior as there is no example seems to be available online.

How to disable the rollups tables in OpsCenter keyspace?

We are using DSE 4.8.8 with OpsCenter 5.2.4. I don't want to collect 1min and 5min metrics. Here is my cluster config file settings.
[cassandra_metrics]
1min_ttl = -1
5min_ttl = -1
2hr_ttl = 7776000
24hr_ttl = 31536000
I have the same issue with 4.8.15 DSE. According to the documentation, TTL value of -1 should disable the metric collection. I restarted Opscenter. However, I still see the 1min and 5min metrics being collected. I even truncated the tables and clear the snapshots to make sure there is no old records. I check the rollups60 and rollups300 directories and the number of files and size of the files grow over time. How to disable the Cassandra metrics collection?

Wrong balance between Aerospike instances in cluster

I have an application with a high load for batch read operations. My Aerospike cluster (v 3.7.2) has 14 servers, each one with 7GB RAM and 2 CPUs in Google Cloud.
By looking at Google Cloud Monitoring Graphs, I noticed a very unbalanced load between servers: some servers have almost 100% CPU load, while others have less than 50% (image below). Even after hours of operation, the cluster unbalanced pattern doesn't change.
Is there any configuration that I could change to make this cluster more homogeneous? How to optimize node balancing?
Edit 1
All servers in the cluster have the same identical aerospike.conf file:
Aerospike database configuration file.
service {
user root
group root
paxos-single-replica-limit 1 # Number of nodes where the replica count is automatically reduced to 1.
paxos-recovery-policy auto-reset-master
pidfile /var/run/aerospike/asd.pid
service-threads 32
transaction-queues 32
transaction-threads-per-queue 32
batch-index-threads 32
proto-fd-max 15000
batch-max-requests 200000
}
logging {
# Log file must be an absolute path.
file /var/log/aerospike/aerospike.log {
context any info
}
}
network {
service {
#address any
port 3000
}
heartbeat {
mode mesh
mesh-seed-address-port 10.240.0.6 3002
mesh-seed-address-port 10.240.0.5 3002
port 3002
interval 150
timeout 20
}
fabric {
port 3001
}
info {
port 3003
}
}
namespace test {
replication-factor 3
memory-size 5G
default-ttl 0 # 30 days, use 0 to never expire/evict.
ldt-enabled true
storage-engine device {
file /data/aerospike.dat
write-block-size 1M
filesize 180G
}
}
Edit 2:
$ asinfo
1 : node
BB90600F00A0142
2 : statistics
cluster_size=14;cluster_key=E3C3672DCDD7F51;cluster_integrity=true;objects=3739898;sub-records=0;total-bytes-disk=193273528320;used-bytes-disk=26018492544;free-pct-disk=86;total-bytes-memory=5368709120;used-bytes-memory=239353472;data-used-bytes-memory=0;index-used-bytes-memory=239353472;sindex-used-bytes-memory=0;free-pct-memory=95;stat_read_reqs=2881465329;stat_read_reqs_xdr=0;stat_read_success=2878457632;stat_read_errs_notfound=3007093;stat_read_errs_other=0;stat_write_reqs=551398;stat_write_reqs_xdr=0;stat_write_success=549522;stat_write_errs=90;stat_xdr_pipe_writes=0;stat_xdr_pipe_miss=0;stat_delete_success=4;stat_rw_timeout=1862;udf_read_reqs=0;udf_read_success=0;udf_read_errs_other=0;udf_write_reqs=0;udf_write_success=0;udf_write_err_others=0;udf_delete_reqs=0;udf_delete_success=0;udf_delete_err_others=0;udf_lua_errs=0;udf_scan_rec_reqs=0;udf_query_rec_reqs=0;udf_replica_writes=0;stat_proxy_reqs=7021;stat_proxy_reqs_xdr=0;stat_proxy_success=2121;stat_proxy_errs=4739;stat_ldt_proxy=0;stat_cluster_key_err_ack_dup_trans_reenqueue=607;stat_expired_objects=0;stat_evicted_objects=0;stat_deleted_set_objects=0;stat_evicted_objects_time=0;stat_zero_bin_records=0;stat_nsup_deletes_not_shipped=0;stat_compressed_pkts_received=0;err_tsvc_requests=110;err_tsvc_requests_timeout=0;err_out_of_space=0;err_duplicate_proxy_request=0;err_rw_request_not_found=17;err_rw_pending_limit=19;err_rw_cant_put_unique=0;geo_region_query_count=0;geo_region_query_cells=0;geo_region_query_points=0;geo_region_query_falsepos=0;fabric_msgs_sent=58002818;fabric_msgs_rcvd=57998870;paxos_principal=BB92B00F00A0142;migrate_msgs_sent=55749290;migrate_msgs_recv=55759692;migrate_progress_send=0;migrate_progress_recv=0;migrate_num_incoming_accepted=7228;migrate_num_incoming_refused=0;queue=0;transactions=101978550;reaped_fds=6;scans_active=0;basic_scans_succeeded=0;basic_scans_failed=0;aggr_scans_succeeded=0;aggr_scans_failed=0;udf_bg_scans_succeeded=0;udf_bg_scans_failed=0;batch_index_initiate=40457778;batch_index_queue=0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0,0:0;batch_index_complete=40456708;batch_index_timeout=1037;batch_index_errors=33;batch_index_unused_buffers=256;batch_index_huge_buffers=217168717;batch_index_created_buffers=217583519;batch_index_destroyed_buffers=217583263;batch_initiate=0;batch_queue=0;batch_tree_count=0;batch_timeout=0;batch_errors=0;info_queue=0;delete_queue=0;proxy_in_progress=0;proxy_initiate=7021;proxy_action=5519;proxy_retry=0;proxy_retry_q_full=0;proxy_unproxy=0;proxy_retry_same_dest=0;proxy_retry_new_dest=0;write_master=551089;write_prole=1055431;read_dup_prole=14232;rw_err_dup_internal=0;rw_err_dup_cluster_key=1814;rw_err_dup_send=0;rw_err_write_internal=0;rw_err_write_cluster_key=0;rw_err_write_send=0;rw_err_ack_internal=0;rw_err_ack_nomatch=1767;rw_err_ack_badnode=0;client_connections=366;waiting_transactions=0;tree_count=0;record_refs=3739898;record_locks=0;migrate_tx_objs=0;migrate_rx_objs=0;ongoing_write_reqs=0;err_storage_queue_full=0;partition_actual=296;partition_replica=572;partition_desync=0;partition_absent=3228;partition_zombie=0;partition_object_count=3739898;partition_ref_count=4096;system_free_mem_pct=61;sindex_ucgarbage_found=0;sindex_gc_locktimedout=0;sindex_gc_inactivity_dur=0;sindex_gc_activity_dur=0;sindex_gc_list_creation_time=0;sindex_gc_list_deletion_time=0;sindex_gc_objects_validated=0;sindex_gc_garbage_found=0;sindex_gc_garbage_cleaned=0;system_swapping=false;err_replica_null_node=0;err_replica_non_null_node=0;err_sync_copy_null_master=0;storage_defrag_corrupt_record=0;err_write_fail_prole_unknown=0;err_write_fail_prole_generation=0;err_write_fail_unknown=0;err_write_fail_key_exists=0;err_write_fail_generation=0;err_write_fail_generation_xdr=0;err_write_fail_bin_exists=0;err_write_fail_parameter=0;err_write_fail_incompatible_type=0;err_write_fail_noxdr=0;err_write_fail_prole_delete=0;err_write_fail_not_found=0;err_write_fail_key_mismatch=0;err_write_fail_record_too_big=90;err_write_fail_bin_name=0;err_write_fail_bin_not_found=0;err_write_fail_forbidden=0;stat_duplicate_operation=53184;uptime=1001388;stat_write_errs_notfound=0;stat_write_errs_other=90;heartbeat_received_self=0;heartbeat_received_foreign=145137042;query_reqs=0;query_success=0;query_fail=0;query_abort=0;query_avg_rec_count=0;query_short_running=0;query_long_running=0;query_short_queue_full=0;query_long_queue_full=0;query_short_reqs=0;query_long_reqs=0;query_agg=0;query_agg_success=0;query_agg_err=0;query_agg_abort=0;query_agg_avg_rec_count=0;query_lookups=0;query_lookup_success=0;query_lookup_err=0;query_lookup_abort=0;query_lookup_avg_rec_count=0
3 : features
cdt-list;pipelining;geo;float;batch-index;replicas-all;replicas-master;replicas-prole;udf
4 : cluster-generation
61
5 : partition-generation
11811
6 : edition
Aerospike Community Edition
7 : version
Aerospike Community Edition build 3.7.2
8 : build
3.7.2
9 : services
10.0.3.1:3000;10.240.0.14:3000;10.0.3.1:3000;10.240.0.27:3000;10.0.3.1:3000;10.240.0.5:3000;10.0.3.1:3000;10.240.0.43:3000;10.0.3.1:3000;10.240.0.30:3000;10.0.3.1:3000;10.240.0.18:3000;10.0.3.1:3000;10.240.0.42:3000;10.0.3.1:3000;10.240.0.33:3000;10.0.3.1:3000;10.240.0.24:3000;10.0.3.1:3000;10.240.0.37:3000;10.0.3.1:3000;10.240.0.41:3000;10.0.3.1:3000;10.240.0.13:3000;10.0.3.1:3000;10.240.0.23:3000
10 : services-alumni
10.0.3.1:3000;10.240.0.42:3000;10.0.3.1:3000;10.240.0.5:3000;10.0.3.1:3000;10.240.0.13:3000;10.0.3.1:3000;10.240.0.14:3000;10.0.3.1:3000;10.240.0.18:3000;10.0.3.1:3000;10.240.0.23:3000;10.0.3.1:3000;10.240.0.24:3000;10.0.3.1:3000;10.240.0.27:3000;10.0.3.1:3000;10.240.0.30:3000;10.0.3.1:3000;10.240.0.37:3000;10.0.3.1:3000;10.240.0.43:3000;10.0.3.1:3000;10.240.0.33:3000;10.0.3.1:3000;10.240.0.41:3000
I have a few comments about your configuration. First, transaction-threads-per-queue should be set to 3 or 4 (don't set it to the number of cores).
The second has to do with your batch-read tuning. You're using the (default) batch-index protocol, and the config params you'll need to tune for batch-read performance are:
You have batch-max-requests set very high. This is probably affecting both your CPU load and your memory consumption. It's enough that there's a slight imbalance in the number of keys you're accessing per-node, and that will reflect in the graphs you've shown. At least, this is possibly the issue. It's better that you iterate over smaller batches than try to fetch 200K records per-node at a time.
batch-index-threads – by default its value is 4, and you set it to 32 (of a max of 64). You should do this incrementally by running the same test and benchmarking the performance. On each iteration adjust higher, then down if it's decreased in performance. For example: test with 32, +8 = 40 , +8 = 48, -4 = 44. There's no easy rule-of-thumb for the setting, you'll need to tune through iterations on the hardware you'll be using, and monitor the performance.
batch-max-buffer-per-queue – this is more directly linked to the number of concurrent batch-read operations the node can support. Each batch-read request will consume at least one buffer (more if the data cannot fit in 128K). If you do not have enough of these allocated to support the number of concurrent batch-read requests you will get exceptions with error code 152 BATCH_QUEUES_FULL . Track and log such events clearly, because it means you need to raise this value. Note that this is the number of buffers per-queue. Each batch response worker thread has its own queue, so you'll have batch-index-threads x batch-max-buffer-per-queue buffers, each taking 128K of RAM. The batch-max-unused-buffers caps the memory usage of all these buffers combined, destroying unused buffers until their number is reduced. There's an overhead to allocating and destroying these buffers, so you do not want to set it too low compared to the total. Your current cost is 32 x 256 x 128KB = 1GB.
Finally, you're storing your data on a filesystem. That's fine for development instances, but not recommended for production. In GCE you can provision either a SATA SSD or an NVMe SSD for your data storage, and those should be initialized, and used as block devices. Take a look at the GCE recommendations for more details. I suspect you have warnings in your log about the device not keeping up.
It's likely that one of your nodes is an outlier with regards to the number of partitions it has (and therefore number of objects). You can confirm it with asadm -e 'asinfo -v "objects"'. If that's the case, you can terminate that node, and bring up a new one. This will force the partitions to be redistributed. This does trigger a migration, which takes quite longer in the CE server than in the EE one.
For anyone interested, Aerospike Enterpirse 4.3 introduced 'uniform-balance' which homogeneously balances data partitions. Read more here: https://www.aerospike.com/blog/aerospike-4-3-all-flash-uniform-balance/