Redis Cluster Single Node issue - redis

Can Redis run in cluster mode enabled but with single instance. It seems we get cluster status as fail when we try to deploy with single node because no slots are added to it.
https://serverfault.com/a/815813/510599
I understand we can manually add SLOTS after deployment. But I wonder if we can modify Redis source code to make this change.

Related

Google cloud kubernetes cluster newbie question

I am a newbie of GKE. I created a GKE cluster with very simple setup. It only has on gpu node and all other stuff was default. After the cluster is up, I was able to list the nodes and ssh into the nodes. But I have two questions here.
I tried to install nvidia driver using the command:
kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
It output that:
kubectl apply --filename https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
daemonset.apps/nvidia-driver-installer configured
But 'nvidia-smi' cannot be found at all. Should I do something else to make it work?
On the worker node, there wasn't the .kube directory and the file 'config'. I had to copy it from the master node to the worker node to make things work. And the config file on the master node automatically updates so I have to copy again and again. Did I miss some steps in the creation of the cluster or how to resolve this problem?
I appreciate someone can shed some light on this. It drove me crazy after working on it for several days.
Tons of thanks.
Alex.
For the DaemonSet to work, you need to have a tag on your worker Node as cloud.google.com/gke-accelerator (see this line). The DaemonSet checks for this tag on a node before scheduling any pods for installing the driver. I'm guessing a default node pool you create did not have this tag on it. You can find more details on this on the GKE docs here.
The worker nodes, by design are just that worker nodes. They do not need privileged access to the Kubernetes API so they don't need any kubeconfig files. The communication between worker nodes and the API is strictly controlled through the kubelet binary running on the node. Therefore, you will never find kubeconfig files on a worker node. Also, you should never put them on the worker node either, since if a node gets compromised, the keys in that file can be used to damage the API Server. Instead, you should make it a habit to either use the master nodes for kubectl commands, or better yet, have the kubeconfig on your local machine, and keep it safe, and issue commands remotely to your cluster.
After all, all you need is access to an API endpoint for your Kubernetes API server, and it shouldn't matter where you access it from, as long as the endpoint is reachable. So, there is no need whatsoever to have kubeconfig on the worker nodes :)

Ignite error upgrading the setup in Kubernetes

While I upgraded the Ignite that is deployed in Kubernetes (EKS) for Log4j vulnerability, I get the error below
[ignite-1] Caused by: class org.apache.ignite.spi.IgniteSpiException: BaselineTopology of joining node (54b55de4-7742-4e82-9212-7158bf51b4a9) is not compatible with BaselineTopology in the cluster. Joining node BlT id (4) is greater than cluster BlT id (3). New BaselineTopology was set on joining node with set-baseline command. Consider cleaning persistent storage of the node and adding it to the cluster again.
The setup is a 3 node cluster, with native persistence enabled (PVC). This seems to be occurring many times in our journey with Apache Ignite, having followed the official guide.
I cannot clean the storage as the pod gets restarted every now and then, by the time I get the pod shell the pod crash & restarts.
This might happen to be due to the wrong startup order, starting nodes manually in reverse order may resolve this, but I'm not sure if that is possible in K8s. Another possible issue might be related to the baseline auto-adjustment that might change your baseline unexpectedly, I suggest you turn it off if it's enabled.
One of the workarounds to clean a DB of a failing POD might be (quite tricky) - to replace Ignite image with some simple image like a plain Debian or Alpine docker images (just to be able to access CLI) keeping the same PVC attached, and once you fix the persistence issue, set the Ignite image back. The other one is - to access underlying PV directly if possible and do surgery in place.

How to clean Apache Ignite caches and sort of start over?

I have a 3 node ignite cluster and 1 client that creates cache. During the development and testing, I had to stop the cluster or interrupt the cache building several time and the entire system is broken now. Only one node starts and the other nodes crashes. The client is blocked and it does not do anythin.
Is there any way to clean everything and sort of start fresh?
I am using Ignite 2.1 and using Persistent Cache storage.
Thank you for your help.
Just delete Ignite work directory - by default, it's ${IGNITE_HOME}/work.
Also, if you configured WAL store path, you need to clean it too:
https://apacheignite.readme.io/docs/distributed-persistent-store#section-write-ahead-log
Note: All data in persistent store will be lost.

Failing over with single Replication Group on ElastiCache Redis

I'm testing out ElastiCache backed by Redis with the following specs:
Using Redis 2.8, with Multi-AZ
Single replication group
1 master node in us-east-1b, 1 slave node in us-east-1c, 1 slave node in us-east-1d
The part of the application writing is directly using the endpoint for the master node (primary-node.use1.cache.amazonaws.com)
The part of the application doing only reads is pointing to a custom endpoint (readonly.redis.mydomain.com) configured in HAProxy, which then points to the two other read slave end points. (readslave1.use1.cache.amazonaws.com and readslave2.use1.cache.amazonaws.com)
Now lets say the primary node (master) fails in us-east-1b.
From what I understand, if the master instance fails, I won't have to change the url for the end point for writing to Redis (primary-node.use1.cache.amazonaws.com), although from there, I still have the following questions:
Do I have to change the endpoint names for the read only slaves?
How long until the missing slave is added into the pool?
If there's anything else I'm missing, I'd appreciate the advice/information.
Thanks!
If you are using ElastiCache, you should make use the "Primary EndpointThe" provided by AWS.
That endpoint actually is backed by Route53, if the primary (master) redis is down, since you enable MutliA-Z, it will auto fail over to one of the read replica (slave).
In that case, you don't need to modify the endpoint of your redis.
I don't know why you have such design, seems you only want write to master, but always read from slave.
For HA Proxy part, you should include TCP check for ALL 3 redis nodes, using their "Read Endpoint"
In haproxy, you can check if the endpoint is SLAVE, if yes, your haproxy should redirect the traffic to that.
Notice that in the application layer, if your redis driver don't support auto reconnect, your script will fail to connect to the new master nodes.
In addition to "auto reconnect", since AWS is using Route53 DNS to do fail over, some lib will NOT do NS lookup again, which means the DNS is still pointing to the OLD ip which is the old master.
Using HAproxy can solve this problem.

CouchBase 2.5 2 nodes in replica: 1 node fail: the service is no more available

We are testing Couchbase with a two node cluster with one replica.
When we stop the service on one node, the other one does not respond until we restart the service or manually failover the stopped node.
Is there a way to maintain the service from the good node when one node is temporary unavailable?
If a node goes down then in order to activate the replicas on the other node you will need to manually fail it over. If you want this to happen automatically then you can enable auto-failover, but in order to use that feature I'm pretty sure you must have at least a three node cluster. When you want to add the failed node back then you can just re-add it to the cluster and rebalance.