Migrate a (storm+nimbus) cluster to a new Zookeeper, without loosing the information or having downtime - apache

I have a nimbus+storm cluster using Zookeeper, and I wish to move my cluster and point it to a new Zookeeper. Do you know if this is possible? Can I keep all the information of the old zookeeper and save it in the new one? Is it possible to do it without downtime?
I have looked in the internet for this procedure but I have not found much.
Would it be as simples as change the storm.yml file in both the master . and worker nodes? Do I need a restart afterwards?
# storm.zookeeper.servers:
# - "server1"
# - "server2"

If you just change storm.yml, you'd be pointing Storm at a new empty Zookeeper cluster, and it will be like you just installed Storm from scratch. More likely, you want to grow your Zookeeper cluster to include your new machines, then update storm.yml to point at the new machines, then shrink the cluster to exclude the machines you want to move away from. That way, your Zookeeper quorum is preserved even though you've moved to other physical machines.
This is easier to do on Zookeeper 3.5 with dynamic reconfiguration http://zookeeper.apache.org/doc/r3.5.5/zookeeperReconfig.html. I'm unsure whether Storm will run on Zookeeper 3.5, but you may consider investigating whether you can upgrade to 3.5 before growing/shrinking the cluster.
Otherwise you will have to do a rolling restart to add the new Zookeeper nodes, then do another one to remove the old machines once the cluster has stabilized.

Let me suggest a hack here. This was a script provided by microsoft for migration on HD Insight cluster , but you can change it and use it for your need.
The script can be downloaded from : https://github.com/hdinsight/hdinsight-storm-examples/tree/master/tools/zkdatatool-1.0 and you can read more about it here :
https://blogs.msdn.microsoft.com/azuredatalake/2017/02/24/restarting-storm-eventhub/
I have used it in the past when i had to migrate some stuff between PaaS clusters and i can confirm it works ok!

Related

Ignite error upgrading the setup in Kubernetes

While I upgraded the Ignite that is deployed in Kubernetes (EKS) for Log4j vulnerability, I get the error below
[ignite-1] Caused by: class org.apache.ignite.spi.IgniteSpiException: BaselineTopology of joining node (54b55de4-7742-4e82-9212-7158bf51b4a9) is not compatible with BaselineTopology in the cluster. Joining node BlT id (4) is greater than cluster BlT id (3). New BaselineTopology was set on joining node with set-baseline command. Consider cleaning persistent storage of the node and adding it to the cluster again.
The setup is a 3 node cluster, with native persistence enabled (PVC). This seems to be occurring many times in our journey with Apache Ignite, having followed the official guide.
I cannot clean the storage as the pod gets restarted every now and then, by the time I get the pod shell the pod crash & restarts.
This might happen to be due to the wrong startup order, starting nodes manually in reverse order may resolve this, but I'm not sure if that is possible in K8s. Another possible issue might be related to the baseline auto-adjustment that might change your baseline unexpectedly, I suggest you turn it off if it's enabled.
One of the workarounds to clean a DB of a failing POD might be (quite tricky) - to replace Ignite image with some simple image like a plain Debian or Alpine docker images (just to be able to access CLI) keeping the same PVC attached, and once you fix the persistence issue, set the Ignite image back. The other one is - to access underlying PV directly if possible and do surgery in place.

Should all pods using a redis cache be constrained to the same node as the rediscache itself?

We are running one of our services in a newly created kubernetes cluster. Because of that, we have now switched them from the previous "in-memory" cache to a Redis cache.
Preliminary tests on our application which exposes an API shows that we experience timeouts from our applications to the Redis cache. I have no idea why and it issue pops up very irregularly.
So I'm thinking maybe the reason for these timeouts are actually network related. Is it a good idea to put in affinity so we always run the Redis-cache on the same nodes as the application to prevent network issues?
The issues have not arisen during "very high load" situations so it's concerning me a bit.
This is an opinion question so I'll answer in an opinionated way:
Like you mentioned I would try to put the Redis and application pods on the same node, that would rule out wire networking issues. You can accomplish that with Kubernetes pod affinity. But you can also try nodeslector, that way you always pin your Redis and application pods to a specific node.
Another way to do this is to taint your nodes where you want to run your workloads and then add a toleration to the Redis and your application pods.
Hope it helps!

SQL Database Blue / Green deployments

I am dealing with the infrastructure for a new project. It is a standard Laravel stack = PHP, SQL server, and Nginx. For the PHP + Nginx part, we are using Kubernetes cluster - so scaling and blue/green deployments are taken care of.
When it comes to the database I am a bit unsure. We don't want to use Kubernetes for SQL, so the current idea is to go for Google Cloud SQL managed service (Are the competitors better for blue/green deployment of SQL?). The question is can it sync the data between old and new versions of the database nodes?
Let's say that we have 3 active Pods and at least 2 active database nodes (and a load balancer).
So the standard deployment should look like this:
Pod with the new code is created.
New database node is created with current data.
The new Pod gets new environment variables to connect to the new database.
Database migrations are run on the new database node.
Health check for the new Pod is run, if it passes Pod starts to receive traffic.
One of the old Pods is taken offline.
It should keep doing this iteration until all of the Pods and Database nodes are replaced.
The question is can this work with the database? Let's imagine there is a user on the website that is using the last OLD database node to write some data and when switched to the NEW database node the data are simply not there until the last database node is upgraded. Can they be synced behind the scenes? Does Google Cloud SQL managed service provide that?
Or is there a completely different and better solution to this problem?
Thank you!
I'm not 100% sure if this is what you are looking for, but for my understanding, Cloud SQL replicas would be a better solution. You can have read replicas [1], that are a copy of the master instance and have different options [2]
A read replica is a copy of the master that reflects changes to the master instance in almost real time. You create a replica to offload read requests or analytics traffic from the master. You can create multiple read replicas for a single master instance.
or a failover replica [3], that in case the master goes down, the data continue to be available there.
If an instance configured for high availability experiences an outage or becomes unresponsive, Cloud SQL automatically fails over to the failover replica, and your data continues to be available to clients. This is called a failover.
You can combine those if you need.

Clone RabbitMQ admin users, etc. on replacement server

We have a couple of crusty AWS hosts running a RabbitMQ implementation in a cluster. We need to upgrade the hardware, and therefore we developed a Chef cookbook to spawn replacement servers.
One thing that we would rather not recreate by hand is the admin users, the queues, etc.
What is the best method to get that stuff from the old hosts to the new ones? I believe it's everything that lives in the /var/lib/rabbitmq/mnesia directory.
Is it wise to copy the files from one host to another?
Is there a programmatic means to do this?
Can it be coded into our Chef cookbook?
You can definitely export and import configuration via command line: https://www.rabbitmq.com/management-cli.html
I'm not sure about admin user, though.
If you create new rabbitmq nodes on your new hardware, you will get all the users in that new node. This is easy to try:
run docker container with image of rabbitmq (with management plugin)
and create a user
run another container and add that node to the
cluster of the first one
kill rabbitmq on the first one, or delete
the docker container and you will see that you still have the newly
created user on the 2nd (but now master) node
I wrote docker since it's faster to create a cluster this way, but if you already have a cluster you could use it for testing if you prefer.
For the queues and exchanges, I don't want to quote almost everything found in the rabbitmq doc page for the high availability, but I will just say that you have to pay attention to the following:
exclusive queues because they are gone once the client connection is gone
queue mirroring (if you have any set up, if not it would be wise to consider it, if not even necessary)
I would do the migration gradually, waiting for the queues to get emptied and then kill of the nodes on the old hardware. It maybe doable in a big-bang fashion, but seems riskier. If you have a running system, than set up queue mirroring and try to find appropriate moment to do manual sync - but careful, this has a huge impact on the broker performance.
Additionally there is this shovel plugin (I have to point out that I did not use it or even explore it) but that may be another way to go since (quoting form the link):
In essence, a shovel is a simple pump. Each shovel:
connects to the source broker and the destination broker, consumes
messages from the queue, re-publishes each message to the destination
broker (using, by default, the original exchange name and
routing_key).

How to fix the Zookeeper error for Hbase

Main OS is windows 7 64bit. Using VM player to create two vm CentOS 5.6 system. The net connection is bridge. I installed Hbase on both of the CentOS system, one is master, the other is slave. When I enter the shell, and run status 'details'.
The error from master is
zookeeper.ZKConfig: no valid quorum servers found in zoo.cfg ERROR:
org.apache.hadoop.hbase.ZooKeeperConnectionException: An error is
preventing HBase from connecting to ZooKeeper
And the error from slave is
ERROR: org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is
able to connect to ZooKeeper but the connection closes immediately.
This could be a sign that the server has too many connections (30 is
the default). Consider inspecting your ZK server logs for that error
and then make sure you are reusing HBaseConfiguration as often as you
can. See HTable's javadoc for more information.
Please give me some suggestion.
Thanks a lot
Check if this is within your .bashrc, if not, add them and restart all hbase services (do not forget to manually run them as well), that did it for me with a pseudo-distributed installation. My problem (and maybe yours as well) was that Hbase wasn't detecting it's configuration.
export HADOOP_CONF_DIR=/etc/hadoop/conf
export HBASE_CONF_DIR=/etc/hbase/conf
I see this very often on my machine. I don't have a failsafe cure, but end up running stop-all.sh, and deleting every place that hadoop and dfs (its a dfs failure) store their temp files. It seems to happen after my computer goes to sleep while dfs is running.
I am going to experiment with single-user mode to avoid this. I dont need distribution while developing.