what decides the roles in Paxos? - paxos

When people describe Paxos, they always assume that there are already some proposers in the cluster. But where are the proposers from, or what decides which processes to be proposers?

How the cluster is initially configured and how it is changed is down to the administrator who is trying to optimise the system.
You can run the different roles on different hosts and have different numbers of them. We could run three proposers, five acceptor and seven learners, whatever you choose. Clients that need to write a value only need to connect to proposers. With multi-Paxos for state replication clients only need to connect to proposers as that is sufficient and the clients don't need to exchange messages with any other role type. Yet there is nothing to prevent clients from also being learners by seeing messages from acceptor.
As long as you follow the Paxos algorithm it all comes down to minimising network hops (latency and bandwidth), costs of hardware, and complexity of the software for your particular workload.
From a practical perspective your clients need to be able to find proposers in the face of failures. A cluster administrator will be configuring which nodes are to be proposes and making sure that they are discovered by clients.
It is hard to visualize from descriptions of the abstract algorithm how things might work as many messaging topographies are possible. When applying the algorithm to a practical application its fair more obvious what setup minimises latency, bandwidth, hardware and complexity. An example might be a three node MySQL cluster running Paxos. You want all three servers to have all the data so they are all learners. All must be acceptors as you need three at a minimum to have one node fail and still maintain progress. They may as well all be proposers to give the best availability and simplicity of software and configuration. Note that one will become the distinguished leader. The database administrator doesn't think about the Paxos roles as they just set up a three-node database cluster.
The roles in the cluster may need to change. For example, you might want to expand the capacity of a database cluster. Or a server might die so you need to change the cluster membership to swap the dead one for a fresh one. For the Paxos algorithm to work every process must have a strongly consistent view of which processes are in which roles. How do you get consensus? You use Paxos to fix a new value of the cluster membership.

Related

ActiveMQ datastore for cluster setup

We have been using ActiveMQ version 5.16.0 broker with single instances in production. Now we are planning to use cluster of AMQ brokers for HA and load distribution with consistency in message data. Currently we are using only one queue
HA can be achieved using failover but do we need to use the same datastore or it can be separated? If I use different instances for AMQ brokers then how to setup a common datastore.
Please guide me how to setup datastore for HA and load distribution
Multiple ActiveMQ servers clustered together can provide HA in a couple ways:
Scale message flow by using compute resources across multiple broker nodes
Maintain message flow during single node planned or unplanned outage of a broker node
Share data store in the event of ActiveMQ process failure.
Network of brokers solve #1 and #2. A standard 3-node cluster will give you excellent performance and ability to scale the number of producers and consumers, along with splitting the full flow across 3-nodes to provide increased capacity.
Solving for #3 is complicated-- in all messaging products. Brokers are always working to be completely empty-- so clustering the data store of a single-broker becomes an anti-pattern of sorts. Many times, relying on RAID disk with a single broker node will provide higher reliability than adding NFSv4, GFSv2, or JDBC and using shared-store.
That being said, if you must use a shared store-- follow best practices and use GFSv2 or NFSv4. JDBC is much slower and requires significant DB maintenance to keep running efficiently.
Note: [#Kevin Boone]'s note about CIFS/SMB is incorrect and CIFS/SMB should not be used. Otherwise, his responses are solid.
You can configure ActiveMQ so that instances share a message store, or so they have separate message stores. If they share a message store, then (essentially) the brokers will automatically form a master-slave configuration, such that only one broker (at a time) will accept connections from clients, and only one broker will update the store. Clients need to identify both brokers in their connection URIs, and will connect to whichever broker happens to be master.
With a shared message store like this, locks in the message store coordinate the master-slave assignment, which makes the choice of message store critical. Stores can be shared filesystems, or shared databases. Only a few shared filesystem implementations work properly -- anything based on NFS 4.x should work. CIFS/SMB stores can work, but there's so much variation between providers that it's hard to be sure. NFS v3 doesn't work, however well-implemented, because the locking semantics are inappropriate. In any case, the store needs to be robust, or replicated, or both, because the whole broker cluster depends on it. No store, no brokers.
In my experience, it's easier to get good throughput from a shared file store than a shared database although, of course, there are many factors to consider. Poor network connectivity will make it hard to get good throughput with any kind of shared store (or any kind of cluster, for that matter).
When using individual message stores, it's typical to put the brokers into some kind of mesh, with 'network connectors' to pass messages from one broker to another. Both brokers will accept connections from clients (there is no master), and the network connections will deal with the situation where messages are sent to one broker, but need to be consumed from another.
Clients' don't necessarily need to specify all brokers in their connection URIs, but generally will, in case one of the brokers is down.
A mesh is generally easier to set up, and (broadly speaking) can handle more client load, than a master-slave with shared filestore. However, (a) losing a broker amounts to losing any messages that were associated with it (until the broker can be restored) and (b) the mesh interferes with messaging patterns like message grouping and exclusive consumers.
There's really no hard-and-fast rule to determine which configuration to use. Many installers who already have some sort of shared store infrastructure (a decent relational database, or a clustered NFS, for example) will tend to want to use it. The rise in cloud deployments has had the effect that mesh operation with no shared store has become (I think) a lot more popular, because it's so symmetric.
There's more -- a lot more -- that could be said here. As a broad question, I suspect the OP is a bit out-of-scope for SO. You'll probably get more traction if you break your question up into smaller, more focused, parts.

Paxos: How are proposers, accepters and learners selected?

I want to elect a leader from a number of identical processes. All the explanations of Paxos say that some processes are Proposers, some are Voters and some are Accepters. Do I need to assign these roles to my processes when I launch them?
What if all my proposers die? Can I switch existing learners/voters to proposers?
The way I thought about this is that all processes come up as voters and then wait with random timeout for messages. If they don't get any message until the timeout expires, they assume the role a proposer. If they receive a message, they kick of another timeout and then can become a proposer again if they don't receive any more messages until the timeout ends (or they have an agreement).
Is that a valid approach?
I am a maintainer of several production paxos systems. I'll detail what I have seen in mine and other practical/production systems.
All the explanations of Paxos say that some processes are Proposers, some are Voters and some are Accepters. Do I need to assign these roles to my processes when I launch them?
In a practical system the proposers and accepters are the same set of nodes. That is, proposer and acceptor are two roles taken on by the same process.
What if all my proposers die?
Many Paxos systems (such as those used as kv stores) work on a sequence of paxos instances (like a transaction log). It is infeasible to assume that the same set of processes will live forever, so there must be mechanisms to change the set of nodes in the quorum.
In some of the systems I maintain there are two parts to the proposed/chosen value: the payload from the customer and the paxos membership. The paxos round is run by the quorum chosen in the prior value. Doing it this way does impede the ability of your system to pipeline the values chosen; you'll have to batch them instead if you want choose multiple values at once—or you can look at the way Raft chooses membership.
In Raft choosing quorum members is a two-phase process. First, the new quorum is proposed and committed; both quorums are used for a while; and then the new quorum takes over. Specifically, in the interim a majority of both the old and the new quorums is required to commit anything, including the take-over from the new quorum.

Zookeeper vs In-memory-data-grid vs Redis

I've found different zookeeper definitions across multiple resources. Maybe some of them are taken out of context, but look at them pls:
A canonical example of Zookeeper usage is distributed-memory computation...
ZooKeeper is an open source Apache™ project that provides a centralized infrastructure and services that enable synchronization across a cluster.
Apache ZooKeeper is an open source file application program interface (API) that allows distributed processes in large systems to synchronize with each other so that all clients making requests receive consistent data.
I've worked with Redis and Hazelcast, that would be easier for me to understand Zookeeper by comparing it with them.
Could you please compare Zookeeper with in-memory-data-grids and Redis?
If distributed-memory computation, how does zookeeper differ from in-memory-data-grids?
If synchronization across cluster, than how does it differs from all other in-memory storages? The same in-memory-data-grids also provide cluster-wide locks. Redis also has some kind of transactions.
If it's only about in-memory consistent data, than there are other alternatives. Imdg allow you to achieve the same, don't they?
https://zookeeper.apache.org/doc/current/zookeeperOver.html
By default, Zookeeper replicates all your data to every node and lets clients watch the data for changes. Changes are sent very quickly (within a bounded amount of time) to clients. You can also create "ephemeral nodes", which are deleted within a specified time if a client disconnects. ZooKeeper is highly optimized for reads, while writes are very slow (since they generally are sent to every client as soon as the write takes place). Finally, the maximum size of a "file" (znode) in Zookeeper is 1MB, but typically they'll be single strings.
Taken together, this means that zookeeper is not meant to store for much data, and definitely not a cache. Instead, it's for managing heartbeats/knowing what servers are online, storing/updating configuration, and possibly message passing (though if you have large #s of messages or high throughput demands, something like RabbitMQ will be much better for this task).
Basically, ZooKeeper (and Curator, which is built on it) helps in handling the mechanics of clustering -- heartbeats, distributing updates/configuration, distributed locks, etc.
It's not really comparable to Redis, but for the specific questions...
It doesn't support any computation and for most data sets, won't be able to store the data with any performance.
It's replicated to all nodes in the cluster (there's nothing like Redis clustering where the data can be distributed). All messages are processed atomically in full and are sequenced, so there's no real transactions. It can be USED to implement cluster-wide locks for your services (it's very good at that in fact), and tehre are a lot of locking primitives on the znodes themselves to control which nodes access them.
Sure, but ZooKeeper fills a niche. It's a tool for making a distributed applications play nice with multiple instances, not for storing/sharing large amounts of data. Compared to using an IMDG for this purpose, Zookeeper will be faster, manages heartbeats and synchronization in a predictable way (with a lot of APIs for making this part easy), and has a "push" paradigm instead of "pull" so nodes are notified very quickly of changes.
The quotation from the linked question...
A canonical example of Zookeeper usage is distributed-memory computation
... is, IMO, a bit misleading. You would use it to orchestrate the computation, not provide the data. For example, let's say you had to process rows 1-100 of a table. You might put 10 ZK nodes up, with names like "1-10", "11-20", "21-30", etc. Client applications would be notified of this change automatically by ZK, and the first one would grab "1-10" and set an ephemeral node clients/192.168.77.66/processing/rows_1_10
The next application would see this and go for the next group to process. The actual data to compute would be stored elsewhere (ie Redis, SQL database, etc). If the node failed partway through the computation, another node could see this (after 30-60 seconds) and pick up the job again.
I'd say the canonical example of ZooKeeper is leader election, though. Let's say you have 3 nodes -- one is master and the other 2 are slaves. If the master goes down, a slave node must become the new leader. This type of thing is perfect for ZK.
Consistency Guarantees
ZooKeeper is a high performance, scalable service. Both reads and write operations are designed to be fast, though reads are faster than writes. The reason for this is that in the case of reads, ZooKeeper can serve older data, which in turn is due to ZooKeeper's consistency guarantees:
Sequential Consistency
Updates from a client will be applied in the order that they were sent.
Atomicity
Updates either succeed or fail -- there are no partial results.
Single System Image
A client will see the same view of the service regardless of the server that it connects to.
Reliability
Once an update has been applied, it will persist from that time forward until a client overwrites the update. This guarantee has two corollaries:
If a client gets a successful return code, the update will have been applied. On some failures (communication errors, timeouts, etc) the client will not know if the update has applied or not. We take steps to minimize the failures, but the only guarantee is only present with successful return codes. (This is called the monotonicity condition in Paxos.)
Any updates that are seen by the client, through a read request or successful update, will never be rolled back when recovering from server failures.
Timeliness
The clients view of the system is guaranteed to be up-to-date within a certain time bound. (On the order of tens of seconds.) Either system changes will be seen by a client within this bound, or the client will detect a service outage.

Apache Kafka: Mirroring vs. Replication

Mirroring is replicating data between Kafka cluster, while Replication is for replicating nodes within a Kafka cluster.
Is there any specific use of Replication, if Mirroring has already been setup?
They are used for different use cases. Let's try to clarify.
As described in the documentation,
The purpose of adding replication in Kafka is for stronger durability and higher availability. We want to guarantee that any successfully published message will not be lost and can be consumed, even when there are server failures. Such failures can be caused by machine error, program error, or more commonly, software upgrades. We have the following high-level goals:
Inside a cluster there might be network partitions (a single server fails, and so forth), therefore we want to provide replication between the nodes. Given a setup of three nodes and one cluster, if server1 fails, there are two replicas Kafka can choose from. Same cluster implies same response times (ok, it also depends on how these servers are configured, sure, but in a normal scenario they should not differ so much).
Mirroring, on the other hand, seems to be very valuable, for example, when you are migrating a data center, or when you have multiple data centers (e.g., AWS in the US and AWS in Ireland). Of course, these are just a couple of use cases. So what you do here is to give applications belonging to the same data center a faster and better way to access data - data locality in some contexts is everything.
If you have one node in each cluster, in case of failure, you might have way higher response times to go, let's say, from AWS located in Ireland to AWS in the US.
You might claim that in order to achieve data locality (services in cluster one read from kafka in cluster one) one still needs to copy the data from one cluster to the other. That's definitely true, but the advantages you might get with mirroring could be higher than those you would get by reading directly (via an SSH tunnel?) from Kafka located in another data center, for example single connections down, clients connection/session times longer (depending on the location of the data center), legislation (some data can be collected in a country while some other data shouldn't).
Replication is the basis of higher availability. You shouldn't use Mirroring to handle high availability in a context where data locality matters. At the same time, you should not use just Replication where you need to duplicate data across data centers (I don't even know if you can without Mirroring/an ssh tunnel).

How to achieve high availability?

My boss wants to have a system that takes into concern of continent wide catastrophic event. He wants to have two servers in US and two servers in Asia (1 login server and 1 worker server in each continent).
In the event that earthquake breaks the connection between the two continents, both should work alone. When the connection is revived, they should sync each other back to normal.
External cloud system not allowed as he has no confidence.
The system should take into account of scalability which means addition of new servers should be easy to configure.
The servers should be load balanced.
The connection between the servers should be very secure(encrypted and send through SSL although SSL takes care of encryption).
The system should let one and only one user log in with one account. (beware of latency between continent and two users sharing account may reach both login server at the same time)
Please help. I'm already at the end of my wit. Thank you in advance.
I imagine that these requirements (if properly analysed) are essentially incompatible, in that they cannot work according to CAP Theorem.
If you have several datacentres, even if they are close by, partitions WILL happen. If a partition happens, either availability OR consistency MUST be lost, because either:
you have a pre-determined "master", which keeps working and other "slave" DCs which fail (or go readonly). This keeps consistency at the expense of availability.
OR you lose consistency for the duration of the partition (this means that operations which depend on immediate consistency are also unavailable).
This is incompatible with your requirements, as far as I can see. What your boss wants is clearly impossible. He needs to understand CAP theorem.
Now, in YOUR application case, you may decide that you can bend the rules and redefine what consistency or availiblity are, for convenience, and have a system which degrades into an inconsistent but temporarily acceptable state.
You probably want to get product management to have a look at the business case for these requirements. Dropping some of them is probably ok. Consistency is a good requirement to keep, as it makes things behave as people expect - this means to drop availability or partition-tolerance. Keeping consistency is definitely easier from an engineering perspective.
This is another one of those things where employers tend not to understand the benefits of using an off-the-shelf solution. If you as a programmer don't really even know where to start with this, then rolling your own is probably a going to be a huge money and time sink. There's nothing wrong with not knowing this stuff either; high-availability, failsafe networking that takes into consideration catastrophic failure of critical components is a large problem domain that many people pour a lot of effort and money into. Why not take advantage of what providers have to offer?
Give talking to your boss about using existing cloud providers one more try.
You could contact one of the solid and experienced hosting provides (we use Rackspace) that have data centers in different regions world wide and get their recommendations upon your requirements.
This will require expert assistance and a large budget, and serious planning.
I better option will be contact a reputable provider with a global footprint and select a premium solution with a solid SLA backing up there service and let them tailor a solution that comes close to your needs.
Just realize even the guys like Google, Yahoo, Microsoft and Amazon (to name a few), at one time or another have had some or other issue that rendered segments of there systems offline to certain users.