Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Hallo i am mohamad a student in masters degree
I want to ask a question about Zookeeper.
I read that the write operation in zookeeper to be done first the server connected to the cliend must contact the leader and then the leader makes a vote and when he gets more than the half of servers then he replies to the server which is connected to the cliend inorder to go on with this operartion .
But my question is that for what is the voting procccess? i mean they vote for what?
and the second question is that how do they vote do they send messages or how do they notify the leader. and the third question is that why do they need voting, i read that there is a versioned number inorder to check the update data so why do they need voting ???
Please can anybody reply for me as fast as he can .
Thanks in advance
This is the fastest I can. You need to get the book Distributed Algorithms by Nancy Lynch for better understanding of how these systems works.
Background -
The algorithm paradigm is called Paxos though Zookeeper has its own implementation which is a bit different. Zookeeper commits data in a two phased commit. All the communication happens using Atomic Broadcast Protocol over a FIFO channel to preserve ordering.
What is the voting processes - Voting process is for finding a leader not for two phased commit. No ballots nothing. Node with highest zxid is the leader.
Voting is for leader election. Two phased commit is for write operations. For more info check out the zookeeper docs and more importantly dist algos book to understand why these things behave the way they are :).
--Sai
ZooKeeper follows a simple client-server model where clients are nodes (i.e., machines) that make use of the service, and servers are nodes that provide the service. A collection of ZooKeeper servers forms a ZooKeeper ensemble. At any given time, one ZooKeeper client is connected to one ZooKeeper server. Each ZooKeeper server can handle a large number of client connections at the same time. Each client periodically sends pings to the ZooKeeper server it is connected to let it know that it is alive and connected. The ZooKeeper server in question responds with an acknowledgment of the ping, indicating the server is alive as well. When the client doesn't receive an acknowledgment from the server within the specified time, the client connects to another server in the ensemble, and the client session is transparently transferred over to the new ZooKeeper server. Check this to understand the zookeeper architecture
When a client requests to read the contents of a particular znode, the read takes place at the server that the client is connected to. Consequently, since only one server from the ensemble is involved, reads are quick and scalable. However, for writes to be completed successfully, a strict majority of the nodes of the ZooKeeper ensemble are required to be available. When the ZooKeeper service is brought up, one node from the ensemble is elected as the leader. When a client issues a write request, the connected server passes on the request to the leader. This leader then issues the same write request to all the nodes of the ensemble. If a strict majority of the nodes (also known as a quorum) respond successfully to this write request, the write request is considered to have succeeded. A successful return code is then returned to the client who initiated the write request. If a quorum of nodes are not available in an ensemble, the ZooKeeper service is nonfunctional. Check this to understand the voting process for write operations
For the service to be reliable and scalable, it is replicated over a set of machines. ZooKeeper uses a version of the famous Paxos algorithm, to keep replicas consistent.
Zookeeper gives the following consistency guarantees
Sequential Consistency Updates from a client will be applied in the order that they were sent.
Atomicity Updates either succeed or fail -- there are no partial results.
Single System Image A client will see the same view of the service regardless of the server that it connects to.
Reliability Once an update has been applied, it will persist from that time forward until a client overwrites the update. This guarantee has two corollaries:
Timeliness The clients view of the system is guaranteed to be up-to-date within a certain time bound (on the order of tens of seconds). Either system changes will be seen by a client within this bound, or the client will detect a service outage.
Here are the answers to your questions
Question1: The voting is for whether the write operation should be committed or not.
Question2: The communication between clients and servers in the zookeeper ensemble happen over a TCP connection through the exchange of messages using ZAB protocol
Question3: For the service to be reliable and fault tolerant, the data has to be replicated to a quorum of servers.
Related
In this example I have a setup of 2 consumers and 2 publishers in my network. The centre is a RabbitMQ broker as shown in the screenshot below. Due to fail-safe reasons, I am wondering if RabbitMQ supports load-balancing or mirroring of the server (broker) in any way. I just would like to get rid of the star topology for two reasons:
1) If one broker fails, another publisher can take over immediately
2) If one brokers network throughput is not good enough the other takes over
Solving one or the other (or even both) would be great.
My current infrastructure
Preferred infrastructure
RabbitMQ clustering (docs) can meet your first requirement. Use three nodes and be sure your applications are coded and tested to take failure scenarios into account.
I don't know of anything out-of-the-box that can meet your second requirement. You will have to implement something that uses cluster statistics or application statistics to determine when to switch to another cluster due to lower throughput.
NOTE: the RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.
I've found different zookeeper definitions across multiple resources. Maybe some of them are taken out of context, but look at them pls:
A canonical example of Zookeeper usage is distributed-memory computation...
ZooKeeper is an open source Apacheā¢ project that provides a centralized infrastructure and services that enable synchronization across a cluster.
Apache ZooKeeper is an open source file application program interface (API) that allows distributed processes in large systems to synchronize with each other so that all clients making requests receive consistent data.
I've worked with Redis and Hazelcast, that would be easier for me to understand Zookeeper by comparing it with them.
Could you please compare Zookeeper with in-memory-data-grids and Redis?
If distributed-memory computation, how does zookeeper differ from in-memory-data-grids?
If synchronization across cluster, than how does it differs from all other in-memory storages? The same in-memory-data-grids also provide cluster-wide locks. Redis also has some kind of transactions.
If it's only about in-memory consistent data, than there are other alternatives. Imdg allow you to achieve the same, don't they?
https://zookeeper.apache.org/doc/current/zookeeperOver.html
By default, Zookeeper replicates all your data to every node and lets clients watch the data for changes. Changes are sent very quickly (within a bounded amount of time) to clients. You can also create "ephemeral nodes", which are deleted within a specified time if a client disconnects. ZooKeeper is highly optimized for reads, while writes are very slow (since they generally are sent to every client as soon as the write takes place). Finally, the maximum size of a "file" (znode) in Zookeeper is 1MB, but typically they'll be single strings.
Taken together, this means that zookeeper is not meant to store for much data, and definitely not a cache. Instead, it's for managing heartbeats/knowing what servers are online, storing/updating configuration, and possibly message passing (though if you have large #s of messages or high throughput demands, something like RabbitMQ will be much better for this task).
Basically, ZooKeeper (and Curator, which is built on it) helps in handling the mechanics of clustering -- heartbeats, distributing updates/configuration, distributed locks, etc.
It's not really comparable to Redis, but for the specific questions...
It doesn't support any computation and for most data sets, won't be able to store the data with any performance.
It's replicated to all nodes in the cluster (there's nothing like Redis clustering where the data can be distributed). All messages are processed atomically in full and are sequenced, so there's no real transactions. It can be USED to implement cluster-wide locks for your services (it's very good at that in fact), and tehre are a lot of locking primitives on the znodes themselves to control which nodes access them.
Sure, but ZooKeeper fills a niche. It's a tool for making a distributed applications play nice with multiple instances, not for storing/sharing large amounts of data. Compared to using an IMDG for this purpose, Zookeeper will be faster, manages heartbeats and synchronization in a predictable way (with a lot of APIs for making this part easy), and has a "push" paradigm instead of "pull" so nodes are notified very quickly of changes.
The quotation from the linked question...
A canonical example of Zookeeper usage is distributed-memory computation
... is, IMO, a bit misleading. You would use it to orchestrate the computation, not provide the data. For example, let's say you had to process rows 1-100 of a table. You might put 10 ZK nodes up, with names like "1-10", "11-20", "21-30", etc. Client applications would be notified of this change automatically by ZK, and the first one would grab "1-10" and set an ephemeral node clients/192.168.77.66/processing/rows_1_10
The next application would see this and go for the next group to process. The actual data to compute would be stored elsewhere (ie Redis, SQL database, etc). If the node failed partway through the computation, another node could see this (after 30-60 seconds) and pick up the job again.
I'd say the canonical example of ZooKeeper is leader election, though. Let's say you have 3 nodes -- one is master and the other 2 are slaves. If the master goes down, a slave node must become the new leader. This type of thing is perfect for ZK.
Consistency Guarantees
ZooKeeper is a high performance, scalable service. Both reads and write operations are designed to be fast, though reads are faster than writes. The reason for this is that in the case of reads, ZooKeeper can serve older data, which in turn is due to ZooKeeper's consistency guarantees:
Sequential Consistency
Updates from a client will be applied in the order that they were sent.
Atomicity
Updates either succeed or fail -- there are no partial results.
Single System Image
A client will see the same view of the service regardless of the server that it connects to.
Reliability
Once an update has been applied, it will persist from that time forward until a client overwrites the update. This guarantee has two corollaries:
If a client gets a successful return code, the update will have been applied. On some failures (communication errors, timeouts, etc) the client will not know if the update has applied or not. We take steps to minimize the failures, but the only guarantee is only present with successful return codes. (This is called the monotonicity condition in Paxos.)
Any updates that are seen by the client, through a read request or successful update, will never be rolled back when recovering from server failures.
Timeliness
The clients view of the system is guaranteed to be up-to-date within a certain time bound. (On the order of tens of seconds.) Either system changes will be seen by a client within this bound, or the client will detect a service outage.
My app will work as follows:
I'll have a bunch of replica servers and a load balancer. The data updates will be managed outside CometD. EDIT: I still intend to notify each CometD server of those updates, if necessary, so they can respond back to clients.
The clients are only subscribing to those updates (i.e. read only), so the CometD server nodes don't need to know anything about each other's behavior.
Am I right in thinking I could have server side "client" instances on the load balancer, per client connection, where each instance listens on the same channel as its respective client and forwards any messages back to it? If so, are there any disadvantages to this approach, instead of using Oort?
Reading the docs about Oort, it seems that the nodes "know" about each other, which I don't need. Would it be better then for me to avoid using Oort altogether, in my case? My concern would be that if I ended up adding many many nodes, the fact that they communicate to "each other" could mean unnecessary processing?
The description of the issue specifies that the data updates are managed outside CometD, but it does not detail how the CometD servers are notified of these data updates.
The two common solutions are A) to notify each CometD server or B) to use Oort.
In solution A) you have an event that triggers a data update, and some external application performs the data update on, say, a database. At this point the external application must notify the CometD servers that there was a data update. If the external application runs on a JVM, it can use the CometD Java client to send a message to each CometD server, notifying them of the data update; in turn, the CometD servers will notify the remote clients.
In solution B) the external application must notify just one CometD server that there was a data update; the Oort cluster will do the rest, broadcasting that message across the cluster, and then to remote clients.
Solution A) does not require the Oort cluster, but requires the external application to know exactly all nodes, and send a message to each node.
Solution B) uses Oort, so the external application needs only to know one Oort node.
Oort requires a bit of additional processing because the nodes are interconnected, but depending on the case this processing may be negligible, or the complications of notifying each CometD server "manually" (as in solution A) may be greater than running Oort.
I don't understand exactly what you mean by having "server side client instances on the load balancer". Typically load balancers don't run a JVM so it is not possible to run CometD clients on them, so this sentence does not sound right.
Besides the CometD documentation, you can also look at these slides.
Is OpenLDAP (or are any of LDAP's flavors) capable of providing write concern? I know it's an eventually consistent model, but there's more then a few DB's that have eventual consistency + write concern.
After doing some research, I'm still not able to figure out whether or not it's a thing.
The UnboundID Directory Server provides support for an assured replication mode in which you can request that the server delay the response to an operation until it has been replicated in a manner that satisfies your desired constraints. This can be controlled on a per-operation basis by including a special control in the add/delete/modify/modify DN request, or by configuring the server with criteria that can be used to identify which operations should use this assured replication mode (e.g., you can configure the server so that operations targeting a particular set of attributes are subjected to a greater level of assurance than others).
Our assured replication implementation allows you to define separate requirements for local servers (servers in the same data center as the one that received the request from the client) and nonlocal servers (servers in other data centers). This allows you tune the server to achieve a balance between performance and behavior.
For local servers, the possible assurance levels are:
Do not perform any special assurance processing. The server will send the response to the client as soon as it's processed locally, and the change will be replicated to other servers as soon as possible. It is possible (although highly unlikely) that a permanent failure that occurs immediately after the server sends the response to the client but before it gets replicated could cause the change to be lost.
Delay the response to the client until the change has been replicated to at least one other server in the local data center. This ensures that the change will not be lost even in the event of the loss of the instance that the client was communicating with, but the change may not yet be visible on all instances in the local data center by the time the client receives the response.
Delay the response to the client until the result of the change is visible in all servers in the local data center. This ensures that no client accessing local servers will see out-of-date information.
The assurance options available for nonlocal servers are:
Do not perform any special assurance processing. The server will not delay the response to the client based on any communication with nonlocal servers, but a change could be lost or delayed if an entire data center is lost (e.g., by a massive natural disaster) or becomes unavailable (e.g., because it loses network connectivity).
Delay the response to the client until the change has been replicated to at least one other server in at least one other data center. This ensures that the change will not be lost even if a full data center is lost, but does not guarantee that the updated information will be visible everywhere by the time the client receives the response.
Delay the response to the client until the change has been replicated to at least one server in every other data center. This ensures that the change will be processed in every data center even if a network partition makes a data center unavailable for a period of time immediately after the change is processed. But again this does not guarantee that the updated information will be visible everywhere by the time the client receives the response.
Delay the response to the client until the change is visible in all available servers in all other data centers. This ensures that no client will see out-of-date information regardless of the location of the server they are using.
The UnboundID Directory Server also provides features to help ensure that clients are not exposed to out-of-date information under normal circumstances. Our replication mechanism is very fast so that changes generally appear everywhere in a matter of milliseconds. Each server is constantly monitoring its own replication backlog and can take action if the backlog becomes too great (e.g., mild action like alerting administrators or more drastic measures like rejecting client requests until replication has caught up). And because most replication backlogs are encountered when the server is taken offline for some reason, the server also has the ability to delay accepting connections from clients at startup until it has caught up with all changes processed in the environment while it was offline. And if you further combine this with the advanced load-balancing and health checking capabilities of the UnboundID Directory Proxy Server, you can ensure that client requests are only forwarded to servers that don't have a replication backlog or any other undesirable condition that may cause the operation to fail, take an unusually long time to complete, or encounter out-of-date information.
From reviewing RFC3384 discussion of replication requirements with respect to LDAP, it looks as though LDAP only requires eventual consistency and does not require transactional consistency. Therefore any products which support this feature are likely to do this with vendor specific implementations.
CA Directory does support a proprietary replication model called MULTI-WRITE which guarantees that the client obtains write confirmation only after all replicated instances have been updated. In addition it supports the standard X.525 Shadowing Protocol which provides lesser consistency guarantees and better performance.
With typical LDAP implementations, an update request will normally return immediately when the DSA handling this request has been updated, and not when the replica instances have been updated. This is the case with OpenLDAP I believe. The benefits are speed, the downsides are lack of guarantee that an updated has been applied to all replicas.
CA's directory product uses a Memory Mapped system and writes are so fast this is not a concern.
I have implemented a wcf service and now, my client wants it to have three copies of it, working independently on different machines. A master-slave approach. I need to find a solution that will enable behavior:
the first service that is instantiated "asks" the other two "if they are alive?" - if no, then it becomes a master and it is the one that is active on the net. The other two, once instantiated see that there is already a master alive, so they became slaves and start sleeping. There needs to be some mechanisms to periodically check if master is not dead and if so, choses the next copy that is alive to became a master (until it becomes dead respectively)
This i think should be a kind of an architectural pattern, so I would be more than glad to be given any advices.
thanks
I would suggest looking at the WCF peer channel (System.Net.PeerToPeer) to facilitate each node knowing about the other nodes. Here is a link that offers a decent introduction.
As for determining which node should be the master, the trick will be negotiating which node should be the master if two or more nodes come online at about the same time. Once the nodes become aware of each other, there needs to be some deterministic mechanism for establishing the master. For example, you could use the earliest creation time, the lowest value of the last octet of each node's IP address, or anything really. You'll just need to define some scheme that allows the nodes to negotiate this automatically.
Finally, as for checking if the master is still alive, I would suggest using the event-based mechanism described here. The master could send out periodic health-and-status events that the other nodes would register for. I'd put a try/catch/finally block at the code entry point so that if the master were to crash, it could publish one final MasterClosing event to let the slaves know it's going away. What this does not account for is a system crash, e.g., power failure, etc. To handle this, provide a timeout in the slaves so that when the timeout expires, the slaves can query the master to see if it's still there. If not, the slaves can negotiate between themselves using your deterministic algorithm about who should be the next master.