Which consistency model do I have? - replication

I have implemented a replicated key/value store on top of Redis. I have passive replication in which all write and read requests are forwarded to the leader that always returns the last value written for the key. The system uses quorum. So it works even if there are nodes which are down or with a network partition. In this case, the value in those nodes are not consistent. But this does not prevent the system to return the last most updated value. Do I have an eventual consistency model or a strict one? Thanks

You mentioned that, it is a quorum based system, with one node as a leader. The read and write requests are always forwarded to the leader.
For the sake of simplicity, let's assume that, there are 5 nodes in the system and one of them is a leader. Other 4 nodes are secondaries.
Typically quorum based systems work on consensus protocols. So out of 5 nodes, if 3 of the nodes have the latest value, it is enough to always return the latest value.
This is how writes should work
Leader first updates the key/value in it's database
Forwards the request to remaining 4 nodes, which are secondaries
The leader waits for at least 2 of the secondaries to acknowledge that, they have updated the latest key/value in their database. It means out of 5 nodes, at least 3 nodes have the latest updated value.
If the leader does not get response from at least 2 of the secondary nodes within the specified time period (request times out), then leader returns failure to client and client needs to retry.
So, the write requests succeeds only if 3 out of 5 nodes have the latest value. At any point, 2 of the nodes may or may not have the latest value and may catch up later.
For the reads, the leader (which has the latest key/value), always returns the response.
What happens when a leader machine is unable to serve requests, due to some issue? (For e.g. network error)
Typically these systems will have a leader election protocol, when the current leader is unable to serve the requests due to some error.
The new leader will be chosen from one of the secondaries, which have the latest updates. So, the newly elected leader should have the latest updated state and should start serving read requests with the latest set of values.
Your system is strictly consistent.

Related

Why do we need distributed lock in Redis if we partition on Key?

Folks,
I read thru Why we need distributed lock in the Redis but it is not answering my question.
I went thru the following
https://redis.io/topics/distlock to be specific https://redis.io/topics/distlock#the-redlock-algorithm and
https://redis.io/topics/partitioning
I understand that partitioning is used to distribute the load across the N nodes. So does this not mean that the interested data will always be in one node? If yes then why should I go for a distributed lock that locks across all N nodes?
Example:
Say if I persist a trade with the id 123, then the partition based on the hashing function will work out which node it has to go. Say for the sake of this example, it goes to 0th node.
Now my clients (multiple instances of my service) will want to access the trade 123.
Redis again based on the hashing is going to find in which Node the trade 123 lives.
This means the clients (in reality one instance among the multiple instances i.e only one client ) will lock the trade 123 on the 0th Node.
So why will it care to lock all the N nodes?
Unless partitioning is done on a particular record across all the Nodes i.e
Say a single trade is 1MB in size and it is partitioned across 4 Nodes with 0.25MB in each node.
Kindly share your thoughts. I am sure I am missing something but any leads are much appreciated.

Apache Ignite Replicated Cache race conditions?

I'm quite new to Apache Ignite so please be gentle. My question is simple.
If I have a replicated cache using Apache Ignite. I write to this cache key 123. My cluster has 10 nodes.
First question is:
Does replicated cache mean that before the "put" call comes back the key 123 must be written to all 10 nodes? Or does the call come back immediately and the replication is done behind the scenes?
Second question is:
Lets say key 123 is written on Node 1. It's now being replicated to all other nodes. However a few microseconds later Node 2 tries to write key 123 with a different value. Do I now have a race condition? Or does Ignite somehow handles this situation in such a way where Node 2's attempt to write key 123 won't happen until Node 1's "put" has replicated across all nodes?
For some context, what I'm trying to build is a de-duplication system across a cluster of API machines. I was hoping that I would be able to create a hash of my API request (with only values that make the request unique) and write it to the Ignite Cache. The API request would only proceed if the cache does not already contain the unique hash (possibly created by a different API instance). Of course the cache would have an eviction policy to evict these cache keys after a few seconds because they won't be needed anymore.
REPLICATED cache is the same as PARTITIONED with infinite number of backups and some optimizations. So it has primary partitions that distributed across nodes according to affinity function.
Now when you perform update, request comes to primary node, and primary node, in it's turn, updates all backups. Property CacheConfiguration.setWriteSynchronizationMode() is responsible for the way in which entries will be updated. By default it's PRIMARY_SYNC, which means that thread which calls put() will wait only for primary partition update, and backups will be updated asynchronously. If you set it to FULL_SYNC, thread will be released only when all backups updated.
Answering your second question, there will not be a race condition, because all requests will come to primary node.
Additionally to your clarification, if backup node wasn't updated yet, get() request will go to primary node, so in PRIMARY_SYNC mode you'll never get null if primary partition has a value.

Zookeeper vs In-memory-data-grid vs Redis

I've found different zookeeper definitions across multiple resources. Maybe some of them are taken out of context, but look at them pls:
A canonical example of Zookeeper usage is distributed-memory computation...
ZooKeeper is an open source Apacheā„¢ project that provides a centralized infrastructure and services that enable synchronization across a cluster.
Apache ZooKeeper is an open source file application program interface (API) that allows distributed processes in large systems to synchronize with each other so that all clients making requests receive consistent data.
I've worked with Redis and Hazelcast, that would be easier for me to understand Zookeeper by comparing it with them.
Could you please compare Zookeeper with in-memory-data-grids and Redis?
If distributed-memory computation, how does zookeeper differ from in-memory-data-grids?
If synchronization across cluster, than how does it differs from all other in-memory storages? The same in-memory-data-grids also provide cluster-wide locks. Redis also has some kind of transactions.
If it's only about in-memory consistent data, than there are other alternatives. Imdg allow you to achieve the same, don't they?
https://zookeeper.apache.org/doc/current/zookeeperOver.html
By default, Zookeeper replicates all your data to every node and lets clients watch the data for changes. Changes are sent very quickly (within a bounded amount of time) to clients. You can also create "ephemeral nodes", which are deleted within a specified time if a client disconnects. ZooKeeper is highly optimized for reads, while writes are very slow (since they generally are sent to every client as soon as the write takes place). Finally, the maximum size of a "file" (znode) in Zookeeper is 1MB, but typically they'll be single strings.
Taken together, this means that zookeeper is not meant to store for much data, and definitely not a cache. Instead, it's for managing heartbeats/knowing what servers are online, storing/updating configuration, and possibly message passing (though if you have large #s of messages or high throughput demands, something like RabbitMQ will be much better for this task).
Basically, ZooKeeper (and Curator, which is built on it) helps in handling the mechanics of clustering -- heartbeats, distributing updates/configuration, distributed locks, etc.
It's not really comparable to Redis, but for the specific questions...
It doesn't support any computation and for most data sets, won't be able to store the data with any performance.
It's replicated to all nodes in the cluster (there's nothing like Redis clustering where the data can be distributed). All messages are processed atomically in full and are sequenced, so there's no real transactions. It can be USED to implement cluster-wide locks for your services (it's very good at that in fact), and tehre are a lot of locking primitives on the znodes themselves to control which nodes access them.
Sure, but ZooKeeper fills a niche. It's a tool for making a distributed applications play nice with multiple instances, not for storing/sharing large amounts of data. Compared to using an IMDG for this purpose, Zookeeper will be faster, manages heartbeats and synchronization in a predictable way (with a lot of APIs for making this part easy), and has a "push" paradigm instead of "pull" so nodes are notified very quickly of changes.
The quotation from the linked question...
A canonical example of Zookeeper usage is distributed-memory computation
... is, IMO, a bit misleading. You would use it to orchestrate the computation, not provide the data. For example, let's say you had to process rows 1-100 of a table. You might put 10 ZK nodes up, with names like "1-10", "11-20", "21-30", etc. Client applications would be notified of this change automatically by ZK, and the first one would grab "1-10" and set an ephemeral node clients/192.168.77.66/processing/rows_1_10
The next application would see this and go for the next group to process. The actual data to compute would be stored elsewhere (ie Redis, SQL database, etc). If the node failed partway through the computation, another node could see this (after 30-60 seconds) and pick up the job again.
I'd say the canonical example of ZooKeeper is leader election, though. Let's say you have 3 nodes -- one is master and the other 2 are slaves. If the master goes down, a slave node must become the new leader. This type of thing is perfect for ZK.
Consistency Guarantees
ZooKeeper is a high performance, scalable service. Both reads and write operations are designed to be fast, though reads are faster than writes. The reason for this is that in the case of reads, ZooKeeper can serve older data, which in turn is due to ZooKeeper's consistency guarantees:
Sequential Consistency
Updates from a client will be applied in the order that they were sent.
Atomicity
Updates either succeed or fail -- there are no partial results.
Single System Image
A client will see the same view of the service regardless of the server that it connects to.
Reliability
Once an update has been applied, it will persist from that time forward until a client overwrites the update. This guarantee has two corollaries:
If a client gets a successful return code, the update will have been applied. On some failures (communication errors, timeouts, etc) the client will not know if the update has applied or not. We take steps to minimize the failures, but the only guarantee is only present with successful return codes. (This is called the monotonicity condition in Paxos.)
Any updates that are seen by the client, through a read request or successful update, will never be rolled back when recovering from server failures.
Timeliness
The clients view of the system is guaranteed to be up-to-date within a certain time bound. (On the order of tens of seconds.) Either system changes will be seen by a client within this bound, or the client will detect a service outage.

cassandra client received duplicated events

I'm using cassandra 2.1.2 and datastax cassandra-driver-core-2.1.2. Here is a strange problem: when a keyspace is created ( or table created, deleted), some of my clients received duplicated events, about 200+ times. my cluster and my clients are in different places(not in one lan).
This cause a lot of problems, Once client received such a event, it should refresh schema, and fetch all schema infos from system.keyspaces and so on. in the end , it also refreshNodeListAndTokenMap. All of these operations may cause some data transfer, and 200+ events in one second is horrible. So any body knows why & how to prevent?
thanks for reading this.
When you mention "some of my clients receive duplicated events", I'm assuming you have multiple Cluster instances, is that correct? If you only have one Cluster object and you are getting multiple events I'm wondering if that is a bug. The java-driver will only subscribe 1 connection (named the 'Control Connection') to schema, topology and node status changes per Cluster instance. That connection is established to one of your contact points initially (and if connection is lost it'll choose another node in the cluster).
Without understanding more about your configuration, I would consider the following:
Follow the 4 simple rules, namely only creating 1 cluster instance per application (JVM).
If you want to prevent 1 node from being responsible for sending events to your clients, randomize your contact points so the same one isn't always primarily chosen for the Control Connection. (Note: There is a ticket for this so the java-driver can do that for you, JAVA-618)

Couchbase node failure

My understanding could be amiss here. As I understand it, Couchbase uses a smart client to automatically select which node to write to or read from in a cluster. What I DON'T understand is, when this data is written/read, is it also immediately written to all other nodes? If so, in the event of a node failure, how does Couchbase know to use a different node from the one that was 'marked as the master' for the current operation/key? Do you lose data in the event that one of your nodes fails?
This sentence from the Couchbase Server Manual gives me the impression that you do lose data (which would make Couchbase unsuitable for high availability requirements):
With fewer larger nodes, in case of a node failure the impact to the
application will be greater
Thank you in advance for your time :)
By default when data is written into couchbase client returns success just after that data is written to one node's memory. After that couchbase save it to disk and does replication.
If you want to ensure that data is persisted to disk in most client libs there is functions that allow you to do that. With help of those functions you can also enshure that data is replicated to another node. This function is called observe.
When one node goes down, it should be failovered. Couchbase server could do that automatically when Auto failover timeout is set in server settings. I.e. if you have 3 nodes cluster and stored data has 2 replicas and one node goes down, you'll not lose data. If the second node fails you'll also not lose all data - it will be available on last node.
If one node that was Master goes down and failover - other alive node becames Master. In your client you point to all servers in cluster, so if it unable to retreive data from one node, it tries to get it from another.
Also if you have 2 nodes in your disposal you can install 2 separate couchbase servers and configure XDCR (cross datacenter replication) and manually check servers availability with HA proxies or something else. In that way you'll get only one ip to connect (proxy's ip) which will automatically get data from alive server.
Hopefully Couchbase is a good system for HA systems.
Let me explain in few sentence how it works, suppose you have a 5 nodes cluster. The applications, using the Client API/SDK, is always aware of the topology of the cluster (and any change in the topology).
When you set/get a document in the cluster the Client API uses the same algorithm than the server, to chose on which node it should be written. So the client select using a CRC32 hash the node, write on this node. Then asynchronously the cluster will copy 1 or more replicas to the other nodes (depending of your configuration).
Couchbase has only 1 active copy of a document at the time. So it is easy to be consistent. So the applications get and set from this active document.
In case of failure, the server has some work to do, once the failure is discovered (automatically or by a monitoring system), a "fail over" occurs. This means that the replicas are promoted as active and it is know possible to work like before. Usually you do a rebalance of the node to balance the cluster properly.
The sentence you are commenting is simply to say that the less number of node you have, the bigger will be the impact in case of failure/rebalance, since you will have to route the same number of request to a smaller number of nodes. Hopefully you do not lose data ;)
You can find some very detailed information about this way of working on Couchbase CTO blog:
http://damienkatz.net/2013/05/dynamo_sure_works_hard.html
Note: I am working as developer evangelist at Couchbase