Why the method QueryNode of Aerospike's Java Client only returns master objects - aerospike

I have an Aerospike cluster of size 3, my replication factor is 3. So suppose I have 10,000 objects, then every node will hold about 3,333 master objects, and 6,666 replicated objects. I have the default settings I suppose.
I realized that the QueryNode method of the Java Client only returns master objects on a node, so I'll always get about 1/3 of my total number of objects, even the nodes do have replicated objects on them. And further more, the Query method uses QueryNode to fetch results from every node.
I found this not ideal in my case: there was one time that all 3 nodes are functioning but the network connection from 1 node to the client program went down. So the Query method will only fetch master objects from 2 nodes, but on the server side, all 3 nodes are functioning so master objects are distributed equally, so as result, I can only fetch about 2/3 of my objects using the Query method.
I can write some initial check to verify if my network connections are all ok, but still, since every node has all objects on them (master or replicated), why only fetch the master objects?

You should read the Getting Started documentation for the Java client, and its JavaDoc reference.
There is a difference between a scan and a query, unless you're using Query without a predicate (which converts into a Scan). If you intend to use a predicate you need to create a secondary index over the bin you'll be querying.
Both queries and scans only return master objects. The purpose of replica objects is to be to ensure that data is not lost when a node goes down. You shouldn't expect to get back replica objects from either.
The case where you can actually read a replica object is when you change your replica read policy to RANDOM. The read may go to the master or one of the replicas, rather than to the master only (the default value is MASTER).

Related

Apache ignite spring data save method transaction behaviour with map parameter

As per apache ignite spring data documentation, there are two method to save the data in ignite cache:
1. org.apache.ignite.springdata.repository.IgniteRepository.save(key, vlaue)
and
2. org.apache.ignite.springdata.repository.IgniteRepository.save(Map<ID, S> entities)
So, I just want to understand the 2nd method transaction behavior. Suppose we are going to save the 100 records by using the save(Map<Id,S>) method and due to some reason after 70 records there are some nodes go down. In this case, will it roll back all the 70 records?
Note: As per 1st method behavior, If we use #Transaction at method level then it will roll back the particular entity.
First of all, you should read about the transaction mechanism used in Apache Ignite. It is very good described in articles presented here:
https://apacheignite.readme.io/v1.0/docs/transactions#section-two-phase-commit-2pc
The most interesting part for you is "Backup Node Failures" and "Primary Node Failures":
Backup Node Failures
If a backup node fails during either "Prepare" phase or "Commit" phase, then no special handling is needed. The data will still be committed on the nodes that are alive. GridGain will then, in the background, designate a new backup node and the data will be copied there outside of the transaction scope.
Primary Node Failures
If a primary node fails before or during the "Prepare" phase, then the coordinator will designate one of the backup nodes to become primary and retry the "Prepare" phase. If the failure happens before or during the "Commit" phase, then the backup nodes will detect the crash and send a message to the Coordinator node to find out whether to commit or rollback. The transaction still completes and the data within distributed cache remains consistent.
In your case, all updates for all values in the map should be done in one transaction or rollbacked. I guess that these articles answered your question.

Forcing Riak to store data on distinct physical servers

I'm concerned by this note in Riak's documentation:
N=3 simply means that three copies of each piece of data will be stored in the cluster. That is, three different partitions/vnodes will receive copies of the data. There are no guarantees that the three replicas will go to three separate physical nodes; however, the built-in functions for determining where replicas go attempts to distribute the data evenly.
https://docs.basho.com/riak/kv/2.1.3/learn/concepts/replication/#so-what-does-n-3-really-mean
I have a cluster of 6 physical servers with N=3. I want to be 100% sure that total loss of some number of nodes (1 or 2) will not lose any data. As I understand the caveat above, Riak cannot guarantee that. It appears that there is some (admittedly low) portion of my data that could have all 3 copies stored on the same physical server.
In practice, this means that for a sufficiently large data set I'm guaranteed to completely lose records if I have a catastrophic failure on a single node (gremlins eat/degauss the drive or something).
Is there a Riak configuration that avoids this concern?
Unfortunate confounding reality: I'm on an old version of Riak (1.4.12).
There is no configuration that avoids the minuscule possibility that a partition might have 2 or more copies on one physical node (although having 5+ nodes in your cluster makes it extremely unlikely that a single node with have more than 2 copies of a partition). With your 6 node cluster it is extremely unlikely that you would have 3 copies of a partition on one physical node.
The riak-admin command line tool can help you explore your partitions/vnodes. Running riak-admin vnode-status (http://docs.basho.com/riak/kv/2.1.4/using/admin/riak-admin/#vnode-status) on each node for example will output the status of all vnodes the are running on the local node the command is run on. If you run it on every node in your cluster you confirm whether or not your data is distributed in a satisfactory way.

Couchbase node failure

My understanding could be amiss here. As I understand it, Couchbase uses a smart client to automatically select which node to write to or read from in a cluster. What I DON'T understand is, when this data is written/read, is it also immediately written to all other nodes? If so, in the event of a node failure, how does Couchbase know to use a different node from the one that was 'marked as the master' for the current operation/key? Do you lose data in the event that one of your nodes fails?
This sentence from the Couchbase Server Manual gives me the impression that you do lose data (which would make Couchbase unsuitable for high availability requirements):
With fewer larger nodes, in case of a node failure the impact to the
application will be greater
Thank you in advance for your time :)
By default when data is written into couchbase client returns success just after that data is written to one node's memory. After that couchbase save it to disk and does replication.
If you want to ensure that data is persisted to disk in most client libs there is functions that allow you to do that. With help of those functions you can also enshure that data is replicated to another node. This function is called observe.
When one node goes down, it should be failovered. Couchbase server could do that automatically when Auto failover timeout is set in server settings. I.e. if you have 3 nodes cluster and stored data has 2 replicas and one node goes down, you'll not lose data. If the second node fails you'll also not lose all data - it will be available on last node.
If one node that was Master goes down and failover - other alive node becames Master. In your client you point to all servers in cluster, so if it unable to retreive data from one node, it tries to get it from another.
Also if you have 2 nodes in your disposal you can install 2 separate couchbase servers and configure XDCR (cross datacenter replication) and manually check servers availability with HA proxies or something else. In that way you'll get only one ip to connect (proxy's ip) which will automatically get data from alive server.
Hopefully Couchbase is a good system for HA systems.
Let me explain in few sentence how it works, suppose you have a 5 nodes cluster. The applications, using the Client API/SDK, is always aware of the topology of the cluster (and any change in the topology).
When you set/get a document in the cluster the Client API uses the same algorithm than the server, to chose on which node it should be written. So the client select using a CRC32 hash the node, write on this node. Then asynchronously the cluster will copy 1 or more replicas to the other nodes (depending of your configuration).
Couchbase has only 1 active copy of a document at the time. So it is easy to be consistent. So the applications get and set from this active document.
In case of failure, the server has some work to do, once the failure is discovered (automatically or by a monitoring system), a "fail over" occurs. This means that the replicas are promoted as active and it is know possible to work like before. Usually you do a rebalance of the node to balance the cluster properly.
The sentence you are commenting is simply to say that the less number of node you have, the bigger will be the impact in case of failure/rebalance, since you will have to route the same number of request to a smaller number of nodes. Hopefully you do not lose data ;)
You can find some very detailed information about this way of working on Couchbase CTO blog:
http://damienkatz.net/2013/05/dynamo_sure_works_hard.html
Note: I am working as developer evangelist at Couchbase

replicas in replication

Data in a system is collection of items i.e objects. These logical objects are implemented by a collection of physical copies called replicas. The replicas are physical objects, each stored at a single computer, with data data and behaviour that are tied to some degree of consistency by the system's operation.
My question are
1 Object should be physical and Replicas should be logical
2 Is replica exact copy or just part of original one i.e. enough information
3 Where replicas are stored and how many in number are they for single object?
4 When clients connect to object, are they accessing replica or original object?
I tried to find answers to my questions online, but couldn't so had to post on stackoverflow.
The answer mostly depends on what "system" do you use. There's no general replication mechanism. However, answer to question #1 and #2 should be always the same: 1. Replica is a physical object 2. Replica is an exact copy.
Almost every distributed system uses something home-grown. Here's some examples:
MySQL replication: client/server application. Transactions executed on master will be transferred to slaves. Number of configured slaves is the number of replicas. Replica and original isn't the same: replica is a delayed version of original. Answers to your questions:
It's up to number of slave nodes configured 4. It's up to client what node to use master or one of slaves
CouchBase cluster: all nodes are equal, no master node. Objects and replicas are distributed by hash function among nodes. If a node fails the rest of the nodes redistribute objects and replicas of the failed node. Answers to your questions: 3. You can configure number of replicas you want to have. 4. There're 2 options:
client can connect to any node and node will proxy the request if the object located somewhere else
client is aware of objects distribution mechanism and knows structure of the cluster. So client can connect directly to the node which stores required object.

What happens when redis gets overloaded?

If redis gets overloaded, can I configure it to drop set requests? I have an application where data is updated in real time (10-15 times a second per item) for a large number of items. The values are outdated quickly and I don't need any kind of consistency.
I would also like to compute parallel sum of the values that are written in real time. What's the best option here? LUA executed in redis? Small app located on the same box as redis using UNIX sockets?
When Redis gets overloaded it will just slow down its clients. For most commands, the protocol itself is synchronous.
Redis supports pipelining though, but there is no way a client can cancel the traffic still in the pipeline, but not yet acknowledged by the server. Redis itself does not really queue the incoming traffic, the TCP stack does it.
So it is not possible to configure the Redis server to drop set requests. However, it is possible to implement a last value queue on client side:
the queue is actually a represented by 2 maps indexed by your items (only one value stored per item). The primary map will be used by the application. The secondary map will be used by a specific thread. The 2 maps content can be swapped in an atomic way.
a specific thread is blocking when the primary map is empty. When it is not, it swaps the content of the two maps, sends the content of the secondary map asynchronously to Redis using aggressive pipelining and variadic parameters commands. It also receives ack from Redis.
while the thread is working with the secondary map, the application can still fill the primary map. If Redis is too slow, the application will only accumulate last values in the primary map.
This strategy could be implemented in C with hiredis and the event loop of your choice.
However, it is not trivial to implement, so I would first check if the performance of Redis against all the traffic is not enough for my purpose. It is not uncommon to benchmark Redis at more than 500K op/s these days (using a single core). Nothing prevents you to shard your data on multiple Redis instances if needed.
You will likely saturate the network links before the CPU of the Redis server. That's why it is better to implement the last value queue (if needed) on client side rather than server side.
Regarding the sum computation, I would try to calculate and maintain it in real time. For instance, the GETSET command can be used to set a new value while returning the previous one.
Instead of just setting your values, you could do:
[old value] = GETSET item <new value>
INCRBY mysum [new value] - [old value]
The mysum key will contain the sum of your values for all the items at any time. With Redis 2.6, you can use Lua to encaspulate this calculation to save roundtrips.
Running a big batch to calculate statistics on existing data (this is how I understand your "parallel" sum) is not really suitable for Redis. It is not designed for map/reduce like computation.