CAP theorum is applicale for Replication or Sharding?

CAP theorum is applicale for Replication or Sharding? - sql

I went through CAP . After going though it, my understanding is CAP makes sense only in context
of replication (where write happens on one node i.e. Master and then replicated across slaves) not for sharding/horizontal scaling(where data is
partitioned based on some key. So different data lies on different nodes).
Ideally data will be always be consistent/available in shards(mainly NoSql DB)
as single node contains required data and there is no need to write the same data to other node. So in NoSql there is no need to be
parttion tolerant as there is node of communication between node until and unless replication is required. So why CAP theorum comes into picture for NoSql DB where sharding
is used not replication.
To me choosing b/w C and A should makes sense where we are using replication not shard which mainly happens in SQL DB not in NOSql DB but reading the articles on google primarily talks about NoSql DB's in terms of CAP
I know I am missing something as CAP theorum but not sure what it is ?

It is true that CAP theorem does not apply if there is only a single master for each shard (without any type of replication). But most implementations have a master shard plus one, two or more slaves for read only queries. And this configuration repeats for each shard in the cluster. Then, there is replication hence CAP theorem applies (and PACELC theorem applies). The replication can be for scaling the shard but mostly for availaibility. If the master shard fails, one of the other takes the master role.

Related

Apache Kafka: Mirroring vs. Replication

Mirroring is replicating data between Kafka cluster, while Replication is for replicating nodes within a Kafka cluster.
Is there any specific use of Replication, if Mirroring has already been setup?

They are used for different use cases. Let's try to clarify.
As described in the documentation,
The purpose of adding replication in Kafka is for stronger durability and higher availability. We want to guarantee that any successfully published message will not be lost and can be consumed, even when there are server failures. Such failures can be caused by machine error, program error, or more commonly, software upgrades. We have the following high-level goals:
Inside a cluster there might be network partitions (a single server fails, and so forth), therefore we want to provide replication between the nodes. Given a setup of three nodes and one cluster, if server1 fails, there are two replicas Kafka can choose from. Same cluster implies same response times (ok, it also depends on how these servers are configured, sure, but in a normal scenario they should not differ so much).
Mirroring, on the other hand, seems to be very valuable, for example, when you are migrating a data center, or when you have multiple data centers (e.g., AWS in the US and AWS in Ireland). Of course, these are just a couple of use cases. So what you do here is to give applications belonging to the same data center a faster and better way to access data - data locality in some contexts is everything.
If you have one node in each cluster, in case of failure, you might have way higher response times to go, let's say, from AWS located in Ireland to AWS in the US.
You might claim that in order to achieve data locality (services in cluster one read from kafka in cluster one) one still needs to copy the data from one cluster to the other. That's definitely true, but the advantages you might get with mirroring could be higher than those you would get by reading directly (via an SSH tunnel?) from Kafka located in another data center, for example single connections down, clients connection/session times longer (depending on the location of the data center), legislation (some data can be collected in a country while some other data shouldn't).
Replication is the basis of higher availability. You shouldn't use Mirroring to handle high availability in a context where data locality matters. At the same time, you should not use just Replication where you need to duplicate data across data centers (I don't even know if you can without Mirroring/an ssh tunnel).

synch data in Redis multi masters configuration

I'm a newbie to Redis and I was wondering if someone could help me to understand if it can be the right tool.
This is my scenario:
I have many different nodes, everyone behaving like a master and accepting clients connections to read and write a few geographical data data and the timestamp of the incoming record.
Each master node could be hosted onto a drone that only randomly get in touch and can comunicate with others, accordind to network conditions; when this happens they should synchronize their data according to their age (only the ones more recent than a specified time).
Is there any way to achieve this by Redis or do I have to implement this feature at application level?
I tried master/slaves configuration without success and I was wondering if Redis Cluster can somewhat meet my neeeds.
I googled around, but what I found had not an answer good for me
https://serverfault.com/questions/717406/redis-multi-master-replication
Using Redis Replication on different machines (multi master)

Teo, as a matter of fact, redis don't have a multi master replication.
And the cluster shard it's data through different instances. Say you have only two redis instances. Instance1 will accept store and retrieve instance1 and instance2 data. But he will ask for, and store in, instance2 every key that does not belong to his shard.
This is not, I think, really what you want. You could give a try to PostgreSQL+BDR as PostgreSQL supports nosql store and BDR provides a real master master replication (https://wiki.postgresql.org/wiki/BDR_Project) if that's really what you need.
I work with both today (and also MongoDB). Each one with a different goal. Redis would provide a smaller overhead and memory use, fast connection and fast replication. But it won't provide multi master (if you really need it).

Redis replication and not RO slaves

Good day!
Suppose we have a redis-master and several slaves. The master goal is to store all data while slaves are used for quering data for users. Hovewer quering is a bit complex and some temporary data needs to be stored. And also I want to cache the query result for a couple of minutes.
How should I configure replication to save temporary data and caches?

Redis slaves have optional support to accept writes, however you have to understand a few limitations of writable slaves before to use them, since they have non trivial issues.
Keys created on the slaves will not support expires. Actually in recent versions of Redis they appear to work but are actually leaked instead of expired, until the next time you resynchronize the slave with the master from scratch or issue FLUSHALL or alike. There are deep reasons for this issue... it is currently not clear if we'll deprecate writable slaves at all, find a solution, or deny expires for writable slaves.
You may want, anyway, to use a different Redis numerical DB (SELECT command) in order to store your intermediate data (you may use MULTI/.../MOVE/EXEC transaction in order to generate your intermediate results in the currently selected DB where data belongs, and MOVE the keys off to some other DB, so it will be clear if keys are accumulating and you can FLUSHDB from time to time).
The keys you create on your slave are volatile, they may go away in any moment when the master will resynchronize with the slave. Does not look like an issue for you since if they key is no longer there, you could recompute, but care should be take,
If you elect this slave into a master you have additional keys inside.
So there are definitely things to take in mind in this setup, however it is doable in some way. However you may want to consider alternative strategies.
Lua scripts on the slave side in order to filter your data inside Lua. Not as fast as Redis C commands often.
Precomputation of data directly in the actual data set in order to make your queries possible just using read only commands.
MIGRATE in order to migrate interesting keys from a slave to an instance (another master) designed specifically to perform post-computations.
Hard to tell what's the best strategy without in-depth analysis of the actual use case / problem, but I hope this general guidelines help.

Azure SQL Replication

I have an application that, for performance reasons, will have completely independent standalone instances in several Azure data centers. The stack of Azure IaaS and PaaS components at each data center will be exactly the same. Primarily, there will be a front end application and a database.
So let's say I have the application hosted in 4 data centers. I would like to have the data coming into each Azure SQL database replicate it's data asynchronously to all of the other 3 databases, in an eventually consistent manner. Each of these databases needs to be updatable.
Does anyone know if Active Geo-Replication can handle this scenario? I know I can do this using a VM and IaaS, but would prefer to use SQL Azure.
Thanks...

Peer-to-peer tranasaction replication supports what you're asking for, to some extent - I'm assuming that's what you're referring to when you mention setting it up in IaaS, but it seems like it would be self defeating if you're looking to it for a boost in write performance (and against their recommendations):
From https://msdn.microsoft.com/en-us/library/ms151196.aspx
Although peer-to-peer replication enables scaling out of read operations, write performance for the topology is like that for a single node. This is because ultimately all inserts, updates, and deletes are propagated to all nodes. Replication recognizes when a change has been applied to a given node and prevents changes from cycling through the nodes more than one time. We strongly recommend that write operations for each row be performed at only node, for the following reasons:
If a row is modified at more than one node, it can cause a conflict or even a lost update when the row is propagated to other nodes.
There is always some latency involved when changes are replicated. For applications that require the latest change to be seen immediately, dynamically load balancing the application across multiple nodes can be problematic.
This makes me think that you'd be better off using Active Geo Replication - you get the benefit of PaaS and not having to manage your own VMs, not having to manage TR, which gets messy, and if the application is built to deal with "eventual consistency" in the UI, you might be able to get away with slight delays in the secondaries being up to date.

IBM Worklight 6.2. Analytics topology. Master and data Nodes

I'm reading about production topology for the Analytics part of Worklight 6.2.
https://www-01.ibm.com/support/knowledgecenter/api/content/SSZH4A_6.2.0/com.ibm.worklight.monitor.doc/monitor/t_setting_up_production_cluster.html
It explains that nodes can act both as Master Node or as Data Node or only as one of them.
My question is why we should configure dedicated nodes, Master OR Data instead of configuring all the nodes for both Master AND Data.
I assume the the node (only one) acting as master will provide worst performance in its Data role but on the other hand the configuration will be simpler and the high availability will be higher.
Thank you.

Your assumption is correct.
A master node is responsible for handling communication between the data nodes. The data nodes will be responsible for indexing data. Having dedicated master and data nodes will allow them to focus their processing time and memory on their specific tasks. However, as you mentioned, in some cases its not worth doing this to complicate the configuration.
Another reason is that its not necessary to put a master node on a high performing machine. You can reserve the better machines for the data nodes.
The analytics console uses Elasticsearch under the covers. It would be worth looking up the benefits and drawbacks of choosing master and data nodes in Elasticsearch since it is an open source library and there are several resources available for it.
Edit:
As you can imagine, there is no one size fits all configuration. The configuration depends on several factors such as:
How long you wish to keep data stored
How many machines you have to dedicate to analytics
How verbose your client logs have been set
Your preferences between availability and performance
In my personal tests, I typically keep each node as a data and master node. Its possible that in the future we will document how the different configurations affect performance.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas