MarkLogic Replicas - Active-Active or Active-Passive? - replication

In MarkLogic 7 are replicas active-active or active-passive?

Are you asking about local-disk failover (aka forest replication), database replication, or flexible replication?
In fact all three are designed for active-passive use: in other words, single master. But if you are new to MarkLogic you should put some thought into which features you plan to use, and how. Here is a quick summary: see the docs for more details.
Forest replication (local-disk failover) is like RAID-1: useful for high availability. All replicas receive updates from the master, sharing the same MVCC timestamps. Only the active master is used for queries. Replication is synchronous.
Database replication is good for disaster recovery. The replicas receive updates from the master, sharing the same MVCC timestamps. You can query the master or any replicas, but updates can only happen on the master. Replication lag is configurable.
Flexible replication is good for application-specific use-cases. Document-level updates propagate using triggers, which is slower than the timestamp-based approaches. All updates should happen on the master, but queries can run on the master or any replica. Because it is trigger-based, flexrep allows replication of a subset of documents, and allows arbitrary XQuery to run as part of replication tasks. In theory this could even be used to implement multi-master (active-active) replication.
http://docs.marklogic.com/guide/cluster/failover
http://docs.marklogic.com/guide/database-replication
http://docs.marklogic.com/guide/flexrep

Related

CAP theorum is applicale for Replication or Sharding?

I went through CAP . After going though it, my understanding is CAP makes sense only in context
of replication (where write happens on one node i.e. Master and then replicated across slaves) not for sharding/horizontal scaling(where data is
partitioned based on some key. So different data lies on different nodes).
Ideally data will be always be consistent/available in shards(mainly NoSql DB)
as single node contains required data and there is no need to write the same data to other node. So in NoSql there is no need to be
parttion tolerant as there is node of communication between node until and unless replication is required. So why CAP theorum comes into picture for NoSql DB where sharding
is used not replication.
To me choosing b/w C and A should makes sense where we are using replication not shard which mainly happens in SQL DB not in NOSql DB but reading the articles on google primarily talks about NoSql DB's in terms of CAP
I know I am missing something as CAP theorum but not sure what it is ?
It is true that CAP theorem does not apply if there is only a single master for each shard (without any type of replication). But most implementations have a master shard plus one, two or more slaves for read only queries. And this configuration repeats for each shard in the cluster. Then, there is replication hence CAP theorem applies (and PACELC theorem applies). The replication can be for scaling the shard but mostly for availaibility. If the master shard fails, one of the other takes the master role.

If rabbitmq can't be used as a locking service, then what can?

The two main issues are:
Not resilient to network partitions
Not resilient to network failures
This article says why it can be used as a locking service: https://www.rabbitmq.com/blog/2014/02/19/distributed-semaphores-with-rabbitmq/
This article goes into more depth explaining why it can't be used as one due to the issues listed above: https://aphyr.com/posts/315-jepsen-rabbitmq
So to recap, if rabbitmq can't be used as a locking service, then what can?
Try:
Zoo Keeper
https://dzone.com/articles/distributed-lock-using
Hashicorp Consul
https://www.consul.io/docs/guides/semaphore.html
Azure Blobs have a lease feature that can be used
https://learn.microsoft.com/en-us/azure/architecture/patterns/leader-election
Etcd
Any relational database. With the correct use of row locks to guarantee linearizable writes to a row you can create a distributed lock.
There are surely many more.

Apache Kafka: Mirroring vs. Replication

Mirroring is replicating data between Kafka cluster, while Replication is for replicating nodes within a Kafka cluster.
Is there any specific use of Replication, if Mirroring has already been setup?
They are used for different use cases. Let's try to clarify.
As described in the documentation,
The purpose of adding replication in Kafka is for stronger durability and higher availability. We want to guarantee that any successfully published message will not be lost and can be consumed, even when there are server failures. Such failures can be caused by machine error, program error, or more commonly, software upgrades. We have the following high-level goals:
Inside a cluster there might be network partitions (a single server fails, and so forth), therefore we want to provide replication between the nodes. Given a setup of three nodes and one cluster, if server1 fails, there are two replicas Kafka can choose from. Same cluster implies same response times (ok, it also depends on how these servers are configured, sure, but in a normal scenario they should not differ so much).
Mirroring, on the other hand, seems to be very valuable, for example, when you are migrating a data center, or when you have multiple data centers (e.g., AWS in the US and AWS in Ireland). Of course, these are just a couple of use cases. So what you do here is to give applications belonging to the same data center a faster and better way to access data - data locality in some contexts is everything.
If you have one node in each cluster, in case of failure, you might have way higher response times to go, let's say, from AWS located in Ireland to AWS in the US.
You might claim that in order to achieve data locality (services in cluster one read from kafka in cluster one) one still needs to copy the data from one cluster to the other. That's definitely true, but the advantages you might get with mirroring could be higher than those you would get by reading directly (via an SSH tunnel?) from Kafka located in another data center, for example single connections down, clients connection/session times longer (depending on the location of the data center), legislation (some data can be collected in a country while some other data shouldn't).
Replication is the basis of higher availability. You shouldn't use Mirroring to handle high availability in a context where data locality matters. At the same time, you should not use just Replication where you need to duplicate data across data centers (I don't even know if you can without Mirroring/an ssh tunnel).

Azure SQL Replication

I have an application that, for performance reasons, will have completely independent standalone instances in several Azure data centers. The stack of Azure IaaS and PaaS components at each data center will be exactly the same. Primarily, there will be a front end application and a database.
So let's say I have the application hosted in 4 data centers. I would like to have the data coming into each Azure SQL database replicate it's data asynchronously to all of the other 3 databases, in an eventually consistent manner. Each of these databases needs to be updatable.
Does anyone know if Active Geo-Replication can handle this scenario? I know I can do this using a VM and IaaS, but would prefer to use SQL Azure.
Thanks...
Peer-to-peer tranasaction replication supports what you're asking for, to some extent - I'm assuming that's what you're referring to when you mention setting it up in IaaS, but it seems like it would be self defeating if you're looking to it for a boost in write performance (and against their recommendations):
From https://msdn.microsoft.com/en-us/library/ms151196.aspx
Although peer-to-peer replication enables scaling out of read operations, write performance for the topology is like that for a single node. This is because ultimately all inserts, updates, and deletes are propagated to all nodes. Replication recognizes when a change has been applied to a given node and prevents changes from cycling through the nodes more than one time. We strongly recommend that write operations for each row be performed at only node, for the following reasons:
If a row is modified at more than one node, it can cause a conflict or even a lost update when the row is propagated to other nodes.
There is always some latency involved when changes are replicated. For applications that require the latest change to be seen immediately, dynamically load balancing the application across multiple nodes can be problematic.
This makes me think that you'd be better off using Active Geo Replication - you get the benefit of PaaS and not having to manage your own VMs, not having to manage TR, which gets messy, and if the application is built to deal with "eventual consistency" in the UI, you might be able to get away with slight delays in the secondaries being up to date.

Database Replication or Mirroring?

What is the difference between Replication and Mirroring in SQL server 2005?
In short, mirroring allows you to have a second server be a "hot" stand-by copy of the main server, ready to take over any moment the main server fails. So mirroring offers fail-over and reliability.
Replication, on the other hand, allows two or more servers to stay "in sync" - that means the secondary servers can answer queries and (depending on setup) actually change data (it will be merged in the sync). You can also use it for local caching, load balancing, etc.
Mirroring is a feature that creates a copy of your database at bit level. Basically you have the same, identical, database in two places. You cannot optionally leave out parts of the database. You can have only one mirror, and the 'mirror' is always offline (it cannot be modified). Mirroring works by shipping the database log as is being created to the mirror and apply (redo-ing) the log on the mirror. Mirroring is a technology for high availability and disaster recoverability.
Replication is a feature that allow 'slices' of a database to be replicated between several sites. The 'slice' can be a set of database objects (ie. tables) but it can also contain parts of a table, like only certain rows (horizontal slicing) or only certain columns to be replicated. You can have multiple replicas and the 'replicas' are available to query and even can be updated. Replication works by tracking/detecting changes (either by triggers or by scanning the log) and shipping the changes, as T-SQL statements, to the subscribers (replicas). Replication is a technology for making data available at off sites and to consolidate data to central sites. Although it is sometimes used for high availability or for disaster recoverability, it is an artificial use for a problem that mirroring and log shipping address better.
There are several types and flavours of replication (merge, transactional, peer-to-peer etc.) and they differ in how they implement change tracking or update propagation, if you want to know more details you should read the MSDN spec on the subject.
Database mirroring is used to increase database uptime and reliability.
Replication is used primarily to distribute portions of your primary database -- the publisher -- to one or more subscriber databases. This is often done to make data available (typically for read only) on remote servers so that remote clients can access the data locally (to them) rather than directly from the publisher across a slower WAN connection. Although, as the previous posts indicate, there are more complex scenarios where updates are permitted on the subscribers. It also can have the benefit of reducing the I/O load on the publisher.