LMAX Replicator Design - How to support high availability? - replication

LMAX Disruptor is generally implemented using the following approach:
As in this example, Replicator is responsible for replicating the input events\commands to the slave nodes. Replicating across a set of nodes requires us to apply consensus algorithms, in case we want the system to available in the presence of network failures, master failure and slave failures.
I was thinking of applying RAFT consensus algorithm to this problem. One observation is that: "RAFT requires that the input event\commands are stored to the disk (durable storage) during replication" (Reference this link)
This observation essentially means that we cannot perform a in-memory replication. Hence it appears that we might have to combine the functionality of replicator and journaller to be able to successfully apply RAFT algorithm to LMAX.
There are two options to do this:
Option 1: Using the replicated log as input event queue
The receiver would read from the network and push the event onto the replicated log instead of the ring buffer
A separate "reader" can read from the log and publish the events onto the ring buffer.
The log can be replicated across nodes using RAFT. We do not need the replicator and journaller as the functionality is already accomplished by RAFT's replicated log
I think a disadvantage of this option has got to do with fact that we do an additional data copy step (receiver to event queue instead of the ring buffer).
Option 2: Use Replicator to push input events\commands to slave's input log file
I was wondering if there is any other solution to design of Replicator? What are the different design options that people have employed for replicators? Particularly any design that can support in-memory replication?

Your intuition is correct about folding the replication and journalling into the Raft component. But, the Raft protocol dictates exactly when things need to be stored on disk.
Here are two different ways to look at it.
I'm assuming there is no is hefty computation, such as a transaction processing, before the replication because you don't have any in your diagrams.
I, personally, would do the first because it separates concerns into different processes. If I was implementing Raft for myself I would take the first half of the second scenario and put it in its own process.
External Raft Replication
In which Raft is implement by an external process.
The replication component outsources to an external Raft process the business of replication. After some time, Raft responds to the replication component that it is, in fact, replicated. The replication component updates the items in the ring buffer, and moves its published cursor forward. The business logic sees the published cursor (via waitFor) and consumes the freshly replicated data.
In this scenario, the replication component probably has a lot of inflight events, so it's read cursor is far ahead of the cursor it publishes to the business logic.
There is no need for a journalling component in this scenario because the external raft system does the journalling for you.
Note, the replication may be the slowest component of the system!
Integrated Raft Replication
In which raft is implemented in the same process as the "Real Business Logic."
In terms of Raft, replication is the business logic. Actually, you have multiple levels of business logic, or equivalently, multiple stages of business logic.
I'm going to use two input disruptors and two output disruptors for this to emphasize the separate business logic. You can combine, split, or rearrange to your heart's content. Or your profiler's content.
The first stage, as I mentioned, is Raft replication. Client events go into the Replication Input Disruptor. The Raft logic picks it up, perhaps in batches, and sends out to the Followers on the Replication Output Disruptor. All Raft messages also go into the Replication Input Disruptor. The Raft logic also picks these up and sends the appropriate responses to the appropriate Followers/Master on the Replication Output Disruptor).
A journaller component hangs off the Input Ring Buffer; it only has to handle certain types of messages as dictated by Raft. This will likely be the slowest part of the system.
When the data is considered replicated, it is moved to the second stage, via the "Real Business Logic" Input Disruptor. There it is processed, sent to the Client Outbound Disruptor, and then sent to one of your millions of happy paying customers.

Related

ActiveMQ datastore for cluster setup

We have been using ActiveMQ version 5.16.0 broker with single instances in production. Now we are planning to use cluster of AMQ brokers for HA and load distribution with consistency in message data. Currently we are using only one queue
HA can be achieved using failover but do we need to use the same datastore or it can be separated? If I use different instances for AMQ brokers then how to setup a common datastore.
Please guide me how to setup datastore for HA and load distribution
Multiple ActiveMQ servers clustered together can provide HA in a couple ways:
Scale message flow by using compute resources across multiple broker nodes
Maintain message flow during single node planned or unplanned outage of a broker node
Share data store in the event of ActiveMQ process failure.
Network of brokers solve #1 and #2. A standard 3-node cluster will give you excellent performance and ability to scale the number of producers and consumers, along with splitting the full flow across 3-nodes to provide increased capacity.
Solving for #3 is complicated-- in all messaging products. Brokers are always working to be completely empty-- so clustering the data store of a single-broker becomes an anti-pattern of sorts. Many times, relying on RAID disk with a single broker node will provide higher reliability than adding NFSv4, GFSv2, or JDBC and using shared-store.
That being said, if you must use a shared store-- follow best practices and use GFSv2 or NFSv4. JDBC is much slower and requires significant DB maintenance to keep running efficiently.
Note: [#Kevin Boone]'s note about CIFS/SMB is incorrect and CIFS/SMB should not be used. Otherwise, his responses are solid.
You can configure ActiveMQ so that instances share a message store, or so they have separate message stores. If they share a message store, then (essentially) the brokers will automatically form a master-slave configuration, such that only one broker (at a time) will accept connections from clients, and only one broker will update the store. Clients need to identify both brokers in their connection URIs, and will connect to whichever broker happens to be master.
With a shared message store like this, locks in the message store coordinate the master-slave assignment, which makes the choice of message store critical. Stores can be shared filesystems, or shared databases. Only a few shared filesystem implementations work properly -- anything based on NFS 4.x should work. CIFS/SMB stores can work, but there's so much variation between providers that it's hard to be sure. NFS v3 doesn't work, however well-implemented, because the locking semantics are inappropriate. In any case, the store needs to be robust, or replicated, or both, because the whole broker cluster depends on it. No store, no brokers.
In my experience, it's easier to get good throughput from a shared file store than a shared database although, of course, there are many factors to consider. Poor network connectivity will make it hard to get good throughput with any kind of shared store (or any kind of cluster, for that matter).
When using individual message stores, it's typical to put the brokers into some kind of mesh, with 'network connectors' to pass messages from one broker to another. Both brokers will accept connections from clients (there is no master), and the network connections will deal with the situation where messages are sent to one broker, but need to be consumed from another.
Clients' don't necessarily need to specify all brokers in their connection URIs, but generally will, in case one of the brokers is down.
A mesh is generally easier to set up, and (broadly speaking) can handle more client load, than a master-slave with shared filestore. However, (a) losing a broker amounts to losing any messages that were associated with it (until the broker can be restored) and (b) the mesh interferes with messaging patterns like message grouping and exclusive consumers.
There's really no hard-and-fast rule to determine which configuration to use. Many installers who already have some sort of shared store infrastructure (a decent relational database, or a clustered NFS, for example) will tend to want to use it. The rise in cloud deployments has had the effect that mesh operation with no shared store has become (I think) a lot more popular, because it's so symmetric.
There's more -- a lot more -- that could be said here. As a broad question, I suspect the OP is a bit out-of-scope for SO. You'll probably get more traction if you break your question up into smaller, more focused, parts.

Can I use Aerospike as persistent layer

Aerospike is a key store database with support for persistence.
But can I trust this persistence enough to use it as an database altogether?
As I understand it writes data to memory first and then persist it.
I can live with eventual consistency, but I don't want to be in a state where something was committed but due to machine failure it never got written down to the disk and hence can never be retrieved.
I tried looking at the various use cases but I was just curious about this one.
Also what guarantee does client.put provides as far as saving of a new record is concerned.
Aerospike provides a user configurable replication factor. Most people use 2, if you are really concerned, you can use 3 or even more. Size the cluster accordingly. For RF=3, put returns when 3 nodes have written data to the their write-block in memory which is asynchronously flushed to persistent layer. So it depends on what node failure pattern you are trying protect against. If you are worried about entire cluster crashing instantly, then you may have a case for 1 second (default) worth of lost data. The one second can be configured lower as well. Aerospike also provides rack aware configuration which protects against data loss if entire rack goes down. The put goes to nodes in different racks always. Finally Aerospike provides cross data center replication - its asynchronous but does give an option to replicate your data across geo. Of course, going across geo does have its latency. Finally, if you are totally concerned about entire cluster shutdown, you can connect to two separate clusters in your application and always push updates to two separate clusters. Of course, you must now worry about consistency if application fails between two writes. I don't know of anyone who had to resort to that.

Zookeeper vs In-memory-data-grid vs Redis

I've found different zookeeper definitions across multiple resources. Maybe some of them are taken out of context, but look at them pls:
A canonical example of Zookeeper usage is distributed-memory computation...
ZooKeeper is an open source Apacheā„¢ project that provides a centralized infrastructure and services that enable synchronization across a cluster.
Apache ZooKeeper is an open source file application program interface (API) that allows distributed processes in large systems to synchronize with each other so that all clients making requests receive consistent data.
I've worked with Redis and Hazelcast, that would be easier for me to understand Zookeeper by comparing it with them.
Could you please compare Zookeeper with in-memory-data-grids and Redis?
If distributed-memory computation, how does zookeeper differ from in-memory-data-grids?
If synchronization across cluster, than how does it differs from all other in-memory storages? The same in-memory-data-grids also provide cluster-wide locks. Redis also has some kind of transactions.
If it's only about in-memory consistent data, than there are other alternatives. Imdg allow you to achieve the same, don't they?
https://zookeeper.apache.org/doc/current/zookeeperOver.html
By default, Zookeeper replicates all your data to every node and lets clients watch the data for changes. Changes are sent very quickly (within a bounded amount of time) to clients. You can also create "ephemeral nodes", which are deleted within a specified time if a client disconnects. ZooKeeper is highly optimized for reads, while writes are very slow (since they generally are sent to every client as soon as the write takes place). Finally, the maximum size of a "file" (znode) in Zookeeper is 1MB, but typically they'll be single strings.
Taken together, this means that zookeeper is not meant to store for much data, and definitely not a cache. Instead, it's for managing heartbeats/knowing what servers are online, storing/updating configuration, and possibly message passing (though if you have large #s of messages or high throughput demands, something like RabbitMQ will be much better for this task).
Basically, ZooKeeper (and Curator, which is built on it) helps in handling the mechanics of clustering -- heartbeats, distributing updates/configuration, distributed locks, etc.
It's not really comparable to Redis, but for the specific questions...
It doesn't support any computation and for most data sets, won't be able to store the data with any performance.
It's replicated to all nodes in the cluster (there's nothing like Redis clustering where the data can be distributed). All messages are processed atomically in full and are sequenced, so there's no real transactions. It can be USED to implement cluster-wide locks for your services (it's very good at that in fact), and tehre are a lot of locking primitives on the znodes themselves to control which nodes access them.
Sure, but ZooKeeper fills a niche. It's a tool for making a distributed applications play nice with multiple instances, not for storing/sharing large amounts of data. Compared to using an IMDG for this purpose, Zookeeper will be faster, manages heartbeats and synchronization in a predictable way (with a lot of APIs for making this part easy), and has a "push" paradigm instead of "pull" so nodes are notified very quickly of changes.
The quotation from the linked question...
A canonical example of Zookeeper usage is distributed-memory computation
... is, IMO, a bit misleading. You would use it to orchestrate the computation, not provide the data. For example, let's say you had to process rows 1-100 of a table. You might put 10 ZK nodes up, with names like "1-10", "11-20", "21-30", etc. Client applications would be notified of this change automatically by ZK, and the first one would grab "1-10" and set an ephemeral node clients/192.168.77.66/processing/rows_1_10
The next application would see this and go for the next group to process. The actual data to compute would be stored elsewhere (ie Redis, SQL database, etc). If the node failed partway through the computation, another node could see this (after 30-60 seconds) and pick up the job again.
I'd say the canonical example of ZooKeeper is leader election, though. Let's say you have 3 nodes -- one is master and the other 2 are slaves. If the master goes down, a slave node must become the new leader. This type of thing is perfect for ZK.
Consistency Guarantees
ZooKeeper is a high performance, scalable service. Both reads and write operations are designed to be fast, though reads are faster than writes. The reason for this is that in the case of reads, ZooKeeper can serve older data, which in turn is due to ZooKeeper's consistency guarantees:
Sequential Consistency
Updates from a client will be applied in the order that they were sent.
Atomicity
Updates either succeed or fail -- there are no partial results.
Single System Image
A client will see the same view of the service regardless of the server that it connects to.
Reliability
Once an update has been applied, it will persist from that time forward until a client overwrites the update. This guarantee has two corollaries:
If a client gets a successful return code, the update will have been applied. On some failures (communication errors, timeouts, etc) the client will not know if the update has applied or not. We take steps to minimize the failures, but the only guarantee is only present with successful return codes. (This is called the monotonicity condition in Paxos.)
Any updates that are seen by the client, through a read request or successful update, will never be rolled back when recovering from server failures.
Timeliness
The clients view of the system is guaranteed to be up-to-date within a certain time bound. (On the order of tens of seconds.) Either system changes will be seen by a client within this bound, or the client will detect a service outage.

Need Design & Implementation inputs on Cassandra based use case

I am planning to store high-volume order transaction records from a commerce website to a repository (Have to use cassandra here, that is our DB). Let us call this component commerceOrderRecorderService.
Second part of the problem is - I want to process these orders and push to other downstream systems. This component can be called batchCommerceOrderProcessor.
commerceOrderRecorderService & batchCommerceOrderProcessor both will run on a java platform.
I need suggestion on design of these components. Especially the below:
commerceOrderRecorderService
What is he best way to design the columns, considering performance and scalability? Should I store the entire order (complex entity) as a single JSON object. There is no search requirement on the order attributes. We can at least wait until they are processed by the batch processor. Consider - that a single order can contain many sub-items - at the time of processing each of which can be fulfilled differently. Designing columns for such data structure may be an overkill
What should be the key, given that data volumes would be high. 10 transactions per second let's say during peak. Any libraries or best practices for creating such transactional data in cassandra? Can TTL also be used effectively?
batchCommerceOrderProcessor
How should the rows be retrieved for processing?
How to ensure that a multi-threded implementation of the batch processor ( and potentially would be running on multiple nodes as well ) will have row level isolation. That is no two instance would read and process the same row at the same time. No duplicate processing.
How to purge the data after a certain period of time, while being friendly to cassandra processes like compaction.
Appreciate design inputs, code samples and pointers to libraries. Thanks.
Depending on the overall requirements of your system, it could be feasible to employ the architecture composed of:
Cassandra to store the orders, analytics and what have you.
Message queue - your commerce order recorder service would simple enqueue new order to the transactional and persistent queue and return. Scalability and performance should not be an issue here as you can easily achieve thousands of transactions per second with a single queue server. You may have a look at RabbitMQ as one of available choices.
Stream processing framework - you could read a stream of messages from the queue in a scalable fashion using streaming frameworks such as Twitter Storm. You could implement in Java than 3 simple pipelined processes in Storm:
a) Spout process that dequeues next order from the queue and pass it to
the second process
b) Second process called Bolt that inserts each next order to Cassandra and pass it to the third bolt
c) Third Bolt process that pushes the order to other downstream systems.
Such an architecture offers high-performance, scalability, and near real-time, low latency data processing. It takes into account that Cassandra is very strong in high-speed data writes, but not so strong in reading sequential list of records. We use Storm+Cassandra combination in our InnoQuant MOCA platform and handle 25.000 tx/second and more depending on hardware.
Finally, you should consider if such an architecture is not an overkill for your scenario. Nowadays, you can easily achieve 10 tx/second with nearly any single-box database.
This example may help a little. It loads a lot of transactions using the jmxbulkloader and then batches the results into files of a certain size to be transported else where. It multi-threaded but within the same process.
https://github.com/PatrickCallaghan/datastax-bulkloader-writer-example
Hope it helps. BTW it uses the latest cassandra 2.0.5.

Database good system decoupling point?

We have two systems where system A sends data to system B. It is a requirement that each system can run independently of the other and neither will blow up if the other is down. The question is what is the best way for system A to communicate with system B while meeting the decoupling requirement.
System B currently has a process that polls data in a db table and processes any new rows that have been inserted.
One proposed design is for system A to just insert data into system b's db table and have system B process the new rows by the existing process. Question is does this solution meet the requirement of decoupling the two systems? Is a database considered part of a system B which might become unavailable and cause system A to blow up?
Another solution is for system A to put data into an MQ queue and have a process that would read from MQ and then insert into system B's database. But is this just extra overhead? Ultimately is an MQ queue any more fault tolerant than a db table?
Generally speaking, database sharing is a close coupling and not to be preferred except possibly for speed purposes. Not only for availability purposes, but also because system A and B will be changed and upgraded at several points in their future, and should have minimal dependencies on each other - message passing is an obvious dependency, whereas shared databases tend to bite you (or your inheritors) on the posterior when least expected. If you go the database sharing route, at least make the sharing interface explicit with dedicated tables or views.
There are four common levels of integration:
Database sharing
File sharing
Remote procedure call
Message passing
which can be applied and combined in various situations, with different availability and maintainability. You have an excellent overview at the enterprise integration patterns site.
As with any central integration infrastructure, MQ should be hosted in an environment with great availability, full failover &c. There are other queue solutions which allow you to distribute the queue coordination.
Use Queues for communication. Do not "pass" data from System A to System B through the database. You're using the database as a giant, expensive, complex message queue.
Use a message queue as a message queue.
This is not "Extra" overhead. This is the best way to decouple systems. It's called Service Oriented Architecture (SOA) and using messages is absolutely central to the design.
An MQ queue is far simpler than a DB table.
Don't compare "fault tolerance" because an RDBMS uses huge (almost unimaginable) overheads to achieve a reasonable level of assurance that your transaction finished properly. Locking. Buffering. Write Queues. Storage Management. Etc. Etc.
A reliable message queue implementation uses some backing store to keep the queue's state. The overhead is much, much less than an RDBMS. The performance is much better. And it's much, much simpler to interact with.
In SQL Server I would do this through an SSIS package or a job (depending on the number of records and the complexity of what I was moving). Other databases also have ETL solutions. I like the ETL solution becasue I can keep logs of what was changed and what errors were processed, I can send records which for some reason won't go to the other system (data structures are rarely the same between two databases) to a holding table without killing the rest of the process. I can also make changes to the data as it flows to adjust for database differences (things like lookup table values, say the completed status in db1 is 5 and it is 7 in db2 or say db2 has a required field that db1 does not and you have to add a default value to the filed if it is null). If one or the other servver is down the job running the SSIS package will fail and neither system will be affected, so it keeps the datbases decoupled as using triggers or replication would not.