Fault tolerant system design - replication

There is a DB as data store and y (>5) other machines. There is a machine A that has data (updated) every x mins. The y machines gets the data from Machine A every x mins, updates the data in the database. Every machine doing the same is for some fault tolerance. Is there a clean way to model the working with fault tolerance?
Any pointers is appreciated.

This is a problem with very large scope. How is the data structured? How are the "db loaders" get the data from the "data producing" machine? What happens if an update fails- is the data lost or must it be persisted at any cost?
I will make some assumptions and suggest a solution:
1. The data can be partitioned.
2. You have access to a central persistent buffer. e.g. MSMQ or WebSphere MQ.
The machine generating the data puts chunks inside a central queue. Each chunk is composed of a set of record IDs and the new values for relevant properties)- you decide the granularity.
The "db loaders" listen to the queue and each de-queues a chunk (the contention is only on the dequeue-stage and is very optimized) and updates its own set of ids.
This way insert work is distributed among the machines, each handles its own portion, and if one crashes, well- the others wills simply work a bit harder.
In case of a failure to update you can return the chunk to the queue and retry later (transactional read).

Related

Can I use Aerospike as persistent layer

Aerospike is a key store database with support for persistence.
But can I trust this persistence enough to use it as an database altogether?
As I understand it writes data to memory first and then persist it.
I can live with eventual consistency, but I don't want to be in a state where something was committed but due to machine failure it never got written down to the disk and hence can never be retrieved.
I tried looking at the various use cases but I was just curious about this one.
Also what guarantee does client.put provides as far as saving of a new record is concerned.
Aerospike provides a user configurable replication factor. Most people use 2, if you are really concerned, you can use 3 or even more. Size the cluster accordingly. For RF=3, put returns when 3 nodes have written data to the their write-block in memory which is asynchronously flushed to persistent layer. So it depends on what node failure pattern you are trying protect against. If you are worried about entire cluster crashing instantly, then you may have a case for 1 second (default) worth of lost data. The one second can be configured lower as well. Aerospike also provides rack aware configuration which protects against data loss if entire rack goes down. The put goes to nodes in different racks always. Finally Aerospike provides cross data center replication - its asynchronous but does give an option to replicate your data across geo. Of course, going across geo does have its latency. Finally, if you are totally concerned about entire cluster shutdown, you can connect to two separate clusters in your application and always push updates to two separate clusters. Of course, you must now worry about consistency if application fails between two writes. I don't know of anyone who had to resort to that.

Zookeeper vs In-memory-data-grid vs Redis

I've found different zookeeper definitions across multiple resources. Maybe some of them are taken out of context, but look at them pls:
A canonical example of Zookeeper usage is distributed-memory computation...
ZooKeeper is an open source Apache™ project that provides a centralized infrastructure and services that enable synchronization across a cluster.
Apache ZooKeeper is an open source file application program interface (API) that allows distributed processes in large systems to synchronize with each other so that all clients making requests receive consistent data.
I've worked with Redis and Hazelcast, that would be easier for me to understand Zookeeper by comparing it with them.
Could you please compare Zookeeper with in-memory-data-grids and Redis?
If distributed-memory computation, how does zookeeper differ from in-memory-data-grids?
If synchronization across cluster, than how does it differs from all other in-memory storages? The same in-memory-data-grids also provide cluster-wide locks. Redis also has some kind of transactions.
If it's only about in-memory consistent data, than there are other alternatives. Imdg allow you to achieve the same, don't they?
https://zookeeper.apache.org/doc/current/zookeeperOver.html
By default, Zookeeper replicates all your data to every node and lets clients watch the data for changes. Changes are sent very quickly (within a bounded amount of time) to clients. You can also create "ephemeral nodes", which are deleted within a specified time if a client disconnects. ZooKeeper is highly optimized for reads, while writes are very slow (since they generally are sent to every client as soon as the write takes place). Finally, the maximum size of a "file" (znode) in Zookeeper is 1MB, but typically they'll be single strings.
Taken together, this means that zookeeper is not meant to store for much data, and definitely not a cache. Instead, it's for managing heartbeats/knowing what servers are online, storing/updating configuration, and possibly message passing (though if you have large #s of messages or high throughput demands, something like RabbitMQ will be much better for this task).
Basically, ZooKeeper (and Curator, which is built on it) helps in handling the mechanics of clustering -- heartbeats, distributing updates/configuration, distributed locks, etc.
It's not really comparable to Redis, but for the specific questions...
It doesn't support any computation and for most data sets, won't be able to store the data with any performance.
It's replicated to all nodes in the cluster (there's nothing like Redis clustering where the data can be distributed). All messages are processed atomically in full and are sequenced, so there's no real transactions. It can be USED to implement cluster-wide locks for your services (it's very good at that in fact), and tehre are a lot of locking primitives on the znodes themselves to control which nodes access them.
Sure, but ZooKeeper fills a niche. It's a tool for making a distributed applications play nice with multiple instances, not for storing/sharing large amounts of data. Compared to using an IMDG for this purpose, Zookeeper will be faster, manages heartbeats and synchronization in a predictable way (with a lot of APIs for making this part easy), and has a "push" paradigm instead of "pull" so nodes are notified very quickly of changes.
The quotation from the linked question...
A canonical example of Zookeeper usage is distributed-memory computation
... is, IMO, a bit misleading. You would use it to orchestrate the computation, not provide the data. For example, let's say you had to process rows 1-100 of a table. You might put 10 ZK nodes up, with names like "1-10", "11-20", "21-30", etc. Client applications would be notified of this change automatically by ZK, and the first one would grab "1-10" and set an ephemeral node clients/192.168.77.66/processing/rows_1_10
The next application would see this and go for the next group to process. The actual data to compute would be stored elsewhere (ie Redis, SQL database, etc). If the node failed partway through the computation, another node could see this (after 30-60 seconds) and pick up the job again.
I'd say the canonical example of ZooKeeper is leader election, though. Let's say you have 3 nodes -- one is master and the other 2 are slaves. If the master goes down, a slave node must become the new leader. This type of thing is perfect for ZK.
Consistency Guarantees
ZooKeeper is a high performance, scalable service. Both reads and write operations are designed to be fast, though reads are faster than writes. The reason for this is that in the case of reads, ZooKeeper can serve older data, which in turn is due to ZooKeeper's consistency guarantees:
Sequential Consistency
Updates from a client will be applied in the order that they were sent.
Atomicity
Updates either succeed or fail -- there are no partial results.
Single System Image
A client will see the same view of the service regardless of the server that it connects to.
Reliability
Once an update has been applied, it will persist from that time forward until a client overwrites the update. This guarantee has two corollaries:
If a client gets a successful return code, the update will have been applied. On some failures (communication errors, timeouts, etc) the client will not know if the update has applied or not. We take steps to minimize the failures, but the only guarantee is only present with successful return codes. (This is called the monotonicity condition in Paxos.)
Any updates that are seen by the client, through a read request or successful update, will never be rolled back when recovering from server failures.
Timeliness
The clients view of the system is guaranteed to be up-to-date within a certain time bound. (On the order of tens of seconds.) Either system changes will be seen by a client within this bound, or the client will detect a service outage.

SQL Service Broker: Collecting data -- plug-in scenario analysis

(2nd Update from 2012/12/06 -- new protocol, a sligtly different view)
The question is whether the solution below seems reasonable for you, or whether there is any flaw that I did not notice (being quite new to SQL Server Service Broker)...
I would like to continue in analysis of the problem presented in the SQL Service Broker: Collecting data from distributed sources. I would like to focus on the problem of protocol to be used when collecting data from the satellite SQL servers. The usage of the SQL Server Service Broker is a must -- it is dictated also by other reasons not presented here. So, please, do not suggest completely alternative solutions.
I would like to focus on details of what should be done and how to use the Service Broker naturally (the best possible way) for the exact problem. The overall goal was presented in the above mentioned question. The picture first:
Now more details to be considered...
Plug-in architecture wanted
The satellite machines are related to real physical production lines. It can happen that some machine is added to the technology process, some machine can disappear, some machine can be replaced in the sense it will use the same production-line identification, but it is physically different -- i.e. its SQL server is a different instance.
The central server knows nothing about the satellite until it gets first messages from it. There is no centralized database of the satelite servers. No knowledge about what and how many satelite SQL servers are to be included to the system. It is always decided on the satelite site.
Any activity related to collecting the data should be initiated by events generated by the satellite machines.
Important: The goal is to continually transfer all the newly created data (from sensors), and to discover and fix drop-outs -- independently on whatever could cause them.
To give you the concrete example:
The machine identified by line number 3 (yellow) was recently added to the environment. Its SQL Server Express was launched and it started to collect the sensor data (the third party solution, dedicated table with special structure). The machine was not connected to the central server, yet.
The only configuration thing is the reliably assigned fixed identification of the production line (here 3), and all the neccessary details to connect to the central SQL server. But the central SQL server does not know the information. The central is just ready to accept data from any new souce, but never knows when. (It was already tested for one machine using the approach suggested by Remus Rusanu answer to the question SQL Service Broker — one central SQL and more satelite SQL….)
The piece of the SQL software is deployed on the machine 3 just a bit later. It starts to talk with the central. The satellite part is not dumb, but its own activity is to send the sensor data whenever new record is inserted to the sensor data table (see point 1 above). From the record, UTC time is calculated (from the proprietary format), several sensor data from one record is converted to the same number of normalized records (formatted as one XML message), and sent to the central SQL server.
The central is activated by the message with the sensor data sent from the satellite machine. The failures of the physical connection is masked by the Service Broker queues.
After a reasonable interval (here one hour), the central server checks whether the so far collected data should be processed or not. There is a work unit that takes some production time, and the data should be processed and added to the documentation of the unit. The processing should happen only when the unit was finished.
The central also checks whether it has all the data for the unit. As the sensor sampling is done in known regular intervals (here about 1 minute), the central can check whether there are some drop-outs. There also is an initial "drop-out" for the time interval when the satellite was not connected to the central via SSB. The mechanism should recover from whatever situation. It can also happen that the sensor where out of order or the data were not collected. The detected drop-out at the central may actually mean that central asks: "I have no data from you for this time interval. Send me some of them if they exist, or tell me they do not exist."
The satellite should send only that much data that can be sent between the sampling times. The recovery from drop-outs can be rather slow. The delay of processing the data at the central server is not critical. However, the central should know when the data is ready (or does not exist for the detected time interval).
Some picture, more solution details
I have chosen the "Recycling conversations" by Remus Rusanu as the basic framework for the communication between the satellite and the central. It defines the EndOfStream message type to signal that the conversation handle should be thrown away and the new one should be used. The lifetime is limited by the above mentioned one hour interval generated by the Service Broker timer.
The message is (mis)used at the central server also for activation of the data processing. At about the same time, the central checks for drop-outs. The central keeps the time below that the drop-outs where already checked. This way it knows what data are ready to be processed.
Do you consider the scenario reasonable? Can you see any problem with it?
(I am going to refine the question to reflect your suggestions.)
Thanks for your time and experience, and have a nice day.
Petr
All data should be stored in table. On satellite side, you should create a table for last processed row to be stored. When new request from Central arrives, new data pack will be sent back to Central depending on last processed record value.
Note: i recommend to limit a number of rows to be sent depending on your data to do not create very large data packs.
When Central processed all rows, appropriate message should be sent to Satellite. It also should contain information about data import errors occurred.
You can start Service Broker conversation when database activity is registered (using DML/DDL triggers on both Central/Satellite database) or within schedule (using Central Agent job).

Database good system decoupling point?

We have two systems where system A sends data to system B. It is a requirement that each system can run independently of the other and neither will blow up if the other is down. The question is what is the best way for system A to communicate with system B while meeting the decoupling requirement.
System B currently has a process that polls data in a db table and processes any new rows that have been inserted.
One proposed design is for system A to just insert data into system b's db table and have system B process the new rows by the existing process. Question is does this solution meet the requirement of decoupling the two systems? Is a database considered part of a system B which might become unavailable and cause system A to blow up?
Another solution is for system A to put data into an MQ queue and have a process that would read from MQ and then insert into system B's database. But is this just extra overhead? Ultimately is an MQ queue any more fault tolerant than a db table?
Generally speaking, database sharing is a close coupling and not to be preferred except possibly for speed purposes. Not only for availability purposes, but also because system A and B will be changed and upgraded at several points in their future, and should have minimal dependencies on each other - message passing is an obvious dependency, whereas shared databases tend to bite you (or your inheritors) on the posterior when least expected. If you go the database sharing route, at least make the sharing interface explicit with dedicated tables or views.
There are four common levels of integration:
Database sharing
File sharing
Remote procedure call
Message passing
which can be applied and combined in various situations, with different availability and maintainability. You have an excellent overview at the enterprise integration patterns site.
As with any central integration infrastructure, MQ should be hosted in an environment with great availability, full failover &c. There are other queue solutions which allow you to distribute the queue coordination.
Use Queues for communication. Do not "pass" data from System A to System B through the database. You're using the database as a giant, expensive, complex message queue.
Use a message queue as a message queue.
This is not "Extra" overhead. This is the best way to decouple systems. It's called Service Oriented Architecture (SOA) and using messages is absolutely central to the design.
An MQ queue is far simpler than a DB table.
Don't compare "fault tolerance" because an RDBMS uses huge (almost unimaginable) overheads to achieve a reasonable level of assurance that your transaction finished properly. Locking. Buffering. Write Queues. Storage Management. Etc. Etc.
A reliable message queue implementation uses some backing store to keep the queue's state. The overhead is much, much less than an RDBMS. The performance is much better. And it's much, much simpler to interact with.
In SQL Server I would do this through an SSIS package or a job (depending on the number of records and the complexity of what I was moving). Other databases also have ETL solutions. I like the ETL solution becasue I can keep logs of what was changed and what errors were processed, I can send records which for some reason won't go to the other system (data structures are rarely the same between two databases) to a holding table without killing the rest of the process. I can also make changes to the data as it flows to adjust for database differences (things like lookup table values, say the completed status in db1 is 5 and it is 7 in db2 or say db2 has a required field that db1 does not and you have to add a default value to the filed if it is null). If one or the other servver is down the job running the SSIS package will fail and neither system will be affected, so it keeps the datbases decoupled as using triggers or replication would not.

Is it possible to get sub-1-second latency with transactional replication?

Our database architecture consists of two Sql Server 2005 servers each with an instance of the same database structure: one for all reads, and one for all writes. We use transactional replication to keep the read database up-to-date.
The two servers are very high-spec indeed (the write server has 32GB of RAM), and are connected via a fibre network.
When deciding upon this architecture we were led to believe that the latency for data to be replicated to the read server would be in the order of a few milliseconds (depending on load, obviously). In practice we are seeing latency of around 2-5 seconds in even the simplest of cases, which is unsatisfactory. By a simplest case, I mean updating a single value in a single row in a single table on the write db and seeing how long it takes to observe the new value in the read database.
What factors should we be looking at to achieve latency below 1 second? Is this even achievable?
Alternatively, is there a different mode of replication we should consider? What is the best practice for the locations of the data and log files?
Edit
Thanks to all for the advice and insight - I believe that the latency periods we are experiencing are normal; we were mis-led by our db hosting company as to what latency times to expect!
We're using the technique described near the bottom of this MSDN article (under the heading "scaling databases"), and we'd failed to deal properly with this warning:
The consequence of creating such specialized databases is latency: a write is now going to take time to be distributed to the reader databases. But if you can deal with the latency, the scaling potential is huge.
We're now looking at implementing a change to our caching mechanism that enforces reads from the write database when an item of data is considered to be "volatile".
No. It's highly unlikely you could achieve sub-1s latency times with SQL Server transactional replication even with fast hardware.
If you can get 1 - 5 seconds latency then you are doing well.
From here:
Using transactional replication, it is
possible for a Subscriber to be a few
seconds behind the Publisher. With a
latency of only a few seconds, the
Subscriber can easily be used as a
reporting server, offloading expensive
user queries and reporting from the
Publisher to the Subscriber.
In the following scenario (using the
Customer table shown later in this
section) the Subscriber was only four
seconds behind the Publisher. Even
more impressive, 60 percent of the
time it had a latency of two seconds
or less. The time is measured from
when the record was inserted or
updated at the Publisher until it was
actually written to the subscribing
database.
I would say it's definately possible.
I would look at:
Your network
Run ping commands between the two servers and see if there are any issues
If the servers are next to each other you should have < 1 ms.
Bottlenecks on the server
This could be network traffic (volume)
Like network cards not being configured for 1GB/sec
Anti-virus or other things
Do some analysis on some queries and see if you can identify indexes or locking which might be a problem
See if any of the selects on the read database might be blocking the writes.
Add with (nolock), and see if this makes a difference on one or two queries you're analyzing.
Essentially you have a complicated system which you have a problem with, you need to determine which component is the problem and fix it.
Transactional replication is probably best if the reports / selects you need to run need to be up to date. If they don't you could look at log shipping, although that would add some down time with each import.
For data/log files, make sure they're on seperate drives so the performance is maximized.
Something to remember about transaction replication is that a single update now requires several operations to happen for that change to occur.
First you update the source table.
Next the log readers sees the change and writes the change to the distribution database.
Next the distribution agent sees the new entry in the distribution database and reads that change, then runs the correct stored procedure on the subscriber to update the row.
If you monitor the statement run times on the two servers you'll probably see that they are running in just a few milliseconds. However it is the lag time while waiting for the log reader and distribution agent to see that they need to do something which is going to kill you.
If you truly need sub second processing time then you will want to look into writing your own processing engine to handle data moving from one server to another. I would recommend using SQL Service Broker to handle this as this way everything is native to SQL Server and no third party code has to be written.