I was watching this screencast at RailsLab where the presenter claims that it's possible to have a master DB for write operations and a slave DB for read operations. While for certain types of Web sites (e.g. blogs, social networks, Web 2.0 sites, etc.) it is acceptable for the master and slave DBs not to be 100% synchronized for short periods of time, but AFAIK this is not acceptable in domains such as banking and insurance.
My question is that if such usages of master-slave replication are reliable enough for banking and insurance (and similar) applications where there's no room for violation of the integrity of the system. In other words, if it is acceptable for the master and slave DBs to be out of sync for short periods of time.
If not, what horizontal (not vertical) solutions are available for scaling a database systems in such environments that there's absolutely no room for system integrity to be compromised?
If not, what horizontal (not vertical) solutions are available for scaling a database systems in such environments that there's absolutely no room for system integrity to be compromised?
Clustering
Related
I am trying to do something new, something I have never done before. I am looking for advice or point me into right direction how to choose technology. I am trying to build race simulation app that will have thousands of iot devices streaming data into central platform. While I understand that I can use some sort of IOT hub with cloud providers, but what technology do I choose for storing data?
Example is online indoor biking app. There are apps where you can connect your indoor bike online and have simulated race. For my project I am trying to build something similar. Do I use NO SQL db in this scenario? What technology will allow better scale of application like this since it could be millions of devices around the world in "simulated" race. I am not worried about front-end and things like that, but backend, IOT hub, storing data, presenting-real time?
At this point it is important to understand what kind of data your IoT devices will stream, and at what kind of a rate. It will have significant impact on your question.
That it is if it's just location information and some other small data sent lets say once a second, then if you're talking about tens of thousands of devices - this is not a big load of information, and any standard database, like MySQL will be able to deal with it. You will of course need a multi-threaded server(s) capable of handling many requests in parallel.
If your IoT devices will stream HD video, then you're looking at a completely different solution, with a much stronger server, capable of handling allot of streams in parallel, with significant bandwidth requirements from your hosting company, as well as storage space for all the videos. In this case you will store the streams as files (if you'll need them later on), and you won't need any special database either.
In any case, once you'll reach millions of users, you'll be able to scale most modern databases and servers, like MySQL replication capability. For example, take a look how Wikipedia is relying on MySQL: wikipedia - MySQL https://www.mysql.com/why-mysql/case-studies/mysql-cs-wikipedia.html
So I wouldn't be worried regarding the database on this stage, but make sure that the design of my system is in accordance to the the type of data and rate it is streamed.
Hope this gives you a pointer.
When people describe Paxos, they always assume that there are already some proposers in the cluster. But where are the proposers from, or what decides which processes to be proposers?
How the cluster is initially configured and how it is changed is down to the administrator who is trying to optimise the system.
You can run the different roles on different hosts and have different numbers of them. We could run three proposers, five acceptor and seven learners, whatever you choose. Clients that need to write a value only need to connect to proposers. With multi-Paxos for state replication clients only need to connect to proposers as that is sufficient and the clients don't need to exchange messages with any other role type. Yet there is nothing to prevent clients from also being learners by seeing messages from acceptor.
As long as you follow the Paxos algorithm it all comes down to minimising network hops (latency and bandwidth), costs of hardware, and complexity of the software for your particular workload.
From a practical perspective your clients need to be able to find proposers in the face of failures. A cluster administrator will be configuring which nodes are to be proposes and making sure that they are discovered by clients.
It is hard to visualize from descriptions of the abstract algorithm how things might work as many messaging topographies are possible. When applying the algorithm to a practical application its fair more obvious what setup minimises latency, bandwidth, hardware and complexity. An example might be a three node MySQL cluster running Paxos. You want all three servers to have all the data so they are all learners. All must be acceptors as you need three at a minimum to have one node fail and still maintain progress. They may as well all be proposers to give the best availability and simplicity of software and configuration. Note that one will become the distinguished leader. The database administrator doesn't think about the Paxos roles as they just set up a three-node database cluster.
The roles in the cluster may need to change. For example, you might want to expand the capacity of a database cluster. Or a server might die so you need to change the cluster membership to swap the dead one for a fresh one. For the Paxos algorithm to work every process must have a strongly consistent view of which processes are in which roles. How do you get consensus? You use Paxos to fix a new value of the cluster membership.
I just finished a database layer based on redis that offers to select between multiple databases, but I have no experience by myself on what should be common sense to do. Reliability is my biggest focus.
How is writes and reads commonly organised in applications where both a slave and a master database is available?
How do the big guys pull it off?
Rule 1: Don't.
Rule 2: Don't until you've measured and proven that the database really is your bottleneck. Most web application bottlenecks are the time required to serve static content and stale content. Nothing to do with database transactions.
Rule 3: Even then, look at other ways of partitioning your data rather than duplicating your database. Get history away from current data into a warehouse. Split data by customer or subject areas or web application into multiple peer databases with limited or no sharing.
Rule 4: When you can prove that there is no alternative, look at master-slave databases.
That's how many folks tackle this problem.
For single master, multi-slave it's often as simple as sending all data modification queries to the master and all selects to the slave. Typically your database abstraction layer can easily handle this for you. This article has some details on this particular kind of setup.
What is the difference between Replication and Mirroring in SQL server 2005?
In short, mirroring allows you to have a second server be a "hot" stand-by copy of the main server, ready to take over any moment the main server fails. So mirroring offers fail-over and reliability.
Replication, on the other hand, allows two or more servers to stay "in sync" - that means the secondary servers can answer queries and (depending on setup) actually change data (it will be merged in the sync). You can also use it for local caching, load balancing, etc.
Mirroring is a feature that creates a copy of your database at bit level. Basically you have the same, identical, database in two places. You cannot optionally leave out parts of the database. You can have only one mirror, and the 'mirror' is always offline (it cannot be modified). Mirroring works by shipping the database log as is being created to the mirror and apply (redo-ing) the log on the mirror. Mirroring is a technology for high availability and disaster recoverability.
Replication is a feature that allow 'slices' of a database to be replicated between several sites. The 'slice' can be a set of database objects (ie. tables) but it can also contain parts of a table, like only certain rows (horizontal slicing) or only certain columns to be replicated. You can have multiple replicas and the 'replicas' are available to query and even can be updated. Replication works by tracking/detecting changes (either by triggers or by scanning the log) and shipping the changes, as T-SQL statements, to the subscribers (replicas). Replication is a technology for making data available at off sites and to consolidate data to central sites. Although it is sometimes used for high availability or for disaster recoverability, it is an artificial use for a problem that mirroring and log shipping address better.
There are several types and flavours of replication (merge, transactional, peer-to-peer etc.) and they differ in how they implement change tracking or update propagation, if you want to know more details you should read the MSDN spec on the subject.
Database mirroring is used to increase database uptime and reliability.
Replication is used primarily to distribute portions of your primary database -- the publisher -- to one or more subscriber databases. This is often done to make data available (typically for read only) on remote servers so that remote clients can access the data locally (to them) rather than directly from the publisher across a slower WAN connection. Although, as the previous posts indicate, there are more complex scenarios where updates are permitted on the subscribers. It also can have the benefit of reducing the I/O load on the publisher.
I am building out a solution that will be deployed in multiple data centers in multiple regions around the world, with each data center having a replicated copy of data actively updated in each region. I will have a combination of multiple databases and file systems in each data center, the state of which must be kept consistent (within a data center). These multiple repositories will be fronted by a SOA service tier.
I can tolerate some latency in the replication, and need to allow for regions to be off-line, and then catch up later.
Given the multiple back end repositories of data, I can't easily rely on independent replication solutions for each one to maintain a consistent state. I am thus lead to implementing replication at the application layer -- by replicating the SOA requests in some manner. I'll need to make sure that replication loops don't occur, and that last writer conditions are sorted out correctly.
In your experience, what is the best pattern for solving this problem, and are there good products (free or otherwise) that should be investigated?
Lotus/ Domino is your answer. I've been working with it for ten years and its exactly what you need. It may not be trendy (a perception that I would challenge) but its powerful, adaptable and very secure, The latest version R8 is the best yet.
You should definitely consider IBM Lotus Domino. A Lotus Notes database can replicate between sites on a predefined schedule. The replicate in Notes/Domino is definitely a very powerful feature and enables for full replication of data between sites. Even if a server is unavailable the next time it connects it will simply replicate and get back in sync.
As far as SOA Service tier you could then use Domino Designer to write a webservice. Since Notes/Domino 7.5.x (I believe) Domino has been able to provision and consume webservices.
AS what other advised, I will recommend also Lotus Notes/Domino. 8.5 is really very powerful application development platfrom
You dont give enough specifics to be certain of your needs but I think you should check out SQL Server Merge replication. It allows for asynchronous replication of multiple databases with full conflict resolution. You will need to designate a Global master and all the other databases will replicate to that one, but all the database instances are fully functional (read/write) and so you can schedule replication at whatever intervals suit you. If any region goes offline they can catch up later with no issues - if the master goes offline everyone will work independantly until replication can resume.
I would be interested to know of other solutions this flexible (apart from Lotus Notes/Domino of course which is not very trendy these days).
I think that your answer is going to have to be based on a pub/sub architecture. I am assuming that you have reliable messaging between your data centers so that you can rely on published updates being received eventually. If all of your access to the data repositories is via service you can add an event notification to the orchestration of each of your update services that notifies all interested data centers of the event. Ideally the master database is the only one that sends out these updates. If the master database is the only one sending the updates you can exclude routing the notifications to the node that generated them in the first place thus avoiding update loops.