Database good system decoupling point? - decoupling

We have two systems where system A sends data to system B. It is a requirement that each system can run independently of the other and neither will blow up if the other is down. The question is what is the best way for system A to communicate with system B while meeting the decoupling requirement.
System B currently has a process that polls data in a db table and processes any new rows that have been inserted.
One proposed design is for system A to just insert data into system b's db table and have system B process the new rows by the existing process. Question is does this solution meet the requirement of decoupling the two systems? Is a database considered part of a system B which might become unavailable and cause system A to blow up?
Another solution is for system A to put data into an MQ queue and have a process that would read from MQ and then insert into system B's database. But is this just extra overhead? Ultimately is an MQ queue any more fault tolerant than a db table?

Generally speaking, database sharing is a close coupling and not to be preferred except possibly for speed purposes. Not only for availability purposes, but also because system A and B will be changed and upgraded at several points in their future, and should have minimal dependencies on each other - message passing is an obvious dependency, whereas shared databases tend to bite you (or your inheritors) on the posterior when least expected. If you go the database sharing route, at least make the sharing interface explicit with dedicated tables or views.
There are four common levels of integration:
Database sharing
File sharing
Remote procedure call
Message passing
which can be applied and combined in various situations, with different availability and maintainability. You have an excellent overview at the enterprise integration patterns site.
As with any central integration infrastructure, MQ should be hosted in an environment with great availability, full failover &c. There are other queue solutions which allow you to distribute the queue coordination.

Use Queues for communication. Do not "pass" data from System A to System B through the database. You're using the database as a giant, expensive, complex message queue.
Use a message queue as a message queue.
This is not "Extra" overhead. This is the best way to decouple systems. It's called Service Oriented Architecture (SOA) and using messages is absolutely central to the design.
An MQ queue is far simpler than a DB table.
Don't compare "fault tolerance" because an RDBMS uses huge (almost unimaginable) overheads to achieve a reasonable level of assurance that your transaction finished properly. Locking. Buffering. Write Queues. Storage Management. Etc. Etc.
A reliable message queue implementation uses some backing store to keep the queue's state. The overhead is much, much less than an RDBMS. The performance is much better. And it's much, much simpler to interact with.

In SQL Server I would do this through an SSIS package or a job (depending on the number of records and the complexity of what I was moving). Other databases also have ETL solutions. I like the ETL solution becasue I can keep logs of what was changed and what errors were processed, I can send records which for some reason won't go to the other system (data structures are rarely the same between two databases) to a holding table without killing the rest of the process. I can also make changes to the data as it flows to adjust for database differences (things like lookup table values, say the completed status in db1 is 5 and it is 7 in db2 or say db2 has a required field that db1 does not and you have to add a default value to the filed if it is null). If one or the other servver is down the job running the SSIS package will fail and neither system will be affected, so it keeps the datbases decoupled as using triggers or replication would not.

Related

nservicebus db insert duplicate

We have a Data loader service that uses NServiceBus to insert data(if not already present)into SQL DB. The queue is configured with Concurrencylevel > 1 as the data to load might get huge. Since the Concurrencylevel > 1, it results in duplicate inserts. Is there a way to handle this within NServiceBus.
Note: We have already considered and ruled out creating thread safe locks
Generally speaking, there's no need to run the endpoint with Concurrency Level of one. You also don't need to manage the threading and fiddle with concurrency/locks when it comes to NServiceBus. There are other factors on how the system needs to be designed to make it work:
Different transports have different levels of transaction support. Choose one that supports Transactions. It means if the message is retried, you won't get duplicated messages/data.
Try to work your system with idempotency. It means that with the lack of transactions (not supported by the transport or disabled by the code) if you process a message twice, you won't have multiple data/side effects. The 'how' part requires better knowledge about the data you're dealing with and your domain.

Zookeeper vs In-memory-data-grid vs Redis

I've found different zookeeper definitions across multiple resources. Maybe some of them are taken out of context, but look at them pls:
A canonical example of Zookeeper usage is distributed-memory computation...
ZooKeeper is an open source Apacheā„¢ project that provides a centralized infrastructure and services that enable synchronization across a cluster.
Apache ZooKeeper is an open source file application program interface (API) that allows distributed processes in large systems to synchronize with each other so that all clients making requests receive consistent data.
I've worked with Redis and Hazelcast, that would be easier for me to understand Zookeeper by comparing it with them.
Could you please compare Zookeeper with in-memory-data-grids and Redis?
If distributed-memory computation, how does zookeeper differ from in-memory-data-grids?
If synchronization across cluster, than how does it differs from all other in-memory storages? The same in-memory-data-grids also provide cluster-wide locks. Redis also has some kind of transactions.
If it's only about in-memory consistent data, than there are other alternatives. Imdg allow you to achieve the same, don't they?
https://zookeeper.apache.org/doc/current/zookeeperOver.html
By default, Zookeeper replicates all your data to every node and lets clients watch the data for changes. Changes are sent very quickly (within a bounded amount of time) to clients. You can also create "ephemeral nodes", which are deleted within a specified time if a client disconnects. ZooKeeper is highly optimized for reads, while writes are very slow (since they generally are sent to every client as soon as the write takes place). Finally, the maximum size of a "file" (znode) in Zookeeper is 1MB, but typically they'll be single strings.
Taken together, this means that zookeeper is not meant to store for much data, and definitely not a cache. Instead, it's for managing heartbeats/knowing what servers are online, storing/updating configuration, and possibly message passing (though if you have large #s of messages or high throughput demands, something like RabbitMQ will be much better for this task).
Basically, ZooKeeper (and Curator, which is built on it) helps in handling the mechanics of clustering -- heartbeats, distributing updates/configuration, distributed locks, etc.
It's not really comparable to Redis, but for the specific questions...
It doesn't support any computation and for most data sets, won't be able to store the data with any performance.
It's replicated to all nodes in the cluster (there's nothing like Redis clustering where the data can be distributed). All messages are processed atomically in full and are sequenced, so there's no real transactions. It can be USED to implement cluster-wide locks for your services (it's very good at that in fact), and tehre are a lot of locking primitives on the znodes themselves to control which nodes access them.
Sure, but ZooKeeper fills a niche. It's a tool for making a distributed applications play nice with multiple instances, not for storing/sharing large amounts of data. Compared to using an IMDG for this purpose, Zookeeper will be faster, manages heartbeats and synchronization in a predictable way (with a lot of APIs for making this part easy), and has a "push" paradigm instead of "pull" so nodes are notified very quickly of changes.
The quotation from the linked question...
A canonical example of Zookeeper usage is distributed-memory computation
... is, IMO, a bit misleading. You would use it to orchestrate the computation, not provide the data. For example, let's say you had to process rows 1-100 of a table. You might put 10 ZK nodes up, with names like "1-10", "11-20", "21-30", etc. Client applications would be notified of this change automatically by ZK, and the first one would grab "1-10" and set an ephemeral node clients/192.168.77.66/processing/rows_1_10
The next application would see this and go for the next group to process. The actual data to compute would be stored elsewhere (ie Redis, SQL database, etc). If the node failed partway through the computation, another node could see this (after 30-60 seconds) and pick up the job again.
I'd say the canonical example of ZooKeeper is leader election, though. Let's say you have 3 nodes -- one is master and the other 2 are slaves. If the master goes down, a slave node must become the new leader. This type of thing is perfect for ZK.
Consistency Guarantees
ZooKeeper is a high performance, scalable service. Both reads and write operations are designed to be fast, though reads are faster than writes. The reason for this is that in the case of reads, ZooKeeper can serve older data, which in turn is due to ZooKeeper's consistency guarantees:
Sequential Consistency
Updates from a client will be applied in the order that they were sent.
Atomicity
Updates either succeed or fail -- there are no partial results.
Single System Image
A client will see the same view of the service regardless of the server that it connects to.
Reliability
Once an update has been applied, it will persist from that time forward until a client overwrites the update. This guarantee has two corollaries:
If a client gets a successful return code, the update will have been applied. On some failures (communication errors, timeouts, etc) the client will not know if the update has applied or not. We take steps to minimize the failures, but the only guarantee is only present with successful return codes. (This is called the monotonicity condition in Paxos.)
Any updates that are seen by the client, through a read request or successful update, will never be rolled back when recovering from server failures.
Timeliness
The clients view of the system is guaranteed to be up-to-date within a certain time bound. (On the order of tens of seconds.) Either system changes will be seen by a client within this bound, or the client will detect a service outage.

Scalability design question - master/slave databases

I just finished a database layer based on redis that offers to select between multiple databases, but I have no experience by myself on what should be common sense to do. Reliability is my biggest focus.
How is writes and reads commonly organised in applications where both a slave and a master database is available?
How do the big guys pull it off?
Rule 1: Don't.
Rule 2: Don't until you've measured and proven that the database really is your bottleneck. Most web application bottlenecks are the time required to serve static content and stale content. Nothing to do with database transactions.
Rule 3: Even then, look at other ways of partitioning your data rather than duplicating your database. Get history away from current data into a warehouse. Split data by customer or subject areas or web application into multiple peer databases with limited or no sharing.
Rule 4: When you can prove that there is no alternative, look at master-slave databases.
That's how many folks tackle this problem.
For single master, multi-slave it's often as simple as sending all data modification queries to the master and all selects to the slave. Typically your database abstraction layer can easily handle this for you. This article has some details on this particular kind of setup.

MSMQ v Database Table

An existing process changes the status field of a booking record in a table, in response to user input.
I have another process to write, that will run asynchronously for records with a particular status. It will read the table record, perform some operations (including calls to third party web services), and update the record's status field to indicate that processing is completed (or In Error, with an error count).
This operation sounds very similar to a queue. What are the benefits and tradeoffs of using MSMQ over a SQL Table in this situation, and why should I choose one over the other?
It is our software that is adding and updating records in the table.
It is a new piece of work (a Windows Service) that will be performing the asynchronous processing. This needs to be "always up".
There are several reasons, which were discussed on the Fog Creek forum here: http://discuss.fogcreek.com/joelonsoftware5/default.asp?cmd=show&ixPost=173704&ixReplies=5
The main benefit is that MSMQ can still be used when there is intermittant connectivity between computers (using a store and forward mechanism on the local machine). As far as the application is concerned it delivered the message to MSMQ, even though MSMQ will possibly deliver the message later.
You can only insert a record to a table when you can connect to the database.
A table approach is better when a workflow approach is required, and the process will move through various stages, and these stages need persisting in the DB.
If the rate at which booking records is created is low I would have the second process periodically check the table for new bookings.
Unless you are already using MSMQ, introducing it just gives you an extra platform component to support.
If the database is heavily loaded, or you get a lot of lock contention with two process reading and writing to the same region of the bookings table, then consider introducing MSMQ.
I also like this answer from le dorfier in the previous discussion:
I've used tables first, then refactor
to a full-fledged msg queue when (and
if) there's reason - which is trivial
if your design is reasonable.
Thanks, folks, for all the answers. Most helpful.
With MSMQ you can also offload the work to another server very easy by changing the location of the queue to another machine rather then the db server.
By the way, as of SQL Server 2005 there is built in queue in the DB. Its called SQL server Service Broker.
See : http://msdn.microsoft.com/en-us/library/ms345108.aspx
Also see previous discussion.
If you have MSMQ expertise, it's a good option. If you know databases but not MSMQ, ask yourself if you want to become expert in another technology; whether your application is a critical one; and which you'd rather debug when there's a problem.
I have recently been investigating this myself so wanted to mention my findings. The location of the Database in comparison to your application is a big factor on deciding which option is faster.
I tested inserting the time it took to insert 100 database entries versus logging the exact same data into a local MSMQ message. I then took the average of the results of performing this test several times.
What I found was that when the database is on the local network, inserting a row was up to 4 times faster than logging to an MSMQ.
When the database was being accessed over a decent internet connection, inserting a row into the database was up to 6 times slower than logging to an MSMQ.
So:
Local database - DB is faster, otherwise MSMQ is.
Instead of making raw MSMQ calls, it might be easier if you implement your sevice as a queued COM+ component and make queued function calls from your client application. In the end, the asynchronous service still uses MSMQ in the background, but your code will be much clearer and easier to use.
I would probably go with MSMQ, or ActiveMQ myself. I would suggest (presuming that you are considering MSMQ you are using windows, with MS technology) looking into WCF, or if you are using MS-SQL 2005+ having a trigger that calls into .net code to run your processing.
Service Broker was introduced in SQL 2005 and it is designed to be very quick at handling messages as the process is relatively simple (I believe its roots were in triggers). If you are concerned about scalability, in SQL 2008 they have released an independant processing executable to separate the processing from SQL Server (in standard Service Broker, everything is controlled by the SQL Server instances).
I would definitely consider using Service Broker over MSMQ but this is dependant on your SQL Development/DBA resources and their knowledge.
Besides of Mitch's answer, some other scenarios:
1. each of your message have its own due date to trigger the action, this can be done through MQ as well, but in this case I prefer to store it into db as it is more controllable;
2. subscriber needs to filter message and then process a portion of it, this can be done by LINQ too, depends on how complex the filter is, the db approach is better because I can use linq to EF do complex query easily;
3. For deployment, i want fully automated deployment process so that DB is a better choice for me. I am not a big fan of manual configurations.

Application Level Replication Technologies

I am building out a solution that will be deployed in multiple data centers in multiple regions around the world, with each data center having a replicated copy of data actively updated in each region. I will have a combination of multiple databases and file systems in each data center, the state of which must be kept consistent (within a data center). These multiple repositories will be fronted by a SOA service tier.
I can tolerate some latency in the replication, and need to allow for regions to be off-line, and then catch up later.
Given the multiple back end repositories of data, I can't easily rely on independent replication solutions for each one to maintain a consistent state. I am thus lead to implementing replication at the application layer -- by replicating the SOA requests in some manner. I'll need to make sure that replication loops don't occur, and that last writer conditions are sorted out correctly.
In your experience, what is the best pattern for solving this problem, and are there good products (free or otherwise) that should be investigated?
Lotus/ Domino is your answer. I've been working with it for ten years and its exactly what you need. It may not be trendy (a perception that I would challenge) but its powerful, adaptable and very secure, The latest version R8 is the best yet.
You should definitely consider IBM Lotus Domino. A Lotus Notes database can replicate between sites on a predefined schedule. The replicate in Notes/Domino is definitely a very powerful feature and enables for full replication of data between sites. Even if a server is unavailable the next time it connects it will simply replicate and get back in sync.
As far as SOA Service tier you could then use Domino Designer to write a webservice. Since Notes/Domino 7.5.x (I believe) Domino has been able to provision and consume webservices.
AS what other advised, I will recommend also Lotus Notes/Domino. 8.5 is really very powerful application development platfrom
You dont give enough specifics to be certain of your needs but I think you should check out SQL Server Merge replication. It allows for asynchronous replication of multiple databases with full conflict resolution. You will need to designate a Global master and all the other databases will replicate to that one, but all the database instances are fully functional (read/write) and so you can schedule replication at whatever intervals suit you. If any region goes offline they can catch up later with no issues - if the master goes offline everyone will work independantly until replication can resume.
I would be interested to know of other solutions this flexible (apart from Lotus Notes/Domino of course which is not very trendy these days).
I think that your answer is going to have to be based on a pub/sub architecture. I am assuming that you have reliable messaging between your data centers so that you can rely on published updates being received eventually. If all of your access to the data repositories is via service you can add an event notification to the orchestration of each of your update services that notifies all interested data centers of the event. Ideally the master database is the only one that sends out these updates. If the master database is the only one sending the updates you can exclude routing the notifications to the node that generated them in the first place thus avoiding update loops.