I'm working on a new implementation where I have some queries regarding the pros and cons of having a shared Database in a microservice architecture.
Context:
Service A listens to an event from Kafka and based on the parameters updates a particular table. This table is owned entirely by Service A and not shared. Some of the data in this table needs to be accessed by other services based on the value of a particular field.
My Approach:
Once the Table is updated, if we know that this data might be required by some other service(by checking the value of the field) write it to an ES index. I want to keep the ES index shared across services.
The other services would read the ES index whenever required. These services would use the index only for read while Service A is the only service which writes to the index.
Also, I've added a fallback API in Service A which hits the table in case ES is down. Please check out the diagram, I've added a link to that below.
Issues:
One issue I can think of is that if ES is completely down then Service A won't be able to write to ES and hence that row update will fail. How do I handle this?
I also need help figuring out the fundamental scalability and deployment issues that can be counter productive to a microservice architecture by introducing a shared ES index. I think I have eliminated some of the resiliency issues by adding a fallback API for the other services in case ES is down.
Please criticise my design. Design Diagram
I see three options:
Option A: Service A needs to implement something equivalent to the two-phase commit protocol where an event consumed from Kafka by Service A would not be acknowledged until both the DB and ES have acknowledged their write.
It puts a big burden on your service, which in case one of the two sub-system goes down (DB and/or ES) would have to spend time retrying, and is not able to consume more events from Kafka. Events would start piling up in the topic. 2PC is hard to implement right in a distributed environment.
Option B: Service A consumes from Kafka topic A, does its things and produces another event in another Kafka topic B. Two other consumer groups responsible for updating sub-systems would then consume those events from topic B, one would keep updating the DB and another would keep updating ES. Service A can do its job rapidly and not have to worry or get bogged down with updates. Each updates can be retried independently by each consumer group without impacting upstream event consumption. Eventually, everything will be in synched.
Option C: It's a variation of option B, more lightweight. Service A consumes events from the Kafka topic, does its job and updates the DB as it does now. Another process (CDC, Logstash, etc) consumes updates from the DB and updates ES asynchronously and is also responsible for retrying is ES is down. Eventually, everything will be in synched as well.
There are other options, but these 3 are the most obvious ones to me.
Related
I'm trying to configure Gemfire/Geode in order to have an async event queue with parallel=true on a replicated region. However, I'm getting the following exception at startup:
com.gemstone.gemfire.internal.cache.wan.AsyncEventQueueConfigurationException: Parallel Async Event Queue myQueue can not be used with replicated region /myRegion
This (i.e. to prevent parallel queues on replicated regions) seems to be a design decision, but I can't understand why it is the case.
I have read all the documentation I've been able to find (primarily http://gemfire.docs.pivotal.io/docs-gemfire/latest/reference/book_intro.html and related docs),
and searched any kind of reference to this exception on the internet, but I didn't find any clear explanation on why I can't have an event listener on each member hosting a replicated region.
My conclusion is that I must be missing some fundamental concept about replicated regions and/or parallel queues, but since I can't find the appropriate documentation
on my own, I'm asking for an explanation and/or pointers to the right resources to read.
Thanks in advance.
EDIT : Let me put the question into context.
I have an external system sending data to my application using REST services, which are load balanced between nodes in order to maximize performance. Each of the nodes hosts the same regions (let's say 3, named A B and C). The data travels through all those regions (A to B to C) and is processed along the way. This means that region A hosts data that has just been received, region B data that has been partially processed and region C hosts data whose processing is complete.
I am using event listeners to process data and move it from region to region, and in case of the listener for region C, to export it to another external system.
All the listeners must (and I repeat, must) be transactional.
I also need horizontal scalability (i.e. adding nodes on the fly to increase throughput) and the maximum amount of data replication that can be possibily achieved.
Moreover, I want to run all of the nodes with the same gemfire configuration.
I have already tried to use partitioned regions, but they are not fit to my needs for a bunch of reasons that I won't explain here for the sake of brevity (just trust me, it is not currently possible).
So I thought that having all the nodes host the replicated regions could be the way, but I need all of them to be able to process events independently and perform region synchronization afterwards in an active/active scenario. It is my understanding that this requires event queues to be parallel, but it does not seem possible (by design).
So the (updated) question(s) are :
Is this scenario even possible? And if it is, how can I achieve it?
Any explanation and/or documentation, example, resource or anything else is more than welcome.
Again, thanks in advance.
An AsyncEventQueue is used to write data that arrives in GemFire to some other data store. You would ideally want to do this only once. Since the content of the replicated region is same on all the members of the system, you only need a Async event listener on one member, hence parallel=true is not supported.
For Partitioned regions, if you only had one member that hosts the AsyncQueue, then every single put to a partitioned region will also be routed through that member. This introduces a single point of contention in the system. The solution to this problem was introduction of parallel AsyncQueues, so that events on each member are only queued up locally in that member.
GemFire also supports CacheListeners, which are invoked on each member even for replicated regions, however, they are synchronous. You can introduce a thread pool in your CacheListener to get the same functionality.
I'm working on a real-time application and building it on Azure.
The idea is that every user reports something about himself and all the other users should see it immediately (they poll the service every seconds or so for new info)
My approach for now was using a Web Role for a WCF REST Service where I'm doing all the writing to the DB (SQL Azure) without a Worker Role so that it will be written immediately.
I've come think that maybe using a Worker Role and a Queue to do the writing might be much more scalable, but might interfere with the real-time side of the service. (The worker role might not take the job immediately from the queue)
Is it true? How should I go about this issue?
Thanks
While it's true that the queue will add a bit of latency, you'll be able to scale out the number of Worker Role instances to handle the sheer volume of messages.
You can also optimize queue-reading by getting more than one message at a time. Since a single queue has a scalability target of 500 TPS, this lets you go well beyond 500 messages per second on reads.
You might look into a Cache for buffering the latest user updates, so when polling occurs, your service reads from cache instead of SQL Azure. That might help as the volume of information increases.
You could have a look at SignalR, it does not support farm scenarios out-of-the-box, but should be able to work with the use of either internal endpoint calls to update every instance, using the Azure Service Bus, or using the AppFabric Cache. This way you get a Push scenario rather than a Pull scenario, thus you don't have to poll your endpoints for potential updates.
Please consider the following questions in the context of multiple publications from a scaled out publisher (using DB subscription storage) and multiple subscriptions with scaled out subscribers (using distributors) where installs and uninstalls happen regularly for initial deployments, upgrades, etc. using automated MSI's.
Using DB subscription storage, what happens if the DB goes down? If access to the Subscription DB is required in order to Publish a message, how will it be delivered? Will it get lost? Will the call to Bus.Publish throw an exception?
Assuming you need to have no down-time deployments: What if you want to move your subscription DB for a particular publication to a different server? How do you manage a transition like this?
Same question goes for a distributor on the subscriber side: What if you want to move your distributor endpoint? One scenario I can think of is if you have multiple subscriptions utilizing a single distributor machine, it might be hard if you want to move some of them to another distributor server to reduce load.
What would the install/uninstall scenarios look like for a setup like this (both initially, and for continuous upgrades)? It seems like you would want to have some special install/uninstall scripts for deployment of the "logical publication" and subscription DB, as well as for the "logical subscriptions" and the distributors. The publisher instances wouldn't need any special install/uninstall logic (since they just start publishing messages using the configured subscription DB, and then stop when they are uninstalled). The subscriber worker nodes wouldn't need anything special on install other than the correct configuration of the distributor endpoint, but would need uninstall logic to make sure they are removed from the distributors list of worker nodes.
Eventually the publisher will fail and the messages will build up in the internal queue. You will have to plan the size of disk you need to handle this based on the message size and how long you want to wait for a DB to come up. From there it is based how much downtime you can handle. You can use DB mirroring or clustering to make the DB have less downtime.
Mirroring and clustering technologies can also help with this. Depends on if you want to do manual or automatic failover and where your doing it(remote sites?).
Clustering MSMQ could help you here. If you want to drop a distributor and move it within a cluster you'd be ok. Another possibility is to expose your distributors via HTTP and load balance them behind either a software or hardware load balancing solution. Behind the load balancer you'd be more free to move things around.
Sounds like you have a good grasp on this one already :)
To your first question, about the high availability of the subscription DB, you can use a cluster for failover. If the DB is down, then the Bus.Publish will throw an exception, yes. It is recommended to keep the subscription DB separate from your applicative DB to avoid having to bring it down when upgrading your app. This doesn't have to be a separate DB server, a separate DB on the same DB server will be fine.
About moving servers, this is usually managed at a DNS level where for a certain period of time you'll have both running, until communication moves over.
On your third question about distributors - don't share a distributor between different publishers or subscribers.
As a rule of thumb, it is recommended to not add/remove subscribers when doing these kinds of maintainenance activities. This usually simplifies things quite a bit.
We have two systems where system A sends data to system B. It is a requirement that each system can run independently of the other and neither will blow up if the other is down. The question is what is the best way for system A to communicate with system B while meeting the decoupling requirement.
System B currently has a process that polls data in a db table and processes any new rows that have been inserted.
One proposed design is for system A to just insert data into system b's db table and have system B process the new rows by the existing process. Question is does this solution meet the requirement of decoupling the two systems? Is a database considered part of a system B which might become unavailable and cause system A to blow up?
Another solution is for system A to put data into an MQ queue and have a process that would read from MQ and then insert into system B's database. But is this just extra overhead? Ultimately is an MQ queue any more fault tolerant than a db table?
Generally speaking, database sharing is a close coupling and not to be preferred except possibly for speed purposes. Not only for availability purposes, but also because system A and B will be changed and upgraded at several points in their future, and should have minimal dependencies on each other - message passing is an obvious dependency, whereas shared databases tend to bite you (or your inheritors) on the posterior when least expected. If you go the database sharing route, at least make the sharing interface explicit with dedicated tables or views.
There are four common levels of integration:
Database sharing
File sharing
Remote procedure call
Message passing
which can be applied and combined in various situations, with different availability and maintainability. You have an excellent overview at the enterprise integration patterns site.
As with any central integration infrastructure, MQ should be hosted in an environment with great availability, full failover &c. There are other queue solutions which allow you to distribute the queue coordination.
Use Queues for communication. Do not "pass" data from System A to System B through the database. You're using the database as a giant, expensive, complex message queue.
Use a message queue as a message queue.
This is not "Extra" overhead. This is the best way to decouple systems. It's called Service Oriented Architecture (SOA) and using messages is absolutely central to the design.
An MQ queue is far simpler than a DB table.
Don't compare "fault tolerance" because an RDBMS uses huge (almost unimaginable) overheads to achieve a reasonable level of assurance that your transaction finished properly. Locking. Buffering. Write Queues. Storage Management. Etc. Etc.
A reliable message queue implementation uses some backing store to keep the queue's state. The overhead is much, much less than an RDBMS. The performance is much better. And it's much, much simpler to interact with.
In SQL Server I would do this through an SSIS package or a job (depending on the number of records and the complexity of what I was moving). Other databases also have ETL solutions. I like the ETL solution becasue I can keep logs of what was changed and what errors were processed, I can send records which for some reason won't go to the other system (data structures are rarely the same between two databases) to a holding table without killing the rest of the process. I can also make changes to the data as it flows to adjust for database differences (things like lookup table values, say the completed status in db1 is 5 and it is 7 in db2 or say db2 has a required field that db1 does not and you have to add a default value to the filed if it is null). If one or the other servver is down the job running the SSIS package will fail and neither system will be affected, so it keeps the datbases decoupled as using triggers or replication would not.
An existing process changes the status field of a booking record in a table, in response to user input.
I have another process to write, that will run asynchronously for records with a particular status. It will read the table record, perform some operations (including calls to third party web services), and update the record's status field to indicate that processing is completed (or In Error, with an error count).
This operation sounds very similar to a queue. What are the benefits and tradeoffs of using MSMQ over a SQL Table in this situation, and why should I choose one over the other?
It is our software that is adding and updating records in the table.
It is a new piece of work (a Windows Service) that will be performing the asynchronous processing. This needs to be "always up".
There are several reasons, which were discussed on the Fog Creek forum here: http://discuss.fogcreek.com/joelonsoftware5/default.asp?cmd=show&ixPost=173704&ixReplies=5
The main benefit is that MSMQ can still be used when there is intermittant connectivity between computers (using a store and forward mechanism on the local machine). As far as the application is concerned it delivered the message to MSMQ, even though MSMQ will possibly deliver the message later.
You can only insert a record to a table when you can connect to the database.
A table approach is better when a workflow approach is required, and the process will move through various stages, and these stages need persisting in the DB.
If the rate at which booking records is created is low I would have the second process periodically check the table for new bookings.
Unless you are already using MSMQ, introducing it just gives you an extra platform component to support.
If the database is heavily loaded, or you get a lot of lock contention with two process reading and writing to the same region of the bookings table, then consider introducing MSMQ.
I also like this answer from le dorfier in the previous discussion:
I've used tables first, then refactor
to a full-fledged msg queue when (and
if) there's reason - which is trivial
if your design is reasonable.
Thanks, folks, for all the answers. Most helpful.
With MSMQ you can also offload the work to another server very easy by changing the location of the queue to another machine rather then the db server.
By the way, as of SQL Server 2005 there is built in queue in the DB. Its called SQL server Service Broker.
See : http://msdn.microsoft.com/en-us/library/ms345108.aspx
Also see previous discussion.
If you have MSMQ expertise, it's a good option. If you know databases but not MSMQ, ask yourself if you want to become expert in another technology; whether your application is a critical one; and which you'd rather debug when there's a problem.
I have recently been investigating this myself so wanted to mention my findings. The location of the Database in comparison to your application is a big factor on deciding which option is faster.
I tested inserting the time it took to insert 100 database entries versus logging the exact same data into a local MSMQ message. I then took the average of the results of performing this test several times.
What I found was that when the database is on the local network, inserting a row was up to 4 times faster than logging to an MSMQ.
When the database was being accessed over a decent internet connection, inserting a row into the database was up to 6 times slower than logging to an MSMQ.
So:
Local database - DB is faster, otherwise MSMQ is.
Instead of making raw MSMQ calls, it might be easier if you implement your sevice as a queued COM+ component and make queued function calls from your client application. In the end, the asynchronous service still uses MSMQ in the background, but your code will be much clearer and easier to use.
I would probably go with MSMQ, or ActiveMQ myself. I would suggest (presuming that you are considering MSMQ you are using windows, with MS technology) looking into WCF, or if you are using MS-SQL 2005+ having a trigger that calls into .net code to run your processing.
Service Broker was introduced in SQL 2005 and it is designed to be very quick at handling messages as the process is relatively simple (I believe its roots were in triggers). If you are concerned about scalability, in SQL 2008 they have released an independant processing executable to separate the processing from SQL Server (in standard Service Broker, everything is controlled by the SQL Server instances).
I would definitely consider using Service Broker over MSMQ but this is dependant on your SQL Development/DBA resources and their knowledge.
Besides of Mitch's answer, some other scenarios:
1. each of your message have its own due date to trigger the action, this can be done through MQ as well, but in this case I prefer to store it into db as it is more controllable;
2. subscriber needs to filter message and then process a portion of it, this can be done by LINQ too, depends on how complex the filter is, the db approach is better because I can use linq to EF do complex query easily;
3. For deployment, i want fully automated deployment process so that DB is a better choice for me. I am not a big fan of manual configurations.