nservicebus db insert duplicate - nservicebus

We have a Data loader service that uses NServiceBus to insert data(if not already present)into SQL DB. The queue is configured with Concurrencylevel > 1 as the data to load might get huge. Since the Concurrencylevel > 1, it results in duplicate inserts. Is there a way to handle this within NServiceBus.
Note: We have already considered and ruled out creating thread safe locks

Generally speaking, there's no need to run the endpoint with Concurrency Level of one. You also don't need to manage the threading and fiddle with concurrency/locks when it comes to NServiceBus. There are other factors on how the system needs to be designed to make it work:
Different transports have different levels of transaction support. Choose one that supports Transactions. It means if the message is retried, you won't get duplicated messages/data.
Try to work your system with idempotency. It means that with the lack of transactions (not supported by the transport or disabled by the code) if you process a message twice, you won't have multiple data/side effects. The 'how' part requires better knowledge about the data you're dealing with and your domain.

Related

Multiple application on network with same SQL database

I will have multiple computers on the same network with the same C# application running, connecting to a SQL database.
I am wondering if I need to use the service broker to ensure that if I update record A in table B on Machine 1, the change is pushed to Machine 2. I have seen applications that need to use messaging servers to accomplish this before but I was wondering why this is necessary, surely if they connect to the same database, any changes from one machine will be reflected on the other?
Thanks :)
This is mostly about consistency and latency.
If your applications always perform atomic operations on the database, and they always read whatever they need with no caching, everything will be consistent.
In practice, this is seldom the case. There's plenty of hidden opportunities for caching, like when you have an edit form - it has the values the entity had before you started the edit process, but what if someone modified those in the mean time? You'd just rewrite their changes with your data.
Solving this is a bunch of architectural decisions. Different scenarios require different approaches.
Once data is committed in the database, everyone reading it will see the same thing - but only if they actually get around to reading it, and the two reads aren't separated by another commit.
Update notifications are mostly concerned with invalidating caches, and perhaps some push-style processing (e.g. IM client might show you a popup saying you got a new message). However, SQL Server notifications are not reliable - there is no guarantee that you'll get the notification, and even less so that you'll get it in time. This means that to ensure consistency, you must not depend on the cached data, and you have to force an invalidation once in a while anyway, even if you didn't get a change notification.
Remember, even if you're actually using a database that's close enough to ACID, it's usually not the default setting (for performance and availability, mostly). You need to understand what kind of guarantees you're getting, and how to write code to handle this. Even the most perfect ACID database isn't going to help your consistency if your application introduces those inconsistencies :)

When exactly do we use WCF transactions?

I was trying to get some info on WCF transactions, and I did manage to get info on how to use them. What I didn't get much info on is Why/When to use them.
What is the difference between database transactions and WCF transactions? Is there any specific case when either of these approach is preferred over the other?
By WCF transactions what you are really asking about is Microsoft's implementation of the WS-AtomicTransaction web service extension standard.
Why/When to use them
Similar to using a database transaction to guarantee consistency within the database, WS-AtomicTransaction is used to guarantee consistency within a larger, distributed system, based on communication over SOAP 1.2 services. This distributed system may or may not include database writes, but more often than not it will do.
Transactions propagated to the service from clients will cause the internal code of the service to execute within the context of the clients transaction.
So, in that same way a database transaction can wrap multiple database writes into a single unit of work, a WCF transaction can wrap multiple service calls into a single unit of work, so that a failure in one will roll back the others.
This, as you can imagine, is hugely costly from a resource perspective, so these kinds of cross-network transactions should be rarely (if ever) used unless absolutely necessary.

Good ways to decouple GUIs from SOAP/WS-API update/write calls?

Let's assume we have some configuration GUI that in its current form uses direct DB transactions to submit new configurations for more than one configurable component in a consistent manner.
Now let's move the data (DB) stuff behind some SOAP/WS API. The GUI has no direct DB access anymore. The transactional behaviour must remain, but the API should NOT be designed to explcitly accommodate the GUI form submissions. In fact, I don't even know how the new GUI will work or how the user input will be structured. Therefore I need to provide something like WS-AtomicTransaction on the API server side. However, there are (at least) two caveats:
The GUI is written in PHP: I don't think there is any WS-Transaction support in PHP available.
I don't want to keep DB transactions open on the server side while waiting for additional client requests.
Solutions I can think of:
using Camel's aggregation. However, that would make things more complicated in at least two ways:
You cannot use DB row ids of newly inserted rows in the subsequent calls inside the same transaction. You need to use some sort of symbolic back-referencing because there would be no communication between client and server while processing the aggregated messages.
call replies would not be immediate (or the immediate and separate reply to each single call would only be some sort of a stub, ie. not containing any useful information beyond "your message has been attached to TX xyz" -- if that's at all possible in the Camel aggregation case).
the two disadvantages of the previous solution make me think of request batches where possibly the WS standards provide means for referencing call results in subsequent calls inside the batch transaction. Is there any such thing already available? Maybe even as a PHP client?
trying to eliminate lock contention in the database by carefully using row-level locks etc. However, when inserting new elements, my guess is that usually pages and index pages need to be locked by the DB.
maybe some server-side persistence layer using optimistic locking? But again, that would not return any DB IDs back to the client before the final commit if DB writes would be postponed until the commit (don't know if that's possible at all).
What do YOU think?
Transactions are a powerful tool and we easily get into a thinking pattern in which we see every problem as a nail we hit with this big hammer. I can relate to your confusion because I've experienced it myself. Unfortunately I have no better advice for you than to try not think in terms of transactions but of atomic API calls.
When I think in terms of transactions, my thought pattern usually goes like this:
start transaction
read (repeat as required)
update (repeat as required)
commit/roll back
It takes some time to realize that we overuse this pattern. Actual conflicts are rare and there are many other ways of dealing with them. Here is a commonly used one in APIs
read and send data to client (atomic API call)
update data (on the client)
send original + updates back to the server (atomic API call)
start transaction (on server)
read
compare with original from client
if not same, return error (client should retry)
if same, update
commit
The last six points are part of the implementation of the API call.
Ferenc Mihaly
http://theamiableapi.com

Database good system decoupling point?

We have two systems where system A sends data to system B. It is a requirement that each system can run independently of the other and neither will blow up if the other is down. The question is what is the best way for system A to communicate with system B while meeting the decoupling requirement.
System B currently has a process that polls data in a db table and processes any new rows that have been inserted.
One proposed design is for system A to just insert data into system b's db table and have system B process the new rows by the existing process. Question is does this solution meet the requirement of decoupling the two systems? Is a database considered part of a system B which might become unavailable and cause system A to blow up?
Another solution is for system A to put data into an MQ queue and have a process that would read from MQ and then insert into system B's database. But is this just extra overhead? Ultimately is an MQ queue any more fault tolerant than a db table?
Generally speaking, database sharing is a close coupling and not to be preferred except possibly for speed purposes. Not only for availability purposes, but also because system A and B will be changed and upgraded at several points in their future, and should have minimal dependencies on each other - message passing is an obvious dependency, whereas shared databases tend to bite you (or your inheritors) on the posterior when least expected. If you go the database sharing route, at least make the sharing interface explicit with dedicated tables or views.
There are four common levels of integration:
Database sharing
File sharing
Remote procedure call
Message passing
which can be applied and combined in various situations, with different availability and maintainability. You have an excellent overview at the enterprise integration patterns site.
As with any central integration infrastructure, MQ should be hosted in an environment with great availability, full failover &c. There are other queue solutions which allow you to distribute the queue coordination.
Use Queues for communication. Do not "pass" data from System A to System B through the database. You're using the database as a giant, expensive, complex message queue.
Use a message queue as a message queue.
This is not "Extra" overhead. This is the best way to decouple systems. It's called Service Oriented Architecture (SOA) and using messages is absolutely central to the design.
An MQ queue is far simpler than a DB table.
Don't compare "fault tolerance" because an RDBMS uses huge (almost unimaginable) overheads to achieve a reasonable level of assurance that your transaction finished properly. Locking. Buffering. Write Queues. Storage Management. Etc. Etc.
A reliable message queue implementation uses some backing store to keep the queue's state. The overhead is much, much less than an RDBMS. The performance is much better. And it's much, much simpler to interact with.
In SQL Server I would do this through an SSIS package or a job (depending on the number of records and the complexity of what I was moving). Other databases also have ETL solutions. I like the ETL solution becasue I can keep logs of what was changed and what errors were processed, I can send records which for some reason won't go to the other system (data structures are rarely the same between two databases) to a holding table without killing the rest of the process. I can also make changes to the data as it flows to adjust for database differences (things like lookup table values, say the completed status in db1 is 5 and it is 7 in db2 or say db2 has a required field that db1 does not and you have to add a default value to the filed if it is null). If one or the other servver is down the job running the SSIS package will fail and neither system will be affected, so it keeps the datbases decoupled as using triggers or replication would not.

MSMQ v Database Table

An existing process changes the status field of a booking record in a table, in response to user input.
I have another process to write, that will run asynchronously for records with a particular status. It will read the table record, perform some operations (including calls to third party web services), and update the record's status field to indicate that processing is completed (or In Error, with an error count).
This operation sounds very similar to a queue. What are the benefits and tradeoffs of using MSMQ over a SQL Table in this situation, and why should I choose one over the other?
It is our software that is adding and updating records in the table.
It is a new piece of work (a Windows Service) that will be performing the asynchronous processing. This needs to be "always up".
There are several reasons, which were discussed on the Fog Creek forum here: http://discuss.fogcreek.com/joelonsoftware5/default.asp?cmd=show&ixPost=173704&ixReplies=5
The main benefit is that MSMQ can still be used when there is intermittant connectivity between computers (using a store and forward mechanism on the local machine). As far as the application is concerned it delivered the message to MSMQ, even though MSMQ will possibly deliver the message later.
You can only insert a record to a table when you can connect to the database.
A table approach is better when a workflow approach is required, and the process will move through various stages, and these stages need persisting in the DB.
If the rate at which booking records is created is low I would have the second process periodically check the table for new bookings.
Unless you are already using MSMQ, introducing it just gives you an extra platform component to support.
If the database is heavily loaded, or you get a lot of lock contention with two process reading and writing to the same region of the bookings table, then consider introducing MSMQ.
I also like this answer from le dorfier in the previous discussion:
I've used tables first, then refactor
to a full-fledged msg queue when (and
if) there's reason - which is trivial
if your design is reasonable.
Thanks, folks, for all the answers. Most helpful.
With MSMQ you can also offload the work to another server very easy by changing the location of the queue to another machine rather then the db server.
By the way, as of SQL Server 2005 there is built in queue in the DB. Its called SQL server Service Broker.
See : http://msdn.microsoft.com/en-us/library/ms345108.aspx
Also see previous discussion.
If you have MSMQ expertise, it's a good option. If you know databases but not MSMQ, ask yourself if you want to become expert in another technology; whether your application is a critical one; and which you'd rather debug when there's a problem.
I have recently been investigating this myself so wanted to mention my findings. The location of the Database in comparison to your application is a big factor on deciding which option is faster.
I tested inserting the time it took to insert 100 database entries versus logging the exact same data into a local MSMQ message. I then took the average of the results of performing this test several times.
What I found was that when the database is on the local network, inserting a row was up to 4 times faster than logging to an MSMQ.
When the database was being accessed over a decent internet connection, inserting a row into the database was up to 6 times slower than logging to an MSMQ.
So:
Local database - DB is faster, otherwise MSMQ is.
Instead of making raw MSMQ calls, it might be easier if you implement your sevice as a queued COM+ component and make queued function calls from your client application. In the end, the asynchronous service still uses MSMQ in the background, but your code will be much clearer and easier to use.
I would probably go with MSMQ, or ActiveMQ myself. I would suggest (presuming that you are considering MSMQ you are using windows, with MS technology) looking into WCF, or if you are using MS-SQL 2005+ having a trigger that calls into .net code to run your processing.
Service Broker was introduced in SQL 2005 and it is designed to be very quick at handling messages as the process is relatively simple (I believe its roots were in triggers). If you are concerned about scalability, in SQL 2008 they have released an independant processing executable to separate the processing from SQL Server (in standard Service Broker, everything is controlled by the SQL Server instances).
I would definitely consider using Service Broker over MSMQ but this is dependant on your SQL Development/DBA resources and their knowledge.
Besides of Mitch's answer, some other scenarios:
1. each of your message have its own due date to trigger the action, this can be done through MQ as well, but in this case I prefer to store it into db as it is more controllable;
2. subscriber needs to filter message and then process a portion of it, this can be done by LINQ too, depends on how complex the filter is, the db approach is better because I can use linq to EF do complex query easily;
3. For deployment, i want fully automated deployment process so that DB is a better choice for me. I am not a big fan of manual configurations.