Gathering distributed data into central database

Gathering distributed data into central database - wcf

I was assigned to update existing system of gathering data coming from points of sale and inserting it into central database. The one that is working now is based on FTP/SFTP transmission, where the information is sent once a day, usually at night. Unfortunately, because of unstable connection links (low quality 2G/3G modems), some of the files appear to be broken. With just a few shops connected that way everything was working smooth, but along with increasing number of shops, errors became more often. What is worse, the time needed to insert data into central database is taking up to 12 - 14h (including waiting for the data to be downloaded from all of the shops) and that cannot happen during the working day as it would block the process of creating sale reports and other activities with the database - so we are really tight with processing time here.
The idea my manager suggested is to send the data continuously, during the day. Data packages would be significantly smaller, so their transmission and insertion would be much faster, central server would contain actual (almost real time) data and night could be used for long running database activities like creating backups, rebuilding indexes etc.
After going through many websites, I found that:
using ASMX web service is now obsolete and WCF should be used instead
WCF with MSMQ or System Messaging could be used to safely transmit data, where I don't have to care that much about acknowledging delivery of data, consistency, nodes going offline etc.
according to http://blogs.msdn.com/b/motleyqueue/archive/2007/09/22/system-messaging-versus-wcf-queuing.aspx WCF queuing is better
there are also other technologies for implementing message queue, like RabbitMQ, ZeroMQ etc.
And that is where I become confused. With so many options, do you have any pros and cons of these technologies?
We were using .NET with Windows Forms and SQL Server, but if it would be necessary, we could change to something more suitable. I am also a bit afraid of server efficiency. After some calculations, server would be receiving about 15 packages of data per second (peak). Is it much? I know there are many websites without serious server infrastructure, that handle hundreds of visitors online and still run smooth, but the website mainly uploads data to the client, and here we would download it from the client.
I also found somewhat similar SO question: Middleware to build data-gathering and monitoring for a distributed system
where DDS was mentioned. What do you think about introducing some middleware servers that would cope with low quality links to points of sale, so the main server would not be clogged with 1KB/s transmission?
I'd be grateful with all your help. Thank you in advance!

Rabbitmq can easily cope with thousands of 1kb messages per second.
As your use case is not about processing real time data, I'd say you should combine few messages and send them as a batch. That would be good enough in order to spread load over the day.
As the motivation here is not to process the data in real time, then any transport layer would do the job. Even ftp/sftp. As rabbitmq will work fine here, it's not the typical use case for it.
As you mentioned that one of your concerns is slow/unreliable network, I'd suggest to compress the files before sending them, and on the receiving end, immediately verify their integrity. Rsync or similar will probably do great job in doing that.

From what I understand, you have basically two problems:
Potential for loss/corruption of call data
Database write performance
The potential for loss/corruption of call data is being caused by a lack of reliability in the transmission of data from client to service.
And it's not clear what is causing the database contention/performance issues, beyond a vague reference to high volumes, so this answer will be more geared towards solving the first problem.
You have correctly identified the need for reliable asynchronous communication transport as a way to address the reliability issues in your current setup.
Looking at MSMQ to deliver this is a valid first step. MSMQ provides reliable communication via a store and forward messaging semantic which comes out of the box and requires very little in the way of configuration.
Unfortunately, while suitable for your needs, MSMQ relies on 2 things:
A reliable network protocol, and
A client service running on both sending and receiving machine.
From your description above, I don't believe 1 exists (the internet is not a reliable network), and you might well struggle with 2 - MSMQ only ships with Windows Server or business/enterprise versions of Windows on the desktop.(*see below...)
As a possible solution to the network reliability problem, you could use a WCF or a RESTful endpoint (using Nancy or WebApi) to expose a service operation(s) exposed over HTTP, which would accept the incoming calls from the client machines. These technologies are quite different, so you'll need to make sure you're making the correct choice early on.
WCF supports WS-ReliableMessaging from the SOAP 1.2 specification out of the box, which allows for reliable web service calls over http, however it's very config-heavy and not generally a nice framework to work with.
REST much simpler than WCF in .Net, is very lightweight and easy to use. However, for reliable delivery you would have to expose some kind of GET operation (in addition to a POST to allow the client to send data) to be called (within a reasonable time-frame) to verify the data was committed. The client would have to implement some kind of retry semantic if the result of the GET "acknowledgement" was negative.
Despite requiring two operations rather than one for the WCF route, I would favour the REST approach. I've done plenty of both and find REST services way nicer to work with.
(*) That's not to say that MSMQ wouldn't work in your ultimate solution, just that it would not be used to address the transmission reliability issue. However it could still be used to address another of your problems, that of database write contention. If you were to queue incoming requests once they came into the server, then these could be processed by an "offline" process, which could then perform the required database operations in a reliable manner. This could be done by using MSMQ transactional queues.
In response to comments:
99% messages are passed from shop to main server, but if some change
is needed (price correction, discounts etc.), that data has to be sent
to shop.
This kind of changes things. Had I understood from the beginning that you had a bidirectional requirement, and seeing as how you have managed to establish msmq communication, I would have nudged you towards NServiceBus, which is a really, really cool wrapper around MSMQ. The reason I would have done this is that you appear to have both a one way, and a publish-subscribe requirement, which is supported really nicely by NServiceBus.

Related

Deploy REST API over multiple servers world-wide, but stay in sync

I've built a REST API with a pretty decent latency. Each request is answered in ~100 ms with a thousand requests per second. This is however with a relatively low physical distance to the data center. The users of this API would, however, be spread all over the globe. From the US for example (to a data center in Germany), the response time for a single request is ~400 ms under no load.
What would be the best approach to deploying this API? I suspect multiple servers at different locations, with a load balancer in front. How would I ensure that the MySQL database would stay in sync between the servers in that case?
With multiple servers and a load balancer, the costs rise exponentially, which is something I can hopefully afford in the future, but not at the moment.
I'd love to hear your ideas!

Afaik. for big projects people use event sourcing with an event storage and microservices and message queues between them or a basic solution is polling the event storage through a simple REST API something like send me the latest events since the last event I received. If you can accept eventual consistency, then I think this approach can work pretty well. It makes writing somewhat slower, but reading can be very fast with it. No need to sync the MySQL databases directly, you just pull the latest events and use a projection to update the local MySQL database. So the event storage is the single source of truth.

nservicebus distributor model vs msft sql server

We are currently setting up nServiceBus in a distributor/worker model and I was wondering if it is really worth it for us.
In our initial test lab, I have 2 clustered distributors and one worker (more workers in prod). What I am wondering is if it would be just as effective to leverage our high availability SQL Server for storage and rebuild the servers to all handle the work instead of having dedicated distributors and workers. All of our messages get onto the bus via a simple .Net Web API service. I could install that service on each box along with the endpoint dlls and have them all talk to SQL server which has more than enough horsepower to handle the load. We have a load balancer available to us to distribute the messages to the handlers.
What would some of the drawbacks be in taking this approach vs the distributor model?
What has me concerned is a line from David Boike's book on nServiceBus (great book BTW) that I just read...
"Using SQL Server as a transport can be a great choice for small
projects on teams that already use SQL Server"
The small projects part is what I am worried about. This is by no means a small project and it will have a pretty high volume of messages flowing through this layer as we refactor more systems to be message driven.
Has anyone been down the same road comparing SQL server to distributor and where did you come out?
Thanks

What I was referring into the book on the quote you mentioned was that there are times when you have a fairly small solution, all in a single SQL Server database, and you want to introduce some messaging around the edges. The SQL Server transport makes it easy to do that without adding a bunch of additional overhead and moving parts. If you keep everything in one database, you can even ditch the Distributed Transactions Coordinator. It can also be really useful for integrating with a legacy system where you monitor for changes via database triggers.
However, keep in mind (and if there's a next edition, I'll be sure to go into a little more detail about this) that the SQL Server transport uses a Broker pattern, that is, all communication must go through SQL Server so it becomes a central point of failure and a central bottleneck. The default MSMQ transport, on the other hand, follows the Bus architectural style, meaning it's completely decentralized. Each endpoint can run completely on its own, at least until you introduce additional dependencies.
Andreas benchmarked the new transports, and found that on V4 MSMQ was capable of roughly 6000 sends/s and 2300 receives/s, and that SqlServer was on par with that, but on MSMQ that is roughly per server (each server gets its own throughput), with the SQL Server transport that is going to be your total achievable throughput, period, and any endpoints you add will have to share it.
Of course, broker-style transports (the rest of the new transports in 4.0 are brokers too) do have some advantages over MSMQ. The biggest is that you don't need to use the Distributor to scale out. In a broker, the "queue" is centralized so you can simply spin up additional endpoints pointing at the same input queue in a competing consumers pattern.
Of course as in all things, your mileage may vary, but if you are planning an ambitious system, then the SQL Server transport may not be for you, as you will at some point get mired down in that point where your only option is to scale up your SQL Server instance.

How can I handle 200K request per sec in wcf

I need to design a system that can handle 200K request per second in each machine over HTTP.
The wcf service need to be hosted under win service.
I wonder if wcf can handle such a requirement?
What is the best system setup/ best configuration?
The machine itself is pretty heavy 32G RAM and 8 core (or more), and can be upgraded if needed
Can I handle such amount of request in each single machine with wcf using http?

Doing this on a single machine is likely to be pretty tough (if indeed it's possible). It would be better to make your system scale horizontally, so you can add lots of machines as required. How you do that will depend on what your system actually needs to do. If it's some simple calculation which requires no persisted state, it shouldn't be too hard. If you've got some interaction with storage of some form which really needs to be read/written on each request, it'll be a lot harder - and choosing your persistence technology is likely to be pretty key to making it all hang together.
Note that there are other benefits to scaling horizontally too - in particular, the ability to upgrade the system without any downtime (if you're careful) and removing a huge single point of failure.

You need to give some more info on this.
Do you get the request and have to process it immediately?
Can you store the request data and delegate the processing to some other thread/process? Is there any way to scale the system out instead of up?
Is this in fact the only piece of infrastructure you can deploy stuff to?
I would start by asking what is it that I want to do during request handling. then what the bottlenecks are going to be.

Should I use MSMQ or SQL Service Broker for transactions?

I've been asked by my team leader to investigate MSMQ as an option for the new version of our product. We use SQL Service Broker in our current version. I've done my fair share of experimentation and Googling to find which product is better for my needs, but I thought I'd ask the best site I know for programming answers.
Some details:
Our client is .NET 1.1 and 2.0 code; this is where the message will be sent from.
The target in a SQL Server 2005 instance. All messages end up being database updates or inserts.
We will send several updates that must be treated as a transaction.
We have to have perfect message recoverability; no messages can be lost.
We have to be asynchronous and able to accept messages even when the target SQL server is down.
Developing our own queuing solution isn't an option; we're a small team.
Things I've discovered so far:
Both MSMQ and SQL Service Broker can do the job.
It appears that service broker is faster for transactional messages.
Service Broker requires a SQL server running somewhere, whereas MSMQ needs any configured Windows machine running somewhere.
MSMQ appears to be better/faster/easier to set up/run in clusters.
Am I missing something? Is there a clear winner here? Any thoughts, experiences, or links would be valued. Thank you!
EDIT: We ended up sticking with service broker because we have a custom DB framework used in some of our client code (we handle transactions better). That code captured SQL for transactions, but not . The client code was also all version 1.1 of .NET, so we'd have to upgrade all the client code. Thanks for your help!

Having just migrated my application from Service Broker to MSMQ, I would have to vote for using MSMQ. There are several factors to take into account, but most of which have to do with how you are using your data and where the processing lives.
If processing is done in the database? Service Broker
If it is just data move? Service Broker
Is processing done in .NET/COM code? MSMQ
Do you need remote distributed transactions (for example, processing on a box different than SQL)? MSMQ
Do you need to be able to send messages while the destination is down? MSMQ
Do you want to use nServiceBus, MassTransit, Rhino-ESB, etc.? MSMQ
Things to consider no matter what you choose
How do you know the health of your queue? Both options handle failover differently. For example Service Broker will disable your queue in certain scenarios which can take down your application.
How will you perform reporting? If you already use SQL Tables in your reports, Service Broker can easily fit in as it's just another dynamic table. If you are already using Performance Monitor MSMQ may fit in nicer. Service Broker does have a lot of performance counters, so don't let this be your only factor.
How do you measure uptime? Is it merely making sure you don't lose transactions, or do you need to respond synchronously? I find that the distributed nature of MSMQ allows for higher uptime because the main queue can go offline and not lose anything. Whereas with Service Broker your database must be online or else you lose out.
Do you already have experience with one of these technologies? Both have a lot of implementation details that can come back and bite you.
No mater what choice you make, how easy is it to switch out the underlying Queueing technology? I recommend having a generic IQueue interface that you write a concrete implementation against. This way the choice you make can easily be changed later on if you find that you made the wrong one. After all, a queue is just a queue and should not lock you into a specific implementation.

I've used MSMQ before and the only item I'd add to your list is a prerequisite check for versioning. I ran into an issue where one site had Win 2000 Server and therefore MSMQ v.2, versus Win 2003 Server and MSMQ v3. All my .NET code targeted v.3 and they aren't compatible... or at least not easily so.
Just a consideration if you go the MSMQ route.

The message size limitation in MSMQ has halted my digging in that direction. I am learning Service Broker for the project.

Do you need to be able to send messages while the destination is down? MSMQ
I don't understand why? SSB can send messages to disconnected destination without any problem. All this messages going to transmission queue and would be delivered when destination stay reachable.

Application Level Replication Technologies

I am building out a solution that will be deployed in multiple data centers in multiple regions around the world, with each data center having a replicated copy of data actively updated in each region. I will have a combination of multiple databases and file systems in each data center, the state of which must be kept consistent (within a data center). These multiple repositories will be fronted by a SOA service tier.
I can tolerate some latency in the replication, and need to allow for regions to be off-line, and then catch up later.
Given the multiple back end repositories of data, I can't easily rely on independent replication solutions for each one to maintain a consistent state. I am thus lead to implementing replication at the application layer -- by replicating the SOA requests in some manner. I'll need to make sure that replication loops don't occur, and that last writer conditions are sorted out correctly.
In your experience, what is the best pattern for solving this problem, and are there good products (free or otherwise) that should be investigated?

Lotus/ Domino is your answer. I've been working with it for ten years and its exactly what you need. It may not be trendy (a perception that I would challenge) but its powerful, adaptable and very secure, The latest version R8 is the best yet.

You should definitely consider IBM Lotus Domino. A Lotus Notes database can replicate between sites on a predefined schedule. The replicate in Notes/Domino is definitely a very powerful feature and enables for full replication of data between sites. Even if a server is unavailable the next time it connects it will simply replicate and get back in sync.
As far as SOA Service tier you could then use Domino Designer to write a webservice. Since Notes/Domino 7.5.x (I believe) Domino has been able to provision and consume webservices.

AS what other advised, I will recommend also Lotus Notes/Domino. 8.5 is really very powerful application development platfrom

You dont give enough specifics to be certain of your needs but I think you should check out SQL Server Merge replication. It allows for asynchronous replication of multiple databases with full conflict resolution. You will need to designate a Global master and all the other databases will replicate to that one, but all the database instances are fully functional (read/write) and so you can schedule replication at whatever intervals suit you. If any region goes offline they can catch up later with no issues - if the master goes offline everyone will work independantly until replication can resume.
I would be interested to know of other solutions this flexible (apart from Lotus Notes/Domino of course which is not very trendy these days).

I think that your answer is going to have to be based on a pub/sub architecture. I am assuming that you have reliable messaging between your data centers so that you can rely on published updates being received eventually. If all of your access to the data repositories is via service you can add an event notification to the orchestration of each of your update services that notifies all interested data centers of the event. Ideally the master database is the only one that sends out these updates. If the master database is the only one sending the updates you can exclude routing the notifications to the node that generated them in the first place thus avoiding update loops.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas