Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I've just started at a new communications company, and we're looking at a workflow / intranet system to manage jobs and processes.
Basically, we receive data files from clients which we then process through our systems.
Receive data file (FTP, Email, etc)
Process data file (either generic script with data mapping to the file, or bespoke ETL package). Adds address values
Create printstream (send processed data file into a postscript / PDF composition engine), or create email output
Send output to production floor (copy to printer input stream, mailing machines)
Process other streams (e.g. send emails / faxes, upload to e-Archive)
Update internal systems (e.g. warehouse stock, invoicing)
We also have a lot of other internal business processes (e.g. reprocessing damaged output, processing dead/returned mail).
I'm trying to keep all elements separated. Some will be off the shelf (e.g. printstream composition, email sending / management, CRM). Some will be built in house (e.g. reprocess damaged output).
But, I'm looking for something to tie it all together, and put the business workflow processes in. E.g. scheduling jobs, kicking off data processing tasks in sequence and managing errors. A lot of this will have human steps. Also, put in SLA management and business activity monitoring / reporting.
One key requirement soon is for automated file receipt and processing (i.e. directory watching and matching to client / application).
I'm keen for something that's easy to manage and maintain (e.g. adding in new steps to a workflow, or conditional logic, or whatever).
I realise this is a big job, and at the moment we're focusing on each individual component and putting manual processes in place until we get a system to manage it. We don't want to design a gargantuan bespoke system to tie it all in, but would rather look at buying some kind of workflow or integration system.
Any suggestions? I've had a look at Biztalk, but not sure if it's overkill or not suited for internal-only systems. Another product I've been exposed to is Sagent Automation, but it looks a little pokey.
-- EDIT --
Forgot to mention, our existing skillset is largely Microsoft. So anything in Microsoft technologies / .Net based would be preferable. But if there's a fantastic product, we're not adverse to upskilling
Check out Apache's Active MQ. It implements the Java Message Service 1.1 specification, layers on a servlet API, and has tons of features that should address your requirements. You can also layer on Camel, which adds a rich implementation of many enterprise integration patterns.
Typically, JMS messages are persisted in a transactional database, which can be configured to give you extremely high degrees of fault tolerance (eg, RAID, master-backup database machine pairs, multiple copies of transaction log files). On top of the database can go multiple, load-balanced app server machines running Active MQ, to give you scalability and high-availability. I think you'll find that you can write your components in a very decoupled fashion if you use Active MQ as your common message bus.
In JMS, when a message is de-queued by a consumer, the consuming process must later confirm that the message was successfully handled. If a confirmation does not come in in time, the JMS system will revive the message so another consuming process can attempt to handle it. This means you can run multiple copies of your application to gain reliability and fault tolerance.
Take a look at O'Reilly's Java Message Service, 2nd Edition, which just came out this week.
A different avenue would be to look into BPEL (Business Process Execution Languge).
Edit: I'm not very familiar with Microsoft offerings, but MSMQ seems like the equivalent to JMS.
You should be able to use ActiveMQ in a Microsoft environment. They claim to support "cross language clients" like "C# and .NET". And even if that should be problematic, since ActiveMQ has a Java servlet-based API for queueing and de-queuing messages, the outside world only has to be able to make HTTP requests to the ActiveMQ server. That should limit the amount of learning your team would have to do. Good luck, this sounds like an awesome project!
SharePoint has a workflow engine that works very well. You can build your workflow using SharePoint designer or Visual Studio 2008. It uses Windows Workflow, which is similar to BizTalk (if not the same engine), but without BizTalk's other services that may not be necessary for your application.
Related
My organization moves data for customers between systems, these integrations are in BizTalk and are done by file, sometimes to/from APIs. More and more customers are switching to APIs so we are facing more and more API to API integrations.
I'm mostly a backend developer but have been tasked with finding out how we can find a more generic pattern or system to make these integrations, we are talking close to a thousand of integrations.
But not thousands of different APIs, many customers use the same sort of systems.
What I want is a solution that:
Fetches data from the source api
Transforms the data to the format for the target api
Sends the data to the target api
Another requirement is that it should be possible to set a schedule when these jobs should run.
This is easily done in BizTalk but as mentioned there will be thousands of integrations and if we need to change something in one of the steps it will be a lot of work.
My vision is something that holds interfaces to all APIs that we communicate with and also contains the scheduled jobs we want to be run between them. Preferrably with logging/tracking.
There must be something out there that does this?
Suggestions?
NOTE: No cloud-based solutions since they are not allowed in our organization.
You can easily implement this using temporal.io open source project. You can code your integrations using a general-purpose programming language. Temporal ensures that the integration runs to completion in the presence of all sorts of intermittent failures. Scheduling is also supported out of the box.
Disclaimer: I'm a founder of the Temporal project.
I was assigned to update existing system of gathering data coming from points of sale and inserting it into central database. The one that is working now is based on FTP/SFTP transmission, where the information is sent once a day, usually at night. Unfortunately, because of unstable connection links (low quality 2G/3G modems), some of the files appear to be broken. With just a few shops connected that way everything was working smooth, but along with increasing number of shops, errors became more often. What is worse, the time needed to insert data into central database is taking up to 12 - 14h (including waiting for the data to be downloaded from all of the shops) and that cannot happen during the working day as it would block the process of creating sale reports and other activities with the database - so we are really tight with processing time here.
The idea my manager suggested is to send the data continuously, during the day. Data packages would be significantly smaller, so their transmission and insertion would be much faster, central server would contain actual (almost real time) data and night could be used for long running database activities like creating backups, rebuilding indexes etc.
After going through many websites, I found that:
using ASMX web service is now obsolete and WCF should be used instead
WCF with MSMQ or System Messaging could be used to safely transmit data, where I don't have to care that much about acknowledging delivery of data, consistency, nodes going offline etc.
according to http://blogs.msdn.com/b/motleyqueue/archive/2007/09/22/system-messaging-versus-wcf-queuing.aspx WCF queuing is better
there are also other technologies for implementing message queue, like RabbitMQ, ZeroMQ etc.
And that is where I become confused. With so many options, do you have any pros and cons of these technologies?
We were using .NET with Windows Forms and SQL Server, but if it would be necessary, we could change to something more suitable. I am also a bit afraid of server efficiency. After some calculations, server would be receiving about 15 packages of data per second (peak). Is it much? I know there are many websites without serious server infrastructure, that handle hundreds of visitors online and still run smooth, but the website mainly uploads data to the client, and here we would download it from the client.
I also found somewhat similar SO question: Middleware to build data-gathering and monitoring for a distributed system
where DDS was mentioned. What do you think about introducing some middleware servers that would cope with low quality links to points of sale, so the main server would not be clogged with 1KB/s transmission?
I'd be grateful with all your help. Thank you in advance!
Rabbitmq can easily cope with thousands of 1kb messages per second.
As your use case is not about processing real time data, I'd say you should combine few messages and send them as a batch. That would be good enough in order to spread load over the day.
As the motivation here is not to process the data in real time, then any transport layer would do the job. Even ftp/sftp. As rabbitmq will work fine here, it's not the typical use case for it.
As you mentioned that one of your concerns is slow/unreliable network, I'd suggest to compress the files before sending them, and on the receiving end, immediately verify their integrity. Rsync or similar will probably do great job in doing that.
From what I understand, you have basically two problems:
Potential for loss/corruption of call data
Database write performance
The potential for loss/corruption of call data is being caused by a lack of reliability in the transmission of data from client to service.
And it's not clear what is causing the database contention/performance issues, beyond a vague reference to high volumes, so this answer will be more geared towards solving the first problem.
You have correctly identified the need for reliable asynchronous communication transport as a way to address the reliability issues in your current setup.
Looking at MSMQ to deliver this is a valid first step. MSMQ provides reliable communication via a store and forward messaging semantic which comes out of the box and requires very little in the way of configuration.
Unfortunately, while suitable for your needs, MSMQ relies on 2 things:
A reliable network protocol, and
A client service running on both sending and receiving machine.
From your description above, I don't believe 1 exists (the internet is not a reliable network), and you might well struggle with 2 - MSMQ only ships with Windows Server or business/enterprise versions of Windows on the desktop.(*see below...)
As a possible solution to the network reliability problem, you could use a WCF or a RESTful endpoint (using Nancy or WebApi) to expose a service operation(s) exposed over HTTP, which would accept the incoming calls from the client machines. These technologies are quite different, so you'll need to make sure you're making the correct choice early on.
WCF supports WS-ReliableMessaging from the SOAP 1.2 specification out of the box, which allows for reliable web service calls over http, however it's very config-heavy and not generally a nice framework to work with.
REST much simpler than WCF in .Net, is very lightweight and easy to use. However, for reliable delivery you would have to expose some kind of GET operation (in addition to a POST to allow the client to send data) to be called (within a reasonable time-frame) to verify the data was committed. The client would have to implement some kind of retry semantic if the result of the GET "acknowledgement" was negative.
Despite requiring two operations rather than one for the WCF route, I would favour the REST approach. I've done plenty of both and find REST services way nicer to work with.
(*) That's not to say that MSMQ wouldn't work in your ultimate solution, just that it would not be used to address the transmission reliability issue. However it could still be used to address another of your problems, that of database write contention. If you were to queue incoming requests once they came into the server, then these could be processed by an "offline" process, which could then perform the required database operations in a reliable manner. This could be done by using MSMQ transactional queues.
In response to comments:
99% messages are passed from shop to main server, but if some change
is needed (price correction, discounts etc.), that data has to be sent
to shop.
This kind of changes things. Had I understood from the beginning that you had a bidirectional requirement, and seeing as how you have managed to establish msmq communication, I would have nudged you towards NServiceBus, which is a really, really cool wrapper around MSMQ. The reason I would have done this is that you appear to have both a one way, and a publish-subscribe requirement, which is supported really nicely by NServiceBus.
We are currently setting up nServiceBus in a distributor/worker model and I was wondering if it is really worth it for us.
In our initial test lab, I have 2 clustered distributors and one worker (more workers in prod). What I am wondering is if it would be just as effective to leverage our high availability SQL Server for storage and rebuild the servers to all handle the work instead of having dedicated distributors and workers. All of our messages get onto the bus via a simple .Net Web API service. I could install that service on each box along with the endpoint dlls and have them all talk to SQL server which has more than enough horsepower to handle the load. We have a load balancer available to us to distribute the messages to the handlers.
What would some of the drawbacks be in taking this approach vs the distributor model?
What has me concerned is a line from David Boike's book on nServiceBus (great book BTW) that I just read...
"Using SQL Server as a transport can be a great choice for small
projects on teams that already use SQL Server"
The small projects part is what I am worried about. This is by no means a small project and it will have a pretty high volume of messages flowing through this layer as we refactor more systems to be message driven.
Has anyone been down the same road comparing SQL server to distributor and where did you come out?
Thanks
What I was referring into the book on the quote you mentioned was that there are times when you have a fairly small solution, all in a single SQL Server database, and you want to introduce some messaging around the edges. The SQL Server transport makes it easy to do that without adding a bunch of additional overhead and moving parts. If you keep everything in one database, you can even ditch the Distributed Transactions Coordinator. It can also be really useful for integrating with a legacy system where you monitor for changes via database triggers.
However, keep in mind (and if there's a next edition, I'll be sure to go into a little more detail about this) that the SQL Server transport uses a Broker pattern, that is, all communication must go through SQL Server so it becomes a central point of failure and a central bottleneck. The default MSMQ transport, on the other hand, follows the Bus architectural style, meaning it's completely decentralized. Each endpoint can run completely on its own, at least until you introduce additional dependencies.
Andreas benchmarked the new transports, and found that on V4 MSMQ was capable of roughly 6000 sends/s and 2300 receives/s, and that SqlServer was on par with that, but on MSMQ that is roughly per server (each server gets its own throughput), with the SQL Server transport that is going to be your total achievable throughput, period, and any endpoints you add will have to share it.
Of course, broker-style transports (the rest of the new transports in 4.0 are brokers too) do have some advantages over MSMQ. The biggest is that you don't need to use the Distributor to scale out. In a broker, the "queue" is centralized so you can simply spin up additional endpoints pointing at the same input queue in a competing consumers pattern.
Of course as in all things, your mileage may vary, but if you are planning an ambitious system, then the SQL Server transport may not be for you, as you will at some point get mired down in that point where your only option is to scale up your SQL Server instance.
I've been trying to find out ways to improve our nservicebus code performance. I searched and stumbled on these profiles that you can set upon running/installing the nservicebus host.
Currently we're running the nservicebus host as-is, and I read that by default we are using the "Lite" version of the available profiles. I've also learnt from this link:
http://docs.particular.net/nservicebus/hosting/nservicebus-host/profiles
that there are Integrated and Production profiles. The documentation does not say much - has anyone tried the Production profiles and noticed an improvement in nservicebus performance? Specifically affecting the speed in consuming messages from the queues?
One major difference between the NSB profiles is how they handle storage of subscriptions.
The lite, integration and production profiles allow NSB to configure how reliable it is. For example, the lite profile uses in-memory subscription storage for all pub/sub registrations. This is a concern because in order to register a subscriber in the lite profile, the publisher has to already be running (so the publisher can store the subscriber list in memory). What this means is that if the publisher crashes for any reason (or is taken offline), all the subscription information is lost (until each subscriber is restarted).
So, the lite profile is good if you are running on a developer machine and want to quickly test how your services interact. However, it is just not suitable to other environments.
The integration profile stores subscription information on a local queue. This can be good for simple environments (like QA etc.). However, in a highly distributed environment holding the subscription information in a database is best, hence the production profile.
So, to answer your question, I don't think that by changing profiles you will see a performance gain. If anything, changing from the lite profile to one of the other profiles is likely to decrease performance (because you incur the cost of accessing queue or database storage).
Unless you tuned the logging yourself, we've seen large improvements based on reduced logging. The performance from reading off the queues is same all around. Since the queues are local, you won't gain much from the transport. I would take a look at tuning your handlers and the underlying infrastructure. You may want to check out tuning MSMQ and look at the disk you are using etc. Another spot would be to look at how distributed transactions are working assuming you are using a remote database that requires them.
Another option to increase processing time is to increase the number of threads consuming the queue. This will require a license. If a license is not an option you can have multiple instances of a single threaded endpoint running. This requires you shard your work based on message type or something else.
Continuing up the scale you can then get into using the Distributor to load balance work. Again this will require a license, but you'll be able to add more nodes as necessary. All of the opportunities above also apply to this topology.
I am building out a solution that will be deployed in multiple data centers in multiple regions around the world, with each data center having a replicated copy of data actively updated in each region. I will have a combination of multiple databases and file systems in each data center, the state of which must be kept consistent (within a data center). These multiple repositories will be fronted by a SOA service tier.
I can tolerate some latency in the replication, and need to allow for regions to be off-line, and then catch up later.
Given the multiple back end repositories of data, I can't easily rely on independent replication solutions for each one to maintain a consistent state. I am thus lead to implementing replication at the application layer -- by replicating the SOA requests in some manner. I'll need to make sure that replication loops don't occur, and that last writer conditions are sorted out correctly.
In your experience, what is the best pattern for solving this problem, and are there good products (free or otherwise) that should be investigated?
Lotus/ Domino is your answer. I've been working with it for ten years and its exactly what you need. It may not be trendy (a perception that I would challenge) but its powerful, adaptable and very secure, The latest version R8 is the best yet.
You should definitely consider IBM Lotus Domino. A Lotus Notes database can replicate between sites on a predefined schedule. The replicate in Notes/Domino is definitely a very powerful feature and enables for full replication of data between sites. Even if a server is unavailable the next time it connects it will simply replicate and get back in sync.
As far as SOA Service tier you could then use Domino Designer to write a webservice. Since Notes/Domino 7.5.x (I believe) Domino has been able to provision and consume webservices.
AS what other advised, I will recommend also Lotus Notes/Domino. 8.5 is really very powerful application development platfrom
You dont give enough specifics to be certain of your needs but I think you should check out SQL Server Merge replication. It allows for asynchronous replication of multiple databases with full conflict resolution. You will need to designate a Global master and all the other databases will replicate to that one, but all the database instances are fully functional (read/write) and so you can schedule replication at whatever intervals suit you. If any region goes offline they can catch up later with no issues - if the master goes offline everyone will work independantly until replication can resume.
I would be interested to know of other solutions this flexible (apart from Lotus Notes/Domino of course which is not very trendy these days).
I think that your answer is going to have to be based on a pub/sub architecture. I am assuming that you have reliable messaging between your data centers so that you can rely on published updates being received eventually. If all of your access to the data repositories is via service you can add an event notification to the orchestration of each of your update services that notifies all interested data centers of the event. Ideally the master database is the only one that sends out these updates. If the master database is the only one sending the updates you can exclude routing the notifications to the node that generated them in the first place thus avoiding update loops.