We use nservicebus for a few applications and monitor endpoint heartbeats and failed messages through service pulse.
Most of the time messages are processed within minutes, but occasionally there is a spike in traffic and clients will ask if there is a problem. I would like to know the length of an endpoint queue so that I can respond and provide estimates.
We use sql as a transport layer and subscription store. I cannot view the database remotely.
What is the best approach to surface this data?
I could expose an SSRS report on top of the database, add code to service control and service pulse since they are both open source, or add a custom check through service pulse...
How about running a job (at a configured interval on the SQL server) on the queues tables that will write the number of messages to a table you can query?
You can than use this table to run your monitoring tool and generate alerts, or indeed write a customCheck so you will get alerts on ServicePulse...
While this is a temporary solution, we are working on filling that gap, take a look at this anouncement: https://groups.google.com/d/msg/particularsoftware/zRJ18bxeY2Y/zrLu9WOIAQAJ
we've been working on enhancing the Particular Service Platform to close existing gap and provide a means of monitoring your NServiceBus-related system more easily.
The initial offering will focus on identifies key metrics (one of them is the queue length) for assessing the health of a system and then presenting these metrics to you in a manner that's easy to visualize and consume.
In the weeks ahead we will share more information about our monitoring philosophy and how we are looking to ease the pain of implementing it. So follow our blog to get notified of updates.
In the meantime you are welcome to join the live webinar,on the monitoring theme, Wednesday, June 28 at 12:00 EDT (17:00BST).
Also: me and my college, William Brander will show the metrics you should consider when monitoring microservices.
link- https://particular.net/what-to-consider-when-monitoring-microservices
Hope this helps,
If I can help, please feel free to email support at particular.net
Related
In current project we currently use 8 worker role machines side by side that actually work a little different than azure may expect it.
Short outline of the system:
each worker start up to 8 processes that actually connect to cloud queue and processes messages
each process accesses three different cloud queues for collecting messages for different purposes (delta recognition, backup, metadata)
each message leads to a WCF call to an ERP system to gather information and finally add retreived response in an ReDis cache
this approach has been chosen over many smaller machines due to costs and performance. While 24 one-core machines would perform by 400 calls/s to the ERP system, 8 four-core machines with 8 processes do over 800 calls/s.
Now to the question: when even increasing the count of machines to increase performance to 1200 calls/s, we experienced outages of Cloud Queue. In same moment of time, 80% of the machines' processes don't process messages anymore.
Here we have two problems:
Remote debugging is not possible for these processes, but it was possible to use dile to get some information out.
We use GetMessages method of Cloud Queue to get up to 4 messages from queue. Cloud Queue always answers with 0 messages. Reconnect the cloud queue does not help.
Restarting workers does help, but shortly lead to same problem.
Are we hitting the natural end of scalability of Cloud Queue and should switch to Service Bus?
Update:
I have not been able to fully understand the problem, I described it in the natual borders of Cloud Queue.
To summarize:
Count of TCP connections have been impressive. Actually too impressive (multiple hundreds)
Going back to original memory size let the system operate normally again
In my experience I have been able to get better raw performance out of Azure Cloud Queues than service bus, but Service Bus has better enterprise features (reliable, topics, etc). Azure Cloud Queue should process up to 2K/second per queue.
https://azure.microsoft.com/en-us/documentation/articles/storage-scalability-targets/
You can also try partitioning to multiple queues if there is some natural partition key.
Make sure that your process don't have some sort of thread deadlock that is the real culprit. You can test this by connecting to the queue when it appears hung and trying to pull messages from the queue. If that works it is your process, not the queue.
Also take a look at this to setup some other monitors:
https://azure.microsoft.com/en-us/documentation/articles/storage-monitor-storage-account/
It took some time to solve this issue:
First a summarization of the usage of the storage account:
We used the blob storage once a day pretty heavily.
The "normal" diagonistics that Azure provides out of the box also used the same storage account.
Some controlling processes used small tables to store and read information once an hour for ca. 20 minutes
There may be up to 800 calls/s that try to increase a number to count calls to an ERP system.
When recognizing that the storage account is put under heavy load we split it up.
Now there are three physical storage accounts heaving 2 queues.
The original one still keeps up to 800/s calls for increasing counters
Diagnositics are still on the original one
Controlling information has been also moved
The system runs now for 2 weeks, working like a charm. There are several things we learned from that:
No, the infrastructure is "not just there" and it doesn't scale endlessly.
Even if we thought we didn't use "that much" summarized we used quite heavily and uncontrolled.
There is no "best practices" anywhere in the net that tells the complete story. Esp. when start working with the storage account a guide from MS would be quite helpful
Exception handling in storage is quite bad. Even if the storage account is overused, I would expect some kind of exception and not just returning zero message without any surrounding information
Read complete story here: natural borders of cloud storage scalability
UPDATE:
The scalability has a lot of influences. You may are interested in Azure Service Bus: Massive count of listeners and senders to be aware of some more pitfalls.
I was assigned to update existing system of gathering data coming from points of sale and inserting it into central database. The one that is working now is based on FTP/SFTP transmission, where the information is sent once a day, usually at night. Unfortunately, because of unstable connection links (low quality 2G/3G modems), some of the files appear to be broken. With just a few shops connected that way everything was working smooth, but along with increasing number of shops, errors became more often. What is worse, the time needed to insert data into central database is taking up to 12 - 14h (including waiting for the data to be downloaded from all of the shops) and that cannot happen during the working day as it would block the process of creating sale reports and other activities with the database - so we are really tight with processing time here.
The idea my manager suggested is to send the data continuously, during the day. Data packages would be significantly smaller, so their transmission and insertion would be much faster, central server would contain actual (almost real time) data and night could be used for long running database activities like creating backups, rebuilding indexes etc.
After going through many websites, I found that:
using ASMX web service is now obsolete and WCF should be used instead
WCF with MSMQ or System Messaging could be used to safely transmit data, where I don't have to care that much about acknowledging delivery of data, consistency, nodes going offline etc.
according to http://blogs.msdn.com/b/motleyqueue/archive/2007/09/22/system-messaging-versus-wcf-queuing.aspx WCF queuing is better
there are also other technologies for implementing message queue, like RabbitMQ, ZeroMQ etc.
And that is where I become confused. With so many options, do you have any pros and cons of these technologies?
We were using .NET with Windows Forms and SQL Server, but if it would be necessary, we could change to something more suitable. I am also a bit afraid of server efficiency. After some calculations, server would be receiving about 15 packages of data per second (peak). Is it much? I know there are many websites without serious server infrastructure, that handle hundreds of visitors online and still run smooth, but the website mainly uploads data to the client, and here we would download it from the client.
I also found somewhat similar SO question: Middleware to build data-gathering and monitoring for a distributed system
where DDS was mentioned. What do you think about introducing some middleware servers that would cope with low quality links to points of sale, so the main server would not be clogged with 1KB/s transmission?
I'd be grateful with all your help. Thank you in advance!
Rabbitmq can easily cope with thousands of 1kb messages per second.
As your use case is not about processing real time data, I'd say you should combine few messages and send them as a batch. That would be good enough in order to spread load over the day.
As the motivation here is not to process the data in real time, then any transport layer would do the job. Even ftp/sftp. As rabbitmq will work fine here, it's not the typical use case for it.
As you mentioned that one of your concerns is slow/unreliable network, I'd suggest to compress the files before sending them, and on the receiving end, immediately verify their integrity. Rsync or similar will probably do great job in doing that.
From what I understand, you have basically two problems:
Potential for loss/corruption of call data
Database write performance
The potential for loss/corruption of call data is being caused by a lack of reliability in the transmission of data from client to service.
And it's not clear what is causing the database contention/performance issues, beyond a vague reference to high volumes, so this answer will be more geared towards solving the first problem.
You have correctly identified the need for reliable asynchronous communication transport as a way to address the reliability issues in your current setup.
Looking at MSMQ to deliver this is a valid first step. MSMQ provides reliable communication via a store and forward messaging semantic which comes out of the box and requires very little in the way of configuration.
Unfortunately, while suitable for your needs, MSMQ relies on 2 things:
A reliable network protocol, and
A client service running on both sending and receiving machine.
From your description above, I don't believe 1 exists (the internet is not a reliable network), and you might well struggle with 2 - MSMQ only ships with Windows Server or business/enterprise versions of Windows on the desktop.(*see below...)
As a possible solution to the network reliability problem, you could use a WCF or a RESTful endpoint (using Nancy or WebApi) to expose a service operation(s) exposed over HTTP, which would accept the incoming calls from the client machines. These technologies are quite different, so you'll need to make sure you're making the correct choice early on.
WCF supports WS-ReliableMessaging from the SOAP 1.2 specification out of the box, which allows for reliable web service calls over http, however it's very config-heavy and not generally a nice framework to work with.
REST much simpler than WCF in .Net, is very lightweight and easy to use. However, for reliable delivery you would have to expose some kind of GET operation (in addition to a POST to allow the client to send data) to be called (within a reasonable time-frame) to verify the data was committed. The client would have to implement some kind of retry semantic if the result of the GET "acknowledgement" was negative.
Despite requiring two operations rather than one for the WCF route, I would favour the REST approach. I've done plenty of both and find REST services way nicer to work with.
(*) That's not to say that MSMQ wouldn't work in your ultimate solution, just that it would not be used to address the transmission reliability issue. However it could still be used to address another of your problems, that of database write contention. If you were to queue incoming requests once they came into the server, then these could be processed by an "offline" process, which could then perform the required database operations in a reliable manner. This could be done by using MSMQ transactional queues.
In response to comments:
99% messages are passed from shop to main server, but if some change
is needed (price correction, discounts etc.), that data has to be sent
to shop.
This kind of changes things. Had I understood from the beginning that you had a bidirectional requirement, and seeing as how you have managed to establish msmq communication, I would have nudged you towards NServiceBus, which is a really, really cool wrapper around MSMQ. The reason I would have done this is that you appear to have both a one way, and a publish-subscribe requirement, which is supported really nicely by NServiceBus.
We are currently setting up nServiceBus in a distributor/worker model and I was wondering if it is really worth it for us.
In our initial test lab, I have 2 clustered distributors and one worker (more workers in prod). What I am wondering is if it would be just as effective to leverage our high availability SQL Server for storage and rebuild the servers to all handle the work instead of having dedicated distributors and workers. All of our messages get onto the bus via a simple .Net Web API service. I could install that service on each box along with the endpoint dlls and have them all talk to SQL server which has more than enough horsepower to handle the load. We have a load balancer available to us to distribute the messages to the handlers.
What would some of the drawbacks be in taking this approach vs the distributor model?
What has me concerned is a line from David Boike's book on nServiceBus (great book BTW) that I just read...
"Using SQL Server as a transport can be a great choice for small
projects on teams that already use SQL Server"
The small projects part is what I am worried about. This is by no means a small project and it will have a pretty high volume of messages flowing through this layer as we refactor more systems to be message driven.
Has anyone been down the same road comparing SQL server to distributor and where did you come out?
Thanks
What I was referring into the book on the quote you mentioned was that there are times when you have a fairly small solution, all in a single SQL Server database, and you want to introduce some messaging around the edges. The SQL Server transport makes it easy to do that without adding a bunch of additional overhead and moving parts. If you keep everything in one database, you can even ditch the Distributed Transactions Coordinator. It can also be really useful for integrating with a legacy system where you monitor for changes via database triggers.
However, keep in mind (and if there's a next edition, I'll be sure to go into a little more detail about this) that the SQL Server transport uses a Broker pattern, that is, all communication must go through SQL Server so it becomes a central point of failure and a central bottleneck. The default MSMQ transport, on the other hand, follows the Bus architectural style, meaning it's completely decentralized. Each endpoint can run completely on its own, at least until you introduce additional dependencies.
Andreas benchmarked the new transports, and found that on V4 MSMQ was capable of roughly 6000 sends/s and 2300 receives/s, and that SqlServer was on par with that, but on MSMQ that is roughly per server (each server gets its own throughput), with the SQL Server transport that is going to be your total achievable throughput, period, and any endpoints you add will have to share it.
Of course, broker-style transports (the rest of the new transports in 4.0 are brokers too) do have some advantages over MSMQ. The biggest is that you don't need to use the Distributor to scale out. In a broker, the "queue" is centralized so you can simply spin up additional endpoints pointing at the same input queue in a competing consumers pattern.
Of course as in all things, your mileage may vary, but if you are planning an ambitious system, then the SQL Server transport may not be for you, as you will at some point get mired down in that point where your only option is to scale up your SQL Server instance.
How can I monitor traffic going out of a wcf service (self-hosted) on Windows Azure ? The amount of data going into my stress-test app doesn't seem to add up to what I'm seeing on the pricing page (which doesn't seem to be updated live anyway). The service is using https and messages are pretty small. Is the SSL handshake traffic negligible? I also have a data-miner worker roler that continuously downloads data from the internet, but from what I've read, inbound traffic is free, so it shouldn't count in the OUT traffic.
How can I get a reliable traffic monitor?
Billing page is usually updated once a day (once in a 24hr period). So you have wait a lot until you see results of your stress test added to the billing page for your account.
One place that you can monitor this (among other KPI for you application) is the MONITOR tab in the Management Portal. You can navigate to your Cloud Service being under test, click the MONITOR menu item, then click on the Add Metric at the bottom, and finally chose Network Out. This monitoring dashboard gets data every 5 minutes so it shall reflect network usage you are talking about.
Here is a screenshot of how to achieve this:
Other option that you have is to use a Network Performance Counter such as Network Interface : Bytes Sent/sec. You have to configure Windows Azure Diagnostics to monitor that specific performance counter. You can then set a scheduled transfer period of 1 minute and dig into the table created by the diagnostics agent for data.
P.S. And yes, you are correct - INBOUND data for Azure is FREE.
I'm working on a real-time application and building it on Azure.
The idea is that every user reports something about himself and all the other users should see it immediately (they poll the service every seconds or so for new info)
My approach for now was using a Web Role for a WCF REST Service where I'm doing all the writing to the DB (SQL Azure) without a Worker Role so that it will be written immediately.
I've come think that maybe using a Worker Role and a Queue to do the writing might be much more scalable, but might interfere with the real-time side of the service. (The worker role might not take the job immediately from the queue)
Is it true? How should I go about this issue?
Thanks
While it's true that the queue will add a bit of latency, you'll be able to scale out the number of Worker Role instances to handle the sheer volume of messages.
You can also optimize queue-reading by getting more than one message at a time. Since a single queue has a scalability target of 500 TPS, this lets you go well beyond 500 messages per second on reads.
You might look into a Cache for buffering the latest user updates, so when polling occurs, your service reads from cache instead of SQL Azure. That might help as the volume of information increases.
You could have a look at SignalR, it does not support farm scenarios out-of-the-box, but should be able to work with the use of either internal endpoint calls to update every instance, using the Azure Service Bus, or using the AppFabric Cache. This way you get a Push scenario rather than a Pull scenario, thus you don't have to poll your endpoints for potential updates.