may be this is a dump question but please bear with me and help understand the concept.
There are two servers interacting using REST APIs, there is a default timeout of 2 minutes.
System A is trying to call System B using REST API and System B is in fault state due to database issue- means getting requests but not responding but System A keep sending hundreds of requests and eventually complaining of slowness.
My understanding is - System A - reserves/consumes some resource while making each request which is waiting at least 2 min until timeout on System B - so if hundreds of requests queued then will slow system down.
I have .net core app running on IIS. Please suggest if this understanding is correct or can be other factors? When System B came back to normal state both system behave correctly.
Related
I built a set of 3 APIs using Asp.net Web Api 2, self-hosted using OWIN in an Azure Cloud service Worker role.
The Worker Role is exposed to the internet with a custom domain.
Each API has a single controller, doing some normal dictionary operations, table calls and Azure Redis calls. 1 request on two just do a single Redis call and return in around 10ms.
The average call when going through all the API code is 150ms.
The answer is a JSON object of around 10k in size.
Everything works fine, but I have a problem.
I'm having around 25 peaks connections per second and no more than 2 Million requests per day, and I can barely get the CPU below 40% with 3 Azure D2_V2 (2 cores , 8GB RAM) instances running.
I'm in trouble because I'm spending almost 1.5k$ a month for an Api serving just 15-25 calls per second.
If I remove or scale down an instance, the CPU go up to 55-60%, Redis and Azure table calls slows down a lot and an API request takes 3- 5 seconds to get back.
I tried everything at the best of my abilities, I thought could be some bots or DDos attack, so I installed the nuget package WebApi Throttle, set a maximum of 1 requests per IP per second.
Nothing changed.
I reviewed all the code configuration to cut unoptimized parts, but 1 call in 2 just call redis and get back and the others are very clean and simple C# returning in 150ms with 2 azure table calls + 1 azure queue set.
The API Controllers are async, everything is async.
I enabled Profiling, the CPU is high in the main azure process, and the Redis Get method, nothing else relevant here, no bottlenecks.
I enabled Diagnostics, no errors.
I installed Application Insights, and here I see something strange that cannot tell if it is normal or not.
I see this IP: 13.88.23.0 doing thousands requests to the APIs with querystring values generally used in normal requests. A lot of them fail.
This IP is Azure itself, why is calling the Api?
Some of these requests are stuck for minutes, I can see that from the Application insights panel, it's always the same IP.
Then I see the remaining logs, dependencies etc,nothing relevant.
Apart from that , what could I do to understand the problem?
I can't think is normal to consume so many CPU resources for an API with just 2 Million calls a day, or not?
Is there an additional profiling technique I could use?
Based on your experience, how many API calls should I expect to serve with 3 dual core 8GB RAM servers in normal conditions? (assuming there is something wrong in my configuration)
Thanks
UPDATE
I separated the API in two cloud services, 2 in one and 1 in another.
I still see in Application Insights calls from another IP belonging to Microsoft.
I suppose this is normal, probably Application insights cannot detect the real IP of the client since is a Worker Role and show the internal one.
But the problem of having to use so much power for so few calls remain.
Any thoughts on that?
FYI: This will be my first real foray into Async/Await; for too long I've been settling for the familiar territory of BackgroundWorker. It's time to move on.
I wish to build a WCF service, self-hosted in a Windows service running on a remote machine in the same LAN, that does this:
Accepts a request for a single .ZIP archive
Creates the archive and packages several files
Returns the archive as its response to the request
I have to support archives as large as 10GB. Needless to say, this scenario isn't covered by basic WCF designs; we must take additional steps to meet the requirement. We must eliminate timeouts while the archive is building and memory errors while it's being sent. Both of these occur under basic WCF designs, depending on the size of the file returned.
My plan is to proceed using task-based asynchronous WCF calls and streaming mode.
I have two concerns:
Is this the proper approach to the problem?
Microsoft has done a nice job at abstracting all of this, but what of the underlying protocols? What goes on 'under the hood?' Does the server keep the connection alive while the archive is building (could be several minutes) or instead does it close the connection and initiate a new one once the operation is complete, thereby requiring me to properly route the request through the client machine firewall?
For #2, clearly I'm hoping for the former (keep-alive). But after some searching I'm not easily finding an answer. Perhaps you know.
You need streaming for big payloads. That is the right approach. This has nothing at all to do with asynchronous IO. The two are independent. The client cannot even tell that the server is async internally.
I'll add my standard answers for whether to use async IO or not:
https://stackoverflow.com/a/25087273/122718 Why does the EF 6 tutorial use asychronous calls?
https://stackoverflow.com/a/12796711/122718 Should we switch to use async I/O by default?
Each request runs over a single connection that is kept alive. This goes for both streaming big amounts of data as well as big initial delays. Not sure why you are concerned about routing. Does your router kill such connections? That's a problem.
Regarding keep alive, there is nothing going over the wire to do that. TCP sessions can stay open indefinitely without any kind of wire traffic.
I'm trying to nail down a performance issue under load in an application which I didn't build, but have become very familiar with the workings of.
The architecture is: mobile apps call an ASP.NET MVC 3 website to get data to display. The ASP.NET site calls a third-party SOAP API using WCF clients (basicHttpBinding), caching results as much as it can to minimize load on that third party.
The load from the mobile apps is in the order of 200+ requests per second at peak times, which translates to something in the order of 20 SOAP requests per second to the third-party, after caching.
Normally it runs fine but we get periods of cascading slowness where every request to the API starts taking 5 seconds.. then 10.. 15.. 20.. 25.. 30.. at which point they time out (we set the WCF client timeout to 30 seconds). Clearly there is a bottleneck somewhere which is causing an increasingly long queue until requests can't be serviced inside 30 seconds.
Now, the third-party API is out of my control but they swear that it should not be having any issues whatsoever with 20 requests per second. So I've been looking into the possibility of a bottleneck at my end.
I've read questions on StackOverflow about ServicePointManager.DefaultConnectionLimit and connectionManagement, but digging through the source, I think the problem is somewhat more fundamental. It seems that our WCF client object (which is a standard System.ServiceModel.ClientBase<T> auto-generated by "Add Service Reference") is being stored in the cache, and thus when multiple requests come in to the ASP.NET site simultaneously, they will share a single Client object.
From a quick experiment with a couple of console apps and spawning multiple threads to call a deliberately slow WCF service with a shared Client object, it seems to me that only one call will occur at a time when multiple threads use a single ClientBase. This would explain a bottleneck when e.g. 20 calls need to be made per second and each one takes more than 50ms to complete.
Can anyone confirm that this is indeed the case?
And if so, and if I switched to every request creating it's own WCF Client object, I would just need to alter ServicePointManager.DefaultConnectionLimit to something greater than the default (which I believe is 2?) before creating the Client objects, in order to increase my maximum number of simultaneous connections?
(sorry for the verbose question, I figured too much information was better than too little)
I know that ZMQ offers all of the flexibility to do your own load-balancing. However I would expect the out-of-the-box broker, about 4 lines of code using the line
zmq_device (ZMQ_QUEUE, frontend, backend);
to load balance quite well as the documentation says it does load balance.
ZMQ_QUEUE creates a shared queue that collects requests from a set of clients, and distributes these fairly among a set of services. Requests are fair-queued from frontend connections and load-balanced between backend connections. Replies automatically return to the client that made the original request.
I have an army of back-end services and yet find that often my front-end clients have to wait several seconds for something that takes < 1/10 of a second in a 1:1 setting (there are same # of client and service machines). I suspect that ZMQ is not load-balancing properly out of the box - it's sending too many requests to the same service even though it doesn't have bandwidth, etc.
I think this is partly because the services are multithreaded in a way that lets them take up to 10 concurrent requests yet it slows down greatly at near the 10th request even though it can still accept them. Random distribution would be ideal. Is there an out-of-the-box way to do this or can it be done in a few lines of code, or do I have to write my own broker from scratch?
Fwiw issue was the workers were taking on work when they didn't have room for it, issue was not in ZMQ layer per se.
I have a WCF service that works fine in IIS 7, however once deployed to Windows Server 2003, IIS6, I'm now getting - "The thread was being aborted" error message. This happens after a few minutes of the service running.
I've tried manually changing some timeout values and turned off IIS keep alives.
Any ideas on how to fix this problem would be welcomed.
Thanks
If you're having this problem - please read! Hopefully you'll save yourself A LOT of trouble knowing this. Get coffee first!
You might come from a traditional programming background, in fields not SOA related, and now you're writing SOA services with the mindset of "traditional programmer". Here are 4 of the most important lessons I've learnt since building SOA services.
Rule number 1
Try your very best not to write services that take an extended amount of time to complete. I know this can be VERY tricky to accomplish, but it is much more reliable to have smaller operations being called many times, than 1 long service performing all the work, then returning a response. For example recently I wrote a service which processed ALL tasks. Each task was stored as an XML file in the IIS site, and each task would export data to a system for example : SharePoint. At any given times during high volumes there could be up to 30 000 tasks waiting to be processed. Over the past 2 months I have yet to get it 100% reliable, this is after diving deep into timeout settings in IIS, AppPools and WCF bindings. Every now and again I would get - "The thread was being aborted" and no reason or explanation as to why this was happening. I exhausted all online knowledge bases, no one seemed the wiser. Eventually after not being able to fix the issues or even reproduce them in a reliable way, I opted for a complete rewrite. I changed my code to instead of process ALL tasks, process just 1 task at a time.
This essentially meant calling 1 web service 30 000 times, rather than calling it once, but performance wise, it is around the same. Each call issues a response quick, and does a lot less work. This has another benefit, I can provide instant feedback on each operation to the client. In the Long call, you get a response back right at the end and ALL at once.
You can also much more easily catch and retry a service call if it does fail, because you don't have to redo the whole call for each operation again, but simply the operation that failed.
Its easier to test too, not only because of the live feedback, but also because you can test 1 inner operation, without the overhead of the loop if you wanted to.
Lastly it adds better scaling if you plan on extending your application later, because you're broken things down into more manageable units of work. So for example: Before you had 1 service which processed ALL Tasks, now you have a web service that can process 1 TASK, because of this you can more easily extend the functionality if you needed to process 10 Tasks, or tasks by selection.
Rule Number 2
Don't upgrade your existing ASMX web services to WCF 3 just because you think its a better technology. WCF 3 is over architectured and not a real pleasure to work with, or deploy. If you need to go WCF, try your best to hold out for the version that ships with .net 4 of the framework, it seems to have been revamped. Another thing you will miss is that WCF has no test forms, so you can't just fire up a web browser quick to test your services. If you're like me - "Keep it simple stupid" Then WCF 3.5 will frustrate you.
Rule Number 3
IIS6 can be dodgy, if at all possible avoid having to host your services in IIS6, if you're after reliable services. I am not saying its impossible to achieve reliability in IIS6, but it requires a LOT of work, and a great deal of testing. If you're dealing with services that are critical, try avoid using a product developed in 2001.
Rule Number 4
Don't underestimate the development and testing required to create reliable SOA services. To be honest all I can say is it is a massive undertaking.
I thought I'd mention that this error is thrown by SharePoint when calling some functions from a user account. Those functions need to be run with SPSecurity.RunWithElevatedPrivileges
This answer shows up when searching for "wcf sharepoint Thread was being aborted" so hopefully this can be useful to someone since 'thread being aborted' isn't very useful of SharePoint to throw when its a permissions issue.