Debugging Performance of Spring Cloud Gateway with Netty - spring-webflux

I have Spring Cloud Gateway (Greenwich) running with Netty. This application receives request and then sends request downstream applications depending on the route configuration.
Randomly few request take lot of time(> 70s). Even though the downstream server responded back within 5 sec, Netty threads (reactor-http-epoll-*) are not picking up the response. I have enabled debug logs to see what those threads are doing. From preliminary analysis, it look like those threads are processing something else and are always in runnable state. When this happens the traffic to server is not unusual and it's same as before.
My question here is:
Why response was not processed by reactor threads while response was received(according to the logging of the downstream app, it sent the response. However, spring-cloud app received response way too late in the logs). Is it possible that all the threads are busy doing other things.
Is there any run book on how such issues should be investigated?
Some-places in logs I do see high number of inactive connections in logs but not sure if that is impacting anything. (Channel cleaned, now 56 active connections and 1400 inactive connections)
Any general guidance on how to proceed with investigation to understand why random slowness is happening in application will really help. Thanks for the help in advance.

Okay, so I ended up doing below things and after lot of investigation it started working fine for me.
Enable logging. Look at how many connections are getting created. In my case, lot of new connections were getting created and and they were not getting re-used.
io.netty.leakDetectionLevel=paranoid
logging.level.reactor.netty=DEBUG
logging.level.reactor.netty.channel.FluxReceive=DEBUG
spring.cloud.gateway.httpclient.wiretap=true
spring.cloud.gateway.httpserver.wiretap=true
Make sure there is no blocking code running on reactor-http-epoll-* threads.
I upgraded Spring Cloud dependencies from Greenwhich train to latest version of Hoxton train.

Related

Why did we only receive the response half of the time (round-robin) with "Spring Cloud DataFlow for HTTP request/response" approach deployed in PCF?

This issue is related to 2 earlier questions:
How to implement HTTP request/reply when the response comes from a rabbitMQ reply queue using Spring Integration DSL?
How do I find the connection information of a RabbitMQ server that is bound to a SCDF stream deployed on Tanzu (Pivotal/PCF) environment?
As you can see the update for the question 2 above, we can receive the correct response back from the rabbit sink. However, it only works half of the time alternated as round-robin way (success-timeout-success-timeout-...). The outside http app was implemented with Spring Integration showed in question 1 - sending the request to the request rabbit source queue and receiving the response from the response rabbit sink queue. This only happened in PCF environment after we deployed both the outside http app and created the stream (see following POC stream) there. However, it's working locally all the time (NOT alternately). Did we miss anything? Not sure what's the culprit in PCF. Thanks.
rabbitSource: rabbit --queues=rabbitSource | my-processor | rabbitSink: rabbit --routing-key=pocStream.rabbitSink.pocStream
Sounds like you have several instances of your stream in that PCF environment. This way there are more then one (round-robin feels like two) subscribers to the same RabbitMQ queue. Where only one consumer must be for that queue since only initiator of the request waits for reply, but odd (or even) replies go to different consumer of the same queue. I don't place it as an answer, just because it is the best guess what is going on since you don't see a problem locally.
Please, investigate your PCF environment and how does it scale instances for your stream. There also might be some option of SCDF which does scaling for us.

Failed Deployment in App Engine Google Cloud

I am deploying my nodejs application in google cloud app engine but it is giving error
This request caused a new process to be started for your application, and thus caused your application code to be loaded for the first time.
This request may thus take longer and use more CPU than a typical request for your application. -- when making request.
I had also saw some stackoverflow answers, but they didn't worked for me.
my app.yaml have this config
runtime: nodejs10
Can anyone help me out
You could add the following to your app.yaml:
inbound_services:
- warmup
And then implement a handler that will catch all warmup requests, so that your application doesn't get the full load. The full explanation is given here. Another detailed post about this topic can be found here.
Additionally you can also add automatic scaling options. You can play a bit with those to find the optimum for your application. Especially the latency related variables are important. Good to note that they can be set in a standard GAE environment.
automatic_scaling:
min_idle_instances: automatic
max_idle_instances: automatic
min_pending_latency: automatic
max_pending_latency: automatic
More scaling options can be found here.
The "request caused a new process to be started" notification usually occurred when there is no warm up request present in your application.
Can you try to implement a health check handler that only returns a ready status when the application is warmed up. This will allow your service to not receive traffic until it is ready.
Warning: Legacy health checks using the /_ah/health path are now
deprecated, and you should migrate to use split health checks.
Here you can find Split health checks for Nodejs
Liveness checks
Liveness checks confirm that the VM and the Docker container are
running. Instances that are deemed unhealthy are restarted.
path: "/liveness_check"
check_interval_sec: 30
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
Readiness checks
Readiness checks confirm that an instance can accept incoming
requests. Instances that don't pass the readiness check are not added
to the pool of available instances.
path: "/readiness_check"
check_interval_sec: 5
timeout_sec: 4
failure_threshold: 2
success_threshold: 2
app_start_timeout_sec: 300
Edit
For App Engine Standard, which doesn't afford you that flexibility, hardware and software failures that cause early termination or frequent restarts can occur without prior warning. link
App Engine attempts to keep manual and basic scaling instances running
indefinitely. However, at this time there is no guaranteed uptime for
manual and basic scaling instances. Hardware and software failures
that cause early termination or frequent restarts can occur without
prior warning and can take considerable time to resolve; thus, you
should construct your application in a way that tolerates these
failures.
Here are some good strategies for avoiding downtime due to instance
restarts:
Reduce the amount of time it takes for your instances restart or for
new ones to start.
For long-running computations, periodically create
checkpoints so that you can resume from that state.
Your app should be "stateless" so that nothing is stored on the instance.
Use queues for performing asynchronous task execution.
If you configure your instances to manual scaling: Use load balancing across > multiple instances. Configure more instances than required to handle normal
traffic. Write fall-back logic that uses cached results when a manual
scaling instance is unavailable.
Instance Uptime

IIS 8.5 Request Management | Queue | Timeout problem

I have a WCF SOAP service that receives too many synchronous requests from other systems
I am having problem when too many request comes at that time IIS Queue will not provide proper result and server memory and CPU usage is gone high and discard request or time out
I did a normal Load test (100 requests with 100 concurrent user) and the IIS started to discard the requests after the maximum queue length reach as well as all the request coming delays to provide the response and Other requests coming in are delayed until the first one either times out, or responds.
Below is server configuration
WCF application code is tested with Resharper tool and there is no object or memory dispose issue
Is there any settings for setup application pool or worker process to manage queue ?
Can i apply web garden in Application ?
Please help me to solve this issue
Thanks in Advance

Mulesoft cloudhub timeout although it works well locally

Mule cloudhub times out only on HTTPs.
I have a mule http listener in my flow. It works all right locally as well as on the cloudhub upon deployment.
To add security, I switched on HTTPs and I did so as per this blog.
https://docs.mulesoft.com/runtime-manager/building-an-https-service
Works fine locally.But once it's deployed to cloudhub, it starts timing out with HTTP 504. I've even increased the idle time out to a pretty high value. But, it still times out.
Anyone faced this? would be great to get some solution for this.
Reason 1) There is a hard timeout for the Mule Cloudhub which is 5 minutes. If your API is exceeding it then that's the reason for the time out in the cloud. As per my knowledge, it's not configurable/overwritable for each organization client.
You might have to look into a callback model here.
Reason 2) If your API not taking 5 minutes or more time for processing, then it might be something with the SSL configurations
Reason 3) If the above two items are good then this might be something with network connectivity. Please verify the firewall rules in this case.

How best to manage Redis connections using ServiceStack?

I work on a few .NET web apps that use Redis heavily for caching along with ServiceStack's Redis client. In all cases I've got Redis running on the same machine. I've used both BasicRedisClientManager and PooledRedisClientManager (always implemented as singletons) and have had some issues with both approaches.
With BasicRedisClientManager, things would work fine for a while, but eventually Redis would start refusing connections. Using netstat we discovered that thousands of TCP connections to the default Redis port were hanging around in TIME_WAIT status.
We then switched to PooledRedisClientManager, which seemed to fix the problem immediately. However, not long after, we started noticing occasional CPU spikes that we narrowed down to thread waiting (System.Threading.Monitor.Wait calls) caused by PooledRedisClientManager.GetClient.
In code, we use a get-in-get-out approach (using ServiceStack's handy ExecAs shortcuts) so in general connections are acquired very frequently but held as briefly as possible.
We get a modest amount of traffic but we're no StackExchange, and I can't help but think the ServiceStack client is up to the job and we're just doing something wrong. Is PooledRedisClientManager the correct approach here? Would it be advisable to simply increase the pool size? Or is that likely just masking a problem with our code?
Just looking for general guidance here, I don't have specific code I need help with at this point. Thanks in advance.
Are you absolutely sure all Redis connections are being disposed?
With ServiceStack, the Redisproperty on Service and ViewPageBase (if you're using SS Razor) do dispose themselves, but any time you request a connection from the pool yourself you must dispose it yourself.
However, despite this, we recently had issues with our pool being exhausted of all connections, too. One of my colleagues discovered that there wasn't proper clean up for Razor pages and made a pull request here - This means that there has only been correct disposal on Razor pages since ServiceStack v4.0.21. I have not checked if that fix has been back-ported to the v3 branch.
My colleague also added TrackingRedisClientsManager that may help you track down the improper disposal. See here
You can also check the stats of a PooledRedisClientManager by using this helper method. We threw it on a little razor page to check the stats as we feel appropriate) but you could write better code around this to monitor the pool health of specific nodes, too.