ELK with RabbitMQ forwarding logs using NLog fails - rabbitmq

I have the following setup: a ELK stack on CentOS server with RB with mirrored Queue. Publishers are using Nlog-HAF-Rabbitmq appender to forward logs to RabbitMQ nodes behind load balancer.
One publisher is a web application which hosted on IIS. Sometimes, it stops logging to ELK stack post-recycle which happens early in morning.
Here are my findings:
I logged Nlog internal logs to identify if any connection failure.
while recycling we have Warning on Event viewer on WAS
A process serving application pool exceeded time limits during shut down. The process id.
A worker process serving application pool failed to stop a listener channel for protocol 'http' in the allotted time. The data field contains the error number.
IIS shutdown time limit 3 seconds (default)
IIS start up time limit 3 seconds (default)
Based information above, what could be the reason?

Related

Moleculer JS "Redis-pub client is disconnected" every 10 minutes

my application (Node.js) is using moleculer for microservices and redis as transporter. However, I find that the application will have this log Redis-pub client is disconnected every 10 minutes, then reconnect with the log Redis-pub client is connected after a few seconds. This is a problem because if a client send a moleculer action during this time, it will fail.
Any idea what is causing this? Let me know if more information is needed.
Azure Cache for Redis currently has a 10-minute idle timeout for connections, so the idle timeout setting in your client application should be less than 10 minutes. Most common client libraries have a configuration setting that allows client libraries to send Redis PING commands to a Redis server automatically and periodically. However, when using client libraries without this type of setting, customer applications themselves are responsible for keeping the connection alive.
More info: https://learn.microsoft.com/en-us/azure/azure-cache-for-redis/cache-best-practices-connection#idle-timeout

ambari cluster + poor connection between ambari-agent to ambari server

we have ambari cluster with 872 data-nodes machines , when ambari version is 2.6.x
we have for now some network problem ,
after long investigation we found that , ambari agent that runs on some machine not communicate well with the ambari server
therefore we get some strange behaviors as 5 dead data-nodes from ambari dashboard , while for sure datanodes machine are healthy
is it possible to give more tolerated value in ambari agent configuration so the ack between ambari agent to ambari server will be after more little time in order to ignore the network problems ?
something like timeout or time connection between the ambari agent to ambari server
First of all, you need to get the root cause of the issue why Data Node is showing as Dead.
Ambari agent runs on every node. It is responsible for sending
metrics and heartbeat to the Ambari server which then publishes to
your Ambari web UI.
The name node waits for 10 minutes till it declares the data node as dead and copies
the blocks to other data nodes.
If it's showing that data node is dead then please check the Ambari agent status in
the specific node by running-service ambari-agent status. Parallelly you can check the ambari-agent.log in the worker node to check why Ambari agent stopped working.
You can configure your http timeouts in ambari-agents for service tasks, http timeouts
https://github.com/apache/ambari/blob/trunk/ambari-agent/conf/unix/ambari-agent.ini
There's a HTTP Timeout section you can configure it based on your network throughput.
The file should be in /etc/ambari-agent/ambari.properties

Weblogic server failed to response under load

We have a quite strange situation on my sight. Under load our WL 10.3.2 server failed to response. We are using RestEasy with HttpClient version 3.1 to coordinate with web service deployed as WAR.
What we have is a calculation process that run on 4 containers based on 4 physical machines and each of them send request to WL during calculation.
Each run we see a messages from HttpClient like this:
[THREAD1] INFO I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server OUR_SERVER_NAME failed to respond
[THREAD1] INFO Retrying request
The HttpClient make several requests until get necessary data.
I want to understand why WL can refuse connections. I read about WL thread pool that process http request and found out that WL allocate separate thread to process web request and the numbers of threads is not bounded in default configuration. Also our server is configured Maximum Open Sockets: -1 which means that the number of open sockets is unlimited.
From this thread I'd want to understand where the issue is? Is it on WL side or it's a problem of our business logic? Can you guys help to deeper investigate the situation.
What should I check more in order to understand that our WL server is configured to work with as much requests as we want?

WCF Service: Status 200 with sc-win32-status of 64

We observed the following behavior on one of the servers hosting a WCF service on IIS 6.0:
The IIS log shows a high value for time-taken (> 100000)
The HTTP status code is 200
sc-win32-status code shows a value of 64
I found out that sc-win32-status code of 64 indicates "The specified network is no longer available"
Initially I suspected that it could be because of limits set on MinFileBytesPerSecond, which sets the minimum throughput rate that HTTP.sys enforces when sending data from the client to the server, and back from the server to the client.
But the value for sc-bytes and cs-bytes indicate that the amount of data is sent is within the range generally observed for the service.
Also note that the WCF service is hosted on four boxes and is load-balanced, but the problem occurs only one of the servers. (but not essentially on the same server). The problem is also intermittent.
Has anybody else encountered this error? Any clues about what could be wrong?
Update
Note: Observation on IIS 7.5 (IIS version does not really matter)
I was able to replicate the issue. The issue occurs if:
1. The WCF service takes a long time to respond
2. The client proxy times out before it receives a response from the server. In this case it leads to TimeoutException on the client.
3. The server keeps waiting for TCP ACK for the client, which it would never receive.
Hence a long timeout (TCP socket timeout (default value: 4 minutes) and sc-win32-status of 64
So essentially it appears that WCF code is taking a long time to respond and the client is timing out, what I observe in IIS log is just a symptom and not a problem.
The behavior you are describing will also occur if you exceed a WCF service's max sessions, calls or instances (depending on how you have your service instancecontext mode configured). If you observe the System.ServiceModel performance counters for %max concurrent sessions and/or %max concurrent calls (again depending on your service's instance context), you may see a correlation with the IIS log entries.
Note that these maxes can be configured in the service throttling behavior.
https://msdn.microsoft.com/en-us/library/vstudio/system.servicemodel.description.servicethrottlingbehavior(v=vs.100).aspx
I saw your question again and wanted to point out that I found a solution for this. It turned out to be this piece of code in the web.config:
<pages smartNavigation="true">
After turning this off I stopped receiving the same time-out errors. See also the answer here
IIS put the services into sleep to save recources.
Copied from here (WCF REST Service goes to sleep after inactivity)
The application pool hosting your service defines Idle Time-out property (advanced settings of app pool in IIS management console) which defaults to 20 minutes. If no request is received by the app pool within idle timeout the worker processes serving the pool is terminated. After receiving a new request the IIS must start the process again, the process must load application domain and all related assemblies, compile .svc file, run the service host and process the request.The solution can be increasing idle time-out but the meaning of this time-out is correct handling of server resources. If the process is not needed it should be stopped. Another ugly workaround is using some ping process (for example cron job or scheduled task on the server) which will regularly ping call some method on the service or page in the same application.

How to configure Glassfish to drop hanging requests?

Can I configure Glassfish to drop any request that takes longer than 10 seconds to process?
Example:
I'm using Glassfish to host my web service. The thread pool is configured to have max 5 connections.
My service has a method that does this:
System.out.println("New request");
Thread.sleep(1000*1000);
I'm creating 5 requests to the service and I see 5 messages "New request" in the log. Then the server stop to respond for a looong time.
In live environment all requests must be processed in less than a second. If it takes more time to process then there is a problem with the request and I want Glassfish to drop such requests but stay alive and serve other requests.
Currently I'm using a workaround in the code. At the beginning of my web method I launch a separate thread for request processing with a timeout as it was suggested here: How to timeout a thread
I do not like this solution and still believe that there must be a configuration setting in the Glassfish to apply this logic to all requests, not to just one method.