All hosts and services Status in the top counter and the monitoring page flapping to UNKNOWN then back to normal every few minutes - centreon-api

Eric here,
I did a fresh install of Centreon 21.10.8, with a Central and Database server.
Soon after adding hosts and services to be monitored, I noticed that the status of hosts and services go to UNKNOWN for a few seconds (about 5s to 10s) before coming back to normal. The same happens in the monitoring views: the status of monitored hosts and services go to UNKNOWN for a few seconds (about 5s to 10s) before coming back to normal. This happens at random every few minutes.
However the real status of the servers is unchanged, and checks on the command line from Central poller are OK.
--
OS: Redhat 8
Centreon Version: Centreon 21.10.8
Browser: Firefox 106.0.1, Chrome 107.0.5304.63
Steps to reproduce:
I simply open the browser on a monitoring view and observe for a few minutes.
What I have tried:
Test Network Performance: I tried to see what is going on in the browser; I can see a lot of ajax/xhr requests from the browser to the server. The execution time of these requests in the browser does not seem to be long (100ms to 200ms for the top counter statuses and 1s to 2s for the monitoring views). I tried the same requests via curl in the cli on the Central server and I get the same execution times.
Modifiy Refresh Settings: I tried changing the settings Administration > Setting > Centreon web > Statistics page Refresh Interval from 15s to 47s and Administration > Setting > Centreon web > Monitoring page Refresh Interval from 15s to 73s
I also noticed their is a javascript called vendor.2d6b7428.js that makes a large number of status requests (once every 2s) to the API right after the first status requests initiated by the Web Page itself. Found it on the server at location /usr/share/centreon/www/static/vendor.2d6b7428.js and in the header of the Centreon web page in a statement:
<script defer="defer" scr="./static/vendor.2d6b7428.js"><script>
The flapping behavior persists.

A solution was found in github issue #5609 ; the solution consisted of setting the parameter Instance timeout (Configuration > Pollers > Broker configuration > Output > Instance timeout OR “instance_timeout” in /etc/centreon-broker/central-broker.json) to it's default value. The value previously set by my team was 20 seconds, which caused a race condition between the freshness verification task, and the refresh interval for resource statuses, resulting in the flapping statuses.
Additional information for future readers:
The "instance_timeout" (Configuration > Pollers > Broker configuration > Output > Instance timeout) defines a freshness time for the statuses in the GUI; passed that interval, statuses are considered expired and shown as UNKNOWN until refreshed by a routine call to the API. Default value is 300s.
The "monitoring_default_refresh_interval" (Administration > Parameters > Centreon UI > Refresh Properties) defines an interval of time after which a query will be made to API to update the status of resources. Default value is 15s.

Related

WSO2 login screen timeouts?

Back when we were running the regular Apereo CAS, there was a setting for login session timeouts, so that if someone went to the CAS login screen and just let it sit, the login session would timeout after a certain period of time (5-10 minutes IIRC.)
I was curious if there was a similar configuration settings with WSO2, and if so, what parameter it is?
The reason I'm asking is because on Saturday we did our first round of incoming student registrations, and apparently the Admissions folks logged in all of the lab computers and got them to the login screen about an hour before the students went to use them, and no one could log in until they refreshed their browsers. So I'm expecting that there is a setting for that somehow, I'm just not sure which setting it would be. Just looking at the identity.xml file, there are quite a few configurable timeout settings, and I'm not sure if it's even one of these:
...../repository/conf/identity # cat identity.xml | grep -i timeout
<CleanUpTimeout>720</CleanUpTimeout>
<CleanUpTimeout>2</CleanUpTimeout>
<SessionIdleTimeout>720</SessionIdleTimeout>
<RememberMeTimeout>10080</RememberMeTimeout>
<AppInfoCacheTimeout>-1</AppInfoCacheTimeout>
<AuthorizationGrantCacheTimeout>-1</AuthorizationGrantCacheTimeout>
<SessionDataCacheTimeout>-1</SessionDataCacheTimeout>
<ClaimCacheTimeout>-1</ClaimCacheTimeout>
<PersistanceCacheTimeout>157680000</PersistanceCacheTimeout>
<SessionIndexCacheTimeout>157680000</SessionIndexCacheTimeout>
<ClientTimeout>10000</ClientTimeout>
<!--<Cache name="AppAuthFrameworkSessionContextCache" enable="false" timeout="1" capacity="5000"/>-->
<CacheTimeout>120</CacheTimeout>
The global configuration can be found in the < IS_HOME >/repository/conf/identity/identity.xml file under the < TimeConfig >element.
<TimeConfig>
<SessionIdleTimeout>15</SessionIdleTimeout>
<RememberMeTimeout>20160</RememberMeTimeout>
</TimeConfig>
More information can be found here.
mgt console session timeout: Open repository/conf/tomcat/carbon/WEB-INF/web.xml Increase the session-timeout value.
<session-config>
<session-timeout>240</session-timeout>
<cookie-config>
<secure>true</secure>
</cookie-config>
</session-config>

HAproxy passive health checking

I'm new to haproxy and load balancing. I want to see what happens when a backend host is turned off while the proxy is running.
The problem is, if I turn off one of the backends and refresh the browser the page immediateltly exposes a 503 error to the user. After the next page load, it no longer gets the error since presumably that backend has been removed from the pool.
As a test I have set up two backend Flask apps and configured HAProxy to balance them like so:
backend app
mode http
balanace roundrobin
server app1 127.0.0.1:5001 check
server app2 127.0.0.1:5002 check
My understanding according to this:
https://www.haproxy.com/doc/aloha/7.0/haproxy/healthchecks.html#check-parameters
is that every 2 seconds a the backend hosts are pingged to see if they are up. Then they are removed from the pool if they are down. The 5xx error happens between the time I kill the backend and the 2 seconds.
I would think there is a way to get around this 5xx error by having HAProxy perform a little logic such that if a request from the frontend fails, it would then remove that failed backend from the pool and then switch to another and make another request. This way the user would never see the failure.
Is there a way to do this, or should I try something else so that my user does not get an error?
By default haproxy will retry 3 times (retries) with 1s intervals to the same backend. In order to allow to take another backend you should set option redispatch.
Also consider to (carefully, it can be hamrful):
decrease fall (default is 3),
decrease error-limit (default is 10) and set on-error to mark-down or sudden-death
tune healthcheck intervals with inter/fastinter/downinter
Note: Haproxy retries only on connection errors (e.g. ECONNNREFUSED like in your case), it will not resend/resubmit request/data.

RUN#Cloud consistently throws me out during a heavy operation

I'm using a large app instance to run a basic java web application (GWT + Spring). There's an expensive operation within my application (report) which takes a long time to execute.
I've tried running it with the cloudbees SDK on my local machine with similar settings as it would be on the cloud and it seems to function just fine. It runs in about 3-4 minutes.
On the cloud, it seems to be taking longer. The problem isn't the fact that it takes long. What happens in that cloudbees terminates the session after 5 minutes and gives me an error in my browser saying 'Unable to connect to server. Please contact your administrator'. A report which doesn't take as long runs just fine. My application has a session timeout of 30 minutes, so that isn't a problem either.
What could possibly be going wrong? Is it something to do with cloudbees?
This may be due to proxy buffering of your request through the routing layer (revproxy) - so it most likely isn't a session timeout - but the http connection getting cut.
You can either set proxyBuffering=false via the bees CLI command (eg when you deploy the app) - this will ensure longer running connections can work.
Ideally, however, you could change the app slightly to return to the browser with some token which you can poll with to get completion status, as even with a connection that lasts that long, over the internet it may provide a bad experience vs locally.

LR: VUgen web_set_timeout function unrealistic?

I understand that VUGen's web_set_timeout function allows me to set a timeout value higher than the usual value (which seems to be 120 seconds).
What I do not understand: Doesn't this imply that all users would have to set their browser http POST timeout config value to a new, higher value? Don't I then test with a (simulated/virtual) user configuration that no real-world user would/could use?
Wouldn't I also require all proxies between the user and the webserver to be configured with an at-least-as-high timeout value to use a custom timeout value in the browser? Otherwise my user's transactions will fail while my load test would pass?
Context: Load test of an browser- (Ajax) based frontend with VUGen 9.51. Browser times out on web server request with Error -27728 Step download timeout (120 seconds) has expired when downloading non-resource(s), and I hesitate using the web_set_timeout fore obvious reasons.
Each browser has a different time-out value defined. This value can also be changed rather easily by users.
Have a look at http://support.microsoft.com/kb/181050 for info on IE timeouts.
In short it says:
Internet Explorer imposes a time-out limit for the server to return data.
By default, the time-out limit is as follows:
Internet Explorer 4.0 and Internet Explorer 4.01 5 minutes
Internet Explorer 5.x and Internet Explorer 6.x 60 minutes
Internet Explorer 7 and Internet Explorer 8 60 minutes
Internet Explorer does not wait endlessly for the server to come
back with data when the server has a problem.
Also many services that are used today are machine-to-machine services (othen SOAP requests
are used for this) and they may have time-outs that are interface specific.
The place in VuGen where this is set from the UI is from the "Run-Time Settings | Preferences | Options" - in this list there are the following timeouts that can be set:
HTTP-Request connect timeout default 120 seconds
HTTP-Request response timeout default 120 seconds
In practice however, if a normal web-ui takes more than 5-10 seconds to respond to user clicks then the service will be considered slow by the users.
The exception here is SAP EP where 30+ minutes of waiting for simple thins is OK ... :)

WCF receive timeout

When attempting to connect/communicate with my service i have to wait for almost exactly 20 seconds each time before the exception is fired. Since this all gonna be running on a local network, I would like decrease that timeout period to 5 seconds? I tried decreasing the receiveTimeout on my client, but it didn't work. I looked all over my code for a 20 second timeout variable set, but couldn't find any. What should i be changing?
There are different timeout settings http://msdn.microsoft.com/en-us/library/ms731078.aspx. They can be set for example in a config file (web.config or app.config) see http://msdn.microsoft.com/en-us/library/ms731343.aspx as an example. Under http://msdn.microsoft.com/en-us/library/ms731399.aspx you can choose the binding which you use and set the corresponding setting.
UPDATED: You probably have the timeout set on the TCP level. Try reducing the TcpMaxConnectRetransmissions (Default value 2) or TcpInitialRTT (Default value 3, on NT 4.0 the parameter has the name InitialRTT) parameters in the registry, reboot your computer and try your experiments one more time. About affect of 21 seconds you can read in http://support.microsoft.com/kb/223450, http://support.microsoft.com/kb/175523, http://support.microsoft.com/kb/170359 or http://www.boyce.us/windows/tipcontent.asp?ID=189. You can read a description of the TCP/IP default configuration values at http://support.microsoft.com/kb/314053 (for Windows XP) and http://technet.microsoft.com/en-us/library/cc739819(WS.10).aspx (for Windows Server 2003 with SP2).
What you may actually be seeing is the cold start from your webapp. The Service Not Found exception would fire back pretty quickly unelss you had hit it pretty hard and you started queueing service requests beyond what WCF was configured to do.
However, if you had your website unloaded (appdomain and worker process) it could take 20 seconds to hit to the code that builds the channel to your service. So it may be something masked.
If your website and service are in different application pools then this is maginfied because it has to cold start the website and then coldstart the service, which are done in succession instead of simultaneously.
To somewhat alleviate this you can use a keepalive/ping service. Something that just constantly hits the URL to keep the AppDomain in memory and the worker process alive (if not shared). By default IIS 6 will shutdown the worker process after 20 minutes of inactivity, so when the first request comes in, http.sys starts up a new worker process, which loads the framework, which loads your app, which starts the pipeline, which executes your code, which delivers to your user. :)