Why my server's "Load average" is 5, while the cpu is not busy? - centos5

Why my server's "Load average" is 5, while the cpu is not busy?
anybody know how to fix that issue? what would be the reason? thanks

Probably iowait. From the Wikipedia article on "load average":
Most UNIX systems count only processes in the running (on CPU) or runnable (waiting for CPU) states. However, Linux also includes processes in uninterruptible sleep states (usually waiting for disk activity), which can lead to markedly different results if many processes remain blocked in I/O due to a busy or stalled I/O system.

Related

RabbitMQ poor performance

We are facing bad performance in our RabbitMQ clusters. Even when idle.
Once installed the rabbitmq-top plugin, we see many processes with very high reductions/sec. 100k and more!
Questions:
What does it mean?
How to control it?
What might be causing such slowness without any errors?
Notes:
Our clusters are running on Kubernetes 1.15.11
We allocated 3 nodes, each with 8 CPU and 8 GB limits. Set vm_watermark to 7G. Actual usage is ~1.5 CPU and 1 GB RAM
RabbitMQ 3.8.2. Erlang 22.1
We don't have many consumers and producers. The slowness is also on a fairly idle environment
The rabbitmqctl status is very slow to return details (sometimes 2 minutes) but does not show any errors
After some more investigation, we found the actual reason was made up of two issues.
RabbitMQ (Erlang) run time configuration by default (using the bitnami helm chart) assigns only a single scheduler. This is good for some simple app with a few concurrent connections. Production grade with 1000s of connections have to use many more schedulers. Bumping up from 1 to 8 schedulers improved throughput dramatically.
Our monitoring that was hammering RabbitMQ with a lot of requests per seconds (about 100/sec). The monitoring hits the aliveness-test, which creates a connection, declares a queue (not mirrored), publishes a message and then consumes that message. Disabling the monitoring reduced load dramatically. 80%-90% drop in CPU usage and the reductions/sec also dropped by about 90%.
References
Performance:
https://www.rabbitmq.com/runtime.html#scheduling
https://www.rabbitmq.com/blog/2020/06/04/how-to-run-benchmarks/
https://www.rabbitmq.com/blog/2020/08/10/deploying-rabbitmq-to-kubernetes-whats-involved/
https://www.rabbitmq.com/runtime.html#cpu-reduce-idle-usage
Monitoring:
http://rabbitmq.1065348.n5.nabble.com/RabbitMQ-API-aliveness-test-td32723.html
https://groups.google.com/forum/#!topic/rabbitmq-users/9pOeHlhQoHA
https://www.rabbitmq.com/monitoring.html

Blocked Processes

As far as I know, some conditions must be validated so that a process continue to run. If they are not confirmed, the processor blocks that process not to waste time. After these conditions are validated, the process enter into ready state.
However, I faced a sentence like this in the book "Modern Operating Systems Andrew Tanenbaum": There are two types of processes which are system processes and user processes. If processor takes a disk interrupt when it executes a user application, the system makes a decision to stop running the current process, and starts to run disk process. In this case, application process is kept in blocked state. After the disk is read or anything is written on the disk, the process waiting for it is unblocked.
I know that a process is blocked in only the situation that a requirement or a condition is not validated. However, I suppose this sentence try to say that disk process has higher precedence, that's why application process is blocked. Is the precedence a factor to block any process ?
What you are describing makes no sense. I have to wonder if this is the result of your quotation.
First of all, the processor does not block processes; the operating system does.
Second, I have not worked on an operating system that works anything like the way you describe here.
Usually, if a disk drive triggers an interrupt, the current process handles that interrupt. While in kernel mode the operating system does whatever queuing is necessary for the disk operation. If the process's time slice is up, only then does the process change. If not, after interrupt handling, the process picks up where it left off before the interrupt.
I cannot imagine a "modern" operating system that invokes a disk process to handle disk interrupts.

Google Compute Engine VM constantly crashes

On the Compute Engine VM in us-west-1b, I run 16 vCPUs near 99% usage. After a few hours, the VM automatically crashes. This is not a one-time incident, and I have to manually restart the VM.
There are a few instances of CPU usage suddenly dropping to around 30%, then bouncing back to 99%.
There are no logs for the VM at the time of the crash. Is there any other way to get the error logs?
How do I prevent VMs from crashing?
CPU usage graph
This could be your process manager saying that your processes are out of resources. You might wanna look into Kernel tuning where you can increase the limits on the number of active processes on your VM/OS and their resources. Or you can try using a bigger machine with more physical resources. In short, your machine is falling short on resources and hence in order to keep the OS up, process manager shuts down the processes. SSH is one of those processes. Once you reset the machine, all comes back to normal.
How process manager/kernel decides to quit a process varies in many ways. It could simply be that a process has consistently stayed up for way long time to consume too many resources. Also, one thing to note is that OS images that you use to create a VM on GCP is custom hardened by Google to make sure that they can limit malicious capabilities of processes running on such machines.
One of the best ways to tackle this is:
increase the resources of your VM
then go back to code and find out if there's something that is leaking in the process or memory
if all fails, then you might wanna do some kernel tuning to make sure your processes have higer priority than other system process. Though this is a bad idea since you could end up creating a zombie VM.

Why is Tomcat not scaling throughput with increased concurrent load?

For this test, I have a simple Java servlet that reads data in and calculates the CRC32 for it. When making serial requests of 512MB each, I get about 600MB/sec. That makes sense since I can't use all 24 cores available to me to calculate a CRC. The program driving this I/O is sitting on the local box to eliminate the possibility of networking issues. I am running Tomcat 8.0.24.0 on FreeBSD using OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode).
Next, I attempt the same test with 6 concurrent requests, expecting that the performance per request might be lower than 600MB/sec, but that the aggregate performance across all 6 requests would be significantly higher.
What I see is the CPU has some idle time at ALL times (so it doesn't appear that I'm CPU-bound). I also see that all processing threads in Tomcat are running concurrently as anticipated. However, it looks like I'm only getting around 800MB/sec in aggregate. The threads in Tomcat spend most of their time waiting to read from the socket, as shown below.
I would appreciate any thoughts on how to improve Tomcat throughput / why so much time is spent waiting for more data (which I assume is what's going on below).
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at org.apache.tomcat.util.net.NioEndpoint$KeyAttachment.awaitLatch(NioEndpoint.java:1386)
at org.apache.tomcat.util.net.NioEndpoint$KeyAttachment.awaitReadLatch(NioEndpoint.java:1388)
at org.apache.tomcat.util.net.NioBlockingSelector.read(NioBlockingSelector.java:185)
at org.apache.tomcat.util.net.NioSelectorPool.read(NioSelectorPool.java:251)
at org.apache.tomcat.util.net.NioSelectorPool.read(NioSelectorPool.java:232)
at org.apache.coyote.http11.InternalNioInputBuffer.fill(InternalNioInputBuffer.java:133)
at org.apache.coyote.http11.InternalNioInputBuffer$SocketInputBuffer.doRead(InternalNioInputBuffer.java:177)
at org.apache.coyote.http11.filters.IdentityInputFilter.doRead(IdentityInputFilter.java:110)
at org.apache.coyote.http11.AbstractInputBuffer.doRead(AbstractInputBuffer.java:416)
at org.apache.coyote.Request.doRead(Request.java:469)
at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:342)
at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:395)
at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:367)
at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:190)
...

SQL Server Log Messages / Memory Clog

I have a dedicated server that's been running for years, with no recent code or configuration changes, but suddenly about a week ago, the MS SQL Server DB has started becoming unresponsive, and shortly thereafter, the entire site goes down due to memory issues on the server. It is sporadic, which leads me to believe it could be a malicious DDOS-like attack, but I am not sure how to confirm what's going on.
After a reboot, it can stay up for a few days, or only a few hours before I start seeing rampant occurrances of these Info messages in the Windows logs, shortly before it seizing up and failing. Research has not yielded any actionable info as of yet, please help, and thank you.
Process 52:0:2 (0xaa0) Worker 0x07E340E8 appears to be non-yielding on Scheduler 0. Thread creation time: 13053491255443. Approx Thread CPU Used: kernel 280 ms, user 35895 ms. Process Utilization 0%%. System Idle 93%%. Interval: 6505497 ms.
New queries assigned to process on Node 0 have not been picked up by a worker thread in the last 2940 seconds. Blocking or long-running queries can contribute to this condition, and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads, or optimize current running queries. SQL Process Utilization: 0%%. System Idle: 91%%.
Here's a blog about the issue: danieladeniji.wordpress that should help you get started.
Seems unlikely that it would be a DDOS.