During peak hours of our service, I notice the CPU load goes up to 100%. However, when I SSH into the machine and use top or htop, I don't ever see the CPU usage go above 25%. This instance is a dedicated load balancer running HAProxy.
Here is a screenshot of the Dashboard: https://gyazo.com/010715208f81ec97c1bc9b78123fe4d4
Here is a screenshot of top in the instance: https://gyazo.com/afa7098331f7a3f66c018041b9be686d
During these peak times, I noticed there are some latency and this is not caused by database or my other server instances as I checked their loads and were far below threshold. I was wondering if Compute Engine is throttling by mistaking it is at 100% CPU usage or something?
Has anyone had a similar experience?
Related
We are trying to build a low latency system using Java.
As part of fine tuning, we do the initial warm up of the application during startup to go above the default compile threshold 10000. We could see that any live processing latency immediately after the warmup looks better. The latency looks good when the system is continuously processing events. But when the system goes back to idle state where there is no events coming for 10 minutes, the latency starts degrading.
I initially thought it could be JVM evicting the generated code from the cache. I have set a higher size for code cache and enabled events of code cache and noticed that the code cache is not even reaching 50% capacity.
Any idea what could be the issue? Basically the initial warm up during start up goes in vain once the system goes into idle mode for a short duration.
I created a TruClient Web (IE) protocol script in LR12.55, when I try to run the script with 50 users, only some would go into running state (in between 25-37) and the rest would stuck in init forever.
I tried to change the Controller -> Options-> Timeout and changed Init timeout from default 180 to 999 however it does not resolve the issue. Can anybody comment on how to resolve this????
TruClient runs a real browser for each vuser (virtual-user), so system resource consumption is higher the API-level testing.
It is possible that 50 vusers is too much for your load-generator machine.
I'd suggest checking CPU and memory levels during the run. If either is over 80% utilization, you should split your load between multiple load-generator machines.
If resources are not fully utilized, the failures should be analyzed to determine the root cause.
To further e-Dough's excellent response, you should expect not to execute these virtual users on the same hardware as the controller. You should expect at least three load generators to be involved, two as primary load and one as a control set. This is in addition to the controller.
Your issue does manifest as the classical, "system out of resources" condition. Consider the same best practices for monitoring your load generator health as you would in monitoring your application under test infrastructure. You want to have monitors for your classical finite resource model components ( CPU, DISK, MEMORY and NETWORK) plus additional sub components, such as a breakout of System and Application under CPU, to understand where and how your system is performing. You want to be able to eliminate false negatives on scalability where your load generators are so unhealthy that they are distorting your test results - Virtual users showing the application is slow when in fact the Virtual Users are slow because the machine in use is resource constrained.
On the Compute Engine VM in us-west-1b, I run 16 vCPUs near 99% usage. After a few hours, the VM automatically crashes. This is not a one-time incident, and I have to manually restart the VM.
There are a few instances of CPU usage suddenly dropping to around 30%, then bouncing back to 99%.
There are no logs for the VM at the time of the crash. Is there any other way to get the error logs?
How do I prevent VMs from crashing?
CPU usage graph
This could be your process manager saying that your processes are out of resources. You might wanna look into Kernel tuning where you can increase the limits on the number of active processes on your VM/OS and their resources. Or you can try using a bigger machine with more physical resources. In short, your machine is falling short on resources and hence in order to keep the OS up, process manager shuts down the processes. SSH is one of those processes. Once you reset the machine, all comes back to normal.
How process manager/kernel decides to quit a process varies in many ways. It could simply be that a process has consistently stayed up for way long time to consume too many resources. Also, one thing to note is that OS images that you use to create a VM on GCP is custom hardened by Google to make sure that they can limit malicious capabilities of processes running on such machines.
One of the best ways to tackle this is:
increase the resources of your VM
then go back to code and find out if there's something that is leaking in the process or memory
if all fails, then you might wanna do some kernel tuning to make sure your processes have higer priority than other system process. Though this is a bad idea since you could end up creating a zombie VM.
I was trying to do a capacity test on an apache web server, but there are some result I can't understand: according to the theory part of capacity planning, I should see three different regions on the plot of throughput in/out.
In the first region the expected result is the line y=x, meaning that the web server can follow my requests and reply to all with the code 200-OK (Thus, the throughput I request is equal to the throughput I get).
In the second region the expected result is the line y=k, where k is that throughput that indicates the saturation of the web server (Thus, the throughput I get can't go further k).
In the third region the expected result is a curve that goes from k to zero, that shows the degradation of web server, which for memory or CPU leaks starts to reject requests.
I have tried to replicate the experiment with a Virtual Machine running an instance of Apache as a server and the Physical Machine running an instance of Apache JMeter as a client. The result that I get is only the first two points, but also if I request a very very huge number of samples/seconds as throughput, I always get the saturation value.
Why I can't get the server going down, even if the CPU is 0% idle and the remaining memory is about 10MB? Or maybe is this the correct behavior and my hypothesis was incorrect?
Thank you in advance.
For this test, I have a simple Java servlet that reads data in and calculates the CRC32 for it. When making serial requests of 512MB each, I get about 600MB/sec. That makes sense since I can't use all 24 cores available to me to calculate a CRC. The program driving this I/O is sitting on the local box to eliminate the possibility of networking issues. I am running Tomcat 8.0.24.0 on FreeBSD using OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode).
Next, I attempt the same test with 6 concurrent requests, expecting that the performance per request might be lower than 600MB/sec, but that the aggregate performance across all 6 requests would be significantly higher.
What I see is the CPU has some idle time at ALL times (so it doesn't appear that I'm CPU-bound). I also see that all processing threads in Tomcat are running concurrently as anticipated. However, it looks like I'm only getting around 800MB/sec in aggregate. The threads in Tomcat spend most of their time waiting to read from the socket, as shown below.
I would appreciate any thoughts on how to improve Tomcat throughput / why so much time is spent waiting for more data (which I assume is what's going on below).
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
at org.apache.tomcat.util.net.NioEndpoint$KeyAttachment.awaitLatch(NioEndpoint.java:1386)
at org.apache.tomcat.util.net.NioEndpoint$KeyAttachment.awaitReadLatch(NioEndpoint.java:1388)
at org.apache.tomcat.util.net.NioBlockingSelector.read(NioBlockingSelector.java:185)
at org.apache.tomcat.util.net.NioSelectorPool.read(NioSelectorPool.java:251)
at org.apache.tomcat.util.net.NioSelectorPool.read(NioSelectorPool.java:232)
at org.apache.coyote.http11.InternalNioInputBuffer.fill(InternalNioInputBuffer.java:133)
at org.apache.coyote.http11.InternalNioInputBuffer$SocketInputBuffer.doRead(InternalNioInputBuffer.java:177)
at org.apache.coyote.http11.filters.IdentityInputFilter.doRead(IdentityInputFilter.java:110)
at org.apache.coyote.http11.AbstractInputBuffer.doRead(AbstractInputBuffer.java:416)
at org.apache.coyote.Request.doRead(Request.java:469)
at org.apache.catalina.connector.InputBuffer.realReadBytes(InputBuffer.java:342)
at org.apache.tomcat.util.buf.ByteChunk.substract(ByteChunk.java:395)
at org.apache.catalina.connector.InputBuffer.read(InputBuffer.java:367)
at org.apache.catalina.connector.CoyoteInputStream.read(CoyoteInputStream.java:190)
...