The application, when subjected to load, sometimes, utilises 100%.
doing a kill -quit <pid> showed 1100+ threads in waiting state as:
Full thread dump Java HotSpot(TM) 64-Bit Server VM (16.3-b01 mixed mode):
"http-8080-1198" daemon prio=10 tid=0x00007f17b465c800 nid=0x2061 in Object.wait() [0x00007f1762b6e000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007f17cb087890> (a org.apache.tomcat.util.net.JIoEndpoint$Worker)
at java.lang.Object.wait(Object.java:485)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.await(JIoEndpoint.java:458)
- locked <0x00007f17cb087890> (a org.apache.tomcat.util.net.JIoEndpoint$Worker)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:484)
at java.lang.Thread.run(Thread.java:619)
"http-8080-1197" daemon prio=10 tid=0x00007f17b465a800 nid=0x2060 in Object.wait() [0x00007f1762c6f000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
- waiting on <0x00007f17cb14f460> (a org.apache.tomcat.util.net.JIoEndpoint$Worker)
at java.lang.Object.wait(Object.java:485)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.await(JIoEndpoint.java:458)
- locked <0x00007f17cb14f460> (a org.apache.tomcat.util.net.JIoEndpoint$Worker)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:484)
at java.lang.Thread.run(Thread.java:619)
............
The state does not change even when the application-context is undeployed OR the DB is restarted.
Please suggest a probable cause.
App Server: Apache Tomcat 6.0.26
Max Threads: 1500
Threads in WAITING state : 1138
"waiting on" is not a problem. The thread is waiting to be notified - and in this case it is locked on the JIoEndpoint.Worker
The background thread that listens for
incoming TCP/IP connections and hands
them off to an appropriate processor.
So I think this is waiting for actual requests to come in.
Firstly, CPU utilization actually increases when you have many threads due to high amount of context switching. Do you actually need 1500? Can you try by reducing?
Secondly, Is it memory hogging or GC-ing too often?
"waiting for" would be a problem if you see those. Do you have any BLOCKED(on object monitor) or waiting to lock () in the stack trace?
On a Solaris system you can use the command
prstat -L -p <pid> 0 1 > filename.txt
This will give you a break down of each process doing work on the CPU and will be based on the Light weight processor ID, instead of the PID. When you look at your thread dump you can match the light weight process up to your NID (or TID depending on the implementations) which are shown on the top line of your thread dump. By matching these two things up you will be able to tell which of your threads are the CPU hog.
Here is an example of the output.
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/LWPID
687 user 1024M 891M sleep 59 0 0:40:07 12.0% java/5
687 user 1024M 891M sleep 59 0 0:34:43 15.3% java/4
687 user 1024M 891M sleep 59 0 0:17:00 7.6% java/3
687 user 1024M 891M sleep 59 0 1:00:07 31.4% java/2
Then with a corresponding thread dump, you can find these threads
"GC task thread#0 (ParallelGC)" prio=3 tid=0x00065295 nid=0x2 runnable
"GC task thread#1 (ParallelGC)" prio=3 tid=0x00012345 nid=0x3 runnable
"GC task thread#2 (ParallelGC)" prio=3 tid=0x0009a765 nid=0x4 runnable
"GC task thread#3 (ParallelGC)" prio=3 tid=0x0003456b nid=0x5 runnable
So in the case of this High CPU case, the problem was in the Garbage collection. This is seen by matching the nid with the LWPID field
If this will help you out I would suggest making a script that will take the output your prstat and the CPU usage all at once. This will provide you wil the most accurate representation of your application.
As per your original two threads, #joseK was correct. Those threads are sitting and waiting to take a request from a user. There is no problem there.
Related
We have Ignite running in server mode in our JVM. Ignite is going into deadlock in following scenario. I have added the thread stack at the end of this question
a.Create a cache with write through enabled
b.In CacheWriter.write() implementation
1.Wait for a second to for step c to be invoked
2.Try to read from another cache
c. While step b is executing Trigger a thread which will create a new
cache.
d.On executing above scenario, Ignite is going into deadlock as
1.Readlock has been acquired by cache.put() operation
2.When cache creation is triggered in separate thread, Partition Map Exchange is also started
3.PME tries to acquire all 16 locks , but wait as one Read lock is already acquire
4.While reading from cache, cache.get() can not complete as it waits for current Partition Map Exchange to complete
We have face this issue in production and above scenario is just its reproducer. Write Through implementation is just trying to read from cache and cache creation is happening in totally different thread
Why Ignite is blocking all cache.get() operation for PME when it does not even have all required locks? Shouldn’t the call be blocked only after PME operation has all the locks?
why PME stops everything? If I create cache A then only related operation for cache A or its cache group should be stopped
Also is there any solution to solve this deadlock?
Thread executing cache.put() and write through
"main" #1 prio=5 os_prio=0 tid=0x0000000003505000 nid=0x43f4 waiting on condition [0x000000000334b000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
at org.apache.ignite.internal.processors.cache.GridCacheAdapter.get(GridCacheAdapter.java:4870)
at org.apache.ignite.internal.processors.cache.GridCacheAdapter.repairableGet(GridCacheAdapter.java:4830)
at org.apache.ignite.internal.processors.cache.GridCacheAdapter.get(GridCacheAdapter.java:1463)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.get(IgniteCacheProxyImpl.java:1128)
at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.get(GatewayProtectedCacheProxy.java:688)
at ReadWriteThroughInterceptor.write(ReadWriteThroughInterceptor.java:70)
at org.apache.ignite.internal.processors.cache.GridCacheLoaderWriterStore.write(GridCacheLoaderWriterStore.java:121)
at org.apache.ignite.internal.processors.cache.store.GridCacheStoreManagerAdapter.put(GridCacheStoreManagerAdapter.java:585)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.update(GridCacheMapEntry.java:6468)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:6239)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry$AtomicCacheUpdateClosure.call(GridCacheMapEntry.java:5923)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.invokeClosure(BPlusTree.java:4041)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Invoke.access$5700(BPlusTree.java:3935)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invokeDown(BPlusTree.java:2039)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.invoke(BPlusTree.java:1923)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke0(IgniteCacheOffheapManagerImpl.java:1734)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.invoke(IgniteCacheOffheapManagerImpl.java:1717)
at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.invoke(IgniteCacheOffheapManagerImpl.java:441)
at org.apache.ignite.internal.processors.cache.GridCacheMapEntry.innerUpdate(GridCacheMapEntry.java:2327)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateSingle(GridDhtAtomicCache.java:2553)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.update(GridDhtAtomicCache.java:2016)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal0(GridDhtAtomicCache.java:1833)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.updateAllAsyncInternal(GridDhtAtomicCache.java:1692)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicAbstractUpdateFuture.sendSingleRequest(GridNearAtomicAbstractUpdateFuture.java:300)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicSingleUpdateFuture.map(GridNearAtomicSingleUpdateFuture.java:481)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicSingleUpdateFuture.mapOnTopology(GridNearAtomicSingleUpdateFuture.java:441)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicAbstractUpdateFuture.map(GridNearAtomicAbstractUpdateFuture.java:249)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.update0(GridDhtAtomicCache.java:1147)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.put0(GridDhtAtomicCache.java:615)
at org.apache.ignite.internal.processors.cache.GridCacheAdapter.put(GridCacheAdapter.java:2571)
at org.apache.ignite.internal.processors.cache.GridCacheAdapter.put(GridCacheAdapter.java:2550)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.put(IgniteCacheProxyImpl.java:1337)
at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.put(GatewayProtectedCacheProxy.java:868)
at com.eqtechnologic.eqube.cache.tests.readerwriter.WriteReadThroughTest.writeToCache(WriteReadThroughTest.java:54)
at com.eqtechnologic.eqube.cache.tests.readerwriter.WriteReadThroughTest.lambda$runTest$0(WriteReadThroughTest.java:26)
at com.eqtechnologic.eqube.cache.tests.readerwriter.WriteReadThroughTest$$Lambda$1095/2028767654.execute(Unknown Source)
at org.junit.jupiter.api.AssertDoesNotThrow.assertDoesNotThrow(AssertDoesNotThrow.java:50)
at org.junit.jupiter.api.AssertDoesNotThrow.assertDoesNotThrow(AssertDoesNotThrow.java:37)
at org.junit.jupiter.api.Assertions.assertDoesNotThrow(Assertions.java:3060)
at WriteReadThroughTest.runTest(WriteReadThroughTest.java:24)
PME thread waiting for locks
"exchange-worker-#39" #56 prio=5 os_prio=0 tid=0x0000000022b91800 nid=0x450 waiting on condition [0x000000002866e000]
java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x000000076e73b428> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireInterruptibly(AbstractQueuedSynchronizer.java:897)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1222)
at java.util.concurrent.locks.ReentrantReadWriteLock$WriteLock.lockInterruptibly(ReentrantReadWriteLock.java:998)
at org.apache.ignite.internal.util.StripedCompositeReadWriteLock$WriteLock.lock0(StripedCompositeReadWriteLock.java:192)
at org.apache.ignite.internal.util.StripedCompositeReadWriteLock$WriteLock.lockInterruptibly(StripedCompositeReadWriteLock.java:172)
at org.apache.ignite.internal.util.IgniteUtils.writeLock(IgniteUtils.java:10487)
at org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.updateTopologyVersion(GridDhtPartitionTopologyImpl.java:272)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.updateTopologies(GridDhtPartitionsExchangeFuture.java:1269)
at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:1028)
at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body0(GridCachePartitionExchangeManager.java:3370)
at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:3197)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:125)
at java.lang.Thread.run(Thread.java:748)
Technically, you have answered your question on your own, that is great work, to be honest.
You are not supposed to have blocking methods in your write-through cache store implementation that might get in conflict with PME or cause pool starvation.
You have to remember that PME is a show-stopper mechanism: the entire user load is stopped. In short, that is required to ensure ACID guarantees. The lock indeed is divided into multiple parts to speed up the processing, i.e. allowing up to 16 threads to perform cache operations concurrently. But a PME does need exclusive control over the cluster, thus it acquires a write lock over all the threads.
Shouldn’t the call be blocked only after PME operation has all the
locks?
Yes, that's indeed how it's supposed to work. But in your case, PME tries to get the write lock, whereas the read lock is there, therefore it's waiting for its completion, and all further read locks are being queued after the write lock.
Also is there any solution to solve this deadlock?
move cache-related logic out of the CacheStore. Ideally, do not start caches dynamically, since that triggers PME. Have them created in advance if possible
check if other mechanisms like continuous-queries or entry processo would work.
But still, it all depends on your use case.
I don't think creating a cache inside the cache store will work. From the documentation for CacheWriter:
A CacheWriter is used for write-through to an external resource.
(Emphasis mine.)
Without knowing your use case, it's difficult to suggest an alternative approach, but creating your caches in advance or using a continuous query as a trigger works in similar situations.
When a process in currently running in the cpu and suddenly have to wait for I\O,
then the scheduler save its state (Program counter, registers..) into is PCB, and then add him to the device queue which the process wait for I\O from it.
when the process know to move from a waiting(device) queue to the ready queue?
and if im doing in code Thread.Sleep(50000) does the process moving to the waiting queue?
Thanks!
The terms are are using are all pedagogical. How this is done is entirely operating system specific.
The process of going from un-executable due to pending I/O to going to a read for execution state varies among systems.
If you're doing blocking (synchronous) I/O, there can only be one blocking I/O call pending per process (or thread). When that completes, the process should be executable. That would occur in the interrupt handler for the I/O request completion.
On some systems, completion of I/O will boost the priority of the process (or thread). In such a system, process will move ahead of other processes that are waiting because they used up their CPU quantum (as opposed to yielded the CPU voluntarily).
Many process state changes occur during timer interrupt serving. The O/S will schedule regular interrupts on the CPU. The timer interrupt handler usually looks for sleeping processes that need to be wakened, I/O requests that have been queued to be competed, and process switching.
I have a dedicated server that's been running for years, with no recent code or configuration changes, but suddenly about a week ago, the MS SQL Server DB has started becoming unresponsive, and shortly thereafter, the entire site goes down due to memory issues on the server. It is sporadic, which leads me to believe it could be a malicious DDOS-like attack, but I am not sure how to confirm what's going on.
After a reboot, it can stay up for a few days, or only a few hours before I start seeing rampant occurrances of these Info messages in the Windows logs, shortly before it seizing up and failing. Research has not yielded any actionable info as of yet, please help, and thank you.
Process 52:0:2 (0xaa0) Worker 0x07E340E8 appears to be non-yielding on Scheduler 0. Thread creation time: 13053491255443. Approx Thread CPU Used: kernel 280 ms, user 35895 ms. Process Utilization 0%%. System Idle 93%%. Interval: 6505497 ms.
New queries assigned to process on Node 0 have not been picked up by a worker thread in the last 2940 seconds. Blocking or long-running queries can contribute to this condition, and may degrade client response time. Use the "max worker threads" configuration option to increase number of allowable threads, or optimize current running queries. SQL Process Utilization: 0%%. System Idle: 91%%.
Here's a blog about the issue: danieladeniji.wordpress that should help you get started.
Seems unlikely that it would be a DDOS.
I red that, process scheduler will replace the process that is currently processing by cpu
with high priority process. At any point only one process will be executed by processor in that case where the scheduler is running to notify cpu about high priority process, when cpu is busy in executing low priority process ?
The process scheduler is the component of the operating system that is
responsible for deciding whether the currently running process should
continue running and, if not, which process should run next.
To help the scheduler monitor processes and the amount of CPU time that they use, a programmable interval timer interrupts the processor periodically (typically 50 or 60 times per second). This timer is programmed when the operating system initializes itself. At each interrupt, the operating system’s scheduler gets to run and decide whether the currently running process should be allowed to continue running or whether it should be suspended and another ready process allowed to run. This is the mechanism used for preemptive scheduling.
So,basically,the process scheduler runs in the same main memory,when active, but are only activated after getting invoked by interrupts. Hence,they aren't all time running.
BTW,that was a great conceptual question to answer.Best wishes for your topic.
The higher-priority thread/process will preempt the lower-priority thread when an interrupt causes the scheduler to be run to decide on what set of threads to run next, and the scheduler algorithm decides that the lower-priority thread needs to be replaced by the higher-priority one.
Interrupts come in two flavours:
Software interrupts, (syscalls) from threads that are already running and change the state of threads, eg. by signaling an event, mutex or semaphore upon which another thread is waiting, and so making it ready to run.
Hardware interrupts that cause a driver to run and that driver chooses to invoke the scheduler on exit because an I/O operation has completed or some timeout interval has expired that needs to change the set of running threads, (eg. disk, NIC, KB, mouse, timer).
I am trying to analyse thread dumps I have taken from my tomcat server. One of the thread dumps was taken after a couple of minutes of uptime and shows a thread pool of about 70, with several in WAITING state. I left a script hitting the server overnight and when I took another thread dump in the morning. When comparing the two dumps I can see that the threadpool has increased to from 70 threads to 90 threads. I can also see that the same threads are in a WAITING state between one dump and the other, while 20 new threads are added. Would this suggest that there is some bug in my application or is this standard behavior? I am wondering why the threads that are in waiting are not being re-used and instead new threads being created. I am assuming that the threads have not been re-used at all from one dump to another because in the dump file it reports them as "waiting on " where the number in <> is the same from one dump to another, is this assumption correct?
For example, from my initial thread dump I see this:
"http-8000-40" - Thread t#74
java.lang.Thread.State: WAITING
at java.lang.Object.wait(Native Method)
- waiting on <4fd24389> (a org.apache.tomcat.util.net.JIoEndpoint$Worker)
at java.lang.Object.wait(Object.java:485)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.await(JIoEndpoint.java:458)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:484)
at java.lang.Thread.run(Thread.java:662)
Locked ownable synchronizers:
- None
and then I can see the same thread in the dump of the following morning in the same state and waiting on the same object: (I am assuming this from the numbers in "<>")
"http-8000-40" - Thread t#74
java.lang.Thread.State: WAITING
at java.lang.Object.wait(Native Method)
- waiting on <4fd24389> (a org.apache.tomcat.util.net.JIoEndpoint$Worker)
at java.lang.Object.wait(Object.java:485)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.await(JIoEndpoint.java:458)
at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:484)
at java.lang.Thread.run(Thread.java:662)
Locked ownable synchronizers:
- None
Tomcat needs to spend some time managing threads and other resources even after your webapp's code completes processing a request. In order to keep up with the load, Tomcat will allocate new threads if enough aren't available.
If you have 70 total threads and 70 simultaneous requests, all should be well. If one request (of 70) completes (that is, the client has received all the data) and another is made before Tomcat is fully-done with the request-processor thread, another thread will be allocated to handle the new request resulting on a thread pool of size=71.
This can happen many times because it's not deterministic due to context switches, GC pauses, etc. that can interfere with exact timing of everything happening on the server.