How can I understand if there are threads in hang in the WebSphere Application Server - workload-scheduler

I'm using IBM Workload Scheduler (TWS) and when the product does not behave as expected or does not reply in a timely fashion, I am under the impression that there could be a thread hanging or blocked somewhere.
Is there a way to tell if there is a blocked thread?

The first step to do is to check if in the SystemOut.log file of WebSphere Application Server (located in WAS_profile_path/logs/server1/SystemOut.log or WAS_profile_path\logs\server1\SystemOut.log in the master domain manager) there is any evidence that one or more threads are hanging. To do this, you can run the following command in the context of an UNIX shell:
cat WAS_profile_path/logs/server1/SystemOut*.log | grep hung
If this command returns something like:
root#MASTER:/opt/IBM/TWA/WAS/TWSProfile/logs/server1# cat SystemOut*.log | grep hung
[6/20/17 5:45:33:988 CEST] 000000b9 ThreadMonitor W WSVR0605W: Thread "WorkManager.ResourceAdvisorWorkManager : 0" (0000009e) has been active for 697451 milliseconds and may be hung. There is/are 1 thread(s) in total in the server that may be hung.
this might mean that a WebSphere thread could be hung.
This may and may not be true, sometimes you have a thread that performs a lot of work and exceeds the set time limit (default value is 10 minutes).
In case you suspect that you are experiencing a real thread hung, consider to give a look to the following articles which provide detailed information to collect the data necessary to diagnose and resolve the issue:
WebSphere MustGather procedure on Linux
WebSphere MustGather procedure on Windows
A similar document exists also for AIX platform.

Related

Splunk 7.2.9.1 Universal forwarder on SUSE Linux12.4 not communicating and forwarding logs to Indexer after certain period of time

I have noticed Splunk 7.2.9.1 Universal forwarder on SUSE Linux12.4 is not communicating to deployment server and forwarding logs to indexer after certain period of time. "splunkd" process appears to be running while this issue persists.
I have to restart UFW for it to resume communication to deployment and forward logs. But this will again stop communication after certain period of time.
I cannot see any specific logs in splunkd.log while this issue occurs.
However, i noticed below message from watchdog.log
06-16-2020 11:51:09.055 +0200 ERROR Watchdog - No response received from IMonitoredThread=0x7f24365fdcd0 within 8000 ms. Looks like thread name='Shutdown' is busy !? Starting to trace with 8000 ms interval.
Can somebody help to understand what is causing this issue.
This appears to be a Known Issue. From the 7.2.9.1 release notes:
Universal Forwarders stop sending data repeatedly throughout the day
Workaround: In limits.conf, try changing file_tracking_db_threshold_mb
in the [inputproc] stanza to a lower value.
I did not find a version where this is not listed as a known problem.

Frozen replication on MySQL 5.7.23 and 5.7.24

We have seen repeatedly this problem on MySQL 5.7.23 and 5.7.24. Replication is frozen on error and I cannot manually restart it using "stop slave; start slave;"
MySQL runs on Debian 9 on VMs on Google compute engine and all packages are up to date. VMs have 4CPUs/26GB RAM. On MySQL replicas we use parallel replication processes, ROW binlog format and LOGICAL_CLOCK for slave-parallel-type
Scenario of our problems:
Replication on read-only replica stops with error 1205.
Error text: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 7 failed executing transaction 'ANONYMOUS' at master log mysql-bin.00xxxx, end_log_pos xxxxxxxxx. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
In bin log I see some normal UPDATE command - we have tons of them during the day.
Check of performance_schema.replication_applier_status_by_worker shows error like this: "Worker 1 failed executing transaction 'ANONYMOUS' at master log mysql-bin.00xxxx, end_log_pos xxxxxxxxx; Lock wait timeout exceeded; try restarting transaction"
I start command "stop slave;" from mysql command line tool but command is frozen - processlist shows process | 56327 | root | localhost | NULL | Query | 61716 | Killing slave | stop slave | running indefinitely
manual reboot of the instance from Linux command line does not work. Instance is frozen and I cannot ssh it, I have to force restart from Google GCE web gui.
In error.log I can see sequence of error messages Worker 7 failed executing transaction 'ANONYMOUS' at master log mysql-bin.00xxxx, end_log_pos xxxxxxxx; Could not execute Update_rows event on table xxxx.xxxx; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mysql-bin.00xxxx, end_log_pos xxxxxxxxx, Error_code: 1205
sequence ends with error message: worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. Error_code: 1205
I tried to set higher variable slave_transaction_retries (to 30) which lowered number of "frozen cases" but problem still stays. If replication stops I cannot restart it manually from mysql command line tool.
We did not have these problems with frozen replication on 5.7.22 or older releases. Although from time to time we had errors 1205 in replication due to huge amount of UPDATEs we have during the day, manual restart of replication from mysql command line tool always worked without problems.
Situation seems to be a bit better on 5.7.24 which came with many repairs in replication. On 24 we see much less cases of this problem but it is still there.
Can I influence this behavior by some parameter?
What would you recommend to check if this problem happens again?
Can I force restart of frozen replication without restarting MySQL?
Thank you very much for any idea or help.

Weblogic 10.3.6 generates empty heapdump on OutOfMemoryError

I'm trying to generate a full heapdump from Weblogic 10.3.6 due to an OutOfMemoryError generated by a Web Application deployed on the Server.
I've setted the following start script:
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/path/to/heapdump
When the OutOfMemoryError occurs, Weblogic generates an empty hprof file (0 bytes size) in /path/to/heapdump folder, and nothing happens: the Server remains in RUNNING mode, even if is not reachable anymore.
The java process is still alive, but with 0% of processor.
Even the server.out log seems completely frozen, without any trace of the OutOfMemoryError.
What's wrong with the configuration?
Probably you can use Java Flight Recorder to save events and check which objects are generating OOM.
(any profiler should work as well).
Been there :( . I remember at the time that we've found it was somewhat logical since there was not enough memory for normal operation, the JVM could not automagically find enough memory to create a heapdump either. If memory serves me well, at that time we did 2 things to debug the memory leak. First we were "lucky" enough that the problem was happening fairly regularly so a close manual monitoring was possible (monitoring of the gc.log looking for repeated FullGC and monitoring of the performance tab in the console). Knowing when the onset of the problem was starting we were doing some kill -3 to get the dump manually. We also used jstack {PID} (JDK 1.6 on Linux) with some luck. With those, at the time, the devs were able to identify the memory leak. Hope that helps.
Okay, your configuration looks alright.. you might want to check if the weblogic process user has the rights to edit the heap dump file.
You can take heap dump by Java tools :
JAVA_HOME/bin/jmap -dump:format=b,file=path_of_the_file
OR
%JROCKIT_HOME%\bin\jrcmd hprofdump filename=path_of_the_file

Solr issue: ClusterState says we are the leader, but locally we don't think so

So today we run into a disturbing solr issue.
After a restart of the whole cluster one of the shard stop being able to index/store documents.
We had no hint about the issue until we started indexing (querying the server looks fine).
The error is:
2014-05-19 18:36:20,707 ERROR o.a.s.u.p.DistributedUpdateProcessor [qtp406017988-19] ClusterState says we are the leader, but locally we don't think so
2014-05-19 18:36:20,709 ERROR o.a.s.c.SolrException [qtp406017988-19] org.apache.solr.common.SolrException: ClusterState says we are the leader (http://x.x.x.x:7070/solr/shard3_replica1), but locally we don't think so. Request came from null
at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:503)
at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:267)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:550)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:126)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:101)
at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:65)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
We run Solr 4.7 in Cluster mode (5 shards) on jetty.
Each shard run on a different host with one zookeeper server.
I checked the zookeeper log and I cannot see anything there.
The only difference is that in the /overseer_election/election folder I see this specific server repeated 3 times, while the other server are only mentioned twice.
45654861x41276x432-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x368-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x369-x.x.x.x:7070_solr-n_00000003xx
Not even sure if this is relevant. (Can it be?)
Any clue what other check can we do?
We've experienced this error under 2 conditions.
Condition 1
On a single zookeeper host there was an orphaned Zookeeper ephemeral node in
/overseer_elect/election. The session this ephemeral node was associated with no longer existed.
The orphaned ephemeral node cannot be deleted.
Caused by: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
This condition will also be accompanied by a /overseer/queue directory that is clogged-up with queue items that are forever waiting to be processed.
To resolve the issue you must restart the Zookeeper node in question with the orphaned ephemeral node.
If after the restart you see Still seeing conflicting information about the leader of shard shard1 for collection <name> after 30 seconds
You will need to restart the Solr hosts as well to resolve the problem.
Condition 2
Cause: a mis-configured systemd service unit.
Make sure you have Type=forking and have PIDFile configured correctly if you are using systemd.
systemd was not tracking the PID correctly, it thought the service was dead, but it wasn't, and at some point 2 services were started. Because the 2nd service will not be able to start (as they both can't listen on the same port) it seems to just sit there in a failed state hanging, or fails to start the process but just messes up the other solr processes somehow by possibly overwriting temporary clusterstate files locally.
Solr logs reported the same error the OP posted.
Interestingly enough, another symptom was that zookeeper listed no leader for our collection in /collections/<name>/leaders/shard1/leader normally this zk node contains contents such as:
{"core":"collection-name_shard1_replica1",
"core_node_name":"core_node7",
"base_url":"http://10.10.10.21:8983/solr",
"node_name":"10.10.10.21:8983_solr"}
But the node is completely missing on the cluster with duplicate solr instances attempting to start.
This error also appeared in the Solr Logs:
HttpSolrCall null:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /roles.json
To correct the issue, killall instances of solr (or java if you know it's safe), and restart the solr service.
We figured out!
The issue was that jetty didn't really stop so we had 2 running processes, for whatever reason this was fine for reading but not for writing.
Killing the older java process solved the issue.

jmeter hangs up and won't return

I am running 340 concurrent users to load test on server using jmeter.
But on most of the cases jmeter hangs up and won' t return, even if I try to close the connection it just hangs up. and eventually I have to close the application.
Any idea how to check what is holding the requests and how to check the requests sent by jmeter and find the bottleneck.
Got the following message on closing the thread
Shutting down thread please be patient message
I've hit this several times over the past few years. In each of my cases (may not be in your's) the issue was with the Load Balance (F5) I was sending my traffic through. Basically a property called OneConnect was holding the connections in a time-wait state and never killing the connection.
Run a pack tool like wireshark and see what's happening with the requests.
Try distributed testing, 340 concurrent users is not a big deal, but still you can try if that decreases your pain. Also take a look at the following link:
http://jmeter.apache.org/usermanual/best-practices.html#lean_mean
First check you script is ok with one user.
Ensure you use assertions.
Then run you test following jmeter best practices:
no gui
no costly listeners
You should then be able to see in csv output the longest request and be able to fix your issue.
I also encountered this problem before when I run my JMeter on my laptop(Core 2 Duo 1.5Ghz) it always hang-up in the middle of the processing. I tried to run on another pc which is more powerful than my laptop and its works now smoothly. Therefore, JMeter will run effectively if your pc or laptop has a better specs.
Note: It is also advisable to run your JMeter in non-gui mode.
Example to run JMeter in Linux box:
$ ./jmeter -t test.jmx -n -l /Users/home/test.jtl
I had the
one or more test threads won't exit
because of a firewall blocking some requests. So I had to leap in the firewalls timeout for all blocked request... then it returned.
You are getting this error probably because JVM is not capable of running so many threads. If you take a look at your terminal, you will see the exception you get:
Uncaught Exception java.lang.OutOfMemoryError: unable to create new native thread. See log file for details.
You can solve this by doing Remote Testing and have multiple clusters running, instead of one.