Frozen replication on MySQL 5.7.23 and 5.7.24 - replication

We have seen repeatedly this problem on MySQL 5.7.23 and 5.7.24. Replication is frozen on error and I cannot manually restart it using "stop slave; start slave;"
MySQL runs on Debian 9 on VMs on Google compute engine and all packages are up to date. VMs have 4CPUs/26GB RAM. On MySQL replicas we use parallel replication processes, ROW binlog format and LOGICAL_CLOCK for slave-parallel-type
Scenario of our problems:
Replication on read-only replica stops with error 1205.
Error text: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 7 failed executing transaction 'ANONYMOUS' at master log mysql-bin.00xxxx, end_log_pos xxxxxxxxx. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
In bin log I see some normal UPDATE command - we have tons of them during the day.
Check of performance_schema.replication_applier_status_by_worker shows error like this: "Worker 1 failed executing transaction 'ANONYMOUS' at master log mysql-bin.00xxxx, end_log_pos xxxxxxxxx; Lock wait timeout exceeded; try restarting transaction"
I start command "stop slave;" from mysql command line tool but command is frozen - processlist shows process | 56327 | root | localhost | NULL | Query | 61716 | Killing slave | stop slave | running indefinitely
manual reboot of the instance from Linux command line does not work. Instance is frozen and I cannot ssh it, I have to force restart from Google GCE web gui.
In error.log I can see sequence of error messages Worker 7 failed executing transaction 'ANONYMOUS' at master log mysql-bin.00xxxx, end_log_pos xxxxxxxx; Could not execute Update_rows event on table xxxx.xxxx; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mysql-bin.00xxxx, end_log_pos xxxxxxxxx, Error_code: 1205
sequence ends with error message: worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. Error_code: 1205
I tried to set higher variable slave_transaction_retries (to 30) which lowered number of "frozen cases" but problem still stays. If replication stops I cannot restart it manually from mysql command line tool.
We did not have these problems with frozen replication on 5.7.22 or older releases. Although from time to time we had errors 1205 in replication due to huge amount of UPDATEs we have during the day, manual restart of replication from mysql command line tool always worked without problems.
Situation seems to be a bit better on 5.7.24 which came with many repairs in replication. On 24 we see much less cases of this problem but it is still there.
Can I influence this behavior by some parameter?
What would you recommend to check if this problem happens again?
Can I force restart of frozen replication without restarting MySQL?
Thank you very much for any idea or help.

Related

Predis Error while reading line from server , timeout fix

I am using Redis with Daemon processes as well as in regular caching
Daemon processes with supervisor (Laravel Redis queues)
Regular caching as key value pair
timeout=300 is currently at my redis.conf file
It had been suggested to change it to timeout=0 at several Git links (https://github.com/predis/predis/issues/33)
My concern is that, if I do a timeout as 0, the redis sever will not drop any connection
Over a period of time, I see chances of getting error of max number of clients reached
Seeking advice for changing timeout --> 0 at redis.conf
Currently, I get following error logs frequently (every 2-3 min) [timeout=300]
{"message":"Error while reading line from the server. [tcp://10.10.101.237:6379]","context":
{"exception":{"class":"Predis\\Connection\\ConnectionException","message":"Error while reading
line from the server.
[tcp://10.10.101.237:6379]","code":0,"file":"/var/www/api/vendor/predis/predis/src/Connection/Ab
stractConnection.php:155"}},"level":400,"level_name":"ERROR","channel":"production","datetime":
{"date":"2020-09-23 07:14:01.207506","timezone_type":3,"timezone":"Asia/Kolkata"},"extra":[]}
I had changed to timeout = 0
Everything is working fine with it !!!
PS: Posting this, post an observation of 2 months after change

How can I understand if there are threads in hang in the WebSphere Application Server

I'm using IBM Workload Scheduler (TWS) and when the product does not behave as expected or does not reply in a timely fashion, I am under the impression that there could be a thread hanging or blocked somewhere.
Is there a way to tell if there is a blocked thread?
The first step to do is to check if in the SystemOut.log file of WebSphere Application Server (located in WAS_profile_path/logs/server1/SystemOut.log or WAS_profile_path\logs\server1\SystemOut.log in the master domain manager) there is any evidence that one or more threads are hanging. To do this, you can run the following command in the context of an UNIX shell:
cat WAS_profile_path/logs/server1/SystemOut*.log | grep hung
If this command returns something like:
root#MASTER:/opt/IBM/TWA/WAS/TWSProfile/logs/server1# cat SystemOut*.log | grep hung
[6/20/17 5:45:33:988 CEST] 000000b9 ThreadMonitor W WSVR0605W: Thread "WorkManager.ResourceAdvisorWorkManager : 0" (0000009e) has been active for 697451 milliseconds and may be hung. There is/are 1 thread(s) in total in the server that may be hung.
this might mean that a WebSphere thread could be hung.
This may and may not be true, sometimes you have a thread that performs a lot of work and exceeds the set time limit (default value is 10 minutes).
In case you suspect that you are experiencing a real thread hung, consider to give a look to the following articles which provide detailed information to collect the data necessary to diagnose and resolve the issue:
WebSphere MustGather procedure on Linux
WebSphere MustGather procedure on Windows
A similar document exists also for AIX platform.

High availability : Jobs not getting submitted immediately after name node fail over

We have an application configured for high availability.
Of the 2 nodes one of them is made active (say NN1) and other one's (say NN2) NameNode process is killed. So now NN1 is active.
Now we submit a mapreduce job , and the logs keep saying
"Application submission is not finished, submitted application application_someid is still in NEW_SAVING".
This happens for about 17 minutes and then the job gets executed successfully.
So which means the fail-over has happened and NN1 is active. But why does it take so long?
The yarn nodemanager logs says :
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: . Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
Can somebody please explain as to why this is happening?
Thanks in advance
I don't know the cause of this problem,
But restarting the yarn service help me solve this problem.

Solr issue: ClusterState says we are the leader, but locally we don't think so

So today we run into a disturbing solr issue.
After a restart of the whole cluster one of the shard stop being able to index/store documents.
We had no hint about the issue until we started indexing (querying the server looks fine).
The error is:
2014-05-19 18:36:20,707 ERROR o.a.s.u.p.DistributedUpdateProcessor [qtp406017988-19] ClusterState says we are the leader, but locally we don't think so
2014-05-19 18:36:20,709 ERROR o.a.s.c.SolrException [qtp406017988-19] org.apache.solr.common.SolrException: ClusterState says we are the leader (http://x.x.x.x:7070/solr/shard3_replica1), but locally we don't think so. Request came from null
at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:503)
at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:267)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:550)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:126)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:101)
at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:65)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
We run Solr 4.7 in Cluster mode (5 shards) on jetty.
Each shard run on a different host with one zookeeper server.
I checked the zookeeper log and I cannot see anything there.
The only difference is that in the /overseer_election/election folder I see this specific server repeated 3 times, while the other server are only mentioned twice.
45654861x41276x432-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x368-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x369-x.x.x.x:7070_solr-n_00000003xx
Not even sure if this is relevant. (Can it be?)
Any clue what other check can we do?
We've experienced this error under 2 conditions.
Condition 1
On a single zookeeper host there was an orphaned Zookeeper ephemeral node in
/overseer_elect/election. The session this ephemeral node was associated with no longer existed.
The orphaned ephemeral node cannot be deleted.
Caused by: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
This condition will also be accompanied by a /overseer/queue directory that is clogged-up with queue items that are forever waiting to be processed.
To resolve the issue you must restart the Zookeeper node in question with the orphaned ephemeral node.
If after the restart you see Still seeing conflicting information about the leader of shard shard1 for collection <name> after 30 seconds
You will need to restart the Solr hosts as well to resolve the problem.
Condition 2
Cause: a mis-configured systemd service unit.
Make sure you have Type=forking and have PIDFile configured correctly if you are using systemd.
systemd was not tracking the PID correctly, it thought the service was dead, but it wasn't, and at some point 2 services were started. Because the 2nd service will not be able to start (as they both can't listen on the same port) it seems to just sit there in a failed state hanging, or fails to start the process but just messes up the other solr processes somehow by possibly overwriting temporary clusterstate files locally.
Solr logs reported the same error the OP posted.
Interestingly enough, another symptom was that zookeeper listed no leader for our collection in /collections/<name>/leaders/shard1/leader normally this zk node contains contents such as:
{"core":"collection-name_shard1_replica1",
"core_node_name":"core_node7",
"base_url":"http://10.10.10.21:8983/solr",
"node_name":"10.10.10.21:8983_solr"}
But the node is completely missing on the cluster with duplicate solr instances attempting to start.
This error also appeared in the Solr Logs:
HttpSolrCall null:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /roles.json
To correct the issue, killall instances of solr (or java if you know it's safe), and restart the solr service.
We figured out!
The issue was that jetty didn't really stop so we had 2 running processes, for whatever reason this was fine for reading but not for writing.
Killing the older java process solved the issue.

Sunspot lock issue on EngineYard

I am having a problem when creating a new record on a RoR3 server.
It updates SolR indexes and it's having a problem with a lock.
RSolr::Error::Http (RSolr::Error::Http - 500 Internal Server Error
Error: Lock obtain timed out: NativeFSLock#/data/dfcgit_r3/releases/20130620195714/solr/data/production/index/write.lock
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock#/data/dfcgit_r3/releases/20130620195714/solr/data/production/index/write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1108)
at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:83)
at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:101)
at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:171)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:219)
Any help with this?
We had the same error when running sunspot solr on amazon ec2.
The 'write'lock' indicated that some process had not released the lock on a resource, either the web server process was still at it or Solr had some other process running. I ran a check on the solr processes running by executing
ps -aux |grep solr
And it showed there were 4 processes running! So I stopped solr from the command : sunspot:solr:stop, then again ran the grep, killed the solr processes listed (kill -9) and then sunspot:solr:start
And the Sun shined again. It worked fine there after