Predis Error while reading line from server , timeout fix - redis

I am using Redis with Daemon processes as well as in regular caching
Daemon processes with supervisor (Laravel Redis queues)
Regular caching as key value pair
timeout=300 is currently at my redis.conf file
It had been suggested to change it to timeout=0 at several Git links (https://github.com/predis/predis/issues/33)
My concern is that, if I do a timeout as 0, the redis sever will not drop any connection
Over a period of time, I see chances of getting error of max number of clients reached
Seeking advice for changing timeout --> 0 at redis.conf
Currently, I get following error logs frequently (every 2-3 min) [timeout=300]
{"message":"Error while reading line from the server. [tcp://10.10.101.237:6379]","context":
{"exception":{"class":"Predis\\Connection\\ConnectionException","message":"Error while reading
line from the server.
[tcp://10.10.101.237:6379]","code":0,"file":"/var/www/api/vendor/predis/predis/src/Connection/Ab
stractConnection.php:155"}},"level":400,"level_name":"ERROR","channel":"production","datetime":
{"date":"2020-09-23 07:14:01.207506","timezone_type":3,"timezone":"Asia/Kolkata"},"extra":[]}

I had changed to timeout = 0
Everything is working fine with it !!!
PS: Posting this, post an observation of 2 months after change

Related

Splunk 7.2.9.1 Universal forwarder on SUSE Linux12.4 not communicating and forwarding logs to Indexer after certain period of time

I have noticed Splunk 7.2.9.1 Universal forwarder on SUSE Linux12.4 is not communicating to deployment server and forwarding logs to indexer after certain period of time. "splunkd" process appears to be running while this issue persists.
I have to restart UFW for it to resume communication to deployment and forward logs. But this will again stop communication after certain period of time.
I cannot see any specific logs in splunkd.log while this issue occurs.
However, i noticed below message from watchdog.log
06-16-2020 11:51:09.055 +0200 ERROR Watchdog - No response received from IMonitoredThread=0x7f24365fdcd0 within 8000 ms. Looks like thread name='Shutdown' is busy !? Starting to trace with 8000 ms interval.
Can somebody help to understand what is causing this issue.
This appears to be a Known Issue. From the 7.2.9.1 release notes:
Universal Forwarders stop sending data repeatedly throughout the day
Workaround: In limits.conf, try changing file_tracking_db_threshold_mb
in the [inputproc] stanza to a lower value.
I did not find a version where this is not listed as a known problem.

Unstable cluster with hosts.xml resets to default

While performing management api operations such as removing app servers from a MarkLogic cluster, it becomes unstable resetting hosts.xml to default/localhost setting.
Logs shows something like:
MarkLogic: Slow send xx.xx.34.113:57692-xx.xx.34.170:7999, 4.605 KB in 1.529 sec; check host xxxx
Consider infrastructure is slow or not slow, but automatic recovery is still not happening.
How to overcome this situation?
Anyone who can provide more info on how management api is working under the hood?
Adding further details:
DELETE http://${BootstrapHost}:8002${serveruri}
POST http://${BootstrapHost}:8002${forest}?state=detach
DELETE http://${BootstrapHost}:8002${forest}?replicas=delete&level=full
DELETE http://${BootstrapHost}:8001/admin/v1/host-config?remote-host=${hostname}
When removing servers 1st request or removing host 4th request, few nodes in the cluster restarts and we check for nodes availability. However this is uncertain and sometimes hosts.xml resets to a default xml which says it is not part of any cluster.
How we fix, we copy hosts.xml from another host to this faulty host and it starts working again.
We found that it was very less likely to come in MarkLogic 8, but with MarkLogic 9 this problem is frequent and if it is on AWS it is even more frequent.

Frozen replication on MySQL 5.7.23 and 5.7.24

We have seen repeatedly this problem on MySQL 5.7.23 and 5.7.24. Replication is frozen on error and I cannot manually restart it using "stop slave; start slave;"
MySQL runs on Debian 9 on VMs on Google compute engine and all packages are up to date. VMs have 4CPUs/26GB RAM. On MySQL replicas we use parallel replication processes, ROW binlog format and LOGICAL_CLOCK for slave-parallel-type
Scenario of our problems:
Replication on read-only replica stops with error 1205.
Error text: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 7 failed executing transaction 'ANONYMOUS' at master log mysql-bin.00xxxx, end_log_pos xxxxxxxxx. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
In bin log I see some normal UPDATE command - we have tons of them during the day.
Check of performance_schema.replication_applier_status_by_worker shows error like this: "Worker 1 failed executing transaction 'ANONYMOUS' at master log mysql-bin.00xxxx, end_log_pos xxxxxxxxx; Lock wait timeout exceeded; try restarting transaction"
I start command "stop slave;" from mysql command line tool but command is frozen - processlist shows process | 56327 | root | localhost | NULL | Query | 61716 | Killing slave | stop slave | running indefinitely
manual reboot of the instance from Linux command line does not work. Instance is frozen and I cannot ssh it, I have to force restart from Google GCE web gui.
In error.log I can see sequence of error messages Worker 7 failed executing transaction 'ANONYMOUS' at master log mysql-bin.00xxxx, end_log_pos xxxxxxxx; Could not execute Update_rows event on table xxxx.xxxx; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mysql-bin.00xxxx, end_log_pos xxxxxxxxx, Error_code: 1205
sequence ends with error message: worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. Error_code: 1205
I tried to set higher variable slave_transaction_retries (to 30) which lowered number of "frozen cases" but problem still stays. If replication stops I cannot restart it manually from mysql command line tool.
We did not have these problems with frozen replication on 5.7.22 or older releases. Although from time to time we had errors 1205 in replication due to huge amount of UPDATEs we have during the day, manual restart of replication from mysql command line tool always worked without problems.
Situation seems to be a bit better on 5.7.24 which came with many repairs in replication. On 24 we see much less cases of this problem but it is still there.
Can I influence this behavior by some parameter?
What would you recommend to check if this problem happens again?
Can I force restart of frozen replication without restarting MySQL?
Thank you very much for any idea or help.

Large commit stalls halfway through

I have a problem with our subversion server. Doing small commits works fine, but as soon as someone tried to commit a large collection sizeable files the commit stalls halfway through and the client finally time out. My test set consists of roughly 2000 files and the total size of the commit is about 1 GB. When I commit the files the file uploading starts but about halfway through the transfer rate drops to 0kb/s and the commit just stalls and never recovers. If I splitting the commit into smaller pieces (<150 Mb) everything works just fine, but that breaks the atomicity of the commit structure and is something I really want to avoid.
When I look at the logs generate by Apache there is no error messages.
When I bumped the loglevel from debug to trace6 on the Apache server, there is some errors appearing at the moment when the upload stalls:
...
OpenSSL: I/O error, 2229 bytes expected to read on BIO
OpenSSL: read 1460/2229 bytes from BIO
...
Versions used:
We are running the connection to the subversion via apache, mod_dav, mod_dav_svn, mod_authz_svn and mod_auth_digest. The client connects via https.
Server:
OpenSuse 42.3
svnserve: 1.9.7
Apache: 2.4.23
Client:
Windows 10 enterprise
svn client: 1.10.0-dev.
What I tried so far:
I have tried increasing the TimeOut value in the apache configuration. The only difference is that the client ends up in stalled mode longer before posting the timeout message.
I have tried increasing the MaxKeepAliveRequests from 100 to 1000. No change.
I have tried adding SVNAllowBulkUpdates Prefer to the svn settings. No change.
Have anyone got any hints on how to debug these types of errors?

Solr issue: ClusterState says we are the leader, but locally we don't think so

So today we run into a disturbing solr issue.
After a restart of the whole cluster one of the shard stop being able to index/store documents.
We had no hint about the issue until we started indexing (querying the server looks fine).
The error is:
2014-05-19 18:36:20,707 ERROR o.a.s.u.p.DistributedUpdateProcessor [qtp406017988-19] ClusterState says we are the leader, but locally we don't think so
2014-05-19 18:36:20,709 ERROR o.a.s.c.SolrException [qtp406017988-19] org.apache.solr.common.SolrException: ClusterState says we are the leader (http://x.x.x.x:7070/solr/shard3_replica1), but locally we don't think so. Request came from null
at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:503)
at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:267)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:550)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:126)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:101)
at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:65)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
We run Solr 4.7 in Cluster mode (5 shards) on jetty.
Each shard run on a different host with one zookeeper server.
I checked the zookeeper log and I cannot see anything there.
The only difference is that in the /overseer_election/election folder I see this specific server repeated 3 times, while the other server are only mentioned twice.
45654861x41276x432-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x368-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x369-x.x.x.x:7070_solr-n_00000003xx
Not even sure if this is relevant. (Can it be?)
Any clue what other check can we do?
We've experienced this error under 2 conditions.
Condition 1
On a single zookeeper host there was an orphaned Zookeeper ephemeral node in
/overseer_elect/election. The session this ephemeral node was associated with no longer existed.
The orphaned ephemeral node cannot be deleted.
Caused by: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
This condition will also be accompanied by a /overseer/queue directory that is clogged-up with queue items that are forever waiting to be processed.
To resolve the issue you must restart the Zookeeper node in question with the orphaned ephemeral node.
If after the restart you see Still seeing conflicting information about the leader of shard shard1 for collection <name> after 30 seconds
You will need to restart the Solr hosts as well to resolve the problem.
Condition 2
Cause: a mis-configured systemd service unit.
Make sure you have Type=forking and have PIDFile configured correctly if you are using systemd.
systemd was not tracking the PID correctly, it thought the service was dead, but it wasn't, and at some point 2 services were started. Because the 2nd service will not be able to start (as they both can't listen on the same port) it seems to just sit there in a failed state hanging, or fails to start the process but just messes up the other solr processes somehow by possibly overwriting temporary clusterstate files locally.
Solr logs reported the same error the OP posted.
Interestingly enough, another symptom was that zookeeper listed no leader for our collection in /collections/<name>/leaders/shard1/leader normally this zk node contains contents such as:
{"core":"collection-name_shard1_replica1",
"core_node_name":"core_node7",
"base_url":"http://10.10.10.21:8983/solr",
"node_name":"10.10.10.21:8983_solr"}
But the node is completely missing on the cluster with duplicate solr instances attempting to start.
This error also appeared in the Solr Logs:
HttpSolrCall null:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /roles.json
To correct the issue, killall instances of solr (or java if you know it's safe), and restart the solr service.
We figured out!
The issue was that jetty didn't really stop so we had 2 running processes, for whatever reason this was fine for reading but not for writing.
Killing the older java process solved the issue.