High availability : Jobs not getting submitted immediately after name node fail over - hadoop-yarn

We have an application configured for high availability.
Of the 2 nodes one of them is made active (say NN1) and other one's (say NN2) NameNode process is killed. So now NN1 is active.
Now we submit a mapreduce job , and the logs keep saying
"Application submission is not finished, submitted application application_someid is still in NEW_SAVING".
This happens for about 17 minutes and then the job gets executed successfully.
So which means the fail-over has happened and NN1 is active. But why does it take so long?
The yarn nodemanager logs says :
INFO org.apache.hadoop.ipc.Client: Retrying connect to server: . Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
Can somebody please explain as to why this is happening?
Thanks in advance

I don't know the cause of this problem,
But restarting the yarn service help me solve this problem.

Related

Splunk 7.2.9.1 Universal forwarder on SUSE Linux12.4 not communicating and forwarding logs to Indexer after certain period of time

I have noticed Splunk 7.2.9.1 Universal forwarder on SUSE Linux12.4 is not communicating to deployment server and forwarding logs to indexer after certain period of time. "splunkd" process appears to be running while this issue persists.
I have to restart UFW for it to resume communication to deployment and forward logs. But this will again stop communication after certain period of time.
I cannot see any specific logs in splunkd.log while this issue occurs.
However, i noticed below message from watchdog.log
06-16-2020 11:51:09.055 +0200 ERROR Watchdog - No response received from IMonitoredThread=0x7f24365fdcd0 within 8000 ms. Looks like thread name='Shutdown' is busy !? Starting to trace with 8000 ms interval.
Can somebody help to understand what is causing this issue.
This appears to be a Known Issue. From the 7.2.9.1 release notes:
Universal Forwarders stop sending data repeatedly throughout the day
Workaround: In limits.conf, try changing file_tracking_db_threshold_mb
in the [inputproc] stanza to a lower value.
I did not find a version where this is not listed as a known problem.

Spinnaker Pipeline is failing after starting it manually

Spinnaker on Gke :
Pipeline is getting failed after starting manually
the first issue is the .spin/config didnt get created, I created that manually as missing in the steps https://cloud.google.com/solutions/continuous-delivery-spinnaker-kubernetes-engine
then when i started pipeline manually , it is giving me an error on production stage
Exception ( Wait For Manifest To Stabilize )
WaitForManifestStableTask of stage Deploy Production Backend timed out after 30 minutes 4 seconds. pausedDuration: 0 seconds, elapsedTime: 30 minutes 4 seconds,timeoutValue: 30 minutes
Many times this is caused by Kubernetes being unable to schedule the pods. Just had the same problem this morning and it was due to lack of resources. That would be the first thing I'd check. You can find the pod in the "Server Group" under the "Infrastructure" tab and view the details of why it's in the state it is.
I find this Spinnaker GitHub page insightful. I wanted to add it as a comment because I still don't have enough reputation.

Frozen replication on MySQL 5.7.23 and 5.7.24

We have seen repeatedly this problem on MySQL 5.7.23 and 5.7.24. Replication is frozen on error and I cannot manually restart it using "stop slave; start slave;"
MySQL runs on Debian 9 on VMs on Google compute engine and all packages are up to date. VMs have 4CPUs/26GB RAM. On MySQL replicas we use parallel replication processes, ROW binlog format and LOGICAL_CLOCK for slave-parallel-type
Scenario of our problems:
Replication on read-only replica stops with error 1205.
Error text: Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 7 failed executing transaction 'ANONYMOUS' at master log mysql-bin.00xxxx, end_log_pos xxxxxxxxx. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.
In bin log I see some normal UPDATE command - we have tons of them during the day.
Check of performance_schema.replication_applier_status_by_worker shows error like this: "Worker 1 failed executing transaction 'ANONYMOUS' at master log mysql-bin.00xxxx, end_log_pos xxxxxxxxx; Lock wait timeout exceeded; try restarting transaction"
I start command "stop slave;" from mysql command line tool but command is frozen - processlist shows process | 56327 | root | localhost | NULL | Query | 61716 | Killing slave | stop slave | running indefinitely
manual reboot of the instance from Linux command line does not work. Instance is frozen and I cannot ssh it, I have to force restart from Google GCE web gui.
In error.log I can see sequence of error messages Worker 7 failed executing transaction 'ANONYMOUS' at master log mysql-bin.00xxxx, end_log_pos xxxxxxxx; Could not execute Update_rows event on table xxxx.xxxx; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log mysql-bin.00xxxx, end_log_pos xxxxxxxxx, Error_code: 1205
sequence ends with error message: worker thread retried transaction 10 time(s) in vain, giving up. Consider raising the value of the slave_transaction_retries variable. Error_code: 1205
I tried to set higher variable slave_transaction_retries (to 30) which lowered number of "frozen cases" but problem still stays. If replication stops I cannot restart it manually from mysql command line tool.
We did not have these problems with frozen replication on 5.7.22 or older releases. Although from time to time we had errors 1205 in replication due to huge amount of UPDATEs we have during the day, manual restart of replication from mysql command line tool always worked without problems.
Situation seems to be a bit better on 5.7.24 which came with many repairs in replication. On 24 we see much less cases of this problem but it is still there.
Can I influence this behavior by some parameter?
What would you recommend to check if this problem happens again?
Can I force restart of frozen replication without restarting MySQL?
Thank you very much for any idea or help.

Solr issue: ClusterState says we are the leader, but locally we don't think so

So today we run into a disturbing solr issue.
After a restart of the whole cluster one of the shard stop being able to index/store documents.
We had no hint about the issue until we started indexing (querying the server looks fine).
The error is:
2014-05-19 18:36:20,707 ERROR o.a.s.u.p.DistributedUpdateProcessor [qtp406017988-19] ClusterState says we are the leader, but locally we don't think so
2014-05-19 18:36:20,709 ERROR o.a.s.c.SolrException [qtp406017988-19] org.apache.solr.common.SolrException: ClusterState says we are the leader (http://x.x.x.x:7070/solr/shard3_replica1), but locally we don't think so. Request came from null
at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:503)
at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:267)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:550)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:126)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:101)
at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:65)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
We run Solr 4.7 in Cluster mode (5 shards) on jetty.
Each shard run on a different host with one zookeeper server.
I checked the zookeeper log and I cannot see anything there.
The only difference is that in the /overseer_election/election folder I see this specific server repeated 3 times, while the other server are only mentioned twice.
45654861x41276x432-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x368-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x369-x.x.x.x:7070_solr-n_00000003xx
Not even sure if this is relevant. (Can it be?)
Any clue what other check can we do?
We've experienced this error under 2 conditions.
Condition 1
On a single zookeeper host there was an orphaned Zookeeper ephemeral node in
/overseer_elect/election. The session this ephemeral node was associated with no longer existed.
The orphaned ephemeral node cannot be deleted.
Caused by: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
This condition will also be accompanied by a /overseer/queue directory that is clogged-up with queue items that are forever waiting to be processed.
To resolve the issue you must restart the Zookeeper node in question with the orphaned ephemeral node.
If after the restart you see Still seeing conflicting information about the leader of shard shard1 for collection <name> after 30 seconds
You will need to restart the Solr hosts as well to resolve the problem.
Condition 2
Cause: a mis-configured systemd service unit.
Make sure you have Type=forking and have PIDFile configured correctly if you are using systemd.
systemd was not tracking the PID correctly, it thought the service was dead, but it wasn't, and at some point 2 services were started. Because the 2nd service will not be able to start (as they both can't listen on the same port) it seems to just sit there in a failed state hanging, or fails to start the process but just messes up the other solr processes somehow by possibly overwriting temporary clusterstate files locally.
Solr logs reported the same error the OP posted.
Interestingly enough, another symptom was that zookeeper listed no leader for our collection in /collections/<name>/leaders/shard1/leader normally this zk node contains contents such as:
{"core":"collection-name_shard1_replica1",
"core_node_name":"core_node7",
"base_url":"http://10.10.10.21:8983/solr",
"node_name":"10.10.10.21:8983_solr"}
But the node is completely missing on the cluster with duplicate solr instances attempting to start.
This error also appeared in the Solr Logs:
HttpSolrCall null:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /roles.json
To correct the issue, killall instances of solr (or java if you know it's safe), and restart the solr service.
We figured out!
The issue was that jetty didn't really stop so we had 2 running processes, for whatever reason this was fine for reading but not for writing.
Killing the older java process solved the issue.

Sunspot lock issue on EngineYard

I am having a problem when creating a new record on a RoR3 server.
It updates SolR indexes and it's having a problem with a lock.
RSolr::Error::Http (RSolr::Error::Http - 500 Internal Server Error
Error: Lock obtain timed out: NativeFSLock#/data/dfcgit_r3/releases/20130620195714/solr/data/production/index/write.lock
org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock#/data/dfcgit_r3/releases/20130620195714/solr/data/production/index/write.lock
at org.apache.lucene.store.Lock.obtain(Lock.java:84)
at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1108)
at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:83)
at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:101)
at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:171)
at org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:219)
Any help with this?
We had the same error when running sunspot solr on amazon ec2.
The 'write'lock' indicated that some process had not released the lock on a resource, either the web server process was still at it or Solr had some other process running. I ran a check on the solr processes running by executing
ps -aux |grep solr
And it showed there were 4 processes running! So I stopped solr from the command : sunspot:solr:stop, then again ran the grep, killed the solr processes listed (kill -9) and then sunspot:solr:start
And the Sun shined again. It worked fine there after