Unstable cluster with hosts.xml resets to default - marklogic-9

While performing management api operations such as removing app servers from a MarkLogic cluster, it becomes unstable resetting hosts.xml to default/localhost setting.
Logs shows something like:
MarkLogic: Slow send xx.xx.34.113:57692-xx.xx.34.170:7999, 4.605 KB in 1.529 sec; check host xxxx
Consider infrastructure is slow or not slow, but automatic recovery is still not happening.
How to overcome this situation?
Anyone who can provide more info on how management api is working under the hood?
Adding further details:
DELETE http://${BootstrapHost}:8002${serveruri}
POST http://${BootstrapHost}:8002${forest}?state=detach
DELETE http://${BootstrapHost}:8002${forest}?replicas=delete&level=full
DELETE http://${BootstrapHost}:8001/admin/v1/host-config?remote-host=${hostname}
When removing servers 1st request or removing host 4th request, few nodes in the cluster restarts and we check for nodes availability. However this is uncertain and sometimes hosts.xml resets to a default xml which says it is not part of any cluster.
How we fix, we copy hosts.xml from another host to this faulty host and it starts working again.
We found that it was very less likely to come in MarkLogic 8, but with MarkLogic 9 this problem is frequent and if it is on AWS it is even more frequent.

Related

"The search engine appears to be down or failing to respond to the search query"

I've installed FusionAuth (awesome product) into a Docker Swarm cluster using the official docker-compose.yml file and everything seems to work brilliantly.
EXCEPT
Periodically, when a user goes to login they will be presented with the above error stating that the search engine is not available. If they try again immediately then everything works correctly! I would, obviously, prefer that they never saw the error.
Elasticsearch is definitely running and is responding to API calls correctly, and I can see the fusionauth_user index is present and populated with docs.
I guess my question is two fold:
1) What role does the ElasticSearch engine play in the FusionAuth ecosystem and can it be disabled?
2) Is there a configurable timeout somewhere that is causing the error message and, if so, where can change it?
I've search the docs for answers to the above but I can't seem to find anything :-(
Thanks for the kind feedback.
1) What role does the ElasticSearch engine play in the FusionAuth ecosystem and can it be disabled?
Elasticsearch provides full text search of user data. Each time a user is created or updated the user is re-indexed. In this case during login, we are updating the search index with the last login instant.
This service is required and cannot be disabled. We have had clients request to make this service optional for embedded applications or small scale scenarios where Elasticsearch may not be required. While this is not currently in plan, it is possible we may revisit this option in the future.
2) Is there a configurable timeout somewhere that is causing the error message and, if so, where can change it?
Not currently.
Full disclosure, I am not a Docker or Docker Swarm expert at all - perhaps there are some nuances to Swarm and response time due to spin up and spin down of resources?
Do you see any exceptions in the log when a user sees this error on the login?

Ignite thin Client unstable behavior

I am newbie to ignite and trying to play around with the example https://github.com/apache/ignite/blob/master/examples/src/main/java/org/apache/ignite/examples/client/ClientPutGetExample.java
i first tried the example with one server node and executed the client everything work fine.
then i started a second node with the following config
IgniteClient igniteClient = Ignition.startClient(new ClientConfiguration().setAddresses("127.0.0.1:10800","127.0.0.1:10801" )))
with CacheMode.REPLICATED;
i re-run the code it work fine, then i kept the same config and i shut down
one of the nodes
then i re-run the code the result is unstable sometimes it gives me Ignite cluster is unavailable sometimes it gives me an empty cache
Thin client put-get example started.
Created cache [put-get-example].
Loaded [null] from the cache.
1-as per the documentation ignite thin client is supposed to failover one of the
running nodes.
2- why the cache is note replicated?
is there something that i am missing here
thank you for your help
This looks like IGNITE-11599 - Thin Client will not failover properly if some of addresses were not up when it started.
It is fixed recently but did not get in any released versions. I'm afraid you will have to work around it by doing manual failovers.

DotNetNuke Lucene Search not working 'Lock obtain timed out' in load balanced env, how to fix?

We have a DotNetNuke site running on two servers that are load balanced. To ensure the files are in sync on these servers, we are using File Replication Service.
Search works fine on DotNetNuke when not load balanced, but in the load balanced setup the search stops working after a while (no suggestions, no results).
The following related exception is all over our log files:
[D:2][T:31][ERROR] DotNetNuke.Services.Exceptions.Exceptions - Lucene.Net.Store.LockObtainFailedException: Lock obtain timed out: NativeFSLock#D:\Sites\SiteName\App_Data\Search\write.lock
at Lucene.Net.Store.Lock.Obtain(Int64 lockWaitTimeout)
at Lucene.Net.Index.IndexWriter.Init(Directory d, Analyzer a, Boolean create, IndexDeletionPolicy deletionPolicy, Int32 maxFieldLength, IndexingChain indexingChain, IndexCommit commit)
at Lucene.Net.Index.IndexWriter..ctor(Directory d, Analyzer a, MaxFieldLength mfl)
at DotNetNuke.Services.Search.Internals.LuceneControllerImpl.get_Writer()
at DotNetNuke.Services.Search.Internals.LuceneControllerImpl.Delete(Query query)
at DotNetNuke.Services.Search.Internals.InternalSearchControllerImpl.DeleteSearchDocumentInternal(SearchDocument searchDocument, Boolean autoCommit)
at DotNetNuke.Services.Search.Internals.InternalSearchControllerImpl.DeleteSearchDocumentsByModule(Int32 portalId, Int32 moduleId, Int32 moduleDefId)
at DotNetNuke.Services.Search.SearchDataStore.StoreSearchItems(SearchItemInfoCollection searchItems)
at DotNetNuke.Services.Search.SearchEngine.IndexContent()
at DotNetNuke.Services.Search.SearchEngineScheduler.DoWork()
My best guess is that the issue is caused because both servers are running their search functionality, and the File Replication Service is syncing the files which causes conflicts.
What would be the best way to solve this?
Add an exclusion rule to not replicate the search index folder, but let both servers keep running search?
Somehow disable one server from indexing?
Any other suggestions?
Installation details:
DNN v. 09.02.00 (366)
.NET Framework 4.6
There's a 'scheduler' tool inside of the settings section that contains all CRON/background jobs functionality.
One of the background jobs is the 'Search: Site Crawler' job which is responsible for indexing the website. When that job runs at the same time on both servers, unexpected conflicts occur. To prevent this from happening, you can configure the job to only run on a specified server using the 'Servers' setting.
After configuring the job to only run on one server, the issue did not come back and search still works on both servers.
Thanks #Sanjay for pointing me in the right direction.
If I remember correctly search is done via a scheduled task. Have you tried setting up the task to run on only one server and then use file replication to sync across to the other server.

How to submit code to a remote Spark cluster from IntelliJ IDEA

I have two clusters, one in local virtual machine another in remote cloud. Both clusters in Standalone mode.
My Environment:
Scala: 2.10.4
Spark: 1.5.1
JDK: 1.8.40
OS: CentOS Linux release 7.1.1503 (Core)
The local cluster:
Spark Master: spark://local1:7077
The remote cluster:
Spark Master: spark://remote1:7077
I want to finish this:
Write codes(just simple word-count) in IntelliJ IDEA locally(on my laptp), and set the Spark Master URL to spark://local1:7077 and spark://remote1:7077, then run my codes in IntelliJ IDEA. That is, I don't want to use spark-submit to submit a job.
But I got some problem:
When I use the local cluster, everything goes well. Run codes in IntelliJ IDEA or use spark-submit can submit job to cluster and can finish the job.
But When I use the remote cluster, I got a warning log:
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
It is sufficient resources not sufficient memory!
And this log keep printing, no further actions. Both spark-submit and run codes in IntelliJ IDEA result the same.
I want to know:
Is it possible to submit codes from IntelliJ IDEA to remote cluster?
If it's OK, does it need configuration?
What are the possible reasons that can cause my problem?
How can I handle this problem?
Thanks a lot!
Update
There is a similar question here, but I think my scene is different. When I run my codes in IntelliJ IDEA, and set Spark Master to local virtual machine cluster, it works. But I got Initial job has not accepted any resources;... warning instead.
I want to know whether the security policy or fireworks can cause this?
Submitting code programatically (e.g. via SparkSubmit) is quite tricky. At the least there is a variety of environment settings and considerations -handled by the spark-submit script - that are quite difficult to replicate within a scala program. I am still uncertain of how to achieve it: and there have been a number of long running threads within the spark developer community on the topic.
My answer here is about a portion of your post: specifically the
TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have
sufficient resources
The reason is typically there were a mismatch on the requested memory and/or number of cores from your job versus what were available on the cluster. Possibly when submitting from IJ the
$SPARK_HOME/conf/spark-defaults.conf
were not properly matching the parameters required for your task on the existing cluster. You may need to update:
spark.driver.memory 4g
spark.executor.memory 8g
spark.executor.cores 8
You can check the spark ui on port 8080 to verify that the parameters you requested are actually available on the cluster.

Solr issue: ClusterState says we are the leader, but locally we don't think so

So today we run into a disturbing solr issue.
After a restart of the whole cluster one of the shard stop being able to index/store documents.
We had no hint about the issue until we started indexing (querying the server looks fine).
The error is:
2014-05-19 18:36:20,707 ERROR o.a.s.u.p.DistributedUpdateProcessor [qtp406017988-19] ClusterState says we are the leader, but locally we don't think so
2014-05-19 18:36:20,709 ERROR o.a.s.c.SolrException [qtp406017988-19] org.apache.solr.common.SolrException: ClusterState says we are the leader (http://x.x.x.x:7070/solr/shard3_replica1), but locally we don't think so. Request came from null
at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:503)
at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:267)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:550)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:126)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:101)
at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:65)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
We run Solr 4.7 in Cluster mode (5 shards) on jetty.
Each shard run on a different host with one zookeeper server.
I checked the zookeeper log and I cannot see anything there.
The only difference is that in the /overseer_election/election folder I see this specific server repeated 3 times, while the other server are only mentioned twice.
45654861x41276x432-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x368-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x369-x.x.x.x:7070_solr-n_00000003xx
Not even sure if this is relevant. (Can it be?)
Any clue what other check can we do?
We've experienced this error under 2 conditions.
Condition 1
On a single zookeeper host there was an orphaned Zookeeper ephemeral node in
/overseer_elect/election. The session this ephemeral node was associated with no longer existed.
The orphaned ephemeral node cannot be deleted.
Caused by: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
This condition will also be accompanied by a /overseer/queue directory that is clogged-up with queue items that are forever waiting to be processed.
To resolve the issue you must restart the Zookeeper node in question with the orphaned ephemeral node.
If after the restart you see Still seeing conflicting information about the leader of shard shard1 for collection <name> after 30 seconds
You will need to restart the Solr hosts as well to resolve the problem.
Condition 2
Cause: a mis-configured systemd service unit.
Make sure you have Type=forking and have PIDFile configured correctly if you are using systemd.
systemd was not tracking the PID correctly, it thought the service was dead, but it wasn't, and at some point 2 services were started. Because the 2nd service will not be able to start (as they both can't listen on the same port) it seems to just sit there in a failed state hanging, or fails to start the process but just messes up the other solr processes somehow by possibly overwriting temporary clusterstate files locally.
Solr logs reported the same error the OP posted.
Interestingly enough, another symptom was that zookeeper listed no leader for our collection in /collections/<name>/leaders/shard1/leader normally this zk node contains contents such as:
{"core":"collection-name_shard1_replica1",
"core_node_name":"core_node7",
"base_url":"http://10.10.10.21:8983/solr",
"node_name":"10.10.10.21:8983_solr"}
But the node is completely missing on the cluster with duplicate solr instances attempting to start.
This error also appeared in the Solr Logs:
HttpSolrCall null:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /roles.json
To correct the issue, killall instances of solr (or java if you know it's safe), and restart the solr service.
We figured out!
The issue was that jetty didn't really stop so we had 2 running processes, for whatever reason this was fine for reading but not for writing.
Killing the older java process solved the issue.