In Knox config file in Ambari we have defined:
<url>http://{{namenode_host}}:{{namenode_http_port}}/webhdfs</url>
The problem is we have 2 namenodes, one active and one passive for high availability. Our active namenode01 failed so namenode02 became active.
This caused problems for a lot scripts as they were hardcoded to point to namenode01. So we used a command to failover namenode02 back to namenode01 using a terminal, not Ambari.
Now, the macro {{namenode_host}} is defined as namenode02 and not namenode01.
So, where is {{namenode_host}} defined?
Or, do we need to failover namenode01 to namenode02, then failover again to namenode01 using Ambari to update the macro?
If we need to failover the namenode using Ambari, I'm assuming we need to select the "Restart" option? There isn't a direct failover command.
See issue here:
https://issues.apache.org/jira/browse/AMBARI-12763
This was committed to Ambari to support HA mode for Knox. However if you're still looking for the location take a look at the file that's edited in the patch. That file is the place where the macros are set. You'll have to find it on your local machine though.
Should be something like params_linux.py
Related
Just trying out splunk, have had an issue with integrating a search head cluster with an indexer cluster.
I have 3 machines in a search head cluster and 3 machines in an indexer cluster. These are all on CentOS7, no firewall installed, all machines are able to ping / view each others splunk instaces (ip:8000 / ip:8089).
When following https://docs.splunk.com/Documentation/Splunk/6.6.2/DistSearch/SHCandindexercluster specifically
splunk edit cluster-config -mode searchhead -master_uri 10.152.31.202:8089 -secret newsecret123
I get an error of
Could not contact master. Check that the master is up, the master_uri=10.152.31.202:8089 and secret are specified correctly
I have removed the https:// part from the IP's above as I couldn't post with them included.
I have set the pass4SymmKey to be the same on all servers.
thanks
Please check shclustering pass4symmkey in both search head cluster and in the master.
i suspect pass4symmkey issue.
You should check splunkd.log to see what the error is. I would recommend not setting up the Pass4symKey and verifying it works first, if not then you found your issue.
Also, you did not mention having an extra server acting as the cluster master. This should be an independent server from your indexers. You have one right?
I'm trying to follow the steps in the RabbitMQ docs here to get clustering with SSL working on Windows. I'm noticing though that the "rabbitmqctl status" command starts failing after the environment variables defined in those steps are set. I'm getting the following error when executing "rabbitmqctl status":
Error: unable to connect to node 'rabbit#server1': nodedown
I've already configured RabbitMQ to use TLS 1.2 and have verified that it's working. I've ensured that my Erlang 18 cookie is the same in the user directory C:\users\me and C:\Windows on the machine, but the error persists, and is stopping other servers from clustering with it. The docs say that the Windows SSL Cluster setup is "Coming soon"... Here are the steps I've taken so far on server1. I think that Erlang wants forward slashes in the paths - this matches the rabbit.config SSL settings.
Combined the contents of my server\cert.pem and server\key.pem into rabbit.pem via the command "type server\cert.pem server\key.pem > server\rabbit.pem"
Created environment variable ERL_SSL_PATH and set to: "C:/Program
Files/erl7.0/lib/ssl-7.0/ebin"
Created environment variable RABBITMQ_CTL_ERL_ARGS and set to: -pa "%ERL_SSL_PATH%" -proto_dist inet_tls -ssl_dist_opt server_certfile C:/OpenSSL-Win64/server/rabbit.pem -ssl_dist_opt server_secure_renegotiate true client_secure_renegotiate true
Created environment variable RABBITMQ_SERVER_ADDITIONAL_ERL_ARGS and set to same value as RABBITMQ_CTL_ERL_ARGS
Copied the erlang cookie at C:\Windows.erlang.cookie to my local user profile directory.
Restarted rabbit using rabbitmq-service start
At this point, on server1, "rabbitmqctl status" no longer works. Attempts to try to join server2 to server1 result in a "node down" error.
Edit 1: I can't get the initial step in the docs working to ask Erlang to report its SSL directory on Windows in order to set ERL_SSL_PATH correctly. Erlang is installed at C:\Program Files\erl7.0 on my server.
Edit 2: Using werl.exe (at C:\Program Files\erl7.0\bin\werl.exe), I was able to issue a command "Foo=io:format(code:lib_dir(ssl, ebin))." and it reported the path as: c:/Program Files/erl7.0/lib/ssl-7.0/ebin. However, this doesn't seem to be the cause of the this issue since that's already what I was using.
Thanks,
Andy
For environment changes to take effect on Windows, the service must be
re-installed. It is not sufficient to restart the service. This can be
done using the installer or on the command line with administrator
permissions
(source)
This will do:
rabbitmq-service.bat stop
rabbitmq-service.bat remove
rabbitmq-service.bat install
rabbitmq-service.bat start
Also, if while the node you're working on is down, the other cluster nodes were running, their state might be assumed to have gone out of sync. In that case, the node might fail to start up and you might need to:
rabbitmqctl force_boot
Check the logs to confirm. (at %RABBIT_BASE%\log\rabbit#server.log)
Late answer but, hopefully this could help a searcher...
I have two clusters, one in local virtual machine another in remote cloud. Both clusters in Standalone mode.
My Environment:
Scala: 2.10.4
Spark: 1.5.1
JDK: 1.8.40
OS: CentOS Linux release 7.1.1503 (Core)
The local cluster:
Spark Master: spark://local1:7077
The remote cluster:
Spark Master: spark://remote1:7077
I want to finish this:
Write codes(just simple word-count) in IntelliJ IDEA locally(on my laptp), and set the Spark Master URL to spark://local1:7077 and spark://remote1:7077, then run my codes in IntelliJ IDEA. That is, I don't want to use spark-submit to submit a job.
But I got some problem:
When I use the local cluster, everything goes well. Run codes in IntelliJ IDEA or use spark-submit can submit job to cluster and can finish the job.
But When I use the remote cluster, I got a warning log:
TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
It is sufficient resources not sufficient memory!
And this log keep printing, no further actions. Both spark-submit and run codes in IntelliJ IDEA result the same.
I want to know:
Is it possible to submit codes from IntelliJ IDEA to remote cluster?
If it's OK, does it need configuration?
What are the possible reasons that can cause my problem?
How can I handle this problem?
Thanks a lot!
Update
There is a similar question here, but I think my scene is different. When I run my codes in IntelliJ IDEA, and set Spark Master to local virtual machine cluster, it works. But I got Initial job has not accepted any resources;... warning instead.
I want to know whether the security policy or fireworks can cause this?
Submitting code programatically (e.g. via SparkSubmit) is quite tricky. At the least there is a variety of environment settings and considerations -handled by the spark-submit script - that are quite difficult to replicate within a scala program. I am still uncertain of how to achieve it: and there have been a number of long running threads within the spark developer community on the topic.
My answer here is about a portion of your post: specifically the
TaskSchedulerImpl: Initial job has not accepted any resources; check
your cluster UI to ensure that workers are registered and have
sufficient resources
The reason is typically there were a mismatch on the requested memory and/or number of cores from your job versus what were available on the cluster. Possibly when submitting from IJ the
$SPARK_HOME/conf/spark-defaults.conf
were not properly matching the parameters required for your task on the existing cluster. You may need to update:
spark.driver.memory 4g
spark.executor.memory 8g
spark.executor.cores 8
You can check the spark ui on port 8080 to verify that the parameters you requested are actually available on the cluster.
So today we run into a disturbing solr issue.
After a restart of the whole cluster one of the shard stop being able to index/store documents.
We had no hint about the issue until we started indexing (querying the server looks fine).
The error is:
2014-05-19 18:36:20,707 ERROR o.a.s.u.p.DistributedUpdateProcessor [qtp406017988-19] ClusterState says we are the leader, but locally we don't think so
2014-05-19 18:36:20,709 ERROR o.a.s.c.SolrException [qtp406017988-19] org.apache.solr.common.SolrException: ClusterState says we are the leader (http://x.x.x.x:7070/solr/shard3_replica1), but locally we don't think so. Request came from null
at org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:503)
at org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:267)
at org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:550)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.processUpdate(JsonLoader.java:126)
at org.apache.solr.handler.loader.JsonLoader$SingleThreadedJsonLoader.load(JsonLoader.java:101)
at org.apache.solr.handler.loader.JsonLoader.load(JsonLoader.java:65)
at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1916)
We run Solr 4.7 in Cluster mode (5 shards) on jetty.
Each shard run on a different host with one zookeeper server.
I checked the zookeeper log and I cannot see anything there.
The only difference is that in the /overseer_election/election folder I see this specific server repeated 3 times, while the other server are only mentioned twice.
45654861x41276x432-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x368-x.x.x.x:7070_solr-n_00000003xx
74030267x31685x369-x.x.x.x:7070_solr-n_00000003xx
Not even sure if this is relevant. (Can it be?)
Any clue what other check can we do?
We've experienced this error under 2 conditions.
Condition 1
On a single zookeeper host there was an orphaned Zookeeper ephemeral node in
/overseer_elect/election. The session this ephemeral node was associated with no longer existed.
The orphaned ephemeral node cannot be deleted.
Caused by: https://issues.apache.org/jira/browse/ZOOKEEPER-2355
This condition will also be accompanied by a /overseer/queue directory that is clogged-up with queue items that are forever waiting to be processed.
To resolve the issue you must restart the Zookeeper node in question with the orphaned ephemeral node.
If after the restart you see Still seeing conflicting information about the leader of shard shard1 for collection <name> after 30 seconds
You will need to restart the Solr hosts as well to resolve the problem.
Condition 2
Cause: a mis-configured systemd service unit.
Make sure you have Type=forking and have PIDFile configured correctly if you are using systemd.
systemd was not tracking the PID correctly, it thought the service was dead, but it wasn't, and at some point 2 services were started. Because the 2nd service will not be able to start (as they both can't listen on the same port) it seems to just sit there in a failed state hanging, or fails to start the process but just messes up the other solr processes somehow by possibly overwriting temporary clusterstate files locally.
Solr logs reported the same error the OP posted.
Interestingly enough, another symptom was that zookeeper listed no leader for our collection in /collections/<name>/leaders/shard1/leader normally this zk node contains contents such as:
{"core":"collection-name_shard1_replica1",
"core_node_name":"core_node7",
"base_url":"http://10.10.10.21:8983/solr",
"node_name":"10.10.10.21:8983_solr"}
But the node is completely missing on the cluster with duplicate solr instances attempting to start.
This error also appeared in the Solr Logs:
HttpSolrCall null:org.apache.zookeeper.KeeperException$SessionExpiredException: KeeperErrorCode = Session expired for /roles.json
To correct the issue, killall instances of solr (or java if you know it's safe), and restart the solr service.
We figured out!
The issue was that jetty didn't really stop so we had 2 running processes, for whatever reason this was fine for reading but not for writing.
Killing the older java process solved the issue.
I have deployed a system integrated with weblogic, but until now I faced a problem is the weblogic increasing the stdout.out size heavily(by GB per week), it caused the system to load slowly and slowly.
Any way to prevent it increase the size heavily or redirect into .log?
Thanks alot
As David Herget says above, using the WebLogic Scripting Tool (WLST) to redirect StdOut and StdErr did not actually work for me either; I had to also do so through the web console (even though they appear to be set on the console) and restart the relevant jvms.
I can't reply to David's comment above due to being a newbie. [Edited since for clarity]
Not totally sure to understand fully your question.
Are you talking about the {server_name}.out file located in the {Domain_Path}/servers/{server_name}/logs ?
If so, I've never found anyway to rotate those logs automatically so I run a script each day to rotate it (basically copying it to another name, zip it and echoing a NULL in the orginal file...erasing the older one after).
If you are talking about redirecting StdOut to the logs though, that can be done within the console for each server in the logging tab by checking "Redirect stdout logging enabled". Configuration to rotate those logs can also be done within that tab.
On that, StdErr can also be redirected, but not from the console (in WL9). You have to put "RedirectStderrToServerLogEnabled" at true in the MBean tree by wlst (it's located at /Servers/{server_name}/Log/{server_name}
I know the question was ask long time ago but hoping it would help nonetheless
Weblogic provides features of log files rotation based on the size and time interval.
You can try rotating the log files based on the size. You would need to configure the log rotation policy from the admin console. Please refer the below link for further details.
http://docs.oracle.com/cd/E12840_01/wls/docs103/ConsoleHelp/taskhelp/logging/RotateLogFiles.html
If you want to rotate the log files on demand, you can use the below WSLT script.
C:\>java weblogic.WLST
#connect WLST to an Administration Server
wls:/offline> connect('username','password')
#navigate to the ServerRuntime MBean hierarchy
wls:/mydomain/serverConfig> serverRuntime()
wls:/mydomain/serverRuntime>ls()
#navigate to the server LogRuntimeMBean
wls:/mydomain/serverRuntime> cd('LogRuntime/myserver')
wls:/mydomain/serverRuntime/LogRuntime/myserver> ls()
-r-- Name myserver
-r-- Type LogRuntime
-r-x forceLogRotation java.lang.Void :
#force the immediate rotation of the server log file
wls:/mydomain/serverRuntime/LogRuntime/myserver> cmo.forceLogRotation()
wls:/mydomain/serverRuntime/LogRuntime/myserver>
http://docs.oracle.com/cd/E12840_01/wls/docs103/logging/config_logs.html#wp1001654