ambari cluster + poor connection between ambari-agent to ambari server - ambari

we have ambari cluster with 872 data-nodes machines , when ambari version is 2.6.x
we have for now some network problem ,
after long investigation we found that , ambari agent that runs on some machine not communicate well with the ambari server
therefore we get some strange behaviors as 5 dead data-nodes from ambari dashboard , while for sure datanodes machine are healthy
is it possible to give more tolerated value in ambari agent configuration so the ack between ambari agent to ambari server will be after more little time in order to ignore the network problems ?
something like timeout or time connection between the ambari agent to ambari server

First of all, you need to get the root cause of the issue why Data Node is showing as Dead.
Ambari agent runs on every node. It is responsible for sending
metrics and heartbeat to the Ambari server which then publishes to
your Ambari web UI.
The name node waits for 10 minutes till it declares the data node as dead and copies
the blocks to other data nodes.
If it's showing that data node is dead then please check the Ambari agent status in
the specific node by running-service ambari-agent status. Parallelly you can check the ambari-agent.log in the worker node to check why Ambari agent stopped working.

You can configure your http timeouts in ambari-agents for service tasks, http timeouts
https://github.com/apache/ambari/blob/trunk/ambari-agent/conf/unix/ambari-agent.ini
There's a HTTP Timeout section you can configure it based on your network throughput.
The file should be in /etc/ambari-agent/ambari.properties

Related

Redis cluster node failure not detected on MISCONF

We currently have a redis cache cluster with 3 masters and 3 slaves hosted on 3 windows servers (1 master/slave by server). We are using StackExhange.Redis as our client.
We have RBD disabled but AOF enabled and are experiencing some problems with the cluster in the following situation :
One of our servers became full and the redis node on this server was unable to write to the AOF file (the error returned to the client was MISCONF Errors writing to the AOF file: No space left on device).
The cluster did not detect that the node was failing and so did not exlclude it from the cluster.
All cache operations were blocked until we make some place on the server.
We know that we don't need the AOF, so we have disalbed it after the incident.
But we would like to confirm or infirm our view on redis clustering: for us, if a node was experiencing a failure, the cluster would redirect all requests to another one. We have tested that with a stopped node master, a slave is promoted into a master so we are confident that our cluster is working, but we are not sure why, in our case, the node was not marked as a failure.
Is the cluster capable of detecting a node failure when the failure is only happening when a request is made from a client to the cluster ?

Activemq stops working - activemq/zookeeper setup

I've configured 3 zookeepers and 3 activemq instances in 1 cluster.
Scenario
3 activemq instances with only 1 master and other two is slave.
all 3 activemq instances are running, i.e. sudo service activemq status returns running but checking the logs, 1 instance(activemq1) is currently waiting for other cluster members, 1 instance(activemq2) stops, 1 instance(activemq3) has error. Assumming that we only require two instance to elect master, this setup should be able to run successfully .
two activemq instances should be running
zookeeper instances are running fine.
Issue
Below are the stacktraces of the respective activemq instances. Based on my understanding, it needs at least two properly running activemq intances for the cluster to nominate a master instance. Given that all activemq instanes produces running when issued with sudo service activemq status , I'm assuming there is an issue inside each activemq instances - refer to below stacktraces. Now, I noticed on logs, that activemq1 only fails to be properly running since other activemq instances failed internally. Notice the stacktrace on activemq2, it's stucked after it successfully connected to zookeeper and activemq3 has issue, I still need to figure out. The issue is fixed when I restarted activemq2 and activemq3. However, I can't be sure this won't happen again, thus this question.
activem1 show the below stacktrace, which I assume that this is because the other 2 activemq instances are running but has errors
Session establishment complete on server 10.5.4.111/10.5.4.111:2181, sessionid = 0x1582db00708000c, negotiated timeout = 4000
Not enough cluster members connected to elect a master.
Not enough cluster members connected to elect a master.
Not enough cluster members connected to elect a master.
activemq2 has the below stacktrace, which is the one I don't understand. It has stopped after successful connection to zookeeper, which should be detected by other activemq instances belonging to cluster-activem1 and activemq3
Opening socket connection to server 10.5.4.111/10.5.4.111:2181
Socket connection established to 10.5.4.111/10.5.4.111:2181, initiating session
Session establishment complete on server 10.5.4.111/10.5.4.111:2181, sessionid = 0x1582db00708000d, negotiated timeout = 4000
activemq3 has the below stacktrace
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:568)[apache-jsp-8.0.9.M3.jar:2.3]
Configuration for activemq
the previous config here is with 2s zkSessionTimeout - which is the default. I made it to 4s as per googled to maximize the time needed for an activemq instance registers itself to zookeeper.
<persistenceAdapter>
<replicatedLevelDB
directory="${activemq.data}/leveldb"
replicas="3"
bind="tcp://0.0.0.0:61619"
zkAddress="zookeeper_addresses_here"
hostname="activemq_hostname_here"
zkSessionTimeout="4s"
/>
</persistenceAdapter>
Configuration for zookeeper
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/my/data/dir
clientPort=2181
server.1=activemq1_privateIP:2888:3888
server.2=activemq2_privateIP:2888:3888
server.3=activemq3_privateIP::2888:3888
autopurge.purgeInterval=24
autopurge.snapRetainCount=5
Zookeeper version 3.4.9
ActiveMQ version 5.13.4
Setup via Opswork
The attribute "directory" master-slave mq is need to refer to the same folder

Flink Jobmanager not able to see task managers

So I've installed an apache flink cluster on our network. I've done the configurations as illustrated below. This Master (JobManager) starts, and sends the start command to all the slaves via ssh. I can see that the task managers are running after they were started by the master node.
Config file on all nodes:
jobmanager.rpc.address: flmaster
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 2048
taskmanager.numberOfTaskSlots: 1
taskmanager.memory.preallocate: false
parallelism.default: 1
jobmanager.web.port: 8081
taskmanager.tmp.dirs: /apps/storage/runtime/flink/workspace
recovery.mode: zookeeper
recovery.zookeeper.quorum:zk1:2181, zk2:2181, zk3:2181
recovery.zookeeper.storageDir: /apps/runtime/flink/recovery
env.java.home: /apps/java/
Then i have a file called slaves in the config folder with a list of the slaves nodes.
flSlave1
flSlave2
flSlave3
I then start it
../bin/start-cluster.sh
This opens an ssh session to all the slave nodes, and starts the task manager. I can see this with ps ax | grep java
I can open the Web-Ui on flMaster:8081
On the WebUI I can see the slave node count is 0. I have no task managers.
As a test, I started the wordcount.jar job, and it tells me it cannot run the job since there are no slots open.
/apps/flink/bin/flink run /apps/flink/examples/batch/WordCount.jar
the response:
07/20/2016 13:19:01 Job execution switched to status FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job.*
Well I guess if there is no task managers/slave nodes, there will be no slots.
Any one ever seen this issue?
Use fully qualified hostname instead of short name. For e.g hostname.xyx.com instead of just hostname. OR you could also try using ip address.
Try doing a telnet on jobmanager machine rpc port. The taskmanagers talk with jobmanager through rpc. So check the network settings whether you are able to access the jobmanager and task managers' rpc ports or not.
Also check the blob server port. Check the taskmanager logs whether it is able to connect to the jobmanager blob server or not.

Hadoop Cluster deployment using Apache Ambari

I have listed few queries related to ambari as follows:
Can I configure more than one hadoop cluster via UI of ambari ?
( Note : I am using Ambari 1.6.1 for my hadoop cluster deployment purpose and I am aware that this can be done via Ambari API, but not able to find via ambari portal)
We can check the status of services on each node by “jps” command, if we have configured hadoop cluster w/o ambari.
Is there any way similar to “jps” to check from back end if the setup for hadoop cluster was successful from the backend ?
( Note : I can see that services are showing UP on the ambari portal )
Help Appreciated !!
Please let me know if any additional information is required.
Thanks
The ambari UI is served up by the ambari server for a specific cluster. To configure another cluster you need to point your browser to the URL for that other cluster's ambari server. So you can't see the configuration for multiple servers on the same web page, but you can set browser bookmarks to jump from configuration to configuration.

Datastax OpsCenter not showing nodes

I installed datastax enterprise in my win7 system,but it is not displaying any node in opscenter dashboard.(Actually I have re-installed the datastax due to some issue in previous installation.)
I am getting the node detail in command line using nodetool command,but no node is present in the datastax ops center dashboard.
I think OpsCenter agent is failing to connect the node.
Please help me
Thanks,
Subhra
The agent might not be started on your system in linux its in /usr/share/datastax-agent/bin run the 'install_agent'.
Also check if the ports for running opscenter are not blocked.
Follow below mentioned procedure :
1) Check datastax-agent is installed on nodes and also service is running.
2) Check Port connection is open for datastax-agent.
http://docs.datastax.com/en/archived/opscenter/5.1/opsc/reference/opscPorts_r.html
3) Reconfigure your existing Cluster details in Opscenter, after deleting previous configuration in Opscenter.
4) If issue still exist check log file of opscenter (opscenterd.log)