Datastax agent failing to report metrics once in a while - datastax

I'm running a DSE 4.6.5 Cluster (Cassandra 2.0.14.352) with OpsCenter 5.1.1
Once or twice a day, one of the nodes (sometimes more) stops reporting metrics until I manually restart the datastax-agent.
Before I restart the agent, it's alive. Here's the agent log :
WARN [Thread-13] 2015-04-14 23:20:23,277 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,277 131176 operations dropped so far.
WARN [Thread-13] 2015-04-14 23:20:23,277 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,277 131177 operations dropped so far.
WARN [Thread-13] 2015-04-14 23:20:23,278 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,278 131178 operations dropped so far.
WARN [Thread-13] 2015-04-14 23:20:23,278 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,278 131179 operations dropped so far.
WARN [Thread-13] 2015-04-14 23:20:23,278 Cassandra operation queue is full, discarding cassandra operation
WARN [Thread-13] 2015-04-14 23:20:23,278 131180 operations dropped so far.
ERROR [cassandra-processor-1] 2015-04-14 23:20:24,387 Error when proccessing cassandra callcom.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (no host was tried)
Please note that :
All the nodes are in the same datacenter, with the same hardware
specs and the same configuration.
Nodes are using two NICs so rpc_address and listen_address are on different networks
OpsCenter is running on one of the cluster nodes
Writes are intensive : please check my other question
To sum up, on one of the machine (in a round robin fashion), agent stops reporting data while on the other it works fine.
Restarting the agent service corrects the issue but shouldn't it restart itself ? Is this a bug ? How can I get around this ?
Please tell me if you need more information.
Thanks.

I've seen this same thing. Two things you can try.
1) Exclude or limit the keyspaces/CF's you collect metrics from. http://docs.datastax.com/en/opscenter/5.1/opsc/configure/opscControllingDataCollection_c.html?scroll=concept_ds_jlq_xk4_gk
2) Run Opscenter on a separate cluster (like a one or two node small cluster separate from your main cluster). http://www.datastax.com/dev/blog/storing-opscenter-data-in-a-separate-cluster
Option 2 is the smarter move honestly, you don't need large nodes, and if you collect metrics on your main cluster and that cluster crashes, you're running blind.

Related

Data Node and Node Manager is not starting in pseudo-cluster mode(Apache Hadoop)

The Data Node and Node Manager is not starting in pseudo-cluster mode (Apache Hadoop).
Seeing this error in the log file:
***2017-08-22 17:15:08,403 ERROR org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Unexpected error starting NodeStatusUpdater
org.apache.hadoop.yarn.exceptions.YarnRuntimeException: Recieved SHUTDOWN signal from Resourcemanager ,Registration of NodeManager failed, Message from ResourceManager: NodeManager from archit doesn't satisfy minimum allocations, Sending SHUTDOWN signal to the NodeManager.***
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.registerWithRM(NodeStatusUpdaterImpl.java:278)
at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.serviceStart(NodeStatusUpdaterImpl.java:197)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:120)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.serviceStart(NodeManager.java:272)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.initAndStartNodeManager(NodeManager.java:496)
at org.apache.hadoop.yarn.server.nodemanager.NodeManager.main(NodeManager.java:543)
2017-08-22 17:15:08,404 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl failed in state STARTED; cause: org.apache.hadoop.yarn.exceptions.YarnRuntimeException:
The "ResourceManager: NodeManager from archit doesn't satisfy minimum allocations" error is seen when node on which node manager is being started does not have enough resources w.r.t yarn.scheduler.minimum-allocation-vcores and yarn.scheduler.minimum-allocation-mb configurations.
Reduce values of yarn.scheduler.minimum-allocation-vcores and yarn.scheduler.minimum-allocation-mb then restart yarn.

Datanode error : NameSystem.getDatanode

Help please Folks
I am trying to set up my Hadoop multinode env (1 master, 1 secondary and 3 slaves - hadoop 2.7.1/Ubuntu 14 on AWS) and i am getting "NameSystem.getDatanode" ERROR message. I browsed and read and tried but reach my limits. Could you point me at least in some direction
Logs (extract) from master - xxx-141/142/143 are the ip of the slaves
'''''''''''''''''''''''''''''''''''
Line 134: 2016-01-23 17:36:19,432 ERROR org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.getDatanode: Data node DatanodeRegistration(XXX.XX.XX.143:50010, datanodeUuid=6826238d-9213-4b19-a6eb-13115e3bea8d, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-57295bbd-e78e-4265-99f7-fdacccbcb33a;nsid=1674724909;c=0) is attempting to report storage ID 6826238d-9213-4b19-a6eb-13115e3bea8d. Node 172.31.22.141:50010 is expected to serve this storage.
Line 135: 2016-01-23 17:36:19,457 ERROR org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.getDatanode: Data node DatanodeRegistration(XXX.XX.XX.142:50010, datanodeUuid=6826238d-9213-4b19-a6eb-13115e3bea8d, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-57295bbd-e78e-4265-99f7-fdacccbcb33a;nsid=1674724909;c=0) is attempting to report storage ID 6826238d-9213-4b19-a6eb-13115e3bea8d. Node 172.31.22.141:50010 is expected to serve this storage.
Line 159: 2016-01-23 17:36:20,988 ERROR org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.getDatanode: Data node DatanodeRegistration(XXX.XX.XX.141:50010, datanodeUuid=6826238d-9213-4b19-a6eb-13115e3bea8d, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-57295bbd-e78e-4265-99f7-fdacccbcb33a;nsid=1674724909;c=0) is attempting to report storage ID 6826238d-9213-4b19-a6eb-13115e3bea8d. Node XXX.XX.XX.143:50010 is expected to serve this storage.
Extract From SLAVE2 SERVER logs
'''''''''''''''''''''''''''''''''''''
2016-01-23 17:36:14,812 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
`2016-01-23 17:36:18,607 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Unsuccessfully sent block report 0x3c90bbfe60c, containing 1 storage report(s), of which we sent 0. The reports had 1 total blocks and used 0 RPC(s). This took 4 msec to generate and 144 msecs for RPC and NN processing. Got back no commands.
2016-01-23 17:36:18,608 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool BP-1050309752-MAST.XX.XX.169-1453113991010 (Datanode Uuid 6826238d-9213-4b19-a6eb-13115e3bea8d) service to master/MAST.XX.XX.169:9000 is shutting down
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.UnregisteredNodeException): Data node DatanodeRegistration(1XX.XX.XX.142:50010, datanodeUuid=6826238d-9213-4b19-a6eb-13115e3bea8d, infoPort=50075, infoSecurePort=0, ipcPort=50020, storageInfo=lv=-56;cid=CID-57295bbd-e78e-4265-99f7-fdacccbcb33a;nsid=1674724909;c=0) is attempting to report storage ID 6826238d-9213-4b19-a6eb-13115e3bea8d. Node 1XX.XX.XX.141:50010 is expected to serve this storage.
at org.apache.hadoop.hdfs.server.blockmanagement.DatanodeManager.getDatanode(DatanodeManager.java:495)
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.processReport(BlockManager.java:1791)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.blockReport(NameNodeRpcServer.java:1315)
at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolServerSideTranslatorPB.blockReport(DatanodeProtocolServerSideTranslatorPB.java:163)
at org.apache.hadoop.hdfs.protocol.proto.DatanodeProtocolProtos$DatanodeProtocolService$2.callBlockingMethod(DatanodeProtocolProtos.java:28543)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:969)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2045)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2043)
at org.apache.hadoop.ipc.Client.call(Client.java:1476)
at org.apache.hadoop.ipc.Client.call(Client.java:1407)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
at com.sun.proxy.$Proxy13.blockReport(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:199)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:463)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:688)
at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:823)
at java.lang.Thread.run(Thread.java:745)
2016-01-23 17:36:18,610 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Ending block pool service for: Block pool BP-1050309752-MAST.XX.XX.169-1453113991010 (Datanode Uuid 6826238d-9213-4b19-a6eb-13115e3bea8d) service to master/MAST.XX.XX.169:9000
2016-01-23 17:36:18,611 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Removed Block pool BP-1050309752-MAST.XX.XX.169-1453113991010 (Datanode Uuid 6826238d-9213-4b19-a6eb-13115e3bea8d)
2016-01-23 17:36:18,611 INFO org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl: Removing block pool BP-1050309752-MAST.XX.XX.169-1453113991010
2016-01-23 17:36:20,611 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Exiting Datanode
2016-01-23 17:36:20,613 INFO org.apache.hadoop.util.ExitUtil: Exiting with status 0
2016-01-23 17:36:20,614 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
/************************************************************
SHUTDOWN_MSG: Shutting down DataNode at ip-SLAV-XX-XX-142/1XX.XX.XX.142
************************************************************/`
it looks like you have three slaves
172.31.22.141:50010
172.31.22.142:50010
172.31.22.143:50010
and you created two of them from a clone of the first slave, after the slave was already included in the cluster.
The two clones now already have a copy of the DFS and use the same storage ID as the first slave. Only one slave with the same ID is expected by the name server. It is trying to tell you this by logging:
[...] is attempting to report storage ID [...].
Node [...]:50010 is expected to serve this storage.
You can try removing the dfs directory on two of the slaves, then restarting them.
i.e. stop the slaves, do a rm -rf on the dfs directory like:
rm -rf /tmp/hadoop-hadoop/dfs/
You can then restart and check that all slaves are connecting and test file replication, e.g. by setting the replication level to 4 for some files like:
hdfs dfs -setrep -w 4 -R /user/somedir
The -w option causes the command to wait until replication has succeeded.

The bgsave on a redis server on windows always return failed, what to do?

Today I took a look and found one of our redis server has a very serious problem. It can't save.
info
# Persistence
loading:0
rdb_changes_since_last_save:82904
rdb_bgsave_in_progress:1
rdb_last_save_time:1444978843
rdb_last_bgsave_status:ok
rdb_last_bgsave_time_sec:1444978844
rdb_current_bgsave_time_sec:1448356070
Basically the last bgsave happened on Oct 11th, I can't restart it because I think I'll lose all the in memory data.
What should I do?

Why does an ActiveMQ cluster fail with "server null" when the Zookeeper master node goes offline?

I have encountered an issue with ActiveMQ where the entire cluster will fail when the master Zookeeper node goes offline.
We have a 3-node ActiveMQ cluster setup in our development environment. Each node has ActiveMQ 5.12.0 and Zookeeper 3.4.6 (*note, we have done some testing with Zookeeper 3.4.7, but this has failed to resolve the issue. Time constraints have so far prevented us from testing ActiveMQ 5.13).
What we have found is that when we stop the master ZooKeeper process (via the "end process tree" command in Task Manager), the remaining two ZooKeeper nodes continue to function as normal. Sometimes the ActiveMQ cluster is able to handle this, but sometimes it does not.
When the cluster fails, we typically see this in the ActiveMQ log:
2015-12-18 09:08:45,157 | WARN | Too many cluster members are connected. Expected at most 3 members but there are 4 connected. | org.apache.activemq.leveldb.replicated.MasterElector | WrapperSimpleAppMain-EventThread
...
...
2015-12-18 09:27:09,722 | WARN | Session 0x351b43b4a560016 for server null, unexpected error, closing socket connection and attempting reconnect | org.apache.zookeeper.ClientCnxn | WrapperSimpleAppMain-SendThread(192.168.0.10:2181)
java.net.ConnectException: Connection refused: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)[:1.7.0_79]
at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown Source)[:1.7.0_79]
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)[zookeeper-3.4.6.jar:3.4.6-1569965]
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)[zookeeper-3.4.6.jar:3.4.6-1569965]
We were immediately concerned by the fact that (A)ActiveMQ seems to think there are four members in the cluster when it is only configured with 3 and (B) when the exception is raised, the server appears to be null. We then increased ActiveMQ's logging level to DEBUG in order to display the list of members:
2015-12-18 09:33:04,236 | DEBUG | ZooKeeper group changed: Map(localhost -> ListBuffer((0000000156,{"id":"localhost","container":null,"address":null,"position":-1,"weight":5,"elected":null}), (0000000157,{"id":"localhost","container":null,"address":null,"position":-1,"weight":1,"elected":null}), (0000000158,{"id":"localhost","container":null,"address":"tcp://192.168.0.11:61619","position":-1,"weight":10,"elected":null}), (0000000159,{"id":"localhost","container":null,"address":null,"position":-1,"weight":10,"elected":null}))) | org.apache.activemq.leveldb.replicated.MasterElector | ActiveMQ BrokerService[localhost] Task-14
Can anyone suggest why this may be happening and/or suggest a way to resolve this? Our configurations are shown below:
ZooKeeper:
tickTime=2000
dataDir=C:\\zookeeper-3.4.7\\data
clientPort=2181
initLimit=5
syncLimit=2
server.1=192.168.0.10:2888:3888
server.2=192.168.0.11:2888:3888
server.3=192.168.0.12:2888:3888
ActiveMQ (server.1):
<persistenceAdapter>
<replicatedLevelDB
directory="activemq-data"
replicas="3"
bind="tcp://0.0.0.0:61619"
zkAddress="192.168.0.11:2181,192.168.0.10:2181,192.168.0.12:2181"
zkPath="/activemq/leveldb-stores"
hostname="192.168.0.10"
weight="5"/>
//server.2 has a weight of 10, server.3 has a weight of 1
</persistenceAdapter>

RabbitMQ consumes memory and shuts

I just installed OpenStack Juno using devstack, and observed that RabbitMQ (package rabbitmq-server-3.1.5-10 installed by yum) is not stable, i.e. it quickly eats up the memory and shuts down; there is 2G of RAM. Below is the messages from logs and 'systemctl status' before the daemon died:
=INFO REPORT==== 18-Dec-2014::01:25:40 ===
vm_memory_high_watermark clear. Memory used:835116352 allowed:835212083
=WARNING REPORT==== 18-Dec-2014::01:25:40 ===
memory resource limit alarm cleared on node rabbit#node
=INFO REPORT==== 18-Dec-2014::01:25:40 ===
accepting AMQP connection <0.27011.5> (10.0.0.11:55198 -> 10.0.0.11:5672)
=INFO REPORT==== 18-Dec-2014::01:25:41 ===
vm_memory_high_watermark set. Memory used:850213192 allowed:835212083
=WARNING REPORT==== 18-Dec-2014::01:25:41 ===
memory resource limit alarm set on node rabbit#node.
**********************************************************
*** Publishers will be blocked until this alarm clears ***
**********************************************************
rabbitmqctl[770]: ===========
rabbitmqctl[770]: nodes in question: [rabbit#node]
rabbitmqctl[770]: hosts, their running nodes and ports:
rabbitmqctl[770]: - node: [{rabbitmqctl770,40089}]
rabbitmqctl[770]: current node details:
rabbitmqctl[770]: - node name: rabbitmqctl770#node
rabbitmqctl[770]: - home dir: /var/lib/rabbitmq
rabbitmqctl[770]: - cookie hash: FftrRFUESg4RKWsyb1cPqw==
systemd[1]: rabbitmq-server.service: control process exited, code=exited status=2
systemd[1]: Unit rabbitmq-server.service entered failed state.
I know about set_vm_memory_high_watermark, but it doesn't solve the issue. I want to ensure that the daemon doesn't shut down abruptly. I wonder if someone saw this before and could advise?
Thanks.
UPDATE
Upgraded to version 3.4.2 taken directly from www.rabbitmq.com/download.html The new version doesn't consume RAM that fast and tends to work longer then previous version, but eventually still eats out all the memory and shuts.
I think the number of connections in the servers are increasing and they are being held like that without closing that's why it is consuming more memory. When the usage of RAM increases beyond the watermark rabbitmq server won't accept any network request. Either you have to close the connections which all are opened or you have to increase the RAM of the system. But increasing the RAM will only reduce the problem for some time but you'll face the problem again it is better to close the connections.
try to use CloudAMQP instead of installing locally. This will be fixed then. firstly create a rabbitMQ account here. https://customer.cloudamqp.com/signup.
then create your queue there and connect with your application.