Flink Jobmanager not able to see task managers - ssh

So I've installed an apache flink cluster on our network. I've done the configurations as illustrated below. This Master (JobManager) starts, and sends the start command to all the slaves via ssh. I can see that the task managers are running after they were started by the master node.
Config file on all nodes:
jobmanager.rpc.address: flmaster
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 2048
taskmanager.numberOfTaskSlots: 1
taskmanager.memory.preallocate: false
parallelism.default: 1
jobmanager.web.port: 8081
taskmanager.tmp.dirs: /apps/storage/runtime/flink/workspace
recovery.mode: zookeeper
recovery.zookeeper.quorum:zk1:2181, zk2:2181, zk3:2181
recovery.zookeeper.storageDir: /apps/runtime/flink/recovery
env.java.home: /apps/java/
Then i have a file called slaves in the config folder with a list of the slaves nodes.
flSlave1
flSlave2
flSlave3
I then start it
../bin/start-cluster.sh
This opens an ssh session to all the slave nodes, and starts the task manager. I can see this with ps ax | grep java
I can open the Web-Ui on flMaster:8081
On the WebUI I can see the slave node count is 0. I have no task managers.
As a test, I started the wordcount.jar job, and it tells me it cannot run the job since there are no slots open.
/apps/flink/bin/flink run /apps/flink/examples/batch/WordCount.jar
the response:
07/20/2016 13:19:01 Job execution switched to status FAILING.
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Not enough free slots available to run the job.*
Well I guess if there is no task managers/slave nodes, there will be no slots.
Any one ever seen this issue?

Use fully qualified hostname instead of short name. For e.g hostname.xyx.com instead of just hostname. OR you could also try using ip address.

Try doing a telnet on jobmanager machine rpc port. The taskmanagers talk with jobmanager through rpc. So check the network settings whether you are able to access the jobmanager and task managers' rpc ports or not.
Also check the blob server port. Check the taskmanager logs whether it is able to connect to the jobmanager blob server or not.

Related

Jmeter remote testing: Master machine starts but freezes

I have followed along the below tutorial to setup a distributed testing environment for Jmeter:
https://www.perfmatrix.com/configuration-process-for-distributed-testing-in-jmeter-5-3/
I have managed to start the remote (slave machine) server and then to trigger the test from the master machine in NON-GUI mode.
But it doesn't want to finish the execution...what could be the reasons for this?
(I am using Jmeter version 5.4 on both machines, and they are in the same network. The master machine is Win OS and the slave machine is Mac OS)
Details about the test
When it comes to the Thread plan I am having a simple HTTP Sampler that makes a request to https://www.google.com (port 443) and no customized listener plugins in the Thread group, just a simple listener. I have no externalized data such as a CSV either.
In master jmeter.properties file I have only added an entry:
remote_hosts=[internal IP-address]
I have also copied over the .jks file generated from the master to the bin folder of the slave machine.
I have first started the jmeter-server from the slave machine with the following command:
sh ./jmeter-server Djava.rmi.server.hostname=[slave machine internal IP-address]
Afterwards I have started the master jmeter in NON-GUI by following:
jmeter -n -t [UNC-path to jmx file] -r
If you need additional details, just let me know!
The referenced article contains several steps which are not required and some statements which are not true at all.
We cannot help you without seeing:
Your test plan, at least Thread Group configuration
jmeter.log file from master
jmeter-server.log file from slave
The most common problems are:
RMI ports are not open in the firewall so the master cannot communicate with the slave or vice versa
Test plan uses a JMeter Plugin which is not installed on the slave
Test plan uses an external data file, i.e. CSV file used in the CSV Data Set Config and the file isn't present in the slave
More information:
Remote Testing
Apache JMeter Distributed Testing Step-by-step
How to Perform Distributed Testing in JMeter
Remote hosts and RMI configuration

ambari cluster + poor connection between ambari-agent to ambari server

we have ambari cluster with 872 data-nodes machines , when ambari version is 2.6.x
we have for now some network problem ,
after long investigation we found that , ambari agent that runs on some machine not communicate well with the ambari server
therefore we get some strange behaviors as 5 dead data-nodes from ambari dashboard , while for sure datanodes machine are healthy
is it possible to give more tolerated value in ambari agent configuration so the ack between ambari agent to ambari server will be after more little time in order to ignore the network problems ?
something like timeout or time connection between the ambari agent to ambari server
First of all, you need to get the root cause of the issue why Data Node is showing as Dead.
Ambari agent runs on every node. It is responsible for sending
metrics and heartbeat to the Ambari server which then publishes to
your Ambari web UI.
The name node waits for 10 minutes till it declares the data node as dead and copies
the blocks to other data nodes.
If it's showing that data node is dead then please check the Ambari agent status in
the specific node by running-service ambari-agent status. Parallelly you can check the ambari-agent.log in the worker node to check why Ambari agent stopped working.
You can configure your http timeouts in ambari-agents for service tasks, http timeouts
https://github.com/apache/ambari/blob/trunk/ambari-agent/conf/unix/ambari-agent.ini
There's a HTTP Timeout section you can configure it based on your network throughput.
The file should be in /etc/ambari-agent/ambari.properties

Remote workers on multiple servers with python-rq (redis)

I am using redis and python-rq to manage a data processing task. I wish to distribute the data processing across multiple servers (each server would manage several rq workers) but I would like to keep a unique queue on a master server.
Is there a way to achieve this using python-rq ?
Thank you.
It turned out to be easy enough. There are two steps:
1) Configure Redis on the master machine so that it is open to external communications by the remote "agent" server. This is done by editing the bind information as explained in this post. Make sure to set a password if setting the bind value to 0.0.0.0 as this will open the Redis connection to anyone.
2) Start the worker on the remote "agent" server using the url parameter:
rq worker --url redis://:[your_master_redis_password]#[your_master_server_IP_address]
On the master server, you can check that the connection was properly made by typing:
rq info --url redis://:[your_master_redis_password]#localhost
If you enabled the localhost binding, this should display all the workers available to Redis from your "master" including the new worker you created on your remote server.

jmeter distribution load testing for no result return from slave to master gui mode

I am not getting any result from slave machine also no entry in DB
1. connectivity between master and slave is established since when i run remote from master, slave states 'Start the test' 'Finish the test'
2. also from single master and slave script is executed successfully
3. also since there is dynamic ip to server so i am not able to provide ip and port
Not able to catch what exactly is the problem, if you can check through Team viewer, please guide me further as you get some time
slave machine
master machine
Make sure both master and slave are running the same Java version
Make sure both master and slave are running the same JMeter version (also consider upgrading to latest JMeter version - JMeter 3.3 as of now)
If you use any JMeter Plugins or there are any libraries in JMeter Classpath make sure to copy them over to slave machines as well.
Check jmeter-server-log on slave side
If you want to see request and response details in GUI add the next line to user.properties file for all slaves:
mode=Standard
Check out How to Load Test Opening a URL on a Remote Machine with JMeter article for comprehensive explanation and step-by-step instructions.

Enterprise Jenkins HA plugin not working as it should

I've been trying to setup Enterprise Jenkins with the High Availabilty setup. The current setup consists of two jenkins masters sharing the same jenkins home, say master1 and master2, an installation of the jenkins-ha-monitor-1.1-1.1 rpm on both these masters, say monitor1 and monitor2. With this setup, according to the documentation atleast, the HA plugin should work as expected. Promotion and demotion scripts are similar to the ones in the documentation (only the ip and interface is different, same approach). i.e
For demotion
ifconfig eth0:2 down
For promotion
ifconfig eth0:2 the.floating.ip
Now for the nodes to get registered correctly I have to start master1, master2, monitor1 and monitor2 in that order. Tailing the logs for both I see that when the services are started in that order they are registered correctly by both monitor services as nodes in a cluster, and in the HA status gui in the jenkins console.
Now when master1 is killed by sending it a KILL signal monitor2 recognizes this and runs the promotion script. But monitor one keeps throwing :
Oct 24, 2012 3:47:36 PM
com.cloudbees.jenkins.ha.singleton.HASingleton$3 suspect INFO:
Suspecting a node failure in a cluster: jenkins-master-1-285 Oct 24,
2012 3:47:39 PM com.cloudbees.jenkins.ha.singleton.HASingleton$3
suspect INFO: Suspecting a node failure in a cluster:
jenkins-master-1-285
continuously without ever runnign the demotion script. Now since master2 has taken up the floating ip via its promotion script, and master1 still has that ip because demotion script is not run the setup ends up with two boxes claiming the same ip. Moreover restarting master1 does not do anything, i.e master1 does not get added to the cluster as a seconday node, monitor1 still keeps spitting the above messages to log, the floating ip keeps returning "Unable to connect" and master2 and monitor2 show the cluster as master2,monitor2 and monitor1. So my question/problem is twofold - why isnt master1 accepted back into the cluster? And why isn't the demotion script run as it should?
Also FYI i have tried to do a
service jenkins stop
and in that case the demotion script runs but again there are similar issues when
service jenkins start
is run on the master that was stopped earlier since the promotion script is run regardless of whether a primary jenkins exists. And in this case the two monitors register different clusters like so monitor1 : master1,monitor1 and monitor2 : master2,monitor2.
Running an ifconfig shows that both masters have taken up the floating ip at this point.
Any help is appreciated! Thanks!
Still under investigation with support. The originally reported problem (here) suggests that the two nodes are communicating fine, but promotions/demotions are not run correctly—either a bug in JGroups or in its usage in Jenkins high availability.
But further tests turned up problems with UDP multicast communication, which has been reported for RedHat/CentOS hosts. Work is underway to offer an alternate JGroups stack which does not rely on multicast (or UDP) at all, using the shared $JENKINS_HOME directory to register Jenkins and monitor instances (as TCP address:port records).