flink web.port can not be configured correctly in yarn mode - hadoop-yarn

I want to get flink's metrics information via REST api, my flink is managed by YARN, but after changing web.port configuration in flink-conf.yaml, the change has no affect, and the web.port in the flink dashboard is always 0. So I can not get the flink metrics information via REST api.
Environment:
ubuntu 16.04
openjdk-8
hadoop 2.7.1.2.3.6.0-3796
flink 1.4.0

When running Flink on Yarn, Flink will pick a random port (0) for the web UI in order to avoid port conflicts with other applications running on the same machine.
In order to access the Flink web UI you can query the Yarn web application proxy (YarnResourceManagerURL/proxy/application_/...). But be aware that only GET requests are properly forwarded to the Yarn application.
Alternatively, Flink logs the web UI URL to stdout when starting a Yarn session. Moreover, you can retrieve the chosen port from the log files. In newer versions (>= 1.5) Flink will log Rest endpoint listening at hostname:port on INFO level and in older versions (<= 1.4 or if using the legacy mode) Flink will log Web frontend listening at hostname:port.

Related

How to manually deploy Mule application package on the on-premises cluster?

I'm looking for the advice of how to manually (i.e. without using Runtime Manager - RM) deploy a mule application package on the on-premises Mule cluster. The official documentation suggests using the RM for the purpose either via the gui or cli or api. However, the RM is not available on our environment.
I can manually deploy the package on a single node by copying it to the /apps folder. But this way the application is only deployed on a single node, not on the cluster.
I've tried using the AMC agent rest API for the purpose with the same result - it only deploys on a single node.
So, what's the correct way of manually deploying a mule application on the Mule servers cluster without using Anypoint RM?
We are on Mule 4.4 EE.
Copy the application jar file into the apps directory of every node. Mule clusters do not transfer applications between nodes.
Alternatively ou can use the Runtime Manager Agent instead however it also works in a per node basis. You need to send the same request to each node to deploy.
Each connector may or may not be cluster aware. Read each connector documentation to understand how they behave. In particular the documentation of the VM connector states:
When running in cluster mode, persistent queues are instead backed by the memory grid. This means that when a Mule flow uses VM Connector to publish content to a queue, Mule runtime engine (Mule) decides whether to process that message in the same origin node or to send it out to the cluster to be picked up and processed by another node.
You can register the multiple nodes through AMC agent on the cloudhub control plane and create a server group and deploy code through control plain runtime manager it does the job of deployment to same app in n nodes

Monitoring rabbitmq (v 3.6.8) with prometheus

I have a challenge - build and release monitoring system in consist of RabbitMQ cluster (with 3 nodes) and standalone Grafana server for visualisation metrics.
I have found in official documentation of prometheus plugin for RabbitMQ (documentation) next section:
This plugin is new as of RabbitMQ 3.8.0.
But i have cluster of version 3.6.8 and when i run the next command
rabbitmq-plugins enable rabbitmq_prometheus
The output is:
Error: The following plugins could not be found:
rabbitmq_prometheus
Upgrade the cluster now is not possible and my question is:
How do i may configure monitoring of the cluster without upgrade it and with prometheus (preferred option) and grafana?
Thanks in advance!
The Prometheus plugin is not the only way to monitor a RabbitMQ cluster.
You can also use the rabbitmq exporter in sidecar. If you are not on a docker platform, you can download the exporter from the release assets and install it as a service somewhere.
It would be best to install the exporter on every server hosting the RabbitMQ node because:
you will need to have as many install as there are nodes (Prometheus is service oriented monitoring)
from the settings, the exporter is accessing the management plugin interface of RabbitMQ; it should stay bound to localhost to reduce attack surface
If your hands are really tied, you can deploy them anywhere (let say on the same server) and point each exporter to a different RabbitMQ node. Prometheus configuration can then identify the underlying service.
- job_name: rabbitmq
honor_labels: true
static_configs:
- targets: ['monitoring-server:97001']
labels:
instance: 'rabbitmq_node_A'
- targets: ['monitoring-server:97002']
labels:
instance: 'rabbitmq_node_B'
# or play with relabeling to acchieve the same.
An important drawback is that there are more cases where the exporter may not be able to access RabbitMQ and you end up alert on events not impacting your RabbitMQ cluster.

Why do we need Redis for running CKAN?

I was wondering why do we need Redis server for running CKAN.
If needed, why? And How do I configure it with CKAN?
p.s
I am running my ckan instance in RHEL7.
Update: Redis has been a requirement since CKAN 2.7, when a new system for asynchronous background jobs was introduced which relies on Redis. You can configure the Redis connection using the ckan.redis.url option.
Redis is not required for the current version of CKAN (2.6.2 at the time of this writing), it's not even mentioned in the CKAN 2.6.2 documentation.
However, the upcoming 2.7 release will require Redis for its new system of asynchronous background jobs. Redis will be configured using the new ckan.redis.url option.

Hadoop Cluster deployment using Apache Ambari

I have listed few queries related to ambari as follows:
Can I configure more than one hadoop cluster via UI of ambari ?
( Note : I am using Ambari 1.6.1 for my hadoop cluster deployment purpose and I am aware that this can be done via Ambari API, but not able to find via ambari portal)
We can check the status of services on each node by “jps” command, if we have configured hadoop cluster w/o ambari.
Is there any way similar to “jps” to check from back end if the setup for hadoop cluster was successful from the backend ?
( Note : I can see that services are showing UP on the ambari portal )
Help Appreciated !!
Please let me know if any additional information is required.
Thanks
The ambari UI is served up by the ambari server for a specific cluster. To configure another cluster you need to point your browser to the URL for that other cluster's ambari server. So you can't see the configuration for multiple servers on the same web page, but you can set browser bookmarks to jump from configuration to configuration.

What is yarn-client mode in Spark?

Apache Spark has recently updated the version to 0.8.1, in which yarn-client mode is available. My question is, what does yarn-client mode really mean? In the documentation it says:
With yarn-client mode, the application will be launched locally. Just like running application or spark-shell on Local / Mesos / Standalone mode. The launch method is also the similar with them, just make sure that when you need to specify a master url, use “yarn-client” instead
What does it mean "launched locally"? Locally where? On the Spark cluster?
What is the specific difference from the yarn-standalone mode?
So in spark you have two different components. There is the driver and the workers. In yarn-cluster mode the driver is running remotely on a data node and the workers are running on separate data nodes. In yarn-client mode the driver is on the machine that started the job and the workers are on the data nodes. In local mode the driver and workers are on the machine that started the job.
When you run .collect() the data from the worker nodes get pulled into the driver. It's basically where the final bit of processing happens.
For my self i have found yarn-cluster mode to be better when i'm at home on the vpn, but yarn-client mode is better when i'm running code from within the data center.
Yarn-client mode also means you tie up one less worker node for the driver.
A Spark application consists of a driver and one or many executors. The driver program is the main program (where you instantiate SparkContext), which coordinates the executors to run the Spark application. The executors run tasks assigned by the driver.
A YARN application has the following roles: yarn client, yarn application master and list of containers running on the node managers.
When Spark application runs on YARN, it has its own implementation of yarn client and yarn application master.
With those background, the major difference is where the driver program runs.
Yarn Standalone Mode: your driver program is running as a thread of the yarn application master, which itself runs on one of the node managers in the cluster. The Yarn client just pulls status from the application master. This mode is same as a mapreduce job, where the MR application master coordinates the containers to run the map/reduce tasks.
Yarn client mode: your driver program is running on the yarn client where you type the command to submit the spark application (may not be a machine in the yarn cluster). In this mode, although the drive program is running on the client machine, the tasks are executed on the executors in the node managers of the YARN cluster.
Reference: http://spark.incubator.apache.org/docs/latest/cluster-overview.html
A Spark application running in
yarn-client mode:
driver program runs in client machine or local machine where the application has been launched.
Resource allocation is done by YARN resource manager based on data locality on data nodes and driver program from local machine will control the executors on spark cluster (Node managers).
Please refer this cloudera article for more info.
The difference between standalone mode and yarn deployment mode,
Resource optimization won't be efficient in standalone mode.
In standalone mode, driver program launch an executor in every node of a cluster irrespective of data locality.
standalone is good for use case, where only your spark application is being executed and the cluster do not need to allocate resources for other jobs in efficient manner.
Both spark and yarn are distributed framework , but their roles are different:
Yarn is a resource management framework, for each application, it has following roles:
ApplicationMaster: resource management of a single application, including ask for/release resource from Yarn for the application and monitor.
Attempt: an attempt is just a normal process which does part of the whole job of the application. For example , a mapreduce job which consists of multiple mappers and reducers , each mapper and reducer is an Attempt.
A common process of summiting a application to yarn is:
The client submit the application request to yarn. In the
request, Yarn should know the ApplicationMaster class; For
SparkApplication, it is
org.apache.spark.deploy.yarn.ApplicationMaster,for MapReduce job ,
it is org.apache.hadoop.mapreduce.v2.app.MRAppMaster.
Yarn allocate some resource for the ApplicationMaster process and
start the ApplicationMaster process in one of the cluster nodes;
After ApplicationMaster starts, ApplicationMaster will request resource from Yarn for this Application and start up worker;
For Spark, the distributed computing framework, a computing job is divided into many small tasks and each Executor will be responsible for each task, the Driver will collect the result of all Executor tasks and get a global result. A spark application has only one driver with multiple executors.
So, then ,the problem comes when Spark is using Yarn as a resource management tool in a cluster:
In Yarn Cluster Mode, Spark client will submit spark application to
yarn, both Spark Driver and Spark Executor are under the supervision
of yarn. In yarn's perspective, Spark Driver and Spark Executor have
no difference, but normal java processes, namely an application
worker process. So, when the client process is gone , e.g. the client
process is terminated or killed, the Spark Application on yarn is
still running.
In yarn client mode, only the Spark Executor are under the
supervision of yarn. The Yarn ApplicationMaster will request resource
for just spark executor. The driver program is running in the client
process which have nothing to do with yarn, just a process submitting
application to yarn.So ,when the client leave, e.g. the client
process exits, the Driver is down and the computing terminated.
First of all, let's make clear what's the difference between running Spark in standalone mode and running Spark on a cluster manager (Mesos or YARN).
When running Spark in standalone mode, you have:
a Spark master node
some Spark slaves nodes, which have been "registered" with the Spark master
So:
the master node will execute the Spark driver sending tasks to the executors & will also perform any resource negotiation, which is quite basic. For example, by default each job will consume all the existing resources.
the slave nodes will run the Spark executors, running the tasks submitted to them from the driver.
When using a cluster manager (I will describe for YARN which is the most common case), you have :
A YARN Resource Manager (running constantly), which accepts requests for new applications and new resources (YARN containers)
Multiple YARN Node Managers (running constantly), which consist the pool of workers, where the Resource manager will allocate containers.
An Application Master (running for the duration of a YARN application), which is responsible for requesting containers from the Resource Manager and sending commands to the allocated containers.
Note that there are 2 modes in that case: cluster-mode and client-mode. In the client mode, which is the one you mentioned:
the Spark driver will be run in the machine, where the command is executed.
The Application Master will be run in an allocated Container in the cluster.
The Spark executors will be run in allocated containers.
The Spark driver will be responsible for instructing the Application Master to request resources & sending commands to the allocated containers, receiving their results and providing the results.
So, back to your questions:
What does it mean "launched locally"? Locally where? On the Spark
cluster?
Locally means in the server in which you are executing the command (which could be a spark-submit or a spark-shell). That means that you could possibly run it in the cluster's master node or you could also run it in a server outside the cluster (e.g. your laptop) as long as the appropriate configuration is in place, so that this server can communicate with the cluster and vice-versa.
What is the specific difference from the yarn-standalone mode?
As described above, the difference is that in the standalone mode, there is no cluster manager at all. A more elaborate analysis and categorisation of all the differences concretely for each mode is available in this article.
With yarn-client mode, your spark application is running in your local machine. With yarn-standalone mode, your spark application would be submitted to YARN's ResourceManager as yarn ApplicationMaster, and your application is running in a yarn node where ApplicationMaster is running.
In both case, yarn serve as spark's cluster manager. Your application(SparkContext) send tasks to yarn.