How to configure Apache Hadoop Vaidya - apache

As I'm looking for tuning my hadoop map-reduce jobs to get better performance with optimal resource utilization, but I'm unable to start Can any one tell me how to configure Apache Hadoop Vaidya. I was following apache blog for Hadoop Vaidya, it has described very well how to use it.
In some blog i found a path
$HADOOP_HOME/contrib/vaidya/bin/
which is not present in my machine so I'm assuming that i have to install/configure Apache Hadoop Vaidya.
Any help will be appreciated!!

Related

apache flink write data to separate hive cluster

With apache flink is it possible to write to a hive cluster such that the cluster is able to distribute the data among his nodes?
Example as described here seems to indicate data is intended to a HDFS on the apache flink node itself. But what options exist if you intend to have the HDFS on a separate cluster and not on the flink worker nodes?
Please bear with me, I am totally new to this topic and I could get something conceptually completely wrong.
Yes, you can read from and write to Hive using Flink. There's an overview available at https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/connectors/table/hive/hive_read_write/

What's Amazon Web Services *native* offering is closest to Apache Kudu?

I am looking for a native offering, such as any of the RDS solutions, Elastic Cache, Amazon Redshift, not something that I would have to host myself.
From the Apache Kudu: https://kudu.apache.org/ :
Kudu provides a combination of fast inserts/updates and efficient columnar
scans to enable multiple real-time analytic workloads across a single storage
layer. As a new complement to HDFS and Apache HBase, Kudu gives architects the
flexibility to address a wider variety of use cases without exotic workarounds.
As I understand it, Kudu is a columnar distributed storage engine for tabular data that allows for fast scans and ad-hoc analytical queries but ALSO allows for random updates and inserts. Every table has a primary key that you can use to find and update single records...
Second answer after question was revised.
The answer is Amazon EMR running Apache Kudu.
Amazon EMR is Amazon's service for Hadoop. Apache Kudu is a package that you install on Hadoop along with many others to process "Big Data".
If you are looking for a managed service for only Apache Kudu, then there is nothing. Apache Kudu is an open source tool that sits on top of Hadoop and is a companion to Apache Impala. On AWS both require Amazon EMR running Hadoop version 2.x or greater.

Flume Distribution

From Apache Flume 1.6 Official website , I find flume is distributed. But Master-slave architecture has been deprecated after Flume 1.x. How does flume distribute the work? I have flume installed on two hosts via Hortonworks . I feel like running 2 seperate flume on two different machines. Can someone help me understand its distribution. What if I have a HttpAgent on Host A and Host B with same source and sink configured? Will there be replication of the input data or flume co-ordinates? Kindly correct my understanding.
Flume has kind of "do it yourself" distribution type. Check the documentation for some details and examples.
What if I have a HttpAgent on Host A and Host B with same source and sink configured?
It will replicate the data unless your source has no "exactly once" delivery semantic

Dose the Apache Drill have any negative influence on the other members of Hadoop ecosystem in the existing Hadoop cluster?

if I deploy Apache Drill in a existing Hadoop cluster, dose the Apache Drill have any negative influence on the other members of Hadoop ecosystem in the existing Hadoop cluster ?
It wont have any negative impact on other members of ecosystem but it will hog a lot of memory of the node. Make sure you have enough memory before installing Drill.
It's Hadoop compatibility component. The influence only happen if you don't provide enough resource for all the members to operate.
You can put Drill with HDFS in the same node or same cluster to get the best performance.

Using Kubernetes or Apache mesos

We have a product which is described in some docker files, which can create the necessary docker containers. Some docker containers will just run some basic apps, while other containers will run clusters (hadoop).
Now is the question which cluster manager I need to use.
Kubernetes or Apache mesos or both?
I read Kubernetes is good for 100% containerized environments, while Apache Mesos is better for environments which are a bit containerized and a bit not-containerized. But Apache Mesos is better for running hadoop in docker (?).
Our environment is composed of only docker containers, but some with an hadoop cluster and some with some apps.
What will be the best?
Both functionally do the same, orchestrate Docker containers, but obviously they will do it in different ways and what you can easily achieve with one, it might prove difficult in the other and vice versa.
Mesos has a higher complexity and learning curve in my opinion. Kubernetes is relatively simpler and easier to grasp. You can literally spawn your own Kube master and minions running one command and specifying the provider: Vagrant or AWS,etc. Kubernetes is also able to be integrated into Mesos, so there is also the possibility where you could try both.
For the Hadoop specific use case you mention, Mesos might have an edge, it might integrate better in the Apache ecosystem, Mesos and Spark were created by the same minds.
Final thoughts: start with Kube, progressively exploring how to make it work for your use case. Then, after you have a good grasp on it, do the same with Mesos. You might end up liking pieces of each and you can have them coexist, or find that Kube is enough for what you need.