Flume Distribution - apache

From Apache Flume 1.6 Official website , I find flume is distributed. But Master-slave architecture has been deprecated after Flume 1.x. How does flume distribute the work? I have flume installed on two hosts via Hortonworks . I feel like running 2 seperate flume on two different machines. Can someone help me understand its distribution. What if I have a HttpAgent on Host A and Host B with same source and sink configured? Will there be replication of the input data or flume co-ordinates? Kindly correct my understanding.

Flume has kind of "do it yourself" distribution type. Check the documentation for some details and examples.
What if I have a HttpAgent on Host A and Host B with same source and sink configured?
It will replicate the data unless your source has no "exactly once" delivery semantic

Related

Kafka S3 Sink basic doubts

Do I really need to use confluent (CLI maybe)? Can I write my custom connector?
How can I write my first Kafka Sink? How to deploy them?
For now, let's assume we have the following details:
Topic: curious.topic
S3 bucket name: curious.s3
Data in the topic: Text/String
My OS: Mac
You start at the documentation for S3 Sink, looking over the configuration properties, and understand how to run Connect itself and deploy any connector (use the REST API); no, confluent CLI is never needed.
You don't need to "write your own sink" because Confluent already has an S3 Sink Connector. Sure, you could fork their open-source repo, and compile it yourself, but that doesn't seem to be what you're asking.
You can download the connector using different command confluent-hub.
Note: pinterest/secor does the same thing, without Kafka Connect.

apache flink write data to separate hive cluster

With apache flink is it possible to write to a hive cluster such that the cluster is able to distribute the data among his nodes?
Example as described here seems to indicate data is intended to a HDFS on the apache flink node itself. But what options exist if you intend to have the HDFS on a separate cluster and not on the flink worker nodes?
Please bear with me, I am totally new to this topic and I could get something conceptually completely wrong.
Yes, you can read from and write to Hive using Flink. There's an overview available at https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/connectors/table/hive/hive_read_write/

Kafka HBase Sink Connector unable to deliver its messages to HBase

I have particular Kafka HBase Sink Connector problem for which I will appreciate any advise or suggestions.
It is a 3-node Kafka cluster - 2 nodes for connect-distributed and 1 node for schema registry + kafka streaming. The Kafka version is 0.10.1 and is part of the Hortonworks platform 2.6.3. There are SSL and Kerberos authentication settings also. On top of it I have custom Kafka application that receives messages, processes them via Kafka streaming and delivers them in HBase.
The process model is:
1) Input topic;
2) Processing (in Kafka streaming);
3) Output topic;
4) HBase sink connector;
5) HBase.
The delivered messages in 1) are successfully transferred and processed until the step 3) inclusive. Then the though the sink connector works fine no message is delivered to HBase.
That being said I tested our custom application model with the Unit tests creating embedded Kafka cluster with its own basic settings and the tests were successful. This could quite likely indicate that the connectivity problem comes from some cluster setting(s).
For your information I observed 3 specific things:
The standard consumer console functionality is able to successfully consume the messages from the sink topic;
There is no consumer id for the sink connection established;
The process of connections starts successfully but stops for not logged reasons and do not call the WorkerSinkTask java class, where actually the writing to the HBase happens.
Addtional important point is the whole SSL encryption and Kerberos authentication setup that might be misconfigured.
In case anyone faced such a case I will greatly appreciate any comments that could be of a help.
Dimitar

Transfer messages of different topics to hdfs by kafka-connect-hdfs

I want to transfer data from kafka to hdfs by confluent, and I do the experiments by the quickstart in CLI model successfully.
Now, I intend to deploy confluent platform on production environment, Is there any tutorial about distributed deployment in detail?
And if there are many topics in kafka, such as register_info, video_play_info, video_like_info, video_repost_info and etc.
I need to process messages by different converters, and transfer to different hive table.
what should i?
I need to process messages by different converters, and transfer to different hive table
Run bin/connect-distributed etc/kafka/connect-distributed.propeties
Create individual JSON files for each HDFS Connector
POST them to the REST endpoint of Kafka Connect
Distributed mode is documented here

How to configure Apache Hadoop Vaidya

As I'm looking for tuning my hadoop map-reduce jobs to get better performance with optimal resource utilization, but I'm unable to start Can any one tell me how to configure Apache Hadoop Vaidya. I was following apache blog for Hadoop Vaidya, it has described very well how to use it.
In some blog i found a path
$HADOOP_HOME/contrib/vaidya/bin/
which is not present in my machine so I'm assuming that i have to install/configure Apache Hadoop Vaidya.
Any help will be appreciated!!