Transfer messages of different topics to hdfs by kafka-connect-hdfs - hive

I want to transfer data from kafka to hdfs by confluent, and I do the experiments by the quickstart in CLI model successfully.
Now, I intend to deploy confluent platform on production environment, Is there any tutorial about distributed deployment in detail?
And if there are many topics in kafka, such as register_info, video_play_info, video_like_info, video_repost_info and etc.
I need to process messages by different converters, and transfer to different hive table.
what should i?

I need to process messages by different converters, and transfer to different hive table
Run bin/connect-distributed etc/kafka/connect-distributed.propeties
Create individual JSON files for each HDFS Connector
POST them to the REST endpoint of Kafka Connect
Distributed mode is documented here

Related

Send data from kafka to s3 using python

For my current project, I am working with Kafka (python) and wanted to know if there is any method by which I can send the streaming Kafka data to the AWS S3 bucket(without using Confluent). I am getting my source data from Reddit API.
I even wanted to know whether Kafka+s3 is a good combination for storing the data which will be processed using pyspark or I should skip the s3 step and directly read data from Kafka.
Kafka S3 Connector doesn't require "using Confluent". It's completely free, open source and works with any Apache Kafka cluster.
Otherwise, sure, Spark or plain Kafka Python consumer can write events to S3, but you've not clearly explained what happens when data is in S3, so maybe start with processing the data directly from Kafka

apache flink write data to separate hive cluster

With apache flink is it possible to write to a hive cluster such that the cluster is able to distribute the data among his nodes?
Example as described here seems to indicate data is intended to a HDFS on the apache flink node itself. But what options exist if you intend to have the HDFS on a separate cluster and not on the flink worker nodes?
Please bear with me, I am totally new to this topic and I could get something conceptually completely wrong.
Yes, you can read from and write to Hive using Flink. There's an overview available at https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/connectors/table/hive/hive_read_write/

Scaling Kafka Connect to handle 10K S3 buckets

I want to load data from various S3 buckets (more than 10,000 buckets and each file is around 20-50MB) into Apache Kafka. The list of buckets is dynamic - buckets are added and removed at runtime. Ideally, each bucket configuration should have its own polling interval (how often to scan for new files - at-least 60 seconds, but might be much more) and priority (number of concurrent files being processed).
Note that setting up notifications from each of the S3 buckets to SQS/SNS/Lambda is not an option due to various IT policies in the organizations of each of the bucket owners.
Kafka Connect seems to be most commonly used tool for such tasks, and its pluggable architecture will make it easier to add new sources in the future, so it fits well. Configuring each S3 bucket as its own connector will let me set a different number of tasks (which maps to priorities) and polling interval for each one. And building a Java custom Kafka Connect source task for my expected file format sounds reasonable.
However, the Kafka Connect code indicates that each running task is assigned its own thread for the lifetime of the task. So if I have 10K buckets, each configured with its own connector and with a single task, I will have 10K threads running in my Kafka Connect distributed worker pool. That's a lot of threads that are mostly just sleep()-ing.
What is the correct approach to scaling the number of tasks/connectors in Kafka Connect?
Kafka Connect is distributed framework which could work as stand-alone mode or distributed, as distributed framework you are creating cluster of kafka connect from several commodity servers each one hosts kafka connect instance and can execute connector's tasks , if you need more power you can add more servers hosting connect instances ,
Reading the S3 Source Connector documents I did not find a way to "whitelist" / "regex" to get it read from multiple buckets...

How can I configure Redis as a Spring Cloud Dataflow Source?

I've search for examples and I have not found any.
My intention is to use a Redis Stream as a source to Spring Cloud Dataflow and route messages to AWS Kinesis or S3 data sinks
Redis is not listed as a Spring Cloud Dataflow source. Will I have to create a custom binder?
Redis only seems available as a sink with PubSub
There used to be a redis-binder for Spring Cloud Stream, but that has been deprecated for a while now. We have plans to implement a binder for Redis Streams in the future, though.
That said, if you have data in Redis, it'd be good to start building a redis-source as a custom application. We have many suppliers/sources that you can use as a reference.
There's currently also a blog-series in the works, which can be of further guidance when building custom applications.
Lastly, feel free to contribute the redis-supplier/source to the applications repo, we can collaborate on a pull request.

S3 connectors to connect with Kafka for streaming data from on-premise to cloud

I want to stream data from on-premise to Cloud(S3) using Kafka. For which I need to intsall kafka on source machine and also on cloud. But I don't want to intsall it on cloud. I need some S3 connector through which I can connect with kafka and stream data from on-premise to cloud.
If your data is in Avro or Json format (or can be converted to those formates), you can use the S3 connector for Kafka Connect. See Confluent's docs on that
Should you want to move actual (bigger) files via Kafka, be aware that Kafka is designed for small messages and not for file transfers.
There is a kafka-connect-s3 project consisting of both sink and source connector from Spreadfast, which can handle text format. Unfortunately it is not really updated, but works nevertheless