Writting events from kafka to Hive with ORC format - hive

I am trying to use a kafka connector to writte data in Hive with OCR format from a Kafka bus.
The events in the bus are in avro format. I need somethig like this ConvertAvroToORC
for NiFi but with Kafka Connectors

ORC is not supported with HDFS Kafka Connect currently.
You're welcome to build the PR and try it out on your own.
https://github.com/confluentinc/kafka-connect-hdfs/pull/294

It has been released io.confluent.connect.hdfs.orc.OrcFormat included in the HDFS connector io.confluent.connect.hdfs.HdfsSinkConnector https://docs.confluent.io/current/connect/kafka-connect-hdfs/configuration_options.html#hdfs-2-sink-connector-configuration-properties

Related

Send data from kafka to s3 using python

For my current project, I am working with Kafka (python) and wanted to know if there is any method by which I can send the streaming Kafka data to the AWS S3 bucket(without using Confluent). I am getting my source data from Reddit API.
I even wanted to know whether Kafka+s3 is a good combination for storing the data which will be processed using pyspark or I should skip the s3 step and directly read data from Kafka.
Kafka S3 Connector doesn't require "using Confluent". It's completely free, open source and works with any Apache Kafka cluster.
Otherwise, sure, Spark or plain Kafka Python consumer can write events to S3, but you've not clearly explained what happens when data is in S3, so maybe start with processing the data directly from Kafka

How to use BigQuery as sink in Apache Flink?

Is it possible to use JDBC Connector to write the Flink Datastream to Bigquery or any other options?
New to Apache Flink, any suggestions/examples would be very helpful.
BigQuery is currently not supported as a JDBC dialect by Flink. An overview of the currently supported versions can be found at https://nightlies.apache.org/flink/flink-docs-master/docs/connectors/table/jdbc/
I'm not aware of a BigQuery sink being available. That implies that in order to write to BigQuery, you would have to create your own custom sink.

Kafka S3 Source Connector

I have a requirement where sources outside of our application will drop a file in an S3 bucket that we have to load in a kafka topic. I am looking at Confluent's S3 Source connector and currently working on defining the configuration for setting up the connector in our environment. But a couple of posts indicated that one can use S3 Source connector only if you have used the S3 Sink connector to drop the file in S3.
Is the above true? Where / what property do I use to define the output topic in the configuration? And can the messages be transformed when reading from S3 and putting them in the topic. Both will be JSON / Avro formats.
Confluent's Quick Start example also assumes you have used the S3 Sink connector, hence the questiion.
Thank you
I received a response from Confluent that it is true that the Confluent S3 Source connector can only be used with the Confluent S3 Sink connector. It cannot be used independently
Confluent release version 2.0.0 as of 2021-12-15. This version includes generalized s3 source connection mode

Transfer messages of different topics to hdfs by kafka-connect-hdfs

I want to transfer data from kafka to hdfs by confluent, and I do the experiments by the quickstart in CLI model successfully.
Now, I intend to deploy confluent platform on production environment, Is there any tutorial about distributed deployment in detail?
And if there are many topics in kafka, such as register_info, video_play_info, video_like_info, video_repost_info and etc.
I need to process messages by different converters, and transfer to different hive table.
what should i?
I need to process messages by different converters, and transfer to different hive table
Run bin/connect-distributed etc/kafka/connect-distributed.propeties
Create individual JSON files for each HDFS Connector
POST them to the REST endpoint of Kafka Connect
Distributed mode is documented here

S3 connectors to connect with Kafka for streaming data from on-premise to cloud

I want to stream data from on-premise to Cloud(S3) using Kafka. For which I need to intsall kafka on source machine and also on cloud. But I don't want to intsall it on cloud. I need some S3 connector through which I can connect with kafka and stream data from on-premise to cloud.
If your data is in Avro or Json format (or can be converted to those formates), you can use the S3 connector for Kafka Connect. See Confluent's docs on that
Should you want to move actual (bigger) files via Kafka, be aware that Kafka is designed for small messages and not for file transfers.
There is a kafka-connect-s3 project consisting of both sink and source connector from Spreadfast, which can handle text format. Unfortunately it is not really updated, but works nevertheless