how to impl kafka-connect with my own message - amazon-s3

I have Kafka topic contains binary message (byte Arr).
I would like to write the message to S3 as Parquet format.
I tried to use Kafka Connect and struggled with the configuration.
My Kafka contains also some Kafka Headers that need to written to parquet as well.
what is the right configuration in this case?
it's not Avro and not Json.
can i write the byte Arr as is to the parquet file without serializing it?
Thanks

Related

Proper recovery of KSQL target stream to preserve offset for Connector to avoid duplicate records

We recently adopted Kafka Streams via KSQL to convert our source JSON topics into AVRO so that our S3 Sink Connector can store the data into Parquet format in their respective buckets.
Our Kafka cluster was taken down over the weekend and we've noticed that some of our target streams (avro) have no data, yet all of our source streams do (checked via print 'topic_name'; with ksql).
I know that I can drop the target stream and recreate it but will that lose the offset and duplicate records in our Sink?
Also, I know that if I recreate the target stream with the same topic name, I may run into the "topic already exists, with different partition/offset.." thus I am hesitant to try this.
So what is the best way to recreate/recover our target streams such that we preserve the topic name and offset for our Sink Connector?
Thanks.

How to parse record headers in Kafka Connect S3?

I use Kafka Connect S3 Sink and it only writes the record's value to S3. I want to incorporate some of the record's headers into the final payload that is written to S3.
How can I do it?
You would need to use a Simple Message Transform to intercept the records and unpack the headers and "move" them to the value section of the record object.
In the source code of Kafka Connect S3, you can see the record value is indeed only written.

How to tag S3 bucket objects using Kafka connect s3 sink connector

Is there any way we can tag the objects written in S3 buckets through the Kafka Connect S3 sink connector.
I am reading messages from Kafka and writing the avro files in S3 bucket using S3 sink connector. When the files are written in S3 bucket I need to tag the files.
there is an API inside source code on the GitHub called addTags(), but it's now private and is not exposed to the connector client except this small config feature called S3_OBJECT_TAGGING_CONFIG which allows you to add start/end offsets as well as record count to s3 object.
configDef.define(
S3_OBJECT_TAGGING_CONFIG,
Type.BOOLEAN,
S3_OBJECT_TAGGING_DEFAULT,
Importance.LOW,
"Tag S3 objects with start and end offsets, as well as record count.",
group,
++orderInGroup,
Width.LONG,
"S3 Object Tagging"
);
If you want to add other/custom tags then answer is NO you cannot do it right now.
Useful feature would be to take the tags from the predefined part of an input document in Kafka but this is not available right now.

Convert Avro in Kafka to Parquet directly into S3

I have topics in Kafka that are stored in Avro format. I would like to consume the entire topic (which at time of receipt will not change any messages) and convert it into Parquet, saving directly on S3.
I currently do this but it requires me consuming the messages from Kafka one a time and processing on a local machine, converting them to a parquet file, and once the entire topic is consumed and the parquet file completely written, close the writing process and then initiate an S3 multipart file upload. Or | Avro in Kafka -> convert to parquet on local -> copy file to S3 | for short.
What I'd like to do instead is | Avro in Kafka -> parquet in S3 |
One of the caveats is that the Kafka topic name isn't static, and needs to be fed in an argument, used once, and then never used again.
I've looked into Alpakka and it seems like it might be possible - but it's unclear, I haven't seen any examples. Any suggestions?
You just described Kafka Connect :)
Kafka Connect is part of Apache Kafka, and with the S3 connector plugin. Although, at the moment the development of Parquet support is still in progress.
For a primer in Kafka Connect see http://rmoff.dev/ksldn19-kafka-connect
Try to add "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat" in your PUT request when you set up your connector.
You can find more details here.

NiFi: The incoming flow file can not be read as an Avro file

I just started with NiFi 1.4.
I am trying to send pipe delimited message via kafka into Hive. So I am using ConsumeKafkaRecord_0_10 and PutHivStreaming processor. Consume Kafka reader sends data on success to PutHiveStreaming.
Consume Kafka reader writing data in avro format but PutHiveStreaming gives error as
The incoming flow file can not be read as an Avro filee: java.io.IOException: Not a data file."
PutHiveStreaming can only read Avro datafiles, so you have to make sure the writer used by ConsumeKafkaRecord is an AvroRecordSetWriter with Schema Write Strategy set to Embedded Schema.
If the schema isn't embedded then when it gets to the Hive processor, it won't be a valid Avro datafile.