How to parse record headers in Kafka Connect S3? - amazon-s3

I use Kafka Connect S3 Sink and it only writes the record's value to S3. I want to incorporate some of the record's headers into the final payload that is written to S3.
How can I do it?

You would need to use a Simple Message Transform to intercept the records and unpack the headers and "move" them to the value section of the record object.
In the source code of Kafka Connect S3, you can see the record value is indeed only written.

Related

Proper recovery of KSQL target stream to preserve offset for Connector to avoid duplicate records

We recently adopted Kafka Streams via KSQL to convert our source JSON topics into AVRO so that our S3 Sink Connector can store the data into Parquet format in their respective buckets.
Our Kafka cluster was taken down over the weekend and we've noticed that some of our target streams (avro) have no data, yet all of our source streams do (checked via print 'topic_name'; with ksql).
I know that I can drop the target stream and recreate it but will that lose the offset and duplicate records in our Sink?
Also, I know that if I recreate the target stream with the same topic name, I may run into the "topic already exists, with different partition/offset.." thus I am hesitant to try this.
So what is the best way to recreate/recover our target streams such that we preserve the topic name and offset for our Sink Connector?
Thanks.

update s3 record based on kafka event

We have records in the s3 bucket (which get updated daily via a job). And We need to listen to a Kafka stream/topic and when a new event arrives in this Kafka stream, we need to update that particular record in s3.
Is this possible?
To my understanding, we need to take the data dump of s3 (via scala code or something) and write to it. IMO, this is not a practical way.
Is there an efficient way to do it?
Unclear if you meant S3 event or Kafka event.
For S3 write events, you can use a Lambda job to notify some code to run, and Kafka does not need involved. Could be any language - Python or NodeJS seem to be most popular for lambdas.
For Kafka records, your consumer would see the event, then use a S3 client API to do whatever it needs with that data, such as write/update a file. Again, any language can be used that supports Kafka protocol, not only Scala. After which, a Lambda could also read that S3 event.

How to store just latest version of record in AWS S3 while writing data from Kafka connector

We are reading orders from Oracle Db and then pushing them into Kafka topic using Kafka Source Connector. Now, with another Kafka Sink Connector we are pulling these orders from Kafka topin and then writing these orders into AWS S3.
Since one order can get multiple updates and also in kafka data is partitioned on the basis of dates, thus what happens is same order gets saved in multiple date folders in S3.
Now, when we try to read this data using AWS Athena, we get multiple rows with the same id.
We tried to partition this data on the basis of id as well, but in that case in S3 folders were created with order id each folder having multiple versions of order.
Is there a way we can overwrite the record with every update of order in S3 that we get Kafka Sink?
Can anyone please help me in solving this use-case?
Thanks in advance

Flink exact once streaming with S3 sink

I am a newbie in Flink and I am trying to write a simple streaming job with exactly-once semantics that listens from Kafka and writes the data to S3. When I say "Exact once", I mean I don't want to end up to have duplicates, on intermediate failure between writing to S3 and commit the file sink operator. I am using Kafka of version v2.5.0, according to the connector described in this page, I am guessing my use case will end up to have exact once behavior.
Questions:
1) Whether my assumption is correct that my use case will endup to have exact once even though there is any failure occurring in any part of the steps so that I can say my S3 files won't have duplicate records?
2) How Flink handle this exact once with S3? In the documentation it says, it uses multipart upload to get exact once semantics, but my question is, how it is handled internally to achieve exact once semantics? Let's say, the task failed once the S3 multipart get succeeded and before the operator commit process, in this case, once the operator gets restarts will it stream the data again to S3 which was written to S3 already, so will it be a duplicate?
If you read from kafka and then write to S3 with the StreamingDataSink you should indeed be able to get exactly once.
Though it is not specifically about S3, this article gives a nice explanation on how to ensure exactly once in general.
https://flink.apache.org/features/2018/03/01/end-to-end-exactly-once-apache-flink.html
My key takeaway: After a failure we must always be able to see where we stand from the perspective of the sink.

How to tag S3 bucket objects using Kafka connect s3 sink connector

Is there any way we can tag the objects written in S3 buckets through the Kafka Connect S3 sink connector.
I am reading messages from Kafka and writing the avro files in S3 bucket using S3 sink connector. When the files are written in S3 bucket I need to tag the files.
there is an API inside source code on the GitHub called addTags(), but it's now private and is not exposed to the connector client except this small config feature called S3_OBJECT_TAGGING_CONFIG which allows you to add start/end offsets as well as record count to s3 object.
configDef.define(
S3_OBJECT_TAGGING_CONFIG,
Type.BOOLEAN,
S3_OBJECT_TAGGING_DEFAULT,
Importance.LOW,
"Tag S3 objects with start and end offsets, as well as record count.",
group,
++orderInGroup,
Width.LONG,
"S3 Object Tagging"
);
If you want to add other/custom tags then answer is NO you cannot do it right now.
Useful feature would be to take the tags from the predefined part of an input document in Kafka but this is not available right now.