Convert Avro in Kafka to Parquet directly into S3 - amazon-s3

I have topics in Kafka that are stored in Avro format. I would like to consume the entire topic (which at time of receipt will not change any messages) and convert it into Parquet, saving directly on S3.
I currently do this but it requires me consuming the messages from Kafka one a time and processing on a local machine, converting them to a parquet file, and once the entire topic is consumed and the parquet file completely written, close the writing process and then initiate an S3 multipart file upload. Or | Avro in Kafka -> convert to parquet on local -> copy file to S3 | for short.
What I'd like to do instead is | Avro in Kafka -> parquet in S3 |
One of the caveats is that the Kafka topic name isn't static, and needs to be fed in an argument, used once, and then never used again.
I've looked into Alpakka and it seems like it might be possible - but it's unclear, I haven't seen any examples. Any suggestions?

You just described Kafka Connect :)
Kafka Connect is part of Apache Kafka, and with the S3 connector plugin. Although, at the moment the development of Parquet support is still in progress.
For a primer in Kafka Connect see http://rmoff.dev/ksldn19-kafka-connect

Try to add "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat" in your PUT request when you set up your connector.
You can find more details here.

Related

Apache Flink: Reading parquet files from S3 in Data Stream APIs

We have several external jobs producing small (500MiB) parquet objects on S3 partitioned by time. The goal is to create an application that would read those files, join them on a specific key and dump the result into a Kinesis stream or another S3 bucket.
Can it be achieved by just the means of Flink? Can it monitor and load new S3 objects being created and load them into the application?
The newer FileSource class (available in recent Flink versions) supports monitoring a directory for new/modified files. See FileSource.forBulkFileFormat() in particular, for reading Parquet files.
You use the FileSourceBuilder returned by the above method call, and then .monitorContinuously(Duration.ofHours(1)); (or whatever interval makes sense).

how to impl kafka-connect with my own message

I have Kafka topic contains binary message (byte Arr).
I would like to write the message to S3 as Parquet format.
I tried to use Kafka Connect and struggled with the configuration.
My Kafka contains also some Kafka Headers that need to written to parquet as well.
what is the right configuration in this case?
it's not Avro and not Json.
can i write the byte Arr as is to the parquet file without serializing it?
Thanks

Read S3 file based on the path that comes in Kafka - Apache Flink

I have a pipeline that listens to a Kafka topic that receives the s3 file-name & path. The pipeline has to read the file from S3 and do some transformation & aggregation.
I see the Flink has support to read the S3 file directly as source connector, but this use case is to read as part of the transformation stage.
I don't believe this is currently possible.
An alternative might be to keep a Flink session cluster running, and dynamically create and submit a new Flink SQL job running in batch mode to handle the ingestion of each file.
Another approach you might be tempted by would be to implement a RichFlatMapFunction that accepts the path as input, reads the file, and emits its records one by one. But this is likely to not work very well unless the files are rather small because Flink really doesn't like to have user functions that run for long periods of time.

How to tag S3 bucket objects using Kafka connect s3 sink connector

Is there any way we can tag the objects written in S3 buckets through the Kafka Connect S3 sink connector.
I am reading messages from Kafka and writing the avro files in S3 bucket using S3 sink connector. When the files are written in S3 bucket I need to tag the files.
there is an API inside source code on the GitHub called addTags(), but it's now private and is not exposed to the connector client except this small config feature called S3_OBJECT_TAGGING_CONFIG which allows you to add start/end offsets as well as record count to s3 object.
configDef.define(
S3_OBJECT_TAGGING_CONFIG,
Type.BOOLEAN,
S3_OBJECT_TAGGING_DEFAULT,
Importance.LOW,
"Tag S3 objects with start and end offsets, as well as record count.",
group,
++orderInGroup,
Width.LONG,
"S3 Object Tagging"
);
If you want to add other/custom tags then answer is NO you cannot do it right now.
Useful feature would be to take the tags from the predefined part of an input document in Kafka but this is not available right now.

Batching and Uploaded real-time traffic to S3

I am looking for some suggestion/solutions on implementing a archiving work flow at at big data scale.
The source of data are messages in kafka. Which is written to in real-time. Destination is S3 bucket.
I need to partition the data based on a field in message. For each partition i need to batch data to 100Mb chunks and then upload it.
The data rate is ~5GB/Minute. So the 100Mb batch should get filled within couple of seconds.
My trouble is around scaling and batching. Since i need to batch and compression data for a "field" in message, i need to bring that part of data together by partitioning. Any suggestions on tech/work flow ?
You can use Kafka Connect. There's a connector for S3:
http://docs.confluent.io/current/connect/connect-storage-cloud/kafka-connect-s3/docs/s3_connector.html
You can use Apache spark to do scaling and batching processes for you. So basically the flow can look like this:
Apache Kafka -> Apache Spark -> Amazon S3.
Spark Streaming API enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka and can be processed using complex algorithms such as high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems like Amazon S3.