Proper recovery of KSQL target stream to preserve offset for Connector to avoid duplicate records - amazon-s3

We recently adopted Kafka Streams via KSQL to convert our source JSON topics into AVRO so that our S3 Sink Connector can store the data into Parquet format in their respective buckets.
Our Kafka cluster was taken down over the weekend and we've noticed that some of our target streams (avro) have no data, yet all of our source streams do (checked via print 'topic_name'; with ksql).
I know that I can drop the target stream and recreate it but will that lose the offset and duplicate records in our Sink?
Also, I know that if I recreate the target stream with the same topic name, I may run into the "topic already exists, with different partition/offset.." thus I am hesitant to try this.
So what is the best way to recreate/recover our target streams such that we preserve the topic name and offset for our Sink Connector?
Thanks.

Related

Replaying Kafka events stored in S3

I might be thinking of this incorrectly, but we're looking to set up a connection between Kafka and S3. We are using Kafka as the backbone of our microservice event sourcing system and may occasionally need to replay events from the beginning of time in certain scenarios (i.e. building a new service, rebuilding a corrupted database view).
Instead of storing events indefinitely in AWS EBS storage ($0.10/GB/mo.), we'd like to shift them to S3 ($0.023/Gb/mo. or less) after seven days using the S3 Sink Connector and eventually continually move them down the chain of S3 storage levels.
However, I don't understand that if I need to replay a topic from the beginning to restore a service, how would Kafka get that data back on demand from S3? I know I can utilize a source connector, but it seems that is only for setting up a new topic. Not for pulling data back from an existing topic.
The Confluent S3 Source Connector doesn't dictate where the data is written back into. But you may want to refer the storage configuration properties regarding topics.dir and topic relationship.
Alternatively, write some code to read your S3 events and send them into a Kafka producer client.
Keep in mind, for your recovery payment calculations that reads from different tiers of S3 cost more and more.
You may also want to follow the developments of Kafka native tiered storage support (or similarly, look at Apache Pulsar as an alternative)

CDC(Change Data Capture) with Kafka

I want to implement CDC with kafka: DB1->kafka-> various sinks (ie: s3, DB, other kafka consumer).
One of the use cases is to restore a given table only with events from given period of time.
Ie: for TBL1 all events are in kafka, initial load and deltas. I want to restore in given sink(ie. s3, db) this table only with events from date1-date2.
How would you approach this, additionally to restoring full state?
You can use Debezium to do an initial snapshot of a database table, then track new changes to Kafka.
There is no available tool out-of-the-box that will copy data to/from a Kafka topic between two time intervals; you'd need to write that yourself (from Kafka), or using a range-query + filter from your database/filesystem to a KafkaProducer. The Confluent S3 Source Connector, for example, will read all files (written by the S3 sink), and the JDBC (I assume this is what you mean by "db") Source Connector needs internal manipulation and a custom query to start/end at any timestamp/record.

Hive partitioning LAYOUT table format in BigQuery

I have many qsns inside this situation. So here goes :
Has anyone ever written Kafka's output to a Google Cloud Storage (GCS) bucket, such that the data in that bucket is partitioned using the "default hive partitioning layout"
The intent behind doing that is this external table needs to be "queryable" in BigQuery
Google's documentation on that is here but wanted to see if someone has an example ( https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs )
for e.g. the documentation says "files follow the default layout, with the key/value pairs laid out as directories with an = sign as a separator, and the partition keys are always in the same order."
What's not clear is
a) does Kafka create these directories on the fly OR do i have to pre-create them ? Lets say i WANT to have KAFKA write to directories based on date in GCS
gs://bucket/table/dt=2020-04-07/
Tonight, after midnight, do i have PRE-create this new directory gs://bucket/table/dt=2020-04-08/ or CAN Kafka create it for me AND in all this, how does hive partitioning LAYOUT help me ?
Does my table's data, which i am trying to put in these dirs every day, need to have "dt" ( from gs://bucket/table/dt=2020-04-07/ ) as a column in it ?
Since the goal in all this to have BigQuery query this external table, which underlying is referencing all data in this bucket i.e.
gs://bucket/table/dt=2020-04-06/
gs://bucket/table/dt=2020-04-07/
gs://bucket/table/dt=2020-04-08/
Just trying to see if this would be the right approach for it.
Kafka itself is a messaging system that allows to exchange data between processes, applications, and servers, but it requires producers and consumers (here is an example) that move the data. For instance:
The Producer needs to send the data in a format that BigQuery can read.
And the Consumer needs to write the data with a valid Hive Layout.
The Consumer should write to GCS, so you would need to find the proper connector for your application (e.g. this Java connector or Confluent connector). And when writing the messages to GCS you need to take care about using a valid 'default hive partitioning layout'.
For example, gs://bucket/table/dt=2020-04-07/, dt is a column where the table is partitioned on, and 2020-04-07 is one of its values, so take care about this. Once you have a valid Hive Layout in GCS, you need to create a table in BigQuery, I recommend a native table from the UI and selecting Google Cloud Storage as the source and enabling 'Source Data Partitioned', but you can also use --hive_partitioning_source_uri_prefix and --hive_partitioning_mode to link the GCS data with a BigQuery table.
As all this process implies different layers of development and configuration, if this process makes sense for you, I recommend you open new questions for any specific errors you could have.
The last but not least, Kafka to BigQuery connector and other connectors to ingest from Kafka to GCP can help better if Hive Layout is not mandatory for your use case.

Convert Avro in Kafka to Parquet directly into S3

I have topics in Kafka that are stored in Avro format. I would like to consume the entire topic (which at time of receipt will not change any messages) and convert it into Parquet, saving directly on S3.
I currently do this but it requires me consuming the messages from Kafka one a time and processing on a local machine, converting them to a parquet file, and once the entire topic is consumed and the parquet file completely written, close the writing process and then initiate an S3 multipart file upload. Or | Avro in Kafka -> convert to parquet on local -> copy file to S3 | for short.
What I'd like to do instead is | Avro in Kafka -> parquet in S3 |
One of the caveats is that the Kafka topic name isn't static, and needs to be fed in an argument, used once, and then never used again.
I've looked into Alpakka and it seems like it might be possible - but it's unclear, I haven't seen any examples. Any suggestions?
You just described Kafka Connect :)
Kafka Connect is part of Apache Kafka, and with the S3 connector plugin. Although, at the moment the development of Parquet support is still in progress.
For a primer in Kafka Connect see http://rmoff.dev/ksldn19-kafka-connect
Try to add "format.class": "io.confluent.connect.s3.format.parquet.ParquetFormat" in your PUT request when you set up your connector.
You can find more details here.

Batching and Uploaded real-time traffic to S3

I am looking for some suggestion/solutions on implementing a archiving work flow at at big data scale.
The source of data are messages in kafka. Which is written to in real-time. Destination is S3 bucket.
I need to partition the data based on a field in message. For each partition i need to batch data to 100Mb chunks and then upload it.
The data rate is ~5GB/Minute. So the 100Mb batch should get filled within couple of seconds.
My trouble is around scaling and batching. Since i need to batch and compression data for a "field" in message, i need to bring that part of data together by partitioning. Any suggestions on tech/work flow ?
You can use Kafka Connect. There's a connector for S3:
http://docs.confluent.io/current/connect/connect-storage-cloud/kafka-connect-s3/docs/s3_connector.html
You can use Apache spark to do scaling and batching processes for you. So basically the flow can look like this:
Apache Kafka -> Apache Spark -> Amazon S3.
Spark Streaming API enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka and can be processed using complex algorithms such as high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems like Amazon S3.