CDC(Change Data Capture) with Kafka - amazon-s3

I want to implement CDC with kafka: DB1->kafka-> various sinks (ie: s3, DB, other kafka consumer).
One of the use cases is to restore a given table only with events from given period of time.
Ie: for TBL1 all events are in kafka, initial load and deltas. I want to restore in given sink(ie. s3, db) this table only with events from date1-date2.
How would you approach this, additionally to restoring full state?

You can use Debezium to do an initial snapshot of a database table, then track new changes to Kafka.
There is no available tool out-of-the-box that will copy data to/from a Kafka topic between two time intervals; you'd need to write that yourself (from Kafka), or using a range-query + filter from your database/filesystem to a KafkaProducer. The Confluent S3 Source Connector, for example, will read all files (written by the S3 sink), and the JDBC (I assume this is what you mean by "db") Source Connector needs internal manipulation and a custom query to start/end at any timestamp/record.

Related

Proper recovery of KSQL target stream to preserve offset for Connector to avoid duplicate records

We recently adopted Kafka Streams via KSQL to convert our source JSON topics into AVRO so that our S3 Sink Connector can store the data into Parquet format in their respective buckets.
Our Kafka cluster was taken down over the weekend and we've noticed that some of our target streams (avro) have no data, yet all of our source streams do (checked via print 'topic_name'; with ksql).
I know that I can drop the target stream and recreate it but will that lose the offset and duplicate records in our Sink?
Also, I know that if I recreate the target stream with the same topic name, I may run into the "topic already exists, with different partition/offset.." thus I am hesitant to try this.
So what is the best way to recreate/recover our target streams such that we preserve the topic name and offset for our Sink Connector?
Thanks.

Replaying Kafka events stored in S3

I might be thinking of this incorrectly, but we're looking to set up a connection between Kafka and S3. We are using Kafka as the backbone of our microservice event sourcing system and may occasionally need to replay events from the beginning of time in certain scenarios (i.e. building a new service, rebuilding a corrupted database view).
Instead of storing events indefinitely in AWS EBS storage ($0.10/GB/mo.), we'd like to shift them to S3 ($0.023/Gb/mo. or less) after seven days using the S3 Sink Connector and eventually continually move them down the chain of S3 storage levels.
However, I don't understand that if I need to replay a topic from the beginning to restore a service, how would Kafka get that data back on demand from S3? I know I can utilize a source connector, but it seems that is only for setting up a new topic. Not for pulling data back from an existing topic.
The Confluent S3 Source Connector doesn't dictate where the data is written back into. But you may want to refer the storage configuration properties regarding topics.dir and topic relationship.
Alternatively, write some code to read your S3 events and send them into a Kafka producer client.
Keep in mind, for your recovery payment calculations that reads from different tiers of S3 cost more and more.
You may also want to follow the developments of Kafka native tiered storage support (or similarly, look at Apache Pulsar as an alternative)

Hive partitioning LAYOUT table format in BigQuery

I have many qsns inside this situation. So here goes :
Has anyone ever written Kafka's output to a Google Cloud Storage (GCS) bucket, such that the data in that bucket is partitioned using the "default hive partitioning layout"
The intent behind doing that is this external table needs to be "queryable" in BigQuery
Google's documentation on that is here but wanted to see if someone has an example ( https://cloud.google.com/bigquery/docs/hive-partitioned-queries-gcs )
for e.g. the documentation says "files follow the default layout, with the key/value pairs laid out as directories with an = sign as a separator, and the partition keys are always in the same order."
What's not clear is
a) does Kafka create these directories on the fly OR do i have to pre-create them ? Lets say i WANT to have KAFKA write to directories based on date in GCS
gs://bucket/table/dt=2020-04-07/
Tonight, after midnight, do i have PRE-create this new directory gs://bucket/table/dt=2020-04-08/ or CAN Kafka create it for me AND in all this, how does hive partitioning LAYOUT help me ?
Does my table's data, which i am trying to put in these dirs every day, need to have "dt" ( from gs://bucket/table/dt=2020-04-07/ ) as a column in it ?
Since the goal in all this to have BigQuery query this external table, which underlying is referencing all data in this bucket i.e.
gs://bucket/table/dt=2020-04-06/
gs://bucket/table/dt=2020-04-07/
gs://bucket/table/dt=2020-04-08/
Just trying to see if this would be the right approach for it.
Kafka itself is a messaging system that allows to exchange data between processes, applications, and servers, but it requires producers and consumers (here is an example) that move the data. For instance:
The Producer needs to send the data in a format that BigQuery can read.
And the Consumer needs to write the data with a valid Hive Layout.
The Consumer should write to GCS, so you would need to find the proper connector for your application (e.g. this Java connector or Confluent connector). And when writing the messages to GCS you need to take care about using a valid 'default hive partitioning layout'.
For example, gs://bucket/table/dt=2020-04-07/, dt is a column where the table is partitioned on, and 2020-04-07 is one of its values, so take care about this. Once you have a valid Hive Layout in GCS, you need to create a table in BigQuery, I recommend a native table from the UI and selecting Google Cloud Storage as the source and enabling 'Source Data Partitioned', but you can also use --hive_partitioning_source_uri_prefix and --hive_partitioning_mode to link the GCS data with a BigQuery table.
As all this process implies different layers of development and configuration, if this process makes sense for you, I recommend you open new questions for any specific errors you could have.
The last but not least, Kafka to BigQuery connector and other connectors to ingest from Kafka to GCP can help better if Hive Layout is not mandatory for your use case.

Hive data transformations in real time?

I have the following data pipeline:
A process writes messages to Kafka
A Spark structured streaming application is listening for new Kafka messages and writes them as they are to HDFS
A batch Hive job runs on a hourly basis and reads the newly ingested messages from HDFS and via some medium complex INSERT INTO statements populates some tables (I do not have materialized views available). EDIT: Essentially after my Hive job I have as result Table1 storing the raw data, then another table Table2 = fun1(Table1), then Table3 = fun2(Table2), then Table4 = join(Table2, Table3), etc. Fun is a selection or an aggregation.
A Tableau dashboard visualizes the data I wrote.
As you can see, step 3 makes my pipeline not real time.
What can you suggest me in order to make my pipeline fully real time? EDIT: I'd like to have Table1, ... TableN updated on real time!
Using Hive with Spark Streaming is not recommended at all. Since the purpose of Spark streaming is to have low latency. Hive introduces the highest latency possible (OLAP) since at backend it executes MR/Tez job (depends on hive.execution.engine).
Recommendation: Use spark streaming with the low latency DB like HBASE, Phoenix.
Solution: Develop a Spark streaming job with Kafka as a source and use the custom sink to write the data into Hbase/Phoenix.
Introducing HDFS obviously isn't real time. MemSQL or Druid/Imply offer much more real time ingestion from Kafka
You need historical data to perform roll ups and aggregations. Tableau may cache datasets, but it doesn't store persistently itself. You therefore need some storage, and you've chosen to use HDFS rather than a database.
Note: Hive / Presto can read directly from Kafka. Therefore you don't really even need Spark.
If you want to do rolling aggregates from Kafka and make it queryable, KSQL could be used instead, or you can write your own Kafka Streams solution

Batching and Uploaded real-time traffic to S3

I am looking for some suggestion/solutions on implementing a archiving work flow at at big data scale.
The source of data are messages in kafka. Which is written to in real-time. Destination is S3 bucket.
I need to partition the data based on a field in message. For each partition i need to batch data to 100Mb chunks and then upload it.
The data rate is ~5GB/Minute. So the 100Mb batch should get filled within couple of seconds.
My trouble is around scaling and batching. Since i need to batch and compression data for a "field" in message, i need to bring that part of data together by partitioning. Any suggestions on tech/work flow ?
You can use Kafka Connect. There's a connector for S3:
http://docs.confluent.io/current/connect/connect-storage-cloud/kafka-connect-s3/docs/s3_connector.html
You can use Apache spark to do scaling and batching processes for you. So basically the flow can look like this:
Apache Kafka -> Apache Spark -> Amazon S3.
Spark Streaming API enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka and can be processed using complex algorithms such as high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems like Amazon S3.