Replaying Kafka events stored in S3 - amazon-s3

I might be thinking of this incorrectly, but we're looking to set up a connection between Kafka and S3. We are using Kafka as the backbone of our microservice event sourcing system and may occasionally need to replay events from the beginning of time in certain scenarios (i.e. building a new service, rebuilding a corrupted database view).
Instead of storing events indefinitely in AWS EBS storage ($0.10/GB/mo.), we'd like to shift them to S3 ($0.023/Gb/mo. or less) after seven days using the S3 Sink Connector and eventually continually move them down the chain of S3 storage levels.
However, I don't understand that if I need to replay a topic from the beginning to restore a service, how would Kafka get that data back on demand from S3? I know I can utilize a source connector, but it seems that is only for setting up a new topic. Not for pulling data back from an existing topic.

The Confluent S3 Source Connector doesn't dictate where the data is written back into. But you may want to refer the storage configuration properties regarding topics.dir and topic relationship.
Alternatively, write some code to read your S3 events and send them into a Kafka producer client.
Keep in mind, for your recovery payment calculations that reads from different tiers of S3 cost more and more.
You may also want to follow the developments of Kafka native tiered storage support (or similarly, look at Apache Pulsar as an alternative)

Related

update s3 record based on kafka event

We have records in the s3 bucket (which get updated daily via a job). And We need to listen to a Kafka stream/topic and when a new event arrives in this Kafka stream, we need to update that particular record in s3.
Is this possible?
To my understanding, we need to take the data dump of s3 (via scala code or something) and write to it. IMO, this is not a practical way.
Is there an efficient way to do it?
Unclear if you meant S3 event or Kafka event.
For S3 write events, you can use a Lambda job to notify some code to run, and Kafka does not need involved. Could be any language - Python or NodeJS seem to be most popular for lambdas.
For Kafka records, your consumer would see the event, then use a S3 client API to do whatever it needs with that data, such as write/update a file. Again, any language can be used that supports Kafka protocol, not only Scala. After which, a Lambda could also read that S3 event.

CDC(Change Data Capture) with Kafka

I want to implement CDC with kafka: DB1->kafka-> various sinks (ie: s3, DB, other kafka consumer).
One of the use cases is to restore a given table only with events from given period of time.
Ie: for TBL1 all events are in kafka, initial load and deltas. I want to restore in given sink(ie. s3, db) this table only with events from date1-date2.
How would you approach this, additionally to restoring full state?
You can use Debezium to do an initial snapshot of a database table, then track new changes to Kafka.
There is no available tool out-of-the-box that will copy data to/from a Kafka topic between two time intervals; you'd need to write that yourself (from Kafka), or using a range-query + filter from your database/filesystem to a KafkaProducer. The Confluent S3 Source Connector, for example, will read all files (written by the S3 sink), and the JDBC (I assume this is what you mean by "db") Source Connector needs internal manipulation and a custom query to start/end at any timestamp/record.

SQS and AWS Lambda Integration

I am developing an Audit Trail System, that will act as a central location for all the critical events happening around the organization. I am planning to use Amazon SQS as a temporary queue to hold the messages that in turn will trigger the AWS lambda function to write the messages into AWS S3 store. I want to segregate the data at tenantId level (some identifiable id) and persist the messages as batches in S3, that will reduce the no of calls from lambda to S3. Moreover, I want to trigger the lambda every hour. But, I have 2 issues here, one the max batch size provided by SQS is 10, also the lambda trigger polls the SQS service on regular basis, that's gonna increase the no of calls to my S3. I want to create a manual batch of 1000 messages(say) before calling the S3 batch api. I am not very much sure how to architecture my system, so that above requirements can be met. Help or idea provided is very much appreciable!
Simplified Architecture:
Thanks!
I would recommend that you instead use Amazon Kinesis Data Firehose. It basically does what you're wanting to do:
Accepts incoming messages
Buffers them for a period of time
Writes output to S3 or Elasticsearch
This is all done as a managed service, and can also integrate with AWS Lambda to provide custom processing (eg filter out certain records).
However, you might have to do something special to segregate the data at tenantId. See: Can I customize partitioning in Kinesis Firehose before delivering to S3?

Using Google Cloud ecosystem vs building your own microservice architecture

Building in the Google Cloud ecosystem is really powerful. I really like how you can ingest files to Cloud Storage then Data Flow enriches, transforms and aggregates the data, and then finally stored in BigQuery or Cloud SQL.
I have a couple of questions to help me have a better understanding.
If you are to build a big data product using the Google services.
When a front-end web application (might be built in React) submits a file to Cloud storage it may take some time before it completely processes. The client might want to view the status the file in the pipeline. They then might want to do something with the result on completion. How are front-end clients expected know when a file has completed processed and ready? Do they need to poll data from somewhere?
If you currently have a microservice architecture in which each service does a different kind of processing. For example one might parse a file, another might processes messages. The services communicate using Kafka or RabbitMQ and store data in Postgres or S3.
If you adopt the Google services ecosystem could you replace that microservice architecture with Cloud storage, dataflow, Cloud SQL/Store?
Did you look at Cloud Pub/Sub (topic subscription/publication service).
Cloud Pub/Sub brings the scalability, flexibility, and reliability of enterprise message-oriented middleware to the cloud. By providing many-to-many, asynchronous messaging that decouples senders and receivers, it allows for secure and highly available communication between independently written applications.
I believe Pub/Sub can mostly substitute Kafka or RabitMQ in your case.
How are front-end clients expected know when a file has completed processed and ready? Do they need to poll data from somewhere?
For example, if you are using dataflow API to process the file, Cloud dataflow can publish the progress and send the status to a topic. Your front end (app engine) just needs to subscribe to that topic and receive update.
1)
Dataflow does not offer inspection to intermediary results. If a frontend wants more progress about an element being processed in a Dataflow pipeline, custom progress reporting will need to be built into the Pipline.
One idea, is to write progress updates to a sink table and output molecules to that at various parts of the pipeline. I.e. have a BigQuery sink where you write rows like ["element_idX", "PHASE-1 DONE"]. Then a frontend can query for those results. (I would avoid overwriting old rows personally, but many approaches can work).
You cand do this by consuming the PCollection in both the new sink, and your pipeline's next step.
2)
Is your Microservice architecture using a "Pipes and filters" pipeline style approach? I.e. each service reads from a source (Kafka/RabbitMQ) and writes data out, then the next consumes it?
Probably the best way to do setup one a few different Dataflow pipelines, and output their results using a Pub/Sub or Kafka sink, and have the next pipeline consume that Pub/Sub sink. You may also wish to sink them to a another location like BigQuery/GCS, so that you can query out these results again if you need to.
There is also an option to use Cloud Functions instead of Dataflow, which have Pub/Sub and GCS triggers. A microservice system can be setup with several Cloud Functions.

push logs in S3 to dynamoDB continuously

we have our application logs pumped to S3 via Kinesis Firehose. we want this data to also flow to DynamoDB so that we can efficiently query the data to be presented in web UI (Ember app). need for this is so that users are able to filter and sort the data and so on. basically to support querying abilities via web UI.
i looked into AWS Data pipeline. this is reliable but more tuned to one time imports or scheduled imports. we want the flow of data from s3 to dynamoDB to be continuous.
what other choices are out there to achieve this? moving data from S3 to dynamoDB isn't a very unique requirement. so how have you solved this problem?
Is an S3 event triggered lambda an option? if yes, then how to make this lambda fault tolerant?
For Full Text Querying
You can design your solution as follows for better querying using AWS Elasticsearch as the destination for rich querying.
Setup Kinesis Firehouse Destination to Amazon Elastic Search. This will allow you to do full text querying from your Web UI.
You can choose to either back up failed records only or all records. If you choose all records, Kinesis Firehose backs up all incoming source data to your S3 bucket concurrently with data delivery to Amazon Elasticsearch. 
For Basic Querying
If you plan to use DynamoDB to store the metadata of logs its better to configure S3 Trigger to Lambda which will retrieve the file and update the metadata to DynamoDB.
Is an S3 event triggered lambda an option?
This is definitely an option. You can create a PutObject event on your S3 bucket and have it call your Lambda function, which will invoke it asynchronously.
if yes, then how to make this lambda fault tolerant?
By default, asynchronous invocations will retry twice upon failure. To ensure fault-tolerance beyond the two retries, you can use Dead Letter Queues and send the failed events to an SQS queue or SNS topic to be handled at a later time.