I am developing an Audit Trail System, that will act as a central location for all the critical events happening around the organization. I am planning to use Amazon SQS as a temporary queue to hold the messages that in turn will trigger the AWS lambda function to write the messages into AWS S3 store. I want to segregate the data at tenantId level (some identifiable id) and persist the messages as batches in S3, that will reduce the no of calls from lambda to S3. Moreover, I want to trigger the lambda every hour. But, I have 2 issues here, one the max batch size provided by SQS is 10, also the lambda trigger polls the SQS service on regular basis, that's gonna increase the no of calls to my S3. I want to create a manual batch of 1000 messages(say) before calling the S3 batch api. I am not very much sure how to architecture my system, so that above requirements can be met. Help or idea provided is very much appreciable!
Simplified Architecture:
Thanks!
I would recommend that you instead use Amazon Kinesis Data Firehose. It basically does what you're wanting to do:
Accepts incoming messages
Buffers them for a period of time
Writes output to S3 or Elasticsearch
This is all done as a managed service, and can also integrate with AWS Lambda to provide custom processing (eg filter out certain records).
However, you might have to do something special to segregate the data at tenantId. See: Can I customize partitioning in Kinesis Firehose before delivering to S3?
Related
I have different data sources and I need to publish them to S3 in real-time. I also need to process and validate data before delivering them to S3 buckets. So, I have to use AWS Lambda and validating data. The question is that what is the difference between AWS Kinesis Data Firehose and using AWS Lambda to directly store data into S3 Bucket? Clearly, what is the advantages of using Kinesis Data Firehose? because we can use AWS Lambda to directly put records into S3!
We might want to clarify near real time, as for me, it is below 1 sec.
Kinesis Firehose in this case will batch the items before delivering them into S3. This will result in more items per S3 object.
You can configured how often you want the data to be stored. (You can also connect a lambda to firehose, so you can process the data before delivering them to S3). Kinesis Firehose will scale automatically.
Note that each PUT to S3 as a cost associated to it.
If you connect your data source to AWS Lambda, then each event will trigger the lambda (unless you have a batching mechanism in place, which you didn't mention) and for each event, you will make a PUT request to S3. This will result in a lot of small object in S3 and therefore a lot of S3 PUT api.
Also, depending on the number of items received per seconds, Lambda might not be able to scale and cost associated will increase.
I might be thinking of this incorrectly, but we're looking to set up a connection between Kafka and S3. We are using Kafka as the backbone of our microservice event sourcing system and may occasionally need to replay events from the beginning of time in certain scenarios (i.e. building a new service, rebuilding a corrupted database view).
Instead of storing events indefinitely in AWS EBS storage ($0.10/GB/mo.), we'd like to shift them to S3 ($0.023/Gb/mo. or less) after seven days using the S3 Sink Connector and eventually continually move them down the chain of S3 storage levels.
However, I don't understand that if I need to replay a topic from the beginning to restore a service, how would Kafka get that data back on demand from S3? I know I can utilize a source connector, but it seems that is only for setting up a new topic. Not for pulling data back from an existing topic.
The Confluent S3 Source Connector doesn't dictate where the data is written back into. But you may want to refer the storage configuration properties regarding topics.dir and topic relationship.
Alternatively, write some code to read your S3 events and send them into a Kafka producer client.
Keep in mind, for your recovery payment calculations that reads from different tiers of S3 cost more and more.
You may also want to follow the developments of Kafka native tiered storage support (or similarly, look at Apache Pulsar as an alternative)
I am creating an app where the long running tasks are getting executed in ECS Fargate and logs are getting pushed to CloudWatch. Now, am looking for a way to give the users an ability in the UI to see those realtime live logs while their tasks are running.
I am thinking of the below approach..
Saving logs temporarily in DynamoDB
DynamoDB stream with batch will trigger a lambda.
Lambda will trigger an AWS Appsync mutation with None data source.
In UI client will subscribed to that mutation to get real time updates. (depends on the batch size, example 5 batch means 5 logs lines )
https://aws.amazon.com/premiumsupport/knowledge-center/appsync-notify-subscribers-real-time/
Is there any other techniques or methods that i can think of?
Why not use Cloudwatch default save in S3 bucket ability and add SNS to let clients choose which topic they want to trail the log. Removing extra DynamoDB.
I have a sqs queue, that my application constantly sends messages to (about 5-15 messages per second).
I need to take the messages data and put it in redshift.
Right now, I have background service which gets X messages from the queue every Y minutes, then the service put them in an s3 file, and transfer the data into redshift using the COPY command.
This implementation have some problems:
In my service, I get X messages at a time, and because of the sqs limits, amazon allow to receive only 10 messages at max at a time (meaning that if I want to get 1000 messages, I will need to make 100 network calls)
My service doesn't scale as the application scales -> when there will be 30 (or 300) messages per second, my service won't be able to handle all the messages.
Using aws firehose is a little inconvenient the way I see it, because SHARDS are not scalable (I will need to configure manually to add shards) but maybe I'm wrong here
...
A a result of those things, I need something that will be scalable and efficient as possible. any ideas?
For the purpose you have described, I think AWS would say that Kinesis Data Streams plus Kinesis Data Firehose is a more appropriate service than SQS.
Yes, like you said, you do have to configure the shards. But just one shard can handle 1000 incoming records/sec. Also there are ways to automate the scaling, for example like AWS have documented here
One further advantage of using Kinesis Data Firehose is you can create a delivery stream which pushes the data straight into Redshift if you wish.
we have our application logs pumped to S3 via Kinesis Firehose. we want this data to also flow to DynamoDB so that we can efficiently query the data to be presented in web UI (Ember app). need for this is so that users are able to filter and sort the data and so on. basically to support querying abilities via web UI.
i looked into AWS Data pipeline. this is reliable but more tuned to one time imports or scheduled imports. we want the flow of data from s3 to dynamoDB to be continuous.
what other choices are out there to achieve this? moving data from S3 to dynamoDB isn't a very unique requirement. so how have you solved this problem?
Is an S3 event triggered lambda an option? if yes, then how to make this lambda fault tolerant?
For Full Text Querying
You can design your solution as follows for better querying using AWS Elasticsearch as the destination for rich querying.
Setup Kinesis Firehouse Destination to Amazon Elastic Search. This will allow you to do full text querying from your Web UI.
You can choose to either back up failed records only or all records. If you choose all records, Kinesis Firehose backs up all incoming source data to your S3 bucket concurrently with data delivery to Amazon Elasticsearch.
For Basic Querying
If you plan to use DynamoDB to store the metadata of logs its better to configure S3 Trigger to Lambda which will retrieve the file and update the metadata to DynamoDB.
Is an S3 event triggered lambda an option?
This is definitely an option. You can create a PutObject event on your S3 bucket and have it call your Lambda function, which will invoke it asynchronously.
if yes, then how to make this lambda fault tolerant?
By default, asynchronous invocations will retry twice upon failure. To ensure fault-tolerance beyond the two retries, you can use Dead Letter Queues and send the failed events to an SQS queue or SNS topic to be handled at a later time.