I have a sqs queue, that my application constantly sends messages to (about 5-15 messages per second).
I need to take the messages data and put it in redshift.
Right now, I have background service which gets X messages from the queue every Y minutes, then the service put them in an s3 file, and transfer the data into redshift using the COPY command.
This implementation have some problems:
In my service, I get X messages at a time, and because of the sqs limits, amazon allow to receive only 10 messages at max at a time (meaning that if I want to get 1000 messages, I will need to make 100 network calls)
My service doesn't scale as the application scales -> when there will be 30 (or 300) messages per second, my service won't be able to handle all the messages.
Using aws firehose is a little inconvenient the way I see it, because SHARDS are not scalable (I will need to configure manually to add shards) but maybe I'm wrong here
...
A a result of those things, I need something that will be scalable and efficient as possible. any ideas?
For the purpose you have described, I think AWS would say that Kinesis Data Streams plus Kinesis Data Firehose is a more appropriate service than SQS.
Yes, like you said, you do have to configure the shards. But just one shard can handle 1000 incoming records/sec. Also there are ways to automate the scaling, for example like AWS have documented here
One further advantage of using Kinesis Data Firehose is you can create a delivery stream which pushes the data straight into Redshift if you wish.
Related
I might be thinking of this incorrectly, but we're looking to set up a connection between Kafka and S3. We are using Kafka as the backbone of our microservice event sourcing system and may occasionally need to replay events from the beginning of time in certain scenarios (i.e. building a new service, rebuilding a corrupted database view).
Instead of storing events indefinitely in AWS EBS storage ($0.10/GB/mo.), we'd like to shift them to S3 ($0.023/Gb/mo. or less) after seven days using the S3 Sink Connector and eventually continually move them down the chain of S3 storage levels.
However, I don't understand that if I need to replay a topic from the beginning to restore a service, how would Kafka get that data back on demand from S3? I know I can utilize a source connector, but it seems that is only for setting up a new topic. Not for pulling data back from an existing topic.
The Confluent S3 Source Connector doesn't dictate where the data is written back into. But you may want to refer the storage configuration properties regarding topics.dir and topic relationship.
Alternatively, write some code to read your S3 events and send them into a Kafka producer client.
Keep in mind, for your recovery payment calculations that reads from different tiers of S3 cost more and more.
You may also want to follow the developments of Kafka native tiered storage support (or similarly, look at Apache Pulsar as an alternative)
I have a process which is uploading files to S3. The rate at which these are files are pumped to S3 is not constant. Another process needs to look into the latest files uploaded in this bucket and update, say watermark. Now we need the best-effort strategy to make this "latest file" available information as soon as possible.
S3 has event notification integration with SNS/SQS. Since I don't need a fan-out, I thought I could simply do a S3 -> SQS integration. But on digging deep into SQS, I see that though there are no limits on the number of SQS queues you can have per account (I would need quite a lot of queues if I were to assign SQS per partition in S3), there is a limit on max number of messages you can receive per call - 10
Though I can setup SQS per partition, i.e Q1 for root/child1, Q2 for root/child2, etc. The number of files getting pumped into these child folder itself could be massive. In that case, instead of trying to drain everything in the queue - JUST to get the latest file in the child directory, is there any other mechanism I could apply?
Note I am not 100% done with my POC and I certainly don't have the metrics - but given that long-polling (the more you wait, the more delay in getting out the latest file information. so short poll is probably what I should be using - but then there is a possibility that it does not send the request to all SQS servers so I would need multiple calls to get the latest event out from SQS. Need to find a balance there), 10 per call limit, etc, I just doubt if I am using the right tool for the problem here. Am I missing something? or am I terribly wrong about SQS?
I am yet to experiment SNS - does it do rate limiting for events? "If there are 10000 events per minute I will only send you the latest one" sort of?
Please let me know what is the best way to get the latest file uploaded in S3 when the rate of files uploaded is high.
I am developing an Audit Trail System, that will act as a central location for all the critical events happening around the organization. I am planning to use Amazon SQS as a temporary queue to hold the messages that in turn will trigger the AWS lambda function to write the messages into AWS S3 store. I want to segregate the data at tenantId level (some identifiable id) and persist the messages as batches in S3, that will reduce the no of calls from lambda to S3. Moreover, I want to trigger the lambda every hour. But, I have 2 issues here, one the max batch size provided by SQS is 10, also the lambda trigger polls the SQS service on regular basis, that's gonna increase the no of calls to my S3. I want to create a manual batch of 1000 messages(say) before calling the S3 batch api. I am not very much sure how to architecture my system, so that above requirements can be met. Help or idea provided is very much appreciable!
Simplified Architecture:
Thanks!
I would recommend that you instead use Amazon Kinesis Data Firehose. It basically does what you're wanting to do:
Accepts incoming messages
Buffers them for a period of time
Writes output to S3 or Elasticsearch
This is all done as a managed service, and can also integrate with AWS Lambda to provide custom processing (eg filter out certain records).
However, you might have to do something special to segregate the data at tenantId. See: Can I customize partitioning in Kinesis Firehose before delivering to S3?
I am looking for some suggestion/solutions on implementing a archiving work flow at at big data scale.
The source of data are messages in kafka. Which is written to in real-time. Destination is S3 bucket.
I need to partition the data based on a field in message. For each partition i need to batch data to 100Mb chunks and then upload it.
The data rate is ~5GB/Minute. So the 100Mb batch should get filled within couple of seconds.
My trouble is around scaling and batching. Since i need to batch and compression data for a "field" in message, i need to bring that part of data together by partitioning. Any suggestions on tech/work flow ?
You can use Kafka Connect. There's a connector for S3:
http://docs.confluent.io/current/connect/connect-storage-cloud/kafka-connect-s3/docs/s3_connector.html
You can use Apache spark to do scaling and batching processes for you. So basically the flow can look like this:
Apache Kafka -> Apache Spark -> Amazon S3.
Spark Streaming API enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Data can be ingested from many sources like Kafka and can be processed using complex algorithms such as high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems like Amazon S3.
I use AWS Kinesis stream with several shards. The partition keys I set when I put records in the stream is not constant, to map the records to every shards.
To be sure about the fact that every shard is used, how can I monitor the activity of the shards ?
I saw that in a enhanced level of AWS Cloudwatch, the metrics of Kinesis can be split by shards. That is not my case, and as my need is just occasional, I don't want to pay for it.
You can enable shard level metrics when you want, then disable when you don't need to. Although you specified that you did not want this solution, this is by far the best way.
On the consumer side, you can use custom logging. For each record batch processed in your IRecordProcessor implementation, you can count the incoming data counts for each shard. Sample code here. You can even add 3rd party metrics platforms (such as Prometheus).
You can customize producer, and log PutRecordResponses. It returns "your data is placed under XXX shard" for each Put call. See AWS Documentation for details.
Generally, if your have a problem regardnig non-uniform data distribution between your shards, best way is to use random partition key while sending data in Kinesis Producer applications.