I came across this problem, and so far it seems that the only solution is stronger consistency model. The service is Amazon S3, which provides eventual consistency. We use it as blob storage backend.
The problem is, we introduced messaging pattern to our application, and we love it. There's no doubt about it's benefits. However, it seems that it demands stronger consistency. Scenario:
subsystem acquires data from user
data is saved to S3
message is sent
message is received by another subsystem
data is read from S3
...crickets. Is this the old data? Sometimes it is.
So. We tried to the obvious, send the data in the message to avoid inconsistent reading from S3. But that's pretty nasty thing to do, the message get unnecessarily big, and when the receiver is too busy or goes down, and receives the message late while there's already new data available, it fails.
Is there a solution to this or do we really need to dump S3 for some more consistent backend like RDBMS or MongoDB?
If your scenario allows for your data to always be written on S3 under a new key (by always creating new objects) then you can rely on Amazon's read-after-write consistency.
Here is Amazon's description that describes this consistency model:
Amazon S3 buckets in the US West (Northern California), EU (Ireland),
Asia Pacific (Singapore), and Asia Pacific (Tokyo) Regions provide
read-after-write consistency for PUTS of new objects and eventual
consistency for overwrite PUTS and DELETES. Amazon S3 buckets in the
US Standard Region provide eventual consistency.
Related
I'm putting together a datastore technology that relies on a backing key-value store, and has Typescript generator code to populate an index made of tiny JSON objects.
With AWS S3 in mind as an example store, I am alarmed at the possibility of a bug in my Typescript index generation simply continuing to write endless entries with unlimited cost. AWS has no mechanism that I know of to defend me from bankruptcy, so I don't want to use their services.
However, the volume of data might build for certain use cases to Megabytes or Gigabytes and access might be only occasional so cheap long term storage would be great.
What simple key-value cloud stores equivalent to S3 are there that will allow me to define limits on storage and retrieval numbers or cost? In such a service, a bug would mean I eventually start seeing e.g. 403, 429 or 507 errors as feedback that I have hit a quota, limiting my financial exposure. Preferably this would be on a per-bucket basis, rather than the whole account being frozen.
I am using S3 as a reference for the bare minimum needed to fulfil the tool's requirements, but any similar blob or object storage API (put, get, delete, list in UTF-8 order) that eventual starts rejecting requests when a quota is exceeded would be fine too.
Learning the names of qualifying systems and the terminology for their quota limit feature would give me important insights and allow me to review possible options.
I might be thinking of this incorrectly, but we're looking to set up a connection between Kafka and S3. We are using Kafka as the backbone of our microservice event sourcing system and may occasionally need to replay events from the beginning of time in certain scenarios (i.e. building a new service, rebuilding a corrupted database view).
Instead of storing events indefinitely in AWS EBS storage ($0.10/GB/mo.), we'd like to shift them to S3 ($0.023/Gb/mo. or less) after seven days using the S3 Sink Connector and eventually continually move them down the chain of S3 storage levels.
However, I don't understand that if I need to replay a topic from the beginning to restore a service, how would Kafka get that data back on demand from S3? I know I can utilize a source connector, but it seems that is only for setting up a new topic. Not for pulling data back from an existing topic.
The Confluent S3 Source Connector doesn't dictate where the data is written back into. But you may want to refer the storage configuration properties regarding topics.dir and topic relationship.
Alternatively, write some code to read your S3 events and send them into a Kafka producer client.
Keep in mind, for your recovery payment calculations that reads from different tiers of S3 cost more and more.
You may also want to follow the developments of Kafka native tiered storage support (or similarly, look at Apache Pulsar as an alternative)
One of the things I see becoming more of a problem in micro-service architecture is disaster recovery. For instance, a common pattern is to store large data objects in S3 such as multimedia data, whilst JSON data would go in DynamoDB. But what happens when you have a hacker come and manages to delete a whole buck of data from your DynamoDB?
You also need to make sure your S3 bucket is restored to the same state is was at that time but are there elegant ways of doing this? The concern is that it will be difficult to guarantee that both the S3 backup and the DynamoDB database are in sync?
I am not aware of a solution to do a genuine synchronised backup-restore across services. However you could use the native DynamoDB point in time restore and the third party S3-pit-restore library to restore both services to a common point in time.
Does anyone know of any real-world analysis on data loss using these two AWS s3 storage options? I know from the AWS docs (via Quora) that one is 99.9999999% guarenteed and the other is only 99.99% gaurenteed but I'm looking for data from a non-AWS source.
Anecdotes or something more thorough would both be great. I apologize if this isn't the right SE site for this question. Feel free to suggest a place to migrate it.
I guess it depends on the data you're storing if you really need 99.999999999% level of durability …
If you keep copies of your data locally and are just using S3 as a convenient place to store data that is actively being accessed by services within the AWS infrastructure, RRS might be the right choice for you :)
In my case, I keep fresh files on the normal durability level till I created a local backup and then move them to RRS, which saves you quite a bit a money.
Are there known limitations of S3 scaling? Anyone ever had so many simultaneous reads or writes that a bucket started returning errors? I'm a bit more interested in writes than reads because S3 is likely to be optimized for reads.
Eric's comment sums it up already on a conceptual level, as addressed in the FAQ What happens if traffic from my application suddenly spikes? as well:
Amazon S3 was designed from the ground up to handle traffic for any
Internet application. [...] Amazon S3’s massive scale enables us to
spread load evenly, so that no individual application is affected by
traffic spikes.
Of course, you still need to account for possible issues and Tune [your] Application for Repeated SlowDown errors (see Amazon S3 Error Best Practices):
As with any distributed system, S3 has protection mechanisms which
detect intentional or unintentional resource over-consumption and
react accordingly. SlowDown errors can occur when a high request rate
triggers one of these mechanisms. Reducing your request rate will
decrease or eliminate errors of this type. Generally speaking, most
users will not experience these errors regularly; however, if you
would like more information or are experiencing high or unexpected
SlowDown errors, please post to our Amazon S3 developer forum
http://developer.amazonwebservices.com/connect/forum.jspa?forumID=24
or sign up for AWS Premium Support
http://aws.amazon.com/premiumsupport/. [emphasis mine]
While rare, these slow downs do happen of course - here is an AWS team response illustrating the issue (pretty dated though):
Amazon S3 will return this error when the request rate is high enough
that servicing the requests would cause degraded service for other
customers. This error is very rarely triggered. If you do receive
it, you should exponentially back off. If this error occurs, system
resources will be reactively rebalanced/allocated to better support a
higher request rate. As a result, the time period during which this
error would be thrown should be relatively short. [emphasis mine]
Your assumption about read vs. write optimization is confirmed there as well:
The threshold where this error is trigged varies and will depend, in
part, on the request type and pattern. In general, you'll be able to
achieve higher rps with gets vs. puts and with lots of gets for a
small number of keys vs. lots of gets for a large number of keys.
When geting or puting a large number of keys you'll be able to achieve
higher rps if the keys are in alphanumeric order vs. random/hashed
order.