How to add read-after-update consistency to AWS S3 - amazon-s3

I'm trying to use S3 as a datastore for an application (for reasons that are not relevant to this question, I can't use a proper DBMS).
My issue is that I want read-after-update consistency: if I try to read an object immediately after updating it, I want to be guaranteed to get the updated object.
S3 guarantees read-after-write consistency (i.e if you try to read an object immediately after creating it, you're guaranteed to succeed).
However, it doesn't guarantee read-after-update consistency (i.e if you try to read an object immediately after updating it, you may get a previous version of the object).
Additionally it has some more caveats.
What can I add to my solution so that it would have read-after-update consistency?
I've thought of a couple of options:
A write-through cache: each write updates a cache first (thinking of using mongoDB as I'm most familiar with it, and it has fast reads for recent objects), then to S3. Each read tries the cache first, and falls back to S3 if not found.
However, this solution doubles the chances of failure (i.e if mongoDB is down, or S3 is down, then my whole database is down). Ideally I'd like to have only S3 as a point of failure.
Always write new objects, never update: I thought I could work around this by utilizing the read-after-write guarantee; instead of creating object /my-bucket/obj-id.json, I'll create /my-bucket/obj-id/v1.json. When I update I'll just create V2 etc.
When reading, I'll need to list the keys with the obj-id path, and choose the latest version.
However, one of the above caveats is:
A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.
Is there any way to "cheat" S3's eventual consistency? Or Am I stuck with a cache component?
Or maybe there's a third, better, idea?
thanks!

Related

How to prevent duplicate keys with S3?

How can it be possible to implement in an application logic a way to prevent duplicate keys given the eventual consistency nature of S3.
So one way to check if a key exists is:
public boolean exists(String path, String name) {
try {
s3.getObjectMetadata(bucket, getS3Path(path) + name);
} catch(AmazonServiceException e) {
return false;
}
return true;
}
Is there a guarantee that when we gate our application logic with this, it will always return whether the key exists or not given, again the eventual consistency of S3? Let's say two requests came with both the exact same key/path would one get the response that it exists (e.g. by exists() == true) or both will be stored but just in different versions?
I would like to point out that I am using S3 as document storage (similar to a JSON storage)
That code won't work as intended.
The first time ever you call s3.getObjectMetadata(...) on a key that S3 has never seen before, it will correctly tell you that there's no such key. However, if after that you upload an object with that key, and call s3.getObjectMetadata(...) again, you may still see S3 telling you that there's no such key.
This is documented on the Introduction to Amazon S3: Amazon S3 data consistency model page:
Amazon S3 provides read-after-write consistency for PUTS of new objects in your S3 bucket in all Regions with one caveat. The caveat is that if you make a HEAD or GET request to a key name before the object is created, then create the object shortly after that, a subsequent GET might not return the object due to eventual consistency.
There's no way to do exactly what you describe with S3 alone. You need a strongly consistent data store for that kind of query. Something like DynamoDB (with strongly consistent reads), RDS, etc.
Alternatively, if you want to try to use just S3, there's one thing you might be able to do, depending on the specifics of the problem you have. If you have the liberty to choose the key that you'll use to write the object to S3, and if you know the full contents of the object that you'll write, you could use keys that are the hash of the contents of the object. Hash-collision apart, a given key will only ever exist in S3 if that exact piece of data is there, because given the piece of data there's only 1 possible name for it.
The write operation would become idempotent. Here's why. If you check for existence and it returns false, you can write the object. If the "return false" was due to eventual consistency, it probably isn't an issue because all you'll be doing is you'll be overwriting an object with the exact same contents, which is almost like a "no-op" (exception being if you're triggering jobs for when objects are written; you'd need to check the idempotency of those, too).
However, that solution may not be applicable to your case. If it isn't, then you'll need to use a strongly consistent storage system for metadata.
Using other "S3-Compatible" like Wasabi solves this problem as stated in this article:
Wasabi also uses a data consistency model that means any operation
followed by another operation will always give the same results. This
Wasabi data consistency approach is in contrast to the Amazon S3 model
which is "eventually consistent" in that you may get different results
in two requests.

Avoid two-phase commits in a event sourced application saving BLOB data

Let's assume we have an Aggregate User which has a UserPortraitImage and a Contract as a PDF file. I want to store files in a dedicated document-based store and just hold process-relevant data in the event (with a link to the BLOB data).
But how do I avoid a two-phase commit when I have to store the files and store the new event?
At first I'd store the documents and then the event; if the first transaction fails it doesn't matter, the command failed. If the second transaction fails it also doesn't matter even if we generated some dead files in the store, the command fails; we could even apply a rollback.
But could there be an additional problem?
The next question is how to design the aggregate and the event. If the aggregate only holds a reference to the BLOB storage, what is the process after a SignUp command got called?
SignUpCommand ==> Store documents (UserPortraitImage and Contract) ==> Create new User aggregate with the given BLOB storage references and store it?
Is there a better design which unburdens the aggregate of knowing that BLOB data is saved in another store? And who is responsible for storing BLOB data and forwarding the reference to the aggregate?
Sounds like you are working with something analogous to an AtomPub media-entry/media-link-entry pair. The blob is going into your data store, the meta data gets copied into the aggregate history
But how do I avoid a two-phase commit when I have to store the files and store the new event?
In practice, you probably don't.
That is to say, if the blob store and the aggregate store happen to be the same database, then you can update both in the same transaction. That couples the two stores, and adds some pretty strong constraints to your choice of storage, but it is doable.
Another possibility is that you accept that the two changes that you are making are isolated from one another, and therefore that for some period of time the two stores are not consistent with each other.
In this second case, the saga pattern is what you are looking for, and it is exactly what you describe; you pair the first action with a compensating action to take if the second action fails. So "manual" rollback.
Or not - in a sense, the git object database uses a two phase commit; an object gets copied into the object store, and then the trees get updated, and then the commit... garbage collection comes along later to discard the objects that you don't need.
who is responsible for storing BLOB data and forwarding the reference to the aggregate?
Well, ultimately it is an infrastructure concern; does your model actually need to interact with the document, or is it just carrying a claim check that can be redeemed later?
At first I'd store the documents and then the event; if the first
transaction fails it doesn't matter, the command failed. If the second
transaction fails it also doesn't matter even if we generated some
dead files in the store, the command fails; we could even apply a
rollback. But could there be an additional problem?
Not that I can think of, aside from wasted disk space. That's what I typically do when I want to avoid distributed transactions or when they're not available across the two types of data stores. Oftentimes, one of the two operations is less important and you can afford to let it complete even if the master operation fails later.
Cleaning up botched attempts can be done during exception handling, as an out-of-band process or as part of a Saga as #VoiceOfUnreason explained.
SignUpCommand ==> Store documents (UserPortraitImage and Contract) ==>
Create new User aggregate with the given BLOB storage references and
store it?
Yes. Usually the Application layer component (Command handler in your case) acts as a coordinator betweeen the different data stores and gets back all it needs to know from one store before talking to the other or to the Domain.

Is there optimistic locking in AWS S3?

I have an excel file within s3. Since different programs read and write it, I need to guarantee that each of them is writing to the version they read.
S3 only guarantees read after write consistency for newly created objects, and eventual consistency for overwriting and deleting objects. If your excel file is small enough (less than 400kb), you could store it in a binary attribute of a DynamoDB item and use conditional updates on a version attribute to ensure read after write consistency for the file. Otherwise, of the file is bigger than 400kb, you could upload each version of the file to a new key in s3 and then track the s3 URL to latest version of the file in a versioned DynamoDB item.
This is not possible with S3.
Specifically, it is impossible to determine conclusively and authoritatively whether the current version visible to you is not already in the process of being overwritten, or may have been very recently overwritten... because the overwrite in progress does not disturb the current version until it is complete or a short time later, because of the eventual consistency model of overwrites.
This is even true when bucket versioning is not enabled. It is sometimes possible to overwrite an object and still download the previous version for a brief time after the overwrite is complete.
GET and HEAD and ListObjects are all eventually consistent.
Since 2020, AWS is strongly consistent and you can use standard HTTP header If-Match to implement optimistic locking.
https://aws.amazon.com/es/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/

Is listing Amazon S3 objects a strong consistency operation or eventual consistency operation?

I know how consistency works when you create/update/delete a file on S3.
What about S3 bucket listing operation?
Is it strongly consistent or eventually consistent?
The List Objects operation appears to be eventually-consistent, even for new objects. According to a support forum post from an AWS employee ChrisP#AWS:
Read-after-write consistency is only valid for GETS of new objects - LISTS might not contain the new objects until the change is fully propagated.
— https://forums.aws.amazon.com/thread.jspa?messageID=687028&#687028
The preceding answer of #Michael had been accurate before latest announcement from AWS.
AWS has recently announced that both read after write operation and list operation is now strongly consistent.
Snippet from AWS:"After a successful write of a new object, or an overwrite or delete of an existing object, any subsequent read request immediately receives the latest version of the object. S3 also provides strong consistency for list operations, so after a write, you can immediately perform a listing of the objects in a bucket with any changes reflected."
Reference - https://aws.amazon.com/s3/consistency/

riak backup solution for a single bucket

What are your recommendations for solutions that allow backing up [either by streaming or snapshot] a single riak bucket to a file?
Backing up just a single bucket is going to be a difficult operation in Riak.
All of the solutions will boil down to the following two steps:
List all of the objects in the bucket. This is the tricky part, since there is no "manifest" or a list of contents of any bucket, anywhere in the Riak cluster.
Issue a GET to each one of those objects from the list above, and write it to a backup file. This part is generally easy, though for maximum performance you want to make sure you're issuing those GETs in parallel, in a multithreaded fashion, and using some sort of connection pooling.
As far as listing all of the objects, you have one of three choices.
One is to do a Streaming List Keys operation on the bucket via HTTP (e.g. /buckets/bucket/keys?keys=stream) or Protocol Buffers -- see http://docs.basho.com/riak/latest/dev/references/http/list-keys/ and http://docs.basho.com/riak/latest/dev/references/protocol-buffers/list-keys/ for details. Under no circumstances should you do a non-streaming regular List Keys operation. (It will hang your whole cluster, and will eventually either time out or crash once the number of keys grows large enough).
Two is to issue a Secondary Index (2i) query to get that object list. See http://docs.basho.com/riak/latest/dev/using/2i/ for discussion and caveats.
And three would be if you're using Riak Search and can retrieve all of the objects via a single paginated search query. (However, Riak Search has a query result limit of 10,000 results, so, this approach is far from ideal).
For an example of a standalone app that can backup a single bucket, take a look at Riak Data Migrator, an experimental Java app that uses the Streaming List Keys approach combined with efficient parallel GETs.
The Basho function contrib has an erlang solution for backing up a single bucket. It is a custom function but it should do the trick.
http://contrib.basho.com/bucket_exporter.html
As far as I know, there's no automated solution to backup a single bucket in Riak. You'd have to use the riak-admin command line tool to take care of backing up a single physical node. You could write something to retrieve all keys in a single bucket and using low r values if you want it to be fast but not secure (r = 1).
Buckets are a logical namespace, all of the keys are stored in the same bitcask structure. That's why the only way to get just a single node is to write a tool to stream them yourself.