How to prevent deletion of a blob? - blob

Is it possible to make a particular blob non delete able? If so, how? I want one blob to stay forever and no one should be able to delete that.

Solution 1:
Try with making the blob immutable
Immutable storage for Azure Blob Storage enables users to store business-critical data in a WORM (Write Once, Read Many) state. While in a WORM state, data cannot be modified or deleted for a user-specified interval. By configuring immutability policies for blob data, you can protect your data from overwrites and deletes.
immutable storage for Azure Blob storage supports two types of immutability policies
1) Time-based retention policies: the time-based retention policy, users can set policies to store data for a specified interval. When a time-based retention policy is set, objects can be created and read, but not modified or deleted. After the retention period has expired, objects can be deleted but not overwritten. For more details about time-based retention policies, Refer Time-based retention policies for immutable blob data
2) Legal hold policies: A legal hold stores immutable data until the legal hold is explicitly cleared. When a legal hold is set, objects can be created and read, but not modified or deleted. For more details refer Legal holds for immutable blob data.
For more details refer this document
Solution 2:
Use the RABC to secure access to containers .
Solution 3:
Use Blob soft delete
individual blob and its versions, snapshots, and metadata from accidental deletes or overwrites by maintaining the deleted data in the system for a specified period of time
For more details refer this document

Another Solution(3): lease blob: this is a lock.

Related

Need for metadata store while storing an object

While checking out the design of a service like pastebin, I noticed the usage of two different storage systems:
An object store(such as Amazon S3) for storing the actual "paste" data
A metadata store to store other things pertaining to that "paste" data; such as - URL Hash(to access that paste data), Reference to the actual paste data etc.
I am trying to understand the need for this metadata store.
Is this generally the recommended way? Any specific advantage we get from using the metadata store?
Do object storage systems NOT allow metadata to be stored along with the actual object in the same storage server?
Object storage systems generally do allow quite a lot of metadata to be attached to the object.
But then your metadata is at the mercy of the object store.
Your metadata search is limited to what the object store allows.
Analysis, notification (a-la inotify) etc. are at limited to what the object store allows.
If you wanted to move from S3 to Google Cloud Storage, or to do both, you'd have to normalize your metadata.
Your metadata size limitations are limited to that of the object store.
You can't do cross-object-store metadata (e.g. a link that refers to multiple paste data).
You might not be able to have binary metdata.
Typically, metadata is both very important, and very heavily used by the business, so it has separate usage characteristics than the data, so it makes sense to put it on storage with different characteristics.
I can't find anywhere how pastebin.com makes money, so I don't know how heavily they use metadata, but merely the lookup, the translation between URL and paste data, is not something you can do securely with object storage alone.
Great answer above, just to add on - two more advantages are caching and scaling up both storage systems individually.
If you just use an object storage, and say a paste is 5 MB, would you cache all of it? Metadata storage also allows to improve UX by caching say first 10 or 100 KB of data for a paste for the user to preview, meanwhile the complete object is fetched in the background. This upper bound also helps to design cache deterministically.
You can also scale the object store and the metadata store independently of each other as per performance/ capacity needs. Lookups in the metadata store will also be quicker since it's less bulkier.
Your concern is legitimate that separating the storage into 2 tables (or mediums) does add some latency, but it's always a compromise with System Design, there is hardly a Win-Win situation.

AWS S3: change storage class without replicating the object

My bucket has a replication rule to backup the object into another region/bucket.
Now I want to change the storage class in the source object (standard -> infrequent access), but it seems this change, applied through CopyObjectRequest API (java client), is triggering the replication. This is unfortunate because cross-region replication has a cost.
So at the moment the "journey" is the following:
object is stored in standard class, source bucket
I change the storage class to IA
object gets replicated into another region (standard class)
after 1 day it's moved to glacier.
As you can see this is a total waste of money, because the replication will end up moving the very same object into glacier again.
How can I avoid this scenario?
Use a lifecycle policy in the source bucket to convert the current object version to the desired storage class. This should migrate the current object without changing its version-id, and should not trigger a replication event.
Otherwise, you'd need to create objects with the desired storage class from the beginning. There isn't a way for a user action to change an object's storage class without creating a new object version, so the seemingly redundant replication event can't otherwise be avoided -- because you are creating a new object version.

Avoid two-phase commits in a event sourced application saving BLOB data

Let's assume we have an Aggregate User which has a UserPortraitImage and a Contract as a PDF file. I want to store files in a dedicated document-based store and just hold process-relevant data in the event (with a link to the BLOB data).
But how do I avoid a two-phase commit when I have to store the files and store the new event?
At first I'd store the documents and then the event; if the first transaction fails it doesn't matter, the command failed. If the second transaction fails it also doesn't matter even if we generated some dead files in the store, the command fails; we could even apply a rollback.
But could there be an additional problem?
The next question is how to design the aggregate and the event. If the aggregate only holds a reference to the BLOB storage, what is the process after a SignUp command got called?
SignUpCommand ==> Store documents (UserPortraitImage and Contract) ==> Create new User aggregate with the given BLOB storage references and store it?
Is there a better design which unburdens the aggregate of knowing that BLOB data is saved in another store? And who is responsible for storing BLOB data and forwarding the reference to the aggregate?
Sounds like you are working with something analogous to an AtomPub media-entry/media-link-entry pair. The blob is going into your data store, the meta data gets copied into the aggregate history
But how do I avoid a two-phase commit when I have to store the files and store the new event?
In practice, you probably don't.
That is to say, if the blob store and the aggregate store happen to be the same database, then you can update both in the same transaction. That couples the two stores, and adds some pretty strong constraints to your choice of storage, but it is doable.
Another possibility is that you accept that the two changes that you are making are isolated from one another, and therefore that for some period of time the two stores are not consistent with each other.
In this second case, the saga pattern is what you are looking for, and it is exactly what you describe; you pair the first action with a compensating action to take if the second action fails. So "manual" rollback.
Or not - in a sense, the git object database uses a two phase commit; an object gets copied into the object store, and then the trees get updated, and then the commit... garbage collection comes along later to discard the objects that you don't need.
who is responsible for storing BLOB data and forwarding the reference to the aggregate?
Well, ultimately it is an infrastructure concern; does your model actually need to interact with the document, or is it just carrying a claim check that can be redeemed later?
At first I'd store the documents and then the event; if the first
transaction fails it doesn't matter, the command failed. If the second
transaction fails it also doesn't matter even if we generated some
dead files in the store, the command fails; we could even apply a
rollback. But could there be an additional problem?
Not that I can think of, aside from wasted disk space. That's what I typically do when I want to avoid distributed transactions or when they're not available across the two types of data stores. Oftentimes, one of the two operations is less important and you can afford to let it complete even if the master operation fails later.
Cleaning up botched attempts can be done during exception handling, as an out-of-band process or as part of a Saga as #VoiceOfUnreason explained.
SignUpCommand ==> Store documents (UserPortraitImage and Contract) ==>
Create new User aggregate with the given BLOB storage references and
store it?
Yes. Usually the Application layer component (Command handler in your case) acts as a coordinator betweeen the different data stores and gets back all it needs to know from one store before talking to the other or to the Domain.

Is listing Amazon S3 objects a strong consistency operation or eventual consistency operation?

I know how consistency works when you create/update/delete a file on S3.
What about S3 bucket listing operation?
Is it strongly consistent or eventually consistent?
The List Objects operation appears to be eventually-consistent, even for new objects. According to a support forum post from an AWS employee ChrisP#AWS:
Read-after-write consistency is only valid for GETS of new objects - LISTS might not contain the new objects until the change is fully propagated.
— https://forums.aws.amazon.com/thread.jspa?messageID=687028&#687028
The preceding answer of #Michael had been accurate before latest announcement from AWS.
AWS has recently announced that both read after write operation and list operation is now strongly consistent.
Snippet from AWS:"After a successful write of a new object, or an overwrite or delete of an existing object, any subsequent read request immediately receives the latest version of the object. S3 also provides strong consistency for list operations, so after a write, you can immediately perform a listing of the objects in a bucket with any changes reflected."
Reference - https://aws.amazon.com/s3/consistency/

DynamoDB - How to do incremental backup?

I am using DynamoDB tables with keys and throughput optimized for application use cases. To support other ad hoc administrative and reporting use cases I want to keep a complete backup in S3 (a day old backup is OK). Again, I cannot afford to scan the entire DynamoDB tables to do the backup. The keys I have are not sufficient to find out what is "new". How do I do incremental backups? Do I have to modify my DynamoDB schema, or add extra tables just to do this? Any best practices?
Update:
DynamoDB Streams solves this problem.
DynamoDB Streams captures a time-ordered sequence of item-level
modifications in any DynamoDB table, and stores this information in a
log for up to 24 hours. Applications can access this log and view the
data items as they appeared before and after they were modified, in
near real time.
I see two options:
Generate the current snapshot. You'll have to read from the table to do this, which you can do at a very slow rate to stay under your capacity limits (Scan operation). Then, keep an in-memory list of updates performed for some period of time. You could put these in another table, but you'll have to read those, too, which would probably cost just as much. This time interval could be a minute, 10 minutes, an hour, whatever you're comfortable losing if your application exits. Then, periodically grab your snapshot from S3, replay these changes on the snapshot, and upload your new snapshot. I don't know how large your data set is, so this may not be practical, but I've seen this done with great success for data sets up to 1-2GB.
Add read throughput and backup your data using a full scan every day. You say you can't afford it, but it isn't clear if you mean paying for capacity, or that the scan would use up all the capacity and the application would begin failing. The only way to pull data out of DynamoDB is to read it, either strongly or eventually consistent. If the backup is part of your business requirements, then I think you have to determine if it's worth it. You can self-throttle your read by examining the ConsumedCapacityUnits property on your results. The Scan operation has a Limit property that you can use to limit the amount of data read in each operation. Scan also uses eventually consistent reads, which are half the price of strongly consistent reads.
You can now use dynamoDB streams to have data persisted into anthother table or maintain another copy of data in another datastore.
https://aws.amazon.com/blogs/aws/dynamodb-streams-preview/
For incremental backups, you can associate your DynamoDB Stream with a Lambda Function to automatically trigger code for every data update (Ie: data to another store like S3)
A lambda function you can use to tie up with DynamoDb for incremental backups:
https://github.com/PageUpPeopleOrg/dynamodb-replicator
I've provided a detailed walk through on how you can use DynamoDB Streams, Lambda and S3 versioned buckets to create incremental backups for your data in DynamoDb on my blog:
https://www.abhayachauhan.com/category/aws/dynamodb/dynamodb-backups
Alternatively, DynamoDB just realised On Demand backups and restores. They aren't incremental, but full backup snapshots.
Check out https://www.abhayachauhan.com/2017/12/dynamodb-scheduling-on-demand-backups/ for more information.
HTH
On November 29th, 2017 On-Demand Backup was introduced. It allows for you to create backups directly in DynamoDB essentially instantly without consuming any capacity. Here are a few snippets from the blog post:
This feature is designed to help you to comply with regulatory requirements for long-term archival and data retention. You can create a backup with a click (or an API call) without consuming your provisioned throughput capacity or impacting the responsiveness of your application. Backups are stored in a highly durable fashion and can be used to create fresh tables.
...
The backup is available right away! It is encrypted with an Amazon-managed key and includes all of the table data, provisioned capacity settings, Local and Global Secondary Index settings, and Streams. It does not include Auto Scaling or TTL settings, tags, IAM policies, CloudWatch metrics, or CloudWatch Alarms.
You may be wondering how this operation can be instant, given that some of our customers have tables approaching half of a petabyte. Behind the scenes, DynamoDB takes full snapshots and saves all change logs. Taking a backup is as simple as saving a timestamp along with the current metadata for the table.
A scan operation in DynamoDB returns rows sorted by primary key (hash key). So if a table's hash key is an auto-incremented integer, then set the hash key of the last record saved during the previous backup as "lastEvaluatedKey" parameter for a scan request when doing the next backup, and the scan will return records which have been created since the last backup only.