AWS S3: change storage class without replicating the object - amazon-s3

My bucket has a replication rule to backup the object into another region/bucket.
Now I want to change the storage class in the source object (standard -> infrequent access), but it seems this change, applied through CopyObjectRequest API (java client), is triggering the replication. This is unfortunate because cross-region replication has a cost.
So at the moment the "journey" is the following:
object is stored in standard class, source bucket
I change the storage class to IA
object gets replicated into another region (standard class)
after 1 day it's moved to glacier.
As you can see this is a total waste of money, because the replication will end up moving the very same object into glacier again.
How can I avoid this scenario?

Use a lifecycle policy in the source bucket to convert the current object version to the desired storage class. This should migrate the current object without changing its version-id, and should not trigger a replication event.
Otherwise, you'd need to create objects with the desired storage class from the beginning. There isn't a way for a user action to change an object's storage class without creating a new object version, so the seemingly redundant replication event can't otherwise be avoided -- because you are creating a new object version.

Related

How to prevent deletion of a blob?

Is it possible to make a particular blob non delete able? If so, how? I want one blob to stay forever and no one should be able to delete that.
Solution 1:
Try with making the blob immutable
Immutable storage for Azure Blob Storage enables users to store business-critical data in a WORM (Write Once, Read Many) state. While in a WORM state, data cannot be modified or deleted for a user-specified interval. By configuring immutability policies for blob data, you can protect your data from overwrites and deletes.
immutable storage for Azure Blob storage supports two types of immutability policies
1) Time-based retention policies: the time-based retention policy, users can set policies to store data for a specified interval. When a time-based retention policy is set, objects can be created and read, but not modified or deleted. After the retention period has expired, objects can be deleted but not overwritten. For more details about time-based retention policies, Refer Time-based retention policies for immutable blob data
2) Legal hold policies: A legal hold stores immutable data until the legal hold is explicitly cleared. When a legal hold is set, objects can be created and read, but not modified or deleted. For more details refer Legal holds for immutable blob data.
For more details refer this document
Solution 2:
Use the RABC to secure access to containers .
Solution 3:
Use Blob soft delete
individual blob and its versions, snapshots, and metadata from accidental deletes or overwrites by maintaining the deleted data in the system for a specified period of time
For more details refer this document
Another Solution(3): lease blob: this is a lock.

Ceph Object Gateway: what is the best backup strategy?

I have a Ceph cluster managed by Rook with a single RGW store over it. We are trying to figure out the best backup strategy for this store. We are considering the following options: using rclone to backup object via an S3 interface, using s3fs-fuse (haven’t tested it yet but s3fs-fuse is known to be not reliable enough), and using NFS-Ganesha to reexport the RGW store as an NFS share.
We are going to have quite a lot of RGW users and quite a lot of buckets, so all three solutions do not scale well for us. Another possibility is to perform snapshots of RADOS pools backing the RGW store and to backup these snapshots, but the RTO will be much higher in that case. Another problem with snapshots is that it does not seem possible to perform them consistently across all RGW-backing pools. We never delete objects from the RGW store, so this problem does not seem to be that big if we start snapshotting from the metadata pool - all the data it refers to will remain in place even if we create a snapshot on the data pool a bit later. It won’t be super consistent but it should not be broken either. It’s not entirely clear how to restore single objects in a timely manner using this snapshotting scheme (to be honest, it’s not entirely clear how to restore using this scheme at all), but it seems to be worth trying.
What other options do we have? Am I missing something?
We're planning to implement Ceph in 2021.
We don't expect a large number of users and buckets, initially.
While waiting for https://tracker.ceph.com/projects/ceph/wiki/Rgw_-_Snapshots, I successfully tested this solution to address the Object Store protection by taking advantage of multisite configuration + sync policy (https://docs.ceph.com/en/latest/radosgw/multisite-sync-policy/) in the "Octopus" version.
Assuming you have all zones in the Prod site Zone Sync'd to the DRS,
create a Zone in the DRS, e.g. "backupZone", not Zone Sync'd from
or to any of the other Prod or DRS zones;
the endpoints for this backupZone are in 2 or more DRS cluster
nodes;
using (https://rclone.org/s3/) write a bash script: for each the
"bucket"s in the DRS zones, create a version enabled "bucket"-p in the backupZone
and schedule sync, e.g. twice a day, from "bucket" to "bucket"-p;
protect the access to the backupZone endpoints so that no ordinary
user (or integration) can access them, only accessible from the other nodes in the
cluster (obviously) and the server running the rclone-based script;
when there is a failure, just recover all the objects from the *-p
buckets, once again using rclone, to the original buckets or to
filesystem.
This protects from the following failures:
Infra:
Bucket or pool failure;
Object pervasive corruption;
Loss of a site
Human error:
Deletion of versions or objects;
Removal of buckets
Elimination of entire Pools
Notes:
Only the latest version of each object is sync'd to the protected
(*-p) bucket, but if the script runs several times you have the
latest states of the objects through time;
when an object is deleted in the prod bucket, rnode just flags the
object with the DeleteMarker upon sync
this does not scale!! As the number of buckets increases, the time to
sync becomes untenable

How to add read-after-update consistency to AWS S3

I'm trying to use S3 as a datastore for an application (for reasons that are not relevant to this question, I can't use a proper DBMS).
My issue is that I want read-after-update consistency: if I try to read an object immediately after updating it, I want to be guaranteed to get the updated object.
S3 guarantees read-after-write consistency (i.e if you try to read an object immediately after creating it, you're guaranteed to succeed).
However, it doesn't guarantee read-after-update consistency (i.e if you try to read an object immediately after updating it, you may get a previous version of the object).
Additionally it has some more caveats.
What can I add to my solution so that it would have read-after-update consistency?
I've thought of a couple of options:
A write-through cache: each write updates a cache first (thinking of using mongoDB as I'm most familiar with it, and it has fast reads for recent objects), then to S3. Each read tries the cache first, and falls back to S3 if not found.
However, this solution doubles the chances of failure (i.e if mongoDB is down, or S3 is down, then my whole database is down). Ideally I'd like to have only S3 as a point of failure.
Always write new objects, never update: I thought I could work around this by utilizing the read-after-write guarantee; instead of creating object /my-bucket/obj-id.json, I'll create /my-bucket/obj-id/v1.json. When I update I'll just create V2 etc.
When reading, I'll need to list the keys with the obj-id path, and choose the latest version.
However, one of the above caveats is:
A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list.
Is there any way to "cheat" S3's eventual consistency? Or Am I stuck with a cache component?
Or maybe there's a third, better, idea?
thanks!

Is listing Amazon S3 objects a strong consistency operation or eventual consistency operation?

I know how consistency works when you create/update/delete a file on S3.
What about S3 bucket listing operation?
Is it strongly consistent or eventually consistent?
The List Objects operation appears to be eventually-consistent, even for new objects. According to a support forum post from an AWS employee ChrisP#AWS:
Read-after-write consistency is only valid for GETS of new objects - LISTS might not contain the new objects until the change is fully propagated.
— https://forums.aws.amazon.com/thread.jspa?messageID=687028&#687028
The preceding answer of #Michael had been accurate before latest announcement from AWS.
AWS has recently announced that both read after write operation and list operation is now strongly consistent.
Snippet from AWS:"After a successful write of a new object, or an overwrite or delete of an existing object, any subsequent read request immediately receives the latest version of the object. S3 also provides strong consistency for list operations, so after a write, you can immediately perform a listing of the objects in a bucket with any changes reflected."
Reference - https://aws.amazon.com/s3/consistency/

NSManagedObjectContext confusion

I am learning about CoreData. Obviously, one of the main classes you entouer is NSManagedObjectContext. I am unclear about the exact role of this. From the articles i've read, it seems that you can have multiple NSManagedObjectContexts. Does this mean that NSManagedObjectContext is basically a copy of the backend?
How would this resolve into a consistent backend when there is multiple different copies lying around?
So, 2 questions basically:
Is NSManagedContext a copy of the backend database?
and...
For example, say I make a change in context A and make some other change in context B. Then I call save on A first, then B? will B prevail?
Thanks
The NSManagedObjectContext is not a copy of the backend database. The documentation describes it as a scratch pad
An instance of NSManagedObjectContext represents a single “object
space” or scratch pad in an application. Its primary responsibility is
to manage a collection of managed objects. These objects form a group
of related model objects that represent an internally consistent view
of one or more persistent stores. A single managed object instance
exists in one and only one context, but multiple copies of an object
can exist in different contexts. Thus object uniquing is scoped to a
particular context.
The NSManagedObjectContext is just a temporary place to make changes to your managed objects in a transactional way. When you make changes to objects in a context it does not effect the backend database until and if you save the context, and as you know you can have multiple context that you can make changes to which is really important for concurrency.
For question number 2, the answer for who prevails will depend on the merge policy you set for your context and which one is called last which would be B. Here are the merge policies that can be set that will effect the second context to be saved.
NSErrorMergePolicyType
Specifies a policy that causes a save to fail
if there are any merge conflicts.
NSMergeByPropertyStoreTrumpMergePolicyType
Specifies a policy that
merges conflicts between the persistent store’s version of the object
and the current in-memory version, giving priority to external
changes.
NSMergeByPropertyObjectTrumpMergePolicyType
Specifies a policy that merges conflicts between the persistent store’s version
of the object and the current in-memory version, giving priority to
in-memory changes.
NSOverwriteMergePolicyType
Specifies a policy that
overwrites state in the persistent store for the changed objects in
conflict.
NSRollbackMergePolicyType
Specifies a policy that
discards in-memory state changes for objects in conflict.
An NSManagedObjectContext is specific representation of your data model. Each context maintains its own state (e.g. context) so changes in one context will not directly affect other contexts. When you work with multiple contexts it is your responsibility to keep them consistent by merging changes when a context saves its changes to the store.
Your question is regarding this process and may also involve merge conflicts. Whenever you save a context its changes are committed to the store and a merge policy is used to resolve conflicts.
When you save a context, it will post various notifications regarding progress. In your case, if [contextA save:&error] succeeds, the context will post the notification NSManagedObjectContextDidSaveNotification. When you have multiple contexts, you typically observe this notification and call:
[contextB mergeChangesFromContextDidSaveNotification:notification];
This will merge the changes saved on contextA into contextB.
EDIT: removed the 'thread-safe' comment. NSManagedObjectContext is not thread safe.