S3 eventual consistency for read after write - amazon-s3

I read a lot about different scenarios and questions that are about s3 eventual consistency and how to handle it to not get 404 error. But here I have a little bit strange use case/requirement! What I'm doing is writing bunch of files to a temp/transient folder in a s3 bucket (using a spark job and make sure job is not going to fail), then remove the main/destination folder if the previous step is successful, and finally copy files over from temp to main folder in the same bucket. Here is part of my code:
# first writing objects into the tempPrefix here using pyspark
...
# delete the main folder (old data) here
...
# copy files from the temp to the main folder
for obj in bucket.objects.filter(Prefix=tempPrefix):
# this function make sure the specific key is available for read
# by calling HeadObject with retries - throwing exception otherwise
waitForObjectToBeAvaiableForRead(bucketName, obj.key)
copy_source = {
"Bucket": bucketName,
"Key": obj.key
}
new_key = obj.key.replace(tempPrefix, mainPrefix, 1)
new_obj = bucket.Object(new_key)
new_obj.copy(copy_source)
This seems to work to avoid any 404 (NoSuchKey) error for immediate read after write. My question is will the bucket.objects.filter give me the newly written objects/keys always? Can eventual consistency affect that as well? The reason I'm asking this because the HeadObject call (in the waitForObjectToBeAvaiableForRead function) sometimes returns 404 error for a key which is returned by bucket.objects.filter!!! I mean the bucket.objects returns a key which is not available for read!!!

When you delete an object in S3, AWS writes a "delete marker" for the object (this assumes that the bucket is versioned). The object appears to be deleted, but that is a sort of illusion created by AWS.
So, if you are writing objects over previously-existing-but-now-deleted objects then you are actually updating an object which results in "eventual consistency" rather than "strong consistency."
Some helpful comments from AWS docs:
A delete marker is a placeholder (marker) for a versioned object that
was named in a simple DELETE request. Because the object was in a
versioning-enabled bucket, the object was not deleted. The delete
marker, however, makes Amazon S3 behave as if it had been deleted.
If you try to get an object and its current version is a delete
marker, Amazon S3 responds with:
A 404 (Object not found) error
A response header, x-amz-delete-marker: true
Specific Answers
My question is will the bucket.objects.filter give me the newly written objects/keys always?
Yes, newly written objects/keys will be included if you have fewer than 1,000 objects in the bucket. The API returns up to 1,000 objects.
Can eventual consistency affect that as well?
Eventual consistency affects the availability of the latest version of an object, not the presence of the object in filter results. The 404 errors are the result of trying to read newly written objects that were last deleted (and full consistency has not yet been achieved).

Related

Is multipart copy really needed to revert an object to a prior version?

For https://github.com/wlandau/gittargets/issues/6, I am trying to programmatically revert an object in an S3 bucket to an earlier version. From reading https://docs.aws.amazon.com/AmazonS3/latest/userguide/RestoringPreviousVersions.html, it looks like copying the object to itself (old version to current version) is recommended. However, I also read that there is a 5 GB limit for copying objects in S3. Does that limit apply to reverting an object to a previous version in the same bucket? A local download followed by a multi-part upload seems extremely inefficient for this use case.
You can create a multi-part transfer request that transfers from S3 to S3. It still takes time, but it doesn't require downloading the object's data and uploading it again, so in practice it tends to be considerably faster than other options:
import boto3
s3 = boto3.resource('s3')
bucket = s3.Bucket('example-bucket')
bucket.copy(
{
'Bucket': 'example-bucket',
'Key': 'test.dat',
'VersionId': '0011223344', # From previous call to bucket.object_versions
},
Key='test.dat',
)

AWS structure of S3 trigger

I am building a Python Lambda in AWS and wanted to add an S3 trigger to it. Following these instructions I saw how to get the bucket and key on which I got the trigger using:
def func(event):
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
There is an example of such an object in the link, but I wasn't able, however, to find a description of the entire event object anywhere in AWS' documentation.
Is there a documentation for this object's structure? Where might I find it?
You can find documentation about the whole object in the S3 documentation:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/notification-content-structure.html
I would also advise to iterate the records, because there could be multiple at once:
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = record['s3']['object']['key']
[...]

how to update metadata on an S3 object larger than 5GB?

I am using the boto3 API to update the S3 metadata on an object.
I am making use of How to update metadata of an existing object in AWS S3 using python boto3?
My code looks like this:
s3_object = s3.Object(bucket,key)
new_metadata = {'foo':'bar'}
s3_object.metadata.update(new_metadata)
s3_object.copy_from(CopySource={'Bucket':bucket,'Key':key}, Metadata=s3_object.metadata, MetadataDirective='REPLACE')
This code fails when the object is larger than 5GB. I get this error:
botocore.exceptions.ClientError: An error occurred (InvalidRequest) when calling the CopyObject operation: The specified copy source is larger than the maximum allowable size for a copy source: 5368709120
How does one update the metadata on an object larger than 5GB?
Due to the size of your object, try invoking a multipart upload and use the copy_from argument. See the boto3 docs here for more information:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.MultipartUploadPart.copy_from
Apparently, you can't just update the metadata - you need to re-copy the object to S3. You can copy it from s3 back to s3, but you can't just update, which is annoying for objects in the 100-500GB range.

AWS S3 life cycle configuration

I want to set the life cycle rule of S3 bucket so that the each file in the bucket will be deleted 7 days after they are generated.
If I set the lifecycle rule as follows (the below is terraform code, similar to console setting, so I just use it here), will all the files in the bucket "test" be removed after 7 days from today or will each file be deleted on different date since they are created in different day? I want them to be deleted on different date, not together.
BTW, I guess I do not need to configure: Permanently delete previous versions because my s3 is not version enabled. Please correct me if I am wrong.
resource "aws_s3_bucket" "s3" {
bucket = "test"
lifecycle_rule {
id = "remove_after_7d"
enabled = true
expiration {
days = 7
}
}
}
The objects will be removed 7 days after their individual creation -- not 7 days after you create the rule. If, for example, all the objects in a bucket are at least 7 days old, they should all be gone within approximately 24 hours after you create the rule.
Note that the timing is not precise, because the deletion process is done in the background, so objects will usually linger a few hours longer than you might expect, if you assume exacrly 7 × 24 hours is how long the objects will remain in the bucket. It may take a day or two for the objects to disappear after the policy is first created. However, once the policy has been fully evaluated against all the objects, S3 will stop billing you for storage of expired objects when their expiration time arrives, even if the delete process hasn't gotten around to actually removing them, yet.
For non-versioned buckets, you are correct -- there is no previous version to delete. Using versioned buckets is generally a good idea, though, since it eliminates the risk of data loss from inadvertently deleting or overwriting an object, for whatever reason (like a bug in your application).
As i know, above configuration will delete the bucket after 7 days from the bucket creation. If you want to delete a specific file after a specific days from that object creation, then you have to mention the path as prefix. eg. delete log.txt which is inside log folder of the bucket:
resource "aws_s3_bucket" "bucket"
{
bucket = "<<bucket_name>>"
acl = "private"
lifecycle_rule {
id = "log"
enabled = true
prefix = "log/log.txt"
expiration {
days = 7
}
}
}
But i was facing an issue(error 409) while updating an existing bucket.

Using Leigh version of S3Wrapper.cfc Can't get past Init

I am new to S3 and need to use it for image storage. I found a half dozen versions of an s2wrapper for cf but it appears that the only one set of for v4 is one modified by Leigh
https://gist.github.com/Leigh-/26993ed79c956c9309a9dfe40f1fce29
Dropped in the com directory and created a "test" page that contains the following code:
s3 = createObject('component','com.S3Wrapper').init(application.s3.AccessKeyId,application.s3.SecretAccessKey);
but got the following error :
So I changed the line 37 from
variables.Sv4Util = createObject('component', 'Sv4').init(arguments.S3AccessKey, arguments.S3SecretAccessKey);
to
variables.Sv4Util = createObject('component', 'Sv4Util').init(arguments.S3AccessKey, arguments.S3SecretAccessKey);
Now I am getting:
I feel like going through Leigh code and start changing things is a bad idea since I have lurked here for year an know Leigh's code is solid.
Does any know if there are any examples on how to use this anywhere? If not what I am doing wrong. If it makes a difference I am using Lucee 5 and not Adobe's CF engine.
UPDATE :
I followed Leigh's directions and the error is now gone. I am addedsome more code to my test page which now looks like this :
<cfscript>
s3 = createObject('component','com.S3v4').init(application.s3.AccessKeyId,application.s3.SecretAccessKey);
bucket = "imgbkt.domain.com";
obj = "fake.ping";
region = "s3-us-west-1"
test = s3.getObject(bucket,obj,region);
writeDump(test);
test2 = s3.getObjectLink(bucket,obj,region);
writeDump(test2);
writeDump(s3);
</cfscript>
Regardless of what I put in for bucket, obj or region I get :
JIC I did go to AWS and get new keys:
Leigh if you are still around or anyone how has used one of the s3Wrappers any suggestions or guidance?
UPDATE #2:
Even after Alex's help I am not able to get this to work. The Link I receive from getObjectLink is not valid and getObject never does download an object. I thought I would try the putObject method
test3 = s3.putObject(bucketName=bucket,regionName=region,keyName="favicon.ico");
writeDump(test3);
to see if there is any additional information, I received this :
I did find this article https://shlomoswidler.com/2009/08/amazon-s3-gotcha-using-virtual-host.html but it is pretty old and since S3 specifically suggests using dots in bucketnames I don't that it is relevant any longer. There is obviously something I am doing wrong but I have spent hours trying to resolve this and I can't seem to figure out what it might be.
I will give you a rundown of what the code does:
getObjectLink returns a HTTP URL for the file fake.ping that is found looking in the bucket imgbkt.domain.com of region s3-us-west-1. This link is temporary and expires after 60 seconds by default.
getObject invokes getObjectLink and immediately requests the URL using HTTP GET. The response is then saved to the directory of the S3v4.cfc with the filename fake.ping by default. Finally the function returns the full path of the downloaded file: E:\wwwDevRoot\taa\fake.ping
To save the file in a different location, you would invoke:
downloadPath = 'E:\';
test = s3.getObject(bucket,obj,region,downloadPath);
writeDump(test);
The HTTP request is synchronous, meaning the file will be downloaded completely when the functions returns the filepath.
If you want to access the actual content of the file, you can do this:
test = s3.getObject(bucket,obj,region);
contentAsString = fileRead(test); // returns the file content as string
// or
contentAsBinary = fileReadBinary(test); // returns the content as binary (byte array)
writeDump(contentAsString);
writeDump(contentAsBinary);
(You might want to stream the content if the file is large since fileRead/fileReadBinary reads the whole file into buffer. Use fileOpen to stream the content.
Does that help you?