Cross-region copy fails for Linode Object Storage - amazon-s3

We're using boto3 with Linode Object Storage, which is compatible with AWS S3 according to their documentation.
Everything seems to work well, except cross-region copy operation.
When I download an object from source region/bucket and then upload it to destination region/bucket, everything works well. Although, I'd like to avoid that unnecessary upload/download step.
I have the bucket named test-bucket on both regions. And I'd like to copy the object named test-object from us-east-1 to us-southeast-1 cluster.
Here is the example code I'm using:
from boto3 import client
from boto3.session import Session
sess = Session(
aws_access_key_id='***',
aws_secret_access_key='***'
)
s3_client_src = sess.client(
service_name='s3',
region_name='us-east-1',
endpoint_url='https://us-east-1.linodeobjects.com'
)
# test-bucket and test-object are already exists.
s3_client_trg = sess.client(
service_name='s3',
region_name='us-southeast-1',
endpoint_url='https://us-southeast-1.linodeobjects.com'
)
copy_source = {
'Bucket': 'test-bucket',
'Key': 'test-object'
}
s3_client_trg.copy(CopySource=copy_source, Bucket='test-bucket', Key='test-object', SourceClient=s3_client_src)
When I call:
s3_client_src.list_objects(Bucket='test-bucket')['Contents']
It shows me that the test-object exists, But when I run copy, then it throws following message:
An error occurred (NoSuchKey) when calling the CopyObject operation: Unknown
Any help is appreciated!

Related

AWS S3 bucket notification lambda throws exception (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey)

We have a AWS Glue DataBrew job which puts the output to some S3 bucket folder. Then a java lambda is notified for this Put notification. But the following sample code throws exception:
S3EventNotification.S3EventNotificationRecord record = event.getRecords().get(0);
String s3Bucket = record.getS3().getBucket().getName();
String s3Key= record.getS3().getObject().getUrlDecodedKey();
//following throws exception --404 NoSuchKey
S3Object s3object = s3Client.getObject(s3Bucket , s3Key);
When seen in logs we see that the key is something like:
**input_files/processed_file_22Dec2022_1671678897600/fdg629ae-4f91-4869-891c-79200772fb92/databrew-test-put-object.temp
So is it that the, lambda gets the file which is still being copied into the S3 folder?. When we upload the file manually using the console, it works fine. But when databrew job uploads it, we are seeing issues.
I expect the file to be read by lambda function with the correct key.
Thanks
What is your trigger event type?
So is it that the, lambda gets the file which is still being copied into the S3 folder?
If you have a Put trigger, Lambda will get triggered when the object upload completes. S3 wouldn't create a temporary object and then delete it.
I haven't used AWS Glue DataBrew before but perhaps that is creating that temporary object? If that is the case, you need to handle it in your code.

how to update metadata on an S3 object larger than 5GB?

I am using the boto3 API to update the S3 metadata on an object.
I am making use of How to update metadata of an existing object in AWS S3 using python boto3?
My code looks like this:
s3_object = s3.Object(bucket,key)
new_metadata = {'foo':'bar'}
s3_object.metadata.update(new_metadata)
s3_object.copy_from(CopySource={'Bucket':bucket,'Key':key}, Metadata=s3_object.metadata, MetadataDirective='REPLACE')
This code fails when the object is larger than 5GB. I get this error:
botocore.exceptions.ClientError: An error occurred (InvalidRequest) when calling the CopyObject operation: The specified copy source is larger than the maximum allowable size for a copy source: 5368709120
How does one update the metadata on an object larger than 5GB?
Due to the size of your object, try invoking a multipart upload and use the copy_from argument. See the boto3 docs here for more information:
https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.MultipartUploadPart.copy_from
Apparently, you can't just update the metadata - you need to re-copy the object to S3. You can copy it from s3 back to s3, but you can't just update, which is annoying for objects in the 100-500GB range.

How to test mocked (moto/boto) S3 read/write in PySpark

I am trying to unittest a function that writes data to S3 and then reads the same data from the same S3 location. I am trying to use a moto and boto (2.x) to achieve that [1]. The problem is that the service returns that I am forbidden to access the key [2]. A similar problem (even though that the error message is a bit different) is reported in the moto github repository [3] but it is not resolved yet.
Has anyone ever successfully tested mocked s3 read/write in PySpark to share some insights?
[1]
import boto
from boto.s3.key import Key
from moto import mock_s3
_test_bucket = 'test-bucket'
_test_key = 'data.csv'
#pytest.fixture(scope='function')
def spark_context(request):
conf = SparkConf().setMaster("local[2]").setAppName("pytest-pyspark-local-testing")
sc = SparkContext(conf=conf)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", 'test-access-key-id')
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", 'test-secret-access-key')
request.addfinalizer(lambda: sc.stop())
quiet_py4j(sc)
return sc
spark_test = pytest.mark.usefixtures("spark_context")
#spark_test
#mock_s3
def test_tsv_read_from_and_write_to_s3(spark_context):
spark = SQLContext(spark_context)
s3_conn = boto.connect_s3()
s3_bucket = s3_conn.create_bucket(_test_bucket)
k = Key(s3_bucket)
k.key = _test_key
k.set_contents_from_string('')
s3_uri = 's3n://{}/{}'.format(_test_bucket, _test_key)
df = (spark
.read
.csv(s3_uri))
[2]
(...)
E py4j.protocol.Py4JJavaError: An error occurred while calling o33.csv.
E : org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/data.csv' - ResponseCode=403, ResponseMessage=Forbidden
(...)
[3]
https://github.com/spulec/moto/issues/1543
moto is a library which is used to mock aws resources.
1. Create the resource:
If you try to access an S3 Bucket which doesn't exist, aws will return a Forbidden error.
Usually, we need these resources created even before our tests run. So, create a pytest fixture with autouse set to True
import pytest
import boto3
from moto import mock_s3
#pytest.fixture(autouse=True)
def fixture_mock_s3():
with mock_s3():
conn = boto3.resource('s3', region_name='us-east-1')
conn.create_bucket(Bucket='MYBUCKET') # an empty test bucket is created
yield
The above code creates a mock s3 bucket with name "MUBUCKET". The bucket is empty.
The name of the bucket should be same as that of the original bucket.
with autouse, the fixture is automatically available across tests.
You can confidently run tests, as your tests will not have access to the original bucket.
2. Define and run tests involving the resource:
Suppose, you have code that writes a file to S3 Bucket
def write_to_s3(filepath: str):
s3 = boto3.resource('s3', region_name='us-east-1')
s3.Bucket('MYBUCKET').upload_file(filepath, 'A/B/C/P/data.txt')
This can be tested the following way:
from botocore.errorfactory import ClientError
def test_write_to_s3():
dummy_file_path = f"{TEST_DIR}/data/dummy_data.txt"
# The s3 bucket is created by the fixture and not lies empty
# test for emptiness
s3 = boto3.resource('s3', region_name='us-east-1')
bucket = s3.Bucket("MYBUCKET")
objects = list(bucket.objects.filter(Prefix="/"))
assert objects == []
# Now, lets write a file to s3
write_to_s3(dummy_file_path)
# the below assert statement doesn't throw any error
assert s3.head_object(Bucket='MYBUCKET', Key='A/B/C/P/data.txt')

boto3 s3 head_object method is returning Storage class as None

Code:
import boto3
s3_cli = boto3.client('s3')
object_summary = s3_cli.head_object(
Bucket='test-cf',
Key='RDS.template',
VersionId='szA3ws4bH6k.rDXOEAchlh1x3OgthNEB'
)
print('LastModified: {}'.format(object_summary.get('LastModified')))
print('StorageClass: {}'.format(object_summary.get('StorageClass')))
print('Metadata: {}'.format(object_summary.get('Metadata')))
print('ContentLength(KB): {}'.format(object_summary.get('ContentLength')/1024))
Output:
LastModified: 2017-06-08 09:22:43+00:00
StorageClass: None
Metadata: {}
ContentLength(KB): 15
Am unable to get the StorageClass of the key using boto3 sdk. I can see the storage class set as standard from the aws console. I have also tried using s3.ObjectSummary and also s3.ObjectVersion methods in boto3 s3 resouces, but they also returned None.
Not sure why it is returning None. Meanwhile, use the following code to get the storage class. Let me check my version of Boto3.
bucket = s3.Bucket('test-cf')
for object in bucket.objects.all():
print object.key, object.storage_class

AWS SDK Boto3 : boto3.exceptions.unknownapiversionerror

I am trying to upload content on amazon s3 but I am getting this error:
boto3.exceptions.unknownapiversionerror: The 's3' resource does not an
API Valid API versions are: 2006-03-01
import boto3
boto3.resource('s3',**AWS_ACCESS_KEY_ID**,**AWS_PRIVATE_KEY*‌​*)
bucket = s3.Bucket( **NAME OF BUCKET**)
obj = bucket.Object(**KEY**)
obj.upload_fileobj(**FILE OBJECT**)
The error is caused by exception raised on "DataNotFound" as in the
boto3.Session source code. Perhaps the developer didn't realize people make the mistake for NOT passing the correct object.
If you read the boto3 documentation example, this is the correct way to upload data.
import boto3
boto3.resource('s3',**AWS_ACCESS_KEY_ID**,**AWS_PRIVATE_KEY*‌​*)
bucket = s3.Bucket( **NAME OF BUCKET**)
obj = bucket.Object("prefix/object_key_name")
# You must pass the file object !
with open('filename', 'rb') as fileobject:
obj.upload_fileobj(fileobject)