I am trying to unittest a function that writes data to S3 and then reads the same data from the same S3 location. I am trying to use a moto and boto (2.x) to achieve that [1]. The problem is that the service returns that I am forbidden to access the key [2]. A similar problem (even though that the error message is a bit different) is reported in the moto github repository [3] but it is not resolved yet.
Has anyone ever successfully tested mocked s3 read/write in PySpark to share some insights?
[1]
import boto
from boto.s3.key import Key
from moto import mock_s3
_test_bucket = 'test-bucket'
_test_key = 'data.csv'
#pytest.fixture(scope='function')
def spark_context(request):
conf = SparkConf().setMaster("local[2]").setAppName("pytest-pyspark-local-testing")
sc = SparkContext(conf=conf)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", 'test-access-key-id')
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", 'test-secret-access-key')
request.addfinalizer(lambda: sc.stop())
quiet_py4j(sc)
return sc
spark_test = pytest.mark.usefixtures("spark_context")
#spark_test
#mock_s3
def test_tsv_read_from_and_write_to_s3(spark_context):
spark = SQLContext(spark_context)
s3_conn = boto.connect_s3()
s3_bucket = s3_conn.create_bucket(_test_bucket)
k = Key(s3_bucket)
k.key = _test_key
k.set_contents_from_string('')
s3_uri = 's3n://{}/{}'.format(_test_bucket, _test_key)
df = (spark
.read
.csv(s3_uri))
[2]
(...)
E py4j.protocol.Py4JJavaError: An error occurred while calling o33.csv.
E : org.apache.hadoop.fs.s3.S3Exception: org.jets3t.service.S3ServiceException: S3 HEAD request failed for '/data.csv' - ResponseCode=403, ResponseMessage=Forbidden
(...)
[3]
https://github.com/spulec/moto/issues/1543
moto is a library which is used to mock aws resources.
1. Create the resource:
If you try to access an S3 Bucket which doesn't exist, aws will return a Forbidden error.
Usually, we need these resources created even before our tests run. So, create a pytest fixture with autouse set to True
import pytest
import boto3
from moto import mock_s3
#pytest.fixture(autouse=True)
def fixture_mock_s3():
with mock_s3():
conn = boto3.resource('s3', region_name='us-east-1')
conn.create_bucket(Bucket='MYBUCKET') # an empty test bucket is created
yield
The above code creates a mock s3 bucket with name "MUBUCKET". The bucket is empty.
The name of the bucket should be same as that of the original bucket.
with autouse, the fixture is automatically available across tests.
You can confidently run tests, as your tests will not have access to the original bucket.
2. Define and run tests involving the resource:
Suppose, you have code that writes a file to S3 Bucket
def write_to_s3(filepath: str):
s3 = boto3.resource('s3', region_name='us-east-1')
s3.Bucket('MYBUCKET').upload_file(filepath, 'A/B/C/P/data.txt')
This can be tested the following way:
from botocore.errorfactory import ClientError
def test_write_to_s3():
dummy_file_path = f"{TEST_DIR}/data/dummy_data.txt"
# The s3 bucket is created by the fixture and not lies empty
# test for emptiness
s3 = boto3.resource('s3', region_name='us-east-1')
bucket = s3.Bucket("MYBUCKET")
objects = list(bucket.objects.filter(Prefix="/"))
assert objects == []
# Now, lets write a file to s3
write_to_s3(dummy_file_path)
# the below assert statement doesn't throw any error
assert s3.head_object(Bucket='MYBUCKET', Key='A/B/C/P/data.txt')
Related
We're using boto3 with Linode Object Storage, which is compatible with AWS S3 according to their documentation.
Everything seems to work well, except cross-region copy operation.
When I download an object from source region/bucket and then upload it to destination region/bucket, everything works well. Although, I'd like to avoid that unnecessary upload/download step.
I have the bucket named test-bucket on both regions. And I'd like to copy the object named test-object from us-east-1 to us-southeast-1 cluster.
Here is the example code I'm using:
from boto3 import client
from boto3.session import Session
sess = Session(
aws_access_key_id='***',
aws_secret_access_key='***'
)
s3_client_src = sess.client(
service_name='s3',
region_name='us-east-1',
endpoint_url='https://us-east-1.linodeobjects.com'
)
# test-bucket and test-object are already exists.
s3_client_trg = sess.client(
service_name='s3',
region_name='us-southeast-1',
endpoint_url='https://us-southeast-1.linodeobjects.com'
)
copy_source = {
'Bucket': 'test-bucket',
'Key': 'test-object'
}
s3_client_trg.copy(CopySource=copy_source, Bucket='test-bucket', Key='test-object', SourceClient=s3_client_src)
When I call:
s3_client_src.list_objects(Bucket='test-bucket')['Contents']
It shows me that the test-object exists, But when I run copy, then it throws following message:
An error occurred (NoSuchKey) when calling the CopyObject operation: Unknown
Any help is appreciated!
How to activating data pipeline when new files arrived on S3.For EMR scheduling using triggered using SNS when new files arrived on S3.
You can execute data pipeline without using SNS. When files will be arrived into S3 Location
Create S3 event which should invoke lambda function.enter image description here
Create Lambda Function (Make sure the role which you will give that has s3 , lambda, data pipeline permissions).
Paste below code in lambda function to execute data pipeline (mention your data pipeline_id)
import boto3
def lambda_handler(event, context):
try :
client = boto3.client('datapipeline',region_name='ap-southeast-2')
s3_client = boto3.client('s3')
data_pipeline_id="df-09312983K28XXXXXXXX"
response_pipeline = client.describe_pipelines(pipelineIds=[data_pipeline_id])
activate = client.activate_pipeline(pipelineId=data_pipeline_id,parameterValues=[])
except Exception as e:
raise Exception("Pipeline is not found or not active")
I am testing Image Recognition from was. So far good. What I am having problems with is indexing faces in the CLI. I can index one at the time, but, I would like to tell AWS to index all faces in a bucket. To index a face one at the time I call this:
aws rekognition index-faces --image "S3Object={Bucket=bname,Name=123.jpg}" --collection-id "myCollection" --detection-attributes "ALL" --external-image-id "myImgID"
How do I tell it to index all images in the "name" bucket?
I tried this:
aws rekognition index-faces --image "S3Object={Bucket=bname}" --collection-id "myCollection" --detection-attributes "ALL" --external-image-id "myImgID"
no luck.
You currently can't index multiple faces in one index-faces call. A script that calls get-objects on the bucket and then loops through the results would accomplish what you want.
In case it helps anyone in the future, I had a similar need, so I wrote this Python 3.6 script to do exactly what #chris-adzima recommends, and I executed it from a lambda function.
import boto3
import concurrent.futures
bucket_name = "MY_BUCKET_NAME"
collection_id = "MY_COLLECTION_ID"
rekognition = boto3.client('rekognition')
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket_name)
def handle_image(key):
rekognition.index_faces(
CollectionId=collection_id,
Image={
'S3Object': {
'Bucket': bucket_name,
'Name': key
}
}
)
def lambda_handler(event, context):
pic_keys = [o.key for o in bucket.objects.all() if o.key.endswith('.png')]
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(handle_image, pic_keys)
Code:
import boto3
s3_cli = boto3.client('s3')
object_summary = s3_cli.head_object(
Bucket='test-cf',
Key='RDS.template',
VersionId='szA3ws4bH6k.rDXOEAchlh1x3OgthNEB'
)
print('LastModified: {}'.format(object_summary.get('LastModified')))
print('StorageClass: {}'.format(object_summary.get('StorageClass')))
print('Metadata: {}'.format(object_summary.get('Metadata')))
print('ContentLength(KB): {}'.format(object_summary.get('ContentLength')/1024))
Output:
LastModified: 2017-06-08 09:22:43+00:00
StorageClass: None
Metadata: {}
ContentLength(KB): 15
Am unable to get the StorageClass of the key using boto3 sdk. I can see the storage class set as standard from the aws console. I have also tried using s3.ObjectSummary and also s3.ObjectVersion methods in boto3 s3 resouces, but they also returned None.
Not sure why it is returning None. Meanwhile, use the following code to get the storage class. Let me check my version of Boto3.
bucket = s3.Bucket('test-cf')
for object in bucket.objects.all():
print object.key, object.storage_class
I am trying to upload content on amazon s3 but I am getting this error:
boto3.exceptions.unknownapiversionerror: The 's3' resource does not an
API Valid API versions are: 2006-03-01
import boto3
boto3.resource('s3',**AWS_ACCESS_KEY_ID**,**AWS_PRIVATE_KEY**)
bucket = s3.Bucket( **NAME OF BUCKET**)
obj = bucket.Object(**KEY**)
obj.upload_fileobj(**FILE OBJECT**)
The error is caused by exception raised on "DataNotFound" as in the
boto3.Session source code. Perhaps the developer didn't realize people make the mistake for NOT passing the correct object.
If you read the boto3 documentation example, this is the correct way to upload data.
import boto3
boto3.resource('s3',**AWS_ACCESS_KEY_ID**,**AWS_PRIVATE_KEY**)
bucket = s3.Bucket( **NAME OF BUCKET**)
obj = bucket.Object("prefix/object_key_name")
# You must pass the file object !
with open('filename', 'rb') as fileobject:
obj.upload_fileobj(fileobject)