Python boto3 load model tar file from s3 and unpack it

Python boto3 load model tar file from s3 and unpack it - amazon-s3

I am using Sagemaker and have a bunch of model.tar.gz files that I need to unpack and load in sklearn. I've been testing using list_objects with delimiter to get to the tar.gz files:
response = s3.list_objects(
Bucket = bucket,
Prefix = 'aleks-weekly/models/',
Delimiter = '.csv'
)
for i in response['Contents']:
print(i['Key'])
And then I plan to extract with
import tarfile
tf = tarfile.open(model.read())
tf.extractall()
But how do I get to the actual tar.gz file from s3 instead of a some boto3 object?

You can download objects to files using s3.download_file(). This will make your code look like:
s3 = boto3.client('s3')
bucket = 'my-bukkit'
prefix = 'aleks-weekly/models/'
# List objects matching your criteria
response = s3.list_objects(
Bucket = bucket,
Prefix = prefix,
Delimiter = '.csv'
)
# Iterate over each file found and download it
for i in response['Contents']:
key = i['Key']
dest = os.path.join('/tmp',key)
print("Downloading file",key,"from bucket",bucket)
s3.download_file(
Bucket = bucket,
Key = key,
Filename = dest
)

Related

Extracting tar files from an S3 bucket to another S3 bucket using Python

We need to extract the contents of zip and tar files to another S3 bucket.
We have the code to extract the zip files working.
We need to use meta.client.upload_fileobj or meta.client.copy so if necessary multipart upload or copy will be used.
def unzip_file(source_bucketname, filename, target_bucketname):
s3_resource = boto3.resource('s3')
s3_client = boto3.client('s3')
target_directory = source_file_name + '/'
zip_obj = s3_resource.Object(
bucket_name=source_bucketname, key=source_file_name)
buffer = BytesIO(zip_obj.get()["Body"].read())
with zipfile.ZipFile(buffer, mode='r', allowZip64=True) as z:
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=target_bucketname,
Key=f'{source_file_name}/{filename}'
)
We can't get the extraction of tar files to work.
def untar_file(source_bucketname, filename, target_bucketname):
s3_resource = boto3.resource('s3')
s3_client = boto3.client('s3')
target_directory = source_file_name + '/'
s3_object = s3_client.get_object(Bucket=source_bucketname, Key=filename)
tar_file = s3_object['Body'].read()
file_object = io.BytesIO(tar_file)
with tarfile.open(fileobj=file_object, mode=('r:gz')) as z:
for filename in z.getmembers():
s3_resource.meta.client.upload_fileobj(
filename, #z.open(filename)
Bucket=target_bucketname,
Key=f'{source_file_name}/{filename}'
)
The problem is specifying the filename object in the meta.client.upload_fileobj command.
We have tried z.open(filename)
We would be very grateful if anyone has any ideas.

Anon Coward answered this but the answer seems to have been deleted.
s3_resource.meta.client.upload_fileobj(
filename, #z.open(filename)
Bucket=target_bucketname,
Key=f'{source_file_name}/{filename}'
)
needs to be
s3_resource.meta.client.upload_fileobj(
z.extractfile(filename),
Bucket=target_bucketname,
Key=f'{source_file_name}/{filename.name}'
)
The source file needs to be z.extractfile(filename) and the destination filename needs to be filename.name.
Many thanks Anon Coward

Pass credentials / .json file stored in S3 bucket to GoogleServiceAccountClient

Here the code to initialize the GoogleRefreshTokenClient using the credentials from json key file.
oauth2_client = oauth2.GoogleServiceAccountClient(key_file, oauth2.GetAPIScope('ad_manager'))
Key_file .json is stored in S3 bucket.
Is there any way to pass .json file (with credentials) stored in s3 to GoogleServiceAccountClient?
Ps. Info to DalmTo stackoverflow member. Do not close or merge this question, please :)

You can read the file from S3 and write it as a json file to your /tmp folder
def readFileFromS3(file_name):
tmp_path = "/tmp/"+file_name
file_path = Path(tmp_path)
if file_path.is_file():
return tmp_path
s3 = boto3.resource(
's3',
aws_access_key_id = <AWS_ACCESS_KEY>,
aws_secret_access_key = <AWS_SECRET>,
region_name = <YOUR_REGION_NAME>
)
content_object = s3.Object(<BUCKET_NAME>, file_name)
file_content = content_object.get()['Body'].read().decode('utf-8')
json_content = json.loads(file_content)
with open(tmp_path, 'w') as res_file:
json.dump(json_content, res_file, indent=4)
return tmp_path
Then use the path returned from the above function in GoogleServiceAccountClient
key_file = readFileFromS3(<key_file_name_in_s3>)
oauth2_client = oauth2.GoogleServiceAccountClient(key_file, oauth2.GetAPIScope('ad_manager'))

Delete Cache-Control metadata from S3 using boto3 [duplicate]

boto3 documentation does not clearly specify how to update the user metadata of an already existing S3 Object.

It can be done using the copy_from() method -
import boto3
s3 = boto3.resource('s3')
s3_object = s3.Object('bucket-name', 'key')
s3_object.metadata.update({'id':'value'})
s3_object.copy_from(CopySource={'Bucket':'bucket-name', 'Key':'key'}, Metadata=s3_object.metadata, MetadataDirective='REPLACE')

You can do this using copy_from() on the resource (like this answer) mentions, but you can also use the client's copy_object() and specify the same source and destination. The methods are equivalent and invoke the same code underneath.
import boto3
s3 = boto3.client("s3")
src_key = "my-key"
src_bucket = "my-bucket"
s3.copy_object(Key=src_key, Bucket=src_bucket,
CopySource={"Bucket": src_bucket, "Key": src_key},
Metadata={"my_new_key": "my_new_val"},
MetadataDirective="REPLACE")
The 'REPLACE' value specifies that the metadata passed in the request should overwrite the source metadata entirely. If you mean to only add new key-values, or delete only some keys, you'd have to first read the original data, edit it and call the update.
To replacing only a subset of the metadata correctly:
Retrieve the original metadata with head_object(Key=src_key, Bucket=src_bucket). Also take note of the Etag in the response
Make desired changes to the metadata locally.
Call copy_object as above to upload the new metadata, but pass CopySourceIfMatch=original_etag in the request to ensure the remote object has the metadata you expect before overwriting it. original_etag is the one you got in step 1. In case the metadata (or the data itself) has changed since head_object was called (e.g. by another program running simultaneously), copy_object will fail with an HTTP 412 error.
Reference: boto3 issue 389

Similar to this answer but with the existing Metadata preserved while modifying only what is needed. From the system defined meta data, I've only preserved ContentType and ContentDisposition in this example. Other system defined meta data can also be preserved similarly.
import boto3
s3 = boto3.client('s3')
response = s3.head_object(Bucket=bucket_name, Key=object_name)
response['Metadata']['new_meta_key'] = "new_value"
response['Metadata']['existing_meta_key'] = "new_value"
result = s3.copy_object(Bucket=bucket_name, Key=object_name,
CopySource={'Bucket': bucket_name,
'Key': object_name},
Metadata=response['Metadata'],
MetadataDirective='REPLACE', TaggingDirective='COPY',
ContentDisposition=response['ContentDisposition'],
ContentType=response['ContentType'])

You can either update metadata by adding something or updating a current metadata value with a new one, here is the piece of code I am using :
import sys
import os
import boto3
import pprint
from boto3 import client
from botocore.utils import fix_s3_host
param_1= YOUR_ACCESS_KEY
param_2= YOUR_SECRETE_KEY
param_3= YOUR_END_POINT
param_4= YOUR_BUCKET
#Create the S3 client
s3ressource = client(
service_name='s3',
endpoint_url= param_3,
aws_access_key_id= param_1,
aws_secret_access_key=param_2,
use_ssl=True,
)
# Building a list of of object per bucket
def BuildObjectListPerBucket (variablebucket):
global listofObjectstobeanalyzed
listofObjectstobeanalyzed = []
extensions = ['.jpg','.png']
for key in s3ressource.list_objects(Bucket=variablebucket)["Contents"]:
#print (key ['Key'])
onemoreObject=key['Key']
if onemoreObject.endswith(tuple(extensions)):
listofObjectstobeanalyzed.append(onemoreObject)
#print listofObjectstobeanalyzed
else :
s3ressource.delete_object(Bucket=variablebucket,Key=onemoreObject)
return listofObjectstobeanalyzed
# for a given existing object, create metadata
def createmetdata(bucketname,objectname):
s3ressource.upload_file(objectname, bucketname, objectname, ExtraArgs={"Metadata": {"metadata1":"ImageName","metadata2":"ImagePROPERTIES" ,"metadata3":"ImageCREATIONDATE"}})
# for a given existing object, add new metadata
def ADDmetadata(bucketname,objectname):
s3_object = s3ressource.get_object(Bucket=bucketname, Key=objectname)
k = s3ressource.head_object(Bucket = bucketname, Key = objectname)
m = k["Metadata"]
m["new_metadata"] = "ImageNEWMETADATA"
s3ressource.copy_object(Bucket = bucketname, Key = objectname, CopySource = bucketname + '/' + objectname, Metadata = m, MetadataDirective='REPLACE')
# for a given existing object, update a metadata with new value
def CHANGEmetadata(bucketname,objectname):
s3_object = s3ressource.get_object(Bucket=bucketname, Key=objectname)
k = s3ressource.head_object(Bucket = bucketname, Key = objectname)
m = k["Metadata"]
m.update({'watson_visual_rec_dic':'ImageCREATIONDATEEEEEEEEEEEEEEEEEEEEEEEEEE'})
s3ressource.copy_object(Bucket = bucketname, Key = objectname, CopySource = bucketname + '/' + objectname, Metadata = m, MetadataDirective='REPLACE')
def readmetadata (bucketname,objectname):
ALLDATAOFOBJECT = s3ressource.get_object(Bucket=bucketname, Key=objectname)
ALLDATAOFOBJECTMETADATA=ALLDATAOFOBJECT['Metadata']
print ALLDATAOFOBJECTMETADATA
# create the list of object on a per bucket basis
BuildObjectListPerBucket (param_4)
# Call functions to see the results
for objectitem in listofObjectstobeanalyzed:
# CALL The function you want
readmetadata(param_4,objectitem)
ADDmetadata(param_4,objectitem)
readmetadata(param_4,objectitem)
CHANGEmetadata(param_4,objectitem)
readmetadata(param_4,objectitem)

downloading file from S3 using boto3 key error

I am trying to download a joblib file from S3 but getting errors with the key format..
This is my S3 path to the file:
"s3://v1/v2/v3/v4/model.joblib"
This is my code:
import boto3
bucketname = "v1"
key = "v2/v3/v4"
filename = "model.joblib"
s3 = boto3.resource('s3')
obj = s3.Object(bucketname, key)
body = obj.get()['label_model.joblib'].read()
ultimately i want to be able to do:
from joblib import load
model = load("model.joblib")
Error i got:
NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.

You are trying to access the file without the filename.
Your code is:
import boto3
bucketname = "v1"
key = "v2/v3/v4"
filename = "model.joblib"
s3 = boto3.resource('s3')
obj = s3.Object(bucketname, key)
body = obj.get()['label_model.joblib'].read()
But you need to add the filename to the key variable. Here is an example downloading the file from s3:
bucketname = "v1"
key = "v2/v3/v4"
filename = "model.joblib"
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucketname)
with open('filename', 'wb') as f:
bucket.download_fileobj(f'{key}/{filename}', f)

S3 Boto3 python - change all files acl to public read

I am trying to change ACL of 500k files within a S3 bucket folder from 'private' to 'public-read'
Is there any way to speed this up?
I am using the below snippet.
from boto3.session import Session
from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=100)
BUCKET_NAME = ""
aws_access_key_id = ""
aws_secret_access_key = ""
Prefix='pics/'
session = Session(aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key)
_s3 = session.resource("s3")
_bucket = _s3.Bucket(BUCKET_NAME)
def upload(eachObject):
eachObject.Acl().put(ACL='public-read')
counter = 0
filenames = []
for eachObject in _bucket.objects.filter(Prefix=Prefix):
counter += 1
filenames.append(eachObject)
if counter % 100 == 0:
pool.map(upload, filenames)
print(counter)
if filenames:
pool.map(upload, filenames)

As far as i can tell, without applying the ACL to the entire bucket, there is no way to simply apply the ACL to all items containing the same prefix without iterating through each item like below:
bucketName='YOUR_BUCKET_NAME'
prefix="YOUR_FOLDER_PREFIX"
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucketName)
[obj.Acl().put(ACL='public-read') for obj in bucket.objects.filter(Prefix=prefix).all()]

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Python boto3 load model tar file from s3 and unpack it - amazon-s3

Related

Extracting tar files from an S3 bucket to another S3 bucket using Python

Pass credentials / .json file stored in S3 bucket to GoogleServiceAccountClient

Delete Cache-Control metadata from S3 using boto3 [duplicate]

downloading file from S3 using boto3 key error

S3 Boto3 python - change all files acl to public read

Categories

Resources