Extracting tar files from an S3 bucket to another S3 bucket using Python - amazon-s3

We need to extract the contents of zip and tar files to another S3 bucket.
We have the code to extract the zip files working.
We need to use meta.client.upload_fileobj or meta.client.copy so if necessary multipart upload or copy will be used.
def unzip_file(source_bucketname, filename, target_bucketname):
s3_resource = boto3.resource('s3')
s3_client = boto3.client('s3')
target_directory = source_file_name + '/'
zip_obj = s3_resource.Object(
bucket_name=source_bucketname, key=source_file_name)
buffer = BytesIO(zip_obj.get()["Body"].read())
with zipfile.ZipFile(buffer, mode='r', allowZip64=True) as z:
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=target_bucketname,
Key=f'{source_file_name}/{filename}'
)
We can't get the extraction of tar files to work.
def untar_file(source_bucketname, filename, target_bucketname):
s3_resource = boto3.resource('s3')
s3_client = boto3.client('s3')
target_directory = source_file_name + '/'
s3_object = s3_client.get_object(Bucket=source_bucketname, Key=filename)
tar_file = s3_object['Body'].read()
file_object = io.BytesIO(tar_file)
with tarfile.open(fileobj=file_object, mode=('r:gz')) as z:
for filename in z.getmembers():
s3_resource.meta.client.upload_fileobj(
filename, #z.open(filename)
Bucket=target_bucketname,
Key=f'{source_file_name}/{filename}'
)
The problem is specifying the filename object in the meta.client.upload_fileobj command.
We have tried z.open(filename)
We would be very grateful if anyone has any ideas.

Anon Coward answered this but the answer seems to have been deleted.
s3_resource.meta.client.upload_fileobj(
filename, #z.open(filename)
Bucket=target_bucketname,
Key=f'{source_file_name}/{filename}'
)
needs to be
s3_resource.meta.client.upload_fileobj(
z.extractfile(filename),
Bucket=target_bucketname,
Key=f'{source_file_name}/{filename.name}'
)
The source file needs to be z.extractfile(filename) and the destination filename needs to be filename.name.
Many thanks Anon Coward

Related

Pass credentials / .json file stored in S3 bucket to GoogleServiceAccountClient

Here the code to initialize the GoogleRefreshTokenClient using the credentials from json key file.
oauth2_client = oauth2.GoogleServiceAccountClient(key_file, oauth2.GetAPIScope('ad_manager'))
Key_file .json is stored in S3 bucket.
Is there any way to pass .json file (with credentials) stored in s3 to GoogleServiceAccountClient?
Ps. Info to DalmTo stackoverflow member. Do not close or merge this question, please :)
You can read the file from S3 and write it as a json file to your /tmp folder
def readFileFromS3(file_name):
tmp_path = "/tmp/"+file_name
file_path = Path(tmp_path)
if file_path.is_file():
return tmp_path
s3 = boto3.resource(
's3',
aws_access_key_id = <AWS_ACCESS_KEY>,
aws_secret_access_key = <AWS_SECRET>,
region_name = <YOUR_REGION_NAME>
)
content_object = s3.Object(<BUCKET_NAME>, file_name)
file_content = content_object.get()['Body'].read().decode('utf-8')
json_content = json.loads(file_content)
with open(tmp_path, 'w') as res_file:
json.dump(json_content, res_file, indent=4)
return tmp_path
Then use the path returned from the above function in GoogleServiceAccountClient
key_file = readFileFromS3(<key_file_name_in_s3>)
oauth2_client = oauth2.GoogleServiceAccountClient(key_file, oauth2.GetAPIScope('ad_manager'))

Python boto3 load model tar file from s3 and unpack it

I am using Sagemaker and have a bunch of model.tar.gz files that I need to unpack and load in sklearn. I've been testing using list_objects with delimiter to get to the tar.gz files:
response = s3.list_objects(
Bucket = bucket,
Prefix = 'aleks-weekly/models/',
Delimiter = '.csv'
)
for i in response['Contents']:
print(i['Key'])
And then I plan to extract with
import tarfile
tf = tarfile.open(model.read())
tf.extractall()
But how do I get to the actual tar.gz file from s3 instead of a some boto3 object?
You can download objects to files using s3.download_file(). This will make your code look like:
s3 = boto3.client('s3')
bucket = 'my-bukkit'
prefix = 'aleks-weekly/models/'
# List objects matching your criteria
response = s3.list_objects(
Bucket = bucket,
Prefix = prefix,
Delimiter = '.csv'
)
# Iterate over each file found and download it
for i in response['Contents']:
key = i['Key']
dest = os.path.join('/tmp',key)
print("Downloading file",key,"from bucket",bucket)
s3.download_file(
Bucket = bucket,
Key = key,
Filename = dest
)

Read and parse CSV file in S3 without downloading the entire file using Python

So, i want to read a large CSV file from an S3 bucket, but i dont want that file to be completely downloaded in memory, what i wanna do is somehow stream the file in chunks and then process it.
So far this is what i have done, but i dont think so this is gonna solve the problem.
import logging
import boto3
import codecs
import os
import csv
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)
s3 = boto3.client('s3')
def lambda_handler(event, context):
# retrieve bucket name and file_key from the S3 event
bucket_name = event['Records'][0]['s3']['bucket']['name']
file_key = event['Records'][0]['s3']['object']['key']
chunk, chunksize = [], 1000
if file_key.endswith('.csv'):
LOGGER.info('Reading {} from {}'.format(file_key, bucket_name))
# get the object
obj = s3.get_object(Bucket=bucket_name, Key=file_key)
file_object = obj['Body']
count = 0
for i, line in enumerate(file_object):
count += 1
if (i % chunksize == 0 and i > 0):
process_chunk(chunk)
del chunk[:]
chunk.append(line)
def process_chunk(chuck):
print(len(chuck))
This will do what you want to achieve. It wont download the whole file in the memory, instead will download in chunks, process and proceed:
from smart_open import smart_open
import csv
def get_s3_file_stream(s3_path):
"""
This function will return a stream of the s3 file.
The s3_path should be of the format: '<bucket_name>/<file_path_inside_the_bucket>'
"""
#This is the full path with credentials:
complete_s3_path = 's3://' + aws_access_key_id + ':' + aws_secret_access_key + '#' + s3_path
return smart_open(complete_s3_path, encoding='utf8')
def download_and_process_csv:
datareader = csv.DictReader(get_s3_file_stream(s3_path))
for row in datareader:
yield process_csv(row) # write a function to do whatever you want to do with the CSV
Did u try AWS Athena https://aws.amazon.com/athena/ ?
its extremely good serverless and pay as go. Without dowloading the file it does everything what you want.
BlazingSql is open source and its also usefull in case of big data problem.

S3 Boto3 python - change all files acl to public read

I am trying to change ACL of 500k files within a S3 bucket folder from 'private' to 'public-read'
Is there any way to speed this up?
I am using the below snippet.
from boto3.session import Session
from multiprocessing.pool import ThreadPool
pool = ThreadPool(processes=100)
BUCKET_NAME = ""
aws_access_key_id = ""
aws_secret_access_key = ""
Prefix='pics/'
session = Session(aws_access_key_id=aws_access_key_id, aws_secret_access_key=aws_secret_access_key)
_s3 = session.resource("s3")
_bucket = _s3.Bucket(BUCKET_NAME)
def upload(eachObject):
eachObject.Acl().put(ACL='public-read')
counter = 0
filenames = []
for eachObject in _bucket.objects.filter(Prefix=Prefix):
counter += 1
filenames.append(eachObject)
if counter % 100 == 0:
pool.map(upload, filenames)
print(counter)
if filenames:
pool.map(upload, filenames)
As far as i can tell, without applying the ACL to the entire bucket, there is no way to simply apply the ACL to all items containing the same prefix without iterating through each item like below:
bucketName='YOUR_BUCKET_NAME'
prefix="YOUR_FOLDER_PREFIX"
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucketName)
[obj.Acl().put(ACL='public-read') for obj in bucket.objects.filter(Prefix=prefix).all()]

How to pick up file name dynamically while uploading the file in S3 with Python?

I am working on a requirement where I have to save logs of my ETL scripts to S3 location.
For this I am able to store the logs in my local system and now I need to upload them in S3.
For this I have written following code-
import logging
import datetime
import boto3
from boto3.s3.transfer import S3Transfer
from etl import CONFIG
FORMAT = '%(asctime)s [%(levelname)s] %(filename)s:%(lineno)s %
(funcName)s() : %(message)s'
DATETIME_FORMAT = '%Y-%m-%d %H:%M:%S'
logger = logging.getLogger()
logger.setLevel(logging.INFO)
S3_DOMAIN = 'https://s3-ap-southeast-1.amazonaws.com'
S3_BUCKET = CONFIG['S3_BUCKET']
filepath = ''
folder_name = 'etl_log'
filename = ''
def log_file_conf(merchant_name, table_name):
log_filename = datetime.datetime.now().strftime('%Y-%m-%dT%H-%M-%S') +
'_' + table_name + '.log'
fh = logging.FileHandler("E:/test/etl_log/" + merchant_name + "/"
+ log_filename)
fh.setLevel(logging.DEBUG)
fh.setFormatter(logging.Formatter(FORMAT, DATETIME_FORMAT))
logger.addHandler(fh)
client = boto3.client('s3',
aws_access_key_id=CONFIG['S3_KEY'],
aws_secret_access_key=CONFIG['S3_SECRET'])
transfer = S3Transfer(client)
transfer.upload_file(filepath, S3_BUCKET, folder_name+"/"+filename)
Issue I am facing here is that logs are generated for different merchants hence their names are based on the merchant and this I have taken cared while saving on local.
But for uploading in S3 I don't know how to select log file name.
Can anyone please help me to achieve my goal?
s3 is an object store, it doesn't have "real path", the so call path e.g. "/" separator is actually cosmetic. So nothing prevent you from using something similar to your local file naming convention. e.g.
transfer.upload_file(filepath, S3_BUCKET, folder_name+"/" + merchant_name + "/" + filename)
To list all the file under the arbitrary path (it is called "prefix") , you just do this
# simple list object, not handling pagination. max 1000 objects listed
client.list_objects(
Bucket = S3_BUCKET,
Prefix = folder_name + "/" + merchant_name
)