how can I get a s3 zip file and attached it in my email using boto3? - amazon-s3

I'm trying to get a zip file from my s3 bucket and then attached it in my email using boto3. I tried this but it doesn't work :
msg = MIMEMultipart()
def get_object(bucket,key):
client = boto3.client("s3")
return client.get_object(Bucket=bucket, Key=key)
file = get_object(BUCKET,key)
from email import encoders
from email.mime.base import MIMEBase
msg_1 = MIMEBase('application')
msg_1.set_payload(file['Body'].read())
encoders.encode_base64(msg_1)
msg_1.add_header('Content-Disposition', 'attachment',
filename='file.zip')
msg.attach(msg_1)

Related

boto3 default session Profile - Access denied

I need to put the file on s3 using the library based on boto3 (great_expectations), and only have write permissions on a specific profile.
Using AWS_PROFILE seems to set the default session's profile, but does not help. To replicate, I ran the code myself and got the same result (AccessDenied):
import boto3
import os
import json
s3 = boto3.resource("s3")
bucket = 'my-bucket'
key = 'path/to/the/file.json'
content_encoding="utf-8"
content_type="application/json"
value = json.dumps({'test':'test indeed'})
os.environ['AWS_PROFILE']
>>> mycorrectprofile
boto3.DEFAULT_SESSION.profile_name
>>> mycorrectprofile
s3_object = s3.Object(bucket, key)
s3_object.put(
Body=value.encode(content_encoding),
ContentEncoding=content_encoding,
ContentType=content_type,
)
that results in AccessDenied
Now, right after that, I do:
my_session = boto3.session.Session(profile_name=os.getenv('AWS_PROFILE'))
ss3 = my_session.resource('s3')
r2_s3 = ss3.Object(bucket, key)
that runs Smoothly.
What is happening here, and how can I resolve that behavior?
boto3.__version__ = '1.15.0'

downloading file from S3 using boto3 key error

I am trying to download a joblib file from S3 but getting errors with the key format..
This is my S3 path to the file:
"s3://v1/v2/v3/v4/model.joblib"
This is my code:
import boto3
bucketname = "v1"
key = "v2/v3/v4"
filename = "model.joblib"
s3 = boto3.resource('s3')
obj = s3.Object(bucketname, key)
body = obj.get()['label_model.joblib'].read()
ultimately i want to be able to do:
from joblib import load
model = load("model.joblib")
Error i got:
NoSuchKey: An error occurred (NoSuchKey) when calling the GetObject operation: The specified key does not exist.
You are trying to access the file without the filename.
Your code is:
import boto3
bucketname = "v1"
key = "v2/v3/v4"
filename = "model.joblib"
s3 = boto3.resource('s3')
obj = s3.Object(bucketname, key)
body = obj.get()['label_model.joblib'].read()
But you need to add the filename to the key variable. Here is an example downloading the file from s3:
bucketname = "v1"
key = "v2/v3/v4"
filename = "model.joblib"
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucketname)
with open('filename', 'wb') as f:
bucket.download_fileobj(f'{key}/{filename}', f)

Python boto3 load model tar file from s3 and unpack it

I am using Sagemaker and have a bunch of model.tar.gz files that I need to unpack and load in sklearn. I've been testing using list_objects with delimiter to get to the tar.gz files:
response = s3.list_objects(
Bucket = bucket,
Prefix = 'aleks-weekly/models/',
Delimiter = '.csv'
)
for i in response['Contents']:
print(i['Key'])
And then I plan to extract with
import tarfile
tf = tarfile.open(model.read())
tf.extractall()
But how do I get to the actual tar.gz file from s3 instead of a some boto3 object?
You can download objects to files using s3.download_file(). This will make your code look like:
s3 = boto3.client('s3')
bucket = 'my-bukkit'
prefix = 'aleks-weekly/models/'
# List objects matching your criteria
response = s3.list_objects(
Bucket = bucket,
Prefix = prefix,
Delimiter = '.csv'
)
# Iterate over each file found and download it
for i in response['Contents']:
key = i['Key']
dest = os.path.join('/tmp',key)
print("Downloading file",key,"from bucket",bucket)
s3.download_file(
Bucket = bucket,
Key = key,
Filename = dest
)

How to load a zip file (containing shp) from s3 bucket to Geopandas?

I zipped name.shp, name.shx, name.dbf files and uploaded them into a AWS s3 bucket. So now, i wanna load this zip file and convert the contained shapefile into a GeoDataFrame of geopandas.
I can do it perfectly if the file is a zipped geojson instead of zipped shapefile.
import io
import boto3
import geopandas as gpd
import zipfile
cliente = boto3.client("s3", aws_access_key_id=ak, aws_secret_access_key=sk)
bucket_name = 'bucketname'
object_key = 'myfolder/locations.zip'
bytes_buffer = io.BytesIO()
cliente.download_fileobj(Bucket=bucket_name, Key=object_key, Fileobj=bytes_buffer)
geojson = bytes_buffer.getvalue()
with zipfile.ZipFile(bytes_buffer) as zi:
with zi.open("locations.shp") as file:
print(gpd.read_file(file.read().decode('ISO-8859-9')))
I got this error:
ç­¤íEÀ¡ËÆ3À: No such file or directory
Basically geopandas package allows to read files directly from S3. And as mentioned in the answer above it allows to read zip files also. So below you can see the code which will read zip file from s3 without downloading it. You need to enter zip+s3:// in the beginning, then add the path in S3.
geopandas.read_file(f'zip+s3://bucket-name/file.zip')
You can read zip directly, no need to use zipfile. You need all parts of Shapefile, not just .shp itself. That is why it works with geojson. You just need to pass it with zip:///. So instead of
gpd.read_file('path/file.shp')
You go with
gpd.read_file('zip:///path/file.zip')
I am not familiar enough with boto3 to know at which point you actually have this path, but I think it will help.
I do not know if it can be of any help, but I faced a similar problem recently, though I only wanted to read the .shp with fiona. I ended up like others zipping the relevant shp, dbf, cpg and shx on the bucket.
And to read from the bucket, I do like so:
from io import BytesIO
from pathlib import Path
from typing import List
from typing import Union
import boto3
from fiona.io import ZipMemoryFile
from pydantic import BaseSettings
from shapely.geometry import Point
from shapely.geometry import Polygon
import fiona
class S3Configuration(BaseSettings):
"""
S3 configuration class
"""
s3_access_key_id: str = ''
s3_secret_access_key: str = ''
s3_region_name: str = ''
s3_endpoint_url: str = ''
s3_bucket_name: str = ''
s3_use: bool = False
S3_CONF = S3Configuration()
S3_STR = 's3'
S3_SESSION = boto3.session.Session()
S3 = S3_SESSION.resource(
service_name=S3_STR,
aws_access_key_id=S3_CONF.s3_access_key_id,
aws_secret_access_key=S3_CONF.s3_secret_access_key,
endpoint_url=S3_CONF.s3_endpoint_url,
region_name=S3_CONF.s3_region_name,
use_ssl=True,
verify=True,
)
BUCKET = S3_CONF.s3_bucket_name
CordexShape = Union[Polygon, List[Polygon], List[Point]]
ZIP_EXT = '.zip'
def get_shapefile_data(file_path: Path, s3_use: S3_CONF.s3_use) -> CordexShape:
"""
Retrieves the shapefile content associated to the passed file_path (either on disk or on S3).
file_path is a .shp file.
"""
if s3_use:
return load_zipped_shp(get_s3_object(file_path.with_suffix(ZIP_EXT)), file_path)
return load_shp(file_path)
def get_s3_object(file_path: Path) -> bytes:
"""
Retrieve as bytes the content associated to the passed file_path
"""
return S3.Object(bucket_name=BUCKET, key=forge_key(file_path)).get()['Body'].read()
def forge_key(file_path: Path) -> str:
"""
Edit this code at your convenience to forge the bucket key out of the passed file_path
"""
return str(file_path.relative_to(*file_path.parts[:2]))
def load_shp(file_path: Path) -> CordexShape:
"""
Retrieve a list of Polygons stored at file_path location
"""
with fiona.open(file_path) as shape:
parsed_shape = list(shape)
return parsed_shape
def load_zipped_shp(zipped_data: bytes, file_path: Path) -> CordexShape:
"""
Retrieve a list of Polygons stored at file_path location
"""
with ZipMemoryFile(BytesIO(zipped_data)) as zip_memory_file:
with zip_memory_file.open(file_path.name) as shape:
parsed_shape = list(shape)
return parsed_shape
There is quite a lot of code, but the first part is very helpful to easily use a minio proxy for local devs (just have to change the .env).
The key to solve the issue for me was the use of fiona not so well documented (in my opinion) but life saver (in my case :)) ZipMemoryFile

Read and parse CSV file in S3 without downloading the entire file using Python

So, i want to read a large CSV file from an S3 bucket, but i dont want that file to be completely downloaded in memory, what i wanna do is somehow stream the file in chunks and then process it.
So far this is what i have done, but i dont think so this is gonna solve the problem.
import logging
import boto3
import codecs
import os
import csv
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)
s3 = boto3.client('s3')
def lambda_handler(event, context):
# retrieve bucket name and file_key from the S3 event
bucket_name = event['Records'][0]['s3']['bucket']['name']
file_key = event['Records'][0]['s3']['object']['key']
chunk, chunksize = [], 1000
if file_key.endswith('.csv'):
LOGGER.info('Reading {} from {}'.format(file_key, bucket_name))
# get the object
obj = s3.get_object(Bucket=bucket_name, Key=file_key)
file_object = obj['Body']
count = 0
for i, line in enumerate(file_object):
count += 1
if (i % chunksize == 0 and i > 0):
process_chunk(chunk)
del chunk[:]
chunk.append(line)
def process_chunk(chuck):
print(len(chuck))
This will do what you want to achieve. It wont download the whole file in the memory, instead will download in chunks, process and proceed:
from smart_open import smart_open
import csv
def get_s3_file_stream(s3_path):
"""
This function will return a stream of the s3 file.
The s3_path should be of the format: '<bucket_name>/<file_path_inside_the_bucket>'
"""
#This is the full path with credentials:
complete_s3_path = 's3://' + aws_access_key_id + ':' + aws_secret_access_key + '#' + s3_path
return smart_open(complete_s3_path, encoding='utf8')
def download_and_process_csv:
datareader = csv.DictReader(get_s3_file_stream(s3_path))
for row in datareader:
yield process_csv(row) # write a function to do whatever you want to do with the CSV
Did u try AWS Athena https://aws.amazon.com/athena/ ?
its extremely good serverless and pay as go. Without dowloading the file it does everything what you want.
BlazingSql is open source and its also usefull in case of big data problem.