Read and parse CSV file in S3 without downloading the entire file using Python

Read and parse CSV file in S3 without downloading the entire file using Python - amazon-s3

So, i want to read a large CSV file from an S3 bucket, but i dont want that file to be completely downloaded in memory, what i wanna do is somehow stream the file in chunks and then process it.
So far this is what i have done, but i dont think so this is gonna solve the problem.
import logging
import boto3
import codecs
import os
import csv
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)
s3 = boto3.client('s3')
def lambda_handler(event, context):
# retrieve bucket name and file_key from the S3 event
bucket_name = event['Records'][0]['s3']['bucket']['name']
file_key = event['Records'][0]['s3']['object']['key']
chunk, chunksize = [], 1000
if file_key.endswith('.csv'):
LOGGER.info('Reading {} from {}'.format(file_key, bucket_name))
# get the object
obj = s3.get_object(Bucket=bucket_name, Key=file_key)
file_object = obj['Body']
count = 0
for i, line in enumerate(file_object):
count += 1
if (i % chunksize == 0 and i > 0):
process_chunk(chunk)
del chunk[:]
chunk.append(line)
def process_chunk(chuck):
print(len(chuck))

This will do what you want to achieve. It wont download the whole file in the memory, instead will download in chunks, process and proceed:
from smart_open import smart_open
import csv
def get_s3_file_stream(s3_path):
"""
This function will return a stream of the s3 file.
The s3_path should be of the format: '<bucket_name>/<file_path_inside_the_bucket>'
"""
#This is the full path with credentials:
complete_s3_path = 's3://' + aws_access_key_id + ':' + aws_secret_access_key + '#' + s3_path
return smart_open(complete_s3_path, encoding='utf8')
def download_and_process_csv:
datareader = csv.DictReader(get_s3_file_stream(s3_path))
for row in datareader:
yield process_csv(row) # write a function to do whatever you want to do with the CSV

Did u try AWS Athena https://aws.amazon.com/athena/ ?
its extremely good serverless and pay as go. Without dowloading the file it does everything what you want.
BlazingSql is open source and its also usefull in case of big data problem.

Related

Extracting tar files from an S3 bucket to another S3 bucket using Python

We need to extract the contents of zip and tar files to another S3 bucket.
We have the code to extract the zip files working.
We need to use meta.client.upload_fileobj or meta.client.copy so if necessary multipart upload or copy will be used.
def unzip_file(source_bucketname, filename, target_bucketname):
s3_resource = boto3.resource('s3')
s3_client = boto3.client('s3')
target_directory = source_file_name + '/'
zip_obj = s3_resource.Object(
bucket_name=source_bucketname, key=source_file_name)
buffer = BytesIO(zip_obj.get()["Body"].read())
with zipfile.ZipFile(buffer, mode='r', allowZip64=True) as z:
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket=target_bucketname,
Key=f'{source_file_name}/{filename}'
)
We can't get the extraction of tar files to work.
def untar_file(source_bucketname, filename, target_bucketname):
s3_resource = boto3.resource('s3')
s3_client = boto3.client('s3')
target_directory = source_file_name + '/'
s3_object = s3_client.get_object(Bucket=source_bucketname, Key=filename)
tar_file = s3_object['Body'].read()
file_object = io.BytesIO(tar_file)
with tarfile.open(fileobj=file_object, mode=('r:gz')) as z:
for filename in z.getmembers():
s3_resource.meta.client.upload_fileobj(
filename, #z.open(filename)
Bucket=target_bucketname,
Key=f'{source_file_name}/{filename}'
)
The problem is specifying the filename object in the meta.client.upload_fileobj command.
We have tried z.open(filename)
We would be very grateful if anyone has any ideas.

Anon Coward answered this but the answer seems to have been deleted.
s3_resource.meta.client.upload_fileobj(
filename, #z.open(filename)
Bucket=target_bucketname,
Key=f'{source_file_name}/{filename}'
)
needs to be
s3_resource.meta.client.upload_fileobj(
z.extractfile(filename),
Bucket=target_bucketname,
Key=f'{source_file_name}/{filename.name}'
)
The source file needs to be z.extractfile(filename) and the destination filename needs to be filename.name.
Many thanks Anon Coward

How to upload large file (~100mb) to Azure blob storage using Python SDK?

I am using the latest Azure Storage SDK (azure-storage-blob-12.7.1). It works fine for smaller files but throwing exceptions for larger files > 30MB.
azure.core.exceptions.ServiceResponseError: ('Connection aborted.',
timeout('The write operation timed out'))
from azure.storage.blob import BlobServiceClient, PublicAccess, BlobProperties,ContainerClient
def upload(file):
settings = read_settings()
connection_string = settings['connection_string']
container_client = ContainerClient.from_connection_string(connection_string,'backup')
blob_client = container_client.get_blob_client(file)
with open(file,"rb") as data:
blob_client.upload_blob(data)
print(f'{file} uploaded to blob storage')
upload('crashes.csv')

Seems everything works for me by your code when I tried to upload a ~180MB .txt file. But if uploading small files work for you, I think uploading your big file in small parts could be a workaround. Try the code below:
from azure.storage.blob import BlobClient
storage_connection_string=''
container_name = ''
dest_file_name = ''
local_file_path = ''
blob_client = BlobClient.from_connection_string(storage_connection_string,container_name,dest_file_name)
#upload 4 MB for each request
chunk_size=4*1024*1024
if(blob_client.exists):
blob_client.delete_blob()
blob_client.create_append_blob()
with open(local_file_path, "rb") as stream:
while True:
read_data = stream.read(chunk_size)
if not read_data:
print('uploaded')
break
blob_client.append_block(read_data)
Result:

Python boto3 load model tar file from s3 and unpack it

I am using Sagemaker and have a bunch of model.tar.gz files that I need to unpack and load in sklearn. I've been testing using list_objects with delimiter to get to the tar.gz files:
response = s3.list_objects(
Bucket = bucket,
Prefix = 'aleks-weekly/models/',
Delimiter = '.csv'
)
for i in response['Contents']:
print(i['Key'])
And then I plan to extract with
import tarfile
tf = tarfile.open(model.read())
tf.extractall()
But how do I get to the actual tar.gz file from s3 instead of a some boto3 object?

You can download objects to files using s3.download_file(). This will make your code look like:
s3 = boto3.client('s3')
bucket = 'my-bukkit'
prefix = 'aleks-weekly/models/'
# List objects matching your criteria
response = s3.list_objects(
Bucket = bucket,
Prefix = prefix,
Delimiter = '.csv'
)
# Iterate over each file found and download it
for i in response['Contents']:
key = i['Key']
dest = os.path.join('/tmp',key)
print("Downloading file",key,"from bucket",bucket)
s3.download_file(
Bucket = bucket,
Key = key,
Filename = dest
)

How to load a zip file (containing shp) from s3 bucket to Geopandas?

I zipped name.shp, name.shx, name.dbf files and uploaded them into a AWS s3 bucket. So now, i wanna load this zip file and convert the contained shapefile into a GeoDataFrame of geopandas.
I can do it perfectly if the file is a zipped geojson instead of zipped shapefile.
import io
import boto3
import geopandas as gpd
import zipfile
cliente = boto3.client("s3", aws_access_key_id=ak, aws_secret_access_key=sk)
bucket_name = 'bucketname'
object_key = 'myfolder/locations.zip'
bytes_buffer = io.BytesIO()
cliente.download_fileobj(Bucket=bucket_name, Key=object_key, Fileobj=bytes_buffer)
geojson = bytes_buffer.getvalue()
with zipfile.ZipFile(bytes_buffer) as zi:
with zi.open("locations.shp") as file:
print(gpd.read_file(file.read().decode('ISO-8859-9')))
I got this error:
ç¤íEÀ¡ËÆ3À: No such file or directory

Basically geopandas package allows to read files directly from S3. And as mentioned in the answer above it allows to read zip files also. So below you can see the code which will read zip file from s3 without downloading it. You need to enter zip+s3:// in the beginning, then add the path in S3.
geopandas.read_file(f'zip+s3://bucket-name/file.zip')

You can read zip directly, no need to use zipfile. You need all parts of Shapefile, not just .shp itself. That is why it works with geojson. You just need to pass it with zip:///. So instead of
gpd.read_file('path/file.shp')
You go with
gpd.read_file('zip:///path/file.zip')
I am not familiar enough with boto3 to know at which point you actually have this path, but I think it will help.

I do not know if it can be of any help, but I faced a similar problem recently, though I only wanted to read the .shp with fiona. I ended up like others zipping the relevant shp, dbf, cpg and shx on the bucket.
And to read from the bucket, I do like so:
from io import BytesIO
from pathlib import Path
from typing import List
from typing import Union
import boto3
from fiona.io import ZipMemoryFile
from pydantic import BaseSettings
from shapely.geometry import Point
from shapely.geometry import Polygon
import fiona
class S3Configuration(BaseSettings):
"""
S3 configuration class
"""
s3_access_key_id: str = ''
s3_secret_access_key: str = ''
s3_region_name: str = ''
s3_endpoint_url: str = ''
s3_bucket_name: str = ''
s3_use: bool = False
S3_CONF = S3Configuration()
S3_STR = 's3'
S3_SESSION = boto3.session.Session()
S3 = S3_SESSION.resource(
service_name=S3_STR,
aws_access_key_id=S3_CONF.s3_access_key_id,
aws_secret_access_key=S3_CONF.s3_secret_access_key,
endpoint_url=S3_CONF.s3_endpoint_url,
region_name=S3_CONF.s3_region_name,
use_ssl=True,
verify=True,
)
BUCKET = S3_CONF.s3_bucket_name
CordexShape = Union[Polygon, List[Polygon], List[Point]]
ZIP_EXT = '.zip'
def get_shapefile_data(file_path: Path, s3_use: S3_CONF.s3_use) -> CordexShape:
"""
Retrieves the shapefile content associated to the passed file_path (either on disk or on S3).
file_path is a .shp file.
"""
if s3_use:
return load_zipped_shp(get_s3_object(file_path.with_suffix(ZIP_EXT)), file_path)
return load_shp(file_path)
def get_s3_object(file_path: Path) -> bytes:
"""
Retrieve as bytes the content associated to the passed file_path
"""
return S3.Object(bucket_name=BUCKET, key=forge_key(file_path)).get()['Body'].read()
def forge_key(file_path: Path) -> str:
"""
Edit this code at your convenience to forge the bucket key out of the passed file_path
"""
return str(file_path.relative_to(*file_path.parts[:2]))
def load_shp(file_path: Path) -> CordexShape:
"""
Retrieve a list of Polygons stored at file_path location
"""
with fiona.open(file_path) as shape:
parsed_shape = list(shape)
return parsed_shape
def load_zipped_shp(zipped_data: bytes, file_path: Path) -> CordexShape:
"""
Retrieve a list of Polygons stored at file_path location
"""
with ZipMemoryFile(BytesIO(zipped_data)) as zip_memory_file:
with zip_memory_file.open(file_path.name) as shape:
parsed_shape = list(shape)
return parsed_shape
There is quite a lot of code, but the first part is very helpful to easily use a minio proxy for local devs (just have to change the .env).
The key to solve the issue for me was the use of fiona not so well documented (in my opinion) but life saver (in my case :)) ZipMemoryFile

How to pick up file name dynamically while uploading the file in S3 with Python?

I am working on a requirement where I have to save logs of my ETL scripts to S3 location.
For this I am able to store the logs in my local system and now I need to upload them in S3.
For this I have written following code-
import logging
import datetime
import boto3
from boto3.s3.transfer import S3Transfer
from etl import CONFIG
FORMAT = '%(asctime)s [%(levelname)s] %(filename)s:%(lineno)s %
(funcName)s() : %(message)s'
DATETIME_FORMAT = '%Y-%m-%d %H:%M:%S'
logger = logging.getLogger()
logger.setLevel(logging.INFO)
S3_DOMAIN = 'https://s3-ap-southeast-1.amazonaws.com'
S3_BUCKET = CONFIG['S3_BUCKET']
filepath = ''
folder_name = 'etl_log'
filename = ''
def log_file_conf(merchant_name, table_name):
log_filename = datetime.datetime.now().strftime('%Y-%m-%dT%H-%M-%S') +
'_' + table_name + '.log'
fh = logging.FileHandler("E:/test/etl_log/" + merchant_name + "/"
+ log_filename)
fh.setLevel(logging.DEBUG)
fh.setFormatter(logging.Formatter(FORMAT, DATETIME_FORMAT))
logger.addHandler(fh)
client = boto3.client('s3',
aws_access_key_id=CONFIG['S3_KEY'],
aws_secret_access_key=CONFIG['S3_SECRET'])
transfer = S3Transfer(client)
transfer.upload_file(filepath, S3_BUCKET, folder_name+"/"+filename)
Issue I am facing here is that logs are generated for different merchants hence their names are based on the merchant and this I have taken cared while saving on local.
But for uploading in S3 I don't know how to select log file name.
Can anyone please help me to achieve my goal?

s3 is an object store, it doesn't have "real path", the so call path e.g. "/" separator is actually cosmetic. So nothing prevent you from using something similar to your local file naming convention. e.g.
transfer.upload_file(filepath, S3_BUCKET, folder_name+"/" + merchant_name + "/" + filename)
To list all the file under the arbitrary path (it is called "prefix") , you just do this
# simple list object, not handling pagination. max 1000 objects listed
client.list_objects(
Bucket = S3_BUCKET,
Prefix = folder_name + "/" + merchant_name
)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Read and parse CSV file in S3 without downloading the entire file using Python - amazon-s3

Did u try AWS Athena https://aws.amazon.com/athena/ ? its extremely good serverless and pay as go. Without dowloading the file it does everything what you want. BlazingSql is open source and its also usefull in case of big data problem.

Related

Extracting tar files from an S3 bucket to another S3 bucket using Python

How to upload large file (~100mb) to Azure blob storage using Python SDK?

Python boto3 load model tar file from s3 and unpack it

How to load a zip file (containing shp) from s3 bucket to Geopandas?

How to pick up file name dynamically while uploading the file in S3 with Python?

Categories

Resources