How to upload large file (~100mb) to Azure blob storage using Python SDK? - azure-storage

I am using the latest Azure Storage SDK (azure-storage-blob-12.7.1). It works fine for smaller files but throwing exceptions for larger files > 30MB.
azure.core.exceptions.ServiceResponseError: ('Connection aborted.',
timeout('The write operation timed out'))
from azure.storage.blob import BlobServiceClient, PublicAccess, BlobProperties,ContainerClient
def upload(file):
settings = read_settings()
connection_string = settings['connection_string']
container_client = ContainerClient.from_connection_string(connection_string,'backup')
blob_client = container_client.get_blob_client(file)
with open(file,"rb") as data:
blob_client.upload_blob(data)
print(f'{file} uploaded to blob storage')
upload('crashes.csv')

Seems everything works for me by your code when I tried to upload a ~180MB .txt file. But if uploading small files work for you, I think uploading your big file in small parts could be a workaround. Try the code below:
from azure.storage.blob import BlobClient
storage_connection_string=''
container_name = ''
dest_file_name = ''
local_file_path = ''
blob_client = BlobClient.from_connection_string(storage_connection_string,container_name,dest_file_name)
#upload 4 MB for each request
chunk_size=4*1024*1024
if(blob_client.exists):
blob_client.delete_blob()
blob_client.create_append_blob()
with open(local_file_path, "rb") as stream:
while True:
read_data = stream.read(chunk_size)
if not read_data:
print('uploaded')
break
blob_client.append_block(read_data)
Result:

Related

Flume not writing correctly in amazon s3 (weird characters)

My flume config:
agent.sinks = s3hdfs
agent.sources = MySpooler
agent.channels = channel
agent.sinks.s3hdfs.type = hdfs
agent.sinks.s3hdfs.hdfs.path = s3a://mybucket/test
agent.sinks.s3hdfs.hdfs.filePrefix = FilePrefix
agent.sinks.s3hdfs.channel = channel
agent.sinks.s3hdfs.hdfs.useLocalTimeStamp = true
agent.sources.MySpooler.channels = channel
agent.sources.MySpooler.type = spooldir
agent.sources.MySpooler.spoolDir = /flume_to_aws
agent.sources.MySpooler.fileHeader = true
agent.channels.channel.type = memory
agent.channels.channel.capacity = 100
Now I will add a file in /flume_to_aws folder with the following content (text):
Oracle and SQL Server
After it is uploaded in S3, I downloaded the file and opened it, and it show the following text:
SEQ!org.apache.hadoop.io.LongWritable"org.apache.hadoop.io.BytesWritable
Œúg ÊC•ý¤ïM·T.C ! †"û­þ Oracle and SQL ServerÿÿÿÿŒúg ÊC•ý¤ïM·T.C
Why the file is not uploaded only with the text "Oracle and SQL Server"??
Problem solved. I have found this question in stackoverflow here
Flume is generating files in binary format instead of text format.
So, I have added the following lines:
agent.sinks.s3hdfs.hdfs.writeFormat = Text
agent.sinks.s3hdfs.hdfs.fileType = DataStream

how can I get a s3 zip file and attached it in my email using boto3?

I'm trying to get a zip file from my s3 bucket and then attached it in my email using boto3. I tried this but it doesn't work :
msg = MIMEMultipart()
def get_object(bucket,key):
client = boto3.client("s3")
return client.get_object(Bucket=bucket, Key=key)
file = get_object(BUCKET,key)
from email import encoders
from email.mime.base import MIMEBase
msg_1 = MIMEBase('application')
msg_1.set_payload(file['Body'].read())
encoders.encode_base64(msg_1)
msg_1.add_header('Content-Disposition', 'attachment',
filename='file.zip')
msg.attach(msg_1)

How to load a zip file (containing shp) from s3 bucket to Geopandas?

I zipped name.shp, name.shx, name.dbf files and uploaded them into a AWS s3 bucket. So now, i wanna load this zip file and convert the contained shapefile into a GeoDataFrame of geopandas.
I can do it perfectly if the file is a zipped geojson instead of zipped shapefile.
import io
import boto3
import geopandas as gpd
import zipfile
cliente = boto3.client("s3", aws_access_key_id=ak, aws_secret_access_key=sk)
bucket_name = 'bucketname'
object_key = 'myfolder/locations.zip'
bytes_buffer = io.BytesIO()
cliente.download_fileobj(Bucket=bucket_name, Key=object_key, Fileobj=bytes_buffer)
geojson = bytes_buffer.getvalue()
with zipfile.ZipFile(bytes_buffer) as zi:
with zi.open("locations.shp") as file:
print(gpd.read_file(file.read().decode('ISO-8859-9')))
I got this error:
ç­¤íEÀ¡ËÆ3À: No such file or directory
Basically geopandas package allows to read files directly from S3. And as mentioned in the answer above it allows to read zip files also. So below you can see the code which will read zip file from s3 without downloading it. You need to enter zip+s3:// in the beginning, then add the path in S3.
geopandas.read_file(f'zip+s3://bucket-name/file.zip')
You can read zip directly, no need to use zipfile. You need all parts of Shapefile, not just .shp itself. That is why it works with geojson. You just need to pass it with zip:///. So instead of
gpd.read_file('path/file.shp')
You go with
gpd.read_file('zip:///path/file.zip')
I am not familiar enough with boto3 to know at which point you actually have this path, but I think it will help.
I do not know if it can be of any help, but I faced a similar problem recently, though I only wanted to read the .shp with fiona. I ended up like others zipping the relevant shp, dbf, cpg and shx on the bucket.
And to read from the bucket, I do like so:
from io import BytesIO
from pathlib import Path
from typing import List
from typing import Union
import boto3
from fiona.io import ZipMemoryFile
from pydantic import BaseSettings
from shapely.geometry import Point
from shapely.geometry import Polygon
import fiona
class S3Configuration(BaseSettings):
"""
S3 configuration class
"""
s3_access_key_id: str = ''
s3_secret_access_key: str = ''
s3_region_name: str = ''
s3_endpoint_url: str = ''
s3_bucket_name: str = ''
s3_use: bool = False
S3_CONF = S3Configuration()
S3_STR = 's3'
S3_SESSION = boto3.session.Session()
S3 = S3_SESSION.resource(
service_name=S3_STR,
aws_access_key_id=S3_CONF.s3_access_key_id,
aws_secret_access_key=S3_CONF.s3_secret_access_key,
endpoint_url=S3_CONF.s3_endpoint_url,
region_name=S3_CONF.s3_region_name,
use_ssl=True,
verify=True,
)
BUCKET = S3_CONF.s3_bucket_name
CordexShape = Union[Polygon, List[Polygon], List[Point]]
ZIP_EXT = '.zip'
def get_shapefile_data(file_path: Path, s3_use: S3_CONF.s3_use) -> CordexShape:
"""
Retrieves the shapefile content associated to the passed file_path (either on disk or on S3).
file_path is a .shp file.
"""
if s3_use:
return load_zipped_shp(get_s3_object(file_path.with_suffix(ZIP_EXT)), file_path)
return load_shp(file_path)
def get_s3_object(file_path: Path) -> bytes:
"""
Retrieve as bytes the content associated to the passed file_path
"""
return S3.Object(bucket_name=BUCKET, key=forge_key(file_path)).get()['Body'].read()
def forge_key(file_path: Path) -> str:
"""
Edit this code at your convenience to forge the bucket key out of the passed file_path
"""
return str(file_path.relative_to(*file_path.parts[:2]))
def load_shp(file_path: Path) -> CordexShape:
"""
Retrieve a list of Polygons stored at file_path location
"""
with fiona.open(file_path) as shape:
parsed_shape = list(shape)
return parsed_shape
def load_zipped_shp(zipped_data: bytes, file_path: Path) -> CordexShape:
"""
Retrieve a list of Polygons stored at file_path location
"""
with ZipMemoryFile(BytesIO(zipped_data)) as zip_memory_file:
with zip_memory_file.open(file_path.name) as shape:
parsed_shape = list(shape)
return parsed_shape
There is quite a lot of code, but the first part is very helpful to easily use a minio proxy for local devs (just have to change the .env).
The key to solve the issue for me was the use of fiona not so well documented (in my opinion) but life saver (in my case :)) ZipMemoryFile

Read and parse CSV file in S3 without downloading the entire file using Python

So, i want to read a large CSV file from an S3 bucket, but i dont want that file to be completely downloaded in memory, what i wanna do is somehow stream the file in chunks and then process it.
So far this is what i have done, but i dont think so this is gonna solve the problem.
import logging
import boto3
import codecs
import os
import csv
LOGGER = logging.getLogger()
LOGGER.setLevel(logging.INFO)
s3 = boto3.client('s3')
def lambda_handler(event, context):
# retrieve bucket name and file_key from the S3 event
bucket_name = event['Records'][0]['s3']['bucket']['name']
file_key = event['Records'][0]['s3']['object']['key']
chunk, chunksize = [], 1000
if file_key.endswith('.csv'):
LOGGER.info('Reading {} from {}'.format(file_key, bucket_name))
# get the object
obj = s3.get_object(Bucket=bucket_name, Key=file_key)
file_object = obj['Body']
count = 0
for i, line in enumerate(file_object):
count += 1
if (i % chunksize == 0 and i > 0):
process_chunk(chunk)
del chunk[:]
chunk.append(line)
def process_chunk(chuck):
print(len(chuck))
This will do what you want to achieve. It wont download the whole file in the memory, instead will download in chunks, process and proceed:
from smart_open import smart_open
import csv
def get_s3_file_stream(s3_path):
"""
This function will return a stream of the s3 file.
The s3_path should be of the format: '<bucket_name>/<file_path_inside_the_bucket>'
"""
#This is the full path with credentials:
complete_s3_path = 's3://' + aws_access_key_id + ':' + aws_secret_access_key + '#' + s3_path
return smart_open(complete_s3_path, encoding='utf8')
def download_and_process_csv:
datareader = csv.DictReader(get_s3_file_stream(s3_path))
for row in datareader:
yield process_csv(row) # write a function to do whatever you want to do with the CSV
Did u try AWS Athena https://aws.amazon.com/athena/ ?
its extremely good serverless and pay as go. Without dowloading the file it does everything what you want.
BlazingSql is open source and its also usefull in case of big data problem.

Test file upload in Flask

I have a flask's controller (POST) to upload a file:
f = request.files['external_data']
filename = secure_filename(f.filename)
f.save(filename)
I have tried to test it:
handle = open(filepath, 'rb')
fs = FileStorage(stream=handle, filename=filename, name='external_data')
payload['files'] = fs
url = '/my/upload/url'
test_client.post(url, data=payload)
But in the controller request.files contains:
ImmutableMultiDict: ImmutableMultiDict([('files', <FileStorage: u'myfile.png' ('image/png')>)])
My tests pass in case I replace 'external_data' with 'files'
How is it possible to create flask test request that contains request.files('external_data')?
You're not showing the origin from payload, which is the issue.
payload should probably be a .copy() of a dict() version of your original object.