How to download multiple files via Flask and boto3 from S3 - amazon-s3

I have a list of .zip files on S3 which is dynamically created and passed to flask view /download. My problem doesn't seem to be in looping through this list, but rather in returning a response so that all files from the list are downloaded to users computer. If I have a view return a response it only downloads the first file, as return closes the connection.
I have tried a number of things and looked at similar issues (like this one: Boto3 to download all files from a S3 Bucket ), but so far had no luck in resolving this. I have also looked at streaming (as in here: http://flask.pocoo.org/docs/1.0/patterns/streaming/ ) and tried creating a subfunction which is a generator, but the same issue persists as I still have to pass a return value to View function - here is that last code example:
#app.route('/download', methods=['POST'])
def download():
download=[]
download = request.form.getlist('checked')
def generate(result):
s3_resource = boto3.resource('s3')
my_bucket = s3_resource.Bucket(S3_BUCKET)
d_object = my_bucket.Object(result).get()
yield d_object['Body'].read()
for filename in download:
return Response (generate(filename), mimetype='application/zip', headers={'Content-Disposition': 'attachment;filename=' + filename})
What would be the best way of doing this so that all the files are downloaded?
Is there an easier way to pass a list to boto3 or any other flask module to download those files?

Related

Scrapy upload files to dynamically created directories in S3 based on field

I've been experimenting with Scrapy for sometime now and recently have been trying to upload files (data and images) to an S3 bucket. If the directory is static, it is pretty straightforward and I didn't hit any roadblocks. But what I want to achieve is to dynamically create directories based on a certain field from the extract data and place the data & media in those directories. The template path, if you will, is below:
s3://<bucket-name>/crawl_data/<account_id>/<media_type>/<file_name>
For example if the account_id is 123, then the images should be placed in the following directory:
s3://<bucket-name>/crawl_data/123/images/file_name.jpeg
and the data file should be placed in the following directory:
s3://<bucket-name>/crawl_data/123/data/file_name.json
I have been able to achieve this for the media downloads (kind of a crude way to segregate media types, as of now), with the following custom File Pipeline:
class CustomFilepathPipeline(FilesPipeline):
def file_path(self, request, response=None, info=None, *, item=None):
adapter = ItemAdapter(item)
account_id = adapter["account_id"]
file_name = os.path.basename(urlparse(request.url).path)
if ".mp4" in file_name:
media_type = "video"
else:
media_type = "image"
file_path = f"crawl_data/{account_id}/{media_type}/{file_name}"
return file_path
The following settings have been configured at a spider level with custom_settings:
custom_settings = {
'FILES_STORE': 's3://<my_s3_bucket_name>/',
'FILES_RESULT_FIELD': 's3_media_url',
'DOWNLOAD_WARNSIZE': 0,
'AWS_ACCESS_KEY_ID': <my_access_key>,
'AWS_SECRET_ACCESS_KEY': <my_secret_key>,
}
So, the media part works flawlessly and I have been able to download the images and videos in their separate directories based on the account_id, in the S3 bucket. My questions is:
Is there a way to achieve the same results with the data files as well? Maybe another custom pipeline?
I have tried to experiment with the 1st example on the Item Exporters page but couldn't make any headway. One thing that I thought might help is to use boto3 to establish connection and then upload files but that might possibly require me to segregate files locally and upload those files together, by using a combination of Pipelines (to split data) and Signals (once spider is closed to upload the files to S3).
Any thoughts and/or guidance on this or a better approach would be greatly appreciated.

Generate index.html for AWS S3

I am trying to simulate directory listing for my bucket on ASW S3. Currently I am creating "index.html" locally as follows:
for root, dirs, files in os.walk(job_dir):
objects = []
for obj in dirs+files:
m_time_epoch = os.stat(os.path.join(path,obj)).st_mtime
mtime = datetime.fromtimestamp(m_time_epoch).strftime('%c')
size = os.stat(os.path.join(path,obj)).st_size
type = 'dir' if os.path.isdir(os.path.join(path,obj)) else 'file'
objects.append({'name': obj,
'mtime': mtime,
'size': size,
'type': type})
generate_index(objects, dest_path)
And then passing it together with destination path (bucket URL) to a function which will create "index.html" using jinja template.
Is there better way to do it? I would like to avoid JavaScript though. I made some googling however so far did not find an elegant solution.
What would be the easiest alternative of "os.walk" using boto3 python client?
I found some snippets e.g. here:
How do I list directory contents of an S3 bucket using Python and Boto3?
But is not there a simpler solution?
Thanks...
I'd recommend using the list_objects_v2 method in boto3.
import boto3
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
response_iterator = paginator.paginate(
Bucket='MyBucket'
)
objects = []
for response in response_iterator:
for r in response['Contents']:
print("File is called {}".format(r['Key']))
While iterating through the objects in the bucket, you could build an object you could pass to a Jinja template to create the index.html page

Get AWS bucket List Boto/Boto3

I have developed a Tornado API which gets me the AWS S3 bucket contents, Below is the code snippet which run perfectly with Boto. However this doesn't work for the buckets in some different location.
The method returns a list(resp) which is consists filename, size, and file type.
Want to achieve similar using Boto3. Tried a lot but Boto3 methods returns the all contents of the s3 bucket with full path.
def post(self):
try:
resp = []
path = self.get_argument('path')
bucket_name = self.get_argument('bucket_name')
path_len = len(path)
conn = S3Connection()
bucket = conn.get_bucket(bucket_name)
folders = bucket.list(path, "/")
for folder in folders:
if folder.name == path:
continue
if str(folder.name).endswith("/"):
file_type = 'd'
file_name = str(folder.name)[path_len:-1]
else:
_file_size = self.filesize(folder.size)
file_type = 'f'
file_name = str(folder.name)[path_len:]
resp.append({"bucket": bucket_name, "path": path, "name": file_name, "type": file_type,
"size": _file_size if file_type == 'f' else ""})
self.write(json.dumps(resp))
Razvan Tudorica built a small replacement for Boto3's upload and delete methods which uses Tornado’s AsyncHTTPClient; he published a blog post here concerning the work and posted his code on GitHub.
As the original SO enquiry highlights that the code snippet supplied "doesn't work for the buckets in some different location", of specific interest here is Razvan's note that, "the main idea around [his] replacement is to use botocore to build the request (AWS wants the requests to be signed using different algorithms based on AWS zones and request data) and only to use the AsyncHTTPClient for the actual asynchronous call."
I hope Razvan's work still proves useful to you or, minimally, to others researching similar efforts (as I was recently).

How to get information of S3 bucket?

Say for example I have the following bucket set up:
bucketone
…/folderone
…/text1.txt
…/text2.txt
…/foldertwo
…/file1.json
…/folderthree
…/folderthreesub
…/file2.json
…/file3.json
But it only goes down one level.
What’s the proper way of retrieving information under a bucket?
Will be sure to accept/upvote answer.
Whats wrong with just doing this from the CLI?
aws s3 cp s3://bucketing . --recursive
Contrary to the way you'd think it will work, rsplit() actually returns the splits from left-right, even though it applies it right-to-left.
Therefore, you actually want to obtain the last element of the split:
filename = obj['Key'].rsplit('/', 1)[-1]
See: Python rsplit() documentation
Also, be careful of 'pretend directories' that might be created via the console. They are actually zero-length files the make the folder appear in the UI. Therefore, skip files with no name after the final slash.
Make those fixes and it works as desired:
import boto3
import os
s3client = boto3.client('s3')
for obj in s3client.list_objects_v2(Bucket='my-bucket')['Contents']:
filename = obj['Key'].rsplit('/', 1)[-1]
localfiledir = os.path.join('/tmp', filename)
if filename != '':
s3client.download_file('my-bucket', obj['Key'], localfiledir)

Locally calculate dropbox hash of files

Dropbox rest api, in function metatada has a parameter named "hash" https://www.dropbox.com/developers/reference/api#metadata
Can I calculate this hash locally without call any remote api rest function?
I need know this value to reduce upload bandwidth.
https://www.dropbox.com/developers/reference/content-hash explains how Dropbox computes their file hashes. A Python implementation of this is below:
import hashlib
import math
import os
DROPBOX_HASH_CHUNK_SIZE = 4*1024*1024
def compute_dropbox_hash(filename):
file_size = os.stat(filename).st_size
with open(filename, 'rb') as f:
block_hashes = b''
while True:
chunk = f.read(DROPBOX_HASH_CHUNK_SIZE)
if not chunk:
break
block_hashes += hashlib.sha256(chunk).digest()
return hashlib.sha256(block_hashes).hexdigest()
The "hash" parameter on the metadata call isn't actually the hash of the file, but a hash of the metadata. It's purpose is to save you having to re-download the metadata in your request if it hasn't changed by supplying it during the metadata request. It is not intended to be used as a file hash.
Unfortunately I don't see any way via the Dropbox API to get a hash of the file itself. I think your best bet for reducing your upload bandwidth would be to keep track of the hash's of your files locally and detect if they have changed when determining whether to upload them. Depending on your system you also likely want to keep track of the "rev" (revision) value returned on the metadata request so you can tell whether the version on Dropbox itself has changed.
This won't directly answer your question, but is meant more as a workaround; The dropbox sdk gives a simple updown.py example that uses file size and modification time to check the currency of a file.
an abbreviated example taken from updown.py:
dbx = dropbox.Dropbox(api_token)
...
# returns a dictionary of name: FileMetaData
listing = list_folder(dbx, folder, subfolder)
# name is the name of the file
md = listing[name]
# fullname is the path of the local file
mtime = os.path.getmtime(fullname)
mtime_dt = datetime.datetime(*time.gmtime(mtime)[:6])
size = os.path.getsize(fullname)
if (isinstance(md, dropbox.files.FileMetadata) and mtime_dt == md.client_modified and size == md.size):
print(name, 'is already synced [stats match]')
As far as I am concerned, No you can't.
The only way is using Dropbox API which is explained here.
The rclone go program from https://rclone.org has exactly what you want:
rclone hashsum dropbox localfile
rclone hashsum dropbox localdir
It can't take more than one path argument but I suspect that's something you can work with...
t0|todd#tlaptop/p8 ~/tmp|295$ echo "Hello, World!" > dropbox-hash-demo/hello.txt
t0|todd#tlaptop/p8 ~/tmp|296$ rclone copy dropbox-hash-demo/hello.txt dropbox-ttf:demo
t0|todd#tlaptop/p8 ~/tmp|297$ rclone hashsum dropbox dropbox-hash-demo
aa4aeabf82d0f32ed81807b2ddbb48e6d3bf58c7598a835651895e5ecb282e77 hello.txt
t0|todd#tlaptop/p8 ~/tmp|298$ rclone hashsum dropbox dropbox-ttf:demo
aa4aeabf82d0f32ed81807b2ddbb48e6d3bf58c7598a835651895e5ecb282e77 hello.txt