How can I save a telegram audio file directly to S3 from Telegram? - amazon-s3

I am trying to save a user-sent Telegram voice message directly to S3. This happens inside AWS Lambda so saving to disk and using s3.upload_file(filename,...) will not work. This fails:
def audio_handler(update, context):
message = update.effective_message
file = message.voice.get_file()
s3 = boto3.client('s3')
s3.upload_file(file, Bucket='mybucket', Key='onelove.ogg')
ValueError: Filename must be a string
If I attempt to use
s3.upload_fileobj(BytesIO(file).getbuffer(), Bucket='mybucket', Key='onelove.ogg')
TypeError: a bytes-like object is required, not 'File'

Voice.get_file returns an object of type File. To download the voice to memory, you can e.g. pass an empty BytesIO object to the out argument of File.download. Please also have a look at the wiki section on working with files and media.
Disclaimer: I'm currently the maintainer of python-telegram-bot.

Related

How to access a file inside sagemaker entrypoint script

I want to know how to access a private bucket S3 file or a folder inside script.py entry point of sagemaker .
I uploaded the file to S3 using following code
boto3_client = boto3.Session(
region_name='us-east-1',
aws_access_key_id='xxxxxxxxxxx',
aws_secret_access_key='xxxxxxxxxxx'
)
sess = sagemaker.Session(boto3_client)
role=sagemaker.session.get_execution_role(sagemaker_session=sess)
inputs = sess.upload_data(path="df.csv", bucket=sess.default_bucket(), key_prefix=prefix)
This is the code of estimator
import sagemaker
from sagemaker.pytorch import PyTorch
pytorch_estimator = PyTorch(
entry_point='script.py',
instance_type='ml.g4dn.xlarge',
source_dir = './',
role=role,
sagemaker_session=sess,
)
Now inside script.py file i want to access the df.csv file from s3.
This is my code inside script.py.
parser = argparse.ArgumentParser()
parser.add_argument("--data-dir", type=str, default=os.environ["SM_CHANNEL_TRAINING"])
args, _ = parser.parse_known_args()
#create session
sess=Session(boto3.Session(
region_name='us-east-1'))
S3Downloader.download(s3_uri=args.data_dir,
local_path='./',
sagemaker_session=sess)
df=pd.read_csv('df.csv')
But this is giving error
ValueError: Expecting 's3' scheme, got: in /opt/ml/input/data/training., exit code: 1
I think one way is to pass secret key and access key. But i am already passing sagemaker_session. How can i call that session inside script.py file and get my file read.
I think this approach is conceptually wrong.
Files within sagemaker jobs (whether training or otherwise) should be passed during machine initialization. Imagine you have to create a job with 10 machines, do you want to read the file 10 times or replicate it directly by having it read once?
In the case of the training job, they should be passed into the fit (in the case of direct code like yours) or as TrainingInput in the case of pipeline.
You can follow this official AWS example: "Train an MNIST model with PyTorch"
However, the important part is simply passing a dictionary of input channels to the fit:
pytorch_estimator.fit({'training': s3_input_train})
You can put the name of the channel (in this case 'train') any way you want. The path s3 will be the one in your df.csv.
Within your script.py, you can read the df.csv directly between environment variables (or at least be able to specify it between argparse). Generic code with this default will suffice:
parser.add_argument("--train", type=str, default=os.environ["SM_CHANNEL_TRAINING"])
It follows the nomenclature "SM_CHANNEL_" + your_channel_name.
So if you had put "train": s3_path, the variable would have been called SM_CHANNEL_TRAIN.
Then you can read your file directly by pointing to the path corresponding to that environment variable.

Size of PDF breaks FastAPI using python-multipart?

I am trying to upload a PDF to FastAPI. After turning the PDF into a base64-blob and storing it in a txt-file, I POST this file to FastAPI using Postman.
This is my server-side code:
from fastapi import FastAPI, File, UploadFile
import base64
app = FastAPI()
#app.post("/uploadfile/")
async def create_upload_file(file: UploadFile = File(...)):
contents = await file.read()
blob = base64.b64decode(contents)
pdf = open('result.pdf','wb')
pdf.write(blob)
pdf.close()
return {"filename": file.filename}
This procedure works fine for a single-page PDF document of size 279KB (blob-size: 372KB), but it doesn't for a multi-page document of size 1.8MB (blob-size: 2.4MB).
When I try, I get the following WARNING and a 400 bad request response (along with the reseponse "detail": "There was an error parsing the body"):
"Did not find boundary character 55 at index 2"
I'm sure there must be an explanation for this behavior? Maybe it has something to do with async?
This is most likely an issue with saving the file using open().
For large files pdf.close() will execute before pdf.write() has finished saving all the contents of the file.
In order to ensure the whole file being written before it is closed, use with such as this:
with open('failed.pdf', 'wb') as outfile:
outfile.write(blob)
Using the with you will not need to close() after writing. with should also be considered best practice over saving the file into a local variable.

How to download multiple files via Flask and boto3 from S3

I have a list of .zip files on S3 which is dynamically created and passed to flask view /download. My problem doesn't seem to be in looping through this list, but rather in returning a response so that all files from the list are downloaded to users computer. If I have a view return a response it only downloads the first file, as return closes the connection.
I have tried a number of things and looked at similar issues (like this one: Boto3 to download all files from a S3 Bucket ), but so far had no luck in resolving this. I have also looked at streaming (as in here: http://flask.pocoo.org/docs/1.0/patterns/streaming/ ) and tried creating a subfunction which is a generator, but the same issue persists as I still have to pass a return value to View function - here is that last code example:
#app.route('/download', methods=['POST'])
def download():
download=[]
download = request.form.getlist('checked')
def generate(result):
s3_resource = boto3.resource('s3')
my_bucket = s3_resource.Bucket(S3_BUCKET)
d_object = my_bucket.Object(result).get()
yield d_object['Body'].read()
for filename in download:
return Response (generate(filename), mimetype='application/zip', headers={'Content-Disposition': 'attachment;filename=' + filename})
What would be the best way of doing this so that all the files are downloaded?
Is there an easier way to pass a list to boto3 or any other flask module to download those files?

AWS SDK Boto3 : boto3.exceptions.unknownapiversionerror

I am trying to upload content on amazon s3 but I am getting this error:
boto3.exceptions.unknownapiversionerror: The 's3' resource does not an
API Valid API versions are: 2006-03-01
import boto3
boto3.resource('s3',**AWS_ACCESS_KEY_ID**,**AWS_PRIVATE_KEY*‌​*)
bucket = s3.Bucket( **NAME OF BUCKET**)
obj = bucket.Object(**KEY**)
obj.upload_fileobj(**FILE OBJECT**)
The error is caused by exception raised on "DataNotFound" as in the
boto3.Session source code. Perhaps the developer didn't realize people make the mistake for NOT passing the correct object.
If you read the boto3 documentation example, this is the correct way to upload data.
import boto3
boto3.resource('s3',**AWS_ACCESS_KEY_ID**,**AWS_PRIVATE_KEY*‌​*)
bucket = s3.Bucket( **NAME OF BUCKET**)
obj = bucket.Object("prefix/object_key_name")
# You must pass the file object !
with open('filename', 'rb') as fileobject:
obj.upload_fileobj(fileobject)

Locally calculate dropbox hash of files

Dropbox rest api, in function metatada has a parameter named "hash" https://www.dropbox.com/developers/reference/api#metadata
Can I calculate this hash locally without call any remote api rest function?
I need know this value to reduce upload bandwidth.
https://www.dropbox.com/developers/reference/content-hash explains how Dropbox computes their file hashes. A Python implementation of this is below:
import hashlib
import math
import os
DROPBOX_HASH_CHUNK_SIZE = 4*1024*1024
def compute_dropbox_hash(filename):
file_size = os.stat(filename).st_size
with open(filename, 'rb') as f:
block_hashes = b''
while True:
chunk = f.read(DROPBOX_HASH_CHUNK_SIZE)
if not chunk:
break
block_hashes += hashlib.sha256(chunk).digest()
return hashlib.sha256(block_hashes).hexdigest()
The "hash" parameter on the metadata call isn't actually the hash of the file, but a hash of the metadata. It's purpose is to save you having to re-download the metadata in your request if it hasn't changed by supplying it during the metadata request. It is not intended to be used as a file hash.
Unfortunately I don't see any way via the Dropbox API to get a hash of the file itself. I think your best bet for reducing your upload bandwidth would be to keep track of the hash's of your files locally and detect if they have changed when determining whether to upload them. Depending on your system you also likely want to keep track of the "rev" (revision) value returned on the metadata request so you can tell whether the version on Dropbox itself has changed.
This won't directly answer your question, but is meant more as a workaround; The dropbox sdk gives a simple updown.py example that uses file size and modification time to check the currency of a file.
an abbreviated example taken from updown.py:
dbx = dropbox.Dropbox(api_token)
...
# returns a dictionary of name: FileMetaData
listing = list_folder(dbx, folder, subfolder)
# name is the name of the file
md = listing[name]
# fullname is the path of the local file
mtime = os.path.getmtime(fullname)
mtime_dt = datetime.datetime(*time.gmtime(mtime)[:6])
size = os.path.getsize(fullname)
if (isinstance(md, dropbox.files.FileMetadata) and mtime_dt == md.client_modified and size == md.size):
print(name, 'is already synced [stats match]')
As far as I am concerned, No you can't.
The only way is using Dropbox API which is explained here.
The rclone go program from https://rclone.org has exactly what you want:
rclone hashsum dropbox localfile
rclone hashsum dropbox localdir
It can't take more than one path argument but I suspect that's something you can work with...
t0|todd#tlaptop/p8 ~/tmp|295$ echo "Hello, World!" > dropbox-hash-demo/hello.txt
t0|todd#tlaptop/p8 ~/tmp|296$ rclone copy dropbox-hash-demo/hello.txt dropbox-ttf:demo
t0|todd#tlaptop/p8 ~/tmp|297$ rclone hashsum dropbox dropbox-hash-demo
aa4aeabf82d0f32ed81807b2ddbb48e6d3bf58c7598a835651895e5ecb282e77 hello.txt
t0|todd#tlaptop/p8 ~/tmp|298$ rclone hashsum dropbox dropbox-ttf:demo
aa4aeabf82d0f32ed81807b2ddbb48e6d3bf58c7598a835651895e5ecb282e77 hello.txt