Bulk Upload to Azure Storage using Secure connection - azure-storage

I need to upload files in batch to azure storage securely, and the files will be processed by an ADF pipeline later.
While I tried to upload files from my angular front-end to the API, the time taken to upload a large file took time ad my request was timing out.
Any suggestions as to how I can approach this problem better?

Related

File Service Architecture & Cost Analysis

Context
I am developping a webapp that
Takes an URL from the user
Downloads and stores the associated file onto my server
The user can fetch the file from my server at any time before the file is eventually expired and removed
I am planning to deploy this application on the AWS. More specifically, using EC2 and S3.
Challenge
I am trying to come up with a design that is both cost-effective and performant to offer this service.
Analysis
The following assumptions are used:
the downloaded file will be available to only one user, the one who provided the URL and initiated the download
the user will only fetch the file once from the server
the file will only stay on the server for at maximum 24 hours before getting removed
the file sizes are in the 100MB - 5GB range
Consider the following application flow:
Internet → EC2: Download the file onto local storage
EC2 → S3: Upload the downloaded file onto S3, deletes the local copy on EC2
EC2 → User: Provide the user with a direct URL to fetch from S3
S3 → user: The user fetches the file from S3
S3: file is removed after 24 hours.
In terms of network performance, step 1 and 2 will be the bottlenecks as EC2 has limited downloading and uploading bandwidth. Step 4 should not be a problem since S3 is taking care of the bandwidth for transferring file to the end user.
In terms of costs, fixed costs are the EC2 instances, and the main variable cost is step 4, where AWS charges 0.09$/GB in data transfer. Since the files are removed after 24 hours, the storage fee is comparatively tiny.
Question
Have I correctly identified the performance bottlenecks in this application flow?
Is my cost analysis correct?
Is this the optimal flow in terms of costs? Is there any way to further reduce the cost?
Since step 1 and step 2 (downloading from Internet and uploading to S3) will be very bandwidth-consuming when simultaneously downloading multiple large files, will it significantly affect the responsiveness of my server to serve regular API requests? Should I use a dedicated EC2 instance just for handling API calls from the clients, and another dedicated EC2 instance just for downloading and uploading? This will slightly further complicate the design, as I will have to manage the communication between the 2 instances as well.
Can you use more AWS Services? Are you aware of AWS Lambda? https://aws.amazon.com/lambda/details/ It can perform actions in response to actions, e.g. it could delete a file from S3 shortly after it is downloaded. http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
This alleviates the need to track downloads and delete them, once you get past the learning curve of AWS Lambda. It can also handle other processing, so you only have to upload to S3 from EC2.
Regarding cost, S3 has different quality levels, and the "reduced redundancy" might be sufficient for your needs, saving a little money.
How about allowing the client to upload files directly to S3?
Your application would generate a pre-signed url, so that you can control which users can upload files, but after that the client interacts directly with S3. This would remove the costly "download then upload" process in steps 1 & 2.
See this document http://docs.aws.amazon.com/AmazonS3/latest/dev/PresignedUrlUploadObject.html

Uploading large file (10+ GB) from Web client via azure web site to azure blob storage

I've got a bit of a problem in uploading a really large file into azure blob storage.
I have no problem uploading that file into the web site as a file
upload in an upload directory.
I have no problem either putting this into the blob storage, as chunking will be handled internally.
The problem I'm having is that the time it takes to move the large file from the upload directory to the blob storage takes longer than the browser timeout and the customer sees an error message.
As far as I know, the solution is to chunk-upload directly from the web browser.
But how do I deal with the block ids? Since the web service is supposed to be stateless, I don't think I can keep around a list of blocks already uploaded.
Also, can the blob storage deal with out-of-order blocks?
And do I have to deal with all the state manually?
Or is there an easier way, maybe just handing the blob service the httprequest input stream from the file upload post request (multipart form data)?
Lots of Greetings!
You could move from the web server to blobs asynchronously. So return success for the original request back once file is on web server, and then have javascript query your web server periodically to confirm file has made it to durable storage in blobs. This javascript doing the polling can then display success to the user once it gets a success response from web server, confirming that the file has made it to blob storage.

Script to take a S3 bucket, Compress it, push the compressed file to an SFTP server

I have a s3 bucket with about 100 gb of small files (in folders).
I have been requested to back this up to a local NAS on a weekly basis.
I have access to a an EC2 instance that is attached to the S3 storage.
My Nas allows me to run an sFTP server.
I also have access to a local server in which I can run a cron job to pull the backup if need be.
How can I best go about this? If possible i would like to only download the files that have been added or changed, or compress it on the server end and then push the compressed file to the SFtp on the Nas.
The end goal is to have a complete backup of the S3 bucket on my Nas with the lowest amount of transfer each week.
Any suggestions are welcome!
Thanks for your help!
Ryan
I think the most scalable method for you to achieve this is using AWS Elastic Map Reduce and Data pipeline.
The architecture is this way:
You will use Data pipeline to configure S3 as an input data node, then EC2 with pig/hive scripts to do the required processing to send the data to SFTP. Pig is extendable to have a custom UDF (user defined function) to send data to SFTP. Then you can setup this pipeline to run at a periodical interval. Having said this this, it requires quite some reading to achieve all these - But a good skill to achieve if you for see future data transformation needs.
Start reading from here:
http://aws.typepad.com/aws/2012/11/the-new-amazon-data-pipeline.html
Similar method can be used for Taking periodic backup of DynamoDB to S3, Reading files from FTP servers, processing and moving to say S3/RDS etc.

Uploading large files with Carrierwave to S3

So I will need to upload large files (zip files that are a few GB large) to S3, and I would like Carrierwave to manage the download/distribution of those files.
Meaning, when a user pays Carrierwave can automagically generate the dynamic URL and send it to them. I know how to do this already, but it just occurred to me that I have never uploaded files via Carrierwave that are bigger than a few dozen MB, much less a few GB to S3.
Given the flakiness of HTTP connections, I figure this is a suboptimal way to do it.
I don't have that many files to upload (maybe 10 - 20 max), and users won't be uploading them. It will be a storefront where the customers will be buying/downloading the files, not uploading them.
It would be nice if there was a way for me to upload the files into my S3 bucket separately (say FTP, git, or some other mechanism) and then just link it to my app through Carrierwave in some way.
What's the best way to approach this?
Also, don't forget that you will encounter the Heroku 30 second timeout when you are uploading the file in the first place.
Don't worry though, there are options:
Direct Upload - S3 supports direct upload where you present a form which uploads directly to s3 bypassing Heroku, you then receive a call back into your application with the uploaded files details for you to process (https://github.com/dwilkie/carrierwave_direct)
Upload to S3 and then expose bucket/folder in your application to connect to your models. We do this approach with a number of clients. They use Transmit (Mac Client) to upload large assets to S3 and then visit their app to link the asset to a Rails model.
Also, I'm pretty sure S3 is an HTTP based service so you're only going to be able to upload via HTTP.

AWS S3 and AjaXplorer

I'm using AjaXplorer to give access to my clients to a shared directory stored in Amazon S3. I installed the SD, configured the plugin (http://ajaxplorer.info/plugins/access/s3/) and could upload and download files but the upload size is limited to my host PHP limit which is 64MB.
Is there a way I can upload directly to S3 without going over my host to improve speed and have S3 limit, no PHP's?
Thanks
I think that is not possible, because the server will first climb to the PHP file and then make transfer to bucket.
Maybe
The only way around this is to use some JQuery or JS that can bypass your server/PHP entirely and stream directly into S3. This involves enabling CORS and creating a signed policy on the fly to allow your uploads, but it can be done!
I ran into just this issue with some inordinately large media files for our website users that I no longer wanted to host on the web servers themselves.
The best place to start, IMHO is here:
https://github.com/blueimp/jQuery-File-Upload
A demo is here:
https://blueimp.github.io/jQuery-File-Upload/
This was written to upload+write files to a variety of locations, including S3. The only tricky bits are getting your MIME type correct for each particular upload, and getting your bucket policy the way you need it.