How to uploaded multiple files to Google cloud bucket and validate them efficiently? - file-upload

I have a DICOM viewer application that allows users to upload DICOM studies (500MB - 3GB in size). Each study could contain 200-2000 individual DICOM files. I allow users to directly upload these DICOM studies to a Google Cloud Storage bucket that is publicly writable. After a study is fully uploaded to the bucket, the frontend application sends a request to a cloud function to process the uploaded files.
There are 4 parts to processing the files:
Validate that all uploaded files were DICOM files
Move the validated DICOM files from this public writable bucket to a private bucket
Run some analysis and machine learning algorithms on only a subset of the files
Move all the valid uploaded files to Google healthcare API
The issue that I am having is that it is taking too long to run all these processing steps after the full study is uploaded. One solution I was thinking of was to invoke a cloud function per each individual DICOM file as it gets uploaded to the bucket and run #1 and #2 on it and then wait for the study to fully upload before running #3 and #4. The concern I have with this approach is that since my bucket is publicly writable, any malicious user could upload a very large amount of files and invoke many cloud functions which will result in unnecessary charges.
Another approach is to only allow authenticated users to upload files to a private GCS bucket, but that would require me to generate a signed URL per each DICOM file. So if there are 2000 DICOM files, I would need the front-end app to request to create 2000 signed URLs from the backend.
I am not sure how to approach this issue. Any advice in designing or implementing will be helpful

Related

Where and how to store files uploaded by a user using rest api?

Currently I’m using a shared storage(azure file storage) to store profile pictures and company logos and also some custom python scripts uploaded by admins. My rest services are running in a docker swarm cluster where all the nodes have access to the shared location. Are there any drawbacks to this kind of design? I’m currently saving the files to the location and creating a url for that file and serving it as a static resource using my nginx reverse proxy/load balancer. So I was curious to know if there are any drawbacks to this design and how can I make it better?
There are several ways to access, store, and manipulate files in Azure file storage using REST API:
The Azure File service offers the following four resources: the storage account, shares, directories, and files. Shares provide a way to organize sets of files and also can be mounted as an SMB file share that is hosted in the cloud.
More info here
When it comes to the design, it will depend of what kind of concerns your customers may have, slow connectivity, are they going to need these files permanently etc ...

File Service Architecture & Cost Analysis

Context
I am developping a webapp that
Takes an URL from the user
Downloads and stores the associated file onto my server
The user can fetch the file from my server at any time before the file is eventually expired and removed
I am planning to deploy this application on the AWS. More specifically, using EC2 and S3.
Challenge
I am trying to come up with a design that is both cost-effective and performant to offer this service.
Analysis
The following assumptions are used:
the downloaded file will be available to only one user, the one who provided the URL and initiated the download
the user will only fetch the file once from the server
the file will only stay on the server for at maximum 24 hours before getting removed
the file sizes are in the 100MB - 5GB range
Consider the following application flow:
Internet → EC2: Download the file onto local storage
EC2 → S3: Upload the downloaded file onto S3, deletes the local copy on EC2
EC2 → User: Provide the user with a direct URL to fetch from S3
S3 → user: The user fetches the file from S3
S3: file is removed after 24 hours.
In terms of network performance, step 1 and 2 will be the bottlenecks as EC2 has limited downloading and uploading bandwidth. Step 4 should not be a problem since S3 is taking care of the bandwidth for transferring file to the end user.
In terms of costs, fixed costs are the EC2 instances, and the main variable cost is step 4, where AWS charges 0.09$/GB in data transfer. Since the files are removed after 24 hours, the storage fee is comparatively tiny.
Question
Have I correctly identified the performance bottlenecks in this application flow?
Is my cost analysis correct?
Is this the optimal flow in terms of costs? Is there any way to further reduce the cost?
Since step 1 and step 2 (downloading from Internet and uploading to S3) will be very bandwidth-consuming when simultaneously downloading multiple large files, will it significantly affect the responsiveness of my server to serve regular API requests? Should I use a dedicated EC2 instance just for handling API calls from the clients, and another dedicated EC2 instance just for downloading and uploading? This will slightly further complicate the design, as I will have to manage the communication between the 2 instances as well.
Can you use more AWS Services? Are you aware of AWS Lambda? https://aws.amazon.com/lambda/details/ It can perform actions in response to actions, e.g. it could delete a file from S3 shortly after it is downloaded. http://docs.aws.amazon.com/lambda/latest/dg/with-s3-example.html
This alleviates the need to track downloads and delete them, once you get past the learning curve of AWS Lambda. It can also handle other processing, so you only have to upload to S3 from EC2.
Regarding cost, S3 has different quality levels, and the "reduced redundancy" might be sufficient for your needs, saving a little money.
How about allowing the client to upload files directly to S3?
Your application would generate a pre-signed url, so that you can control which users can upload files, but after that the client interacts directly with S3. This would remove the costly "download then upload" process in steps 1 & 2.
See this document http://docs.aws.amazon.com/AmazonS3/latest/dev/PresignedUrlUploadObject.html

Uploaded File Storage/Retrieval

I am developing a web application that needs to store uploaded files - images, pdfs, etc. I need this to be secure and to scale - I don't have a finite number of uploads to plan for. From my research, the best practice seems to be storing files in the private file system, storing paths and meta data in the database, and serving through an authenticated script.
My question is where should these files be stored?
I can't store them on the web servers because I have more than 1, would be worried about disk space, and don't want the performance hit from replication.
Should they be programmatically uploaded to a CDN? Should I spin up a file server/cluster to handle this?
Is there a standard way for securely storing/retrieving a large number of files for web applications?
"My question is where should these files be stored?"
I would suggest using a dedicated storage server or cloud service such as Amazon AWS. It is secure and completely scalable. That is how it is usually done these days.
"Should they be programmatically uploaded to a CDN?" - yes, along with a matching db entry of some sort for retrieval.
"Should I spin up a file server/cluster to handle this?" - you could. I would suggest one of the many cloud storage services though.
"Is there a standard way for securely storing/retrieving a large number of files for web applications?" Yes. Uploading files and info via web or mobile app (.php, rails, .net, etc) where the app uploads to storage area (not located in public directory) and then inserts file info into a database.

Uploading large files with Carrierwave to S3

So I will need to upload large files (zip files that are a few GB large) to S3, and I would like Carrierwave to manage the download/distribution of those files.
Meaning, when a user pays Carrierwave can automagically generate the dynamic URL and send it to them. I know how to do this already, but it just occurred to me that I have never uploaded files via Carrierwave that are bigger than a few dozen MB, much less a few GB to S3.
Given the flakiness of HTTP connections, I figure this is a suboptimal way to do it.
I don't have that many files to upload (maybe 10 - 20 max), and users won't be uploading them. It will be a storefront where the customers will be buying/downloading the files, not uploading them.
It would be nice if there was a way for me to upload the files into my S3 bucket separately (say FTP, git, or some other mechanism) and then just link it to my app through Carrierwave in some way.
What's the best way to approach this?
Also, don't forget that you will encounter the Heroku 30 second timeout when you are uploading the file in the first place.
Don't worry though, there are options:
Direct Upload - S3 supports direct upload where you present a form which uploads directly to s3 bypassing Heroku, you then receive a call back into your application with the uploaded files details for you to process (https://github.com/dwilkie/carrierwave_direct)
Upload to S3 and then expose bucket/folder in your application to connect to your models. We do this approach with a number of clients. They use Transmit (Mac Client) to upload large assets to S3 and then visit their app to link the asset to a Rails model.
Also, I'm pretty sure S3 is an HTTP based service so you're only going to be able to upload via HTTP.

How to receive an uploaded file using node.js formidable library and save it to Amazon S3 using knox?

I would like to upload a form from a web page and directly save the file to S3 without first saving it to disk. This node.js app will be deployed to Heroku, where there is no local disk to save the file to.
The node-formidable library provides a great way to upload files and save them to disk. I am not sure how to turn off formidable (or connect-form) from saving file first. The Knox library on the other hand provides a way to read a file from the disk and save it on Amazon S3.
1) Is there a way to hook into formidable's events (on Data) to send the stream to Knox's events, so that I can directly save the uploaded file in my Amazon S3 bucket?
2) Are there any libraries or code snippets that can allow me to directly take the uploaded file and save it Amazon S3 using node.js?
There is a similar question here but the answers there do not address NOT saving the file to disk.
It looks like there is no good way to do it. One reason might be that the node-formidable library saves the uploaded file to disk. I could not find any options to do otherwise. The knox library takes the saved file on the disk and using your Amazon S3 credentials uploads it to Amazon.
Since on Heroku I cannot save files locally, I ended up using transloadit service. Though their authentication docs have some learning curve, I found the service useful.
For those who want to use transloadit using node.js, the following code sample may help (transloadit page had only Ruby and PHP examples)
var crypto, signature;
crypto = require('crypto');
signature = crypto.createHmac("sha1", 'auth secret').
update('some string').
digest("hex")
console.log(signature);
this is Andy, creator of AwsSum:
https://github.com/appsattic/node-awssum/
I just released v0.2.0 of this library. It uploads the files that were created by Express' bodyParser() though as you say, this won't work on Heroku:
https://github.com/appsattic/connect-stream-s3
However, I shall be looking at adding the ability to stream from formidable directly to S3 in the next (v0.3.0) version. For the moment though, take a look and see if it can help. :)