Reuploading to EdgeCast or Amazon S3? - amazon-s3

I am working on a project in which we are planning to use EdgeCast to store our data. I am concerned about it, because the client wants to upload the image to our server first, and then use curl to upload it to EdgeCast. In this case our servers will be "proxying" the request, doubling the time needed for uploads.
What would you suggest? And is direct uploading risky?
PS the reason I mentioned S3 is because of its similarity to EdgeCast. Hence I assume the same principle will apply.

Yep - Martin's right - usually a good idea when letting users have direct access to storage to have a proxy. EdgeCast supports rsync which will automatically sync content from your server and the EdgeCast storage account. Or you can use our "customer origin" reverse proxy feature for our network to pull content automatically from your servers as its requested by the public. Feel free to contact us at sales#edgecast.com with questions.

Having your server in between the end user and the storage, is probably a good idea. Whenever I let users direct access to storage places, with FTP or SSH, it tends to get really messy. A place where you can upload files, that get accessible from the web, is used for all sorts of things.
Having your server in between you can organise the files uploaded into some rational structure. A folder per date for instance, and perhaps also enforce some strict naming of the files themselves, avoiding URL encoding problems etc.

There is no reason to be concerned about Edgecast. On my opinion, it always makes its best to serve its customers the best way so that the customers have their websites as fast as it's possible and also secured and well-optimized. The whole comparison of Edgecasr vs Amazon look at http://jodihost.com/2014_edgecast_vs_amazon.php

Related

Static files as API GET targets

I'm creating a RESTful backend API for eventual use by a phone app, and am toying with the idea of making some of the API read functions nothing more than static files, created and periodically updated by my server-side code, that the app will simply GET directly.
Is this a good idea?
My hope is to significantly reduce the CPU and memory load on the server by not requiring any code to run at all for many of the API calls. However, there could potentially be a huge number of these files (at least one per user of the phone app, which will be a public app listed in the app stores that I naturally hope will get lots of downloads) and I'm wondering if that alone will lead to latency issues I'm trying to avoid.
Here are more details:
It's an Apache server
The hardware is a hosting provider's VPS with about 1gb memory and 20gb free disk space
The average file size (in terms of content and not disk footprint) will probably be < 1kb
I imagine my server-side code might update a given user's data once a day or so at most.
The app will probably do GETs on these files just a few times a day. (There's no real-time interaction going on.)
I might password protect the directory the files will be in at the .htaccess level, though there's no personal or proprietary information in any of the files, so maybe I don't need to, but if I do, will that make a difference in terms of the main question of feasibility and performance?
Thanks for any help you can give me.
This is generally a good thing to do: anything that can be static rather than dynamic is a win for performance and cost (it's why we do caching!), but the main issue with with authorization (which you'll still need to do for each incoming request).
You might also want to consider using a cloud service for storage of the static data (e.g., Amazon S3 or Google Cloud Storage). There are neat ways to provide temporary authorized URLs that you can pass to users so that they can read the data for a short time and then must re-authorize to continue having access.

What is the best approach to handle file upload in graphql?

I'm looking for a way to handle file upload in my backend powered by prisma (graphcool). However I am a beginner and it looks very intimidating and I don't know anything about how file upload works. What is the best aproach to do this ? Can I do it using prisma ? I have red about Amazon S3 buckets but it looks like a complicated approach to begin with.
There are many ways to accomplish this. I will lay out a few solutions and resources and you should have something to get you going.
Two common ways to handle this primarily differ in the method used to persist the uploaded files, upload directly to the server file system vs. upload to a cloud service, typically S3.
For most use cases, the second option, uploading to a cloud service will be superior due to ease of scalability, ease of backing up data in something like S3 and additional security features such as signed urls. One more neat thing about using S3 in particular is that you can take advantage of AWS Athena, described (by AWS) as
an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL
๐Ÿ‘Note that the examples below all leverage the awesome work by Jayden Seric, who has done tons of work on uploads using GraphQL.๐Ÿ‘
Upload to FS, API and react example: Upload files to node server file system
middleware for apollo-upload-server: small abstraction on top of apollo-upload-server which simplifies server development. This example also shows how to integrate with S3.

Server Load & Scalability for Massive Uploads

I want to upload millions of audio items by users to my server. The current app has designed to give the contents, transcode them and finally send by ftp to storage servers. I want to know:
Does the app server can bear the enormous tasks by user like commenting, uploading, transcoding after scaling to more servers (to carry web app load)?
If the answer of above question is yes, is it correct and best approach? Because a good architecture will be to send transcoding to storage servers wait for finishing the job and sending respond to app server but at the same time it has more complexity and insecurity.
What is the common method for this type of websites?
If I send the upload and transcoding job to storage servers does it compatible with enterprise storage technologies in a long term scalability?
5- The current App is based on PHP. Is it possible to move tmp folder to another servers to overcome upload overload?
Thanks for answer, for tmp folder question number 5. I mean the tmp folder in Apache. I know that all uploaded files before moving to final storage destination (eg: storage servers or any solution) are stored in tmp folder of apache. I was wondering if this is a rule for apache and all uploaded files should be located first in app server, so how can I control, scale and redirect this massive load of storage to a temporary storage or server? I mean a server or storage solution as tmp folder of appche to just be guest of uploaded files before sending to the final storages places. I have studied and designed all the things about scaling of database, storages, load balancing, memcache etc. but this is one of my unsolved question. Where new arrived files by users to main server will be taken place in a scaled architect? And what is the common solution for this? (In one box solution all files will be temporary in the tmp dir of appche but for massive amount of contents and in a scaled system?).
Regards
You might want to take a look at the Viddler architecture: http://highscalability.com/blog/2011/5/10/viddler-architecture-7-million-embeds-a-day-and-1500-reqsec.html
Since I don't feel I can answer this (I wanted to add a comment, but my text was too long), some thoughts:
If you are creating such a large system (as it sounds) you should have some performance tests to see, how many concurrent connections/uploads,... whatever your architecture can handle. As I always say: If you don't know it: "no, it can't ".
I think the best way to deal with heavy load (this is: a lot of uploads, requiring a lot of blocked Threads from the appserver (-> this means, I would not use the Appserver to handle the fileuploads). Perform all your heavy operations (transcoding) asynchronously (e.g. queue the uploaded files, processess them afterwards). In any case the Applicaiton server should not wait for the response of the transcoding system -> just tell the user, that his file are going to be processed and send him a message (or whatever) when its finished. You can use something like gearman for that.
I would search for existing architectures, that have to handle a lot of uploads/conversion too (e.g. flickr) just go to slideshare and search for "flickr" or "scalable web architecture"
I do not really understand this - but I would use Servers based on their tasks (e.g. Applicaiton server, Database serversm, Transconding servers, Storage,...) - each server should do, what he can do best.
I am afraid I don't know what you are talking about when you say tmp folder.
Good luck

Storing files (videos/images/music) in CouchDB/Cloudant vs CDN (CloudFront)?

I am new to CouchDB/Cloudant and CDN (CloudFront).
I am about to build an application using CouchDB as database.
This web application will handle a lot of files.
I know that CouchDB can store files in the database as attachments. But then I have heard about leveraging CDN to store and distribute the files all over the world.
My questions:
How is storing files in CouchDB compared to CDN (CloudFront)?
How is Cloudant's service compared to CDN (CloudFront)?
Is Google storage also a CDN?
What is the difference between Amazon CloudFront and S3?
Do I have to choose to store files either in CouchDB/Cloudant or CDN, or could/should I actually combine them?
What are best practices for storing files when using CouchDB?
Some of these questions are based on your specific implementation, but here's a generalization (not in any particular order):
Unless they have Cloudant mirrored on numerous servers around the world (effectively a CDN in its own right, just sans static files), a true CDN would probably have better response time, depending mostly on how you used Cloudant (eg, you might get good response times, but if you load the entire file into memory before outputting it, you're losing the CDN battle).
CouchDB has to process more data server-side before it can output an attachment.
CloudFront (and CDNs in general) are optimized for the fastest possible response time with the closest server.
S3 is only storage; CloudFront uses that storage and distributes it across many servers that serve the content based upon which one is closer to the user requesting that content.
Yes, you have to choose between Cloudant or the CDN; one stores them in the filesystem verbatim, the other stores them in the filesystem within the database.
I don't know the answer to some of these, eg, how CouchDB actually handles attachment storage at a low level, nor its best practices, however, this should give you enough of an idea to at least start thinking about which suits your needs best.

Planning the development of a scalable web application

We have created a product that potentially will generate tons of requests for a data file that resides on our server. Currently we have a shared hosting server that runs a PHP script to query the DB and generate the data file for each user request. This is not efficient and has not been a problem so far but we want to move to a more scalable system so we're looking in to EC2. Our main concerns are being able to handle high amounts of traffic when they occur, and to provide low latency to users downloading the data files.
I'm not 100% sure on how this is all going to work yet but this is the idea:
We use an EC2 instance to host our admin panel and to generate the files that are being served to app users. When any admin makes a change that affects these data files (which are downloaded by users), we make a copy over to S3 using CloudFront. The idea here is to get data cached and waiting on S3 so we can keep our compute times low, and to use CloudFront to get low latency for all users requesting the files.
I am still learning the system and wanted to know if anyone had any feedback on this idea or insight in to how it all might work. I'm also curious about the purpose of projects like Cassandra. My understanding is that simply putting our application on EC2 servers makes it scalable by the nature of the servers. Is Cassandra just about keeping resource usage low, or is there a reason to use a system like this even when on EC2?
CloudFront: http://aws.amazon.com/cloudfront/
EC2: http://aws.amazon.com/cloudfront/
Cassandra: http://cassandra.apache.org/
Cassandra is a non-relational database engine and if this is what you need, you should first evaluate Amazon's SimpleDB : a non-relational database engine built on top of S3.
If the file only needs to be updated based on time (daily, hourly, ...) then this seems like a reasonable solution. But you may consider placing a load balancer in front of 2 EC2 images, each running a copy of your application. This would make it easier to scale later and safer if one instance fails.
Some other services you should read up on:
http://aws.amazon.com/elasticloadbalancing/ -- Amazons load balancer solution.
http://aws.amazon.com/sqs/ -- Used to pass messages between systems, in your DA (distributed architecture). For example if you wanted the systems that create the data file to be different than the ones hosting the site.
http://aws.amazon.com/autoscaling/ -- Allows you to adjust the number of instances online based on traffic
Make sure to have a good backup process with EC2, snapshot your OS drive often and place any volatile data (e.g. a database files) on an EBS block. EC2 doesn't fail often but when it does you don't have access to the hardware, and if you have an up to date snapshot you can just kick a new instance online.
Depending on the datasets, Cassandra can also significantly improve response times for queries.
There is an excellent explanation of the data structure used in NoSQL solutions that may help you see if this is an appropriate solution to help:
WTF is a Super Column