Handling untracked attachments between backend and S3 - amazon-s3

I wanted to ask if you have any proven ways to deal with removing non-tracked attachments from S3?
Architecture diagram
I am using the S3 in project to hold files in a way that the Frontend fetching a pre-signed URL from Backend (1). Then Frontend adds an attachment to S3 (2), and then adds a URL to some resource (3).
The challenge is that there may be a difference between attachments on the server and URLs in the resource - eg Frontend getting the URL from Backend, sends file to S3, and does not add it to the Backend resource.
Does any of you have a way to deal with such a problem?
Fetching all the resources from database once in a day and comparing their attachments with those on the server sounds unsatisfactory. :/

Distributed consistency is hard sometimes... Orphaned records in non-transactional systems are hard and expensive to track down.
You have to decide what your tolerance for data-consistency-drift is. If you need it to be absolute, then you need to put in place things like rollbacks when transactions fail (e.g. when the writ to the backend fails, rollback the write to S3 when the backend write fails). Alternatively you can tag files in S3 with a tag that says something like "backend:pending" when creating the files and update it to "backend:done" so that you can easily identify s3 objects that need cleanup.
A more scalable option would be to use something like SQS to manage your workflow, ensuring that both changes are handled, and creating hospital queues for when things go sideways.

Related

gun db storage model for large centrally stored data, tiny collaborative clients

Use Case:
Say I wanted to create a realtime-collaborative document editing system.
In this scenario many users could create and collaborate on many documents.
Due to client-device constraints, it's not possible for any client to keep a replica of all documents, only just a handful.
There needs to be central storage server where all documents always live, and this server is always backed up.
Each client can 'subscribe' to any document, and all clients subscribed can see realtime changes of all other clients subscribed/editing the same document.
Questions:
Since each client can't store all documents, there needs to be a way to remove the replicas of 'old' documents from the client, without deleting the document from the central store, ideally based on an automatic least-recently-used approach. How is this handled in gun?
In gun, how can a document be deleted from the central store, so it's then effectively permanently removed from, and no longer accessible to, all clients?
When a document is deleted from the central store, how is the physical storage space ever actually reclaimed for later use?
Great questions, #user2672083 . Here is the current lay out:
Collaborative realtime document editing is possible with gun. Here is a quick prototype I recorded a long time ago, however there are no full pre-built examples/implementations yet.
Not all data is stored on every client by default. The browser only keeps the data it requests/gets/subscribes to.
The default server already acts as a backup. I recommend using the S3 storage adapter, because then you do not have to worry about running out of disk space.
Removing old replicas. Currently, if I want the server to act as a central "master", I just put a localStorage.clear() at the top of my browser code. This will force the browser to have to always load the latest from the server. This is not ideal though, an LRU specific feature is coming soon according to the roadmap.
Permanently removing data and reclaiming space. While this should be easy for a central setup, because gun is P2P by default, it uses a technique called tombstoning to delete data. Given a lot of requests (like yours) for LRU/TTL/GC/deleting, there will be better support for this in the future. Currently, you have to use a mix of rm data.json, localStorage.clear() and 30 day lifecycles on S3 to get this to work. This will be more integrated/easier in the future.
Now a question for you: What are you working on, and how can I help? Many of the things you asked about are possible (with some work) now, but slated to be the focus of the next version of gun - I'd love to get your feedback as we build this out.
All peers reply to requests for data (#2), meaning that localStorage and the server will both reply. Because localStorage is physically closer to a user, it will reply first/fastest and then replies from the server will be merged. GUN does not try each peer "in sequence" doing try/catch cascades, GUN replies from all peers in parallel.
GUN has swappable storage and transport interfaces, so yes, it is easy to build other layers on top or into it.

Server Load & Scalability for Massive Uploads

I want to upload millions of audio items by users to my server. The current app has designed to give the contents, transcode them and finally send by ftp to storage servers. I want to know:
Does the app server can bear the enormous tasks by user like commenting, uploading, transcoding after scaling to more servers (to carry web app load)?
If the answer of above question is yes, is it correct and best approach? Because a good architecture will be to send transcoding to storage servers wait for finishing the job and sending respond to app server but at the same time it has more complexity and insecurity.
What is the common method for this type of websites?
If I send the upload and transcoding job to storage servers does it compatible with enterprise storage technologies in a long term scalability?
5- The current App is based on PHP. Is it possible to move tmp folder to another servers to overcome upload overload?
Thanks for answer, for tmp folder question number 5. I mean the tmp folder in Apache. I know that all uploaded files before moving to final storage destination (eg: storage servers or any solution) are stored in tmp folder of apache. I was wondering if this is a rule for apache and all uploaded files should be located first in app server, so how can I control, scale and redirect this massive load of storage to a temporary storage or server? I mean a server or storage solution as tmp folder of appche to just be guest of uploaded files before sending to the final storages places. I have studied and designed all the things about scaling of database, storages, load balancing, memcache etc. but this is one of my unsolved question. Where new arrived files by users to main server will be taken place in a scaled architect? And what is the common solution for this? (In one box solution all files will be temporary in the tmp dir of appche but for massive amount of contents and in a scaled system?).
Regards
You might want to take a look at the Viddler architecture: http://highscalability.com/blog/2011/5/10/viddler-architecture-7-million-embeds-a-day-and-1500-reqsec.html
Since I don't feel I can answer this (I wanted to add a comment, but my text was too long), some thoughts:
If you are creating such a large system (as it sounds) you should have some performance tests to see, how many concurrent connections/uploads,... whatever your architecture can handle. As I always say: If you don't know it: "no, it can't ".
I think the best way to deal with heavy load (this is: a lot of uploads, requiring a lot of blocked Threads from the appserver (-> this means, I would not use the Appserver to handle the fileuploads). Perform all your heavy operations (transcoding) asynchronously (e.g. queue the uploaded files, processess them afterwards). In any case the Applicaiton server should not wait for the response of the transcoding system -> just tell the user, that his file are going to be processed and send him a message (or whatever) when its finished. You can use something like gearman for that.
I would search for existing architectures, that have to handle a lot of uploads/conversion too (e.g. flickr) just go to slideshare and search for "flickr" or "scalable web architecture"
I do not really understand this - but I would use Servers based on their tasks (e.g. Applicaiton server, Database serversm, Transconding servers, Storage,...) - each server should do, what he can do best.
I am afraid I don't know what you are talking about when you say tmp folder.
Good luck

Reuploading to EdgeCast or Amazon S3?

I am working on a project in which we are planning to use EdgeCast to store our data. I am concerned about it, because the client wants to upload the image to our server first, and then use curl to upload it to EdgeCast. In this case our servers will be "proxying" the request, doubling the time needed for uploads.
What would you suggest? And is direct uploading risky?
PS the reason I mentioned S3 is because of its similarity to EdgeCast. Hence I assume the same principle will apply.
Yep - Martin's right - usually a good idea when letting users have direct access to storage to have a proxy. EdgeCast supports rsync which will automatically sync content from your server and the EdgeCast storage account. Or you can use our "customer origin" reverse proxy feature for our network to pull content automatically from your servers as its requested by the public. Feel free to contact us at sales#edgecast.com with questions.
Having your server in between the end user and the storage, is probably a good idea. Whenever I let users direct access to storage places, with FTP or SSH, it tends to get really messy. A place where you can upload files, that get accessible from the web, is used for all sorts of things.
Having your server in between you can organise the files uploaded into some rational structure. A folder per date for instance, and perhaps also enforce some strict naming of the files themselves, avoiding URL encoding problems etc.
There is no reason to be concerned about Edgecast. On my opinion, it always makes its best to serve its customers the best way so that the customers have their websites as fast as it's possible and also secured and well-optimized. The whole comparison of Edgecasr vs Amazon look at http://jodihost.com/2014_edgecast_vs_amazon.php

Storing files (videos/images/music) in CouchDB/Cloudant vs CDN (CloudFront)?

I am new to CouchDB/Cloudant and CDN (CloudFront).
I am about to build an application using CouchDB as database.
This web application will handle a lot of files.
I know that CouchDB can store files in the database as attachments. But then I have heard about leveraging CDN to store and distribute the files all over the world.
My questions:
How is storing files in CouchDB compared to CDN (CloudFront)?
How is Cloudant's service compared to CDN (CloudFront)?
Is Google storage also a CDN?
What is the difference between Amazon CloudFront and S3?
Do I have to choose to store files either in CouchDB/Cloudant or CDN, or could/should I actually combine them?
What are best practices for storing files when using CouchDB?
Some of these questions are based on your specific implementation, but here's a generalization (not in any particular order):
Unless they have Cloudant mirrored on numerous servers around the world (effectively a CDN in its own right, just sans static files), a true CDN would probably have better response time, depending mostly on how you used Cloudant (eg, you might get good response times, but if you load the entire file into memory before outputting it, you're losing the CDN battle).
CouchDB has to process more data server-side before it can output an attachment.
CloudFront (and CDNs in general) are optimized for the fastest possible response time with the closest server.
S3 is only storage; CloudFront uses that storage and distributes it across many servers that serve the content based upon which one is closer to the user requesting that content.
Yes, you have to choose between Cloudant or the CDN; one stores them in the filesystem verbatim, the other stores them in the filesystem within the database.
I don't know the answer to some of these, eg, how CouchDB actually handles attachment storage at a low level, nor its best practices, however, this should give you enough of an idea to at least start thinking about which suits your needs best.

How do I access the same bucket using multiple domains on Amazon S3?

I would like to parallelize requests to the resources stored on Amazon's S3 service by accessing them with different domains. I realize that I could just make multiple buckets and make sure the content is always the same. This is annoying however. Everytime I upload something, I have to do it 5 times.
Thanks.
You don't need to do this. S3 should handle many download requests at once. If you need more than that, then use cloud front, which gets the files closer to your clients.
If by 'parallelize requests' you mean something else, then please elaborate.