gun db storage model for large centrally stored data, tiny collaborative clients - gun

Use Case:
Say I wanted to create a realtime-collaborative document editing system.
In this scenario many users could create and collaborate on many documents.
Due to client-device constraints, it's not possible for any client to keep a replica of all documents, only just a handful.
There needs to be central storage server where all documents always live, and this server is always backed up.
Each client can 'subscribe' to any document, and all clients subscribed can see realtime changes of all other clients subscribed/editing the same document.
Questions:
Since each client can't store all documents, there needs to be a way to remove the replicas of 'old' documents from the client, without deleting the document from the central store, ideally based on an automatic least-recently-used approach. How is this handled in gun?
In gun, how can a document be deleted from the central store, so it's then effectively permanently removed from, and no longer accessible to, all clients?
When a document is deleted from the central store, how is the physical storage space ever actually reclaimed for later use?

Great questions, #user2672083 . Here is the current lay out:
Collaborative realtime document editing is possible with gun. Here is a quick prototype I recorded a long time ago, however there are no full pre-built examples/implementations yet.
Not all data is stored on every client by default. The browser only keeps the data it requests/gets/subscribes to.
The default server already acts as a backup. I recommend using the S3 storage adapter, because then you do not have to worry about running out of disk space.
Removing old replicas. Currently, if I want the server to act as a central "master", I just put a localStorage.clear() at the top of my browser code. This will force the browser to have to always load the latest from the server. This is not ideal though, an LRU specific feature is coming soon according to the roadmap.
Permanently removing data and reclaiming space. While this should be easy for a central setup, because gun is P2P by default, it uses a technique called tombstoning to delete data. Given a lot of requests (like yours) for LRU/TTL/GC/deleting, there will be better support for this in the future. Currently, you have to use a mix of rm data.json, localStorage.clear() and 30 day lifecycles on S3 to get this to work. This will be more integrated/easier in the future.
Now a question for you: What are you working on, and how can I help? Many of the things you asked about are possible (with some work) now, but slated to be the focus of the next version of gun - I'd love to get your feedback as we build this out.
All peers reply to requests for data (#2), meaning that localStorage and the server will both reply. Because localStorage is physically closer to a user, it will reply first/fastest and then replies from the server will be merged. GUN does not try each peer "in sequence" doing try/catch cascades, GUN replies from all peers in parallel.
GUN has swappable storage and transport interfaces, so yes, it is easy to build other layers on top or into it.

Related

Deploy REST API over multiple servers world-wide, but stay in sync

I've built a REST API with a pretty decent latency. Each request is answered in ~100 ms with a thousand requests per second. This is however with a relatively low physical distance to the data center. The users of this API would, however, be spread all over the globe. From the US for example (to a data center in Germany), the response time for a single request is ~400 ms under no load.
What would be the best approach to deploying this API? I suspect multiple servers at different locations, with a load balancer in front. How would I ensure that the MySQL database would stay in sync between the servers in that case?
With multiple servers and a load balancer, the costs rise exponentially, which is something I can hopefully afford in the future, but not at the moment.
I'd love to hear your ideas!
Afaik. for big projects people use event sourcing with an event storage and microservices and message queues between them or a basic solution is polling the event storage through a simple REST API something like send me the latest events since the last event I received. If you can accept eventual consistency, then I think this approach can work pretty well. It makes writing somewhat slower, but reading can be very fast with it. No need to sync the MySQL databases directly, you just pull the latest events and use a projection to update the local MySQL database. So the event storage is the single source of truth.

Reliability of Windows Azure Storage Logging

We are in the process of creating a piece of software to backup a storage account (blobs & tables, no queues) and while researching how to do this we came across the possibility storage logging. We would like to use this feature to do smart incremental backups after an initial full backup. However in the introductory post for this feature here the following caveat is mentioned:
During normal operation all requests are logged; but it is important to note that logging is provided on a best effort basis. This means we do not guarantee that every message will be logged due to the fact that the log data is buffered in memory at the storage front-ends before being written out, and if a role is restarted then its buffer of logs would be lost.
As this is a backup solution this behavior makes the features unusable, we can't miss a file. However I wonder if this has changed in the meantime as Microsoft has built a number of features on top of it like blob function triggers and very recently their new Azure Event Grid.
My question is whether this behavior has changed in the meantime or are the logs still on a best effort basis and should we stick to our 'scanning' strategy?
The behavior for Azure Storage logs is still same. For your case, you might be better off using the EventGrid notification for Blob storage: https://azure.microsoft.com/en-us/blog/introducing-azure-event-grid-an-event-service-for-modern-applications/

How to handle temporary unreachable online api

This is a more general question, so bear with my abstraction of the following problem.
I'm currently developing an application, that is interfacing with a remote server over a public api. The api in question does provide mechanisms for fetching data based on a timestamp (e.g. "get me everything that changed since xxx"). Since the amount of data is quite high, I keep a local copy in a database and check for changes on the remote side every hour.
While this makes the application robust against network problems (remote server in maintenance, network outage, etc.) and enables employees to continue working with the application, there is one big gaping problem:
The api in question also offers write access. E.g. my application can instruct the remote server to create a new object. Currently I'm sending the request via api, and upon success create the object in my local database, too. It will eventually propagate via the hourly data fetching, where my application (ideally) sees that no changes need to be made to the local database.
Now when the api is unreachable, i create the object in my database, and cache the request until the api is reachable again. This has multiple problems:
If the request fails (due to not beforehand validateble errors), I end up with an object in the database which shouldn't even exist. I could delete it, but it seems hard to explain to the user(s) ("something went wrong with the api, we deleted that object again").
The problem especially cascades when depended actions que up. E.g. creating the object, and two more requests for modifying it. When the initial create fails, so will the modifying requests (since the object does not exist on the remote side)
Worst case is deletion - when an object is deleted locally, but will not be deleted on the remote site, I have no way of restoring it (easily).
One might suggest to never create objects locally, and let them propagate only through the hourly data sync. This unfortunately is not an option. If the api is not accessible, it can be for hours. And it is mandatory that employees can continue working with the application (which they cannot when said objects don't exist locally).
So bottom line:
How to handle such a scenario, where the api might not be reachable, but certain requests must be cached locally and repeated when the api is reachable again. Especially how to handle cases where those requests unpredictable fail.

Server Load & Scalability for Massive Uploads

I want to upload millions of audio items by users to my server. The current app has designed to give the contents, transcode them and finally send by ftp to storage servers. I want to know:
Does the app server can bear the enormous tasks by user like commenting, uploading, transcoding after scaling to more servers (to carry web app load)?
If the answer of above question is yes, is it correct and best approach? Because a good architecture will be to send transcoding to storage servers wait for finishing the job and sending respond to app server but at the same time it has more complexity and insecurity.
What is the common method for this type of websites?
If I send the upload and transcoding job to storage servers does it compatible with enterprise storage technologies in a long term scalability?
5- The current App is based on PHP. Is it possible to move tmp folder to another servers to overcome upload overload?
Thanks for answer, for tmp folder question number 5. I mean the tmp folder in Apache. I know that all uploaded files before moving to final storage destination (eg: storage servers or any solution) are stored in tmp folder of apache. I was wondering if this is a rule for apache and all uploaded files should be located first in app server, so how can I control, scale and redirect this massive load of storage to a temporary storage or server? I mean a server or storage solution as tmp folder of appche to just be guest of uploaded files before sending to the final storages places. I have studied and designed all the things about scaling of database, storages, load balancing, memcache etc. but this is one of my unsolved question. Where new arrived files by users to main server will be taken place in a scaled architect? And what is the common solution for this? (In one box solution all files will be temporary in the tmp dir of appche but for massive amount of contents and in a scaled system?).
Regards
You might want to take a look at the Viddler architecture: http://highscalability.com/blog/2011/5/10/viddler-architecture-7-million-embeds-a-day-and-1500-reqsec.html
Since I don't feel I can answer this (I wanted to add a comment, but my text was too long), some thoughts:
If you are creating such a large system (as it sounds) you should have some performance tests to see, how many concurrent connections/uploads,... whatever your architecture can handle. As I always say: If you don't know it: "no, it can't ".
I think the best way to deal with heavy load (this is: a lot of uploads, requiring a lot of blocked Threads from the appserver (-> this means, I would not use the Appserver to handle the fileuploads). Perform all your heavy operations (transcoding) asynchronously (e.g. queue the uploaded files, processess them afterwards). In any case the Applicaiton server should not wait for the response of the transcoding system -> just tell the user, that his file are going to be processed and send him a message (or whatever) when its finished. You can use something like gearman for that.
I would search for existing architectures, that have to handle a lot of uploads/conversion too (e.g. flickr) just go to slideshare and search for "flickr" or "scalable web architecture"
I do not really understand this - but I would use Servers based on their tasks (e.g. Applicaiton server, Database serversm, Transconding servers, Storage,...) - each server should do, what he can do best.
I am afraid I don't know what you are talking about when you say tmp folder.
Good luck

Planning the development of a scalable web application

We have created a product that potentially will generate tons of requests for a data file that resides on our server. Currently we have a shared hosting server that runs a PHP script to query the DB and generate the data file for each user request. This is not efficient and has not been a problem so far but we want to move to a more scalable system so we're looking in to EC2. Our main concerns are being able to handle high amounts of traffic when they occur, and to provide low latency to users downloading the data files.
I'm not 100% sure on how this is all going to work yet but this is the idea:
We use an EC2 instance to host our admin panel and to generate the files that are being served to app users. When any admin makes a change that affects these data files (which are downloaded by users), we make a copy over to S3 using CloudFront. The idea here is to get data cached and waiting on S3 so we can keep our compute times low, and to use CloudFront to get low latency for all users requesting the files.
I am still learning the system and wanted to know if anyone had any feedback on this idea or insight in to how it all might work. I'm also curious about the purpose of projects like Cassandra. My understanding is that simply putting our application on EC2 servers makes it scalable by the nature of the servers. Is Cassandra just about keeping resource usage low, or is there a reason to use a system like this even when on EC2?
CloudFront: http://aws.amazon.com/cloudfront/
EC2: http://aws.amazon.com/cloudfront/
Cassandra: http://cassandra.apache.org/
Cassandra is a non-relational database engine and if this is what you need, you should first evaluate Amazon's SimpleDB : a non-relational database engine built on top of S3.
If the file only needs to be updated based on time (daily, hourly, ...) then this seems like a reasonable solution. But you may consider placing a load balancer in front of 2 EC2 images, each running a copy of your application. This would make it easier to scale later and safer if one instance fails.
Some other services you should read up on:
http://aws.amazon.com/elasticloadbalancing/ -- Amazons load balancer solution.
http://aws.amazon.com/sqs/ -- Used to pass messages between systems, in your DA (distributed architecture). For example if you wanted the systems that create the data file to be different than the ones hosting the site.
http://aws.amazon.com/autoscaling/ -- Allows you to adjust the number of instances online based on traffic
Make sure to have a good backup process with EC2, snapshot your OS drive often and place any volatile data (e.g. a database files) on an EBS block. EC2 doesn't fail often but when it does you don't have access to the hardware, and if you have an up to date snapshot you can just kick a new instance online.
Depending on the datasets, Cassandra can also significantly improve response times for queries.
There is an excellent explanation of the data structure used in NoSQL solutions that may help you see if this is an appropriate solution to help:
WTF is a Super Column