Storing NLP Models in Git Repo vs S3? - amazon-s3

What is the best way to store NLP Models? I have multiple NLP models which are about 800MB in size in total. My code will load the models in memory at start up time. However I am wondering what is the best way to store the models. Should I store it in git repo and then I can load directly from local system or should I store in an external location like S3 and load it from there? What are the advantages/disadvantages of each? Or do people use some other method which I haven't considered?

Do your NLP models need to be version controlled? Do you ever need to revert back to a previous NLP model? If these are not the case, storing your artifacts in an S3 bucket is certainly sufficient. If you are planning on storing many NLP models for a long period of time, I also recommend AWS Glacier. Glacier is an extremely cost effective for long term storage.

Very good question, while very few people pay attention to it.
Here are a few factors I point out:
Cost of (1) storing files (2) bandwidth: cost of
downloading/uploading resources (models, etc)
Lazy download: Not all the resources are required for running an NLP systems. It's a headache for the end-point user to download many resources that are not nearly useful for their purpose. In other words, the system should download (ideally itself) any resource needed for its purpose, when it's required.
Convenience.
And options are:
S3: The benefit is that if you have it working, it's convenient. But the issue is that someone familiar with S3 and Amazon AWS has to monitor the system for failures/payments/etc. And it's often expensive. Not only you pay for having the space, more importantly you also pay for band-width. If you have resources like word-embeddings or dictionaries (in addition to your models), each of which taking a few GB, it's not hard to hit terabytes of bandwidth usage. AI2 uses S3 and they have a simple Scala system for their usage. Their system is "lazy" i.e. your program downloads (and caches) a given resource only when it's required.
Keep it in the repo: certainly checking in big binary files in the repo is not a good idea, unless you use LFS to keep the big files outside your git history. Even with this, I'm not sure how you'll make programmatic calls to your files. Like you have to have scripts and instructions for users to manually download the files, etc (which is ugly).
I'm adding these two options too:
Maven dependency: Basically package everything in Jar files, deploy them and add them as dependencies. We used to use this, and some ppl still use it (e.g. StanfordNLP ppl, they ask you to add models as maven dependency). I personally do not recommend it, mainly because maven is not designed to take care of big resources (Like sometimes it hangs, etc). And this approach is not lazy, meaning that maven downloads EVERYTHING at once at compile/run time (e.g. when trying StanfordCoreNLP for the first time, you'll HAVE TO download a few Gigabytes of files that you might never need to use, which is a headache). Also, if you're a Java user you know that working with classpath is a BIGx10 headache.
Your own server: Install file manager server (like Minio), store your files there and whenever required, send programmatic calls to the server in your desired language (their APIs are available for different languages in their github page). We've written a convenient Java system to access it in Java that might come handy to you. This gives you the lazy behavior (like S3), while not being expensive (unlike S3) (Basically you'd get all the benefits of S3).
Just to summarize my opinion: I've tried S3 in past, and it was pretty convenient, but it was expensive. Since we have a server that's often idle we are using Minio and we're happy about it. I'd go with this option, if you have a reliable remote server to store your files.

Related

MLflow: Why can't backend-store-uri be an s3 location?

I'm new to mlflow and I can't figure out why the artifact store can't be the same as the backend store?
The only reason I can think of is to be able to query the experiments with SQL syntax... but since we can interact with the runs using mlflow ui I just don't understand why all artifacts and parameters can't go to a same location (which is what happens when using local storage).
Can anyone shed some light on this?
MLflow's Artifacts are typically ML models, i.e. relatively large binary files. On the other hand, run data are typically a couple of floats.
In the end it is not a question of what is possible or not (many things are possible if you put enough effort into it), but rather to follow good practices:
storing large binary artifacts in an SQL database is possible but is bound the degrade the performance of the database sooner or later, and this in turn will degrade your user experience.
storing a couple of floats from a SQL database for quick retrieval for display in a front-end or via command line is a robust industry-proven classic
It remains true that the documentation of MLflow on the architecture design rationale could be improved (as of 2020)

Object storage for a web application

I am currently working on a website where, roughly 40 million documents and images should be served to it's users. I need suggestions on which method is the most suitable for storing content with subject to these requirements.
System should be highly available, scale-able and durable.
Files have to be stored permanently and users should be able to modify them.
Due to client restrictions, 3rd party object storage providers such as Amazon S3 and CDNs are not suitable.
File size of content can vary from 1 MB to 30 MB. (However about 90% of the files would be less than 2 MB)
Content retrieval latency is not much of a problem. Therefore indexing or caching is not very important.
I did some research and found out about the following solutions;
Storing content as BLOBs in databases.
Using GridFS to chunk and store content.
Storing content in a file server in directories using a hash and storing the metadata in a database.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
The website is developed using PHP and Couchbase Community Edition is used as the database.
I would really appreciate any input.
Thank you.
I have been working on a similar system for last two years, the work is still in progress. However, requirements are slightly different from yours: modifications are not possible (I will try to explain why later), file sizes fall in range from several bytes to several megabytes, and, the most important one, the deduplication, which should be implemented both on the document and block levels. If two different users upload the same file to the storage, the only copy of the file should be kept. Also if two different files partially intersect with each other, it's necessary to store the only copy of the common part of these files.
But let's focus on your requirements, so deduplication is not the case. First of all, high availability implies replication. You'll have to store your file in several replicas (typically 2 or 3, but there are techniques to decrease data parity) on independent machines in order to stay alive in case if one of the storage servers in your backend dies. Also, taking into account the estimation of the data amount, it's clear that all your data just won't fit into a single server, so vertical scaling is not possible and you have to consider partitioning. Finally, you need to take into account concurrency control to avoid race conditions when two different clients are trying to write or update the same data simultaneously. This topic is close to the concept of transactions (I don't mean ACID literally, but something close). So, to summarize, these facts mean that you're are actually looking for distributed database designed to store BLOBs.
On of the biggest problems in distributed systems is difficulties with global state of the system. In brief, there are two approaches:
Choose leader that will communicate with other peers and maintain global state of the distributed system. This approach provides strong consistency and linearizability guarantees. The main disadvantage is that in this case leader becomes the single point of failure. If leader dies, either some observer must assign leader role to one of the replicas (common case for master-slave replication in RDBMS world), or remaining peers need to elect new one (algorithms like Paxos and Raft are designed to target this issue). Anyway, almost whole incoming system traffic goes through the leader. This leads to the "hot spots" in backend: the situation when CPU and IO costs are unevenly distributed across the system. By the way, Raft-based systems have very low write throughput (check etcd and consul limitations if you are interested).
Avoid global state at all. Weaken the guarantees to eventual consistency. Disable the update of files. If someone wants to edit the file, you need to save it as new file. Use the system which is organized as a peer-to-peer network. There is no peer in the cluster that keeps the full track of the system, so there is no single point of failure. This results in high write throughput and nice horizontal scalability.
So now let's discuss the options you've found:
Storing content as BLOBs in databases.
I don't think it's a good option to store files in traditional RDBMS because they provide optimizations for structured data and strong consistency, and you don't need neither of this. Also you'll have difficulties with backups and scaling. People usually don't use RDBMS in this way.
Using GridFS to chunk and store content.
I'm not sure, but it looks like GridFS is built on the top of MongoDB. Again, this is document-oriented database designed to store JSONs, not BLOBs. Also MongoDB had problems with a cluster for many years. MongoDB passed Jepsen tests only in 2017. This may mean that MongoDB cluster is not mature yet. Make performance and stress tests, if you go this way.
Storing content in a file server in directories using a hash and storing the metadata in a database.
This option means that you need to develop object storage on your own. Consider all the problems I've mentioned above.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
I used neither of these solutions, but HDFS looks like overkill, because you get dependent on Hadoop stack. Have no idea about GlusterFS performance. Always consider the design of distributed file systems. If they have some kind of dedicated "metadata" serves, treat it as a single point of failure.
Finally, my thoughts on the solutions that may fit your needs:
Elliptics. This object storage is not well-known outside of the russian part of the Internet, but it's mature and stable, and performance is perfect. It was developed at Yandex (russian search engine) and a lot of Yandex services (like Disk, Mail, Music, Picture hosting and so on) are built on the top of it. I used it in previous project, this may take some time for your ops to get into it, but it's worth it, if you're OK with GPL license.
Ceph. This is real object storage. It's also open source, but it seems that only Red Hat people know how to deploy and maintain it. So get ready to a vendor lock. Also I heard that it have too complicated settings. Never used in production, so don't know about performance.
Minio. This is S3-compatible object storage, under active development at the moment. Never used it in production, but it seems to be well-designed.
You may also check wiki page with the full list of available solutions.
And the last point: I strongly recommend not to use OpenStack Swift (there are lot of reasons why, but first of all, Python is just not good for these purposes).
One probably-relevant question, whose answer I do not readily see in your post, is this:
How often do users actually "modify" the content?
and:
When and if they do, how painful is it if a particular user is served "stale" content?
Personally (and, "categorically speaking"), I prefer to tackle such problems in two stages: (1) identifying the objects to be stored – e.g. using a database as an index; and (2) actually storing them, this being a task that I wish to delegate to "a true file-system, which after all specializes in such things."
A database (it "offhand" seems to me ...) would be a very good way to handle the logical ("as seen by the user") taxonomy of the things which you wish to store, while a distributed filesystem could handle the physical realities of storing the data and actually getting it to where it needs to go, and your application would be in the perfect position to gloss-over all of those messy filesystem details . . .

Storing large objects in Couchbase - best practice?

In my system, a user can upload very large files, which I need to store in Couchbase. I don't need such very large objects to persist in memory, but I want them to be always read/written from/to disk. These files are read-only (never modified). The user can upload them, delete them, download them, but never update them. For some technical constraints, my system cannot store those files in the file system, so they have to be stored into the database.
I've done some research and found an article[1] saying that storing large objects in a database is generally a bad idea, especially with Couchbase, but at the same time provides some advice: create a secondary bucket with a low RAM quota, tune up the value/full eviction policy. My concern is the limit of 20Mb mentioned by the author. My files would be much larger than that.
What's the best approach to follow to store large files into Couchbase without having them persist in memory? Is it possible to raise the limit of 20Mb in case? Shall I create a secondary bucket with a very low RAM quota and a full eviction policy?
[1]http://blog.couchbase.com/2016/january/large-objects-in-a-database
Generally, Couchbase engineers recommend that you not store large files in Couchbase. Instead, you can store the files on some file server (like AWS or Azure Blob or something) and instead store the meta-data about the files in Couchbase.
There's a couchbase blog posting that gives a pretty detailed breakdown of how to do what you want to do in Couchbase.
This is Java API specific but the general approach can work with any of the Couchbase SDKs, I'm actually in the midst of doing something pretty similar right now with the node SDK.
I can't speak for what couchbase engineers recommend but they've posted this blog entry detailing how to do it.
For large files, you'll certainly want to split into chunks. Do not attempt to store a big file all in one document. The approach I'm looking at is to chunk the data, and insert it under the file sha1 hash. So file "Foo.docx" would get split into say 4 chunks, which would be "sha1|0", "sha1|1" and so on, where sha1 is the hash of the document. This would also enable a setup where you can store the same file under many different names.
Tradeoffs -- if integration with Amazon S3 is an option for you, you might be better off with that. In general chunking data in a DB like what I describe is going to be more complicated to implement, and much slower, than using something like Amazon S3. But that has to be traded off other requirements, like whether or not you can keep sensitive files in S3, or whether you want to deal with maintaining a filesystem and the associated scaling of that.
So it depends on what your requirements are. If you want speed/performance, don't put your files in Couchbase -- but can you do it? Sure. I've done it myself, and the blog post above describes a separate way to do it.
There are all kinds of interesting extensions you might wish to implement, depending on your needs. For example, if you commonly store many different files with similar content, you might implement a blocking strategy that would allow single-store of many common segments, to save space. Other solutions like S3 will happily store copies of copies of copies of copies, and gleefully charge you huge amounts of money to do so.
EDIT as a follow-up, there's this other Couchbase post talking about why storing in the DB might not be a good idea. Reasonable things to consider - but again it depends on your application-specific requirements. "Use S3" I think would be generally good advice, but won't work for everyone.
MongoDB has an option to do this sort of thing, and it's supported in almost all drivers: GridFS. You could do something like GridFS in Couchbase, which is to make a metadata collection (bucket) and a chunk collection with fixed size blobs. GridFS allows you to change the blob size per file, but all blobs must be the same size. The filesize is stored in the metadata. A typical chunk size is 2048, and are restricted to powers of 2.
You don't need memory cache for files, you can queue up the chunks for download in your app server. You may want to try GridFS on Mongo first, and then see if you can adapt it to Couchbase, but there is always this: https://github.com/couchbaselabs/cbfs
This is the best practice: do not take couchbase database as the main database consider it as sync database because no matter how you chunk data into small pieces it will go above 20MB size which will hit you in long run, so having a strong database like MySQL in a middle will help to save those large data then use couchbase for realtime and sync only.

Post Processing of Resized Image In clustered environment

Been playing with ImageResizer for a bit now, and trying to do something, I am having trouble understanding the way to go about it.
Mainly I would like to stick to the idea of using the pipeline, and not trying to cheat it.
So.... Let's say, I pretty standard use ImageResizer For something like:
giants_logo.jpg?w=280&h=100
The File giants_logo.jpg
Processing Request is for a resized version of 'w=280&h=100'
In a clustered environment, what will happen is if this same request is served by 3 machines.
All 3 would end up doing the resize, and then storing their cached version in a local folder on disc. I could leverage a shared drive or something, but that has it's own limitations.
What I am looking to do, is get the processed file, and then copy it back up to the DB or S3 where the main images are served from.
My thought is.... I might have to write somehting like DiscCache, but with a complelty different guts, using the DB or S3 as the back end instead of the file system.
I realize the point of caching is speed, and what I am suggesting is negating that aspect..... but that's not the case if we layer the things maybe.
Anyway, What I am focused on is trying to keep track of the files generated, as well as avoid processing on multiple servers.
Any thoughts on the route I should look at to accomplish this?
TLDR; When DiskCache actually stops working well (usually between 1 and 20 million unique images), then switch to a CDN (unless it's too expensive), or a reverse proxy (unless your data set is really too huge to be bound by mortal infrastructure).
For petabyte data sets on the cheap when performance isn't king, it's a good plan. But for most people, it's premature. Even users with upwards of 20TB (source images) still use DiskCache. Really. Terabyte drives are cheap.
Latency is the killer.
To make this work you would need a central Redis server. MSSQL won't cut it (at least not on a VM or commodity hardware, we've tried). Given a Redis server, you can track what is done and stored (and perhaps even what is in progress, to de-duplicate effort in real time, as DiskCache does).
If you can track it, you can reuse it, and you can delete it. Reuse will be slower, since you're doubling the network traffic, moving the result twice. (But also decreasing it linearly with the number of servers in the cluster for source image fetches).
If bandwidth saturation is your bottleneck (very common), this could make performance worse. In fact, unless your read/write ratio is write and CPU heavy, you'll likely see worse performance than duplicated CPU effort under individual disk caches.
If you have the infrastructure to test it, put DiskCache on a SAN or shared drive; this will give you a solid estimate of the performance you can expect (assuming said drive and your blob storage system have comparable IO perf).
However, it's a fair amount of work, and you're essentially duplicating a subset of the functionality of reverse proxy (but with worse performance, since every response has to be proxied through the unlucky cluster server, instead of being spooled directly from disk).
CDNs and Reverse proxies to the rescue
Amazon CloudFront or Varnish can serve quite well as reverse proxies/caches for a web farm or cluster. Now, you'll have a bit less control over the 'garbage collection' process, but... also less code to maintain.
There's also ARR, but I've heard neither success nor failure stories about it.
But it sounds fun!
Send me a Github link and I'll help out.
I'd love to get a Redis-coordinated, cloud-agnostic poor-man's blob cache system out there. You bring the petabytes and infrastructure, I'll help you with the integration and troublesome bits. Efficient HTTP proxying is probably the hardest part; the rest is state management and basic threading.
You might want to have a look at a modified AzureReader2 plugin at https://github.com/orbyone/Sensible.ImageResizer.Plugins.AzureReader2
This implementation stores the transformed image back to the Azure blob container on the initial requests, so subsequent requests are redirected to that copy.

Which technology should be used for serving large number of static files?

My main aim is to serve large number of XML files ( > 1bn each <1kb) via web server. Files can be considered as staic as those will be modified by external code, in relatively very low frequency (about 50k updates per day). Files will be requested in high frequency (>30 req/sec).
Current suggestion from my team is to create a dedicated Java application to implement HTTP protocal and use memcached to speed up the thing, keeping all file data in RDBMS and getting rid of file system.
On other hand, I think, a tweaked Apache Web Server or lighttpd should be enough. Caching can be left to OS or web server's defalt caching. There is no point in keeping data in DB if the same output is required and only queried based on file name. Not sure how memcached will work here. Also updating external cache (memcached) while updating file via external code will add complexity.
Also other question, if I choose to use files is is possible to store those in directory like \a\b\c\d.xml and access via abcd.xml? Or should I put all 1bn files in single directory (Not sure OS will allow it or not).
This is NOT a website, but for an application API in closed network so Cloud/CDN is of no use.
I am planning to use CentOS + Apache/lighttpd. Suggest any alternative and best possible solution.
This is the only public note found on such topic, and it is little old too.
1bn files at 1KB each, that's about 1TB of data. Impressive. So it won't fit into memory unless you have very expensive hardware. It can even be a problem on disk if your file system wastes a lot of space for small files.
30 requests a second is far less impressive. It's certainly not the limiting factor for the network nor for any serious web server out there. It might be a little challenge for a slow harddisk.
So my advice is: Put the XML files on a hard disk and serve them with a plain vanilla web server of your choice. Then measure the throughput and optimize it, if you don't reach 50 files a second. But don't invest into anything unless you have shown it to be a limiting factor.
Possible optimizations are:
Find a better layout in the file system, i.e. distribute your files over enough directories so that you don't have too many files (more than 5,000) in a single directory.
Distribute the files over several harddisks so that they can access the files in parallel
Use faster harddisk
Use solid state disks (SSD). They are expensive, but can easily serve hundreds of files a second.
If a large number of the files are requested several times a day, then even a slow hard disk should be enough because your OS will have the files in the file cache. And with today's file cache size, a considerable amount of your daily deliveries will fit into the cache. Because at 30 requests a second, you serve 0.25% of all files a day, at most.
Regarding distributing your files over several directories, you can hide this with an Apache RewriteRule, e.g.:
RewriteRule ^/xml/(.)(.)(.)(.)(.*)\.xml /xml/$1/$2/$3/$4/$5.xml
Another thing you could look at is Pomegranate, which seems very similar to what you are trying to do.
I believe that a dedicated application with everything feeding off a memcache db would be the best bet.