Kubernetes + TF serving - how to use hundred of ML models without running hundred of idle pods up and running?

Kubernetes + TF serving - how to use hundred of ML models without running hundred of idle pods up and running? - tensorflow

I have hundreds of models, based on categories, projects,s, etc. Some of the models are heavily used while other models are not used very frequently.
How can I trigger a scale-up operation only in case needed (For the models that are not frequently used), instead of running hundreds of pods serving hundreds of models while most of them are not being used - which is a huge waste of computing resources.

What you are trying to do is to scale deployment to zero when these are not used.
K8s does not provide such functionality out of the box.
You can achieve it using Knative Pod Autoscaler.
Knative is probably the most mature solution available at the moment of writing this answer.
There are also some more experimental solutions like osiris or zero-pod-autoscaler you may find interesting and that may be a good fit for your usecase.

Related

Can OptaPlanner solve multiple problems concurrently

When OptaPlanner is used in web service, which means an OptaPlanner app is required to solve multiple problems in parallel threads, are there any limitations to prevent OptaPlanner from doing this? Is synchronization required in any OptaPlanner functions? Thanks

OptaPlanner supports this: it's a common use case.
In a single JVM, look at SolverManager to solve mulitple datasets of the same use case in parallel. Even if the constraint weights differ per dataset (see ConstraintConfiguration). So even if some datasets disable/enable some of the constraints while others don't.
For different use cases in a single JVM, just create multiple SolverFactory or SolverManager instances. This is uncommon because usually each use case is a different app (= microservice?).
Across multiple JVMs (= pods), there are several good techniques. Our activemq quickstart scales beautifully horizontally. Read Radovan's blog about how the ActiveMQ is used to load balance the work across the solver pods.

Tensorflow Mirror Strategy and Horovod Distribution Strategy

I am trying to understand what are the basic difference between Tensorflow Mirror Strategy and Horovod Distribution Strategy.
From the documentation and the source code investigation I found that Horovod (https://github.com/horovod/horovod) is using Message Passing Protocol (MPI) to communicate between multiple nodes. Specifically it uses all_reduce, all_gather of MPI.
From my observation (I may be wrong) Mirror Strategy is also using all_reduce algorithm (https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/distribute).
Both of them are using data-parallel, synchronous training approach.
So I am a bit confused how they are different? Is the difference only in implementation or there are other (theoretical) difference?
And how is the performance of mirror strategy compared to horovod?

Mirror Strategy has its own all_reduce algorithm which use remote procedural calls (gRPC) under the hood.
Like you mentioned Horovod uses MPI/GLOO to communicate between multiple processes.

Regarding the performance, one of my colleagues have performed experiments before using 4 Tesla V100 GPUs using the codes from here. The results suggested that 3 settings work the best: replicated with all_reduce_spec=nccl, collective_all_reduce with properly tuned allreduce_merge_scope (e.g. 32), and horovod. I did not see significant differences among these 3.

Storing NLP Models in Git Repo vs S3?

What is the best way to store NLP Models? I have multiple NLP models which are about 800MB in size in total. My code will load the models in memory at start up time. However I am wondering what is the best way to store the models. Should I store it in git repo and then I can load directly from local system or should I store in an external location like S3 and load it from there? What are the advantages/disadvantages of each? Or do people use some other method which I haven't considered?

Do your NLP models need to be version controlled? Do you ever need to revert back to a previous NLP model? If these are not the case, storing your artifacts in an S3 bucket is certainly sufficient. If you are planning on storing many NLP models for a long period of time, I also recommend AWS Glacier. Glacier is an extremely cost effective for long term storage.

Very good question, while very few people pay attention to it.
Here are a few factors I point out:
Cost of (1) storing files (2) bandwidth: cost of
downloading/uploading resources (models, etc)
Lazy download: Not all the resources are required for running an NLP systems. It's a headache for the end-point user to download many resources that are not nearly useful for their purpose. In other words, the system should download (ideally itself) any resource needed for its purpose, when it's required.
Convenience.
And options are:
S3: The benefit is that if you have it working, it's convenient. But the issue is that someone familiar with S3 and Amazon AWS has to monitor the system for failures/payments/etc. And it's often expensive. Not only you pay for having the space, more importantly you also pay for band-width. If you have resources like word-embeddings or dictionaries (in addition to your models), each of which taking a few GB, it's not hard to hit terabytes of bandwidth usage. AI2 uses S3 and they have a simple Scala system for their usage. Their system is "lazy" i.e. your program downloads (and caches) a given resource only when it's required.
Keep it in the repo: certainly checking in big binary files in the repo is not a good idea, unless you use LFS to keep the big files outside your git history. Even with this, I'm not sure how you'll make programmatic calls to your files. Like you have to have scripts and instructions for users to manually download the files, etc (which is ugly).
I'm adding these two options too:
Maven dependency: Basically package everything in Jar files, deploy them and add them as dependencies. We used to use this, and some ppl still use it (e.g. StanfordNLP ppl, they ask you to add models as maven dependency). I personally do not recommend it, mainly because maven is not designed to take care of big resources (Like sometimes it hangs, etc). And this approach is not lazy, meaning that maven downloads EVERYTHING at once at compile/run time (e.g. when trying StanfordCoreNLP for the first time, you'll HAVE TO download a few Gigabytes of files that you might never need to use, which is a headache). Also, if you're a Java user you know that working with classpath is a BIGx10 headache.
Your own server: Install file manager server (like Minio), store your files there and whenever required, send programmatic calls to the server in your desired language (their APIs are available for different languages in their github page). We've written a convenient Java system to access it in Java that might come handy to you. This gives you the lazy behavior (like S3), while not being expensive (unlike S3) (Basically you'd get all the benefits of S3).
Just to summarize my opinion: I've tried S3 in past, and it was pretty convenient, but it was expensive. Since we have a server that's often idle we are using Minio and we're happy about it. I'd go with this option, if you have a reliable remote server to store your files.

Converting a deep learning model from GPU powered framework, such as Theano, to a common, easily handled one, such as Numpy

I have been playing around with building some deep learning models in Python and now I have a couple of outcomes I would like to be able to show friends and family.
Unfortunately(?), most of my friends and family aren't really up to the task of installing any of the advanced frameworks that are more or less necessary to have when creating these networks, so I can't just send them my scripts in the present state and hope to have them run.
But then again, I have already created the nets, and just using the finished product is considerably less demanding than making it. We don't need advanced graph compilers or GPU compute powers for the show and tell. We just need the ability to make a few matrix multiplications.
"Just" being a weasel word, regrettably. What I would like to do is convert the the whole model (connectivity,functions and parameters) to a model expressed in e.g. regular Numpy (which, though not part of standard library, is both much easier to install and easier to bundle reliably with a script)
I fail to find any ready solutions to do this. (I find it difficult to pick specific keywords on it for a search engine). But it seems to me that I can't be the first guy who wants to use a ready-made deep learning model on a lower-spec machine operated by people who aren't necessarily inclined to spend months learning how to set the parameters in an artificial neural network.
Are there established ways of transferring a model from e.g. Theano to Numpy?
I'm not necessarily requesting those specific libraries. The main point is I want to go from a GPU-capable framework in the creation phase to one that is trivial to install or bundle in the usage phase, to alleviate or eliminate the threshold the dependencies create for users without extensive technical experience.

An interesting option for you would be to deploy your project to heroku, like explained on this page:
https://github.com/sugyan/tensorflow-mnist

Examples of distributed computing tasks relatively common among users

Can you give an example of such tasks?
I'm particularly interested in tasks, relevant to quite large amount of people, which could be solved by using distributed computing. (Not a global projects, such as SETI#Home, Folding#Home, etc)
As example we can take rendering and http://www.renderfarm.fi community.
Cryptocurrencies mining is not relevant.
Thank you!

Well, I don't know much about rendering, but when talking about tasks that can be solved by distributed computing, you will probably want to take a look on Bag-of-Tasks (BoT) applications.
"Bag-of-Tasks applications (those parallel applications whose tasks are
independent) are both relevant and amendable for execution on computational grids. In fact, one can argue that Bag-of-Tasks applications
are the applications most suited for grids, where communication can
easily become a bottleneck for tightly-coupled parallel applications."
This was taken from a paper that talks exactly about Bag-of-Tasks applications with grid computing. You can read the full paper here.
Now finding a task relevant to users is a matter of creativity. This list of distributed computing projects might give you some insights.
Setting up the BOINC server and, mainly, programming the BOINC applications will be the hard tasks here. This BOINC wiki helps you to have a notion of what is needed on the "background" of a BOINC project.

Old question, but fresh answer.
I have my own Distributed Computing Library written completely in C++ (search for gridman raspberry pi).
I am using it for:
- Distributed Neural Networks training / validation
- Distributed raytracing (for fun)
- Distributed MD5 crunching (for fun)
- Distributed WEP crunching (for fun)
- Distributed WPA crunching (for fun)
And in general, i always think of it this way: If something takes too long time for me, then i split this into several PC's. Real world examples?
Take Investment Banking for example, all these models have to be calculated milion times with different parameters.
Take Neural Networks - a good example, learning takes ages (depends on data) - if you split this into 10 PC, your results are obtained 10 times faster.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas