deep learning model in production, backend or frontend? - tensorflow

I recently built a web, in which a user can upload photos and then there will be a POST request sent to backend for further predictions of the photos. Currently, the use case of the web is like ... someone opens the browser in their phone, takes the photo with their phone and upload. So basically the web is run on browser in a phone but not a computer.
Backend : keras+flask+gunicorn+nginx hosted on a GPU powered machine (1080 Ti*2)
My question is.. is this a good architecture in terms of speed ?
I've heard someone said that the POST request would be very slow due to the fact that sending photo via http is slow.
I wonder if loading model on client side with Tensorflow.js is a better choice ? It looks great since there is no need to POST photos to backend, but It also means my GPU would not be used ?
I've searched on the Internet but couldn't find any reference or comparison
Thank you!

There are many variables to consider. The key being how many user requests you expect to service per minute. The bottleneck in the system will be 'prediction' as you've termed it. Prediction speed will vary depending on many factors e.g. image resolution and algorithm complexity. You should do some simple tests. Build an algorithm for the type of prediction you want to do e.g. classification, detection, segmentation, etc. There are stock algorithms available which balance speed vs performance. It will give you a sense of whats possible. From memory, on a single 1080ti gpu machine, a ssd detection algorithm takes less 1 sec (perhaps even 0.2sec) for high-resolution images. Build your system diagram, identify key risks and perform tests for risk identified.

Related

TensorFlow Serving and serving more models than the memory can allow

TensorFlow Serving can serve multiple models by configuring the --model_config_file command line argument. I had success using this feature in small experiments.
However, it's unclear to me what happens when the total memory required by these models is larger than, say, the available GPU memory.
Does the server just crash? Or does it support keeping a subset of models available and possibly unloading/loading models based on the usage?
Thanks.
Trying to load a model when you are out of memory will fail to load that model. There's no dynamic loading/unloading at this time.
As currently written, it will crash if there isn't enough memory for all of the models requested to load. Internally there is a feature to gracefully decline to load a model that doesn't fit, which you could enable by writing a small PR that pipes the ServerCore::Options::total_model_memory_limit_bytes option [1] to a flag in main.cc. Note, however, that the notion of "fitting in memory" is based on a somewhat crude way of estimating model RAM footprint.
As Gautam said, it does not dynamically load/unload, although there is a library implemented for that (which isn't currently used in the released binary), called CachingManager [2].
[1] https://github.com/tensorflow/serving/blob/master/tensorflow_serving/model_servers/server_core.h#L112
[2] https://github.com/tensorflow/serving/blob/master/tensorflow_serving/core/caching_manager.h

Is it possible to use TensorFlow Serving with distributed TensorFlow cluster to improve throughput/latency?

I'm looking into ways to improve latency and/or throughput of a TensorFlow Serving instance. I've seen the "Serving Inception" manual and three GitHub Issues (2, 3, 4), but all of them seem to create a separate instance of TensorFlow Serving per server and then choosing server on client. Issue 4 is actually about adding some load balancer in front of that stuff, which is currently absent in TensorFlow Serving itself.
However, there is also "Distributed TensorFlow" tutorial which shows how to join a set of machines into a fixed cluster and then manually "pin" some computations to some machines, which can improve both latency and throughput if model is "wide" and can be parallelized well. However, I do not see any mentions of combining this with TensorFlow Serving in either documentation.
Question is: is it possible to configure TensorFlow Serving to use distributed TensorFlow cluster?
I was able to make it create and use gRPC sessions (instead of local) with some hacks:
Make tensorflow/core/distributed_runtime/rpc:grpc_session target publicly visible (it's internal to tensorflow package by default) by modifying BUILD file.
Add it as a dependency to the tensorflow_serving/model_servers:tensorflow_model_server target.
Add an extra flag to tensorflow_model_server called --session_target which sets up session_bundle_config.session_target() in main.cc.
Run the binary with --session_target=grpc://localhost:12345, where localhost:12345 is an arbitrary node which will be used to create master sessions.
See my cluster performing some computations on behalf of TensorFlow Serving.
However, this set of hacks does not look enough for "real-world usage" for three reasons:
grpc_session target is probably internal for a reason.
As noticed in my other question, distributed TensorFlow works better when computations are manually "pinned" to specific machines. So, if we use TensorFlow Serving, we need a way to save those "pins" and model's structure becomes tied with cluster's structure. I'm not sure whether this information is exported with Exporter/Saver at all.
tensorflow_model_server creates session once - during bootstrap. If master node of the cluster goes down and then restores, serving server still holds the "old" session and cannot process further requests.
All in all, it looks like this scenario is not officially supported yet, but I'm not sure.
If your model fits into single machine, then it's hard to see how distributing it over many machines will improve throughput. Essentially you are taking computations which can be done independently and adding a dependency. If one of your machines is slow or crashes, instead of making some queries slow, it will make all queries sow.
That said, it's worth benchmarking to see if it helps, in which case it would make sense to ask for this use-case to be officially supported.
Regarding questions:
Worker assignments are done through device field in graph .pbtxt. Some importers/exporters clear those assignments and have clear_devices flag. You could open graph definition (.pbtxt file or equivalently, str(tf.get_default_graph().as_graph_def(), and grep for device strings to check)
If any worker restarts, or there's some temporary network connectivity your sess.run fails with error (Unavailable) and you need to recreate the session. This is handled automatically by MonitoredTrainingSession in tf.train, but you need to handle this yourself with serving.
If your model is not using images, or is not entirely too large, you shouldn't need too much compute for each inference/serve, and I'm saying this using Inception-v# which takes ~1 sec to serve a response to an image on a Google Cloud Platform n1-standard-1 machine.
Now that being said, perhaps its the throughput that you need to scale up and that is a different problem. Your best option for scale at that point would be to use Docker Swarm & Compose, as well as Kubernetes to help scale e up and serve your inference "micro-service". You could use flask to iterate over a sequence of requests also if your use-case warrants it.

What is squeeze testing?

In the talk "Beyond DevOps: How Netflix Bridges the Gap," around 29:10 Josh Evans mentions squeeze testing as something that can help them understand system drift. What is squeeze testing and how is it implemented?
It seems to be a term used by the folks at Netflix.
It is to run some tests/benchmarks to see changes in performance and calculate the breaking point of the application. Then see if that last change was inefficient or determine the recommended auto-scaling parameters before deploying it.
There is a little more information here and here:
One practice which isn’t yet widely adopted but is used consistently
by our edge teams (who push most frequently) is automated squeeze
testing. Once the canary has passed the functional and ACA analysis
phases the production traffic is differentially steered at an
increased rate against the canary, increasing in well-defined steps.
As the request rate goes up key metrics are evaluated to determine
effective carrying capacity; automatically determining if that
capacity has decreased as part of the push.
As someone who helped with the development of squeeze testing at Netflix. It is using the large amount of stateless requests from actual production traffic to test the system. One of those ways is to put inordinately more load on one instance of a service until it breaks. Monitor the key performance metrics of that instance and use that information to inform how to configure auto scaling policies. It eliminates the problems of fake traffic not stressing the system in the right way.
The reasons it might not work for every one:
you need more traffic than any one instance can handle.
the requests need to be somewhat uniform in demand on the service under test.
clients & protocol need to be resilient to errors should things go wrong.
The way it is set up is a proxy is put in front of the service. The proxy is configured to send a specific RPS to one instance. I used the bresenham's line algorithm to evenly spread the fluctuation in incoming traffic over time to a precise out going RPS. turn up the dial on the RPS, watch it burn.

Streaming IP Camera solutions that do not require a computer?

I want to embed a video stream into my web page, which is part of our own cloud based software. The video should be low-latency (like video conferencing), and it would be preferable, but not required, for it to include audio. I am comfortable serving streaming binary data from the server-side, and embedding it into the page using HTML5 video.
What I am not comfortable with is the ability to capture the video data to begin with. The client does not already have a solution in place, and is looking to us for assistance. The video would be routed through our server equipment, and not be an embedded peice that connects directly to the video source.
It is a known quantity for us to use a USB or built-in camera from the computer. What I would like more information is about stand-alone cameras.
Some models of cameras have their own API documentation (example). It would seem from what I am reading that a manufacturer would typically have their own API which they repeat on many or all of their models, and that each manufacturer would be different in their API. However, I have only done surface reading and hope to gain more knowledge from someone who has already researched this, or perhaps even had first hand experience.
Do stand-alone cameras generally include an API? (Wouldn't this is a common requirement, so that security software can use multiple lines of cameras?) Or if not an API, how is the data retrieved from the on-board webserver? Is it usually flash based? Perhaps there is a re-useable video stream I could capture from there? Or is the stream formatting usually diverse?
What would I run into when trying to get the server-side to capture that data?
How does latency on a stand-alone device compare with a USB camera solution?
Do you have tips on picking out a stand-alone camera that would be a good fit for streaming through a server?
I am experienced at using JavaScript (both HTML5 and Node.JS), Perl and Java.
Each camera manufacturer has their own take on this from the point of access points; generally you should be able to ask for a snapshot or a MJPEG stream, but it can vary. Take a look at this entry on CodeProject; it tackles two common methodologies. Here's another one targeted at Foscam specifically.
Get a good NAS, I suggest Synology, check out their long list of supported IP Web Cams. You can connect them with a hub or with a router or whatever you wish. It's not a "computer" as-in "tower", but it does many computer jobs, and it can stay on while your computer is off or away, and do thing like like video feeds, torrents, backups, etc.
I'm not an expert on all the features, so I don't know how to get it to broadcast without recording, but even if it does then at least it's separate. Synology is a popular brand and there are lot of authorized and un-authorized plugins for it. Check them out and see if one suits you.

Data usage from any application

I want to read how much data from 3G every app uses. Is this is possible in iOS 5.x ? And in iOS 4.x? My goal is for example:
Maps consumed 3 MB from your data plan
Mail consumed 420 kB from your data plan
etc, etc. Is this possible?
EDIT:
I just found app doing that: Data Man Pro
EDIT 2:
I'm starting a bounty. Extra points goes to the answer that make this clear. I know it is possible (screen from Data Man Pro) and i'm sure the solution is limited. But what is the solution and how to implement this.
These are just hints not a solution. I thought about this many times, but never really started implementing the whole thing.
first of all, you can calculate transferred bytes querying network interfaces, take a look to this SO answer for code and a nice explanation about network interfaces on iOS;
use sysctl or similar system functions to detect which apps are currently running (and for running I mean the process state is set to RUNNING, like the ps or top commands do on OSX. Never tried I just suppose this to be possible on iOS, hoping there are no problems with app running as unprivileged user) so you can deduce which apps are running and save the traffic stats for those apps. Obviously, given the possibility to have applications runnning in background it is hard to determine which app is transferring data.
It also could be possible to retrieve informations about network activity per process/app like nettop does on OSX Lion, unfortunately nettop uses the private framework NetworkStatistics.framework so you can't dig something out it's implementation;
take into account time;
My 2 cents
No, all applications in iOS are sandboxed, meaning you cannot access anything outside of the application. I do not believe this is possible. Neither do I believe data-traffic is saved on this level on the device, hence apple would have implemented it in either the network page or the usage page in Settings.app.
Besides that, not everybody has a "data-plan". E.g. in Sweden its common that data-traffic is free of charge without limit in either size or speed.