Run multiple networks on the same Intel Neural Compute Stick 2 (NCS2/MYRIAD)? - raspbian

I want to load and run multiple networks on the same NCS2: a one-class object detection network (like a person detector), and a network for some recognition on that detection (like gesture recognition).
I tried to load the networks on one NCS2 through two different threads. But when loading the second network, the program exits without any warning or error; the networks are separately working fine (one at a time).
I am using Python on Raspberry pi 4/Raspbian Buster, and the networks are in IR (xml + bin) format.
Is it possible to load multiple networks on the same NCS2 at all?
If yes, what do I miss? Do I have to do some configuration or so?

Yes. It is possible. No specific configuration actions required.
There are examples of such functionality in a repo open-model-zoo.
For example, this one. Action recognition demo based on two networks. The demo is implemented using Python.
Any chance to share source code of your app? It would be a lot easier to understand what may go wrong.

Related

Can I create multiple virtual devices from multiple GPUs in Tensorflow?

I'm using Local device configuration in Tensorflow 2.3.0 currently, to simulate multiple GPU training, and it is working. If I buy another GPU, will I be able to use the same functionality to each GPU?
Right now I have 4 virtual GPUs and one physical GPU. I want to buy another GPU and want to have 2x4 virtual GPUs. I haven't found any information about it, and because I don't have another GPU right now, I can't test it. Is it supported? I'm afraid, it's not.
Yes, you can have additional GPU, since there is no restriction in the number of GPU's you can make use of all the GPU devices you have.
As you can see in the document also which says,
A visible tf.config.PhysicalDevice will by default have a single
tf.config.LogicalDevice associated with it once the runtime is
initialized. Specifying a list of tf.config.LogicalDeviceConfiguration
objects allows multiple devices to be created on the same
tf.config.PhysicalDevice
You can follow this documentation for more details on usage of multiple GPU's.

Are there any way to do federated learning with real multiple machines using tensorflow-federated API?

I am studying about tensorflow-federated API to make federated learning with real multiple machines.
But I found the answer on this site that not support to make real multiple federated learning using multiple learning.
Are there no way to make federated learning with real multiple machines?
Even I make a network structure for federated learning with 2 clients PC and 1 server PC, Is it impossible to consist of that system using tensorflow federated API?
Or even if I apply the code, can't I make the system I want?
If you can modify the code to configure it, can you give me a tip?If not, when will there be an example to configure on a real computer?
In case you are still looking for something: If you're not bound to TensorFlow, you could have a look at PySyft, which is using PyTorch. Here is a practical example of a FL system built with one server and two Raspberry Pis as clients.
TFF is really about expressing the federated computations you wish to execute. In terms of physical deployments, TFF includes two distinct runtimes: one "reference executor" which simply interprets the syntactic artifact that TFF generates, serially, all in Python and without any fancy constructs or optimizations; another still under development, but demonstrated in the tutorials, which uses asyncio and hierarchies of executors to allow for flexible executor architectures. Both of these are really about simulation and FL research, and not about deploying to devices.
In principle, this may address your question (in particular, see tff.framework.RemoteExecutor). But I assume that you are asking more about deployment to "real" FL systems, e.g. data coming from sources that you don't control. This is really out of scope for TFF. From the FAQ:
Although we designed TFF with deployment to real devices in mind, at this stage we do not currently provide any tools for this purpose. The current release is intended for experimentation uses, such as expressing novel federated algorithms, or trying out federated learning with your own datasets, using the included simulation runtime.
We anticipate that over time the open source ecosystem around TFF will evolve to include runtimes targeting physical deployment platforms.

deep learning model in production, backend or frontend?

I recently built a web, in which a user can upload photos and then there will be a POST request sent to backend for further predictions of the photos. Currently, the use case of the web is like ... someone opens the browser in their phone, takes the photo with their phone and upload. So basically the web is run on browser in a phone but not a computer.
Backend : keras+flask+gunicorn+nginx hosted on a GPU powered machine (1080 Ti*2)
My question is.. is this a good architecture in terms of speed ?
I've heard someone said that the POST request would be very slow due to the fact that sending photo via http is slow.
I wonder if loading model on client side with Tensorflow.js is a better choice ? It looks great since there is no need to POST photos to backend, but It also means my GPU would not be used ?
I've searched on the Internet but couldn't find any reference or comparison
Thank you!
There are many variables to consider. The key being how many user requests you expect to service per minute. The bottleneck in the system will be 'prediction' as you've termed it. Prediction speed will vary depending on many factors e.g. image resolution and algorithm complexity. You should do some simple tests. Build an algorithm for the type of prediction you want to do e.g. classification, detection, segmentation, etc. There are stock algorithms available which balance speed vs performance. It will give you a sense of whats possible. From memory, on a single 1080ti gpu machine, a ssd detection algorithm takes less 1 sec (perhaps even 0.2sec) for high-resolution images. Build your system diagram, identify key risks and perform tests for risk identified.

Is it possible to use TensorFlow Serving with distributed TensorFlow cluster to improve throughput/latency?

I'm looking into ways to improve latency and/or throughput of a TensorFlow Serving instance. I've seen the "Serving Inception" manual and three GitHub Issues (2, 3, 4), but all of them seem to create a separate instance of TensorFlow Serving per server and then choosing server on client. Issue 4 is actually about adding some load balancer in front of that stuff, which is currently absent in TensorFlow Serving itself.
However, there is also "Distributed TensorFlow" tutorial which shows how to join a set of machines into a fixed cluster and then manually "pin" some computations to some machines, which can improve both latency and throughput if model is "wide" and can be parallelized well. However, I do not see any mentions of combining this with TensorFlow Serving in either documentation.
Question is: is it possible to configure TensorFlow Serving to use distributed TensorFlow cluster?
I was able to make it create and use gRPC sessions (instead of local) with some hacks:
Make tensorflow/core/distributed_runtime/rpc:grpc_session target publicly visible (it's internal to tensorflow package by default) by modifying BUILD file.
Add it as a dependency to the tensorflow_serving/model_servers:tensorflow_model_server target.
Add an extra flag to tensorflow_model_server called --session_target which sets up session_bundle_config.session_target() in main.cc.
Run the binary with --session_target=grpc://localhost:12345, where localhost:12345 is an arbitrary node which will be used to create master sessions.
See my cluster performing some computations on behalf of TensorFlow Serving.
However, this set of hacks does not look enough for "real-world usage" for three reasons:
grpc_session target is probably internal for a reason.
As noticed in my other question, distributed TensorFlow works better when computations are manually "pinned" to specific machines. So, if we use TensorFlow Serving, we need a way to save those "pins" and model's structure becomes tied with cluster's structure. I'm not sure whether this information is exported with Exporter/Saver at all.
tensorflow_model_server creates session once - during bootstrap. If master node of the cluster goes down and then restores, serving server still holds the "old" session and cannot process further requests.
All in all, it looks like this scenario is not officially supported yet, but I'm not sure.
If your model fits into single machine, then it's hard to see how distributing it over many machines will improve throughput. Essentially you are taking computations which can be done independently and adding a dependency. If one of your machines is slow or crashes, instead of making some queries slow, it will make all queries sow.
That said, it's worth benchmarking to see if it helps, in which case it would make sense to ask for this use-case to be officially supported.
Regarding questions:
Worker assignments are done through device field in graph .pbtxt. Some importers/exporters clear those assignments and have clear_devices flag. You could open graph definition (.pbtxt file or equivalently, str(tf.get_default_graph().as_graph_def(), and grep for device strings to check)
If any worker restarts, or there's some temporary network connectivity your sess.run fails with error (Unavailable) and you need to recreate the session. This is handled automatically by MonitoredTrainingSession in tf.train, but you need to handle this yourself with serving.
If your model is not using images, or is not entirely too large, you shouldn't need too much compute for each inference/serve, and I'm saying this using Inception-v# which takes ~1 sec to serve a response to an image on a Google Cloud Platform n1-standard-1 machine.
Now that being said, perhaps its the throughput that you need to scale up and that is a different problem. Your best option for scale at that point would be to use Docker Swarm & Compose, as well as Kubernetes to help scale e up and serve your inference "micro-service". You could use flask to iterate over a sequence of requests also if your use-case warrants it.

How can I stream data directly into tensorflow as opposed to reading files on disc?

Every tensorflow tutorial I've been able to find so far works by first loading the training/validation/test images into memory and then processing them. Does anyone have a guide or recommendations for streaming images and labels as input into tensorflow? I have a lot of images stored on a different server and I would like to stream those images into tensorflow as opposed to saving the images directly on my machine.
Thank you!
Tensorflow does have Queues, which support streaming so you don't have to load the full data in memory. But yes, they only support reading from files on the same server by default. The real problem you have is that, you want to load in memory data from some other server. I can think of following ways to do this:
Expose your images using a REST service. Write your own queueing mechanism in python and read this data (using Urllib or something) and feed it to Tensorflow placeholders.
Instead of using python queues (as above) you can use Tensorflow queues as well (See this answer), although it's slighly more complicated. The advantage will be, tensorflow queues can use multiple cores giving you better performance, compared to normal python multi-threaded queues.
Use a network mount to fool your OS into believing the data is on the same machine.
Also, remember when using this sort of distributed setup, you will always incur network overhead (time taken for images to be transferred from Server 1 to 2), which can slow your training by a lot. To counteract this, you'll have to build a multi-threaded queueing mechanism with fetch-execute overlap, which is a lot of effort. An easier option IMO is to just copy the data into your training machine.
You can use the sockets package in Python to transfer a batch of images, and labels from your server to your host. Your graph needs to be defined to take a placeholder as input. The placeholder must be compatible with your batch size.