is there concept about multiplecpu in mxnet - mxnet

As we know, there is concept about multiplegpu in mxnet,when specify -gpu in command.And if we don't specify gpu,it will run it on cpu.How many cpus do it run ?Can it possible to specify multiple cpu?

You can use several CPUs with the following code (R version, in python is pretty similar):
devices = lapply(1:2, function(i) {
mx.cpu(i)
})
And the train the network as usual. Also if you have MKL library the system automatically computes with all cores.

A couple ways to look at this.
If you compile MXNet with a good BLAS library, those math operations will use all the CPU cores available.
Also, you can specify how many CPU worker threads through the environment variable MXNET_CPU_WORKER_NTHREADS. See http://mxnet.io/how_to/env_var.html

Related

How to define multiple gres resources in SLURM using the same GPU device?

I'm running machine learning (ML) jobs that make use of very little GPU memory.
Thus, I could run multiple ML jobs on a single GPU.
To achieve that, I would like to add multiple lines in the gres.conf file that specify the same device.
However, it seems the slurm deamon doesn't accept this, the service returning:
fatal: Gres GPU plugin failed to load configuration
Is there any option I'm missing to make this work?
Or maybe a different way to achieve that with SLURM?
It is kind of smiliar to this one, but this one seems specific to some CUDA code with compilation enabled. Something which seems way more specific than my general case (or at least as far as I understand).
How to run multiple jobs on a GPU grid with CUDA using SLURM
I don't think you can oversubscribe GPUs, so I see two options:
You can configure the CUDA Multi-Process Service or
pack multiple calculations into a single job that has one GPU and run them in parallel.
Besides nVidia MPS mentioned by #Marcus Boden, which is relevant for V100 types of cards, there also is Multi-Instance GPU which is relevant for A100 types of cards.

Optimizing a neural net for running in an embedded system

I am running some code on an embedded system with an extremely limited memory, and even more limited processing power.
I am using TensorFlow for this implementation.
I have never had to work in this kind of environment before.
What are some steps I can take to ensure I am being efficient as possible in my implementations/optimization?
Some ideas -
- Pruning code -
https://jacobgil.github.io/deeplearning/pruning-deep-learning
- Ensure loops are as minimal as possible (in the big O sense)
- ...
Thanks a lot.
I suggest using TensorFlow Lite.
It will enable you to compress and quantize your model to make it smaller and faster to run.
It also supports leveraging GPU and/or hardware accelerator if any of this is available to you.
https://www.tensorflow.org/lite
If you are working with TensorFlow 1.13 (the latest stable version before the 2.0 prototype), there is a pruning function from tf.contrib submodule. It contains a sparcity parameter that you can tune to determine the size of the network.
I suggest you to take a look at all the tf.contrib.model_pruning submodule here. It's plenty of functions you might need for your specific task.

What does it do if I choose "None" in Hardware Accelerator?

Pretty straightforward question. I was just wondering what was doing the calculus when this option was chosen. Does it run on Goolge's CPU or on my hardware ?
I have looked on Google, Stackoverflow and Colab's Help without success finding a precise answer
Thanks :)
PS : When running a full Dense Network "without" accelarator it is approx. as fast as with TPU and a lot faster than with GPU.
Your guess is correct: None means CPU only, but on a Colab-managed cloud VM rather than your local machine. (Unless you've connected to a local Jupyter instance.
Also keep in mind that you'll need to adjust your code in order to take advantage of hardware accelerators like GPUs and TPUs.
Speedup on a GPU is often a bit magical since many frameworks automatically detect and take advantage of GPUs. Built-in support for TPUs is rare, and obtaining a speedup from TPUs will require adjusting your code.

Simple explanation of terms while running configure command during Tensorflow installation

I am installing tensorflow from this link.
When I run the ./configurecommand, I see following terms
XLA JIT
GDR
VERBS
OpenCL
Can somebody explain in simple language, what these terms mean and what are they used for?
XLA stands for 'Accelerated Linear Algebra'. The XLA page states that: 'XLA takes graphs ("computations") [...] and compiles them into machine instructions for various architectures." As far as I understand will this take the computation you define in tensorflow and compile it. Think of producing code in C and then running it through the C compiler for the CPU and load the resulting shared library with the code for the full computation instead of making separate calls from python to compiled functions for each part of your computation. Theano does something like this by default. JIT stands for 'just in time compiler', i.e. the graph is compiled 'on the fly'.
GDR seems to be support for exchanging data between GPUs on different servers via GPU direct. GPU direct makes it possible that e.g. the network card which receives data from another server over the network writes directly into the local GPU's memory without going through the CPU or main memory.
VERBS refers to the Infiniband Verbs application programming interface ('library'). Infiniband is a low latency network used in many supercomputers for example. This can be used when you want to run tensorflow on more than one server for communication between them. The Verbs API is to Infiniband what the Berkeley Socket API is to TCP/IP communication (although there are many more communication options and different semantics optimized for performance with Verbs).
OpenCL is a programming language suited for executing parallel computing tasks on CPU and non-CPU devices such as GPUs, with a C like syntax. With respect to C however there are certain restrictions such as no support for recursion etc. One could probably say that OpenCL is to AMD what CUDA is to NVIDIA (although also OpenCL is also used by other companies like ALTERA).

Julia uses only 20-30% of my CPU. What should I do?

I am running a program that does numeric ODE integration in Julia. I am running Windows 10 (64bit), with Intel Core i7-4710MQ # 2.50Ghz (8 logical processors).
I noticed that when my code was running on julia, only max 30% of CPU is in usage. Going into the parallelazation documentation, I started Julia using:
C:\Users\*****\AppData\Local\Julia-0.4.5\bin\julia.exe -p 8 and expected to see improvements. I did not see them however.
Therefore my question is the following:
Is there a special way I have to write my code in order for it to use CPU more efficiently? Is this maybe a limitation posed by my operating system (windows 10)?
I submit my code in the julia console with the command:
include("C:\\Users\\****\\AppData\\Local\\Julia-0.4.5\\13. Fast Filesaving Format.jl").
Within this code I use some additional packages with:
using ODE; using PyPlot; using JLD.
I measure the CPU usage with windows' "Task Manager".
The -p 8 option to julia starts 8 worker processes, and disables multithreading in libraries like BLAS and FFTW so that the workers don't oversubscribe the physical threads on the system – since this kills performance in well-balanced distributed workloads. If you want to get more speed out of -p 8 then you need to distribute work between those workers, e.g. by having each of them do an independent computation, or by having the collaborate on a computation via SharedArrays. You can't just add workers and not change the program. If you are using BLAS (doing lots of matrix multiplies) or FFTW (doing lots of Fourier transforms), then if you don't use the -p flag, you'll automatically get multithreading from those libraries. Otherwise, there is no (non-experimental) user-level threading in Julia yet. There is experimental threading support and version 1.0 will support threading, but I wouldn't recommend that yet unless you're an expert.