Optimizing Tensorflow for a 32-cores computer - tensorflow

I'm running a tensorflow code on an Intel Xeon machine with 2 physical CPU each with 8 cores and hyperthreading, for a grand total of 32 available virtual cores. However, I run the code keeping the system monitor open and I notice that just a small fraction of these 32 vCores are used and that the average CPU usage is below 10%.
I'm quite the tensorflow beginner and I haven't configured the session in any way. My question is: should I somehow tell tensorflow how many cores it can use? Or should I assume that it is already trying to use all of them but there is a bottleneck somewhere else? (for example, slow access to the hard disk)

TensorFlow will attempt to use all available CPU resources by default. You don't need to configure anything for it. There can be many reasons why you might be seeing low CPU usage. Here are some possibilities:
The most common case, as you point out, is the slow input pipeline.
Your graph might be mostly linear, i.e. a long narrow chain of operations on relatively small amounts of data, each depending on outputs of the previous one. When a single operation is running on smallish inputs, there is little benefit in parallelizing it.
You can also be limited by the memory bandwidth.
A single session.run() call takes little time. So, you end up going back and forth between python and the execution engine.
You can find useful suggestions here
Use the timeline to see what is executed when

Related

Is it possible to use system memory instead of GPU memory for processing Dask tasks

We have been running DASK clusters on Kubernetes for some time. Up to now, we have been using CPUs for processing and, of course, system memory for storing our Dataframe of around 1,5 TB (per DASK cluster, split onto 960 workers). Now we want to update our algorithm to take advantage of GPUs. But it seems like the available memory on GPUs is not going to be enough for our needs, it will be a limiting factor(with our current setup, we are using more than 1GB of memory per virtual core).
I was wondering if it is possible to use GPUs (thinking about NVDIA, AMD cards with PCIe connections and their own VRAMS, not integrated GPUs that use system memory) for processing and system memory (not GPU memory/VRAM) for storing DASK Dataframes. I mean, is it technically possible? Have you ever tried something like this? Can I schedule a kubernetes pod such that it uses GPU cores and system memory together?
Another thing is, even if it was possible to allocate the system RAM as VRAM of GPU, is there a limitation to the size of this allocatable system RAM?
Note 1. I know that using system RAM with GPU (if it was possible) will create an unnecessary traffic through PCIe bus, and will result in a degraded performance, but I would still need to test this configuration with real data.
Note 2. GPUs are fast because they have many simple cores to perform simple tasks at the same time/in parallel. If an individual GPU core is not superior to an individual CPU core then may be I am chasing the wrong dream? I am already running dask workers on kubernetes which already have access to hundreds of CPU cores. In the end, having a huge number of workers with a part of my data won't mean better performance (increased shuffling). No use infinitely increasing the number of cores.
Note 3. We are mostly manipulating python objects and doing math calculations using calls to .so libraries implemented in C++.
Edit1: DASK-CUDA library seems to support spilling from GPU memory to host memory but spilling is not what I am after.
Edit2: It is very frustrating that most of the components needed to utilize GPUs on Kubernetes are still experimental/beta.
Dask-CUDA: This library is experimental...
NVIDIA device plugin: The NVIDIA device plugin is still considered beta and...
Kubernetes: Kubernetes includes experimental support for managing AMD and NVIDIA GPUs...
I don't think this is possible directly as of today, but it's useful to mention why and reply to some of the points you've raised:
Yes, dask-cuda is what comes to mind first when I think of your use-case. The docs do say it's experimental, but from what I gather, the team has plans to continue to support and improve it. :)
Next, dask-cuda's spilling mechanism was designed that way for a reason -- while doing GPU compute, your biggest bottleneck is data-transfer (as you have also noted), so we want to keep as much data on GPU-memory as possible by design.
I'd encourage you to open a topic on Dask's Discourse forum, where we can reach out to some NVIDIA developers who can help confirm. :)
A sidenote, there are some ongoing discussion around improving how Dask manages GPU resources. That's in its early stages, but we may see cool new features in the coming months!

OpenVINO unable to get optimum performance while running multiple inference engines

I am running multiple python processes( 4 in this case using multiprocessing module) for person detection (using ssd mobilenet model), each having it's own inference engine of OpenVINO. I am getting a very low FPS (not more than 10) for each process. My suspicion is the CPUs are not getting utilized optimally because the number of threads being spawned by each engine are high, which is adding to the overhead and also the sharing of CPUs across processes.
Also for single process, I am getting upto 60fps with OMP_NUM_THREADS set to 4.
My CPU details are:-
2 Sockets
4 cores each socket
1 thread each core
Total - 8 CPUs
So what would be the
Optimal value for OMP_NUM_THREADS in this case?
How can I avoid Sharing of CPUs across each process?
Currently I am playing with OMP_NUM_THREADS and KMP_AFFINITY variables, but just doing a hit and trail on setting the values. Any detail on how to set would be really helpful. Thanks
In case of multiple networks inference you may try to set OMP_WAIT_POLICY to PASSIVE.
BTW, OpenVINO 2019R1 moved from OpenMP to TBB. It might give better efficiency in case of deep learning networks pipeline.
In case if you are using the same model for all the processes consider to use OV multi-stream inference. Using this you can load single network and next to create a multiple infer requests. Using this you will have a better CPU utilization (if compare to running one infer request across multiple cores) and in result better throughput.
To understand how to use multi stream inference take a look on inference_engine/samples/python_samples/benchmark_app/benchmark sample
As well you can use benchmark sample to do a grid search to find an optimal configuration (number of streams, batch size).

How to allocate more CPU and RAM to SUMO (Simulation of Urban Mobility)

I have downloaded and unzipped sumo-win64-0.32.0 and running sumo.exe this on a powerful machine (64GB ram, Xeon CPU E5-1650 v4 3.6GHz) for about 140k trips, 108k edges, and 25k vehicles types which are departed in the first 30 min of simulation. I have noticed that my CPU is utilized only 30% and Memory only 38%, Is there any way to increase the speed by forcing sumo to use more CPU and ram, or possibly run in parallel? From "Can SUMO be run in parallel (on multiple cores or computers)?
The simulation itself always runs on a single core."
it appears that parallel processing is not possible t, but what about dedicating more CPU and ram?
Windows usually shows the CPU utilization such that 100% means all cores are used, so 30% is probably already more than one core and there is no way of increasing that with a single threaded application as sumo. Also if your scenario fits within RAM completely there is no point of increasing that. You might want to try one of the several parallelization approaches SUMO has but none of them got further than some toy examples (and none is in the official distribution) and the speed improvements are sometimes only marginal. Probably the best you can do is to do some profiling and find the performance bottlenecks and/or send your results to the developers.

Programe Execution Optimization

I am writing a Program for Parabolic Time Price Systems based on the book written by J.Welles Wilder Jr.
I am have way through the program, running with an execution time of 122 microsecs. This is way above the benchmark limit. What I was looking for is a few views and tips if I
write a kernel space program to achieve the same. Implementing it through drivers
[Really keen on this method] Is it possible, if yes then how and where I should start looking, passing instructions to a graphic driver to perform the steps and calculation (Read this in a blog somewhere).
Thanks in Advance.
--->Programming on c
What makes GPU very fast is the fact that it can run around 2000~ (depending on the card) threads asynchronously.
If you code can be divided into threads then it might improve your performance to do the calculations on the gpgpu since average CPU speed is 50-100 GFlops and average GPU speed is 1500~ when used correctly.
Also you might want to consider the difficulties of maintaining gpgpu code. I suggest you that if you have an NVidia GPU you should check out 'Managed CUDA' since it contains a debugger and a GPU profiler which makes it possible to work with.
TL;DR: use gpgpu only for async code and preferably use 'managed CUDA' if possible

Slow Parallel programming - MPI, VB.NET and FORTRAN

I'm working on parallelizing a software which simulates transport and flow process in the unsaturated soil zone. The software consists of a VB.NET user interface, and a FORTRAN DLL kernel to do the calculations.
I parallelized the software by using the package MPI.NET in the VB.NET part. When the program is started with a number of processes, all of them but the master process go into a wait function, while the master process takes care of the interaction of the software with the user. When all the data required for the simulation is entered, the master process enters the FORTRAN DLL, and calls the other processes. These jump to the starting point of the function in the DLL, and together all the processes solve a linear system of equations for about 10-20 times (the original partial differential equation is nonlinear, therefore these iterations in order to gain accuracy in the solution). When the solution is computed, all the processes go back to VB.NET, This is done for all the timesteps of the simulation. When all steps are computed, the master process continues with the user interaction, while the other processes go back
into the wait function, until they are called again by the master process.
The thing is that this program runs much slower than the original, sequential version of it. Now there might be a number of reasons for this. I used the PETSc library in the FORTRAN DLL to solve the system of equations, and I think I have configured it quite well. My question is if at some point in the architecture I described there could be a point or two which could cause a significant slowdown if not handled correctly. I'm not sure f.e. if the subsequent calls of DLL function can cost a lot of time.
My system is a Intel Xeon 3470 processor with 8GB RAM. The systems I tried to solve had up to 120.000 unknowns, which I know is at the very lower bound of what should be calculated in parallel, but at least with the 120.000 matrix I would have expected a better performance than I did measure.
Thanks in advance for your thoughts,
Martin
I would say that 120,000 degrees of freedom and 10-20 iterations is not that large a problem. Million degree of freedom problems were done when I did finite element analysis for a living, and that was 16 years ago.
Is it possible to solve it using an in-memory solver, without parallelization, with 8GB of RAM? That would certainly be your benchmark. Is that what you're comparing your parallel results to?
Are the parallel processes running on different processors or different machines? Parallelization doesn't buy you anything if everything is done on a single processor. You have to context switch and time slice processes, and there's overhead associated with MPI to communicate between processes. I would expect a parallel solution on a single processor to run more slowly than a single thread, in-memory solution.
If you have multiple processes, then I'd say it's a matter of tuning. I'd plot performance versus number of parallel processes. If there's a speedup, you should find that it improves with more processes until you reach a saturation point, beyond which the overhead is greater than the benefit.
If you have multiple cores, when you run your program sequentially can you see that only one or a few processor are utilized?
If the load in the sequential case is high and evenly distributed over all cores then IMHO there is no need to parallelize your program.
My system has a Xeon 3470, which is a quadcore processor. So the computations are all done on these 4 on 1 machine. I don't run the program with more than 4 processes of course.The old solver that the software had was sequential of course, and that still runs faster than the parallel version. When I plot number of processes against runtime, I see that runtime even increases a little bit with smaller models - but that is to be expected because of the communication overhead.
In both the sequential and the parallel case all 4 processors are utilized, and the load balance between them is acceptable.
Like I said, I know that the models I've tested so far are not ideal to talk about parallel performance. I was just wondering if besides the communication overhead due to MPI there could still be another point that could lead to the slowdown of the program.