FMU execution using GPU - gpu

I'm currently doing simulation using FMI.
Due to performance issue I'm looking for a solution to execute the FMI on GPU.
Does a solution exist in order to directly execute FMI on a GPU (high level) ?
Regards

Related

How to speed up Tensorflow-gpu with using CUDA code simultaneoulsy

I only have one GPU(GTX 1070, 8GB VRAM) and I would like to using tensorflow-gpu with another CUDA code simultaneously, on the same GPU.
But, using CUDA code and tensorflow-gpu at the same time slows tensorflow-gpu down about twice time.
Is there any solutions to speed up when tensorflow-gpu and CUDA code are used together?
A slightly longer version of #talonmies comment:
GPUs are awesome, but they still have finite resources. Any competently-built application that uses the GPU will do its best to saturate the device, leaving few resources for other applications. In fact, one of the goals and challenges of optimizing GPU code - whether it be a shader, CUDA or CL kernel - is making sure that all CUs are used as efficiently as possible.
Assuming that TF is already doing that: When running another GPU-heavy application, or you're sharing a resource that's already running full-tilt. So, things slow down.
Some options are:
Get a second, or faster, GPU.
Optimize your CUDA kernels to reduce requirements and simplify your TF stuff. While this is always important to keep in mind when developing for GPGPU, it's unlikely to help with your current problem.
Don't run these things at the same time. This may turn out to be slightly faster than this quasi time-slicing situation that you currently have.

OpenVINO unable to get optimum performance while running multiple inference engines

I am running multiple python processes( 4 in this case using multiprocessing module) for person detection (using ssd mobilenet model), each having it's own inference engine of OpenVINO. I am getting a very low FPS (not more than 10) for each process. My suspicion is the CPUs are not getting utilized optimally because the number of threads being spawned by each engine are high, which is adding to the overhead and also the sharing of CPUs across processes.
Also for single process, I am getting upto 60fps with OMP_NUM_THREADS set to 4.
My CPU details are:-
2 Sockets
4 cores each socket
1 thread each core
Total - 8 CPUs
So what would be the
Optimal value for OMP_NUM_THREADS in this case?
How can I avoid Sharing of CPUs across each process?
Currently I am playing with OMP_NUM_THREADS and KMP_AFFINITY variables, but just doing a hit and trail on setting the values. Any detail on how to set would be really helpful. Thanks
In case of multiple networks inference you may try to set OMP_WAIT_POLICY to PASSIVE.
BTW, OpenVINO 2019R1 moved from OpenMP to TBB. It might give better efficiency in case of deep learning networks pipeline.
In case if you are using the same model for all the processes consider to use OV multi-stream inference. Using this you can load single network and next to create a multiple infer requests. Using this you will have a better CPU utilization (if compare to running one infer request across multiple cores) and in result better throughput.
To understand how to use multi stream inference take a look on inference_engine/samples/python_samples/benchmark_app/benchmark sample
As well you can use benchmark sample to do a grid search to find an optimal configuration (number of streams, batch size).

Google-colaboratory: No backend with GPU available

Here it is described how to use gpu with google-colaboratory:
Simply select "GPU" in the Accelerator drop-down in Notebook Settings (either through the Edit menu or the command palette at cmd/ctrl-shift-P).
However, when I select gpu in Notebook Settings I get a popup saying:
Failed to assign a backend
No backend with GPU available. Would you like to use a runtime with no accelerator?
When I run:
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))
Of course, I get GPU device not found. It seems the description is incomplete. Any ideas what needs to be done?
You need to configure the Notebook with GPU device
Click Edit->notebook settings->hardware accelerator->GPU
You'll need to try again later when a GPU is available. The message indicates that all available GPUs are in use.
The FAQ provides additional info:
How may I use GPUs and why are they sometimes unavailable?
Colaboratory is intended for interactive use. Long-running background
computations, particularly on GPUs, may be stopped. Please do not use
Colaboratory for cryptocurrency mining. Doing so is unsupported and
may result in service unavailability. We encourage users who wish to
run continuous or long-running computations through Colaboratory’s UI
to use a local runtime.
There seems to be a cooldown on continuous training with GPUs. So, if you encounter the error dialog, try again later, and perhaps try to limit long-term training in subsequent sessions.
Add some pictures to make it clearer
My reputation is just slightly too low to comment, but here's a bit of additional info for #Bob Smith's answer re cooldown period.
There seems to be a cooldown on continuous training with GPUs. So, if you encounter the error dialog, try again later, and perhaps try to limit long-term training in subsequent sessions.
Based on my own recent experience, I believe Colab will allocate you at most 12 hours of GPU usage, after which there is roughly an 8 hour cool-down period before you can use compute resources again. In my case, I could not connect to an instance even without a GPU. I'm not entirely sure about this next bit but I think if you run say 3 instances at once, your 12 hours are depleted 3 times as fast. I don't know after what period of time the 12 hour limit resets, but I'd guess maybe a day.
Anyway, still missing a few details but the main takeaway is that if you exceed you'll limit, you'll be locked out from connecting to an instance for 8 hours (which is a great pain if you're actively working on something).
After Reset runtime didn't work, I did:
Runtime -> Reset all runtimes -> Yes
I then got a happy:
Found GPU at: /device:GPU:0
This is the precise answer to your question man.
According to a post from Colab :
overall usage limits, as well as idle timeout periods, maximum VM
lifetime, GPU types available, and other factors, vary over time.
GPUs and TPUs are sometimes prioritized for users who use Colab
interactively rather than for long-running computations, or for users
who have recently used less resources in Colab. As a result, users who
use Colab for long-running computations, or users who have recently
used more resources in Colab, are more likely to run into usage limits
and have their access to GPUs and TPUs temporarily restricted. Users
with high computational needs may be interested in using Colab’s UI
with a local runtime running on their own hardware.
Google Colab has by default tensorflow 2.0, Change it to tensorflow 1. Add the code,
%tensorflow_version 1.x
Use it before any keras or tensorflow code.

Keras using GPU vs CPU

Are there any differences from the programming point of view(syntax,functions or any other ) in 'Keras with back-end as Tensorflow' while working on 'Keras GPU' and 'Keras CPU'? I meant if one program can run on a GPU enabled Keras, Will the same program run on Keras CPU(efficiency doesn't matter)?
GPU code running on CPU? Sure, it is basic Multithreading. And you needed to avoid the race conditions anyway.
For most intents a GPU is just a giant load of realy weak CPU's, wich allows highly effective Multithreading (basically a display-high times display-width Core).
The other way (running procedural CPU code in a massively paralell GPU enviroment) is where the work lies.

Getting the most of the GPU in an Embedded Platform

My platform is Ubuntu running ob Exynos4412CPU which has the Mali400GPU. I would like to do some computer vision using OpenCV and OpenGL, I'm also going to do some fragment shaders. My question is what is the fastest way to copy the contents from the GPU to the CPU, which is really slow on my platform using glreadpixels. Is it beneficial to utilize glreadpixels in its own thread or use OpenMP ? Suggestions are welcome please :).
The Exynos 4412 doesn't have separate CPU and GPU memory at the hardware level; it's all the same RAM and physically accessible by both. Thus, there is likely to be some way to access the GPU's portion of the memory directly from the CPU.