Can a capture card be used to move results from the GPU more efficiently as by loading it into the RAM?

Can a capture card be used to move results from the GPU more efficiently as by loading it into the RAM? - gpu

I have been working on simulations on the GPU for some time now. One of the biggest Bottlenecks for some of my simulations is the transfer of intermediate results from the GPU.
I have never worked with capture cards before. But as I read about them I was wondering if you can use them to get intermediate results from the GPU more effient than to load them into the RAM and then store them.
As I understood capture cards receive data via HDMI/Displayport and the amount of data that can be send is limited by the medium.
I can imagine that a capture card can save a lot of CPU instructions and memory loads. What are your thoughts on this idea of using a capture card for dumping intermediary results from simulations on the GPU?

Related

Optimizing Tensorflow for a 32-cores computer

I'm running a tensorflow code on an Intel Xeon machine with 2 physical CPU each with 8 cores and hyperthreading, for a grand total of 32 available virtual cores. However, I run the code keeping the system monitor open and I notice that just a small fraction of these 32 vCores are used and that the average CPU usage is below 10%.
I'm quite the tensorflow beginner and I haven't configured the session in any way. My question is: should I somehow tell tensorflow how many cores it can use? Or should I assume that it is already trying to use all of them but there is a bottleneck somewhere else? (for example, slow access to the hard disk)

TensorFlow will attempt to use all available CPU resources by default. You don't need to configure anything for it. There can be many reasons why you might be seeing low CPU usage. Here are some possibilities:
The most common case, as you point out, is the slow input pipeline.
Your graph might be mostly linear, i.e. a long narrow chain of operations on relatively small amounts of data, each depending on outputs of the previous one. When a single operation is running on smallish inputs, there is little benefit in parallelizing it.
You can also be limited by the memory bandwidth.
A single session.run() call takes little time. So, you end up going back and forth between python and the execution engine.
You can find useful suggestions here
Use the timeline to see what is executed when

how to prevent cpu usage from changing timing in labview?

I'm trying to write a code in which every 1 ms a number plused one , should be replaced the old number . (something like a chronometer ! ) .
the problem is whenever the cpu usage increases because of some other programs running on the pc, this 1 milliseconds is also increased and timing in my program changes !
is there any way to prevent cpu load changes affecting timing in my program ?

It sounds as though you are trying to generate an analogue output waveform with a digital-to-analogue converter card using software timing, where your software is responsible for determining what value should be output at any given time and updating the output accordingly.
This is OK for stationary or low-speed signals but you are trying to do it at 1 ms intervals, in other words to output 1000 samples per second or 1 ks/s. You cannot do this reliably on a desktop operating system - there are too many other processes going on which can use CPU time and block your program from running for many milliseconds (or even seconds, e.g. for network access).
Here are a few ways you could solve this:
Use buffered, hardware-clocked output if your analogue output device supports it. Instead of writing one sample at a time, you send the device a waveform or array of samples and it outputs them at regular intervals using a timing signal generated in hardware. Unfortunately, low-end DAQ devices often don't support hardware-clocked output.
Instead of expecting the loop that writes your samples to the AO to run every millisecond, read LabVIEW's Tick Count (ms) value in the loop and use that as an index to your array of samples: rather than trying to output every sample, your code will now say 'what time is it now, and therefore what should the output be?' That won't give you a perfect signal out but at least now it should keep the correct frequency rather than be 'slowed down' - instead you will see glitches imposed on the signal whenever the loop can't keep up. This is easy to test and maybe it will be adequate for your needs.
Use a real-time operating system instead of a desktop OS. In the case of LabVIEW this would mean using the Real-Time software module and either a National Instruments hardware device that supports RT, such as the CompactRIO series, or installing the RT OS on a dedicated PC if the hardware is compatible. This is not a cheap option, obviously (unless it's strictly for personal, home use). In any case you would need to have an RT-compatible driver for your output device.
Use your computer's sound output as the output device. LabVIEW has functions for buffered sound output and you should be able to get reliable results. You'll need to upsample your signal to one of the sound output's available sample rates, probably 44.1 ks/s. The drawbacks are that the output level is limited in range and is not calibrated, and will probably be AC-coupled so you can't output a DC or very low-frequency signal. However if the level is OK for what you want to connect it to, or you can add suitable signal conditioning, this could be a neat solution. If you need the output level to be calibrated you could simultaneously measure it with your DAQ card and scale the sound waveform you're outputting to keep it correct.

The answer to your question is "not on a desktop computer." This is why products like LabVIEW Real-Time and dedicated deterministic hardware exist: you need a computer built around dedication to a particular process in order to consistently serve that process. Every application in a regular Windows/Mac/Linux desktop system has the problem you are seeing of potentially being interrupted by other system processes, particularly in its UI layer.

There is no way to prevent cpu load changes from affecting timing in your program unless the computer has a realtime clock.
If it doesn't have a realtime clock, there is no reason to expect it to behave deterministically. Do you need for your program to run at that pace?

Using a GPU both as video card and GPGPU

Where I work, we do a lot of numerical computations and we are considering buying workstations with NVIDIA video cards because of CUDA (to work with TensorFlow and Theano).
My question is: should these computers come with another video card to handle the display and free the NVIDIA for the GPGPU?
I would appreciate if anyone knows of hard data on using a video card for display and GPGPU at the same time.

Having been through this, I'll add my two cents.
It is helpful to have a dedicated card for computations, but it is definitely not necessary.
I have used a development workstation with a single high-end GPU for both display and compute. I have also used workstations with multiple GPUs, as well as headless compute servers.
My experience is that doing compute on the display GPU is fine as long as demands on the display are typical for software engineering. In a Linux setup with a couple monitors, web browsers, text editors, etc., I use about 200MB for display out of the 6GB of the card -- so only about 3% overhead. You might see the display stutter a bit during a web page refresh or something like that, but the throughput demands of the display are very small.
One technical issue worth noting for completeness is that the NVIDIA driver, GPU firmware, or OS may have a timeout for kernel completion on the display GPU (run NVIDIA's 'deviceQueryDrv' to see the driver's "run time limit on kernels" setting). In my experience (on Linux), with machine learning, this has never been a problem since the timeout is several seconds and, even with custom kernels, synchronization across multiprocessors constrains how much you can stuff into a single kernel launch. I would expect the typical runs of the pre-baked ops in TensorFlow to be two or more orders of magnitude below this limit.
That said, there are some big advantages of having multiple compute-capable cards in a workstation (whether or not one is used for display). Of course there is the potential for more throughput (if your software can use it). However, the main advantage in my experience, is being able to run long experiments while concurrently developing new experiments.
It is of course feasible to start with one card and then add one later, but make sure your motherboard has lots of room and your power supply can handle the load. If you decide to have two cards, with one being a low-end card dedicated to display, I would specifically advise against having the low-end card be a CUDA-capable card lest it get selected as a default for computation.
Hope that helps.

In my experience it is awkward to share a GPU card between numerical computation tasks and driving a video monitor. For example, there is limited memory available on any GPU, which is often the limiting factor in the size of a model you can train. Unless you're doing gaming, a fairly modest GPU is probably adequate to drive the video. But for serious ML work you will probably want a high-performance card. Where I work (Google) we typically put two GPUs in desk-side machines when one is to be used for numerical computation.

Which is best among Hybrid CPU-GPU, only GPU,onlyCPU for implementing large matrix addition or matrix multiplication?

If there is a matrix addition application that is implemented by hybrid CPU-GPU (in CUDA (i.e) using pthreads where each thread performs a partial matrix addition in host CPU and in GPU), for instance, if the matrix size is 1000, first 500 will be computed by host-CPU and the rest by GPU, basically the computation is split between cpu and gpu, so is this the best when compared to CPU only computation and GPU only computation.
Please, help me understand this concept.
Is there any profiling tool that will help find such kind of computation performance between those 3 ?. I'm new to CUDA so any help/guidance will be appreciated.
Thank you!

The problem with CPU-GPU hybrid computations where you need the result back on CPU is the latency between the two. If you expect to do some computation on GPU and have the result back on CPU, there can be easily several milliseconds of delay from starting the computation on GPU to get the results back on CPU, so the amount of work done on GPU should be significant. Or you need significant amount of work on CPU between starting GPU computation and getting the results back from GPU. Performing 1000 element matrix addition is tiny amount of work thus you would be better off performing the entire computation on CPU instead. You also have the overhead of transferring the data back and forth between the CPU & GPU across the PCI bus which adds to the overhead, so computations which require small amount of data transferred between the two lean more towards hybrid solution.
If you never need to read the result back from GPU to CPU, then you don't have the latency issue though. For example you could do N-body simulation on GPU and perform visualization on GPU as well thus never needing the result on CPU. But the moment you need the result of the simulation back to CPU you have to deal with the latency issue.

non-graphics benchmarks for gpu

Most of the benchmarks for gpu performance and load testing are graphics related. Is there any benchmark that is computationally intensive but not graphics related ? I am using
DELL XPS 15 laptop,
nvidia GT 525M graphics card,
Ubuntu 11.04 with bumblebee installed.
I want to load test my system to come up with a max load the graphics cards can handle. Are there any non-graphics benchmarks for gpu ?

What exactly do you want to measure?
To measure GFLOPS on the card just write a simple Kernel in Cuda (or OpenCL).
If you have never written anything in CUDA let me know and i can post something for you.
If your application is not computing intensive (take a look at a roofline paper) then I/O will be the bottleneck. Getting data from global (card) memory to the processor takes 100's of cycles.
On the other hand if your application IS compute intensive then just time it and calculate how many bytes you process per second. In order to hit the maximum GFLOPS (your card can do 230) you need many FLOPs per memory access, so that the processors are busy and not stalling for memory and switching threads.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas