How do rasterizers do so many computations per second, but shaders can't even come close? - gpu

I am playing around on Shadertoy and I kept running into a hard loop count of about 12000 iterations, so I decided to check just how many calculations per frame it could do without dropping frames. Sure enough, the shaders don't seem to be able to anything more than 12000 calculations per frame without the frame rate dropping. This seems odd because I had thought that shaders run directly on the GPU, which regularly does way more calculations per polygon (like 100 calculations per 150K polygons!) with rasterizer software like OpenGL and Vulkan. So essentially my question is, how can I directly send calculations through the GPU like a rasterizer does to get the speedy data crunching that rasterizers do?

Related

Oxyplot: IsValidPoint on realtime LineSerie

I've been using oxyplot for a month now and I'm pretty happy with what it delivers. I'm getting data from an oscilloscope and, after a fast processing, I'm plotting it in real time to a graph.
However, if I compare my application CPU usage to the one provided by the oscilloscope manufacturer, I'm loading a lot more the CPU. Maybe they're using some gpu-based plotter, but I think I can reduce my CPU usage with some modifications.
I'm capturing 10.000 samples per second and adding it to a LineSeries. I'm not plotting all that data, I'm decimating it to a constant number of points, let's say 80 points for a 20 secs measure, so I have 4 points/sec while totally zoomed out and a bit more detail if I zoom in to a specific range.
With the aid of ReSharper, I've noticed that the application is calling a lot of times (I've 6 different plots) the IsValidPoint method (something like 400.000.000 times), which is taking a lot of time.
I think the problem is that, when I add new points to the series, it checks for every point if it is a valid point, instead of the added values only.
Also, it spends a lot of time in the MeasureText/DrawText method.
My question is: is there a way to override those methods and to adapt it to my needs? I'm adding 10.000 new values each second, but the first ones remain the same, so there's no need for re-validate them. Also, the text shown doesn't change.
Thank you in advance for any advice you can give me. Have a good day!

An example: Am I understanding GPU advantage correctly?

Just reading a bit about what the advantage of GPU is, and I want to verify I understand on a practical level. Lets say I have 10,000 arrays each containing a billion simple equations to run. On a cpu it would need to go through every single equation, 1 at a time, but with a GPU I could run all 10,000 arrays as as 10,000 different threads, all at the same time, so it would finish a ton faster...is this example spot on or have I misunderstood something?
I wouldn't call it spot on, but I think you're headed in the right direction. Mainly, a GPU is optimized for graphics-related calculations. This does not, however, mean that's all it is capable of.
Without knowing how much detail you want me to go into here, I can say at the very least the concept of running things in parallel is relevant. The GPU is very good at performing many tasks simultaneously in one go (known as running in parallel). CPUs can do this too, but the GPU is specifically optimized to handle much larger numbers of specific calculations with preset data.
For example, to render every pixel on your screen requires a calculation, and the GPU will attempt to do as many of these calculations as it can all at the same time. The more powerful the GPU, the more of these it can handle at once and the faster its clock speed. The end result is a higher-end GPU can run your OS and games in 4k resolution, whereas other cards (or integrated graphics) might only be able to handle 1080p or less.
There's a lot more to this as well, but I figured you weren't looking for the insanely technical explanation.
The bottom line is this: For running a single task on one piece of data, the CPU will normally be faster. A single CPU core is generally much faster than a single GPU core. However, they typically have many cores and for running a single task on many pieces of data (so you have to run it once for each), the GPU will usually be faster. But these are data-driven situations, and as such each situation should be assessed on an individual basis to determine which to use and how to use it.

How to measure by c code

I have a question about how to measure the bandwidth of a GPU. I have tried some different ways but none of them work. For example, I tried to use the amount of data transfer divided by the time used to calculate the bandwidth. However, since GPU can switch warps currently executed, the number of data transfer varies during execution. I wonder whether you may give me some advices about how to do it. That would be really appreciated.

JNCI/JCOL kernel optimization

I have a kernel running in open CL (via a jocl front end) that is running horrible slow compared to the other kernels, I'm trying to figure why and how to accelerate it. This kernel is very basic. it's sole job is to decimate the number of sample points we have. It copies every Nth point from the input array to a smaller output array to shrink our array size.
The kernel is passed a float specifying how many points to skip between 'good' points. So if it is passed 1.5 it will skip one point, ten two, then one etc to keep an average of every 1.5 points being skipped. The input array is already on the GPU (it was generated by an earlier kernel) and the output array will stay on the kernel so there is no expense to transfer data to or from the CPU.
This kernel is running 3-5 times slower then any of the other kernels; and as much as 20 times slower then some of the fast kernels. I realize that I'm suffering a penalty for not coalescing my array accesses; but I can't believe that it would cause me to run this horribly slow. After all every other kernel is touching every sample in the array, I would think touching ever X sample in the array, even if not coalesced, should be around the same speed at least of touching every sample in an array.
The original kernel actually decimated two arrays at once, for real and imaginary data. I tried splitting the kernel up into two kernel calls, one to decimate real and one to decimate imaginary data; but this didn't help at all. Likewise I tried 'unrolling' the kernel by having one thread be responsible for decimation of 3-4 points; but this didn't help any. Ive tried messing with the size of data passed into each kernel call (ie one kernel call on many thousands of data points, or a few kernel calls on a smaller number of data points) which has allowed me to tweak out small performance gains; but not to the order of magnitude I need for this kernel to be considered worth implementing on GPU.
just to give a sense of scale this kernel is taking 98 ms to run per iteration while the FFT takes only 32 ms for the same input array size and every other kernel is taking 5 or less ms. What else could cause such a simple kernel to run so absurdly slow compared to the rest of the kernels were running? Is it possible that I actually can't optimize this kernel sufficiently to warrant running it on the GPU. I don't need this kernel to run faster then CPU; just not quite as slow compared to CPU so I can keep all processing on the GPU.
it turns out the issue isn't with the kernel at all. Instead the problem is that when I try to release the buffer I was decimating it causes the entire program to stall while the kernel (and all other kernels in queue) complete. This appears to be functioning incorrectly, the clrelease should only decrement a counter so far as I understand, not block on the queue. However; the important point is that my kernel is running efficiently as it should be.

How to Periodically Updating Labview chart when collecting multi channel data at a high rate

Looking for some help with a Labview data collection program. If I could collect 2ms of data at 8kHz (gives 16 data points) per channel (I am collecting data on 4 analog channels with an National Instruments data acquisition board). The DAQ-MX collection task gives a 1D array of 4 waveforms.
If I don't display the data I can do all my computation time is about 2ms and it is OK if the processing loop lags a little behind the collection loop. Updating the chart in Labview's front panel introduces an unacceptable delay. We don't need to update the display very quickly probably at 5-10Hz would be sufficient. But I don't know how to set this up.
My current Labview VI has three parallel loops
A timed-loop for data collection
A loop for analysis and processing
A low priority loop for caching data to disk as a TDMS file
Data is passed from the collection loop to the other loops using a queue. Labview examples gave me some ideas but I am stuck.
Any suggestions, references, ideas would be appreciated.
Thanks
Azim
Follow Up Question
eaolson suggests that I re-sample the data for display purposes. The data coming from the DAQ-MX read is a one dimensional array of waveforms. So I would need to somehow build or concatenate the waveform data for each channel. And then re-sample the data before updating the front panel chart. I suppose the best approach would be to queue the data and in a display loop dequeue the stack build and re-sample the data based on screen resolution and then update the chart. Would there be any other approach. I will look on
(NI Labview Forum)[http://forums.ni.com/ni/board?board.id=170] for more information as suggetsted by eaolson.
Updates
changed acceptable update rate for graphs to 5-10Hz (thanks Underflow and eaolson)
disk cache loop is a low priority one (thanks eaolson)
Thanks for all the responses.
Your overall architecture description sounds solid, but... getting to 30Hz for any non-trivial graph is going to be challenging. Make sure you really need that rate before trying to make it happen. Optimizing to that level might take some time.
References that should be helpful:
You can defer panel updates. This keeps the front panel from refreshing until you're ready for it to do so, allowing you to buffer data in the background, and only draw it occasionally.
You should know about (a)synchronous display. This option allows some control over display rates.
There is some general advice available about speeding execution.
There is a (somewhat dated) report on execution speed on the LAVA forums. Googling around the LAVA forums is a great idea if you need to optimize your speed.
Television updates at about 30 Hz. Any more than that is faster than the human eye can see. 30 Hz should be at the maximum update rate you should consider for a display, not the starting point. Consider an update rate of 5-10 Hz.
LabVIEW charts append the most recent data to the historical data they store and display all the data at once. At 8 kHz, you're acquiring at least 8000 data points per channel per second. That means the array backing that graph has to continuously be resized to hold the new data. Also, even if your graph is 1000 pixels across, that means you're displaying 8 data points per screen pixel. There's not usually any reason to display any more than one data point per pixel. If you really need fast update rates, plot less data. Create an array to hold the historical data and plot only every Nth data point, where N is chosen so you're plotting, say, only a few hundred points.
Remember that your loops can run at different rates. It may be satisfactory to run the write-to-disk loop at a much lower frequency than the data collection rate, maybe every couple of seconds.
Avoid property nodes if you can. They run in the UI thread, which is slower than most other execution.
Other than that, it's really hard to offer a lot of substantial advice without seeing code or more specifics. Consider also asking your question at the NI LabVIEW forums. There are a lot of helpful people there.