An example: Am I understanding GPU advantage correctly? - gpu

Just reading a bit about what the advantage of GPU is, and I want to verify I understand on a practical level. Lets say I have 10,000 arrays each containing a billion simple equations to run. On a cpu it would need to go through every single equation, 1 at a time, but with a GPU I could run all 10,000 arrays as as 10,000 different threads, all at the same time, so it would finish a ton faster...is this example spot on or have I misunderstood something?

I wouldn't call it spot on, but I think you're headed in the right direction. Mainly, a GPU is optimized for graphics-related calculations. This does not, however, mean that's all it is capable of.
Without knowing how much detail you want me to go into here, I can say at the very least the concept of running things in parallel is relevant. The GPU is very good at performing many tasks simultaneously in one go (known as running in parallel). CPUs can do this too, but the GPU is specifically optimized to handle much larger numbers of specific calculations with preset data.
For example, to render every pixel on your screen requires a calculation, and the GPU will attempt to do as many of these calculations as it can all at the same time. The more powerful the GPU, the more of these it can handle at once and the faster its clock speed. The end result is a higher-end GPU can run your OS and games in 4k resolution, whereas other cards (or integrated graphics) might only be able to handle 1080p or less.
There's a lot more to this as well, but I figured you weren't looking for the insanely technical explanation.
The bottom line is this: For running a single task on one piece of data, the CPU will normally be faster. A single CPU core is generally much faster than a single GPU core. However, they typically have many cores and for running a single task on many pieces of data (so you have to run it once for each), the GPU will usually be faster. But these are data-driven situations, and as such each situation should be assessed on an individual basis to determine which to use and how to use it.

Related

How can computational requirements be compared?

Calculating the solution to an optimization problem takes a 2 GHz CPU one hour. During this process there are no background processes, no RAM is being used and the CPU is at 100% capacity.
Based on this information, can it be derived that a 1 GHz CPU will take two hours to solve the same problem?
A quick search of IPC, frequence, and chip architecture will show you this topic has been breached many times. There are many things that can determine the execution speed of a program (without even going into threading at all) the main ones that pop to mind:
Instruction set - If one chip has an instruction for multiplication, than a*b is atomic. If not, you will need a lot of atomic instructions to perform such an action - big difference in speed, which can prove to make even higher frequency chips slower.
Cycles per second - this is the frequency of the chip.
Instructions per cycle (IPC) - what you are really interested is IPC*frequency, not just frequency. How many atomic actions can you can perform in a second. After the amount of atomic actions (see 1), on a single threaded application this might act as you expect (x2 this => x2 faster program), though no guarantees.
and there are a ton of other nuance technologies that can affect this, like branch prediction which hit the news big time recently. For a complete understanding a book/course might be a better resource.
So, in general, no. If you are comparing two single core, same architecture chips (unlikely), then maybe yes.

Is faster code also more power efficient?

Assume I have a CPU running at a constant rate, pulling an equal amount of energy per instruction. I also have two functionally identical programs, which result in the same output, except one has been optimized to execute only 100 instructions, while the other program executes 200 instructions. Is the 100 instruction program necessarily faster than the 200 instruction program? Does a program with fewer instructions draw less power than a program with more instructions?
Things are much more complex than this.
For example execution speed is in many cases dominated by memory. As a practical example some code could process the pixels of an image first in rows and then in columns... a different code instead could be more complex but processing rows and columns at the same time.
The second version could execute more instructions because of more complex housekeeping of the data but I wouldn't be surprised if it was faster because of how memory is organized: reading an image one column at a time is going to "trash the cache" and it's very possible that despite being simple the code working that way could be a LOT slower than the more complex one doing the processing in a memory-friendly way. The simpler code may end up being "stalled" a lot waiting for the cache lines to be filled or flushed to the external memory.
This is just an example, but in reality what happens inside a CPU when code is executed is for many powerful processors today a very very complex process: instructions are exploded in micro-instructions, registers are renamed, there is speculative execution of parts of code depending on what branch predictors guess even before the program counter really reaches a certain instruction and so on. Today the only way to know for sure if something is faster or slower is in many cases just trying with real data and measure.
Is the 100 instruction program necessarily faster than the 200 instruction program?
No. Firstly, on some architectures (such as x86) different instructions can take a different number of cycles. Secondly, there are effects — such cache misses, page faults and branch mispreditictions — that complicate the picture further.
From this it follows that the answer to your headline question is "not necessarily".
Further reading.
I found a paper from 2017 comparing the energy usage, speed, and memory consumption of various programming languages. There is an obvious positive correlation between faster languages also using less energy.

Could a GPU speed up comparing every pixel between two images?

I've implemented the game where the user must spot 5 differences in two side by side images, and I've made the image comparison engine to find the different regions first. The performance is pretty good (4-10 ms to compare 800x600), but I'm aware GPUs have so much power.
My question is could a performance gain be realized by using all those cores (just to compare each pixel once)... at the cost of copying the images in. My hunch says it may be worthwhile, but my understanding of GPUs is foggy.
Yes, implementing this process to run on the GPU can result in much faster processing time. The amount of performance increase you get is, as you allude to, related to the size of the images you use. The bigger the images, the faster the GPU will complete the process compared to the CPU.
In the case of processing just two images, with dimensions of 800 x 600, the GPU will still be faster. Relatively, that is a very small amount of memory and can be written to the GPU memory quickly.
The algorithm of performing this process on the GPU is not overly complicated, but assuming a person had no experience of writing code for the graphics card, the cost of learning how to code a GPU is potentially not worth the result of having this algorithm implemented on a GPU. If however, the goal was to learn GPU programming, this could be a good early exercise. I would recommend, to first learn gpu programming, which will take some time and should start with even simpler exercises.

MATLAB parallel computing setup

I have a quad core computer; and I use the parallel computing toolbox.
I set different number for the "worker" number in the parallel computing setting, for example 2,4,8..............
However, no matter what I set, the AVERAGE cpu usage by MATLAB is exactly 25% of total CPU usage; and None of the cores run at 100% (All are around 10%-30%). I am using MATLAB to run optimization problem, so I really want my quad core computer using all its power to do the computing. Please help
Setting a number of workers (up to 4 on a quad-core) is not enough. You also need to use a command like parfor to signal to Matlab what part of the calculation should be distributed among the workers.
I am curious about what kind of optimization you're running. Normally, optimization problems are very difficult to parallelize, since the result of every iteration depends on the previous one. However, if you want to e.g. try and fit multiple models to the data, or if you have to fit multiple data sets, then you can easily run these in parallel as opposed to sequentially.
Note that having many cores may not be sufficient in terms of resources - if performing the optimization on one worker uses k GB of RAM, performing it on n workers requires at least n*k GB of RAM.

Scattered-write speed versus scattered-read speed on modern Intel or AMD CPUs?

I'm thinking of optimizing a program via taking a linear array and writing each element to a arbitrary location (random-like from the perspective of the CPU) in another array. I am only doing simple writes and not reading the elements back.
I understand that a scatted read for a classical CPU can be quite slow as each access will cause a cache miss and thus a processor wait. But I was thinking that a scattered write could technically be fast because the processor isn't waiting for a result, thus it may not have to wait for the transaction to complete.
I am unfortunately unfamiliar with all the details of the classical CPU memory architecture and thus there may be some complications that may cause this also to be quite slow.
Has anyone tried this?
(I should say that I am trying to invert a problem I have. I currently have an linear array from which I am read arbitrary values -- a scattered read -- and it is incredibly slow because of all the cache misses. My thoughts are that I can invert this operation into a scattered write for a significant speed benefit.)
In general you pay a high penalty for scattered writes to addresses which are not already in cache, since you have to load and store an entire cache line for each write, hence FSB and DRAM bandwidth requirements will be much higher than for sequential writes. And of course you'll incur a cache miss on every write (a couple of hundred cycles typically on modern CPUs), and there will be no help from any automatic prefetch mechanism.
I must admit, this sounds kind of hardcore. But I take the risk and answer anyway.
Is it possible to divide the input array into pages, and read/scan each page multiple times. Every pass through the page, you only process (or output) the data that belongs in a limited amount of pages. This way you only get cache-misses at the start of each input page loop.