Programe Execution Optimization - optimization

I am writing a Program for Parabolic Time Price Systems based on the book written by J.Welles Wilder Jr.
I am have way through the program, running with an execution time of 122 microsecs. This is way above the benchmark limit. What I was looking for is a few views and tips if I
write a kernel space program to achieve the same. Implementing it through drivers
[Really keen on this method] Is it possible, if yes then how and where I should start looking, passing instructions to a graphic driver to perform the steps and calculation (Read this in a blog somewhere).
Thanks in Advance.
--->Programming on c

What makes GPU very fast is the fact that it can run around 2000~ (depending on the card) threads asynchronously.
If you code can be divided into threads then it might improve your performance to do the calculations on the gpgpu since average CPU speed is 50-100 GFlops and average GPU speed is 1500~ when used correctly.
Also you might want to consider the difficulties of maintaining gpgpu code. I suggest you that if you have an NVidia GPU you should check out 'Managed CUDA' since it contains a debugger and a GPU profiler which makes it possible to work with.
TL;DR: use gpgpu only for async code and preferably use 'managed CUDA' if possible

Related

Translate OpenCl code for CPU compilation

Sometimes I find myself writing OpenCl kernel code (using pyopencl), even for tasks which involve moderate computational complexity, because it is easier to develop than a chain of numpy operations (especially if no appropriate numpy function exists).
However, in those cases the transfer overhead/delay between host and device may exceed the time spend for computation.
I was thinking about creating some Python tool, which automatically translates the OpenCl code to e.g. Cython code (or similar) which, after compilation for the CPU, can directly work on the underlying memory of the numpy arrays, without the need to copy the data to the device. I know that the CPU is capable of executing OpenCl kernels with appropriate drivers. However, this still has the disadvantages of additional delay due to the to_device operation. A multicore CPU could also exploit the OpenCL programming model for parallel execution. Furthermore, this approach removes the need for special OpenCl drivers and just requires some build tools for C-Code compilation.
Is that a reasonable idea? I do not want to reinvent the wheel. Any hints for existing frameworks/tools which could achieve my goals are much appreciated.
While converting an OpenCL code to a parallel CPU-oriented code is probably possible, it very hard (if not possible) to generate an efficient code.
Indeed, OpenCL encourage/force programmers to perform big computational steps (kernels) often reading/writing a relatively big portion of memory. However, the GPUs memory bandwidth is generally much higher than the one of CPUs (eg. my Nvidia 1660S has a bandwidth of 336 GB/s while my i5-9600KF with 2 DD4 channel succeed to reach about 40 GB/s while they had a similar price). OpenCL computing kernels are not be fully optimized for CPUs whatever the low-level transformation applied to the code. The main problem lies in the OpenCL algorithms themselves as well as the programming model. Rewriting OpenCL kernels to a CPU code can often result in a more efficient execution if the code is specifically optimized for such a platform. Low-level optimizations include working on in caches using data chunks, using register blocking, using the best SIMD instructions available. High-level optimizations consist in choosing the best algorithm and data structure for the target problem. The best sorting algorithm on a GPU is likely very different from the best one on a CPU. The same thing applies for other problems like computing a prefix sum, a partition/median or even string searching. Thus, one should keep in mind that different hardwares required different computing methods/algorithms.
A high-level algorithmic transformation could theoretically result in an efficient code, but such a transformation is insanely complex to perform if even possible. Indeed, there is fundamental theoretical limitations that strongly prevent many generalized advanced code analysis/transformation starting from the halting problem to high-level optimization.

Optimizing Tensorflow for a 32-cores computer

I'm running a tensorflow code on an Intel Xeon machine with 2 physical CPU each with 8 cores and hyperthreading, for a grand total of 32 available virtual cores. However, I run the code keeping the system monitor open and I notice that just a small fraction of these 32 vCores are used and that the average CPU usage is below 10%.
I'm quite the tensorflow beginner and I haven't configured the session in any way. My question is: should I somehow tell tensorflow how many cores it can use? Or should I assume that it is already trying to use all of them but there is a bottleneck somewhere else? (for example, slow access to the hard disk)
TensorFlow will attempt to use all available CPU resources by default. You don't need to configure anything for it. There can be many reasons why you might be seeing low CPU usage. Here are some possibilities:
The most common case, as you point out, is the slow input pipeline.
Your graph might be mostly linear, i.e. a long narrow chain of operations on relatively small amounts of data, each depending on outputs of the previous one. When a single operation is running on smallish inputs, there is little benefit in parallelizing it.
You can also be limited by the memory bandwidth.
A single session.run() call takes little time. So, you end up going back and forth between python and the execution engine.
You can find useful suggestions here
Use the timeline to see what is executed when

Could a GPU be Programmed and/or Modified to Carry Out CPU Instructions?

I was wondering if a GPU could behave like a CPU if modified or programmed to do so. If there is a way, I would also like to know how that could be done. The reason why is, well, sometimes I do that kind of stuff as experiments, just for fun. Plus, if it isn't a big hassle, then it would be much better than buying an expensive processor just to get better performance. I usually don't need my GPU, only because I use my computer for the simplest of things. My other computer, that's a slightly different story (because I use it for video playback), but you get the idea.
Yes, it's called GPGPU (general purpose GPU), and with it you could program some CPU-like workloads on your GPU using languages like CUDA or OpenCL.
Of course this method doesn't work well with any workload, the CPU is still much better in single-threaded hard-to-parallelize codes, or codes with complicated control flow (due to branch predictors) or memory locality (due to better caching and prefetching). GPGPUs are mostly better for performing very straight-forward highly parallel vectorizable code.
In fact, this method of computation caught enough traction to create a new lines of products, (such as Xeon Phi, formely Larrabee), and enhancing existing GPUs (e.g. Tesla/Fermi, and others)
EDIT
Having reread your question - if you mean running actual CPU ISA on such GPGPU, not just some general CPU task, then the best bet is Xeon Phi mentioned above, it's intended to be based on the same ISA as the CPU (it's the only x86 GPGPU I know of).

Slow Parallel programming - MPI, VB.NET and FORTRAN

I'm working on parallelizing a software which simulates transport and flow process in the unsaturated soil zone. The software consists of a VB.NET user interface, and a FORTRAN DLL kernel to do the calculations.
I parallelized the software by using the package MPI.NET in the VB.NET part. When the program is started with a number of processes, all of them but the master process go into a wait function, while the master process takes care of the interaction of the software with the user. When all the data required for the simulation is entered, the master process enters the FORTRAN DLL, and calls the other processes. These jump to the starting point of the function in the DLL, and together all the processes solve a linear system of equations for about 10-20 times (the original partial differential equation is nonlinear, therefore these iterations in order to gain accuracy in the solution). When the solution is computed, all the processes go back to VB.NET, This is done for all the timesteps of the simulation. When all steps are computed, the master process continues with the user interaction, while the other processes go back
into the wait function, until they are called again by the master process.
The thing is that this program runs much slower than the original, sequential version of it. Now there might be a number of reasons for this. I used the PETSc library in the FORTRAN DLL to solve the system of equations, and I think I have configured it quite well. My question is if at some point in the architecture I described there could be a point or two which could cause a significant slowdown if not handled correctly. I'm not sure f.e. if the subsequent calls of DLL function can cost a lot of time.
My system is a Intel Xeon 3470 processor with 8GB RAM. The systems I tried to solve had up to 120.000 unknowns, which I know is at the very lower bound of what should be calculated in parallel, but at least with the 120.000 matrix I would have expected a better performance than I did measure.
Thanks in advance for your thoughts,
Martin
I would say that 120,000 degrees of freedom and 10-20 iterations is not that large a problem. Million degree of freedom problems were done when I did finite element analysis for a living, and that was 16 years ago.
Is it possible to solve it using an in-memory solver, without parallelization, with 8GB of RAM? That would certainly be your benchmark. Is that what you're comparing your parallel results to?
Are the parallel processes running on different processors or different machines? Parallelization doesn't buy you anything if everything is done on a single processor. You have to context switch and time slice processes, and there's overhead associated with MPI to communicate between processes. I would expect a parallel solution on a single processor to run more slowly than a single thread, in-memory solution.
If you have multiple processes, then I'd say it's a matter of tuning. I'd plot performance versus number of parallel processes. If there's a speedup, you should find that it improves with more processes until you reach a saturation point, beyond which the overhead is greater than the benefit.
If you have multiple cores, when you run your program sequentially can you see that only one or a few processor are utilized?
If the load in the sequential case is high and evenly distributed over all cores then IMHO there is no need to parallelize your program.
My system has a Xeon 3470, which is a quadcore processor. So the computations are all done on these 4 on 1 machine. I don't run the program with more than 4 processes of course.The old solver that the software had was sequential of course, and that still runs faster than the parallel version. When I plot number of processes against runtime, I see that runtime even increases a little bit with smaller models - but that is to be expected because of the communication overhead.
In both the sequential and the parallel case all 4 processors are utilized, and the load balance between them is acceptable.
Like I said, I know that the models I've tested so far are not ideal to talk about parallel performance. I was just wondering if besides the communication overhead due to MPI there could still be another point that could lead to the slowdown of the program.

Is "the optimized delay" a myth or is it real?

From time to time you hear stories that are meant to illustrate how good someone is at something, and sometimes you hear about the guy how is so into code optimization that he optimizes his delay loop.
Since this really sounds like it's a strange thing to do as it's much better to start a "timer interrupt" instead of a optimized buzy wait,
and nobody ever tend to tells you the name of the optimizing hacker.
That has left me to wonder if it is a urban myth or is it real?
What do you say, reality or fiction?
Thanks
Johan
Update: It sounds like ShuggyCoUk was on to something,
wonder if we can find a example.
Update: Just a little clarification, this question is about the "delay" function it self and how that is implemented, not how and where you call it.
And what that purpose was, and how that system became better.
Update: It's no myth, those guys seems to exist
Thanks
ShuggyCoUk
This has more than a kernel of truth about it...
Spin wait can be much better than a signal based interrupt or a yield.
You trade some throughput for much reduced latency.
Often this is vitally important within an OS itself.
You allow yourself the freedom to do operations not possible within an interrupt handler
memory allocation for example.
You can get considerably finer grained control of the interval waited since you can essentially measure the cycle count.
However spin waits are tricky to get right.
If you can you should use use proper idle instructions which:
can power down parts of the core, improving power usage/heat dissipation and even allowing other cores to go faster.
In Hyper Thread based CPUs you allow the other logical thread to use the full CPU pipeline while you spin.
an instruction you might think was a no-op could cause the CPU to execute them out of order via the super scalar execution units. The resulting code may get unforeseen out of order artefacts which force the CPU to apply a great deal of effort in terms of stalls and memory barriers which are unwanted.
This is why you let someone else write the spin wait loop for you in most cases..
In Linux there is the cpu_relax macro
on arm this is barrier()
on x86 this is rep_nop()
In Windows there is YieldProcessor
Accessible in .Net via Thread.SpinWait
OS X eschews providing a standard implementation unless you are in the kernel
see this document and note that it encourages the use only of lck_spin_t
As to some citations of using PAUSE for spin waits:
PostGresSQL
Linux
See also the note that this is better on non P4 as well due to reducing power
The version I've always heard is of a group of hardware programmers who developed a special instruction that optimised the idle (not busy) loop of their operating system. This is mentioned in Kernighan & Pike's book The Practice Of Programming, but even there they admit it may be an Urban Myth.
I've heard stories of programmers who intentionally put in long delay loops early in projects and removed them later as "optimizations" to impress management. Never figured out if the stories were apocryphal or not.