How to optimize MATLAB loops?

How to optimize MATLAB loops? - optimization

I have been working lately on a number of iterative algorithms in MATLAB, and been getting hit hard by MATLAB's performance (or lack thereof) when it comes to loops. I'm aware of the benefit of vectorizing code when possible, but are there any tools for optimization when you need the loop for your algorithm?
I am aware of the MEX-file option to write small subroutines in C/C++, although given my algorithms, this can be a very painful option given the data structures required. I mainly use MATLAB for the simplicity and speed of prototyping, so a syntactically complex, statically typed language is not ideal for my situation.
Are there any other suggestions? Even other languages (python?) which have relatively painless matrix tools are an option.

It was once true that vectorization would improve the speed of your MATLAB code. However, that is largely no longer true with the JIT-accelerator
This video demonstrating the MATLAB profiler might help.

PROFILER is very useful tool to find bottlenecks in Matlab code. it does not change your code of course, but helps to find which functions/lines to optimize with vectorization or mex.
http://www.mathworks.com/access/helpdesk/help/techdoc/ref/profile.html
http://www.mathworks.com/access/helpdesk/help/techdoc/matlab_env/f9-17018.html

If you have a choice, be sure to set up your loops so you scan the data column-wise which is how the data in MATLAB are arranged. In addition, be sure to preallocate any output arrays before the loop and index into them instead of growing the array inside the for-loop.

If you can cast your code so your operations are called on the whole matrix then you will see great improvement in the speed of your code. Many functions are much quicker when operating on the whole matrix rather than in an element-wise fashion with loops.

You might want to investigate MATLAB's Parallel Computing Toolbox which can make a big difference if you have the right hardware. I re-wrote about 12 lines of code and got 4 - 6 times speedup for one of our loop-intensive programs on and eight core PC.

Related

How to code fastest for loop ? vectors, iterators, basic ? what is the best method?

Before I wasn't considering any speed issue during 3D graphics programming, but now I have to consider it seriously for real-time applications. In real-time, as well as offline I have to use a lots of loops, I used to write algorithms with basic for loop like this..
for (i=0; i<nb; i++){}
So when I rum my algorithm with above for loop well you can also consider nested loops too, I have to wait for 5-10 seconds to finish it, but when I see any professional software they have much faster way for the same algorithm, just in eye blink time the algorithm is finished. Not professional, but the simple example of MeshLab.
So what the meshlab is using, is iterators like below.
std::vector<int> pStorage;
vector<int>::iterator it;
for (it = pStorage.begin(); it!=pStorage.end(); ++it){}
So, my question is, how and which for loop should I use for my algorithms for speeding up my application. Any suggestion?
I have been doing research on this topic since days, I found iterators are faster, but I have tried it, used it, I found no difference in speed, why ?
3D graphics programming usually heavy for processing, But I have to find a fastest way to make my loops faster.
Thanks.

Does VB.NET offer any performance improvement over VB6 for CPU-bound processes?

I'm working on a mathematical model written in VB6. The amount of CPU time this model is consuming is becoming a concern to some of our customers and the notion has been floated that porting it to VB.NET will improve its performance.
The model is performing a lot of single-precision arithmetic (a finite-difference scheme over a large grid) with small bursts of database access every five seconds or so (not enough to be important). Only basic arithmetic functions with occasional use of the ^ 4 operator are involved.
Does anyone think porting to VB.NET is likely to improve matters (or not)? Does anyone know any reliable articles or papers I can check over to help with this decision?

My opinion is that VB.Net won't improve performance by far. The improvement is given by your ability to make an optimized algorith.

Probably the best performance boost you can get is eliminating the DB access (even if it doesnt look important I/O usually is the bottleneck, not the language itself). If possible get the data upfront and save it at the end instead of accesing every 5 secs.
Also as other pointed out, change the algorithm if possible since porting the code to .NET probably will only get you small performance benefits.
But if you change it to .NET 4.0 maybe you can use the parallel extensions and really get a boost by using multiple cores. http://msdn.microsoft.com/en-us/library/dd460693.aspx , but that also means, changing the algorithm
Hope it helps. ;-)

I think that improvements in memory management improve performance in VB.NET

To provide you with a correct answer we should check your code...
But certainly VB.NET in theory should be more performant:
possibility to compile on 64 bit machines (without too much effort)
VB6 was interpreted, VB.NET is almost compiled
You can use threads (depends on your algorithm) and other "tricks", so you can use more CPUs to make calculations in parallel
Best thing to try: port the most CPU consuming part of your application to VB.NET and compare.

The same algorithm will perform faster in VB6, because it is compiled in native language.
If the program has extensive memory allocation, it may perform faster in a .NET when running in a 64 bit environment though.

What's the difference between code written for a desktop machine and a supercomputer?

Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might simulate molecules, fluids, reactions, and so on)?
My impression is that it has to do with taking advantage of certain libraries (e.g., BLAS, LAPLACK) where possible, revising algorithms (reducing iteration), profiling, parallelizing, considering memory-hard disk-processor use/access... I am aware of the adage, "want to optimize your code? don't do it", but if one were interested in learning about writing efficient code, what references might be available?
I think this question is language agnostic, but since many number-crunching packages for biomolecular simulation, climate modeling, etc. are written in some version of Fortran, this language would probably be my target of interest (and I have programmed rather extensively in Fortran 77).

Profiling is a must at any level of machinery. In common usage, I've found that scaling to larger and larger grids requires a better understanding of the grid software and the topology of the grid. In that sense, everything you learn about optimizing for one machine is still applicable, but understanding the grid software gets you additional mileage. Hadoop is one of the most popular and widespread grid systems, so learning about the scheduler options, interfaces (APIs and web interfaces), and other aspects of usage will help. Although you may not use Hadoop for a given supercomputer, it is one of the less painful methods for learning about distributed computing. For parallel computing, you may pursue MPI and other systems.
Additionally, learning to parallelize code on a single machine, across multiple cores or processors, is something you can begin learning on a desktop machine.
Recommendations:
Learn to optimize code on a single machine:
Learn profiling
Learn to use optimized libraries (after profiling: so that you see the speedup)
Be sure you know algorithms and data structures very well (*)
Learn to do embarrassingly parallel programming on multiple core machines.
Later: consider multithreaded programming. It's harder and may not pay off for your problem.
Learn about basic grid software for distributed processing
Learn about tools for parallel processing on a grid
Learn to program for alternative hardware, e.g. GPUs, various specialized computing systems.
This is language agnostic. I have had to learn the same sequence in multiple languages and multiple HPC systems. At each step, take a simpler route to learn some of the infrastructure and tools; e.g. learn multicore before multithreaded, distributed before parallel, so that you can see what fits for the hardware and problem, and what doesn't.
Some of the steps may be reordered depending on local computing practices, established codebases, and mentors. If you have a large GPU or MPI library in place, then, by all means, learn that rather than foist Hadoop onto your collaborators.
(*) The reason to know algorithms very well is that as soon as your code is running on a grid, others will see it. When it is hogging up the system, they will want to know what you're doing. If you are running a process that is polynomial and should be constant, you may find yourself mocked. Others with more domain expertise may help you find good approximations for NP-hard problems, but you should know that the concept exists.

Parallelization would be the key.
Since the problems you cited (e.g. CFD, multiphysics, mass transfer) are generally expressed as large-scale linear algebra problems, you need matrix routines that parallelize well. MPI is a standard for those types of problems.
Physics can influence as well. For example, it's possible to solve some elliptical problems efficiently using explicit dynamics and artificial mass and damping matricies.
3D multiphysics means coupled differential equations with varying time scales. You'll want a fine mesh to resolve details in both space and time, so the number of degrees of freedom will rise rapidly; time steps will be governed by the stability requirements of your problem.
If someone ever figures out how to run linear algebra as a map-reduce problem they'll have it knocked.

Hypothetically speaking, if my scientific work was leading toward the development of functions/modules/subroutines (on a desktop), what would I need to know to incorporate it into a large-scale simulation to be run on a supercomputer (which might simulate molecules, fluids, reactions, and so on)?
First, you would need to understand the problem. Not all problems can be solved in parallel (and I'm using the term parallel in as wide meaning as it can get). So, see how the problem is now solved. Can it be solved with some other method quicker. Can it be divided in independent parts ... and so on ...
Fortran is the language specialized for scientific computing, and during the recent years, along with the development of new language features, there has also been some very interesting development in terms of features that are aiming for this "market". The term "co-arrays" could be an interesting read.
But for now, I would suggest reading first into a book like Using OpenMP - OpenMP is a simpler model but the book (fortran examples inside) explains nicely the fundamentals. Message parsing interface (for friends, MPI :) is a larger model, and one of often used. Your next step from OpenMP should probably go in this direction. Books on the MPI programming are not rare.
You mentioned also libraries - yes, some of those you mentioned are widely used. Others are also available. A person who does not know exactly where the problem in performance lies should IMHO never try to undertake the task of trying to rewrite library routines.
Also there are books on parallel algorithms, you might want to check out.
I think this question is language agnostic, but since many number-crunching packages for biomolecular simulation, climate modeling, etc. are written in some version of Fortran, this language would probably be my target of interest (and I have programmed rather extensively in Fortran 77).
In short it comes down to understanding the problem, learning where the problem in performance is, re-solving the whole problem again with a different approach, iterating a few times, and by that time you'll already know what you're doing and where you're stuck.

We're in a position similar to yours.
I'm most in agreement with #Iterator's answer, but I think there's more to say.
First of all, I believe in "profiling" by the random-pausing method, because I'm not really interested in measuring things (It's easy enough to do that) but in pinpointing code that is causing time waste, so I can fix it. It's like the difference between a floodlight and a laser.
For one example, we use LAPACK and BLAS. Now, in taking my stack samples, I saw a lot of the samples were in the routine that compares characters. This was called from a general routine that multiplies and scales matrices, and that was called from our code. The matrix-manipulating routine, in order to be flexible, has character arguments that tell it things like, if a matrix is lower-triangular or whatever. In fact, if the matrices are not very large, the routine can spend more than 50% of its time just classifying the problem. Of course, the next time it is called from the same place, it does the same thing all over again. In a case like that, a special routine should be written. When it is optimized by the compiler, it will be as fast as it reasonably can be, and will save all that classifying time.
For another example, we use a variety of ODE solvers. These are optimized to the nth degree of course. They work by calling user-provided routines to calculate derivatives and possibly a jacobian matrix. If those user-provided routines don't actually do much, samples will indeed show the program counter in the ODE solver itself. However, if the user-provided routines do much more, samples will find the lower end of the stack in those routines mostly, because they take longer, while the ODE code takes roughly the same time. So, optimization should be concentrated in the user-provided routines, not the ODE code.
Once you've done several of the kind of optimization that is pinpointed by stack sampling, which can speed things up by 1-2 orders of magnitude, then by all means exploit parallelism, MPI, etc. if the problem allows it.

What's the best language for physics modeling?

I've been out of the modeling biz, so to speak, for a while now. When I was in college, most of the models I worked with were written in FORTRAN, which I never liked. I'm looking to get back into science, so I'm wondering if there are modern languages with feature sets suited for this kind of application. What would you consider to be an optimal language for simulating complex physics systems?

While certainly Fortran was the absolute ruler for this, Python is being used more and more exactly for this purpose. While it is very hard to say which is the BEST program for this, I've found python pretty useful for physics simulations and physics education.

It depends on the task
C++ is good at complicated data structures, but it is bad at slicing and multiply matrices. (This task equires you to spend a lot of time writing for loops.)
FORTRAN has a nice notation for slicing and multiplying matrices, but it is clumsy for creating complicated data structure such as graphs and linked lists.
Python/scipy has a nice notation for everything, but python is an interepreted language, so it is slow at certain tasks.
Some people are interested in languages like CUDA that allow you to use your GPU to speed up your simulations.
In the molecular dynamics community c++ seems to be popular, because you need somewhat complicated data structures to represent the shapes of the molecules.

I think it's arguable that FORTRAN is still dominant when it comes to solving large-scale problems in physics, as long as we're talking about serial calculations.
I know that parallelization is changing the game. I'm less certain about whether or not parallelized versions of LINPACK and other linear algebra packages are still written in FORTRAN.
A lot of engineers are using MATLAB and Mathematica these days, because they combine numerical and graphics capabilities.
I'd also point out that there's a difference between calculation and display engines. The former might still be written in FORTRAN, but the latter may be using more modern languages and OpenGL.
I'm also unsure about how much modeling has crept into biology. Physical chemistry might be a very different animal altogether.
If you write a terrific parallel linear algebra package in Scala or F# or Haskell that performs well, the world will beat a path to your door.

Python + Matplotlib + NumPy + α

The nuclear/particle/high energy physics community has moved heavily toward c++ (in part due to ROOT and Geant4), with some interest in Python (because it has ROOT bindings).
But you'll note that this is sub-discipline dependent..."physics" and "modeling" are big, broad topics, so there is no one answer.

Modelica is a specialized language for modeling (and simulating) physical systems. OpenModelica is an open source implementation of Modelica.

Python is very popular among science-oriented people, as is Matlab. The issue with these is that they are both VERY slow (to run). If you want to do large simulations that may take minutes/hours/days, you're going to have to pick another language.
As long as you are picking a language for speed, suck it up and use C/C++, maybe with CUDA depending on your needs.
Final thought though: if it takes you two days longer to write and debug your model in C than in python, and the resulting code takes 10 minutes to run instead of an hour, have your really saved any time?

There's also a lot of capability with MATLAB. Especially when interfacing your simulations with hardware, or if you need your results visualised.

I'll chime in with Python but you should also look to R for any statistical work you may need to do. You should really be asking more about what packages for which languages to use rather than the language itself.

How fast is VB .net compared to native code for arithmetic?

I need to write software that will do a lot of math. Mostly it will be matrix multiplication with integers to compute DCT. How much faster should I expect the code to run in native c as compared to VB .Net? Factor of 2, factor of 10, factor of 1000...? Has someone tried and collected statistics on this?

.Net code is JIT-compiled to native code before execution, so it should not be slower than native code in general. I'd expect a factor < 10.
Moreover, adaptive optimization techniques profile the code as it runs, gaining more information than a typical static compiler. So, the JIT can make more informed decisions for further optimizations

The .NET code is compiled into native code by the JIT compiler, so you get native code in both cases.
The difference is that the C code has somewhat less overhead around the calculations, so you should perhaps expect a performace difference of factor 2.

VB is 93.7% as fast as C. If you pick the right scenario.
Actually, if your 'native C' includes regular calls to malloc() and free(), any kind of Gargage Collected language like VB.Net is going to literally run circles around it. GC can be 10x faster than mallocs in your inner loops.
If you break down and use C, try to reuse structures that you declared just once instead of making new ones, to avoid this problem. This may be of benefit even in VB if your solution lends itself to it. However it will be harder to program and GC is very fast.
As far as bounds/overflow checks, if speed is important and testing has revealed they don't happen, and you're not risking life or millions from an error or abend, they are a waste of time. But if you can't get rid of them, your time is likely still more valuable in a language with which you can program more quickly.
If you expect serious size and usage, it pays to split the task with a controlling program and store the allocated 'task definitions' into a shared directory with a file per task solver, or a database. Then you can run a solver per processor (2 per HT CPU), or network computers. Be weary of queue structures - it's tough to atomicly 'Mark-Taken-And-Get-Data-If-Not-Taken'. You know how many task solvers you're going to start. I did this with an imaging utility I develop, it was much easier than expected, and it creamed the previous version. Plus if you use multiple processes with a properly dividable problem domain, you avoid the slight-to-significant programming burden of multithreading. Or convincing your coworkers that your culrly braces are in the right place. Peace.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas