Here is my problem: I have a fortran code with a certain amount of nested loops and first I wanted to know if it's possible to optimize (rearranging) them in order to get a time gain? Second I wonder if I could use OpenMP to optimize them?
I have seen a lot of posts about nested do loops in fortran and how to optimize them but I didn't find one example that is suited to mine. I have also searched about OpenMP for nested do loops in fortran but I'm level 0 in OpenMP and it's difficult for me to know how to use it in my case.
Here are two very similar examples of loops that I have, first:
do p=1,N
do q=1,N
do ab=1,nVV
cd = 0
do c=nO+1,N
do d=c+1,N
cd = cd + 1
A(p,q,ab) = A(p,q,ab) + (B(p,q,c,d) - B(p,q,d,c))*C(cd,ab)
end do
end do
kl = 0
do k=1,nO
do l=k+1,nO
kl = kl + 1
A(p,q,ab) = A(p,q,ab) + (B(p,q,k,l) - B(p,q,l,k))*D(kl,ab)
end do
end do
end do
do ij=1,nOO
cd = 0
do c=nO+1,N
do d=c+1,N
cd = cd + 1
E(p,q,ij) = E(p,q,ij) + (B(p,q,c,d) - B(p,q,d,c))*F(cd,ij)
end do
end do
kl = 0
do k=1,nO
do l=k+1,nO
kl = kl + 1
E(p,q,ij) = E(p,q,ij) + (B(p,q,k,l) - B(p,q,l,k))*G(kl,ij)
end do
end do
end do
end do
end do
and the other one is:
do p=1,N
do q=1,N
do ab=1,nVV
cd = 0
do c=nO+1,N
do d=nO+1,N
cd = cd + 1
A(p,q,ab) = A(p,q,ab) + B(p,q,c,d)*C(cd,ab)
end do
end do
kl = 0
do k=1,nO
do l=1,nO
kl = kl + 1
A(p,q,ab) = A(p,q,ab) + B(p,q,k,l)*D(kl,ab)
end do
end do
end do
do ij=1,nOO
cd = 0
do c=nO+1,N
do d=nO+1,N
cd = cd + 1
E(p,q,ij) = E(p,q,ij) + B(p,q,c,d)*F(cd,ij)
end do
end do
kl = 0
do k=1,nO
do l=1,nO
kl = kl + 1
E(p,q,ij) = E(p,q,ij) + B(p,q,k,l)*G(kl,ij)
end do
end do
end do
end do
end do
The very small difference between the two examples is mainly in the indices of the loops. I don't know if you need more info about the different integers in the loops but you have in general: nO < nOO < N < nVV. So I don't know if it's possible to optimize these loops and/or possibly put them in a way that will facilitate the use of OpenMP (I don't know yet if I will use OpenMP, it will depend on how much I can gain by optimizing the loops without it).
I already tried to rearrange the loops in different ways without any success (no time gain) and I also tried a little bit of OpenMP but I don't know much about it, so again no success.
From the initial comments it may appear that at least in some cases you may be using more memory than the available RAM, which means you may be using the swap file, with all the bad consequences on the performances. To fix this, you have to either install more RAM if possible, or deeply reorganize your code to not store the full B array (by far the largest one) at once (again, if possible).
Now, let's assume that you have enough RAM. As I wrote in the comments, the access pattern on the B array is far from optimal, as the inner loops correspond to the last indeces of B, which can result in many cache misses (all the more given the the size of B). Changing the loop order if possible is a way to go.
Just looking at your first example, I am focusing on the computation of the array A (the computation of the array E looks completely independent of A, so it can be processed separately):
!! test it at first without OpenMP
!!$OMP PARALLEL DO PRIVATE(cd,c,d,kl,k,l)
do ab=1,nVV
cd = 0
do c=nO+1,N
do d=c+1,N
cd = cd + 1
A(:,:,ab) = A(:,:,ab) + (B(:,:,c,d) - B(:,:,d,c))*C(cd,ab)
end do
end do
kl = 0
do k=1,nO
do l=k+1,nO
kl = kl + 1
A(:,:,ab) = A(:,:,ab) + (B(:,:,k,l) - B(:,:,l,k))*D(kl,ab)
end do
end do
end do
!!$OMP END PARALLEL DO
What I did:
moved the loops on p and q from outer to inner positions (it's not always as easy than it is here)
replaced them with array syntax (no performance gain to expect, just a code easier to read)
Now the inner loops (abstracted by the array syntax) tackle contiguous elements in memory, which is much better for the performances. The code is even ready for OpenMP multithreading on the (now) outer loop.
EDIT/Hint
Fortran stores the arrays in "column-major order", that is when incrementing the first index one accesses contiguous elements in memory. In C the arrays are stored in "row-major order", that is when incrementing the last index one accesses contiguous elements in memory. So a general rule is to have the inner loops on the first indeces (and the opposite in C).
It would be helpful if you could describe the operations you'd like to perform using tensor notation and the Einstein summation rule. I have the feeling the code could be written much more succinctly using something like np.einsum in NumPy.
For the second block of loop nests (the ones where you iterate across a square sub-section of B as opposed to a triangle) you could try to introduce some sub-programs or primitives from which the full solution is built.
Working from the bottom up, you start with a simple sum of two matrices.
!
! a_ij := a_ij + beta * b_ij
!
pure subroutine apb(A,B,beta)
real(dp), intent(inout) :: A(:,:)
real(dp), intent(in) :: B(:,:)
real(dp), intent(in) :: beta
A = A + beta*B
end subroutine
(for first code block in the original post, you would substitute this primitive with one that only updates the upper/lower triangle of the matrix)
One step higher is a tensor contraction
!
! a_ij := a_ij + b_ijkl c_kl
!
pure subroutine reduce_b(A,B,C)
real(dp), intent(inout) :: A(:,:)
real(dp), intent(in) :: B(:,:,:,:)
real(dp), intent(in) :: C(:,:)
integer :: k, l
do l = 1, size(B,4)
do k = 1, size(B,3)
call apb( A, B(:,:,k,l), C(k,l) )
end do
end do
end subroutine
Note the dimensions of C must match the last two dimensions of B. (In the original loop nest above, the storage order of C is swapped (i.e. c_lk instead of c_kl.)
Working our way upward, we have the contractions with two different sub-blocks of B, moreover A, C, and D have an additional outer dimension:
!
! A_n := A_n + B1_cd C_cdn + B2_kl D_kln
!
! The elements of A_n are a_ijn
! The elements of B1_cd are B1_ijcd
! The elements of B2_kl are B2_ijkl
!
subroutine abcd(A,B1,C,B2,D)
real(dp), intent(inout), contiguous :: A(:,:,:)
real(dp), intent(in) :: B1(:,:,:,:)
real(dp), intent(in) :: B2(:,:,:,:)
real(dp), intent(in), contiguous, target :: C(:,:), D(:,:)
real(dp), pointer :: p_C(:,:,:) => null()
real(dp), pointer :: p_D(:,:,:) => null()
integer :: k
integer :: nc, nd
nc = size(B1,3)*size(B1,4)
nd = size(B2,3)*size(B2,4)
if (nc /= size(C,1)) then
error stop "FATAL ERROR: Dimension mismatch between B1 and C"
end if
if (nd /= size(D,1)) then
error stop "FATAL ERROR: Dimension mismatch between B2 and D"
end if
! Pointer remapping of arrays C and D to rank-3
p_C(1:size(B1,3),1:size(B1,4),1:size(C,2)) => C
p_D(1:size(B2,3),1:size(B2,4),1:size(D,2)) => D
!$omp parallel do default(private) shared(A,B1,p_C,B2,p_D)
do k = 1, size(A,3)
call reduce_b( A(:,:,k), B1, p_C(:,:,k))
call reduce_b( A(:,:,k), B2, p_D(:,:,k))
end do
!$omp end parallel do
end subroutine
Finally, we reach the main level where we select the subblocks of B
program doit
use transform, only: abcd, dp
implicit none
! n0 [2,10]
!
integer, parameter :: n0 = 6
integer, parameter :: n00 = n0*n0
integer, parameter :: N, nVV
real(dp), allocatable :: A(:,:,:), B(:,:,:,:), C(:,:), D(:,:)
! N [100,200]
!
read(*,*) N
nVV = (N - n0)**2
allocate(A(N,N,nVV))
allocate(B(N,N,N,N))
allocate(C(nVV,nVV))
allocate(D(n00,nVV))
print *, "Memory occupied (MB): ", &
real(sizeof(A) + sizeof(B) + sizeof(C) + sizeof(D),dp) / 1024._dp**2
A = 0
call random_number(B)
call random_number(C)
call random_number(D)
call abcd(A=A, &
B1=B(:,:,n0+1:N,n0+1:N), &
B2=B(:,:,1:n0,1:n0), &
C=C, &
D=D)
deallocate(A,B,C,D)
end program
Similar to the answer by PierU, parallelization is on the outermost loop. On my PC, for N = 50, this re-engineered routine is about 8 times faster when executed serially. With OpenMP on 4 threads the factor is 20. For N = 100 and I got tired of waiting for the original code; the re-engineered version on 4 threads took about 3 minutes.
The full code I used for testing, configurable via environment variables (ORIG=<0|1> N=100 ./abcd), is available here: https://gist.github.com/ivan-pi/385b3ae241e517381eb5cf84f119423d
With more fine-tuning it should be possible to bring the numbers down even further. Even better performance could be sought with a specialized library like cuTENSOR (also used under the hood of Fortran intrinsics as explained in Bringing Tensor Cores to Standard Fortran or a tool like the Tensor Contraction Engine.
One last thing I found odd was that large parts of B are un-unused. The sub sections B(:,:,1:n0,n0+1:N) and B(:,:,n0+1:N,1:n0) appear to be wasted space.
Veryreverie suggested an answer (https://stackoverflow.com/a/73366165/7462275) to merge two programs (a 2D and a 3D ones). The solution is to have only a 3D program and to set the third dimension to 1 for the the 2D case. I did not imagined this solution because I thought that there would be overhead by adding an useful third dimension. (Program is written in Fortran)
Then, I did many tests to compare 2D and 3D with third dimension = 1.
For example :
module m
implicit none
type :: int_3dim_t
integer :: x
integer :: y
integer :: z
end type int_3dim_t
end module m
program main
use m
implicit none
type(int_3dim_t) :: ii_curr, full_dim
real, dimension(:,:,:,:), allocatable :: myMatrix
real :: aaa
integer :: ll
open(unit=50,file='dim2.dat',status='old')
read(50,*) full_dim%x,full_dim%y
close(50)
full_dim%z = 1
allocate(myMatrix(1:8,1:full_dim%x,1:full_dim%y,1:full_dim%z))
associate (ii_x => ii_curr%x, ii_y => ii_curr%y, ii_z => ii_curr%z)
do ii_z = 1, full_dim%z
do ii_y = 1, full_dim%y
do ii_x = 1, full_dim%x
do ll=1,8
myMatrix(ll,ii_curr%x,ii_curr%y,ii_curr%z) = myMatrix(ll,ii_curr%x,ii_curr%y,ii_curr%z) + my_calc(ii_curr,ll)
enddo
enddo
enddo
enddo
end associate
print *,myMatrix(:,10,10,1)
contains
real function my_calc(ii,ll)
type(int_3dim_t), intent(in) :: ii
integer, intent(in) :: ll
my_calc = sqrt(real((ii%x/100.0)**2+(ii%y/100.0)**2))*ll
end function my_calc
end program main
The input file is 10000 10000 (another values do not change the conclusion), gfortran version is 12.1.1 and O2 optimisation flag is used.
(For 2D version, everything concerning the third dimension is removed)
And surprisingly, I did not found any difference in execution time.
So, I am looking for simple examples where this third dimension (set to 1) has an effect on the execution time. (Of course, with the correct fortran loop order : first indice in the innner loop and last indice in the outer loop)
I am trying to do calculations as part of regression model in Excel.
I need to calculate ((X^T)WX)^(-1)(X^T)WY. Where X, W, Y are matrices and ^T and ^-1 are denoting the matrix transpose and inverting operation.
Now when X, W, Y are of small dimensions I simply run my macro which calculates these values very very fast.
However sometimes I am dealing with the case when say, the dimensions of X, W, Y are 5000 X 5, 5000 X 1 and 5000 X 1 respectively, then the macro can take a lot longer to run.
I have two questions:
Would, instead of using my macro which generates the matrices on Excel sheets and then uses Excel formulas like MMULT and MINVERSE etc. to calculate the output, it be faster for larger dimension matrices if I used arrays in VBA to do all the calculations? (I am not too sure how arrays work in VBA so I don't actually know if it would do anything to excel, and hence if it would be any quicker/less computationally intensive.)
If the answer to the above question is no it would be no quicker. Then does anybody have an idea how to speed such calculations up? Or do I need to simply put up with it and wait.
Thanks for your time.
Considering that the algorithm of the code is the same, the speed ranking is the following:
Dll custom library with C#, C++, C, Java or anything similar
VBA
Excel
I have compared a VBA vs C++ function here, in the long term the result is really bad for VBA.
So, the following Fibonacci with recursion in C++:
int __stdcall FibWithRecursion(int & x)
{
int k = 0;
int p = 0;
if (x == 0)
return 0;
if (x == 1)
return 1;
k = x - 1;
p = x - 2;
return FibWithRecursion(k) + FibWithRecursion(p);
}
is exponentially better, when called in Excel, than the same complexity function in VBA:
Public Function FibWithRecursionVBA(ByRef x As Long) As Long
Dim k As Long: k = 0
Dim p As Long: p = 0
If (x = 0) Then FibWithRecursionVBA = 0: Exit Function
If (x = 1) Then FibWithRecursionVBA = 1: Exit Function
k = x - 1
p = x - 2
FibWithRecursionVBA = FibWithRecursionVBA(k) + FibWithRecursionVBA(p)
End Function
Better late than never:
I use matrices that are bigger, 3 or 4 dimensions, sized like 16k x 26 x 5.
I run through them to find data, apply one or two formulas or make combos with other matrices.
Number one, after starting the macro, open another application like notepad, you might have a nice speed increase ☺ !
Then, I guess you switched of screen updating etc, and turned of automatic calculation
As last: don't put the data in cells, not in arrays.
Just something like:
Dim Matrix1 as String ===>'put it in declarations if you want to use it in other macros as well. Remember you can not do "blabla=activecell.value2" etc anymore!!
In the "Sub()" code, use ReDim Matrix1(1 to a_value, 1 to 2nd_value, ... , 1 to last_value)
Matrix1(45,32,63)="what you want to put there"
After running, just drop the
Matrix1(1 to a_value, 1 to 2nd_value,1) at 1st sheet,
Matrix1(1 to a_value, 1 to 2nd_value,2) at 2nd sheet, etc
Switch on screen updating again, etc
In this way my calculation went from 45 minutes to just one, by avoiding the intermediary screen update
Success, I hope it is useful for somebody
I already checked similar posting. The solution is given by M. S. B. here Reading data file in Fortran with known number of lines but unknown number of entries in each line
So, the problem I am having is that from text file I am trying to read inputs. In one line there is supposed to be 3 variables. But sometimes the input file may have 2 variables. In that case I need to make the last variable zero. I tried using READ statement with IOSTAT but if there is only two values it goes to the next line and reads the next available value. I need to make it stop in the 1st line after reading 2 values when there is no 3rd value.
I found one way to do that is to have a comment/other than the type I am trying to read (in this case I am reading float while a comment is a char) which makes a IOSTAT>0 and I can use that as a check. But if in some cases I may not have that comment. I want to make sure it works even than.
Part of the code
read(15,*) x
read(15,*,IOSTAT=ioerr) y,z,w
if (ioerr.gt.0) then
write(*,*)'No value was found'
w=0.0;
goto 409
elseif (ioerr.eq.0) then
write(*,*)'Value found', w
endif
409 read(15,*) a,b
read(15,*) c,d
INPUT FILE is of the form
-1.000 abcd
12.460 28.000 8.00 efg
5.000 5.000 hijk
20.000 21.000 lmno
I need to make it work even when there is no "8.00 efg"
for this case
-1.000 abcd
12.460 28.000
5.000 5.000 hijk
20.000 21.000 lmno
I can not use the string method suggested by MSB. Is there any other way?
I seem to remember trying to do something similar in the past. If you know that the size of a line of the file won't exceed a certain number, you might be able to try something like:
...
character*(128) A
read(15,'(A128)') A !This now holds 1 line of text, padded on the right with spaces
read(A,*,IOSTAT=ioerror) x,y,z
if(IOSTAT.gt.0)then
!handle error here
endif
I'm not completely sure how portable this solution is from one compiler to the next and I don't have time right now to read up on it in the f77 standard...
I have a routine that counts the number of reals on a line. You could adapt this to your purpose fairly easily I think.
subroutine line_num_columns(iu,N,count)
implicit none
integer(4),intent(in)::iu,N
character(len=N)::line
real(8),allocatable::r(:)
integer(4)::r_size,count,i,j
count=0 !Set to zero in case of premature return
r_size=N/5 !Initially try out this max number of reals
allocate(r(r_size))
read(iu,'(a)') line
50 continue
do i=1,r_size
read(line,*,end=99) (r(j),j=1,i) !Try reading i reals
count=i
!write(*,*) count
enddo
r_size=r_size*2 !Need more reals
deallocate(r)
allocate(r(r_size))
goto 50
return
99 continue
write(*,*) 'I conclude that there are ',count,' reals on the first line'
end subroutine line_num_columns
If a Fortran 90 solution is fine, you can use the following procedure to parse a line with multiple real values:
subroutine readnext_r1(string, pos, value)
implicit none
character(len=*), intent(in) :: string
integer, intent(inout) :: pos
real, intent(out) :: value
integer :: i1, i2
i2 = len_trim(string)
! initial values:
if (pos > i2) then
pos = 0
value = 0.0
return
end if
! skip blanks:
i1 = pos
do
if (string(i1:i1) /= ' ') exit
i1 = i1 + 1
end do
! read real value and set pos:
read(string(i1:i2), *) value
pos = scan(string(i1:i2), ' ')
if (pos == 0) then
pos = i2 + 1
else
pos = pos + i1 - 1
end if
end subroutine readnext_r1
The subroutine reads the next real number from a string 'string' starting at character number 'pos' and returns the value in 'value'. If the end of the string has been reached, 'pos' is set to zero (and a value of 0.0 is returned), otherwise 'pos' is incremented to the character position behind the real number that was read.
So, for your case you would first read the line to a character string:
character(len=1024) :: line
...
read(15,'(A)') line
...
and then parse this string
real :: y, z, w
integer :: pos
...
pos = 1
call readnext_r1(line, pos, y)
call readnext_r1(line, pos, z)
call readnext_r1(line, pos, w)
if (pos == 0) w = 0.0
where the final 'if' is not even necessary (but this way it is more transparent imho).
Note, that this technique will fail if there is a third entry on the line that is not a real number.
I know the following simple solution:
w = 0.0
read(15,*,err=600)y, z, w
goto 610
600 read(15,*)y, z
610 do other stuff
But it contains "goto" operators
You might be able to use the wonderfully named colon edit descriptor. This allows you to skip the rest of a format if there are no further items in the I/O list:
Program test
Implicit None
Real :: a, b, c
Character( Len = 10 ) :: comment
Do
c = 0.0
comment = 'No comment'
Read( *, '( 2( f7.3, 1x ), :, f7.3, a )' ) a, b, c, comment
Write( *, * ) 'I read ', a, b, c, comment
End Do
End Program test
For instance with gfortran I get:
Wot now? gfortran -W -Wall -pedantic -std=f95 col.f90
Wot now? ./a.out
12.460 28.000 8.00 efg
I read 12.460000 28.000000 8.0000000 efg
12.460 28.000
I read 12.460000 28.000000 0.00000000E+00
^C
This works with gfortran, g95, the NAG compiler, Intel's compiler and the Sun/Oracle compiler. However I should say I'm not totally convinced I understand this - if c or comment are NOT read are they guaranteed to be 0 and all spaces respectively? Not sure, need to ask elsewhere.