Adding an extra dimension of size 1 and overhead - optimization

Veryreverie suggested an answer (https://stackoverflow.com/a/73366165/7462275) to merge two programs (a 2D and a 3D ones). The solution is to have only a 3D program and to set the third dimension to 1 for the the 2D case. I did not imagined this solution because I thought that there would be overhead by adding an useful third dimension. (Program is written in Fortran)
Then, I did many tests to compare 2D and 3D with third dimension = 1.
For example :
module m
implicit none
type :: int_3dim_t
integer :: x
integer :: y
integer :: z
end type int_3dim_t
end module m
program main
use m
implicit none
type(int_3dim_t) :: ii_curr, full_dim
real, dimension(:,:,:,:), allocatable :: myMatrix
real :: aaa
integer :: ll
open(unit=50,file='dim2.dat',status='old')
read(50,*) full_dim%x,full_dim%y
close(50)
full_dim%z = 1
allocate(myMatrix(1:8,1:full_dim%x,1:full_dim%y,1:full_dim%z))
associate (ii_x => ii_curr%x, ii_y => ii_curr%y, ii_z => ii_curr%z)
do ii_z = 1, full_dim%z
do ii_y = 1, full_dim%y
do ii_x = 1, full_dim%x
do ll=1,8
myMatrix(ll,ii_curr%x,ii_curr%y,ii_curr%z) = myMatrix(ll,ii_curr%x,ii_curr%y,ii_curr%z) + my_calc(ii_curr,ll)
enddo
enddo
enddo
enddo
end associate
print *,myMatrix(:,10,10,1)
contains
real function my_calc(ii,ll)
type(int_3dim_t), intent(in) :: ii
integer, intent(in) :: ll
my_calc = sqrt(real((ii%x/100.0)**2+(ii%y/100.0)**2))*ll
end function my_calc
end program main
The input file is 10000 10000 (another values do not change the conclusion), gfortran version is 12.1.1 and O2 optimisation flag is used.
(For 2D version, everything concerning the third dimension is removed)
And surprisingly, I did not found any difference in execution time.
So, I am looking for simple examples where this third dimension (set to 1) has an effect on the execution time. (Of course, with the correct fortran loop order : first indice in the innner loop and last indice in the outer loop)

Related

Is it possible to optimize these fortran loops?

Here is my problem: I have a fortran code with a certain amount of nested loops and first I wanted to know if it's possible to optimize (rearranging) them in order to get a time gain? Second I wonder if I could use OpenMP to optimize them?
I have seen a lot of posts about nested do loops in fortran and how to optimize them but I didn't find one example that is suited to mine. I have also searched about OpenMP for nested do loops in fortran but I'm level 0 in OpenMP and it's difficult for me to know how to use it in my case.
Here are two very similar examples of loops that I have, first:
do p=1,N
do q=1,N
do ab=1,nVV
cd = 0
do c=nO+1,N
do d=c+1,N
cd = cd + 1
A(p,q,ab) = A(p,q,ab) + (B(p,q,c,d) - B(p,q,d,c))*C(cd,ab)
end do
end do
kl = 0
do k=1,nO
do l=k+1,nO
kl = kl + 1
A(p,q,ab) = A(p,q,ab) + (B(p,q,k,l) - B(p,q,l,k))*D(kl,ab)
end do
end do
end do
do ij=1,nOO
cd = 0
do c=nO+1,N
do d=c+1,N
cd = cd + 1
E(p,q,ij) = E(p,q,ij) + (B(p,q,c,d) - B(p,q,d,c))*F(cd,ij)
end do
end do
kl = 0
do k=1,nO
do l=k+1,nO
kl = kl + 1
E(p,q,ij) = E(p,q,ij) + (B(p,q,k,l) - B(p,q,l,k))*G(kl,ij)
end do
end do
end do
end do
end do
and the other one is:
do p=1,N
do q=1,N
do ab=1,nVV
cd = 0
do c=nO+1,N
do d=nO+1,N
cd = cd + 1
A(p,q,ab) = A(p,q,ab) + B(p,q,c,d)*C(cd,ab)
end do
end do
kl = 0
do k=1,nO
do l=1,nO
kl = kl + 1
A(p,q,ab) = A(p,q,ab) + B(p,q,k,l)*D(kl,ab)
end do
end do
end do
do ij=1,nOO
cd = 0
do c=nO+1,N
do d=nO+1,N
cd = cd + 1
E(p,q,ij) = E(p,q,ij) + B(p,q,c,d)*F(cd,ij)
end do
end do
kl = 0
do k=1,nO
do l=1,nO
kl = kl + 1
E(p,q,ij) = E(p,q,ij) + B(p,q,k,l)*G(kl,ij)
end do
end do
end do
end do
end do
The very small difference between the two examples is mainly in the indices of the loops. I don't know if you need more info about the different integers in the loops but you have in general: nO < nOO < N < nVV. So I don't know if it's possible to optimize these loops and/or possibly put them in a way that will facilitate the use of OpenMP (I don't know yet if I will use OpenMP, it will depend on how much I can gain by optimizing the loops without it).
I already tried to rearrange the loops in different ways without any success (no time gain) and I also tried a little bit of OpenMP but I don't know much about it, so again no success.
From the initial comments it may appear that at least in some cases you may be using more memory than the available RAM, which means you may be using the swap file, with all the bad consequences on the performances. To fix this, you have to either install more RAM if possible, or deeply reorganize your code to not store the full B array (by far the largest one) at once (again, if possible).
Now, let's assume that you have enough RAM. As I wrote in the comments, the access pattern on the B array is far from optimal, as the inner loops correspond to the last indeces of B, which can result in many cache misses (all the more given the the size of B). Changing the loop order if possible is a way to go.
Just looking at your first example, I am focusing on the computation of the array A (the computation of the array E looks completely independent of A, so it can be processed separately):
!! test it at first without OpenMP
!!$OMP PARALLEL DO PRIVATE(cd,c,d,kl,k,l)
do ab=1,nVV
cd = 0
do c=nO+1,N
do d=c+1,N
cd = cd + 1
A(:,:,ab) = A(:,:,ab) + (B(:,:,c,d) - B(:,:,d,c))*C(cd,ab)
end do
end do
kl = 0
do k=1,nO
do l=k+1,nO
kl = kl + 1
A(:,:,ab) = A(:,:,ab) + (B(:,:,k,l) - B(:,:,l,k))*D(kl,ab)
end do
end do
end do
!!$OMP END PARALLEL DO
What I did:
moved the loops on p and q from outer to inner positions (it's not always as easy than it is here)
replaced them with array syntax (no performance gain to expect, just a code easier to read)
Now the inner loops (abstracted by the array syntax) tackle contiguous elements in memory, which is much better for the performances. The code is even ready for OpenMP multithreading on the (now) outer loop.
EDIT/Hint
Fortran stores the arrays in "column-major order", that is when incrementing the first index one accesses contiguous elements in memory. In C the arrays are stored in "row-major order", that is when incrementing the last index one accesses contiguous elements in memory. So a general rule is to have the inner loops on the first indeces (and the opposite in C).
It would be helpful if you could describe the operations you'd like to perform using tensor notation and the Einstein summation rule. I have the feeling the code could be written much more succinctly using something like np.einsum in NumPy.
For the second block of loop nests (the ones where you iterate across a square sub-section of B as opposed to a triangle) you could try to introduce some sub-programs or primitives from which the full solution is built.
Working from the bottom up, you start with a simple sum of two matrices.
!
! a_ij := a_ij + beta * b_ij
!
pure subroutine apb(A,B,beta)
real(dp), intent(inout) :: A(:,:)
real(dp), intent(in) :: B(:,:)
real(dp), intent(in) :: beta
A = A + beta*B
end subroutine
(for first code block in the original post, you would substitute this primitive with one that only updates the upper/lower triangle of the matrix)
One step higher is a tensor contraction
!
! a_ij := a_ij + b_ijkl c_kl
!
pure subroutine reduce_b(A,B,C)
real(dp), intent(inout) :: A(:,:)
real(dp), intent(in) :: B(:,:,:,:)
real(dp), intent(in) :: C(:,:)
integer :: k, l
do l = 1, size(B,4)
do k = 1, size(B,3)
call apb( A, B(:,:,k,l), C(k,l) )
end do
end do
end subroutine
Note the dimensions of C must match the last two dimensions of B. (In the original loop nest above, the storage order of C is swapped (i.e. c_lk instead of c_kl.)
Working our way upward, we have the contractions with two different sub-blocks of B, moreover A, C, and D have an additional outer dimension:
!
! A_n := A_n + B1_cd C_cdn + B2_kl D_kln
!
! The elements of A_n are a_ijn
! The elements of B1_cd are B1_ijcd
! The elements of B2_kl are B2_ijkl
!
subroutine abcd(A,B1,C,B2,D)
real(dp), intent(inout), contiguous :: A(:,:,:)
real(dp), intent(in) :: B1(:,:,:,:)
real(dp), intent(in) :: B2(:,:,:,:)
real(dp), intent(in), contiguous, target :: C(:,:), D(:,:)
real(dp), pointer :: p_C(:,:,:) => null()
real(dp), pointer :: p_D(:,:,:) => null()
integer :: k
integer :: nc, nd
nc = size(B1,3)*size(B1,4)
nd = size(B2,3)*size(B2,4)
if (nc /= size(C,1)) then
error stop "FATAL ERROR: Dimension mismatch between B1 and C"
end if
if (nd /= size(D,1)) then
error stop "FATAL ERROR: Dimension mismatch between B2 and D"
end if
! Pointer remapping of arrays C and D to rank-3
p_C(1:size(B1,3),1:size(B1,4),1:size(C,2)) => C
p_D(1:size(B2,3),1:size(B2,4),1:size(D,2)) => D
!$omp parallel do default(private) shared(A,B1,p_C,B2,p_D)
do k = 1, size(A,3)
call reduce_b( A(:,:,k), B1, p_C(:,:,k))
call reduce_b( A(:,:,k), B2, p_D(:,:,k))
end do
!$omp end parallel do
end subroutine
Finally, we reach the main level where we select the subblocks of B
program doit
use transform, only: abcd, dp
implicit none
! n0 [2,10]
!
integer, parameter :: n0 = 6
integer, parameter :: n00 = n0*n0
integer, parameter :: N, nVV
real(dp), allocatable :: A(:,:,:), B(:,:,:,:), C(:,:), D(:,:)
! N [100,200]
!
read(*,*) N
nVV = (N - n0)**2
allocate(A(N,N,nVV))
allocate(B(N,N,N,N))
allocate(C(nVV,nVV))
allocate(D(n00,nVV))
print *, "Memory occupied (MB): ", &
real(sizeof(A) + sizeof(B) + sizeof(C) + sizeof(D),dp) / 1024._dp**2
A = 0
call random_number(B)
call random_number(C)
call random_number(D)
call abcd(A=A, &
B1=B(:,:,n0+1:N,n0+1:N), &
B2=B(:,:,1:n0,1:n0), &
C=C, &
D=D)
deallocate(A,B,C,D)
end program
Similar to the answer by PierU, parallelization is on the outermost loop. On my PC, for N = 50, this re-engineered routine is about 8 times faster when executed serially. With OpenMP on 4 threads the factor is 20. For N = 100 and I got tired of waiting for the original code; the re-engineered version on 4 threads took about 3 minutes.
The full code I used for testing, configurable via environment variables (ORIG=<0|1> N=100 ./abcd), is available here: https://gist.github.com/ivan-pi/385b3ae241e517381eb5cf84f119423d
With more fine-tuning it should be possible to bring the numbers down even further. Even better performance could be sought with a specialized library like cuTENSOR (also used under the hood of Fortran intrinsics as explained in Bringing Tensor Cores to Standard Fortran or a tool like the Tensor Contraction Engine.
One last thing I found odd was that large parts of B are un-unused. The sub sections B(:,:,1:n0,n0+1:N) and B(:,:,n0+1:N,1:n0) appear to be wasted space.

Why is my moving average algorithm so slow?

I've written a Fortran function that calculates the moving average of a 1D array of numbers in a very straightforward way:
function moving_average(data, w)
implicit none
integer, intent(in) :: w
real(8), intent(in) :: data(:)
integer :: n, i
real(8) :: moving_average(size(data)-w+1)
n = w-1
do i=1, size(data)-n
moving_average(i) = mean(data(i:i+n))
end do
end function
Where the function mean is defined as:
real(8) function mean(data)
implicit none
real(8), dimension(:), intent(in) :: data
mean = sum(data)/size(data)
end function
When running the function moving_average on my laptop with a data set of 100000 numbers and a window width of 1000, it takes 0.1 seconds. However, the function running_mean in this post using numpy takes only 1 ms. Why is my algorithm so much slower?
Your algorithm is of the order O(n*m) with n the size of the moving average and m the size of the array.
Every time you compute a point in the array moving_average, you do the following steps:
extract a part of the array
compute the sum over that slice
divide by the constant n
However, moving_average(i) and moving_average(i+1) are related in the following way:
moving_average(i+i) = moving_average(i) + (data(i+n) - data(i-1))/n
When you use this, you can reduce the computational time from O(n*m) to O(m)

passing a portion of an array of a derived type that contains an allocatable component

I'm having trouble when I pass a portion of an array of a derived type which has an allocatable component. A (probable) memory leaks seems to occur (unfortunatly valgrind is not yet available on mac 10.14/mojave, so I simply used a long do loop and looked the memory used with top).
Consider the following test program which reproduces my problem in a simpler context:
program test_leak
implicit none
type :: vint_t
integer, allocatable :: col(:)
end type vint_t
type(vint_t) :: row(4)
integer :: i, Ind(2)=[2,3]
do i = 1, 4
allocate(row(i)%col(i))
end do
row(1)%col(1) = 11
row(2)%col(1) = 21 ; row(2)%col(2) = 22
row(3)%col(1) = 31 ; row(3)%col(2) = 32 ; row(3)%col(3) = 33
row(4)%col(1) = 41 ; row(4)%col(2) = 42 ; row(4)%col(3) = 43 ; row(4)%col(4) = 44
do i = 1, 1000000
call do_nothing ( row(Ind) ) ! (version #A) with this version and with gfortran: memory grows with iter
!call do_nothing ( row(2:3) ) ! (version #B) but not with this version
if (mod(i,10000) == 0) then
print*,i ; read*
end if
end do
contains
subroutine do_nothing ( a )
type(vint_t), intent(in) :: a(:)
end subroutine do_nothing
end program test_leak
The problem occurs with gfortran (8.2) and not with ifort (19) nor with nagfor (6.2) (with nagfor and ifort the used memory is stable during the iterations and ca 400 Kb, while it reaches ca 30 Mb with gfortran!).
A final procedure doesn't solve the problem.
The problem disapears if the allocatable member "col" is replaced by an explicit array (say col(4)) or if the version #B is used.
The problem also occurs with an allocatable character component.
If I print "a" in the subroutine "do_nothing" both versions give the correct result.
Does anyone have some insight on this issue? Of course I can design the called subroutines by passing the array Ind (do_nothing(row, Ind)) but that would be somewhat a pity.

Logical Indexing based on "find" in Fortran 90

I am trying to make a logical array (B) to use in logical indexing based on values between .1 and .999 in an array (EP_G2) using a couple different methods 1) where loop 2) ANY.
program flux_3d
implicit none
INTEGER :: RMAX, YMAX, ZMAZ, timesteps
DOUBLE PRECISION, PARAMETER :: pmin=0.1
DOUBLE PRECISION, PARAMETER :: pmax=0.999
INTEGER :: sz
DOUBLE PRECISION, ALLOCATABLE :: EP_G2(:,:), C(:)
INTEGER, DIMENSION(RMAX*ZMAX*YMAX) :: B
LOGICAL, DIMENSION(RMAX*ZMAX*YMAX) :: A
! dimensions of array,
RMAX = 540
YMAX = 204
ZMAX = 54
timesteps = 1
!Open ascii array
OPEN(100, FILE ='EP_G2', form = 'formatted')
ALLOCATE( EP_G2(RMAX*ZMAX*YMAX, timesteps))
READ(100, *) EP_G2
WHERE(pmin<EP_G2(:,timesteps)<pmax) B = 1
ELSEWHERE
B = 0
ENDWHERE
PRINT*, B
! SZ # OF POINTS IN B
sz = count(B.eq.1)
!alternate way of finding points between pmin and pmax
A = ANY(pmin<EP_G2(:,t)<pmax)
print*, sz
!Then use the logical matrix B (or A) to make new array
!Basically I want a new array that isolates just the points in array between
!pmin and pmax
ALLOCATE(C(sz))
C = EP_G2(LOGICAL(B), 1)
The issue I am getting is that I either get the whole array or nothing and the command LOGICAL(B) gets an error that it isn't the right kind. I am new to Fortran coming from Matlab where I would just use find.
Additionally, since this array is over 5,948,640 x 1 computational time is an issue. I am using the Intel Fortran compiler (15.0 I believe).
Basically, I am looking for the fastest way to find the indexes of points in an array between two numbers.
Any help would be very appreciated.
I'm a little confused by your code and question but I think you want to find the indices of elements in an array whose values lie between 0.1 and 0.999. It's not entirely clear that you want the indices, or just the elements themselves, but bear with me and I'll explain how to get both.
Suppose your original array is declared something like
real, dimension(10**6) :: values
and that it is populated with values. Then I might declare an index array like this
integer, dimension(10**6) :: indices = [(ix,ix=1,10**6)]
(obviously you'll need to declare the variable ix too).
Now, the expression
pack(indices, values>pmin.and.values<pmax)
will return those elements of indices whose values lie strictly between pmin and pmax. Of course, if you only want the values themselves you can dispense with indices entirely and write
pack(values, values>pmin.and.values<pmax)
Both of these uses of pack will return an array of rank 1, and you can assign the returned array to an allocatable, Fortran will take care of the sizing for you.
I'll leave it to you to expand this approach to the rank-2 array you are actually working with, but it shouldn't be too difficult to do that. Ask for more help if you need it.
And is this the fastest approach ? I'm not sure, but it was very quick to write and if you're interested in its run-time performance I suggest you test that.
Incidentally, the Fortran intrinsic function logical is for transforming logical values from one kind to another, it's not defined on integers (or any other intrinsic type). Fortran predates the madness of regarding 0 and 1 as logical values.
How can the arrays be the same?
RMAX, YMAX, ZMAZ, and timesteps values are not known until after A, B, and C are declared - So they (A and B) will not likely be the size you want.
implicit none
INTEGER :: RMAX, YMAX, ZMAZ, timesteps
DOUBLE PRECISION, PARAMETER :: pmin=0.1
DOUBLE PRECISION, PARAMETER :: pmax=0.999
INTEGER :: sz
DOUBLE PRECISION, ALLOCATABLE :: EP_G2(:,:), C(:)
INTEGER, DIMENSION(RMAX*ZMAX*YMAX) :: B
LOGICAL, DIMENSION(RMAX*ZMAX*YMAX) :: A
! dimensions of array,
RMAX = 540
YMAX = 204
ZMAX = 54
timesteps = 1
You probably want either this:
implicit none
INTEGER , PARAMETER :: RMAX = 504
INTEGER , PARAMETER :: YMAX = 204
INTEGER , PARAMETER :: ZMAZ = 54
INTEGER , PARAMETER :: timesteps = 1
DOUBLE PRECISION, PARAMETER :: pmin = 0.1
DOUBLE PRECISION, PARAMETER :: pmax = 0.999
INTEGER :: sz
DOUBLE PRECISION, ALLOCATABLE :: EP_G2(:,:), C(:)
INTEGER, DIMENSION(RMAX*ZMAX*YMAX) :: B
LOGICAL, DIMENSION(RMAX*ZMAX*YMAX) :: A
! dimensions of array,
!RMAX = 540
!YMAX = 204
!ZMAX = 54
!timesteps = 1
Or this:
implicit none
! dimensions of array
INTEGER , PARAMETER :: RMAX = 504
INTEGER , PARAMETER :: YMAX = 204
INTEGER , PARAMETER :: ZMAZ = 54
INTEGER , PARAMETER :: timesteps = 1
DOUBLE PRECISION, PARAMETER :: pmin = 0.1
DOUBLE PRECISION, PARAMETER :: pmax = 0.999
INTEGER :: sz
DOUBLE PRECISION, ALLOCATABLE :: EP_G2(:,:), C(:)
INTEGER, DIMENSION(:), ALLOCATABLE :: B
LOGICAL, DIMENSION(:), ALLOCATABLE :: A
! Then allocate A and B
And you may also want to consider the use of SHAPE or SIZE to see if the rank and size of the array are correct.
IF(SHAPE(A) /= SHAPE(B) ) ... chuck and error message.
IF(SIZE(A,1) /= SIZE(B,1) ) etc

How do I read numbers from a file in Fortran and immediately do calculations with each numbers?

I'm totally new to Fortran, and I'm trying to learn the language here:
http://www.fortrantutorial.com/files-precision/index.php. I have some basic experience with C and Python, but not much, like introduction class and such.
So in exercises 4.1, they ask me to input some numbers from a file, and check if these numbers are even or odd. Here is the code to input the numbers:
program readdata
implicit none
!reads data from a file called mydata.txt
real :: x,y,z
open(10,file='mydata.txt')
read(10,*) x,y,z
print *,x,y,z
end program readdata
The file mydata.txt contains some random numbers. And they can check if the number is even or odd by:
if (mod(num,2)>0) then……
My question is that: if this file have like 10, or 1000 numbers, do I have to manually assign every single one of them? Is there any other way for me to do quick calculation with mass numbers situation like that?
Every read also moves the read pointer forwards. So with every new read, a new line is read in from the file.
The easiest thing to do is to keep reading until the READ statement returns an error. Of course, you have to pass a variable for the READ to write its error into. Something like this:
program readdata
implicit none
real :: x, y, z
integer :: iounit, ios
open(newunit=iounit, file='mydata.txt', iostat=ios, action='READ')
if (ios /= 0) STOP 1
do
read(iounit, *, iostat=ios) x, y, z
if (ios /= 0) exit
print *, x, y, z
end do
close(iounit)
end program readdata
Update: If you're limited to Fortran 95, as OP suggested in his comments, here's what to change: Instead of
integer :: iounit, ios
open(newunit=iounit, ...)
you use
integer :: ios
integer, paramter :: iounit = 100
open(unit=iounit, ...)
All that's important is that iounit is a number, greater than 10, which is not used as a unit for any other read/write operation.
well a mass of numbers means that X, Y, and Z need to be more than a single number of each... so something like this should get you vectored towards a solution:
program readdata
implicit none
!reads data from a file called mydata.txt
INTEGER, PARAMETER :: max2process = 10000
INTEGER :: I
INTEGER :: K = 0
real , DIMENSION(max2process) :: x,y,z
LOGICAL, DIMENSION(max2process) :: ODD = .FALSE.
open(10,file='mydata.txt')
10 CONTINUE
X(:) = 0
Y(:) = 0
K(:) = 0
DO I = 1, max2process
K = K + I
read(10,*, EOF=99) x(I), y(I), z(I)
print *,x(I), y(I), z(I)
ENDDO
WRITE(*,22) I, MINVAL(X(1:I)), MAXVAL(X(1:I)
22 FORMAT(' MIN(X(1:',I5,')=',0PE12.5,' Max=',0PE12.5)
odd = .FALSE.
WHERE (MODULO(X, 2) /= 0)
ODD = .TRUE.
ENDWHERE
DO I = 1, 10
WRITE(*,88) I, X(I), odd(I)
88 FORMAT('x(', I2,')=',0PE12.5,' odd="',L1,'"')
EBDDO
WRITE(*,*) ' we did not hit the end of file... Go back and read some more'
GOTO 10
99 WRITE(*,*) 'END of file reached at ', (K-1)
end program readdata
Basically something like that, but I probably have a typo (LTR). If MODULO is ELEMENTAL then you can do it easily... And MODULO is ELEMENTAL, so you are in luck (which is common).
There is a https://rosettacode.org/wiki.Even_or_Odd#Fortran that looks nice. the functions could be further enhanced using ELEMENTAL FUNCTION.
that should give you enough to work it out.
or you could read the file to find the size, close and reopen it, and then allocate X, Y, Z and Odd... the example i gave was more of a streaming example. one should generally use a DO WHILE rather than a GOTO, but starting out a GOTO can be conceptually easier. (You need an EXIT or some DO WHILE (IOSTAT /= ) in order to break out)