flags, compilers and algorithm strategies to optimise Fortran loops

flags, compilers and algorithm strategies to optimise Fortran loops - optimization

I'm studying around the impact of unrolling and optimisation flags on Fortran code. I come up with the following, very trivial, case:
program do_order
implicit none
integer :: j, s, n, nLoops
integer, dimension(4) :: Iv
real*8, dimension(16, 4) :: tmp, Ov
real :: start, finish
nLoops = 1000000
!! Initialize the values of Input vector;
do n = 1,4
Iv(n) = n**2
end do
!! Explicit Do-loop + implicit Do-loop working across columns (to be Fortran efficient)
call cpu_time(start)
do n = 1, nLoops
tmp = 0.d0
Ov = 0.d0
do j = 1,4
tmp(1:Iv(j),j) = Ov(1:Iv(j),j) - 10.0d0
end do
end do
call cpu_time(finish)
print '("Loop-1 Time = ",f6.3," seconds.")',finish-start
tmp = 0.d0
Ov = 0.d0
!! Un-rolled loop + implicit Do-loop
call cpu_time(start)
do n = 1, nLoops
tmp = 0.d0
Ov = 0.d0
tmp(1:Iv(1),1) = Ov(1:Iv(1),1) - 10.0d0
tmp(1:Iv(2),2) = Ov(1:Iv(2),2) - 10.0d0
tmp(1:Iv(3),3) = Ov(1:Iv(3),3) - 10.0d0
tmp(1:Iv(4),4) = Ov(1:Iv(4),4) - 10.0d0
end do
call cpu_time(finish)
print '("Loop-2 Time = ",f6.3," seconds.")',finish-start
end program
Compiled with: -O3 -mprefer-avx128 flags, gives me the following timing:
Loop-1 Time = 4.487 seconds.
Loop-2 Time = 3.657 seconds.
I've used Gfort: GNU Fortran (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0. I have access also to
Ifort 19.1.3.304. The same test with Ifort -O3 gives me:
Loop-1 Time = 0.939 seconds.
Loop-2 Time = 0.873 seconds.
My questions:
Why the huge gap in performance between Gfort and Ifort?
did I unrolled the second loop correctly?
Are there other optimization flags to speedup the code even more (i.e. Loop unrolling & optimization, Decreasing the fortran run time by extra optimization flags)?
Are there other strategies to speedup the code even more (without moving to multithreading, for now)?

Related

What is a time complexity of the following algorithm in Big Theta Notation?

res = 0
for i in range (1,n):
j = i
while j % 2 == 0:
j = j/2
res = res + j
I understand that upper bound is O(nlogn), however I'm wondering if it's possible to find a stronger constraint? I'm stuck with the analysis.

Some ideas that may be helpful:
Could create a function (g(n)) that annotates your function (f(n)) to include how many operations occur when running f(n)
def f(n):
res = 0
for i in range (1,n):
j = i
while j % 2 == 0:
j = j/2
res = res + j
return res
def g(n):
comparisons = 0
operations = 0
assignments = 0
assignments += 1
res = 0
assignments += 1. # i = 1
comparisons += 1. # i < n
for i in range (1,n):
assignments += 1
j = i
operations += 1
comparisons += 1
while j % 2 == 0:
operations += 1
assignments += 1
j = j/2
operations += 1
assignments += 1
res = res + j
operations += 1
comparisons += 1
operations += 1 # i + 1
assignments += 1 # assign to i
comparisons += 1 # i < n ?
return operations + comparisons + assignments
For n = 1, the code runs without hitting any loops: assigning the value of res; assigning i as 1; comparing i to n and skipping the loop as a result.
For n > 1, you get into the for loop, and the for statement is all that is changing the loop varaible, so the complexity of the rest of the code is at least O(n).
Once in the loop:
if i is odd, then you only assign j, perform the mod operation and compare to zero. That will be the case for half the values of i, so each run of the loop from 2 to n will (half the time) add a fixed number of a few operations (including the loop operations). So, that's still O(n), just with a larger constant.
if i is even, then we divide by 2 until it is odd. This is what we need to work out the impact of.
Based on my counting of the different operations, I get:
g_initial_setup = 3 (every time)
g_for_any_i = 6 (half the time, it is just this)
g_for_even_i = 6 for each time we divide by two (the other half of the time)
For a random even i between 2 and n, half the time we will only need to divide by two once, half the remaining time by two again, half the remaining time by two again, etc. So we have an infinite series as n goes to infinity of sum(1/2^i) for 1 < i < n, and multiply that by the 6 operations done for each halving of j.
I would expect from this:
g(n) = 3 + (n * 6) + (n * 6) * sum( 1 / pow(2,m) for m between 1 and n )
Given that the infinite series 1/2^n = 1, we simplify that to:
g(n) = 3 + 12n as n approaches infinity.
That implies that the algorithm is O(n). Huh. I did not expect that.
Let's try out the function g(n) from above, counting all the operations that are occurring as f(n) is computed.
g(1) = 3 operations
g(2) = 9
g(3) = 21
g(4) = 27
g(5) = 45
g(10) = 123
g(100) = 1167
g(1000) = 11943
g(10000) = 119943
g(100000) = 1199931
g(1000000) = 11999919
g(10000000) = 119999907
Okay, unless I've really made a serious error here, it's O(n).

determine the time complexity in Java

public class complexity {
public int calc(int n) {
int a = 2 * Math.abs(n);
int b = Math.abs(n) - a;
int result = n;
for (int i = 0; i < a; i++) {
result += i;
for (int j = 0; j > b; j--) {
result -= j;
}
for (int j = a/2; j > 1 ; j/=2) {
System.out.println(j);
}
}
int tmp = result;
while (tmp > 0) {
result += tmp;
tmp--;
}
return result;
}
}
I have been given the following program in Java and need to determine the time complexity (Tw(n)).
I checked those site:
Big O, how do you calculate/approximate it?
How to find time complexity of an algorithm
But I have still problem too understand it.
Here is the given solution by the professor.From the for loop on I didn't understand anything how he came up with the different time complexity. Can anybody explain it ?

let's go over it step by step :
before the for loop every instruction is executed in a constant time
then:
for(int i = 0; i < a; i++) is executed 2n + 1 times as a = 2*n so 0=> a = 2n steps and the additional step is when i = 2n the program has to execute that step and when it finds that i = 2n it breaks out of the loop.
Now each instruction in the loop has to be executed 2n times (as we are looping from 0 to 2n - 1) which explains why result += i is done 2n times.
The next instruction is another for loop so we apply the same logic about the line
for (int j = 0; j > b; j--) : as b = -n this instruction will go from 0 down to -n plus the extra step I mentioned in the first for loop which means : 0 -> -n => n steps + 1 = n+1 steps and as this instruction is in the other loop it will be execited 2n times hence 2n * (n+1)
Instructions inside this for loop are done n times and therefore result -= j is done n times and as the loop itself is done 2n times (result -= j) will be done n*2n times
Same goes for the last for loop except here we are going from a/2 which is n as a = 2n to 1 and each time we are dividing by 2, this is a bit complicated so lets do some steps first iteration j = n then j = n/2 then j = n/4 till j is <= 1 so how many steps do we need to reach 1 ?
to reach n/2 we need 1 = log(2) step n => n/2
to reach n/4 we need 2 2 = log(4) steps n => n/2 => n/4
we remark here that to reach n/x we need log(x) steps and as 1 = n/n we need log(n) steps to reach 1
therefore each instruction in this loop is executed log(n) times and as it is in the parent loop it has to be done 2n times => 2n*log(n) times.
for the while loop : result = n + (0 + 1 + 2 + .. + 2n) + (0 + 1 + .. + 2(n^2))
we did this in the for loop and then you do the arithmetic sequence sums it gives
result = n^2 (n+1) and here you go.
I hope this is clear don't hesitate to ask otherwise !

I am willing to clarify i.e. the last while loop. Unfortunately my answer does not match the professor's. And I am not that certain to have not made a grave mistake.
result starts with n
for i from 0 upto 2|n|
result += i
2|n| times done, average i.|n| so result increased by: ~2n.n = 2n².
for j from 0 downto -|n|
result -= j
2|n| times done, result increased by: 2n.n/2 = n²
So result is n + 3n²
The outer for loop remains O(n²) as the inner println-for has only O(log n).
The last while loop would be the same as:
for tmp from n + 3n² downto 0
result += tmp
This is also O(n²) like the outer for loop, so the entire complexity is O(n²).
The result is roughly 3n².3n²/2 or. 4.n4.

Optimizing a Fortran subroutine

I've written a minimal implementation for the fast xoroshiro128plus pseudo-random number generator in Fortran to replace the intrinsic random_number. This implementation is quite fast (4X faster than random_number) and the quality is good enough for my purposes, I don't use it in cryptography applications.
My question is how can I optimize this subroutine to get the last drop of performance from my compiler, even 10% improvement is appreciated. This subroutine is to be used in tight loops inside long simulations. I'm interested more in generating a single random number at a time and not big vectors or nD arrays at once.
Here is a test program to give you some context about how my subroutine is used:
program test_xoroshiro128plus
implicit none
integer, parameter :: n = 10000
real*8 :: A(n,n)
integer :: i, j, t0, t1, count_rate, count_max
call system_clock(t0, count_rate, count_max)
do j = 1,n
do i = 1,n
call drand128(A(i,j))
end do
end do
! call drand128(A) ! works also with 2D
call system_clock(t1)
print *, "Time :", real(t1-t0)/count_rate
print *, "Mean :", sum(A)/size(A), char(10), A(1:2,1:3)
contains
impure elemental subroutine drand128(r)
real*8, intent(out) :: r
integer*8 :: s0 = 113, s1 = 19937
s1 = xor(s0,s1)
s0 = xor(xor(ior(ishft(s0,55), ishft(s0,-9)),s1), ishft(s1,14))
s1 = ior(ishft(s1,36), ishft(s1,-28))
r = ishft(s0+s1, -1) / 9223372036854775808.d0
end
end program

Only now I realized you are asking about this particular PRNG. I am using it in Fortran myself https://bitbucket.org/LadaF/elmm/src/eb5b54b9a8eb6af158a38038f72d07865fe23ee3/src/rng_par_zig.f90?at=master&fileviewer=file-view-default
My code in the link is slower than yours, because it calls several subroutines and aims to be more universal. Bet let's try to condense the code I use into a single subroutine.
So let's just compare the performance of your code and the optimized version by #SeverinPappadeux and my optimized code with Gfortran 4.8.5
> gfortran -cpp -O3 -mtune=native xoroshiro.f90
Time drand128 sub: 1.80900002
Time drand128 fun: 1.80900002
Time rng_uni: 1.32900000
the code is here, remember to let the CPU spin-up, the first iteration of the k loop is just garbage!!!
program test_xoroshiro128plus
use iso_fortran_env
implicit none
integer, parameter :: n = 30000
real*8 :: A(n,n)
real*4 :: B(n,n)
integer :: i, j, k, t0, t1, count_rate, count_max
integer(int64) :: s1 = int(Z'1DADBEEFBAADD0D0', int64), s2 = int(Z'5BADD0D0DEADBEEF', int64)
!let the CPU spin-up
do k = 1, 3
call system_clock(t0, count_rate, count_max)
do j = 1,n
do i = 1,n
call drand128(A(i,j))
end do
end do
! call drand128(A) ! works also with 2D
call system_clock(t1)
print *, "Time drand128 sub:", real(t1-t0)/count_rate
call system_clock(t0, count_rate, count_max)
do j = 1,n
do i = 1,n
A(i,j) = drand128_fun()
end do
end do
! call drand128(A) ! works also with 2D
call system_clock(t1)
print *, "Time drand128 fun:", real(t1-t0)/count_rate
call system_clock(t0, count_rate, count_max)
do j = 1,n
do i = 1,n
call rng_uni(A(i,j))
end do
end do
call system_clock(t1)
print *, "Time rng_uni:", real(t1-t0)/count_rate
end do
print *, "Mean :", sum(A)/size(A), char(10), A(1:2,1:3)
contains
impure elemental subroutine drand128(r)
real*8, intent(out) :: r
integer*8 :: s0 = 113, s1 = 19937
s1 = xor(s0,s1)
s0 = xor(xor(ior(ishft(s0,55), ishft(s0,-9)),s1), ishft(s1,14))
s1 = ior(ishft(s1,36), ishft(s1,-28))
r = ishft(s0+s1, -1) / 9223372036854775808.d0
end
impure elemental real*8 function drand128_fun()
real*8, parameter :: c = 1.0d0/9223372036854775808.d0
integer*8 :: s0 = 113, s1 = 19937
s1 = xor(s0,s1)
s0 = xor(xor(ior(ishft(s0,55), ishft(s0,-9)),s1), ishft(s1,14))
s1 = ior(ishft(s1,36), ishft(s1,-28))
drand128_fun = ishft(s0+s1, -1) * c
end
impure elemental subroutine rng_uni(fn_val)
real(real64), intent(inout) :: fn_val
integer(int64) :: ival
ival = s1 + s2
s2 = ieor(s2, s1)
s1 = ieor( ieor(rotl(s1, 24), s2), shiftl(s2, 16))
s2 = rotl(s2, 37)
ival = ior(int(Z'3FF0000000000000',int64), shiftr(ival, 12))
fn_val = transfer(ival, 1.0_real64) - 1;
end subroutine
function rotl(x, k)
integer(int64) :: rotl
integer(int64) :: x
integer :: k
rotl = ior( shiftl(x, k), shiftr(x, 64-k))
end function
end program
The main difference should come from the faster and better way to convert from integers to reals http://experilous.com/1/blog/post/perfect-fast-random-floating-point-numbers#half-open-range
If you are bored, you could try to inline rotl() manually, but I trust the compiler here.

Ok, here is my attempt. First, I made it to function - in x64 or similar ABI function returning float value do in in register - much faster than parameter transfer. Second,
replaced final division by multiplication, though Intel compiler might do it for you.
Timing, Intel i7 6820, WSL, Ubuntu 18.04:
before - 0.850000024
after - 0.601000011
GNU Fortran 7.3.0, command line
gfortran -std=gnu -O3 -ffast-math -mavx2 /mnt/c/Users/kkk/Documents/CPP/a.for
Code
program test_xoroshiro128plus
implicit none
integer, parameter :: n = 10000
real*8 :: A(n,n)
integer :: i, j, t0, t1, count_rate, count_max
call system_clock(t0, count_rate, count_max)
do j = 1,n
do i = 1,n
A(i,j) = drand128()
end do
end do
A = drand128() ! works also with 2D
call system_clock(t1)
print *, "Time :", real(t1-t0)/count_rate
print *, "Mean :", sum(A)/size(A), char(10), A(1:2,1:3)
contains
impure elemental real*8 function drand128()
real*8, parameter :: c = 1.0d0/9223372036854775808.d0
integer*8 :: s0 = 113, s1 = 19937
s1 = xor(s0,s1)
s0 = xor(xor(ior(ishft(s0,55), ishft(s0,-9)),s1), ishft(s1,14))
s1 = ior(ishft(s1,36), ishft(s1,-28))
drand128 = ishft(s0+s1, -1) * c
end
end program

For loop with variable upper bound

I'd like to write a for loop with a variable upper limit in Mathematica 9. So, instead of
j = 0;
For[n = 1, n <= 3, n++, j = j + n];
j
(*6*)
I'd like to do
N = 3;
j = 0;
For[n = 1, n <= N, n++, j = j + n];
j
n
(*
0
1
*)
. But, as shown, this does not give the right result at all; it would appear from the value of n that the body of the loop was not evaluated at all.
I've looked through the Mathematica docs both on for loops and and on loops and control structures more generally (and also done some DuckDuckGo searches), but there's still something fundamental I'm missing. What is it?
For completeness, I should note that my ultimate goal is to put this in a function:
foo[N] =
Module[{j = 0},
For[n = 1, n <= N, n++, j = j + n;];
j]
foo[3]

Your code shows several common new user's problems. For example:
N is a reserved word
You shouldn't start your identifiers with Upper Case letters
The function foo[] should be defined with SetDelayed (:=) and not
with Set (=)
You need to use patterns (_) in the function definition arguments
For[]loops, and iterations in general should be avoided in
Mathematica
I think you could carefully read all the answers to this post to get a better grip on Mathematica.
Anyway, your code may be rewritten as
foo[k_] := Module[{j = 0}, For[n = 1, n <= k, n++, j = j + n]; j]
foo[3]
(*6*)
But this is horrible Mathematica coding.
The following are much better ways in Mathematica:
foo[j_ , k_] := Fold[Plus, j, Range#k]
foo[j_ , k_] := j + Total#Range#k
foo[j_ , k_] := j + Tr#Range#k

Fortran Error Meanings

I have been following books and PDFs on writing in FORTRAN to write an integration program. I compile the code with gfortran and get several copies of the following errors.
1)Unexpected data declaration statement at (1)
2)Unterminated character constant beginning at (1)
3)Unclassifiable statement at (1)
4)Unexpected STATEMENT FUNCTION statement at (1)
5)Expecting END PROGRAM statement at (1)
6)Syntax error in data declaration at (1)
7)Statement function at (1) is recursive
8)Unexpected IMPLICIT NONE statement at (1)
I do not know hat they truly mean or how to fix them, google search has proven null and the other topics on this site we about other errors. for Error 5) i put in Program main and end program main like i might in C++ but still got the same result. Error 7) makes no sense, i am trying for recursion in the program. Error 8) i read implicit none was to prevent unnecessary decelerations.
Ill post the code itself but i am more interested in the compiling errors because i still need to fine tune the array data handling, but i cant do that until i get it working.
Program main
implicit none
real, dimension(:,:), allocatable :: m, oldm
real a
integer io, nn
character(30) :: filename
real, dimension(:,:), allocatable :: alt, temp, nue, oxy
integer locationa, locationt, locationn, locationo, i
integer nend
real dz, z, integral
real alti, tempi, nuei, oxyi
integer y, j
allocate( m(0, 0) ) ! size zero to start with?
nn = 0
j = 0
write(*,*) 'Enter input file name: '
read(*,*) filename
open( 1, file = filename )
do !reading in data file
read(1, *, iostat = io) a
if (io < 0 ) exit
nn = nn + 1
allocate( oldm( size(m), size(m) ) )
oldm = m
deallocate( m )
allocate( m(nn, nn) )
m = oldm
m(nn, nn) = a ! The nnth value of m
deallocate( oldm )
enddo
! Decompose matrix array m into column arrays [1,n]
write(*,*) 'Enter Column Number for Altitude'
read(*,*) locationa
write(*,*) 'Enter Column Number for Temperature'
read(*,*) locationt
write(*,*) 'Enter Column Number for Nuetral Density'
read(*,*) locationn
write(*,*) 'Enter Column Number for Oxygen density'
read(*,*) locationo
nend = size(m, locationa) !length of column #locationa
do i = 1, nend
alt(i, 1) = m(i, locationa)
temp(i, 1) = log(m(i, locationt))
nue(i, 1) = log(m(i, locationn))
oxy(i, 1) = log(m(i, locationo))
enddo
! Interpolate Column arrays, Constant X value will be array ALT with the 3 other arrays
!real dz = size(alt)/100, z, integral = 0
!real alti, tempi, nuei, oxyi
!integer y, j = 0
dz = size(alt)/100
do z = 1, 100, dz
y = z !with chopped rounding alt(y) will always be lowest integer for smooth transition.
alti = alt(y, 1) + j*dz ! the addition of j*dz's allow for all values not in the array between two points of the array.
tempi = exp(linear_interpolation(alt, temp, size(alt), alti))
nuei = exp(linear_interpolation(alt, nue, size(alt), alti))
oxyi = exp(linear_interpolation(alt, oxy, size(alt), alti))
j = j + 1
!Integration
integral = integral + tempi*nuei*oxyi*dz
enddo
end program main
!Functions
real function linear_interpolation(x, y, n, x0)
implicit none
integer :: n, i, k
real :: x(n), y(n), x0, y0
k = 0
do i = 1, n-1
if ((x0 >= x(i)) .and. (x0 <= x(i+1))) then
k = i ! k is the index where: x(k) <= x <= x(k+1)
exit ! exit loop
end if
enddo
if (k > 0) then ! compute the interpolated value for a point not in the array
y0 = y(k) + (y(k+1)-y(k))/(x(k+1)-x(k))*(x0-x(k))
else
write(*,*)'Error computing the interpolation !!!'
write(*,*) 'x0 =',x0, ' is out of range <', x(1),',',x(n),'>'
end if
! return value
linear_interpolation = y0
end function linear_interpolation
I can provide a more detailed description of the exact errors, i was hoping that the error name would be enough since i have a few of each type.

I think I can spot a few serious errors in your code sample. The syntax error is that you have unbalanced parentheses in the exp(... statements. They should be like this:
tempi = exp(linear_interpolation(alt, temp, size(alt), alti) ) ! <- extra ")"
nuei = exp(linear_interpolation(alt, nue, size(alt), alti) )
oxyi = exp(linear_interpolation(alt, oxy, size(alt), alti) )
It's precisely things like this that can produce long strings of confusing errors like you're getting; therefore the advice Dave and Jonathan have given can't be repeated often enough.
Another error (the "unclassifiable statement") applies to your loops:
do(i=1, nend)
! ...
do(z=1, 100, dz)
! ...
These should be written without parentheses.
The "data declaration error" stems from your attempt to declare and initialise multiple variables like
real dz = size(alt)/100, z, integral = 0
Along with being positioned incorrectly in the code (as noted), this can only be done with the double colon separator:
real :: dz = size(alt)/100, z, integral = 0
I personally recommend always writing declarations like this. It must be noted, though, that initialising variables like this has the side effect of implicitly giving them the save attribute. This has no effect in a main program, but is important to know; you can avoid it by putting the initialisations on a separate line.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

flags, compilers and algorithm strategies to optimise Fortran loops - optimization

Related

What is a time complexity of the following algorithm in Big Theta Notation?

determine the time complexity in Java

Optimizing a Fortran subroutine

For loop with variable upper bound

Fortran Error Meanings

Categories

Resources