Performance of Fortran versus MPI files writing/reading - optimization

I am confused about the performance of Fortran writing and reading performance (speed) versus MPI one for small and big files.
I wrote the following simple dummy program to test this (just writing dummy values to files):
PROGRAM test
!
IMPLICIT NONE
!
#if defined (__MPI)
!
! Include file for MPI
!
#if defined (__MPI_MODULE)
USE mpi
#else
INCLUDE 'mpif.h'
#endif
#else
! dummy world and null communicator
INTEGER, PARAMETER :: MPI_COMM_WORLD = 0
INTEGER, PARAMETER :: MPI_COMM_NULL = -1
INTEGER, PARAMETER :: MPI_COMM_SELF = -2
#endif
INTEGER (kind=MPI_OFFSET_KIND) :: lsize, pos, pos2
INTEGER, PARAMETER :: DP = 8
REAL(kind=DP), ALLOCATABLE, DIMENSION(:) :: trans_prob, array_cpu
INTEGER :: ierr, i, error, my_pool_id, world_comm
INTEGER (kind=DP) :: fil
REAL :: start, finish
INTEGER :: iunepmat, npool, arr_size, loop, pos3, j
real(dp):: dummy
integer*8 :: unf_recl
integer :: ios, direct_io_factor, recl
iunepmat = 10000
arr_size = 102400
loop = 500
! Initialize MPI
CALL MPI_INIT(ierr)
call MPI_COMM_DUP(MPI_COMM_WORLD, world_comm, ierr)
call MPI_COMM_RANK(world_comm,my_pool_id,error)
ALLOCATE(trans_prob(arr_size))
trans_prob(:) = 1.5d0
!Write using Fortran
CALL MPI_BARRIER(world_comm,error)
!
CALL cpu_time(start)
!
DO i=1, loop
! This writes also info on the record length using a real with 4 bytes.
OPEN(unit=10+my_pool_id, form='unformatted', position='append', action='write')
WRITE(10+my_pool_id ) trans_prob(:)
CLOSE(unit=10+my_pool_id)
ENDDO
CALL MPI_COMM_SIZE(world_comm, npool, error)
! Master collect and write
IF (my_pool_id==0) THEN
INQUIRE (IOLENGTH=direct_io_factor) dummy
unf_recl = direct_io_factor * int(arr_size * loop, kind=kind(unf_recl))
ALLOCATE (array_cpu( arr_size * loop ))
array_cpu(:) = 0.0d0
OPEN(unit=100,file='merged.dat',form='unformatted', status='new', position='append', action='write')
DO i=0, npool - 1
OPEN(unit=10+i,form='unformatted', status ='old', access='direct', recl = unf_recl )
READ(unit=10+i, rec=1) array_cpu(:)
CLOSE(unit=10+i)
WRITE(unit=100) array_cpu(:)
ENDDO
CLOSE(unit=100)
DEALLOCATE (array_cpu)
ENDIF
call cpu_time(finish)
!Print time
CALL MPI_BARRIER(world_comm,error)
IF (my_pool_id==0) print*, ' Fortran time', finish-start
!Write using MPI
CALL MPI_BARRIER(world_comm,error)
!
CALL cpu_time(start)
!
lsize = INT( arr_size , kind = MPI_OFFSET_KIND)
pos = 0
pos2 = 0
CALL MPI_FILE_OPEN(world_comm, 'MPI.dat',MPI_MODE_WRONLY + MPI_MODE_CREATE,MPI_INFO_NULL,iunepmat,ierr)
DO i=1, loop
pos = pos2 + INT( arr_size * (my_pool_id), kind = MPI_OFFSET_KIND ) * 8_MPI_OFFSET_KIND
CALL MPI_FILE_SEEK(iunepmat, pos, MPI_SEEK_SET, ierr)
CALL MPI_FILE_WRITE(iunepmat, trans_prob, lsize, MPI_DOUBLE_PRECISION,MPI_STATUS_IGNORE,ierr)
pos2 = pos2 + INT( arr_size * (npool -1), kind = MPI_OFFSET_KIND ) * 8_MPI_OFFSET_KIND
ENDDO
!
CALL MPI_FILE_CLOSE(iunepmat,ierr)
CALL cpu_time(finish)
CALL MPI_BARRIER(world_comm,error)
IF (my_pool_id==0) print*, ' MPI time', finish-start
DEALLOCATE (trans_prob)
END PROGRAM
The compilation is made with:
mpif90 -O3 -x f95-cpp-input -D__FFTW -D__MPI -D__SCALAPACK test_mpi2.f90 -o a.x
and then run in parallel with 4 cores:
mpirun -np 4 ./a.x
I get the following results:
Loop size 1
array size 10,240,000
File size: 313 Mb
Fortran time 0.237030014 sec
MPI time 0.164155006 sec
Loop size 10
array size 1,024,000
File size: 313 Mb
Fortran time 0.242821991 sec
MPI time 0.172048002 sec
Loop size 100
array size 102,400
File size: 313 Mb
Fortran time 0.235879987 sec
MPI time 9.78289992E-02 sec
Loop size 50
array size 1,024,000
File size: 1.6G
Fortran time 1.60272002 sec
MPI time 3.40623116 sec
Loop size 500
array size 102,400
File size: 1.6G
Fortran time 1.44547606 sec
MPI time 3.38340592 sec
As you can see the performances of MPI degrade significantly for larger files. Is it possible to improve MPI performance for large files ?
Is this behavior expected?

Related

Fortran "undefined reference to [function]" error that goes away by compiling the .f90 files directly

Before someone closes the question, yes, there are many questions that seem similar, but so far I haven't found one with this exact weird problem that seems to go away only sometimes.
I had an odd Fortran error while trying to make a module for linear regression.
The module is named "LSQregression" and the main program "LSQAdvertising". Compiling it using the following gfortran command works:
gfortran ../LSQregression.f90 LinearAdvertising.f90 -llapack -o LinAd
However, I'd like to be able to turn my module into a .o file and link it with whatever program I may need instead of compiling again every time. So I tried to do that:
gfortran -c ../LSQregression.f90
And then link it with my program file like that:
gfortran ../LSQregression.o LinearAdvertising.f90 -llapack -o LinAd
I also tried also turning the program into a .o file. Neither works, they both return the following error:
/usr/bin/ld: /tmp/ccf7T1aj.o: in function `MAIN__':
LinearAdvertising.f90:(.text+0x1f8b): undefined reference to `__lsqregression_MOD_lsqestimate_simple'
collect2: error: ld returned 1 exit status
It refers to the following function in the module that is being called here inside the program:
print *, LSQestimate(test,LSQbeta(inputs,sales,(/1,2/)) )
The function itself is defined as part of an interface:
interface LSQestimate
procedure LSQestimate_simple, LSQestimate_using
end interface LSQestimate
real(8) function LSQestimate_simple(X, beta)
implicit none
real(8), dimension(:,:) :: X, beta
real(8) :: boundary, Y
integer :: i, Xlen, betalen
Xlen = size(X, 1)
betalen = size(beta, 1)
if (Xlen + 1 .ne. betalen) stop "incompatible beta and X"
Y = beta(1,1)
do i = 1, Xlen
Y = Y + beta(i+1, 1)*X(i,1)
end do
LSQestimate_simple = Y
end function LSQestimate_simple
It's very odd because I've done the same thing with another function and it seems to work fine, and the problem only happens when I turn the module into a .o file first, and goes away if I try to directly compile the .f90 file... I can't figure out why one works and the other doesn't.
EDIT: Someone told me to do nm ../LSQregression.o | grep -i regression . I have no idea what that does but I did it so here are the results if it helps at all:
0000000000002768 T __lsqregression_MOD_inv
0000000000000b85 T __lsqregression_MOD_lsqbeta_simple
000000000000035c T __lsqregression_MOD_lsqbeta_using
0000000000000000 T __lsqregression_MOD_lsqdecision
Oddly enough the name of the function in question doesn't even appear here.
EDIT 2: Decided to post the entire code of the module:
module LSQregression
implicit none
public :: LSQbeta, LSQestimate
interface LSQbeta
procedure LSQbeta_simple, LSQbeta_using
end interface LSQbeta
interface LSQestimate
procedure LSQestimate_simple, LSQestimate_using
end interface LSQestimate
contains
function inv(A) result(Ainv)
implicit none
real(8), dimension(:,:), intent(in) :: A
real(8), dimension(size(A,1),size(A,2)) :: Ainv
real(8), dimension(size(A,1)) :: work ! work array for LAPACK
integer, dimension(size(A,1)) :: ipiv ! pivot indices
integer :: n, info
! External procedures defined in LAPACK
external DGETRF
external DGETRI
! Store A in Ainv to prevent it from being overwritten by LAPACK
Ainv = A
n = size(A,1)
! DGETRF computes an LU factorization of a general M-by-N matrix A
! using partial pivoting with row interchanges.
call DGETRF(n, n, Ainv, n, ipiv, info)
if (info /= 0) then
stop 'Matrix is numerically singular!'
end if
! DGETRI computes the inverse of a matrix using the LU factorization
! computed by DGETRF.
call DGETRI(n, Ainv, n, ipiv, work, n, info)
if (info /= 0) then
stop 'Matrix inversion failed!'
end if
end function inv
!----------------------------------------------------------------------
function LSQbeta_simple(Xold, Y) result(beta)
real(8), dimension(:,:) :: Xold, Y
real(8), dimension(size(Xold,1), size(Xold,2)+1) :: X
real(8), dimension(:,:), allocatable :: beta, XTX, IXTX, pinv
integer :: rowsX, colsX, rowsY, colsY,i
rowsX = size(X(:,1))
colsX = size(X(1,:))
rowsY = size(Y(:,1))
colsY = size(Y(1,:))
if (colsY .ne. 1) stop 'Inappropriate y'
if (rowsY .ne. rowsX) stop "Y and X rows don't match"
!---X-ify-----------
X(:,1) = 1.0 !includes 1 in the first column
do i = 2, colsX
X(:,i) = Xold(:,i-1)
end do
!--------------------
allocate(XTX(colsX,colsX))
allocate(IXTX(colsX,colsX))
allocate(pinv(colsX,rowsY))
allocate(beta(colsX,1))
XTX = matmul(transpose(X),X)
IXTX = inv(XTX)
pinv = matmul(IXTX, transpose(X))
beta = matmul(pinv, Y)
deallocate(pinv)
deallocate(XTX)
deallocate(IXTX)
return
deallocate(beta)
end function LSQbeta_simple
!--------------------------------------------
function LSQbeta_using(Xold, Y, indexes) result(beta)
implicit none
real(8), dimension(:,:) :: Xold, Y
integer, dimension(:) :: indexes
real(8), dimension(size(indexes)+1,1) :: beta
real(8), dimension(size(Xold,1),size(indexes)) :: X
integer :: indLen,i
indLen = size(indexes)
do i = 1, indLen
X(:,i) = Xold(:,indexes(i))
end do
beta = LSQbeta_simple(X,Y)
end function LSQbeta_using
real(8) function LSQestimate_simple(X, beta)
implicit none
real(8), dimension(:,:) :: X, beta
real(8) :: boundary, Y
integer :: i, Xlen, betalen
Xlen = size(X, 1)
betalen = size(beta, 1)
if (Xlen + 1 .ne. betalen) stop "incompatible beta and X"
Y = beta(1,1)
do i = 1, Xlen
Y = Y + beta(i+1, 1)*X(i,1)
end do
LSQestimate_simple = Y
end function LSQestimate_simple
real(8) function LSQestimate_using()
LSQestimate_using = 1.0D0
end function LSQestimate_using
logical function LSQdecision(X, beta, boundary)
implicit none
real(8), dimension(:,:) :: X, beta
real(8) :: boundary
LSQdecision = LSQestimate(X, beta) > boundary
end function LSQdecision
end module LSQregression

Definition of operator(+) with multiple addends in Fortran with derived type. An issue with allocatable array

I am trying to define the (+) operator between Fortran derived types that describe matrices (linear operators).
My goal is to implicitly define a matrix M = M1 + M2 + M3 such that, given a vector x, Mx = M1x + M2x + M3x.
First, I defined an abstract type (abs_linop) with the abstract interface for a matrix vector multiplication (y = M *x).
Then, I built an derived type (add_linop), extending the abstract type (abs_linop).
The operator (+) is defined for the type (add_linop). I then create an example of concrete type (eye) extending the abstract type (abs_linop) that describes the identity matrix. This type is used in the main program. This is the source code
module LinearOperator
implicit none
private
public :: abs_linop,multiplication
type, abstract :: abs_linop
integer :: nrow=0
integer :: ncol=0
character(len=20) :: name='empty'
contains
!> Procedure for computation of (matrix) times (vector)
procedure(multiplication), deferred :: Mxv
end type abs_linop
abstract interface
!>-------------------------------------------------------------
!> Abstract procedure defining the interface for a general
!<-------------------------------------------------------------
subroutine multiplication(this,vec_in,vec_out,info,lun_err)
import abs_linop
implicit none
class(abs_linop), intent(inout) :: this
real(kind=8), intent(in ) :: vec_in(this%ncol)
real(kind=8), intent(inout) :: vec_out(this%nrow)
integer, optional, intent(inout) :: info
integer, optional, intent(in ) :: lun_err
end subroutine multiplication
end interface
!>---------------------------------------------------------
!> Structure variable for Identity matrix
!> (rectangular case included)
!>---------------------------------------------------------
type, extends(abs_linop), public :: eye
contains
!> Static constructor
procedure, public, pass :: init => init_eye
!> Compute matrix times vector operatoration
procedure, public, pass :: Mxv => apply_eye
end type eye
!>----------------------------------------------------------------
!> Structure variable to build implicit matrix defined
!> as composition and sum of linear operator
!>----------------------------------------------------------------
public :: add_linop, operator(+)
type, extends(abs_linop) :: add_linop
class(abs_linop) , pointer :: matrix_1
class(abs_linop) , pointer :: matrix_2
real(kind=8), allocatable :: scr(:)
contains
procedure, public , pass:: Mxv => add_Mxv
end type add_linop
INTERFACE OPERATOR (+)
module PROCEDURE mmsum
END INTERFACE OPERATOR (+)
contains
!>------------------------------------------------------
!> Function that give two linear operator A1 and A2
!> defines, implicitely, the linear operator
!> A=A1+A2
!> (public procedure for class add_linop)
!>
!> usage:
!> 'var' = A1 + A2
!<-------------------------------------------------------------
function mmsum(matrix_1,matrix_2) result(this)
implicit none
class(abs_linop), target, intent(in) :: matrix_1
class(abs_linop), target, intent(in) :: matrix_2
type(add_linop) :: this
! local
integer :: res
character(len=20) :: n1,n2
if (matrix_1%nrow .ne. matrix_2%nrow) &
write(*,*) 'Error mmproc dimension must agree '
if (matrix_1%ncol .ne. matrix_2%ncol) &
write(*,*) 'Error mmproc dimension must agree '
this%matrix_1 => matrix_1
this%matrix_2 => matrix_2
this%nrow = matrix_1%nrow
this%ncol = matrix_2%ncol
this%name=etb(matrix_1%name)//'+'//etb(matrix_2%name)
write(*,*) 'Sum Matrix initialization '
write(*,*) 'M1 : ',this%matrix_1%name
write(*,*) 'M2 : ',this%matrix_2%name
write(*,*) 'sum : ',this%name
allocate(this%scr(this%nrow),stat=res)
contains
function etb(strIn) result(strOut)
implicit none
! vars
character(len=*), intent(in) :: strIn
character(len=len_trim(adjustl(strIn))) :: strOut
strOut=trim(adjustl(strIn))
end function etb
end function mmsum
recursive subroutine add_Mxv(this,vec_in,vec_out,info,lun_err)
implicit none
class(add_linop), intent(inout) :: this
real(kind=8), intent(in ) :: vec_in(this%ncol)
real(kind=8), intent(inout) :: vec_out(this%nrow)
integer, optional, intent(inout) :: info
integer, optional, intent(in ) :: lun_err
write(*,*) 'Matrix vector multipliction',&
'matrix:',this%name,&
'M1: ',this%matrix_1%name,&
'M2: ',this%matrix_2%name
select type (mat=>this%matrix_1)
type is (add_linop)
write(*,*) 'is allocated(mat%scr) ?', allocated(mat%scr)
end select
call this%matrix_1%Mxv(vec_in,this%scr,info=info,lun_err=lun_err)
call this%matrix_2%Mxv(vec_in,vec_out,info=info,lun_err=lun_err)
vec_out = this%scr + vec_out
end subroutine add_Mxv
subroutine init_eye(this,nrow)
implicit none
class(eye), intent(inout) :: this
integer, intent(in ) :: nrow
this%nrow = nrow
this%ncol = nrow
end subroutine init_eye
subroutine apply_eye(this,vec_in,vec_out,info,lun_err)
class(eye), intent(inout) :: this
real(kind=8), intent(in ) :: vec_in(this%ncol)
real(kind=8), intent(inout) :: vec_out(this%nrow)
integer, optional, intent(inout) :: info
integer, optional, intent(in ) :: lun_err
! local
integer :: mindim
vec_out = vec_in
if (present(info)) info=0
end subroutine apply_eye
end module LinearOperator
program main
use LinearOperator
implicit none
real(kind=8) :: x(2),y(2),z(2),t(2)
type(eye) :: id1,id2,id3
type(add_linop) :: sum12,sum23,sum123_ok,sum123_ko
integer :: i
call id1%init(2)
id1%name='I1'
call id2%init(2)
id2%name='I2'
call id3%init(2)
id3%name='I3'
x=1.0d0
y=1.0d0
z=1.0d0
write(*,*) ' Vector x =', x
call id1%Mxv(x,t)
write(*,*) ' Vector t = I1 *x', t
write(*,*) ' '
sum12 = id1 + id2
call sum12%Mxv(x,t)
write(*,*) ' Vector t = (I1 +I2) *x', t
write(*,*) ' '
sum23 = id2 + id3
sum123_ok = id1 + sum23
call sum123_ok%Mxv(x,t)
write(*,*) ' Vector t = ( I1 + (I2 + I3) )*x', t
write(*,*) ' '
sum123_ko = id1 + id2 + id3
call sum123_ko%Mxv(x,t)
write(*,*) ' Vector t = ( I1 +I2 + I3) *x', t
end program main
I compile this code with gfortran version 7.5.0 and flags
"-g -C -Wall -fcheck=all -O -ffree-line-length-none -mcmodel=medium "
and this is what I get
Vector x = 1.0000000000000000 1.0000000000000000
Vector t = I1 *x 1.0000000000000000 1.0000000000000000
Sum Matrix initialization
M1 : I1
M2 : I2
sum : I1+I2
Matrix vector multiplictionmatrix:I1+I2 M1: I1 M2: I2
Vector t = (I1 +I2) *x 2.0000000000000000 2.0000000000000000
Sum Matrix initialization
M1 : I2
M2 : I3
sum : I2+I3
Sum Matrix initialization
M1 : I1
M2 : I2+I3
sum : I1+I2+I3
Matrix vector multiplictionmatrix:I1+I2+I3 M1: I1 M2: I2+I3
Matrix vector multiplictionmatrix:I2+I3 M1: I2 M2: I3
Vector t = ( I1 + (I2 + I3) )*x 3.0000000000000000 3.0000000000000000
Sum Matrix initialization
M1 : I1
M2 : I2
sum : I1+I2
Sum Matrix initialization
M1 : I1+I2
M2 : I3
sum : I1+I2+I3
Matrix vector multiplictionmatrix:I1+I2+I3 M1: I1+I2 M2: I3
is allocated(mat%scr) ? F
Matrix vector multiplictionmatrix:I1+I2 M1: I1 M2: I2
At line 126 of file LinearOperator.f90
Fortran runtime error: Allocatable actual argument &apos;this&apos; is not allocated
Everthing works fine when I use the (+) operator with 2 terms. But when 3 terms are used there is an issue with the allocatable array scr, member of type (add_linop), that is not allocated.
Does anybody knows the reason of this issue and how to solve it?
I include the Makefile used for compiling the code.
#Gfortran compiler
FC = gfortran
OPENMP = -fopenmp
MODEL = -mcmodel=medium
OFLAGS = -O5 -ffree-line-length-none
DFLAGS = -g -C -Wall -fcheck=all -O -ffree-line-length-none
#DFLAGS = -g -C -Wall -ffree-line-length-none -fcheck=all
PFLAGS = -pg
CPPFLAGS = -D_GFORTRAN_COMP
ARFLAGS =
ODIR = objs
MDIR = mods
LDIR = libs
INCLUDE = -J$(MODDIR)
OBJDIR = $(CURDIR)/$(ODIR)
MODDIR = $(CURDIR)/$(MDIR)
LIBDIR = $(CURDIR)/$(LDIR)
INCLUDE += -I$(MODDIR)
FFLAGS = $(OFLAGS) $(MODEL) $(INCLUDE)
LIBSRCS =
DEST = .
EXTHDRS =
HDRS =
LIBS = -llapack -lblas
LIBMODS =
LDFLAGS = $(MODEL) $(INCLUDE) -L. -L/usr/lib -L/usr/local/lib -L$(LIBDIR)
LINKER = $(FC)
MAKEFILE = Makefile
PRINT = pr
CAT = cat
PROGRAM = main.out
SRCS = LinearOperator.f90
OBJS = LinearOperator.f90
PRJS= $(SRCS:jo=.prj)
OBJECTS = $(SRCS:%.f90=$(OBJDIR)/%.o)
MODULES = $(addprefix $(MODDIR)/,$(MODS))
.SUFFIXES: .prj .f90
print-% :
#echo $* = $($*)
.f.prj:
ftnchek -project -declare -noverbose $<
.f90.o:
$(FC) $(FFLAGS) $(INCLUDE) -c $<
all::
#make dirs
#make $(PROGRAM)
$(PROGRAM): $(LIBS) $(MODULES) $(OBJECTS)
$(LINKER) -o $(PROGRAM) $(LDFLAGS) $(OBJECTS) $(LIBS)
$(LIBS):
#set -e; for i in $(LIBSRCS); do cd $$i; $(MAKE) --no-print-directory -e CURDIR=$(CURDIR); cd $(CURDIR); done
$(OBJECTS): $(OBJDIR)/%.o: %.f90
$(FC) $(CPPFLAGS) $(FFLAGS) -o $# -c $<
dirs:
#-mkdir -p $(OBJDIR) $(MODDIR) $(LIBDIR)
clean-emacs:
#-rm -f $(CURDIR)/*.*~
#-rm -f $(CURDIR)/*\#*
check: $(PRJS)
ftnchek -noverbose -declare $(PRJS) -project -noextern -library > $(PROGRAM).ftn
profile:; #make "FFLAGS=$(PFLAGS) $(MODEL) " "CFLAGS=$(PFLAGS) $(MODEL)" "LDFLAGS=$(PFLAGS) $(LDFLAGS)" $(PROGRAM)
debug:; #make "FFLAGS=$(DFLAGS) $(MODEL) $(INCLUDE)" "LDFLAGS=$(DFLAGS) $(LDFLAGS)" $(PROGRAM)
openmp:; #make "FFLAGS=$(OFLAGS) $(OPENMP) $(MODEL) $(INCLUDE)" "LDFLAGS=$(LDFLAGS) $(OPENMP)" $(PROGRAM)
clean:; #rm -f $(OBJECTS) $(MODULES) $(PROGRAM).cat $(PROGRAM).ftn
#set -e; for i in $(LIBSRCS); do cd $$i; $(MAKE) --no-print-directory clean; cd $(CURDIR); done
clobber:; #rm -f $(OBJECTS) $(MODULES) $(PROGRAM).cat $(PROGRAM).ftn $(PROGRAM)
#-rm -rf $(OBJDIR) $(MODDIR) $(LIBDIR)
#-rm -f $(CURDIR)/*.*~
#-rm -f $(CURDIR)/*\#*
.PHONY: mods
index:; ctags -wx $(HDRS) $(SRCS)
install: $(PROGRAM)
install -s $(PROGRAM) $(DEST)
print:; $(PRINT) $(HDRS) $(SRCS)
cat:; $(CAT) $(HDRS) $(SRCS) > $(PROGRAM).cat
program: $(PROGRAM)
profile: $(PROFILE)
tags: $(HDRS) $(SRCS); ctags $(HDRS) $(SRCS)
update: $(DEST)/$(PROGRAM)
main.o: linearoperator.mod
# DO NOT EDIT --- auto-generated file
linearoperator.mod : LinearOperator.f90
$(FC) $(FCFLAGS) -c $<
Your program is not valid Fortran.
The function result of mmsum has a pointer component which, during the execution of the function, is pointer associated with a dummy argument. This dummy argument (correctly for this use) has the target attribute. However, the actual argument does not have the target attribute: when the function execution completes the pointer component becomes of undefined pointer association status.
In the subroutine add_Mxv there is an attempt to dereference this pointer. This is not allowed.
It will be necessary to revisit how the operands are handled in your data type. Note in particular that an expression cannot have the target attribute: in the case of id1+id2+id3 the id1+id2 expression won't usefully remain as something to reference later on.

Segmentation fault with deferred functions and non_overridable keyword

I am developing an object-oriented Fortran code for numerical optimization with polymorphism supported by abstract types. Because it is a good TDD practice, I'm trying to write all optimization tests in the abstract type class(generic_optimizer), which then should be run by each instantiated class, e.g., by type(newton_raphson).
All the optimization tests feature a call to call my_problem%solve(...), which is defined as deferred in the abstract type and of course features a different implementation in each derived type.
The issue is: if in each non-abstract class I define the deferred function as non_overridable, I get segmentation fault such as:
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) where
#0 0x0000000000000000 in ?? ()
#1 0x0000000000913efe in __newton_raphson_MOD_nr_solve ()
#2 0x00000000008cfafa in MAIN__ ()
#3 0x00000000008cfb2b in main ()
#4 0x0000003a3c81ed5d in __libc_start_main () from /lib64/libc.so.6
#5 0x00000000004048f9 in _start ()
After some trial-and-error, I've noticed that I can avoid the error if I remove the non_overridable declaration. In this case it is not an issue, but I wanted to enforce that since two levels of polymorphism are unlikely for this code. Was I violating any requirements from the standard, instead?
Here is a sample code that reproduces the error. I've been testing it with gfortran 5.3.0 and 6.1.0.
module generic_type_module
implicit none
private
type, abstract, public :: generic_type
real(8) :: some_data
contains
procedure (sqrt_interface), deferred :: square_root
procedure, non_overridable :: sqrt_test
end type generic_type
abstract interface
real(8) function sqrt_interface(this,x) result(sqrtx)
import generic_type
class(generic_type), intent(in) :: this
real(8), intent(in) :: x
end function sqrt_interface
end interface
contains
subroutine sqrt_test(this,x)
class(generic_type), intent(in) :: this
real(8), intent(in) :: x
print *, 'sqrt(',x,') = ',this%square_root(x)
end subroutine sqrt_test
end module generic_type_module
module actual_types_module
use generic_type_module
implicit none
private
type, public, extends(generic_type) :: crashing
real(8) :: other_data
contains
procedure, non_overridable :: square_root => crashing_square_root
end type crashing
type, public, extends(generic_type) :: working
real(8) :: other_data
contains
procedure :: square_root => working_square_root
end type working
contains
real(8) function crashing_square_root(this,x) result(sqrtx)
class(crashing), intent(in) :: this
real(8), intent(in) :: x
sqrtx = sqrt(x)
end function crashing_square_root
real(8) function working_square_root(this,x) result(sqrtx)
class(working), intent(in) :: this
real(8), intent(in) :: x
sqrtx = sqrt(x)
end function working_square_root
end module actual_types_module
program deferred_test
use actual_types_module
implicit none
type(crashing) :: crashes
type(working) :: works
call works%sqrt_test(2.0_8)
call crashes%sqrt_test(2.0_8)
end program
To narrow down the problem, I removed the abstract attribute and data members from the OP's code such that
module types
implicit none
type :: Type1
contains
procedure :: test
procedure :: square => Type1_square
endtype
type, extends(Type1) :: Type2
contains
procedure, non_overridable :: square => Type2_square
endtype
contains
subroutine test( this, x )
class(Type1) :: this
real :: x
print *, "square(", x, ") = ",this % square( x )
end subroutine
function Type1_square( this, x ) result( y )
class(Type1) :: this
real :: x, y
y = -100 ! dummy
end function
function Type2_square( this, x ) result( y )
class(Type2) :: this
real :: x, y
y = x**2
end function
end module
program main
use types
implicit none
type(Type1) :: t1
type(Type2) :: t2
call t1 % test( 2.0 )
call t2 % test( 2.0 )
end program
With this code, gfortran-6 gives
square( 2.00000000 ) = -100.000000
square( 2.00000000 ) = -100.000000
while ifort-{14,16} and Oracle fortran 12.5 give
square( 2.000000 ) = -100.0000
square( 2.000000 ) = 4.000000
I also tried replacing the functions with subroutines (to print which routines are actually called):
subroutine test( this, x )
class(Type1) :: this
real :: x, y
call this % square( x, y )
print *, "square(", x, ") = ", y
end subroutine
subroutine Type1_square( this, x, y )
class(Type1) :: this
real :: x, y
print *, "Type1_square:"
y = -100 ! dummy
end subroutine
subroutine Type2_square( this, x, y )
class(Type2) :: this
real :: x, y
print *, "Type2_square:"
y = x**2
end subroutine
with all the other parts kept the same. Then, gfortran-6 gives
Type1_square:
square( 2.00000000 ) = -100.000000
Type1_square:
square( 2.00000000 ) = -100.000000
while ifort-{14,16} and Oracle fortran 12.5 give
Type1_square:
square( 2.000000 ) = -100.0000
Type2_square:
square( 2.000000 ) = 4.000000
If I remove non_overridable from the above codes, gfortran gives the same result as the other compilers. So, this may be a specific issue to gfortran + non_overridable (if the above code is standard-conforming)...
(The reason why OP got segmentation fault may be that gfortran accessed the deferred procedure in the parent type (generic_type) having null pointer; if this is the case, the story becomes consistent.)
Edit
The same exceptional behavior of gfortran occurs also when we declare Type1 as abstract. Specifically, if we change the definition of Type1 as
type, abstract :: Type1 ! now an abstract type (cannot be instantiated)
contains
procedure :: test
procedure :: square => Type1_square
endtype
and the main program as
program main
use types
implicit none
type(Type2) :: t2
call t2 % test( 2.0 )
end program
we get
ifort-16 : square( 2.000000 ) = 4.000000
oracle-12.5 : square( 2.0 ) = 4.0
gfortran-6 : square( 2.00000000 ) = -100.000000
If we further make square() in Type1 to be deferred (i.e., no implementation given) and so make the code almost equivalent to the OP's case,
type, abstract :: Type1 ! now an abstract type (cannot be instantiated)
contains
procedure :: test
procedure(Type1_square), deferred :: square ! has no implementation yet
endtype
abstract interface
function Type1_square( this, x ) result( y )
import
class(Type1) :: this
real :: x, y
end function
end interface
then ifort-16 and Oracle-12.5 gives 4.0 with call t2 % test( 2.0 ), while gfortran-6 results in segmentation fault. Indeed, if we compile as
$ gfortran -fsanitize=address test.f90 # on Linux x86_64
we get
ASAN:SIGSEGV (<-- or "ASAN:DEADLYSIGNAL" on OSX 10.9)
=================================================================
==22045==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000
(pc 0x000000000000 bp 0x7fff1d23ecd0 sp 0x7fff1d23eac8 T0)
==22045==Hint: pc points to the zero page.
So overall, it seems as if the binding name square() in Type1 (which has no implementation) is called erroneously by gfortran (possibly with null pointer). And more importantly, if we drop non_overridable from the definition of Type2, gfortran also gives 4.0 (with no segmentation fault).

PETSc - MatMultScale? Matrix X vector X scalar

I'm using PETSc and I wanted to do something like,
I know I can do:
Mat A
Vec x,y
MatMult(A,x,y)
VecScale(y,0.5)
I was just curious if there is a function that would do all of these in one shot. It seems like it would save a loop.
MatMultScale(A,x,0.5,y)
Does such a function exist?
This function (or anything close) does not seems to be in the list of functions operating on Mat. So a brief answer to your question would be...no.
If you often use $y=\frac12 Ax$, a solution would be to scale the matrix once for all, using MatScale(A,0.5);.
Would such a function be useful ? One way to check this is to use the -log_summary option of petsc, to get some profiling information. If your matrix is dense, you will see that the time spent in MatMult() is much larger than the time spent in VecScale(). This question is meaningful only if a sparce matrix is handled, with a few non-null terms per line.
Here is a code to test it, using 2xIdentity as the matrix :
static char help[] = "Tests solving linear system on 0 by 0 matrix.\n\n";
#include <petscksp.h>
#undef __FUNCT__
#define __FUNCT__ "main"
int main(int argc,char **args)
{
Vec x, y;
Mat A;
PetscReal alpha=0.5;
PetscErrorCode ierr;
PetscInt n=42;
PetscInitialize(&argc,&args,(char*)0,help);
ierr = PetscOptionsGetInt(NULL,"-n",&n,NULL);CHKERRQ(ierr);
/* Create the vector*/
ierr = VecCreate(PETSC_COMM_WORLD,&x);CHKERRQ(ierr);
ierr = VecSetSizes(x,PETSC_DECIDE,n);CHKERRQ(ierr);
ierr = VecSetFromOptions(x);CHKERRQ(ierr);
ierr = VecDuplicate(x,&y);CHKERRQ(ierr);
/*
Create matrix. When using MatCreate(), the matrix format can
be specified at runtime.
Performance tuning note: For problems of substantial size,
preallocation of matrix memory is crucial for attaining good
performance. See the matrix chapter of the users manual for details.
*/
ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
ierr = MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,n,n);CHKERRQ(ierr);
ierr = MatSetFromOptions(A);CHKERRQ(ierr);
ierr = MatSetUp(A);CHKERRQ(ierr);
/*
This matrix is diagonal, two times identity
should have preallocated, shame
*/
PetscInt i,col;
PetscScalar value=2.0;
for (i=0; i<n; i++) {
col=i;
ierr = MatSetValues(A,1,&i,1,&col,&value,INSERT_VALUES);CHKERRQ(ierr);
}
ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
/*
let's do this 42 times for nothing :
*/
for(i=0;i<42;i++){
ierr = MatMult(A,x,y);CHKERRQ(ierr);
ierr = VecScale(y,alpha);CHKERRQ(ierr);
}
ierr = VecDestroy(&x);CHKERRQ(ierr);
ierr = VecDestroy(&y);CHKERRQ(ierr);
ierr = MatDestroy(&A);CHKERRQ(ierr);
ierr = PetscFinalize();
return 0;
}
The makefile :
include ${PETSC_DIR}/conf/variables
include ${PETSC_DIR}/conf/rules
include ${PETSC_DIR}/conf/test
CLINKER=g++
all : ex1
ex1 : main.o chkopts
${CLINKER} -w -o main main.o ${PETSC_LIB}
${RM} main.o
run :
mpirun -np 2 main -n 10000000 -log_summary -help -mat_type mpiaij
And here the resulting two lines of -log_summary that could answer your question :
Event Count Time (sec) Flops --- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
VecScale 42 1.0 1.0709e+00 1.0 2.10e+08 1.0 0.0e+00 0.0e+00 0.0e+00 4 50 0 0 0 4 50 0 0 0 392
MatMult 42 1.0 5.7360e+00 1.1 2.10e+08 1.0 0.0e+00 0.0e+00 0.0e+00 20 50 0 0 0 20 50 0 0 0 73
So the 42 VecScale() operations took 1 second while the 42 MatMult() operations took 5.7 seconds. Suppressing the VecScale() operation would speed up the code by 20%, in the best case. The overhead due to the for loop is even lower than that. I guess that's the reason why this function does not exist.
I apologize for the poor performance of my computer (392Mflops for VecScale()...). I am curious to know what happens on yours !

Faster way to structure operations on offset neighborhoods in OpenCL

How can an operation on many overlapping but offset blocks of a 2D array be structured for more efficient execution in OpenCL?
For example, I have the following OpenCL kernel:
__kernel void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int2 pos0 = (int2)(pos.x - pos.x % 16, pos.y - pos.y % 16);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) -
read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j));
}
}
write_imageui(dest, pos, diff);
}
It produces correct results, but is slow... only ~25 GFLOPS on NVS4200M with 1k by 1k input. (The hardware spec is 155 GFLOPS). I'm guessing this has to do with the memory access patterns. Each work item reads one 16x16 block of data which is the same as all its neighbors in a 16x16 area, and also another offset block of data most of the time overlaps with that of its immediate neighbors. All reads are through samplers. The host program is PyOpenCL (I don't think that actually changes anything) and the work-group size is 16x16.
EDIT: New version of kernel per suggestion below, copy work area to local variables:
__kernel __attribute__((reqd_work_group_size(16, 16, 1)))
void test_kernel(
read_only image2d_t src,
write_only image2d_t dest,
const int width,
const int height
)
{
const sampler_t sampler = CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP_TO_EDGE | CLK_FILTER_NEAREST;
int2 pos = (int2)(get_global_id(0), get_global_id(1));
int dx = pos.x % 16;
int dy = pos.y % 16;
__local uint4 local_src[16*16];
__local uint4 local_src2[32*32];
local_src[(pos.y % 16) * 16 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, pos);
local_src2[(pos.y % 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16)] = read_imageui(src, sampler, (int2)(pos.x, pos.y + 16));
local_src2[(pos.y % 16 + 16) * 32 + (pos.x % 16) + 16] = read_imageui(src, sampler, (int2)(pos.x + 16, pos.y + 16));
barrier(CLK_LOCAL_MEM_FENCE);
uint4 diff = (uint4)(0, 0, 0, 0);
for (int i=0; i<16; i++)
{
for (int j=0; j<16; j++)
{
diff += local_src[ j*16 + i ] - local_src2[ (j+dy)*32 + i+dx ];
}
}
write_imageui(dest, pos, diff);
}
Result: output is correct, running time is 56% slower. If using local_src only (not local_src2), the result is ~10% faster.
EDIT: Benchmarked on much more powerful hardware, AMD Radeon HD 7850 gets 420GFLOPS, spec is 1751GFLOPS. To be fair the spec is for multiply-add, and there is no multiply here so the expected is ~875GFLOPS, but this is still off by quite a lot compared to the theoretical performance.
EDIT: To ease running tests for anyone who would like to try this out, the host-side program in PyOpenCL below:
import pyopencl as cl
import numpy
import numpy.random
from time import time
CL_SOURCE = '''
// kernel goes here
'''
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx, properties=cl.command_queue_properties.PROFILING_ENABLE)
prg = cl.Program(ctx, CL_SOURCE).build()
h, w = 1024, 1024
src = numpy.zeros((h, w, 4), dtype=numpy.uint8)
src[:,:,:] = numpy.random.rand(h, w, 4) * 255
mf = cl.mem_flags
src_buf = cl.image_from_array(ctx, src, 4)
fmt = cl.ImageFormat(cl.channel_order.RGBA, cl.channel_type.UNSIGNED_INT8)
dest_buf = cl.Image(ctx, mf.WRITE_ONLY, fmt, shape=(w, h))
# warmup
for n in range(10):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
# benchmark
t1 = time()
for n in range(100):
event = prg.test_kernel(queue, (w, h), (16,16), src_buf, dest_buf, numpy.int32(w), numpy.int32(h))
event.wait()
t2 = time()
print "Duration (host): ", (t2-t1)/100
print "Duration (event): ", (event.profile.end-event.profile.start)*1e-9
EDIT: Thinking about the memory access patterns, the original naive version may be pretty good; when calling read_imageui(src, sampler, (int2)(pos0.x + i, pos0.y + j)) all work-items in a work group are reading the same location (so this is just one read??), and when calling read_imageui(src, sampler, (int2)(pos.x + i, pos.y + j)) they are reading sequential locations (so the reads can be coalesced perfectly??).
This is definitely a memory access problem. Neighbouring work items' pixels can overlap by as much as 15x16, and worse yet, each work item will overlap at least 225 others.
I would use local memory and get work groups to cooperatively process many 16x16 blocks. I like to use a large, square block for each work group. Rectangular blocks are a bit more complicated, but can get better memory utilization for you.
If you read blocks of n by n pixels form your source image, the boarders will overlap by nx15 (or 15xn). You need to calculate the largest possible value for n base on your available local memory size (LDS). If you are using opencl 1.1 or greater, the LDS is at least 32kb. opencl 1.0 promises 16kb per work group.
n <= sqrt(32kb / sizeof(uint4))
n <= sqrt(32768 / 16)
n ~ 45
Using n=45 will use 32400 out of 32768 bytes of the LDS, and let you use 900 work items per group (45-15)^2 = 900. Note: Here's where a rectangular block would help out; for example 64x32 would use all of the LDS, but with group size = (64-15)*(32-15) = 833.
steps to use LDS for your kernel:
allocate a 1D or 2D local array for your cached block of the image. I use a #define constant, and it rarely has to change.
read the uint values from your image, and store locally.
adjust 'pos' for each work item to relate to the local memory
execute the same i,j loops you have, but using the local memory to read values. remember that the i and j loops stop 15 short of n.
Each step can be searched online if you are not sure how to implement it, or you can ask me if you need a hand.
Chances are good that the LDS on your device will outperform the texture read speed. This is counter-intuitive, but remember that you are reading tiny amounts of data at a time, so the gpu may not be able to cache the pixels effectively. The LDS usage will guarantee that the pixels are available, and given the number of times each pixel is read, I expect this to make a huge difference.
Please let me know what kind of results you observe.
UPDATE: Here's my attempt to better explain my solution. I used graph paper for my drawings, because I'm not all that great with image manipulation software.
Above is a sketch of how the values were read from src in your first code snippet. The big problem is that the pos0 rectangle -- 16x16 uint4 values -- is being read in its entirety for each work item in the group (256 of them). My solution involves reading a large area and sharing the data for all 256 work groups.
If you store a 31x31 region of your image in local memory, all 256 work items' data will be available.
steps:
use work group dimensions: (16,16)
read the values of src into a large local buffer ie: uint4 buff[31][31]; The buffer needs to be translated such that 'pos0' is at buff[0][0]
barrier(CLK_LOCAL_MEM_FENCE) to wait for memory copy operations
do the same i,j for loops you had originally, except you leave out the pos and pos0 values. only use i and j for the location. Accumulate 'diff' in the same way you were doing so originally.
write the solution to 'dest'
This is the same as my first response to your question, except I use n=16. This value does not utilize the local memory fully, but will probably work well for most platforms. 256 tends to be a common maximum work group size.
I hope this clears things up for you.
Some suggestions:
Compute more than 1 output pixel in each work item. It will increase data reuse.
Benchmark different work-group sizes to maximize the usage of texture cache.
Maybe there is a way to separate the kernel into two passes (horizontal and vertical).
Update: more suggestions
Instead of loading everything in local memory, try loading only the local_src values, and use read_image for the other one.
Since you do almost no computations, you should measure read speed in GB/s, and compare to the peak memory speed.