I'm using PETSc and I wanted to do something like,
I know I can do:
Mat A
Vec x,y
I was just curious if there is a function that would do all of these in one shot. It seems like it would save a loop.
Does such a function exist?

This function (or anything close) does not seems to be in the list of functions operating on Mat. So a brief answer to your question would
If you often use $y=\frac12 Ax$, a solution would be to scale the matrix once for all, using MatScale(A,0.5);.
Would such a function be useful ? One way to check this is to use the -log_summary option of petsc, to get some profiling information. If your matrix is dense, you will see that the time spent in MatMult() is much larger than the time spent in VecScale(). This question is meaningful only if a sparce matrix is handled, with a few non-null terms per line.
Here is a code to test it, using 2xIdentity as the matrix :
static char help[] = "Tests solving linear system on 0 by 0 matrix.\n\n";
#include <petscksp.h>
#undef __FUNCT__
#define __FUNCT__ "main"
int main(int argc,char **args)
Vec x, y;
Mat A;
PetscReal alpha=0.5;
PetscErrorCode ierr;
PetscInt n=42;
ierr = PetscOptionsGetInt(NULL,"-n",&n,NULL);CHKERRQ(ierr);
/* Create the vector*/
ierr = VecCreate(PETSC_COMM_WORLD,&x);CHKERRQ(ierr);
ierr = VecSetSizes(x,PETSC_DECIDE,n);CHKERRQ(ierr);
ierr = VecSetFromOptions(x);CHKERRQ(ierr);
ierr = VecDuplicate(x,&y);CHKERRQ(ierr);
Create matrix. When using MatCreate(), the matrix format can
be specified at runtime.
Performance tuning note: For problems of substantial size,
preallocation of matrix memory is crucial for attaining good
performance. See the matrix chapter of the users manual for details.
ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
ierr = MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,n,n);CHKERRQ(ierr);
ierr = MatSetFromOptions(A);CHKERRQ(ierr);
ierr = MatSetUp(A);CHKERRQ(ierr);
This matrix is diagonal, two times identity
should have preallocated, shame
PetscInt i,col;
PetscScalar value=2.0;
for (i=0; i<n; i++) {
ierr = MatSetValues(A,1,&i,1,&col,&value,INSERT_VALUES);CHKERRQ(ierr);
ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
let's do this 42 times for nothing :
ierr = MatMult(A,x,y);CHKERRQ(ierr);
ierr = VecScale(y,alpha);CHKERRQ(ierr);
ierr = VecDestroy(&x);CHKERRQ(ierr);
ierr = VecDestroy(&y);CHKERRQ(ierr);
ierr = MatDestroy(&A);CHKERRQ(ierr);
ierr = PetscFinalize();
return 0;
The makefile :
include ${PETSC_DIR}/conf/variables
include ${PETSC_DIR}/conf/rules
include ${PETSC_DIR}/conf/test
all : ex1
ex1 : main.o chkopts
${CLINKER} -w -o main main.o ${PETSC_LIB}
${RM} main.o
run :
mpirun -np 2 main -n 10000000 -log_summary -help -mat_type mpiaij
And here the resulting two lines of -log_summary that could answer your question :
Event Count Time (sec) Flops --- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
--- Event Stage 0: Main Stage
VecScale 42 1.0 1.0709e+00 1.0 2.10e+08 1.0 0.0e+00 0.0e+00 0.0e+00 4 50 0 0 0 4 50 0 0 0 392
MatMult 42 1.0 5.7360e+00 1.1 2.10e+08 1.0 0.0e+00 0.0e+00 0.0e+00 20 50 0 0 0 20 50 0 0 0 73
So the 42 VecScale() operations took 1 second while the 42 MatMult() operations took 5.7 seconds. Suppressing the VecScale() operation would speed up the code by 20%, in the best case. The overhead due to the for loop is even lower than that. I guess that's the reason why this function does not exist.
I apologize for the poor performance of my computer (392Mflops for VecScale()...). I am curious to know what happens on yours !


Performance of Fortran versus MPI files writing/reading

I am confused about the performance of Fortran writing and reading performance (speed) versus MPI one for small and big files.
I wrote the following simple dummy program to test this (just writing dummy values to files):
#if defined (__MPI)
! Include file for MPI
#if defined (__MPI_MODULE)
USE mpi
INCLUDE 'mpif.h'
! dummy world and null communicator
INTEGER (kind=MPI_OFFSET_KIND) :: lsize, pos, pos2
REAL(kind=DP), ALLOCATABLE, DIMENSION(:) :: trans_prob, array_cpu
INTEGER :: ierr, i, error, my_pool_id, world_comm
INTEGER (kind=DP) :: fil
REAL :: start, finish
INTEGER :: iunepmat, npool, arr_size, loop, pos3, j
real(dp):: dummy
integer*8 :: unf_recl
integer :: ios, direct_io_factor, recl
iunepmat = 10000
arr_size = 102400
loop = 500
! Initialize MPI
call MPI_COMM_DUP(MPI_COMM_WORLD, world_comm, ierr)
call MPI_COMM_RANK(world_comm,my_pool_id,error)
trans_prob(:) = 1.5d0
!Write using Fortran
CALL MPI_BARRIER(world_comm,error)
CALL cpu_time(start)
DO i=1, loop
! This writes also info on the record length using a real with 4 bytes.
OPEN(unit=10+my_pool_id, form='unformatted', position='append', action='write')
WRITE(10+my_pool_id ) trans_prob(:)
CALL MPI_COMM_SIZE(world_comm, npool, error)
! Master collect and write
IF (my_pool_id==0) THEN
INQUIRE (IOLENGTH=direct_io_factor) dummy
unf_recl = direct_io_factor * int(arr_size * loop, kind=kind(unf_recl))
ALLOCATE (array_cpu( arr_size * loop ))
array_cpu(:) = 0.0d0
OPEN(unit=100,file='merged.dat',form='unformatted', status='new', position='append', action='write')
DO i=0, npool - 1
OPEN(unit=10+i,form='unformatted', status ='old', access='direct', recl = unf_recl )
READ(unit=10+i, rec=1) array_cpu(:)
WRITE(unit=100) array_cpu(:)
DEALLOCATE (array_cpu)
call cpu_time(finish)
!Print time
CALL MPI_BARRIER(world_comm,error)
IF (my_pool_id==0) print*, ' Fortran time', finish-start
!Write using MPI
CALL MPI_BARRIER(world_comm,error)
CALL cpu_time(start)
lsize = INT( arr_size , kind = MPI_OFFSET_KIND)
pos = 0
pos2 = 0
DO i=1, loop
pos = pos2 + INT( arr_size * (my_pool_id), kind = MPI_OFFSET_KIND ) * 8_MPI_OFFSET_KIND
CALL MPI_FILE_SEEK(iunepmat, pos, MPI_SEEK_SET, ierr)
pos2 = pos2 + INT( arr_size * (npool -1), kind = MPI_OFFSET_KIND ) * 8_MPI_OFFSET_KIND
CALL MPI_FILE_CLOSE(iunepmat,ierr)
CALL cpu_time(finish)
CALL MPI_BARRIER(world_comm,error)
IF (my_pool_id==0) print*, ' MPI time', finish-start
DEALLOCATE (trans_prob)
The compilation is made with:
mpif90 -O3 -x f95-cpp-input -D__FFTW -D__MPI -D__SCALAPACK test_mpi2.f90 -o a.x
and then run in parallel with 4 cores:
mpirun -np 4 ./a.x
I get the following results:
Loop size 1
array size 10,240,000
File size: 313 Mb
Fortran time 0.237030014 sec
MPI time 0.164155006 sec
Loop size 10
array size 1,024,000
File size: 313 Mb
Fortran time 0.242821991 sec
MPI time 0.172048002 sec
Loop size 100
array size 102,400
File size: 313 Mb
Fortran time 0.235879987 sec
MPI time 9.78289992E-02 sec
Loop size 50
array size 1,024,000
File size: 1.6G
Fortran time 1.60272002 sec
MPI time 3.40623116 sec
Loop size 500
array size 102,400
File size: 1.6G
Fortran time 1.44547606 sec
MPI time 3.38340592 sec
As you can see the performances of MPI degrade significantly for larger files. Is it possible to improve MPI performance for large files ?
Is this behavior expected?

ceres solver analytical derivative doesn't work

template<typename ConcreteOccGridMapUtil>
class getResidual : public ceres::SizedCostFunction<1,3>
ConcreteOccGridMapUtil* occ;
DataContainer dataPoints;
getResidual(ConcreteOccGridMapUtil* occ, const DataContainer& dataPoints)
this->occ = occ;
this->dataPoints = dataPoints;
virtual ~getResidual() {}
virtual bool Evaluate(double const* const* parameters,
double* residuals,
double** jacobians) const
Eigen::Matrix<double, 3, 1> pose1(parameters[0][0],parameters[0][1],parameters[0][2]);
Eigen::Vector3f pose = pose1.cast<float>();
Eigen::Affine2f transform(occ->getTransformForState(pose)); // transform: rotation->translation
float sinRot = std::sin(pose[2]);
float cosRot = std::cos(pose[2]);
int size = dataPoints.getSize();
residuals[0] = 0;
for (int i = 0; i < size; ++i)
const Eigen::Vector2f& currPoint (dataPoints.getVecEntry(i)); /// lidar point
Eigen::Vector3f transformedPointData(occ->interpMapValueWithDerivatives(transform * currPoint)); /// {M,dM/dx,dM/dy}
float funVal = 1.0f - transformedPointData[0];
// float weight=util::WeightValue(funVal);
float weight=1.0;
residuals[0] += static_cast<double>(funVal);
jacobians[0][0] += static_cast<double>(transformedPointData[1]);
jacobians[0][1] += static_cast<double>(transformedPointData[2]);
double rotDeriv = ((-sinRot * currPoint.x() - cosRot * currPoint.y()) * transformedPointData[1] + (cosRot * currPoint.x() - sinRot * currPoint.y()) * transformedPointData[2]);
jacobians[0][2] += static_cast<double>(rotDeriv);
return true;
my parameter to optimize is the pose = [x,y,theta]
my objective function is to minimize the occupancy value about pose and laser point. And here I add them manually together into residuals[0]
I have 3 parameters [x,y,theta] so my jacobians have 3 dimensions in jocobians[0]
But when I run the program, the report is like below:
Solver Summary (v 1.12.0-eigen-(3.2.0)-lapack-suitesparse-(4.2.1)-openmp)
Original Reduced
Parameter blocks 1 1
Parameters 3 3
Residual blocks 1 1
Residual 1 1
Dense linear algebra library EIGEN
Trust region strategy LEVENBERG_MARQUARDT
Given Used
Linear solver DENSE_QR DENSE_QR
Threads 1 1
Linear solver threads 1 1
Linear solver ordering AUTOMATIC 1
Initial 8.569800e+04
Final 8.569800e+04
Change 0.000000e+00
Minimizer iterations 1
Successful steps 1
Unsuccessful steps 0
Time (in seconds):
Preprocessor 0.0001
Residual evaluation 0.0000
Jacobian evaluation 0.0050
Linear solver 0.0000
Minimizer 0.0051
Postprocessor 0.0000
Total 0.0052
Termination: CONVERGENCE (Gradient tolerance reached. Gradient max norm: 0.000000e+00 <= 1.000000e-10)
Since I have set the jacobians, how can it say that the gradient norm is so small?
Two things.
1. You cannot unconditionally set the Jacobian, you need to check if the solver is actually asking for and the pointers are non-null.
2. There is something wrong with your Jacobian eval, because as far as Ceres can tell it is seeing a zero gradient. Simple thing to check would be to dump out the Jacobian and Jacobian'residual from the CostFunction before returning.
for example are you sure size != 0?

How to explain the strange results for while loop with floating point in swift

I have tested while loop below and don’t understand the result.
var x: Float = 0.0
var counter = 0
while x < 1.41
x += 0.1
counter += 1
print (counter) // 15
print (x) // 1.5
How is it possible to have the result x = 1.5 for the used while condition where x < 14.1 ? How to explain this result?
and one more. Why the results are different for Double and Float ?
var x: Double = -0.5
var counter = 0
while x < 1.0
x += 0.1
counter += 1
print (counter) // 16
print (x)//1.1
var x: Float = -0.5
var counter = 0
while x < 1.0
x += 0.1
counter += 1
print (counter) // 15
print (x)//1.0
Update 2
and another one. Why there is no difference for < and <= conditions. Does it mean that usage of <= has no sense for floating point ?
var x: Double = 0.0
var counter = 0
while x < 1.5
x += 0.1
counter += 1
print (counter) // 15
print (x) //1.5
var x: Double = 0.0
var counter = 0
while x <= 1.5
x += 0.1
counter += 1
print (counter) // 15
print (x) //1.5
What else would you expect? The loop is executed 15 times. On the 14th time, x is 1.4 and so you add another 0.1, making it 1.5.
If you expect the loop to terminate at 1.4, you should increment x before checking the while condition, not after that.
If you expect the loop to terminate on 1.41, your increment is wrong and you should do
x += 0.01
instead, making it 141 iterations.
As for the second question, I am aware that Float should not be used for monetary calculations and such due to its lack of precision. However, I trusted Double so far, and the while loop in run 15 actually claims the Double value to be less than 1.0 while it is reported to be 1.0. We have got a precision problem here, as we can see if we substract x from 1.0:
which returns: 1.11022302462516e-16
At the same time, Float seems to be unprecise in the other direction. In the last run, it is a little bigger than 0.9 (0.9 + 5.96046e-08), making it bigger than 10 in the following run.
The reason why Double and Float are wrong in different directions is just a matter of how the values are stored, and the result will be different depending on the number. For example, with 2.0 both actual values are bigger: Double by 4.440892..e-16 and Float by 2.38419e-07. For 3.0 Double is bigger by 1.33226e-15 and Float smaller by 7.1525e-07.
The same problems occur using x.isLess(than: 1.0), but this method is the basis for the < operator as of
isLessThanOrEqualTo(1.0), on the other hand, seems to work reliably as expected.
This answer is pretty much a question itself by now, so I'm curious if anyone has an in-depth explanation of this...
The more I think about it, the less of a Swift problem it is. Basically, you have that problem in all floating point calculations, because they are never precise. Both Float and Double are not precise, Double is just twice as accurate. However, this means that comparisons like == are useless with floating point values unless they are both rounded. Therefore, good advice in loops like those of yours with a known precision (in your case one decimal) would be to round to that precision before doing any kind of comparison. For example, this would fix the loop:
var x: Double = -0.5
var counter = 0
while (round(x * 1000) / 1000) < 1.0
x += 0.1
counter += 1
print (counter) // 15
print (x)//1.0
var x: Float = -0.5
var counter = 0
while (round(x * 1000) / 1000) < 1.0
x += 0.1
counter += 1
print (counter) // 15
print (x)//1.0

PyCUDA large nonuniform matrix operations

I am working with large, nonuniform matrices and am having problems with what I believe to be mismatching on the elements.
In, get_simulated_ipp() builds echo and tx, two linear arrays of size 250000 and 25000 respectively. The code also hardcoded sr=25.
My code is attempting to complex multiply tx into echo along different stretches, depending on specified ranges and value of sr. This will then be stored in an array S.
After searching through some other people's examples, I found a way of building blocks and grids here that I thought would work well. I'm unfamiliar with C code, but have been trying to learn over the past week. Here is my code:
#This iteration only works on the first and last elements, mismatching after that.
# However, this doesn't result in any empty elements in S
import numpy as np
import example as ex
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
#pull simulated data and get info about it
((echo,tx)) = ex.get_simulated_ipp()
ranges = np.arange(4000,6000).astype(np.int32)
S = np.zeros([len(ranges),len(tx)],dtype=np.complex64)
sr =
#copying input to gpu
# will try this explicitly if in/out (in the function call) don't work
block_dim_x = 8 #thread number is product of block dims,
block_dim_y = 8 # want a multiple of 32 (warp multiple)
blocks_x = np.ceil(len(ranges)/block_dim_x).astype(np.int32).item()
blocks_y = np.ceil(len(tx)/block_dim_y).astype(np.int32).item()
#include <cuComplex.h>
__global__ void complex_mult(cuFloatComplex *tx, cuFloatComplex *echo, cuFloatComplex *result,
int *ranges, int sr)
unsigned int block_num = blockIdx.x + blockIdx.y * gridDim.x;
unsigned int thread_num = threadIdx.x + threadIdx.y * blockDim.x;
unsigned int threads_in_block = blockDim.x * blockDim.y;
unsigned long int idx = threads_in_block * block_num + thread_num;
//aligning the i,j to idx, something is mismatched?
int i = ((idx % (threads_in_block * gridDim.x)) % blockDim.x) +
((block_num % gridDim.x) * blockDim.x);
int j = ((idx - (threads_in_block * block_num)) / blockDim.x) +
((block_num / gridDim.x) * blockDim.y);
result[idx] = cuCmulf(echo[j+ranges[i]*sr], tx[j]);
## want something to work like this:
## result[i][j] = cuCmulf(echo[j+ranges[i]*sr], tx[j]);
#includes directory of where cuComplex.h is located
mod = SourceModule(kernel_code, include_dirs=['/usr/local/cuda-7.0/include/'])
complex_mult = mod.get_function("complex_mult")
complex_mult(cuda.In(tx), cuda.In(echo), cuda.Out(S), cuda.In(ranges), np.int32(sr),
compare = np.zeros_like(S) #built to compare CPU vs GPU calcs
txidx = np.arange(len(tx))
for ri,r in enumerate(ranges):
compare[ri,:] = echo[txidx+r*sr]*tx
print np.subtract(S, compare)
At the bottom here, I've put in a CPU implementation of what I'm attempting to accomplish and put in a subtraction. The result is that the very first and very last elements come out as 0+0j, but the rest do not. The kernel is attempting to align an i and j to the idx so that I can traverse echo, ranges, and tx more easily.
Is there a better way to implement something like this? Also, why might the result not come out as all 0+0j as I intend?
Trying a little example to get a better grasp of how the arrays are being indexed with this block/grid configuration, I stumbled upon something very strange. Before, I tried to index the elements, I just wanted to run a little test multiplication. It seems like my block/grid covers all of the ary_in matrix, but the result ends up only doubling the top half of ary_in and the bottom half is returning whatever was left over from the bottom half calculation previously.
If I change blocks_x to 4 so that I cover more space than needed, however, the doubling works fine. If I then run it with a 4x4 grid, with * 3 instead, it'll work out fine with ary_out as ary_in tripled. When I run it again with a 2x4 grid and only doubling, the top half of ary_out returns the doubled ary_in, but the bottom half returns the previous result in memory, a tripled value instead. I would understand this to be something in my index/block/grid mapping wrongly to the values, but I can't figure out what.
ary_in = np.arange(128).reshape((8,16))
print ary_in
ary_out = np.zeros_like(ary_in)
block_dim_x = 4
block_dim_y = 4
blocks_x = 2
blocks_y = 4
limit = block_dim_x * block_dim_y * blocks_x * blocks_y
mod = SourceModule("""
__global__ void indexing_order(int *ary_in, int *ary_out, int n)
unsigned int block_num = blockIdx.x + blockIdx.y * gridDim.x;
unsigned int thread_num = threadIdx.x + threadIdx.y * blockDim.x;
unsigned int threads_in_block = blockDim.x * blockDim.y;
unsigned int idx = threads_in_block * block_num + thread_num;
if (idx < n) {
// ary_out[idx] = thread_num;
ary_out[idx] = ary_in[idx] * 2;
indexing_order = mod.get_function("indexing_order")
indexing_order(drv.In(ary_in), drv.Out(ary_out), np.int32(limit),
print ary_out
I figured out the problems. In the edit just above, the ary_in is by default an int64, mismatching with the int initialization in the C code of an int32. This only allocated half the amount of data needed on the GPU for the entire array, so only the top half was moved over and operated on. Adding a .astype(np.int32) solved this problem.
This allowed me to figure out the the ordering of the indexing in my case and fix the main code with:
int i = idx / row_len;
int j = idx % row_len;
I still don't understand how to get this working with non even division of block dimensions into the output array (e.g. 16x16), even with an if (idx
I figured out the problems. In the edit just above, the ary_in is by default an int64, mismatching with the int initialization in the C code of an int32. This only allocated half the amount of data needed on the GPU for the entire array, so only the top half was moved over and operated on. Adding a .astype(np.int32) solved this problem.
This allowed me to figure out the the ordering of the indexing in my case and fix the main code with:
int i = idx / row_len;
int j = idx % row_len;

Discrete Wavelet Transform on images and watermark embedding in LL band coefficients, data is lost when IDWT-DWT is performed again?

I'm writing an image watermarking system to hide a watermark in an image's low frequency band by transforming the image's luminance channel with a Discrete Wavelet Transform, then modifying coefficients in the LL band of the DWT output. I then do an Inverse DWT and rebuild my image.
The problem I'm having is when I modify coefficients in the DWT output, then inverse-DWT, and then DWT again, the modified coefficients are radically different.
For example, one of the output coefficients in the LL band of the 2-scale DWT was -0.10704, I modified this coefficient to be 16.89, then performed the IDWT on my data. I then took the output of the IDWT and performed a DWT on it again, and my coefficient which was modified to be 16.89 became 0.022.
I'm fairly certain that the DWT and IDWT code is correct because I've tested it against other libraries and the output from each transform matches when the filter coefficients and other parameters are the same. (Within what can be expected due to rounding error)
The main problem I have is that I perhaps don't understand the DWT all that well, I thought DWT and IDWT were supposed to be reasonably lossless (Aside from rounding error and such), yet this doesn't seem to be the case here.
I'm hoping someone more familiar with the transform can point me at a possible issue, is it possible that because the coefficients in my other subbands (LH, HL, HH) for that position are insignificant I'm losing data? If so, how can I determine which coefficients this may happen to?
My embedding function is below, coefficients are chosen in the LL band, "strong" is determined to be true if the absolute value of the LH, HH, or HL band for the selected location is larger than the mean value of the corresponding subband.
//If this evaluates to true, then the texture is considered strong.
if ((Math.Abs(LH[i][w]) >= LHmean) || (Math.Abs(HL[i][w]) >= HLmean) || (Math.Abs(HH[i][w]) >= HHmean))
static double MarkCoeff(int index, double coeff,bool strong)
int q1 = 16;
int q2 = 8;
int quantizestep = 0;
byte watermarkbit = binaryWM[index];
quantizestep = q1;
quantizestep = q2;
coeff /= (double)quantizestep;
double coeffdiff = 0;
if(coeff > 0.0)
coeffdiff = coeff - (int)coeff;
coeffdiff = coeff + (int)coeff;
if (1 == ((int)coeff % 2))
if (watermarkbit == 0)
if (Math.Abs(coeffdiff) > 0.5)
coeff += 1.0;
coeff -= 1.0;
if (watermarkbit == 1)
if (Math.Abs(coeffdiff) > 0.5)
coeff += 1.0;
coeff -= 1.0;
coeff *= (double)quantizestep;
return coeff;