I have a generic code that I am trying to move to SSE to speed it up since it's getting called a lot. The code in question is basically something like this:
for (int i = 1; i < mysize; ++i)
{
buf[i] = myMin(buf[i], buf[i - 1] + offset);
}
where myMin is your simple min function (a < b) ? a : b (I've looked at the disassembly and there are jumps in here)
My SSE code (which I've gone through several iterations to speed up) is at this form now:
float tmpf = *(tmp - 1);
__m128 off = _mm_set_ss(offset);
for (int l = 0; l < mysize; l += 4)
{
__m128 post = _mm_load_ps(tmp);
__m128 pre = _mm_move_ss(post, _mm_set_ss(tmpf));
pre = _mm_shuffle_ps(pre, pre, _MM_SHUFFLE(0, 3, 2, 1));
pre = _mm_add_ss(pre, off);
post = _mm_min_ss(post, pre);
// reversed
pre = _mm_shuffle_ps(post, post, _MM_SHUFFLE(2, 1, 0, 3));
post = _mm_add_ss(post, off );
pre = _mm_min_ss(pre, post);
post = _mm_shuffle_ps(pre, pre, _MM_SHUFFLE(2, 1, 0, 3));
pre = _mm_add_ss(pre, off);
post = _mm_min_ss(post, pre);
// reversed
pre = _mm_shuffle_ps(post, post, _MM_SHUFFLE(2, 1, 0, 3));
post = _mm_add_ss(post, off);
pre = _mm_min_ss(pre, post);
post = _mm_shuffle_ps(pre, pre, _MM_SHUFFLE(2, 1, 0, 3));
_mm_store_ps(tmp, post);
tmpf = tmp[3];
tmp += 4;
}
Ignoring any edge case scenarios, which I've handled fine, and overhead for those are negligible due to size of buf/tmp, can anyone explain why the SSE version is slower by 2x? VTune keeps attributing it to L1 misses, but as I can see, it should be make 4x less trips to L1 and no branches/jumps, so it should be faster, but it's not. What am I mistaking here?
Thanks
EDIT:
So I did find something else in a separate test case. I didn't think this would matter but alas it did. So mysize above is actually not that big (about 30-50), but there are a LOT of these and they are all being done serially. In that case, the ternary expression is faster than SSE. However, if it's reversed with mysize being in millions and there are only 30-50 iterations of them, the SSE version is faster. Any idea why? I would think memory interactions would be the same for both, including pre-emptive prefetching etc...
If this code is performance critical, you'll have to look at the data that you get. It's the serial dependency that's killing you, and you need to get rid of it.
One very small value an buf [i] will influence a lot of the following values. For example, if offset = 1, buf [0] = 0, and all other values are > 1 million, that one value will influence the next one million. On the other hand, that kind of thing might happen very rarely.
If it is rare, they you check fully vectorised whether buf [i] > buf [i] + offset, replace it if it is, and keep track where changes were made, without considering that buf [i] values could trickle upwards. Then you check where changes were made, and re-check them.
In extreme cases, say buf [i] is always between 0 and 1, and offset > 0.5, you know that buf [i] cannot influence buf [i + 2] at all, so you just ignore the serial dependency and do everything in parallel, fully vectorised.
On the other hand, if you have some tiny values in your buffer that influence large numbers of consecutive values, then you start with the first value buf [0] and fully vectorised check whether buf [i] < buf [0] + i * offset, replacing values, until the check fails.
You say "the values can be anything". If that is the case, for example if buf [i] is randomly chosen anywhere between 0 and 1,000,000, and offset is not very large, then you will have elements buf [i] which force lots of following elements to be buf [i] + (k - i) * offset. For example if offset = 1, and you find buf [i] is about 10,000, then it will force on average about 100 values to be equal to buf [i] + (k - i) * offset.
Here's a branchless solution you could try
for (int i = 1; i < mysize; i++) {
float a = buf[i];
float b = buf[i-1] + offset;
buf[i] = b + (a<b)*(a-b);
}
Here is the assembly:
.L6:
addss xmm0, xmm4
movss xmm1, DWORD PTR [rax]
movaps xmm2, xmm1
add rax, 4
movaps xmm3, xmm6
cmpltss xmm2, xmm0
subss xmm1, xmm0
andps xmm3, xmm2
andnps xmm2, xmm5
orps xmm2, xmm3
mulss xmm1, xmm2
addss xmm0, xmm1
movss DWORD PTR [rax-4], xmm0
cmp rax, rdx
jne .L6
But the version with a branch is probably already better
for (int i = 1; i < mysize; i++) {
float a = buf[i];
float b = buf[i-1] + offset;
buf[i] = a<b ? a : b;
}
Here is the assembly
.L15:
addss xmm0, xmm2
movss xmm1, DWORD PTR [rax]
add rax, 4
minss xmm1, xmm0
movss DWORD PTR [rax-4], xmm1
cmp rax, rdx
movaps xmm0, xmm1
jne .L15
This produces code which is branchless anyway using minss (cmp rax, rdx applies to the loop iterator).
Finally, here is code you can be used with MSVC which produces the same assembly as GCC which is branchless
__m128 offset4 = _mm_set1_ps(offset);
for (int i = 1; i < mysize; i++) {
__m128 a = _mm_load_ss(&buf[i]);
__m128 b = _mm_load_ss(&buf[i-1]);
b = _mm_add_ss(b, offset4);
a = _mm_min_ss(a,b);
_mm_store_ss(&buf[i], a);
}
Here is another form you can try which uses a branch
__m128 offset4 = _mm_set1_ps(offset);
for (int i = 1; i < mysize; i++) {
__m128 a = _mm_load_ss(&buf[i]);
__m128 b = _mm_load_ss(&buf[i-1]);
b = _mm_add_ss(b, offset4);
if(_mm_comige_ss(b,a))
_mm_store_ss(&buf[i], b);
}
Related
I'm quite new to CUDA and GPU programming. I'm trying to write a Kernel for an application in physics. The parallelization is made over a quadrature of directions, each direction resulting in a sweep of a 2D cartesian domain. Here is the kernel. it actually works well, giving good results.
However, a very high number of registers per blocks leads to a spill to local memory that harshly slow down the code performance.
__global__ void KERNEL (int imax, int jmax, int mmax, int lg, int lgmax,
double *x, double *y, double *qd, double *kappa,
double *S, double *G, double *qw, double *SkG,
double *Ska,double *a, double *Ljm, int *data)
{
int m = 1+blockIdx.x*blockDim.x + threadIdx.x ;
int tid = threadIdx.x ;
//Var needed for thread execution
...
extern __shared__ double shared[] ;
//Read some data from Global mem
mu = qd[ (m-1)];
eta = qd[ MSIZE+(m-1)];
wm = qd[3*MSIZE+(m-1)];
amu = fabs(mu);
aeta= fabs(eta);
ista = data[ (m-1)] ;
iend = data[1*MSIZE+(m-1)] ;
istp = data[2*MSIZE+(m-1)] ;
jsta = data[3*MSIZE+(m-1)] ;
jend = data[4*MSIZE+(m-1)] ;
jstp = data[5*MSIZE+(m-1)] ;
j1 = (1-jstp) ;
j2 = (1+jstp)/2 ;
i1 = (1-istp) ;
i2 = (1+istp)/2 ;
isw = ista-istp ;
jsw = jsta-jstp ;
dy = dx = 1.0e-2 ;
for(i=1 ; i<=imax; i++) Ljm[MSIZE*(i-1)+m] = S[jsw*(imax+2)+i] ;
//Beginning of the vertical Sweep, can be from left to right,
// or opposite depending on the thread
for(j=jsta ; j1*jend + j2*j<=j2*jend + j1*j ; j=j+jstp) {
Lw = S[j*(imax+2)+isw] ;
//Beginning of the horizontal Sweep, can be from left to right,
// or opposite depending on the thread
for(i=ista ; i1*iend + i2*i<=i2*iend + i1*i ; i=i+istp) {
ax = dy ;
Lx = ax*amu/ex ;
ay = dx ;
Ly = ay*aeta/ey ;
dv = ax*ay ;
L0 = dv*kappaij ;
Sp = S[j*(imax+2)+i]*dv ;
Ls = Ljm[MSIZE*(i-1)+m] ;
Lp = (Lx*Lw+Ly*Ls+Sp)/(Lx+Ly+L0) ;
Lw = Lw+(Lp-Lw)/ex ;
Ls = Ls+(Lp-Ls)/ey ;
Ljm[MSIZE*(i-1)+m] = Ls ;
shared[tid] = wm*Lp ;
__syncthreads();
for (s=16; s>0; s>>=1) {
if (tid < s) {
shared[tid] += shared[tid + s] ;
}
}
if(tid==0) atomicAdd(&SkG[imax*(j-1)+(i-1)],shared[tid]*kappaij);
}
// End of horizontal sweep
}
// End of vertical sweep
}
How can i optimize the execution of this code ? I run it over 8 blocks of 32 threads.
The occupancy for this kernel is really low, limited by the registers according to the Visual profiler.
I have no idea on how to improve it.
Thanks !
First of all, you are using blocks of 32 threads, because of that, occupancy kernel is too low. Your gpu is running only 256 threads in parallel but it can run up to 1536 threads per multiprocessor (compute capability 2.x)
How many registers are you using?
You also can try to declare your variables into their local scope, helping to the device to reuse better the registers.
I'm writing a program in OpenCL that receives two arrays of points, and calculates the nearest neighbour for each point.
I have two programs for this. One of them will calculate distance for 4 dimensions, and one for 6 dimensions. They are below:
4 dimensions:
kernel void BruteForce(
global read_only float4* m,
global float4* y,
global write_only ushort* i,
read_only uint mx)
{
int index = get_global_id(0);
float4 curY = y[index];
float minDist = MAXFLOAT;
ushort minIdx = -1;
int x = 0;
int mmx = mx;
for(x = 0; x < mmx; x++)
{
float dist = fast_distance(curY, m[x]);
if (dist < minDist)
{
minDist = dist;
minIdx = x;
}
}
i[index] = minIdx;
y[index] = minDist;
}
6 dimensions:
kernel void BruteForce(
global read_only float8* m,
global float8* y,
global write_only ushort* i,
read_only uint mx)
{
int index = get_global_id(0);
float8 curY = y[index];
float minDist = MAXFLOAT;
ushort minIdx = -1;
int x = 0;
int mmx = mx;
for(x = 0; x < mmx; x++)
{
float8 mx = m[x];
float d0 = mx.s0 - curY.s0;
float d1 = mx.s1 - curY.s1;
float d2 = mx.s2 - curY.s2;
float d3 = mx.s3 - curY.s3;
float d4 = mx.s4 - curY.s4;
float d5 = mx.s5 - curY.s5;
float dist = sqrt(d0 * d0 + d1 * d1 + d2 * d2 + d3 * d3 + d4 * d4 + d5 * d5);
if (dist < minDist)
{
minDist = dist;
minIdx = index;
}
}
i[index] = minIdx;
y[index] = minDist;
}
I'm looking for ways to optimize this program for GPGPU. I've read some articles (including http://www.macresearch.org/opencl_episode6, which comes with a source code) about GPGPU optimization by using local memory. I've tried applying it and came up with this code:
kernel void BruteForce(
global read_only float4* m,
global float4* y,
global write_only ushort* i,
__local float4 * shared)
{
int index = get_global_id(0);
int lsize = get_local_size(0);
int lid = get_local_id(0);
float4 curY = y[index];
float minDist = MAXFLOAT;
ushort minIdx = 64000;
int x = 0;
for(x = 0; x < {0}; x += lsize)
{
if((x+lsize) > {0})
lsize = {0} - x;
if ( (x + lid) < {0})
{
shared[lid] = m[x + lid];
}
barrier(CLK_LOCAL_MEM_FENCE);
for (int x1 = 0; x1 < lsize; x1++)
{
float dist = distance(curY, shared[x1]);
if (dist < minDist)
{
minDist = dist;
minIdx = x + x1;
}
}
barrier(CLK_LOCAL_MEM_FENCE);
}
i[index] = minIdx;
y[index] = minDist;
}
I'm getting garbage results for my 'i' output (e.g. many values that are the same). Can anyone point me to the right direction? I'll appreciate any answer that helps me improve this code, or maybe find the problem with the optimize version above.
Thank you very much
CauĂȘ
One way to get a big speed up here is to use local data structures and compute entire blocks of data at a time. You should also only need a single read/write global vector (float4). The same idea can be applied to the 6d version using smaller blocks. Each work group is able to work freely through the block of data it is crunching. I will leave the exact implementation to you because you will know the specifics of your application.
some pseudo-ish-code (4d):
computeBlockSize is the size of the blocks to read from global and crunch.
this value should be a multiple of your work group size. I like 64 as a WG
size; it tends to perform well on most platforms. will be
allocating 2 * float4 * computeBlockSize + uint * computeBlockSize of shared memory.
(max value for ocl 1.0 ~448, ocl 1.1 ~896)
#define computeBlockSize = 256
__local float4[computeBlockSize] blockA;
__local float4[computeBlockSize] blockB;
__local uint[computeBlockSize] blockAnearestIndex;
now blockA gets computed against all blockB combinations. this is the job of a single work group.
*important*: only blockA ever gets written to. blockB is stored in local memory, but never changed or copied back to global
steps:
load blockA into local memory with async_work_group_copy
blockA is located at get_group_id(0) * computeBlockSize in the global vector
optional: set all blockA 'w' values to MAXFLOAT
optional: load blockAnearestIndex into local memory with async_work_group_copy if needed
need to compute blockA against itself first, then go into the blockB's
be careful to only write to blockA[j], NOT blockA[k]. j is exclusive to this work item
for(j=get_local_id(0); j<computeBlockSize;j++)
for(k=0;k<computeBlockSize; k++)
if(j==k) continue; //no self-comparison
calculate distance of blockA[j] vs blockA[k]
store min distance in blockA[j].w
store global index (= i*computeBlockSize +k) of nearest in blockAnearestIndex[j]
barrier(local_mem_fence)
for (i=0;i<get_num_groups(0);i++)
if (i==get_group_id(0)) continue;
load blockB into local memory: async_work_group_copy(...)
for(j=get_local_id(0); j<computeBlockSize;j++)
for(k=0;k<computeBlockSize; k++)
calculate distance of blockA[j] vs blockB[k]
store min distance in blockA[j].w
store global index (= i*computeBlockSize +k) of nearest in blockAnearestIndex[j]
barrier(local_mem_fence)
write blockA and blockAnearestIndex to global memory using two async_work_group_copy
There should be no problem in reading a blockB while another work group writes the same block (as its own blockA), because only the W values may have changed. If there happens to be trouble with this -- or if you do require two different vectors of points, you could use two global vectors like you have above, one with the A's (writeable) and the other with the B's (read only).
This algorithm work best when your global data size is a multiple of computeBlockSize. To handle the edges, two solutions come to mind. I recommend writing a second kernel for the non-square edge blocks that would in a similar manner as above. The new kernel can execute after the first, and you could save the second pci-e transfer. Alternately, you can use a distance of -1 to signify a skip in the comparison of two elements (ie if either blockA[j].w == -1 or blockB[k].w == -1, continue). This second solution would result in a lot more branching in your kernel though, which is why I recommend writing a new kernel. A very small percentage of your data points will actually fall in a edge block.
I want to calculate the product A^T*A ( A is 2000x1000 Matrix). Also i only want to solve the upper triangular Matrix. In the inner loop i have to solve the dot product of two vectors.
Now, here is the problem. Using cblas ddot() is not faster than calculating the dot product with a loop. How is this possible? (using Intel Core (TM)i7 CPU M620 #2,67GHz, 1,92GB RAM)
The problem is caused essentially by matrix size, not by ddot. Your matrices are so large that they do not fit in the cache memory. The solution is to rearrange the three nested loops such that as much as possible can be done with a line in cache, so reducing cache refreshes. A model implementation follows for both the ddot and an daxpy approach. On my computer the time consumption was about 15:1.
In other words: never, never, never program a matrix multiplication along the "row times column" scheme that we learned in school.
/*
Matrix product of A^T * A by two methods.
1) "Row times column" as we learned in school.
2) With rearranged loops such that need for cash refreshes is reduced
(this can be improved even more).
Compile: gcc -o aT_a aT_a.c -lgslcblas -lblas -lm
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cblas.h>
#define ROWS 2000
#define COLS 1000
static double a[ROWS][COLS];
static double c[COLS][COLS];
static void dot() {
int i, j;
double *ai, *bj;
ai = a[0];
for (i=0; i<COLS; i++) {
bj = a[0];
for (j=0; j<COLS; j++) {
c[i][j] = cblas_ddot(ROWS,ai,COLS,bj,COLS);
bj += 1;
}
ai += 1;
}
}
static void axpy() {
int i, j;
double *ci, *bj, aij;
for (i=0; i<COLS; i++) {
ci = c[i];
for (j=0; j<COLS; j++) ci[j] = 0.;
for (j=0; j<ROWS; j++) {
aij = a[j][i];
bj = a[j];
cblas_daxpy(COLS,aij,bj,1,ci,1);
}
}
}
int main(int argc, char** argv) {
clock_t t0, t1;
int i, j;
for (i=0; i<ROWS; ++i)
for (j=0; j<COLS; ++j)
a[i][j] = i+j;
t0 = clock();
dot();
t0 = clock();
printf("Time for DOT : %f sec.\n",(double)t0/CLOCKS_PER_SEC);
axpy();
t1 = clock();
printf("Time for AXPY: %f sec.\n",(double)(t1-t0)/CLOCKS_PER_SEC);
return 0;
}
The CBLAS dot product is effectively just a computation in slightly unrolled loop. The netlib Fortran is just this:
DO I = MP1,N,5
DTEMP = DTEMP + DX(I)*DY(I) + DX(I+1)*DY(I+1) +
$ DX(I+2)*DY(I+2) + DX(I+3)*DY(I+3) + DX(I+4)*DY(I+4)
END DO
ie. just a loop unrolled to a stride of 5.
If you must use a ddot style dot product for your operation, you might get a performance boost by re-writing your loop to use SSE2 intrinsics:
#include <emmintrin.h>
double ddotsse2(const double *x, const double *y, const int n)
{
double result[2];
int n2 = 2 * (n/2);
__m128d dtemp;
if ( (n % 2) == 0) {
dtemp = _mm_setzero_pd();
} else {
dtemp = _mm_set_sd(x[n] * y[n]);
}
for(int i=0; i<n2; i+=2) {
__m128d x1 = _mm_loadr_pd(x+i);
__m128d y1 = _mm_loadr_pd(y+i);
__m128d xy = _mm_mul_pd(x1, y1);
dtemp = _mm_add_pd(dtemp, xy);
}
_mm_store_pd(&result[0],dtemp);
return result[0] + result[1];
}
(not tested, never been compiled, buyer beware).
This may or may be faster than the standard BLAS implementation. You may also want to investigate whether further loop unrolling could improve performance.
If you're not using SSE2 intrinsics or using a data type that may not boost performance with them, you can try to transpose the matrix for an easy improvement in performance for larger matrix multiplications with cblas_?dot. Performing the matrix multiplication in blocks also helps.
void matMulDotProduct(int n, float *A, float* B, int a_size, int b_size, int a_row, int a_col, int b_row, int b_col, float *C) {
int i, j, k;
MKL_INT incx, incy;
incx = 1;
incy = b_size;
//copy out multiplying matrix from larger matrix
float *temp = (float*) malloc(n * n * sizeof(float));
for (i = 0; i < n; ++i) {
cblas_scopy(n, &B[(b_row * b_size) + b_col + i], incy, &temp[i * n], 1);
}
//transpose
mkl_simatcopy('R', 'T', n, n, 1.0, temp, 1, 1);
for (i = 0; i < n; i+= BLOCK_SIZE) {
for (j = 0; j < n; j++) {
for (k = 0; k < BLOCK_SIZE; ++k) {
C[((i + k) * n) + j] = cblas_sdot(n, &A[(a_row + i + k) * a_size + a_col], incx, &temp[n * j], 1);
}
}
}
free(temp);
}
On my machine, this code is about 1 order of magnitude faster than the the 3 loop code (but also 1 order of magnitude slower than cblas_?gemm call) for single precision floats and 2K by 2K matrices. (I'm using Intel MKL).
This one has been bothering me for a while now: Is there a difference (e.g. memory-wise) between this
Pointer *somePointer;
for (...)
{
somePointer = something;
// do stuff with somePointer
}
and this
for (...)
{
Pointer *somePointer = something;
// do stuff with somePointer
}
If you want to use the pointer when you're done with the loop, you need to do the first one.
Pointer *somePointer;
Pointer *somePointer2;
for(loopA)
{
if(meetsSomeCriteria(somePointer)) break;
}
for(loopB)
{
if(meetsSomeCriteria(somePointer2)) break;
}
/* do something with the two pointers */
someFunc(somePointer,somePointer2);
Well, first, in you second example somePointer will be valid only inside the loop (it's scope), so if you want to use it outside you have to do like in snippet #1.
If we turn on assembly we can see that the second snipped needs only 2 more instructions to execute:
Snippet 1:
for(c = 0; c <= 10; c++)
(*p1)++;
0x080483c1 <+13>: lea -0x8(%ebp),%eax # eax = &g
0x080483c4 <+16>: mov %eax,-0xc(%ebp) # p1 = g
0x080483c7 <+19>: movl $0x0,-0x4(%ebp) # c = 0
0x080483ce <+26>: jmp 0x80483e1 <main+45> # dive in the loop
0x080483d0 <+28>: mov -0xc(%ebp),%eax # eax = p1
0x080483d3 <+31>: mov (%eax),%eax # eax = *p1
0x080483d5 <+33>: lea 0x1(%eax),%edx # edx = eax + 1
0x080483d8 <+36>: mov -0xc(%ebp),%eax # eax = p1
0x080483db <+39>: mov %edx,(%eax) # *p1 = edx
0x080483dd <+41>: addl $0x1,-0x4(%ebp) # c++
0x080483e1 <+45>: cmpl $0xa,-0x4(%ebp) # re-loop if needed
0x080483e5 <+49>: jle 0x80483d0 <main+28>
Snippet 2:
for(c = 0; c <= 10; c++) {
int *p2 = &g;
(*p2)--;
}
0x080483f0 <+60>: lea -0x8(%ebp),%eax # eax = &g
0x080483f3 <+63>: mov %eax,-0x10(%ebp) # p2 = eax
0x080483f6 <+66>: mov -0x10(%ebp),%eax # eax = p2
0x080483f9 <+69>: mov (%eax),%eax # eax = *p2
0x080483fb <+71>: lea -0x1(%eax),%edx # edx = eax - 1
0x080483fe <+74>: mov -0x10(%ebp),%eax # eax = p2
0x08048401 <+77>: mov %edx,(%eax) # *p2 = edx
0x08048403 <+79>: addl $0x1,-0x4(%ebp) # increment c
0x08048407 <+83>: cmpl $0xa,-0x4(%ebp) # loop if needed
0x0804840b <+87>: jle 0x80483f0 <main+60>
Ok, the difference is in the first two instructions of snippet #2 which are executed at every loop, while in the first snippet they're executed just before entering the loop.
Hope I was clear. ;)
Well, with the first version you only have to release once, after the loop. With the second version you can't use the pointer from outside the loop, so you need to release inside the loop. Memory-wise it shouldn't matter that much, but you do have allocation overhead in your second example I think.
check out a similar answer on stackoverflow here with some good answers. However this is probably compiler/language independent...
int lcm_old(int a, int b) {
int n;
for(n=1;;n++)
if(n%a == 0 && n%b == 0)
return n;
}
int lcm(int a,int b) {
int n = 0;
__asm {
lstart:
inc n;
mov eax, n;
mov edx, 0;
idiv a;
mov eax, 0;
cmp eax, edx;
jne lstart;
mov eax, n;
mov edx, 0;
idiv b;
mov eax, 0;
cmp eax, edx;
jnz lstart;
}
return n;
}
I'm trying to beat/match the code for the top function with my own function (bottom). Have you got any ideas how I can optimize my routine?
PS. This is just for fun.
I would optimize by using a different algorithm. Searching linearly like you are doing is super-slow. It's a fact that the least common mulitple of two natural numbers is the quotient of their product divided by their greatest common divisor. You can compute the greatest common divisor quickly using the Euclidean algorithm.
Thus:
int lcm(int a, int b) {
int p = a * b;
return p / gcd(a, b);
}
where you need to implement gcd(int, int). As the average number of steps in the Euclidean algorithm is O(log n), we beat the naive linear search hands down.
There are other approaches to this problem. If you had an algorithm that could quickly factor integers (say a quantum computer) then you can also solve this problem like so. If you write each of a and b into its canonical prime factorization
a = p_a0^e_a0 * p_a1^e_a1 * ... * p_am^e_am
b = p_b0^e_b0 * p_b1^e_b1 * ... * p_bn^e_bn
then the least common multiple of a and b is the obtained by taking for each prime factor appearing in at least one of the factorizations of a and b, taking it with the maximum exponent that it appears in the factorization of a or b. For example:
28 = 2^2 * 7
312 = 2^3 * 39
so that
lcm(28, 312) = 2^3 * 7 * 39 = 2184
All of this is to point out that naive approaches are admirable in their simplicity but you can spend all day optimizing every last nanosecond out of them and still not beat a superior algorithm.
I'm going to assume you want to keep the same algorithm. This should at least be a slightly more efficient implementation of it. The main difference is that the code in the loop only uses registers, not memory.
int lcm(int a,int b) {
__asm {
xor ecx, ecx
mov esi, a
mov edi, b
lstart:
inc ecx
mov eax, ecx
xor edx, edx
idiv esi
test edx, edx
jne lstart
mov eax, ecx;
idiv edi
test edx, edx
jnz lstart
mov eax, ecx
leave
ret
}
}
As Jason pointed out, however, this really isn't a very efficient algorithm -- multiplying, finding the GCD, and dividing will normally be faster (unless a and b are quite small).
Edit: there is another algorithm that's almost simpler to understand, that should also be a lot faster (than the original -- not than multiplying, then dividing by GCD). Instead of generating consecutive numbers until you find one that divides both a and b, generate consecutive multiples of one (preferably the larger) until you find one that divides evenly by the other:
int lcm2(int a, int b) {
__asm {
xor ecx, ecx
mov esi, a
mov edi, b
lstart:
add ecx, esi
mov eax, ecx
xor edx, edx
idiv edi
test edx, edx
jnz lstart
mov eax, ecx
leave
ret
}
}
This remains dead simple to understand, but should give a considerable improvement over the original.