Out of memory error. Allocating... - gprof

I'm trying to use a gprof command: gprof -s executable.exe gmon.out gmon.sum to merge profiling data gathered from 2 runs of my programs. But the following error appears:
gprof: out of memory allocating 3403207348 bytes after a total of 196608 bytes
My program is quite simple (just one for loop). If i run it once, the run time is too short (it shows 0.00s) for gprof to record.
In CygWin, I do the following steps:
gcc -pg -o fl forAndWhilLoop.c
fl (run the program)
mv gmon.out gmon.sum
fl (run the program)
gprof -s fl.exe gmon.out gmon.sum
gprof fl.exe gmon.sum>gmon.out
gprof fl.exe
My program:
int main(void)
{
int fac=1;
int count=10;
int k;
for(k=1;k<=count;k++)
{
fac = fac * k;
}
return 0;
}
So can anyone help me with this problem? Thanks!

If all you want is to time it, on my machine it's 105ns. Here's the code:
void forloop(void){
int fac=1;
int count=10;
int k;
for(k=1;k<=count;k++)
{
fac = fac * k;
}
}
int main(int argc, char* argv[])
{
int i;
for (i = 0; i < 1000000000; i++){
forloop();
}
return 0;
}
Get the idea? I used a hand-held stopwatch. Since it runs 10^9 times, seconds = nanoseconds.
Unrolling the inner loop like this reduced the time to 92ns;
int k = 1;
while(k+5 <= count){
fac *= k * (k+1) * (k+2) * (k+3) * (k+4);
k += 5;
}
while(k <= count){
fac *= k++;
}
Switching to Release build from Debug brought it down to 21ns. You can only expect that kind of speedup in an actual hotspot, which this is.

It seems that pprof instead of gprof should be executed

Related

SMHasher setup?

The SMHasher test suite for hash functions is touted as the best of the lot. But the latest version I've got (from rurban) gives absolutely no clue on how to check your proposed hash function (it does include an impressive battery of hash functions, but some of interest --if only for historic value-- are missing). Add that I'm a complete CMake newbie.
It's actually quite simple. You just need to install CMake.
Building SMHasher
To build SMHasher on a Linux/Unix machine:
git clone https://github.com/rurban/smhasher
cd smhasher/
git submodule init
git submodule update
cmake .
make
Adding a new hash function
To add a new function, you can edit just three files: Hashes.cpp, Hashes.h and main.cpp.
For example, I will add the ElfHash:
unsigned long ElfHash(const unsigned char *s)
{
unsigned long h = 0, high;
while (*s)
{
h = (h << 4) + *s++;
if (high = h & 0xF0000000)
h ^= high >> 24;
h &= ~high;
}
return h;
}
First, need to modify it slightly to take a seed and length:
uint32_t ElfHash(const void *key, int len, uint32_t seed)
{
unsigned long h = seed, high;
const uint8_t *data = (const uint8_t *)key;
for (int i = 0; i < len; i++)
{
h = (h << 4) + *data++;
if (high = h & 0xF0000000)
h ^= high >> 24;
h &= ~high;
}
return h;
}
Add this function definition to Hashes.cpp. Also add the following to Hashes.h:
uint32_t ElfHash(const void *key, int len, uint32_t seed);
inline void ElfHash_test(const void *key, int len, uint32_t seed, void *out) {
*(uint32_t *) out = ElfHash(key, len, seed);
}
In file main.cpp add the following line into array g_hashes:
{ ElfHash_test, 32, 0x0, "ElfHash", "ElfHash 32-bit", POOR, {0x0} },
(The third value is self-verification. You will learn this only after running the test once.)
Finally, rebuild and run the test:
make
./SMHasher ElfHash
It will show you all the tests that this hash function fails. (It is very bad.)

Understanding RiscV objdump

I am examining the objdump of a C file that I have compiled using the following commands:
riscv64-unknown-elf-gcc -O0 -o maxmul.o maxmul.c
riscv64-unknown-elf-objdump -d maxmul.o > maxmul.dump
strangely (or not) the addresses appear not to be aligned on 32-bit words but actually on a 16-bit boundary.
Can anyone please explain me why?
Thanks.
objdump excerpt:
00000000000101da <main>:
101da: 7155 addi sp,sp,-208
101dc: e586 sd ra,200(sp)
101de: e1a2 sd s0,192(sp)
101e0: 0980 addi s0,sp,208
...
C-code:
int main()
{
int first[3][3], second[3][3], multiply[3][3];
int golden[3][3];
int sum;
first[0][0] = 1; first[0][1] = 2; first[0][2] = 3;
first[1][0] = 4; first[1][1] = 5; first[1][2] = 6;
first[2][0] = 7; first[2][1] = 8; first[2][2] = 9;
second[0][0] = 9; second[0][1] = 8; second[0][2] = -7;
second[1][0] = -6; second[1][1] = 5; second[1][2] = 4;
second[2][0] = 3; second[2][1] = 2; second[2][2] = -1;
golden[0][0] = 6; golden[0][1] = 24; golden[0][2] = -2;
golden[1][0] = 24; golden[1][1] = 69; golden[1][2] = -14;
golden[2][0] = 42; golden[2][1] = 1140; golden[2][2] = -26;
int i, ii, iii;
for (i = 0; i < 3; i++) {
for (ii = 0; ii < 3; ii++) {
for (iii = 0; iii < 3; iii++) {
//printf("first[%d][%d] * second[%d][%d] \n", i, iii, iii, ii);
//printf("%d * %d (%d,%d)\n", first[i][ii], second[ii][i], i, ii);
sum += first[i][iii] * second[iii][ii];
}
//printf("sum = %d\n", sum);
multiply[i][ii] = sum;
sum = 0;
}
}
int c, d;
int err;
for ( c = 0; c < 3; c++) {
for ( d = 0; d < 3; d++) {
//printf("%d\t", multiply[c][d]);
if (multiply[c][d] != golden[c][d]) {
fail(golden[c][d], multiply[c][d]);
err++;
}
}
//printf("\n");
}
if (err == 0) {
pass();
}
return 0;
}
I am suspecting that your gcc compiles by default with the compressed instruction format where instructions can be 16b & 32b intermix - in such case, 16b instructions are 16b aligned as you can see in the disassembled code.
Objdump provides the address, the encoding, and the mnemonics ; the encoding in your case is always 16b, which means that the compiler have selected 16b instructions when possible.
By enabling verbose mode (-verbose), you can see that, by default,-march=rv64imafdc and -mabi=lp64d. The default targetted ISA is the compressed one, and the targetted ABI requires Double floats extension.
By setting -march=rv64imafd and letting ABI unchanged, gcc successfully compiles using instructions that are only 32b because compressed ISA is no more enabled.
The addresses of instruction are then always 32b aligned.
When compiling (or assembling) to RV64GC or RV32GC (or another target that enables the "C" Standard Extension Compressed Instructions), the compiler (or assembler) automatically replaces some instructions with compressed ones.
Non-compressed instructions are encoded in 32 bit, while compressed instructions are encoded in 16 bit.
When a compressed instruction is emitted it changes the alignment for the next instruction. Either from 32 bit to 16 bit or from 16 bit to 32 bit. That means not only 16 bit wide instructions may be aligned to a 16 bit address but also 32 bit wide ones. IOW both types of instructions (compressed and normal) are tightly packed side by side.
By default, objdump -d doesn't explicitly indicate that an instruction is compressed because it uses the same mnemonic as for the uncompressed variant. Although the number of bytes in the displayed raw instruction gives it away (4 vs. 2 bytes).
However, you can tell objdump to use separate mnemonics for compressed instructions such that they are more easily recognizable (those start with c. then), e.g.:
$ riscv64-unknown-elf-objdump -d -M no-aliases rotate
[..]
101e4: 00d66533 or a0,a2,a3
101e8: 8082 c.jr ra
00000000000101ea <rotr>:
101ea: 00b55633 srl a2,a0,a1
[..]
Note that with the switch -M no-aliases pseudo-instructions aren't displayed anymore, but the corresponding instruction(s) instead.

Debug data/neon performance hazards in arm neon code

Originally the problem appeared when I tried to optimize an algorithm for neon arm and some minor part of it was taking 80% of according to profiler. I tried to test to see what can be done to improve it and for that I created array of function pointers to different versions of my optimized function and then I run them in the loop to see in profiler which one performs better:
typedef unsigned(*CalcMaxFunc)(const uint16_t a[8][4], const uint16_t b[4][4]);
CalcMaxFunc CalcMaxFuncs[] =
{
CalcMaxFunc_NEON_0,
CalcMaxFunc_NEON_1,
CalcMaxFunc_NEON_2,
CalcMaxFunc_NEON_3,
CalcMaxFunc_C_0
};
int N = sizeof(CalcMaxFunc) / sizeof(CalcMaxFunc[0]);
for (int i = 0; i < 10 * N; ++i)
{
auto f = CalcMaxFunc[i % N];
unsigned retI = f(a, b);
// just random code to ensure that cpu waits for the results
// and compiler doesn't optimize it away
if (retI > 1000000)
break;
ret |= retI;
}
I got surprising results: performance of a function was totally depend on its position within CalcMaxFuncs array. For example, when I swapped CalcMaxFunc_NEON_3 to be first it would be 3-4 times slower and according to profiler it would stall at the last bit of the function where it tried to move data from neon to arm register.
So, what does it make stall sometimes and not in other times? BY the way, I profile on iPhone6 in xcode if that matters.
When I intentionally introduced neon pipeline stalls by mixing-in some floating point division between calling these functions in the loop I eliminated unreliable behavior, now all of them perform the same regardless of the order in which they were called. So, why in the first place did I have that problem and what can I do to eliminate it in actual code?
Update:
I tried to create a simple test function and then optimize it in stages and see how I could possibly avoid neon->arm stalls.
Here's the test runner function:
void NeonStallTest()
{
int findMinErr(uint8_t* var1, uint8_t* var2, int size);
srand(0);
uint8_t var1[1280];
uint8_t var2[1280];
for (int i = 0; i < sizeof(var1); ++i)
{
var1[i] = rand();
var2[i] = rand();
}
#if 0 // early exit?
for (int i = 0; i < 16; ++i)
var1[i] = var2[i];
#endif
int ret = 0;
for (int i=0; i<10000000; ++i)
ret += findMinErr(var1, var2, sizeof(var1));
exit(ret);
}
And findMinErr is this:
int findMinErr(uint8_t* var1, uint8_t* var2, int size)
{
int ret = 0;
int ret_err = INT_MAX;
for (int i = 0; i < size / 16; ++i, var1 += 16, var2 += 16)
{
int err = 0;
for (int j = 0; j < 16; ++j)
{
int x = var1[j] - var2[j];
err += x * x;
}
if (ret_err > err)
{
ret_err = err;
ret = i;
}
}
return ret;
}
Basically it it does sum of squared difference between each uint8_t[16] block and returns index of the block pair that has lowest squared difference. So, then I rewrote it in neon intrisics (no particular attempt was made to make it fast, as it's not the point):
int findMinErr_NEON(uint8_t* var1, uint8_t* var2, int size)
{
int ret = 0;
int ret_err = INT_MAX;
for (int i = 0; i < size / 16; ++i, var1 += 16, var2 += 16)
{
int err;
uint8x8_t var1_0 = vld1_u8(var1 + 0);
uint8x8_t var1_1 = vld1_u8(var1 + 8);
uint8x8_t var2_0 = vld1_u8(var2 + 0);
uint8x8_t var2_1 = vld1_u8(var2 + 8);
int16x8_t s0 = vreinterpretq_s16_u16(vsubl_u8(var1_0, var2_0));
int16x8_t s1 = vreinterpretq_s16_u16(vsubl_u8(var1_1, var2_1));
uint16x8_t u0 = vreinterpretq_u16_s16(vmulq_s16(s0, s0));
uint16x8_t u1 = vreinterpretq_u16_s16(vmulq_s16(s1, s1));
#ifdef __aarch64__1
err = vaddlvq_u16(u0) + vaddlvq_u16(u1);
#else
uint32x4_t err0 = vpaddlq_u16(u0);
uint32x4_t err1 = vpaddlq_u16(u1);
err0 = vaddq_u32(err0, err1);
uint32x2_t err00 = vpadd_u32(vget_low_u32(err0), vget_high_u32(err0));
err00 = vpadd_u32(err00, err00);
err = vget_lane_u32(err00, 0);
#endif
if (ret_err > err)
{
ret_err = err;
ret = i;
#if 0 // enable early exit?
if (ret_err == 0)
break;
#endif
}
}
return ret;
}
Now, if (ret_err > err) is clearly data hazard. Then I manually "unrolled" loop by two and modified code to use err0 and err1 and check them after performing next round of compute. According to profiler I got some improvements. In simple neon loop I got roughly 30% of entire function spent in the two lines vget_lane_u32 followed by if (ret_err > err). After I unrolled by two these operations started to take 25% (e.g. I got roughly 10% overall speedup). Also, check armv7 version, there is only 8 instructions between when err0 is set (vmov.32 r6, d16[0]) and when it's accessed (cmp r12, r6). T
Note, in the code early exit is ifdefed out. Enabling it would make function even slower. If I unrolled it by four and changed to use four errN variables and deffer check by two rounds then I still saw vget_lane_u32 in profiler taking too much time. When I checked generated asm, appears that compiler destroys all the optimizations attempts because it reuses some of the errN registers which effectively makes CPU access results of vget_lane_u32 much earlier than I want (and I aim to delay access by 10-20 instructions). Only when I unrolled by 4 and marked all four errN as volatile vget_lane_u32 totally disappeared from the radar in profiler, however, the if (ret_err > errN) check obviously got slow as hell as now these probably ended up as regular stack variables overall these 4 checks in 4x manual loop unroll started to take 40%. Looks like with proper manual asm it's possible to make it work properly: have early loop exit, while avoiding neon->arm stalls and have some arm logic in the loop, however, extra maintenance required to deal with arm asm makes it 10x more complex to maintain that kind of code in a large project (that doesn't have any armasm).
Update:
Here's sample stall when moving data from neon to arm register. To implement early exist I need to move from neon to arm once per loop. This move alone takes more than 50% of entire function according to sampling profiler that comes with xcode. I tried to add lots of noops before and/or after the mov, but nothing seems to affect results in profiler. I tried to use vorr d0,d0,d0 for noops: no difference. What's the reason for the stall, or the profiler simply shows wrong results?

Gaussian Elimination in OpenMP - Performance Problems

I'm new to openMP, and I was trying to parallelize a Gaussian Elimination, and I'm having troubles with performance. I'm compiling the code below using:
gcc -o gaussian_elimination gaussian_elimination.c -lm -lgsl -lgslcblas -fopenmp -Wall
And setting the number of threads on the terminal with export OMP_NUM_THREADS
And my problem is that the parallel version of this code is running way slower than the serial version of the same. I believe that this is because I declared #pragma parallel for inside the external loop, and this would force openMP to create and destroy thread at each iteration, which would be incredibly costly, but I haven't seen any other clear way to do the same kind of operation, and I don't think I can exchange the external loop with the internal parallel ones.
I'm probably missing something, but I have not found any other forum threads here commenting on this particular problem. As far as execution correctness goes, my code seems to be functioning alright, the problem is just performance-wise.
Thanks in Advance
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
#include <stdbool.h>
#include <time.h>
#include <gsl/gsl_linalg.h>
#include <gsl/gsl_rng.h>
#define DEBUG_MODE false
int random_matrix(double *A, int N,long long int seed);
int print_matrix(double *A, int N);
int print_vector(float *b,int N);
int main(int argc, char **argv){
int N=1000;
int i,j,k,l,i_p,s,err,D=N+1;
long long int seed=9089123498274; // just a fixed seed only not to bother
double *A,pivot,sw,tmp,begin,end,time_spent;
double *Aref,*bref;
gsl_matrix_view gsl_m;
gsl_vector_view gsl_b;
gsl_vector *gsl_x;
gsl_permutation *gsl_p;
/* Input */
//scanf("%d",&N);
A = (double*)malloc(N*(N+1)*sizeof(double));
if(A==NULL){
printf("Matrix A not allocated\n");
return 1;
}
Aref = (double*)malloc(N*N*sizeof(double));
if(Aref==NULL){
printf("Matrix A not allocated\n");
return 1;
}
bref = (double*)malloc(N*sizeof(double));
if(bref==NULL){
printf("Vector B not allocated\n");
return 2;
}
/*
for(i=0;i<N;i+=1)
for(j=0;j<N;j+=1)
scanf("%f",&(A[i*N+j]));
for(i=0;i<N;i+=1)
scanf("%f",&(b[i]));
*/
/*
for(i=0;i<N*N;i++)
A[i]=(float) a_data[i];
for(i=0;i<N;i+=1)
b[i]=(float) b_data[i]; */
err= random_matrix(A,N,seed);
if(err!=0)
return err;
for(i=0;i<N;i++)
for(j=0;j<N;j+=1)
Aref[i*N+j]= A[i*D+j];
for(i=0;i<N;i+=1)
bref[i]= A[i*D+N];//b[i];
printf("GSL reference:\n");
gsl_m = gsl_matrix_view_array (Aref, N, N);
gsl_b = gsl_vector_view_array (bref, N);
gsl_x = gsl_vector_alloc (N);
gsl_p = gsl_permutation_alloc(N);
begin = clock();
gsl_linalg_LU_decomp(&gsl_m.matrix, gsl_p, &s);
gsl_linalg_LU_solve(&gsl_m.matrix, gsl_p, &gsl_b.vector, gsl_x);
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("gsl matrix solver: %lf s\n",time_spent);
if(DEBUG_MODE==true)
gsl_vector_fprintf(stdout,gsl_x,"%f");
gsl_permutation_free(gsl_p);
gsl_vector_free(gsl_x);
begin = omp_get_wtime();
for(i=0;i<N;i+=1){
i_p = i;
pivot = fabs(A[i*D+i]);
for(j=i;j<N;j+=1)
if(pivot<fabs(A[j*D+i])){
pivot = fabs(A[j*D+i]);
i_p = j;
}
#pragma omp parallel for shared(i,N,A,i_p) private(j,sw)
for(j=i;j<D;j+=1){
sw = A[i*D+j];
A[i*D+j] = A[i_p*D+j];
A[i_p*D+j] = sw;
}
pivot=A[i*D+i];
#pragma omp parallel for shared(i,D,pivot,A) private(j)
for(j=0;j<D;j++)
A[i*D+j]=A[i*D+j]/pivot;
#pragma omp parallel for shared(i,A,N,D) private(tmp,j,k,l)
for(j=i+1;j<N+i;j++){
k=j%N;
tmp=A[k*D+i];
for(l=0;l<D;l+=1)
A[k*D+l]=A[k*D+l]-tmp*A[i*D+l];
}
}
end = omp_get_wtime();
time_spent = (end - begin);
printf("omp matrix solver: %lf s\n",time_spent);
/* Output */
if(DEBUG_MODE==true){
printf("\nCalculated: \n");
for(i=0;i<N;i+=1)
printf("%.6f \n",A[i*(N+1)+N]);
printf("\n");
}
free(A);
return 0;
}
int random_matrix(double *A, int N,long long int seed){
int i,j;
const gsl_rng_type * T;
gsl_rng *r;
gsl_rng_env_setup();
T = gsl_rng_default;
r = gsl_rng_alloc (T);
for(i=0;i<N;i++)
for(j=0;j<=N;j++)
A[i*(N+1)+j]= gsl_rng_uniform (r);
gsl_rng_free (r);
return 0;
}
int print_matrix(double *A, int N){
int i,j;
for(i=0;i<N;i++)
for(j=0;j<=N+1;j++){
if(j==0 || j==N || j==N+1)
printf(" | ");
printf("%.2f ",A[i*(N+1)+j]);
if(j==N+1)
printf("\n");
}
return 0;
}
int print_vector(float *b,int N){
int i;
for(i=0;i<N;i+=1)
printf("%f\n", b[i]);
return 0;
}
I updated the code above with the omp_get_wtime(), and now it reads as the wtime diminishing as I include more and more threads, so, it does behave as it should, although not as clean as I would like.
For 1000 x 1000 matrices I get 0.25 s for the GSL lib, 4.4 s for the serial omp run and 1.5 s for the 4-thread run.
For 3000 x 3000 matrices, I get ~ 9s for the GSL lib, ~ 117 s for the serial omp run and ~ 44 s for the 4 thread-run, thus at least adding more threads indeed speeds up the program!
Thanks a lot everyone

Create a Fraction array

I have to Create a dynamic array capable of holding 2*n Fractions.
If the dynamic array cannot be allocated, prints a message and calls exit(1).
It next fills the array with reduced random Fractions whose numerator
is between 1 and 20, inclusive; and whose initial denominator
is between 2 and 20, inclusive.
I ready did the function that is going to create the fraction and reduced it. this is what I got. When I compiled and run this program it crashes I cant find out why. If I put 1 instead of 10 in the test.c It doesn't crash but it gives me a crazy fraction. If I put 7,8,or 11 in the test.c it will crash. I would appreciate if someone can help me.
FractionSumTester.c
Fraction randomFraction(int minNum, int minDenom, int max)
{
Fraction l;
Fraction m;
Fraction f;
l.numerator = randomInt(minNum, max);
l.denominator = randomInt(minDenom, max);
m = reduceFraction(l);
while (m.denominator <= 1)
{
l.numerator = randomInt(minNum, max);
l.denominator = randomInt(minDenom, max);
m = reduceFraction(l);
}
return m;
}
Fraction *createFractionArray(int n)
{
Fraction *p;
int i;
p = malloc(n * sizeof(Fraction));
if (p == NULL)
{
printf("error");
exit(1);
}
for(i=0; i < 2*n ; i++)
{
p[i] = randomFraction(1,2,20);
printf("%d/%d\n", p[i].numerator, p[i].denominator);
}
return p;
}
this is the what I am using to test this two functions.
test.c
#include "Fraction.h"
#include "FractionSumTester.h"
#include <stdio.h>
int main()
{
createFractionArray(10);
return 0;
}
In your createFractionArray() function, you malloc() space for n items. Then, in the for loop, you write 2*n items into that space... which overruns your buffer and causes the crash.