I have been banging my head for several hours because I have a rare problem. I suspect I have a memory issue
I have a pcb with an atmega328p in DIP format and an I2C OLED display with 128x64 pixels. At first I was using the adafruit library but I quickly met stability issues when my RAM reached about 48%. I learned that the adafruit library uses more RAM than my compiler (arduino cli)
would tell me..
So I migrated to the <U8x8lib.h> library. Which works signaficantly better.
But now my program starts to grow I again face strange behaviour.
The 'weird' part is happening in this switch-case
void updateLCD()
{
clearDisplay() ;
delay(10);
printNumberAt(5,5,2,mode) ;
delay(1000);
switch( mode )
{
case locos :
drawSpeed( speed ) ;
drawFunctions() ;
printAt(0,0, F("Loco:")) ; printNumberAt( 7, 0, 3, currentAddress ) ;
break ;
case points :
printAt(0, 0, F("POINT #") ) ;
printNumberAt( 7, 0, 3, pointAddress ) ;
uint8_t bit = pointAddress % 8 ;
uint8_t group = pointAddress / 8 ;
bool state = !bitRead( pointStates[ group ], bit) ;
if( !state ) printCustom( 13, 0, 0, straight ) ;
else printCustom( 13, 0, 0, curved ) ;
break ;
case gettingAddress:
printAt(0, 0, F("ENTER ADDRESS") ) ;
break ;
case gettingSlot:
printAt(0, 0, F("ENTER SLOT") ) ;
break ;
case pointStreets :
printAt(0, 0, F("setting street" ) ) ;
printAt(0, 1, F("enter number" ) ) ;
break ;
case locoSlots:
drawFunctions() ;
drawSpeed( speed) ;
printAt(0,0, F("Loco Slot")) ; printNumberAt( 9, 0, 3, slot ) ;
printDescription( eepromLoco.name, 1 );
break ;
case programs:
printAt(0, 0, F("PROGRAM MODE") ) ;
printAt(0, 1, F("CHANNEL #") ) ;
printNumberAt(5,2, 4, channel ) ;
uint8 state1 = program1.getState() ;
if( state1 == recording ) printAt(0, 3, F("recording") ) ;
if( state1 == playing ) printAt(0, 3, F("playing") ) ;
if( state1 == finishing ) printAt(0, 3, F("finishing") ) ;
if( state1 == idle ) printAt(0, 3, F("idle") ) ;
break;
}
}
What is happening. Each and every last case has worked without flaw. And now only the one or two top most cases work. If I change the order of these cases I can let some texts 'dissapear' and others 'reappear'.
To verify that the switch variable mode was not containing an invalid value, I added these lines.
delay(10);
printNumberAt(5,5,2,mode) ;
delay(1000);
Mode is always correct in al situations... yet the text of the following case.. it is not to be seen.
My best guess is that my heap and stack are colliding. From what I read, that can cause rather vague symptoms like I am having now.
As my program is this big and the bug is related to the OLED display with which I had earlier stability issues. I think I may be on the right track.
Sketch uses 22554 bytes (69%) of program storage space. Maximum is 32256 bytes.
Global variables use 1414 bytes (69%) of dynamic memory, leaving 634 bytes for local variables. Maximum is 2048 bytes.
Are there methods to verify or detect that heap and stack are colliding?
What more could be propable causes?
Related
I'm stuck. I need a (deceptive) simple operation on a tibble...
One of the columns is a string. I also have vars that is a char vector that matches names on tibble.
So I need to replace all my vars in my_tib$thestring by the corresponding value in the tibble.
Here is an example
vars <- c("Yes", "No", "Maybe")
my_tib <- tribble(
~Yes, ~No, ~Maybe, ~thestring,
1, 0, 2 , "Sometimes Yes is YES",
1, 0, 3 , "Sometimes Yes others is No or Maybe",
1, 0, 4 , "Sometimes Yes while Maybe...",
1, 0, 5 , "Sometimes Yes is Yes and No and maybe",
)
# Intended Result
my_tib_result <- tribble(
~yes, ~no, ~maybe, ~thestring,
1, 0, 2 , "Sometimes 1 is YES",
1, 0, 3 , "Sometimes 1 others is 0 or 3",
1, 0, 4 , "Sometimes 1 while 4...",
1, 0, 5 , "Sometimes 1 is 1 and 0 and 5",
)
I'm sure it's simple (:) or not :))... but I'm not moving from this point... so I need a Most welcome push.
Thank you very much for your comments and help.
AC
I have found a method... not the most elegant, but it works... so sharing the solution.
If any one has a better idea I would appreciatte.
My solution:
# Create a function
chg_each <- function(str, tb){
tb %>% mutate(
# Note the 'as.character' and 'get' ... for map2
text = map2(text, as.character(get(str)),
~if_else(is.na(.x), "",
str_replace_all(.x, str, .y)))
}
# Iterate over all vars to change
end_my_tib <- my_tib
for(var in vars){
end_my_tib <- chg_each (var, end_my_tib)
}
I have tried to find but my answer doesn't match with the solution in the text.
Could anyone explain me to find the time complexity?
for (int i=0; i<n; i++)
for (int j=i; j< i*i; j++)
if (j%i == 0)
{
for (int k=0; k<j; k++)
printf("*");
}
Let f(n) be the number of operations aggregated from the outer loop,
Let g(n) be the number of operations aggregated at the level of the first inner loop.
Let h(n) be the number of operations performed at the level of the third (most inner) loop.
Looking at the most inner loop
for (int k=0; k<j; k++)
printf("*");
We can say that h(j) = j.
Now, as j varies from i to i*i, the following values of i satisfy i%j = 0, i.e. i is a multiple of j:
j = 1.i
j = 2.i
j = 3.i
...
j = (i-1).i
So
g(i) = sum(j=i, j<i^2, h(j) if j%i=0, else 0)
= h(i) + h(2.i) + ... + h((i-1).i)
= i + 2.i + ... + (i-1).i
= i.(1 + 2 + ... + i-1) = i.i.(i-1)/2
= 0.5i^3 // dropped the term -0.5i^2 dominated by i^3 as i -> +Inf
=> f(n) = sum(i=0, i<n, g(i))
= sum(i=0, i<n, 0.5i^3)
<= sum(i=0, i<n, 0.5n^3)
<= 0.5n^4
=> f(n) = O(n^4)
Could anyone explain me how to find the time complexity?
A posted claim of cited O( N^5 ) was not supported by experimental data.
Best start with experimentation on low scale:
for ( int aScaleOfBigO_N = 1;
aScaleOfBigO_N < 2147483646;
aScaleOfBigO_N *= 2
){
printf( "START: running experiment for a scale of N( %d ) produces this:\n",
aScaleOfBigO_N
);
int letsAlsoExplicitlyCountTheVisits = 0;
for ( int i = 0; i < aScaleOfBigO_N; i++ )
for ( int j = i; j < i*i; j++ )
if ( j % i == 0 )
{
for ( int k = 0; k < j; k++ )
{
// printf( "*" ); // avoid devastating UI
letsAlsoExplicitlyCountTheVisits++;
}
}
printf( " END: running experiment visits this many( %d ) times the code\n",
letsAlsoExplicitlyCountTheVisits
);
}
Having collected some reasonably large amount of datapoints ( N, countedVisits ), your next step may be to fit the observed datapoints and formulate the best matching O( f(N) ) function of N.
That can go this simple.
START: running experiment for a scale of N( 1 )
END: running experiment visits this many( 0 ) times the code.
START: running experiment for a scale of N( 2 )
END: running experiment visits this many( 0 ) times the code.
START: running experiment for a scale of N( 4 )
END: running experiment visits this many( 11 ) times the code.
START: running experiment for a scale of N( 8 )
END: running experiment visits this many( 322 ) times the code.
START: running experiment for a scale of N( 16 )
END: running experiment visits this many( 6580 ) times the code.
START: running experiment for a scale of N( 32 )
END: running experiment visits this many( 117800 ) times the code.
START: running experiment for a scale of N( 64 )
END: running experiment visits this many( 1989456 ) times the code.
START: running experiment for a scale of N( 128 )
END: running experiment visits this many( 32686752 ) times the code.
START: running experiment for a scale of N( 256 )
END: running experiment visits this many( 529904960 ) times the code.
START: running experiment for a scale of N( 512 )
END: running experiment visits this many( 8534108800 ) times the code.
START: running experiment for a scale of N( 1024 )
END: running experiment visits this many(136991954176 ) times the code.
START: running experiment for a scale of N( 2048 )
...
Experimental data show about this algorithm time-complexity behaviour in-vivo:
I want to create counte 4bit generate this sequence : 1,3,5,7,9,8,6,4,2,0,1... in ABEL HDL. For odd number i make , but for even number fail. Can anyone exlpain me , where is my mistake.
I think to make like in vhdl.
contor=1;
if contor >9 contor = contor + 2; else contor = 8;
if contor >0 contor -2 ;
But i don't understand how to use #if .
This is the code which i made :
*<MODULE CounterV2
declarations
"pin declaration
X = .x.;
x = .X.;
"pin of clock , load "
clock pin 1;
ld pin 7;
i_d pin 8;
"output pin"
q3, q2, q1, q0 pin 19, 18, 17, 16 istype 'reg';
contor = [q3,q2,q1,q0];
MOD = [ld, i_d];
STOP =(MOD == [0, X]);
PAR =(MOD == [0, 1]);
IMPAR =(MOD == [0, 0]);
equations
[q3,q2,q1,q0].clk = clock;
"there is the core of code :D "
when IMPAR then contor :=1#(contor + 2);
when PAR then {
contor := 8;
when contor == 8 then contor :=contor -2;
}
" there a made test vector
test_vectors 'test'
([i_d,clock]->[q3,q2,q1,q0])
[0,.c.]->[x,x,x,x];
[0,.c.]->[x,x,x,x];
[0,.c.]->[x,x,x,x];
[0,.c.]->[x,x,x,x];
[0,.c.]->[x,x,x,x];
[1,.c.]->[x,x,x,x];
[1,.c.]->[x,x,x,x];
[1,.c.]->[x,x,x,x];
[1,.c.]->[x,x,x,x];
[1,.c.]->[x,x,x,x];
[1,.c.]->[x,x,x,x];
END
>*
I'm quite new to CUDA and GPU programming. I'm trying to write a Kernel for an application in physics. The parallelization is made over a quadrature of directions, each direction resulting in a sweep of a 2D cartesian domain. Here is the kernel. it actually works well, giving good results.
However, a very high number of registers per blocks leads to a spill to local memory that harshly slow down the code performance.
__global__ void KERNEL (int imax, int jmax, int mmax, int lg, int lgmax,
double *x, double *y, double *qd, double *kappa,
double *S, double *G, double *qw, double *SkG,
double *Ska,double *a, double *Ljm, int *data)
{
int m = 1+blockIdx.x*blockDim.x + threadIdx.x ;
int tid = threadIdx.x ;
//Var needed for thread execution
...
extern __shared__ double shared[] ;
//Read some data from Global mem
mu = qd[ (m-1)];
eta = qd[ MSIZE+(m-1)];
wm = qd[3*MSIZE+(m-1)];
amu = fabs(mu);
aeta= fabs(eta);
ista = data[ (m-1)] ;
iend = data[1*MSIZE+(m-1)] ;
istp = data[2*MSIZE+(m-1)] ;
jsta = data[3*MSIZE+(m-1)] ;
jend = data[4*MSIZE+(m-1)] ;
jstp = data[5*MSIZE+(m-1)] ;
j1 = (1-jstp) ;
j2 = (1+jstp)/2 ;
i1 = (1-istp) ;
i2 = (1+istp)/2 ;
isw = ista-istp ;
jsw = jsta-jstp ;
dy = dx = 1.0e-2 ;
for(i=1 ; i<=imax; i++) Ljm[MSIZE*(i-1)+m] = S[jsw*(imax+2)+i] ;
//Beginning of the vertical Sweep, can be from left to right,
// or opposite depending on the thread
for(j=jsta ; j1*jend + j2*j<=j2*jend + j1*j ; j=j+jstp) {
Lw = S[j*(imax+2)+isw] ;
//Beginning of the horizontal Sweep, can be from left to right,
// or opposite depending on the thread
for(i=ista ; i1*iend + i2*i<=i2*iend + i1*i ; i=i+istp) {
ax = dy ;
Lx = ax*amu/ex ;
ay = dx ;
Ly = ay*aeta/ey ;
dv = ax*ay ;
L0 = dv*kappaij ;
Sp = S[j*(imax+2)+i]*dv ;
Ls = Ljm[MSIZE*(i-1)+m] ;
Lp = (Lx*Lw+Ly*Ls+Sp)/(Lx+Ly+L0) ;
Lw = Lw+(Lp-Lw)/ex ;
Ls = Ls+(Lp-Ls)/ey ;
Ljm[MSIZE*(i-1)+m] = Ls ;
shared[tid] = wm*Lp ;
__syncthreads();
for (s=16; s>0; s>>=1) {
if (tid < s) {
shared[tid] += shared[tid + s] ;
}
}
if(tid==0) atomicAdd(&SkG[imax*(j-1)+(i-1)],shared[tid]*kappaij);
}
// End of horizontal sweep
}
// End of vertical sweep
}
How can i optimize the execution of this code ? I run it over 8 blocks of 32 threads.
The occupancy for this kernel is really low, limited by the registers according to the Visual profiler.
I have no idea on how to improve it.
Thanks !
First of all, you are using blocks of 32 threads, because of that, occupancy kernel is too low. Your gpu is running only 256 threads in parallel but it can run up to 1536 threads per multiprocessor (compute capability 2.x)
How many registers are you using?
You also can try to declare your variables into their local scope, helping to the device to reuse better the registers.
I am using this algorithm in my program :
for( i=0 ; i<N ; i++ )
for( j=i+1 ; j<N+1 ; j++ )
for( k=0 ; k<i ; k++ )
doWork();
Can anyone help me find the time complexity of this snippet ?
I guess for the first two loops it is
N*(N+1)/2
right ? what about the three loops all together?
Thanks to #Tim Meyer to correct me:
Simple equation gives for (N= 0,1,2,3,4,5,6, 7, 8 ...) following series: 0, 0, 1, 4, 10, 20, 35, 56, 84 ... , which is resolved with following formula:
u(n) = (n - 1)n(n + 1)/6
So it will have O((N - 1)N(N + 1)/6) time complexity, which can be simplified to O(N^3)
Formally, you can do the following: