different execution time between loclal variable and static structure - structure

I write a program on AVR's microcontrolers. It should check actual temperature and show it on 7-segment display. And this the problem which I have: I made a structure with all variables referd to temperature (temperature, pointer position, sign and unit) and saw that the execution time of e.g. dividing by 10 or mod 10 is much more longer then when i use normal local variable. I dont know why. I use Atmel Studio 6.2.
struct dane
{
int32_t temperature;
int8_t pointer;
int8_t sign;
int8_t unit;
};
//************************************
//inside function of timer interrupt
static struct dane present;
//*****************************
//tested operations:
present.temperature % 10; //execution time: ~380 processor's cycles, on normal local variable ~4 cycles.
present.temperature /= 10; //execution time: ~611 cycles
I give you this function where I use it and a little bit of assembly code.
ISR(TIMER0_OVF_vect)
{
static int8_t i = 4;
static struct dane present;
if(i == 4 && (TCCR0 & (1 << CS01)))
{
i = 0;
present = current;
if(present.temperature < 0)
present.temperature = -present.temperature;
}
if((TCCR0 & ((1 << CS00) | (1 << CS02))) && i != 0)
{
i = 0;
}
if(present.unit == current.unit) //time between here and fist instruction in function print equals about 300 cycles.
{
print((i * present.sign == 3 && present_temperature % 10 == 0) ? 16 : present_temperature % 10, displays[i], i == present.pointer);
}
else
{
print(current.unit, displays[i],0);
if(i == 4)
{
i = 3;
TCCR0 = (1 << CS01);
present.unit = current.unit;
}
}
present.temperature /= 10;
i++;
}
And assembly code for the one before last instruction:
present.temperature /= 10;
0000021F LDI R28,0x7D Load immediate
00000220 LDI R29,0x00 Load immediate
00000221 LDD R22,Y+0 Load indirect with displacement
00000222 LDD R23,Y+1 Load indirect with displacement
00000223 LDD R24,Y+2 Load indirect with displacement
00000224 LDD R25,Y+3 Load indirect with displacement
00000225 LDI R18,0x0A Load immediate
00000226 LDI R19,0x00 Load immediate
00000227 LDI R20,0x00 Load immediate
00000228 LDI R21,0x00 Load immediate
00000229 RCALL PC+0x01AC Relative call subroutine
0000022A STD Y+0,R18 Store indirect with displacement
0000022B STD Y+1,R19 Store indirect with displacement
0000022C STD Y+2,R20 Store indirect with displacement
0000022D STD Y+3,R21 Store indirect with displacement
I can't use int16_t for temperature because I use the same structure inside function which coverts the temperature from the sensor and it is easier to operate on number with decimal part when I multiply it by 10 powered by suitable number.

There must be something wrong with your timings:
present.temperature % 10; //execution time: ~380 processor's cycles, on normal local variable ~4 cycles.
present.temperature /= 10; //execution time: ~611 cycles
A modulo-10 operation for a 32-bit value is never going to happen in 4 clock cycles with an AVR. The 380 cycles sounds a lot, but it is more realistic for a 32:32 division operation. I am afraid even an integer division on an AVR will take a lot of time with long integers.
It is quite natural that operations on module static variables take a bit longer, because they have to be fetched and stored in the RAM. This takes maybe 10 extra clock cycles per byte when compared to register variables (local variables are often in a register. The variable being in a struct should not change the timings at all in a case like this (with pointers to structs it may have an effect).
The only real way to get to know what is happening is to look at the assembly code produced by the compiler in each case.
And, please, include a minimal but complete example of both cases in your question. Then it is easier to see if there is something clearly wrong.
If you are interested in making your code faster, I suggest you try to use int16_t for the temperature. Your dynamic range in temperature measurement is hardly more than 12 bits (that would be, e.g., 0.1 °C for temperatures between -100°C..+300°C.) so that 16-bit ints should be sufficient.

Related

Parallel Dynamic Programming with CUDA

It is my first attempt to implement recursion with CUDA. The goal is to extract all the combinations from a set of chars "12345" using the power of CUDA to parallelize dynamically the task. Here is my kernel:
__device__ char route[31] = { "_________________________"};
__device__ char init[6] = { "12345" };
__global__ void Recursive(int depth) {
// up to depth 6
if (depth == 5) return;
// newroute = route - idx
int x = depth * 6;
printf("%s\n", route);
int o = 0;
int newlen = 0;
for (int i = 0; i<6; ++i)
{
if (i != threadIdx.x)
{
route[i+x-o] = init[i];
newlen++;
}
else
{
o = 1;
}
}
Recursive<<<1,newlen>>>(depth + 1);
}
__global__ void RecursiveCount() {
Recursive <<<1,5>>>(0);
}
The idea is to exclude 1 item (the item corresponding to the threadIdx) in each different thread. In each recursive call, using the variable depth, it works over a different base (variable x) on the route device variable.
I expect the kernel prompts something like:
2345_____________________
1345_____________________
1245_____________________
1234_____________________
2345_345_________________
2345_245_________________
2345_234_________________
2345_345__45_____________
2345_345__35_____________
2345_345__34_____________
..
2345_245__45_____________
..
But it prompts ...
·_____________
·_____________
·_____________
·_____________
·_____________
·2345
·2345
·2345
·2345
...
What I´m doing wrong?
What I´m doing wrong?
I may not articulate every problem with your code, but these items should get you a lot closer.
I recommend providing a complete example. In my view it is basically required by Stack Overflow, see item 1 here, note use of the word "must". Your example is missing any host code, including the original kernel call. It's only a few extra lines of code, why not include it? Sure, in this case, I can deduce what the call must have been, but why not just include it? Anyway, based on the output you indicated, it seems fairly evident the launch configuration of the host launch would have to be <<<1,1>>>.
This doesn't seem to be logical to me:
I expect the kernel prompts something like:
2345_____________________
The very first thing your kernel does is print out the route variable, before making any changes to it, so I would expect _____________________. However we can "fix" this by moving the printout to the end of the kernel.
You may be confused about what a __device__ variable is. It is a global variable, and there is only one copy of it. Therefore, when you modify it in your kernel code, every thread, in every kernel, is attempting to modify the same global variable, at the same time. That cannot possibly have orderly results, in any thread-parallel environment. I chose to "fix" this by making a local copy for each thread to work on.
You have an off-by-1 error, as well as an extent error in this loop:
for (int i = 0; i<6; ++i)
The off-by-1 error is due to the fact that you are iterating over 6 possible items (that is, i can reach a value of 5) but there are only 5 items in your init variable (the 6th item being a null terminator. The correct indexing starts out over 0-4 (with one of those being skipped). On subsequent iteration depths, its necessary to reduce this indexing extent by 1. Note that I've chosen to fix the first error here by increasing the length of init. There are other ways to fix, of course. My method inserts an extra _ between depths in the result.
You assume that at each iteration depth, the correct choice of items is the same, and in the same order, i.e. init. However this is not the case. At each depth, the choices of items must be selected not from the unchanging init variable, but from the choices passed from previous depth. Therefore we need a local, per-thread copy of init also.
A few other comments about CUDA Dynamic Parallelism (CDP). When passing pointers to data from one kernel scope to a child scope, local space pointers cannot be used. Therefore I allocate for the local copy of route from the heap, so it can be passed to child kernels. init can be deduced from route, so we can use an ordinary local variable for myinit.
You're going to quickly hit some dynamic parallelism (and perhaps memory) limits here if you continue this. I believe the total number of kernel launches for this is 5^5, which is 3125 (I'm doing this quickly, I may be mistaken). CDP has a pending launch limit of 2000 kernels by default. We're not hitting this here according to what I see, but you'll run into that sooner or later if you increase the depth or width of this operation. Furthermore, in-kernel allocations from the device heap are by default limited to 8KB. I don't seem to be hitting that limit, but probably I am, so my design should probably be modified to fix that.
Finally, in-kernel printf output is limited to the size of a particular buffer. If this technique is not already hitting that limit, it will soon if you increase the width or depth.
Here is a worked example, attempting to address the various items above. I'm not claiming it is defect free, but I think the output is closer to your expectations. Note that due to character limits on SO answers, I've truncated/excerpted some of the output.
$ cat t1639.cu
#include <stdio.h>
__device__ char route[31] = { "_________________________"};
__device__ char init[7] = { "12345_" };
__global__ void Recursive(int depth, const char *oroute) {
char *nroute = (char *)malloc(31);
char myinit[7];
if (depth == 0) memcpy(myinit, init, 6);
else memcpy(myinit, oroute+(depth-1)*6, 6);
myinit[6] = 0;
if (nroute == NULL) {printf("oops\n"); return;}
memcpy(nroute, oroute, 30);
nroute[30] = 0;
// up to depth 6
if (depth == 5) return;
// newroute = route - idx
int x = depth * 6;
//printf("%s\n", nroute);
int o = 0;
int newlen = 0;
for (int i = 0; i<(6-depth); ++i)
{
if (i != threadIdx.x)
{
nroute[i+x-o] = myinit[i];
newlen++;
}
else
{
o = 1;
}
}
printf("%s\n", nroute);
Recursive<<<1,newlen>>>(depth + 1, nroute);
}
__global__ void RecursiveCount() {
Recursive <<<1,5>>>(0, route);
}
int main(){
RecursiveCount<<<1,1>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t1639 t1639.cu -rdc=true -lcudadevrt -arch=sm_70
$ cuda-memcheck ./t1639
========= CUDA-MEMCHECK
2345_____________________
1345_____________________
1245_____________________
1235_____________________
1234_____________________
2345__345________________
2345__245________________
2345__235________________
2345__234________________
2345__2345_______________
2345__345___45___________
2345__345___35___________
2345__345___34___________
2345__345___345__________
2345__345___45____5______
2345__345___45____4______
2345__345___45____45_____
2345__345___45____5______
2345__345___45____5_____5
2345__345___45____4______
2345__345___45____4_____4
2345__345___45____45____5
2345__345___45____45____4
2345__345___35____5______
2345__345___35____3______
2345__345___35____35_____
2345__345___35____5______
2345__345___35____5_____5
2345__345___35____3______
2345__345___35____3_____3
2345__345___35____35____5
2345__345___35____35____3
2345__345___34____4______
2345__345___34____3______
2345__345___34____34_____
2345__345___34____4______
2345__345___34____4_____4
2345__345___34____3______
2345__345___34____3_____3
2345__345___34____34____4
2345__345___34____34____3
2345__345___345___45_____
2345__345___345___35_____
2345__345___345___34_____
2345__345___345___45____5
2345__345___345___45____4
2345__345___345___35____5
2345__345___345___35____3
2345__345___345___34____4
2345__345___345___34____3
2345__245___45___________
2345__245___25___________
2345__245___24___________
2345__245___245__________
2345__245___45____5______
2345__245___45____4______
2345__245___45____45_____
2345__245___45____5______
2345__245___45____5_____5
2345__245___45____4______
2345__245___45____4_____4
2345__245___45____45____5
2345__245___45____45____4
2345__245___25____5______
2345__245___25____2______
2345__245___25____25_____
2345__245___25____5______
2345__245___25____5_____5
2345__245___25____2______
2345__245___25____2_____2
2345__245___25____25____5
2345__245___25____25____2
2345__245___24____4______
2345__245___24____2______
2345__245___24____24_____
2345__245___24____4______
2345__245___24____4_____4
2345__245___24____2______
2345__245___24____2_____2
2345__245___24____24____4
2345__245___24____24____2
2345__245___245___45_____
2345__245___245___25_____
2345__245___245___24_____
2345__245___245___45____5
2345__245___245___45____4
2345__245___245___25____5
2345__245___245___25____2
2345__245___245___24____4
2345__245___245___24____2
2345__235___35___________
2345__235___25___________
2345__235___23___________
2345__235___235__________
2345__235___35____5______
2345__235___35____3______
2345__235___35____35_____
2345__235___35____5______
2345__235___35____5_____5
2345__235___35____3______
2345__235___35____3_____3
2345__235___35____35____5
2345__235___35____35____3
2345__235___25____5______
2345__235___25____2______
2345__235___25____25_____
2345__235___25____5______
2345__235___25____5_____5
2345__235___25____2______
2345__235___25____2_____2
2345__235___25____25____5
2345__235___25____25____2
2345__235___23____3______
2345__235___23____2______
2345__235___23____23_____
2345__235___23____3______
2345__235___23____3_____3
2345__235___23____2______
2345__235___23____2_____2
2345__235___23____23____3
2345__235___23____23____2
2345__235___235___35_____
2345__235___235___25_____
2345__235___235___23_____
2345__235___235___35____5
2345__235___235___35____3
2345__235___235___25____5
2345__235___235___25____2
2345__235___235___23____3
2345__235___235___23____2
2345__234___34___________
2345__234___24___________
2345__234___23___________
2345__234___234__________
2345__234___34____4______
2345__234___34____3______
2345__234___34____34_____
2345__234___34____4______
2345__234___34____4_____4
2345__234___34____3______
2345__234___34____3_____3
2345__234___34____34____4
2345__234___34____34____3
2345__234___24____4______
2345__234___24____2______
2345__234___24____24_____
2345__234___24____4______
2345__234___24____4_____4
2345__234___24____2______
2345__234___24____2_____2
2345__234___24____24____4
2345__234___24____24____2
2345__234___23____3______
2345__234___23____2______
2345__234___23____23_____
2345__234___23____3______
2345__234___23____3_____3
2345__234___23____2______
2345__234___23____2_____2
2345__234___23____23____3
2345__234___23____23____2
2345__234___234___34_____
2345__234___234___24_____
2345__234___234___23_____
2345__234___234___34____4
2345__234___234___34____3
2345__234___234___24____4
2345__234___234___24____2
2345__234___234___23____3
2345__234___234___23____2
2345__2345__345__________
2345__2345__245__________
2345__2345__235__________
2345__2345__234__________
2345__2345__345___45_____
2345__2345__345___35_____
2345__2345__345___34_____
2345__2345__345___45____5
2345__2345__345___45____4
2345__2345__345___35____5
2345__2345__345___35____3
2345__2345__345___34____4
2345__2345__345___34____3
2345__2345__245___45_____
2345__2345__245___25_____
2345__2345__245___24_____
2345__2345__245___45____5
2345__2345__245___45____4
2345__2345__245___25____5
2345__2345__245___25____2
2345__2345__245___24____4
2345__2345__245___24____2
2345__2345__235___35_____
2345__2345__235___25_____
2345__2345__235___23_____
2345__2345__235___35____5
2345__2345__235___35____3
2345__2345__235___25____5
2345__2345__235___25____2
2345__2345__235___23____3
2345__2345__235___23____2
2345__2345__234___34_____
2345__2345__234___24_____
2345__2345__234___23_____
2345__2345__234___34____4
2345__2345__234___34____3
2345__2345__234___24____4
2345__2345__234___24____2
2345__2345__234___23____3
2345__2345__234___23____2
1345__345________________
1345__145________________
1345__135________________
1345__134________________
1345__1345_______________
1345__345___45___________
1345__345___35___________
1345__345___34___________
1345__345___345__________
1345__345___45____5______
1345__345___45____4______
1345__345___45____45_____
1345__345___45____5______
1345__345___45____5_____5
1345__345___45____4______
1345__345___45____4_____4
1345__345___45____45____5
1345__345___45____45____4
1345__345___35____5______
1345__345___35____3______
1345__345___35____35_____
1345__345___35____5______
1345__345___35____5_____5
1345__345___35____3______
1345__345___35____3_____3
1345__345___35____35____5
1345__345___35____35____3
1345__345___34____4______
1345__345___34____3______
1345__345___34____34_____
1345__345___34____4______
1345__345___34____4_____4
1345__345___34____3______
1345__345___34____3_____3
1345__345___34____34____4
1345__345___34____34____3
1345__345___345___45_____
1345__345___345___35_____
1345__345___345___34_____
1345__345___345___45____5
1345__345___345___45____4
1345__345___345___35____5
1345__345___345___35____3
1345__345___345___34____4
1345__345___345___34____3
1345__145___45___________
1345__145___15___________
1345__145___14___________
1345__145___145__________
1345__145___45____5______
1345__145___45____4______
1345__145___45____45_____
1345__145___45____5______
1345__145___45____5_____5
1345__145___45____4______
1345__145___45____4_____4
1345__145___45____45____5
1345__145___45____45____4
1345__145___15____5______
1345__145___15____1______
1345__145___15____15_____
1345__145___15____5______
1345__145___15____5_____5
1345__145___15____1______
1345__145___15____1_____1
1345__145___15____15____5
1345__145___15____15____1
1345__145___14____4______
1345__145___14____1______
1345__145___14____14_____
1345__145___14____4______
1345__145___14____4_____4
1345__145___14____1______
1345__145___14____1_____1
1345__145___14____14____4
1345__145___14____14____1
1345__145___145___45_____
1345__145___145___15_____
1345__145___145___14_____
1345__145___145___45____5
1345__145___145___45____4
1345__145___145___15____5
1345__145___145___15____1
1345__145___145___14____4
1345__145___145___14____1
1345__135___35___________
1345__135___15___________
1345__135___13___________
1345__135___135__________
1345__135___35____5______
1345__135___35____3______
1345__135___35____35_____
1345__135___35____5______
1345__135___35____5_____5
1345__135___35____3______
1345__135___35____3_____3
1345__135___35____35____5
1345__135___35____35____3
1345__135___15____5______
1345__135___15____1______
1345__135___15____15_____
1345__135___15____5______
1345__135___15____5_____5
1345__135___15____1______
1345__135___15____1_____1
1345__135___15____15____5
1345__135___15____15____1
1345__135___13____3______
1345__135___13____1______
1345__135___13____13_____
1345__135___13____3______
1345__135___13____3_____3
1345__135___13____1______
1345__135___13____1_____1
1345__135___13____13____3
1345__135___13____13____1
1345__135___135___35_____
1345__135___135___15_____
1345__135___135___13_____
1345__135___135___35____5
1345__135___135___35____3
1345__135___135___15____5
1345__135___135___15____1
1345__135___135___13____3
1345__135___135___13____1
1345__134___34___________
1345__134___14___________
1345__134___13___________
1345__134___134__________
1345__134___34____4______
1345__134___34____3______
1345__134___34____34_____
1345__134___34____4______
1345__134___34____4_____4
1345__134___34____3______
1345__134___34____3_____3
1345__134___34____34____4
1345__134___34____34____3
1345__134___14____4______
1345__134___14____1______
1345__134___14____14_____
1345__134___14____4______
1345__134___14____4_____4
1345__134___14____1______
1345__134___14____1_____1
1345__134___14____14____4
1345__134___14____14____1
1345__134___13____3______
1345__134___13____1______
1345__134___13____13_____
1345__134___13____3______
1345__134___13____3_____3
1345__134___13____1______
1345__134___13____1_____1
1345__134___13____13____3
1345__134___13____13____1
1345__134___134___34_____
1345__134___134___14_____
1345__134___134___13_____
1345__134___134___34____4
1345__134___134___34____3
1345__134___134___14____4
1345__134___134___14____1
1345__134___134___13____3
1345__134___134___13____1
1345__1345__345__________
1345__1345__145__________
1345__1345__135__________
1345__1345__134__________
1345__1345__345___45_____
1345__1345__345___35_____
1345__1345__345___34_____
1345__1345__345___45____5
1345__1345__345___45____4
1345__1345__345___35____5
1345__1345__345___35____3
1345__1345__345___34____4
1345__1345__345___34____3
1345__1345__145___45_____
1345__1345__145___15_____
1345__1345__145___14_____
1345__1345__145___45____5
1345__1345__145___45____4
1345__1345__145___15____5
1345__1345__145___15____1
1345__1345__145___14____4
1345__1345__145___14____1
1345__1345__135___35_____
1345__1345__135___15_____
1345__1345__135___13_____
1345__1345__135___35____5
1345__1345__135___35____3
1345__1345__135___15____5
1345__1345__135___15____1
1345__1345__135___13____3
1345__1345__135___13____1
1345__1345__134___34_____
1345__1345__134___14_____
1345__1345__134___13_____
1345__1345__134___34____4
1345__1345__134___34____3
1345__1345__134___14____4
1345__1345__134___14____1
1345__1345__134___13____3
1345__1345__134___13____1
1245__245________________
1245__145________________
1245__125________________
1245__124________________
1245__1245_______________
1245__245___45___________
1245__245___25___________
1245__245___24___________
1245__245___245__________
1245__245___45____5______
1245__245___45____4______
1245__245___45____45_____
1245__245___45____5______
1245__245___45____5_____5
1245__245___45____4______
1245__245___45____4_____4
1245__245___45____45____5
1245__245___45____45____4
1245__245___25____5______
1245__245___25____2______
1245__245___25____25_____
1245__245___25____5______
1245__245___25____5_____5
1245__245___25____2______
1245__245___25____2_____2
1245__245___25____25____5
1245__245___25____25____2
1245__245___24____4______
1245__245___24____2______
1245__245___24____24_____
1245__245___24____4______
1245__245___24____4_____4
1245__245___24____2______
1245__245___24____2_____2
1245__245___24____24____4
1245__245___24____24____2
1245__245___245___45_____
1245__245___245___25_____
1245__245___245___24_____
1245__245___245___45____5
1245__245___245___45____4
1245__245___245___25____5
1245__245___245___25____2
1245__245___245___24____4
1245__245___245___24____2
1245__145___45___________
1245__145___15___________
1245__145___14___________
1245__145___145__________
1245__145___45____5______
1245__145___45____4______
1245__145___45____45_____
1245__145___45____5______
1245__145___45____5_____5
1245__145___45____4______
...
1235__1235__235___25_____
1235__1235__235___23_____
1235__1235__235___35____5
1235__1235__235___35____3
1235__1235__235___25____5
1235__1235__235___25____2
1235__1235__235___23____3
1235__1235__235___23____2
1235__1235__135___35_____
1235__1235__135___15_____
1235__1235__135___13_____
1235__1235__135___35____5
1235__1235__135___35____3
1235__1235__135___15____5
1235__1235__135___15____1
1235__1235__135___13____3
1235__1235__135___13____1
1235__1235__125___25_____
1235__1235__125___15_____
1235__1235__125___12_____
1235__1235__125___25____5
1235__1235__125___25____2
1235__1235__125___15____5
1235__1235__125___15____1
1235__1235__125___12____2
1235__1235__125___12____1
1235__1235__123___23_____
1235__1235__123___13_____
1235__1235__123___12_____
1235__1235__123___23____3
1235__1235__123___23____2
1235__1235__123___13____3
1235__1235__123___13____1
1235__1235__123___12____2
1235__1235__123___12____1
1234__234________________
1234__134________________
1234__124________________
1234__123________________
1234__1234_______________
1234__234___34___________
1234__234___24___________
1234__234___23___________
1234__234___234__________
1234__234___34____4______
1234__234___34____3______
1234__234___34____34_____
1234__234___34____4______
1234__234___34____4_____4
1234__234___34____3______
1234__234___34____3_____3
1234__234___34____34____4
1234__234___34____34____3
1234__234___24____4______
1234__234___24____2______
1234__234___24____24_____
1234__234___24____4______
1234__234___24____4_____4
1234__234___24____2______
1234__234___24____2_____2
1234__234___24____24____4
1234__234___24____24____2
1234__234___23____3______
1234__234___23____2______
1234__234___23____23_____
1234__234___23____3______
1234__234___23____3_____3
1234__234___23____2______
1234__234___23____2_____2
1234__234___23____23____3
1234__234___23____23____2
1234__234___234___34_____
1234__234___234___24_____
1234__234___234___23_____
1234__234___234___34____4
1234__234___234___34____3
1234__234___234___24____4
1234__234___234___24____2
1234__234___234___23____3
1234__234___234___23____2
1234__134___34___________
1234__134___14___________
1234__134___13___________
1234__134___134__________
1234__134___34____4______
1234__134___34____3______
1234__134___34____34_____
1234__134___34____4______
1234__134___34____4_____4
1234__134___34____3______
1234__134___34____3_____3
1234__134___34____34____4
1234__134___34____34____3
1234__134___14____4______
1234__134___14____1______
1234__134___14____14_____
1234__134___14____4______
1234__134___14____4_____4
1234__134___14____1______
1234__134___14____1_____1
1234__134___14____14____4
1234__134___14____14____1
1234__134___13____3______
1234__134___13____1______
1234__134___13____13_____
1234__134___13____3______
1234__134___13____3_____3
1234__134___13____1______
1234__134___13____1_____1
1234__134___13____13____3
1234__134___13____13____1
1234__134___134___34_____
1234__134___134___14_____
1234__134___134___13_____
1234__134___134___34____4
1234__134___134___34____3
1234__134___134___14____4
1234__134___134___14____1
1234__134___134___13____3
1234__134___134___13____1
1234__124___24___________
1234__124___14___________
1234__124___12___________
1234__124___124__________
1234__124___24____4______
1234__124___24____2______
1234__124___24____24_____
1234__124___24____4______
1234__124___24____4_____4
1234__124___24____2______
1234__124___24____2_____2
1234__124___24____24____4
1234__124___24____24____2
1234__124___14____4______
1234__124___14____1______
1234__124___14____14_____
1234__124___14____4______
1234__124___14____4_____4
1234__124___14____1______
1234__124___14____1_____1
1234__124___14____14____4
1234__124___14____14____1
1234__124___12____2______
1234__124___12____1______
1234__124___12____12_____
1234__124___12____2______
1234__124___12____2_____2
1234__124___12____1______
1234__124___12____1_____1
1234__124___12____12____2
1234__124___12____12____1
1234__124___124___24_____
1234__124___124___14_____
1234__124___124___12_____
1234__124___124___24____4
1234__124___124___24____2
1234__124___124___14____4
1234__124___124___14____1
1234__124___124___12____2
1234__124___124___12____1
1234__123___23___________
1234__123___13___________
1234__123___12___________
1234__123___123__________
1234__123___23____3______
1234__123___23____2______
1234__123___23____23_____
1234__123___23____3______
1234__123___23____3_____3
1234__123___23____2______
1234__123___23____2_____2
1234__123___23____23____3
1234__123___23____23____2
1234__123___13____3______
1234__123___13____1______
1234__123___13____13_____
1234__123___13____3______
1234__123___13____3_____3
1234__123___13____1______
1234__123___13____1_____1
1234__123___13____13____3
1234__123___13____13____1
1234__123___12____2______
1234__123___12____1______
1234__123___12____12_____
1234__123___12____2______
1234__123___12____2_____2
1234__123___12____1______
1234__123___12____1_____1
1234__123___12____12____2
1234__123___12____12____1
1234__123___123___23_____
1234__123___123___13_____
1234__123___123___12_____
1234__123___123___23____3
1234__123___123___23____2
1234__123___123___13____3
1234__123___123___13____1
1234__123___123___12____2
1234__123___123___12____1
1234__1234__234__________
1234__1234__134__________
1234__1234__124__________
1234__1234__123__________
1234__1234__234___34_____
1234__1234__234___24_____
1234__1234__234___23_____
1234__1234__234___34____4
1234__1234__234___34____3
1234__1234__234___24____4
1234__1234__234___24____2
1234__1234__234___23____3
1234__1234__234___23____2
1234__1234__134___34_____
1234__1234__134___14_____
1234__1234__134___13_____
1234__1234__134___34____4
1234__1234__134___34____3
1234__1234__134___14____4
1234__1234__134___14____1
1234__1234__134___13____3
1234__1234__134___13____1
1234__1234__124___24_____
1234__1234__124___14_____
1234__1234__124___12_____
1234__1234__124___24____4
1234__1234__124___24____2
1234__1234__124___14____4
1234__1234__124___14____1
1234__1234__124___12____2
1234__1234__124___12____1
1234__1234__123___23_____
1234__1234__123___13_____
1234__1234__123___12_____
1234__1234__123___23____3
1234__1234__123___23____2
1234__1234__123___13____3
1234__1234__123___13____1
1234__1234__123___12____2
1234__1234__123___12____1
========= ERROR SUMMARY: 0 errors
$
The answer given by Robert Crovella is correct at the 5th point, the mistake was in the using of init in every recursive call, but I want to clarify something that can be useful for other beginners with CUDA.
I used this variable because when I tried to launch a child kernel passing a local variable I always got the exception: Error: a pointer to local memory cannot be passed to a launch as an argument.
As I´m C# expert developer I´m not used to using pointers (Ref does the low-level-work for that) so I thought there was no way to do it in CUDA/c programming.
As Robert shows in its code it is possible copying the pointer with memalloc for using it as a referable argument.
Here is a kernel simplified as an example of deep recursion.
__device__ char init[6] = { "12345" };
__global__ void Recursive(int depth, const char* route) {
// up to depth 6
if (depth == 5) return;
//declaration for a referable argument (point 6)
char* newroute = (char*)malloc(6);
memcpy(newroute, route, 5);
int o = 0;
int newlen = 0;
for (int i = 0; i < (6 - depth); ++i)
{
if (i != threadIdx.x)
{
newroute[i - o] = route[i];
newlen++;
}
else
{
o = 1;
}
}
printf("%s\n", newroute);
Recursive <<<1, newlen>>>(depth + 1, newroute);
}
__global__ void RecursiveCount() {
Recursive <<<1, 5>>>(0, init);
}
I don't add the main call because I´m using ManagedCUDA for C# but as Robert says it can be figured-out how the call RecursiveCount is.
About ending arrays of char with /0 ... sorry but I don't know exactly what is the benefit; this code works fine without them.

How is it possible that O(1) constant time code is slower than O(n) linear time code?

"...It is very possible for O(N) code to run faster than O(1) code for specific inputs. Big O just describes the rate of increase."
According to my understanding:
O(N) - Time taken for an algorithm to run based on the varying values of input N.
O(1) - Constant time taken for the algorithm to execute irrespective of the size of the input e.g. int val = arr[10000];
Can someone help me understand based on the author's statement?
O(N) code run faster than O(1)?
What are the specific inputs the author is alluding to?
Rate of increase of what?
O(n) constant time can absolutely be faster than O(1) linear time. The reason is that constant-time operations are totally ignored in Big O, which is a measure of how fast an algorithm's complexity increases as input size n increases, and nothing else. It's a measure of growth rate, not running time.
Here's an example:
int constant(int[] arr) {
int total = 0;
for (int i = 0; i < 10000; i++) {
total += arr[0];
}
return total;
}
int linear(int[] arr) {
int total = 0;
for (int i = 0; i < arr.length; i++) {
total += arr[i];
}
return total;
}
In this case, constant does a lot of work, but it's fixed work that will always be the same regardless of how large arr is. linear, on the other hand, appears to have few operations, but those operations are dependent on the size of arr.
In other words, as arr increases in length, constant's performance stays the same, but linear's running time increases linearly in proportion to its argument array's size.
Call the two functions with a single-item array like
constant(new int[] {1});
linear(new int[] {1});
and it's clear that constant runs slower than linear.
But call them like:
int[] arr = new int[10000000];
constant(arr);
linear(arr);
Which runs slower?
After you've thought about it, run the code given various inputs of n and compare the results.
Just to show that this phenomenon of run time != Big O isn't just for constant-time functions, consider:
void exponential(int n) throws InterruptedException {
for (int i = 0; i < Math.pow(2, n); i++) {
Thread.sleep(1);
}
}
void linear(int n) throws InterruptedException {
for (int i = 0; i < n; i++) {
Thread.sleep(10);
}
}
Exercise (using pen and paper): up to which n does exponential run faster than linear?
Consider the following scenario:
Op1) Given an array of length n where n>=10, print the first ten elements twice on the console. --> This is a constant time (O(1)) operation, because for any array of size>=10, it will execute 20 steps.
Op2) Given an array of length n where n>=10, find the largest element in the array. This is a constant time (O(N)) operation, because for any array, it will execute N steps.
Now if the array size is between 10 and 20 (exclusive), Op1 will be slower than Op2. But let's say, we take an array of size>20 (for eg, size =1000), Op1 will still take 20 steps to complete, but Op2 will take 1000 steps to complete.
That's why the big-o notation is about growth(rate of increase) of an algorithm's complexity

Debug data/neon performance hazards in arm neon code

Originally the problem appeared when I tried to optimize an algorithm for neon arm and some minor part of it was taking 80% of according to profiler. I tried to test to see what can be done to improve it and for that I created array of function pointers to different versions of my optimized function and then I run them in the loop to see in profiler which one performs better:
typedef unsigned(*CalcMaxFunc)(const uint16_t a[8][4], const uint16_t b[4][4]);
CalcMaxFunc CalcMaxFuncs[] =
{
CalcMaxFunc_NEON_0,
CalcMaxFunc_NEON_1,
CalcMaxFunc_NEON_2,
CalcMaxFunc_NEON_3,
CalcMaxFunc_C_0
};
int N = sizeof(CalcMaxFunc) / sizeof(CalcMaxFunc[0]);
for (int i = 0; i < 10 * N; ++i)
{
auto f = CalcMaxFunc[i % N];
unsigned retI = f(a, b);
// just random code to ensure that cpu waits for the results
// and compiler doesn't optimize it away
if (retI > 1000000)
break;
ret |= retI;
}
I got surprising results: performance of a function was totally depend on its position within CalcMaxFuncs array. For example, when I swapped CalcMaxFunc_NEON_3 to be first it would be 3-4 times slower and according to profiler it would stall at the last bit of the function where it tried to move data from neon to arm register.
So, what does it make stall sometimes and not in other times? BY the way, I profile on iPhone6 in xcode if that matters.
When I intentionally introduced neon pipeline stalls by mixing-in some floating point division between calling these functions in the loop I eliminated unreliable behavior, now all of them perform the same regardless of the order in which they were called. So, why in the first place did I have that problem and what can I do to eliminate it in actual code?
Update:
I tried to create a simple test function and then optimize it in stages and see how I could possibly avoid neon->arm stalls.
Here's the test runner function:
void NeonStallTest()
{
int findMinErr(uint8_t* var1, uint8_t* var2, int size);
srand(0);
uint8_t var1[1280];
uint8_t var2[1280];
for (int i = 0; i < sizeof(var1); ++i)
{
var1[i] = rand();
var2[i] = rand();
}
#if 0 // early exit?
for (int i = 0; i < 16; ++i)
var1[i] = var2[i];
#endif
int ret = 0;
for (int i=0; i<10000000; ++i)
ret += findMinErr(var1, var2, sizeof(var1));
exit(ret);
}
And findMinErr is this:
int findMinErr(uint8_t* var1, uint8_t* var2, int size)
{
int ret = 0;
int ret_err = INT_MAX;
for (int i = 0; i < size / 16; ++i, var1 += 16, var2 += 16)
{
int err = 0;
for (int j = 0; j < 16; ++j)
{
int x = var1[j] - var2[j];
err += x * x;
}
if (ret_err > err)
{
ret_err = err;
ret = i;
}
}
return ret;
}
Basically it it does sum of squared difference between each uint8_t[16] block and returns index of the block pair that has lowest squared difference. So, then I rewrote it in neon intrisics (no particular attempt was made to make it fast, as it's not the point):
int findMinErr_NEON(uint8_t* var1, uint8_t* var2, int size)
{
int ret = 0;
int ret_err = INT_MAX;
for (int i = 0; i < size / 16; ++i, var1 += 16, var2 += 16)
{
int err;
uint8x8_t var1_0 = vld1_u8(var1 + 0);
uint8x8_t var1_1 = vld1_u8(var1 + 8);
uint8x8_t var2_0 = vld1_u8(var2 + 0);
uint8x8_t var2_1 = vld1_u8(var2 + 8);
int16x8_t s0 = vreinterpretq_s16_u16(vsubl_u8(var1_0, var2_0));
int16x8_t s1 = vreinterpretq_s16_u16(vsubl_u8(var1_1, var2_1));
uint16x8_t u0 = vreinterpretq_u16_s16(vmulq_s16(s0, s0));
uint16x8_t u1 = vreinterpretq_u16_s16(vmulq_s16(s1, s1));
#ifdef __aarch64__1
err = vaddlvq_u16(u0) + vaddlvq_u16(u1);
#else
uint32x4_t err0 = vpaddlq_u16(u0);
uint32x4_t err1 = vpaddlq_u16(u1);
err0 = vaddq_u32(err0, err1);
uint32x2_t err00 = vpadd_u32(vget_low_u32(err0), vget_high_u32(err0));
err00 = vpadd_u32(err00, err00);
err = vget_lane_u32(err00, 0);
#endif
if (ret_err > err)
{
ret_err = err;
ret = i;
#if 0 // enable early exit?
if (ret_err == 0)
break;
#endif
}
}
return ret;
}
Now, if (ret_err > err) is clearly data hazard. Then I manually "unrolled" loop by two and modified code to use err0 and err1 and check them after performing next round of compute. According to profiler I got some improvements. In simple neon loop I got roughly 30% of entire function spent in the two lines vget_lane_u32 followed by if (ret_err > err). After I unrolled by two these operations started to take 25% (e.g. I got roughly 10% overall speedup). Also, check armv7 version, there is only 8 instructions between when err0 is set (vmov.32 r6, d16[0]) and when it's accessed (cmp r12, r6). T
Note, in the code early exit is ifdefed out. Enabling it would make function even slower. If I unrolled it by four and changed to use four errN variables and deffer check by two rounds then I still saw vget_lane_u32 in profiler taking too much time. When I checked generated asm, appears that compiler destroys all the optimizations attempts because it reuses some of the errN registers which effectively makes CPU access results of vget_lane_u32 much earlier than I want (and I aim to delay access by 10-20 instructions). Only when I unrolled by 4 and marked all four errN as volatile vget_lane_u32 totally disappeared from the radar in profiler, however, the if (ret_err > errN) check obviously got slow as hell as now these probably ended up as regular stack variables overall these 4 checks in 4x manual loop unroll started to take 40%. Looks like with proper manual asm it's possible to make it work properly: have early loop exit, while avoiding neon->arm stalls and have some arm logic in the loop, however, extra maintenance required to deal with arm asm makes it 10x more complex to maintain that kind of code in a large project (that doesn't have any armasm).
Update:
Here's sample stall when moving data from neon to arm register. To implement early exist I need to move from neon to arm once per loop. This move alone takes more than 50% of entire function according to sampling profiler that comes with xcode. I tried to add lots of noops before and/or after the mov, but nothing seems to affect results in profiler. I tried to use vorr d0,d0,d0 for noops: no difference. What's the reason for the stall, or the profiler simply shows wrong results?

cudaMallocHost vs malloc for better performance shows no difference

I have gone through this site. From here I got that pinned memory using cudamallocHost gives better performance than cudamalloc. Then I use two different simple program and tested the execution time as
using cudaMallocHost
#include <stdio.h>
#include <cuda.h>
// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
// main routine that executes on the host
int main(void)
{
clock_t start;
start=clock();/* Line 8 */
clock_t finish;
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 100000; // Number of elements in arrays
size_t size = N * sizeof(float);
cudaMallocHost((void **) &a_h, size);
//a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
cudaFreeHost(a_h);
cudaFree(a_d);
finish = clock() - start;
double interval = finish / (double)CLOCKS_PER_SEC;
printf("%f seconds elapsed", interval);
}
using malloc
#include <stdio.h>
#include <cuda.h>
// Kernel that executes on the CUDA device
__global__ void square_array(float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N) a[idx] = a[idx] * a[idx];
}
// main routine that executes on the host
int main(void)
{
clock_t start;
start=clock();/* Line 8 */
clock_t finish;
float *a_h, *a_d; // Pointer to host & device arrays
const int N = 100000; // Number of elements in arrays
size_t size = N * sizeof(float);
a_h = (float *)malloc(size); // Allocate array on host
cudaMalloc((void **) &a_d, size); // Allocate array on device
// Initialize host array and copy it to CUDA device
for (int i=0; i<N; i++) a_h[i] = (float)i;
cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
// Do calculation on device:
int block_size = 4;
int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);
square_array <<< n_blocks, block_size >>> (a_d, N);
// Retrieve result from device and store it in host array
cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);
// Print results
for (int i=0; i<N; i++) printf("%d %f\n", i, a_h[i]);
// Cleanup
free(a_h); cudaFree(a_d);
finish = clock() - start;
double interval = finish / (double)CLOCKS_PER_SEC;
printf("%f seconds elapsed", interval);
}
here during execution of both program, the execution time was almost similar.
Is there anything wrong in the implementation?? what is the exact difference in execution in cudamalloc and cudamallochost??
and also with each run the execution time decreases
If you want to see the difference in execution time for the copy operation, just time the copy operation. In many cases you will see approximately a 2x difference in execution time for just the copy operation when the underlying mememory is pinned. And make your copy operation large enough/long enough so that you are well above the granularity of whatever timing mechanism you are using. The various profilers such as the visual profiler and nvprof can help here.
The cudaMallocHost operation under the hood is doing something like a malloc plus additional OS functions to "pin" each page associated with the allocation. These additional OS operations take extra time, as compared to just doing a malloc. And note that as the size of the allocation increases, the registration ("pinning") cost will generally increase as well.
Therefore, for many examples, just timing the overall execution doesn't show much difference, because while the cudaMemcpy operation may be quicker from pinned memory, the cudaMallocHost takes longer than the corresponding malloc.
So what's the point?
You may be interested in using pinned memory (i.e. cudaMallocHost) when you will be doing repeated transfers from a single buffer. You only pay the extra cost to pin it once, but you benefit on each transfer/usage.
Pinned memory is required to overlap a data transfer operations (cudaMemcpyAsync) with compute activities (kernel calls). Refer to the programming guide.
I too found that just declaring cudaHostAlloc / cudaMallocHost on a piece of memory doesn't do much.
To be sure, do a nvprof with --print-gpu-trace and see whether the throughput for memcpyHtoD or memcpyDtoH is good. For PCI2.0, you should get around 6-8gbps.
However, pinned memory is a perquisite for cudaMemcpyAsync.
After I called cudaMemcpyAsync, I shifted whatever computations I had on the host right after it. In this way you can "layer" the asynchronous memcpys with the host computations.
I was surprised that I was able to save quite a lot of time this way, it's worth a try.

Something weird in for loop speed

here is a part of my program code:
int test;
for(uint i = 0; i < 1700; i++) {
test++;
}
the whole program takes 0.5 seconds to finish, but when I change it to:
int test[1];
for(uint i = 0; i < 1700; i++) {
test[0]++;
}
it will takes 3.5 seconds! and when I change the int to double, it will gets very worse:
double test;
for(uint i = 0; i < 1700; i++) {
test++;
}
it will takes about 18 seconds to finish !!!
I have to increase an int array element and a double variable in my real for loop, and it will takes about 30 seconds!
What's happening here?! Why should it takes that much time for just an increment?!
I know a floating point data type like double has different structure from a fixed point data type like int, but is it the only cause for such a big different time? and what about the second example which is also an int array element?!
Thanks
You have answered your question yourself.
float (double) operations are different from integer ones. Even if you just add 1.0f.
Your second example takes longer than the first one just because you added some pointer refernces. An array in C is -bottom down- not much different from a pointer to the first element. Accessing any element, even the first one, would cause the machine code to load the starting address of the array multiply the index (0 in this case) with the length of each member (4 or whatever bytes int has) and add that (0) to the pointer. Then it has to dereference the pointer, meaning to acutally load the value at that very address. Add one and write back the result.
A smart modern compiler should optimize this a bit. When you want to avoid this optimization, then modify the code a bit and don`t use a constant for the index.
I never tried that with a modern objective-c compiler. But I guess that this code would take much loger than 3.5s to run:
int test[2];
int index = 0;
for(uint i = 0; i < 1700; i++) {
test[index]++;
}
If that does not make much of a change then try this:
-(void)foo:(int)index {
int test[2];
for(uint i = 0; i < 1700; i++) {
test[index]++;
}
}
and then call foo:0;
Give it a try and let us know :)