Gcov reports branches for plain function calls - conditional-statements

I'm using gcov to get code coverage for our project, but it frequently reports 50% conditional coverage for plain function calls. It doesn't make any difference if the function takes any parameters or returns any data or not. I'm using gcovr and Cobertura with Jenkins, but a simple gcov file gives the same result.
The actual tested code is attached below together with the stubbed functions, all in gcov format.
Any ideas why gcov threats these function calls as branches?
-: 146:/*****************************************************************************/
function _Z12mw_log_clearv called 2 returned 100% blocks executed 100%
2: 147:void mw_log_clear( void )
2: 147-block 0
-: 148:{
2: 149: uint8_t i = 0;
2: 150: uint8_t clear_tuple[EE_PAGE_SIZE] = { 0xff };
-: 151:
66: 152: for (i = 0; i < (int16_t)EE_PAGE_SIZE; i++)
2: 152-block 0
64: 152-block 1
66: 152-block 2
branch 0 taken 97%
branch 1 taken 3% (fallthrough)
-: 153: {
64: 154: clear_tuple[i] = 0xff;
-: 155: }
-: 156:
-: 157: /* Write pending data */
2: 158: mw_eeprom_write_blocking();
2: 158-block 0
call 0 returned 100%
branch 1 taken 100% (fallthrough) <---- This is a plain function call, not a branch
branch 2 taken 0% (throw) <---- This is a plain function call, not a branch
-: 159:
26: 160: for (i = 0; i < (RESERVED_PAGES_PER_PAREMETER_SET - POPULATED_PAGES_PER_PAREMETER_SET); i++)
2: 160-block 0
24: 160-block 1
26: 160-block 2
branch 0 taken 96%
branch 1 taken 4% (fallthrough)
-: 161: {
25: 162: if (status_ok != mw_eeprom_write(LOG_TUPLE_START_ADDRESS + i * EE_PAGE_SIZE, clear_tuple, sizeof(clear_tuple)))
25: 162-block 0
call 0 returned 100%
branch 1 taken 100% (fallthrough) <---- This is a plain function call, not a branch
branch 2 taken 0% (throw) <---- This is a plain function call, not a branch
25: 162-block 1
branch 3 taken 4% (fallthrough)
branch 4 taken 96%
-: 163: {
1: 164: mw_error_handler_add(mw_error_eeprom_busy);
1: 164-block 0
call 0 returned 100%
branch 1 taken 100% (fallthrough) <---- This is a plain function call, not a branch
branch 2 taken 0% (throw) <---- This is a plain function call, not a branch
1: 165: break;
1: 165-block 0
-: 166: }
-: 167:
24: 168: mw_eeprom_write_blocking();
24: 168-block 0
call 0 returned 100%
branch 1 taken 100% (fallthrough) <---- This is a plain function call, not a branch
branch 2 taken 0% (throw) <---- This is a plain function call, not a branch
-: 169: }
2: 170:}
2: 170-block 0
-: 171:
-: 172:/*****************************************************************************/
/*****************************************************************************/
void mw_eeprom_write_blocking(void)
{
stub_data.eeprom_write_blocking_calls++;
}
/*****************************************************************************/
void mw_error_handler_add(mw_error_code_t error_code)
{
EXPECT_EQ(error_code, stub_data.expected_error_code);
stub_data.registered_error_code = error_code;
}
/*****************************************************************************/
status_t mw_eeprom_write(
const uint32_t eeprom_start_index,
void *const source_start_address,
const uint32_t length)
{
stub_data.eeprom_write_start_index = eeprom_start_index;
stub_data.eeprom_write_length = length;
stub_data.eeprom_write_called = true;
EXPECT_NE(NULL, (uint32_t)source_start_address);
EXPECT_NE(0, length);
EXPECT_LE(eeprom_start_index + length, EEPROM_SIZE);
if (status_ok == stub_data.eeprom_write_status)
memcpy(&stub_data.eeprom[eeprom_start_index], source_start_address, length);
return stub_data.eeprom_write_status;
}

Solved!
Found the answer in this thread:
Why gcc 4.1 + gcov reports 100% branch coverage and newer (4.4, 4.6, 4.8) reports 50% for "p = new class;" line?
Seems like gcov reacted on some "invisible" exception handling code for these function calls, so adding "-fno-exceptions" to g++ made all these missing branches to disappear.

Related

Reaching unexpected objc breakpoint with method __Block_byref_object_copy_

I'm using lldb to debug objc based service. several breakpoints (which is set
have been placed in the code, and I see that one of them is reached unexpectedly according to the stack trace.
The method encapsulating this breakpoint shouldn't have called but I still see it in stack trace (file1.mm:97) although it seems like the code isn't being execute there.
I suspect that objc internal method __Block_byref_object_copy_ is responsible for copying the block of code which involves both caller and callee methods (MyClass from the upper frame in the stack and the method in file1.mm:97).
While copying the debugger probably thinks that it reach this line for execution and stop there, where in fact it's only for copying the code block which involves those 2 methods.
Perhaps anybody can support this claim or provide additional explanation of why am I getting this breakpoint where it shouldn't occur ?
* frame #0: 0x0000000107e03ce0 MyLib`::__Block_byref_object_copy_((null)=0x00007fda19a86b30, (null)=0x00007ffeea7f3bd0) at file1.mm:97:27
frame #1: 0x00007fff7de6bb78 libsystem_blocks.dylib`_Block_object_assign + 325
frame #2: 0x0000000107dd960a MyLib`::__copy_helper_block_ea8_32r((null)=0x00007fda19a86540, (null)=0x00007ffeea7f3ba8) at file2.mm:47:55
frame #3: 0x00007fff7de6b9f3 libsystem_blocks.dylib`_Block_copy + 104
frame #4: 0x00007fff7c64e1e8 libobjc.A.dylib`objc_setProperty_atomic_copy + 53
frame #5: 0x00007fff5411d16b Foundation`-[NSXPCConnection _sendInvocation:orArguments:count:methodSignature:selector:withProxy:] + 1885
frame #6: 0x00007fff54168508 Foundation`-[NSXPCConnection _sendSelector:withProxy:arg1:arg2:] + 125
frame #7: 0x00007fff54168485 Foundation`_NSXPCDistantObjectSimpleMessageSend2 + 46
frame #8: 0x0000000107e0520e MyLib`::-[MyClass func:withVar0:Var1:Var2:withError:](self=0x00007fda17c2cb50, _cmd="funcWithVar0:Var1:Var2:Var3:withError:", var0="aaa", var1=0x0000000000000000, var2="bbb", var3=0x00007fda17d41dd0, err=0x00007ffeea7f4258) at MyClass.mm:196:5
UPDATE:
thanks to the comments below, it happen that if I set breakpoint according to file and line, it gives me 3 locations (!?)
breakpoint set --file myfile.mm --line 97
now when I list my breakpoints, it give me 2 breakpoints that aren't related to the actual method which wraps the file, besides the expected breakpoint.
3.2: where = my class`::__Block_byref_object_copy_() + 16 at myfile:97:27, address = 0x0000000107e03ce0, unresolved, hit count = 0
3.3: where = myclass `::__Block_byref_object_dispose_() + 16 at myfile:97:27, address = 0x0000000107e03d40, unresolved, hit count = 0
Not really an answer, more an illustration ...
unsigned fn ( void )
{
return 3; // Set bp here and see what happens
}
int main(int argc, const char * argv[])
{
#autoreleasepool
{
// insert code here...
NSLog(#"Hello, World!");
unsigned int x = 0;
x = 1;
x = 2;
x = 3;
if ( x >= 0 )
{
switch ( 5 )
{
case 1 : x = 4; break;
case 2 : x = 4; break; // Set bp here - it will trigger!!!!
case 3 : x = 4; break;
case 4 : x = 4; break;
case 5 : x = 4; break; // Set bp here and see what happens
case 6 : x = 4; break;
case 7 : x = 4; break;
default : x = 4; break;
}
}
__block unsigned y;
void ( ^ b1 )( void ) = ^ {
y = fn();
};
void ( ^ b2 )( void ) = ^ {
b1();
};
if ( YES )
{
b2();
}
NSLog ( #"x = %u y = %u", x, y );
}
return 0;
}
Without going into too much detail, note that most of the code above will be optimised away. The compiler will optimise for loops and switches aggressively and will optimise superfluous assignments and checks (e.g. x >= 0 for unsigned x) away. Above even the blocks gets inlined and you end up with what appears to be very strange behaviour. I think the blocks here are relevant to your specific problem?
If you optimise then the first breakpoint indicated in the code does not get triggered as the blocks all get inlined (I think as I did not look at the disassembly). Likewise the third one does get triggered, but because of optimisation in the strangest of places. Really, a large part of the code gets reduced into a single assignment and the compiler does not really know where to stick it so when that bp is triggered it looks as if it is in the strangest and most disconnected place possible.
Finally, even the second (!!!) one will trigger. That code should never execute but since it all collapse due to optimisation you can even get it to trigger.
So I would not be too perplexed about what triggers your bp ...
I mean, above I just proved that 2 == 5 if I take the bp seriously. And that is not even variable but constant 2 == 5!!! Quite amusing ...

Why my VGPRs Usage increases so fast when I use this assignment statement code in OpenCL?

if (condition)
{
printf("find:%d\n",gid);
*foundFlag = 1;
dst[gid] = gid * crack_cnt + num;
break;
}
This code is used in ending kernel funtion when password is found.
*foundFlag is a pointer to a char value,defined in kernel funtion formal parameters as follows:
__global char * foundFlag,
when I Comment * foundFlag = 1,the usage of VGPRs by CodeXL,only use 4 VGPRs Registers:
But when this line code running,the usage of VGPRs increase to 88 VGPRs Registers:
I guarantee that I just modified this line of code, and that foundFlag is only used here. I'm confused.If you can help me solve this problem I will be grateful.(0_0)

What's the most efficient way to calculate the warp id / lane id in a 1-D grid?

In CUDA, each thread knows its block index in the grid and thread index within the block. But two important values do not seem to be explicitly available to it:
Its index as a lane within its warp (its "lane id")
The index of the warp of which it is a lane within the block (its "warp id")
Assuming the grid is 1-dimensional(a.k.a. linear, i.e. blockDim.y and blockDim.z are 1), one can obviously obtain these as follows:
enum : unsigned { warp_size = 32 };
auto lane_id = threadIdx.x % warp_size;
auto warp_id = threadIdx.x / warp_size;
and if you don't trust the compiler to optimize that, you could rewrite it as:
enum : unsigned { warp_size = 32, log_warp_size = 5 };
auto lane_id = threadIdx.x & (warp_size - 1);
auto warp_id = threadIdx.x >> log_warp_size;
is that the most efficient thing to do? It still seems like a lot of waste for every thread to have to compute this.
(inspired by this question.)
The naive computation is currently the most efficient.
Note: This answer has been heavily edited.
It is very tempting to try and avoid the computation altogether - as these two values seem to already be available if you look under the hood.
You see, nVIDIA GPUs have special registers which your (compiled) code can read to access various kinds of useful information. One such register holds threadIdx.x; another holds blockDim.x; another - the clock tick count; and so on. C++ as a language does not have these exposed, obviously; and, in fact, neither does CUDA. However, the intermediary representation into which CUDA code is compiled, named PTX, does expose these special registers (since PTX 1.3, i.e. with CUDA versions >= 2.1).
Two of these special registers are %warpid and %laneid. Now, CUDA supports inlining PTX code within CUDA code with the asm keyword - just like it can be used for host-side code to emit CPU assembly instructions directly. With this mechanism one can use these special registers:
__forceinline__ __device__ unsigned lane_id()
{
unsigned ret;
asm volatile ("mov.u32 %0, %laneid;" : "=r"(ret));
return ret;
}
__forceinline__ __device__ unsigned warp_id()
{
// this is not equal to threadIdx.x / 32
unsigned ret;
asm volatile ("mov.u32 %0, %warpid;" : "=r"(ret));
return ret;
}
... but there are two problems here.
The first problem - as #Patwie suggests - is that %warp_id does not give you what you actually want - it's not the index of the warp in the context of the grid, but rather in the context of the physical SM (which can hold so many warps resident at a time), and those two are not the same. So don't use %warp_id.
As for %lane_id, it does give you the correct value, but it will almost surely hurt your performance: Even though it's a "register", it's not like the regular registers in your register file, with 1-cycle access latency. It's a special register, which in the actual hardware is retrieved using an S2R instruction, which can exhibit long latency. Since you almost certainly already have the value of threadIdx.x in a register, it is faster to apply a bitmask to this value than to retrieve %lane_id.
Bottom line: Just compute the warp ID and lane ID from the thread ID. We can't get around this - for now.
The other answer is very dangerous! Compute the lane-id and warp-id yourself.
#include <cuda.h>
#include <iostream>
inline __device__ unsigned get_lane_id() {
unsigned ret;
asm volatile("mov.u32 %0, %laneid;" : "=r"(ret));
return ret;
}
inline __device__ unsigned get_warp_id() {
unsigned ret;
asm volatile("mov.u32 %0, %warpid;" : "=r"(ret));
return ret;
}
__global__ void kernel() {
const int actual_warpid = get_warp_id();
const int actual_laneid = get_lane_id();
const int expected_warpid = threadIdx.x / 32;
const int expected_laneid = threadIdx.x % 32;
if (expected_laneid == 0) {
printf("[warp:] actual: %i expected: %i\n", actual_warpid,
expected_warpid);
printf("[lane:] actual: %i expected: %i\n", actual_laneid,
expected_laneid);
}
}
int main(int argc, char const *argv[]) {
dim3 grid(8, 7, 1);
dim3 block(4 * 32, 1);
kernel<<<grid, block>>>();
cudaDeviceSynchronize();
return 0;
}
which gives something like
[warp:] actual: 4 expected: 3
[warp:] actual: 10 expected: 0
[warp:] actual: 1 expected: 1
[warp:] actual: 12 expected: 1
[warp:] actual: 4 expected: 3
[warp:] actual: 0 expected: 0
[warp:] actual: 13 expected: 2
[warp:] actual: 12 expected: 1
[warp:] actual: 6 expected: 1
[warp:] actual: 6 expected: 1
[warp:] actual: 13 expected: 2
[warp:] actual: 10 expected: 0
[warp:] actual: 1 expected: 1
...
[lane:] actual: 0 expected: 0
[lane:] actual: 0 expected: 0
[lane:] actual: 0 expected: 0
[lane:] actual: 0 expected: 0
[lane:] actual: 0 expected: 0
[lane:] actual: 0 expected: 0
[lane:] actual: 0 expected: 0
[lane:] actual: 0 expected: 0
[lane:] actual: 0 expected: 0
[lane:] actual: 0 expected: 0
[lane:] actual: 0 expected: 0
see also the PTX docs
A predefined, read-only special register that returns the thread's
warp identifier. The warp identifier provides a unique warp number
within a CTA but not across CTAs within a grid. The warp identifier
will be the same for all threads within a single warp.
Note that %warpid is volatile and returns the location of a thread at
the moment when read, but its value may change during execution, e.g.,
due to rescheduling of threads following preemption.
Hence, it is the warp-id of the scheduler without any guarantee that it matches the virtual warp-id (started by counting from 0).
The docs makes this clear:
For this reason, %ctaid and %tid should be used to compute a virtual warp index if such a value is needed in kernel code; %warpid is
intended mainly to enable profiling and diagnostic code to sample and
log information such as work place mapping and load distribution.
If you think, ok let's use CUB for this: This even affects cub::WarpId()
Returns the warp ID of the calling thread. Warp ID is guaranteed to be
unique among warps, but may not correspond to a zero-based ranking
within the thread block.
EDIT: Using %laneid seems to be safe.

Trouble Understanding Fork Logic

Can someone help me understand what is happening in this segment of code? I am having trouble understanding why the output is how it is. Output is:
0 1 2 3 4
3
2
1
0
int main() {
int i;
for (i = 0; i < 5 && !fork(); i++) {
fflush(stdout);
printf("%d ", i);
}
wait(NULL);
printf("\n");
return 0;
}
Two things here:
First, fork() return 0 in child process while it returns a non zero pid to the parent process.
Second, short circuit of &&.
So in the beginning of the first process (p0), it runs to i < 5 && !fork(). Now i = 0 and another process created (p1). Now for p0, test !fork() fails, it ends the for loop and waiting for child to end. For p1, the test succeeds, and print out 0, then increment i to 1, then it will create process p2 and itself goes out the for loop as p0 did.
Because of short circuiting, when i equals 5, no more fork will be called.

system() Returning Wrong Value

I have an ARM-based device running Embedded Linux and I have observed that when I use the C library's system() call, the return code is incorrect. Here is a test program that demonstrates the behavior:
#include <stdlib.h>
#include <stdio.h>
int main(void)
{
int ret = system("exit 42");
printf("Should return 42 for system() call: %d\n", ret);
printf("Returning 43 to shell..\n");
exit(43);
};
And here is the program output on the device:
# returnCodeTest
Should return 42 for system() call: 10752
Returning 43 to shell..
The value "10752" is returned by system() instead of "42". 10752 is 42 when left-shifted by 8:
Python 2.7.3 (default, Feb 27 2014, 20:00:17)
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 42<<8
10752
So is suspect one of the following is going on somewhere:
The byte order is getting swapped
The value is getting shifted by 8 bits
Incompatible struct definitions are being used
When I run strace I see the following:
# strace /usr/bin/returnCodeTest
...
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x4001e308) = 977
wait4(977, [{WIFEXITED(s) && WEXITSTATUS(s) == 42}], 0, NULL) = 977
rt_sigaction(SIGINT, {SIG_DFL, [], 0x4000000 /* SA_??? */}, NULL, 8) = 0
rt_sigaction(SIGQUIT, {SIG_DFL, [], 0x4000000 /* SA_??? */}, NULL, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=977, si_status=42, si_utime=0, si_stime=0} ---
fstat64(1, {st_mode=S_IFCHR|0622, st_rdev=makedev(136, 0), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x4001f000
write(1, "Should return 42 for system() ca"..., 42Should return 42 for system() call: 10752
) = 42
write(1, "Returning 43 to shell..\n", 24Returning 43 to shell..
) = 24
exit_group(43) = ?
+++ exited with 43 +++
wait4() returns with the correct status (si_status=42), but when it gets printed to standard output the value is shifted by 8 bits, and it looks like it is happening in a library. Interestingly the write returns a value of 42. I wonder if this is a hint as to what is going on...
Unfortunately I cannot get ltrace to compile and run on the device. Has anyone seen this type of behavior before or have any ideas (possibly architecture-specific) on where to look?
$man 3 system
Return Value
The value returned is -1 on error (e.g., fork(2) failed), and the
return status of the command otherwise. This latter return status is
in the format specified in wait(2). Thus, the exit code of the command
will be WEXITSTATUS(status).
$man 2 wait
WEXITSTATUS(status) returns the exit status of the child. This
consists of the least significant 8 bits of the status argument that
the child specified in a call to exit(3) or _exit(2) or as the
argument for a return statement in main(). This macro should only be
employed if WIFEXITED returned true.
I think exit codes are different from return values and is specific to OS.
Linux does the following when you call "exit" with a code.
(error_code&0xff)<<8
SYSCALL_DEFINE1(exit, int, error_code)
{
do_exit((error_code&0xff)<<8);
}
Take look at the below link (exit status codes in Linux).
Are there any standard exit status codes in Linux?