I'm under the impression that these two commands result in the same end, namely incrementing X by 1 but that the latter is probably more efficient.
If this is not correct, please explain the diff.
If it is correct, why should the latter be more efficient? Shouldn't they both compile to the same IL?
Thanks.
From the MSDN library for +=:
Using this operator is almost the same as specifying result = result + expression, except that result is only evaluated once.
So they are not identical and that is why x += 1 will be more efficient.
Update: I just noticed that my MSDN Library link was to the JScript page instead of the VB page, which does not contain the same quote.
Therefore upon further research and testing, that answer does not apply to VB.NET. I was wrong. Here is a sample console app:
Module Module1
Sub Main()
Dim x = 0
Console.WriteLine(PlusEqual1(x))
Console.WriteLine(Add1(x))
Console.WriteLine(PlusEqual2(x))
Console.WriteLine(Add2(x))
Console.ReadLine()
End Sub
Public Function PlusEqual1(ByVal x As Integer) As Integer
x += 1
Return x
End Function
Public Function Add1(ByVal x As Integer) As Integer
x = x + 1
Return x
End Function
Public Function PlusEqual2(ByVal x As Integer) As Integer
x += 2
Return x
End Function
Public Function Add2(ByVal x As Integer) As Integer
x = x + 2
Return x
End Function
End Module
IL for both PlusEqual1 and Add1 are indeed identical:
.method public static int32 Add1(int32 x) cil managed
{
.maxstack 2
.locals init (
[0] int32 Add1)
L_0000: nop
L_0001: ldarg.0
L_0002: ldc.i4.1
L_0003: add.ovf
L_0004: starg.s x
L_0006: ldarg.0
L_0007: stloc.0
L_0008: br.s L_000a
L_000a: ldloc.0
L_000b: ret
}
The IL for PlusEqual2 and Add2 are nearly identical to that as well:
.method public static int32 Add2(int32 x) cil managed
{
.maxstack 2
.locals init (
[0] int32 Add2)
L_0000: nop
L_0001: ldarg.0
L_0002: ldc.i4.2
L_0003: add.ovf
L_0004: starg.s x
L_0006: ldarg.0
L_0007: stloc.0
L_0008: br.s L_000a
L_000a: ldloc.0
L_000b: ret
}
I wrote a simple console app:
static void Main(string[] args)
{
int i = 0;
i += 1;
i = i + 1;
Console.WriteLine(i);
}
I disassembled it using Reflector and here's what i got:
private static void Main(string[] args)
{
int i = 0;
i++;
i++;
Console.WriteLine(i);
}
They are the same.
they compile to the same, the second is just easier to type.
IMPORTANT:
The answers specifying evaluation are certainly correct in terms of what a += do, in general languages. But in VB.NET, I assume X specified in the OP is a variable or a property.
They'll probably compile to the same IL.
UPDATE (to address the probably controversy):
VB.NET is a specification of a programming language. Any compiler that conforms to what's defined in the spec can be a VB.NET implementation. If you edit the source code of the MS VB.NET compiler to generate crappy code for X += 1 case, you'll still conform to VB.NET spec (because it didn't say anything about how it's going to work. It just says the effect will be exactly the same, which makes it logical to generate the same code, indeed).
While the compiler is very very likely (and I feel it really does) generate the same code for both, but it's pretty complex piece of software. Heck, you can't even guarantee that a compiler generates the exact same code when the same code is compiled twice!
What you can feel 100% secure to say (unless you know the source code of the compiler intimately) is that a good compiler should generate the same code, performance-wise, which might or might not be the exact same code.
So many speculations! Even the conclusion with the Reflector thingy is not necessarily true because it can do optimizations while dissassembling.
So why does none of you guys just have a look into the IL code? Have a look at the following C# programme:
static void Main(string[] args)
{
int x = 2;
int y = 3;
x += 1;
y = y + 1;
Console.WriteLine(x);
Console.WriteLine(y);
}
This code snippet compiles to:
.method private hidebysig static void Main(string[] args) cil managed
{
.entrypoint
// Code size 25 (0x19)
.maxstack 2
.locals init ([0] int32 x,
[1] int32 y)
// some commands omitted here
IL_0004: ldloc.0
IL_0005: ldc.i4.1
IL_0006: add
IL_0007: stloc.0
IL_0008: ldloc.1
IL_0009: ldc.i4.1
IL_000a: add
IL_000b: stloc.1
// some commands omitted here
}
As you can see, it's in fact absolutely the same. And why is it? Because IL's purpose is to tell what to do, not how to. The optimization will be a job of the JIT compiler. Btw it's the same in VB.Net
On x86, if x is in register eax, they will both result in something like
inc eax;
So you're right, after some compilation stage, the IL will be the same.
There's a whole class of questions like this that can be answered with "trust your optimizer."
The famous myth is that
x++;
is less efficient than
++x;
because it has to store a temporary value. If you never use the temporary value, the optimizer will remove that store.
Yes, they behave the same.
No, they are probably equally efficient. Optimizers are good at that sort of thing. If you'd like to double check, write the optimized code and view it in reflector.
The optimizer probably produces the same result, if x is a simple type like int or float.
If you'd use some other language (limited VB knowledge here, can you overload +=?) where x could be one big honking object, the former creates and extra copy, which can be hundreds of megs. The latter does not.
are the same.
x=x+1
is mathematical seen a contradiction whereas
x+=1
isn't and is light to be typed.
They may be the same in VB; they are not necessarily the same in C (where the operator comes from).
In C++ it depends what datatype is x and how are operators defined. If x is an instance of some class you can get completely different results.
Or maybe you should fix the question and specify that x is an integer or whatever.
i thought the differences are due to the additional clock cycles used for memory references, but i turned out to be wrong! can't understand this thing myself
instruction type example cycles
===================================================================
ADD reg,reg add ax,bx 1
ADD mem,reg add total, cx 3
ADD reg,mem add cx,incr 2
ADD reg,immed add bx,6 1
ADD mem,immed add pointers[bx][si],6 3
ADD accum,immed add ax,10 1
INC reg inc bx 1
INC mem inc vpage 3
MOV reg,reg mov bp,sp 1
MOV mem,reg mov array[di],bx 1
MOV reg,mem mov bx,pointer 1
MOV mem,immed mov [bx],15 1
MOV reg,immed mov cx,256 1
MOV mem,accum mov total,ax 1
MOV accum,mem mov al,string 1
MOV segreg,reg16 mov ds,ax 2, 3
MOV segreg,mem16 mov es,psp 2, 3
MOV reg16,segreg mov ax,ds 1
MOV mem16,segreg mov stack_save,ss 1
MOV reg32,controlreg mov eax,cr0 22
mov eax,cr2 12
mov eax,cr3 21, 46
mov eax,cr4 14
MOV controlreg,reg32 mov cr0,eax 4
MOV reg32,debugreg mov edx,dr0 DR0-DR3,DR6,DR7=11;
DR4,DR5=12
MOV debugreg,reg32 mov dr0,ecx DR0-DR3,DR6,DR7=11;
DR4,DR5=12
source:http://turkish_rational.tripod.com/trdos/pentium.txt
the instructions may be tranlated as:
;for i = i+1 ; cycles
mov ax, [i] ; 1
add ax, 1 ; 1
mov [i], ax ; 1
;for i += 1
; dunno the syntax of instruction. it should be the pointers one :S
;for i++
inc i ; 3
;or
mov ax, [i] ; 1
inc ax ; 1
mov [i], ax ; 1
;for ++i
mov ax, [i] ; 1
;do stuff ; matters not
inc ax ; 1
mov [i], ax ; 1
all turn out to be same :S
its just some data that may be helpful. please comment!
Something worth noting is that +=, -=, *= etc. do an implicit cast.
int i = 0;
i = i + 5.5; // doesn't compile.
i += 5.5; // compiles.
At run time (at least with PERL) there is no difference. x+=1 is roughly .5 seconds faster to type than x = x+1 though
There is no difference in programmatic efficiency; just typing efficiency.
Back in the early 1980s, one of the really cool optimizations of the Lattice C Compiler was that "x = x + 1;", "x += 1;" and "x++;" all produced exactly the same machine code. If they could do it, a compiler written in this millenium should definitely be able to do it.
If x is a simple integer scalar variable, they should be the same.
If x is a large expression, possibly with side effects, +=1 and ++ should be twice as fast.
Many people concentrate on this kind of low-level optimization as if that's what optimization is all about. I assume you know it's a much bigger subject.
Related
Which algorithm is fastest for returning the last n bits in an unsigned integer?
1.
return num & ((1 << bits) - 1)
2.
return num % (1 << bits)
3.
let shift = num.bitWidth - bits
return (num << shift) >> shift
(where bitWidth is the width of the integer, in bits)
Or is there another, faster algorithm?
This is going to depend heavily on what compiler you have, what the optimization settings are, and what size of integers you're working with.
My hypothesis going into this section was that the answer would be "the compiler will be smart enough to optimize all of these in a way that's better than whatever you'd choose to write." And in some sense, that's correct. Consider the following three pieces of code:
#include <stdint.h>
#include <limits.h>
uint32_t lastBitsOf_v1(uint32_t number, uint32_t howManyBits) {
return number & ((1 << howManyBits) - 1);
}
uint32_t lastBitsOf_v2(uint32_t number, uint32_t howManyBits) {
return number % (1 << howManyBits);
}
uint32_t lastBitsOf_v3(uint32_t number, uint32_t howManyBits) {
uint32_t shift = sizeof(number) * CHAR_BIT - howManyBits;
return (number << shift) >> shift;
}
Over at the godbolt compiler explorer with optimization turned up to -Ofast with -march=native enabled, we get this code generated for the three functions:
lastBitsOf_v1(unsigned int, unsigned int):
bzhi eax, edi, esi
ret
lastBitsOf_v2(unsigned int, unsigned int):
bzhi eax, edi, esi
ret
lastBitsOf_v3(unsigned int, unsigned int):
mov eax, 32
sub eax, esi
shlx edi, edi, eax
shrx eax, edi, eax
ret
Notice that the compiler recognized what you were trying to do with the first two versions of this function and completely rewrote the code to use the bzhi x86 instruction. This instruction copies the lower bits of one register into another. In other words, the compiler was able to generate a single assembly instruction! On the other hand, the compiler didn't recognize what the last version was trying to do, so it actually generated the code as written and actually did the shifts and subtraction.
But that's not the end of the story. Imagine that the number of bits to extract is known in advance. For example, suppose we want the lower 13 bits. Now, watch what happens with this code:
#include <stdint.h>
#include <limits.h>
uint32_t lastBitsOf_v1(uint32_t number) {
return number & ((1 << 13) - 1);
}
uint32_t lastBitsOf_v2(uint32_t number) {
return number % (1 << 13);
}
uint32_t lastBitsOf_v3(uint32_t number) {
return (number << 19) >> 19;
}
These are literally the same functions, just with the bit amount hardcoded. Now look at what gets generated:
lastBitsOf_v1(unsigned int):
mov eax, edi
and eax, 8191
ret
lastBitsOf_v2(unsigned int):
mov eax, edi
and eax, 8191
ret
lastBitsOf_v3(unsigned int):
mov eax, edi
and eax, 8191
ret
All three versions get compiled to the exact same code. The compiler saw what we're doing in each case and replaced it with this much simpler code that's basically the first version.
After seeing all of this, what should you do? My recommendation would be the following:
Unless this code is an absolute performance bottleneck - as in, you've measured your code's runtime and you're absolutely certain that the code for extracting the low bits of numbers is what's actually slowing you down - I wouldn't worry too much about this at all. Pick the most readable code that you can. I personally find option (1) the cleanest, but that's just me.
If you absolutely must get every ounce of performance out of this that you can, rather than taking my word for it, I'd recommend tinkering around with different versions of the code and seeing what assembly gets generated in each case and running some performance experiments. After all, if something like this is really important, you'd want to see it for yourself!
Hope this helps!
The ARM Coretex STM32's HardFault_Handler can only get several registers values, r0, r1,r2, r3, lr, pc, xPSR, when crash happened. But there is no FP and SP in the stack. Thus I could not unwind the stack.
Is there any solution for this? Thanks a lot.
[update]
Following a web instruction to let ARMGCC(Keil uvision IDE) generate FP by adding a compiling option "--use_frame_pointer", but I could not find the FP in the stack. I am a real newbie here. Below is my demo code:
int test2(int i, int j)
{
return i/j;
}
int main()
{
SCB->CCR |= 0x10;
int a = 10;
int b = 0;
int c;
c = test2(a,b);
}
enum { r0 = 0, r1, r2, r3, r11, r12, lr, pc, psr};
void Hard_Fault_Handler(uint32_t *faultStackAddress)
{
uint32_t r0_val = faultStackAddress[r0];
uint32_t r1_val = faultStackAddress[r1];
uint32_t r2_val = faultStackAddress[r2];
uint32_t r3_val = faultStackAddress[r3];
uint32_t r12_val = faultStackAddress[r12];
uint32_t r11_val = faultStackAddress[r11];
uint32_t lr_val = faultStackAddress[lr];
uint32_t pc_val = faultStackAddress[pc];
uint32_t psr_val = faultStackAddress[psr];
}
I have two questions here:
1. I am not sure where the index of FP(r11) in the stack, or whether it is pushed into stack or not. I assume it is before r12, because I compared the assemble source before and after adding the option "--use_frame_pointer". I also compared the values read from Hard_Fault_Handler, seems like r11 is not in the stack. Because r11 address I read points to a place where the code is not my code.
[update] I have confirmed that FP is pushed into the stack. The second question still needs to be answered.
See below snippet code:
Without the option "--use_frame_pointer"
test2 PROC
MOVS r0,#3
BX lr
ENDP
main PROC
PUSH {lr}
MOVS r0,#0
BL test2
MOVS r0,#0
POP {pc}
ENDP
with the option "--use_frame_pointer"
test2 PROC
PUSH {r11,lr}
ADD r11,sp,#4
MOVS r0,#3
MOV sp,r11
SUB sp,sp,#4
POP {r11,pc}
ENDP
main PROC
PUSH {r11,lr}
ADD r11,sp,#4
MOVS r0,#0
BL test2
MOVS r0,#0
MOV sp,r11
SUB sp,sp,#4
POP {r11,pc}
ENDP
2. Seems like FP is not in the input parameter faultStackAddress of Hard_Fault_Handler(), where can I get the caller's FP to unwind the stack?
[update again]
Now I understood the last FP(r11) is not stored in the stack. All I need to do is to read the value of r11 register, then I can unwind the whole stack.
So now my final question is how to read it using inline assembler of C. I tried below code, but failed to read the correct value from r11 following the reference of http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0472f/Cihfhjhg.html
volatile int top_fp;
__asm
{
mov top_fp, r11
}
r11's value is 0x20009DCC
top_fp's value is 0x00000004
[update 3] Below is my whole code.
int test5(int i, int j, int k)
{
char a[128] = {0} ;
a[0] = 'a';
return i/j;
}
int test2(int i, int j)
{
char a[18] = {0} ;
a[0] = 'a';
return test5(i, j, 0);
}
int main()
{
SCB->CCR |= 0x10;
int a = 10;
int b = 0;
int c;
c = test2(a,b); //create a divide by zero crash
}
/* The fault handler implementation calls a function called Hard_Fault_Handler(). */
#if defined(__CC_ARM)
__asm void HardFault_Handler(void)
{
TST lr, #4
ITE EQ
MRSEQ r0, MSP
MRSNE r0, PSP
B __cpp(Hard_Fault_Handler)
}
#else
void HardFault_Handler(void)
{
__asm("TST lr, #4");
__asm("ITE EQ");
__asm("MRSEQ r0, MSP");
__asm("MRSNE r0, PSP");
__asm("B Hard_Fault_Handler");
}
#endif
void Hard_Fault_Handler(uint32_t *faultStackAddress)
{
volatile int top_fp;
__asm
{
mov top_fp, r11
}
//TODO: use top_fp to unwind the whole stack.
}
[update 4] Finally, I made it out. My solution:
Note: To access r11, we have to use embedded assembler, see here, which costs me much time to figure it out.
//we have to use embedded assembler.
__asm int getRegisterR11()
{
mov r0,r11
BX LR
}
//call it from Hard_Fault_Handler function.
/*
Function call stack frame:
FP1(r11) -> | lr |(High Address)
| FP2|(prev FP)
| ...|
Current FP(r11) ->| lr |
| FP1|(prev FP)
| ...|(Low Address)
With FP, we can access lr(link register) which is the address to return when the current functions returns(where you were).
Then (current FP - 1) points to prev FP.
Thus we can unwind the stack.
*/
void unwindBacktrace(uint32_t topFp, uint16_t* backtrace)
{
uint32_t nextFp = topFp;
int j = 0;
//#define BACK_TRACE_DEPTH 5
//loop backtrace using FP(r11), save lr into an uint16_t array.
for(int i = 0; i < BACK_TRACE_DEPTH; i++)
{
uint32_t lr = *((uint32_t*)nextFp);
if ((lr >= 0x08000000) && (lr <= 0x08FFFFFF))
{
backtrace[j*2] = LOW_16_BITS(lr);
backtrace[j*2 + 1] = HIGH_16_BITS(lr);
j += 1;
}
nextFp = *((uint32_t*)nextFp - 1);
if (nextFp == 0)
{
break;
}
}
}
#if defined(__CC_ARM)
__asm void HardFault_Handler(void)
{
TST lr, #4
ITE EQ
MRSEQ r0, MSP
MRSNE r0, PSP
B __cpp(Hard_Fault_Handler)
}
#else
void HardFault_Handler(void)
{
__asm("TST lr, #4");
__asm("ITE EQ");
__asm("MRSEQ r0, MSP");
__asm("MRSNE r0, PSP");
__asm("B Hard_Fault_Handler");
}
#endif
void Hard_Fault_Handler(uint32_t *faultStackAddress)
{
//get back trace
int topFp = getRegisterR11();
unwindBacktrace(topFp, persistentData.faultStack.back_trace);
}
Very primitive method to unwind the stack in such case is to read all stack memory above SP seen at the time of HardFault_Handler and process it using arm-none-eabi-addr2line. All link register entries saved on stack will be transformed into source line (remember that actual code path goes the line before LR points to). Note, if functions in between were called using branch instruction (b) instead of branch and link (bl) you'll not see them using this method.
(I don't have enough reputation points to write comments, so I'm editing my answer):
UPDATE for question 2:
Why do you expect that Hard_Fault_Handler has any arguments? Hard_Fault_Handler is usally a function to which address is stored in vector (exception) table. When the processor exception happens then Hard_Fault_Handler will be executed. There is no arguments passing involved doing this. But still, all registers at the time the fault happens are preserved. Specifically, if you compiled without omit-frame-pointer you can just read value of R11 (or R7 in Thumb-2 mode). However, to be sure that in your code Hard_Fault_Handler is actually a real hard fault handler, look into startup.s code and see if Hard_Fault_Handler is at the third entry in vector table. If there is an other function, it means Hard_Fault_Handler is just called from that function explicitly. See this article for details. You can also read my blog :) There is a chapter about stack which is based on Android example, but a lot of things are the same in general.
Also note, most probably in faultStackAddress should be stored a stack pointer, not a frame pointer.
UPDATE 2
Ok, lets clarify some things. Firstly, please paste the code from which you call Hard_Fault_Handler. Secondly, I guess you call it from within real HardFault exception handler. In that case you cannot expect that R11 will be at faultStackAddress[r11]. You've already mentioned it at the first sentence in your question. There will be only r0-r3, r12, lr, pc and psr.
You've also written:
But there is no FP and SP in the stack. Thus I could not unwind the
stack. Is there any solution for this?
The SP is not "in the stack" because you have it already in one of the stack registers (msp or psp). See again THIS ARTICLE. Also, FP is not crucial to unwind stack because you can do it without it (by "navigating" through saved Link Registers). Other thing is that if you dump memory below your SP you can expect FP to be just next to saved LR if you really need it.
Answering your last question: I don't now how you're verifying this code and how you're calling it (you need to paste full code). You can look into assembly of that function and see what's happening under the hood. Other thing you can do is to follow this post as a template.
I'm trying to make an assembly program that adds together an unknown number of int's, like
sum(int a,int b, ...)
My code is
.globl notnull
notnull:
leal 4(%esp),%ebx
jmp next2
next:
leal 4(%ebx),%ebx
next2:
cmp $0,(%ebx)
je end
movl (%ebx),%eax
jmp next
end:
ret
I test it with this program:
#include <stdio.h>
extern int notnull();
int main()
{
int x=notnull(3,2,1,0);
printf("3,2,1,0 = %d\n",x);
x=notnull(2,1,0);
printf("2,1,0 = %d\n",x);
x=notnull(1,0);
printf("1,0 = %d\n",x);
x=notnull(0);
printf("0 = %d\n",x);
x=notnull();
printf("_ = %d\n",x);
return 0;
}
Wich gives me this output:
3,2,1,0 = 1 (#1)
2,1,0 = 1 (#2)
1,0 = 1 (#3)
0 = 8 (#4)
_ = 8 (#5)
What I want is the program to return 0 when there are no variables (see #5), and also make it work without having to have 0 as the last digit.
The perfect output of notnull(3,2) would 2 and notnull()=0
You need to read up on C argument passing conventions.
Basically, there is no way to automatically determine how many arguments are being passed to a function.
This is why all C functions either have a fixed number of arguments, or if they use variable arguments (varargs) they have one fixed argument before the variable part, which somehow expresses how many additional arguments are being passed.
Using an empty argument list makes it possible to validly call the function in any manner, but it doesn't help with the core problem of (in the function) determining how many arguments are available.
You might be able to figure it out by inspecting the stack, but of course that requires intimate knowledge of exactly how your particular compiler choses to implement the call. This might vary for different number of arguments, too.
For this example, I am working with objective-c, but answers from the broader C/C++ community are welcome.
#interface BSWidget : NSObject {
float tre[3];
}
#property(assign) float* tre;
.
- (void)assignToTre:(float*)triplet {
tre[0] = triplet[0];
tre[1] = triplet[1];
tre[2] = triplet[2];
}
.
- (void)copyToTre:(float*)triplet {
memcpy(tre, triplet, sizeof(tre) );
}
So between these two approaches, and considering the fact that these setter functions will only generally handle dimensions of 2,3, or 4...
What would be the most efficient approach for this situation?
Will gcc generally reduce these to the same basic operations?
Thanks.
A quick test seems to show that the compiler, when optimising, replaces the memcpy call with the instructions to perform the assignment.
Disassemble the following code, when compiled unoptimised and with -O2, shows that in the optimised case the testMemcpy function does not contain a call to memcpy.
struct test src = { .a=1, .b='x' };
void testMemcpy(void)
{
struct test *dest = malloc(sizeof(struct test));
memcpy(dest, &src, sizeof(struct test));
}
void testAssign(void)
{
struct test *dest = malloc(sizeof(struct test));
*dest = src;
}
Unoptimised testMemcpy, with a memcpy call as expected
(gdb) disassemble testMemcpy
Dump of assembler code for function testMemcpy:
0x08048414 <+0>: push %ebp
0x08048415 <+1>: mov %esp,%ebp
0x08048417 <+3>: sub $0x28,%esp
0x0804841a <+6>: movl $0x8,(%esp)
0x08048421 <+13>: call 0x8048350 <malloc#plt>
0x08048426 <+18>: mov %eax,-0xc(%ebp)
0x08048429 <+21>: movl $0x8,0x8(%esp)
0x08048431 <+29>: movl $0x804a018,0x4(%esp)
0x08048439 <+37>: mov -0xc(%ebp),%eax
0x0804843c <+40>: mov %eax,(%esp)
0x0804843f <+43>: call 0x8048340 <memcpy#plt>
0x08048444 <+48>: leave
0x08048445 <+49>: ret
Optimised testAssign
(gdb) disassemble testAssign
Dump of assembler code for function testAssign:
0x080483f0 <+0>: push %ebp
0x080483f1 <+1>: mov %esp,%ebp
0x080483f3 <+3>: sub $0x18,%esp
0x080483f6 <+6>: movl $0x8,(%esp)
0x080483fd <+13>: call 0x804831c <malloc#plt>
0x08048402 <+18>: mov 0x804a014,%edx
0x08048408 <+24>: mov 0x804a018,%ecx
0x0804840e <+30>: mov %edx,(%eax)
0x08048410 <+32>: mov %ecx,0x4(%eax)
0x08048413 <+35>: leave
0x08048414 <+36>: ret
Optimised testMemcpy does not contain a memcpy call
(gdb) disassemble testMemcpy
Dump of assembler code for function testMemcpy:
0x08048420 <+0>: push %ebp
0x08048421 <+1>: mov %esp,%ebp
0x08048423 <+3>: sub $0x18,%esp
0x08048426 <+6>: movl $0x8,(%esp)
0x0804842d <+13>: call 0x804831c <malloc#plt>
0x08048432 <+18>: mov 0x804a014,%edx
0x08048438 <+24>: mov 0x804a018,%ecx
0x0804843e <+30>: mov %edx,(%eax)
0x08048440 <+32>: mov %ecx,0x4(%eax)
0x08048443 <+35>: leave
0x08048444 <+36>: ret
Speaking from a C background, I recommend using direct assignment. That version of the code is more obvious as to your intent, and less error-prone if your array changes in the future and adds extra indices that your function doesn't need to copy.
The two are not strictly equivalent. memcpy is typically implemented as a loop that copies the data in fixed-size chunks (that may be smaller than a float), so the compiler probably won't generate the same code for the memcpy case. The only way to know for sure is to build it both ways and look at the emitted assembly in a debugger.
Even if the memcpy call is inlined, it will probably result in more code and slower execution time. The direct assignment case should be more efficient (unless your target platform requires special code to handle float datatypes). This is only an educated guess, however; the only way to know for sure is to try it both ways and profile the code.
memcpy:
Do function prolog.
Initialize counter and pointers.
Check if have bytes to copy.
Copy memory.
Increment pointer.
Increment pointer.
Increment counter.
Repeat 3-7 3 or 11 more times.
Do function epilog.
Direct assignment:
Copy memory.
Copy memory.
Copy memory.
As you see, direct assignment is much faster.
Today I had a problem converting a Long (Int64) to an Integer (Int32). The problem is that my code was always working in 32-bit environments, but when I try THE SAME executable in a 64-bit computer it crashes with a System.OverflowException exception.
I've prepared this test code in Visual Studio 2008 in a new project with default settings:
Module Module1
Sub Main()
Dim alpha As Long = -1
Dim delta As Integer
Try
delta = CInt(alpha And UInteger.MaxValue)
Console.WriteLine("CINT OK")
delta = Convert.ToInt32(alpha And UInteger.MaxValue)
Console.WriteLine("Convert.ToInt32 OK")
Catch ex As Exception
Console.WriteLine(ex.GetType().ToString())
Finally
Console.ReadLine()
End Try
End Sub
End Module
On my 32-bit setups (Windows XP SP3 32-bit and Windows 7 32-bit) it prints up to "CINT OK", but in the 64-bit computer (Windows 7 64-bit) that I've tested THE SAME executable it prints the exception name only.
Is this behavior documented? I tried to find a reference, but I failed miserably.
For reference I leave the CIL code too:
.method public static void Main() cil managed
{
.entrypoint
.custom instance void [mscorlib]System.STAThreadAttribute::.ctor() = ( 01 00 00 00 )
// Code size 88 (0x58)
.maxstack 2
.locals init ([0] int64 alpha,
[1] int32 delta,
[2] class [mscorlib]System.Exception ex)
IL_0000: nop
IL_0001: ldc.i4.m1
IL_0002: conv.i8
IL_0003: stloc.0
IL_0004: nop
.try
{
.try
{
IL_0005: ldloc.0
IL_0006: ldc.i4.m1
IL_0007: conv.u8
IL_0008: and
IL_0009: conv.ovf.i4
IL_000a: stloc.1
IL_000b: ldstr "CINT OK"
IL_0010: call void [mscorlib]System.Console::WriteLine(string)
IL_0015: nop
IL_0016: ldloc.0
IL_0017: ldc.i4.m1
IL_0018: conv.u8
IL_0019: and
IL_001a: call int32 [mscorlib]System.Convert::ToInt32(int64)
IL_001f: stloc.1
IL_0020: ldstr "Convert.ToInt32 OK"
IL_0025: call void [mscorlib]System.Console::WriteLine(string)
IL_002a: nop
IL_002b: leave.s IL_0055
} // End .try
catch [mscorlib]System.Exception
{
IL_002d: dup
IL_002e: call void [Microsoft.VisualBasic]Microsoft.VisualBasic.CompilerServices.ProjectData::SetProjectError(class [mscorlib]System.Exception)
IL_0033: stloc.2
IL_0034: nop
IL_0035: ldloc.2
IL_0036: callvirt instance class [mscorlib]System.Type [mscorlib]System.Exception::GetType()
IL_003b: callvirt instance string [mscorlib]System.Type::ToString()
IL_0040: call void [mscorlib]System.Console::WriteLine(string)
IL_0045: nop
IL_0046: call void [Microsoft.VisualBasic]Microsoft.VisualBasic.CompilerServices.ProjectData::ClearProjectError()
IL_004b: leave.s IL_0055
} // End handler
} // End .try
finally
{
IL_004d: nop
IL_004e: call string [mscorlib]System.Console::ReadLine()
IL_0053: pop
IL_0054: endfinally
} // End handler
IL_0055: nop
IL_0056: nop
IL_0057: ret
} // End of method Module1::Main
I suspect that the instruction that is behaving differently is either conv.ovf.i4 or the ldc.i4.m1/conv.u8 pair.
What is going on?
Convert.ToInt32(long) fails in both environments. It is only CInt(Long) which is behaving differently.
Unfortunately, the 64-bit version is accurate. It really is an overflow, the result of the expression is a long with the value &hffffffff. The sign bit is AND-ed off the value, it is no longer negative. The resulting value cannot be converted to an integer, the maximum integer value is &h7fffffff. You can see this by adding this code to your snippet:
Dim value As Long = alpha And UInteger.MaxValue
Console.WriteLine(value)
Output: 4294967295
The x64 jitter uses an entirely different way to check for overflows, it doesn't rely on the CPU overflow exception but explicitly compares the values to Integer.MaxValue and Integer.MinValue. The x86 jitter gets it wrong, it optimizes the code too much and ends up making an unsigned operation that doesn't trip the CPU exception.
Filing a bug report at connect.microsoft.com is probably not worth the effort, fixing this for the x86 jitter would be a drastically breaking change. You'll have to rework this logic. Not sure how, I don't see what you are trying to do.
I don't know of any real reference as such, but if you go to this page:
http://msdn.microsoft.com/en-us/library/system.int32.aspx
You can see in the sample where they use CInt they do wrap it in a OverflowException handler (try searching for CInt on that page to find it). So at least they say implicitly that CInt can throw that in certain circumstances.
If you do not want the exceptions being thrown you can change the Remove integer overflow checks setting on the Advanced Compile Options page.
Try to change build platform target from “Any CPU” to "x86".
Just to complete the documentation of this issue I made this:
Imports System.Runtime.InteropServices
Module Module1
<DllImport("KERNEL32.DLL", EntryPoint:="DebugBreak", _
SetLastError:=False, CharSet:=CharSet.Unicode, _
ExactSpelling:=True, _
CallingConvention:=CallingConvention.StdCall)> _
Public Sub DebugBreak()
End Sub
Sub Main()
Dim alpha As Long = -1
Dim delta As Integer
DebugBreak() ' To call OllyDbg
' Needed to prevent the jitter from raising the overflow exception in the second CInt without really doing the convertion first
alpha = alpha Xor Environment.TickCount
Console.WriteLine(alpha)
delta = CInt(alpha And UInteger.MaxValue)
Console.WriteLine(delta)
alpha = alpha And UInteger.MaxValue
delta = CInt(alpha)
Console.WriteLine(delta)
Console.ReadLine()
End Sub
End Module
Using OllyDbg I got this:
CPU Disasm
Address Hex dump Command Comments
00D10070 55 PUSH EBP
00D10071 8BEC MOV EBP,ESP
00D10073 57 PUSH EDI
00D10074 56 PUSH ESI
00D10075 53 PUSH EBX
00D10076 E8 A1BFC7FF CALL 0098C01C
00D1007B E8 A18C1879 CALL <JMP.&KERNEL32.GetTickCount> ; Jump to KERNEL32.GetTickCount
00D10080 99 CDQ
00D10081 F7D0 NOT EAX
00D10083 F7D2 NOT EDX
00D10085 8BF0 MOV ESI,EAX
00D10087 8BFA MOV EDI,EDX
00D10089 E8 62D25D78 CALL 792ED2F0 ; Called everytime Console is referenced here
00D1008E 57 PUSH EDI
00D1008F 56 PUSH ESI
00D10090 8BC8 MOV ECX,EAX
00D10092 8B01 MOV EAX,DWORD PTR DS:[ECX]
00D10094 FF90 C4000000 CALL DWORD PTR DS:[EAX+0C4] ; Console.WriteLine(Int64)
00D1009A 8BDE MOV EBX,ESI ; Note: EDI:ESI holds alpha variable
00D1009C 83E3 FF AND EBX,FFFFFFFF ; delta = CInt(alpha And UInteger.MaxValue)
00D1009F E8 4CD25D78 CALL 792ED2F0
00D100A4 8BC8 MOV ECX,EAX
00D100A6 8BD3 MOV EDX,EBX
00D100A8 8B01 MOV EAX,DWORD PTR DS:[ECX]
00D100AA FF90 BC000000 CALL DWORD PTR DS:[EAX+0BC] ; Console.WriteLine(Int32)
00D100B0 33FF XOR EDI,EDI ; alpha = alpha And UInteger.MaxValue
00D100B2 85F6 TEST ESI,ESI ; delta = CInt(alpha) [Begins here]
00D100B4 7C 06 JL SHORT 00D100BC
00D100B6 85FF TEST EDI,EDI
00D100B8 75 2B JNE SHORT 00D100E5
00D100BA EB 05 JMP SHORT 00D100C1
00D100BC 83FF FF CMP EDI,-1
00D100BF 75 24 JNE SHORT 00D100E5
00D100C1 8BDE MOV EBX,ESI ; delta = CInt(alpha) [Ends here]
00D100C3 E8 28D25D78 CALL 792ED2F0
00D100C8 8BC8 MOV ECX,EAX
00D100CA 8BD3 MOV EDX,EBX
00D100CC 8B01 MOV EAX,DWORD PTR DS:[ECX]
00D100CE FF90 BC000000 CALL DWORD PTR DS:[EAX+0BC] ; Console.WriteLine(Int32)
00D100D4 E8 1B1AA878 CALL 79791AF4
00D100D9 8BC8 MOV ECX,EAX
00D100DB 8B01 MOV EAX,DWORD PTR DS:[ECX]
00D100DD FF50 64 CALL DWORD PTR DS:[EAX+64]
00D100E0 5B POP EBX
00D100E1 5E POP ESI
00D100E2 5F POP EDI
00D100E3 5D POP EBP
00D100E4 C3 RETN
As you can see the second CInt sentence is much more complex than just ANDing (which it could actually be suppressed as EBX won't change and the EFLAGS are not consumed anywhere). The probable origin of this problem can be seen in Hans' answer