I have a question regarding C++/CLI vs Native C++ speeds. I have written a little test application and I am seeing very surprising results.
It seems that the unmanaged C++/CLI code is significantly slower. Basically I created two console apps. One a standard win32 console app and a CLR console app
Here is the code I did for a test. I kept the code exactly the same in all versions of the test.
const int NumberOfTests = 10000000;
void GrowBalance(int numberOfYears)
{
std::cout<<"Called"<<std::endl;
DWORD startTime = GetTickCount();
int numberOfRandom = 0;
for(int i = 0; i < NumberOfTests; i++)
{
double dBalance = 10000.0;
for(int year = 0; year < numberOfYears; year++)
{
dBalance *= 1.05;
if(dBalance > 20000.00 && dBalance < 22000.00)
{
numberOfRandom++;
}//if
}//for
}//for
DWORD endTime = GetTickCount();
std::cout<<"Time Elapsed: "<<endTime - startTime<<std::endl;
std::cout<<"Number of random: "<<numberOfRandom<<std::endl;
}
Output managed code:
Called
Time Elapsed: 9937
Number of random: 20000000
Output managed code with pragma managed(push, off):
Called
Time Elapsed: 24516
Number of random: 20000000
Output native code:
Called
Time Elapsed: 2156
Number of random: 20000000
In the mains just calling GrowBalance with 90 years specified. Pretty basic test. Is there something I am doing wrong or am I really looking at code that is going to be 4.5 times slower by using C++/CLI. And I also don't understand the case of turning managed code off. Everything I have read said this would compile the code to native C++ but it is insanely slower. Any help with this would be very much appreciated.
Update:
I just ran this test in visual studio 2005 instead of 2008. Native C++ performance is matched.
Update #2:
I just put my test code into a class instead of a single function and am getting much better results. Now the mixed code is preforming at an average run time of ~5000ms
But in 2005 I am seeing much much faster results. Average run time of about ~1875ms. Maybe I will just stick to 2005 for my CLI development. Unless someone has a reason why this could be occurring.
One thing you may be running into is that for native C++, optimizations are controlled by command-line arguments to the compiler, but for managed code, optimizations are controlled by how you start the application (i.e. if you launch in debugger, many optimizations are disabled even if you did an optimized build). You shouldn't be running performance tests "in" Visual Studio at all.
The native compiler also has a LOT of extra optimizations. It might even be smart enough to figure out that dBalance is strictly increasing, and continuing the inner for loop once dBalance > 22000.0 has no observable side effects.
What happens in all three cases if you change that inner for loop like this (it will only do 17 iterations, as long as numberOfYears >= 17)?
double dBalance = 10000.0;
for(int year = 0; year < numberOfYears && dBalance < 22000.0; year++)
{
dBalance *= 1.05;
if(dBalance > 20000.0)
{
numberOfRandom++;
}//if
}//for
How about:
if (numberOfYears > 14) {
double dBalance = 19799.315994393973883056640625;
for(int year = 14; year < numberOfYears && dBalance < 22000.0; year++)
{
dBalance *= 1.05;
numberOfRandom++;
}//for
}
And how about:
if (numberOfYears > 14) {
numberOfRandom += (numberOfYears >= 17)? 3: numberOfYears - 14;
}
Related
I'm searching for a proper way to use Google OR tools for solving an annual ship crew scheduling problem. I tried to follow the scheduling problem examples provided, but I couldn't find a way to set a 4 dimensional decision variable needed ( D[i,j,k,t] , i for captains, j for engineers, k for ship and t for time-period (days or weeks)).
Although there are many examples given (for C#) the major problems I faced is the way to set and utilize this main decision variable, and how to use the Decision Builder, since in all examples the variables had 2 dimensions, and were .flattened in order to make the comparisons. Unfortunately, I haven't found a way to use smaller D-Variables, since the penalty score (minimize penalty problem) is estimated by possible sets of Captains-Engineers , Captains-Ships, and Engineer Ships.
Why don't you create you 4D array, and then fill it with variables one by one.
Here is the code for matrices:
public IntVar[,] MakeIntVarMatrix(int rows, int cols, long min, long max) {
IntVar[,] array = new IntVar[rows, cols];
for (int i = 0; i < rows; ++i) {
for (int j = 0; j < cols; ++j) {
array[i,j] = MakeIntVar(min, max);
}
}
return array;
}
This being said, please use the CP-SAT solver as the original CP solver is deprecated.
For an introduction:
see:
https://developers.google.com/optimization/cp/cp_solver#cp-solver_example/
https://github.com/google/or-tools/tree/master/ortools/sat/doc
I have confusion in this particular line-->
result = (double) hi * (1 << 30) * 4 + lo;
of the following code:
void access_counter(unsigned *hi, unsigned *lo)
// Set *hi and *lo to the high and low order bits of the cycle
// counter.
{
asm("rdtscp; movl %%edx,%0; movl %%eax,%1" // Read cycle counter
: "=r" (*hi), "=r" (*lo) // and move results to
: /* No input */ // the two outputs
: "%edx", "%eax");
}
double get_counter()
// Return the number of cycles since the last call to start_counter.
{
unsigned ncyc_hi, ncyc_lo;
unsigned hi, lo, borrow;
double result;
/* Get cycle counter */
access_counter(&ncyc_hi, &ncyc_lo);
lo = ncyc_lo - cyc_lo;
borrow = lo > ncyc_lo;
hi = ncyc_hi - cyc_hi - borrow;
result = (double) hi * (1 << 30) * 4 + lo;
if (result < 0) {
fprintf(stderr, "Error: counter returns neg value: %.0f\n", result);
}
return result;
}
The thing I cannot understand is that why is hi being multiplied with 2^30 and then 4? and then low added to it? Someone please explain what is happening in this line of code. I do know that what hi and low contain.
The short answer:
That line turns a 64bit integer that is stored as 2 32bit values into a floating point number.
Why doesn't the code just use a 64bit integer? Well, gcc has supported 64bit numbers for a long time, but presumably this code predates that. In that case, the only way to support numbers that big is to put them into a floating point number.
The long answer:
First, you need to understand how rdtscp works. When this assembler instruction is invoked, it does 2 things:
1) Sets ecx to IA32_TSC_AUX MSR. In my experience, this generally just means ecx gets set to zero.
2) Sets edx:eax to the current value of the processor’s time-stamp counter. This means that the lower 64bits of the counter go into eax, and the upper 32bits are in edx.
With that in mind, let's look at the code. When called from get_counter, access_counter is going to put edx in 'ncyc_hi' and eax in 'ncyc_lo.' Then get_counter is going to do:
lo = ncyc_lo - cyc_lo;
borrow = lo > ncyc_lo;
hi = ncyc_hi - cyc_hi - borrow;
What does this do?
Since the time is stored in 2 different 32bit numbers, if we want to find out how much time has elapsed, we need to do a bit of work to find the difference between the old time and the new. When it is done, the result is stored (again, using 2 32bit numbers) in hi / lo.
Which finally brings us to your question.
result = (double) hi * (1 << 30) * 4 + lo;
If we could use 64bit integers, converting 2 32bit values to a single 64bit value would look like this:
unsigned long long result = hi; // put hi into the 64bit number.
result <<= 32; // shift the 32 bits to the upper part of the number
results |= low; // add in the lower 32bits.
If you aren't used to bit shifting, maybe looking at it like this will help. If lo = 1 and high = 2, then expressed as hex numbers:
result = hi; 0x0000000000000002
result <<= 32; 0x0000000200000000
result |= low; 0x0000000200000001
But if we assume the compiler doesn't support 64bit integers, that won't work. While floating point numbers can hold values that big, they don't support shifting. So we need to figure out a way to shift 'hi' left by 32bits, without using left shift.
Ok then, shifting left by 1 is really the same as multiplying by 2. Shifting left by 2 is the same as multiplying by 4. Shifting left by [omitted...] Shifting left by 32 is the same as multiplying by 4,294,967,296.
By an amazing coincidence, 4,294,967,296 == (1 << 30) * 4.
So why write it in that complicated fashion? Well, 4,294,967,296 is a pretty big number. In fact, it's too big to fit in an 32bit integer. Which means if we put it in our source code, a compiler that doesn't support 64bit integers may have trouble figuring out how to process it. Written like this, the compiler can generate whatever floating point instructions it might need to work on that really big number.
Why the current code is wrong:
It looks like variations of this code have been wandering around the internet for a long time. Originally (I assume) access_counter was written using rdtsc instead of rdtscp. I'm not going to try to describe the difference between the two (google them), other than to point out that rdtsc does not set ecx, and rdtscp does. Whoever changed rdtsc to rdtscp apparently didn't know that, and failed to adjust the inline assembler stuff to reflect it. While your code might work fine despite this, it might do something weird instead. To fix it, you could do:
asm("rdtscp; movl %%edx,%0; movl %%eax,%1" // Read cycle counter
: "=r" (*hi), "=r" (*lo) // and move results to
: /* No input */ // the two outputs
: "%edx", "%eax", "%ecx");
While this will work, it isn't optimal. Registers are a valuable and scarce resource on i386. This tiny fragment uses 5 of them. With a slight modification:
asm("rdtscp" // Read cycle counter
: "=d" (*hi), "=a" (*lo)
: /* No input */
: "%ecx");
Now we have 2 fewer assembly statements, and we only use 3 registers.
But even that isn't the best we can do. In the (presumably long) time since this code was written, gcc has added both support for 64bit integers and a function to read the tsc, so you don't need to use asm at all:
unsigned int a;
unsigned long long result;
result = __builtin_ia32_rdtscp(&a);
'a' is the (useless?) value that was being returned in ecx. The function call requires it, but we can just ignore the returned value.
So, instead of doing something like this (which I assume your existing code does):
unsigned cyc_hi, cyc_lo;
access_counter(&cyc_hi, &cyc_lo);
// do something
double elapsed_time = get_counter(); // Find the difference between cyc_hi, cyc_lo and the current time
We can do:
unsigned int a;
unsigned long long before, after;
before = __builtin_ia32_rdtscp(&a);
// do something
after = __builtin_ia32_rdtscp(&a);
unsigned long long elapsed_time = after - before;
This is shorter, doesn't use hard-to-understand assembler, is easier to read, maintain and produces the best possible code.
But it does require a relatively recent version of gcc.
People keep telling me instead of writing "shift 1 bit to the left", just write "multiple by 2", because it's a lot more readable, and the compile will be smart enough to do the optimization.
What else would compiles generally do, and developers should not do (for code readability)? I always write string.length == 0 instead of string == "" because I read somewhere 5-6 years ago, saying numeric operations are much faster. Is this still true?
Or, would most compiler be smart enough to convert the following:
int result = 0;
for (int i = 0; i <= 100; i++)
{
result += i;
}
into: int result = 5050;?
What is your favourite "optimization" that you do, because most compiles won't do?
Algorithms: no compiler on the planet so far can choose a better algorithm for you. Too many people hastily jump to the rewrite-in-C part after they benchmark, when they should have really considered replacing the algorithm they're using in the first place.
Does anyone know a good reference to help with understanding the relative cost of operations like copying variables, declaring new variables, FileIO, array operations, etc? I've been told to study decompilation and machine code but a quick reference would be nice. For example, something to tell me how much worse
for(int i = 0; i < 100; i++){
new double d = 7.65;
calc(d);
}
is than
double d = 7.65;
for(int i = 0; i < 100; i++){
calc(d);
}
Here is a nice paper by Felix von Leitner on the state of C compiler optimization. I learned of it on this Lambda the Ultimate page.
The performance of operations you mention such as file I/O, memory access, and computation are highly dependent on a computer's architecture. Much of the optimization of software for today's desktop computers is focused on cache memory.
You would gain much from an architecture book or course. Here's a good example from CMU.
Martinus gives a good example of a code where compiler optimizes the code at run-time by calculating out multiplication:
Martinus code
int x = 0;
for (int i = 0; i < 100 * 1000 * 1000 * 1000; ++i) {
x += x + x + x + x + x;
}
System.out.println(x);
His code after Constant Folding -compiler's optimization at compile-time (Thanks to Abelenky for pointing that out)
int x = 0;
for (int i = 0; i < 100000000000; ++i) {
x += x + x + x + x + x;
}
System.out.println(x);
This optimization technique seems to be a trivial in my opinion.
I guess that it may one of the techniques Sun started to keep trivial recently.
I am interested in two types of optimizations made by compilers:
optimizations which were omitted in today's compilers as trivial such as in Java's compiler at run-time
optimizations which are used by the majority of today's compilers
Please, put each optimization technique to a separate answer.
Which techniques have compilers used in 90s (1) and today (2)?
Just buy the latest edition of the Dragon Book.
How about loop unrolling?:
for (i = 0; i < 100; i++)
g ();
To:
for (i = 0; i < 100; i += 2)
{
g ();
g ();
}
From http://www.compileroptimizations.com/. They have many more - too many for an answer per technique.
Check out Trace Trees for a cool interpreter/just-in-time optimization.
The optimization shown in your example, of collapsing 100*1000*1000*1000 => 100000000000 is NOT a run-time optimization. It happens at compile-time. (and I wouldn't even call it an optimization)
I don't know of any optimizations that happen at run-time, unless you count VM engines that have JIT (just-in-time) compiling.
Optimizations that happen at compile-time are wide ranging, and frequently not simple to explain. But they can include in-lining small functions, re-arranging instructions for cache-locality, re-arranging instructions for better pipelining or hyperthreading, and many, many other techniques.
EDIT: Some F*ER edited my post... and then down-voted it. My original post clearly indicated that collapsing multiplication happens at COMPILE TIME, not RUN TIME, as the poster suggested. Then I mentioned I don't really consider collapsing constants to be much of an optimization. The pre-processor even does it.
Masi: if you want to answer the question, then answer the question. Do NOT edit other people's answers to put in words they never wrote.
Compiler books should provide a pretty good resource.
If this is obvious, please ignore it, but you're asking about low-level optimizations, the only ones that compilers can do. In non-toy programs, high-level optimizations are far more productive, but only the programmer can do them.