Rust optimizing out loops? - optimization

I was doing some very simple benchmarks to compare performance of C and Rust. I used a function adding integers 1 + 2 + ... + n (something that I could verify by a computation by hand), where n = 10^10.
The code in Rust looks like this:
fn main() {
let limit: u64 = 10000000000;
let mut buf: u64 = 0;
for u64::range(1, limit) |i| {
buf = buf + i;
}
io::println(buf.to_str());
}
The C code is as follows:
#include <stdio.h>
int main()
{
unsigned long long buf = 0;
for(unsigned long long i = 0; i < 10000000000; ++i) {
buf = buf + i;
}
printf("%llu\n", buf);
return 0;
}
I compiled and run them:
$ rustc sum.rs -o sum_rust
$ time ./sum_rust
13106511847580896768
real 6m43.122s
user 6m42.597s
sys 0m0.076s
$ gcc -Wall -std=c99 sum.c -o sum_c
$ time ./sum_c
13106511847580896768
real 1m3.296s
user 1m3.172s
sys 0m0.024s
Then I tried with optimizations flags on, again both C and Rust:
$ rustc sum.rs -o sum_rust -O
$ time ./sum_rust
13106511847580896768
real 0m0.018s
user 0m0.004s
sys 0m0.012s
$ gcc -Wall -std=c99 sum.c -o sum_c -O9
$ time ./sum_c
13106511847580896768
real 0m16.779s
user 0m16.725s
sys 0m0.008s
These results surprised me. I did expected the optimizations to have some effect, but the optimized Rust version is 100000 times faster :).
I tried changing n (the only limitation was u64, the run time was still virtually zero), and even tried a different problem (1^5 + 2^5 + 3^5 + ... + n^5), with similar results: executables compiled with rustc -O are several orders of magnitude faster than without the flag, and are also many times faster than the same algorithm compiled with gcc -O9.
So my question is: what's going on? :) I could understand a compiler optimizing 1 + 2 + .. + n = (n*n + n)/2, but I can't imagine that any compiler could derive a formula for 1^5 + 2^5 + 3^5 + .. + n^5. On the other hand, as far as I can see, the result must've been computed somehow (and it seems to be correct).
Oh, and:
$ gcc --version
gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3
$ rustc --version
rustc 0.6 (dba9337 2013-05-10 05:52:48 -0700)
host: i686-unknown-linux-gnu

Yes, compilers do use the 1 + ... + n = n*(n+1)/2 optimisation to remove the loop, and there are similar tricks for any power of the summation variable. e.g. k1 are triangular numbers, k2 are pyramidal numbers, k3 are squared triangular numbers, etc. In general, there is even a formula to calculate ∑k kp for any p.
You can use a more complicated expression, so that the compiler doesn't have any tricks to remove the loop. e.g.
fn main() {
let limit: u64 = 1000000000;
let mut buf: u64 = 0;
for u64::range(1, limit) |i| {
buf += i + i ^ (i*i);
}
io::println(buf.to_str());
}
and
#include <stdio.h>
int main()
{
unsigned long long buf = 0;
for(unsigned long long i = 0; i < 1000000000; ++i) {
buf += i + i ^ (i * i);
}
printf("%llu\n", buf);
return 0;
}
which gives me
real 0m0.700s
user 0m0.692s
sys 0m0.004s
and
real 0m0.698s
user 0m0.692s
sys 0m0.000s
respectively (with -O for both compilers).

Related

SMHasher setup?

The SMHasher test suite for hash functions is touted as the best of the lot. But the latest version I've got (from rurban) gives absolutely no clue on how to check your proposed hash function (it does include an impressive battery of hash functions, but some of interest --if only for historic value-- are missing). Add that I'm a complete CMake newbie.
It's actually quite simple. You just need to install CMake.
Building SMHasher
To build SMHasher on a Linux/Unix machine:
git clone https://github.com/rurban/smhasher
cd smhasher/
git submodule init
git submodule update
cmake .
make
Adding a new hash function
To add a new function, you can edit just three files: Hashes.cpp, Hashes.h and main.cpp.
For example, I will add the ElfHash:
unsigned long ElfHash(const unsigned char *s)
{
unsigned long h = 0, high;
while (*s)
{
h = (h << 4) + *s++;
if (high = h & 0xF0000000)
h ^= high >> 24;
h &= ~high;
}
return h;
}
First, need to modify it slightly to take a seed and length:
uint32_t ElfHash(const void *key, int len, uint32_t seed)
{
unsigned long h = seed, high;
const uint8_t *data = (const uint8_t *)key;
for (int i = 0; i < len; i++)
{
h = (h << 4) + *data++;
if (high = h & 0xF0000000)
h ^= high >> 24;
h &= ~high;
}
return h;
}
Add this function definition to Hashes.cpp. Also add the following to Hashes.h:
uint32_t ElfHash(const void *key, int len, uint32_t seed);
inline void ElfHash_test(const void *key, int len, uint32_t seed, void *out) {
*(uint32_t *) out = ElfHash(key, len, seed);
}
In file main.cpp add the following line into array g_hashes:
{ ElfHash_test, 32, 0x0, "ElfHash", "ElfHash 32-bit", POOR, {0x0} },
(The third value is self-verification. You will learn this only after running the test once.)
Finally, rebuild and run the test:
make
./SMHasher ElfHash
It will show you all the tests that this hash function fails. (It is very bad.)

What does JVM interpreter (NOT the JIT compiler) actually do?

Please note that my question is around JVM interpreter, not JIT compiler. JIT compiler converts java bytecodes to native machine code. As such, this MUST mean that the interpreter within the JVM DOES NOT convert bytecodes to machine code. Hence the question: in essence what does the interpreter do? If someone can help me answer this with a simple example of bytecodes equivalent of 1+1 = 2, i.e. what does the interpreter do with respect to executing this add operation? (My implicit question is, if interpreter does not translate to machine code which CPU then executes the ADD operation, how then is this operation performed? what machine code is ACTUALLY executed to support this ADD operation?)
The expression 1+1 will compile to the following bytecode:
iconst_1
iconst_1
add
(Actually, it will just compile to iconst_2 because the Java compiler performs constant-folding, but let's ignore that for the purposes of this answer.)
So to find out exactly what the interpreter does for those instructions, we should look at its source code. The relevant sections for const_1 and add start at line 983 and line 1221 respectively, so let's take a look:
#define OPC_CONST_n(opcode, const_type, value) \
CASE(opcode): \
SET_STACK_ ## const_type(value, 0); \
UPDATE_PC_AND_TOS_AND_CONTINUE(1, 1);
OPC_CONST_n(_iconst_m1, INT, -1);
OPC_CONST_n(_iconst_0, INT, 0);
OPC_CONST_n(_iconst_1, INT, 1);
// goes on for several other constants
//...
#define OPC_INT_BINARY(opcname, opname, test) \
CASE(_i##opcname): \
if (test && (STACK_INT(-1) == 0)) { \
VM_JAVA_ERROR(vmSymbols::java_lang_ArithmeticException(), \
"/ by zero", note_div0Check_trap); \
} \
SET_STACK_INT(VMint##opname(STACK_INT(-2), \
STACK_INT(-1)), \
-2); \
UPDATE_PC_AND_TOS_AND_CONTINUE(1, -1); \
// and then the same thing for longs instead of ints
OPC_INT_BINARY(add, Add, 0);
// other operators
The whole thing is inside a switch-statement that examines the opcode of the current instruction.
If we expand the macro-magic, replace the surrounding code with an extremely simplified template and make some simplifying assumptions (such as the stack only consisting of ints), we end up with something like this:
enum OpCode {
_iconst_1, _iadd
};
// ...
int* stack = new int[calculate_maximum_stack_size()];
size_t top_of_stack = 0;
size_t program_counter = 0;
while(program_counter < program_size) {
switch(opcodes[program_counter]) {
case _iconst_1:
// SET_STACK_INT(1, 0);
stack[top_of_stack] = 1;
// UPDATE_PC_AND_TOS_AND_CONTINUE(1, 1);
program_counter += 1;
top_of_stack += 1;
break;
case _iadd:
// SET_STACK_INT(VMintAdd(STACK_INT(-2), STACK_INT(-1)), -2);
stack[top_of_stack - 2] = stack[top_of_stack - 1] + stack[top_of_stack - 2];
// UPDATE_PC_AND_TOS_AND_CONTINUE(1, -1);
program_counter += 1;
top_of_stack += -1;
break;
}
So for 1+1 the sequence of operations would be:
stack[0] = 1;
stack[1] = 1;
stack[0] = stack[1] + stack[0];
And top_of_stack would be 1, so we'd end with a stack that contains the value 2 as its only element.

Understanding RiscV objdump

I am examining the objdump of a C file that I have compiled using the following commands:
riscv64-unknown-elf-gcc -O0 -o maxmul.o maxmul.c
riscv64-unknown-elf-objdump -d maxmul.o > maxmul.dump
strangely (or not) the addresses appear not to be aligned on 32-bit words but actually on a 16-bit boundary.
Can anyone please explain me why?
Thanks.
objdump excerpt:
00000000000101da <main>:
101da: 7155 addi sp,sp,-208
101dc: e586 sd ra,200(sp)
101de: e1a2 sd s0,192(sp)
101e0: 0980 addi s0,sp,208
...
C-code:
int main()
{
int first[3][3], second[3][3], multiply[3][3];
int golden[3][3];
int sum;
first[0][0] = 1; first[0][1] = 2; first[0][2] = 3;
first[1][0] = 4; first[1][1] = 5; first[1][2] = 6;
first[2][0] = 7; first[2][1] = 8; first[2][2] = 9;
second[0][0] = 9; second[0][1] = 8; second[0][2] = -7;
second[1][0] = -6; second[1][1] = 5; second[1][2] = 4;
second[2][0] = 3; second[2][1] = 2; second[2][2] = -1;
golden[0][0] = 6; golden[0][1] = 24; golden[0][2] = -2;
golden[1][0] = 24; golden[1][1] = 69; golden[1][2] = -14;
golden[2][0] = 42; golden[2][1] = 1140; golden[2][2] = -26;
int i, ii, iii;
for (i = 0; i < 3; i++) {
for (ii = 0; ii < 3; ii++) {
for (iii = 0; iii < 3; iii++) {
//printf("first[%d][%d] * second[%d][%d] \n", i, iii, iii, ii);
//printf("%d * %d (%d,%d)\n", first[i][ii], second[ii][i], i, ii);
sum += first[i][iii] * second[iii][ii];
}
//printf("sum = %d\n", sum);
multiply[i][ii] = sum;
sum = 0;
}
}
int c, d;
int err;
for ( c = 0; c < 3; c++) {
for ( d = 0; d < 3; d++) {
//printf("%d\t", multiply[c][d]);
if (multiply[c][d] != golden[c][d]) {
fail(golden[c][d], multiply[c][d]);
err++;
}
}
//printf("\n");
}
if (err == 0) {
pass();
}
return 0;
}
I am suspecting that your gcc compiles by default with the compressed instruction format where instructions can be 16b & 32b intermix - in such case, 16b instructions are 16b aligned as you can see in the disassembled code.
Objdump provides the address, the encoding, and the mnemonics ; the encoding in your case is always 16b, which means that the compiler have selected 16b instructions when possible.
By enabling verbose mode (-verbose), you can see that, by default,-march=rv64imafdc and -mabi=lp64d. The default targetted ISA is the compressed one, and the targetted ABI requires Double floats extension.
By setting -march=rv64imafd and letting ABI unchanged, gcc successfully compiles using instructions that are only 32b because compressed ISA is no more enabled.
The addresses of instruction are then always 32b aligned.
When compiling (or assembling) to RV64GC or RV32GC (or another target that enables the "C" Standard Extension Compressed Instructions), the compiler (or assembler) automatically replaces some instructions with compressed ones.
Non-compressed instructions are encoded in 32 bit, while compressed instructions are encoded in 16 bit.
When a compressed instruction is emitted it changes the alignment for the next instruction. Either from 32 bit to 16 bit or from 16 bit to 32 bit. That means not only 16 bit wide instructions may be aligned to a 16 bit address but also 32 bit wide ones. IOW both types of instructions (compressed and normal) are tightly packed side by side.
By default, objdump -d doesn't explicitly indicate that an instruction is compressed because it uses the same mnemonic as for the uncompressed variant. Although the number of bytes in the displayed raw instruction gives it away (4 vs. 2 bytes).
However, you can tell objdump to use separate mnemonics for compressed instructions such that they are more easily recognizable (those start with c. then), e.g.:
$ riscv64-unknown-elf-objdump -d -M no-aliases rotate
[..]
101e4: 00d66533 or a0,a2,a3
101e8: 8082 c.jr ra
00000000000101ea <rotr>:
101ea: 00b55633 srl a2,a0,a1
[..]
Note that with the switch -M no-aliases pseudo-instructions aren't displayed anymore, but the corresponding instruction(s) instead.

Passing an inlined CArray in a CStruct to a shared library using NativeCall

This is a follow-up question to "How to declare native array of fixed size in Perl 6?".
In that question it was discussed how to incorporate an array of a fixed size into a CStruct. In this answer it was suggested to use HAS to inline a CArray in the CStruct. When I tested this idea, I ran into some strange behavior that could not be resolved in the comments section below the question, so I decided to write it up as a new question. Here is is my C test library code:
slib.c:
#include <stdio.h>
struct myStruct
{
int A;
int B[3];
int C;
};
void use_struct (struct myStruct *s) {
printf("sizeof(struct myStruct): %ld\n", sizeof( struct myStruct ));
printf("sizeof(struct myStruct *): %ld\n", sizeof( struct myStruct *));
printf("A = %d\n", s->A);
printf("B[0] = %d\n", s->B[0]);
printf("B[1] = %d\n", s->B[1]);
printf("B[2] = %d\n", s->B[2]);
printf("C = %d\n", s->C);
}
To generate a shared library from this i used:
gcc -c -fpic slib.c
gcc -shared -o libslib.so slib.o
Then, the Perl 6 code:
p.p6:
use v6;
use NativeCall;
class myStruct is repr('CStruct') {
has int32 $.A is rw;
HAS int32 #.B[3] is CArray is rw;
has int32 $.C is rw;
}
sub use_struct(myStruct $s) is native("./libslib.so") { * };
my $s = myStruct.new();
$s.A = 1;
$s.B[0] = 2;
$s.B[1] = 3;
$s.B[2] = 4;
$s.C = 5;
say "Expected size of Perl 6 struct: ", (nativesizeof(int32) * 5);
say "Actual size of Perl 6 struct: ", nativesizeof( $s );
say 'Number of elements of $s.B: ', $s.B.elems;
say "B[0] = ", $s.B[0];
say "B[1] = ", $s.B[1];
say "B[2] = ", $s.B[2];
say "Calling library function..";
say "--------------------------";
use_struct( $s );
The output from the script is:
Expected size of Perl 6 struct: 20
Actual size of Perl 6 struct: 24
Number of elements of $s.B: 3
B[0] = 2
B[1] = 3
B[2] = 4
Calling library function..
--------------------------
sizeof(struct myStruct): 20
sizeof(struct myStruct *): 8
A = 1
B[0] = 0 # <-- Expected 2
B[1] = 653252032 # <-- Expected 3
B[2] = 22030 # <-- Expected 4
C = 5
Questions:
Why does nativesizeof( $s ) give 24 (and not the expected value of 20)?
Why is the content of the array B in the structure not as expected when printed from the C function?
Note:
I am using Ubuntu 18.04 and Perl 6 Rakudo version 2018.04.01, but have also tested with version 2018.05
Your code is correct. I just fixed that bug in MoarVM, and added tests to rakudo, similar to your code:
In C:
typedef struct {
int a;
int b[3];
int c;
} InlinedArrayInStruct;
In Perl 6:
class InlinedArrayInStruct is repr('CStruct') {
has int32 $.a is rw;
HAS int32 #.b[3] is CArray;
has int32 $.c is rw;
}
See these patches:
https://github.com/MoarVM/MoarVM/commit/ac3d3c76954fa3c1b1db14ea999bf3248c2eda1c
https://github.com/rakudo/rakudo/commit/f8b79306cc1900b7991490eef822480f304a56d9
If you are not building rakudo (and also NQP and MoarVM) directly from latest source from github, you probably have to wait for the 2018.08 release that will appear here: https://rakudo.org/files

Out of memory error. Allocating...

I'm trying to use a gprof command: gprof -s executable.exe gmon.out gmon.sum to merge profiling data gathered from 2 runs of my programs. But the following error appears:
gprof: out of memory allocating 3403207348 bytes after a total of 196608 bytes
My program is quite simple (just one for loop). If i run it once, the run time is too short (it shows 0.00s) for gprof to record.
In CygWin, I do the following steps:
gcc -pg -o fl forAndWhilLoop.c
fl (run the program)
mv gmon.out gmon.sum
fl (run the program)
gprof -s fl.exe gmon.out gmon.sum
gprof fl.exe gmon.sum>gmon.out
gprof fl.exe
My program:
int main(void)
{
int fac=1;
int count=10;
int k;
for(k=1;k<=count;k++)
{
fac = fac * k;
}
return 0;
}
So can anyone help me with this problem? Thanks!
If all you want is to time it, on my machine it's 105ns. Here's the code:
void forloop(void){
int fac=1;
int count=10;
int k;
for(k=1;k<=count;k++)
{
fac = fac * k;
}
}
int main(int argc, char* argv[])
{
int i;
for (i = 0; i < 1000000000; i++){
forloop();
}
return 0;
}
Get the idea? I used a hand-held stopwatch. Since it runs 10^9 times, seconds = nanoseconds.
Unrolling the inner loop like this reduced the time to 92ns;
int k = 1;
while(k+5 <= count){
fac *= k * (k+1) * (k+2) * (k+3) * (k+4);
k += 5;
}
while(k <= count){
fac *= k++;
}
Switching to Release build from Debug brought it down to 21ns. You can only expect that kind of speedup in an actual hotspot, which this is.
It seems that pprof instead of gprof should be executed