Understanding RiscV objdump - objdump

I am examining the objdump of a C file that I have compiled using the following commands:
riscv64-unknown-elf-gcc -O0 -o maxmul.o maxmul.c
riscv64-unknown-elf-objdump -d maxmul.o > maxmul.dump
strangely (or not) the addresses appear not to be aligned on 32-bit words but actually on a 16-bit boundary.
Can anyone please explain me why?
Thanks.
objdump excerpt:
00000000000101da <main>:
101da: 7155 addi sp,sp,-208
101dc: e586 sd ra,200(sp)
101de: e1a2 sd s0,192(sp)
101e0: 0980 addi s0,sp,208
...
C-code:
int main()
{
int first[3][3], second[3][3], multiply[3][3];
int golden[3][3];
int sum;
first[0][0] = 1; first[0][1] = 2; first[0][2] = 3;
first[1][0] = 4; first[1][1] = 5; first[1][2] = 6;
first[2][0] = 7; first[2][1] = 8; first[2][2] = 9;
second[0][0] = 9; second[0][1] = 8; second[0][2] = -7;
second[1][0] = -6; second[1][1] = 5; second[1][2] = 4;
second[2][0] = 3; second[2][1] = 2; second[2][2] = -1;
golden[0][0] = 6; golden[0][1] = 24; golden[0][2] = -2;
golden[1][0] = 24; golden[1][1] = 69; golden[1][2] = -14;
golden[2][0] = 42; golden[2][1] = 1140; golden[2][2] = -26;
int i, ii, iii;
for (i = 0; i < 3; i++) {
for (ii = 0; ii < 3; ii++) {
for (iii = 0; iii < 3; iii++) {
//printf("first[%d][%d] * second[%d][%d] \n", i, iii, iii, ii);
//printf("%d * %d (%d,%d)\n", first[i][ii], second[ii][i], i, ii);
sum += first[i][iii] * second[iii][ii];
}
//printf("sum = %d\n", sum);
multiply[i][ii] = sum;
sum = 0;
}
}
int c, d;
int err;
for ( c = 0; c < 3; c++) {
for ( d = 0; d < 3; d++) {
//printf("%d\t", multiply[c][d]);
if (multiply[c][d] != golden[c][d]) {
fail(golden[c][d], multiply[c][d]);
err++;
}
}
//printf("\n");
}
if (err == 0) {
pass();
}
return 0;
}

I am suspecting that your gcc compiles by default with the compressed instruction format where instructions can be 16b & 32b intermix - in such case, 16b instructions are 16b aligned as you can see in the disassembled code.
Objdump provides the address, the encoding, and the mnemonics ; the encoding in your case is always 16b, which means that the compiler have selected 16b instructions when possible.
By enabling verbose mode (-verbose), you can see that, by default,-march=rv64imafdc and -mabi=lp64d. The default targetted ISA is the compressed one, and the targetted ABI requires Double floats extension.
By setting -march=rv64imafd and letting ABI unchanged, gcc successfully compiles using instructions that are only 32b because compressed ISA is no more enabled.
The addresses of instruction are then always 32b aligned.

When compiling (or assembling) to RV64GC or RV32GC (or another target that enables the "C" Standard Extension Compressed Instructions), the compiler (or assembler) automatically replaces some instructions with compressed ones.
Non-compressed instructions are encoded in 32 bit, while compressed instructions are encoded in 16 bit.
When a compressed instruction is emitted it changes the alignment for the next instruction. Either from 32 bit to 16 bit or from 16 bit to 32 bit. That means not only 16 bit wide instructions may be aligned to a 16 bit address but also 32 bit wide ones. IOW both types of instructions (compressed and normal) are tightly packed side by side.
By default, objdump -d doesn't explicitly indicate that an instruction is compressed because it uses the same mnemonic as for the uncompressed variant. Although the number of bytes in the displayed raw instruction gives it away (4 vs. 2 bytes).
However, you can tell objdump to use separate mnemonics for compressed instructions such that they are more easily recognizable (those start with c. then), e.g.:
$ riscv64-unknown-elf-objdump -d -M no-aliases rotate
[..]
101e4: 00d66533 or a0,a2,a3
101e8: 8082 c.jr ra
00000000000101ea <rotr>:
101ea: 00b55633 srl a2,a0,a1
[..]
Note that with the switch -M no-aliases pseudo-instructions aren't displayed anymore, but the corresponding instruction(s) instead.

Related

SMHasher setup?

The SMHasher test suite for hash functions is touted as the best of the lot. But the latest version I've got (from rurban) gives absolutely no clue on how to check your proposed hash function (it does include an impressive battery of hash functions, but some of interest --if only for historic value-- are missing). Add that I'm a complete CMake newbie.
It's actually quite simple. You just need to install CMake.
Building SMHasher
To build SMHasher on a Linux/Unix machine:
git clone https://github.com/rurban/smhasher
cd smhasher/
git submodule init
git submodule update
cmake .
make
Adding a new hash function
To add a new function, you can edit just three files: Hashes.cpp, Hashes.h and main.cpp.
For example, I will add the ElfHash:
unsigned long ElfHash(const unsigned char *s)
{
unsigned long h = 0, high;
while (*s)
{
h = (h << 4) + *s++;
if (high = h & 0xF0000000)
h ^= high >> 24;
h &= ~high;
}
return h;
}
First, need to modify it slightly to take a seed and length:
uint32_t ElfHash(const void *key, int len, uint32_t seed)
{
unsigned long h = seed, high;
const uint8_t *data = (const uint8_t *)key;
for (int i = 0; i < len; i++)
{
h = (h << 4) + *data++;
if (high = h & 0xF0000000)
h ^= high >> 24;
h &= ~high;
}
return h;
}
Add this function definition to Hashes.cpp. Also add the following to Hashes.h:
uint32_t ElfHash(const void *key, int len, uint32_t seed);
inline void ElfHash_test(const void *key, int len, uint32_t seed, void *out) {
*(uint32_t *) out = ElfHash(key, len, seed);
}
In file main.cpp add the following line into array g_hashes:
{ ElfHash_test, 32, 0x0, "ElfHash", "ElfHash 32-bit", POOR, {0x0} },
(The third value is self-verification. You will learn this only after running the test once.)
Finally, rebuild and run the test:
make
./SMHasher ElfHash
It will show you all the tests that this hash function fails. (It is very bad.)

What does JVM interpreter (NOT the JIT compiler) actually do?

Please note that my question is around JVM interpreter, not JIT compiler. JIT compiler converts java bytecodes to native machine code. As such, this MUST mean that the interpreter within the JVM DOES NOT convert bytecodes to machine code. Hence the question: in essence what does the interpreter do? If someone can help me answer this with a simple example of bytecodes equivalent of 1+1 = 2, i.e. what does the interpreter do with respect to executing this add operation? (My implicit question is, if interpreter does not translate to machine code which CPU then executes the ADD operation, how then is this operation performed? what machine code is ACTUALLY executed to support this ADD operation?)
The expression 1+1 will compile to the following bytecode:
iconst_1
iconst_1
add
(Actually, it will just compile to iconst_2 because the Java compiler performs constant-folding, but let's ignore that for the purposes of this answer.)
So to find out exactly what the interpreter does for those instructions, we should look at its source code. The relevant sections for const_1 and add start at line 983 and line 1221 respectively, so let's take a look:
#define OPC_CONST_n(opcode, const_type, value) \
CASE(opcode): \
SET_STACK_ ## const_type(value, 0); \
UPDATE_PC_AND_TOS_AND_CONTINUE(1, 1);
OPC_CONST_n(_iconst_m1, INT, -1);
OPC_CONST_n(_iconst_0, INT, 0);
OPC_CONST_n(_iconst_1, INT, 1);
// goes on for several other constants
//...
#define OPC_INT_BINARY(opcname, opname, test) \
CASE(_i##opcname): \
if (test && (STACK_INT(-1) == 0)) { \
VM_JAVA_ERROR(vmSymbols::java_lang_ArithmeticException(), \
"/ by zero", note_div0Check_trap); \
} \
SET_STACK_INT(VMint##opname(STACK_INT(-2), \
STACK_INT(-1)), \
-2); \
UPDATE_PC_AND_TOS_AND_CONTINUE(1, -1); \
// and then the same thing for longs instead of ints
OPC_INT_BINARY(add, Add, 0);
// other operators
The whole thing is inside a switch-statement that examines the opcode of the current instruction.
If we expand the macro-magic, replace the surrounding code with an extremely simplified template and make some simplifying assumptions (such as the stack only consisting of ints), we end up with something like this:
enum OpCode {
_iconst_1, _iadd
};
// ...
int* stack = new int[calculate_maximum_stack_size()];
size_t top_of_stack = 0;
size_t program_counter = 0;
while(program_counter < program_size) {
switch(opcodes[program_counter]) {
case _iconst_1:
// SET_STACK_INT(1, 0);
stack[top_of_stack] = 1;
// UPDATE_PC_AND_TOS_AND_CONTINUE(1, 1);
program_counter += 1;
top_of_stack += 1;
break;
case _iadd:
// SET_STACK_INT(VMintAdd(STACK_INT(-2), STACK_INT(-1)), -2);
stack[top_of_stack - 2] = stack[top_of_stack - 1] + stack[top_of_stack - 2];
// UPDATE_PC_AND_TOS_AND_CONTINUE(1, -1);
program_counter += 1;
top_of_stack += -1;
break;
}
So for 1+1 the sequence of operations would be:
stack[0] = 1;
stack[1] = 1;
stack[0] = stack[1] + stack[0];
And top_of_stack would be 1, so we'd end with a stack that contains the value 2 as its only element.

Debug data/neon performance hazards in arm neon code

Originally the problem appeared when I tried to optimize an algorithm for neon arm and some minor part of it was taking 80% of according to profiler. I tried to test to see what can be done to improve it and for that I created array of function pointers to different versions of my optimized function and then I run them in the loop to see in profiler which one performs better:
typedef unsigned(*CalcMaxFunc)(const uint16_t a[8][4], const uint16_t b[4][4]);
CalcMaxFunc CalcMaxFuncs[] =
{
CalcMaxFunc_NEON_0,
CalcMaxFunc_NEON_1,
CalcMaxFunc_NEON_2,
CalcMaxFunc_NEON_3,
CalcMaxFunc_C_0
};
int N = sizeof(CalcMaxFunc) / sizeof(CalcMaxFunc[0]);
for (int i = 0; i < 10 * N; ++i)
{
auto f = CalcMaxFunc[i % N];
unsigned retI = f(a, b);
// just random code to ensure that cpu waits for the results
// and compiler doesn't optimize it away
if (retI > 1000000)
break;
ret |= retI;
}
I got surprising results: performance of a function was totally depend on its position within CalcMaxFuncs array. For example, when I swapped CalcMaxFunc_NEON_3 to be first it would be 3-4 times slower and according to profiler it would stall at the last bit of the function where it tried to move data from neon to arm register.
So, what does it make stall sometimes and not in other times? BY the way, I profile on iPhone6 in xcode if that matters.
When I intentionally introduced neon pipeline stalls by mixing-in some floating point division between calling these functions in the loop I eliminated unreliable behavior, now all of them perform the same regardless of the order in which they were called. So, why in the first place did I have that problem and what can I do to eliminate it in actual code?
Update:
I tried to create a simple test function and then optimize it in stages and see how I could possibly avoid neon->arm stalls.
Here's the test runner function:
void NeonStallTest()
{
int findMinErr(uint8_t* var1, uint8_t* var2, int size);
srand(0);
uint8_t var1[1280];
uint8_t var2[1280];
for (int i = 0; i < sizeof(var1); ++i)
{
var1[i] = rand();
var2[i] = rand();
}
#if 0 // early exit?
for (int i = 0; i < 16; ++i)
var1[i] = var2[i];
#endif
int ret = 0;
for (int i=0; i<10000000; ++i)
ret += findMinErr(var1, var2, sizeof(var1));
exit(ret);
}
And findMinErr is this:
int findMinErr(uint8_t* var1, uint8_t* var2, int size)
{
int ret = 0;
int ret_err = INT_MAX;
for (int i = 0; i < size / 16; ++i, var1 += 16, var2 += 16)
{
int err = 0;
for (int j = 0; j < 16; ++j)
{
int x = var1[j] - var2[j];
err += x * x;
}
if (ret_err > err)
{
ret_err = err;
ret = i;
}
}
return ret;
}
Basically it it does sum of squared difference between each uint8_t[16] block and returns index of the block pair that has lowest squared difference. So, then I rewrote it in neon intrisics (no particular attempt was made to make it fast, as it's not the point):
int findMinErr_NEON(uint8_t* var1, uint8_t* var2, int size)
{
int ret = 0;
int ret_err = INT_MAX;
for (int i = 0; i < size / 16; ++i, var1 += 16, var2 += 16)
{
int err;
uint8x8_t var1_0 = vld1_u8(var1 + 0);
uint8x8_t var1_1 = vld1_u8(var1 + 8);
uint8x8_t var2_0 = vld1_u8(var2 + 0);
uint8x8_t var2_1 = vld1_u8(var2 + 8);
int16x8_t s0 = vreinterpretq_s16_u16(vsubl_u8(var1_0, var2_0));
int16x8_t s1 = vreinterpretq_s16_u16(vsubl_u8(var1_1, var2_1));
uint16x8_t u0 = vreinterpretq_u16_s16(vmulq_s16(s0, s0));
uint16x8_t u1 = vreinterpretq_u16_s16(vmulq_s16(s1, s1));
#ifdef __aarch64__1
err = vaddlvq_u16(u0) + vaddlvq_u16(u1);
#else
uint32x4_t err0 = vpaddlq_u16(u0);
uint32x4_t err1 = vpaddlq_u16(u1);
err0 = vaddq_u32(err0, err1);
uint32x2_t err00 = vpadd_u32(vget_low_u32(err0), vget_high_u32(err0));
err00 = vpadd_u32(err00, err00);
err = vget_lane_u32(err00, 0);
#endif
if (ret_err > err)
{
ret_err = err;
ret = i;
#if 0 // enable early exit?
if (ret_err == 0)
break;
#endif
}
}
return ret;
}
Now, if (ret_err > err) is clearly data hazard. Then I manually "unrolled" loop by two and modified code to use err0 and err1 and check them after performing next round of compute. According to profiler I got some improvements. In simple neon loop I got roughly 30% of entire function spent in the two lines vget_lane_u32 followed by if (ret_err > err). After I unrolled by two these operations started to take 25% (e.g. I got roughly 10% overall speedup). Also, check armv7 version, there is only 8 instructions between when err0 is set (vmov.32 r6, d16[0]) and when it's accessed (cmp r12, r6). T
Note, in the code early exit is ifdefed out. Enabling it would make function even slower. If I unrolled it by four and changed to use four errN variables and deffer check by two rounds then I still saw vget_lane_u32 in profiler taking too much time. When I checked generated asm, appears that compiler destroys all the optimizations attempts because it reuses some of the errN registers which effectively makes CPU access results of vget_lane_u32 much earlier than I want (and I aim to delay access by 10-20 instructions). Only when I unrolled by 4 and marked all four errN as volatile vget_lane_u32 totally disappeared from the radar in profiler, however, the if (ret_err > errN) check obviously got slow as hell as now these probably ended up as regular stack variables overall these 4 checks in 4x manual loop unroll started to take 40%. Looks like with proper manual asm it's possible to make it work properly: have early loop exit, while avoiding neon->arm stalls and have some arm logic in the loop, however, extra maintenance required to deal with arm asm makes it 10x more complex to maintain that kind of code in a large project (that doesn't have any armasm).
Update:
Here's sample stall when moving data from neon to arm register. To implement early exist I need to move from neon to arm once per loop. This move alone takes more than 50% of entire function according to sampling profiler that comes with xcode. I tried to add lots of noops before and/or after the mov, but nothing seems to affect results in profiler. I tried to use vorr d0,d0,d0 for noops: no difference. What's the reason for the stall, or the profiler simply shows wrong results?

Progress 10.1C 4GL Encode Function

Does anyone know which algorithm Progress 10.1C uses in the Encode Function?
I found this: http://knowledgebase.progress.com/articles/Article/21685
The Progress 4GL ENCODE function uses a CRC-16 algorithm to generate its encoded output.
Progress 4GL:
ENCODE("Test").
gives as output "LkwidblanjsipkJC"
But for example on http://www.nitrxgen.net/hashgen/ with Password "Test", I never get the Result as from Progress..
Any Ideas? :)
I've made the algorithm available on https://github.com/pvginkel/ProgressEncode.
I needed this function in Java. So I ported Pieter's C# code (https://github.com/pvginkel/ProgressEncode) to Java. At least all test cases passed. Enjoy! :)
public class ProgressEncode {
static int[] table = { 0x0000, 0xC0C1, 0xC181, 0x0140, 0xC301, 0x03C0,
0x0280, 0xC241, 0xC601, 0x06C0, 0x0780, 0xC741, 0x0500, 0xC5C1,
0xC481, 0x0440, 0xCC01, 0x0CC0, 0x0D80, 0xCD41, 0x0F00, 0xCFC1,
0xCE81, 0x0E40, 0x0A00, 0xCAC1, 0xCB81, 0x0B40, 0xC901, 0x09C0,
0x0880, 0xC841, 0xD801, 0x18C0, 0x1980, 0xD941, 0x1B00, 0xDBC1,
0xDA81, 0x1A40, 0x1E00, 0xDEC1, 0xDF81, 0x1F40, 0xDD01, 0x1DC0,
0x1C80, 0xDC41, 0x1400, 0xD4C1, 0xD581, 0x1540, 0xD701, 0x17C0,
0x1680, 0xD641, 0xD201, 0x12C0, 0x1380, 0xD341, 0x1100, 0xD1C1,
0xD081, 0x1040, 0xF001, 0x30C0, 0x3180, 0xF141, 0x3300, 0xF3C1,
0xF281, 0x3240, 0x3600, 0xF6C1, 0xF781, 0x3740, 0xF501, 0x35C0,
0x3480, 0xF441, 0x3C00, 0xFCC1, 0xFD81, 0x3D40, 0xFF01, 0x3FC0,
0x3E80, 0xFE41, 0xFA01, 0x3AC0, 0x3B80, 0xFB41, 0x3900, 0xF9C1,
0xF881, 0x3840, 0x2800, 0xE8C1, 0xE981, 0x2940, 0xEB01, 0x2BC0,
0x2A80, 0xEA41, 0xEE01, 0x2EC0, 0x2F80, 0xEF41, 0x2D00, 0xEDC1,
0xEC81, 0x2C40, 0xE401, 0x24C0, 0x2580, 0xE541, 0x2700, 0xE7C1,
0xE681, 0x2640, 0x2200, 0xE2C1, 0xE381, 0x2340, 0xE101, 0x21C0,
0x2080, 0xE041, 0xA001, 0x60C0, 0x6180, 0xA141, 0x6300, 0xA3C1,
0xA281, 0x6240, 0x6600, 0xA6C1, 0xA781, 0x6740, 0xA501, 0x65C0,
0x6480, 0xA441, 0x6C00, 0xACC1, 0xAD81, 0x6D40, 0xAF01, 0x6FC0,
0x6E80, 0xAE41, 0xAA01, 0x6AC0, 0x6B80, 0xAB41, 0x6900, 0xA9C1,
0xA881, 0x6840, 0x7800, 0xB8C1, 0xB981, 0x7940, 0xBB01, 0x7BC0,
0x7A80, 0xBA41, 0xBE01, 0x7EC0, 0x7F80, 0xBF41, 0x7D00, 0xBDC1,
0xBC81, 0x7C40, 0xB401, 0x74C0, 0x7580, 0xB541, 0x7700, 0xB7C1,
0xB681, 0x7640, 0x7200, 0xB2C1, 0xB381, 0x7340, 0xB101, 0x71C0,
0x7080, 0xB041, 0x5000, 0x90C1, 0x9181, 0x5140, 0x9301, 0x53C0,
0x5280, 0x9241, 0x9601, 0x56C0, 0x5780, 0x9741, 0x5500, 0x95C1,
0x9481, 0x5440, 0x9C01, 0x5CC0, 0x5D80, 0x9D41, 0x5F00, 0x9FC1,
0x9E81, 0x5E40, 0x5A00, 0x9AC1, 0x9B81, 0x5B40, 0x9901, 0x59C0,
0x5880, 0x9841, 0x8801, 0x48C0, 0x4980, 0x8941, 0x4B00, 0x8BC1,
0x8A81, 0x4A40, 0x4E00, 0x8EC1, 0x8F81, 0x4F40, 0x8D01, 0x4DC0,
0x4C80, 0x8C41, 0x4400, 0x84C1, 0x8581, 0x4540, 0x8701, 0x47C0,
0x4680, 0x8641, 0x8201, 0x42C0, 0x4380, 0x8341, 0x4100, 0x81C1,
0x8081, 0x4040 };
public static byte[] Encode(byte[] input) {
if (input == null)
return null;
byte[] scratch = new byte[16];
int hash = 17;
for (int i = 0; i < 5; i++) {
for (int j = 0; j < input.length; j++)
scratch[15 - (j % 16)] ^= input[j];
for (int j = 0; j < 16; j += 2) {
hash = Hash(scratch, hash);
scratch[j] = (byte) (hash & 0xFF);
scratch[j + 1] = (byte) ((hash >>> 8) & 0xFF);
}
}
byte[] target = new byte[16];
for (int i = 0; i < 16; i++) {
byte lower = (byte) (scratch[i] & 0x7F);
if ((lower >= 'A' && lower <= 'Z') || (lower >= 'a' && lower <= 'z'))
target[i] = lower;
else
target[i] = (byte) (((scratch[i] >>> 4 & 0xF) + 0x61) & 0xFF);
}
return target;
}
private static int Hash(byte[] scratch, int hash) {
for (int i = 15; i >= 0; i--)
hash = ((hash >>> 8) & 0xFF ^ table[hash & 0xFF] ^ table[scratch[i] & 0xFF]) & 0xFFFF;
return hash;
}
}
There are several implementations of CRC-16. Progress Software (deliberately) does not document which variant is used.
For what purpose are you looking for this?
Rather than trying to use "encode" I'd recommend studying OE's cryptography functionality. I'm not sure what 10.1C supports, the 11.0 docs I have says OE supports:
• DES — Data Encryption Standard
• DES3 — Triple DES
• AES — Advanced Encryption Standard
• RC4 — Also known as ARC4
The OE PDF docs are available here:
http://communities.progress.com/pcom/docs/DOC-16074
The way how the ENCODE function only works one way. Progress has never disclosed the algorithm behind it. Plus they have never built in a function to decode.
As with OE 10.0B Progress has implemented cryptography within the ABL. Have a look at the ENCRYPT and DECRYPT function.

Out of memory error. Allocating...

I'm trying to use a gprof command: gprof -s executable.exe gmon.out gmon.sum to merge profiling data gathered from 2 runs of my programs. But the following error appears:
gprof: out of memory allocating 3403207348 bytes after a total of 196608 bytes
My program is quite simple (just one for loop). If i run it once, the run time is too short (it shows 0.00s) for gprof to record.
In CygWin, I do the following steps:
gcc -pg -o fl forAndWhilLoop.c
fl (run the program)
mv gmon.out gmon.sum
fl (run the program)
gprof -s fl.exe gmon.out gmon.sum
gprof fl.exe gmon.sum>gmon.out
gprof fl.exe
My program:
int main(void)
{
int fac=1;
int count=10;
int k;
for(k=1;k<=count;k++)
{
fac = fac * k;
}
return 0;
}
So can anyone help me with this problem? Thanks!
If all you want is to time it, on my machine it's 105ns. Here's the code:
void forloop(void){
int fac=1;
int count=10;
int k;
for(k=1;k<=count;k++)
{
fac = fac * k;
}
}
int main(int argc, char* argv[])
{
int i;
for (i = 0; i < 1000000000; i++){
forloop();
}
return 0;
}
Get the idea? I used a hand-held stopwatch. Since it runs 10^9 times, seconds = nanoseconds.
Unrolling the inner loop like this reduced the time to 92ns;
int k = 1;
while(k+5 <= count){
fac *= k * (k+1) * (k+2) * (k+3) * (k+4);
k += 5;
}
while(k <= count){
fac *= k++;
}
Switching to Release build from Debug brought it down to 21ns. You can only expect that kind of speedup in an actual hotspot, which this is.
It seems that pprof instead of gprof should be executed