Related
I am new to Capstone and PE structure... beg your indulgence
I want to extract hex number which means opcode and operands in Python using Capstone
Here is my example:
if there is a file that looks like this
.text:00404499 8B 0D 14 61 41 00
mov ecx, dword_416114
the hex num -'8D' is the hex number including (mov, ecx, dword_416114) right?
so I tried to extract the exact hex num, but i am having trouble...
Here is my code:
for ins in cs.disasm(pe.sections[0].get_data(), 0x0):
print("0x%x:\t%s\t%s" %(ins.address, ins.mnemonic, ins.op_str))
above code will show this:
0x0: xor eax, 0xff77ee8b
0x5: mov ch, dh
0x7: ja 0xffffff9f
...
0x15: inc esi
how can i get what i want?
I think I'm thinking about this the wrong way, but I'm wondering how an embedded system with less than 32-bits can use 32-bit data values. I'm a beginner programmer so go easy on me :)
base 10
0100 <- carry in/out
5432
+1177
======
6609
never brought up in class but we can now extend that to two operations
100
32
+77
======
09
01
54
+11
======
66
and come up with the 6609 result because we understand that it is column based and each column treated separately.
base 2
1111
+0011
=====
11110
1111
+0011
=====
10010
110
11
+11
=====
10
111
11
+00
=====
100
result 10010
you can break your operations up into however many bits you want 8, 16, 13, 97 whatever. it is column based (for addition) and it just works. division you should be able to figure out, multiplication is just shifting and adding and can turn that into multiple operations as well
n bits * n bits = 2*n bits so if you have an 8 bit * 8 bit = 16 bit multiply you can use that on an 8 bit system otherwise you have to limit to 4 bits * 4 bits = 8 bits and work with that (or if no multiply then just do the shift and add).
base 2
abcd
* 1101
========
abcd
0000
abcd
+abcd
=========
which you can break down into a shifting and adding problem, can do N bits with a 4 or 8 or M bit processor/registers/alu
Or look at it another way, grade school algebra
(a+b)*(c+d) = ac + bc + ad + bd
mnop * tuvw = ((mn*0x100)+(op)) * ((tu*0x100)+(vw)) = (a+b)*(c+d)
and you should find that you can combine the with 0x100 terms and without,
do those separately from the without putting together parts of the answer using an 8 bit alu (or 4 bits of the 8 bit as needed).
shifting should be obvious just move the bits over to the next byte or (half)word or whatever.
and bitwise operations (xor, and, or) are bitwise so dont need anything special just keep the columns lined up.
EDIT
Or you could just try it
unsigned long fun1 ( unsigned long a, unsigned long b )
{
return(a+b);
}
00000000 <_fun1>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 1d40 0004 mov 4(r5), r0
8: 1d41 0006 mov 6(r5), r1
c: 6d40 0008 add 10(r5), r0
10: 6d41 000a add 12(r5), r1
14: 0b40 adc r0
16: 1585 mov (sp)+, r5
18: 0087 rts pc
00000000 <fun1>:
0: 0e 5c add r12, r14
2: 0f 6d addc r13, r15
4: 30 41 ret
00000000 <fun1>:
0: 62 0f add r22, r18
2: 73 1f adc r23, r19
4: 84 1f adc r24, r20
6: 95 1f adc r25, r21
8: 08 95 ret
bonus points if you can figure out these instruction sets.
unsigned long fun2 ( unsigned long a, unsigned long b )
{
return(a*b);
}
00000000 <_fun2>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 10e6 mov r3, -(sp)
6: 1d41 0006 mov 6(r5), r1
a: 1d40 000a mov 12(r5), r0
e: 1043 mov r1, r3
10: 00a1 clc
12: 0c03 ror r3
14: 74d7 fff2 ash $-16, r3
18: 6d43 0004 add 4(r5), r3
1c: 70c0 mul r0, r3
1e: 00a1 clc
20: 0c00 ror r0
22: 7417 fff2 ash $-16, r0
26: 6d40 0008 add 10(r5), r0
2a: 7040 mul r0, r1
2c: 10c0 mov r3, r0
2e: 6040 add r1, r0
30: 0a01 clr r1
32: 1583 mov (sp)+, r3
34: 1585 mov (sp)+, r5
36: 0087 rts pc
An 8 bit system can perform 8 bit operations in a single instruction and single memory access, on such an 8 bit system, 16 and 32 bit operations require additional data accesses and additional instructions.
For example, typical architectures place arithmetic results in register (often an accumulator but some architectures are more_orthogonal_ and can use any register for results), and arithmetic overflow results in a carry flag being set in a status register. In operations larger that the native architecture, the code can inspect the carry flag in order to take the appropriate action in subsequent instructions.
So say for an 8 bit system you add 1 to 255, the result in the 8 bit accumulator will be zero, with the carry flag set; the next instruction can then add one to the upper byte of a 16 bit value in response to the carry flag. This can be made to ripple through to any number of bytes or words, so that a system can be made to process operations of arbitrary bit length above that of the underlying architecture just not in a single instruction operation.
working on a project with a self checking test bench and having a problem I do not understand.
The problem for the following code is an error in the simulation. I will point to where the error is coming from in the code:
LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
use IEEE.NUMERIC_STD.ALL;
ENTITY TestBenchAutomated IS
-- Generics passed in
generic (m: integer := 3; n: integer := 5; h: integer := 4; DATA_SIZE: integer :=5);
END TestBenchAutomated;
ARCHITECTURE behavior OF TestBenchAutomated IS
-- Component Declaration for the Unit Under Test (UUT)
COMPONENT TopLevelM_M
generic (m: integer := 3; n: integer := 5; h: integer := 4; DATA_SIZE: integer :=5);
PORT(
clk : IN std_logic;
next_in : IN std_logic; --User input
rst_in : IN std_logic; --User input
OUTPUT : OUT SIGNED((DATA_SIZE+DATA_SIZE)+(m-1)-1 downto 0) --Calculated DATA output
);
END COMPONENT;
--Inputs
signal clk : std_logic := '0';
signal next_in : std_logic := '0';
signal rst_in : std_logic := '0';
--Outputs
signal OUTPUT : SIGNED((DATA_SIZE+DATA_SIZE)+(m-1)-1 downto 0);
-- Clock period definitions
constant clk_period : time := 10 ns;
--Variable to be used in assert section
type Vector is record
OUTPUT_test : SIGNED((DATA_SIZE+DATA_SIZE)+(m-1)-1 downto 0);
end record;
type VectorArray is array (natural range <>) of Vector;
constant Vectors : VectorArray := (
-- Values to be compaired to calculated output
(OUTPUT_test =>"000000110000"), -- 48
(OUTPUT_test =>"000011110110"), -- 246
(OUTPUT_test =>"000101001000"), -- 382 <--- Purposefully incorrect value, Should be '000100001000' = 264
(OUTPUT_test =>"111111010011"), -- -45
(OUTPUT_test =>"111101001100"), -- -180
(OUTPUT_test =>"111111001111"), -- -49
(OUTPUT_test =>"000000101011"), -- 43 Purposefully incorrect value, Should be '000010101011' = 171
(OUTPUT_test =>"000000010011"), -- 19
(OUTPUT_test =>"111111100101"), -- -27
(OUTPUT_test =>"111110111011"), -- -69
(OUTPUT_test =>"111110111011"), -- -69
(OUTPUT_test =>"000000101101"), -- 45
(OUTPUT_test =>"111011011110"), -- -290
(OUTPUT_test =>"000001010110"), -- 86
(OUTPUT_test =>"000011110010"), -- 242
(OUTPUT_test =>"00000111110"), -- 125
(OUTPUT_test =>"111111001001"), -- -55
(OUTPUT_test =>"000100010101"), -- 277
(OUTPUT_test =>"111111100011"), -- -29
(OUTPUT_test =>"111101111101"));-- -131
BEGIN
-- Instantiate the Unit Under Test (UUT)
uut: TopLevelM_M PORT MAP (
clk => clk,
next_in => next_in,
rst_in => rst_in,
OUTPUT => OUTPUT
);
-- Clock process definitions
clk_process :process
begin
clk <= '0';
wait for clk_period/2;
clk <= '1';
wait for clk_period/2;
end process;
-- Process to simulate user input and to check output is correct
Test :process
variable i : integer;
begin
wait for 100 ns;
rst_in <= '1';
wait for clk_period*3;
rst_in <= '0';
--Loops through enough times to cover matrix and more to show it freezes in S_Wait state
for i in 0 to 50 loop
for i in Vectors'range loop
next_in <= '1';
wait for clk_period*5;
next_in <= '0';
wait for clk_period*4; --Appropriate amount of clock cycles needed for calculations to be displayed at output
--Check the output is the same as expected
assert OUTPUT = Vectors(i).OUTPUT_test
report "Incorrect Output on vector line" & integer'image(i) &
lf & "Expected:" & integer'image(i)(to_integer((Vectors(i).OUTPUT_test))) --& lf &
--"But got" & integer'image(i)(to_integer(signed(OUTPUT)))
severity error;
end loop;
end loop;
wait;
end process;
END;
As you can see in the vector, I have inserted two incorrect values to make sure the code works. I there for expect an error in the simulation telling me that there is an error on address 2 of the vector and what integer it is. However, the simulation stops and i get this:
ERROR: Index 328 out of bound 1 to 1.
ERROR: In process TestBenchAutomated.vhd:Test
INFO: Simulator is stopped.
Obviously the integer 328 that is represented by the binary number in the vector causes this error, but I dont understand why it causes THIS error instead of the one I have coded. What is this index out of bound OF?
Any help would be much appreciated.
Thanks
This:
report "Incorrect Output on vector line" & integer'image(i) &
lf & "Expected:" & integer'image(i)(to_integer((Vectors(i).OUTPUT_test)))
Should be:
report "Incorrect Output on vector line" & integer'image(i) &
lf & "Expected:" & integer'image(to_integer((Vectors(i).OUTPUT_test)))
It's complaining that the value (to_integer((Vectors(i).OUTPUT_test))) is out of range for a character when it should have been used as parameter to 'IMAGE, which you supplied already as i.
For a simplified test case:
library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
entity foo is
constant m: integer := 3;
constant n: integer := 5;
constant h: integer := 4;
constant DATA_SIZE: integer :=5;
end entity;
architecture fum of foo is
signal OUTPUT : SIGNED((DATA_SIZE+DATA_SIZE)+(m-1)-1 downto 0) := "000011110110" ;
type Vector is record
OUTPUT_test : SIGNED((DATA_SIZE+DATA_SIZE)+(m-1)-1 downto 0);
end record;
type VectorArray is array (natural range <>) of Vector;
constant Vectors : VectorArray := (
-- Values to be compaired to calculated output
(OUTPUT_test =>"000011110110"), -- 246 (CORRECT)
(OUTPUT_test =>"000101001000") -- 382 (INCORRECT)
);
begin
TEST:
process
begin
for i in Vectors'RANGE loop
assert OUTPUT = Vectors(i).OUTPUT_test
report "Incorrect Output on vector line " & integer'image(i) &
-- lf & "Expected:" & integer'image(i)(to_integer((Vectors(i).OUTPUT_test)))
lf & "Expected:" & integer'image(to_integer((Vectors(i).OUTPUT_test)))
severity error;
end loop;
wait;
end process;
end architecture;
And the incorrect usage, Nick Gasson's nvc gave:
david_koontz#Macbook: nvc -a foo.vhdl
** Error: expected 2 parameters for attribute IMAGE but have 3
File foo.vhdl, Line 34
lf & "Expected:" & integer'image(i)(to_integer((Vectors(i).OUTPUT_t ...
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
With the correct number of arguments to `'IMAGE' (shown in the example):
david_koontz#Macbook: nvc -r foo
** Fatal: 0ms+0: Assertion Error: Incorrect Output on vector line 1
Expected:328
Process :foo:test
File foo.vhdl, Line 32
Found a ghdl bug not reporting this when it likely should. It worked either way (this should be a run time error). An integer value of 382 isn't a character eligible for concatenation.
Addendum:
Tristan Gingold (ghdl author) pointed out that the expression is an element index to the string output of the 'IMAGE function.
Further analysis reveals the basis for the error message on the original code for the question:
& integer'image(i)(to_integer((Vectors(i).OUTPUT_test)))
T'IMAGE(X)
Kind: Function.
Prefix: Any scalar type or subtype T.
Parameter: An expression whose type is the base type of T.
Result Type: Type String.
Result: The string representation of the parameter value, without
leading or trailing whitespace.
No concatenation operator following.
(to_integer( ( Vectors(i).OUTPUT ) ) ) returns the integer value for the record element OUTPUT, type signed. (superfluous parentheses aside).
The contents of Vectors(i).OUTPUT is
constant Vectors : VectorArray := (
(OUTPUT_test =>"000011110110"), -- 246 (CORRECT)
(OUTPUT_test =>"000101001000") -- 382 (INCORRECT)
);
The 382 should be 328, its 0x148. Dyslexia is hard to spell.
And in this case for i = 1 (Vectors'RANGE is (0 to 1), is
"000101001000" which to_integer is 328, out of range for an element of a string (element type character).
An integer, value 328 or not is not an element index type for a record (while OUTPUT is).
The subtype for the unnamed string output of 'IMAGE is the length of the string for i, whose value is 1, the length is 1, the range is 1 to 1. 328 is out of range.
Notice the ISIM message said exactly that in the original model:
ERROR: Index 328 out of bound 1 to 1. ERROR: In process TestBenchAutomated.vhd:Test
This still looks like a ghdl error. It does also make nvc's error message suspect however.
For academic purposes I want to try to write an ARM NEON optimization of the following algorithm, even only to test if it is possible to obtain any performance improvement or not. I think this is not a good candidate for SIMD optimization because results are merged togheter losing parallelization gains.
This is the algorithm:
const uchar* center = ...;
int t0, t1, val;
t0 = center[0]; t1 = center[1];
val = t0 < t1;
t0 = center[2]; t1 = center[3];
val |= (t0 < t1) << 1;
t0 = center[4]; t1 = center[5];
val |= (t0 < t1) << 2;
t0 = center[6]; t1 = center[7];
val |= (t0 < t1) << 3;
t0 = center[8]; t1 = center[9];
val |= (t0 < t1) << 4;
t0 = center[10]; t1 = center[11];
val |= (t0 < t1) << 5;
t0 = center[12]; t1 = center[13];
val |= (t0 < t1) << 6;
t0 = center[14]; t1 = center[15];
val |= (t0 < t1) << 7;
d[i] = (uchar)val;
This is what I thought in ARM assembly:
VLD2.8 {d0, d1} ["center" addr]
supposing 8 bit chars, this first operation should load all the t0 and t1 values alternatively in 2 registers.
VCLT.U8 d2, d0, d1
a single operation of "less then" for all the comparisons. NOTES: I've read that VCLT is possible only with a #0 constant as second operand, so this must be inverted in a >=. Reading ARM documentation i think the result of every 8 bit value will be "all 1" for true (11111111) or "all 0" for false (00000000).
VSHR.U8 d4, d2, #7
this right shift will delete 7 out of 8 values in the register 8-bit "cells" (mainly to delete 7 ones). I've used d4 because of the next step will be the first d register mapped in q2.
Now problems begin: shifting and ORs.
VSHLL.U8 q2[1], d4[1], 1
VSHLL.U8 q2[2], d4[2], 2
...
VSHLL.U8 q2[7], d4[7], 7
I can imagine only this way (if it's possible to use [offsets]) for left shifts. Q2 should be specified instead of d4 according to the documentation.
VORR(.U8) d4[0], d4[1], d4[0]
VORR(.U8) d4[0], d4[2], d4[0]
...
VORR(.U8) d4[0], d4[7], d4[0]
Last step should give the result.
VST1.8 d4[0], [d[i] addr]
Simple store of the result.
It is my first approach to ARM NEON, so probably many assumptions may be incorrect. Help me understand possible errors, and suggest a better solution if possible.
EDIT:
This is the final working code after the suggested solutions:
__asm__ __volatile ("VLD2.8 {d0, d1}, [%[ordered_center]] \n\t"
"VCGT.U8 d2, d1, d0 \n\t"
"MOV r1, 0x01 \n\t"
"MOV r2, 0x0200 \n\t"
"ORR r2, r2, r1 \n\t"
"MOV r1, 0x10 \n\t"
"MOV r3, 0x2000 \n\t"
"ORR r3, r3, r1 \n\t"
"MOVT r2, 0x0804 \n\t"
"MOVT r3, 0x8040 \n\t"
"VMOV.32 d3[0], r2 \n\t"
"VMOV.32 d3[1], r3 \n\t"
"VAND d0, d2, d3 \n\t"
"VPADDL.U8 d0, d0 \n\t"
"VPADDL.U16 d0, d0 \n\t"
"VPADDL.U32 d0, d0 \n\t"
"VST1.8 d0[0], [%[desc]] \n\t"
:
: [ordered_center] "r" (ordered_center), [desc] "r" (&desc[i])
: "d0", "d1", "d2", "d3", "r1", "r2", "r3");
After the comparison, you have an array of 8 booleans represented by 0xff or 0x00. The reason SIMD comparisons (on any architecture) produce those values is to make them useful for a bit-mask operation (and/or bit-select in NEON's case) so you can turn the result into an arbitrary value quickly, without a multiply.
So rather than reducing them to 1 or 0 and shifting them about, you'll find it easier to mask them with the constant 0x8040201008040201. Then each lane contains the bit corresponding to its position in the final result. You can pre-load the constant into another register (I'll use d3).
VAND d0, d2, d3
Then, to combine the results, you can use VPADD (instead of OR), which will combine adjacent pairs of lanes, d0[0] = d0[0] + d0[1], d0[1] = d0[2] + d0[3], etc... Since the bit patterns do not overlap there is no carry and add works just as well as or. Also, because the output is half as large as the input we have to fill in the second half with junk. I've used a second copy of d0 for that.
You'll need to do the add three times to get all columns combined.
VPADD.u8 d0, d0, d0
VPADD.u8 d0, d0, d0
VPADD.u8 d0, d0, d0
and now the result will now be in d0[0].
As you can see, d0 has room for seven more results; and some lanes of the VPADD operations have been working with junk data. It would be better if you could fetch more data at once, and feed that additional work in as you go so that none of the arithmetic is wasted.
EDIT
Supposing the loop is unrolled four times; with results in d4, d5, d6, and d7; the constant mentioned earlier should be loaded into, say, d30 and d31, and then some q register arithmetic can be used:
VAND q0, q2, q15
VAND q1, q3, q15
VPADD.u8 d0, d0, d1
VPADD.u8 d2, d2, d3
VPADD.u8 d0, d0, d2
VPADD.u8 d0, d0, d0
With the final result in d0[0..3], or simply the 32-bit value in d0[0].
There seem to be lots of registers free to unroll it further, but I don't know how many of those you'll use up on other calculations.
load a d register with the value 0x8040201008040201
vand with the result of vclt
vpaddl.u8 from 2)
vpaddl.u16 from 3)
vpaddl.u32 from 4)
store the lowest single byte from 5)
Start with expressing the parallelism explicitly to begin with:
int /* bool, whatever ... */ val[8] = {
center[0] < center[1],
center[2] < center[3],
center[4] < center[5],
center[6] < center[7],
center[8] < center[9],
center[10] < center[11],
center[12] < center[13],
center[14] < center[15]
};
d[i] = extract_mask(val);
The shifts are equivalent to a "mask move", as you want each comparison to result in a single bit.
The comparison of the above sixteen values can be done by first doing a structure load (vld2.8) to split adjacent bytes into two uint8x8_t, then the parallel compare. The result of that is a uint8x8_t with either 0xff or 0x00 in the bytes. You want one bit of each, in the respective bit position.
That's a "mask extract"; on Intel SSE2, that'd be MASKMOV but on Neon, no direct equiv exists; three vpadd as shown above (or see SSE _mm_movemask_epi8 equivalent method for ARM NEON for more on this) are a suitable substitute.
I wrote a storable vector instance for the data type below (original question here):
data Atoms = I GHC.Int.Int32 | S GHC.Int.Int16
The code for defining those instances for Storable vector is below. While I am getting very good performance with the code below, I am very much interested in generic suggestions to improve the performance of that storable instance. By generic suggestion, I mean the following:
It is not specific to a GHC compiler version. You can assume GHC 6.12.3+ to exclude performance bugs if any present in earlier versions, and relevant to the code here.
Platform-specific suggestions are ok. You may assume x86_64 Linux platform.
A generic suggestion more in the form of algorithm improvement (big O) is very much valued, than a suggestion that exploits hardware-specific optimizations. But, given a basic operation like peek/poke here, there is not much scope for algorithmic improvement, as far as I can tell (and hence more valuable because it is a scarce commodity :)
Compiler flags for x86_64 are acceptable (e.g., telling compiler about removing floating point safe check etc.). I am using "-O2 --make" option to compile the code.
If there is any known good library source code that does similar thing (i.e., define storable instances for union/recursive data types), I will be very much interested in checking them.
import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign
import Foreign.C.Types
import GHC.Int
data Atoms = I GHC.Int.Int32 | S GHC.Int.Int16
deriving (Show)
instance Storable Atoms where
sizeOf _ = 1 + sizeOf (undefined :: Int32)
alignment _ = 1 + alignment (undefined :: Int32)
{-# INLINE peek #-}
peek p = do
let p1 = (castPtr p::Ptr Word8) `plusPtr` 1 -- get pointer to start of the element. First byte is type of element
t <- peek (castPtr p::Ptr Word8)
case t of
0 -> do
x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int32) 0
return (I x)
1 -> do
x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int16) 0
return (S x)
{-# INLINE poke #-}
poke p x = case x of
I a -> do
poke (castPtr p :: Ptr Word8) 0
pokeElemOff (castPtr p1) 0 a
S a -> do
poke (castPtr p :: Ptr Word8) 1
pokeElemOff (castPtr p1) 0 a
where p1 = (castPtr p :: Ptr Word8) `plusPtr` 1 -- get pointer to start of the element. First byte is type of element
Update:
Based on feedback from Daniel and dflemstr, I rewrote the alignment, and also, updated the constructor to be of type Word32 instead of Word8. But, it seems that for this to be effective, the data constructor too should be updated to have unpacked values - that was an oversight on my part. I should have written data constructor to have unpacked values in the first place (see performance slides by John Tibbell - slide #49). So, rewriting the data constructor, coupled with alignment and constructor changes, made a big impact on the performance, improving it by about 33% for functions over vector (a simple sum function in my benchmark test). Relevant changes below (warning - not portable but it is not an issue for my use case):
Data constructor change:
data Atoms = I {-# UNPACK #-} !GHC.Int.Int32 | S {-# UNPACK #-} !GHC.Int.Int16
Storable sizeof and alignment changes:
instance Storable Atoms where
sizeOf _ = 2*sizeOf (undefined :: Int32)
alignment _ = 4
{-# INLINE peek #-}
peek p = do
let p1 = (castPtr p::Ptr Word32) `plusPtr` 1
t <- peek (castPtr p::Ptr Word32)
case t of
0 -> do
x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int32) 0
return (I x)
_ -> do
x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int16) 0
return (S x)
{-# INLINE poke #-}
poke p x = case x of
I a -> do
poke (castPtr p :: Ptr Word32) 0
pokeElemOff (castPtr p1) 0 a
S a -> do
poke (castPtr p :: Ptr Word32) 1
pokeElemOff (castPtr p1) 0 a
where p1 = (castPtr p :: Ptr Word32) `plusPtr` 1
Four or eight byte aligned memory access is typically much faster than oddly aligned access. It may be that the alignment for your instance is automatically rounded up to eight bytes, but I'd advise to at least measure with explicit eight byte alignment, using 32 bits (Int32 or Word32) for the constructor tag and reading and writing both types of payloads as Int32. That'll waste bits, but there's a good chance it'll be faster. Since you're on a 64-bit platform, it may be even faster to use 16-byte alignment and reading/writing Int64. Benchmark, benchmark, benchmark to find out what serves you best.
If speed is what you're after, then this kind of bit packing isn't the right direction to go in.
A processor always deals with word-sized operations, meaning that if you have e.g. a 32-bit processor, the smallest amount of memory that the processor can (physically) deal with is 32 bits or 4 bytes (and for 64-bit processors its 64 bits or 8 bytes). Further; a processor can only load memory at word-boundaries, meaning at byte addresses that are multiples of the word size.
So if you use an alignment of 5 (in this case), it means that your data is stored like this:
| 32 bits | 32 bits | 32 bits | 32 bits |
[ data ] [ data ] [ data ]
00 00 00 00 01 01 00 01 00 00 00 12 34 56 78 00
IX Value IX Value XX XX IX Value
IX = Constructor index
Value = The stored value
XX = Unused byte
As you can see, the data gets more and more out of sync with the word boundaries, making the processor/program have to do more work to access each element.
If you increase your alignment to 8 (64 bits), your data will be stored like this:
| 32 bits | 32 bits | 32 bits | 32 bits | 32 bits | 32 bits |
[ data ] [ data ] [ data ]
00 00 00 00 01 00 00 00 01 00 01 00 00 00 00 00 00 12 34 56 78 00 00 00
IX Value XX XX XX IX Value XX XX XX XX XX IX Value XX XX XX
This makes you "waste" 3 bytes per element, but your data structure will be much faster, since each datum can be loaded and interpreted with far fewer instructions and aligned memory loads.
If you are going to use 8 bytes anyways, you might as well make your constructor index to a Int32, since you aren't using those bytes for anything else anyways, and making all of your datum elements word-aligned further increases speed:
| 32 bits | 32 bits | 32 bits | 32 bits | 32 bits | 32 bits |
[ data ] [ data ] [ data ]
00 00 00 00 00 00 00 01 00 00 00 01 00 01 00 00 00 00 00 00 12 34 56 78
Index Value Index Value XX XX Index Value
This is the price you have to pay for a faster data structures on current processor architectures.