Optimization suggestions when writing storable vector definition for union struct - optimization

I wrote a storable vector instance for the data type below (original question here):
data Atoms = I GHC.Int.Int32 | S GHC.Int.Int16
The code for defining those instances for Storable vector is below. While I am getting very good performance with the code below, I am very much interested in generic suggestions to improve the performance of that storable instance. By generic suggestion, I mean the following:
It is not specific to a GHC compiler version. You can assume GHC 6.12.3+ to exclude performance bugs if any present in earlier versions, and relevant to the code here.
Platform-specific suggestions are ok. You may assume x86_64 Linux platform.
A generic suggestion more in the form of algorithm improvement (big O) is very much valued, than a suggestion that exploits hardware-specific optimizations. But, given a basic operation like peek/poke here, there is not much scope for algorithmic improvement, as far as I can tell (and hence more valuable because it is a scarce commodity :)
Compiler flags for x86_64 are acceptable (e.g., telling compiler about removing floating point safe check etc.). I am using "-O2 --make" option to compile the code.
If there is any known good library source code that does similar thing (i.e., define storable instances for union/recursive data types), I will be very much interested in checking them.
import Data.Vector.Storable
import qualified Data.Vector.Storable as V
import Foreign
import Foreign.C.Types
import GHC.Int
data Atoms = I GHC.Int.Int32 | S GHC.Int.Int16
deriving (Show)
instance Storable Atoms where
sizeOf _ = 1 + sizeOf (undefined :: Int32)
alignment _ = 1 + alignment (undefined :: Int32)
{-# INLINE peek #-}
peek p = do
let p1 = (castPtr p::Ptr Word8) `plusPtr` 1 -- get pointer to start of the element. First byte is type of element
t <- peek (castPtr p::Ptr Word8)
case t of
0 -> do
x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int32) 0
return (I x)
1 -> do
x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int16) 0
return (S x)
{-# INLINE poke #-}
poke p x = case x of
I a -> do
poke (castPtr p :: Ptr Word8) 0
pokeElemOff (castPtr p1) 0 a
S a -> do
poke (castPtr p :: Ptr Word8) 1
pokeElemOff (castPtr p1) 0 a
where p1 = (castPtr p :: Ptr Word8) `plusPtr` 1 -- get pointer to start of the element. First byte is type of element
Update:
Based on feedback from Daniel and dflemstr, I rewrote the alignment, and also, updated the constructor to be of type Word32 instead of Word8. But, it seems that for this to be effective, the data constructor too should be updated to have unpacked values - that was an oversight on my part. I should have written data constructor to have unpacked values in the first place (see performance slides by John Tibbell - slide #49). So, rewriting the data constructor, coupled with alignment and constructor changes, made a big impact on the performance, improving it by about 33% for functions over vector (a simple sum function in my benchmark test). Relevant changes below (warning - not portable but it is not an issue for my use case):
Data constructor change:
data Atoms = I {-# UNPACK #-} !GHC.Int.Int32 | S {-# UNPACK #-} !GHC.Int.Int16
Storable sizeof and alignment changes:
instance Storable Atoms where
sizeOf _ = 2*sizeOf (undefined :: Int32)
alignment _ = 4
{-# INLINE peek #-}
peek p = do
let p1 = (castPtr p::Ptr Word32) `plusPtr` 1
t <- peek (castPtr p::Ptr Word32)
case t of
0 -> do
x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int32) 0
return (I x)
_ -> do
x <- peekElemOff (castPtr p1 :: Ptr GHC.Int.Int16) 0
return (S x)
{-# INLINE poke #-}
poke p x = case x of
I a -> do
poke (castPtr p :: Ptr Word32) 0
pokeElemOff (castPtr p1) 0 a
S a -> do
poke (castPtr p :: Ptr Word32) 1
pokeElemOff (castPtr p1) 0 a
where p1 = (castPtr p :: Ptr Word32) `plusPtr` 1

Four or eight byte aligned memory access is typically much faster than oddly aligned access. It may be that the alignment for your instance is automatically rounded up to eight bytes, but I'd advise to at least measure with explicit eight byte alignment, using 32 bits (Int32 or Word32) for the constructor tag and reading and writing both types of payloads as Int32. That'll waste bits, but there's a good chance it'll be faster. Since you're on a 64-bit platform, it may be even faster to use 16-byte alignment and reading/writing Int64. Benchmark, benchmark, benchmark to find out what serves you best.

If speed is what you're after, then this kind of bit packing isn't the right direction to go in.
A processor always deals with word-sized operations, meaning that if you have e.g. a 32-bit processor, the smallest amount of memory that the processor can (physically) deal with is 32 bits or 4 bytes (and for 64-bit processors its 64 bits or 8 bytes). Further; a processor can only load memory at word-boundaries, meaning at byte addresses that are multiples of the word size.
So if you use an alignment of 5 (in this case), it means that your data is stored like this:
| 32 bits | 32 bits | 32 bits | 32 bits |
[ data ] [ data ] [ data ]
00 00 00 00 01 01 00 01 00 00 00 12 34 56 78 00
IX Value IX Value XX XX IX Value
IX = Constructor index
Value = The stored value
XX = Unused byte
As you can see, the data gets more and more out of sync with the word boundaries, making the processor/program have to do more work to access each element.
If you increase your alignment to 8 (64 bits), your data will be stored like this:
| 32 bits | 32 bits | 32 bits | 32 bits | 32 bits | 32 bits |
[ data ] [ data ] [ data ]
00 00 00 00 01 00 00 00 01 00 01 00 00 00 00 00 00 12 34 56 78 00 00 00
IX Value XX XX XX IX Value XX XX XX XX XX IX Value XX XX XX
This makes you "waste" 3 bytes per element, but your data structure will be much faster, since each datum can be loaded and interpreted with far fewer instructions and aligned memory loads.
If you are going to use 8 bytes anyways, you might as well make your constructor index to a Int32, since you aren't using those bytes for anything else anyways, and making all of your datum elements word-aligned further increases speed:
| 32 bits | 32 bits | 32 bits | 32 bits | 32 bits | 32 bits |
[ data ] [ data ] [ data ]
00 00 00 00 00 00 00 01 00 00 00 01 00 01 00 00 00 00 00 00 12 34 56 78
Index Value Index Value XX XX Index Value
This is the price you have to pay for a faster data structures on current processor architectures.

Related

How to extract hex_num which means opcode+operand

I am new to Capstone and PE structure... beg your indulgence
I want to extract hex number which means opcode and operands in Python using Capstone
Here is my example:
if there is a file that looks like this
.text:00404499 8B 0D 14 61 41 00
mov ecx, dword_416114
the hex num -'8D' is the hex number including (mov, ecx, dword_416114) right?
so I tried to extract the exact hex num, but i am having trouble...
Here is my code:
for ins in cs.disasm(pe.sections[0].get_data(), 0x0):
print("0x%x:\t%s\t%s" %(ins.address, ins.mnemonic, ins.op_str))
above code will show this:
0x0: xor eax, 0xff77ee8b
0x5: mov ch, dh
0x7: ja 0xffffff9f
...
0x15: inc esi
how can i get what i want?

Where the Hamming Distance Constants Came From

The function:
function popcount (x, n) {
if (n !== undefined) {
x &= (1 << n) - 1
}
x -= x >> 1 & 0x55555555
x = (x & 0x33333333) + (x >> 2 & 0x33333333)
x = x + (x >> 4) & 0x0f0f0f0f
x += x >> 8
x += x >> 16
return x & 0x7f
}
Is for calculating Hamming Weight. I am wondering where these constants come from and generally how this method was discovered. Wondering if anyone knows the resource that describes it.
There masks select the even numbered k-bit parts, k=1 gives 0x55555555, k=2 gives 0x33333333, k=4 gives 0x0f0f0f0f.
In binary the masks look like:
0x55555555 = 01010101010101010101010101010101
0x33333333 = 00110011001100110011001100110011
0x0f0f0f0f = 00001111000011110000111100001111
They are also the result of 0xffffffff / 3, 0xffffffff / 5 and 0xffffffff / 17 but this arithmetic insight is probably not useful in this context.
Overall this method of computing the Hamming weight has the form of a tree where first adjacent bits are summed into a 2-bit number, then adjacent 2-bit numbers are summed into 4-bit numbers, and so on.
All the steps could have this form:
x = (x & m[k]) + ((x >> k) & m[k])
where m[k] is a mask selecting the even-numbered k-bit parts.
But many steps have short-cuts available for them. For example, to sum adjacent bits, there are only 4 cases to consider:
00 -> 00
01 -> 01
10 -> 01
11 -> 10
This could be done by extracting both bits and summing them, but x -= x >> 1 & 0x55555555 also works. This subtracts the top bit from the 2-bit part, so
00 -> 00 - 0 = 00
01 -> 01 - 0 = 01
10 -> 10 - 1 = 01
11 -> 11 - 1 = 10
Maybe this could be discovered through "cleverness and insight", whatever those are.
In the step x = (x + (x >> 4)) & 0x0f0f0f0f (extra parentheses added for clarity), a couple of properties are used. The results from the previous steps are the Hamming weights of 4-bit strings stored in 4 bits each, so they are at most 0100. That means two of them can be added in-place without carrying into the next higher part, because their sum will be at most 1000 which still fits. So instead of masking twice before the sum, it is enough to mask once after the sum, this mask effectively zero-extends the even numbered 4-bit parts into 8-bit parts. This could be discovered by considering the maximum values at each step.
The step x += x >> 8 has similar reasoning but it works out even better, even masking after the sum is not needed, this leaves some "stray bits" in the second byte from the bottom and in the top byte, but that is not damaging to the next step: the >> 16 throws away the second byte from the bottom, in the end all the stray bits are removed with x & 0x7f.

If an embedded system coded in C is 8 or 16-bit, how will it manipulate 32-bit data types like int?

I think I'm thinking about this the wrong way, but I'm wondering how an embedded system with less than 32-bits can use 32-bit data values. I'm a beginner programmer so go easy on me :)
base 10
0100 <- carry in/out
5432
+1177
======
6609
never brought up in class but we can now extend that to two operations
100
32
+77
======
09
01
54
+11
======
66
and come up with the 6609 result because we understand that it is column based and each column treated separately.
base 2
1111
+0011
=====
11110
1111
+0011
=====
10010
110
11
+11
=====
10
111
11
+00
=====
100
result 10010
you can break your operations up into however many bits you want 8, 16, 13, 97 whatever. it is column based (for addition) and it just works. division you should be able to figure out, multiplication is just shifting and adding and can turn that into multiple operations as well
n bits * n bits = 2*n bits so if you have an 8 bit * 8 bit = 16 bit multiply you can use that on an 8 bit system otherwise you have to limit to 4 bits * 4 bits = 8 bits and work with that (or if no multiply then just do the shift and add).
base 2
abcd
* 1101
========
abcd
0000
abcd
+abcd
=========
which you can break down into a shifting and adding problem, can do N bits with a 4 or 8 or M bit processor/registers/alu
Or look at it another way, grade school algebra
(a+b)*(c+d) = ac + bc + ad + bd
mnop * tuvw = ((mn*0x100)+(op)) * ((tu*0x100)+(vw)) = (a+b)*(c+d)
and you should find that you can combine the with 0x100 terms and without,
do those separately from the without putting together parts of the answer using an 8 bit alu (or 4 bits of the 8 bit as needed).
shifting should be obvious just move the bits over to the next byte or (half)word or whatever.
and bitwise operations (xor, and, or) are bitwise so dont need anything special just keep the columns lined up.
EDIT
Or you could just try it
unsigned long fun1 ( unsigned long a, unsigned long b )
{
return(a+b);
}
00000000 <_fun1>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 1d40 0004 mov 4(r5), r0
8: 1d41 0006 mov 6(r5), r1
c: 6d40 0008 add 10(r5), r0
10: 6d41 000a add 12(r5), r1
14: 0b40 adc r0
16: 1585 mov (sp)+, r5
18: 0087 rts pc
00000000 <fun1>:
0: 0e 5c add r12, r14
2: 0f 6d addc r13, r15
4: 30 41 ret
00000000 <fun1>:
0: 62 0f add r22, r18
2: 73 1f adc r23, r19
4: 84 1f adc r24, r20
6: 95 1f adc r25, r21
8: 08 95 ret
bonus points if you can figure out these instruction sets.
unsigned long fun2 ( unsigned long a, unsigned long b )
{
return(a*b);
}
00000000 <_fun2>:
0: 1166 mov r5, -(sp)
2: 1185 mov sp, r5
4: 10e6 mov r3, -(sp)
6: 1d41 0006 mov 6(r5), r1
a: 1d40 000a mov 12(r5), r0
e: 1043 mov r1, r3
10: 00a1 clc
12: 0c03 ror r3
14: 74d7 fff2 ash $-16, r3
18: 6d43 0004 add 4(r5), r3
1c: 70c0 mul r0, r3
1e: 00a1 clc
20: 0c00 ror r0
22: 7417 fff2 ash $-16, r0
26: 6d40 0008 add 10(r5), r0
2a: 7040 mul r0, r1
2c: 10c0 mov r3, r0
2e: 6040 add r1, r0
30: 0a01 clr r1
32: 1583 mov (sp)+, r3
34: 1585 mov (sp)+, r5
36: 0087 rts pc
An 8 bit system can perform 8 bit operations in a single instruction and single memory access, on such an 8 bit system, 16 and 32 bit operations require additional data accesses and additional instructions.
For example, typical architectures place arithmetic results in register (often an accumulator but some architectures are more_orthogonal_ and can use any register for results), and arithmetic overflow results in a carry flag being set in a status register. In operations larger that the native architecture, the code can inspect the carry flag in order to take the appropriate action in subsequent instructions.
So say for an 8 bit system you add 1 to 255, the result in the 8 bit accumulator will be zero, with the carry flag set; the next instruction can then add one to the upper byte of a 16 bit value in response to the carry flag. This can be made to ripple through to any number of bytes or words, so that a system can be made to process operations of arbitrary bit length above that of the underlying architecture just not in a single instruction operation.

Getting TPM's public EK: meaning of leading/trailing bits

I have been trying to get a TPM's EK's public key using two methods:
using Hyper-V's Get-PlatformIdentifier I get the following result:
3082010a0282010100<EKPUBLICKEY>0203010001
Using Urchin's C Library:
<EKPUBLICKEY>
Can anyone explain what do 3082010a0282010100 and 0203010001 mean/encode?
It is DER Encoding of format for ASN.1 Types.
For example, 3082010A0282010100<KEY>0203010001
30: said SEQUENCE type
82010A: Said SEQUENCE of length 010A (82 of which more than 80, indicates the length information of 2 bytes.)
02: Integer type
820101: An integer representing the length of 0101 (decimal 257)
00<KEY>: The integer is modulus, 00 used to denote a positive integer, deduct 00 and 256 bytes, so the modulus is 256 bytes
Finally Exponent
0203010001: 02 integer representing the length of 3010001 Exponent, 03

How to divide a BCD by 2 on an 8085 processor?

On an 8085 processor, an efficient algorithm for dividing a BCD by 2 comes in handy when converting a BCD to binary representation. You might think of recursive subtraction or multiplying by 0.5, however these algorithms require lengthy arithmetics.
Therefore, I would like to share with you the following code (in 8085 assembler) that does it more efficiently. The code has been thoroughly tested on GNUSim8085 and ASM80 emulators. If this code was helpful to you, please share your experience with me.
Before running the code, put the BCD in register A. Set the carry flag if there is a remainder to be received from a more significant byte (worth 50). After execution, register A will contain the result. The carry flag is used to pass the remainder, if any, to the next less significant byte.
The algorithm uses DAA instruction after manipulating C and AC flags in a very special way thus taking into account that any remainder passed down to the next nibble (i.e. half-octet) is worth 5 instead of 8.
;Division of BCD by 2 on an 8085 processor
;Set initial values.
;Register A contains a two-digit BCD. Carry flag contains remainder.
stc
cmc
mvi a, 85H
;Do modified decimal adjust before division.
cmc
cma
rar
adc a
cma
daa
cmc
;Divide by 2.
rar
;Save quotient and remainder to registers B and C.
mov b, a
mvi a, 00H
rar
mov c, a
;Continue working on decimal adjust.
mov a, b
sui 33H
mov b, a
mov a, c
ral
mov a, b
hlt
Suppose a two digit BCD number is represented as:D7D6D5D4 D3D2D1D0
For a division by 2, for binary (or hex), simply right shift the number by one place. If there is an overflow then remainder is 1, and 0 othwerwise. The same things applies to two digit (8-bit) BCD numbers when D4 is 0, i.e. there is no effective bit shift from higher order four bits. Now if D4 is 1 (before the shift), then shifting will introduce a 8 (1000) in the lower order four bits, which apparantly jeopardizes this process. Observe that in BCD the bit shift should introduce 10/2 = 5 not 16/2 = 8. Thus we can simply adjust by subtrating 8-5 = 3 from the lower order four bits, i.e. 03H from the entire number. The following code summarizes this strategy. We assume accumulator holds the data, and after the division the result is kept in the accumulator and remainder is kept in the register B.
MVI B,00H ; remainder = 0
STC
CMC ; clear the carry flag
RAR ; right shift the data
JNC SKIP
INR B ; CY=1 so, remainder = 1
SKIP: MOV D,A ; backup
ANI 08H ; if get D3 after the shift (or D4 before the shift)
MOV A,D ; get the data from backup
JZ FIN ; if D4 before the shift was 0
SUI 03H ; adjustment for the shift
FIN: HLT ; A has the result, B has the remainder