optimizing rc4 with cuda - optimization

I made some attempts to implement an efficient of rc4 cipher algorithm in cuda. I used shared memory to store the internal permutation state, taking care of the banked memory layout to time penalty with parallel thread accesses in the warp. I also tried to minimize the number of accesses exploiting the fact that read/write accesses with the 'i' index are contiguous and can be packed in 32-bits words. Last, I made use of constant memory to initialize the permutation state.
Despite these 'clever' tricks, i can expect to achieve only roughly 50% of throughput of the best reported implementations (see guapdf cracker for example), even taking into consideration that unblocked communication between host and gpu could be used to partially cover the computation. I can't figure why and I am looking for new improvement ideas or comments on bad assumptions i could have made.
Here is a toy implementation of my KSA (key setting) kernel with a key reduced to 4 bytes.
__constant__ unsigned int c_init[256*32/4];
__global__ void rc4Block(unsigned int *d_out, unsigned int *d_in)
{
__shared__ unsigned int s_data[256*32/4];
int inOffset = blockDim.x * blockIdx.x;
int in = inOffset + threadIdx.x;
unsigned int key, u;
// initialization
key = d_in[in];
for(int i=0; i<(256/4); i++) { // read from constant memory
s_data[i*32+threadIdx.x] = c_init[i*32+threadIdx.x];
}
// key mixing
unsigned char j = 0;
unsigned char k0 = key & 0xFF;
unsigned char k1 = (key >> 8) & 0xFF;
unsigned char k2 = (key >> 8) & 0xFF;
unsigned char k3 = (key >> 8) & 0xFF;
for(int i=0; i<256; i+=4) { // unrolled
unsigned int u, sj, v;
unsigned int si = s_data[(i/4)*32+threadIdx.x];
unsigned int shiftj;
u = si & 0xff;
j = (j + k0 + u) & 0xFF;
sj = s_data[(j/4)*32+threadIdx.x];
shiftj = 8*(j%4);
v = (sj >> shiftj) & 0xff;
si = (si & 0xffffff00) | v;
sj = (sj & ~(0xFFu << (8*(j%4)))) | (u << shiftj);
s_data[(j/4)*32+threadIdx.x] = sj;
u = (si >> 8) & 0xff;
j = (j + k1 + u) & 0xFF;
sj = s_data[(j/4)*32+threadIdx.x];
shiftj = 8*(j%4);
v = (sj >> shiftj) & 0xff;
si = (si & 0xffff00ff) | (v<<8);
sj = (sj & ~(0xFFu << (8*(j%4)))) | (u << shiftj);
s_data[(j/4)*32+threadIdx.x] = sj;
u = (si >> 16) & 0xff;
j = (j + k2 +u) & 0xFF;
sj = s_data[(j/4)*32+threadIdx.x];
shiftj = 8*(j%4);
v = (sj >> shiftj) & 0xff;
si = (si & 0xff00ffff) | (v<<16);
sj = (sj & ~(0xFFu << (8*(j%4)))) | (u << shiftj);
s_data[(j/4)*32+threadIdx.x] = sj;
u = (si >> 24) & 0xff;
j = (j + k3 + u) & 0xFF;
sj = s_data[(j/4)*32+threadIdx.x];
shiftj = 8*(j%4);
v = (sj >> shiftj) & 0xff;
si = (si & 0xffffff) | (v<<24);
sj = (sj & ~(0xFFu << (8*(j%4)))) | (u << shiftj);
s_data[(j/4)*32+threadIdx.x] = sj;
s_data[(i/4)*32+threadIdx.x] = si;
}
d_out[in] = s_data[threadIdx.x]; // unrelevant debug output
}

It seems the code at least partially involves re-ordering bytes. If you are using a Fermi-class GPU, you could look into using the __byte_perm() intrinsic which maps to a hardware instruction on Fermi-class devices and allows one to re-order bytes more efficiently.
I assume when you compare to other implementations it is apples-to-apples, i.e. on the same type of GPU. This code looks entirely compute bound, so the throughput will largely depend on the integer-instruction throughput of the GPU, and the performance spectrum is wide.

Related

Fibonacci shift register pseudo-random number generator

I am attempting to get the following code working for a Fibonacci shift register to generate pseudo-random numbers. Can't seem to get it working, so is(are) there any obvious issues(?)
Shared Function Main() As Integer
Dim start_state As UShort = &HACE1UI ' Any nonzero start state will work.
Dim lfsr As UShort = start_state
Dim bit As UInteger
Dim period As UInteger = 0
Do While lfsr <> start_state
' taps: 16 14 13 11; feedback polynomial: x^16 + x^14 + x^13 + x^11 + 1
bit = ((lfsr >> 0) Xor (lfsr >> 2) Xor (lfsr >> 3) Xor (lfsr >> 5)) And 1
lfsr = (lfsr >> 1) Or (bit << 15)
period += 1
Loop
Return 0
End Function
Last, does "period" need to be divided by a large integer to get U(0,1)'s?
Below is the original C++ code:
# include <stdint.h>
int main(void)
{
uint16_t start_state = 0xACE1u; /* Any nonzero start state will work. */
uint16_t lfsr = start_state;
uint16_t bit; /* Must be 16bit to allow bit<<15 later in the code */
unsigned period = 0;
do
{
/* taps: 16 14 13 11; feedback polynomial: x^16 + x^14 + x^13 + x^11 + 1 */
bit = ((lfsr >> 0) ^ (lfsr >> 2) ^ (lfsr >> 3) ^ (lfsr >> 5) ) & 1;
lfsr = (lfsr >> 1) | (bit << 15);
++period;
} while (lfsr != start_state);
return 0;
}
As in #dummy's comment,
Do While lfsr <> start_state
...
Loop
doesn't run because lfsr = start_state at the beginning.
The code equivalent to C++
do {
...
} while (lfsr != start_state);
in VB.NET is
Do
...
Loop While lfsr <> start_state

AES 128 CTR Mode Bit Shifting to Create Counter

I have access to some VC++ source code for which I am trying to convert to VB.NET. I previously asked a question regarding bit shifting, and although the answers given made sense and seemed rather simple to convert over to VB.NET, I am having difficulty getting things to work out. Here is some VC++ code that I am needing to convert to VB.NET:
#define bitShift(_val) \
((u64)(((((u64)_val) & 0xff00000000000000ull) >> 56) | \
((((u64)_val) & 0x00ff000000000000ull) >> 40) | \
((((u64)_val) & 0x0000ff0000000000ull) >> 24) | \
((((u64)_val) & 0x000000ff00000000ull) >> 8 ) | \
((((u64)_val) & 0x00000000ff000000ull) << 8 ) | \
((((u64)_val) & 0x0000000000ff0000ull) << 24) | \
((((u64)_val) & 0x000000000000ff00ull) << 40) | \
((((u64)_val) & 0x00000000000000ffull) << 56)))
Now, the returned value will be used as the counter for AES decryption in CTR Mode. The following VC++ code is used to calculate the counter:
u8 counter[16];
*(u64 *)(counter + 0) = bitShift(i);
*(u64 *)(counter + 8) = 0;
This is where I am currently at with the VB.NET code:
Public Function SwapBits(ByVal value As Int64) As Int64
Dim uvalue As UInt64 = CULng(value)
Dim swapped As UInt64 = ((&HFF00000000000000UL) And (uvalue >> 56) Or _
(&HFF000000000000L) And (uvalue >> 40) Or _
(&HFF0000000000L) And (uvalue >> 24) Or _
(&HFF00000000L) And (uvalue >> 8) Or _
(&HFF000000UI) And (uvalue << 8) Or _
(&HFF0000) And (uvalue << 24) Or _
(&HFF00) And (uvalue << 40) Or _
(&HFF) And (uvalue << 56))
Return CLng(swapped)
End Function
Here is the code used to create the counter:
Dim blocks As Integer = file_size / 16
For i As Integer = 0 To blocks - 1
Dim buffer As Byte() = New Byte(15) {}
Array.Copy(BitConverter.GetBytes(SwapBits(CULng(i))), 0, buffer, 0, 8)
'AES decryption takes place after this...
The counter is 16 bytes, but only the first 8 bytes are encrypted using AES 128 bit EBC and then XOR'd with the current encrypted block of data which is also 16 bytes (AES CTR Mode). I can get the code to run without any errors, but the output of decrypted data is incorrect which leads me to believe I am not calculating the counter which is being used for encryption correctly.
Once again, any help is obviously appreciated, and thanks in advance!
EDIT: Current SwapBits function... still not right though
Public Function SwapBits(ByVal uvalue As UInt64) As UInt64
Dim swapped As UInt64 = ((((uvalue) And &HFF00000000000000) >> 56) Or _
(((uvalue) And &HFF000000000000) >> 40) Or _
(((uvalue) And &HFF0000000000) >> 24) Or _
(((uvalue) And &HFF00000000) >> 8) Or _
(((uvalue) And &HFF000000) << 8) Or _
(((uvalue) And &HFF0000) << 24) Or _
(((uvalue) And &HFF00) << 40) Or _
(((uvalue) And &HFF) << 56))
Return swapped
End Function
This actually causes an "Arithmetic operation resulted in an overflow." error when uvalue reaches a value of 128. When the value of 1 is passed to SwapBits, my return value = 72057594037927936. My interpretation of the VC++ code is that my counter should simply be a 16 byte array incrementing by 1 each time. For example, if
uvalue = 1
then my counter needs to be
0000000100000000
if
uvalue = 25
then my counter needs to be
0000002500000000
etc, etc... Or I am misinterpreting something somewhere?
Not sure what you're expecting from the C++ code. But when I use this:
#include <iostream>
using namespace std;
#define bitShift(_val) \
((unsigned __int64)(((((unsigned __int64)_val) & 0xff00000000000000ull) >> 56) | \
((((unsigned __int64)_val) & 0x00ff000000000000ull) >> 40) | \
((((unsigned __int64)_val) & 0x0000ff0000000000ull) >> 24) | \
((((unsigned __int64)_val) & 0x000000ff00000000ull) >> 8 ) | \
((((unsigned __int64)_val) & 0x00000000ff000000ull) << 8 ) | \
((((unsigned __int64)_val) & 0x0000000000ff0000ull) << 24) | \
((((unsigned __int64)_val) & 0x000000000000ff00ull) << 40) | \
((((unsigned __int64)_val) & 0x00000000000000ffull) << 56)))
int main()
{
unsigned __int64 test = bitShift(25);
return 0;
}
I get the exact same return value(1801439850948198400 || &H1900000000000000) as this:
Dim result As ULong = SwapBits(25)
Public Function SwapBits(ByVal uvalue As UInt64) As UInt64
Dim swapped As UInt64 = ((((uvalue) And &HFF00000000000000UL) >> 56) Or _
(((uvalue) And &HFF000000000000UL) >> 40) Or _
(((uvalue) And &HFF0000000000UL) >> 24) Or _
(((uvalue) And &HFF00000000UL) >> 8) Or _
(((uvalue) And &HFF000000UL) << 8) Or _
(((uvalue) And &HFF0000UL) << 24) Or _
(((uvalue) And &HFF00UL) << 40) Or _
(((uvalue) And &HFFUL) << 56))
Return swapped
End Function
I don't have much experience in C++, care to share what this is doing:
u8 counter[16];
*(u64 *)(counter + 0) = bitShift(i);
*(u64 *)(counter + 8) = 0;
basically that section of code increments the first 8 bytes of counter by 1 each iteration 0f i, starting with the right most byte and expanding left for each carryover. For instance, if the counter reaches 999 counter[7] will hold 231(&HE7) and counter[6] 3(&H3) which when you look at the whole array gives, &H000000000003E7 which equals 999 decimal.
Something tells me conversion is better done using the GetBytes and ToUInt64() methods, a for loop and a temporary variable. It would be much easier to read and probably fast enough for most purposes.

How to implement murmurhash3 in VBNET

I'm trying to implement murmurhash3 in vb.net and trying to convert from this C# implementation
first part of the function in c#
public static SqlInt32 MurmurHash3(SqlBinary data)
{
const UInt32 c1 = 0xcc9e2d51;
const UInt32 c2 = 0x1b873593;
int curLength = data.Length; /* Current position in byte array */
int length = curLength; /* the const length we need to fix tail */
UInt32 h1 = seed;
UInt32 k1 = 0;
/* body, eat stream a 32-bit int at a time */
Int32 currentIndex = 0;
while (curLength >= 4)
{
/* Get four bytes from the input into an UInt32 */
k1 = (UInt32)(data[currentIndex++]
| data[currentIndex++] << 8
| data[currentIndex++] << 16
| data[currentIndex++] << 24);
/* bitmagic hash */
k1 *= c1;
k1 = rotl32(k1, 15);
k1 *= c2;
h1 ^= k1;
h1 = rotl32(h1, 13);
h1 = h1 * 5 + 0xe6546b64;
curLength -= 4;
}
And same in VB.net:
Public Shared Function MurmurHash3(data As Byte()) As Int32
Const c1 As UInt32 = &HCC9E2D51UI
Const c2 As UInt32 = &H1B873593
Dim curLength As Integer = data.Length
' Current position in byte array
Dim length As Integer = curLength
' the const length we need to fix tail
Dim h1 As UInt32 = seed
Dim k1 As UInt32 = 0
' body, eat stream a 32-bit int at a time
Dim dBytes As Byte()
Dim currentIndex As Int32 = 0
While curLength >= 4
' Get four bytes from the input into an UInt32
dBytes = New Byte() {data(currentIndex), data(currentIndex + 1), data(currentIndex + 2), data(currentIndex + 3)}
k1 = BitConverter.ToUInt32(dBytes, 0)
currentIndex += 4
' bitmagic hash
k1 *= c1
k1 = rotl32(k1, 15)
k1 *= c2
h1 = h1 Xor k1
h1 = rotl32(h1, 13)
h1 = h1 * 5 + &HE6546B64UI
curLength -= 4
End While
Private Shared Function rotl32(x As UInt32, r As Byte) As UInt32
Return (x << r) Or (x >> (32 - r))
End Function
k1 *= c1
Throws error Arithmetic operation resulted in an overflow.
Any suggestions how this should be implemented? I'm Not sure how to do the Get four bytes from the input into an UInt32 part if that is the problem or is it related to something else since there are some differences in bitwise operations between C# and VB.
For the reference Java implementation also exists
https://github.com/yonik/java_util/blob/master/src/util/hash/MurmurHash3.java
I'd first convert the 32-bit k1 to a 64-bit variant first, e.g:
k1_64 = CType(k1, UInt64)
for modulo-32bit calculation, do
k1_64 = (k1_64 * c1) And &HFFFFFFFFUI
finally, recast back to 32-bit
k1 = CType(k1_64 And $HFFFFFFFFUI, UInt32)
to add more performance, you might want to consider replacing the BitConverter.ToUInt call with something else.
EDIT : Here's a simpler version without additional variable (but with a 'helper constant')
Const LOW_32 as UInt32 = &HFFFFFFFFUI
' ... intervening code ...
k1 = (1L * k1 * c1) And LOW_32
' ... later on ...
h1 = (h1 * 5L + &HE6546B64UL) And LOW_32
the 1L forces the calculation within the parens to be performed as Long (Int64). The And LOW_32 pares down the number of non-zero bits to 32, and the overall result is then automatically casted to UInt32. Similar thing happens on the h1 line.
Reference: http://www.undermyhat.org/blog/2009/08/secrets-and-lies-of-type-suffixes-in-c-and-vb-net/ (scroll down to the section "Secrets of constants and type suffixes")
Unfortunately, it possible to do the equivalent of unchecked {} in VB.NET? You could use a try/catch blocked and do the shift manually if you overflow. Just be careful, putting an error handler in there will slow down the hash calculation.

How can I convert an RGB integer to the corresponding RGB tuple (R,G,B)?

How can I convert an ARGB integer to the corresponding ARGB tuple (A,R,G,B)?
I receive some XML where a color tag is given with some integer value (e.g -16777216). I need to draw a rectangle filled with that color. But I am unable to retrieve values of the A,R,G,B components from the integer value.
If the integer is ARGB I think it should be:
unsigned char b = color & 0x000000FF;
unsigned char g = (color>> 8) & 0x000000FF;
unsigned char r = (color>>16) & 0x000000FF;
unsigned char a = (color>>24) & 0x000000FF;
Use bitwise AND and shift right to select individual bytes from the 32-bit integer.
uint32_t color = -16777216;
uint8_t b = (color & 0x000000ff);
uint8_t g = (color & 0x0000ff00) >> 8;
uint8_t r = (color & 0x00ff0000) >> 16;
uint8_t a = (color & 0xff000000) >> 24;
You can try use unions. Something like this
struct color
{
unsigned char alpha:8;
unsigned char r:8;
unsigned char g:8;
unsigned char b:8;
};
union
{
struct color selector;
unsigned int base:32;
};
Try the following code:
unsigned a = (color >> 24) & 0x000000FF;
unsigned b = (color >> 16) & 0x000000FF;
unsigned g = (color >> 8) & 0x000000FF;
unsigned r = color & 0x000000FF;
CGFloat rf = (CGFloat)r / 255.f;
CGFloat gf = (CGFloat)g / 255.f;
CGFloat bf = (CGFloat)b / 255.f;
CGFloat af = (CGFloat)a / 255.f;

VB.NET Bit manipulation: how to extract byte from short?

Given this Short (signed):
&Hxxxx
I want to:
Extract the most right &HxxFF as SByte (signed)
Extract the left &H7Fxx as Byte (unsigned)
Identify if the most left &H8xxx is positive or negative (bool result)
Extract the most right 0xxxff
myShort & 0x00FF
Extract the left 0xffxx
(myShort & 0xFF00) >> 8
Identify if the most left 0xfxxx is
positive or negative (it's a signed
short).
(myShort & 0xF000) >= 0;
Dim test As UInt16 = &HD 'a test value 1101
Dim rb As Byte 'lsb
Dim lb As Byte 'msb - 7 bits
Dim rm As UInt16 = &HFF 'lsb mask
Dim lm As UInt16 = &H7F00 'msb mask
Dim sgn As Byte = &H80 'sign mask
For x As Integer = 0 To 15 'shift the test value one bit at a time
rb = CByte(test And rm) 'get lsb
lb = CByte((test And lm) >> 8) 'get msb
Dim lbS, rbS As Boolean 'sign
'set signs
If (rb And sgn) = sgn Then rbS = True _
Else rbS = False
If (lb And sgn) = sgn Then lbS = True _
Else lbS = False 'should always be false based on mask
Console.WriteLine(String.Format("{0} {1} {2} {3} {4}",
x.ToString.PadLeft(2, " "c),
Convert.ToString(lb, 2).PadLeft(8, "0"c),
Convert.ToString(rb, 2).PadLeft(8, "0"c),
lbS.ToString, rbS.ToString))
test = test << 1
Next
inline char getLsb(short s)
{
return s & 0xff;
}
inline char getMsb(short s)
{
return (s & 0xff00) >> 8;
}
inline bool isBitSet(short s, unsigned pos)
{
return (s & (1 << pos)) > 0;
}
Uh...
value & 0x00ff
(value & 0xff00) >> 8
(value & 0xf000) >= 0
EDIT: I suppose you want the byte value and not just the upper 8 bits.
Extract the most right &HxxFF as SByte (signed)
CType(s AND &H00FF, SByte)
Extract the left &H7Fxx as Byte (unsigned)
CType((s AND &H7F00) >> 8, Byte)
Identify if the most left &H8xxx is positive or negative (bool result)
s AND &H8000 > 0
I think those work, been a while since I have worked in VB