Calculating 64bit checksum of a part of file in Swift - objective-c

I'm trying to port the code for calculating OpenSubtitles hash and I'm using the Objective-C example as reference (http://trac.opensubtitles.org/projects/opensubtitles/wiki/HashSourceCodes#Objective-C). The formula for the hash is file size + 64bit checksum of the first 64k of the file + 64bit checksum of the last 64k of the file.
I'm having trouble with the bit of code that calculates the checksums. This is the important part of the code in Objective-C:
const NSUInteger CHUNK_SIZE=65536;
NSData *fileDataBegin, *fileDataEnd;
uint64_t hash=0;
fileDataBegin = [handle readDataOfLength:(NSUInteger)CHUNK_SIZE];
[handle seekToEndOfFile];
unsigned long long fileSize = [handle offsetInFile];
uint64_t * data_bytes= (uint64_t*)[fileDataBegin bytes];
for( int i=0; i< CHUNK_SIZE/sizeof(uint64_t); i++ )
hash+=data_bytes[i];
I tried converting most of the code ti Swift by just rewriting it in a similar fashion. I'm having trouble with coming up with the replacement code for this little bit:
uint64_t * data_bytes= (uint64_t*)[fileDataBegin bytes];
for( int i=0; i< CHUNK_SIZE/sizeof(uint64_t); i++ )
hash+=data_bytes[i];
Any help would be great.

uint64_t * data_bytes= (uint64_t*)[fileDataBegin bytes];
can be translated as
let data_bytes = UnsafeBufferPointer<UInt64>(
start: UnsafePointer(fileDataBegin.bytes),
count: fileDataBegin.length/sizeof(UInt64)
)
which has the additional advantage that data_bytes is not just
a pointer, but also stores the number of elements. An
UnsafeBufferPointer can be treated almost like a Swift Array.
Therefore
for( int i=0; i< CHUNK_SIZE/sizeof(uint64_t); i++ )
hash+=data_bytes[i];
can be written simply as
var hash : UInt64 = 0
// ...
hash = reduce(data_bytes, hash) { $0 &+ $1 }
using
/// Return the result of repeatedly calling `combine` with an
/// accumulated value initialized to `initial` and each element of
/// `sequence`, in turn.
func reduce<S : SequenceType, U>(sequence: S, initial: U, combine: (U, S.Generator.Element) -> U) -> U
and the "overflow operator" &+:
Unlike arithmetic operators in C, arithmetic operators in Swift do not
overflow by default. Overflow behavior is trapped and reported as an
error. To opt in to overflow behavior, use Swift’s second set of
arithmetic operators that overflow by default, such as the overflow
addition operator (&+). All of these overflow operators begin with an
ampersand (&).

Related

Gather AVX2&512 intrinsic for 16-bit integers?

Imagine this piece of code:
void Function(int16 *src, int *indices, float *dst, int cnt, float mul)
{
for (int i=0; i<cnt; i++) dst[i] = float(src[indices[i]]) * mul;
};
This really asks for gather intrinsics e.g. _mm_i32gather_epi32. I got great success with these when loading floats, but are there any for 16-bit ints? Another problem here is that I need to transition from 16-bits on the input to 32-bits (float) on the output.
There is indeed no instruction to gather 16bit integers, but (assuming there is no risk of memory-access violation) you can just load 32bit integers starting at the corresponding addresses, and mask out the upper halves of each value.
For uint16_t this would be a simple bit-and, for signed integers you can shift the values to the left in order to have the sign bit at the most-significant position. You can then (arithmetically) shift back the values before converting them to float, or, since you multiply them anyway, just scale the multiplication factor accordingly.
Alternatively, you could load from two bytes earlier and arithmetically shift to the right. Either way, your bottle-neck will likely be the load-ports (vpgatherdd requires 8 load-uops. Together with the load for the indices you have 9 loads distributed on two ports, which should result in 4.5 cycles for 8 elements).
Untested possible AVX2 implementation (does not handle the last elements, if cnt is not a multiple of 8 just execute your original loop at the end):
void Function(int16_t const *src, int const *indices, float *dst, size_t cnt, float mul_)
{
__m256 mul = _mm256_set1_ps(mul_*float(1.0f/0x10000));
for (size_t i=0; i+8<=cnt; i+=8){ // todo handle last elements
// load indicies:
__m256i idx = _mm256_loadu_si256(reinterpret_cast<__m256i const*>(indices + i));
// load 16bit integers in the lower halves + garbage in the upper halves:
__m256i values = _mm256_i32gather_epi32(reinterpret_cast<int const*>(src), idx, 2);
// shift each value to upper half (removes garbage, makes sure sign is at the right place)
// values are too large by a factor of 0x10000
values = _mm256_slli_epi32(values, 16);
// convert to float, scale and multiply:
__m256 fvalues = _mm256_mul_ps(_mm256_cvtepi32_ps(values), mul);
// store result
_mm256_storeu_ps(dst, fvalues);
}
}
Porting this to AVX-512 should be straight-forward.

Implementing a side channel timing attack

I'm working on a project implementing a side channel timing attack in C on HMAC. I've done so by computing the hex encoded tag and brute forcing byte-by-byte by taking advantage of strcmp's timing optimization. So for every digit in my test tag, I calculate the amount of time it takes for every hex char to verify. I take the hex char that corresponds to the highest amount of time calculated and infer that it is the correct char in the tag and move on to the next byte. However, strcmp's timing is very unpredictable. Although it is easy to see the timing differences between comparing two equal strings and two totally different strings, I'm having difficulty finding the char that takes my test string the most time to compute when every other string I'm comparing to is very similar (only differing by 1 byte).
The changeByte method below takes in customTag, which is the tag that has been computed up to that point in time and attempts to find the correct byte corresponding to index. changeByte is called n time where n=length of the tag. hexTag is a global variable that is the correct tag. timeCompleted stores the average time taken to compute the testTag at each of the hex characters for a char position. Any help would be appreciated, thank you for your time.
// Checks if the index of the given byte is correct or not
void changeByte(unsigned char *k, unsigned char * m, unsigned char * algorithm, unsigned char * customTag, int index)
{
long iterations=50000;
// used for every byte sequence to test the timing
unsigned char * tempTag = (unsigned char *)(malloc(sizeof (unsigned char)*(strlen(customTag)+1 ) ));
sprintf(tempTag, "%s", customTag);
int timeIndex=0;
// stores the time completed for every respective ascii char
double * timeCompleted = (double *)(malloc (sizeof (double) * 16));
// iterates through hex char 0-9, a-f
for (int i=48; i<=102;i++){
if (i >= 58 && i <=96)continue;
double total=0;
for (long j=0; j<iterations; j++){
// calculates the time it takes to complete for every char in that position
tempTag[index]=(unsigned char)i;
struct rusage usage;
struct timeval start, end;
getrusage(RUSAGE_SELF, &usage);
start=usage.ru_stime;
for (int k=0; k<50000; k++)externalStrcmp(tempTag, hexTag); // this is just calling strcmp in another file
getrusage (RUSAGE_SELF, &usage);
end=usage.ru_stime;
}
double startTime=((double)start.tv_sec + (double)start.tv_usec)/10000;
double endTime=((double)end.tv_sec+(double)end.tv_usec)/10000;
total+=endTime-startTime;
}
double val=total/iterations;
timeCompleted[timeIndex]=val;
timeIndex++;
}
// sets next char equal to the hex char corresponding to the index
customTag[index]=getCorrectChar (timeCompleted);
free(timeCompleted);
free(tempTag);
}
// finds the highest time. The hex char corresponding with the highest time it took the
// verify function to complete is the correct one
unsigned char getCorrectChar(double * timeCompleted)
{
double high =-1;
int index=0;
for (int i=0; i<16; i++){
if (timeCompleted[i]>high){
high=timeCompleted[i];
index=i;
}
}
return (index+48)<=57 ?(unsigned char) (index+48) : (unsigned char)(index+87);
}
I'm not sure if it's the main problem, but you add seconds to microseconds directly as though 1us == 1s. It will give wrong results when number of seconds in startTime and endTime differs.
And the scaling factor between usec and sec is 1 000 000 (thx zaph). So that should work better:
double startTime=(double)start.tv_sec + (double)start.tv_usec/1000000;
double endTime=(double)end.tv_sec + (double)end.tv_usec/1000000;

Have a memset issue in C++/CLI

I'm a newbie currently using c++/cli to wrap a few classes that i have used in my .lib file. And I am in dire need of using "memset" in my c++/cli. Anyone here knows how to use memset in c++/cli code?
The c++ code I'm trying to use in my c++/cli code:
memset(&DeviceInfo, 0, sizeof(FS_DEVICE_INFO));
Here's my c++/cli Code where I get the error when I try to use the same memset line from my c++ code:
bool newIFSWDevice::GetDeviceInfo(PFS_DEVICE_INFO pDevInfo)
{
IFSDevice* pDeviceWheel = nullptr;
FS_DEVICE_INFO DeviceInfo;
int x = 0;
while (nullptr != (pDeviceWheel = newFSDeviceEnumerator::EnumerateInstance(x++)))
{
memset(&DeviceInfo, 0, sizeof(FS_DEVICE_INFO)); //error line
pDeviceWheel->GetDeviceInfo(&DeviceInfo);
if (0 == wcscmp(DeviceInfo.Name, FS_DEVICE_WHEEL_PORSCHE_NAME)
break;
I tried using a for loop instead...
for (int i = 0; i <= sizeof(FS_DEVICE_INFO); i++)
FS_DEVICE_INFO[i] = 0;
But it still gives me an error "expression must have a constant value". Help would be much appreciated! :)
As noted in the comments, you're missing the header file #include <string.h>. See the documentation.
It's also worth noting that your for loop to do the clearing has several problems:
for (int i = 0; i <= sizeof(FS_DEVICE_INFO); i++)
FS_DEVICE_INFO[i] = 0;
sizeof(FS_DEVICE_INFO) gives you the size of that struct in bytes, but FS_DEVICE_INFO[i] indexes into an array of structs: [1] would be the second struct in the array, not the second byte! You would need to cast the pointer to char or something similar.
i <= sizeof(FS_DEVICE_INFO): The <= is incorrect. If the struct is 10 bytes large, you'd end up operating on bytes 0 through 10, which is 11 bytes total, stomping on whatever happened to be after the struct.
FS_DEVICE_INFO[i]: FS_DEVICE_INFO is the name of the class, your local variable is DeviceInfo, so this should be DeviceInfo[i]. This is why you're getting the expression must have a constant value error.

Optimizing a Bit-Wise Manipulation Kernel

I have the following code which progressively goes through a string of bits and rearrange them into blocks of 20bytes. I'm using 32*8 blocks with 40 threads per block. However the process takes something like 36ms on my GT630M. Are there any further optimization I can do? Especially with regard to removing the if-else in the inner most loop.
__global__ void test(unsigned char *data)
{
__shared__ unsigned char dataBlock[20];
__shared__ int count;
count = 0;
unsigned char temp = 0x00;
for(count=0; count<(streamSize/8); count++)
{
for(int i=0; i<8; i++)
{
if(blockIdx.y >= i)
temp |= (*(data + threadIdx.x*(blockIdx.x + gridDim.x*(i+count)))&(0x01<<blockIdx.y))>>(blockIdx.y - i);
else
temp |= (*(data + threadIdx.x*(blockIdx.x + gridDim.x*(i+count)))&(0x01<<blockIdx.y))<<(i - blockIdx.y);
}
dataBlock[threadIdx.x] = temp;
//do something
}
}
It's not clear what your code is trying to accomplish, but a couple obvious opportunities are:
1) if possible, use 32-bit words instead of unsigned char.
2) use block sizes that are multiples of 32.
3) The conditional code may not be costing you as much as you expect. You can check by compiling with --cubin --gpu-architecture sm_xx (where xx is the SM version of your target hardware), and using cuobjdump --dump-sass on the resulting cubin file to look at the generated assembly. You may have to modify the source code to loft the common subexpression into a separate variable, and/or use the ternary operator ? : to hint to the compiler to use predication.

Using memcpy and malloc resulting in corrupted data stream

The code below attempts to save a data stream to a file using fwrite. The first example using malloc works but with the second example the data stream is %70 corrupted. Can someone explain to me why the second example is corrupted and how I can remedy it?
short int fwBuffer[1000000];
// short int *fwBuffer[1000000];
unsigned long fwSize[1000000];
// Not Working *********
if (dataFlow) {
size = sizeof(short int)*length*inchannels;
short int tmpbuffer[length*inchannels];
int count = 0;
for (count = 0; count < length*inchannels; count++)
{
tmpbuffer[count] = (short int) (inbuffer[count]);
}
memcpy(&fwBuffer[saveBufferCount], tmpbuffer, sizeof(tmpbuffer));
fwSize[saveBufferCount] = size;
saveBufferCount++;
totalSize += size;
}
// Working ***********
if (dataFlow) {
size = sizeof(short int)*length*inchannels;
short int *tmpbuffer = (short int*)malloc(size);
int count = 0;
for (count = 0; count < length*inchannels; count++)
{
tmpbuffer[count] = (short int) (inbuffer[count]);
}
fwBuffer[saveBufferCount] = tmpbuffer;
fwSize[saveBufferCount] = size;
saveBufferCount++;
totalSize += size;
}
// Write to file ***********
for (int i = 0; i < saveBufferCount; i++) {
if (isRecording && outFile != NULL) {
// fwrite(fwBuffer[i], 1, fwSize[i],outFile);
fwrite(&fwBuffer[i], 1, fwSize[i],outFile);
if (fwBuffer[i] != NULL) {
// free(fwBuffer[i]);
}
fwBuffer[i] = NULL;
}
}
You initialize your size as
size = sizeof(short int) * length * inchannels;
then you declare an array of size
short int tmpbuffer[size];
This is already highly suspect. Why did you include sizeof(short int) into the size and then declare an array of short int elements with that size? The byte size of your array in this case is
sizeof(short int) * sizeof(short int) * length * inchannels
i.e. the sizeof(short int) is factored in twice.
Later you initialize only length * inchannels elements of the array, which is not entire array, for the reasons described above. But the memcpy that follows still copies the entire array
memcpy(&fwBuffer[saveBufferCount], &tmpbuffer, sizeof (tmpbuffer));
(Tail portion of the copied data is garbage). I'd suspect that you are copying sizeof(short int) times more data than was intended. The recipient memory overflows and gets corrupted.
The version based on malloc does not suffer from this problem since malloc-ed memory size is specified in bytes, not in short int-s.
If you want to simulate the malloc behavior in the upper version of the code, you need to declare your tmpbuffer as an array of char elements, not of short int elements.
This has very good chances to crash
short int tmpbuffer[(short int)(size)];
first size could be too big, but then truncating it and having whatever size results of that is probably not what you want.
Edit: Try to write the whole code without a single cast. Only then the compiler has a chance to tell you if there is something wrong.