How to obtain the size of a BLOB without reading the BLOB's content - sql

I have the following questions regarding BLOBs in sqlite:
Does sqlite keep track of sizes of BLOBs?
I'm guessing that it does, but then, does the length function use it, or does it read the BLOB's content?
If sqlite keeps track of the size of the BLOB and length doesn't use it, is the size accessible via some other functionality?
I'm asking this because I'm wondering if I should implement triggers that set BLOBs' sizes in additional columns, of if I can obtain the sizes dynamically without the performance hit of sqlite reading the BLOBs.

From the source:
** In an SQLite index record, the serial type is stored directly before
** the blob of data that it corresponds to. In a table record, all serial
** types are stored at the start of the record, and the blobs of data at
** the end. Hence these functions allow the caller to handle the
** serial-type and data blob seperately.
**
** The following table describes the various storage classes for data:
**
** serial type bytes of data type
** -------------- --------------- ---------------
** 0 0 NULL
** 1 1 signed integer
** 2 2 signed integer
** 3 3 signed integer
** 4 4 signed integer
** 5 6 signed integer
** 6 8 signed integer
** 7 8 IEEE float
** 8 0 Integer constant 0
** 9 0 Integer constant 1
** 10,11 reserved for expansion
** N>=12 and even (N-12)/2 BLOB
** N>=13 and odd (N-13)/2 text
In other words, the blob size is in the serial, and it's length is simply "(serial_type-12)/2".
This serial is stored before the actual blob, so you don't need to read the blob to get its size.
Call sqlite3_blob_open and then sqlite3_blob_bytes to get this value.

Write a 1byte and a 10GB blob in a test database. If length() takes the same time for both blobs, the blob's length is probably accessed. Otherwise the blob is probably read.
OR: download the source code and debug through it: http://www.sqlite.org/download.html. These are some relevant bits:
/*
** Implementation of the length() function
*/
static void lengthFunc(
sqlite3_context *context,
int argc,
sqlite3_value **argv
){
int len;
assert( argc==1 );
UNUSED_PARAMETER(argc);
switch( sqlite3_value_type(argv[0]) ){
case SQLITE_BLOB:
case SQLITE_INTEGER:
case SQLITE_FLOAT: {
sqlite3_result_int(context, sqlite3_value_bytes(argv[0]));
break;
}
case SQLITE_TEXT: {
const unsigned char *z = sqlite3_value_text(argv[0]);
if( z==0 ) return;
len = 0;
while( *z ){
len++;
SQLITE_SKIP_UTF8(z);
}
sqlite3_result_int(context, len);
break;
}
default: {
sqlite3_result_null(context);
break;
}
}
}
and then
/*
** Return the number of bytes in the sqlite3_value object assuming
** that it uses the encoding "enc"
*/
SQLITE_PRIVATE int sqlite3ValueBytes(sqlite3_value *pVal, u8 enc){
Mem *p = (Mem*)pVal;
if( (p->flags & MEM_Blob)!=0 || sqlite3ValueText(pVal, enc) ){
if( p->flags & MEM_Zero ){
return p->n + p->u.nZero;
}else{
return p->n;
}
}
return 0;
}
You can see that the length of text data is calculated on the fly. That of blobs... well, I'm not fluent enough in C... :-)

If you have access to the raw c api sqlite3_blob_bytes will do the job for you. If not please provide additional information.

Related

How to make sense of CoreBluetooth data

I've been playing around with CoreBluetooth a lot lately, and although I can connect to some devices, I never seem to be able to properly read the data (characteristic values).
Right now I am connecting with the Wahoo BT Heartrate monitor, and I am getting all the signals but I can't make the data into anything sensible. (Yes, I am aware there is an API, but I am trying to connect without it, to properly get something working with CoreBluetooth).
I have so far not been able to turn the NSData (characteristic.value) into anything sensible. If you have any suggestions on how to make sense of this data that would be very much apreciated.
Below some code to fully parse all the HeartRate measurement characteristic data.
How to process the data depends on several things:
is the BPM written into a single byte or two?
is there EE data present?
calculate the number of RR-interval values, as there can be multiple values within in one message (I have seen up to three).
Here is the actual spec of the Heart_rate_measurement characteristic
// Instance method to get the heart rate BPM information
- (void) getHeartBPMData:(CBCharacteristic *)characteristic error:(NSError *)error
{
// Get the BPM //
// https://developer.bluetooth.org/gatt/characteristics/Pages/CharacteristicViewer.aspx?u=org.bluetooth.characteristic.heart_rate_measurement.xml //
// Convert the contents of the characteristic value to a data-object //
NSData *data = [characteristic value];
// Get the byte sequence of the data-object //
const uint8_t *reportData = [data bytes];
// Initialise the offset variable //
NSUInteger offset = 1;
// Initialise the bpm variable //
uint16_t bpm = 0;
// Next, obtain the first byte at index 0 in the array as defined by reportData[0] and mask out all but the 1st bit //
// The result returned will either be 0, which means that the 2nd bit is not set, or 1 if it is set //
// If the 2nd bit is not set, retrieve the BPM value at the second byte location at index 1 in the array //
if ((reportData[0] & 0x01) == 0) {
// Retrieve the BPM value for the Heart Rate Monitor
bpm = reportData[1];
offset = offset + 1; // Plus 1 byte //
}
else {
// If the second bit is set, retrieve the BPM value at second byte location at index 1 in the array and //
// convert this to a 16-bit value based on the host’s native byte order //
bpm = CFSwapInt16LittleToHost(*(uint16_t *)(&reportData[1]));
offset = offset + 2; // Plus 2 bytes //
}
NSLog(#"bpm: %i", bpm);
// Determine if EE data is present //
// If the 3rd bit of the first byte is 1 this means there is EE data //
// If so, increase offset with 2 bytes //
if ((reportData[0] & 0x03) == 1) {
offset = offset + 2; // Plus 2 bytes //
}
// Determine if RR-interval data is present //
// If the 4th bit of the first byte is 1 this means there is RR data //
if ((reportData[0] & 0x04) == 0)
{
NSLog(#"%#", #"Data are not present");
}
else
{
// The number of RR-interval values is total bytes left / 2 (size of uint16) //
NSUInteger length = [data length];
NSUInteger count = (length - offset)/2;
NSLog(#"RR count: %lu", (unsigned long)count);
for (int i = 0; i < count; i++) {
// The unit for RR interval is 1/1024 seconds //
uint16_t value = CFSwapInt16LittleToHost(*(uint16_t *)(&reportData[offset]));
value = ((double)value / 1024.0 ) * 1000.0;
offset = offset + 2; // Plus 2 bytes //
NSLog(#"RR value %lu: %u", (unsigned long)i, value);
}
}
}
Well, you should be implementing the Heart Rate Profile (see here), which uses the Heart Rate Service. If you look at the Heart Rate Service Specifications, you will see that the format of the Heart Rate Measurement Characteristic changes according to the flags set in the least significant octet of the data packet.
This means that you need to set up your code to handle dynamic packet sizes.
So your general process would be:
Get the first byte of the value property and check it for:
Is the heart rate measurement 8 bits or 16 bits?
Is sensor contact supported?
Is sensor contact detected?
Is Energy Expended supported?
Is RR-Interval measurement supported?
If the heart rate measurement is 8 bits (bit 0 of byte 0 is 0), then cast the next byte into its intended format (hint: it's uint8_t). If it is 16 bits (i.e. bit 0 of byte 0 is 1), then cast the next two bytes into uint16_t.
If Energy Expended is supported (bit 3 of byte 0 is 1), then cast the next byte into a uint16_t.
Do the same with RR-Intervals.
Using NSData - especially with Core Bluetooth - takes some getting used to, but it's not that bad once you grasp the concept.
Good luck!
Well...
What you'll have to do, when you read the value for the characteric :
NSData *data = [characteritic value];
theTypeOfTheData value;
[data getByte:&value lenght:sizeof(value)];
But, theTypeOfTheData can be whatever has thought the developer of the device. So it could be UInt8, UInt16, or struct (with or without bitfield)... You'll have to get info by contacting the developer of the device or looking if there is some documentation.
For example, with my colleague, we use the the type of data that consumes the less space (because the device hasn't infinite memory) according to what is need.
Look into the descriptor of the characteristic. It may points out the type of data.

Using memcpy and malloc resulting in corrupted data stream

The code below attempts to save a data stream to a file using fwrite. The first example using malloc works but with the second example the data stream is %70 corrupted. Can someone explain to me why the second example is corrupted and how I can remedy it?
short int fwBuffer[1000000];
// short int *fwBuffer[1000000];
unsigned long fwSize[1000000];
// Not Working *********
if (dataFlow) {
size = sizeof(short int)*length*inchannels;
short int tmpbuffer[length*inchannels];
int count = 0;
for (count = 0; count < length*inchannels; count++)
{
tmpbuffer[count] = (short int) (inbuffer[count]);
}
memcpy(&fwBuffer[saveBufferCount], tmpbuffer, sizeof(tmpbuffer));
fwSize[saveBufferCount] = size;
saveBufferCount++;
totalSize += size;
}
// Working ***********
if (dataFlow) {
size = sizeof(short int)*length*inchannels;
short int *tmpbuffer = (short int*)malloc(size);
int count = 0;
for (count = 0; count < length*inchannels; count++)
{
tmpbuffer[count] = (short int) (inbuffer[count]);
}
fwBuffer[saveBufferCount] = tmpbuffer;
fwSize[saveBufferCount] = size;
saveBufferCount++;
totalSize += size;
}
// Write to file ***********
for (int i = 0; i < saveBufferCount; i++) {
if (isRecording && outFile != NULL) {
// fwrite(fwBuffer[i], 1, fwSize[i],outFile);
fwrite(&fwBuffer[i], 1, fwSize[i],outFile);
if (fwBuffer[i] != NULL) {
// free(fwBuffer[i]);
}
fwBuffer[i] = NULL;
}
}
You initialize your size as
size = sizeof(short int) * length * inchannels;
then you declare an array of size
short int tmpbuffer[size];
This is already highly suspect. Why did you include sizeof(short int) into the size and then declare an array of short int elements with that size? The byte size of your array in this case is
sizeof(short int) * sizeof(short int) * length * inchannels
i.e. the sizeof(short int) is factored in twice.
Later you initialize only length * inchannels elements of the array, which is not entire array, for the reasons described above. But the memcpy that follows still copies the entire array
memcpy(&fwBuffer[saveBufferCount], &tmpbuffer, sizeof (tmpbuffer));
(Tail portion of the copied data is garbage). I'd suspect that you are copying sizeof(short int) times more data than was intended. The recipient memory overflows and gets corrupted.
The version based on malloc does not suffer from this problem since malloc-ed memory size is specified in bytes, not in short int-s.
If you want to simulate the malloc behavior in the upper version of the code, you need to declare your tmpbuffer as an array of char elements, not of short int elements.
This has very good chances to crash
short int tmpbuffer[(short int)(size)];
first size could be too big, but then truncating it and having whatever size results of that is probably not what you want.
Edit: Try to write the whole code without a single cast. Only then the compiler has a chance to tell you if there is something wrong.

Objective c, Scanf() string taking in the same value twice

Hi all I am having a strange issue, when i use scanf to input data it repeats strings and saves them as one i am not sure why.
Please Help
/* Assment Label loop - Loops through the assment labels and inputs the percentage and the name for it. */
i = 0;
j = 0;
while (i < totalGradedItems)
{
scanf("%s%d", assLabel[i], &assPercent[i]);
i++;
}
/* Print Statement */
i = 0;
while (i < totalGradedItems)
{
printf("%s", assLabel[i]);
i++;
}
Input Data
Prog1 20
Quiz 20
Prog2 20
Mdtm 15
Final 25
Output Via Console
Prog1QuizQuizProg2MdtmMdtmFinal
Final diagnosis
You don't show your declarations...but you must be allocating just 5 characters for the strings:
When I adjust the enum MAX_ASSESSMENTLEN from 10 to 5 (see the code below) I get the output:
Prog1Quiz 20
Quiz 20
Prog2Mdtm 20
Mdtm 15
Final 25
You did not allow for the terminal null. And you didn't show us what was causing the bug! And the fact that you omitted newlines from the printout obscured the problem.
What's happening is that 'Prog1' is occupying all 5 bytes of the string you read in, and is writing a null at the 6th byte; then Quiz is being read in, starting at the sixth byte.
When printf() goes to read the string for 'Prog1', it stops at the first null, which is the one after the 'z' of 'Quiz', producing the output shown. Repeat for 'Prog2' and 'Mtdm'. If there was an entry after 'Final', it too would suffer. You are lucky that there are enough zero bytes around to prevent any monstrous overruns.
This is a basic buffer overflow (indeed, since the array is on the stack, it is a basic Stack Overflow); you are trying to squeeze 6 characters (Prog1 plus '\0') into a 5 byte space, and it simply does not work well.
Preliminary diagnosis
First, print newlines after your data.
Second, check that scanf() is not returning errors - it probably isn't, but neither you nor we can tell for sure.
Third, are you sure that the data file contains what you say? Plausibly, it contains a pair of 'Quiz' and a pair of 'Mtdm' lines.
Your variable j is unused, incidentally.
You would probably be better off having the input loop run until you are either out of space in the receiving arrays or you get a read failure. However, the code worked for me when dressed up slightly:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
char assLabel[10][10];
int assPercent[10];
int i = 0;
int totalGradedItems = 5;
while (i < totalGradedItems)
{
if (scanf("%9s%d", assLabel[i], &assPercent[i]) != 2)
{
fprintf(stderr, "Error reading\n");
exit(1);
}
i++;
}
/* Print Statement */
i = 0;
while (i < totalGradedItems)
{
printf("%-9s %3d\n", assLabel[i], assPercent[i]);
i++;
}
return 0;
}
For the quoted input data, the output results are:
Prog1 20
Quiz 20
Prog2 20
Mdtm 15
Final 25
I prefer this version, though:
#include <stdio.h>
enum { MAX_GRADES = 10 };
enum { MAX_ASSESSMENTLEN = 10 };
int main(void)
{
char assLabel[MAX_GRADES][MAX_ASSESSMENTLEN];
int assPercent[MAX_GRADES];
int i = 0;
int totalGradedItems;
for (i = 0; i < MAX_GRADES; i++)
{
if (scanf("%9s%d", assLabel[i], &assPercent[i]) != 2)
break;
}
totalGradedItems = i;
for (i = 0; i < totalGradedItems; i++)
printf("%-9s %3d\n", assLabel[i], assPercent[i]);
return 0;
}
Of course, if I'd set up the scanf() format string 'properly' (meaning safely) so as to limit the length of the assessment names to fit into the space allocated, then the loop would stop reading on the second attempt:
...
char format[10];
...
snprintf(format, sizeof(format), "%%%ds%%d", MAX_ASSESSMENTLEN-1);
...
if (scanf(format, assLabel[i], &assPercent[i]) != 2)
With MAX_ASSESSMENTLEN at 5, the snprintf() generates the format string "%4s%d". The code compiled reads:
Prog 1
and stops. The '1' comes from the 5th character of 'Prog1'; the next assessment name is '20', and then the conversion of 'Quiz' into a number fails, causing the input loop to stop (because only one of two expected items was converted).
Despite the nuisance value, if you want to make your scanf() strings adjust to the size of the data variables it is reading into, you have to do something akin to what I did here - format the string using the correct size values.
i guess, you need to put a
scanf("%s%d", assLabel[i], &assPercent[i]);
space between %s and %d here.
And it is not saving as one. You need to put newline or atlease a space after %s on print to see difference.
add:
when i tried
#include <stdio.h>
int main (int argc, const char * argv[])
{
char a[1][2];
for(int i =0;i<3;i++)
scanf("%s",a[i]);
for(int i =0;i<3;i++)
printf("%s",a[i]);
return 0;
}
with inputs
123456
qwerty
sdfgh
output is:
12qwsdfghqwsdfghsdfgh
that proves that, the size of string array need to be bigger then decleared there.

Non repeating random numbers in Objective-C

I'm using
for (int i = 1, i<100, i++)
int i = arc4random() % array count;
but I'm getting repeats every time. How can I fill out the chosen int value from the range, so that when the program loops I will not get any dupe?
It sounds like you want shuffling of a set rather than "true" randomness. Simply create an array where all the positions match the numbers and initialize a counter:
num[ 0] = 0
num[ 1] = 1
: :
num[99] = 99
numNums = 100
Then, whenever you want a random number, use the following method:
idx = rnd (numNums); // return value 0 through numNums-1
val = num[idx]; // get then number at that position.
num[idx] = val[numNums-1]; // remove it from pool by overwriting with highest
numNums--; // and removing the highest position from pool.
return val; // give it back to caller.
This will return a random value from an ever-decreasing pool, guaranteeing no repeats. You will have to beware of the pool running down to zero size of course, and intelligently re-initialize the pool.
This is a more deterministic solution than keeping a list of used numbers and continuing to loop until you find one not in that list. The performance of that sort of algorithm will degrade as the pool gets smaller.
A C function using static values something like this should do the trick. Call it with
int i = myRandom (200);
to set the pool up (with any number zero or greater specifying the size) or
int i = myRandom (-1);
to get the next number from the pool (any negative number will suffice). If the function can't allocate enough memory, it will return -2. If there's no numbers left in the pool, it will return -1 (at which point you could re-initialize the pool if you wish). Here's the function with a unit testing main for you to try out:
#include <stdio.h>
#include <stdlib.h>
#define ERR_NO_NUM -1
#define ERR_NO_MEM -2
int myRandom (int size) {
int i, n;
static int numNums = 0;
static int *numArr = NULL;
// Initialize with a specific size.
if (size >= 0) {
if (numArr != NULL)
free (numArr);
if ((numArr = malloc (sizeof(int) * size)) == NULL)
return ERR_NO_MEM;
for (i = 0; i < size; i++)
numArr[i] = i;
numNums = size;
}
// Error if no numbers left in pool.
if (numNums == 0)
return ERR_NO_NUM;
// Get random number from pool and remove it (rnd in this
// case returns a number between 0 and numNums-1 inclusive).
n = rand() % numNums;
i = numArr[n];
numArr[n] = numArr[numNums-1];
numNums--;
if (numNums == 0) {
free (numArr);
numArr = 0;
}
return i;
}
int main (void) {
int i;
srand (time (NULL));
i = myRandom (20);
while (i >= 0) {
printf ("Number = %3d\n", i);
i = myRandom (-1);
}
printf ("Final = %3d\n", i);
return 0;
}
And here's the output from one run:
Number = 19
Number = 10
Number = 2
Number = 15
Number = 0
Number = 6
Number = 1
Number = 3
Number = 17
Number = 14
Number = 12
Number = 18
Number = 4
Number = 9
Number = 7
Number = 8
Number = 16
Number = 5
Number = 11
Number = 13
Final = -1
Keep in mind that, because it uses statics, it's not safe for calling from two different places if they want to maintain their own separate pools. If that were the case, the statics would be replaced with a buffer (holding count and pool) that would "belong" to the caller (a double-pointer could be passed in for this purpose).
And, if you're looking for the "multiple pool" version, I include it here for completeness.
#include <stdio.h>
#include <stdlib.h>
#define ERR_NO_NUM -1
#define ERR_NO_MEM -2
int myRandom (int size, int *ppPool[]) {
int i, n;
// Initialize with a specific size.
if (size >= 0) {
if (*ppPool != NULL)
free (*ppPool);
if ((*ppPool = malloc (sizeof(int) * (size + 1))) == NULL)
return ERR_NO_MEM;
(*ppPool)[0] = size;
for (i = 0; i < size; i++) {
(*ppPool)[i+1] = i;
}
}
// Error if no numbers left in pool.
if (*ppPool == NULL)
return ERR_NO_NUM;
// Get random number from pool and remove it (rnd in this
// case returns a number between 0 and numNums-1 inclusive).
n = rand() % (*ppPool)[0];
i = (*ppPool)[n+1];
(*ppPool)[n+1] = (*ppPool)[(*ppPool)[0]];
(*ppPool)[0]--;
if ((*ppPool)[0] == 0) {
free (*ppPool);
*ppPool = NULL;
}
return i;
}
int main (void) {
int i;
int *pPool;
srand (time (NULL));
pPool = NULL;
i = myRandom (20, &pPool);
while (i >= 0) {
printf ("Number = %3d\n", i);
i = myRandom (-1, &pPool);
}
printf ("Final = %3d\n", i);
return 0;
}
As you can see from the modified main(), you need to first initialise an int pointer to NULL then pass its address to the myRandom() function. This allows each client (location in the code) to have their own pool which is automatically allocated and freed, although you could still share pools if you wish.
You could use Format-Preserving Encryption to encrypt a counter. Your counter just goes from 0 upwards, and the encryption uses a key of your choice to turn it into a seemingly random value of whatever radix and width you want.
Block ciphers normally have a fixed block size of e.g. 64 or 128 bits. But Format-Preserving Encryption allows you to take a standard cipher like AES and make a smaller-width cipher, of whatever radix and width you want (e.g. radix 2, width 16), with an algorithm which is still cryptographically robust.
It is guaranteed to never have collisions (because cryptographic algorithms create a 1:1 mapping). It is also reversible (a 2-way mapping), so you can take the resulting number and get back to the counter value you started with.
AES-FFX is one proposed standard method to achieve this. I've experimented with some basic Python code which is based on the AES-FFX idea, although not fully conformant--see Python code here. It can e.g. encrypt a counter to a random-looking 7-digit decimal number, or a 16-bit number.
You need to keep track of the numbers you have already used (for instance, in an array). Get a random number, and discard it if it has already been used.
Without relying on external stochastic processes, like radioactive decay or user input, computers will always generate pseudorandom numbers - that is numbers which have many of the statistical properties of random numbers, but repeat in sequences.
This explains the suggestions to randomise the computer's output by shuffling.
Discarding previously used numbers may lengthen the sequence artificially, but at a cost to the statistics which give the impression of randomness.
The best way to do this is create an array for numbers already used. After a random number has been created then add it to the array. Then when you go to create another random number, ensure that it is not in the array of used numbers.
In addition to using secondary array to store already generated random numbers, invoking random no. seeding function before every call of random no. generation function might help to generate different seq. of random numbers in every run.

Functions to compress and uncompress array of integers

I was recently asked to complete a task for a c++ role, however as the application was decided not to be progressed any further I thought that I would post here for some feedback / advice / improvements / reminder of concepts I've forgotten.
The task was:
The following data is a time series of integer values
int timeseries[32] = {67497, 67376, 67173, 67235, 67057, 67031, 66951,
66974, 67042, 67025, 66897, 67077, 67082, 67033, 67019, 67149, 67044,
67012, 67220, 67239, 66893, 66984, 66866, 66693, 66770, 66722, 66620,
66579, 66596, 66713, 66852, 66715};
The series might be, for example, the closing price of a stock each day
over a 32 day period.
As stored above, the data will occupy 32 x sizeof(int) bytes = 128 bytes
assuming 4 byte ints.
Using delta encoding , write a function to compress, and a function to
uncompress data like the above.
Ok, so before this point I had never looked at compression so my solution is far from perfect. The manner in which I approached the problem is by compressing the array of integers into a array of bytes. When representing the integer as a byte I keep the calculate most
significant byte (msb) and keep everything up to this point, whilst throwing the rest away. This is then added to the byte array. For negative values I increment the msb by 1 so that we can
differentiate between positive and negative bytes when decoding by keeping the leading
1 bit values.
When decoding I parse this jagged byte array and simply reverse my
previous actions performed when compressing. As mentioned I have never looked at compression prior to this task so I did come up with my own method to compress the data. I was looking at C++/Cli recently, had not really used it previously so just decided to write it in this language, no particular reason. Below is the class, and a unit test at the very bottom. Any advice / improvements / enhancements will be much appreciated.
Thanks.
array<array<Byte>^>^ CDeltaEncoding::CompressArray(array<int>^ data)
{
int temp = 0;
int original;
int size = 0;
array<int>^ tempData = gcnew array<int>(data->Length);
data->CopyTo(tempData, 0);
array<array<Byte>^>^ byteArray = gcnew array<array<Byte>^>(tempData->Length);
for (int i = 0; i < tempData->Length; ++i)
{
original = tempData[i];
tempData[i] -= temp;
temp = original;
int msb = GetMostSignificantByte(tempData[i]);
byteArray[i] = gcnew array<Byte>(msb);
System::Buffer::BlockCopy(BitConverter::GetBytes(tempData[i]), 0, byteArray[i], 0, msb );
size += byteArray[i]->Length;
}
return byteArray;
}
array<int>^ CDeltaEncoding::DecompressArray(array<array<Byte>^>^ buffer)
{
System::Collections::Generic::List<int>^ decodedArray = gcnew System::Collections::Generic::List<int>();
int temp = 0;
for (int i = 0; i < buffer->Length; ++i)
{
int retrievedVal = GetValueAsInteger(buffer[i]);
decodedArray->Add(retrievedVal);
decodedArray[i] += temp;
temp = decodedArray[i];
}
return decodedArray->ToArray();
}
int CDeltaEncoding::GetMostSignificantByte(int value)
{
array<Byte>^ tempBuf = BitConverter::GetBytes(Math::Abs(value));
int msb = tempBuf->Length;
for (int i = tempBuf->Length -1; i >= 0; --i)
{
if (tempBuf[i] != 0)
{
msb = i + 1;
break;
}
}
if (!IsPositiveInteger(value))
{
//We need an extra byte to differentiate the negative integers
msb++;
}
return msb;
}
bool CDeltaEncoding::IsPositiveInteger(int value)
{
return value / Math::Abs(value) == 1;
}
int CDeltaEncoding::GetValueAsInteger(array<Byte>^ buffer)
{
array<Byte>^ tempBuf;
if(buffer->Length % 2 == 0)
{
//With even integers there is no need to allocate a new byte array
tempBuf = buffer;
}
else
{
tempBuf = gcnew array<Byte>(4);
System::Buffer::BlockCopy(buffer, 0, tempBuf, 0, buffer->Length );
unsigned int val = buffer[buffer->Length-1] &= 0xFF;
if ( val == 0xFF )
{
//We have negative integer compressed into 3 bytes
//Copy over the this last byte as well so we keep the negative pattern
System::Buffer::BlockCopy(buffer, buffer->Length-1, tempBuf, buffer->Length, 1 );
}
}
switch(tempBuf->Length)
{
case sizeof(short):
return BitConverter::ToInt16(tempBuf,0);
case sizeof(int):
default:
return BitConverter::ToInt32(tempBuf,0);
}
}
And then in a test class I had:
void CTestDeltaEncoding::TestCompression()
{
array<array<Byte>^>^ byteArray = CDeltaEncoding::CompressArray(m_testdata);
array<int>^ decompressedArray = CDeltaEncoding::DecompressArray(byteArray);
int totalBytes = 0;
for (int i = 0; i<byteArray->Length; i++)
{
totalBytes += byteArray[i]->Length;
}
Assert::IsTrue(m_testdata->Length * sizeof(m_testdata) > totalBytes, "Expected the total bytes to be less than the original array!!");
//Expected totalBytes = 53
}
This smells a lot like homework to me. The crucial phrase is: "Using delta encoding."
Delta encoding means you encode the delta (difference) between each number and the next:
67497, 67376, 67173, 67235, 67057, 67031, 66951, 66974, 67042, 67025, 66897, 67077, 67082, 67033, 67019, 67149, 67044, 67012, 67220, 67239, 66893, 66984, 66866, 66693, 66770, 66722, 66620, 66579, 66596, 66713, 66852, 66715
would turn into:
[Base: 67497]: -121, -203, +62
and so on. Assuming 8-bit bytes, the original numbers require 3 bytes apiece (and given the number of compilers with 3-byte integer types, you're normally going to end up with 4 bytes apiece). From the looks of things, the differences will fit quite easily in 2 bytes apiece, and if you can ignore one (or possibly two) of the least significant bits, you can fit them in one byte apiece.
Delta encoding is most often used for things like sound encoding where you can "fudge" the accuracy at times without major problems. For example, if you have a change from one sample to the next that's larger than you've left space to encode, you can encode a maximum change in the current difference, and add the difference to the next delta (and if you don't mind some back-tracking, you can distribute some to the previous delta as well). This will act as a low-pass filter, limiting the gradient between samples.
For example, in the series you gave, a simple delta encoding requires ten bits to represent all the differences. By dropping the LSB, however, nearly all the samples (all but one, in fact) can be encoded in 8 bits. That one has a difference (right shifted one bit) of -173, so if we represent it as -128, we have 45 left. We can distribute that error evenly between the preceding and following sample. In that case, the output won't be an exact match for the input, but if we're talking about something like sound, the difference probably won't be particularly obvious.
I did mention that it was an exercise that I had to complete and the solution that I received was deemed not good enough, so I wanted some constructive feedback seeing as actual companies never decide to tell you what you did wrong.
When the array is compressed I store the differences and not the original values except the first as this was my understanding. If you had looked at my code I have provided a full solution but my question was how bad was it?