Store non-binary values into a unique integer - optimization

In 8 bits we can store 8 numbers from 0 to 1 each. We can also say that we can store 8 different piece of data in a range from 0 to 255.
0/1 0/1 0/1 0/1 0/1 0/1 0/1 0/1 → 8 different piece of data
↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓
[bit] [bit] [bit] [bit] [bit] [bit] [bit] [bit] → 8 bits total
If instead of storing 0 or 1 we need to store 0, 1 or 2, then we will end up using more bits, of course. In this condition, by the way, we can store 0, 1, 2 or 3. However, this is beyond what I need but technically it will use the same amount of data space.
0/1/2 0/1/2 0/1/2 0/1/2 → 4 different piece of data (expectative)
0/1/2/3 0/1/2/3 0/1/2/3 0/1/2/3 → 4 different piece of data (too much for me)
↓ ↓ ↓ ↓
[2bits] [2bits] [2bits] [2bits] → 8 bits total
We can agree that we are "losing" storage capacity. The solution is to do a radix conversion (I don't know if that's the right term) using the numerical base of 3, this way we will achieve the objective of storing exclusively 0, 1 or 2, and in the same 8 bits we can store more information (and, to be honest, I don't know how the bits organize themselves for this to work).
Examples:
00000₃ → int 0
01212₃ → int 50
11111₃ → int 121
12121₃ → int 151
22222₃ → int 242 (max)
In this conversion, we were able to store 5 pieces of information composed of 0, 1 or 2, instead of just 4. And there's still a "space" left, and that's what I'd like to talk about.
I was wondering if there is any way to store mixed base data instead of fixed base (eg. 3, as exemplified above). To be clearer, supposing that I wanted to store two pieces of data composed of 0, 1 or 2 each, and one piece of data composed of a number between 0 and 8. In binary, and still limited to 256, we can do it as follows:
0/1/2 0/1/2 0/1/2/3/4/5/6/7/8 → 3 piece of different data
↓ ↓ ↓
[2bits] [2bits] [ 4 bits ] → 8 bits total
But, it seems to me that we have empty "spaces" left again, since it doesn't seem to me to be using the full capacity possible in 8 bits, and maybe it's still possible to add more data if we use number base conversion instead.
The problem is: I don't know how to handle the data in this way, in a way to reliably merge and split it again.
In the real world: I need to transfer a lot of data that uses very little information (eg numbers 0 to 2, 0 to 5, 0 to 10, etc.). And I'm currently doing bitwise snapping of this data. And in my case, any optimized byte is a very nice gain (if this is really possible, maybe there is a 20~40% data saving).
I understand that it might consume more processing on rebasing, merge and split conversions, but that won't be an issue, because this processing will be done on the client side (merge when sending and split when receiving), and the server manages the same data already optimized (no additional processing).
--
Possible solutions:
Let's imagine that I have a set of three numbers that can be 0, 1 or 2, and another set of three numbers that can go from 0 to 8 (so it is 9 possibilities each). The goal is to merge all these numbers into a single integer (which should use about 2 bytes at best current method).
Solution 1: each number being stored in a single complete byte:
byte A from 0 to 2
byte B from 0 to 2
byte C from 0 to 2
byte D from 0 to 8
byte E from 0 to 8
byte F from 0 to 8
Problem: it will be necessary to consume 6 bytes to store these 6 numbers.
Solution 2: storage through bits:
2 bits to store A
2 bits to store B
2 bits to store C
4 bits to store F
4 bits to store G
4 bits to store H
6 unused bits to complete the last byte
Problems: in addition to 2 bits being "more than necessary" to store numbers from 0 to 2 (A, B and C), and same for 4 bits from 0 to 8, we will use a total of 3 bytes and still have 6 "dead" (unused) bits at end to complete the last byte.
Solution 3: Separate into two sets and convert their respective bases individually:
A, B and C to base 3 consumes 1 byte
D, E and F to base 9 consumes 2 bytes
Problem: despite being a great solution at the moment (and despite the example above, in more complex situations it can be more optimized than solution 2), and that's what I'm using, I believe there's a lot of "left over" space in this union, and maybe it's still possible to squeeze everything into 2 bytes only.
For example, the conversion of 222₃ consumes up to 5 bits, which means that in this byte we still have 3 unused bits. The 888₉ conversion consumes 10 bits. This makes me see the possibility of using only 15 bits with only 1 unused bit (2 bytes).
Then we come to the next solution.
Solution 4: move the bits to further optimize space:
higher bits stores A, B and C
lower bits stores D, E and F
Example:
[5 bits of numbers of base 3] +
[10 bits of numbers of base 9] +
[1 unused bit]
Problem: currently this solution is even better than the one I am currently using. However, I still see a possibility to improve in situations where more information may be available through number base conversion and union.

You pack your values in the following way:
Example:
possible values for
a = [0,2] -> 3 states
b = [0,2] -> 3 states
c = [0,8] -> 9 states
lets assume you have following values
a := 1
b := 2
c := 8
than you can calculate the final number the following way
int number = 0;
int multi = 1;
number = number.add(multi.multiply(a));
multi = multi.multiply(3));
number = number.add(multi.multiply(b));
multi = multi.multiply(3));
number = number.add(multi.multiply(c));
number holds now your packed numbers
to unpack you just need todo
a = number.mod(3)
number = number.divide(3)
b = number.mod(3)
number = number.divide(3)
c = number.mod(9)

Related

Can DEFLATE only compress duplicate strings up to 32 KiB apart?

According to DEFLATE spec:
Compressed representation overview
A compressed data set consists of a series of blocks, corresponding to successive blocks of input
data. The block sizes are arbitrary, except that non-compressible
blocks are limited to 65,535 bytes.
Each block is compressed using a combination of the LZ77 algorithm and
Huffman coding. The Huffman trees for each block are independent of
those for previous or subsequent blocks; the LZ77 algorithm may use a
reference to a duplicated string occurring in a previous block, up to
32K input bytes before.
Each block consists of two parts: a pair of Huffman code trees that
describe the representation of the compressed data part, and a
compressed data part. (The Huffman trees themselves are compressed
using Huffman encoding.) The compressed data consists of a series of
elements of two types: literal bytes (of strings that have not been
detected as duplicated within the previous 32K input bytes), and
pointers to duplicated strings, where a pointer is represented as a
pair <length, backward distance>. The representation used in the
"deflate" format limits distances to 32K bytes and lengths to 258
bytes, but does not limit the size of a block, except for
uncompressible blocks, which are limited as noted above.
So pointers to duplicate strings only go back 32 KiB, but since block size is not limited, could the Huffman code tree store two duplicate strings more than 32 KiB apart as the same code? Then is the limiting factor the block size?
The Huffman tree for distances contains codes 0 to 29 (table below); the code 29, followed by 8191 in "plain" bits, means "distance 32768". That's a hard limit in the definition of Deflate. The block size is not limiting. Actually the block size is not stored anywhere: the block is an infinite stream. If you want to stop the block, you send an End-Of-Block code for that.
Distance Codes
--------------
Extra Extra Extra Extra
Code Bits Dist Code Bits Dist Code Bits Distance Code Bits Distance
---- ---- ---- ---- ---- ------ ---- ---- -------- ---- ---- --------
0 0 1 8 3 17-24 16 7 257-384 24 11 4097-6144
1 0 2 9 3 25-32 17 7 385-512 25 11 6145-8192
2 0 3 10 4 33-48 18 8 513-768 26 12 8193-12288
3 0 4 11 4 49-64 19 8 769-1024 27 12 12289-16384
4 1 5,6 12 5 65-96 20 9 1025-1536 28 13 16385-24576
5 1 7,8 13 5 97-128 21 9 1537-2048 29 13 24577-32768
6 2 9-12 14 6 129-192 22 10 2049-3072
7 2 13-16 15 6 193-256 23 10 3073-4096
To add to Zerte's answer, the references to previous sequences have nothing to do with blocks or block boundaries. Such references can be within blocks, across blocks, and the referenced sequence can cross a block boundary.

How to sew multiple values into one

I need to store 5 values in a single SQL Server column, each range 1-90. The values cannot be repeated. I though of using the 2, 4, 8, 16, 32, 64, ... system but you guess it will get really big, using decimal I risk wrong calculation. Is there a convenient way to:
store the 5 values into a single column so that to avoid having 90 bit column in the table, see my previous post here.
quickly query the database for example to return all records with number X and Y
another option was a string (90) containing flags like 000001000011000 but this way I have to use substrings to query and I fear it will slow down on a table with 25.000 records or more.
First request: You say most are bit. But if not all then you cant use bitwise operator. And can't save it in a single field
In that case you need an aditional table.
Row_id | fieldName | fieldValue
1 | name1 | value1
1 | name2 | value2
.
.
.
1 | name90 | value90
Second request: Save the 5 values is very easy and fast on the aditional table. Just create and index for row_id on both tables.
Third Request: Here you say again can save it as bits. But instead using strings, that is a bad idea.
Your are right, number isnt big enough to hold 90 bit, that is because a number can only hold 32 or 64 bits depending on type.
In that case you need to use two field (64 bits) or three field (32 bits) to store all 90 possible flags.
Again easy to do and really fast.
EDIT
For use multiple fields you have to create categories
Like imagine there are 16 bits split into two 8 bits (0..256)
01234567 89ABCDEF
01010101 11111111
Create fieldUp and fieldDown
SAVE
FieldUp = 01234567
FieldUp = 1 + 4 + 16 + 64
FieldDown = 89ABCDEF
FieldDown = 1 + 2 + 4 + 8 + 16 + 32 + 64 + 128
Then Select a row with FLAGS [b1, b5, bA] would be
SELECT *
FROM TABLE
WHERE FieldUp & (4 + 32)
AND FieldDown & 8
I have resolved saving the numbers comma separated, then in my code i split this field into an array and can process the data. Numbers are not meant for math operations but just as a string.

Is there a way to represent a number in binary where bits have approximately uniform significance?

I'm wondering if it is possible to represent a number as a sequence of bits, each having approximately the same significance, such that if we flip one of the bits, the overall value does not change by much.
For example, we can use sequences of 4-bits, where each group represents a value from 0 to 15 and the overall value is the sum of all these values.
0110 0101 1101 1010 1011 → 6 + 5 + 13 + 10 + 11 = 45
and now flipping any bit can only incur in a maximum difference of 8 in the final value.
Some drawbacks obviously exist with this approach:
values have multiple representations, with some values having more representations than other ones (for example, there are 39280 distinct representations for the number 38, and only 1 for the number 0);
the amount of values that can be represented is greatly reduced (this representation allows for integers from 0 to 75, while 20 bits could normally represent 220 ~ 1 million different integers).
Are there any resources I can find concerning this problem? I can't seem to find anything online, but maybe I'm not searching with the right keywords. What other alternatives exist to my approach? Do they improve on its disadvantages?

I'm new to visual basic and trying to understand how to set individual bits in a byte [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
Hi, I am new to Visual Basic, I have a project where I need to be able to manipulate individual bits in a value.
I need to be able to switch these bits between 1 and 0 and combine multiple occurrences of bits into one variable in my code.
Each bit will represent a single TRUE / FALSE value, so I'm not looking for how to do a single TRUE / FALSE value in one variable, but rather multiple TRUE / FALSE values in one variable.
Can someone please explain to me how I can achieve this please.
Many thanks in advance.
Does it have to be exactly one bit?
Why don't you just use the actual built in VB data type of Boolean for this.
http://msdn.microsoft.com/en-us/library/wts33hb3(v=vs.80).aspx
It's sole reason for existence is so you can define variables that have 2 states, true or false.
Dim myVar As Boolean
myVar = True
myVar = Flase
if myVar = False Then
myVar = True
End If
UPDATE (1)
After reading through the various answers and comments from the OP I now understand what it is the OP is trying to achieve.
As others have said the smallest unit one can use in any of these languages is an 8 bit byte. There is simply no order of data type with a smaller bit size than this.
However, with a bit of creative thinking and a smattering of binary operations, you can refer to the contents of that byte as individual bits.
First however you need to understand the binary number system:
ALL numbers in binary are to the power of two, from right to left.
Each column is the double of it's predecessor, so:
1 becomes 2, 2 becomes 4, 4 becomes 8 and so on
looking at this purely in a binary number your columns would be labelled thus:
128 64 32 16 8 4 2 1 (Remember it's right to left)
this gives us the following:
The bit at position 1 = 1;
The bit at position 2 = 2;
The bit at position 3 = 4;
The bit at position 4 = 8;
and so on.
Using this method on the smallest data type you have (The byte) you can pack 8 bit's into one value. That is you could use one variable to hold 8 separate values of 1 or 0
So while you cannot go any smaller than a byte, you can still reduce memory consumption by packing 8 values into 1 variable.
How do you read and write the values?
Remember the column positions? well you can use something called Bit Shifting and Bit masks.
Bit Shifting is the process of using the
<<
and
>>
operators
A shifting operation takes as a parameter the number of columns to shift.
EG:
Dim byte myByte
myByte = 1 << 4
In this case the variable 'myByte' would become equal to 16, but you would have actually set bit position 5 to a 1, if we illustrate this, it will make better sense:
mybyte = 0 = 00000000 = 0
mybyte = 1 = 00000001 = 1
mybyte = 2 = 00000010 = (1 << 1)
mybyte = 4 = 00000100 = (1 << 2)
mybyte = 8 = 00001000 = (1 << 3)
mybyte = 16 = 00010000 = (1 << 4)
the 0 through to 16 if you note is equal to the right to left column values I mentioned above.
given what Iv'e just explained then, if you wanted to set bits 5, 4 and 1 to be equal to 1 and the rest to be 0, you could simply use:
mybyte = 25(16 + 8 + 1) = 00011001 = (1 << 4) + (1 << 3) + 1
to get your bits back out, into a singleton you just bit shift the other way
retrieved bit = mybyte >> 4 = 00000001
Now there is unfortunately however one small flaw with the bit shifting method.
by shifting back and forth you are highly likely to LOOSE information from any bits you might already have set, in order to prevent this from happening, it's better to combine your bit shifting operations with bit masks and boolean operations such as 'AND' & 'OR'
To understand what these do you first need to understand simple logic principles as follows:
AND
Output is one if both the A and B inputs are 1
Illustrating this graphically
A B | Output
-------------
0 0 | 0
0 1 | 0
1 0 | 0
1 1 | 1
As you can see if a bit position in our input number is a 1 and the same position in our input number B is 1, then we will keep that position in our output number, otherwise we will discard the bit and set it to a 0, take the following example:
00011001 = Bits 5,4 and 1 are set
00010000 = Our mask ONLY has bit 5 set
if we perform
00011001 AND 0010000
we will get a result of
00010000
which we can then shift down by 5
00010000 >> 5 = 00000001 = 1
so by using AND we now have a way of checking an individual bit in our byte for a value of 1:
if ((mybyte AND 16) >> 1) = 1 then
'Bit one is set
else
'Bit one is NOT set
end if
by using different masks, with the different values of 2 in the right to left columns as shown previously, we can easily extract different singular values from our byte and treat them as a simple bit value.
Setting a byte is just as easy, except you perform the operation the opposite way using an 'OR'
OR
Output is one if either the A or B inputs are 1
Illustrating this graphically
A B | Output
-------------
0 0 | 0
0 1 | 1
1 0 | 1
1 1 | 1
eg:
00011001 OR 00000100 = 00011101
as you can see the bit at position 4 has been set.
To answer the fundamental question that started all this off however, you cannot use a data type in VB that has any resolution less than 1 byte, I suspect if you need absolute bit wise accuracy I'm guessing you must be writing either a compression algorithm or some kind of encryption system. :-)
01010100 01110010 01110101 01100101, is the string value of the word "TRUE"
What you want is to store the information in a boolean
Dim v As Boolean
v = True
v = False
or
If number = 84 Then ' 84 = 01010100 = T
v = True
End If
Other info
Technicaly you can't store anything in a bit, the smallest value is a char which is 8 bit. You'll need to learn how to do bitwise operation. Or use the BitArray class.
VB.NET (nor any other .NET language that I know of) has a "bit" data type. The smallest that you can use is a Byte. (Not a Char, they are two-bytes in size). So while you can read and convert a byte of value 84 into a byte with value 1 for true, and convert a byte of value 101 into a byte of value 0 for false, you are not saving any memory.
Now, if you have a small and fixed number of these flags, you CAN store several of them in one of the integer data types (in .NET the largest integer data type is 64 bits). Or if you have a large number of these flags you can use the BitArray class (which uses the same technique but backs it with an array so storage capacity is greater).

Deterministic/non-deterministic state system mapping

I read in a book on non-deterministic mapping there is mapping from Q*∑ to 2Q for M=(Q,∑,trans,q0,F)
where Q is a set of states.
But I am not able to understand how it's 2Q;
if there are 3 states a, b, c, how does it map to 8 states?
I always found that the easiest way to think about these (since the set of states is finite) is as having each of those subsets be an encoding of a base-2 number that ranges from 0 (all bits zero) to 2|Q|-1 (all bits one), where there are as many bits in the number as there are members in the state set, Q. Then, you can just take one of these numbers and map it into a subset by using whether a particular bit in the number is set. Easy!
Here's a worked example where Q = {a,b,c}. In this case, |Q| is 3 (there are three elements) and so 23 is 8. That means we get this if we say that the leading bit is for element a, the next bit is for b, and the trailing bit for c:
0 = 000 = {}
1 = 001 = {c}
2 = 010 = {b}
3 = 011 = {b,c}
4 = 100 = {a}
5 = 101 = {a,c}
6 = 110 = {a,b}
7 = 111 = {a,b,c}
See? That initial three states has been transformed into 8, and we have a natural numbering of them that we could use to create the labels of those states if we chose.
Now, to the interpretations of this within a non-deterministic context. Basically, the non-determinism means that we're uncertain about what state we're in. We represent this by using a pseudo-state that is the set of “real” states that we might be in; if we have total non-determinism then we are in the pseudo-state where all real-states are possible (i.e., {a,b,c}) whereas the pseudo-state where no real-states are possible (i.e., {}) is the converse (and really ought to be impossible to reach in the transition system). In a real system, you're usually not dealing with either of those extremes.
The logic of how you convert the deterministic transition system into a non-deterministic one is rather more complex than I want to go into here. (I had to read a substantial PhD thesis to learn it so it's definitely more than an SO answer's worth!)
2Q means the set of all subsets of Q. For each state q and each letter x from sigma, there is a subset of Q states to which you can go from q with letter x. So yeah, if there are three states abc the set 2Q consists of 8 elements {{}, {a}, {b}, {c}, {a,b}, {a,c}, {b,c}, {a,b,c}}. It doesn't map to 8 states, it maps to one of these 8 sets. HTH