How can one compute the optimal parameters to a start-step-stop coding scheme? - optimization

A start-step-stop code is a data compression technique that is used to compress number that are relatively small.
The code works as follows: It has three parameters, start, step and stop. Start determines the amount of bits used to compute the first few numbers. Step determines how many bits to add to the encoding when we run out and stop determines the maximum amount of bits used to encode a number.
So the length of an encoding is given by l = start + step * i.
The "i" value of a particular code is encoded using unary. That is, a number of 1 bits followed by a terminating 0 bit. If we have reached stop then we can drop the terminating 0 bit. If i is zero we only write out the 0 bit.
So a (1, 2, 5) start-step-stop code would work as follows:
Value 0, encoded as: 0 0
Value 1, encoded as: 0 1
Value 2, encoded as: 10 000
Value 9, encoded as: 10 111
Value 10, encoded as: 11 00000
Value 41, encoded as: 11 11111
So, given a file containing several numbers, how can we compute the optimal start-step-stop codes for that file? The optimal parameters are defined as those that will result in the greatest compression ratio.

These "start-step-stop" codes looks like a different way of calling Huffman codes. See the basic technique for an outline of the pseudo-code for calculating them.
Essentially this is what the algorithm does:
Before you start the Huffman encoding you need to gather the statistics of each symbol you'll be compressing (Their total frequency in the file to compress).
After you have that you create a binary tree using that info such that the most frequently used symbols are at the top of the tree (and thus use less bits) and such that no encoding has a prefix code. Since if an encoding has a common prefix there could be ambiguities decompressing.
At the end of the Huffman encoding your start value will be depth of the shallowest leaf node, your step will always be 1 (logically this makes sense, why would you force more bits than you need, just add one at a time,) and your stop value will be the depth of the deepest leaf node.
If the frequency stats aren't sorted it will take O(nlog n) to do, if they are sorted by frequency it can be done in O(n).
Huffman codes are guaranteed to have the best average compression for this type of encoding:
Huffman was able to design the most
efficient compression method of this
type: no other mapping of individual
source symbols to unique strings of
bits will produce a smaller average
output size when the actual symbol
frequencies agree with those used to
create the code.
This should help you implement the ideal solution to your problem.
Edit: Though similar, this isn't what the OP was looking for.
This academic paper by the creator of these codes describes a generalization of start-step-stop codes, start-stop codes. However, the author briefly describes how to get optimal start-step-stop near the end of section 2. It involves using a statistical random variable, or brute-force funding the best combination. Without any prior knowledge of the file the algorithm is O((log n)^3).
Hope this helps.

The approach I used was a simple brute force solution. The algorithm followed these basic steps:
Count the frequency of each number in the file. In the same pass, compute the total amount of numbers in the file and determine the greatest number as maxNumber.
Compute the probability of each number as its frequency divided by the total amount of numbers in the file.
Determine "optimalStop" as equal to log2(maxNumber). This is the ideal number of bits that should be used to represent maxNumber as in Shannon information theory and therefore a reasonable estimate of the optimal maximum amount of bits used in the encoding of a particular number.
For every "start" value from 1 to "optimalStop" repeat step 5 - 7:
For every "step" value from 1 to ("optimalStop" - "start") / 2, repeat step 6 & 7:
Calculate the "stop" value closest to "optimalStop" that satisfies stop = start + step * i for some integer i.
Compute the average number of bits that would be used by this encoding. This can be calculated as each number's probability multiplied by its bit length in the given encoding.
Pick the encoding with the lowest average number of bits.

Related

Binary Search: Number of comparisons in the worst case

I am trying to figure out the number of comparisons that binary search does on an array of a given size in the worst case.
Let's say there is an array A with 123,456 elements (or any other number). Binary search is applied to find some element E. The comparison is to determine whether A[i] = E. How many times would this comparison be executed in the worst case?
According to this post, the number of worst case comparisons is 2logn+1.
Result: 50
According to this post, the max. number of binary search comparisons is log2(n+1).
Result: 25
According to this post, the number of comparisons is 2logn-1.
Result: 50
I am confused by the different answers. Can anyone tell me which one is correct and how I can determine the maximum number of comparisons in the worst case?
According to this Wiki page:
In the worst case, binary search makes floor(log2(n)+1) iterations of the comparison loop, where the floor notation denotes the floor function that yields the greatest integer less than or equal to the argument, and log2 is the binary logarithm. This is because the worst case is reached when the search reaches the deepest level of the tree, and there are always floor(log2(n)+1) levels in the tree for any binary search.
Also, it's not enough to consider only comparisons A[i] = E. The binary search also includes comparisons E <= A[mid], where the mid is the midpoint of the index interval.

Generating distinct random numbers efficiently

My main purpose is to spread a buffer over pixels of an image randomly and efficiently, but I'm stuck at generating distinct random numbers. What I simply want is to generate numbers between 0 and N, but I also want these numbers to be distinct. Also note that N usually will be quite large such as 20 million and the algorithm doesn't have to be cryptographically secure.
I can't use random shuffle method since N is quite large. I did some search and found Linear congruential generator but the parameter m is required to be prime, but my N is sometimes not.
Lastly, I tried the following approach but it's not quite efficient and reliable since it might throw maximum call stack size exceeded error.
next(max: number)
{
let num = LCG.next()
if (num <= max) return num
return next(max)
}
If numbers are distinct, then they are not random. Random numbers can repeat; distinct numbers are selected from an ever decreasing set. It is the difference between selecting numbers with replacement and without replacement.
You want numbers from 0 to 20 million. As you have found, that is too large for a shuffle. Better to use an encryption. Because an encryption is one-to-one, as long as you have distinct inputs you will get distinct outputs. Just encrypt 0, 1, 2, 3, ... and you will get distinct outputs.
You talk about using a linear congruential PRNG so I assume that security is not of great importance. 20 million is about 2^24 or 2^26 so you can write a simple four round Feistel cipher sized appropriately to do the work. Alternatively, use a standard library cipher with one of the Format preserving methods to keep the output within the bounds you want.

Cryptographic Hash Function

I have an exam tomorrow on cryptography and came across an old exam question on hash functions and finding out the probability of collision of two hash values being the same, but I don't know how to calculate it. The question is:
If the hash value is a 20 bit output and allowable inputs must not exceed 2^64 bits, what is the probability of two randomly chosen values yielding a collision?
Was hoping someone could provide a solution.
Thanks.
Should be 1 / (2 ^ 20). (It should be independent of the length of the Input if you consider 2 randomly choosen inputs (and not ALL possible inputs), given the hash function is proper.) So I guess the additional Information about the length of the Input is just to make you crazy.

is SHA-512 collision resistant?

According to the books that i have read, it says that S.H.A(Secure Hash Algorithm) is collision resistant.But if the input space is a 1024 bit number and the output space is a 512 bit message digest then shouldn't it be colliding for
(2^1024)/(2^512) times? As the range is lesser than the domain being mapped there should have been collisions. please explain where i am going wrong.
The chance for a collision does not depend on the input size. The chance to a 512-bit hash collision is 1.4×10^77, see Probability table
Maybe your book has also mentioned the definition of collision resistance? It does not mean that no collisions are created (which is clearly not the case), but that given a hash you are not able to create a message easily that produces this hash.
a hash function H is collision resistant if it is hard to find two
inputs that hash to the same output; that is, two inputs a and b such
that H(a) = H(b), and a ≠ b
From Wikipedia
As you describe: Since the input space (arbitrary size) is larger than the output space (e.g. 512bit for sha512), there always exist collisions.
"Collision resistant" means, it is adequately unlikely for a collision to be found.
Your confusion is answered when considering how large the output space "512 bits" really is:
2^512 (the number of possible configurations of a 512 bit array) is of the order 10^154.
For comparison: The number of atoms in the visible universe is somewhere in the range of 10^80.
A million is 10^6.
So a million of our 'visible universes' has 10^86 atoms.
A million times a million universes has 10^92 atoms.
If you could store a single 512 bit value on a single atom, how many universes would you need to have all possible 512 bit has values stored?
Starting with a specific 512bit number (and assuming the has function is not broken), the probability p to obtain a collision is assuming you can produce new hashes with a rate R and have the total time of t to do this is:
p = R*t/(2^(512/2))
(The exponent is halved, see "birthday attach". The expected search space for a success is to find a collision in n bits is n/2.)
Let's plugin in some example numbers:
The has rate of the bitcoin network is currently about R = 200*10^15 / s (200 million terrahashes per second).
Consider the situation that since the beginning of the universe the bitcoin network's current hashing capacity would have been available for the sole purpose of finding a collision for a specific hash value, i.e. for an available time of t=13.787*10^9 years,
then the probability that a collision would have been found by now is about 7 × 10^-41 %
Again, it is hard to appreciate how small this number is.
Edit: A similar question with a good answer is found here: https://crypto.stackexchange.com/questions/89558/are-sha-256-and-sha-512-collision-resistant

Determine whether there is a subset of size n which has a standard deviation <= s

Given a bunch of numbers, I am trying to determine whether there is a "clump" anywhere where numbers are very densely packed.
To make things more precise, I thought I'd ask a more specific problem: given a set of numbers, I would like to determine whether there is a subset of size n which has a standard deviation <= s. If there are many such subsets, I'd like to find the subset with the lowest standard deviation.
So question #1 : does this formal problem definition effectively capture the intuitive concept of a "clump" of densely packed numbers?
EDIT: I don't actually care about determining which numbers belong to this "clump", I'm much more interested in determining where the clump is centred, which is why I think that specifying n in advance is okay. But feel free to correct me!
And question #2 : assuming it does, what is the best way to go about implementing something like this (in particular, I want a solution with lowest time complexity)? So far I think I have a solution that runs in n log n:
First, note that the lowest-standard-deviation-possessing subset of a given size must consist of consecutive numbers. So step 1 is sort the numbers (this is n log n)
Second, take the first n numbers and compute their standard deviation. If our array of numbers is 0-based, then the first n numbers are [0, n-1]. To get standard deviation, compute s1 and s2 as follows:
s1 = sum of numbers
s2 = sum of squares of numbers
Then, wikipedia says that the standard deviation is sqrt(n*s2 - s1^2)/n. Record this value as the highest standard deviation seen so far.
Find the standard deviation of [1, n], [2, n+1], [3, n+2] ... until you hit the the last n numbers. To do each computation takes only constant time if you keep track of s1 and s2 running totals: for example, to get std dev of [1, n], just subtract the 0th element from the s1 and s2 totals and add the nth element, then recalculate standard deviation. This means that the entire standard deviation calculating portion of the algorithm takes linear time.
So total time complexity n log n.
Is my assessment right? Is there a better way to do this? I really need this to run fast on fairly large sets, so the faster the better! Space is less of an issue (I think).
Having been working recently on a similar problem, both the definition of the clumps and the proposed implementation seem reasonable.
Another reasonable definition would be to find the minimum of all the ranges of n numbers. Thus, given that the list of numbers x is sorted, one would just find the minimum of x[n]-x[1], x[n+1]-x[2], etc. This would be slightly quicker than finding the standard deviation because it would avoid the multiplications and square roots. Indeed, you can avoid the square roots even when looking for the lowest standard deviation by finding the minimum variance (the square of the standard deviation), rather than the sd itself.
A caution would be that the location of the biggest clump might be quite sensitive to the choice of n. If there is an a priori reason to select a particular n, that won't be a problem. If not, however, it might require some experimentation to select the value of n that fairly reliably finds the clumps you are looking for, whether you are selecting by range or by standard deviation. Some ideas on this can be found in Chapter 6 of the online book ABC of EDA.