How to improve the performance of my graph coloring model in MiniZinc? - optimization

I have created a model for solving the graph coloring problem in MiniZinc:
include "globals.mzn";
int: n_nodes; % Number of nodes
int: n_edges; % Number of edges
int: domain_ub; % Number of colors
array[int] of int: edges; % All edges of graph as a 1D array
array[1..n_edges, 1..2] of int: edges2d = array2d(1..n_edges, 1..2, edges);
array[1..n_nodes] of var 1..domain_ub: colors;
constraint forall (i in 1..n_edges) (colors[edges2d[i,1]] != colors[edges2d[i,2]]);
solve :: int_search(colors, dom_w_deg, indomain_random)
In order to tackle big problems (around 400-500 nodes), I start with an upper bound of the number of colors and solve successive satisfaction problems decrementing the number by one till it becomes unsatisfiable or times out. This method gives me decent results.
In order to improve my results, I added symmetry breaking constraints to the above model:
constraint colors[1] = 1;
constraint forall (i in 2..n_nodes) ( colors[i] in 1..max(colors[1..i-1])+1 );
This, however, brings down my results both speed-wise and quality-wise.
Why is my model performing badly after adding the additional constraints? How should I go about adding the symmetry breaking constraints?

For symmetry breaking for cases where the values are fully symmetric, I would recommend the seq_precede_chain constraint, which breaks that symmetry. As commented by #hakank, using indomain_random is probably not a good idea when used with symmetry breaking, indomain_min is a safer choice.
For graph coloring in general, it may help performance to run a clique-finding algorithm, and post all_different constraints over each cliques found. That would have to be done when generating a minizinc program for each instance. For comparison, see the Gecode graph coloring example which uses pre-computed cliques.

I know this is an old question, but I was working on the same problem and I wanted to write what I found about this topic that maybe it will be useful to someone in the future.
To improve the model the solution is to use symmetry breaking constraint, as you did, but in Minizinc there is a global constraint called value_precede which can be used in this case.
% A new color J is only allowed to appear after colors 0..J-1 have been seen before (in any order)
constraint forall(j in 1..n-1)(value_precede(j, j+1, map));
Changing the search heuristics the result does not improve much, I have tried different configurations and the best results are obtained using dom_w_deg and indomain_min (compared to my data files).
Another way to improve the results is to accept any good enough solution that's less than the number of colours in the domain.
But this model does not always lead to obtaining the optimal result.
include "globals.mzn";
int: n; % Number of nodes
int: e; % Number of edges
int: maxcolors = 17; % Domain of colors
array[1..e,1..2] of int: E; % 2d array, rows = edges, 2 cols = nodes per edge
array[0..n-1] of var 0..maxccolors: c; % Color of node n
constraint forall(i in 1..e)(c[E[i,1]] != c[E[i,2]] ); % Two linked nodes have diff color
constraint c[0] == 0; % Break Symmetry, force fist color == 0
% Big symmetry breaker. A new color J is only allowed to appear after colors
% 0..J-1 have been seen before (in any order)
constraint forall(i in 0..n-2)( value_precede(i,i+1, c) );
% Ideally solve would minimize(max(c)), but that's too slow, so we accept any good
% enough solution that's less equal our heuristic "maxcolors"
constraint max(c) <= maxcolors;
solve :: int_search(c, dom_w_deg, indomain_min, complete) satisfy;
output [ show(max(c)+1), "\n", show(c)]
A clear and complete explanation can be found here:


Octave minimization for a many-body Hamiltonian with non-linear constraint

I work in theoretical physics, and I have come upon a problem that requires the minimization of a particular Hamiltonian operator for a system of 8 particles, with one non-linear constraint. Due to the complexity of the system, I cannot define the entire Hamiltonian "in one go", nor the constraint. By this I mean that the quantity I am searching for is defined recurrently, depending on complex summations over quantities calculated for systems of 7 particles, which in turn depend on quantities calculated for systems of 6, and so on, until it reaches a one or two-particle system, for which said quantities are given as initial values, dependent on the elements of a column vector (the argument/minization parameters). The constraint itself is also of this form, requiring the "overlap" between the states of 8 particles to be exactly 1. (I.E. the state be normalized) I have been thinking of a way to use fmincon for this, but I've come up short, since my function has an implicit dependence on the parameters, and I can't write the whole thing explicitly. For a better understanding, here is some of the code:
for m=3:npairs+1
for n=3:npairs+1
for i=1:nsps
for j=1:nsps
function [E]=H(x)
E=summation over all i and j of N and p0p for m=n=8 %not actual code
overlap(9,9)=1 %constraint
It's hard to give a specific answer, but I would advise the following to get you started.
First, note that, the inner two steps of the nest loop can be vectorised, since i and j always appear as indices (whereas m and n make backreferences, so they cannot be vectorised). So your 4-level loop can be reduced to a 2-level loop containing 4 functions operating over i-by-j matrices.
Second, note that the whole construct can be expressed as a recursive function. If you have suitable base cases for m = 0, n = 0, you can iteratively obtain all i,j matrices for all cases up to m=9,n=9. In particular, you can try to 'memoize' the early steps, and plug them into higher steps, rather than rely on actual recursion.
Assuming you need to sum with the first two indeces fixed to 8 (if I understood correctly), you can easily do with Anonymous Functions
# creating same data
# defining 2 versions of sums
f = #(A,B) [sum(sum(A(8,8,:,:))), sum(sum(B(8,8,:,:)))];
g = #(A,B) sum(sum(A(8,8,:,:)))+ sum(sum(B(8,8,:,:)));
the output will be:
octave:21> E1=f(A,B)
E1 =
16 32
octave:22> E2=g(A,B)
E2 = 48

MiniZinc find the set of int

I have a script in minizinc that tries to find the set of int, but is unable to do so. The problem statement is given a set of 2 class features, minimal support set needs to be found with constraints that its length should be less than some k, and with some array of set it should contain atleast one of the index value from them. So, suppose, that the solution is {3,4,7} and the array of set (let's call it - atmostone) atmostone = [{1,2,3}, {4,5,6}, {7,8,9}] so the intersection of the solution and each of the set from atmostone array must be of exactly length one.
These are the constraints I implemented, but the error is of model inconsistency.
include "globals.mzn";
include "alldifferent.mzn";
int: t; %number of attributes
int: k; %maximum size of support set
int: n; %number of positive instances
int: m; %number of negative instances
int: c; %number of atMostOne Constraints
array [1..n, 1..t] of 0..1: omegap;
array [1..m, 1..t] of 0..1: omegan;
array [int] of set of int: atMostOne;
set of int: K = 1..k;
set of int: T = 1..t;
var set of T: solution;
function array[int] of var opt int : set2array(var set of int: a) =
[i | i in a];
% constraint alldifferent(solution);
constraint length(set2array(solution)) <= k;
constraint forall(i in 1..length(atMostOne))(length(set2array(solution intersect atMostOne[i])) <= 1);
constraint forall(i in 1..n, j in 1..m)(not(omegap[i, fix(solution)] == omegan[j, fix(solution)]));
solve satisfy;
This is the error:
Compiling support_send.mzn
WARNING: model inconsistency detected
Running support_send.mzn
% Top level failure!
Finished in 88msec
The data:
t=8; %number of attributes
k=3; %maximum size of support set
n=5; %number of positive instances
m=3; %number of negative instances
c=4; %number of atMostOne Constraints
omegap=[| 0,0,1,0,1,0,0,0 |
omegan=[| 1,1,0,0,1,0,1,1|
atMostOne =
Any help would be appreciated.
Thank You.
The problems in your model stem from your set variable solutions.
The first problem is caused by the set2array function. You might think that this returns an array with the integers that are located in your array; however, that is not true. Instead it returns an array of optional integers. This means that all values that are possible in your set are stored in the array, but some of them might just be marked absent. In this case it is almost the same as having an array of boolean variables, that just say if a value is located in the set or not.
Note that the constraint length(set2array(solution)) <= k is impossible to satisfy. Because solution has more possible than k the length of the array will always be bigger. The constraint you probably want to enforce is card(solution) <= k. The function card(X) return the cardinality/size of a set. The same problem can be found in the second constraint.
Your final constraint has a different problem: it contains the expression fix(solution). In the context of your model you cannot write this because won't be fixed at compilation time. The expression also evaluates into a set of int. Although you can use sets to access an array, array slicing, it is currently not allowed with variables sets. I would suggest trying to find a different formulation for this constraint. (Because I cannot figure out what it is supposed to do, I'm afraid I cannot suggest anything)
As a final note, the commented out constraint, alldifferent(solution), is unnecessary. Because solution is a set, it is guaranteed to contain values only once.

Efficient random permutation of n-set-bits

For the problem of producing a bit-pattern with exactly n set bits, I know of two practical methods, but they both have limitations I'm not happy with.
First, you can enumerate all of the possible word values which have that many bits set in a pre-computed table, and then generate a random index into that table to pick out a possible result. This has the problem that as the output size grows the list of candidate outputs eventually becomes impractically large.
Alternatively, you can pick n non-overlapping bit positions at random (for example, by using a partial Fisher-Yates shuffle) and set those bits only. This approach, however, computes a random state in a much larger space than the number of possible results. For example, it may choose the first and second bits out of three, or it might, separately, choose the second and first bits.
This second approach must consume more bits from the random number source than are strictly required. Since it is choosing n bits in a specific order when their order is unimportant, this means that it is making an arbitrary distinction between n! different ways of producing the same result, and consuming at least floor(log_2(n!)) more bits than are necessary.
Can this be avoided?
There is obviously a third approach of iteratively computing and counting off the legal permutations until a random index is reached, but that's simply a space-for-time trade-off on the first approach, and isn't directly helpful unless there is an efficient way to count off those n permutations.
The first approach requires picking a single random number between zero and (where w is the output size), as this is the number of possible solutions.
The second approach requires picking n random values between zero and w-1, zero and w-2, etc., and these have a product of , which is times larger than the first approach.
This means that the random number source has been forced to produce bits to distinguish n! different results which are all equivalent. I'd like to know if there's an efficient method to avoid relying on this superfluous randomness. Perhaps by using an algorithm which produces an un-ordered list of bit positions, or by directly computing the nth unique permutation of bits.
Seems like you want a variant of Floyd's algorithm:
Algorithm to select a single, random combination of values?
Should be especially useful in your case, because the containment test is a simple bitmask operation. This will require only k calls to the RNG. In the code below, I assume you have randint(limit) which produces a uniform random from 0 to limit-1, and that you want k bits set in a 32-bit int:
mask = 0;
for (j = 32 - k; j < 32; ++j) {
r = randint(j+1);
b = 1 << r;
if (mask & b) mask |= (1 << j);
else mask |= b;
How many bits of entropy you need here depends on how randint() is implemented. If k > 16, set it to 32 - k and negate the result.
Your alternative suggestion of generating a single random number representing one combination among the set (mathematicians would call this a rank of the combination) is simpler if you use colex order rather than lexicographic rank. This code, for example:
for (i = k; i >= 1; --i) {
while ((b = binomial(n, i)) > r) --n;
buf[i-1] = n;
r -= b;
will fill the array buf[] with indices from 0 to n-1 for the k-combination at colex rank r. In your case, you'd replace buf[i-1] = n with mask |= (1 << n). The binomial() function is binomial coefficient, which I do with a lookup table (see this). That would make the most efficient use of entropy, but I still think Floyd's algorithm would be a better compromise.
[Expanding my comment:] If you only have a little raw entropy available, then use a PRNG to stretch it further. You only need enough raw entropy to seed a PRNG. Use the PRNG to do the actual shuffle, not the raw entropy. For the next shuffle reseed the PRNG with some more raw entropy. That spreads out the raw entropy and makes less of a demand on your entropy source.
If you know exactly the range of numbers you need out of the PRNG, then you can, carefully, set up your own LCG PRNG to cover the appropriate range while needing the minimum entropy to seed it.
ETA: In C++there is a next_permutation() method. Try using that. See std::next_permutation Implementation Explanation for more.
Is this a theory problem or a practical problem?
You could still do the partial shuffle, but keep track of the order of the ones and forget the zeroes. There are log(k!) bits of unused entropy in their final order for your future consumption.
You could also just use the recurrence (n choose k) = (n-1 choose k-1) + (n-1 choose k) directly. Generate a random number between 0 and (n choose k)-1. Call it r. Iterate over all of the bits from the nth to the first. If we have to set j of the i remaining bits, set the ith if r < (i-1 choose j-1) and clear it, subtracting (i-1 choose j-1), otherwise.
Practically, I wouldn't worry about the couple of words of wasted entropy from the partial shuffle; generating a random 32-bit word with 16 bits set costs somewhere between 64 and 80 bits of entropy, and that's entirely acceptable. The growth rate of the required entropy is asymptotically worse than the theoretical bound, so I'd do something different for really big words.
For really big words, you might generate n independent bits that are 1 with probability k/n. This immediately blows your entropy budget (and then some), but it only uses linearly many bits. The number of set bits is tightly concentrated around k, though. For a further expected linear entropy cost, I can fix it up. This approach has much better memory locality than the partial shuffle approach, so I'd probably prefer it in practice.
I would use solution number 3, generate the i-th permutation.
But do you need to generate the first i-1 ones?
You can do it a bit faster than that with kind of divide and conquer method proposed here: Returning i-th combination of a bit array and maybe you can improve the solution a bit
From the formula you have given - w! / ((w-n)! * n!) it looks like your problem set has to do with the binomial coefficient which deals with calculating the number of unique combinations and not permutations which deals with duplicates in different positions.
You said:
"There is obviously a third approach of iteratively computing and counting off the legal permutations until a random index is reached, but that's simply a space-for-time trade-off on the first approach, and isn't directly helpful unless there is an efficient way to count off those n permutations.
This means that the random number source has been forced to produce bits to distinguish n! different results which are all equivalent. I'd like to know if there's an efficient method to avoid relying on this superfluous randomness. Perhaps by using an algorithm which produces an un-ordered list of bit positions, or by directly computing the nth unique permutation of bits."
So, there is a way to efficiently compute the nth unique combination, or rank, from the k-indexes. The k-indexes refers to a unique combination. For example, lets say that the n choose k case of 4 choose 3 is taken. This means that there are a total of 4 numbers that can be selected (0, 1, 2, 3), which is represented by n, and they are taken in groups of 3, which is represented by k. The total number of unique combinations can be calculated as n! / ((k! * (n-k)!). The rank of zero corresponds to the k-index of (2, 1, 0). Rank one is represented by the k-index group of (3, 1, 0), and so forth.
There is a formula that can be used to very efficiently translate between a k-index group and the corresponding rank without iteration. Likewise, there is a formula for translating between the rank and corresponding k-index group.
I have written a paper on this formula and how it can be seen from Pascal's Triangle. The paper is called Tablizing The Binomial Coeffieicent.
I have written a C# class which is in the public domain that implements the formula described in the paper. It uses very little memory and can be downloaded from the site. It performs the following tasks:
Outputs all the k-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters.
Converts the k-index to the proper lexicographic index or rank of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle and is very efficient compared to iterating over the entire set.
Converts the index in a sorted binomial coefficient table to the corresponding k-index. The technique used is also much faster than older iterative solutions.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers. This version returns a long value. There is at least one other method that returns an int. Make sure that you use the method that returns a long value.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to use the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with at least 2 cases and there are no known bugs.
The following tested example code demonstrates how to use the class and will iterate through each unique combination:
public void Test10Choose5()
String S;
int Loop;
int N = 10; // Total number of elements in the set.
int K = 5; // Total number of elements in each group.
// Create the bin coeff object required to get all
// the combos for this N choose K combination.
BinCoeff<int> BC = new BinCoeff<int>(N, K, false);
int NumCombos = BinCoeff<int>.GetBinCoeff(N, K);
// The Kindexes array specifies the indexes for a lexigraphic element.
int[] KIndexes = new int[K];
StringBuilder SB = new StringBuilder();
// Loop thru all the combinations for this N choose K case.
for (int Combo = 0; Combo < NumCombos; Combo++)
// Get the k-indexes for this combination.
BC.GetKIndexes(Combo, KIndexes);
// Verify that the Kindexes returned can be used to retrive the
// rank or lexigraphic order of the KIndexes in the table.
int Val = BC.GetIndex(true, KIndexes);
if (Val != Combo)
S = "Val of " + Val.ToString() + " != Combo Value of " + Combo.ToString();
SB.Remove(0, SB.Length);
for (Loop = 0; Loop < K; Loop++)
if (Loop < K - 1)
SB.Append(" ");
S = "KIndexes = " + SB.ToString();
So, the way to apply the class to your problem is by considering each bit in the word size as the total number of items. This would be n in the n!/((k! (n - k)!) formula. To obtain k, or the group size, simply count the number of bits set to 1. You would have to create a list or array of the class objects for each possible k, which in this case would be 32. Note that the class does not handle N choose N, N choose 0, or N choose 1 so the code would have to check for those cases and return 1 for both the 32 choose 0 case and 32 choose 32 case. For 32 choose 1, it would need to return 32.
If you need to use values not much larger than 32 choose 16 (the worst case for 32 items - yields 601,080,390 unique combinations), then you can use 32 bit integers, which is how the class is currently implemented. If you need to use 64 bit integers, then you will have to convert the class to use 64 bit longs. The largest value that a long can hold is 18,446,744,073,709,551,616 which is 2 ^ 64. The worst case for n choose k when n is 64 is 64 choose 32. 64 choose 32 is 1,832,624,140,942,590,534 - so a long value will work for all 64 choose k cases. If you need numbers bigger than that, then you will probably want to look into using some sort of big integer class. In C#, the .NET framework has a BigInteger class. If you are working in a different language, it should not be hard to port.
If you are looking for a very good PRNG, one of the fastest, lightweight, and high quality output is the Tiny Mersenne Twister or TinyMT for short . I ported the code over to C++ and C#. it can be found here, along with a link to the original author's C code.
Rather than using a shuffling algorithm like Fisher-Yates, you might consider doing something like the following example instead:
// Get 7 random cards.
ulong Card;
ulong SevenCardHand = 0;
for (int CardLoop = 0; CardLoop < 7; CardLoop++)
// The card has a value of between 0 and 51. So, get a random value and
// left shift it into the proper bit position.
Card = (1UL << RandObj.Next(CardsInDeck));
} while ((SevenCardHand & Card) != 0);
SevenCardHand |= Card;
The above code is faster than any shuffling algorithm (at least for obtaining a subset of random cards) since it only works on 7 cards instead of 52. It also packs the cards into individual bits within a single 64 bit word. It makes evaluating poker hands much more efficient as well.
As a side, note, the best binomial coefficient calculator I have found that works with very large numbers (it accurately calculated a case that yielded over 15,000 digits in the result) can be found here.

Constrained Single-Objective Optimization

I need to split an array filled with a certain type (let's take water buckets for example) with two values set (in this case weight and volume), while keeping the difference between the total of the weight to a minimum (preferred) and the difference between the total of the volumes less than 1000 (required). This doesn't need to be a full-fetched genetic algorithm or something similar, but it should be better than what I currently have...
Current Implementation
Due to not knowing how to do it better, I started by splitting the array in two same-length arrays (the array can be filled with an uneven number of items), replacing a possibly void spot with an item with both values being 0. The sides don't need to have the same amount of items, I just didn't knew how to handle it otherwise.
After having these distributed, I'm trying to optimize them like this:
func (main *Main) Optimize() {
for {
difference := main.Difference(WEIGHT)
for i := 0; i < len(main.left); i++ {
for j := 0; j < len(main.right); j++ {
if main.DifferenceAfter(i, j, WEIGHT) < main.Difference(WEIGHT) {
main.left[i], main.right[j] = main.right[j], main.left[i]
if difference == main.Difference(WEIGHT) {
for main.Difference(CAPACITY) > 1000 {
leftIndex := 0
rightIndex := 0
liters := 0
weight := 100
for i := 0; i < len(main.left); i++ {
for j := 0; j < len(main.right); j++ {
if main.DifferenceAfter(i, j, CAPACITY) < main.Difference(CAPACITY) {
newLiters := main.Difference(CAPACITY) - main.DifferenceAfter(i, j, CAPACITY)
newWeight := main.Difference(WEIGHT) - main.DifferenceAfter(i, j, WEIGHT)
if newLiters > liters && newWeight <= weight || newLiters == liters && newWeight < weight {
leftIndex = i
rightIndex = j
liters = newLiters
weight = newWeight
main.left[leftIndex], main.right[rightIndex] = main.right[rightIndex], main.left[leftIndex]
main.Difference(const) calculates the absolute difference between the two sides, the constant taken as an argument decides the value to calculate the difference for
main.DifferenceAfter(i, j, const) simulates a swap between the two buckets, i being the left one and j being the right one, and calculates the resulting absolute difference then, the constant again determines the value to check
Basically this starts by optimizing the weight, which is what the first for-loop does. On every iteration, it tries every possible combination of buckets that can be switched and if the difference after that is less than the current difference (resulting in better distribution) it switches them. If the weight doesn't change anymore, it breaks out of the for-loop. While not perfect, this works quite well, and I consider this acceptable for what I'm trying to accomplish.
Then it's supposed to optimize the distribution based on the volume, so the total difference is less than 1000. Here I tried to be more careful and search for the best combination in a run before switching it. Thus it searches for the bucket switch resulting in the biggest capacity change and is also supposed to search for a tradeoff between this, though I see the flaw that the first bucket combination tried will set the liters and weight variables, resulting in the next possible combinations being reduced by a big a amount.
I think I need to include some more math here, but I'm honestly stuck here and don't know how to continue here, so I'd like to get some help from you, basically that can help me here is welcome.
As previously said, your problem is actually a constrained optimisation problem with a constraint on your difference of volumes.
Mathematically, this would be minimise the difference of volumes under constraint that the difference of volumes is less than 1000. The simplest way to express it as a linear optimisation problem would be:
min weights . x
subject to volumes . x < 1000.0
for all i, x[i] = +1 or -1
Where a . b is the vector dot product. Once this problem is solved, all indices where x = +1 correspond to your first array, all indices where x = -1 correspond to your second array.
Unfortunately, 0-1 integer programming is known to be NP-hard. The simplest way of solving it is to perform exhaustive brute force exploring of the space, but it requires testing all 2^n possible vectors x (where n is the length of your original weights and volumes vectors), which can quickly get out of hands. There is a lot of literature on this topic, with more efficient algorithms, but they are often highly specific to a particular set of problems and/or constraints. You can google "linear integer programming" to see what has been done on this topic.
I think the simplest might be to perform a heuristic-based brute force search, where you prune your search tree early when it would get you out of your volume constraint, and stay close to your constraint (as a general rule, the solution of linear optimisation problems are on the edge of the feasible space).
Here are a couple of articles you might want to read on this kind of optimisations:
UCLA Linear integer programming
MIT course on Integer programming
Carleton course on Binary programming
Articles on combinatorial optimisation & linear integer programming
If you are not familiar with optimisation articles or math in general, the wikipedia articles provides a good introduction, but most articles on this topic quickly show some (pseudo)code you can adapt right away.
If your n is large, I think at some point you will have to make a trade off between how optimal your solution is and how fast it can be computed. Your solution is probably suboptimal, but it is much faster than the exhaustive search. There might be a better trade off, depending on the exact configuration of your problem.
It seems that in your case, difference of weight is objective, while difference of volume is just a constraint, which means that you are seeking for solutions that optimize difference of weight attribute (as small as possible), and satisfy the condition on difference of volume attribute (total < 1000). In this case, it's a single objective constrained optimization problem.
Whereas, if you are interested in multi-objective optimization, maybe you wanna look at the concept of Pareto Frontier: . It's good for keeping multiple good solutions with advantages in different objective, i.e., not losing diversity.

How to 'checksum' an array of noisy floating point numbers?

What is a quick and easy way to 'checksum' an array of floating point numbers, while allowing for a specified small amount of inaccuracy?
e.g. I have two algorithms which should (in theory, with infinite precision) output the same array. But they work differently, and so floating point errors will accumulate differently, though the array lengths should be exactly the same. I'd like a quick and easy way to test if the arrays seem to be the same. I could of course compare the numbers pairwise, and report the maximum error; but one algorithm is in C++ and the other is in Mathematica and I don't want the bother of writing out the numbers to a file or pasting them from one system to another. That's why I want a simple checksum.
I could simply add up all the numbers in the array. If the array length is N, and I can tolerate an error of 0.0001 in each number, then I would check if abs(sum1-sum2)<0.0001*N. But this simplistic 'checksum' is not robust, e.g. to an error of +10 in one entry and -10 in another. (And anyway, probability theory says that the error probably grows like sqrt(N), not like N.) Of course, any checksum is a low-dimensional summary of a chunk of data so it will miss some errors, if not most... but simple checksums are nonetheless useful for finding non-malicious bug-type errors.
Or I could create a two-dimensional checksum, [sum(x[n]), sum(abs(x[n]))]. But is the best I can do, i.e. is there a different function I might use that would be "more orthogonal" to the sum(x[n])? And if I used some arbitrary functions, e.g. [sum(f1(x[n])), sum(f2(x[n]))], then how should my 'raw error tolerance' translate into 'checksum error tolerance'?
I'm programming in C++, but I'm happy to see answers in any language.
i have a feeling that what you want may be possible via something like gray codes. if you could translate your values into gray codes and use some kind of checksum that was able to correct n bits you could detect whether or not the two arrays were the same except for n-1 bits of error, right? (each bit of error means a number is "off by one", where the mapping would be such that this was a variation in the least significant digit).
but the exact details are beyond me - particularly for floating point values.
i don't know if it helps, but what gray codes solve is the problem of pathological rounding. rounding sounds like it will solve the problem - a naive solution might round and then checksum. but simple rounding always has pathological cases - for example, if we use floor, then 0.9999999 and 1 are distinct. a gray code approach seems to address that, since neighbouring values are always single bit away, so a bit-based checksum will accurately reflect "distance".
[update:] more exactly, what you want is a checksum that gives an estimate of the hamming distance between your gray-encoded sequences (and the gray encoded part is easy if you just care about 0.0001 since you can multiple everything by 10000 and use integers).
and it seems like such checksums do exist: Any error-correcting code can be used for error detection. A code with minimum Hamming distance, d, can detect up to d − 1 errors in a code word. Using minimum-distance-based error-correcting codes for error detection can be suitable if a strict limit on the minimum number of errors to be detected is desired.
so, just in case it's not clear:
multiple by minimum error to get integers
convert to gray code equivalent
use an error detecting code with a minimum hamming distance larger than the error you can tolerate.
but i am still not sure that's right. you still get the pathological rounding in the conversion from float to integer. so it seems like you need a minimum hamming distance that is 1 + len(data) (worst case, with a rounding error on each value). is that feasible? probably not for large arrays.
maybe ask again with better tags/description now that a general direction is possible? or just add tags now? we need someone who does this for a living. [i added a couple of tags]
I've spent a while looking for a deterministic answer, and been unable to find one. If there is a good answer, it's likely to require heavy-duty mathematical skills (functional analysis).
I'm pretty sure there is no solution based on "discretize in some cunning way, then apply a discrete checksum", e.g. "discretize into strings of 0/1/?, where ? means wildcard". Any discretization will have the property that two floating-point numbers very close to each other can end up with different discrete codes, and then the discrete checksum won't tell us what we want to know.
However, a very simple randomized scheme should work fine. Generate a pseudorandom string S from the alphabet {+1,-1}, and compute csx=sum(X_i*S_i) and csy=sum(Y_i*S_i), where X and Y are my original arrays of floating point numbers. If we model the errors as independent Normal random variables with mean 0, then it's easy to compute the distribution of csx-csy. We could do this for several strings S, and then do a hypothesis test that the mean error is 0. The number of strings S needed for the test is fixed, it doesn't grow linearly in the size of the arrays, so it satisfies my need for a "low-dimensional summary". This method also gives an estimate of the standard deviation of the error, which may be handy.
Try this:
#include <complex>
#include <cmath>
#include <iostream>
const size_t no_freqs = 3;
const double freqs[no_freqs] = {0.05, 0.16, 0.39}; // (for example)
int main() {
std::complex<double> spectral_amplitude[no_freqs];
for (size_t i = 0; i < no_freqs; ++i) spectral_amplitude[i] = 0.0;
size_t n_data = 0;
std::complex<double> datum;
while (std::cin >> datum) {
for (size_t i = 0; i < no_freqs; ++i) {
spectral_amplitude[i] += datum * std::exp(
std::complex<double>(0.0, 1.0) * freqs[i] * double(n_data)
std::cout << "Fuzzy checksum:\n";
for (size_t i = 0; i < no_freqs; ++i) {
std::cout << real(spectral_amplitude[i]) << "\n";
std::cout << imag(spectral_amplitude[i]) << "\n";
std::cout << "\n";
return 0;
It returns just a few, arbitrary points of a Fourier transform of the entire data set. These make a fuzzy checksum, so to speak.
How about computing a standard integer checksum on the data obtained by zeroing the least significant digits of the data, the ones that you don't care about?