Optimal Solution: Get a random sample of items from a data set - objective-c

So I recently had this as an interview question and I was wondering what the optimal solution would be. Code is in Objective-c.
Say we have a very large data set, and we want to get a random sample
of items from it for testing a new tool. Rather than worry about the
specifics of accessing things, let's assume the system provides these
things:
// Return a random number from the set 0, 1, 2, ..., n-2, n-1.
int Rand(int n);
// Interface to implementations other people write.
#interface Dataset : NSObject
// YES when there is no more data.
- (BOOL)endOfData;
// Get the next element and move forward.
- (NSString*)getNext;
#end
// This function reads elements from |input| until the end, and
// returns an array of |k| randomly-selected elements.
- (NSArray*)getSamples:(unsigned)k from:(Dataset*)input
{
// Describe how this works.
}
Edit: So you are supposed to randomly select items from a given array. So if k = 5, then I would want to randomly select 5 elements from the dataset and return an array of those items. Each element in the dataset has to have an equal chance of getting selected.

This seems like a good time to use Reservoir Sampling. The following is an Objective-C adaptation for this use case:
NSMutableArray* result = [[NSMutableArray alloc] initWithCapacity:k];
int i,j;
for (i = 0; i < k; i++) {
[result setObject:[input getNext] atIndexedSubscript:i];
}
for (i = k; ![input endOfData]; i++) {
j = Rand(i);
NSString* next = [input getNext];
if (j < k) {
[result setObject:next atIndexedSubscript:j];
}
}
return result;
The code above is not the most efficient reservoir sampling algorithm because it generates a random number for every entry of the reservoir past the entry at index k. Slightly more complex algorithms exist under the general category "reservoir sampling". This is an interesting read on an algorithm named "Algorithm Z". I would be curious if people find newer literature on reservoir sampling, too, because this article was published in 1985.

Interessting question, but as there is no count or similar method in DataSet and you are not allowed to iterate more than once, i can only come up with this solution to get good random samples (no k > Datasize handling):
- (NSArray *)getSamples:(unsigned)k from:(Dataset*)input {
NSMutableArray *source = [[NSMutableArray alloc] init];
while(![input endOfData]) {
[source addObject:[input getNext]];
}
NSMutableArray *ret = [[NSMutableArray alloc] initWithCapacity:k];
int count = [source count];
while ([ret count] < k) {
int index = Rand(count);
[ret addObject:[source objectAtIndex:index]];
[source removeObjectAtIndex:index];
count--;
}
return ret;
}

This is not the answer I did in the interview but here is what I wish I had done:
Store pointer to first element in dataset
Loop over dataset to get count
Reset dataset to point at first element
Create NSMutableDictionary for storing random indexes
Do for loop from i=0 to i=k. Each iteration, generate a random value, check if value exists in dictionary. If it does, keep generating a random value until you get a fresh value.
Loop over dataset. If the current index is within the dictionary, add it to a the array of random subset values.
Return array of random subsets.

There are multiple ways to do this, the first way:
1. use input parameter k to dynamically allocate an array of numbers
unsigned * numsArray = (unsigned *)malloc(sizeof(unsigned) * k);
2. run a loop that gets k random numbers and stores them into the numsArray (must be careful here to check each new random to see if we have gotten it before, and if we have, get another random, etc...)
3. sort numsArray
4. run a loop beginning at the beginning of DataSet with your own incrementing counter dataCount and another counter numsCount both beginning at 0. whenever dataCount is equal to numsArray[numsCount], grab the current data object and add it to your newly created random list then increment numsCount.
5. The loop in step 4 can end when either numsCount > k or when dataCount reaches the end of the dataset.
6. The only other step that may need to be added here is before any of this to use the next command of the object type to count how large the dataset is to be able to bound your random numbers and check to make sure k is less than or equal to that.
The 2nd way to do this would be to run through the actual list MULTIPLE times.
// one must assume that once we get to the end, we can start over within the set again
1. run a while loop that checks for endOfData
a. count up a count variable that is initialized to 0
2. run a loop from 0 through k-1
a. generate a random number that you constrain to the list size
b. run a loop that moves through the dataset until it hits the rand element
c. compare that element with all other elements in your new list to make sure it isnt already in your new list
d. store the element into your new list
there may be ways to speed up the 2nd method by storing a current list location, that way if you generate a random that is past the current pointer you dont have to move through the whole list again to get back to element 0, then to the element you wish to retreive.
A potential 3rd way to do this might be to:
1. run a loop from 0 through k-1
a. generate a random
b. use the generated random as a skip count, move skip count objects through the list
c. store the current item from the list into your new list
Problem with this 3rd method is without knowing how big the list is, you dont know how to constrain the random skip count. Further, even if you did, chances are that it wouldnt truly look like a randomly grabbed subset that could easily reach the last element in the list as it would become statistically unlikely that you would ever reach the end element (i.e. not every element is given an equal shot of being select.)
Arguably the FASTEST way to do this is method 1, where you generate the random numerics first, then traverse the list only once (yes its actually twice, once to get the size of the dataset list then again to grab the random elements)

We need a little probability theory. As others, I will ignore the case n < k. The probability that the n'th item will be selected into the set of size k is just C(n-1, k-1) / C(n, k) where C is the binomial coefficient. A bit of math says shows that this is just k/n. For the rest, note that the selection of the n'th item is independent of all other selections. In other words, "the past doesn't matter."
So an algorithm is:
S = set of up to k elements
n = 0
while not end of input
v = next value
n = n + 1
if |S| < k add v to S
else if random(0,1) >= k/n replace a randomly chosen element of S with v
I will let the coders code this one! It's pretty trivial. All you need is an array of size k and one pass over the data.

If you care about efficiency (as your tags suggest) and the number of items in the population is known, don't use reservior sampling. That would require you to loop through the entire data set and generate a random number for each.
Instead, pick five values ranges from 0 to n-1. In the unlikely case, there is a duplicate among the five indexes, replace the duplicate with another random value. Then use the five indexes to do a random-access lookup to the i-th element in the population.
This is simple. It uses a minimum number of calls the random number generator. And it accesses memory only for the relevant selections.
If you don't know the number of data elements in advance, you can loop-over the data once to get the population size and proceed as above.
If you aren't allow to iterate over the data more than once, use a chunked form of reservior sampling: 1) Choose the first five elements as the initial sample, each having a probability of 1/5th. 2) Read in a large chunk of data and choose five new samples from the new set (using only five calls to Rand). 3) Pairwise, decide whether to keep the new sample item or old sample element (with odds proportional the the probablities for each of the two sample groups). 4) Repeat until all the data has been read.
For example, assume there are 1000 data elements (but we don't know this in advance).
Choose the first five as the initial sample: current_sample = read(5); population=5.
Read a chunk of n datapoints (perhaps n=200 in this example):
subpop = read(200);
m = len(subpop);
new_sample = choose(5, subpop);
loop-over the two samples pairwise:
for (a, b) in (current_sample and new_sample): if random(0 to population + m) < population, then keep a, otherwise keep *b)
population += m
repeat

Related

How to effectively get the N lowest values from the collection (Top N) in Kotlin?

How to effectively get the N lowest values from the collection (Top N) in Kotlin?
Is there any other way besides collectionOrSequence.sortedby{it.value}.take(n)?
Assume I have a collection with +100500 elements and I need to found 10 lowest. I'm afraid that the sortedby will create new temporary collection which later will take only 10 items.
You could keep a list of the n smallest elements and just update it on demand, e.g.
fun <T : Comparable<T>> top(n: Int, collection: Iterable<T>): List<T> {
return collection.fold(ArrayList<T>()) { topList, candidate ->
if (topList.size < n || candidate < topList.last()) {
// ideally insert at the right place
topList.add(candidate)
topList.sort()
// trim to size
if (topList.size > n)
topList.removeAt(n)
}
topList
}
}
That way you only compare the current element of your list once to the largest element of the top n elements which would usually be faster than sorting the entire list https://pl.kotl.in/SyQPtDTcQ
If you're running on the JVM, you could use Guava's Comparators.least(int, Comparator), which uses a more efficient algorithm than any of these suggestions, taking O(n + k log k) time and O(k) memory to find the lowest k elements in a collection of size n, as opposed to zapl's algorithm (O(nk log k)) or Lior's (O(nk)).
You have more to worry about.
collectionOrSequence.sortedby{it.value} runs java.util.Arrays.sort, that will run timSort (or mergeSort if requested).
timSort is great, but usually ends by n*log(n) operations, which is much more than the O(n) of copying the array.
Each of the O(n*log.n) operations will run a function (the lambda you provided, {it.value}) --> an additional meaningful overhead.
Lastly, java.util.Arrays.sort will convert the collection to Array and back to a List - 2 additional conversions (which you wanted to avoid, but this is secondary)
The efficient way to do it is probably:
map the values for comparison into a list: O(n) conversions (once per element) rather than O(n*log.n) or more.
Iterate over the list (or Array) created to collect the N smallest elements in one pass
Keep a list of N smallest elements found so far and their index on the original list. If it is small (e.g. 10 items) - mutableList is a good fit.
Keep a variable holding the max value for the small element list.
When iterating over the original collection, compare the current element on the original list against the max value of the small values list. If smaller than it - replace it in the "small list" and find the updated max value in it.
Use the indexes from the "small list" to extract the 10 smallest elements of the original list.
That would allow you to go from O(n*log.n) to O(n).
Of course, if time is critical - it is always best to benchmark the specific case.
If you managed, on the first step, to extract primitives for the basis of comparison (e.g. int or long) - that would be even more efficient.
I suggest implementing your own sort method based on a typical quickSort algorithm(in descending order, and take the first N elements), if the collection has 1k+ values spread randomly.

Linking Text to an Integer Objective C

The goal of this post is to find a more efficient way to create this method. Right now, as I start adding more and more values, I'm going to have a very messy and confusing app. Any help is appreciated!
I am making a workout app and assign an integer value to each workout. For example:
Where the number is exersiceInt:
01 is High Knees
02 is Jumping Jacks
03 is Jog in Place
etc.
I am making it so there is a feature to randomize the workout. To do this I am using this code:
-(IBAction) setWorkoutIntervals {
exerciseInt01 = 1 + (rand() %3);
exerciseInt02 = 1 + (rand() %3);
exerciseInt03 = 1 + (rand() %3);
}
So basically the workout intervals will first be a random workout (between high knees, jumping jacks, and jog in place). What I want to do is make a universal that defines the following so I don't have to continuously hard code everything.
Right now I have:
-(void) setLabelText {
if (exerciseInt01 == 1) {
exercise01Label.text = [NSString stringWithFormat:#"High Knees"];
}
if (exerciseInt01 == 2) {
exercise01Label.text = [NSString stringWithFormat:#"Jumping Jacks"];
}
if (exerciseInt01 == 3) {
exercise01Label.text = [NSString stringWithFormat:#"Jog in Place"];
}
}
I can already tell this about to get really messy once I start specifying images for each workout and start adding workouts. Additionally, my plan was to put the same code for exercise02Label, exercise03Label, etc. which would become extremely redundant and probably unnecessary.
What I'm thinking would be perfect if there would be someway to say
exercise01Label.text = exercise01Int; (I want to to say that the Label's text equals Jumping Jacks based on the current integer value)
How can I make it so I only have to state everything once and make the code less messy and less lengthy?
Three things for you to explore to make your code easier:
1. Count from zero
A number of things can be easier if you count from zero. A simple example is if your first exercise was numbered 0 then your random calculation would just be rand() % 3 (BTW look up uniform random number, there are much better ways to get a random number).
2. Learn about enumerations
An enumeration is a type with a set of named literal values. In (Objective-)C you can also think of them as just a collection of named integer values. For example you might declare:
typedef enum
{
HighKnees,
JumpingJacks,
JogInPlace,
ExerciseKindCount
} ExerciseCount;
Which declares ExerciseCount as a new type with 4 values. Each of these is equivalent to an integer, here HighKnees is equivalent to 0 and ExerciseKindCount to 3 - this should make you think of the first thing, count from zero...
3. Discover arrays
An array is an ordered collection of items where each item has an index - which is usually an integer or enumeration value. In (Objective-)C there are two basic kinds of arrays: C-style and object-style represented by NSArray and NSMutableArray. For example here is a simple C-style array:
NSString *gExerciseLabels[ExerciseKindCount] =
{ #"High Knees",
#"Jumping Jacks",
#"Jog in Place"
}
You've probably guessed by now, the first item of the above array has index 0, back to counting from zero...
Exploring these three things should quickly show you ways to simplify your code. Later you may wish to explore structures and objects.
HTH
A simple way to start is by putting the exercise names in an array. Then you can access the names by index. eg - exerciseNames[exerciseNumber]. You can also make the list of exercises in an array (of integers). So you would get; exerciseNames[exerciseTable[i]]; for example. Eventually you will want an object to define an exercise so that you can include images, videos, counts, durations etc.

Time and Space Complexity(for specific algorithm)

Despite the last 30 minutes i spent on trying to understand time and space complexity better, i still can't confidently determine those for the algorithm below:
bool checkSubstr(std::string sub)
{
//6 OR(||) connected if statement(checks whether the parameter
//is among the items in the list)
}
void checkWords(int start,int end)
{
int wordList[2] ={0};
int j = 0;
if (start < 0)
{
start = 0;
}
if (end>cAmount)
{
end = cAmount -1;
}
if (end-start < 2)
{
return;
}
for (int i = start; i <= end-2; i++)
{
if (crystals[i] == 'I' || crystals[i] == 'A')
{
continue;
}
if (checkSubstr(crystals.substr(i,3)))
{
wordList[j] = i;
j++;
}
}
if (j==1)
{
crystals.erase(wordList[0],3);
cAmount -= 3;
checkWords(wordList[0]-2,wordList[0]+1);
}
else if (j==2)
{
crystals.erase(wordList[0],(wordList[1]-wordList[0]+3));
cAmount -= wordList[1]-wordList[0]+3;
checkWords(wordList[0]-2,wordList[0]+1);
}
}
The function basically checks a sub-string of the whole string for predetermined (3 letter, e.g. "SAN") combinations of letters. Sub-string length can be 4-6 no real way to determine, depends on the input(pretty sure it's not relevant, although not 100%).
My reasoning:
If there are n letters in the string, worst case scenario, we have to check each of them. Again depending on the input, this can be done 3 ways.
All 6 length sub-strings: If this is the case the function runs n/6 times, each running 8(or 10?) processes, which(i think) means that its time complexity is O(n).
All 4 length sub-strings: Pretty much the same reason above, O(n).
4 and 6 length sub-strings mixed: Can't see why this would be different than previous 2. O(n)
As for the space complexity, i am completely lost. However, i have an idea:
If the function recurs for maximum amount of time,it will require:
n/4 x The Amount Used In One Run
which made me think it should be O(n). Although, i'm not convinced this is correct. I thought maybe seeing someone else's thought process on this example would help me understand how to calculate time and space complexity better.
Thank you for your time.
EDIT: Let me provide clearer information. We read a combination of 6 different letters into a string, this can be (almost)any combination in any length. 'crystals' is the string, and we are looking for 6 different 3 letter combinations in that list of letters. Sort of like a jewel matching game. Now the starting list contains no matches(none of the 6 predetermined combinations exist in the first place). Therefore the only way matches can occur from then on is by swaps or matches disappearing. Once a swap is processed by top level code, the function is called to check for matches, and if a match is found the function recurs after deleting the "match" part of the string.
Now let's look at how the code is looking for a match. To demonstrate a swap of 2 letters:
ABA B-R ZIB(no spaces or '-' in the actual string, used for better demonstration),
B and R is being swapped. This swap only effects the 6 letters starting from 2nd letter and ending on 7th letter. In other words, the letters the first A and last B can form a match with are same, before and after the swap, thus no point checking for matches including those words. So a sub-string of 6 letters sent to the checking algorithm. Similarly, if a formed match disappears(gets deleted from the string) the range of effected letters is 4. So when i thought of a worst case scenario, i imagined either 1 swap creating a whole chain reaction and matching all the way till there are not enough letters to form a match, or each match happens with a swap. Again, i am not saying this is how we should think when calculating time and space complexity but this is how the code works. Hope this is clear enough if not let me know and i can provide more details. It's also important to note that swap amount and places are a part of the input we read.
EDIT: Here is how the function is called on top level for the first time:
checkWords(swaps[i]-2,swaps[i]+3);
Sub-string length can be 4-6 no real way to determine, depends on the
input (pretty sure it's not relevant, although not 100%).
That's not what the code shows; the line if (checkSubstr(crystals.substr(i,3))) conveys that substrings always have exactly 3 characters. If the substring length varies, it is relevant, since your naive substring match will degrade to O(N*M) in the general case, where N is start-end+1 (the size of the input string) and M is the size of the substring being searched. This happens because in the worst case you'll compare M characters for each of the N characters of the source string.
The rest of this answer assumes that substrings are of size 3, since that's what the code shows.
If substrings are always 3 characters long, it's different: you can essentially assume checkSubstr() is O(1) because you will always compare at most 3 characters. The bulk of the work happens inside the for loop, which is O(N), where N is end-1-start.
After the loop, in the worst case (when one of the ifs is entered), you erase a bunch of characters from crystal. Assuming this is a string backed by an array in memory, this is an O(cAmount) operation, because all elements after wordList[0] must be shifted. The recursive call always passes in a range of size 4; it does not grow nor shrink with the size of the input, so you can also say there are O(1) recursive calls.
Thus, time complexity is O(N+cAmount) (where N is end-1-start), and space complexity is O(1).

Efficient random permutation of n-set-bits

For the problem of producing a bit-pattern with exactly n set bits, I know of two practical methods, but they both have limitations I'm not happy with.
First, you can enumerate all of the possible word values which have that many bits set in a pre-computed table, and then generate a random index into that table to pick out a possible result. This has the problem that as the output size grows the list of candidate outputs eventually becomes impractically large.
Alternatively, you can pick n non-overlapping bit positions at random (for example, by using a partial Fisher-Yates shuffle) and set those bits only. This approach, however, computes a random state in a much larger space than the number of possible results. For example, it may choose the first and second bits out of three, or it might, separately, choose the second and first bits.
This second approach must consume more bits from the random number source than are strictly required. Since it is choosing n bits in a specific order when their order is unimportant, this means that it is making an arbitrary distinction between n! different ways of producing the same result, and consuming at least floor(log_2(n!)) more bits than are necessary.
Can this be avoided?
There is obviously a third approach of iteratively computing and counting off the legal permutations until a random index is reached, but that's simply a space-for-time trade-off on the first approach, and isn't directly helpful unless there is an efficient way to count off those n permutations.
clarification
The first approach requires picking a single random number between zero and (where w is the output size), as this is the number of possible solutions.
The second approach requires picking n random values between zero and w-1, zero and w-2, etc., and these have a product of , which is times larger than the first approach.
This means that the random number source has been forced to produce bits to distinguish n! different results which are all equivalent. I'd like to know if there's an efficient method to avoid relying on this superfluous randomness. Perhaps by using an algorithm which produces an un-ordered list of bit positions, or by directly computing the nth unique permutation of bits.
Seems like you want a variant of Floyd's algorithm:
Algorithm to select a single, random combination of values?
Should be especially useful in your case, because the containment test is a simple bitmask operation. This will require only k calls to the RNG. In the code below, I assume you have randint(limit) which produces a uniform random from 0 to limit-1, and that you want k bits set in a 32-bit int:
mask = 0;
for (j = 32 - k; j < 32; ++j) {
r = randint(j+1);
b = 1 << r;
if (mask & b) mask |= (1 << j);
else mask |= b;
}
How many bits of entropy you need here depends on how randint() is implemented. If k > 16, set it to 32 - k and negate the result.
Your alternative suggestion of generating a single random number representing one combination among the set (mathematicians would call this a rank of the combination) is simpler if you use colex order rather than lexicographic rank. This code, for example:
for (i = k; i >= 1; --i) {
while ((b = binomial(n, i)) > r) --n;
buf[i-1] = n;
r -= b;
}
will fill the array buf[] with indices from 0 to n-1 for the k-combination at colex rank r. In your case, you'd replace buf[i-1] = n with mask |= (1 << n). The binomial() function is binomial coefficient, which I do with a lookup table (see this). That would make the most efficient use of entropy, but I still think Floyd's algorithm would be a better compromise.
[Expanding my comment:] If you only have a little raw entropy available, then use a PRNG to stretch it further. You only need enough raw entropy to seed a PRNG. Use the PRNG to do the actual shuffle, not the raw entropy. For the next shuffle reseed the PRNG with some more raw entropy. That spreads out the raw entropy and makes less of a demand on your entropy source.
If you know exactly the range of numbers you need out of the PRNG, then you can, carefully, set up your own LCG PRNG to cover the appropriate range while needing the minimum entropy to seed it.
ETA: In C++there is a next_permutation() method. Try using that. See std::next_permutation Implementation Explanation for more.
Is this a theory problem or a practical problem?
You could still do the partial shuffle, but keep track of the order of the ones and forget the zeroes. There are log(k!) bits of unused entropy in their final order for your future consumption.
You could also just use the recurrence (n choose k) = (n-1 choose k-1) + (n-1 choose k) directly. Generate a random number between 0 and (n choose k)-1. Call it r. Iterate over all of the bits from the nth to the first. If we have to set j of the i remaining bits, set the ith if r < (i-1 choose j-1) and clear it, subtracting (i-1 choose j-1), otherwise.
Practically, I wouldn't worry about the couple of words of wasted entropy from the partial shuffle; generating a random 32-bit word with 16 bits set costs somewhere between 64 and 80 bits of entropy, and that's entirely acceptable. The growth rate of the required entropy is asymptotically worse than the theoretical bound, so I'd do something different for really big words.
For really big words, you might generate n independent bits that are 1 with probability k/n. This immediately blows your entropy budget (and then some), but it only uses linearly many bits. The number of set bits is tightly concentrated around k, though. For a further expected linear entropy cost, I can fix it up. This approach has much better memory locality than the partial shuffle approach, so I'd probably prefer it in practice.
I would use solution number 3, generate the i-th permutation.
But do you need to generate the first i-1 ones?
You can do it a bit faster than that with kind of divide and conquer method proposed here: Returning i-th combination of a bit array and maybe you can improve the solution a bit
Background
From the formula you have given - w! / ((w-n)! * n!) it looks like your problem set has to do with the binomial coefficient which deals with calculating the number of unique combinations and not permutations which deals with duplicates in different positions.
You said:
"There is obviously a third approach of iteratively computing and counting off the legal permutations until a random index is reached, but that's simply a space-for-time trade-off on the first approach, and isn't directly helpful unless there is an efficient way to count off those n permutations.
...
This means that the random number source has been forced to produce bits to distinguish n! different results which are all equivalent. I'd like to know if there's an efficient method to avoid relying on this superfluous randomness. Perhaps by using an algorithm which produces an un-ordered list of bit positions, or by directly computing the nth unique permutation of bits."
So, there is a way to efficiently compute the nth unique combination, or rank, from the k-indexes. The k-indexes refers to a unique combination. For example, lets say that the n choose k case of 4 choose 3 is taken. This means that there are a total of 4 numbers that can be selected (0, 1, 2, 3), which is represented by n, and they are taken in groups of 3, which is represented by k. The total number of unique combinations can be calculated as n! / ((k! * (n-k)!). The rank of zero corresponds to the k-index of (2, 1, 0). Rank one is represented by the k-index group of (3, 1, 0), and so forth.
Solution
There is a formula that can be used to very efficiently translate between a k-index group and the corresponding rank without iteration. Likewise, there is a formula for translating between the rank and corresponding k-index group.
I have written a paper on this formula and how it can be seen from Pascal's Triangle. The paper is called Tablizing The Binomial Coeffieicent.
I have written a C# class which is in the public domain that implements the formula described in the paper. It uses very little memory and can be downloaded from the site. It performs the following tasks:
Outputs all the k-indexes in a nice format for any N choose K to a file. The K-indexes can be substituted with more descriptive strings or letters.
Converts the k-index to the proper lexicographic index or rank of an entry in the sorted binomial coefficient table. This technique is much faster than older published techniques that rely on iteration. It does this by using a mathematical property inherent in Pascal's Triangle and is very efficient compared to iterating over the entire set.
Converts the index in a sorted binomial coefficient table to the corresponding k-index. The technique used is also much faster than older iterative solutions.
Uses Mark Dominus method to calculate the binomial coefficient, which is much less likely to overflow and works with larger numbers. This version returns a long value. There is at least one other method that returns an int. Make sure that you use the method that returns a long value.
The class is written in .NET C# and provides a way to manage the objects related to the problem (if any) by using a generic list. The constructor of this class takes a bool value called InitTable that when true will create a generic list to hold the objects to be managed. If this value is false, then it will not create the table. The table does not need to be created in order to use the 4 above methods. Accessor methods are provided to access the table.
There is an associated test class which shows how to use the class and its methods. It has been extensively tested with at least 2 cases and there are no known bugs.
The following tested example code demonstrates how to use the class and will iterate through each unique combination:
public void Test10Choose5()
{
String S;
int Loop;
int N = 10; // Total number of elements in the set.
int K = 5; // Total number of elements in each group.
// Create the bin coeff object required to get all
// the combos for this N choose K combination.
BinCoeff<int> BC = new BinCoeff<int>(N, K, false);
int NumCombos = BinCoeff<int>.GetBinCoeff(N, K);
// The Kindexes array specifies the indexes for a lexigraphic element.
int[] KIndexes = new int[K];
StringBuilder SB = new StringBuilder();
// Loop thru all the combinations for this N choose K case.
for (int Combo = 0; Combo < NumCombos; Combo++)
{
// Get the k-indexes for this combination.
BC.GetKIndexes(Combo, KIndexes);
// Verify that the Kindexes returned can be used to retrive the
// rank or lexigraphic order of the KIndexes in the table.
int Val = BC.GetIndex(true, KIndexes);
if (Val != Combo)
{
S = "Val of " + Val.ToString() + " != Combo Value of " + Combo.ToString();
Console.WriteLine(S);
}
SB.Remove(0, SB.Length);
for (Loop = 0; Loop < K; Loop++)
{
SB.Append(KIndexes[Loop].ToString());
if (Loop < K - 1)
SB.Append(" ");
}
S = "KIndexes = " + SB.ToString();
Console.WriteLine(S);
}
}
So, the way to apply the class to your problem is by considering each bit in the word size as the total number of items. This would be n in the n!/((k! (n - k)!) formula. To obtain k, or the group size, simply count the number of bits set to 1. You would have to create a list or array of the class objects for each possible k, which in this case would be 32. Note that the class does not handle N choose N, N choose 0, or N choose 1 so the code would have to check for those cases and return 1 for both the 32 choose 0 case and 32 choose 32 case. For 32 choose 1, it would need to return 32.
If you need to use values not much larger than 32 choose 16 (the worst case for 32 items - yields 601,080,390 unique combinations), then you can use 32 bit integers, which is how the class is currently implemented. If you need to use 64 bit integers, then you will have to convert the class to use 64 bit longs. The largest value that a long can hold is 18,446,744,073,709,551,616 which is 2 ^ 64. The worst case for n choose k when n is 64 is 64 choose 32. 64 choose 32 is 1,832,624,140,942,590,534 - so a long value will work for all 64 choose k cases. If you need numbers bigger than that, then you will probably want to look into using some sort of big integer class. In C#, the .NET framework has a BigInteger class. If you are working in a different language, it should not be hard to port.
If you are looking for a very good PRNG, one of the fastest, lightweight, and high quality output is the Tiny Mersenne Twister or TinyMT for short . I ported the code over to C++ and C#. it can be found here, along with a link to the original author's C code.
Rather than using a shuffling algorithm like Fisher-Yates, you might consider doing something like the following example instead:
// Get 7 random cards.
ulong Card;
ulong SevenCardHand = 0;
for (int CardLoop = 0; CardLoop < 7; CardLoop++)
{
do
{
// The card has a value of between 0 and 51. So, get a random value and
// left shift it into the proper bit position.
Card = (1UL << RandObj.Next(CardsInDeck));
} while ((SevenCardHand & Card) != 0);
SevenCardHand |= Card;
}
The above code is faster than any shuffling algorithm (at least for obtaining a subset of random cards) since it only works on 7 cards instead of 52. It also packs the cards into individual bits within a single 64 bit word. It makes evaluating poker hands much more efficient as well.
As a side, note, the best binomial coefficient calculator I have found that works with very large numbers (it accurately calculated a case that yielded over 15,000 digits in the result) can be found here.

how to optimize search difference between array / list of object

Premesis:
I am using ActionScript with two arraycollections containing objects with values to be matched...
I need a solution for this (if in the framework there is a library that does it better) otherwise any suggestions are appreciated...
Let's assume I have two lists of elements A and B (no duplicate values) and I need to compare them and remove all the elements present in both, so at the end I should have
in A all the elements that are in A but not in B
in B all the elements that are in B but not in A
now I do something like that:
for (var i:int = 0 ; i < a.length ;)
{
var isFound:Boolean = false;
for (var j:int = 0 ; j < b.length ;)
{
if (a.getItemAt(i).nome == b.getItemAt(j).nome)
{
isFound = true;
a.removeItemAt(i);
b.removeItemAt(j);
break;
}
j++;
}
if (!isFound)
i++;
}
I cycle both the arrays and if I found a match I remove the items from both of the arrays (and don't increase the loop value so the for cycle progress in a correct way)
I was wondering if (and I'm sure there is) there is a better (and less CPU consuming) way to do it...
If you must use a list, and you don't need the abilities of arraycollection, I suggest simply converting it to using AS3 Vectors. The performance increase according to this (http://www.mikechambers.com/blog/2008/09/24/actioscript-3-vector-array-performance-comparison/) are 60% compared to Arrays. I believe Arrays are already 3x faster than ArrayCollections from some article I once read. Unfortunately, this solution is still O(n^2) in time.
As an aside, the reason why Vectors are faster than ArrayCollections is because you provide type-hinting to the VM. The VM knows exactly how large each object is in the collection and performs optimizations based on that.
Another optimization on the vectors is to sort the data first by nome before doing the comparisons. You add another check to break out of the loop if the nome of list b simply wouldn't be found further down in list A due to the ordering.
If you want to do MUCH faster than that, use an associative array (object in as3). Of course, this may require more refactoring effort. I am assuming object.nome is a unique string/id for the objects. Simply assign that the value of nome as the key in objectA and objectB. By doing it this way, you might not need to loop through each element in each list to do the comparison.